git on MacOSX and files with decomposed utf-8 file names

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* git on MacOSX and files with decomposed utf-8 file names
@ 2008-01-16 15:17 Mark Junker
  2008-01-16 15:34 ` Johannes Schindelin
  2008-01-17  4:43 ` Jay Soffian
  0 siblings, 2 replies; 260+ messages in thread
From: Mark Junker @ 2008-01-16 15:17 UTC (permalink / raw)
  To: git

Hi,

I have some files like "Lüftung.txt" in my repository. The strange thing 
is that I can pull / add / commit / push those files without problem but 
git-status always complains that thoes files are untraced (but not 
missing). My assumption is that it's a problem with the way MacOSX 
stores the file names (decomposed UTF-8). So something like 
"Lüftung.txt" becomes "Lüftung.txt".

It seems that git-status does two things:
1. Find files under version control (i.e. search for missing files)
2. Find files not under version control (i.e. search for untracked files)

I guess that the first look-up succeeds because MacOS X converts 
composed UTF-8 to decomposed UTF-8 when searching for a file. But it 
seems that the second look-up takes the file names as-is (decomposed) 
without converting them to composed UTF-8.

Is there an easy way to fix this behaviour? It's really annoying to see 
all those "untracked" files that are already under version control when 
executing a git-status.

Regards,
Mark

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 15:17 git on MacOSX and files with decomposed utf-8 file names Mark Junker
@ 2008-01-16 15:34 ` Johannes Schindelin
  2008-01-16 15:43   ` Kevin Ballard
  2008-01-17  4:43 ` Jay Soffian
  1 sibling, 1 reply; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-16 15:34 UTC (permalink / raw)
  To: Mark Junker; +Cc: git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 590 bytes --]

Hi,

On Wed, 16 Jan 2008, Mark Junker wrote:

> I have some files like "Lüftung.txt" in my repository. The strange thing is
> that I can pull / add / commit / push those files without problem but
> git-status always complains that thoes files are untraced (but not missing).

This is a known problem.  Unfortunately, noone has implemented a fix, 
although if you're serious about it, I can point you to threads where it 
has been hinted how to solve the issue.

FWIW the issue is that Mac OS X decides that it knows better how to encode 
your filename than you could yourself.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 15:34 ` Johannes Schindelin
@ 2008-01-16 15:43   ` Kevin Ballard
  2008-01-16 16:32     ` Johannes Schindelin
                       ` (2 more replies)
  0 siblings, 3 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-16 15:43 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Mark Junker, git

[-- Attachment #1: Type: text/plain, Size: 977 bytes --]

On Jan 16, 2008, at 10:34 AM, Johannes Schindelin wrote:

> On Wed, 16 Jan 2008, Mark Junker wrote:
>
>> I have some files like "Lüftung.txt" in my repository. The strange  
>> thing is
>> that I can pull / add / commit / push those files without problem but
>> git-status always complains that thoes files are untraced (but not  
>> missing).
>
> This is a known problem.  Unfortunately, noone has implemented a fix,
> although if you're serious about it, I can point you to threads  
> where it
> has been hinted how to solve the issue.
>
> FWIW the issue is that Mac OS X decides that it knows better how to  
> encode
> your filename than you could yourself.


More like, Mac OS X has standardized on Unicode and the rest of the  
world hasn't caught up yet. Git is the only tool I've ever heard of  
that has a problem with OS X using Unicode.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 15:43   ` Kevin Ballard
@ 2008-01-16 16:32     ` Johannes Schindelin
  2008-01-16 16:46       ` Jakub Narebski
  2008-01-16 22:37       ` Eyvind Bernhardsen
  2008-01-16 23:03     ` Wincent Colaiuta
  2008-01-17  7:29     ` Miles Bader
  2 siblings, 2 replies; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-16 16:32 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Mark Junker, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1367 bytes --]

Hi,

On Wed, 16 Jan 2008, Kevin Ballard wrote:

> On Jan 16, 2008, at 10:34 AM, Johannes Schindelin wrote:
> 
> > On Wed, 16 Jan 2008, Mark Junker wrote:
> > 
> > > I have some files like "Lüftung.txt" in my repository. The strange 
> > > thing is that I can pull / add / commit / push those files without 
> > > problem but git-status always complains that thoes files are 
> > > untraced (but not missing).
> > 
> > This is a known problem.  Unfortunately, noone has implemented a fix, 
> > although if you're serious about it, I can point you to threads where 
> > it has been hinted how to solve the issue.
> > 
> > FWIW the issue is that Mac OS X decides that it knows better how to 
> > encode your filename than you could yourself.
> 
> More like, Mac OS X has standardized on Unicode and the rest of the 
> world hasn't caught up yet. Git is the only tool I've ever heard of that 
> has a problem with OS X using Unicode.

No.  That's not at all the problem.  Mac OS X insists on storing _another_ 
encoding of your filename.  Both are UTF-8.  Both encode the _same_ 
string.  Yet they are different, bytewise.  For no good reason.

Stop spreading FUD.  Git can handle Unicode just fine.  In fact, Git does 
not _care_ how the filename is encoded, it _respects_ the user's choice, 
not only of the encoding _type_, but the _encoding_, too.

Okay?

Hth,
Dscho

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 16:32     ` Johannes Schindelin
@ 2008-01-16 16:46       ` Jakub Narebski
  2008-01-16 20:39         ` Kevin Ballard
  2008-01-16 22:37       ` Eyvind Bernhardsen
  1 sibling, 1 reply; 260+ messages in thread
From: Jakub Narebski @ 2008-01-16 16:46 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Kevin Ballard, Mark Junker, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> On Wed, 16 Jan 2008, Kevin Ballard wrote:
> 
> > On Jan 16, 2008, at 10:34 AM, Johannes Schindelin wrote:
> > 
> > > On Wed, 16 Jan 2008, Mark Junker wrote:
> > > 
> > > > I have some files like "Lüftung.txt" in my repository. The strange 
> > > > thing is that I can pull / add / commit / push those files without 
> > > > problem but git-status always complains that thoes files are 
> > > > untraced (but not missing).
> > > 
> > > This is a known problem.  Unfortunately, noone has implemented a fix, 
> > > although if you're serious about it, I can point you to threads where 
> > > it has been hinted how to solve the issue.
> > > 
> > > FWIW the issue is that Mac OS X decides that it knows better how to 
> > > encode your filename than you could yourself.
> > 
> > More like, Mac OS X has standardized on Unicode and the rest of the 
> > world hasn't caught up yet. Git is the only tool I've ever heard of that 
> > has a problem with OS X using Unicode.
> 
> No.  That's not at all the problem.  Mac OS X insists on storing _another_ 
> encoding of your filename.  Both are UTF-8.  Both encode the _same_ 
> string.  Yet they are different, bytewise.  For no good reason.

To be more exact encoding used to _create_ file differs from encoding
returned when _reading directory_... 
 
> Stop spreading FUD.  Git can handle Unicode just fine.  In fact, Git does 
> not _care_ how the filename is encoded, it _respects_ the user's choice, 
> not only of the encoding _type_, but the _encoding_, too.

...which means that sequence of bytes differ. And Git by design is
(both for filenames and for blob contents) encoding agnostic.

HFS+ is just _stupid_. And unfortunately Git doesn't support stupid
filesystems (e.g. case insensitive filesystems) well.

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 16:46       ` Jakub Narebski
@ 2008-01-16 20:39         ` Kevin Ballard
  2008-01-16 21:51           ` Jakub Narebski
  2008-01-16 23:52           ` Dmitry Potapov
  0 siblings, 2 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-16 20:39 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Johannes Schindelin, Mark Junker, git

[-- Attachment #1: Type: text/plain, Size: 3763 bytes --]

On Jan 16, 2008, at 11:46 AM, Jakub Narebski wrote:

>>> More like, Mac OS X has standardized on Unicode and the rest of the
>>> world hasn't caught up yet. Git is the only tool I've ever heard  
>>> of that
>>> has a problem with OS X using Unicode.
>>
>> No.  That's not at all the problem.  Mac OS X insists on storing  
>> _another_
>> encoding of your filename.  Both are UTF-8.  Both encode the _same_
>> string.  Yet they are different, bytewise.  For no good reason.
>
> To be more exact encoding used to _create_ file differs from encoding
> returned when _reading directory_...
>
>> Stop spreading FUD.  Git can handle Unicode just fine.  In fact,  
>> Git does
>> not _care_ how the filename is encoded, it _respects_ the user's  
>> choice,
>> not only of the encoding _type_, but the _encoding_, too.
>
> ...which means that sequence of bytes differ. And Git by design is
> (both for filenames and for blob contents) encoding agnostic.
>
> HFS+ is just _stupid_. And unfortunately Git doesn't support stupid
> filesystems (e.g. case insensitive filesystems) well.

There's two different ways to do filesystem encodings. One is to have  
the fs simply not care about encoding, which is what the linux world  
seems to prefer. Sure, this is great in that what you create the file  
with is what you get back, but on the other hand, given an arbitrary  
non-ASCII file on disk, you have absolutely no idea what the encoding  
should be and you can't display it without making assumptions (yes you  
can use heuristics, but you're still making assumptions). Filesystems  
like HFS+ that standardize the encoding, on the other hand, make it  
such that you always know what the encoding of a file should be, so  
you can always display and use the filename intelligently. It also  
means it plays much nicer in a non-ASCII world, since you don't have  
to worry about different normalizations of a given string referring to  
different files (it's one thing to be case-sensitive, but claiming  
that "föo" and "föo" are different files just because one uses a  
composed character and the other doesn't is extremely user- 
unfriendly). On the other hand, what you create the file with may not  
be what you read back later, since the name has been standardized.  
It's hard to say one is better than the other, they're just different  
ways of doing it. However, I have noticed that everybody who's voiced  
an opinion on this list in favor of the encoding-agnostic approach  
seem to be unwilling to accept that any other approach might have  
validity, to the extent of calling an OS/filesystem that does things  
different stupid or insane. This strikes me as extremely elitist and  
risks alienating what I expect to be a fast-growing group of users  
(i.e. OS X users).

I'm willing to give Linus a free pass on calling other OS's stupid and  
insane, as I don't think Linux would exist as it does today without  
his strong opinions, but I don't think this should give carte blanche  
to the rest of the community for this inflammatory behavior.

I should note that I'm only taking the time to discuss this because,  
despite the fact that I'm new to git, I really like it and I want it  
to work better. And one area that it has a problem with is the de- 
facto filesystem on my OS of choice. However, attempts to discuss the  
problem invariable end up with multiple people calling my OS stupid  
and insane simply because it differs in a particular design decision.  
This is not a good way to build a community or to build a better  
product, and I hope it can be improved.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 20:39         ` Kevin Ballard
@ 2008-01-16 21:51           ` Jakub Narebski
  2008-01-16 22:06             ` Kevin Ballard
  2008-01-16 23:52           ` Dmitry Potapov
  1 sibling, 1 reply; 260+ messages in thread
From: Jakub Narebski @ 2008-01-16 21:51 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Johannes Schindelin, Mark Junker, git

On Wed, 16 Jan 2008, Kevin Ballard wrote:
> On Jan 16, 2008, at 11:46 AM, Jakub Narebski wrote:
>>>> More like, Mac OS X has standardized on Unicode and the rest of the
>>>> world hasn't caught up yet. Git is the only tool I've ever heard  
>>>> which has a problem with OS X using Unicode.
>>>
>>> No.  That's not at all the problem.  Mac OS X insists on storing  
>>> _another_  encoding of your filename.  Both are UTF-8.  Both encode
>>> the _same_ string.  Yet they are different, bytewise.  For no good
>>> reason. 
>>
>> To be more exact encoding used to _create_ file differs from encoding
>> returned when _reading directory_...
>>
>>> Stop spreading FUD.  Git can handle Unicode just fine.  In fact,  
>>> Git does not _care_ how the filename is encoded, it _respects_ the
>>> user's choice, not only of the encoding _type_, but the _encoding_,
>>> too. 
>>
>> ...which means that sequence of bytes differ. And Git by design is
>> (both for filenames and for blob contents) encoding agnostic.
>>
>> HFS+ is just _stupid_. And unfortunately Git doesn't support stupid
>> filesystems (e.g. case insensitive filesystems) well.

By the way, calling HFS+ stupid, or rather calling at least two 
different normalizations of UTF-8 (two different encodings) used for 
writing and reading filenames stupid is wrong _for me_. I have quoted 
Linus here, when I think I should use other description.
 
> There's two different ways to do filesystem encodings. One is to have  
> the fs simply not care about encoding, which is what the linux world  
> seems to prefer. Sure, this is great in that what you create the file  
> with is what you get back, but on the other hand, given an arbitrary  
> non-ASCII file on disk, you have absolutely no idea what the encoding  
> should be and you can't display it without making assumptions (yes you  
> can use heuristics, but you're still making assumptions). Filesystems  
> like HFS+ that standardize the encoding, on the other hand, make it  
> such that you always know what the encoding of a file should be, so  
> you can always display and use the filename intelligently. It also  
> means it plays much nicer in a non-ASCII world, since you don't have  
> to worry about different normalizations of a given string referring to  
> different files (it's one thing to be case-sensitive, but claiming  
> that "föo" and "föo" are different files just because one uses a  
> composed character and the other doesn't is extremely user- 
> unfriendly).

For me it looks like a layering violation... but my knowledge about 
filesystem is cluse to nil. IMHO it is VFS and libc which should do the 
translating.

> On the other hand, what you create the file with may not   
> be what you read back later, since the name has been standardized.  
> It's hard to say one is better than the other, they're just different  
> ways of doing it.

But using one encoding to create file, and another when reding filenames 
is strange. It is IMHO better to simply refuse creating filenames which 
are outside chosen encoding / normalization. But having different 
encodings used for reading and writing on the level of filesystem 
access (not on level of UI) is strange.

> However, I have noticed that everybody who's voiced   
> an opinion on this list in favor of the encoding-agnostic approach  
> seem to be unwilling to accept that any other approach might have  
> validity, to the extent of calling an OS/filesystem that does things  
> different stupid or insane. This strikes me as extremely elitist and  
> risks alienating what I expect to be a fast-growing group of users  
> (i.e. OS X users).

First, it is Git philosophy and very core of design to be encoding 
agnostic (to be "content tracker"). Second, using the same sequence of 
bytes on filesystem, in the index, and in 'tree' objects ensures good 
performance... this is something to think about if you want to add 
patches which would deal with HFS+ API/UI quirks.

[cut]
-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 21:51           ` Jakub Narebski
@ 2008-01-16 22:06             ` Kevin Ballard
  2008-01-16 22:23               ` Johannes Schindelin
  2008-01-16 22:32               ` Linus Torvalds
  0 siblings, 2 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-16 22:06 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Johannes Schindelin, Mark Junker, git

[-- Attachment #1: Type: text/plain, Size: 2687 bytes --]

On Jan 16, 2008, at 4:51 PM, Jakub Narebski wrote:

>> On the other hand, what you create the file with may not
>> be what you read back later, since the name has been standardized.
>> It's hard to say one is better than the other, they're just different
>> ways of doing it.
>
> But using one encoding to create file, and another when reding  
> filenames
> is strange. It is IMHO better to simply refuse creating filenames  
> which
> are outside chosen encoding / normalization. But having different
> encodings used for reading and writing on the level of filesystem
> access (not on level of UI) is strange.

It's not using different encodings, it's all Unicode. However, it  
accepts different normalization variants of Unicode, since it can read  
them all and it would be folly to require everybody to conform to its  
own special internal variant. But it does have to normalize them,  
otherwise how would it detect the same filename using different  
normalizations? Also, it may seem strange to have different names  
between reading and writing, but that's only if you think of the name  
as a sequence of bytes - when treated as a sequence of characters, you  
get the same result. In other words, you're used to filenames as  
bytes, HFS+ treats filenames as strings.

>> However, I have noticed that everybody who's voiced
>> an opinion on this list in favor of the encoding-agnostic approach
>> seem to be unwilling to accept that any other approach might have
>> validity, to the extent of calling an OS/filesystem that does things
>> different stupid or insane. This strikes me as extremely elitist and
>> risks alienating what I expect to be a fast-growing group of users
>> (i.e. OS X users).
>
> First, it is Git philosophy and very core of design to be encoding
> agnostic (to be "content tracker"). Second, using the same sequence of
> bytes on filesystem, in the index, and in 'tree' objects ensures good
> performance... this is something to think about if you want to add
> patches which would deal with HFS+ API/UI quirks.

Sure, it makes sense from a performance perspective, but it causes  
problems with HFS+ and any other filesystem that behaves the same way.  
In the previous discussion about case-sensitivity, somebody suggested  
using a lookup table to map between git's internal representation and  
the name the filesystem returns, which seems like a decent idea and  
one that could be enabled with a config parameter to avoid penalizing  
repos on other filesystems. But I don't know enough about the  
internals of git to even think of trying to implement it myself.

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 22:06             ` Kevin Ballard
@ 2008-01-16 22:23               ` Johannes Schindelin
  2008-01-16 23:16                 ` Kevin Ballard
  2008-01-16 22:32               ` Linus Torvalds
  1 sibling, 1 reply; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-16 22:23 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Jakub Narebski, Mark Junker, git

Hi,

On Wed, 16 Jan 2008, Kevin Ballard wrote:

> It's not using different encodings, it's all Unicode.

But that's the _point_!  It _is_ Unicode, yet it uses _different_ 
encodings of the _same_ string.

Now, this discussion gets really annoying.  The real question is: will you 
do something about it, or reply with another 500-line email?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 22:06             ` Kevin Ballard
  2008-01-16 22:23               ` Johannes Schindelin
@ 2008-01-16 22:32               ` Linus Torvalds
  2008-01-16 22:52                 ` Linus Torvalds
  2008-01-16 23:11                 ` Kevin Ballard
  1 sibling, 2 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-16 22:32 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Jakub Narebski, Johannes Schindelin, Mark Junker, git

On Wed, 16 Jan 2008, Kevin Ballard wrote:
> 
> It's not using different encodings, it's all Unicode. However, it accepts
> different normalization variants of Unicode, since it can read them all and it
> would be folly to require everybody to conform to its own special internal
> variant. But it does have to normalize them, otherwise how would it detect the
> same filename using different normalizations?

That's a singularly *stupid* argument.

Here, let me rephrase that same idiotic argument:

  "But it does have to uppercase them, otherwise how would it detect the 
   same filename using different cases?"

..and if you don't see how that's *exactly* the same argument, you really 
are stupid.

The fact is, normalization is wrong.

It's wrong when you normalize upper/lower case (no, the word "Polish" is 
not the same as "polish"), and it's equally wrong when you normalize for 
"looks similar".

> In other words, you're used to filenames as bytes, HFS+ treats filenames 
> as strings.

No. HFS+ treats users as idiots and thinks that it should "fix" the 
filename for them. And it causes problems.

It causes problems for exactly the same reasons case-independence causes 
problems, because it's EXACTLY THE SAME ISSUE. People may think that "but 
they are the same", but they aren't. Case matters. And so does "single 
character" vs "two character overlay". 

Does it always matter? Hell no. But the problem with a filesystem that 
thinks it knows better is that when it *sometimes* matters, the filesystem 
simply DOES THE WRONG THING.

Can't you understand that?

			Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 16:32     ` Johannes Schindelin
  2008-01-16 16:46       ` Jakub Narebski
@ 2008-01-16 22:37       ` Eyvind Bernhardsen
  1 sibling, 0 replies; 260+ messages in thread
From: Eyvind Bernhardsen @ 2008-01-16 22:37 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Kevin Ballard, Mark Junker, git

On 16. jan.. 2008, at 17.32, Johannes Schindelin wrote:

>>> FWIW the issue is that Mac OS X decides that it knows better how to
>>> encode your filename than you could yourself.
>>
>> More like, Mac OS X has standardized on Unicode and the rest of the
>> world hasn't caught up yet. Git is the only tool I've ever heard of  
>> that
>> has a problem with OS X using Unicode.
>
> No.  That's not at all the problem.  Mac OS X insists on storing  
> _another_
> encoding of your filename.  Both are UTF-8.  Both encode the _same_
> string.  Yet they are different, bytewise.  For no good reason.
>
> Stop spreading FUD.  Git can handle Unicode just fine.  In fact, Git  
> does
> not _care_ how the filename is encoded, it _respects_ the user's  
> choice,
> not only of the encoding _type_, but the _encoding_, too.

"FUD" is a bit strong, don't you think?  HFS+ is the way it is and it  
would be nice if Git could deal with it.

The problem is that HFS+ normalizes filenames to avoid multiple files  
that appear to have the same name (eg "M<A WITH UMLAUT>rchen" vs  
"Ma<UMLAUT MODIFIER>rchen", in gitweb/test).  This is sort of like  
case sensitivity, but filenames are normalized when a file is  
_created_.  Git, not unreasonably, expects a file to keep the name it  
was created with.

As far as I can tell, as long as you add all your internationally  
becharactered files to git from an HFS+ file system using a gui or  
command-line completion, you'll be okay; trouble starts when you check  
in a file with the composed form of a character, by typing the name on  
the command line (I'm not sure about this one) or committing on  
another OS.  Git will store the filename in composed form, but the  
Mac's filesystem will decompose the filename when you check the file  
out.

The result looks like this:

vredefort:[git]% git status
# On branch master
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#	gitweb/test/Märchen
nothing added to commit but untracked files present (use "git add" to  
track)

(this is directly after checking out git.git @ v1.5.4-rc3)

There are two things to note here.  One is that Git thinks that there  
is a new file called "gitweb/test/Märchen" (decomposed) when it's  
"really" just the same "gitweb/test/Märchen" (precomposed) that's in  
the repository.  The other is that git _thinks_ that the "gitweb/test/ 
Märchen" (precomposed) it's expecting is still there, because the  
filesystem, when asked for "gitweb/test/Märchen" in any form will  
return the file "gitweb/test/Märchen" (decomposed).

Trying to check out the "next" branch at this point is a pain since  
next's "Märchen" would overwrite the untracked "Märchen".

I can't provide links to any previous discussions about this, but  
here's Apple's Technical Q&A on the subject:

http://developer.apple.com/qa/qa2001/qa1235.html

Finding a sane way of allowing git to handle this behaviour is left as  
an exercise for the reader.

Eyvind Bernhardsen

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 22:32               ` Linus Torvalds
@ 2008-01-16 22:52                 ` Linus Torvalds
  2008-01-16 23:11                 ` Kevin Ballard
  1 sibling, 0 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-16 22:52 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Jakub Narebski, Johannes Schindelin, Mark Junker, git

On Wed, 16 Jan 2008, Linus Torvalds wrote:
> 
> Does it always matter? Hell no. But the problem with a filesystem that 
> thinks it knows better is that when it *sometimes* matters, the filesystem 
> simply DOES THE WRONG THING.
> 
> Can't you understand that?

Side note: there are ways to do it right.

You can:

 - not do conversion at all (which is always right). Not corrupting the 
   user data means that the user never gets something back that he didn't 
   put in

   (And, btw, the "security" argument is total BS. The fact that two 
   characters look the same does not mean that they should act the same, 
   and it is *not* a security feature. Quite the reverse. Having programs 
   that get different results back from what they actually wrote, *that* 
   tends to be a security issue, because now you have a confused program, 
   and I guarantee that there are more bugs in unexpected cases than in 
   the expected ones)

 - Not accept data in formats that you don't like. This is also always 
   right, but can be rather impolite.

 - Not accept data in formats that you don't like, and give people 
   explicit conversion and comparison routines so that they can then make 
   their own decisions and they are *aware* of the conversion (so that 
   they don't come back to the problem of being confused)

So there are certainly many ways to handle things like this.

The one thing you shouldn't do is to silently convert data behind the 
programs back, without even giving any way to disable it (and that disable 
has to be on a use-by-use casis, not some "disable/enable for all users of 
this filesystem", because you can - and do - have different programs that 
have different expectations).

And finally: all of the above is true at *all* levels. It doesn't matter 
one whit whether the automatic conversion conversion is in the kernel or 
in a library. Doing it on a library level has advantages (namely the whole 
"disable/enable" thing tends to get *much* easier to do, and applications 
can decide to link against a particular version to get the behaviour 
*they* want, for example).

So doing it inside the kernel is just about the worst possible case, 
exactly because it makes it really hard to do a "on a case-by-case" basis. 

Yes, Linux does it too, but it does it only for filesystems that are 
*defined* to be insane. OS X really should have known better. Especially 
since they already fixed the applications (ie they do allow for 
case-sensitive filesystems).

I can understand normalization when it's about case-insensitivity (there 
are lots of _technical_ reasons to do it there), but once you let the 
case-insensitivity go, there just isn't any excuse any more.

			Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 15:43   ` Kevin Ballard
  2008-01-16 16:32     ` Johannes Schindelin
@ 2008-01-16 23:03     ` Wincent Colaiuta
  2008-01-17  7:29     ` Miles Bader
  2 siblings, 0 replies; 260+ messages in thread
From: Wincent Colaiuta @ 2008-01-16 23:03 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Johannes Schindelin, Mark Junker, git

El 16/1/2008, a las 16:43, Kevin Ballard escribió:

> On Jan 16, 2008, at 10:34 AM, Johannes Schindelin wrote:
>
>> On Wed, 16 Jan 2008, Mark Junker wrote:
>>
>>> I have some files like "Lüftung.txt" in my repository. The strange  
>>> thing is
>>> that I can pull / add / commit / push those files without problem  
>>> but
>>> git-status always complains that thoes files are untraced (but not  
>>> missing).
>>
>> This is a known problem.  Unfortunately, noone has implemented a fix,
>> although if you're serious about it, I can point you to threads  
>> where it
>> has been hinted how to solve the issue.
>>
>> FWIW the issue is that Mac OS X decides that it knows better how to  
>> encode
>> your filename than you could yourself.
>
> More like, Mac OS X has standardized on Unicode and the rest of the  
> world hasn't caught up yet. Git is the only tool I've ever heard of  
> that has a problem with OS X using Unicode.

As far as I know, Subversion has basically exactly the same problem,  
and any time you consume/produce files on Mac OS X that are be  
consumed/produced on other platforms you will run into this kind of  
issue, with any software.

Tell Mac OS X to write a file with "ó" in the file name ("\xc3\xb3" in  
UTF-8), and it will "normalize" it prior to writing by converting it  
into a decomposed form (that is, ASCII "o" followed by "\xcc\x81", or  
"combining acute accent"). So they're both valid Unicode, both valid  
UTF-8, and they encode exactly the same characters but the byte stream  
is different.

If you only work on Mac OS X then this will never be a problem because  
all the files you create and therefore all the files you add to your  
Git repository will have their names in decomposed UTF-8. But when you  
start cloning repositories containing files added on other systems,  
systems which might use precomposed rather than decomposed UTF-8 then  
you'll run into exactly this kind of problem. The git.git repo has one  
such file itself (gitweb/test/Märchen, if I remember correctly, which  
Git reports as untracked).

Now, Mac OS X's behaviour is not entirely "insane" as some would  
claim; there is indeed a rationale behind it even if you don't agree  
with it, but it *does* produce some unfortunate teething problems for  
people wanting to use Mac OS X in a cross-platform environment.

Here are some Apple docs on the subject:

http://developer.apple.com/qa/qa2001/qa1173.html

http://developer.apple.com/qa/qa2001/qa1235.html

I personally wish that UTF-8 didn't allow different normalization  
forms; then this kind of problem wouldn't arise. But it has arisen and  
we have to live with it. Some workarounds have been proposed for Git,  
but I haven't seen any convincing proposals yet.

Cheers,
Wincent

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 22:32               ` Linus Torvalds
  2008-01-16 22:52                 ` Linus Torvalds
@ 2008-01-16 23:11                 ` Kevin Ballard
  2008-01-16 23:38                   ` Linus Torvalds
  1 sibling, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-16 23:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jakub Narebski, Johannes Schindelin, Mark Junker, git

[-- Attachment #1: Type: text/plain, Size: 5125 bytes --]

On Jan 16, 2008, at 5:32 PM, Linus Torvalds wrote:

> On Wed, 16 Jan 2008, Kevin Ballard wrote:
>>
>> It's not using different encodings, it's all Unicode. However, it  
>> accepts
>> different normalization variants of Unicode, since it can read them  
>> all and it
>> would be folly to require everybody to conform to its own special  
>> internal
>> variant. But it does have to normalize them, otherwise how would it  
>> detect the
>> same filename using different normalizations?
>
> That's a singularly *stupid* argument.
>
> Here, let me rephrase that same idiotic argument:
>
>  "But it does have to uppercase them, otherwise how would it detect  
> the
>   same filename using different cases?"
>
> ..and if you don't see how that's *exactly* the same argument, you  
> really
> are stupid.

You're right, it doesn't actually have to store the normalized form.  
And yes, it's possible to compare without normalizing them.  
Admittedly, I don't know much about the implementation details of  
unicode, but I would assume that the easiest way to compare two  
strings is to normalize them first. But in the case of the filesystem,  
normalization actually is important if you're thinking about filenames  
in terms of characters rather than bytes. When I feed the filesystem a  
given unicode string, it has to find the file I'm talking about -  
should it do a relatively expensive unicode-sensitive comparison of  
all the filenames with the one I gave it, or should it just normalize  
all names and do the much cheaper lookup that way? I don't know about  
you, but I'd prefer to let my filesystem normalize the name and run  
faster.

> The fact is, normalization is wrong.
>
> It's wrong when you normalize upper/lower case (no, the word  
> "Polish" is
> not the same as "polish"), and it's equally wrong when you normalize  
> for
> "looks similar".

There's a difference between "looks similar" as in "Polish" vs  
"polish", and actually is the same string as in "Ma<UMLAUT  
MODIFIER>rchen" vs "M<A WITH UMLAUT>rchen". Capitalization has a valid  
semantic meaning, normalization doesn't. The only way to argue that  
normalization is wrong is by providing a good reason to preserve the  
exact byte sequence, and so far the only reason I've seen is to help  
git. Applications in general don't care one whit about the byte  
sequence of the filename, they care about the underlying file the name  
represents. Additionally, it would be a terrible experience for a user  
to enter "Märchen" and have the application say "sorry, I can't find  
this file" simply because the application used decomposed characters  
and the filename used composed characters. Unless the user is  
knowledgeable about the OS, filesystems, and unicode, they wouldn't  
have a hope of figuring out what the problem was.

>
>> In other words, you're used to filenames as bytes, HFS+ treats  
>> filenames
>> as strings.
>
> No. HFS+ treats users as idiots and thinks that it should "fix" the
> filename for them. And it causes problems.

How do you figure? When I type "Märchen", I'm typing a string, not a  
byte sequence. I have no control over the normalization of the  
characters. Therefore, depending on what program I'm typing the name  
in, I might use the same normalization as the filename, or I might  
miss. It's completely out of my control. This is why the filesystem  
has to step in and say "You composed that character differently, but I  
know you were trying to specify this file".

> It causes problems for exactly the same reasons case-independence  
> causes
> problems, because it's EXACTLY THE SAME ISSUE. People may think that  
> "but
> they are the same", but they aren't. Case matters. And so does "single
> character" vs "two character overlay".

There are valid reasons for case to matter, but what reason is there  
for "single character" vs" two character overlay" to matter in  
filenames? They're different representations of the exact same string,  
and that's what a filename is - a string.

It seems like your arguments stem from the assumption that the user  
cares about the byte sequence that represents the filename, which is  
wrong. The user has no idea what the byte sequence is - the user cares  
about the string. Normalization is meant to help computers, not users,  
and claiming that different normalizations of the same string produces  
different meaningful strings is complete bunk.

If you were to have two different files on your system, both of them  
called "Märchen", but one precomposed and one decomposed, how would  
you specify which one you wanted? Unless Linux has a special text  
input system which gives the user control over the normalization of  
their typed characters, you'd have to write out the UTF-8 bytes  
manually.

I just don't understand this insistence on treating the specific byte  
sequence that makes up the filename as significant.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 22:23               ` Johannes Schindelin
@ 2008-01-16 23:16                 ` Kevin Ballard
  0 siblings, 0 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-16 23:16 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jakub Narebski, Mark Junker, git

[-- Attachment #1: Type: text/plain, Size: 934 bytes --]

On Jan 16, 2008, at 5:23 PM, Johannes Schindelin wrote:

> On Wed, 16 Jan 2008, Kevin Ballard wrote:
>
>> It's not using different encodings, it's all Unicode.
>
> But that's the _point_!  It _is_ Unicode, yet it uses _different_
> encodings of the _same_ string.
>
> Now, this discussion gets really annoying.  The real question is:  
> will you
> do something about it, or reply with another 500-line email?

I wish I could do something about it. But right now I'm a full-time  
student trying to do contracting jobs on the side, and I don't believe  
I have the time to learn enough about the guts of git to try and make  
any changes to something as core as index filename handling. I just  
want people here to recognize that this is a valid problem instead of  
simply dismissing it as "HFS+ is insane, lets just ignore this issue".

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 23:11                 ` Kevin Ballard
@ 2008-01-16 23:38                   ` Linus Torvalds
  2008-01-16 23:57                     ` Pedro Melo
                                       ` (2 more replies)
  0 siblings, 3 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-16 23:38 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Jakub Narebski, Johannes Schindelin, Mark Junker, git

On Wed, 16 Jan 2008, Kevin Ballard wrote:
> 
> There's a difference between "looks similar" as in "Polish" vs "polish", and
> actually is the same string as in "Ma<UMLAUT MODIFIER>rchen" vs "M<A WITH
> UMLAUT>rchen". Capitalization has a valid semantic meaning, normalization
> doesn't. 

That simply isn't true.

Normalization actually has real semantic meaning. If it didn't, there 
would never ever be a reason why you'd use the non-normalized form in the 
first place.

Others have argued the exact same thing for capitalization. "A" is the 
same letter as "a". Except there is a distinction.

The same is true of "a<UMLAUT MODIFIER>" and "<a WITH UMLAUT>". Yes, it's 
the same "chacter" in either case. Except when there is a distinction.

And there *are* cases where there are distinctions. Especially inside 
computers. For one thing, you may not be talking about "characters on 
screen", but you may be talking about "key sequences". And suddenly 
"a<UMLAUT MODIFIER>" is a two-key sequence, and "<a WITH UMLAUT>" is a 
single-key sequence, and THEY ARE DIFFERENT.

See?

"a" and "A" are the same letter. But sometimes case matters.

Multi-character UTF-8 sequences may be the same character. But sometimes 
the sequence matters.

Same exact thing.

>	The only way to argue that normalization is wrong is by providing a
> good reason to preserve the exact byte sequence, and so far the only reason
> I've seen is to help git.

Git doesn't care. Just use the *same* sequence everywhere. Make sure 
something doesn't change it. Because if something changes it, git will 
track it.

> How do you figure? When I type "Märchen", I'm typing a string, not a byte
> sequence. I have no control over the normalization of the characters.
> Therefore, depending on what program I'm typing the name in, I might use the
> same normalization as the filename, or I might miss. It's completely out of my
> control. This is why the filesystem has to step in and say "You composed that
> character differently, but I know you were trying to specify this file".

Pure and utter garbage.

What you are describing is an *input method* issue, not a filesystem 
issue.

The fact that you think this has anything what-so-ever to do with 
filesystems, I cannot understand.

Here's an example: I can type Märchen two different ways on my keyboard: I 
can press the 'ä' key (yes, I have one, I have a Swedish keyboard), or I 
could press the '¨' key and the 'a' key.

See: I get 'ä' and 'ä' respectively.

And as I send this email off, those characters never *ever* got written as 
filenames to any filesystem. But they *did* get written as part of 
text-files to the disk using "write()", yes.

And according to your *insane* logic, that write() call should have 
converted them to the same representation, no?

Hell no! That conversion has absolutely nothing to do with the filesystem. 
It's done at a totally different layer that actually knows what it is 
doing, and turned them both into \xc3\xa4 (and then, the email client 
probably will turn this into Latin1, and send it out as a single-byte 
'\xe4' character).

See? Putting the conversion in the filesystem IS INSANE. You wouldn't make 
the filesystem convert the characters in the data stream (because it would 
cause strange data conversion issues) AND FOR EXACTLY THE SAME REASON it 
shouldn't do it for filenames either!

And your claim that "you have no control over the normalization of 
characters" is simply insane. Of course you have. It's just not supposed 
to be at the filesystem level - whether it's a write() call or a creat() 
call!

			Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 20:39         ` Kevin Ballard
  2008-01-16 21:51           ` Jakub Narebski
@ 2008-01-16 23:52           ` Dmitry Potapov
  1 sibling, 0 replies; 260+ messages in thread
From: Dmitry Potapov @ 2008-01-16 23:52 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Jakub Narebski, Johannes Schindelin, Mark Junker, git

On Wed, Jan 16, 2008 at 03:39:36PM -0500, Kevin Ballard wrote:
> On Jan 16, 2008, at 11:46 AM, Jakub Narebski wrote:
> 
> >
> >HFS+ is just _stupid_. And unfortunately Git doesn't support stupid
> >filesystems (e.g. case insensitive filesystems) well.
> 
> There's two different ways to do filesystem encodings. One is to have  
> the fs simply not care about encoding, which is what the linux world  
> seems to prefer. 

There is no technical reason for *kernel* to care about file name
encoding. It is something that can be and should be dealt with in
the user space (except some special cases like smbfs).

> Sure, this is great in that what you create the file  
> with is what you get back,

And also because a user space program can deal with it much more
gracefully...

> but on the other hand, given an arbitrary  
> non-ASCII file on disk, you have absolutely no idea what the encoding  
> should be and you can't display it without making assumptions (yes you  
> can use heuristics, but you're still making assumptions).

Wrong. If you have a policy that all file names are stored in UTF-8
encoding then there is no problem here. It should not be a kernel
problem to care about encoding, besides you cannot fully solve it
in the kernel space anyway...

> Filesystems  
> like HFS+ that standardize the encoding,

Yeah, right... Like Microsoft likes to "standardize" everything, which
in practice means forcing on others something fundamentally broken and
that does not follow any existing standard precisely:

===
IMPORTANT:
The terms used in this Q&A, decomposed and precomposed, roughly
correspond to Unicode Normal Forms D and C, respectively. However, most
volume formats do not follow the exact specification for these normal
forms.
===
http://developer.apple.com/qa/qa2001/qa1173.html

Not to mention that the use of decomposed Unicode as the standard is
outright silly -- no sane person writes in "decomposed" Unicode...

> on the other hand, make it  
> such that you always know what the encoding of a file should be, so  
> you can always display and use the filename intelligently.

Somehow I have no problem with displaying non-ASCII names on Linux.
I can see both Unicode Normal Forms C and D encoded symbols without
any problem, though the kernel is completely unaware about them.

> It also  
> means it plays much nicer in a non-ASCII world, since you don't have  
> to worry about different normalizations of a given string referring to  
> different files (it's one thing to be case-sensitive, but claiming  
> that "föo" and "föo" are different files

As you typed them, they both are exactly the same, and both of them are
in the Normal Forms C (which Mac calls as precomposed). So why do you
use one encoding in your writings and the other in your file names?

> just because one uses a  
> composed character and the other doesn't is extremely user- 
> unfriendly). On the other hand, what you create the file with may not  
> be what you read back later, since the name has been standardized.  
> It's hard to say one is better than the other, they're just different  
> ways of doing it. However, I have noticed that everybody who's voiced  
> an opinion on this list in favor of the encoding-agnostic approach  
> seem to be unwilling to accept that any other approach might have  
> validity, to the extent of calling an OS/filesystem that does things  
> different stupid or insane. This strikes me as extremely elitist and  
> risks alienating what I expect to be a fast-growing group of users  
> (i.e. OS X users).

I am sure everyone here is scared to death... I mean we have used to
hear such threats from some MS salespeople, but from a Mac guy? It is
really scare....

Wake up, and stop shooting this nonsense at us. If you have technical
reasons why your solution is better, let us know. So far, you do not
sound very convincing here. Why do think that the issue of encoding can
not be dealt with in the user space? Why does Mac OS X uses so-called
decomposed Unicode, which even does not follow any standard precisely?
Why does Mac OS X chose to decompose characters while it does not
solve any real issue?

> And one area that it has a problem with is the de- 
> facto filesystem on my OS of choice.

I suppose it would be much better a subject for discussion...
At least, it would be more likely to result in that Git working
better on your OS.

> However, attempts to discuss the  
> problem invariable end up with multiple people calling my OS stupid  
> and insane simply because it differs in a particular design decision.  

First, no one called Mac OS X insane, but case insensitive filesystems,
and there are good reasons to think so, because no one has demonstrated
so far any advantage of that approach, but disadvantages are quite 
obvious to anyone -- comparison of a stored file list with readdir()
is much more problematic, and you cannot say that you have solved the
problem with encoding if you force other people to *duplicate* some
logic that Mac OS X does in its kernel just to get things working...
So, no one thinks it is insane because it is different, but because it
requires much more efforts to do the same thing -- compare two file
lists, and this operation is important for Git to work properly...

Dmitry

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 23:38                   ` Linus Torvalds
@ 2008-01-16 23:57                     ` Pedro Melo
  2008-01-17  0:16                       ` Linus Torvalds
  2008-01-16 23:58                     ` David Kastrup
  2008-01-17  0:09                     ` Kevin Ballard
  2 siblings, 1 reply; 260+ messages in thread
From: Pedro Melo @ 2008-01-16 23:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kevin Ballard, Jakub Narebski, Johannes Schindelin, Mark Junker,
	git


On Jan 16, 2008, at 11:38 PM, Linus Torvalds wrote:
> On Wed, 16 Jan 2008, Kevin Ballard wrote:
>> 	The only way to argue that normalization is wrong is by providing a
>> good reason to preserve the exact byte sequence, and so far the  
>> only reason
>> I've seen is to help git.
>
> Git doesn't care. Just use the *same* sequence everywhere. Make sure
> something doesn't change it. Because if something changes it, git will
> track it.

The problem is that you don't control the sequence that everybody uses.

See this example:

melo@speed(~)$ uname -a
Linux speed.simplicidade.org 2.6.9-55.ELsmp #1 SMP Wed May 2 14:28:44  
EDT 2007 i686 i686 i386 GNU/Linux
melo@speed(~)$ set | grep LANG
LANG=en_US.UTF-8
melo@speed(~)$ mkdir t
melo@speed(~)$ cd t
melo@speed(~/t)$ git init
Initialized empty Git repository in .git/
melo@speed(~/t)$ touch á
melo@speed(~/t)$ git-add á
melo@speed(~/t)$ git-commit -m "added a in utf8"
Created initial commit 7a473a2: added a in utf8
  0 files changed, 0 insertions(+), 0 deletions(-)
  create mode 100644 "\303\241"
melo@speed(~/t)$ export LANG=en_US
melo@speed(~/t)$ touch á
melo@speed(~/t)$ ls -la
total 12
drwxrwxr-x   3 melo melo 4096 Jan 16 23:44 .
drwx--x--x  31 melo melo 4096 Jan 16 23:43 ..
-rw-rw-r--   1 melo melo    0 Jan 16 23:44 á
-rw-rw-r--   1 melo melo    0 Jan 16 23:43 Ã¡
drwxrwxr-x   8 melo melo 4096 Jan 16 23:43 .git
melo@speed(~/t)$ git-add á
melo@speed(~/t)$ git-commit -m "added a in iso-latin-1"
Created commit 4282fca: OlÃ¡x!
  0 files changed, 0 insertions(+), 0 deletions(-)
  create mode 100644 "\341"

So two (simulated in this test) users who use different LANG settings  
will be in trouble in no time.

What I take from this conversation is that I have to specify, for  
each project I work on, which encoding we should use, across all  
users, before they start using git with files with accented chars.

The difference I see between us is that if I tell my filesystem that  
I want to name my file with a particular string encoded in X, users  
using encoding Y will be able to read it correctly. I  like my  
filesystem to make that work for me.

Best regards,
-- 
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 23:38                   ` Linus Torvalds
  2008-01-16 23:57                     ` Pedro Melo
@ 2008-01-16 23:58                     ` David Kastrup
  2008-01-17  0:19                       ` Linus Torvalds
  2008-01-17  0:09                     ` Kevin Ballard
  2 siblings, 1 reply; 260+ messages in thread
From: David Kastrup @ 2008-01-16 23:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kevin Ballard, Jakub Narebski, Johannes Schindelin, Mark Junker,
	git

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, 16 Jan 2008, Kevin Ballard wrote:
>> 
>> There's a difference between "looks similar" as in "Polish" vs "polish", and
>> actually is the same string as in "Ma<UMLAUT MODIFIER>rchen" vs "M<A WITH
>> UMLAUT>rchen". Capitalization has a valid semantic meaning, normalization
>> doesn't. 
>
> That simply isn't true.
>
> Normalization actually has real semantic meaning. If it didn't, there
> would never ever be a reason why you'd use the non-normalized form in
> the first place.

Actually, there is no good reason for non-normalized forms (deficient
software not able to deal with some of the normalized forms is not a
good reason: such software should be fixed).

It is just that the file system is a rather quirky place for enforcing
the normalization.  One should not be able to get unnormalized forms
created easily in the first place, be it command line or script.

> And there *are* cases where there are distinctions. Especially inside
> computers. For one thing, you may not be talking about "characters on
> screen", but you may be talking about "key sequences". And suddenly
> "a<UMLAUT MODIFIER>" is a two-key sequence, and "<a WITH UMLAUT>" is a
> single-key sequence, and THEY ARE DIFFERENT.
>
> See?

No.  Input methods are not the same as their resulting string.  I can
even produce some ASCII characters on my keyboard in more than one way
and would not expect them to lead to different codes.

>> How do you figure? When I type "Märchen", I'm typing a string, not a
>> byte sequence. I have no control over the normalization of the
>> characters.  Therefore, depending on what program I'm typing the name
>> in, I might use the same normalization as the filename, or I might
>> miss. It's completely out of my control. This is why the filesystem
>> has to step in and say "You composed that character differently, but
>> I know you were trying to specify this file".
>
> Pure and utter garbage.
>
> What you are describing is an *input method* issue, not a filesystem
> issue.
>
> The fact that you think this has anything what-so-ever to do with
> filesystems, I cannot understand.

How nice.  We are actually in agreement here.

> See? Putting the conversion in the filesystem IS INSANE. You wouldn't
> make the filesystem convert the characters in the data stream (because
> it would cause strange data conversion issues) AND FOR EXACTLY THE
> SAME REASON it shouldn't do it for filenames either!

Yup.  But that does not mean that normalization is a bad idea.  It is
just that the filesystem is not the right place for it.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 23:38                   ` Linus Torvalds
  2008-01-16 23:57                     ` Pedro Melo
  2008-01-16 23:58                     ` David Kastrup
@ 2008-01-17  0:09                     ` Kevin Ballard
  2008-01-17  0:25                       ` Linus Torvalds
  2008-01-17  1:16                       ` Linus Torvalds
  2 siblings, 2 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-17  0:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jakub Narebski, Johannes Schindelin, Mark Junker, git

[-- Attachment #1: Type: text/plain, Size: 6389 bytes --]

On Jan 16, 2008, at 6:38 PM, Linus Torvalds wrote:

> On Wed, 16 Jan 2008, Kevin Ballard wrote:
>>
>> There's a difference between "looks similar" as in "Polish" vs  
>> "polish", and
>> actually is the same string as in "Ma<UMLAUT MODIFIER>rchen" vs  
>> "M<A WITH
>> UMLAUT>rchen". Capitalization has a valid semantic meaning,  
>> normalization
>> doesn't.
>
> That simply isn't true.
>
> Normalization actually has real semantic meaning. If it didn't, there
> would never ever be a reason why you'd use the non-normalized form  
> in the
> first place.

My understanding is that normalization is there to help the computer.  
That doesn't give it any semantic meaning, because all normal forms of  
a given string still represent the exact same string to the user.

> Others have argued the exact same thing for capitalization. "A" is the
> same letter as "a". Except there is a distinction.

The argument for case insensitivity is different than the argument for  
normalization. I certainly hope you understand why they are different  
arguments, or there's really no point in going further.

> The same is true of "a<UMLAUT MODIFIER>" and "<a WITH UMLAUT>". Yes,  
> it's
> the same "chacter" in either case. Except when there is a distinction.
>
> And there *are* cases where there are distinctions. Especially inside
> computers. For one thing, you may not be talking about "characters on
> screen", but you may be talking about "key sequences". And suddenly
> "a<UMLAUT MODIFIER>" is a two-key sequence, and "<a WITH UMLAUT>" is a
> single-key sequence, and THEY ARE DIFFERENT.
>
> See?
>
> "a" and "A" are the same letter. But sometimes case matters.
>
> Multi-character UTF-8 sequences may be the same character. But  
> sometimes
> the sequence matters.
>
> Same exact thing.

You're right, sometimes the sequence matters. As in key sequences. But  
we're not talking about key sequences, we're talking about strings.  
Just because it matters sometimes doesn't mean it matters all the time.

>> 	The only way to argue that normalization is wrong is by providing a
>> good reason to preserve the exact byte sequence, and so far the  
>> only reason
>> I've seen is to help git.
>
> Git doesn't care. Just use the *same* sequence everywhere. Make sure
> something doesn't change it. Because if something changes it, git will
> track it.

And how am I supposed to use the same sequence everywhere? When I type  
"Märchen", I don't know which form I'm typing, nor should I. It's not  
something that I, as a user, should have to know. Especially if I pass  
this name through various other utilities before using it - I have no  
idea if another utility is going to end up normalizing the name, and  
it shouldn't matter, as they are equivalent strings.

>> How do you figure? When I type "Märchen", I'm typing a string, not  
>> a byte
>> sequence. I have no control over the normalization of the characters.
>> Therefore, depending on what program I'm typing the name in, I  
>> might use the
>> same normalization as the filename, or I might miss. It's  
>> completely out of my
>> control. This is why the filesystem has to step in and say "You  
>> composed that
>> character differently, but I know you were trying to specify this  
>> file".
>
> Pure and utter garbage.
>
> What you are describing is an *input method* issue, not a filesystem
> issue.
>
> The fact that you think this has anything what-so-ever to do with
> filesystems, I cannot understand.
>
> Here's an example: I can type Märchen two different ways on my  
> keyboard: I
> can press the 'ä' key (yes, I have one, I have a Swedish keyboard),  
> or I
> could press the '¨' key and the 'a' key.
>
> See: I get 'ä' and 'ä' respectively.

On a US keyboard I only have one way of typing ä, and I have no idea  
whether it ends up precomposed or decomposed in the resulting byte  
stream. And I don't care. Because I'm typing characters, not bytes. I  
could be typing in a file in ISO-Latin-1 and I still wouldn't care,  
because it looks the same to me. If my filesystem did make a  
distinction between the normal forms, and I see that I have a file  
named "Märchen", how am I supposed to type that at my keyboard? I  
don't know which normal form it's using.

The fact that you think the normalization of the string matters, I  
don't understand.

> And as I send this email off, those characters never *ever* got  
> written as
> filenames to any filesystem. But they *did* get written as part of
> text-files to the disk using "write()", yes.
>
> And according to your *insane* logic, that write() call should have
> converted them to the same representation, no?
>
>
> Hell no! That conversion has absolutely nothing to do with the  
> filesystem.
> It's done at a totally different layer that actually knows what it is
> doing, and turned them both into \xc3\xa4 (and then, the email client
> probably will turn this into Latin1, and send it out as a single-byte
> '\xe4' character).
>
> See? Putting the conversion in the filesystem IS INSANE. You  
> wouldn't make
> the filesystem convert the characters in the data stream (because it  
> would
> cause strange data conversion issues) AND FOR EXACTLY THE SAME  
> REASON it
> shouldn't do it for filenames either!

What a fabulous straw man argument you just put together. I hope you  
don't need me to point out why this argument is fundamentally flawed.

> And your claim that "you have no control over the normalization of
> characters" is simply insane. Of course you have. It's just not  
> supposed
> to be at the filesystem level - whether it's a write() call or a  
> creat()
> call!

I'm speaking as a user, and as such, I shouldn't even have to know  
that it's possible to write the same character in multiple different  
ways. As a user, HFS+ behaves exactly the way I want it to. You were  
talking earlier about not messing with the "user data", but what is  
the "user data"? It's the string, not the byte sequence. That's all I  
care about - the string. That's all the OS cares about, that's all any  
application I use cares about, and that's all git should care about.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 23:57                     ` Pedro Melo
@ 2008-01-17  0:16                       ` Linus Torvalds
  2008-01-17  0:27                         ` Pedro Melo
  2008-01-18  8:29                         ` Peter Karlsson
  0 siblings, 2 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-17  0:16 UTC (permalink / raw)
  To: Pedro Melo
  Cc: Kevin Ballard, Jakub Narebski, Johannes Schindelin, Mark Junker,
	git



On Wed, 16 Jan 2008, Pedro Melo wrote:
> 
> The difference I see between us is that if I tell my filesystem that I want to
> name my file with a particular string encoded in X, users using encoding Y
> will be able to read it correctly. I  like my filesystem to make that work for
> me.

The difference I see between us is that when I tell you that this is 
exactly the same thing as your file *contents*, you don't seem to get it.

An OS that silently changes the contents of your files is *crap*.

Get it?

An OS that silently changes the contents of your directories is *crap*.

Get it now?

		Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 23:58                     ` David Kastrup
@ 2008-01-17  0:19                       ` Linus Torvalds
  0 siblings, 0 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-17  0:19 UTC (permalink / raw)
  To: David Kastrup
  Cc: Kevin Ballard, Jakub Narebski, Johannes Schindelin, Mark Junker,
	git



On Thu, 17 Jan 2008, David Kastrup wrote:
> 
> Actually, there is no good reason for non-normalized forms (deficient
> software not able to deal with some of the normalized forms is not a
> good reason: such software should be fixed).

I'd actually agree, and it then boils down to the second sane choice I 
gave earlier:

 - don't accept data you don't like

if you don't like non-normalized names, don't create them. That's fine.

But don't go normalizing them behind the users back.

> Yup.  But that does not mean that normalization is a bad idea.  It is
> just that the filesystem is not the right place for it.

Oh, absolutely. You can - and often should - normalize in the application 
(or have libraries to do it for you). 

Not silently and behind peoples backs.

		Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  0:09                     ` Kevin Ballard
@ 2008-01-17  0:25                       ` Linus Torvalds
  2008-01-17  0:33                         ` Johannes Schindelin
  2008-01-17  1:16                       ` Linus Torvalds
  1 sibling, 1 reply; 260+ messages in thread
From: Linus Torvalds @ 2008-01-17  0:25 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Jakub Narebski, Johannes Schindelin, Mark Junker, git

On Wed, 16 Jan 2008, Kevin Ballard wrote:
> 
> My understanding is that normalization is there to help the computer. That
> doesn't give it any semantic meaning, because all normal forms of a given
> string still represent the exact same string to the user.

THAT IS NOT TRUE!

How the hell does the computer know what the string means?

Hint: it does not.

The fact is, the user may use a non-normalized string on purpose. It's not 
your place to say that the user is wrong. Your "undestanding" is simply 
wrong. Two strings are *different* if they are [un]normalized differently.

Really.

The exact same way the word Polish and polish are different, just because 
they are capitalized differently.

> The argument for case insensitivity is different than the argument for
> normalization. I certainly hope you understand why they are different
> arguments, or there's really no point in going further.

You do not understand.

In *order* to do case-insensitivity, you generally need to normalize (and 
do other things too - normalization is just *one* of the things you need 
to do).

So if you are a case-insensitive filesystem, then normalization is sane.

But if you aren't, then there is no reason to normalize.

> You're right, sometimes the sequence matters. As in key sequences. But we're
> not talking about key sequences, we're talking about strings.

You define "string" to be something totally made-up.

In your world "string" means "normalized". BUT IT'S NOT TRUE!

You define normalization to be a property of strings, without any actual 
backing for why that would be.

The fact is, *looks the same* is very very different from *is the same*.

But you seem to be too stupid to undestand the differce.

		Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  0:16                       ` Linus Torvalds
@ 2008-01-17  0:27                         ` Pedro Melo
  2008-01-17  0:32                           ` David Kastrup
  2008-01-17  0:35                           ` Johannes Schindelin
  2008-01-18  8:29                         ` Peter Karlsson
  1 sibling, 2 replies; 260+ messages in thread
From: Pedro Melo @ 2008-01-17  0:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kevin Ballard, Jakub Narebski, Johannes Schindelin, Mark Junker,
	git

Hi,

On Jan 17, 2008, at 12:16 AM, Linus Torvalds wrote:
> On Wed, 16 Jan 2008, Pedro Melo wrote:
>>
>> The difference I see between us is that if I tell my filesystem  
>> that I want to
>> name my file with a particular string encoded in X, users using  
>> encoding Y
>> will be able to read it correctly. I  like my filesystem to make  
>> that work for
>> me.
>
> The difference I see between us is that when I tell you that this is
> exactly the same thing as your file *contents*, you don't seem to  
> get it.

I get that you think its the same thing.

What I don't get is why a user should be forced to know what type of  
encoding he and the other users are using on all the layers going  
down to the filesystem. If two users on different systems or in  
different configurations, choose the same unicode string as the name,  
why do we need to make it harder for things to just work out?

The content of the file is sacred, we both agree on that. We disagree  
on the filename, because for me it's more important that equal  
strings, even if encoded to different byte sequences, should be  
treated as the same file.

> An OS that silently changes the contents of your files is *crap*.
>
> Get it?

I was not talking about content of files, those are sacred. I was  
talking about filenames. Those *for me* are not, but are for you. No  
problem, we just have different values: I want my computer to work  
for me, not me working for the computer. I'm willing to accept a file  
system or other layer that normalizes encoding of filenames if that  
makes the end-user life easier, specially in a tool distributed by  
nature.

> An OS that silently changes the contents of your directories is  
> *crap*.
>
> Get it now?

As I said before, we disagree on file meta-data, not on file  
contents. For you, byte in must be the same byte out. For me string  
in must be the same string out.

And as I said in the previous email, what I learned today is that in  
a distributed project using git, and if you need to use accented  
characters, I need to tell all the users to use the same LANG settings.

It's important information, at least for me.

Best regards,
-- 
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  0:27                         ` Pedro Melo
@ 2008-01-17  0:32                           ` David Kastrup
  2008-01-17  0:40                             ` Pedro Melo
  2008-01-17  0:35                           ` Johannes Schindelin
  1 sibling, 1 reply; 260+ messages in thread
From: David Kastrup @ 2008-01-17  0:32 UTC (permalink / raw)
  To: Pedro Melo
  Cc: Linus Torvalds, Kevin Ballard, Jakub Narebski,
	Johannes Schindelin, Mark Junker, git

Pedro Melo <melo@simplicidade.org> writes:

> On Jan 17, 2008, at 12:16 AM, Linus Torvalds wrote:
>> On Wed, 16 Jan 2008, Pedro Melo wrote:
>>>
>>> The difference I see between us is that if I tell my filesystem that
>>> I want to name my file with a particular string encoded in X, users
>>> using encoding Y will be able to read it correctly. I like my
>>> filesystem to make that work for me.
>>
>> The difference I see between us is that when I tell you that this is
>> exactly the same thing as your file *contents*, you don't seem to get
>> it.
>
> I get that you think its the same thing.
>
> What I don't get is why a user should be forced to know what type of
> encoding he and the other users are using on all the layers going down
> to the filesystem. If two users on different systems or in different
> configurations, choose the same unicode string as the name, why do we
> need to make it harder for things to just work out?

If you do the normalization in the right place, things will just work
out.  The file system is not the right place.

> I'm willing to accept a file system or other layer that normalizes
> encoding of filenames if that makes the end-user life easier,
> specially in a tool distributed by nature.

Well, as the issue shows it does not make life for the end-user easier.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  0:25                       ` Linus Torvalds
@ 2008-01-17  0:33                         ` Johannes Schindelin
  2008-01-17  0:43                           ` Pedro Melo
  2008-01-17  1:06                           ` Linus Torvalds
  0 siblings, 2 replies; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-17  0:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kevin Ballard, Jakub Narebski, Mark Junker, git

Hi,

On Wed, 16 Jan 2008, Linus Torvalds wrote:

> So if you are a case-insensitive filesystem, then normalization is sane.

Actually, no.  Even an case-challenged filesystem should keep the 
_original_ name around, if only for the exact same argument you used 
earlier: if the user chooses to capitalise some letters, but not others, 
it is not the filesystem's place to "correct" that.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  0:27                         ` Pedro Melo
  2008-01-17  0:32                           ` David Kastrup
@ 2008-01-17  0:35                           ` Johannes Schindelin
  2008-01-17  0:45                             ` Pedro Melo
  1 sibling, 1 reply; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-17  0:35 UTC (permalink / raw)
  To: Pedro Melo
  Cc: Linus Torvalds, Kevin Ballard, Jakub Narebski, Mark Junker, git

Hi,

On Thu, 17 Jan 2008, Pedro Melo wrote:

> The content of the file is sacred, we both agree on that. We disagree on 
> the filename, because for me it's more important that equal strings, 
> even if encoded to different byte sequences, should be treated as the 
> same file.

Why should the filename be _stored_ normalised?  I agree on the lookup, 
yes, but not the storage.

Hth,
Dscho

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  0:32                           ` David Kastrup
@ 2008-01-17  0:40                             ` Pedro Melo
  2008-01-17  0:54                               ` Wincent Colaiuta
  0 siblings, 1 reply; 260+ messages in thread
From: Pedro Melo @ 2008-01-17  0:40 UTC (permalink / raw)
  To: David Kastrup
  Cc: Linus Torvalds, Kevin Ballard, Jakub Narebski,
	Johannes Schindelin, Mark Junker, git

Hello,

On Jan 17, 2008, at 12:32 AM, David Kastrup wrote:
> Pedro Melo <melo@simplicidade.org> writes:
>> On Jan 17, 2008, at 12:16 AM, Linus Torvalds wrote:
>>> On Wed, 16 Jan 2008, Pedro Melo wrote:
>>>>
>>>> The difference I see between us is that if I tell my filesystem  
>>>> that
>>>> I want to name my file with a particular string encoded in X, users
>>>> using encoding Y will be able to read it correctly. I like my
>>>> filesystem to make that work for me.
>>>
>>> The difference I see between us is that when I tell you that this is
>>> exactly the same thing as your file *contents*, you don't seem to  
>>> get
>>> it.
>>
>> I get that you think its the same thing.
>>
>> What I don't get is why a user should be forced to know what type of
>> encoding he and the other users are using on all the layers going  
>> down
>> to the filesystem. If two users on different systems or in different
>> configurations, choose the same unicode string as the name, why do we
>> need to make it harder for things to just work out?
>
> If you do the normalization in the right place, things will just work
> out.  The file system is not the right place.

No problem, but don't you think that git should to it?

Don't you think its important in a distributed tool that no matter  
what system they use, be it linux or solaris, they are able to talk  
about a file with non-ascii chars and be the same file to both of them?

That's the point I'm making. The fact that I need to set LANG across  
all users of a project is insane...

>> I'm willing to accept a file system or other layer that normalizes
>> encoding of filenames if that makes the end-user life easier,
>> specially in a tool distributed by nature.
>
> Well, as the issue shows it does not make life for the end-user  
> easier.

I'm assuming you are talking about HFS+ and the strange normalization  
it does.

I'm sorry but that was not the problem I sent. I sent a scenario, in  
which two users, using the same linux system but with different LANG  
settings cannot use git reliably.

Although this thread started because of HFS+ "choices", the problem  
is not really related to HFS+ given that you can have the same issues  
even on the same physical <insert flavor here> POSIX system.

Best regards,
-- 
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  0:33                         ` Johannes Schindelin
@ 2008-01-17  0:43                           ` Pedro Melo
  2008-01-17  0:57                             ` Johannes Schindelin
  2008-01-17  1:06                           ` Linus Torvalds
  1 sibling, 1 reply; 260+ messages in thread
From: Pedro Melo @ 2008-01-17  0:43 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Linus Torvalds, Kevin Ballard, Jakub Narebski, Mark Junker, git

Hi,

On Jan 17, 2008, at 12:33 AM, Johannes Schindelin wrote:

> On Wed, 16 Jan 2008, Linus Torvalds wrote:
>
>> So if you are a case-insensitive filesystem, then normalization is  
>> sane.
>
> Actually, no.  Even an case-challenged filesystem should keep the
> _original_ name around, if only for the exact same argument you used
> earlier: if the user chooses to capitalise some letters, but not  
> others,
> it is not the filesystem's place to "correct" that.

For the record, HFS+ is case-insensitive but case-preserving so I  
believe they keep the original filename around. I don't have the spec  
in front of me, but from memory I believe that this is what they do.

But I think that focusing on HFS+ is loosing sight of the real  
problem. It's not about encoding at the filesystem, but encoding  
inside the git structures.

Best regards,
-- 
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  0:35                           ` Johannes Schindelin
@ 2008-01-17  0:45                             ` Pedro Melo
  0 siblings, 0 replies; 260+ messages in thread
From: Pedro Melo @ 2008-01-17  0:45 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Linus Torvalds, Kevin Ballard, Jakub Narebski, Mark Junker, git


On Jan 17, 2008, at 12:35 AM, Johannes Schindelin wrote:
> On Thu, 17 Jan 2008, Pedro Melo wrote:
>
>> The content of the file is sacred, we both agree on that. We  
>> disagree on
>> the filename, because for me it's more important that equal strings,
>> even if encoded to different byte sequences, should be treated as the
>> same file.
>
> Why should the filename be _stored_ normalised?  I agree on the  
> lookup,
> yes, but not the storage.

Personally I don't care how you store it. It's an implementation  
detail, and you should choose the best one for your use cases. If  
that means that you store the original version and a normalized  
version just for lookups, fine.

What I think its important is that if two users use different  
encodings for the same string in a filename, git should treat that as  
the same file.

Best regards,
-- 
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  0:40                             ` Pedro Melo
@ 2008-01-17  0:54                               ` Wincent Colaiuta
  2008-01-17  1:08                                 ` Johannes Schindelin
  0 siblings, 1 reply; 260+ messages in thread
From: Wincent Colaiuta @ 2008-01-17  0:54 UTC (permalink / raw)
  To: Pedro Melo
  Cc: David Kastrup, Linus Torvalds, Kevin Ballard, Jakub Narebski,
	Johannes Schindelin, Mark Junker, git

El 17/1/2008, a las 1:40, Pedro Melo escribió:

> That's the point I'm making. The fact that I need to set LANG across  
> all users of a project is insane...

I don't think I'd call that "insane" (in fact, I think these  
discussions would be much less irritating for all involved if we  
didn't use that word so often, even when it's not called for). It's  
not that different than the whole LF/CRLF line-ending thing.

The real problem is that setting LANG won't help you on Mac OS X; set  
LANG to whatever you want and there is *nothing* that you can do to  
stop your filenames being normalized into decomposed UTF-8, short of  
dropping HFS+. You can use an alternative filesystem, but support for  
basically everything except HFS+ is suboptimal in Mac OS X at the  
moment.

Cheers,
Wincent

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  0:43                           ` Pedro Melo
@ 2008-01-17  0:57                             ` Johannes Schindelin
  0 siblings, 0 replies; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-17  0:57 UTC (permalink / raw)
  To: Pedro Melo
  Cc: Linus Torvalds, Kevin Ballard, Jakub Narebski, Mark Junker, git

Hi,

On Thu, 17 Jan 2008, Pedro Melo wrote:

> On Jan 17, 2008, at 12:33 AM, Johannes Schindelin wrote:
> 
> > On Wed, 16 Jan 2008, Linus Torvalds wrote:
> > 
> > > So if you are a case-insensitive filesystem, then normalization is 
> > > sane.
> > 
> > Actually, no.  Even an case-challenged filesystem should keep the 
> > _original_ name around, if only for the exact same argument you used 
> > earlier: if the user chooses to capitalise some letters, but not 
> > others, it is not the filesystem's place to "correct" that.
> 
> For the record, HFS+ is case-insensitive but case-preserving so I 
> believe they keep the original filename around.

For the record, that's only the default setting.  AFAIK you can configure 
it to care about case, too.

Also for the record, the whole thread was about HFS+ _not_ keeping the 
original filename around, but _only_ a normalised version of it.

> But I think that focusing on HFS+ is loosing sight of the real problem. 
> It's not about encoding at the filesystem, but encoding inside the git 
> structures.

So far I have not seen anyone talking _seriously_ about this issue.  Only 
a few shouts "you should support", and a few shouts back "I don't care 
about insane filesystems".

Therefore, I fully agree with you that we're losing sight of the real 
problem.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  0:33                         ` Johannes Schindelin
  2008-01-17  0:43                           ` Pedro Melo
@ 2008-01-17  1:06                           ` Linus Torvalds
  1 sibling, 0 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-17  1:06 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Kevin Ballard, Jakub Narebski, Mark Junker, git



On Thu, 17 Jan 2008, Johannes Schindelin wrote:
> 
> On Wed, 16 Jan 2008, Linus Torvalds wrote:
> 
> > So if you are a case-insensitive filesystem, then normalization is sane.
> 
> Actually, no.  Even an case-challenged filesystem should keep the 
> _original_ name around

You're right. The normalization only really needs to happen as part of the 
name comparison itself.

		Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  0:54                               ` Wincent Colaiuta
@ 2008-01-17  1:08                                 ` Johannes Schindelin
  2008-01-17  1:41                                   ` Linus Torvalds
  0 siblings, 1 reply; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-17  1:08 UTC (permalink / raw)
  To: Wincent Colaiuta
  Cc: Pedro Melo, David Kastrup, Linus Torvalds, Kevin Ballard,
	Jakub Narebski, Mark Junker, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 456 bytes --]

Hi,

On Thu, 17 Jan 2008, Wincent Colaiuta wrote:

> El 17/1/2008, a las 1:40, Pedro Melo escribió:
> 
> > That's the point I'm making. The fact that I need to set LANG across 
> > all users of a project is insane...

FWIW if you use another filesystem, such as reiserfs or ext[2-4], the 
filenames will be _unaffected_ by your particular setting of LANG.  They 
will be stored byte-wise exactly like asked for.  That's why I call them 
"sane".

Hth,
Dscho

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  0:09                     ` Kevin Ballard
  2008-01-17  0:25                       ` Linus Torvalds
@ 2008-01-17  1:16                       ` Linus Torvalds
  2008-01-17  3:52                         ` Kevin Ballard
  1 sibling, 1 reply; 260+ messages in thread
From: Linus Torvalds @ 2008-01-17  1:16 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Jakub Narebski, Johannes Schindelin, Mark Junker, git

On Wed, 16 Jan 2008, Kevin Ballard wrote:
> 
> I'm speaking as a user, and as such, I shouldn't even have to know that it's
> possible to write the same character in multiple different ways.

The thing is, you seem to argue that what OS X does helps you as the user.

But you are arguing based on incorrect assumptions.

First off, we've had years and years and years of usage of non-corrupting 
filesystems (pretty much every UNIX OS around since day 1, and many other 
OS's too), and it's simply not true that it's a problem. You see the 
filename in the file dialog, and you open it, and you're done. OS X isn't 
any "easier" in this regard.

In fact, this whole thread comes from the fact that the OS X choice that 
you *think* is easier, is in fact not easier at all. It's not easier for 
the user, it's not easier for the application programmer, and the really 
sad part is that it's very much *not* easier for OS X itself either (ie 
they had to literally write extra code with nasty tables to do it, and it 
really does hurt them in performance and complexity).

And _that_ is why the OS X situation is so sad. Apple literally added 
extra code to make things slower and more complex *and* harder to use 
reliably.

Does it show up in normal behaviour? Of course not. You'd probably never 
see it in real life outside of test-suites. People simply don't even tend 
to use filenames outside of US-ASCII, and when they do use them, input 
methods really *do* tend to do the normalization for you.

But when it comes to automation (which is what computers are all about), 
the OS X choice is literally the wrong one. And there's no _upside_. It's 
all downside. Which is why it's so stupid.

I bet it only exists because OS X engineers didn't really even think about 
it, and they just assumed that "normalization is helpful". They took your 
stance - thinking it was worth it, without ever really thinking it 
through.

			Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  1:08                                 ` Johannes Schindelin
@ 2008-01-17  1:41                                   ` Linus Torvalds
  2008-01-17  4:07                                     ` Kevin Ballard
  0 siblings, 1 reply; 260+ messages in thread
From: Linus Torvalds @ 2008-01-17  1:41 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Wincent Colaiuta, Pedro Melo, David Kastrup, Kevin Ballard,
	Jakub Narebski, Mark Junker, git

On Thu, 17 Jan 2008, Johannes Schindelin wrote:
> On Thu, 17 Jan 2008, Wincent Colaiuta wrote:
> 
> > El 17/1/2008, a las 1:40, Pedro Melo escribió:
> > 
> > > That's the point I'm making. The fact that I need to set LANG across 
> > > all users of a project is insane...
> 
> FWIW if you use another filesystem, such as reiserfs or ext[2-4], the 
> filenames will be _unaffected_ by your particular setting of LANG.  They 
> will be stored byte-wise exactly like asked for.  That's why I call them 
> "sane".

One of the advantages (the biggest one, in fact, apart from the obvious 
US-ASCII down-compatibility and the fact that you can do C-compatible 
NUL-terminated strings) of UTF-8 is that it's locale-independent, and 
doesn't care about LANG, because it's valid in all languages.

And that's really important. It's important for a very simple reason: 
there is almost never such a thing as "a locale" except for US-ASCII. Once 
you move away from US-ASCII, it actually tends to be much more common that 
you have a *mixture* of locales - often in the same "document" - than to 
have one single locale.

It very much happens even in filenames - people "mix" locales in trivial 
ways even within a single pathname component (non-US-ASCII filename, but 
with a regular file extension), but much more interestingly they do so 
within a directory tree (ie you have have translation subdirectories where 
the filenames themselves are in another language, and you can have full 
pathnames where different components are in different languages, for 
example).

And UTF-8 is _wonderful_ for this, because LANG doesn't matter, and 
cannot matter, and thus mixing isn't a problem.

Of course, you can screw it up. Locales still can change things like sort 
order and capitalization etc, so even if you use UTF-8, you sure can get 
into trouble with LANG and thinking that a per-session locale makes sense.

So choosing UTF-8 for the filesystem isn't wrong per se. It's a fine 
choice, and has no issues with LANG in itself. Limiting it to strictly 
valid UTF-8 encodings is also fine. Limiting it (further) to only 
character normalized UTF-8 is also fine.

Most Linux filesystems don't limit it in any way, so you can make 
filenames that aren't valid UTF-8 at all, much less normalizing 
multi-character sequences.

I personally think that's the best option, but I probably do so mostly 
because I know some people still use Latin1 as their only locale (and I 
suspect Asia will take decades before it has converted to UTF-8 and will 
also have cases where they use other non-UTF locales).

But enforcing clean UTF-8 is not a bad idea per se. Not allowing byte 
sequences that aren't a valid UTF-8 encoding (eg \xc0\xc0 is not a valid 
UTF-8 character) is fine.

I wouldn't call people crazy for doing that, although it does mean that 
you cannot, for example, decide to write a Latin1 filename (which is not 
necessarily a *good* idea in this day and age, but I think there's a 
difference between "that's not a good idea" and "you cannot do that").

And even limiting the UTF-8 charset further to only the minimal 
representation of one particular glyph (ie not allowing multi-character 
sequences that can be represented more simply) may be even *more* 
big-brother, but would at least not cause the technical aliasing issues. I 
personally think that's so controlling as to be stupid (and has no real 
advantage), but hey, at least it doesn't *corrupt* anything silently.

So I think that using UTF-8 as a character encoding is a *good* thing to 
do, and that automatically means that LANG shouldn't matter for filenames, 
but within that choice of UTF-8 there are still mistakes that you can 
make. Notably multi-character normalization and case-insensitivity.

			Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  1:16                       ` Linus Torvalds
@ 2008-01-17  3:52                         ` Kevin Ballard
  2008-01-17  4:08                           ` Linus Torvalds
  0 siblings, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-17  3:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jakub Narebski, Johannes Schindelin, Mark Junker,
	git@vger.kernel.org

On Jan 16, 2008, at 8:16 PM, Linus Torvalds <torvalds@linux-foundation.org 
 > wrote:

> On Wed, 16 Jan 2008, Kevin Ballard wrote:
>>
>> I'm speaking as a user, and as such, I shouldn't even have to know  
>> that it's
>> possible to write the same character in multiple different ways.
>
> The thing is, you seem to argue that what OS X does helps you as the  
> user.
>
> But you are arguing based on incorrect assumptions.
>
> First off, we've had years and years and years of usage of non- 
> corrupting
> filesystems (pretty much every UNIX OS around since day 1, and many  
> other
> OS's too), and it's simply not true that it's a problem. You see the
> filename in the file dialog, and you open it, and you're done. OS X  
> isn't
> any "easier" in this regard.
>
> In fact, this whole thread comes from the fact that the OS X choice  
> that
> you *think* is easier, is in fact not easier at all. It's not easier  
> for
> the user, it's not easier for the application programmer, and the  
> really
> sad part is that it's very much *not* easier for OS X itself either  
> (ie
> they had to literally write extra code with nasty tables to do it,  
> and it
> really does hurt them in performance and complexity).
>
> And _that_ is why the OS X situation is so sad. Apple literally added
> extra code to make things slower and more complex *and* harder to use
> reliably.
>
> Does it show up in normal behaviour? Of course not. You'd probably  
> never
> see it in real life outside of test-suites. People simply don't even  
> tend
> to use filenames outside of US-ASCII, and when they do use them, input
> methods really *do* tend to do the normalization for you.
>
> But when it comes to automation (which is what computers are all  
> about),
> the OS X choice is literally the wrong one. And there's no _upside_.  
> It's
> all downside. Which is why it's so stupid.
>
> I bet it only exists because OS X engineers didn't really even think  
> about
> it, and they just assumed that "normalization is helpful". They took  
> your
> stance - thinking it was worth it, without ever really thinking it
> through.
>
>            Linus
>

I believe it exists because HFS+ was created at a time when the Mac  
was moving from a multi-encoding world (which was a nightmare) to a  
Unicode world and they wanted to remove ambiguity in filenames. But I  
wasn't around when they made this decision so this is just a guess.

-Kevin Ballard

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  1:41                                   ` Linus Torvalds
@ 2008-01-17  4:07                                     ` Kevin Ballard
  0 siblings, 0 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-17  4:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Schindelin, Wincent Colaiuta, Pedro Melo, David Kastrup,
	Jakub Narebski, Mark Junker, git

[-- Attachment #1: Type: text/plain, Size: 4941 bytes --]

On Jan 16, 2008, at 8:41 PM, Linus Torvalds wrote:

> On Thu, 17 Jan 2008, Johannes Schindelin wrote:
>> On Thu, 17 Jan 2008, Wincent Colaiuta wrote:
>>
>>> El 17/1/2008, a las 1:40, Pedro Melo escribió:
>>>
>>>> That's the point I'm making. The fact that I need to set LANG  
>>>> across
>>>> all users of a project is insane...
>>
>> FWIW if you use another filesystem, such as reiserfs or ext[2-4], the
>> filenames will be _unaffected_ by your particular setting of LANG.   
>> They
>> will be stored byte-wise exactly like asked for.  That's why I call  
>> them
>> "sane".
>
> One of the advantages (the biggest one, in fact, apart from the  
> obvious
> US-ASCII down-compatibility and the fact that you can do C-compatible
> NUL-terminated strings) of UTF-8 is that it's locale-independent, and
> doesn't care about LANG, because it's valid in all languages.
>
> And that's really important. It's important for a very simple reason:
> there is almost never such a thing as "a locale" except for US- 
> ASCII. Once
> you move away from US-ASCII, it actually tends to be much more  
> common that
> you have a *mixture* of locales - often in the same "document" -  
> than to
> have one single locale.
>
> It very much happens even in filenames - people "mix" locales in  
> trivial
> ways even within a single pathname component (non-US-ASCII filename,  
> but
> with a regular file extension), but much more interestingly they do so
> within a directory tree (ie you have have translation subdirectories  
> where
> the filenames themselves are in another language, and you can have  
> full
> pathnames where different components are in different languages, for
> example).
>
> And UTF-8 is _wonderful_ for this, because LANG doesn't matter, and
> cannot matter, and thus mixing isn't a problem.
>
> Of course, you can screw it up. Locales still can change things like  
> sort
> order and capitalization etc, so even if you use UTF-8, you sure can  
> get
> into trouble with LANG and thinking that a per-session locale makes  
> sense.
>
> So choosing UTF-8 for the filesystem isn't wrong per se. It's a fine
> choice, and has no issues with LANG in itself. Limiting it to strictly
> valid UTF-8 encodings is also fine. Limiting it (further) to only
> character normalized UTF-8 is also fine.
>
> Most Linux filesystems don't limit it in any way, so you can make
> filenames that aren't valid UTF-8 at all, much less normalizing
> multi-character sequences.
>
> I personally think that's the best option, but I probably do so mostly
> because I know some people still use Latin1 as their only locale  
> (and I
> suspect Asia will take decades before it has converted to UTF-8 and  
> will
> also have cases where they use other non-UTF locales).
>
> But enforcing clean UTF-8 is not a bad idea per se. Not allowing byte
> sequences that aren't a valid UTF-8 encoding (eg \xc0\xc0 is not a  
> valid
> UTF-8 character) is fine.
>
> I wouldn't call people crazy for doing that, although it does mean  
> that
> you cannot, for example, decide to write a Latin1 filename (which is  
> not
> necessarily a *good* idea in this day and age, but I think there's a
> difference between "that's not a good idea" and "you cannot do that").
>
> And even limiting the UTF-8 charset further to only the minimal
> representation of one particular glyph (ie not allowing multi- 
> character
> sequences that can be represented more simply) may be even *more*
> big-brother, but would at least not cause the technical aliasing  
> issues. I
> personally think that's so controlling as to be stupid (and has no  
> real
> advantage), but hey, at least it doesn't *corrupt* anything silently.
>
> So I think that using UTF-8 as a character encoding is a *good*  
> thing to
> do, and that automatically means that LANG shouldn't matter for  
> filenames,
> but within that choice of UTF-8 there are still mistakes that you can
> make. Notably multi-character normalization and case-insensitivity.
>
> 			Linus

Alright, you've made your point, and I'm willing to concede at least  
some of what you've said. So perhaps we can now move onto the more  
relevant and practical issue of: HFS+, despite how stupid it may or  
may not be, normalizes filenames (and is case-insensitive, which is a  
related issue). This causes a problem with git. How can this be solved?

I'm more than willing to do work to solve it, my biggest issue is I  
don't believe I actually have the free time to learn the git internals  
well enough to actually do proper work on what I would assume is a  
fairly performance-critical section of git's code. However, I would be  
happy to work with others who are perhaps more knowledgeable in this  
area.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  3:52                         ` Kevin Ballard
@ 2008-01-17  4:08                           ` Linus Torvalds
  2008-01-17  4:30                             ` Kevin Ballard
  2008-01-17 10:08                             ` Wincent Colaiuta
  0 siblings, 2 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-17  4:08 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Jakub Narebski, Johannes Schindelin, Mark Junker,
	git@vger.kernel.org

On Wed, 16 Jan 2008, Kevin Ballard wrote:
> 
> I believe it exists because HFS+ was created at a time when the Mac was moving
> from a multi-encoding world (which was a nightmare) to a Unicode world and
> they wanted to remove ambiguity in filenames. But I wasn't around when they
> made this decision so this is just a guess.

I do agree. And I think starting out case-insensitive (something they must 
really hate by now) also made it less of an issue. When you're 
case-insensitive, the issues with any UTF-8 normalization are simply 
swamped by all the issues of case, so you probably don't even think about 
it very much.

The big problem with any name rewriting is that I can open file 'xyz', and 
I literally have a very hard time knowing whether that file I know I 
opened and created has anything to do with the file 'Xyz' that I see when 
I do a readdir().

Are they the same? Maybe. But it's literally hard to tell on OS X. I can 
do an fstat() on my file descriptor and on the directory entry, and if 
they get the same d_ino they *probably are the same entry, but even then 
it actually could have been a hardlink (and my 'xyz' is really *another* 
name for it entirely, and the filesystem is actually case-sensitive and 
'Xyz' was a *different* name that somebody else did!).

See? If you're creating a content tracker, these kinds of issues are not 
"idle chatter". It's really *really* important. Was that file the one I 
was told to track? Or was it a temporary file that was just hardlinked? 

This is why case-insensitivity is so hard: you have a very real "aliasing" 
on the filesystem level, where all those really *different* pathnames end 
up being the same thing.

And all the same issues show up with utf-8 rewriting, so if you normalize 
utf-8 names, you actually end up having almost all the same problems that 
a case-insensitive filesystem has. They're just much rarer in practice, so 
you just won't hit them as often - but when you do, they are equally 
painful!

(In fact, they can be a whole lot *more* painful, because now they are 
really rare, and really confusing when they happen!)

But if you come from a case-insensitive background, all the UTF-8 
rewriting really looks like such a small problem compared to all the 
horrid problems that you had with different locales and cases, so I 
suspect they didn't even realize what a big mistake they did!

			Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  4:08                           ` Linus Torvalds
@ 2008-01-17  4:30                             ` Kevin Ballard
  2008-01-17  4:51                               ` Martin Langhoff
  2008-01-17 10:08                             ` Wincent Colaiuta
  1 sibling, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-17  4:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jakub Narebski, Johannes Schindelin, Mark Junker,
	git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 4645 bytes --]

On Jan 16, 2008, at 11:08 PM, Linus Torvalds wrote:

> On Wed, 16 Jan 2008, Kevin Ballard wrote:
>>
>> I believe it exists because HFS+ was created at a time when the Mac  
>> was moving
>> from a multi-encoding world (which was a nightmare) to a Unicode  
>> world and
>> they wanted to remove ambiguity in filenames. But I wasn't around  
>> when they
>> made this decision so this is just a guess.
>
> I do agree. And I think starting out case-insensitive (something  
> they must
> really hate by now) also made it less of an issue. When you're
> case-insensitive, the issues with any UTF-8 normalization are simply
> swamped by all the issues of case, so you probably don't even think  
> about
> it very much.

Those of us who grew up on a case-insensitive filesystem don't find  
there to be any problem with it. I can count on one hand the number of  
times I've run into a problem caused by a case-insensitive filesystem.  
That number is 1. And that 1 time is when git screwed up trying to  
track CS4536 and cs4536 in the same directory (see earlier thread).

> The big problem with any name rewriting is that I can open file  
> 'xyz', and
> I literally have a very hard time knowing whether that file I know I
> opened and created has anything to do with the file 'Xyz' that I see  
> when
> I do a readdir().

That's only true if you don't know what type of filesystem you're on.  
And, in the vast majority of cases (in fact, a content tracker is the  
only exception I can think of), it doesn't matter. If the user said  
'xyz' and you can stat() it, great, that's what the user wanted! Just  
because it's really called 'Xyz' on the filesystem doesn't make any  
difference.

> Are they the same? Maybe. But it's literally hard to tell on OS X. I  
> can
> do an fstat() on my file descriptor and on the directory entry, and if
> they get the same d_ino they *probably are the same entry, but even  
> then
> it actually could have been a hardlink (and my 'xyz' is really  
> *another*
> name for it entirely, and the filesystem is actually case-sensitive  
> and
> 'Xyz' was a *different* name that somebody else did!).
>
> See? If you're creating a content tracker, these kinds of issues are  
> not
> "idle chatter". It's really *really* important. Was that file the  
> one I
> was told to track? Or was it a temporary file that was just  
> hardlinked?

But git is a content tracker, so even if it's really a different  
hardlink that shouldn't matter, it's still referencing the same  
content. Go ahead and track whatever name the user specified  
originally, as long as it maps to a file on disk with the expected  
content you're set. If the file is really called 'foo' and I told git  
to track 'Foo', I'm perfectly happy with it continuing to think 'foo'  
is 'Foo' until I use 'git mv Foo foo'.

> This is why case-insensitivity is so hard: you have a very real  
> "aliasing"
> on the filesystem level, where all those really *different*  
> pathnames end
> up being the same thing.

I don't see that as being a problem. Think of it, if you will, as if  
every single file simply had an implicit hardlink for every possible  
case or normalization variant. The whole point of the filename is that  
it is meta-information, used as an identifier and not as actual  
content, and thus it is perfectly fine for it to be a real string,  
subject to interpretation, rather than treated as a sacred binary blob  
like content is. The whole purpose of the name is to identify the  
inode in question, and case and normalization aren't particularly  
relevant here. As long as we can identify the file, we're happy.

> And all the same issues show up with utf-8 rewriting, so if you  
> normalize
> utf-8 names, you actually end up having almost all the same problems  
> that
> a case-insensitive filesystem has. They're just much rarer in  
> practice, so
> you just won't hit them as often - but when you do, they are equally
> painful!
>
> (In fact, they can be a whole lot *more* painful, because now they are
> really rare, and really confusing when they happen!)
>
> But if you come from a case-insensitive background, all the UTF-8
> rewriting really looks like such a small problem compared to all the
> horrid problems that you had with different locales and cases, so I
> suspect they didn't even realize what a big mistake they did!

Again, as someone who grew up in a case-insensitive world, there's no  
problems here. I wish I could tell you that it causes problems, I wish  
I could agree with you, but I can't.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 15:17 git on MacOSX and files with decomposed utf-8 file names Mark Junker
  2008-01-16 15:34 ` Johannes Schindelin
@ 2008-01-17  4:43 ` Jay Soffian
  2008-01-17  4:59   ` Jay Soffian
  2008-01-17  5:11   ` Linus Torvalds
  1 sibling, 2 replies; 260+ messages in thread
From: Jay Soffian @ 2008-01-17  4:43 UTC (permalink / raw)
  To: git

FWIW, here's Sun's take on the issue of filesystems and i18n:

http://developers.sun.com/global/products_platforms/solaris/reference/presentations/IUC29-FileSystems.pdf

j.

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  4:30                             ` Kevin Ballard
@ 2008-01-17  4:51                               ` Martin Langhoff
  2008-01-17  5:23                                 ` Kevin Ballard
  0 siblings, 1 reply; 260+ messages in thread
From: Martin Langhoff @ 2008-01-17  4:51 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Linus Torvalds, Jakub Narebski, Johannes Schindelin, Mark Junker,
	git@vger.kernel.org

On Jan 17, 2008 5:30 PM, Kevin Ballard <kevin@sb.org> wrote:
> Those of us who grew up on a case-insensitive filesystem don't find
> there to be any problem with it. I can count on one hand the number of

I guess you haven't used unix tools much. The ever-popular HEAD perl
utility (which does an HTTP HEAD against a URL), when installed,
silently overwrites the head shell utility, which is used for all
sorts of things, some even in startup scripts. Ooops! I've been hit by
this more than once - and if you google for it, it hurt a lot of
people.

> That's only true if you don't know what type of filesystem you're on.
> And, in the vast majority of cases (in fact, a content tracker is the
> only exception I can think of), it doesn't matter. If the user said

Hmmm. Many important tools - that I wouldn't want to ever fail! - have
similar needs to git. Backup/restore and file replication tools for
example.

> > This is why case-insensitivity is so hard: you have a very real
> > "aliasing"
> > on the filesystem level, where all those really *different*
> > pathnames end
> > up being the same thing.
>
> I don't see that as being a problem. Think of it, if you will, as if
> every single file simply had an implicit hardlink for every possible
> case or normalization variant. The whole point of the filename is that

Ok - but how do you track the directory then (in git's terms, the
tree). There's no way to tell what the user wants. Does the user want
a copy of the file with different capitalization, or is the OS playing
games?
> it is meta-information, used as an identifier and not as actual
> content, and thus it is perfectly fine for it to be a real string,
> subject to interpretation,

I don't think you *actually* want it subject to interpretation.

> Again, as someone who grew up in a case-insensitive world, there's no
> problems here. I wish I could tell you that it causes problems, I wish
> I could agree with you, but I can't.

Probably because you have been surrounded by tools that have a lot of
extra code to cope with the case insensitive way of life, and learned
to not do things that are completely valid, just to avoid trouble.
Which is ok, but I don't think it makes the OS design decision
defensible.

cheers,


m

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  4:43 ` Jay Soffian
@ 2008-01-17  4:59   ` Jay Soffian
  2008-01-17  5:15     ` Junio C Hamano
  2008-01-17  5:11   ` Linus Torvalds
  1 sibling, 1 reply; 260+ messages in thread
From: Jay Soffian @ 2008-01-17  4:59 UTC (permalink / raw)
  To: git

So here's what I can see as being useful additions to git:

* Allowing a repo to be *optionally* configured to disallow two files
in a directory that can cause aliasing problems, with options for
unicode normalization aliasing and/or case-insensitivity aliasing. Can
this already be done via hooks and someone just needs to write the
appropriate hooks?

* Having git warn during checkout if there are files which alias in
the working copy filesystem. I guess it might be interesting if there
were a mechanism in this situation for telling git which of the
aliases you want checked out, though that doesn't seem like a very
good feature.

Thoughts (besides "patches welcomed")?

j.

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  4:43 ` Jay Soffian
  2008-01-17  4:59   ` Jay Soffian
@ 2008-01-17  5:11   ` Linus Torvalds
  1 sibling, 0 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-17  5:11 UTC (permalink / raw)
  To: Jay Soffian; +Cc: git

On Wed, 16 Jan 2008, Jay Soffian wrote:
>
> FWIW, here's Sun's take on the issue of filesystems and i18n:

Pretty sane, from a quick read-through, although most of it seems to not 
be about general issue, as about "let's emulate others correctly on their 
filesystems" (ie the rules are different for NTFS and HFS+, little enough 
discussion about "native" preferred logic).

However, while they don't consider normalization on file creates to be the 
"preferred solution", they *do* consider filename comparison with 
canonical equivalence to be that. Which means that you can get the same 
odd problems:

	fd = open(filename, O_CREAT);
	+
	readdir()

can actually return a *different* filename than the one we just created, 
if it already existed in the directory under the different normalization.

So it's basically "normalization-preserving, but normalization-ignoring" 
(the same way many filesystems are case-preserving, but case-ignoring). I 
don't much like it either, but as with case, the "preserving" behaviour is 
probably the nicer one.

I'd guess the problems are harder to trigger in practice, but you can 
still get some pretty hairy cases. It's just painful when readdir() and 
your own file creation doesn't have any obvious 1:1 relationship.

			Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  4:59   ` Jay Soffian
@ 2008-01-17  5:15     ` Junio C Hamano
  2008-01-17 10:28       ` Wincent Colaiuta
  0 siblings, 1 reply; 260+ messages in thread
From: Junio C Hamano @ 2008-01-17  5:15 UTC (permalink / raw)
  To: Jay Soffian; +Cc: git

"Jay Soffian" <jaysoffian+git@gmail.com> writes:

> So here's what I can see as being useful additions to git:
> ...
> Thoughts (besides "patches welcomed")?

I think we already discussed a plan to store normalization
mapping in the index extension section and use it to avoid
getting confused by readdir(3) that lies to us.  Is there any
more thing that need to be discussed?

I would presume that we would still add _new_ paths using the
pathname we receive from the user (there is no need for us to be
similarly insane as broken "normalizing" filesystems), but when
deciding if a path is new or we already have it in the index
would be done by seeing if an entry already exists in the index
whose "normalized" form is the same as the "normalized" form of
the given path --- that way we would not add two paths to the
index that would "normalize" to the same string.

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  4:51                               ` Martin Langhoff
@ 2008-01-17  5:23                                 ` Kevin Ballard
  2008-01-17  6:13                                   ` Geert Bosch
  0 siblings, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-17  5:23 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Linus Torvalds, Jakub Narebski, Johannes Schindelin, Mark Junker,
	git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 6634 bytes --]

On Jan 16, 2008, at 11:51 PM, Martin Langhoff wrote:

> On Jan 17, 2008 5:30 PM, Kevin Ballard <kevin@sb.org> wrote:
>> Those of us who grew up on a case-insensitive filesystem don't find
>> there to be any problem with it. I can count on one hand the number  
>> of
>
> I guess you haven't used unix tools much. The ever-popular HEAD perl
> utility (which does an HTTP HEAD against a URL), when installed,
> silently overwrites the head shell utility, which is used for all
> sorts of things, some even in startup scripts. Ooops! I've been hit by
> this more than once - and if you google for it, it hurt a lot of
> people.

I can imagine. However, I've never been hit by such a situation. This  
doesn't mean a case-insensitive filesystem is a problem per se, it  
means interactions between a case-insensitive and a case-sensitive  
filesystem can be a problem. That doesn't mean either way is "correct"  
it just means both don't work well together.

I like ice cream, and I like steak, but I sure don't think a mixture  
of steak and ice cream would go well together. Do you?

>> That's only true if you don't know what type of filesystem you're on.
>> And, in the vast majority of cases (in fact, a content tracker is the
>> only exception I can think of), it doesn't matter. If the user said
>
> Hmmm. Many important tools - that I wouldn't want to ever fail! - have
> similar needs to git. Backup/restore and file replication tools for
> example.

Both of which would be replicating the directory contents, not a  
listing of files specified by the user. If, as a user, I were to say  
"please replicate file FOO" and the file was really called "foo", I  
wouldn't be in the least surprised to see the tool take me at my word  
and produce a file called "FOO" with the contents of "foo". But in  
general, things like this operate on the filesystem, not on the user  
args.

>>> This is why case-insensitivity is so hard: you have a very real
>>> "aliasing"
>>> on the filesystem level, where all those really *different*
>>> pathnames end
>>> up being the same thing.
>>
>> I don't see that as being a problem. Think of it, if you will, as if
>> every single file simply had an implicit hardlink for every possible
>> case or normalization variant. The whole point of the filename is  
>> that
>
> Ok - but how do you track the directory then (in git's terms, the
> tree). There's no way to tell what the user wants. Does the user want
> a copy of the file with different capitalization, or is the OS playing
> games?

If I say "track FOO", I probably mean it. So go ahead and track "FOO",  
even if you end up tracking the contents of file "foo". I certainly  
won't blame the tool for doing what I told it.

>> it is meta-information, used as an identifier and not as actual
>> content, and thus it is perfectly fine for it to be a real string,
>> subject to interpretation,
>
> I don't think you *actually* want it subject to interpretation.

Sure I do. I find it  very convenient, for example, to say "cd  
documents/school" when I really want to go to "Documents/School".  
Similarly, if I'm trying to reference gitweb/tests/Märchen, I'm quite  
happy to not have to figure out what normalization the filename is  
using and attempt to replicate that (especially as I have no idea  
which normalization my input mechanism uses - unlike Linus, I don't  
have a key dedicated to ä, and even if I did I wouldn't necessarily  
expect it to use precomposed vs decomposed). I can't think of a single  
reason why I'd want to be able to have 2 different files named  
"Märchen" on my disk. On the other hand, treating unicode  
normalization as significant can pose security risks - how am I to  
know that the file that is named "foo.txt" is really the same file  
"foo.txt" that I last saw? Someone I know on IRC sent me this  
image[1], which shows 6 files all apparently named "foo.txt" on a disk  
image. This is possible because on a case-sensitive HFS+ volume, the  
file system doesn't ignore ignorables when comparing filenames (it  
does on a case-insensitive HFS+ system), and so all of those filenames  
look identical up until you actually pipe their names through xxd and  
look at the byte sequence. When this sort of tomfoolery is possible, I  
simply cannot trust the names of any of my files anymore.

[1]: http://sailor月.com/imgs/ignorable.png

>> Again, as someone who grew up in a case-insensitive world, there's no
>> problems here. I wish I could tell you that it causes problems, I  
>> wish
>> I could agree with you, but I can't.
>
> Probably because you have been surrounded by tools that have a lot of
> extra code to cope with the case insensitive way of life, and learned
> to not do things that are completely valid, just to avoid trouble.
> Which is ok, but I don't think it makes the OS design decision

Extra code? I don't think so. The only reason I'd need extra code is  
if I were attempting to explicitly detect the "real" filename for a  
user-supplied argument, by scanning the directory contents until I  
found a file that was equivalent to the given argument. But there's no  
reason to do that. None of the code I've ever written, or any of the  
code I've ever seen, has had to do any extra work because it was on a  
case-insensitive filesystem. I contribute to a packaging system for  
the Mac called MacPorts, and I've never seen any patches on any of the  
4000+ ports to handle case insensitivity (granted, I haven't looked at  
every port, but I've looked at a significant fraction). It's a  
complete non-issue.

The content of files is sacred. The filename is only there to provide  
a handle to locate the contents. I don't see any problem with  
expanding the equivalency scope of the filename to accept multiple  
encodings and cases. The only arguments I can see that have any  
validity at all are the ones that sound like "we use case-sensitive  
filesystems, and your case-insensitivity and normalization are causing  
problems with our tools! Conform to our world!". As I said above, this  
isn't a problem of case-insensitivity or normalization, it's a problem  
of interaction between two incompatible viewpoints. All I want to do  
is make git play nicer in an HFS+ world, and this would be far easier  
if you guys were willing to admit this is a problem that should be  
solved in the tool rather than a problem with the system.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  5:23                                 ` Kevin Ballard
@ 2008-01-17  6:13                                   ` Geert Bosch
  2008-01-17  7:11                                     ` Mitch Tishmack
  2008-01-17 14:02                                     ` Andrew Heybey
  0 siblings, 2 replies; 260+ messages in thread
From: Geert Bosch @ 2008-01-17  6:13 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Martin Langhoff, Linus Torvalds, Jakub Narebski,
	Johannes Schindelin, Mark Junker, git@vger.kernel.org

For those on Mac OS X: it is possible to create a case-sensitive HFS+  
partition and
use it with git. You even can just create a disk image and mount it.  
However,
I wouldn't quite try to use it as startup filesystem...

   -Geert

PS. I'm working on a proposal/patch for addressing the UFS/case  
sensitivity issues.
     Will try to mail something later this week.

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  6:13                                   ` Geert Bosch
@ 2008-01-17  7:11                                     ` Mitch Tishmack
  2008-01-17 10:22                                       ` Wincent Colaiuta
  2008-01-17 14:02                                     ` Andrew Heybey
  1 sibling, 1 reply; 260+ messages in thread
From: Mitch Tishmack @ 2008-01-17  7:11 UTC (permalink / raw)
  To: git

I was going to post this earlier, but wanted to search the archives  
first. Here are the commands assuming you don't want to or can't  
partition a drive and format as ufs (I don't care for HFS+ much). I  
can't believe I didn't find the command in the git list archives, so  
voilà:

$ hdiutil create -size 300m -fs UFS foo.dmg
...............................................................................
created: /Users/mitch/foo.dmg
$ hdiutil attach foo.dmg
/dev/disk2          	GUID_partition_scheme          	
/dev/disk2s1        	Apple_UFS                      	/Volumes/untitled
$ cd /Volumes/untitled && git clone git://git.kernel.org/pub/scm/git/ 
git.git
... snipped ...
$ cd git && git status
# On branch master
nothing to commit (working directory clean)

After git clone in HFS+ land...
$ git status
# On branch master
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#	gitweb/test/MaÌˆrchen
nothing added to commit but untracked files present (use "git add" to  
track)

Should I just add this to the wiki? Then we can all go back to  
ignoring the insane filesystems.

Mitch

On Jan 17, 2008, at 12:13 AM, Geert Bosch wrote:

> For those on Mac OS X: it is possible to create a case-sensitive HFS 
> + partition and
> use it with git. You even can just create a disk image and mount it.  
> However,
> I wouldn't quite try to use it as startup filesystem...
>
>  -Geert
>
> PS. I'm working on a proposal/patch for addressing the UFS/case  
> sensitivity issues.
>    Will try to mail something later this week.
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-16 15:43   ` Kevin Ballard
  2008-01-16 16:32     ` Johannes Schindelin
  2008-01-16 23:03     ` Wincent Colaiuta
@ 2008-01-17  7:29     ` Miles Bader
  2 siblings, 0 replies; 260+ messages in thread
From: Miles Bader @ 2008-01-17  7:29 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Johannes Schindelin, Mark Junker, git

Kevin Ballard <kevin@sb.org> writes:
> More like, Mac OS X has standardized on Unicode and the rest of the
> world hasn't caught up yet. Git is the only tool I've ever heard of
> that has a problem with OS X using Unicode.

Apple's decision[*] to use _decomposed_ unicode causes all sorts of
little problems because other tools aren't expecting to see strings
changed behind their backs.

I know little about the gritty details, but I see the bug reports...

-Miles

-- 
Any man who is a triangle, has thee right, when in Cartesian Space, to
have angles, which when summed, come to know more, nor no less, than
nine score degrees, should he so wish.  [TEMPLE OV THEE LEMUR]
.

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  4:08                           ` Linus Torvalds
  2008-01-17  4:30                             ` Kevin Ballard
@ 2008-01-17 10:08                             ` Wincent Colaiuta
  2008-01-17 16:43                               ` Linus Torvalds
  1 sibling, 1 reply; 260+ messages in thread
From: Wincent Colaiuta @ 2008-01-17 10:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kevin Ballard, Jakub Narebski, Johannes Schindelin, Mark Junker,
	git@vger.kernel.org

El 17/1/2008, a las 5:08, Linus Torvalds escribió:

> On Wed, 16 Jan 2008, Kevin Ballard wrote:
>>
>> I believe it exists because HFS+ was created at a time when the Mac  
>> was moving
>> from a multi-encoding world (which was a nightmare) to a Unicode  
>> world and
>> they wanted to remove ambiguity in filenames. But I wasn't around  
>> when they
>> made this decision so this is just a guess.
>
> I do agree. And I think starting out case-insensitive (something  
> they must
> really hate by now) also made it less of an issue.

I hope you're right (about them hating it), but we'll see. They've  
just opened the source for the ZFS port they're working on. By the  
time it goes final and becomes the default FS, replacing HFS+,  
probably within a couple of years, we'll see if they make the same two  
design decisions which cause the kinds of problems being discussed  
here (case-insensitivity, and ubiquitous FS-level UTF-8 normalization).

I've done a dumb search in the ZFS source code for "CASE" and see that  
it can in theory support case-insensitivity as an optional feature.  
The potential is there for Apple to use this. I personally hope that  
they don't, because as has already been pointed out, these little  
tricks tend to make life more difficult for users rather than helping  
them (the day I have two files in the same directory called "Märchen"  
and want to specify one of them on the command line I'll worry about  
that when I come to it).

http://fuzzy.wordpress.com/2007/06/09/zfsandfilesystemoptions/

Cheers,
Wincent

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  7:11                                     ` Mitch Tishmack
@ 2008-01-17 10:22                                       ` Wincent Colaiuta
  2008-01-17 13:44                                         ` Kevin Ballard
  0 siblings, 1 reply; 260+ messages in thread
From: Wincent Colaiuta @ 2008-01-17 10:22 UTC (permalink / raw)
  To: Mitch Tishmack; +Cc: git

El 17/1/2008, a las 8:11, Mitch Tishmack escribió:

> I was going to post this earlier, but wanted to search the archives  
> first. Here are the commands assuming you don't want to or can't  
> partition a drive and format as ufs (I don't care for HFS+ much). I  
> can't believe I didn't find the command in the git list archives, so  
> voilà:
>
> $ hdiutil create -size 300m -fs UFS foo.dmg
> ...............................................................................
> created: /Users/mitch/foo.dmg
> $ hdiutil attach foo.dmg
> /dev/disk2          	GUID_partition_scheme          	
> /dev/disk2s1        	Apple_UFS                      	/Volumes/untitled
> $ cd /Volumes/untitled && git clone git://git.kernel.org/pub/scm/git/ 
> git.git
> ... snipped ...
> $ cd git && git status
> # On branch master
> nothing to commit (working directory clean)
>
> After git clone in HFS+ land...
> $ git status
> # On branch master
> # Untracked files:
> #   (use "git add <file>..." to include in what will be committed)
> #
> #	gitweb/test/MaÌˆrchen
> nothing added to commit but untracked files present (use "git add"  
> to track)
>
> Should I just add this to the wiki?

Definitely.

> Then we can all go back to ignoring the insane filesystems.

While it's a nice workaround, it really is just that (a workaround)  
because performance will be suboptimal in a repository running on a  
disk image (and many of switched to Git because of its speed).

Cheers,
Wincent

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  5:15     ` Junio C Hamano
@ 2008-01-17 10:28       ` Wincent Colaiuta
  2008-01-17 11:10         ` Johannes Schindelin
  0 siblings, 1 reply; 260+ messages in thread
From: Wincent Colaiuta @ 2008-01-17 10:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jay Soffian, git

El 17/1/2008, a las 6:15, Junio C Hamano escribió:

> "Jay Soffian" <jaysoffian+git@gmail.com> writes:
>
>> So here's what I can see as being useful additions to git:
>> ...
>> Thoughts (besides "patches welcomed")?
>
> I think we already discussed a plan to store normalization
> mapping in the index extension section and use it to avoid
> getting confused by readdir(3) that lies to us.  Is there any
> more thing that need to be discussed?
>
> I would presume that we would still add _new_ paths using the
> pathname we receive from the user (there is no need for us to be
> similarly insane as broken "normalizing" filesystems), but when
> deciding if a path is new or we already have it in the index
> would be done by seeing if an entry already exists in the index
> whose "normalized" form is the same as the "normalized" form of
> the given path --- that way we would not add two paths to the
> index that would "normalize" to the same string.

And what do we do when asked to check out a tree which has two  
different files in it whose normalized forms are the same (ie. a clone  
of a repo created on a non-HFS+ filesystem)?

We either have to fail catastrophically, preventing the user from  
working with that tree on HFS+, or arbitrarily pick one of the files  
as the "winner" which gets written out into the work tree. None of the  
options is particularly attractive, although luckily this exact  
situation is unlikely to come up in practice.

Cheers,
Wincent

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 10:28       ` Wincent Colaiuta
@ 2008-01-17 11:10         ` Johannes Schindelin
  2008-01-17 11:23           ` Pedro Melo
  2008-01-17 11:46           ` Wincent Colaiuta
  0 siblings, 2 replies; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-17 11:10 UTC (permalink / raw)
  To: Wincent Colaiuta; +Cc: Junio C Hamano, Jay Soffian, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2222 bytes --]

Hi,

[Jay, don't cull Cc: lists on vger.kernel.org.  I consider it rude.]

On Thu, 17 Jan 2008, Wincent Colaiuta wrote:

> El 17/1/2008, a las 6:15, Junio C Hamano escribió:
> 
> > "Jay Soffian" <jaysoffian+git@gmail.com> writes:
> > 
> > > So here's what I can see as being useful additions to git:
> > > ...
> > > Thoughts (besides "patches welcomed")?
> > 
> > I think we already discussed a plan to store normalization mapping in 
> > the index extension section and use it to avoid getting confused by 
> > readdir(3) that lies to us.  Is there any more thing that need to be 
> > discussed?

Yes, and I think that a lot of time would have more wisely spent on 
reading that, and trying to implement it, than writing a number of long 
mails, repeating the _same_ (refuted) points over and over again.

> > I would presume that we would still add _new_ paths using the pathname 
> > we receive from the user (there is no need for us to be similarly 
> > insane as broken "normalizing" filesystems), but when deciding if a 
> > path is new or we already have it in the index would be done by seeing 
> > if an entry already exists in the index whose "normalized" form is the 
> > same as the "normalized" form of the given path --- that way we would 
> > not add two paths to the index that would "normalize" to the same 
> > string.

Agree.

> And what do we do when asked to check out a tree which has two different 
> files in it whose normalized forms are the same (ie. a clone of a repo 
> created on a non-HFS+ filesystem)?
> 
> We either have to fail catastrophically, preventing the user from 
> working with that tree on HFS+, or arbitrarily pick one of the files as 
> the "winner" which gets written out into the work tree. None of the 
> options is particularly attractive, although luckily this exact 
> situation is unlikely to come up in practice.

Anything else but failure would be Not What You Want.  You might want a 
special mode where you use a _different_ name on-disk (something like the 
infamous short names on FAT), but that _must_ be turned off by default: 
think of Martin's HEAD example.  Sometimes, it's just not possible to 
check such a tree out on a less-than-nice system.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 11:10         ` Johannes Schindelin
@ 2008-01-17 11:23           ` Pedro Melo
  2008-01-17 11:51             ` Wincent Colaiuta
  2008-01-17 13:05             ` Johannes Schindelin
  2008-01-17 11:46           ` Wincent Colaiuta
  1 sibling, 2 replies; 260+ messages in thread
From: Pedro Melo @ 2008-01-17 11:23 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Wincent Colaiuta, Junio C Hamano, Jay Soffian, git

Hi,

On Jan 17, 2008, at 11:10 AM, Johannes Schindelin wrote:
> [Jay, don't cull Cc: lists on vger.kernel.org.  I consider it rude.]
>
> On Thu, 17 Jan 2008, Wincent Colaiuta wrote:
>
>> El 17/1/2008, a las 6:15, Junio C Hamano escribió:
>>
>>> "Jay Soffian" <jaysoffian+git@gmail.com> writes:
>>>
>>>> So here's what I can see as being useful additions to git:
>>>> ...
>>>> Thoughts (besides "patches welcomed")?
>>>
>>> I think we already discussed a plan to store normalization  
>>> mapping in
>>> the index extension section and use it to avoid getting confused by
>>> readdir(3) that lies to us.  Is there any more thing that need to be
>>> discussed?
>
> Yes, and I think that a lot of time would have more wisely spent on
> reading that, and trying to implement it, than writing a number of  
> long
> mails, repeating the _same_ (refuted) points over and over again.

I searched the archives for the posts about normalization and I could  
not find them, sorry.

Is stringprep (RFC 3454) being proposed as an optional normalization  
step before lookups in the index?

Best regards,
-- 
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 11:10         ` Johannes Schindelin
  2008-01-17 11:23           ` Pedro Melo
@ 2008-01-17 11:46           ` Wincent Colaiuta
  1 sibling, 0 replies; 260+ messages in thread
From: Wincent Colaiuta @ 2008-01-17 11:46 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, Jay Soffian, git

El 17/1/2008, a las 12:10, Johannes Schindelin escribió:

> On Thu, 17 Jan 2008, Wincent Colaiuta wrote:
>
>> And what do we do when asked to check out a tree which has two  
>> different
>> files in it whose normalized forms are the same (ie. a clone of a  
>> repo
>> created on a non-HFS+ filesystem)?
>>
>> We either have to fail catastrophically, preventing the user from
>> working with that tree on HFS+, or arbitrarily pick one of the  
>> files as
>> the "winner" which gets written out into the work tree. None of the
>> options is particularly attractive, although luckily this exact
>> situation is unlikely to come up in practice.
>
> Anything else but failure would be Not What You Want.  You might  
> want a
> special mode where you use a _different_ name on-disk (something  
> like the
> infamous short names on FAT), but that _must_ be turned off by  
> default:
> think of Martin's HEAD example.  Sometimes, it's just not possible to
> check such a tree out on a less-than-nice system.

Such a special mode would be mostly useless in most contexts, where  
Git is used to track source code. It would enable you to check out the  
tree for inspection, but you probably couldn't build anything from it  
seeing as at least one of the filenames specified in your Makefile  
wouldn't be present in the work tree.

As such, in that kind of situation I'd rather see a big red warning  
printed out that the checkout failed because a particular file  
couldn't be written out, and perhaps an instruction to the user that  
they can use "git show" if they want to see the blob/s which wasn't/ 
weren't written.

Cheers,
Wincent

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 11:23           ` Pedro Melo
@ 2008-01-17 11:51             ` Wincent Colaiuta
  2008-01-17 12:53               ` Johannes Schindelin
  2008-01-17 17:58               ` Junio C Hamano
  2008-01-17 13:05             ` Johannes Schindelin
  1 sibling, 2 replies; 260+ messages in thread
From: Wincent Colaiuta @ 2008-01-17 11:51 UTC (permalink / raw)
  To: Pedro Melo; +Cc: Johannes Schindelin, Junio C Hamano, Jay Soffian, git

El 17/1/2008, a las 12:23, Pedro Melo escribió:

> On Jan 17, 2008, at 11:10 AM, Johannes Schindelin wrote:
>> [Jay, don't cull Cc: lists on vger.kernel.org.  I consider it rude.]
>>
>> On Thu, 17 Jan 2008, Wincent Colaiuta wrote:
>>
>>> El 17/1/2008, a las 6:15, Junio C Hamano escribió:
>>>
>>>> "Jay Soffian" <jaysoffian+git@gmail.com> writes:
>>>>
>>>>> So here's what I can see as being useful additions to git:
>>>>> ...
>>>>> Thoughts (besides "patches welcomed")?
>>>>
>>>> I think we already discussed a plan to store normalization  
>>>> mapping in
>>>> the index extension section and use it to avoid getting confused by
>>>> readdir(3) that lies to us.  Is there any more thing that need to  
>>>> be
>>>> discussed?
>>
>> Yes, and I think that a lot of time would have more wisely spent on
>> reading that, and trying to implement it, than writing a number of  
>> long
>> mails, repeating the _same_ (refuted) points over and over again.
>
> I searched the archives for the posts about normalization and I  
> could not find them, sorry.
>
> Is stringprep (RFC 3454) being proposed as an optional normalization  
> step before lookups in the index?

If this is really just a platform-specific hack, can we use platform- 
specific code to do the normalization?

On Mac OS X we have (unfortunately only 10.4 and up):

CFStringCreateWithFileSystemRepresentation()
CFStringGetFileSystemRepresentation()
CFStringGetMaximumSizeOfFileSystemRepresentation()

If we were to use those you'd at least know that you're getting the  
true normalized form as the system defines it.

> The terms used in this Q&A, decomposed and precomposed, roughly  
> correspond to Unicode Normal Forms D and C, respectively. However,  
> most volume formats do not follow the exact specification for these  
> normal forms. For example, HFS Plus uses a variant of Normal Form D  
> in which U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800  
> through U+2FAFF are not decomposed (this avoids problems with round  
> trip conversions from old Mac text encodings). It's likely that your  
> volume format has similar oddities.


http://developer.apple.com/qa/qa2001/qa1173.html

Cheers,
Wincent

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 11:51             ` Wincent Colaiuta
@ 2008-01-17 12:53               ` Johannes Schindelin
  2008-01-17 13:40                 ` Wincent Colaiuta
  2008-01-17 17:58               ` Junio C Hamano
  1 sibling, 1 reply; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-17 12:53 UTC (permalink / raw)
  To: Wincent Colaiuta; +Cc: Pedro Melo, Junio C Hamano, Jay Soffian, git

Hi,

On Thu, 17 Jan 2008, Wincent Colaiuta wrote:

> On Mac OS X we have (unfortunately only 10.4 and up):

That remark about the version raises my eyebrows.  Where I live, 10.2.8 is 
_still_ quite common.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 11:23           ` Pedro Melo
  2008-01-17 11:51             ` Wincent Colaiuta
@ 2008-01-17 13:05             ` Johannes Schindelin
  1 sibling, 0 replies; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-17 13:05 UTC (permalink / raw)
  To: Pedro Melo; +Cc: Wincent Colaiuta, Junio C Hamano, Jay Soffian, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1657 bytes --]

Hi,

On Thu, 17 Jan 2008, Pedro Melo wrote:

> On Jan 17, 2008, at 11:10 AM, Johannes Schindelin wrote:
> > [Jay, don't cull Cc: lists on vger.kernel.org.  I consider it rude.]
> > 
> > On Thu, 17 Jan 2008, Wincent Colaiuta wrote:
> > 
> > > El 17/1/2008, a las 6:15, Junio C Hamano escribió:
> > > 
> > > > "Jay Soffian" <jaysoffian+git@gmail.com> writes:
> > > > 
> > > > > So here's what I can see as being useful additions to git:
> > > > > ...
> > > > > Thoughts (besides "patches welcomed")?
> > > > 
> > > > I think we already discussed a plan to store normalization mapping 
> > > > in the index extension section and use it to avoid getting 
> > > > confused by readdir(3) that lies to us.  Is there any more thing 
> > > > that need to be discussed?
> > 
> > Yes, and I think that a lot of time would have more wisely spent on 
> > reading that, and trying to implement it, than writing a number of 
> > long mails, repeating the _same_ (refuted) points over and over again.
> 
> I searched the archives for the posts about normalization and I could 
> not find them, sorry.

Here's my pointer:

http://thread.gmane.org/gmane.comp.gnu.make.devel/387/focus=61073

FWIW I searched by the term "readdir", and then browsed the thread to find 
a more interesting post than the first hit.

> Is stringprep (RFC 3454) being proposed as an optional normalization 
> step before lookups in the index?

I don't know.  I'd probably prefer something using iconv (which we use 
already if it's available), so that the same system can be used for 
case-insensitivity, UTF-8 normalisation, but also other transformations 
you might wish to perform.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 12:53               ` Johannes Schindelin
@ 2008-01-17 13:40                 ` Wincent Colaiuta
  0 siblings, 0 replies; 260+ messages in thread
From: Wincent Colaiuta @ 2008-01-17 13:40 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Pedro Melo, Junio C Hamano, Jay Soffian, git

El 17/1/2008, a las 13:53, Johannes Schindelin escribió:

> Hi,
>
> On Thu, 17 Jan 2008, Wincent Colaiuta wrote:
>
>> On Mac OS X we have (unfortunately only 10.4 and up):
>
> That remark about the version raises my eyebrows.  Where I live,  
> 10.2.8 is
> _still_ quite common.

There may be alternatives that I don't know about.

All the way back to 10.0 you have -[NSString  
fileSystemRepresentation], which does the same thing but that's  
Objective-C. I wouldn't be surprised if that's just a wrapper for the  
CF functions; that's often the way it is on Mac OS X. And often, the  
CF functions *are* present on older systems, but they're just not  
declared in public headers. I wouldn't actually recommend using a  
private SPI, but they are often there.

Cheers,
Wincent

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 10:22                                       ` Wincent Colaiuta
@ 2008-01-17 13:44                                         ` Kevin Ballard
  2008-01-17 15:57                                           ` Johannes Schindelin
  0 siblings, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-17 13:44 UTC (permalink / raw)
  To: Wincent Colaiuta; +Cc: Mitch Tishmack, git

[-- Attachment #1: Type: text/plain, Size: 2315 bytes --]

On Jan 17, 2008, at 5:22 AM, Wincent Colaiuta wrote:

> El 17/1/2008, a las 8:11, Mitch Tishmack escribió:
>
>> I was going to post this earlier, but wanted to search the archives  
>> first. Here are the commands assuming you don't want to or can't  
>> partition a drive and format as ufs (I don't care for HFS+ much). I  
>> can't believe I didn't find the command in the git list archives,  
>> so voilà:
>>
>> $ hdiutil create -size 300m -fs UFS foo.dmg
>> ...............................................................................
>> created: /Users/mitch/foo.dmg
>> $ hdiutil attach foo.dmg
>> /dev/disk2          	GUID_partition_scheme          	
>> /dev/disk2s1        	Apple_UFS                      	/Volumes/ 
>> untitled
>> $ cd /Volumes/untitled && git clone git://git.kernel.org/pub/scm/ 
>> git/git.git
>> ... snipped ...
>> $ cd git && git status
>> # On branch master
>> nothing to commit (working directory clean)
>>
>> After git clone in HFS+ land...
>> $ git status
>> # On branch master
>> # Untracked files:
>> #   (use "git add <file>..." to include in what will be committed)
>> #
>> #	gitweb/test/MaÌˆrchen
>> nothing added to commit but untracked files present (use "git add"  
>> to track)
>>
>> Should I just add this to the wiki?
>
> Definitely.
>
>> Then we can all go back to ignoring the insane filesystems.
>
> While it's a nice workaround, it really is just that (a workaround)  
> because performance will be suboptimal in a repository running on a  
> disk image (and many of switched to Git because of its speed).

Not only is it suboptimal, it's also not acceptable, plain and simple.  
If an individual wants to do that, sure, but it's simply not an  
appropriate solution in general for this problem. I certainly don't  
want to have to attach a disk image every time I want access to  
anything I keep in a git repo, nor do I want to be restricted to  
keeping everything within a certain filesystem on disk. Additionally,  
while I'm not certain it's impossible, it's certainly very difficult  
to attach a disk image without anybody logged into the system at the  
GUI, as diskarbitrationd won't be running.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  6:13                                   ` Geert Bosch
  2008-01-17  7:11                                     ` Mitch Tishmack
@ 2008-01-17 14:02                                     ` Andrew Heybey
  2008-01-17 15:04                                       ` Kevin Ballard
  1 sibling, 1 reply; 260+ messages in thread
From: Andrew Heybey @ 2008-01-17 14:02 UTC (permalink / raw)
  To: Geert Bosch
  Cc: Kevin Ballard <kevin@sb.org>Martin Langhoff, Linus Torvalds,
	Jakub Narebski, Johannes Schindelin, Mark Junker,
	git@vger.kernel.org

Geert Bosch <bosch@adacore.com> writes:

> For those on Mac OS X: it is possible to create a case-sensitive HFS+
> partition and
> use it with git. You even can just create a disk image and mount it.
> However,
> I wouldn't quite try to use it as startup filesystem...

This is starting to stray far afield, but the first thing I did when I
got a Macbook was to reinstall it with case-sensitive HFS as the boot
file system.  Works fine, including with git.  The only problem I have
had is that FileVault does not work.  There are rumored to be some
third-part apps that do not work but I do not use that many of those
anyway.

andrew

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 14:02                                     ` Andrew Heybey
@ 2008-01-17 15:04                                       ` Kevin Ballard
  2008-01-19 19:29                                         ` Kyle Moffett
  0 siblings, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-17 15:04 UTC (permalink / raw)
  To: Andrew Heybey
  Cc: Geert Bosch, Kevin Ballard <kevin@sb.org>Martin Langhoff,
	Linus Torvalds, Jakub Narebski, Johannes Schindelin, Mark Junker,
	git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1037 bytes --]

On Jan 17, 2008, at 9:02 AM, Andrew Heybey wrote:

> Geert Bosch <bosch@adacore.com> writes:
>
>> For those on Mac OS X: it is possible to create a case-sensitive HFS+
>> partition and
>> use it with git. You even can just create a disk image and mount it.
>> However,
>> I wouldn't quite try to use it as startup filesystem...
>
> This is starting to stray far afield, but the first thing I did when I
> got a Macbook was to reinstall it with case-sensitive HFS as the boot
> file system.  Works fine, including with git.  The only problem I have
> had is that FileVault does not work.  There are rumored to be some
> third-part apps that do not work but I do not use that many of those
> anyway.
>
> andrew

The main problem with this approach is you know for certain that using  
HFSX as the boot partition is barely tested by Apple, and certainly  
untested by third-party apps. This means the potential for breakage is  
extremely high.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 13:44                                         ` Kevin Ballard
@ 2008-01-17 15:57                                           ` Johannes Schindelin
  2008-01-17 16:53                                             ` Kevin Ballard
  0 siblings, 1 reply; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-17 15:57 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Wincent Colaiuta, Mitch Tishmack, git

Hi,

On Thu, 17 Jan 2008, Kevin Ballard wrote:

> On Jan 17, 2008, at 5:22 AM, Wincent Colaiuta wrote:
> 
> > While it's a nice workaround, it really is just that (a workaround) 
> > because performance will be suboptimal in a repository running on a 
> > disk image (and many of switched to Git because of its speed).
> 
> Not only is it suboptimal, it's also not acceptable, plain and simple.

If it's not acceptable, do something about it (and I don't mean writing 50 
emails). If you don't want to do something about it, I have to assume that 
you accept it as-is.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 10:08                             ` Wincent Colaiuta
@ 2008-01-17 16:43                               ` Linus Torvalds
  2008-01-17 18:09                                 ` Mark Junker
  2008-01-17 22:01                                 ` JM Ibanez
  0 siblings, 2 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-17 16:43 UTC (permalink / raw)
  To: Wincent Colaiuta
  Cc: Kevin Ballard, Jakub Narebski, Johannes Schindelin, Mark Junker,
	git@vger.kernel.org

On Thu, 17 Jan 2008, Wincent Colaiuta wrote:
>
> (the day I have two files in the same directory called "Märchen" and 
> want to specify one of them on the command line I'll worry about that 
> when I come to it).

Side note: the thing is, the reason people shouldn't worry about it is 
that this is a *trivial* thing to handle. You really don't even need to 
know what you're doing. And you can test it today, easily.

Having two (differently encoded) files like that is really no different 
from the traditional UNIX FAQ of "how do I remove a file starting with 
'-'" or even more closely "how do I remove a file that has a character in 
it that I cannot get at the keyboard".

In other words, on a bog-standard UNIX (and yes, in this case, I bet OS X 
works fine too for this test), just try this

	filename1=$(echo -e "hello\002there")
	filename2=$(echo -e "hello\003there")
	echo Odd file > "$filename1"
	echo Another odd file > "$filename2"

and now you have a filename that is actually rather hard to type on the 
command line. In fact, for me they even *look* the same:

	[torvalds@woody ~]$ ll hello*
	-rw-rw-r-- 1 torvalds torvalds  9 2008-01-17 08:23 hello?there
	-rw-rw-r-- 1 torvalds torvalds 17 2008-01-17 08:23 hello?there

See?

Even in my graphical browser, those two filenames look 100% *identical*. I 
could give you a screen-shot, but I'm lazy. Just take my word for it, or 
just fire up konqueror on Linux (but it may well depend on the particular 
font you're using).

[ And yes, for other browsers, you might have something that shows them as 
  different characters - depending on the font, it might show up as a 
  small box with [00 02] vs [00 03] in it, for example. But that's also 
  actually 100% true of the two different encodings of 'ä' - you could 
  easily have a file broswer that shows the multi-character as a 
  multi-character, exactly to distinguish them and show that one of them 
  isn't "normalized"!

  The point is, once the filesystem doesn't corrupt the data, it's always 
  easy to get at, and there is never any ambiguity. ]

How is this different from "Märchen" spelled with two different encodings 
for that "ä"?

I'll tell you: it's not at all different. It's 100% the exact same issue.

And does that make you perhaps go "Hunh? How do I remove it, or open it?"

And the fact is, those "idential looking" filenames (and thus they must be 
the same, and something should have normalized them to the same thing, 
no?) are obviously two different files, and they are *really*easy* to edit 
and look at.

Fire up that graphical browser again, and it doesn't even matter whether 
the filename looks identical or not, it shows up as two different files, 
and you can drag them around independently, rename them there, and at 
least my file browser shows clearly which is which, because I get a small 
icon with a preview in it, so I directly see which one is the "Odd file" 
and which one is the "Another odd file".

So the whole "but they _look_ the same" argument is just total BS. In just 
about all character encodings there has always been unique and different 
"characters" that _look_ the same on screen, and it has never really made 
them actually *be* the same, and it has never been a valid argument for 
them being considered the same.

Because even when they *look* the same, that file browser that didn't show 
the difference in names visually, still showed them correctly as two 
separate files, and I could still just rename them by hand by 
right-clicking on them and picking "rename". 

So "look the same" is really not a new thing, nor is it even a really hard 
thing. Yes, people can get confused by it, but hey, people can get 
confused by *anything*. People get confused by filenames starting with a 
"-", yet nobody sane really says that filenames cannot start with a dash.

			Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 15:57                                           ` Johannes Schindelin
@ 2008-01-17 16:53                                             ` Kevin Ballard
  2008-01-18  0:44                                               ` Robin Rosenberg
  0 siblings, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-17 16:53 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Wincent Colaiuta, Mitch Tishmack, git

[-- Attachment #1: Type: text/plain, Size: 1528 bytes --]

On Jan 17, 2008, at 10:57 AM, Johannes Schindelin wrote:

> On Thu, 17 Jan 2008, Kevin Ballard wrote:
>
>> On Jan 17, 2008, at 5:22 AM, Wincent Colaiuta wrote:
>>
>>> While it's a nice workaround, it really is just that (a workaround)
>>> because performance will be suboptimal in a repository running on a
>>> disk image (and many of switched to Git because of its speed).
>>
>> Not only is it suboptimal, it's also not acceptable, plain and  
>> simple.
>
> If it's not acceptable, do something about it (and I don't mean  
> writing 50
> emails). If you don't want to do something about it, I have to  
> assume that
> you accept it as-is.

I never said I don't want to do anything about it. However, I do  
believe that it will take a significant investment of time and energy  
to learn all the gooey details of how git handles filenames and how  
the index works and all that jazz, which is knowledge that other  
people already have. I believe that, for me to solve this problem  
independently, it may require so much time that it never gets done  
(after all, I am fairly busy). However, if other people who already  
have this knowledge are willing to help, that would make this task far  
easier, especially given that if nobody else even acknowledges that  
this is a problem I don't have much hope of getting a patch accepted.

So again, I'm certainly going to try, but working by myself it simply  
may never get done.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 11:51             ` Wincent Colaiuta
  2008-01-17 12:53               ` Johannes Schindelin
@ 2008-01-17 17:58               ` Junio C Hamano
  2008-01-17 18:22                 ` Johan Herland
  1 sibling, 1 reply; 260+ messages in thread
From: Junio C Hamano @ 2008-01-17 17:58 UTC (permalink / raw)
  To: Wincent Colaiuta; +Cc: Pedro Melo, Johannes Schindelin, Jay Soffian, git

Wincent Colaiuta <win@wincent.com> writes:

> If this is really just a platform-specific hack, can we use platform- 
> specific code to do the normalization?

Unfortunately, I do not think this can be a platform-specific
hack.

If a project wants to be usable on both sane and insane
filesystems, people on platforms whose filesystems treat "foo"
and "Foo" as two distinct pathnames (and "Ma<UMLAUT>rchen" and
"M<A-with-UMLAUT>rchen" as two distinct ones) need to be
prevented from creating both in their tree objects at the same
time.  Once you create two pathnames xt_connmark.c and
xt_CONNMARK.c in the same tree object in your project, people on
case insensitive filesystems cannot work with your project (you
cannot check out the kernel source tree and work on it on vfat).

This is exactly the same logic as making autocrlf=safe (or at
least 'input') the default for projects that people need to work
both on UNIX and Windows, which Steffen Prohaska has been
adovocating in another thread.

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 16:43                               ` Linus Torvalds
@ 2008-01-17 18:09                                 ` Mark Junker
  2008-01-17 18:12                                   ` Pedro Melo
                                                     ` (2 more replies)
  2008-01-17 22:01                                 ` JM Ibanez
  1 sibling, 3 replies; 260+ messages in thread
From: Mark Junker @ 2008-01-17 18:09 UTC (permalink / raw)
  To: git@vger.kernel.org

Linus Torvalds schrieb:

> In other words, on a bog-standard UNIX (and yes, in this case, I bet OS X 
> works fine too for this test), just try this
> 
> 	filename1=$(echo -e "hello\002there")
> 	filename2=$(echo -e "hello\003there")
> 	echo Odd file > "$filename1"
> 	echo Another odd file > "$filename2"
> 
> and now you have a filename that is actually rather hard to type on the 
> command line. In fact, for me they even *look* the same:
> 
> 	[torvalds@woody ~]$ ll hello*
> 	-rw-rw-r-- 1 torvalds torvalds  9 2008-01-17 08:23 hello?there
> 	-rw-rw-r-- 1 torvalds torvalds 17 2008-01-17 08:23 hello?there
> 
> See?

Sorry, but you're using different characters that look the same. But 
Kevins point was that it's a different thing if you use two characters 
that look the same or the same character with different encodings. This 
makes this HFS-specific problem different from the "look the same"- or 
the "case-insensitivity"-issues.

BTW: I also read about your argument that you wouldn't convert file data 
to normalized UTF-8 (I agree with you that this would be nonsense) and 
therefore filenames shouldn't be converted too. This is something where 
I have to disagree because a filename (like ctime, mtime, atime, ...) 
are meta data (while file contents isn't) and - until now - I would've 
guessed that you agree on this point because git doesn't care about 
filenames but contents.

IMHO it would be the best solution when git stores all string meta data 
in UTF-8 and converts it to the target systems file system encoding. 
That would fix all those problems with different locales and file system 
encodings ...

However, I have to agree that the enforced character set conversion 
causes more problems than it solves.

Regards,
Mark

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 18:09                                 ` Mark Junker
@ 2008-01-17 18:12                                   ` Pedro Melo
  2008-01-17 18:18                                     ` Johannes Schindelin
  2008-01-17 18:44                                     ` Linus Torvalds
  2008-01-17 18:42                                   ` Linus Torvalds
  2008-01-17 21:27                                   ` Dmitry Potapov
  2 siblings, 2 replies; 260+ messages in thread
From: Pedro Melo @ 2008-01-17 18:12 UTC (permalink / raw)
  To: Mark Junker; +Cc: git@vger.kernel.org

Hi,

On Jan 17, 2008, at 6:09 PM, Mark Junker wrote:
> Linus Torvalds schrieb:
> IMHO it would be the best solution when git stores all string meta  
> data in UTF-8 and converts it to the target systems file system  
> encoding. That would fix all those problems with different locales  
> and file system encodings ...

+1.

And I would suggest the use of RFC 3454 as the guidelines for UTF-8  
normalization.

Best regards,
-- 
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 18:12                                   ` Pedro Melo
@ 2008-01-17 18:18                                     ` Johannes Schindelin
  2008-01-17 18:36                                       ` Mark Junker
  2008-01-17 18:38                                       ` Pedro Melo
  2008-01-17 18:44                                     ` Linus Torvalds
  1 sibling, 2 replies; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-17 18:18 UTC (permalink / raw)
  To: Pedro Melo; +Cc: Mark Junker, git@vger.kernel.org

Hi,

On Thu, 17 Jan 2008, Pedro Melo wrote:

> On Jan 17, 2008, at 6:09 PM, Mark Junker wrote:
>
> > IMHO it would be the best solution when git stores all string meta 
> > data in UTF-8 and converts it to the target systems file system 
> > encoding. That would fix all those problems with different locales and 
> > file system encodings ...
> 
> +1.

-1.

It's just too arrogant to force your particular preferences down the 
throat of every git user.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 17:58               ` Junio C Hamano
@ 2008-01-17 18:22                 ` Johan Herland
  0 siblings, 0 replies; 260+ messages in thread
From: Johan Herland @ 2008-01-17 18:22 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Wincent Colaiuta, Pedro Melo, Johannes Schindelin,
	Jay Soffian

On Thursday 17 January 2008, Junio C Hamano wrote:
> Wincent Colaiuta <win@wincent.com> writes:
> 
> > If this is really just a platform-specific hack, can we use platform- 
> > specific code to do the normalization?
> 
> Unfortunately, I do not think this can be a platform-specific
> hack.
> 
> If a project wants to be usable on both sane and insane
> filesystems, people on platforms whose filesystems treat "foo"
> and "Foo" as two distinct pathnames (and "Ma<UMLAUT>rchen" and
> "M<A-with-UMLAUT>rchen" as two distinct ones) need to be
> prevented from creating both in their tree objects at the same
> time.  Once you create two pathnames xt_connmark.c and
> xt_CONNMARK.c in the same tree object in your project, people on
> case insensitive filesystems cannot work with your project (you
> cannot check out the kernel source tree and work on it on vfat).

IMHO, support for insane filesystems should be split into two parts:

1. A git config setting (probably in .gitattributes) that is enabled
   by the project to prevent anyone from committing files that would
   cause problems on insane filesystems. This setting must be enabled
   for everybody in the project (which is why it cannot easily be
   solved by the current hooks infrastructure which is per-repo only).

2. A platform-specific hack that detects whenever you're about to
   check out a problematic filename on an insane filesystem.
   The hack should either warn or (probably better) FAIL to check out
   the problematic file(s) (with an appropriate error message
   pointing at the setting in (1)).

AFAICS, _both_ are needed in order to solve this problem properly.

Have fun!

...Johan

-- 
Johan Herland, <johan@herland.net>
www.herland.net

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 18:18                                     ` Johannes Schindelin
@ 2008-01-17 18:36                                       ` Mark Junker
  2008-01-17 18:38                                       ` Pedro Melo
  1 sibling, 0 replies; 260+ messages in thread
From: Mark Junker @ 2008-01-17 18:36 UTC (permalink / raw)
  To: git@vger.kernel.org

Johannes Schindelin schrieb:

> It's just too arrogant to force your particular preferences down the 
> throat of every git user.

It's not arrogant to make a suggestion. Where is your alternative solution?

However, what about storing an additional information like the file 
system encoding (for every file)? This would result in the same 
behaviour (and speed) as today as long as the file system encoding is 
the same. Conversion will only be done when the targets file system 
encoding is different.

BTW: This reminds me of the code page switching stuff back in the times 
of MS-DOS 4/5. This really wasn't funny.

Regards,
Mark

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 18:18                                     ` Johannes Schindelin
  2008-01-17 18:36                                       ` Mark Junker
@ 2008-01-17 18:38                                       ` Pedro Melo
  1 sibling, 0 replies; 260+ messages in thread
From: Pedro Melo @ 2008-01-17 18:38 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Mark Junker, git@vger.kernel.org

Hi,

On Jan 17, 2008, at 6:18 PM, Johannes Schindelin wrote:
> On Thu, 17 Jan 2008, Pedro Melo wrote:
>> On Jan 17, 2008, at 6:09 PM, Mark Junker wrote:
>>
>>> IMHO it would be the best solution when git stores all string meta
>>> data in UTF-8 and converts it to the target systems file system
>>> encoding. That would fix all those problems with different  
>>> locales and
>>> file system encodings ...
>>
>> +1.
>
> -1.
>
> It's just too arrogant to force your particular preferences down the
> throat of every git user.

Do you agree that you need to store or at least calculate a  
normalized version of each filename to see if you are already  
tracking the file, to take in account all the the filesystems out  
there who are not case-preserving, case-sensitive?

If so, do you think those rules should be an option? Or a preference?

Should I specify in my config file that I want my filenames to be  
normalized?

Ignoring encoding, and case-sensitive issues in the git index creates  
problems for those people who want/need to use non-ascii chars in  
their filenames, and have some change of being able to collaborate  
with other users on different operating systems.

Best regards,
-- 
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 18:09                                 ` Mark Junker
  2008-01-17 18:12                                   ` Pedro Melo
@ 2008-01-17 18:42                                   ` Linus Torvalds
  2008-01-17 18:50                                     ` Mark Junker
  2008-01-17 18:52                                     ` Pedro Melo
  2008-01-17 21:27                                   ` Dmitry Potapov
  2 siblings, 2 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-17 18:42 UTC (permalink / raw)
  To: Mark Junker; +Cc: git@vger.kernel.org

On Thu, 17 Jan 2008, Mark Junker wrote:
> 
> Sorry, but you're using different characters that look the same. But Kevins
> point was that it's a different thing if you use two characters that look the
> same or the same character with different encodings.

But that's exactly the case he gave - 'ä' vs 'a¨' are exactly that: 
different strings (not even characters: the second is actually a 
multi-character) that just look the same.

You try to twist the argument by just claiming that they are the same 
"character". They aren't, unless you *define* character to be the same as 
"glyph". Of course, if you claim that, then you can always support your 
argument, but I claim that is a bogus and incorrect axiom to start with!

Too many people confuse "character" and "glyph". They are different.

See, for example

	http://en.wikipedia.org/wiki/Unicode

and notice the *many* places where they try to make that distinction 
between "character" and "glyph" clear (and also "code values", which are 
the actual bytes that encode a character).

See also

	http://en.wikipedia.org/wiki/Unicode_normalization

and realize that a Unicode sequence is a sequence of *characters* even if 
it is not normalized! Those things are still characters, when they are the 
"simpler" non-combined characters.

You are trying to make a totally BOGUS argument, and you base it on the 
INCORRECT basis that the TWO characters 'a'+'¨' somehow aren't independent 
characters. They *are*. They are *different* characters from 'ä', even 
though they may be "Canonically equivalent" as a sequence.

The fact is that "equivalent" does not mean "same". Why cannot people 
accept that?

			Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 18:12                                   ` Pedro Melo
  2008-01-17 18:18                                     ` Johannes Schindelin
@ 2008-01-17 18:44                                     ` Linus Torvalds
  2008-01-17 19:02                                       ` Pedro Melo
  1 sibling, 1 reply; 260+ messages in thread
From: Linus Torvalds @ 2008-01-17 18:44 UTC (permalink / raw)
  To: Pedro Melo; +Cc: Mark Junker, git@vger.kernel.org



On Thu, 17 Jan 2008, Pedro Melo wrote:
> 
> On Jan 17, 2008, at 6:09 PM, Mark Junker wrote:
>
> > IMHO it would be the best solution when git stores all string meta data in
> > UTF-8 and converts it to the target systems file system encoding. That would
> > fix all those problems with different locales and file system encodings ...
> 
> +1.
> 
> And I would suggest the use of RFC 3454 as the guidelines for UTF-8
> normalization.

The problem is that there is no way to know what the "target system 
encoding" is.

And it wouldn't actually solve the bigger problem on OS X anyway: as long 
as you are case-insensitive, you'll have all the same problems (ie the 
insane OS X filesystem presumably thinks that "MÄRCHEN" and "Märchen" are 
also identical, because they are "equivalent" names).

			Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 18:42                                   ` Linus Torvalds
@ 2008-01-17 18:50                                     ` Mark Junker
  2008-01-17 18:52                                     ` Pedro Melo
  1 sibling, 0 replies; 260+ messages in thread
From: Mark Junker @ 2008-01-17 18:50 UTC (permalink / raw)
  To: git@vger.kernel.org

Linus Torvalds schrieb:

> You try to twist the argument by just claiming that they are the same 
> "character". They aren't, unless you *define* character to be the same as 
> "glyph". Of course, if you claim that, then you can always support your 
> argument, but I claim that is a bogus and incorrect axiom to start with!

Ahhhh ... now I understand.

Regards,
Mark

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 18:42                                   ` Linus Torvalds
  2008-01-17 18:50                                     ` Mark Junker
@ 2008-01-17 18:52                                     ` Pedro Melo
       [not found]                                       ` <alpine.LFD.1.00.0801 171100330.14959@woody.linux-foundation.org>
                                                         ` (2 more replies)
  1 sibling, 3 replies; 260+ messages in thread
From: Pedro Melo @ 2008-01-17 18:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Mark Junker, git@vger.kernel.org

Hi,

On Jan 17, 2008, at 6:42 PM, Linus Torvalds wrote:
> Too many people confuse "character" and "glyph". They are different.

This is very true.

> The fact is that "equivalent" does not mean "same". Why cannot people
> accept that?

I'll shut up now if you can answer me one question,  because it  
really is a problem for my team.

We have people using windows, people using Macs, and people using  
several flavors of Linux desktops. They all have different settings  
and if I add a file like áéióú that happens to be UTF-8 encoded, it  
will reach a iso-latin-1 user as visual garbage. git will track the  
file perfectly, we know that, because the sequence of bytes that my  
system used to create the file will be the same on all "sane"  
systems, but the file will look "funny" to some users, and we get  
complaints for some less enlightened ones.

The answer is that users should not create filenames with non-ascii  
characters if they want a consistent experience, right?

This is just so that I can write a best practices document to them...

Best regards,
-- 
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 18:52                                     ` Pedro Melo
       [not found]                                       ` <alpine.LFD.1.00.0801 171100330.14959@woody.linux-foundation.org>
@ 2008-01-17 19:01                                       ` Theodore Tso
  2008-01-17 19:11                                       ` Linus Torvalds
  2 siblings, 0 replies; 260+ messages in thread
From: Theodore Tso @ 2008-01-17 19:01 UTC (permalink / raw)
  To: Pedro Melo; +Cc: Linus Torvalds, Mark Junker, git@vger.kernel.org

On Thu, Jan 17, 2008 at 06:52:57PM +0000, Pedro Melo wrote:
> The answer is that users should not create filenames with non-ascii 
> characters if they want a consistent experience, right?
>
> This is just so that I can write a best practices document to them...

That's the easist thing to do if you want to assure that things will
mostly work across multiple different OS's, with different levels of
sanity.  You might also want to include that it's a bad idea to create
two filenames that are identical on case-insensitive filesystems,
i.e., "makefile" and "Makefile", or "foo.H" and "foo.h" which even
though it works Just Fine on Linux, will likely cause problems on
Windows and MacOS filesystems, and other systems that are insane with
respect to case insensitivity.

							- Ted

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 18:44                                     ` Linus Torvalds
@ 2008-01-17 19:02                                       ` Pedro Melo
  0 siblings, 0 replies; 260+ messages in thread
From: Pedro Melo @ 2008-01-17 19:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Mark Junker, git@vger.kernel.org

Hi,

On Jan 17, 2008, at 6:44 PM, Linus Torvalds wrote:
> On Thu, 17 Jan 2008, Pedro Melo wrote:
>>
>> On Jan 17, 2008, at 6:09 PM, Mark Junker wrote:
>>
>>> IMHO it would be the best solution when git stores all string  
>>> meta data in
>>> UTF-8 and converts it to the target systems file system encoding.  
>>> That would
>>> fix all those problems with different locales and file system  
>>> encodings ...
>>
>> +1.
>>
>> And I would suggest the use of RFC 3454 as the guidelines for UTF-8
>> normalization.
>
> The problem is that there is no way to know what the "target system
> encoding" is.

Correct. Storing or using a normalized version of the filename is  
only part of the problem.

The full problem is:

User A <-> filesystem A <-#-> git < ...... > git <-#-> filesystem B <- 
 > user B.

You have to encode/decode/normalize on all the <-#-> and there is no  
magic bullet. Each user would have to tell git "Hey I'm using utf-8"  
or "Hey, I'm a masochist using HFS+".

But I think its important for git to store the filenames in something  
that at least permits this kind of scenario.

All encoding/decoding/normalization is of course optional, and for  
git, it still is a sequence of bytes.

> And it wouldn't actually solve the bigger problem on OS X anyway:  
> as long
> as you are case-insensitive, you'll have all the same problems (ie the
> insane OS X filesystem presumably thinks that "MÄRCHEN" and  
> "Märchen" are
> also identical, because they are "equivalent" names).

Correct. HFS+ has bigger problems. I'm not sure if this is enough to  
solve it.

But it would solve two linux users using different encodings.

And given that the filtering layers are optional, you have to  
configure them, it wont bite nobody.

Best regards,
-- 
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 18:52                                     ` Pedro Melo
       [not found]                                       ` <alpine.LFD.1.00.0801 171100330.14959@woody.linux-foundation.org>
  2008-01-17 19:01                                       ` Theodore Tso
@ 2008-01-17 19:11                                       ` Linus Torvalds
  2008-01-18  0:18                                         ` Kevin Ballard
                                                           ` (2 more replies)
  2 siblings, 3 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-17 19:11 UTC (permalink / raw)
  To: Pedro Melo; +Cc: Mark Junker, git@vger.kernel.org

On Thu, 17 Jan 2008, Pedro Melo wrote:
>
> We have people using windows, people using Macs, and people using several
> flavors of Linux desktops. They all have different settings and if I add a
> file like áéióú that happens to be UTF-8 encoded, it will reach a iso-latin-1
> user as visual garbage.

Yes.

> git will track the file perfectly, we know that, because the sequence of 
> bytes that my system used to create the file will be the same on all 
> "sane" systems, but the file will look "funny" to some users, and we get 
> complaints for some less enlightened ones.

I can't really suggest anything else than trying to make everybody use 
UTF-8.

[ Not just for filenames, by the way - this is one of the reasons I think
  it is so *important* to not corrupt filenames, exactly because this is 
  in no way filename-specific at all, and filenames are generally "textual 
  data" exactly the same way a text-file is.

  But only totally insane people think that you should force-normalize 
  text-files, even though all the issues are obviously all the same 
  regardless of whether it's a filename or a word in textfile. ]

And yes, I also realize that it's not going to be realistic. We're 
probably *closer* to that than we used to be, but I don't think you can 
even make Windows think FAT is UTF-8.

I don't know how NTFS works (I know it is Unicode-aware, and I think it 
encodes filenames in UCS-2 or possibly UTF-16, but there is an obvious 1:1 
translation to UTF-8, and since we use C strings, I'd assume/hope Windows 
actually uses that unambiguous translation for any filenames).

Under modern Linux and OS X, UTF-8 is basically the only way (older Linux 
distros may be set up for Latin1, but at least the newer ones seem to all 
default to a UTF-8 locale).

> The answer is that users should not create filenames with non-ascii characters
> if they want a consistent experience, right?

Oh, absolutely. That takes care of 99.9% of all source projects. Even then 
you can have problems with case insensitivity (the Linux kernel sources 
are all US-ASCII filenames, for example, but *literally* has many files 
that are identical if you ignore case, and that's not unheard of).

So yes, to a first approximation, the answer is to simply avoid using 
anything but US-ASCII. It's seldom a big limitation when talking about 
filenames.

			Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 18:09                                 ` Mark Junker
  2008-01-17 18:12                                   ` Pedro Melo
  2008-01-17 18:42                                   ` Linus Torvalds
@ 2008-01-17 21:27                                   ` Dmitry Potapov
  2 siblings, 0 replies; 260+ messages in thread
From: Dmitry Potapov @ 2008-01-17 21:27 UTC (permalink / raw)
  To: Mark Junker; +Cc: git@vger.kernel.org

On Thu, Jan 17, 2008 at 07:09:43PM +0100, Mark Junker wrote:
> 
> Sorry, but you're using different characters that look the same. But 
> Kevins point was that it's a different thing if you use two characters 
> that look the same or the same character with different encodings.

No, the encoding was the same -- UTF-8. MacOSX converts one sequence of
Unicode characters to *another* sequence, which are canonical equivalent,
but being canonical equivalent does not mean they are the same characters.
In the same way, as being compatible equivalent does not mean being the
same. As well as, being case-insensitive equivalent does not mean being
the same... Do you remember DOS? It stored all filenames in upper-case,
so they original and stored names are case-insensitive equivalent, but
they are not the same!

Dmitry

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 16:43                               ` Linus Torvalds
  2008-01-17 18:09                                 ` Mark Junker
@ 2008-01-17 22:01                                 ` JM Ibanez
  2008-01-17 22:09                                   ` Johannes Schindelin
                                                     ` (2 more replies)
  1 sibling, 3 replies; 260+ messages in thread
From: JM Ibanez @ 2008-01-17 22:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Wincent Colaiuta, Kevin Ballard, Jakub Narebski,
	Johannes Schindelin, Mark Junker, git@vger.kernel.org

Linus Torvalds <torvalds@linux-foundation.org> writes:
> So the whole "but they _look_ the same" argument is just total BS. In just 
> about all character encodings there has always been unique and different 
> "characters" that _look_ the same on screen, and it has never really made 
> them actually *be* the same, and it has never been a valid argument for 
> them being considered the same.

With the exception of Unicode. If you check the standard, two Unicode
codepoints (i.e. the numeric value that gets stored on disk) *can* map
to the same character, hence they are the same. They don't just look the
same, they are the same character -- even if the codepoints are
different (i.e. precomposed vs. decomposed characters). In fact, part of
the Unicode standard deals with that. (Technically, Unicode calls it
equivalence, but what the hey).

In other words, Unicode treats e.g. both U+0065 and U+00E9 as
fundamentally the same character. This comes even more into play in such
alphabets as Hangul (Korean) and the Japanese Kana.

-- 
JM Ibanez
Software Architect
Orange & Bronze Software Labs, Ltd. Co.

jm@orangeandbronze.com
http://software.orangeandbronze.com/

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 22:01                                 ` JM Ibanez
@ 2008-01-17 22:09                                   ` Johannes Schindelin
  2008-01-18  1:27                                     ` Robin Rosenberg
  2008-01-17 23:05                                   ` Linus Torvalds
  2008-01-17 23:10                                   ` Dmitry Potapov
  2 siblings, 1 reply; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-17 22:09 UTC (permalink / raw)
  To: JM Ibanez
  Cc: Linus Torvalds, Wincent Colaiuta, Kevin Ballard, Jakub Narebski,
	Mark Junker, git@vger.kernel.org

Hi,

On Fri, 18 Jan 2008, JM Ibanez wrote:

> If you check the standard, two Unicode codepoints (i.e. the numeric 
> value that gets stored on disk) *can* map to the same character, hence 
> they are the same.

As Linus _already_ pointed out, you are confusing characters with glyphs.

Hth,
Dscho

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 22:01                                 ` JM Ibanez
  2008-01-17 22:09                                   ` Johannes Schindelin
@ 2008-01-17 23:05                                   ` Linus Torvalds
  2008-01-17 23:10                                   ` Dmitry Potapov
  2 siblings, 0 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-17 23:05 UTC (permalink / raw)
  To: JM Ibanez
  Cc: Wincent Colaiuta, Kevin Ballard, Jakub Narebski,
	Johannes Schindelin, Mark Junker, git@vger.kernel.org

On Fri, 18 Jan 2008, JM Ibanez wrote:
>
> With the exception of Unicode. If you check the standard, two Unicode
> codepoints (i.e. the numeric value that gets stored on disk) *can* map
> to the same character, hence they are the same.

But if you want to make it clear, you can use "encoded character" or yes, 
"code point". 

But the thing is, even the unicode standard tends to just say "character", 
and a unicode string (for example) is defined to be a sequence of "code 
units" which in turn is about those *encoded* characters, which is all 
about the code points.

So you'll find that they are very careful in some technical definition 
parts to talk about "code points", but then in other sequences they talk 
about "character" even though they are referring to the actual code point 
(ie the figure literally has the unicode number in it!)

In fact, they sometimes even talk about "characters" in the totally 
non-encoding meaning of "glyph".

So yes, "character" is often ambiguous. It would be good to never use the 
word at all, and only talk about "code point" and "glyph" and one of the 
well-defined special terms like "combining character" or "replacement 
character".

But to take a representative example from The Unicode Standard, Chapter 2: 
"Unicode Design Principles":

  Characters are represented by code points that reside only in a memory 
  representation, as strings in memory, on disk, or in data transmission. 
  The Unicode Standard deals only with character codes.

(any speling mistakes mine). In other words, from the very beginning of 
the standard, very basic design principles chapter, it starts talking 
about characters being represented by code points and explicitly says that 
it really only deals with CHARACTER CODES.

Yes, I'm sure you can argue ad infinitum that all the "equivalences" and 
other crap means that a "character" can sometimes mean just about 
anything, but I'd say that it's pretty damn reasonable to equate "unicode 
character" with "code point" or "character code".

			Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 22:01                                 ` JM Ibanez
  2008-01-17 22:09                                   ` Johannes Schindelin
  2008-01-17 23:05                                   ` Linus Torvalds
@ 2008-01-17 23:10                                   ` Dmitry Potapov
  2 siblings, 0 replies; 260+ messages in thread
From: Dmitry Potapov @ 2008-01-17 23:10 UTC (permalink / raw)
  To: JM Ibanez
  Cc: Linus Torvalds, Wincent Colaiuta, Kevin Ballard, Jakub Narebski,
	Johannes Schindelin, Mark Junker, git@vger.kernel.org

On Fri, Jan 18, 2008 at 06:01:13AM +0800, JM Ibanez wrote:
> 
> With the exception of Unicode.

Nice exception...

> If you check the standard,

The standard of what? Could you provide the exact reference?

> two Unicode
> codepoints (i.e. the numeric value that gets stored on disk)

Does the standard say something about disk storage?

> *can* map to the same character, 

So what?

> hence they are the same.

non-sequitor.

> They don't just look the
> same, they are the same character

Because?

> -- even if the codepoints are
> different (i.e. precomposed vs. decomposed characters).

And where exactly does the standard says so?

> In fact, part of
> the Unicode standard deals with that. (Technically, Unicode calls it
> equivalence, but what the hey).

So they are not the same after all? It is just you don't care
about what it actually says, right? How about this: Unicode
provides a unique number for every character. So, if numbers
are not the same then by definition of the Unicode standard
those characters are different.

> 
> In other words, Unicode treats e.g. both U+0065 and U+00E9 as
> fundamentally the same character.

There is no notion "fundamentally the same character" in the Unicode
standard as far as I know, and the characters you mentioned are very
different in Unicode:
http://www.fileformat.info/info/unicode/char/0065/index.htm
http://www.fileformat.info/info/unicode/char/00e9/index.htm
There have different names, they have different glyphs, and they
are functional different.

Dmitry

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 19:11                                       ` Linus Torvalds
@ 2008-01-18  0:18                                         ` Kevin Ballard
  2008-01-18  0:35                                           ` Linus Torvalds
  2008-01-18  1:05                                         ` Robin Rosenberg
  2008-01-18 10:19                                         ` Peter Karlsson
  2 siblings, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-18  0:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Pedro Melo, Mark Junker, git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 714 bytes --]

On Jan 17, 2008, at 2:11 PM, Linus Torvalds wrote:

> [ Not just for filenames, by the way - this is one of the reasons I  
> think
>  it is so *important* to not corrupt filenames, exactly because this  
> is
>  in no way filename-specific at all, and filenames are generally  
> "textual
>  data" exactly the same way a text-file is.

I just don't understand why you insist that the filename is data, when  
it is clearly metadata. The filename has two purposes: the identify  
the file to the user, and to provide a handle with which to reference  
the file contents. The specific byte sequence is in no way sacred.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-18  0:18                                         ` Kevin Ballard
@ 2008-01-18  0:35                                           ` Linus Torvalds
  0 siblings, 0 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-18  0:35 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Pedro Melo, Mark Junker, git@vger.kernel.org

On Thu, 17 Jan 2008, Kevin Ballard wrote:
> 
> I just don't understand why you insist that the filename is data, when it is
> clearly metadata.

Uhh. And exactly how do you know the difference, and why should it matter?

A lot of data is metadata. Look at the git index file. It's *all* 
metadata. Does that mean that the OS has the right to corrupt it?

IOW, why do you seem to argue that metadata something you can corrupt, but 
not then "regular" data?

Why is it ok to change a filename, when that same filename may *also* be 
encoded by the user in a regular data file (think about MD5SUM files, for 
example, that include the pathname, but now the pathname is part of the 
file data, not on a filesystem). 

So filenames are data, they're metadata, they're whatever. None of that 
means that it's acceptable to corrupt them, or gives the OS any reason to 
say that it "knows better" than the user in how users use them. It's still 
the *users* metadata, not the filesystems own metadata!

In many cases, users use filenames *as* data, ie the filename actually has 
a meaning in itself, not just as a handle to get the file contents.

If this was truly metadata that isn't visible to the user, and not under 
the users control (ie indirect block numbers etc), then you'd have a good 
point. At that point, it's obviously entirely up to the filesystem how the 
heck it encodes it.

But that's not what filenames are. Filenames are an index specified by the 
user, not by the computer. 

		Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 16:53                                             ` Kevin Ballard
@ 2008-01-18  0:44                                               ` Robin Rosenberg
  0 siblings, 0 replies; 260+ messages in thread
From: Robin Rosenberg @ 2008-01-18  0:44 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Johannes Schindelin, Wincent Colaiuta, Mitch Tishmack, git

torsdagen den 17 januari 2008 skrev Kevin Ballard:
> On Jan 17, 2008, at 10:57 AM, Johannes Schindelin wrote:
> 
> > On Thu, 17 Jan 2008, Kevin Ballard wrote:
> >
> >> On Jan 17, 2008, at 5:22 AM, Wincent Colaiuta wrote:
> >>
> >>> While it's a nice workaround, it really is just that (a workaround)
> >>> because performance will be suboptimal in a repository running on a
> >>> disk image (and many of switched to Git because of its speed).
> >>
> >> Not only is it suboptimal, it's also not acceptable, plain and  
> >> simple.
> >
> > If it's not acceptable, do something about it (and I don't mean  
> > writing 50
> > emails). If you don't want to do something about it, I have to  
> > assume that
> > you accept it as-is.
> 
> I never said I don't want to do anything about it. However, I do  
> believe that it will take a significant investment of time and energy  
> to learn all the gooey details of how git handles filenames and how  
> the index works and all that jazz, which is knowledge that other  
> people already have. I believe that, for me to solve this problem  
> independently, it may require so much time that it never gets done  
> (after all, I am fairly busy). However, if other people who already  
> have this knowledge are willing to help, that would make this task far  
> easier, especially given that if nobody else even acknowledges that  
> this is a problem I don't have much hope of getting a patch accepted.
> 
> So again, I'm certainly going to try, but working by myself it simply  
> may never get done.

(This is only for those that think the problem should be solved somehow. The
rest can move on - nothing to see here)

You may look at http://rosenberg.homelinux.net/cgi-bin/gitweb/gitweb.cgi?p=GIT.git;a=log;h=i18n
for inspiration. It's pretty obsolete by now and only a "proof of concept", i.e.
it can be done, not that it necessarily should be done exactly this way.

Basically it intercepts the user's access to git, i.e. certain commands
and how files are named (since those names represent a user interface). Then
it assumes the internal encoding is UTF-8 (or garbage) converting to and
from the user's local encoding. The heuristics is based on the assumption that
a string (even random onesthat looks like UTF-8, with a very high probablity
actually is UTF-8 encoded.

The test cases might be usable almost as is.

-- robin

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 19:11                                       ` Linus Torvalds
  2008-01-18  0:18                                         ` Kevin Ballard
@ 2008-01-18  1:05                                         ` Robin Rosenberg
  2008-01-18  1:24                                           ` Linus Torvalds
  2008-01-18 10:19                                         ` Peter Karlsson
  2 siblings, 1 reply; 260+ messages in thread
From: Robin Rosenberg @ 2008-01-18  1:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Pedro Melo, Mark Junker, git@vger.kernel.org

torsdagen den 17 januari 2008 skrev Linus Torvalds:
> And yes, I also realize that it's not going to be realistic. We're 
> probably *closer* to that than we used to be, but I don't think you can 
> even make Windows think FAT is UTF-8.
It's UTF-16 (when needed). I think it's all in the Linux kernel for you
to see.

> I don't know how NTFS works (I know it is Unicode-aware, and I think it 
> encodes filenames in UCS-2 or possibly UTF-16, but there is an obvious 1:1 
UTF-16 (was UCS-2 until MS did a s/UCS-2/UTF-16/ on the documentation).

> translation to UTF-8, and since we use C strings, I'd assume/hope Windows 
> actually uses that unambiguous translation for any filenames).

It uses the local 8-bit codepage, which is not UTF-8, often some latin-inspired
thingy, but in Asia multi-byte encodings are used. In western Europe it is
Windows-1252, which is almost, but not exactly iso-8859-1. Oh, and then we
have the cmd prompt which has another encoding in 8-bit mode.

I think there is a cygwin patch that converts to and from UTF-8. An application
can choose to use the "A" or "W" interfaces. The W-API's are the real ones and 
the others' are just wrappers that convert to and from UTF-16 before anything
happens (i.e. CreateFileA is slower than CreateFileW and so on). 

-- robin

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-18  1:05                                         ` Robin Rosenberg
@ 2008-01-18  1:24                                           ` Linus Torvalds
  2008-01-18  4:08                                             ` Brian Dessent
                                                               ` (2 more replies)
  0 siblings, 3 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-18  1:24 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Pedro Melo, Mark Junker, git@vger.kernel.org

On Fri, 18 Jan 2008, Robin Rosenberg wrote:

> torsdagen den 17 januari 2008 skrev Linus Torvalds:
> > And yes, I also realize that it's not going to be realistic. We're 
> > probably *closer* to that than we used to be, but I don't think you can 
> > even make Windows think FAT is UTF-8.
>
> It's UTF-16 (when needed). I think it's all in the Linux kernel for you
> to see.

.. well, FAT certainly wasn't. But yes, VFAT probably is.  Not that I want 
to look at it ;)

> > translation to UTF-8, and since we use C strings, I'd assume/hope Windows 
> > actually uses that unambiguous translation for any filenames).
> 
> It uses the local 8-bit codepage, which is not UTF-8, often some latin-inspired
> thingy, but in Asia multi-byte encodings are used. In western Europe it is
> Windows-1252, which is almost, but not exactly iso-8859-1. Oh, and then we
> have the cmd prompt which has another encoding in 8-bit mode.

Well, if it uses a 8-bit codepage, then that means that as far as the 
POSIX filename interface is concerned, it has nothing what-so-ever to do 
with Unicode (ie unicode is just a totally invisible internal encoding 
issue, not externally visible).

I assume you have to use some insane Windows-only UCS-2 filename function 
to actually see any Unicode behaviour.

Sad. Because there really is no reason to use a local 8-bit codepage when 
you could just use UTF-8.

> I think there is a cygwin patch that converts to and from UTF-8. An application
> can choose to use the "A" or "W" interfaces. The W-API's are the real ones and 
> the others' are just wrappers that convert to and from UTF-16 before anything
> happens (i.e. CreateFileA is slower than CreateFileW and so on). 

So the CreateFileW() is the "native UTF-16 interface", and CreateFileA() 
is the 8-bit codepage one that has nothing to do with Unicode and is 
purely some local thing.

But for a UNIX interface layer, the most logical thing would probably be 
to map "open()" and friends not to CreateFileA(), but to 
CreateFileW(utf8_to_utf16(filename)). 

Once you do that, then it sounds like Windows would basically be Unicode, 
and hopefully without any crazy normalization (but presumably all the 
crazy case-insensitivity cannot be fixed ;^).

So it probably really only depends on whether you choose to use the insane 
8-bit code page translation or whether you just use a sane and trivial 
UTF8<->UTF16 conversion.

Anybody know which one cygwin/mingw does?

			Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 22:09                                   ` Johannes Schindelin
@ 2008-01-18  1:27                                     ` Robin Rosenberg
  0 siblings, 0 replies; 260+ messages in thread
From: Robin Rosenberg @ 2008-01-18  1:27 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: JM Ibanez, Linus Torvalds, Wincent Colaiuta, Kevin Ballard,
	Jakub Narebski, Mark Junker, git@vger.kernel.org

torsdagen den 17 januari 2008 skrev Johannes Schindelin:
> Hi,
> 
> On Fri, 18 Jan 2008, JM Ibanez wrote:
> 
> > If you check the standard, two Unicode codepoints (i.e. the numeric 
> > value that gets stored on disk) *can* map to the same character, hence 
> > they are the same.
> 
> As Linus _already_ pointed out, you are confusing characters with glyphs.
> 
Someone is. 

He is refering to the unicode definition of an (abstract) character.

Ch3.4 D11 - "A single abstract character may also be represented by a sequence
of code points—for example, latin capital letter g with acute may be represented
by the sequence <U+0047 latin capital letter g, U+0301 combining acute accent>, 
rather than being mapped to a single code point.


-- robin

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-18  1:24                                           ` Linus Torvalds
@ 2008-01-18  4:08                                             ` Brian Dessent
  2008-01-18  8:49                                             ` Dmitry Potapov
  2008-01-18  9:42                                             ` Robin Rosenberg
  2 siblings, 0 replies; 260+ messages in thread
From: Brian Dessent @ 2008-01-18  4:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Robin Rosenberg, Pedro Melo, Mark Junker, git@vger.kernel.org

Linus Torvalds wrote:

> But for a UNIX interface layer, the most logical thing would probably be
> to map "open()" and friends not to CreateFileA(), but to
> CreateFileW(utf8_to_utf16(filename)).
> 
> Once you do that, then it sounds like Windows would basically be Unicode,
> and hopefully without any crazy normalization (but presumably all the
> crazy case-insensitivity cannot be fixed ;^).
> 
> So it probably really only depends on whether you choose to use the insane
> 8-bit code page translation or whether you just use a sane and trivial
> UTF8<->UTF16 conversion.
> 
> Anybody know which one cygwin/mingw does?

Cygwin does not yet support doing the smart thing.  At the moment you
can only open() files in the current 8 bit codepage.  There is a patch
floating around to allow using UTF-8, but it was rejected for inclusion
because it was considered too hackish.  Instead work has been ongoing
for some time to replumb the internal representation of Windows
filenames to use UTF-16 instead of plain chars, so that conversion
overhead can be held at a minimum.  In conjuction with dropping Win9x/ME
support this also means the Native APIs like NtCreateFile() can be used
directly, as they are more low level than the Win32 -A and -W functions
and expose more flexibility, such as the ability to implement the
openat() family of functions natively (no pun intended) without
emulation.  These two items (unicode and dropping non-NT windows) are
the big features for 1.7.

Of course since a lot of what Cygwin does is translate paths in
sometimes unobvious and complicated ways, there's a lot of path handling
code to adapt, so it's taking a while.

Incidently, the ridiculously short MAX_PATH of 260 on Windows comes from
the Win32 -A version of the functions.  The -W API and the Native API
can cope with paths of up to 32k wide chars, so a side benefit of this
should be the ability to finally stop running into length limits.  Of
course there's always a catch: when using long filenames with the Win32
-W API or the Native API you can only use absolute paths, so either you
have to live with the 260 limitation for relative paths or you keep
track of the current directory and always do a rel->abs conversion.  Or
better, if you stick to the Native API you can do a directory handle
relative openat-type thing which I suppose starts to sound relatively
sane.  However, there's another catch here: For some time Cygwin has
maintained a separate and private value of CWD behind Windows' back, and
only synced the two when spawning a non-Cygwin binary.  This allows
Windows to happly think the process' CWD is always C:\ or whatever, and
not hold an open handle to the actual CWD.  In turn Cygwin uses this to
allow POSIX filesystem behavior of being able to unlink the current dir,
which some programs or build systems assume they can do but is not
possible in straight Win32.  This is a roundabout way of saying that
going back to actually having to keep a handle to CWD open again in
order to do relative paths might be complicated.

Brian

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17  0:16                       ` Linus Torvalds
  2008-01-17  0:27                         ` Pedro Melo
@ 2008-01-18  8:29                         ` Peter Karlsson
  2008-01-18 11:16                           ` Jakub Narebski
  1 sibling, 1 reply; 260+ messages in thread
From: Peter Karlsson @ 2008-01-18  8:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pedro Melo, Kevin Ballard, Jakub Narebski, Johannes Schindelin,
	Mark Junker, git

Linus Torvalds:

> The difference I see between us is that when I tell you that this is
> exactly the same thing as your file *contents*,

This is the same issue as the CRLF issue I posted on earlier, and it
all stems from that git also sees file names as a stream of bytes, not
a string of characters, just as it does text.

> An OS that silently changes the contents of your files is *crap*.
> Get it?

A program that silently ignores the conventions of the platform it runs
on is *crap*, no matter if the conventions are not the same as for
other platforms.

> An OS that silently changes the contents of your directories is *crap*.
> Get it now?

A program that silently ignores the conventions of the file system it
tries to store its files on is *crap* :-)

In my perfect world, file names would be stored as a string of characters,
so if I save a file with an å in it, that å would be preserved no
matter if I run Linux on ext2 with my locale is set to latin-1 (which
stores it as byte 0xE5), on Windows with NTFS (which stores it as the
UTF-16 code 0x00E5), on Windows/DOS with FAT (which stores it as the
byte 0x86) or on Mac OS X which stores it as decomposed UTF-8 (whose
byte sequence I don't know at the top of my head). If that was just
stored as U+00E5 in whatever encoding in the filename index, the local
implementation of git can just check it out in the form needed.

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-18  1:24                                           ` Linus Torvalds
  2008-01-18  4:08                                             ` Brian Dessent
@ 2008-01-18  8:49                                             ` Dmitry Potapov
  2008-01-18  9:42                                             ` Robin Rosenberg
  2 siblings, 0 replies; 260+ messages in thread
From: Dmitry Potapov @ 2008-01-18  8:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Robin Rosenberg, Pedro Melo, Mark Junker, git@vger.kernel.org

On Thu, Jan 17, 2008 at 05:24:01PM -0800, Linus Torvalds wrote:
> 
> On Fri, 18 Jan 2008, Robin Rosenberg wrote:
> 
> > It uses the local 8-bit codepage, which is not UTF-8, often some latin-inspired
> > thingy, but in Asia multi-byte encodings are used. In western Europe it is
> > Windows-1252, which is almost, but not exactly iso-8859-1. Oh, and then we
> > have the cmd prompt which has another encoding in 8-bit mode.

Yes, the default code page for the command prompt uses so-called OEM
encoding, and GUI programs uses another one, which MS calls as "ANSI"
encoding. However, if you use Cygwin, then you have ANSI encoding in
the command prompt. So, in the same command prompt window, you can have
Cygwin programs using one encoding and other window console programs
using a different encoding.

> 
> Well, if it uses a 8-bit codepage, then that means that as far as the 
> POSIX filename interface is concerned, it has nothing what-so-ever to do 
> with Unicode (ie unicode is just a totally invisible internal encoding 
> issue, not externally visible).

Some people tried to set the current code page to 65001, which is
the Microsoft code page for UTF-8. However, it seems that does not
work very well.

http://support.microsoft.com/kb/175392
http://blogs.msdn.com/michkap/archive/2006/03/13/550191.aspx

It seems to me that Win32 API functions work correctly with
UTF-8 (after all, they are just wrappers over UTF-16 functions),
but Microsoft's C library cannot handle UTF-8 (or any other
encoding that requires more than two bytes per character).

> Anybody know which one cygwin/mingw does?

There is a patch for Cygwin that adds UTF-8 support for it, however,
Cygwin maintainers do not like it, so it is not integrated. I think
Cygwin 1.7 will support UTF-8, but I have no idea how soon it will be
released.

I don't know much about mingw, but if I am not mistaken, mingw relies
on Microsoft's C library, so I suppose it uses an "OEM" code page for
console programs by default.

Dmitry

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-18  1:24                                           ` Linus Torvalds
  2008-01-18  4:08                                             ` Brian Dessent
  2008-01-18  8:49                                             ` Dmitry Potapov
@ 2008-01-18  9:42                                             ` Robin Rosenberg
  2008-01-18 10:30                                               ` Dmitry Potapov
  2 siblings, 1 reply; 260+ messages in thread
From: Robin Rosenberg @ 2008-01-18  9:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Pedro Melo, Mark Junker, git@vger.kernel.org

fredagen den 18 januari 2008 skrev Linus Torvalds:
> > > translation to UTF-8, and since we use C strings, I'd assume/hope Windows 
> > > actually uses that unambiguous translation for any filenames).
> > 
> > It uses the local 8-bit codepage, which is not UTF-8, often some latin-inspired
> > thingy, but in Asia multi-byte encodings are used. In western Europe it is
> > Windows-1252, which is almost, but not exactly iso-8859-1. Oh, and then we
> > have the cmd prompt which has another encoding in 8-bit mode.
> 
> Well, if it uses a 8-bit codepage, then that means that as far as the 
> POSIX filename interface is concerned, it has nothing what-so-ever to do 
> with Unicode (ie unicode is just a totally invisible internal encoding 
> issue, not externally visible).

I just had to investigate this a bit, so on a Vista machine I started a cmd
prompt and typed mode con: cp select=65001, selected the lucida font and then
echo å >x.txt and opened it in notepad and it was UTF-8 encoded. So there might
be some hope after all. I don't know how to change the encoding for non-console
apps. I leave that as an excercise for the list.

-- robin

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 19:11                                       ` Linus Torvalds
  2008-01-18  0:18                                         ` Kevin Ballard
  2008-01-18  1:05                                         ` Robin Rosenberg
@ 2008-01-18 10:19                                         ` Peter Karlsson
  2008-01-18 10:50                                           ` Dmitry Potapov
  2008-01-18 17:11                                           ` Linus Torvalds
  2 siblings, 2 replies; 260+ messages in thread
From: Peter Karlsson @ 2008-01-18 10:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Mark Junker, Pedro Melo, git@vger.kernel.org

Linus Torvalds:

> But that's exactly the case he gave - 'ä' vs 'a¨' are exactly that: 
> different strings (not even characters: the second is actually a 
> multi-character) that just look the same.

But they are not different strings, they are canonically equivalent as
far as Unicode is concerned. They're even supposed to map to the same
glyph (if the font has an "ä", it should display it in both cases, if
it has an "a" and a combining diaeresis, it should make up one).

You cannot do a binary comparison of text to see if two strings are
equivalent.

> You try to twist the argument by just claiming that they are the same
> "character". They aren't, unless you *define* character to be the
> same as "glyph".

Whereas you are confusing characters and code points.

"ä" and "a¨" use different code points, but they encode the same
character, and from the user's perspective it is the *character* that
is interesting (although he might confuse it with the glyph).

> I don't know how NTFS works (I know it is Unicode-aware, and I think
> it encodes filenames in UCS-2 or possibly UTF-16,

Actually, NTFS is a bit broken. It sees file names as a string of
16-bit words. It doesn't check that it is valid UTF-16, or even valid
UCS-2, it allows almost anything.

Apple made Mac OS X handle filenames properly, by seeing that file
names are a string of characters, not code points, so they use a
canonical form for all characters (personally, I would have preferred
the pre-composed form, though).

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-18  9:42                                             ` Robin Rosenberg
@ 2008-01-18 10:30                                               ` Dmitry Potapov
  2008-01-18 15:37                                                 ` Peter Karlsson
  0 siblings, 1 reply; 260+ messages in thread
From: Dmitry Potapov @ 2008-01-18 10:30 UTC (permalink / raw)
  To: Robin Rosenberg
  Cc: Linus Torvalds, Pedro Melo, Mark Junker, git@vger.kernel.org

On Fri, Jan 18, 2008 at 10:42:36AM +0100, Robin Rosenberg wrote:
> 
> I just had to investigate this a bit, so on a Vista machine I started a cmd
> prompt and typed mode con: cp select=65001, selected the lucida font and then
> echo å >x.txt and opened it in notepad and it was UTF-8 encoded. 

Yes, but have you tried to run any batch file? At least, on WinXP
all batch files silently stopped working after choosing 65001, and
I don't know what else gets broken, because Microsoft C library
does not work with encoding that requires more than two bytes per
character.

> So there might
> be some hope after all. I don't know how to change the encoding for non-console
> apps. I leave that as an excercise for the list.

It is not difficult to change the current encoding in any Windows
application, the real issue is that neither Microsoft C library nor
Cygwin library does not work correctly with UTF-8. There is a patch
for Cygwin though...

Dmitry

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-18 10:19                                         ` Peter Karlsson
@ 2008-01-18 10:50                                           ` Dmitry Potapov
  2008-01-18 15:30                                             ` Peter Karlsson
  2008-01-18 17:11                                           ` Linus Torvalds
  1 sibling, 1 reply; 260+ messages in thread
From: Dmitry Potapov @ 2008-01-18 10:50 UTC (permalink / raw)
  To: Peter Karlsson
  Cc: Linus Torvalds, Mark Junker, Pedro Melo, git@vger.kernel.org

On Fri, Jan 18, 2008 at 11:19:21AM +0100, Peter Karlsson wrote:
> Linus Torvalds:
> 
> > But that's exactly the case he gave - 'ä' vs 'a¨' are exactly that: 
> > different strings (not even characters: the second is actually a 
> > multi-character) that just look the same.
> 
> But they are not different strings, they are canonically equivalent as
> far as Unicode is concerned.

There are canonically equivalent, but they are different sequences
of characters as Unicode is concerned. In one case, we have one
character in the other case, we have two characters that canonically
equivalent to the first one.

> They're even supposed to map to the same
> glyph (if the font has an "ä", it should display it in both cases, if
> it has an "a" and a combining diaeresis, it should make up one).

By defition, sequences of characters that are canonically equivalent
are both visual and functional equivalent...

> You cannot do a binary comparison of text to see if two strings are
> equivalent.

Of course, you can't. Who argues otherwise?

> > You try to twist the argument by just claiming that they are the same
> > "character". They aren't, unless you *define* character to be the
> > same as "glyph".
> 
> Whereas you are confusing characters and code points.

I am afraid it is you who confuses "characters" with "abstract
characters", there is no place in the standard saying that
"characters" are "abstract characters" only. On contrary, the
term "characters" is used to refer non abstract characters.

Dmitry

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-18  8:29                         ` Peter Karlsson
@ 2008-01-18 11:16                           ` Jakub Narebski
  0 siblings, 0 replies; 260+ messages in thread
From: Jakub Narebski @ 2008-01-18 11:16 UTC (permalink / raw)
  To: Peter Karlsson
  Cc: Linus Torvalds, Pedro Melo, Kevin Ballard, Johannes Schindelin,
	Mark Junker, git

Peter Karlsson wrote:
> Linus Torvalds wrote:
> 
> > The difference I see between us is that when I tell you that this is
> > exactly the same thing as your file *contents*,
> 
> This is the same issue as the CRLF issue I posted on earlier, and it
> all stems from that git also sees file names as a stream of bytes, not
> a string of characters, just as it does text.

You have to be careful about CRLF conversion, lest you corrupt your
binary files. CRLF conversion is off by default.

> > An OS that silently changes the contents of your files is *crap*.
> > Get it?
> 
> A program that silently ignores the conventions of the platform it runs
> on is *crap*, no matter if the conventions are not the same as for
> other platforms.
> 
> > An OS that silently changes the contents of your directories is *crap*.
> > Get it now?
> 
> A program that silently ignores the conventions of the file system it
> tries to store its files on is *crap* :-)

Git philosophy to see the contents of files and "contents" of directories
(filenames) as stream of bytes, i.e. to use 'native' encoding works
perfectly well and _fast_ if all developers work in the same environment.
Troubles start if you are working across operating systems, and across
filesystems.

> In my perfect world, file names would be stored as a string of characters,
> so if I save a file with an å in it, that å would be preserved no
> matter if I run Linux on ext2 with my locale is set to latin-1 (which
> stores it as byte 0xE5), on Windows with NTFS (which stores it as the
> UTF-16 code 0x00E5), on Windows/DOS with FAT (which stores it as the
> byte 0x86) or on Mac OS X which stores it as decomposed UTF-8 (whose
> byte sequence I don't know at the top of my head). If that was just
> stored as U+00E5 in whatever encoding in the filename index, the local
> implementation of git can just check it out in the form needed.

Git has for a long time i18n.commitEncoding, and from some time it
saves it in 'encoding' header in commit object (if different from
'uft-8') and has also i18n.logOutputEncoding.

For dealing with different filesystem encodings you would also have
to have both: encoding used in 'tree' objects (by repository) for
filenames saved somewhere in repository, either in tree object (argh!)
or in some kind of .gitconfig file; encoding used by filesystem in
repository config as i18n.filesystemEncoding or something like that.
And think what to put in the on disk index, and in memory index.

NOTE, NOTE, NOTE! If filename is used somewherein the file contents
(manifest-like file, include-like statement), and this filename uses
characters which are differently encoded in different encoding you
are screwed with this fancy system, badly, anyway.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-18 10:50                                           ` Dmitry Potapov
@ 2008-01-18 15:30                                             ` Peter Karlsson
  0 siblings, 0 replies; 260+ messages in thread
From: Peter Karlsson @ 2008-01-18 15:30 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Linus Torvalds, Mark Junker, Pedro Melo, git@vger.kernel.org

Dmitry Potapov:

> I am afraid it is you who confuses "characters" with "abstract
> characters", there is no place in the standard saying that
> "characters" are "abstract characters" only. On contrary, the term
> "characters" is used to refer non abstract characters.

Perhaps it's just a case of confusion about naming conventions. I tend
to use "character" as a "grapheme cluster", i.e a "user character" (to
the end user, "ä" and "a"+diaeresis is the same character, no matter if
they would display as different glyphs), whereas some people use
"character" as a "code point", which would be more of a "programmer
character". And then there are some people that still use "character"
interchangibly for "bytes" or "code units" (for UTF-16; a pair of
surrogate code units is still only one "code point").

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-18 10:30                                               ` Dmitry Potapov
@ 2008-01-18 15:37                                                 ` Peter Karlsson
  2008-01-18 17:24                                                   ` Jakub Narebski
  0 siblings, 1 reply; 260+ messages in thread
From: Peter Karlsson @ 2008-01-18 15:37 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Robin Rosenberg, Linus Torvalds, Pedro Melo, Mark Junker,
	git@vger.kernel.org

Dmitry Potapov:

> because Microsoft C library does not work with encoding that requires
> more than two bytes per character.

Indeed. On Windows, you should avoid using UTF-8 and instead use UTF-16
everywhere. That usually works better, and if you run on an NT-based
system it will convert all the data to WinAPI to UTF-16 anyway.

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-18 10:19                                         ` Peter Karlsson
  2008-01-18 10:50                                           ` Dmitry Potapov
@ 2008-01-18 17:11                                           ` Linus Torvalds
  2008-01-18 20:24                                             ` Kevin Ballard
                                                               ` (2 more replies)
  1 sibling, 3 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-18 17:11 UTC (permalink / raw)
  To: Peter Karlsson; +Cc: Mark Junker, Pedro Melo, git@vger.kernel.org

On Fri, 18 Jan 2008, Peter Karlsson wrote:
> 
> But they are not different strings, they are canonically equivalent as
> far as Unicode is concerned.

Fuck me with a spoon.

Why the hell cannot people see that "equivalent" and "same" are two 
totally different meanings.

> You cannot do a binary comparison of text to see if two strings are
> equivalent.

.. and this is relevant how? They are different strings. Not the same.

Equivalence doesn't matter. Equivalence is *evil*. Equivalence is what 
gives us case-insensitive filesystems ("because the names are 
equivalent").

Filesystems don't *want* equivalence. They want a much stronger exactness 
guarantee. Exactly because sometimes the differences matter.

		Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-18 15:37                                                 ` Peter Karlsson
@ 2008-01-18 17:24                                                   ` Jakub Narebski
  0 siblings, 0 replies; 260+ messages in thread
From: Jakub Narebski @ 2008-01-18 17:24 UTC (permalink / raw)
  To: git

Peter Karlsson wrote:

> Dmitry Potapov:
> 
>> because Microsoft C library does not work with encoding that requires
>> more than two bytes per character.
> 
> Indeed. On Windows, you should avoid using UTF-8 and instead use UTF-16
> everywhere. That usually works better, and if you run on an NT-based
> system it will convert all the data to WinAPI to UTF-16 anyway.

Errr... doesn't UTF-16 (as compared to USC-2) sometimes (for some exotic
characters) require more than two bytes per character?

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-18 17:11                                           ` Linus Torvalds
@ 2008-01-18 20:24                                             ` Kevin Ballard
  2008-01-19  8:48                                               ` Dmitry Potapov
  2008-01-18 20:28                                             ` Junio C Hamano
  2008-01-21 14:14                                             ` Peter Karlsson
  2 siblings, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-18 20:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Karlsson, Mark Junker, Pedro Melo, git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1431 bytes --]

As far as I can tell, the only time you ever run into the problems  
you've described on a filesystem which treats filenames as unicode  
strings (and therefore is free to normalize), are when you're trying  
to interact with a filesystem that treats filenames as sequences of  
bytes.

This doesn't mean treating filenames as unicode strings is wrong, it  
just means that the world would be much better if every filesystem had  
the same behaviour here. It's kinda like the endian issue, except  
there's no simple solution here.

-Kevin Ballard

On Jan 18, 2008, at 12:11 PM, Linus Torvalds wrote:

> On Fri, 18 Jan 2008, Peter Karlsson wrote:
>>
>> But they are not different strings, they are canonically equivalent  
>> as
>> far as Unicode is concerned.
>
> Fuck me with a spoon.
>
> Why the hell cannot people see that "equivalent" and "same" are two
> totally different meanings.
>
>> You cannot do a binary comparison of text to see if two strings are
>> equivalent.
>
> .. and this is relevant how? They are different strings. Not the same.
>
> Equivalence doesn't matter. Equivalence is *evil*. Equivalence is what
> gives us case-insensitive filesystems ("because the names are
> equivalent").
>
> Filesystems don't *want* equivalence. They want a much stronger  
> exactness
> guarantee. Exactly because sometimes the differences matter.

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-18 17:11                                           ` Linus Torvalds
  2008-01-18 20:24                                             ` Kevin Ballard
@ 2008-01-18 20:28                                             ` Junio C Hamano
  2008-01-18 20:50                                               ` Johannes Schindelin
  2008-01-23  2:46                                               ` Eric W. Biederman
  2008-01-21 14:14                                             ` Peter Karlsson
  2 siblings, 2 replies; 260+ messages in thread
From: Junio C Hamano @ 2008-01-18 20:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Karlsson, Mark Junker, Pedro Melo, git@vger.kernel.org

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Fri, 18 Jan 2008, Peter Karlsson wrote:
>> 
>> But they are not different strings, they are canonically equivalent as
>> far as Unicode is concerned.
>
> Fuck me with a spoon.
>
> Why the hell cannot people see that "equivalent" and "same" are two 
> totally different meanings.

Could people _please_ stop this already?

I think the sane people see the difference between equivalence
and sameness, and we established that a filesystem that mangles
the filenames behind user's back is a bad design.  Anybody who
followed the thread and still does not agree with you is, eh,
"ugly-and-stupid", as you might say ;-).  You cannot educate
them all.

The thing is, even if you mange to educate them all, that broken
filesystem, and other filesystems with similar brokenness, do
not go away.  If your ultimate objective is to declare that it
is the right thing for git not to support such broken
filesystems, and to make everybody agree to it, that is fine.
Please keep pouring fuel to the fire.  But if that is not the
case, we would need to devise a way to help lives easier for the
unfortunate people who are stuck on such filesystems.  They may
not even realize that they are unfortunate now, and I agree that
some education is justified, but this thread has raged on long
enough to salvage any salvageable lost souls (the remaining ones
may be beyond salvation but let's not waste time on them).

I'd rather see our mental bandwidth spent on coming up with a
workable workaround for such broken filesystems, while not
hurting use of git on sane platforms.

I fear it might have to end up to be very messy and slow,
though.

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-18 20:28                                             ` Junio C Hamano
@ 2008-01-18 20:50                                               ` Johannes Schindelin
  2008-01-23  2:46                                               ` Eric W. Biederman
  1 sibling, 0 replies; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-18 20:50 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

Hi,

On Fri, 18 Jan 2008, Junio C Hamano wrote:

> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
> > On Fri, 18 Jan 2008, Peter Karlsson wrote:
> >> 
> >> But they are not different strings, they are canonically equivalent as
> >> far as Unicode is concerned.
> >
> > Fuck me with a spoon.
> >
> > Why the hell cannot people see that "equivalent" and "same" are two 
> > totally different meanings.
> 
> Could people _please_ stop this already?

Welcome, voice of reason.

> I think the sane people see the difference between equivalence
> and sameness, and we established that a filesystem that mangles
> the filenames behind user's back is a bad design.  Anybody who
> followed the thread and still does not agree with you is, eh,
> "ugly-and-stupid", as you might say ;-).  You cannot educate
> them all.

Actually, I see some value in calling them names, see 
http://video.google.nl/videoplay?docid=-4216011961522818645 for why.

> The thing is, even if you mange to educate them all, that broken 
> filesystem, and other filesystems with similar brokenness, do not go 
> away.

I was almost starting with hacking on this, but then the discussion 
annoyed me too much, and I asked myself for who I think I'd do this.

IMHO those people should ask "how could I begin to work on this".

Instead, they started a useless flamewar.

Now, back to the issue: Robin posted a link to his UTF-8 work.  While it 
is way too intrusive, and not limited to filenames at all, I think it has 
a few good pointers.

Ciao,
Dscho "who needs to calm down now"

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-18 20:24                                             ` Kevin Ballard
@ 2008-01-19  8:48                                               ` Dmitry Potapov
  2008-01-19 14:55                                                 ` Kevin Ballard
  2008-01-19 18:58                                                 ` Linus Torvalds
  0 siblings, 2 replies; 260+ messages in thread
From: Dmitry Potapov @ 2008-01-19  8:48 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

Hi,

[please do not top post. Just delete everything you do not reply to]

On Fri, Jan 18, 2008 at 03:24:34PM -0500, Kevin Ballard wrote:
> As far as I can tell, the only time you ever run into the problems  
> you've described on a filesystem which treats filenames as unicode  
> strings (and therefore is free to normalize), are when you're trying  
> to interact with a filesystem that treats filenames as sequences of  
> bytes.

If you read the first message in this thread, you would probably know
that the problem exists on Mac even without any other filesystem being
involved. My understanding of it is that is caused by HFS+ converting
one sequence of Unicode characters (generated by Mac keyboard driver)
to another sequence using "fast decomposed" conversion.

[And please stop calling by normalization what is not. Mac does NOT
normalize Unicode strings, it uses some sub-standard conversion,
which neither produce a normalized string nor is guaranteed to be
stable across versions of Unicode.]

> This doesn't mean treating filenames as unicode strings is wrong, it  
> just means that the world would be much better if every filesystem had  
> the same behaviour here. It's kinda like the endian issue, except  
> there's no simple solution here.

Actually, there is, if you care to do something. You can write a wrapper
around readdir(3) that will recodes filenames in Unicode Normal Forms C.
This does not require much knowledge of Git -- what it requires the
desire to do something to solve the problem. Of course, this step alone
is not a complete solution (it does not solve case-insensitive issue),
but the first step in the right direction...

BTW, Git is far from being only software that ran into this problem with
Mac. But not being first, we can benefit from other people experiences:
http://osdir.com/ml/network.gnutella.limewire.core.devel/2003-01/msg00000.html

Dmitry

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-19  8:48                                               ` Dmitry Potapov
@ 2008-01-19 14:55                                                 ` Kevin Ballard
  2008-01-19 21:17                                                   ` Dmitry Potapov
  2008-01-19 18:58                                                 ` Linus Torvalds
  1 sibling, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-19 14:55 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 3064 bytes --]

On Jan 19, 2008, at 3:48 AM, Dmitry Potapov wrote:

> On Fri, Jan 18, 2008 at 03:24:34PM -0500, Kevin Ballard wrote:
>> As far as I can tell, the only time you ever run into the problems
>> you've described on a filesystem which treats filenames as unicode
>> strings (and therefore is free to normalize), are when you're trying
>> to interact with a filesystem that treats filenames as sequences of
>> bytes.
>
> If you read the first message in this thread, you would probably know
> that the problem exists on Mac even without any other filesystem being
> involved. My understanding of it is that is caused by HFS+ converting
> one sequence of Unicode characters (generated by Mac keyboard driver)
> to another sequence using "fast decomposed" conversion.

In this case, git's index counts as a filesystem that treats filenames  
as sequences of bytes. But yes, it is possible, though somewhat  
difficult, to produce this problem on just HFS+. It's far more common  
when the file was originally added on a different filesystem

> [And please stop calling by normalization what is not. Mac does NOT
> normalize Unicode strings, it uses some sub-standard conversion,
> which neither produce a normalized string nor is guaranteed to be
> stable across versions of Unicode.]

 From what the HFS+ technote says, it produces a variant of Normal  
Form D. This variant, while not guaranteed to be stable across  
versions of HFS+, but in practice it is stable.

What would you prefer I call it?

>> This doesn't mean treating filenames as unicode strings is wrong, it
>> just means that the world would be much better if every filesystem  
>> had
>> the same behaviour here. It's kinda like the endian issue, except
>> there's no simple solution here.
>
> Actually, there is, if you care to do something. You can write a  
> wrapper
> around readdir(3) that will recodes filenames in Unicode Normal  
> Forms C.
> This does not require much knowledge of Git -- what it requires the
> desire to do something to solve the problem. Of course, this step  
> alone
> is not a complete solution (it does not solve case-insensitive issue),
> but the first step in the right direction...

I'm not sure how that would solve anything. Sure, it would provide a  
stable, known encoding for git to compare filenames against, but that  
would only work if the filename is known to be Unicode, and as it has  
been pointed out on other filesystems the filename can be whatever  
encoding the user chooses (which, IMHO, is a flaw).

> BTW, Git is far from being only software that ran into this problem  
> with
> Mac. But not being first, we can benefit from other people  
> experiences:
> http://osdir.com/ml/network.gnutella.limewire.core.devel/2003-01/msg00000.html

It looks like their problem was binary compatibility with strings from  
other clients that were using Normal Form C instead of Normal Form D.  
git's problem is that it's only even using a known encoding on HFS+.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-19  8:48                                               ` Dmitry Potapov
  2008-01-19 14:55                                                 ` Kevin Ballard
@ 2008-01-19 18:58                                                 ` Linus Torvalds
  2008-01-19 20:39                                                   ` Mark Junker
                                                                     ` (2 more replies)
  1 sibling, 3 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-19 18:58 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Kevin Ballard, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Sat, 19 Jan 2008, Dmitry Potapov wrote:
> 
> Actually, there is, if you care to do something. You can write a wrapper
> around readdir(3) that will recodes filenames in Unicode Normal Forms C.

If somebody wants to do this, then readdir() isn't the only place, but 
yes, readdir() is one of the places.

I suspect that if we were to just do the "turn into NFC on readdir() on OS 
X", that might actually be good enough to hide most of the problems. The 
issue isn't just that OS X mangles the filenames, it's that it picks a 
particularly *stupid* way to mangle them (the decomposed forms), which 
means that OS X will actually not just corrupt "odd cases" of Unicode, but 
will corrupt the obvious and *common* Latin1 translations of Unicode.

I don't know if NFC is better for other locales, but I doubt it. Usually 
people want to do the *composite* forms, not the *de*composed forms.

A trivial example of this for some cross-OS issue:

 - let's say that you have a file "Märchen" on just about *any* other OS 
   than OS X. It could be Latin1 or it could be Unicode, but even if it is 
   Unicode, I can almost guarantee that the 'ä' is going to be the 
   *single* Unicode character U+00e4 (utf-8: "\xc3\xa4", latin1: "\xe4")

   So from a cross-OS standpoint, that's the *common* representation, and 
   yes, you can create the file that way (I don't know what happens if you 
   actually create it with the Latin1 encoding, but I would not be 
   surprised if OS X notices that it's not a valid UTF sequence and 
   assumes it's Latin1 and converts it to Unicode)

 - But on OS X, because of Apples *insane* choice of normal form, it will 
   then be turned into "a¨". I doubt *anybody* else does that. If you have 
   to normalize it, NFD is just about the *worst* choice.

So yeah, even just re-coding it as NFC on readdir() would at least mean 
that any OS X git client would be MORE LIKELY to pick the same 
representation as git clients on other OS's.

It wouldn't solve all problems (and it would almost certainly create a few 
new ones), but it would likely at least increase compatibility between 
systems.

So doing the NFC conversion on readdir() on OS X is probably a good idea, 
and probably is the simplest way to make it interact better with other 
OS's. And it's definitely safe on OS X, since OS X _already_ corrupted the 
name, so we're not losing any information (in contrast, on other systems, 
doing a NFC conversion would possibly lose encoding detail _and_ might be 
incorrect simply because they might not use Unicode in the first place).

Anybody want to creat a compat layer around "readdir()" that does that NFC 
conversion on OS X but not elsewhere?

		Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-17 15:04                                       ` Kevin Ballard
@ 2008-01-19 19:29                                         ` Kyle Moffett
  2008-01-19 19:57                                           ` Kevin Ballard
  0 siblings, 1 reply; 260+ messages in thread
From: Kyle Moffett @ 2008-01-19 19:29 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Andrew Heybey, Geert Bosch,
	Kevin Ballard <kevin@sb.org>Martin Langhoff, Linus Torvalds,
	Jakub Narebski, Johannes Schindelin, Mark Junker,
	git@vger.kernel.org

On Jan 17, 2008, at 10:04, Kevin Ballard wrote:
> The main problem with this approach is you know for certain that  
> using HFSX as the boot partition is barely tested by Apple, and  
> certainly untested by third-party apps. This means the potential for  
> breakage is extremely high.

No, actually, HFSX boot partitions are fairly well tested by Apple and  
most 3rd-party programs.  I had one for a while and the only problems  
I encountered were with programs ported from Windows without Mac  
versions, such as "Microsoft Office for Mac" and "World of Warcraft".   
"Quake 4" has a few quirks which are easily worked around.

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-19 19:29                                         ` Kyle Moffett
@ 2008-01-19 19:57                                           ` Kevin Ballard
  0 siblings, 0 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-19 19:57 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Andrew Heybey, Geert Bosch,
	Kevin Ballard <kevin@sb.org>Martin Langhoff, Linus Torvalds,
	Jakub Narebski, Johannes Schindelin, Mark Junker,
	git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1574 bytes --]

On Jan 19, 2008, at 2:29 PM, Kyle Moffett wrote:

> On Jan 17, 2008, at 10:04, Kevin Ballard wrote:
>> The main problem with this approach is you know for certain that  
>> using HFSX as the boot partition is barely tested by Apple, and  
>> certainly untested by third-party apps. This means the potential  
>> for breakage is extremely high.
>
> No, actually, HFSX boot partitions are fairly well tested by Apple  
> and most 3rd-party programs.  I had one for a while and the only  
> problems I encountered were with programs ported from Windows  
> without Mac versions, such as "Microsoft Office for Mac" and "World  
> of Warcraft".  "Quake 4" has a few quirks which are easily worked  
> around.

Perhaps the big name companies might do some testing on HFSX, but I  
can guarantee most third-party programs will not be tested under HFSX.

Also, World of Warcraft isn't a ported program. It was developed for  
the Mac concurrently with the Windows version. Same with MS Office -  
it's an entirely different team (the Mac BU) developing MS Office for  
Mac independently of the Windows version, not a porting job. However,  
if you're saying these two big-name programs had problems, I wouldn't  
be surprised to see many more problems on various other third-party  
apps from smaller companies.

In any case, "just use HFSX" is still not an appropriate solution to  
the problem, especially since that will only take care of case  
sensitivity and not the utf-8 stuff.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-19 18:58                                                 ` Linus Torvalds
@ 2008-01-19 20:39                                                   ` Mark Junker
  2008-01-19 22:58                                                   ` Johannes Schindelin
  2008-01-20  0:11                                                   ` Wincent Colaiuta
  2 siblings, 0 replies; 260+ messages in thread
From: Mark Junker @ 2008-01-19 20:39 UTC (permalink / raw)
  To: git

Linus Torvalds schrieb:

>  - let's say that you have a file "Märchen" on just about *any* other OS 
>    than OS X. It could be Latin1 or it could be Unicode, but even if it is 
>    Unicode, I can almost guarantee that the 'ä' is going to be the 
>    *single* Unicode character U+00e4 (utf-8: "\xc3\xa4", latin1: "\xe4")
> 
>    So from a cross-OS standpoint, that's the *common* representation, and 
>    yes, you can create the file that way (I don't know what happens if you 
>    actually create it with the Latin1 encoding, but I would not be 
>    surprised if OS X notices that it's not a valid UTF sequence and 
>    assumes it's Latin1 and converts it to Unicode)

FWIW: I just made a test and it seems that MacOS X refuses the creation 
of a file with this invalid name.

> Anybody want to creat a compat layer around "readdir()" that does that NFC 
> conversion on OS X but not elsewhere?

Maybe I'll try it.

Regards,
Mark

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-19 14:55                                                 ` Kevin Ballard
@ 2008-01-19 21:17                                                   ` Dmitry Potapov
  0 siblings, 0 replies; 260+ messages in thread
From: Dmitry Potapov @ 2008-01-19 21:17 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Sat, Jan 19, 2008 at 09:55:45AM -0500, Kevin Ballard wrote:
> 
> >[And please stop calling by normalization what is not. Mac does NOT
> >normalize Unicode strings, it uses some sub-standard conversion,
> >which neither produce a normalized string nor is guaranteed to be
> >stable across versions of Unicode.]
> 
> From what the HFS+ technote says, it produces a variant of Normal  
> Form D. 

There is no such thing in the standard as a variant of NFD. Moreover,
even if this conversion were described in the standard, it would never
called as normalization, because normalization means conversion that
makes all equivalent strings having identical binary representations.
HFS+ conversion does not met this criterion, so it not normalization.

> This variant, while not guaranteed to be stable across  
> versions of HFS+, but in practice it is stable.
> 
> What would you prefer I call it?

Apple calls it as decomposition, which is correct even if it is not full
decomposition as stated in the technote.

> 
> >>This doesn't mean treating filenames as unicode strings is wrong, it
> >>just means that the world would be much better if every filesystem  
> >>had
> >>the same behaviour here. It's kinda like the endian issue, except
> >>there's no simple solution here.
> >
> >Actually, there is, if you care to do something. You can write a  
> >wrapper
> >around readdir(3) that will recodes filenames in Unicode Normal  
> >Forms C.
> >This does not require much knowledge of Git -- what it requires the
> >desire to do something to solve the problem. Of course, this step  
> >alone
> >is not a complete solution (it does not solve case-insensitive issue),
> >but the first step in the right direction...
> 
> I'm not sure how that would solve anything. Sure, it would provide a  
> stable, known encoding for git to compare filenames against, but that  
> would only work if the filename is known to be Unicode, and as it has  
> been pointed out on other filesystems the filename can be whatever  
> encoding the user chooses (which, IMHO, is a flaw).

I believe that Git internally should use only UTF-8 for encoding file
names, commit messages, etc. The problem with some other filesystems
should be addressed separately (by those who work on those systems or
at least have access to them). Regardless interoperability with other
systems, this change alone should solve the issue that was described
in the first message of this thread.

Dmitry

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-19 18:58                                                 ` Linus Torvalds
  2008-01-19 20:39                                                   ` Mark Junker
@ 2008-01-19 22:58                                                   ` Johannes Schindelin
  2008-01-20  6:14                                                     ` Dmitry Potapov
  2008-01-20  0:11                                                   ` Wincent Colaiuta
  2 siblings, 1 reply; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-19 22:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dmitry Potapov, Kevin Ballard, Peter Karlsson, Mark Junker,
	Pedro Melo, git@vger.kernel.org

Hi,

On Sat, 19 Jan 2008, Linus Torvalds wrote:

> On Sat, 19 Jan 2008, Dmitry Potapov wrote:
> > 
> > Actually, there is, if you care to do something. You can write a 
> > wrapper around readdir(3) that will recodes filenames in Unicode 
> > Normal Forms C.
> 
> If somebody wants to do this, then readdir() isn't the only place, but 
> yes, readdir() is one of the places.
> 
> I suspect that if we were to just do the "turn into NFC on readdir() on 
> OS X", that might actually be good enough to hide most of the problems.

I think a better approach would be to try to match the name to what we 
have in the index.  Then we could implement case-insensitivity and MacOSX 
workaround at the same time.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-19 18:58                                                 ` Linus Torvalds
  2008-01-19 20:39                                                   ` Mark Junker
  2008-01-19 22:58                                                   ` Johannes Schindelin
@ 2008-01-20  0:11                                                   ` Wincent Colaiuta
  2008-01-20  1:04                                                     ` Linus Torvalds
  2 siblings, 1 reply; 260+ messages in thread
From: Wincent Colaiuta @ 2008-01-20  0:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dmitry Potapov, Kevin Ballard, Peter Karlsson, Mark Junker,
	Pedro Melo, git@vger.kernel.org

El 19/1/2008, a las 19:58, Linus Torvalds escribió:

> I suspect that if we were to just do the "turn into NFC on readdir()  
> on OS
> X", that might actually be good enough to hide most of the problems.  
> The
> issue isn't just that OS X mangles the filenames, it's that it picks a
> particularly *stupid* way to mangle them (the decomposed forms), which
> means that OS X will actually not just corrupt "odd cases" of  
> Unicode, but
> will corrupt the obvious and *common* Latin1 translations of Unicode.


For what it's worth, their choice wasn't entirely "insane" ie. it did  
have an element of rationality: that decomposed forms are a little bit  
simpler to sort.

Of course, this doesn't excuse them for creating a file system that  
interacts so horridly with basically everything else out there.

Cheers,
Wincent

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-20  0:11                                                   ` Wincent Colaiuta
@ 2008-01-20  1:04                                                     ` Linus Torvalds
  2008-01-20  5:27                                                       ` Mike Hommey
  2008-01-20  9:34                                                       ` Wincent Colaiuta
  0 siblings, 2 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-20  1:04 UTC (permalink / raw)
  To: Wincent Colaiuta
  Cc: Dmitry Potapov, Kevin Ballard, Peter Karlsson, Mark Junker,
	Pedro Melo, git@vger.kernel.org

On Sun, 20 Jan 2008, Wincent Colaiuta wrote:
> 
> For what it's worth, their choice wasn't entirely "insane" ie. it did have an
> element of rationality: that decomposed forms are a little bit simpler to
> sort.

No they are *not*.

In many languages, 'ä' does *not* sort like 'a' at all, and if you think 
it does, you'll sort at least Finnish and Swedish totally wrong (åäö are 
real letters, and they sort at the *end* of the alphabet, they have 
nothing what-so-ever to do with the letters 'a' or 'o').

The fact that in *some* languages the decomposed forms sort as the base 
letter is immaterial. It's only true in some cases.

So no, sort order is not it. To sort right, you need to use the a real 
Unicode sort (and the decomposed form is *not* going to help you one bit, 
quite the reverse).

It may be that a case compare is easier in NFD (ie you basically only do 
the case-compare on the base letter).

		Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-20  1:04                                                     ` Linus Torvalds
@ 2008-01-20  5:27                                                       ` Mike Hommey
  2008-01-20  5:45                                                         ` Linus Torvalds
  2008-01-20  9:34                                                       ` Wincent Colaiuta
  1 sibling, 1 reply; 260+ messages in thread
From: Mike Hommey @ 2008-01-20  5:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Wincent Colaiuta, Dmitry Potapov, Kevin Ballard, Peter Karlsson,
	Mark Junker, Pedro Melo, git@vger.kernel.org

On Sat, Jan 19, 2008 at 05:04:09PM -0800, Linus Torvalds wrote:
> 
> 
> On Sun, 20 Jan 2008, Wincent Colaiuta wrote:
> > 
> > For what it's worth, their choice wasn't entirely "insane" ie. it did have an
> > element of rationality: that decomposed forms are a little bit simpler to
> > sort.
> 
> No they are *not*.
> 
> In many languages, 'ä' does *not* sort like 'a' at all, and if you think 
> it does, you'll sort at least Finnish and Swedish totally wrong (åäö are 
> real letters, and they sort at the *end* of the alphabet, they have 
> nothing what-so-ever to do with the letters 'a' or 'o').

But there is no way to know whether 'ä' in a document is the Finnish 'ä'
or a 'ä' from, say, German, that sorts after 'a'.

> The fact that in *some* languages the decomposed forms sort as the base 
> letter is immaterial. It's only true in some cases.
> 
> So no, sort order is not it. To sort right, you need to use the a real 
> Unicode sort (and the decomposed form is *not* going to help you one bit, 
> quite the reverse).

Unicode sort is not enough, there is no language indicator in an Unicode
document, which is why Unicode, while solving a bunch of problems, has
its very own, cf. the infamous CJK problem.

But that's all very OT.

Mike

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-20  5:27                                                       ` Mike Hommey
@ 2008-01-20  5:45                                                         ` Linus Torvalds
  2008-01-20  7:00                                                           ` Mike Hommey
  0 siblings, 1 reply; 260+ messages in thread
From: Linus Torvalds @ 2008-01-20  5:45 UTC (permalink / raw)
  To: Mike Hommey
  Cc: Wincent Colaiuta, Dmitry Potapov, Kevin Ballard, Peter Karlsson,
	Mark Junker, Pedro Melo, git@vger.kernel.org

On Sun, 20 Jan 2008, Mike Hommey wrote:
> 
> But there is no way to know whether 'ä' in a document is the Finnish 'ä'
> or a 'ä' from, say, German, that sorts after 'a'.

... without knowing the locale. Correct.

That's why sorting is locale-dependent, even in Unicode. And why you 
should always sort using the *combined* character, not think that you can 
sort by decompsed sequence.

That said, even then you get the wrong thing. Some things cannot be sorted 
character by character at all, and have semantical sorting at a higher 
level entirely. I think most European family names are traditionally 
sorted by effectively using the prefixes (ie d', von, etc) as a secondary 
sort key (so even though they are in front, they sort as if they were 
at the _end_ of the name).

So unicode doesn't help with sorting, and you shouldn't even try to find 
sort rules in the Unicode spec or tech reports. But in general, 
decomposing the characters just makes things worse, not better. To sort 
well, you tend to need the bigger picture, not the details.

Of course, for something like git, we sort by binary value, because we 
also require the sort to be not just well-defined, but *stable*. A sort 
based on any kind of unicode rule is rather likely to change over time.

			Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-19 22:58                                                   ` Johannes Schindelin
@ 2008-01-20  6:14                                                     ` Dmitry Potapov
  2008-01-20  6:53                                                       ` Linus Torvalds
  2008-01-20 13:15                                                       ` Johannes Schindelin
  0 siblings, 2 replies; 260+ messages in thread
From: Dmitry Potapov @ 2008-01-20  6:14 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Linus Torvalds, Kevin Ballard, Peter Karlsson, Mark Junker,
	Pedro Melo, git@vger.kernel.org

On Sat, Jan 19, 2008 at 10:58:08PM +0000, Johannes Schindelin wrote:
> 
> I think a better approach would be to try to match the name to what we 
> have in the index.  Then we could implement case-insensitivity and MacOSX 
> workaround at the same time.

I thought about that, but the problem is that HFS+ _already_ mangled
names from what the user entered (and what is used by anyone else)
to some sub-standard form, which no one outside of Mac likes or uses.
Thus, bringing filenames back to the NFC form (which is what almost
anyone uses) is the only sane thing do, because no one outside of Mac
really needs to know about this HFS+ specific craziness.

So I really dislike the idea that due to some HFS+ specific conversion,
we may end up having some strangely encoded names in a Git repository.
Sane people enter names only in NFC, so why should they suffer because
of some insane conversation made by filesystem behind everyone's back?
And I am not entertaining the idea of having this Mac OS/X specific
workaround outside of Mac OS/X.

Besides, writing a wrapper around readdir() is not difficult. We
already have git-compat-util.h, which redefines some functions for
some platforms, so I don't see any problem with writing a wrapper
around readdir().

Dmitry

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-20  6:14                                                     ` Dmitry Potapov
@ 2008-01-20  6:53                                                       ` Linus Torvalds
  2008-01-20 13:15                                                       ` Johannes Schindelin
  1 sibling, 0 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-20  6:53 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Johannes Schindelin, Kevin Ballard, Peter Karlsson, Mark Junker,
	Pedro Melo, git@vger.kernel.org

On Sun, 20 Jan 2008, Dmitry Potapov wrote:
>
> On Sat, Jan 19, 2008 at 10:58:08PM +0000, Johannes Schindelin wrote:
> > 
> > I think a better approach would be to try to match the name to what we 
> > have in the index.  Then we could implement case-insensitivity and MacOSX 
> > workaround at the same time.
> 
> I thought about that, but the problem is that HFS+ _already_ mangled
> names from what the user entered (and what is used by anyone else)
> to some sub-standard form, which no one outside of Mac likes or uses.

Well, more importantly, most of the important cases actually don't have an 
index entry yet.

For example, what about "git add"? That's when it really matters that you 
add things in a sane format, and by definition, you don't have an index 
entry to try to match to. 

So once you aim for NFC in "git add", now the index will generally be in 
NFC anyway (since I agree that that's what you'd normally get on non-OSX 
systems), so there is little point in then matching the index.

But no, it won't fix all problems. I do suspect it would make them less 
obvious in practice, though.

		Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-20  5:45                                                         ` Linus Torvalds
@ 2008-01-20  7:00                                                           ` Mike Hommey
  2008-01-20  7:26                                                             ` Linus Torvalds
  2008-01-20  8:00                                                             ` Dmitry Potapov
  0 siblings, 2 replies; 260+ messages in thread
From: Mike Hommey @ 2008-01-20  7:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Wincent Colaiuta, Dmitry Potapov, Kevin Ballard, Peter Karlsson,
	Mark Junker, Pedro Melo, git@vger.kernel.org

On Sat, Jan 19, 2008 at 09:45:40PM -0800, Linus Torvalds wrote:
> 
> 
> On Sun, 20 Jan 2008, Mike Hommey wrote:
> > 
> > But there is no way to know whether 'ä' in a document is the Finnish 'ä'
> > or a 'ä' from, say, German, that sorts after 'a'.
> 
> ... without knowing the locale. Correct.
> 
> That's why sorting is locale-dependent, even in Unicode. And why you 
> should always sort using the *combined* character, not think that you can 
> sort by decompsed sequence.

That said, the locale doesn't necessarily express the language in which
the document is written. It's easy enough to read documents that are not
written in your native language on the net. That's already what we are both
doing right now. Fortunately, HTTP and HTML have ways to indicate the
language in which a document is written in, but that leaves out plain
mail, for instance. 

That said, the "decomposed" version of UTF-8 has nice side effects on
OSX, with UTF-8 encoded RockRidge ISO-9660 volumes (with or without
Joliet ; OSX will use RockRidge by default when it's there), for instance.

Mike

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-20  7:00                                                           ` Mike Hommey
@ 2008-01-20  7:26                                                             ` Linus Torvalds
  2008-01-20  8:00                                                             ` Dmitry Potapov
  1 sibling, 0 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-20  7:26 UTC (permalink / raw)
  To: Mike Hommey
  Cc: Wincent Colaiuta, Dmitry Potapov, Kevin Ballard, Peter Karlsson,
	Mark Junker, Pedro Melo, git@vger.kernel.org

On Sun, 20 Jan 2008, Mike Hommey wrote:
> 
> That said, the locale doesn't necessarily express the language in which
> the document is written.

.. and quite commonly, there are multiple languages per document.

The good news is that sorting is almost never relevant or done over 
general documents. You sort almost only well-behaved data, and quite often 
the exact order is less than important: and when it is, you have very 
specific rules (which probably seldom have anything what-so-ever to do 
with general unicode ;).

> It's easy enough to read documents that are not
> written in your native language on the net. That's already what we are both
> doing right now. Fortunately, HTTP and HTML have ways to indicate the
> language in which a document is written in, but that leaves out plain
> mail, for instance. 

Well, Unicode already handles the "reading" part, just not the sorting.

> That said, the "decomposed" version of UTF-8 has nice side effects on
> OSX, with UTF-8 encoded RockRidge ISO-9660 volumes (with or without
> Joliet ; OSX will use RockRidge by default when it's there), for instance.

I think Unicode in general (and UTF-8 in particular) is a great thing. I 
do not argue against Unicode at all.  It's what I use myself.

The thing I argue against is that they force normalization (and then, as a 
secondary complaint, their insane choice of target format).

Linux is generally UTF-8 too, and does all of this much better. No forced 
normalization, and it uses UTF-8 everywhere as the encoding model. Joliet 
and RR works beautifully.

(I don't think RR is NFD, btw. It's the standard microsoft UTF-16 without 
normalization, afaik. I think you can happily generate a Rock Ridge disk 
that has two _different_ filenames that OS X cannot tell apart, but that 
both Linux and Windows can see peoperly)

		Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-20  7:00                                                           ` Mike Hommey
  2008-01-20  7:26                                                             ` Linus Torvalds
@ 2008-01-20  8:00                                                             ` Dmitry Potapov
  2008-01-20  8:12                                                               ` Dmitry Potapov
  1 sibling, 1 reply; 260+ messages in thread
From: Dmitry Potapov @ 2008-01-20  8:00 UTC (permalink / raw)
  To: Mike Hommey
  Cc: Linus Torvalds, Wincent Colaiuta, Kevin Ballard, Peter Karlsson,
	Mark Junker, Pedro Melo, git@vger.kernel.org

On Sun, Jan 20, 2008 at 08:00:18AM +0100, Mike Hommey wrote:
> 
> That said, the "decomposed" version of UTF-8 has nice side effects on
> OSX, with UTF-8 encoded RockRidge ISO-9660 volumes (with or without
> Joliet ; OSX will use RockRidge by default when it's there), for instance.

AFAIK, the RockRidge standard prescribes to use the portable character
set, and it has nothing to do with Unicode. Basically, it is a subset of
ASCII.

http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap06.html

So, I don't think UTF-8 encoded filenames are valid regardless whether
they are decomposed or not.

Dmitry

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-20  8:00                                                             ` Dmitry Potapov
@ 2008-01-20  8:12                                                               ` Dmitry Potapov
  0 siblings, 0 replies; 260+ messages in thread
From: Dmitry Potapov @ 2008-01-20  8:12 UTC (permalink / raw)
  To: Mike Hommey
  Cc: Linus Torvalds, Wincent Colaiuta, Kevin Ballard, Peter Karlsson,
	Mark Junker, Pedro Melo, git@vger.kernel.org

On Sun, Jan 20, 2008 at 11:00:56AM +0300, Dmitry Potapov wrote:
> On Sun, Jan 20, 2008 at 08:00:18AM +0100, Mike Hommey wrote:
> > 
> > That said, the "decomposed" version of UTF-8 has nice side effects on
> > OSX, with UTF-8 encoded RockRidge ISO-9660 volumes (with or without
> > Joliet ; OSX will use RockRidge by default when it's there), for instance.
> 
> AFAIK, the RockRidge standard prescribes to use the portable character
> set, 

Actually, it prescribes to use the portable *filename* character set,
which is even more restrictive than just portable character set.

http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html#tag_03_276

Anyway, there is no place for UTF-8 in it.

Dmitry

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-20  1:04                                                     ` Linus Torvalds
  2008-01-20  5:27                                                       ` Mike Hommey
@ 2008-01-20  9:34                                                       ` Wincent Colaiuta
  1 sibling, 0 replies; 260+ messages in thread
From: Wincent Colaiuta @ 2008-01-20  9:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dmitry Potapov, Kevin Ballard, Peter Karlsson, Mark Junker,
	Pedro Melo, git@vger.kernel.org

El 20/1/2008, a las 2:04, Linus Torvalds escribió:

> On Sun, 20 Jan 2008, Wincent Colaiuta wrote:
>>
>> For what it's worth, their choice wasn't entirely "insane" ie. it  
>> did have an
>> element of rationality: that decomposed forms are a little bit  
>> simpler to
>> sort.
>
> No they are *not*.
>
> In many languages, 'ä' does *not* sort like 'a' at all, and if you  
> think
> it does, you'll sort at least Finnish and Swedish totally wrong (åäö  
> are
> real letters, and they sort at the *end* of the alphabet, they have
> nothing what-so-ever to do with the letters 'a' or 'o').
>
> The fact that in *some* languages the decomposed forms sort as the  
> base
> letter is immaterial. It's only true in some cases.
>
> So no, sort order is not it. To sort right, you need to use the a real
> Unicode sort (and the decomposed form is *not* going to help you one  
> bit,
> quite the reverse).

That's what I get for believing Wikipedia ("This makes sorting far  
simpler"):

http://en.wikipedia.org/wiki/UTF-8#Mac_OS_X

Cheers,
Wincent

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-20  6:14                                                     ` Dmitry Potapov
  2008-01-20  6:53                                                       ` Linus Torvalds
@ 2008-01-20 13:15                                                       ` Johannes Schindelin
  1 sibling, 0 replies; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-20 13:15 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Linus Torvalds, Kevin Ballard, Peter Karlsson, Mark Junker,
	Pedro Melo, git@vger.kernel.org

Hi,

On Sun, 20 Jan 2008, Dmitry Potapov wrote:

> On Sat, Jan 19, 2008 at 10:58:08PM +0000, Johannes Schindelin wrote:
> > 
> > I think a better approach would be to try to match the name to what we 
> > have in the index.  Then we could implement case-insensitivity and 
> > MacOSX workaround at the same time.
> 
> I thought about that, but the problem is that HFS+ _already_ mangled 
> names from what the user entered (and what is used by anyone else) to 
> some sub-standard form, which no one outside of Mac likes or uses.

So?  That's why I said "match", not "compare for identity".

To be a little bit more precise: I think a viable plan would be to

- have a config switch which determines what type of filename mangling we
  allow the host OS to perform (Unicode "normalisation", case mongering),
  and leave _everybody_ alone who left that switch unset,

- "overload" readdir() (by the famous git_X(); #define X git_X trick),

- have the overloaded readdir() _know_ which is the current prefix, and
  load the index if it has not yet been loaded (but probably into a static
  variable to avoid reloading, and to avoid interfering with the global
  "cache" instance).

It _could_ be wise to store the "normalised" forms at one stage (instead 
of the index) to speed up comparison -- the prefix has to be normalised 
for readdir()s purposes, too, then.

This is possible with the HFS+ problem, since we know exactly how HFS+ 
tries to "help", and for case insensitivity too, I think.  But it may be 
restricting ourselves for other filename "equivalences" we might want to 
handle one day.

BTW: I cannot think of anything else than readdir() which should have the 
"problem" of reading back a name that the user did not specify.  What am I 
missing?

> Thus, bringing filenames back to the NFC form (which is what almost 
> anyone uses) is the only sane thing do, because no one outside of Mac 
> really needs to know about this HFS+ specific craziness.

No.  I think that would be a serious mistake.  If you add a file on MacOSX 
(with a _mangled_ filename, think of "git add ."), git should not try to 
be as clever as HFS+ and "remangle" it.

> So I really dislike the idea that due to some HFS+ specific conversion, 
> we may end up having some strangely encoded names in a Git repository.

It _is_ UTF-8, so what's the problem?

As for the HFS+ specfic conversion: like the CRLF issue, I am opposed to 
have a "solution" affecting other people than those on broken system.  So 
I very much _want_ it to be an HFS+ specific conversion.

> Besides, writing a wrapper around readdir() is not difficult. We already 
> have git-compat-util.h, which redefines some functions for some 
> platforms, so I don't see any problem with writing a wrapper around 
> readdir().

Exactly.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-18 17:11                                           ` Linus Torvalds
  2008-01-18 20:24                                             ` Kevin Ballard
  2008-01-18 20:28                                             ` Junio C Hamano
@ 2008-01-21 14:14                                             ` Peter Karlsson
  2008-01-21 16:43                                               ` Kevin Ballard
  2008-01-21 18:16                                               ` Linus Torvalds
  2 siblings, 2 replies; 260+ messages in thread
From: Peter Karlsson @ 2008-01-21 14:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Mark Junker, Pedro Melo, git@vger.kernel.org

Linus Torvalds:

> Fuck me with a spoon.

I'd prefer not to.

> .. and this is relevant how? They are different strings. Not the same.

It is relevant because the Mac OS file system stores file names as a
sequence of Unicode code points, in a (apparently slightly modified)
normalized form, whereas Git prefers to see file systems that store
file names as a sequence of octets, which may, or may not, actually map
to something that the user would call characters.

I happen to prefer the text-as-string-of-characters (or code points,
since you use the other meaning of characters in your posts), since I
come from the text world, having worked a lot on Unicode text
processing.

You apparently prefer the text-as-sequence-of-octets, which I tend to
dislike because I would have thought computer engineers would have
evolved beyond this when we left the 1900s.

But the real issue is that Git cannot use it's filenames as string of
octets on Mac OS X, since the file system doesn't handle it. So Git
needs to do something sensible. That's part of porting. Preferrably
that would involve supporting real Unicode file names, which would also
work on Windows (through it's UTF-16 file APIs), and in part on other
systems (through conversion to the systems' locale encoding).

-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 14:14                                             ` Peter Karlsson
@ 2008-01-21 16:43                                               ` Kevin Ballard
  2008-01-21 16:48                                                 ` David Kastrup
                                                                   ` (4 more replies)
  2008-01-21 18:16                                               ` Linus Torvalds
  1 sibling, 5 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-21 16:43 UTC (permalink / raw)
  To: Peter Karlsson
  Cc: Linus Torvalds, Mark Junker, Pedro Melo, git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1098 bytes --]

On Jan 21, 2008, at 9:14 AM, Peter Karlsson wrote:

> I happen to prefer the text-as-string-of-characters (or code points,
> since you use the other meaning of characters in your posts), since I
> come from the text world, having worked a lot on Unicode text
> processing.
>
> You apparently prefer the text-as-sequence-of-octets, which I tend to
> dislike because I would have thought computer engineers would have
> evolved beyond this when we left the 1900s.

I agree. Every single problem that I can recall Linus bringing up as a  
consequence of HFS+ treating filenames as strings is in fact only a  
problem if you then think of the filename as octets at some point. If  
you stick with UTF-8 equivalence comparison the entire time, then  
everything just works.

Granted, this is a problem when you have to operate on a filesystem  
that thinks of filenames as octets, but as I said before, this doesn't  
mean the HFS+ approach is wrong, it just means it's incompatible with  
Linus's approach.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 16:43                                               ` Kevin Ballard
@ 2008-01-21 16:48                                                 ` David Kastrup
  2008-01-21 16:59                                                   ` Kevin Ballard
  2008-01-21 16:53                                                 ` Jeff King
                                                                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 260+ messages in thread
From: David Kastrup @ 2008-01-21 16:48 UTC (permalink / raw)
  To: git

Kevin Ballard <kevin@sb.org> writes:

> On Jan 21, 2008, at 9:14 AM, Peter Karlsson wrote:
>
>> I happen to prefer the text-as-string-of-characters (or code points,
>> since you use the other meaning of characters in your posts), since I
>> come from the text world, having worked a lot on Unicode text
>> processing.
>>
>> You apparently prefer the text-as-sequence-of-octets, which I tend to
>> dislike because I would have thought computer engineers would have
>> evolved beyond this when we left the 1900s.
>
> I agree. Every single problem that I can recall Linus bringing up as a
> consequence of HFS+ treating filenames as strings is in fact only a
> problem if you then think of the filename as octets at some point. If
> you stick with UTF-8 equivalence comparison the entire time, then
> everything just works.

git calculates hashes over filenames and sorts them.  This is not a mere
question of "UTF-8 equivalence comparison".

> Granted, this is a problem when you have to operate on a filesystem
> that thinks of filenames as octets,

It also is a problem when operating on a filesystem that considers "ä" a
single utf-8 character instead of decomposing it.

> but as I said before, this doesn't mean the HFS+ approach is wrong, it
> just means it's incompatible with Linus's approach.

It is not the business of a file system to juggle with filename
representations.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 16:43                                               ` Kevin Ballard
  2008-01-21 16:48                                                 ` David Kastrup
@ 2008-01-21 16:53                                                 ` Jeff King
  2008-01-21 17:08                                                 ` Nicolas Pitre
                                                                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 260+ messages in thread
From: Jeff King @ 2008-01-21 16:53 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Peter Karlsson, Linus Torvalds, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Mon, Jan 21, 2008 at 11:43:54AM -0500, Kevin Ballard wrote:

> I agree. Every single problem that I can recall Linus bringing up as a  
> consequence of HFS+ treating filenames as strings is in fact only a  
> problem if you then think of the filename as octets at some point. If you 
> stick with UTF-8 equivalence comparison the entire time, then everything 
> just works.

Git's data model relies on SHA-1 hashing of data, including filenames.
So at some level, git _has_ to treat data as octets, and "equivalent"
strings must be the same at the octet level (or else you lose all of the
useful properties that the hashing data model provides). You can argue
about where in the program conversion and normalization occur, but I
don't think you can get around the fact that you're going to need
to think of the "filename as octets at some point."

-Peff

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 16:48                                                 ` David Kastrup
@ 2008-01-21 16:59                                                   ` Kevin Ballard
  2008-01-21 20:43                                                     ` Dmitry Potapov
  0 siblings, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-21 16:59 UTC (permalink / raw)
  To: David Kastrup; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 2086 bytes --]

On Jan 21, 2008, at 11:48 AM, David Kastrup wrote:

> Kevin Ballard <kevin@sb.org> writes:
>
>> On Jan 21, 2008, at 9:14 AM, Peter Karlsson wrote:
>>
>>> I happen to prefer the text-as-string-of-characters (or code points,
>>> since you use the other meaning of characters in your posts),  
>>> since I
>>> come from the text world, having worked a lot on Unicode text
>>> processing.
>>>
>>> You apparently prefer the text-as-sequence-of-octets, which I tend  
>>> to
>>> dislike because I would have thought computer engineers would have
>>> evolved beyond this when we left the 1900s.
>>
>> I agree. Every single problem that I can recall Linus bringing up  
>> as a
>> consequence of HFS+ treating filenames as strings is in fact only a
>> problem if you then think of the filename as octets at some point. If
>> you stick with UTF-8 equivalence comparison the entire time, then
>> everything just works.
>
> git calculates hashes over filenames and sorts them.  This is not a  
> mere
> question of "UTF-8 equivalence comparison".

No, it's a question of hashing algorithm. And it's one that's fairly  
easily solved simply by picking a specific nonambiguous UTF-8 encoding  
before hashing.

>> Granted, this is a problem when you have to operate on a filesystem
>> that thinks of filenames as octets,
>
> It also is a problem when operating on a filesystem that considers  
> "ä" a
> single utf-8 character instead of decomposing it.

What makes you say that?

>> but as I said before, this doesn't mean the HFS+ approach is wrong,  
>> it
>> just means it's incompatible with Linus's approach.
>
> It is not the business of a file system to juggle with filename
> representations.

You're right, that probably belongs in the VFS layer, but the behavior  
is the same either way. You can't leave it up to user-space libraries  
to enforce a filesystem encoding, because you can't rely on all  
clients to behave properly.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 16:43                                               ` Kevin Ballard
  2008-01-21 16:48                                                 ` David Kastrup
  2008-01-21 16:53                                                 ` Jeff King
@ 2008-01-21 17:08                                                 ` Nicolas Pitre
  2008-01-21 17:25                                                   ` Kevin Ballard
  2008-01-21 20:32                                                   ` David Kastrup
  2008-01-21 18:12                                                 ` Linus Torvalds
  2008-01-21 20:30                                                 ` Dmitry Potapov
  4 siblings, 2 replies; 260+ messages in thread
From: Nicolas Pitre @ 2008-01-21 17:08 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Peter Karlsson, Linus Torvalds, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Mon, 21 Jan 2008, Kevin Ballard wrote:

> On Jan 21, 2008, at 9:14 AM, Peter Karlsson wrote:
> 
> > I happen to prefer the text-as-string-of-characters (or code points,
> > since you use the other meaning of characters in your posts), since I
> > come from the text world, having worked a lot on Unicode text
> > processing.
> > 
> > You apparently prefer the text-as-sequence-of-octets, which I tend to
> > dislike because I would have thought computer engineers would have
> > evolved beyond this when we left the 1900s.
> 
> I agree. Every single problem that I can recall Linus bringing up as a
> consequence of HFS+ treating filenames as strings is in fact only a problem if
> you then think of the filename as octets at some point. If you stick with
> UTF-8 equivalence comparison the entire time, then everything just works.
> 
> Granted, this is a problem when you have to operate on a filesystem that
> thinks of filenames as octets, but as I said before, this doesn't mean the
> HFS+ approach is wrong, it just means it's incompatible with Linus's approach.

Linus' approach is _FAST_.

Why do you think Git has now acquired a reputation of kicking asses all 
around the SCM scene?

The HFS+ approach might be fine if you think of it in terms of "the user 
will be awfully confused if two file names are shown identically in the 
File Open dialog box".  But it otherwise sucks big time when it comes to 
high performance applications needing to deal with a huge amount of file 
names at once.

Normalization will always hurt performances.  This is an overhead.  
Sometimes that overhead might be insignificant and not be perceptible, 
but sometimes it is.  And Git is clearly in the later case. Performances 
will be hurt big time the day it is made aware of that normalization. 
This is why there is so much resistance about it, especially when the 
benefits of normalizing file names are not shown to be worth their cost 
in performance and complexity, as other systems do rather fine without 
it.

Nicolas

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 17:08                                                 ` Nicolas Pitre
@ 2008-01-21 17:25                                                   ` Kevin Ballard
  2008-01-21 20:35                                                     ` David Kastrup
  2008-01-21 20:32                                                   ` David Kastrup
  1 sibling, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-21 17:25 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Peter Karlsson, Linus Torvalds, Mark Junker, Pedro Melo,
	git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 3222 bytes --]

On Jan 21, 2008, at 12:08 PM, Nicolas Pitre wrote:

> On Mon, 21 Jan 2008, Kevin Ballard wrote:
>
>> On Jan 21, 2008, at 9:14 AM, Peter Karlsson wrote:
>>
>>> I happen to prefer the text-as-string-of-characters (or code points,
>>> since you use the other meaning of characters in your posts),  
>>> since I
>>> come from the text world, having worked a lot on Unicode text
>>> processing.
>>>
>>> You apparently prefer the text-as-sequence-of-octets, which I tend  
>>> to
>>> dislike because I would have thought computer engineers would have
>>> evolved beyond this when we left the 1900s.
>>
>> I agree. Every single problem that I can recall Linus bringing up  
>> as a
>> consequence of HFS+ treating filenames as strings is in fact only a  
>> problem if
>> you then think of the filename as octets at some point. If you  
>> stick with
>> UTF-8 equivalence comparison the entire time, then everything just  
>> works.
>>
>> Granted, this is a problem when you have to operate on a filesystem  
>> that
>> thinks of filenames as octets, but as I said before, this doesn't  
>> mean the
>> HFS+ approach is wrong, it just means it's incompatible with  
>> Linus's approach.
>
> Linus' approach is _FAST_.
>
> Why do you think Git has now acquired a reputation of kicking asses  
> all
> around the SCM scene?
>
> The HFS+ approach might be fine if you think of it in terms of "the  
> user
> will be awfully confused if two file names are shown identically in  
> the
> File Open dialog box".  But it otherwise sucks big time when it  
> comes to
> high performance applications needing to deal with a huge amount of  
> file
> names at once.
>
> Normalization will always hurt performances.  This is an overhead.
> Sometimes that overhead might be insignificant and not be perceptible,
> but sometimes it is.  And Git is clearly in the later case.  
> Performances
> will be hurt big time the day it is made aware of that normalization.
> This is why there is so much resistance about it, especially when the
> benefits of normalizing file names are not shown to be worth their  
> cost
> in performance and complexity, as other systems do rather fine without
> it.

I agree, Linus's approach is indeed fast. And if speed is more  
important than treating filenames as text instead of octets, then so  
be it. This is a trade-off. But a trade-off doesn't mean one approach  
is "wrong", it just means the authors of HFS+ thought it was an  
acceptable trade-off. HFS+ wasn't designed to be a high-performance  
filesystem that deals with lots of files, it was designed to be a  
filesystem used by regular people on the Mac, and I believe treating  
filenames as text is a good choice in this scenario. Unfortunately,  
this does mean git has to do extra work to behave correctly on this  
system.

Now, to move on to actually coming up with a solution. Unfortunately I  
don't know enough about the internals of git to really evaluate the  
proposed ideas myself, or to write a patch. Hopefully I'll come up  
with the time to acquire the necessary knowledge, but until then I can  
only participate in these higher-level discussions.

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 16:43                                               ` Kevin Ballard
                                                                   ` (2 preceding siblings ...)
  2008-01-21 17:08                                                 ` Nicolas Pitre
@ 2008-01-21 18:12                                                 ` Linus Torvalds
  2008-01-21 19:05                                                   ` Kevin Ballard
                                                                     ` (3 more replies)
  2008-01-21 20:30                                                 ` Dmitry Potapov
  4 siblings, 4 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-21 18:12 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Peter Karlsson, Mark Junker, Pedro Melo, git@vger.kernel.org

On Mon, 21 Jan 2008, Kevin Ballard wrote:
> On Jan 21, 2008, at 9:14 AM, Peter Karlsson wrote:
> > 
> > I happen to prefer the text-as-string-of-characters (or code points,
> > since you use the other meaning of characters in your posts), since I
> > come from the text world, having worked a lot on Unicode text
> > processing.
> > 
> > You apparently prefer the text-as-sequence-of-octets, which I tend to
> > dislike because I would have thought computer engineers would have
> > evolved beyond this when we left the 1900s.
> 
> I agree. Every single problem that I can recall Linus bringing up as a
> consequence of HFS+ treating filenames as strings [..]

You say "I agree", BUT YOU DON'T EVEN SEEM TO UNDERSTAND WHAT IS GOING ON.

The fact is, text-as-string-of-codepoints (let's make the "codepoints" 
obvious, so that there is no ambiguity, but I'd also like to make it clear 
that a codepoint *is* how a Unicode character is defined, and a Unicode 
"string" is actually *defined* to be a sequence of codepoints, and totally 
independent of normalization!) is fine.

That was never the issue at all. Unicode codepoints are wonderful.

Now, git _also_ heavily depends on the actual encoding of those 
codepoints, since we create hashes etc, so in fact, as far ass git is 
concerned, names have to be in some particular encoding to be hashed, and 
UTF-8 is the only sane encoding for Unicode. People can blather about 
UCS-2 and UTF-16 and UTF-32 all they want, but the fact is, UTF-8 is 
simply technically superior in so many ways that I don't even understand 
why anybody ever uses anything else.

So I would not disagree with using UTF-8 at all.

But that is *entirely* a separate issue from "normalization". 

Kevin, you seem to think that normalization is somehow forced on you by 
the "text-as-codepoints" decision, and that is SIMPLY NOT TRUE. 
Normalization is a totally separate decision, and it's a STUPID one, 
because it breaks so many of the _nice_ properties of using UTF-8.

And THAT is where we differ. It has nothing to do with "octets". It has 
nothing to do with not liking Unicode. It has nothing to do with 
"strings". 

In short:

 - normalization is by no means required or even a good feature. It's 
   something you do when you want to know if two strings are equivalent, 
   but that doesn't actually mean that you should keep the strings 
   normalized all the time!

 - normalization has *nothing* to do with "treating text as octets". 
   That's entirely an encoding issue.

 - of *course* git has to treat things as a binary stream at some point, 
   since you need that to even compute a SHA1 in the first place, but that 
   has *nothing* to do with normalization or the lack of it.

Got it? Forced normalization is stupid, because it changes the data and 
removes information, and unless you know that change is safe, it's the 
wrong thing to do.

One reason _not_ to do normalization is that if you don't, you can still 
interact with no ambiguity with other non-Unicode locales. You can do the 
1:1 Latin1<->Unicode translation, and you *never* get into trouble. In 
cotnrast, if you normalize, it's no longer a 1:1 translation any more, and 
you can get into a situation where the translation from Latin1 to Unicode 
and back results in a *different* filename than the one you started with!

See? That's a *serious*problem*. A system that forces normalization BY 
DEFINITION cannot work with people who use a Latin1 filesystem, because it 
will corrupt the filenames!

But you are apparently too damn stupid to understand that "data 
corruption" == "bad", and too damn stupid to see that "Unicode" does not 
mean "Forced normalization".

But I'll try one more time. Let's say that I work on a project where there 
are some people who use Latin1, and some people who use UTF-8, and we use 
special characters. It should all work, as long as we use only the common 
subset, and we teach git to convert to UTF-8 as a common base. Right?

In your *idiotic* world, where you have to normalize and corrupting 
filenames is ok, that doesn't work! It works wonderfully well if you do 
the obvious 1:1 translation and you do *not* normalize, but the moment you 
start normalizing, you actually corrupt the filenames!

And yes, the character sequence 'a¨' is exactly one such sequence. It's 
perfectly representable in both Latin1 and in UTF-8: in latin1 it is a 
two-character '\x61\xa8', and when doing a Latin1->UTF-8 conversion, it 
becomes '\x61\xc2\xa8', and you can convert back and forth between those 
two forms an infinite amount of times, and you never corrupt it.

But the moment you add normalization to the mix, you start screwing up. 
Suddenly, the sequence '\x61\xa8' in Latin1 becomes (assuming NFD) 
'\xc3\xa4' in UTF-8, and when converted back to Latin1, it is now '\xe4', 
ie that filename hass been corrupted!

See? Normalization in the face of working together with others is a total 
and utter mistake, and yes, it really *does* corrupt data. It makes it 
fundamentally impossible to reliably work together with other encodings - 
even when you do converstion between the two!

[ And that's the really sad part. Non-normalized Unicode can pretty much 
  be used as a "generic encoding" for just about all locales - if you know 
  the locale you convert from and to, you can generally use UTF-8 as an 
  internal format, knowing that you can always get the same result back in 
  the original encoding. Normalization literally breaks that wonderful 
  generic capability of Unicode.

  And the fact that Unicode is such a "generic replacement" for any locale 
  is exactly what makes it so wonderful, and allows you to fairly 
  seamlessly convert piece-meal from some particular locale to Unicode: 
  even if you have some programs that still work in the original locale, 
  you know that you can convert back to it without loss of information.

  Except if you normalize. In that case, you *do* lose information, and 
  suddenly one of the best things about Unicode simply disappears.

  As a result, people who force-normalize are idiots. But they seem to 
  also be stupid enough that they don't understand that they are idiots.
  Sad. 

  It's a bit like whitespace. Whitespace "doesn't matter" in text (== is 
  equivalent), but an email client that force-normalizes whitespace in 
  text is a really *broken* email client, because it turns out that 
  sometimes even the "equivalent" forms simply do matter. Patches are 
  text, but whitespace is meaningful there. 

  Same exact deal: it's good to have the *ability* to normalize 
  whitespace (in email, we call this "text=flowed" or similar), and in 
  some ceses you might even want to make it the default action, but 
  *forcing* normalization is total idiocy and actually makes the system 
  less useful! ]

		Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 14:14                                             ` Peter Karlsson
  2008-01-21 16:43                                               ` Kevin Ballard
@ 2008-01-21 18:16                                               ` Linus Torvalds
  1 sibling, 0 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-21 18:16 UTC (permalink / raw)
  To: Peter Karlsson; +Cc: Mark Junker, Pedro Melo, git@vger.kernel.org

On Mon, 21 Jan 2008, Peter Karlsson wrote:
> 
> It is relevant because the Mac OS file system stores file names as a
> sequence of Unicode code points, in a (apparently slightly modified)
> normalized form, whereas Git prefers to see file systems that store
> file names as a sequence of octets, which may, or may not, actually map
> to something that the user would call characters.

No. The *only* issue is that git doesn't normalize.

You can think of git as a UTF-8 namespace all you want, and it will work 
together wonderfully with OS X. 

Git just doesn't force-normalize the names.

> You apparently prefer the text-as-sequence-of-octets, which I tend to
> dislike because I would have thought computer engineers would have
> evolved beyond this when we left the 1900s.

Some of us just know what we're doing, and have been working with UTF-8 
for a long time. It's not about sequence-of-octets, it's about not 
corrupting the data.

You think data should be changed behind peoples backs, potentially causing 
corruption due to unintended conversions. And I don't.

You can call me "left behind in the 1900s", but that's apparently because 
you don't understand the issues. Data corruption wasn't something that 
magically became ok just because we switched into a new century.

			Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 18:12                                                 ` Linus Torvalds
@ 2008-01-21 19:05                                                   ` Kevin Ballard
  2008-01-21 19:41                                                     ` Linus Torvalds
                                                                       ` (2 more replies)
  2008-01-21 19:44                                                   ` Mike Hommey
                                                                     ` (2 subsequent siblings)
  3 siblings, 3 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-21 19:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Karlsson, Mark Junker, Pedro Melo, git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 10310 bytes --]

On Jan 21, 2008, at 1:12 PM, Linus Torvalds wrote:

> On Mon, 21 Jan 2008, Kevin Ballard wrote:
>> On Jan 21, 2008, at 9:14 AM, Peter Karlsson wrote:
>>>
>>> I happen to prefer the text-as-string-of-characters (or code points,
>>> since you use the other meaning of characters in your posts),  
>>> since I
>>> come from the text world, having worked a lot on Unicode text
>>> processing.
>>>
>>> You apparently prefer the text-as-sequence-of-octets, which I tend  
>>> to
>>> dislike because I would have thought computer engineers would have
>>> evolved beyond this when we left the 1900s.
>>
>> I agree. Every single problem that I can recall Linus bringing up  
>> as a
>> consequence of HFS+ treating filenames as strings [..]
>
> You say "I agree", BUT YOU DON'T EVEN SEEM TO UNDERSTAND WHAT IS  
> GOING ON.

I could say the same thing about you.

> The fact is, text-as-string-of-codepoints (let's make the "codepoints"
> obvious, so that there is no ambiguity, but I'd also like to make it  
> clear
> that a codepoint *is* how a Unicode character is defined, and a  
> Unicode
> "string" is actually *defined* to be a sequence of codepoints, and  
> totally
> independent of normalization!) is fine.
>
> That was never the issue at all. Unicode codepoints are wonderful.
>
> Now, git _also_ heavily depends on the actual encoding of those
> codepoints, since we create hashes etc, so in fact, as far ass git is
> concerned, names have to be in some particular encoding to be  
> hashed, and
> UTF-8 is the only sane encoding for Unicode. People can blather about
> UCS-2 and UTF-16 and UTF-32 all they want, but the fact is, UTF-8 is
> simply technically superior in so many ways that I don't even  
> understand
> why anybody ever uses anything else.
>
> So I would not disagree with using UTF-8 at all.
>
> But that is *entirely* a separate issue from "normalization".
>
> Kevin, you seem to think that normalization is somehow forced on you  
> by
> the "text-as-codepoints" decision, and that is SIMPLY NOT TRUE.
> Normalization is a totally separate decision, and it's a STUPID one,
> because it breaks so many of the _nice_ properties of using UTF-8.

I'm not saying it's forced on you, I'm saying when you treat filenames  
as text, it DOESN'T MATTER if the string gets normalized. As long as  
the string remains equivalent, YOU DON'T CARE about the underlying  
byte stream.

> And THAT is where we differ. It has nothing to do with "octets". It  
> has
> nothing to do with not liking Unicode. It has nothing to do with
> "strings".
>
> In short:
>
> - normalization is by no means required or even a good feature. It's
>   something you do when you want to know if two strings are  
> equivalent,
>   but that doesn't actually mean that you should keep the strings
>   normalized all the time!

Alright, fine. I'm not saying HFS+ is right in storing the normalized  
version, but I do believe the authors of HFS+ must have had a reason  
to do that, and I also believe that it shouldn't make any difference  
to me since it remains equivalent.

> - normalization has *nothing* to do with "treating text as octets".
>   That's entirely an encoding issue.

Sure it does. Normalizing a string produces an equivalent string, and  
so unless I look at the octets the two strings are, for all intents  
and purposes, the same.

> - of *course* git has to treat things as a binary stream at some  
> point,
>   since you need that to even compute a SHA1 in the first place, but  
> that
>   has *nothing* to do with normalization or the lack of it.

You're right, but it doesn't have to treat it as a binary stream at  
the level I care about. I mean, no matter what you do at some level  
the string is evaluated as a binary stream. For our purposes, just  
redefine the hashing algorithm to hash all equivalent strings the  
same, and you can implement that by using SHA1 on a particular  
encoding of the string.

> Got it? Forced normalization is stupid, because it changes the data  
> and
> removes information, and unless you know that change is safe, it's the
> wrong thing to do.

Decomposing and recomposing shouldn't lose any information we care  
about - when treating filenames as text, a<COMBINING DIARESIS> and <A  
WITH DIARESIS> are equivalent, and thus no distinction is made between  
them. I'm not sure what other information you might be considering  
lost in this case.

> One reason _not_ to do normalization is that if you don't, you can  
> still
> interact with no ambiguity with other non-Unicode locales. You can  
> do the
> 1:1 Latin1<->Unicode translation, and you *never* get into trouble. In
> cotnrast, if you normalize, it's no longer a 1:1 translation any  
> more, and
> you can get into a situation where the translation from Latin1 to  
> Unicode
> and back results in a *different* filename than the one you started  
> with!

I don't believe you. See below.

> See? That's a *serious*problem*. A system that forces normalization BY
> DEFINITION cannot work with people who use a Latin1 filesystem,  
> because it
> will corrupt the filenames!
>
> But you are apparently too damn stupid to understand that "data
> corruption" == "bad", and too damn stupid to see that "Unicode" does  
> not
> mean "Forced normalization".

When have I ever said that Unicode meant Forced normalization?

> But I'll try one more time. Let's say that I work on a project where  
> there
> are some people who use Latin1, and some people who use UTF-8, and  
> we use
> special characters. It should all work, as long as we use only the  
> common
> subset, and we teach git to convert to UTF-8 as a common base. Right?
>
> In your *idiotic* world, where you have to normalize and corrupting
> filenames is ok, that doesn't work! It works wonderfully well if you  
> do
> the obvious 1:1 translation and you do *not* normalize, but the  
> moment you
> start normalizing, you actually corrupt the filenames!

Wrong.

> And yes, the character sequence 'a¨' is exactly one such sequence.  
> It's
> perfectly representable in both Latin1 and in UTF-8: in latin1 it is a
> two-character '\x61\xa8', and when doing a Latin1->UTF-8 conversion,  
> it
> becomes '\x61\xc2\xa8', and you can convert back and forth between  
> those
> two forms an infinite amount of times, and you never corrupt it.
>
> But the moment you add normalization to the mix, you start screwing  
> up.
> Suddenly, the sequence '\x61\xa8' in Latin1 becomes (assuming NFD)
> '\xc3\xa4' in UTF-8, and when converted back to Latin1, it is now  
> '\xe4',
> ie that filename hass been corrupted!

Wrong. '\x61\x18' in Latin1, when converted to UTF-8 (NFD) is still  
'\x61\xc2\xa8'. You're mixing up DIARESIS (U+00A8) and COMBINING  
DIARESIS (U+0308).

I suspect this is why you've been yelling so much - you have a  
fundamental misunderstanding about what normalization is actually doing.

> See? Normalization in the face of working together with others is a  
> total
> and utter mistake, and yes, it really *does* corrupt data. It makes it
> fundamentally impossible to reliably work together with other  
> encodings -
> even when you do converstion between the two!
>
> [ And that's the really sad part. Non-normalized Unicode can pretty  
> much
>  be used as a "generic encoding" for just about all locales - if you  
> know
>  the locale you convert from and to, you can generally use UTF-8 as an
>  internal format, knowing that you can always get the same result  
> back in
>  the original encoding. Normalization literally breaks that wonderful
>  generic capability of Unicode.
>
>  And the fact that Unicode is such a "generic replacement" for any  
> locale
>  is exactly what makes it so wonderful, and allows you to fairly
>  seamlessly convert piece-meal from some particular locale to Unicode:
>  even if you have some programs that still work in the original  
> locale,
>  you know that you can convert back to it without loss of information.
>
>  Except if you normalize. In that case, you *do* lose information, and
>  suddenly one of the best things about Unicode simply disappears.

See above as to why you're not losing the information you so fervently  
believe you are.

>  As a result, people who force-normalize are idiots. But they seem to
>  also be stupid enough that they don't understand that they are  
> idiots.
>  Sad.

People who insult others run the risk of looking like a fool when  
shown to be wrong.

>  It's a bit like whitespace. Whitespace "doesn't matter" in text (==  
> is
>  equivalent), but an email client that force-normalizes whitespace in
>  text is a really *broken* email client, because it turns out that
>  sometimes even the "equivalent" forms simply do matter. Patches are
>  text, but whitespace is meaningful there.
>
>  Same exact deal: it's good to have the *ability* to normalize
>  whitespace (in email, we call this "text=flowed" or similar), and in
>  some ceses you might even want to make it the default action, but
>  *forcing* normalization is total idiocy and actually makes the system
>  less useful! ]

Sure, it all depends on what level you need to evaluate text. If we're  
talking about english paragraphs, then whitespace can be messed with.  
When we're talking about unicode strings, then specific encoding can  
be messed with. When talking about byte sequence, nothing can be  
messed with.

In our case, when working on an HFS+ filesystem all you have to care  
about is the unicode string level. The specific encoding can be messed  
with, and the client shouldn't care. Problems only arise when  
attempting to interoperate with filesystems that work at the byte  
sequence level.

The only information you lose when doing canonical normalization is  
what the original byte sequence was. Sure, this is a problem when  
working on a filesystem that cares about byte sequence, but it's not a  
problem when working on a filesystem that cares about the unicode  
string.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 19:05                                                   ` Kevin Ballard
@ 2008-01-21 19:41                                                     ` Linus Torvalds
  2008-01-21 19:58                                                       ` Kevin Ballard
  2008-01-21 19:57                                                     ` Theodore Tso
  2008-01-21 20:56                                                     ` Dmitry Potapov
  2 siblings, 1 reply; 260+ messages in thread
From: Linus Torvalds @ 2008-01-21 19:41 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Peter Karlsson, Mark Junker, Pedro Melo, git@vger.kernel.org

On Mon, 21 Jan 2008, Kevin Ballard wrote:
> 
> I'm not saying it's forced on you, I'm saying when you treat filenames as
> text, it DOESN'T MATTER if the string gets normalized. As long as the string
> remains equivalent, YOU DON'T CARE about the underlying byte stream.

Sure I do, because it matters a lot for things like - wait for it - things 
like checksumming it.

> Alright, fine. I'm not saying HFS+ is right in storing the normalized version,
> but I do believe the authors of HFS+ must have had a reason to do that, and I
> also believe that it shouldn't make any difference to me since it remains
> equivalent.

I've already told you the reason: they did the mistake of wanting to be 
case-independent, and a (bad) case compare is easier in NFD.

Once you give strings semantic meaning (and "case independent" implies 
that semantic meaning), suddenly normalization looks like a good idea, and 
since you're going to corrupt the data *anyway*, who cares? You just 
created a file like "Hello", and readdir() returns "hello" (because there 
was an old file under that name), and it's a lot more obviously corrupt 
than just due to normalization.

> Sure it does. Normalizing a string produces an equivalent string, and so
> unless I look at the octets the two strings are, for all intents and purposes,
> the same.

.. but you *have* to look at the octets at some point. They're kind of 
what the string is built up of. They never went away, even if you chose to 
ignore them. The encoding is really quite important, and is visible both 
in memory and on disk.

It's what shows up when you sha1sum, but it's also as simple as what shows 
up when you do an "ls -l" and look at a file size.

It doesn't matter if the text is "equivalent", when you then see the 
differences in all these small details.

You can shut your eyes as much as you want, and say that you don't care, 
but the differences are real, and they are visible.

> Decomposing and recomposing shouldn't lose any information we care about -
> when treating filenames as text, a<COMBINING DIARESIS> and <A WITH DIARESIS>
> are equivalent, and thus no distinction is made between them. I'm not sure
> what other information you might be considering lost in this case.

You're right, I messed up. I used a non-combining diaeresis, and you're 
right, it doesn't get corrupted. And I think that means that if Apple had 
used NFC, we'd not have this problem with Latin1 systems (because then the 
UTF-8 representation would be the same).

So I still think that normalization is totally idiotic, but the thing that 
actually causes most problems for people on OS X is that they chose the 
really inconvenient one.

			Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 18:12                                                 ` Linus Torvalds
  2008-01-21 19:05                                                   ` Kevin Ballard
@ 2008-01-21 19:44                                                   ` Mike Hommey
  2008-01-21 20:36                                                   ` Dmitry Potapov
  2008-01-21 21:06                                                   ` Martin Langhoff
  3 siblings, 0 replies; 260+ messages in thread
From: Mike Hommey @ 2008-01-21 19:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kevin Ballard, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Mon, Jan 21, 2008 at 10:12:01AM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 21 Jan 2008, Kevin Ballard wrote:
> > On Jan 21, 2008, at 9:14 AM, Peter Karlsson wrote:
> > > 
> > > I happen to prefer the text-as-string-of-characters (or code points,
> > > since you use the other meaning of characters in your posts), since I
> > > come from the text world, having worked a lot on Unicode text
> > > processing.
> > > 
> > > You apparently prefer the text-as-sequence-of-octets, which I tend to
> > > dislike because I would have thought computer engineers would have
> > > evolved beyond this when we left the 1900s.
> > 
> > I agree. Every single problem that I can recall Linus bringing up as a
> > consequence of HFS+ treating filenames as strings [..]
> 
> You say "I agree", BUT YOU DON'T EVEN SEEM TO UNDERSTAND WHAT IS GOING ON.
> 
> The fact is, text-as-string-of-codepoints (let's make the "codepoints" 
> obvious, so that there is no ambiguity, but I'd also like to make it clear 
> that a codepoint *is* how a Unicode character is defined, and a Unicode 
> "string" is actually *defined* to be a sequence of codepoints, and totally 
> independent of normalization!) is fine.
> 
> That was never the issue at all. Unicode codepoints are wonderful.
> 
> Now, git _also_ heavily depends on the actual encoding of those 
> codepoints, since we create hashes etc, so in fact, as far ass git is 
> concerned, names have to be in some particular encoding to be hashed, and 
> UTF-8 is the only sane encoding for Unicode. People can blather about 
> UCS-2 and UTF-16 and UTF-32 all they want, but the fact is, UTF-8 is 
> simply technically superior in so many ways that I don't even understand 
> why anybody ever uses anything else.

Maybe because it's 1.5 times bigger for any text in chinese, japanese or
korean ?

Mike

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 19:05                                                   ` Kevin Ballard
  2008-01-21 19:41                                                     ` Linus Torvalds
@ 2008-01-21 19:57                                                     ` Theodore Tso
  2008-01-21 20:01                                                       ` Kevin Ballard
  2008-01-21 20:56                                                     ` Dmitry Potapov
  2 siblings, 1 reply; 260+ messages in thread
From: Theodore Tso @ 2008-01-21 19:57 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Mon, Jan 21, 2008 at 02:05:51PM -0500, Kevin Ballard wrote:
> You're right, but it doesn't have to treat it as a binary stream at the 
> level I care about. I mean, no matter what you do at some level the string 
> is evaluated as a binary stream. For our purposes, just redefine the 
> hashing algorithm to hash all equivalent strings the same, and you can 
> implement that by using SHA1 on a particular encoding of the string.

That's horribly broken, for a couple of reasons.  First of all,
changing the hash algorithm breaks compatibility with existing
repositories; sure, you can try to guess what will least likely break
existing repository (which won't be the native MacOSX normalization
algorithm, since it's more likely the combined character will likely
be used on other environments), but there's still no guarantee there
aren't filenames that use some other form of byte-string for the
filename.

Secondly, the hash algorithm would not be stable.  Unicode is not
static, and new characters can get added that may be composable, and
thus would be normalized differently.  This is one of the reasons why
Unicode is so horribly broken as a standard.  It was originally
created by representatives from the printing world that were horribly
clueless about what was needed with respect to canonicalization
representation, so they compromised allowed both forms, not realizing
what a massive f*ckup this would cause later on.  So people have over
the years piled kludges on top of kludges in order to make Unicode
"work".  

So we can't blame all of the craziness on the MacOS designers,
although they have seen to have been very creative about how to take a
bad situation and make it worse....

					- Ted

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 19:41                                                     ` Linus Torvalds
@ 2008-01-21 19:58                                                       ` Kevin Ballard
  2008-01-21 20:33                                                         ` Linus Torvalds
  0 siblings, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-21 19:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Karlsson, Mark Junker, Pedro Melo, git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 5143 bytes --]

On Jan 21, 2008, at 2:41 PM, Linus Torvalds wrote:

> On Mon, 21 Jan 2008, Kevin Ballard wrote:
>>
>> I'm not saying it's forced on you, I'm saying when you treat  
>> filenames as
>> text, it DOESN'T MATTER if the string gets normalized. As long as  
>> the string
>> remains equivalent, YOU DON'T CARE about the underlying byte stream.
>
> Sure I do, because it matters a lot for things like - wait for it -  
> things
> like checksumming it.

I believe I already responded to the issue of hashing. In summary,  
just re-define your hash function to convert the string to a specific  
encoding. Sure, you'll lose some speed, but we're already assuming  
that it's worth taking a speed hit in order to treat filenames as  
strings (please don't argue this point, it's an opinion, not a factual  
statement, and I'm not necessarily saying I agree with it, I'm just  
saying it's valid).

>> Alright, fine. I'm not saying HFS+ is right in storing the  
>> normalized version,
>> but I do believe the authors of HFS+ must have had a reason to do  
>> that, and I
>> also believe that it shouldn't make any difference to me since it  
>> remains
>> equivalent.
>
> I've already told you the reason: they did the mistake of wanting to  
> be
> case-independent, and a (bad) case compare is easier in NFD.
>
> Once you give strings semantic meaning (and "case independent" implies
> that semantic meaning), suddenly normalization looks like a good  
> idea, and
> since you're going to corrupt the data *anyway*, who cares? You just
> created a file like "Hello", and readdir() returns "hello" (because  
> there
> was an old file under that name), and it's a lot more obviously  
> corrupt
> than just due to normalization.

Perhaps that is the reason, I don't know (neither do you, you're just  
guessing). However, my point still stands - as long as the string  
stays canonically equivalent, it doesn't matter to me if the  
filesystem changes the encoding, since I'm working at the string level.

>> Sure it does. Normalizing a string produces an equivalent string,  
>> and so
>> unless I look at the octets the two strings are, for all intents  
>> and purposes,
>> the same.
>
> .. but you *have* to look at the octets at some point. They're kind of
> what the string is built up of. They never went away, even if you  
> chose to
> ignore them. The encoding is really quite important, and is visible  
> both
> in memory and on disk.

Someone has to look at the octets, but it doesn't have to be me. As  
long as I use unicode-aware libraries and such, I can let the  
underlying system care about the byte order and my code will be clean.

> It's what shows up when you sha1sum, but it's also as simple as what  
> shows
> up when you do an "ls -l" and look at a file size.

It does? Why on earth should it do that? Filename doesn't contribute  
to the listed filesize on OS X.

kevin@KBLAPTOP:~> echo foo > foo; echo foo > foobar
kevin@KBLAPTOP:~> ls -l foo*
-rw-r--r--  1 kevin  kevin  4 Jan 21 14:50 foo
-rw-r--r--  1 kevin  kevin  4 Jan 21 14:50 foobar

It would be singularly stupid for the filesize to reflect the  
filename, especially since this means you would report different  
filesizes for hardlinks.

> It doesn't matter if the text is "equivalent", when you then see the
> differences in all these small details.
>
> You can shut your eyes as much as you want, and say that you don't  
> care,
> but the differences are real, and they are visible.

Visible at some level, sure, but not visible at the level my code  
works on. And thus, I don't have to care about it.

>> Decomposing and recomposing shouldn't lose any information we care  
>> about -
>> when treating filenames as text, a<COMBINING DIARESIS> and <A WITH  
>> DIARESIS>
>> are equivalent, and thus no distinction is made between them. I'm  
>> not sure
>> what other information you might be considering lost in this case.
>
> You're right, I messed up. I used a non-combining diaeresis, and  
> you're
> right, it doesn't get corrupted. And I think that means that if  
> Apple had
> used NFC, we'd not have this problem with Latin1 systems (because  
> then the
> UTF-8 representation would be the same).

I'm not sure what you mean. The byte sequence is different from Latin1  
to UTF-8 even if you use NFC, so I don't think, in this case, it makes  
any difference whether you use NFC or NFD. Yes, the codepoints are the  
same in Latin1 and UTF-8 if you use NFC, but that's hardly relevant.  
Please correct me if I'm wrong, but I believe Latin1->UTF-8->Latin1  
conversion will always produce the same Latin1 text whether you use  
NFC or NFD.

> So I still think that normalization is totally idiotic, but the  
> thing that
> actually causes most problems for people on OS X is that they chose  
> the
> really inconvenient one.

The only reason it's particularly inconvenient is because it's  
different from what most other systems picked. And if you want to  
blame someone for that, blame Unicode for having so many different  
normalization forms.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 19:57                                                     ` Theodore Tso
@ 2008-01-21 20:01                                                       ` Kevin Ballard
  2008-01-21 20:15                                                         ` Theodore Tso
  0 siblings, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-21 20:01 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2705 bytes --]

On Jan 21, 2008, at 2:57 PM, Theodore Tso wrote:

> On Mon, Jan 21, 2008 at 02:05:51PM -0500, Kevin Ballard wrote:
>> You're right, but it doesn't have to treat it as a binary stream at  
>> the
>> level I care about. I mean, no matter what you do at some level the  
>> string
>> is evaluated as a binary stream. For our purposes, just redefine the
>> hashing algorithm to hash all equivalent strings the same, and you  
>> can
>> implement that by using SHA1 on a particular encoding of the string.
>
> That's horribly broken, for a couple of reasons.  First of all,
> changing the hash algorithm breaks compatibility with existing
> repositories; sure, you can try to guess what will least likely break
> existing repository (which won't be the native MacOSX normalization
> algorithm, since it's more likely the combined character will likely
> be used on other environments), but there's still no guarantee there
> aren't filenames that use some other form of byte-string for the
> filename.
>
> Secondly, the hash algorithm would not be stable.  Unicode is not
> static, and new characters can get added that may be composable, and
> thus would be normalized differently.  This is one of the reasons why
> Unicode is so horribly broken as a standard.  It was originally
> created by representatives from the printing world that were horribly
> clueless about what was needed with respect to canonicalization
> representation, so they compromised allowed both forms, not realizing
> what a massive f*ckup this would cause later on.  So people have over
> the years piled kludges on top of kludges in order to make Unicode
> "work".
>
> So we can't blame all of the craziness on the MacOS designers,
> although they have seen to have been very creative about how to take a
> bad situation and make it worse..

You seem to be under the impression that I'm advocating that git treat  
all filenames as unicode strings, and thus change its hashing  
algorithm as described. I am not. I am saying that, if git only had to  
deal with HFS+, then it could treat all filenames as strings, etc.  
However, since git does not only have to deal with HFS+, this will not  
work. What I am describing is an ideal, not a practicality.

In other words, what I'm saying is that treating filenames as strings  
works perfectly fine, *provided you can do that 100% of the time*. git  
cannot do that 100% of the time, therefore it's not appropriate here.  
The purpose of this argument is to illustrate that treating filenames  
as strings isn't wrong, it's simply incompatible with treating  
filenames as byte sequences.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 20:01                                                       ` Kevin Ballard
@ 2008-01-21 20:15                                                         ` Theodore Tso
  2008-01-21 20:31                                                           ` Kevin Ballard
  0 siblings, 1 reply; 260+ messages in thread
From: Theodore Tso @ 2008-01-21 20:15 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Mon, Jan 21, 2008 at 03:01:43PM -0500, Kevin Ballard wrote:
>
> You seem to be under the impression that I'm advocating that git treat all 
> filenames as unicode strings, and thus change its hashing algorithm as 
> described. I am not. I am saying that, if git only had to deal with HFS+, 
> then it could treat all filenames as strings, etc. However, since git does 
> not only have to deal with HFS+, this will not work. What I am describing 
> is an ideal, not a practicality.

Well, why are you arguing on the git list about precisely that (when
you reponsed to Linus), then?

> In other words, what I'm saying is that treating filenames as strings works 
> perfectly fine, *provided you can do that 100% of the time*. git cannot do 
> that 100% of the time, therefore it's not appropriate here. The purpose of 
> this argument is to illustrate that treating filenames as strings isn't 
> wrong, it's simply incompatible with treating filenames as byte sequences.

No, it's still broken, because of the Unicode-is-not-static problem.
What happens when you start adding more composable characters, which
some future version of HFS+ will start breaking apart? 

Presumably the whole *reason* why HFS+ was corrupting strings was so
that "stupid applications" that only did byte comparisons would work
correctly.  But when you upgrade from Mac OS 10.5 to 10.6, and it adds
support for new composable characters, and you now take a USB hard
drive that was hooked up to a MacBook Air, running one version of
MacOS, and hook it up to another Macintosh, running another version of
MacOS, the normalization algorithm will be different, so the byte
comparisons won't work.  

So all of this extra work which MacOS put in to corrupt filenames
behind our back doesn't actually do any good; applications still need
to be smart, or there will be rare, hard to reproduce bugs
nevertheless.  So if MacOS wants to supply Unicode libraries that
compare strings keeping in mind Unicode "equivalences" it can be our
guest (although how they deal with different versions of Unicode with
different equivalence classes will be their cross to bear).  BUT MacOS
X SHOULD NOT BE CORRUPTING FILENAMES.  TO DO SO IS BROKEN.

Even Microsoft got this right; its filesystem is case-preserving, but
it has case-insensitive lookups.  Hence, it is not corrupting
filenames behind the application's back, unlike MacOS.

						- Ted

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 16:43                                               ` Kevin Ballard
                                                                   ` (3 preceding siblings ...)
  2008-01-21 18:12                                                 ` Linus Torvalds
@ 2008-01-21 20:30                                                 ` Dmitry Potapov
  4 siblings, 0 replies; 260+ messages in thread
From: Dmitry Potapov @ 2008-01-21 20:30 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Peter Karlsson, Linus Torvalds, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Mon, Jan 21, 2008 at 11:43:54AM -0500, Kevin Ballard wrote:
> 
> I agree. Every single problem that I can recall Linus bringing up as a  
> consequence of HFS+ treating filenames as strings is in fact only a  
> problem if you then think of the filename as octets at some point.

At *some* point everything stored in computers is a sequence of octets.
In fact, the whole point of the Unicode standard is to define characters
and how to map each character to a unique number (code points) and then
how to encode this number into sequence of octets.

> If  
> you stick with UTF-8 equivalence comparison the entire time, then  
> everything just works.

There are more than one equivalence comparison. The unicode standard
defines at least two, and for some other purpose you may want to use
some others, but for some reason you are trying to present that to
work with text means to follow only one type of equivalence the entire
time...

Dmitry

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 20:15                                                         ` Theodore Tso
@ 2008-01-21 20:31                                                           ` Kevin Ballard
  2008-01-21 20:46                                                             ` Theodore Tso
  0 siblings, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-21 20:31 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 3940 bytes --]

On Jan 21, 2008, at 3:15 PM, Theodore Tso wrote:

> On Mon, Jan 21, 2008 at 03:01:43PM -0500, Kevin Ballard wrote:
>>
>> You seem to be under the impression that I'm advocating that git  
>> treat all
>> filenames as unicode strings, and thus change its hashing algorithm  
>> as
>> described. I am not. I am saying that, if git only had to deal with  
>> HFS+,
>> then it could treat all filenames as strings, etc. However, since  
>> git does
>> not only have to deal with HFS+, this will not work. What I am  
>> describing
>> is an ideal, not a practicality.
>
> Well, why are you arguing on the git list about precisely that (when
> you reponsed to Linus), then?

Because of the way in which an argument evolves. This started out as  
"HFS+ is stupid because it normalizes", and I was arguing that said  
normalization wasn't stupid. This turned into an argument as to why HFS 
+ wasn't stupid for normalization, which is basically this argument of  
the ideal. Yes, I realize that it's not producing any practical  
results, but I'm stubborn (as, apparently, are most of you), and I  
believe that if the official stance of the git project is "HFS+ is  
stupid" then there's a lower chance of a patch being accepted then if  
people accept that "HFS+ is different in an incompatible fashion".

>> In other words, what I'm saying is that treating filenames as  
>> strings works
>> perfectly fine, *provided you can do that 100% of the time*. git  
>> cannot do
>> that 100% of the time, therefore it's not appropriate here. The  
>> purpose of
>> this argument is to illustrate that treating filenames as strings  
>> isn't
>> wrong, it's simply incompatible with treating filenames as byte  
>> sequences.
>
> No, it's still broken, because of the Unicode-is-not-static problem.
> What happens when you start adding more composable characters, which
> some future version of HFS+ will start breaking apart?

If you need a static representation, you normalize to a specific form.  
And in fact, adding new composable characters doesn't matter, since if  
they didn't exist before, you couldn't have possibly used them. Unless  
you mean adding new composed forms of existing simpler characters, at  
which point you seem to be arguing for NFD instead of NFC.

> Presumably the whole *reason* why HFS+ was corrupting strings was so
> that "stupid applications" that only did byte comparisons would work
> correctly.  But when you upgrade from Mac OS 10.5 to 10.6, and it adds
> support for new composable characters, and you now take a USB hard
> drive that was hooked up to a MacBook Air, running one version of
> MacOS, and hook it up to another Macintosh, running another version of
> MacOS, the normalization algorithm will be different, so the byte
> comparisons won't work.

I doubt that HFS+ normalized so that "stupid applications" could do  
byte comparisons. But even if that were the case, see previous  
paragraph.

> So all of this extra work which MacOS put in to corrupt filenames
> behind our back doesn't actually do any good; applications still need
> to be smart, or there will be rare, hard to reproduce bugs
> nevertheless.  So if MacOS wants to supply Unicode libraries that
> compare strings keeping in mind Unicode "equivalences" it can be our
> guest (although how they deal with different versions of Unicode with
> different equivalence classes will be their cross to bear).  BUT MacOS
> X SHOULD NOT BE CORRUPTING FILENAMES.  TO DO SO IS BROKEN.

Your entire argument is based on the assumption that HFS+ "corrupts"  
filenames in order to allow dumb clients to do byte comparisons, and I  
don't believe that to be the case. In fact, it's only considered a  
corruption if you care about the byte sequence of filenames, and my  
argument is that, on HFS+, you aren't supposed to care about the byte  
sequence.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 17:08                                                 ` Nicolas Pitre
  2008-01-21 17:25                                                   ` Kevin Ballard
@ 2008-01-21 20:32                                                   ` David Kastrup
  1 sibling, 0 replies; 260+ messages in thread
From: David Kastrup @ 2008-01-21 20:32 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Kevin Ballard, Peter Karlsson, Linus Torvalds, Mark Junker,
	Pedro Melo, git@vger.kernel.org

Nicolas Pitre <nico@cam.org> writes:

> Normalization will always hurt performances.  This is an overhead.
> Sometimes that overhead might be insignificant and not be perceptible,
> but sometimes it is.  And Git is clearly in the later
> case. Performances will be hurt big time the day it is made aware of
> that normalization.  This is why there is so much resistance about it,
> especially when the benefits of normalizing file names are not shown
> to be worth their cost in performance and complexity, as other systems
> do rather fine without it.

Normalization is cheap if you normalize user input.  The user will
always be quite slower than any reasonable normalization algorithm.  But
in the filesystem, one is normalizing the same stuff over and over.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 19:58                                                       ` Kevin Ballard
@ 2008-01-21 20:33                                                         ` Linus Torvalds
  2008-01-21 20:53                                                           ` Kevin Ballard
  0 siblings, 1 reply; 260+ messages in thread
From: Linus Torvalds @ 2008-01-21 20:33 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Peter Karlsson, Mark Junker, Pedro Melo, git@vger.kernel.org

On Mon, 21 Jan 2008, Kevin Ballard wrote:
> 
> > It's what shows up when you sha1sum, but it's also as simple as what shows
> > up when you do an "ls -l" and look at a file size.
> 
> It does? Why on earth should it do that? Filename doesn't contribute to the
> listed filesize on OS X.

Umm. What's this inability to see that data is data is data?

Why do you think Unicode has anything in particular to do with filenames?

Those same unicode strings are often part of the file data itself, and 
then that encoding damn well is visible in "ls -l".

Doing

	echo ä > file
	ls -l file

sure shows that "underlying octet" thing that you wanted to avoid so much. 
My point was that those underlying octets are always there, and they do 
matter. The fact that the differences may not be visible when you compare 
the normalized forms doesn't make it any less true.

You can choose to put blinders on and try to claim that normalization is 
invisible, but it's only invisible TO THOSE THINGS THAT DON'T WANT TO SEE 
IT.

But that doesn't change the fact that a lot of things *do* see it. There 
are very few things that are "Unicode specific", and a *lot* of tools that 
are just "general data tools".

And git tries to be a general data tool, not a Unicode-specific one.

> I'm not sure what you mean. The byte sequence is different from Latin1 to
> UTF-8 even if you use NFC, so I don't think, in this case, it makes any
> difference whether you use NFC or NFD.
>
> Yes, the codepoints are the same in Latin1 and UTF-8 if you use NFC, but 
> that's hardly relevant. Please correct me if I'm wrong, but I believe 
> Latin1->UTF-8->Latin1 conversion will always produce the same Latin1 
> text whether you use NFC or NFD.

The problem is that the UTF-8 form is different, so if you save things in 
UTF-8 (which we hopefully agree is a sane thing to do), then you should 
try to use a representation that people agree on.

And NFC is the more common normalization form by far, so by normalizing to 
something else, you actually de-normalize as far as those other people are 
concerned.

So if you have to normalize, at least use the normal form!

> The only reason it's particularly inconvenient is because it's different from
> what most other systems picked. And if you want to blame someone for that,
> blame Unicode for having so many different normalization forms.

I blame them for encouraging normalization at all.

It's stupid.

You don't need it.

The people who care about "are these strings equivalent" shouldn't do a 
"memcmp()" on them in the first place. And if you don't do a memcmp() on 
things, then you don't need to normalize. 

So you have two cases:
 (a) the cases that care about *identity*. They don't want normalization
 (b) the cases that care about *equivalence*. And they shouldn't do 
      octet-by-octet comparison.

See? Either you want to see equivalence, or you don't. And in neither case 
is normalization the right thing to do (except as *possibly* an internal 
part of the comparison, but there are actually better ways to check for 
equivalence than the brute-force "normalize both and compare results 
bitwise").

			Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 17:25                                                   ` Kevin Ballard
@ 2008-01-21 20:35                                                     ` David Kastrup
  0 siblings, 0 replies; 260+ messages in thread
From: David Kastrup @ 2008-01-21 20:35 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Nicolas Pitre, Peter Karlsson, Linus Torvalds, Mark Junker,
	Pedro Melo, git@vger.kernel.org

Kevin Ballard <kevin@sb.org> writes:

> I agree, Linus's approach is indeed fast. And if speed is more
> important than treating filenames as text instead of octets, then so
> be it. This is a trade-off. But a trade-off doesn't mean one approach
> is "wrong", it just means the authors of HFS+ thought it was an
> acceptable trade-off. HFS+ wasn't designed to be a high-performance
> filesystem that deals with lots of files, it was designed to be a
> filesystem used by regular people on the Mac, and I believe treating
> filenames as text is a good choice in this scenario.

Regular people have brains, not filesystems.  HFS+ is employed by
computers, and computers can produce or query or process lots of data in
very short time spans, in their own pace.  And if Mac users did not want
to make use of that, they would still be using Mac classics.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 18:12                                                 ` Linus Torvalds
  2008-01-21 19:05                                                   ` Kevin Ballard
  2008-01-21 19:44                                                   ` Mike Hommey
@ 2008-01-21 20:36                                                   ` Dmitry Potapov
  2008-01-21 21:06                                                   ` Martin Langhoff
  3 siblings, 0 replies; 260+ messages in thread
From: Dmitry Potapov @ 2008-01-21 20:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kevin Ballard, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Mon, Jan 21, 2008 at 10:12:01AM -0800, Linus Torvalds wrote:
> 
> The fact is, text-as-string-of-codepoints (let's make the "codepoints" 
> obvious, so that there is no ambiguity, but I'd also like to make it clear 
> that a codepoint *is* how a Unicode character is defined, and a Unicode 
> "string" is actually *defined* to be a sequence of codepoints, and totally 
> independent of normalization!) is fine.

Code point is a unique numerical value assigned to every Unicode character.
Also, every Unicode character has a uniqie name assigned to it. There are
some other non-unique properties that every Unicode has. So, to say that
a Unicode character is just a code point is not exactly correct, because
the code point is one of properties of a unicode character. But, yes, any
Unicode character can be identified by its code point. So, it is one to
one relation.

Dmitry

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 16:59                                                   ` Kevin Ballard
@ 2008-01-21 20:43                                                     ` Dmitry Potapov
  2008-01-21 20:53                                                       ` Kevin Ballard
  0 siblings, 1 reply; 260+ messages in thread
From: Dmitry Potapov @ 2008-01-21 20:43 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: David Kastrup, git

On Mon, Jan 21, 2008 at 11:59:24AM -0500, Kevin Ballard wrote:
> 
> No, it's a question of hashing algorithm. And it's one that's fairly  
> easily solved simply by picking a specific nonambiguous UTF-8 encoding  
> before hashing.

UTF-8 is a *single* encoding, and it maps every Unicode character to
a unique binary representation. So, it is completely nonambiguous.

Dmitry

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 20:31                                                           ` Kevin Ballard
@ 2008-01-21 20:46                                                             ` Theodore Tso
  2008-01-21 20:59                                                               ` Kevin Ballard
       [not found]                                                               ` <6E303071-82A4-4D69-AA0C-EC41168B9AFE@sb.org>
  0 siblings, 2 replies; 260+ messages in thread
From: Theodore Tso @ 2008-01-21 20:46 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Mon, Jan 21, 2008 at 03:31:02PM -0500, Kevin Ballard wrote:
>> No, it's still broken, because of the Unicode-is-not-static problem.
>> What happens when you start adding more composable characters, which
>> some future version of HFS+ will start breaking apart?
>
> If you need a static representation, you normalize to a specific form. And 
> in fact, adding new composable characters doesn't matter, since if they 
> didn't exist before, you couldn't have possibly used them. 

Sure you can.  Suppose you unpack the same tar file or zip file that
contains one of these new-fangled characters, one on a MacOS 10.5
system, and one on a MacOS 10.9 system.  How HFS+ will corrupt that
filename will differ depending which version of MacOS you are running.
Hence, normalizing the filename when you store it is stupid and
broken.  MacOS and its applications and libraries want to do
normalization in the privacy of its own address space, that's it's
business.  It can pursue any fetish it wants, among consenting adults.
Safe, sane and consensual, and all that... well, consensual, anyway.
I'm not sure about "safe" and "sane"....

My arguement is basically is that there is absolutely no value in what
HFS+ is doing, by corrupting filenames --- if you want to call it
"normalizing" them, fine, but since Unicode is not static, so you
can't even call it a "canonical" form.  It's just some random
corruption of what was passed in at open(2) time, that can and will
change depending on what version of MacOS you are running.

If you want to play the insane Unicode game of "equivalent"
characters, you have to do it at comparison time, so there's no point
trying to "normalize" them when you store them.  It doesn't buy you
anything, and it causes all sorts of pain.

> Your entire argument is based on the assumption that HFS+ "corrupts" 
> filenames in order to allow dumb clients to do byte comparisons, and I 
> don't believe that to be the case. 

OK, what's your reason for why HFS+ corrupts filenames?  What do you
think is its excuse?  What problem does it solve?  If the answer is
"no reason at all, but because it *can*", according to the Great God
Unicode, then that's really not very impressive....

						- Ted

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 20:43                                                     ` Dmitry Potapov
@ 2008-01-21 20:53                                                       ` Kevin Ballard
  2008-01-21 21:05                                                         ` David Kastrup
  2008-01-21 23:01                                                         ` Dmitry Potapov
  0 siblings, 2 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-21 20:53 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: David Kastrup, git

[-- Attachment #1: Type: text/plain, Size: 794 bytes --]

On Jan 21, 2008, at 3:43 PM, Dmitry Potapov wrote:

> On Mon, Jan 21, 2008 at 11:59:24AM -0500, Kevin Ballard wrote:
>>
>> No, it's a question of hashing algorithm. And it's one that's fairly
>> easily solved simply by picking a specific nonambiguous UTF-8  
>> encoding
>> before hashing.
>
> UTF-8 is a *single* encoding, and it maps every Unicode character to
> a unique binary representation. So, it is completely nonambiguous.

In this case, encoding refers to normalization form, as other people  
have used it in the conversation besides me.

I suggest you stop trying to find inconsequential stuff to argue  
about, especially when a tiny bit of critical thinking would reveal  
the answer.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 20:33                                                         ` Linus Torvalds
@ 2008-01-21 20:53                                                           ` Kevin Ballard
       [not found]                                                             ` <alpine.LFD.1.0! 0.0801211323120.2957@woody.linux-foundation.org>
                                                                               ` (3 more replies)
  0 siblings, 4 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-21 20:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Karlsson, Mark Junker, Pedro Melo, git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 5116 bytes --]

On Jan 21, 2008, at 3:33 PM, Linus Torvalds wrote:

> On Mon, 21 Jan 2008, Kevin Ballard wrote:
>>
>>> It's what shows up when you sha1sum, but it's also as simple as  
>>> what shows
>>> up when you do an "ls -l" and look at a file size.
>>
>> It does? Why on earth should it do that? Filename doesn't  
>> contribute to the
>> listed filesize on OS X.
>
> Umm. What's this inability to see that data is data is data?

I'm not sure what you mean. I stated a fact - at least on OS X, the  
filename does not contribute to the listed filesize, so changing the  
encoding of the filename doesn't change the filesize. This isn't a  
philosophical point, it's a factual statement.

> Why do you think Unicode has anything in particular to do with  
> filenames?

I don't, but I do think this discussion revolves around filenames,  
therefore it should not surprise you when I talk about filenames.

> Those same unicode strings are often part of the file data itself, and
> then that encoding damn well is visible in "ls -l".
>
> Doing
>
> 	echo ä > file
> 	ls -l file
>
> sure shows that "underlying octet" thing that you wanted to avoid so  
> much.
> My point was that those underlying octets are always there, and they  
> do
> matter. The fact that the differences may not be visible when you  
> compare
> the normalized forms doesn't make it any less true.

Yes, I am well aware that the encoding of the *file contents* affects  
filesize. But when did I suggest changing the encoding of filenames  
inside file contents? If you treat filenames as strings, there's no  
requirement to change the encoding of filenames inside file contents.  
I'm talking specifically about the filenames, not about file contents,  
so stop trying to argue against that which is irrelevant.

> You can choose to put blinders on and try to claim that  
> normalization is
> invisible, but it's only invisible TO THOSE THINGS THAT DON'T WANT  
> TO SEE
> IT.

Don't want to, or don't need to? It's not a matter of ignoring  
encoding because I don't want to deal with it, it's ignoring encoding  
because it's simply not relevant if I treat filenames as strings.

> But that doesn't change the fact that a lot of things *do* see it.  
> There
> are very few things that are "Unicode specific", and a *lot* of  
> tools that
> are just "general data tools".
>
> And git tries to be a general data tool, not a Unicode-specific one.

Yes, I realize that. See my previous message about discussing ideal vs  
practicality.

>> I'm not sure what you mean. The byte sequence is different from  
>> Latin1 to
>> UTF-8 even if you use NFC, so I don't think, in this case, it makes  
>> any
>> difference whether you use NFC or NFD.
>>
>> Yes, the codepoints are the same in Latin1 and UTF-8 if you use  
>> NFC, but
>> that's hardly relevant. Please correct me if I'm wrong, but I believe
>> Latin1->UTF-8->Latin1 conversion will always produce the same Latin1
>> text whether you use NFC or NFD.
>
> The problem is that the UTF-8 form is different, so if you save  
> things in
> UTF-8 (which we hopefully agree is a sane thing to do), then you  
> should
> try to use a representation that people agree on.
>
> And NFC is the more common normalization form by far, so by  
> normalizing to
> something else, you actually de-normalize as far as those other  
> people are
> concerned.
>
> So if you have to normalize, at least use the normal form!

Was NFC the common normalization form back in 1998? My understanding  
is Unicode was still in the process of being adopted back then, so  
there was no one common standard that was obvious for everyone to use.

>> The only reason it's particularly inconvenient is because it's  
>> different from
>> what most other systems picked. And if you want to blame someone  
>> for that,
>> blame Unicode for having so many different normalization forms.
>
> I blame them for encouraging normalization at all.
>
> It's stupid.
>
> You don't need it.
>
> The people who care about "are these strings equivalent" shouldn't  
> do a
> "memcmp()" on them in the first place. And if you don't do a  
> memcmp() on
> things, then you don't need to normalize.
>
> So you have two cases:
> (a) the cases that care about *identity*. They don't want  
> normalization
> (b) the cases that care about *equivalence*. And they shouldn't do
>      octet-by-octet comparison.
>
> See? Either you want to see equivalence, or you don't. And in  
> neither case
> is normalization the right thing to do (except as *possibly* an  
> internal
> part of the comparison, but there are actually better ways to check  
> for
> equivalence than the brute-force "normalize both and compare results
> bitwise").

I could argue against this, but frankly, I'm really tired of arguing  
this same point. I suggest we simply agree to disagree, and move on to  
actually fixing the problem.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 19:05                                                   ` Kevin Ballard
  2008-01-21 19:41                                                     ` Linus Torvalds
  2008-01-21 19:57                                                     ` Theodore Tso
@ 2008-01-21 20:56                                                     ` Dmitry Potapov
  2008-01-21 21:07                                                       ` Kevin Ballard
  2 siblings, 1 reply; 260+ messages in thread
From: Dmitry Potapov @ 2008-01-21 20:56 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Mon, Jan 21, 2008 at 02:05:51PM -0500, Kevin Ballard wrote:
> >
> >But that is *entirely* a separate issue from "normalization".
> >
> >Kevin, you seem to think that normalization is somehow forced on you  
> >by
> >the "text-as-codepoints" decision, and that is SIMPLY NOT TRUE.
> >Normalization is a totally separate decision, and it's a STUPID one,
> >because it breaks so many of the _nice_ properties of using UTF-8.
> 
> I'm not saying it's forced on you, I'm saying when you treat filenames  
> as text,

to treat as text could mean different for different people. Some
may prefer to fi and fi_ligature to be treated as same in some
context.

> it DOESN'T MATTER if the string gets normalized. As long as  
> the string remains equivalent,

As matter of fact it does, otherwise characters would be the
same and we would not have this conversation at all. String
can be equivalent and not equivalent at the time, because there
are different equivalent relations. Finally, what HFS+ does
is even not normalization. In the technote, Apple explains
that they decompose some characters but not others for better
compatibility. So, you see, there is a PROBLEM here.

> YOU DON'T CARE about the underlying  
> byte stream.

It is not about byte stream. After all, if it were UTF-16 instead
of UTF-8, it would be one to one conversion for each character.
So, what gets corrupted by HFS+ are Unicode *characters*.

> 
> Alright, fine. I'm not saying HFS+ is right in storing the normalized  
> version, but I do believe the authors of HFS+ must have had a reason  
> to do that,

I don't say they do that without *any* reason, but I suppose all
Apple developers in the Copland project had some reasons for they
did, but the outcome was not very good...

> The only information you lose when doing canonical normalization is  
> what the original byte sequence was. 

Not true. You lose the original sequence of *characters*.

Dmitry

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 20:53                                                           ` Kevin Ballard
       [not found]                                                             ` <alpine.LFD.1.0! 0.0801211323120.2957@woody.linux-foundation.org>
@ 2008-01-21 20:58                                                             ` David Kastrup
  2008-01-21 21:17                                                             ` Martin Langhoff
  2008-01-21 21:33                                                             ` Linus Torvalds
  3 siblings, 0 replies; 260+ messages in thread
From: David Kastrup @ 2008-01-21 20:58 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

Kevin Ballard <kevin@sb.org> writes:

> On Jan 21, 2008, at 3:33 PM, Linus Torvalds wrote:
>
>> On Mon, 21 Jan 2008, Kevin Ballard wrote:

>>> It does? Why on earth should it do that? Filename doesn't
>>> contribute to the
>>> listed filesize on OS X.
>>
>> Umm. What's this inability to see that data is data is data?
>
> I'm not sure what you mean. I stated a fact - at least on OS X, the
> filename does not contribute to the listed filesize, so changing the
> encoding of the filename doesn't change the filesize. This isn't a
> philosophical point, it's a factual statement.

Changing the encoding of the file name most certainly changes the
file size of the directory.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 20:46                                                             ` Theodore Tso
@ 2008-01-21 20:59                                                               ` Kevin Ballard
       [not found]                                                               ` <6E303071-82A4-4D69-AA0C-EC41168B9AFE@sb.org>
  1 sibling, 0 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-21 20:59 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 3368 bytes --]

Note: resent to list due to bounce.
Original CC list: tytso@MIT.EDU, torvalds@linux-foundation.org, peter@softwolves.pp.se 
, mjscod@web.de, melo@simplicidade.org

On Jan 21, 2008, at 3:46 PM, Theodore Tso wrote:

> On Mon, Jan 21, 2008 at 03:31:02PM -0500, Kevin Ballard wrote:
>>> No, it's still broken, because of the Unicode-is-not-static problem.
>>> What happens when you start adding more composable characters, which
>>> some future version of HFS+ will start breaking apart?
>>
>> If you need a static representation, you normalize to a specific  
>> form. And
>> in fact, adding new composable characters doesn't matter, since if  
>> they
>> didn't exist before, you couldn't have possibly used them.
>
> Sure you can.  Suppose you unpack the same tar file or zip file that
> contains one of these new-fangled characters, one on a MacOS 10.5
> system, and one on a MacOS 10.9 system.  How HFS+ will corrupt that
> filename will differ depending which version of MacOS you are running.
> Hence, normalizing the filename when you store it is stupid and
> broken.  MacOS and its applications and libraries want to do
> normalization in the privacy of its own address space, that's it's
> business.  It can pursue any fetish it wants, among consenting adults.
> Safe, sane and consensual, and all that... well, consensual, anyway.
> I'm not sure about "safe" and "sane"....

You're making the huge assumption that the HFS+ normalization  
algorithms will change. As the technote states:

"Platform algorithms tend to evolve with the Unicode standard. The HFS  
Plus algorithms cannot evolve because such evolution would invalidate  
existing HFS Plus volumes."

> My arguement is basically is that there is absolutely no value in what
> HFS+ is doing, by corrupting filenames --- if you want to call it
> "normalizing" them, fine, but since Unicode is not static, so you
> can't even call it a "canonical" form.  It's just some random
> corruption of what was passed in at open(2) time, that can and will
> change depending on what version of MacOS you are running.

Again with the huge assumptions.

> If you want to play the insane Unicode game of "equivalent"
> characters, you have to do it at comparison time, so there's no point
> trying to "normalize" them when you store them.  It doesn't buy you
> anything, and it causes all sorts of pain.

It must have bought somebody something, or they never would have done  
it.

>> Your entire argument is based on the assumption that HFS+ "corrupts"
>> filenames in order to allow dumb clients to do byte comparisons,  
>> and I
>> don't believe that to be the case.
>
> OK, what's your reason for why HFS+ corrupts filenames?  What do you
> think is its excuse?  What problem does it solve?  If the answer is
> "no reason at all, but because it *can*", according to the Great God
> Unicode, then that's really not very impressive....

I have no idea why HFS+ stores filenames in a normalized form, and  
further I am smart enough to know that speculating is completely  
pointless. I assume the authors had a good reason (which should be a  
safe assumption, filesystem authors are a smart bunch). The reason may  
not be valid anymore, but if it was valid back in 1998, then I can  
accept it without complaining.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 20:53                                                       ` Kevin Ballard
@ 2008-01-21 21:05                                                         ` David Kastrup
  2008-01-21 23:01                                                         ` Dmitry Potapov
  1 sibling, 0 replies; 260+ messages in thread
From: David Kastrup @ 2008-01-21 21:05 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Dmitry Potapov, git

Kevin Ballard <kevin@sb.org> writes:

> On Jan 21, 2008, at 3:43 PM, Dmitry Potapov wrote:
>
>> On Mon, Jan 21, 2008 at 11:59:24AM -0500, Kevin Ballard wrote:
>>>
>>> No, it's a question of hashing algorithm. And it's one that's fairly
>>> easily solved simply by picking a specific nonambiguous UTF-8
>>> encoding before hashing.
>>
>> UTF-8 is a *single* encoding, and it maps every Unicode character to
>> a unique binary representation. So, it is completely nonambiguous.
>
> In this case, encoding refers to normalization form, as other people
> have used it in the conversation besides me.

There exists more than one "normalization form" (even across MacOS
platforms), and git is cross-platform.  And people can't be made to
agree on normalization forms, anyway.  You are aware that Unicode code
points are shared between some Chinese and Japanese signs, and that
stroked forms might be composed differently in different languages?  We
don't need to go to the Far East, anyway: in Turkish, İ and i are
equivalent, as are I and ı, whereas in other European languages, I is
instead equivalent to i.  In the Netherlands, ÿ is IIRC equivalent to
ij.  And so on.

> I suggest you stop trying to find inconsequential stuff to argue
> about, especially when a tiny bit of critical thinking would reveal
> the answer.

Now that you have established that you are the only person on the list
capable of critical thinking, how about going elsewhere where you can
find similarly sharp critical thinkers?

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 18:12                                                 ` Linus Torvalds
                                                                     ` (2 preceding siblings ...)
  2008-01-21 20:36                                                   ` Dmitry Potapov
@ 2008-01-21 21:06                                                   ` Martin Langhoff
  2008-01-21 21:09                                                     ` David Kastrup
  2008-01-21 21:42                                                     ` Linus Torvalds
  3 siblings, 2 replies; 260+ messages in thread
From: Martin Langhoff @ 2008-01-21 21:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kevin Ballard, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Jan 22, 2008 7:12 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Now, git _also_ heavily depends on the actual encoding of those
> codepoints, since we create hashes etc, so in fact, as far ass git is
> concerned, names have to be in some particular encoding to be hashed, and
> UTF-8 is the only sane encoding for Unicode. People can blather about
> UCS-2 and UTF-16 and UTF-32 all they want, but the fact is, UTF-8 is
> simply technically superior in so many ways that I don't even understand
> why anybody ever uses anything else.
>
> So I would not disagree with using UTF-8 at all.

Linus,

(slightly offtopic) are you praising UTF-8 as storage format (for disk
and network) or in general? UTF-8-aware string ops like counting
characters seem to me a horrendous thing at the ASM level.

More on topic, I suspect Kevin's experience is more on end-user apps,
where input sanitization and even canonicalisation are common
practice. From a kernel and filesystems POV, a filename is data as
sacred as file data. On the webapp world, we "corrupt" user input
liberally to avoid XSS attacks and the like. In some cases, these
practices are stupid and can be replaced with escaping data properly,
but in other cases, the web platform is so broken that there's no
option.

At least in Moodle we store *exactly*  what the user POSTed and
cleanup^Wcorrupt it when displaying it, so that if it does happen that
the cleanup was buggy, we never corrupted the data.

So no point in calling eachother stupid this much. Once is enough ;-)
And no point in arguing that something that is ok for an end-user app
is a good design decision for an OS.

martin

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 20:56                                                     ` Dmitry Potapov
@ 2008-01-21 21:07                                                       ` Kevin Ballard
  2008-01-21 22:41                                                         ` Dmitry Potapov
  0 siblings, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-21 21:07 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2672 bytes --]

On Jan 21, 2008, at 3:56 PM, Dmitry Potapov wrote:

> On Mon, Jan 21, 2008 at 02:05:51PM -0500, Kevin Ballard wrote:
>>>
>>> But that is *entirely* a separate issue from "normalization".
>>>
>>> Kevin, you seem to think that normalization is somehow forced on you
>>> by
>>> the "text-as-codepoints" decision, and that is SIMPLY NOT TRUE.
>>> Normalization is a totally separate decision, and it's a STUPID one,
>>> because it breaks so many of the _nice_ properties of using UTF-8.
>>
>> I'm not saying it's forced on you, I'm saying when you treat  
>> filenames
>> as text,
>
> to treat as text could mean different for different people. Some
> may prefer to fi and fi_ligature to be treated as same in some
> context.

Those people can use NFKC/NFKD (compatibility equivalence). As I've  
said before, I'm talking about canonical equivalence, because that  
doesn't lose information like compatibility equivalence does (ex. the  
fi ligature gets turned into fi in compatibility equivalence, but not  
canonical equivalence).

>> it DOESN'T MATTER if the string gets normalized. As long as
>> the string remains equivalent,
>
> As matter of fact it does, otherwise characters would be the
> same and we would not have this conversation at all. String
> can be equivalent and not equivalent at the time, because there
> are different equivalent relations. Finally, what HFS+ does
> is even not normalization. In the technote, Apple explains
> that they decompose some characters but not others for better
> compatibility. So, you see, there is a PROBLEM here.

Again, I've specified many times that I'm talking about canonical  
equivalence.

And yes, HFS+ does normalization, it just doesn't use NFD. It uses a  
custom variant. I fail to see how this is a problem.

>> Alright, fine. I'm not saying HFS+ is right in storing the normalized
>> version, but I do believe the authors of HFS+ must have had a reason
>> to do that,
>
> I don't say they do that without *any* reason, but I suppose all
> Apple developers in the Copland project had some reasons for they
> did, but the outcome was not very good...

Stupid engineers don't get to work on developing new filesystems. And  
Copland didn't fail because of stupid engineers anyway. If I had to  
blame someone, I'd blame management.

>> The only information you lose when doing canonical normalization is
>> what the original byte sequence was.
>
> Not true. You lose the original sequence of *characters*.

Which is only a problem if you care about the byte sequence, which is  
kinda the whole point of my argument.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 21:06                                                   ` Martin Langhoff
@ 2008-01-21 21:09                                                     ` David Kastrup
  2008-01-21 21:42                                                     ` Linus Torvalds
  1 sibling, 0 replies; 260+ messages in thread
From: David Kastrup @ 2008-01-21 21:09 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Linus Torvalds, Kevin Ballard, Peter Karlsson, Mark Junker,
	Pedro Melo, git@vger.kernel.org

"Martin Langhoff" <martin.langhoff@gmail.com> writes:

> (slightly offtopic) are you praising UTF-8 as storage format (for disk
> and network) or in general? UTF-8-aware string ops like counting
> characters seem to me a horrendous thing at the ASM level.

Huh?  Why?  Just count all characters in the range 00-bf.  That's the
exact character count of utf-8 characters.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 20:53                                                           ` Kevin Ballard
       [not found]                                                             ` <alpine.LFD.1.0! 0.0801211323120.2957@woody.linux-foundation.org>
  2008-01-21 20:58                                                             ` David Kastrup
@ 2008-01-21 21:17                                                             ` Martin Langhoff
  2008-01-21 21:28                                                               ` Kevin Ballard
  2008-01-21 21:33                                                             ` Linus Torvalds
  3 siblings, 1 reply; 260+ messages in thread
From: Martin Langhoff @ 2008-01-21 21:17 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Jan 22, 2008 9:53 AM, Kevin Ballard <kevin@sb.org> wrote:
> On Jan 21, 2008, at 3:33 PM, Linus Torvalds wrote:
> > Umm. What's this inability to see that data is data is data?
>
> I'm not sure what you mean. I stated a fact - at least on OS X, the
> filename does not contribute to the listed filesize, so changing the
> encoding of the filename doesn't change the filesize. This isn't a
> philosophical point, it's a factual statement.

Kevin,

as you might know, Linus' "other hobby" is to write kernels ;-) From
taht POV, a filename is as much data as the data in the file. Doing
odd things like sorting it, searching through it, etc, is all work for
code higher in the stack that is free to mangle the data in any way it
wants, including creating nice case-insensitive indexes, and
who-knows-what for ideogram-based languages. In contrast, the core OS
treats user data a sacred stuff, and I'm thankful it does.

And from a kernel/filesystem POV, a directory is also a file. So if a
filename has a different number of octets, the directory will be
different.

For all the searching and matching, it really makes sense to have
something like locate or SpotLight or whatever to index user files
that should be easy to find and match, because all the locale rules
for matching are hideously expensive to apply. Even today, most UTF-8
aware (and supposedly collation-smart) applications have trouble
matching MARTÍN when asked for martín in a case-insensitive search.
That pesky latin í trips them up everytime.

cheers,

martin

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
       [not found]                                                               ` <6E303071-82A4-4D69-AA0C-EC41168B9AFE@sb.org>
@ 2008-01-21 21:18                                                                 ` Theodore Tso
  2008-01-21 21:43                                                                   ` Kevin Ballard
  0 siblings, 1 reply; 260+ messages in thread
From: Theodore Tso @ 2008-01-21 21:18 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Mon, Jan 21, 2008 at 03:58:03PM -0500, Kevin Ballard wrote:
> You're making the huge assumption that the HFS+ normalization algorithms 
> will change. As the technote states:
>
> "Platform algorithms tend to evolve with the Unicode standard. The HFS Plus 
> algorithms cannot evolve because such evolution would invalidate existing 
> HFS Plus volumes."

Great, so even worse.  Does the tech note then specify exactly what
version of Unicode HFS+ is using to do its "normalization"?  Or
exactly what characters it will normalize?  After all, Unicode has
added all sorts of characters since 1998, and I'm sure some of them
were combining characters.

And you *really* want to continue argue that a sane thing for a
cross-platform system to do is to pervert its hash algorithm to take
into account *one* particular OS that happened to freeze a
normalization algorithm at some arbitrary point in time, approximately
nine years ago?  Talk about the tail wagging the dog!!  Especially
when you can't even justify why it was done nine years ago!

> It must have bought somebody something, or they never would have done it.

Your faith in the HFS+ designers is touching.

> I have no idea why HFS+ stores filenames in a normalized form, and further 
> I am smart enough to know that speculating is completely pointless. I 
> assume the authors had a good reason (which should be a safe assumption, 
> filesystem authors are a smart bunch). The reason may not be valid anymore, 
> but if it was valid back in 1998, then I can accept it without complaining.

Well, I *AM* a filesystem designer (ext2/ext3/ext4), and well before
1998, I knew that trying to do anything with Unicode normalization was
a fool's errand.  So if you're going to blindly trust filesystme
designers (not something I would recommend, actually :-), trust me.
What HFS+ is doing is dumb, dumb, dumb.

And even if *you* can accept it, why should the git designers pervert
any core part of git's design to support this behaviour?  Especially
if it's legacy behaviour which will hopefully be going away, say when
MacOS adopts ZFS --- there's an opportunity for them to start afresh,
and not make the same mistakes they made nine years ago!

So why don't you suggest some kind of sane fix in the Mac specific
code that doesn't impact any core part of git, such as its hash
algorithm?  It would be far more productive than trying to defend a
bad design decision made nine years ago....   :-}

					- Ted

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 21:17                                                             ` Martin Langhoff
@ 2008-01-21 21:28                                                               ` Kevin Ballard
  2008-01-21 21:43                                                                 ` Martin Langhoff
  0 siblings, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-21 21:28 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2469 bytes --]

On Jan 21, 2008, at 4:17 PM, Martin Langhoff wrote:

> On Jan 22, 2008 9:53 AM, Kevin Ballard <kevin@sb.org> wrote:
>> On Jan 21, 2008, at 3:33 PM, Linus Torvalds wrote:
>>> Umm. What's this inability to see that data is data is data?
>>
>> I'm not sure what you mean. I stated a fact - at least on OS X, the
>> filename does not contribute to the listed filesize, so changing the
>> encoding of the filename doesn't change the filesize. This isn't a
>> philosophical point, it's a factual statement.
>
> Kevin,
>
> as you might know, Linus' "other hobby" is to write kernels ;-) From
> taht POV, a filename is as much data as the data in the file. Doing
> odd things like sorting it, searching through it, etc, is all work for
> code higher in the stack that is free to mangle the data in any way it
> wants, including creating nice case-insensitive indexes, and
> who-knows-what for ideogram-based languages. In contrast, the core OS
> treats user data a sacred stuff, and I'm thankful it does.

That's certainly a reasonable POV. However, it's not the only one. As  
evidenced by the Mac, treating filenames as strings rather than bytes  
is a viable alternative POV - you can't argue that it doesn't work,  
because OS X proves it does.

However, it is a trade-off.

> And from a kernel/filesystem POV, a directory is also a file. So if a
> filename has a different number of octets, the directory will be
> different.

Sure, that makes sense. That's why, if you are going to mangle  
filenames, you need to pick a stable form to always use, which HFS+  
does.

> For all the searching and matching, it really makes sense to have
> something like locate or SpotLight or whatever to index user files
> that should be easy to find and match, because all the locale rules
> for matching are hideously expensive to apply. Even today, most UTF-8
> aware (and supposedly collation-smart) applications have trouble
> matching MARTÍN when asked for martín in a case-insensitive search.
> That pesky latin í trips them up everytime.


Perhaps you should try OS X. Every single Cocoa app should do the  
search properly. In fact, I just checked using 3 different text  
engines (WebKit, Cocoa's text engine, and ATSUI) and all 3 did the  
case-insensitive search properly. That said, this isn't particularly  
relevant.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 20:53                                                           ` Kevin Ballard
                                                                               ` (2 preceding siblings ...)
  2008-01-21 21:17                                                             ` Martin Langhoff
@ 2008-01-21 21:33                                                             ` Linus Torvalds
  2008-01-21 21:49                                                               ` Kevin Ballard
  3 siblings, 1 reply; 260+ messages in thread
From: Linus Torvalds @ 2008-01-21 21:33 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Peter Karlsson, Mark Junker, Pedro Melo, git@vger.kernel.org

On Mon, 21 Jan 2008, Kevin Ballard wrote:
> 
> I'm not sure what you mean. I stated a fact - at least on OS X, the filename
> does not contribute to the listed filesize, so changing the encoding of the
> filename doesn't change the filesize. This isn't a philosophical point, it's a
> factual statement.

And my point was that your *whole* argument boils down to "normalization 
is invisible".

When it isn't. It's not invisible for filenames, it's not invisible for 
file contents.

You're trying to claim that normalization cannot matter. I'm just pointing 
out that it sure as hell can. Exactly because lots of things don't 
actually look at data other than as just a Unicode string. They do look at 
the raw format.

And that's true both of file contents and file names.

> I don't, but I do think this discussion revolves around filenames, therefore
> it should not surprise you when I talk about filenames.

I'm surprised that you make generalized sweeping statements about how it's 
ok to normalize because normalization is "invisible", and then when I 
point out that that isn't true, you try to limit it.

And no, that normalization is not invisible EVEN IN FILENAMES. If it was, 
git wouldn't ever have noticed it, would it?

> > And git tries to be a general data tool, not a Unicode-specific one.
> 
> Yes, I realize that. See my previous message about discussing ideal vs
> practicality.

I don't know which argument you're talking about. Git (and, btw, Linux) 
does the "ideal" thing (don't screw up peoples data), and it turns out to 
be the "practical" thing too (it can handle a wider range of cases than OS 
X can).

So no, this is not "ideal" vs "practical". They aren't in any conflict 
here.

> I could argue against this, but frankly, I'm really tired of arguing this same
> point. I suggest we simply agree to disagree, and move on to actually fixing
> the problem.

.. and people have even suggested how. Hide the idiotic OS X choices by 
making a OS X-specific wrapper around readdir() that turns it into NFC.

That's just about the best we can do. We can't *fix* the thing that OS X 
loses information, but a least we can then show the lost information in 
the same form it _probably_ was in originally.

But no, it won't "fix" git on OS X. 

			Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 21:06                                                   ` Martin Langhoff
  2008-01-21 21:09                                                     ` David Kastrup
@ 2008-01-21 21:42                                                     ` Linus Torvalds
  2008-01-21 22:45                                                       ` Martin Langhoff
  1 sibling, 1 reply; 260+ messages in thread
From: Linus Torvalds @ 2008-01-21 21:42 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Kevin Ballard, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Tue, 22 Jan 2008, Martin Langhoff wrote:
> 
> (slightly offtopic) are you praising UTF-8 as storage format (for disk
> and network) or in general? UTF-8-aware string ops like counting
> characters seem to me a horrendous thing at the ASM level.

I'm praising UTF-8 (without normalization) as a wonderful format where you 
can do 99.9% of everything without ever caring about all the expensive 
stuff.

But in order to do that, you really need to avoid normalization, and you 
also need to accept mis-formed UTF-8 strings (because even if it is real 
UTF-8, the string may actually be just a fragment of some larger string).

Once you do that (and _only_ if you do that), then UTF-8 is actually a 
wonderful thing. You can consider it to be a traditional "everything is a 
stream of bytes", and everything that only cares about a stream of byte 
will work wonderfully well.

And then, the (actually relatively few) things that want to do things like 
show things on the screen, or check for equivalence, or worry about width 
of the characters, *those* can still do so. 

So the beauty of UTF-8 is that you can switch between thinking of it like 
just a binary blob and thinking of it like text, and everythign works 
(including the traditional C null-termination).

And yes, that was obviously the explicit design goal. It's a good thing.

> More on topic, I suspect Kevin's experience is more on end-user apps,
> where input sanitization and even canonicalisation are common
> practice.

Sure. And I'm not arguing against them. Knowing the rules for combining 
characters is really important for input and output. 

> At least in Moodle we store *exactly*  what the user POSTed and
> cleanup^Wcorrupt it when displaying it, so that if it does happen that
> the cleanup was buggy, we never corrupted the data.

Absolutely. It's what the kernel does, and I think that's what perl does 
too for their "strings". It works really well. It also allows you to 
handle binary data (ie data that *really* isn't text) with shared routines 
etc etc.

And that's the beauty of non-normalized (and possibly badly formed) UTF-8.

		Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 21:28                                                               ` Kevin Ballard
@ 2008-01-21 21:43                                                                 ` Martin Langhoff
  0 siblings, 0 replies; 260+ messages in thread
From: Martin Langhoff @ 2008-01-21 21:43 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Jan 22, 2008 10:28 AM, Kevin Ballard <kevin@sb.org> wrote:
> That's certainly a reasonable POV. However, it's not the only one. As
> evidenced by the Mac, treating filenames as strings rather than bytes
> is a viable alternative POV - you can't argue that it doesn't work,
> because OS X proves it does.

With its own slew of bugs. See Ted's reply earlier for a mouthful of
woe in HFS+ that is not easy to workaround.

> Perhaps you should try OS X. Every single Cocoa app should do the

OSX has given me enough grief with other filesystem and general OS
problems that I have definitely abandoned it after 2 years trying to
use it part-time. It has been back to linux for me ;-)

cheers,



martin

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 21:18                                                                 ` Theodore Tso
@ 2008-01-21 21:43                                                                   ` Kevin Ballard
  2008-01-21 21:49                                                                     ` Martin Langhoff
  2008-01-21 22:38                                                                     ` David Kastrup
  0 siblings, 2 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-21 21:43 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 3996 bytes --]

On Jan 21, 2008, at 4:18 PM, Theodore Tso wrote:

> On Mon, Jan 21, 2008 at 03:58:03PM -0500, Kevin Ballard wrote:
>> You're making the huge assumption that the HFS+ normalization  
>> algorithms
>> will change. As the technote states:
>>
>> "Platform algorithms tend to evolve with the Unicode standard. The  
>> HFS Plus
>> algorithms cannot evolve because such evolution would invalidate  
>> existing
>> HFS Plus volumes."
>
> Great, so even worse.  Does the tech note then specify exactly what
> version of Unicode HFS+ is using to do its "normalization"?  Or
> exactly what characters it will normalize?  After all, Unicode has
> added all sorts of characters since 1998, and I'm sure some of them
> were combining characters.
>
> And you *really* want to continue argue that a sane thing for a
> cross-platform system to do is to pervert its hash algorithm to take
> into account *one* particular OS that happened to freeze a
> normalization algorithm at some arbitrary point in time, approximately
> nine years ago?  Talk about the tail wagging the dog!!  Especially
> when you can't even justify why it was done nine years ago!

I suggest you go back and read the emails where I specifically stated  
that I'm *not* suggesting this.

>> It must have bought somebody something, or they never would have  
>> done it.
>
> Your faith in the HFS+ designers is touching.

And your arrogance is troubling. Do you really believe you are so  
smart you can claim the HFS+ designers had no reason for this decision?

>> I have no idea why HFS+ stores filenames in a normalized form, and  
>> further
>> I am smart enough to know that speculating is completely pointless. I
>> assume the authors had a good reason (which should be a safe  
>> assumption,
>> filesystem authors are a smart bunch). The reason may not be valid  
>> anymore,
>> but if it was valid back in 1998, then I can accept it without  
>> complaining.
>
> Well, I *AM* a filesystem designer (ext2/ext3/ext4), and well before
> 1998, I knew that trying to do anything with Unicode normalization was
> a fool's errand.  So if you're going to blindly trust filesystme
> designers (not something I would recommend, actually :-), trust me.
> What HFS+ is doing is dumb, dumb, dumb.

Again, I'm not saying that they necessarily did the "correct" thing,  
as I can't evaluate that without knowing their reason. I'm just saying  
there must have been a reason.

> And even if *you* can accept it, why should the git designers pervert
> any core part of git's design to support this behaviour?  Especially
> if it's legacy behaviour which will hopefully be going away, say when
> MacOS adopts ZFS --- there's an opportunity for them to start afresh,
> and not make the same mistakes they made nine years ago!

And why do you believe MacOS is going to adopt ZFS? Sure, they might,  
but assuming stuff about the future is just as bad as assuming stuff  
about the past. And git should "pervert" itself because of the simple  
fact that git has a problem on HFS+. Keeping your code "pure" is all  
well and good, except it's not particularly practical. If the git  
project has any interest in being a viable system on OS X, it really  
should behave properly. I'm sure you have various "perversions" for  
other cases.

> So why don't you suggest some kind of sane fix in the Mac specific
> code that doesn't impact any core part of git, such as its hash
> algorithm?  It would be far more productive than trying to defend a
> bad design decision made nine years ago....   :-}

How many times must I say I never suggested actually changing git's  
hashing algorithm? And if you want me to suggest a fix to git that  
works, first you have to wait for me to learn how git's internals  
work, and frankly, I have too much work on my plate right now to  
devote the time necessary to learning git's internals well enough to  
fix this problem.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 21:43                                                                   ` Kevin Ballard
@ 2008-01-21 21:49                                                                     ` Martin Langhoff
  2008-01-21 21:57                                                                       ` Kevin Ballard
  2008-01-21 22:38                                                                     ` David Kastrup
  1 sibling, 1 reply; 260+ messages in thread
From: Martin Langhoff @ 2008-01-21 21:49 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Theodore Tso, Linus Torvalds, Peter Karlsson, Mark Junker,
	Pedro Melo, git@vger.kernel.org

On Jan 22, 2008 10:43 AM, Kevin Ballard <kevin@sb.org> wrote:
> How many times must I say I never suggested actually changing git's
> hashing algorithm? And if you want me to suggest a fix to git that
> works, first you have to wait for me to learn how git's internals
> work, and frankly, I have too much work on my plate right now to
> devote the time necessary to learning git's internals well enough to
> fix this problem.

LOL! Spare us the flamefesting and you will have plenty of time for
learning git internals. You might even learn something.

cheers,


martin

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 21:33                                                             ` Linus Torvalds
@ 2008-01-21 21:49                                                               ` Kevin Ballard
  2008-01-21 22:34                                                                 ` Linus Torvalds
  0 siblings, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-21 21:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Karlsson, Mark Junker, Pedro Melo, git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 3626 bytes --]

On Jan 21, 2008, at 4:33 PM, Linus Torvalds wrote:

> On Mon, 21 Jan 2008, Kevin Ballard wrote:
>>
>> I'm not sure what you mean. I stated a fact - at least on OS X, the  
>> filename
>> does not contribute to the listed filesize, so changing the  
>> encoding of the
>> filename doesn't change the filesize. This isn't a philosophical  
>> point, it's a
>> factual statement.
>
> And my point was that your *whole* argument boils down to  
> "normalization
> is invisible".
>
> When it isn't. It's not invisible for filenames, it's not invisible  
> for
> file contents.
>
> You're trying to claim that normalization cannot matter. I'm just  
> pointing
> out that it sure as hell can. Exactly because lots of things don't
> actually look at data other than as just a Unicode string. They do  
> look at
> the raw format.
>
> And that's true both of file contents and file names.
>
>> I don't, but I do think this discussion revolves around filenames,  
>> therefore
>> it should not surprise you when I talk about filenames.
>
> I'm surprised that you make generalized sweeping statements about  
> how it's
> ok to normalize because normalization is "invisible", and then when I
> point out that that isn't true, you try to limit it.
>
> And no, that normalization is not invisible EVEN IN FILENAMES. If it  
> was,
> git wouldn't ever have noticed it, would it?

I'm really surprised that, after all of this, you're still horribly  
misunderstanding my argument. I never said it was invisible. NEVER.

I'm also surprised that you seem to care more about this argument then  
my offer to stop arguing and work towards fixing the problem.

>>> And git tries to be a general data tool, not a Unicode-specific one.
>>
>> Yes, I realize that. See my previous message about discussing ideal  
>> vs
>> practicality.
>
> I don't know which argument you're talking about. Git (and, btw,  
> Linux)
> does the "ideal" thing (don't screw up peoples data), and it turns  
> out to
> be the "practical" thing too (it can handle a wider range of cases  
> than OS
> X can).
>
> So no, this is not "ideal" vs "practical". They aren't in any conflict
> here.

You misunderstand my point. In a previous email I specifically used  
the words "ideal" and "practical" to describe arguments, which is what  
I was referring to here.

>> I could argue against this, but frankly, I'm really tired of  
>> arguing this same
>> point. I suggest we simply agree to disagree, and move on to  
>> actually fixing
>> the problem.
>
> .. and people have even suggested how. Hide the idiotic OS X choices  
> by
> making a OS X-specific wrapper around readdir() that turns it into  
> NFC.

And I've responded to that suggestion, multiple times, saying that  
this doesn't actually fix the problem, it only hides it.

> That's just about the best we can do. We can't *fix* the thing that  
> OS X
> loses information, but a least we can then show the lost information  
> in
> the same form it _probably_ was in originally.
>
> But no, it won't "fix" git on OS X.

Quite a while ago it was suggested that git uses a table that maps the  
original byte sequence as seen in the index to the form returned by  
readdir(). So far this has sounded like the best solution, but as I've  
said before I don't know git's internals enough (or, really, at all)  
to be able to work on this myself.

This solution should only "lose" information in the case where the  
index has 2 filenames that HFS+ treats as a single filename.

Is there some reason this won't work?

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 21:49                                                                     ` Martin Langhoff
@ 2008-01-21 21:57                                                                       ` Kevin Ballard
  2008-01-22  0:36                                                                         ` Johannes Schindelin
  0 siblings, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-21 21:57 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 1594 bytes --]

Again with the bouncing. I gotta figure out how to fix this.
Original CC list: martin.langhoff@gmail.com, tytso@mit.edu, torvalds@linux-foundation.org 
, peter@softwolves.pp.se, mjscod@web.de, melo@simplicidade.org

On Jan 21, 2008, at 4:49 PM, Martin Langhoff wrote:

> On Jan 22, 2008 10:43 AM, Kevin Ballard <kevin@sb.org> wrote:
>> How many times must I say I never suggested actually changing git's
>> hashing algorithm? And if you want me to suggest a fix to git that
>> works, first you have to wait for me to learn how git's internals
>> work, and frankly, I have too much work on my plate right now to
>> devote the time necessary to learning git's internals well enough to
>> fix this problem.
>
> LOL! Spare us the flamefesting and you will have plenty of time for
> learning git internals. You might even learn something.

Ah, so I'm flaming while you are providing a well-reasoned and  
articulate argument? Glad to know the difference.

In any case, you should be very familiar with the fact that writing  
emails and learning code are two vastly different activities that  
require vastly different amounts of concentration and time. I'm  
responding to email while doing other things - were I to replace the  
time spent writing email with learning git's internals, I would be  
pulled away so frequently that I would end up not having learned  
anything. Therefore, I write emails now, and I leave learning git's  
internals until later, when I have the undisturbed time to devote.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 21:49                                                               ` Kevin Ballard
@ 2008-01-21 22:34                                                                 ` Linus Torvalds
  2008-01-21 22:46                                                                   ` Kevin Ballard
  0 siblings, 1 reply; 260+ messages in thread
From: Linus Torvalds @ 2008-01-21 22:34 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Peter Karlsson, Mark Junker, Pedro Melo, git@vger.kernel.org

On Mon, 21 Jan 2008, Kevin Ballard wrote:
> 
> I'm really surprised that, after all of this, you're still horribly
> misunderstanding my argument. I never said it was invisible. NEVER.

You said it was invisible when you treat things "as text". Here's the 
quote:

	.. when you treat filenames as text, it DOESN'T MATTER if the 
 	string gets normalized ..

Without ever apparently realizing that "as text" is part of the problem in 
itself. What is "text" to one person is gibberish to another.

In particular, the biggest reason to not normalize is that you don't know 
it's text or Unicode in the first place. Which is why git doesn't do it.

And no, even with filenames you don't know that they are "text". People 
encode stuff in them. And people don't always use UTF-8. 

Of course, you could ask everybody to create OS X-only programs that know 
that under OS X, you only have a subset of filenames. If so, you're 
complaining about the wrong tool. Especially when the whole point of the 
tool was to be distributed (not to mention coming from an environment that 
simply doesn't have the same silly limitations OS X has).

So here's a few clues:

 - "as text" isn't "as unicode": it may well be Latin1 or EUC-JP or
   something. Yes, it's still used. Git doesn't care, and very consciously 
   has avoided forcing character sets, even if the *default* (and notice 
   how it's overridable) commit message encoding may be utf-8.

 - In fact, even in unicode, the difference between "identical" and 
   "equivalent" strings exists, and even in the standard, unicode 
   strings are very much defined to be arbitrary codepoint sequences, not 
   normalized.

So even for the very specific case of unicode text, it's simply not true 
that "it doesn't matter if the string gets normalized". The unicode spec 
itself talks about cases where even canonical normalization makes a 
difference.

Search for this quote:

  "Not all processes are required to respect canonical equivalence. For 
   example:

    * A function that collects a set of the General_Category values 
      present in a string will and should produce a different value for 
      <angstrom sign, semicolon> than for <A, combining ring above, greek 
      question mark>, even though they are canonically equivalent.
    * A function that does a binary comparison of strings will also find 
      these two sequences different."

and notice that first case. Even things that are *very*much* aware of 
Unicode text do actually have cases where canonical equivalence doesn't 
mean crud.

		Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 21:43                                                                   ` Kevin Ballard
  2008-01-21 21:49                                                                     ` Martin Langhoff
@ 2008-01-21 22:38                                                                     ` David Kastrup
  2008-01-22  2:34                                                                       ` Kevin Ballard
  1 sibling, 1 reply; 260+ messages in thread
From: David Kastrup @ 2008-01-21 22:38 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Theodore Tso, Linus Torvalds, Peter Karlsson, Mark Junker,
	Pedro Melo, git@vger.kernel.org

Kevin Ballard <kevin@sb.org> writes:

> On Jan 21, 2008, at 4:18 PM, Theodore Tso wrote:
>
>> Your faith in the HFS+ designers is touching.
>
> And your arrogance is troubling. Do you really believe you are so
> smart you can claim the HFS+ designers had no reason for this
> decision?

No reason?  Don't say where he says that.  No sane reason?  Certainly.
If the visibility of the upsides is not in the same order of magnitude
as that of the downsides (and your "I trust they must have had good
reason" is implicating exactly that), then yes, this appears like a
misdesign, however well-intended.  Because its cleverness hinges on an
what amounts to an arbitrary historic point of stability with only
fleeting convenience.

It reminds me of the self-defeatingproblem haunting the MIPS
(microprocessor without interlocked pipeline stages) architecture: for
pipelined processors, one has to add logic that prevents one command
from working before the results from other commands arrive.  Now the
ingenious idea of the MIPS architecture was to move that logic into the
compiler instead of the hardware.  But then the implications of that
idea got intermingled with binary compatibility and the result was that
the advantages lasted for one processor generation, and afterwards, the
m-stage pipelines needed logic that simulated the n-stage pipeline of
the first MIPS processor rather then a comparatively simple 1-stage
pipeline of a conceptual sequential processor.  Rendering the whole
original idea completely absurd and requiring rather more complicated
rather than simpler hardware as originally envisioned.

The road to hell is paved with good intentions.

>>> The reason may not be valid anymore, but if it was valid back in
>>> 1998, then I can accept it without complaining.

There is no shortage in short-sighted decisions to repeat.  Some
political parties bank on it.

> Again, I'm not saying that they necessarily did the "correct" thing,
> as I can't evaluate that without knowing their reason. I'm just saying
> there must have been a reason.

Jumping to blind faith-based conclusions is never a good move.  You
don't end up improving the work of your predecessors that way.

> And why do you believe MacOS is going to adopt ZFS? Sure, they might,
> but assuming stuff about the future is just as bad as assuming stuff
> about the past. And git should "pervert" itself because of the simple
> fact that git has a problem on HFS+. Keeping your code "pure" is all
> well and good, except it's not particularly practical. If the git
> project has any interest in being a viable system on OS X, it really
> should behave properly.

If OS X has any interest in being a viable system, perios, it really
should behave properly.

> How many times must I say I never suggested actually changing git's
> hashing algorithm? And if you want me to suggest a fix to git that
> works, first you have to wait for me to learn how git's internals
> work, and frankly, I have too much work on my plate right now to
> devote the time necessary to learning git's internals well enough to
> fix this problem.

Then please understand that you have too much work on your plate right
now to devote the time necessary to provide any constructive criticism.
A smart person in this situation would shut up until he has the time.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 21:07                                                       ` Kevin Ballard
@ 2008-01-21 22:41                                                         ` Dmitry Potapov
  2008-01-21 22:53                                                           ` Kevin Ballard
  0 siblings, 1 reply; 260+ messages in thread
From: Dmitry Potapov @ 2008-01-21 22:41 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Mon, Jan 21, 2008 at 04:07:27PM -0500, Kevin Ballard wrote:
> 
> Again, I've specified many times that I'm talking about canonical  
> equivalence.
> 
> And yes, HFS+ does normalization, it just doesn't use NFD. It uses a  
> custom variant. I fail to see how this is a problem.

If you think that HFS+ does normalization then you apparently have no
idea of what the term "normalization" means. Have you? But if you
don't know what is "normalization" then you cannot really know what
canonical equivalence means.

> >
> >I don't say they do that without *any* reason, but I suppose all
> >Apple developers in the Copland project had some reasons for they
> >did, but the outcome was not very good...
> 
> Stupid engineers don't get to work on developing new filesystems.

Assigning someone to work on a new filesystem does not make him
suddenly smart. As to that stupid engineers don't get to work,
it is like saying there is no stupid engineers at all. There are
plenty evidence to contrary. And when management is disastrous
then most idiots with big mouth and little capacity to produce
any useful does get assignment to develop new features, while
those who can actually solve problems are assigned to fix the
next build, because the only thing that this management worries
about how to survive another year or another months...

> And  
> Copland didn't fail because of stupid engineers anyway. If I had to  
> blame someone, I'd blame management.

But if the code was so good then why was most of that code thrown away
later when management was changed? Still bad management?

> 
> >>The only information you lose when doing canonical normalization is
> >>what the original byte sequence was.
> >
> >Not true. You lose the original sequence of *characters*.
> 
> Which is only a problem if you care about the byte sequence, which is  
> kinda the whole point of my argument.

Byte sequences are not an issue here. If the filesystem used UTF-16 to
store filenames, that would NOT cause this problem, because characters
would be the same even though bytes stored on the disk were different.
So, what you actually lose here is the original sequence of *characters*.

Dmitry

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 21:42                                                     ` Linus Torvalds
@ 2008-01-21 22:45                                                       ` Martin Langhoff
  0 siblings, 0 replies; 260+ messages in thread
From: Martin Langhoff @ 2008-01-21 22:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kevin Ballard, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Jan 22, 2008 10:42 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> I'm praising UTF-8 (without normalization) as a wonderful format where you
> can do 99.9% of everything without ever caring about all the expensive
> stuff.

*thanks* for these notes. Very useful, and...

...
> And then, the (actually relatively few) things that want to do things like
> show things on the screen, or check for equivalence, or worry about width
> of the characters, *those* can still do so.

I find the above amusing -- different worlds we live in. Programming
webapps means that 90% of the code deals with a bit of metaprogramming
(with lots of string manipulation) to talk SQL to a backend, and then
doing lots of string manipulation on the data the DB returns, which
ends up in humongous strings of goop otherwise known as HTML+CSS+JS.
After waiting for the DB to return data, over 50% of cpu time is spent
in regexes, concatenations, counting words, array ops, etc. So it is
pretty significant.

So now I have to worry about cost and correctness of stuff that I took
for granted in the pre-unicode days - strtolower() can be quite
expensive and... buggy! But that's mainly due to Unicode, not UTF8. I
think the only slowdown I can pin on UTF-8 is in counting chars, and
probably slower regexes. Not that I deal with the C implementation of
any of this stuff -- and so happy about it! ;-)

</offtopic>

(...)

> And that's the beauty of non-normalized (and possibly badly formed) UTF-8.

I had a few issues with Perl v5.6's utf-8 handling that wasn't binary
safe (fread() to a fixed-length buffer would break the input if a
unicode char landed across the boundary - ouch!) -- made me think that
you couldn't do this in binary safe ways. So I tend to tell Perl to
treatfiles as binary, and switch to utf-8 in specially chosen spots. I
suspect that 5.8 is a bit saner about this, but I'm not taking
chances.

cheers,

martin

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 22:34                                                                 ` Linus Torvalds
@ 2008-01-21 22:46                                                                   ` Kevin Ballard
  2008-01-21 22:56                                                                     ` Martin Langhoff
                                                                                       ` (2 more replies)
  0 siblings, 3 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-21 22:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Karlsson, Mark Junker, Pedro Melo, git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 4001 bytes --]

On Jan 21, 2008, at 5:34 PM, Linus Torvalds wrote:

> On Mon, 21 Jan 2008, Kevin Ballard wrote:
>>
>> I'm really surprised that, after all of this, you're still horribly
>> misunderstanding my argument. I never said it was invisible. NEVER.
>
> You said it was invisible when you treat things "as text". Here's the
> quote:
>
> 	.. when you treat filenames as text, it DOESN'T MATTER if the
> 	string gets normalized ..
>
> Without ever apparently realizing that "as text" is part of the  
> problem in
> itself. What is "text" to one person is gibberish to another.

Which is actually a good argument as to why filenames should be  
enforced as UTF-8.

> In particular, the biggest reason to not normalize is that you don't  
> know
> it's text or Unicode in the first place. Which is why git doesn't do  
> it.

Sure, I understand why git doesn't do it. I'm saying in a system which  
uses unicode top-to-bottom, which you can create if you're using HFS+  
only, can do it. On HFS+ you know the filename is unicode.

> And no, even with filenames you don't know that they are "text".  
> People
> encode stuff in them. And people don't always use UTF-8.

Again, I was talking about a system that used unicode top-to-bottom.  
On HFS+ you have to use UTF-8 for your filename or it simply won't work.

> Of course, you could ask everybody to create OS X-only programs that  
> know
> that under OS X, you only have a subset of filenames. If so, you're
> complaining about the wrong tool. Especially when the whole point of  
> the
> tool was to be distributed (not to mention coming from an  
> environment that
> simply doesn't have the same silly limitations OS X has).
>
> So here's a few clues:
>
> - "as text" isn't "as unicode": it may well be Latin1 or EUC-JP or
>   something. Yes, it's still used. Git doesn't care, and very  
> consciously
>   has avoided forcing character sets, even if the *default* (and  
> notice
>   how it's overridable) commit message encoding may be utf-8.
>
> - In fact, even in unicode, the difference between "identical" and
>   "equivalent" strings exists, and even in the standard, unicode
>   strings are very much defined to be arbitrary codepoint sequences,  
> not
>   normalized.
>
> So even for the very specific case of unicode text, it's simply not  
> true
> that "it doesn't matter if the string gets normalized". The unicode  
> spec
> itself talks about cases where even canonical normalization makes a
> difference.
>
> Search for this quote:
>
>  "Not all processes are required to respect canonical equivalence. For
>   example:
>
>    * A function that collects a set of the General_Category values
>      present in a string will and should produce a different value for
>      <angstrom sign, semicolon> than for <A, combining ring above,  
> greek
>      question mark>, even though they are canonically equivalent.
>    * A function that does a binary comparison of strings will also  
> find
>      these two sequences different."
>
> and notice that first case. Even things that are *very*much* aware of
> Unicode text do actually have cases where canonical equivalence  
> doesn't
> mean crud.

I find it amusing that you keep arguing against having git treat  
filenames as unicode when, if you had actually taken my advice and  
read my previous email talking about "ideal" vs "practical", you'd  
realize that I was not suggesting git should. I was simply describing  
why having the filesystem specifically treat filenames as utf-8 isn't  
a problem when the entire system is unicode-aware, and thus showing  
how the problems that are cropping up in git aren't because the  
filesystem treats filenames as unicode, but rather because the  
filesystem treats filenames differently than other filesystems. In  
other words, I was trying to illustrate that HFS+ isn't wrong, it's  
just different, and the difference is causing the problem.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 22:41                                                         ` Dmitry Potapov
@ 2008-01-21 22:53                                                           ` Kevin Ballard
  2008-01-21 23:21                                                             ` Dmitry Potapov
  0 siblings, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-21 22:53 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 3551 bytes --]

On Jan 21, 2008, at 5:41 PM, Dmitry Potapov wrote:

> On Mon, Jan 21, 2008 at 04:07:27PM -0500, Kevin Ballard wrote:
>>
>> Again, I've specified many times that I'm talking about canonical
>> equivalence.
>>
>> And yes, HFS+ does normalization, it just doesn't use NFD. It uses a
>> custom variant. I fail to see how this is a problem.
>
> If you think that HFS+ does normalization then you apparently have no
> idea of what the term "normalization" means. Have you? But if you
> don't know what is "normalization" then you cannot really know what
> canonical equivalence means.

I would go look up specifics to back me up, but my DNS is screwing up  
right now so I can't access most of the internet. In any case, there  
are 4 standard normalization forms - NFC, NFD, NFKC, NFKD. If there  
are others, they aren't notable enough to be listed in the resource I  
was reading. HFS+ uses a variant on NFD - it's a well-defined variant,  
and thus can safely be called its own normalization form. I fail to  
see how this means it's not "normalization".

>>> I don't say they do that without *any* reason, but I suppose all
>>> Apple developers in the Copland project had some reasons for they
>>> did, but the outcome was not very good...
>>
>> Stupid engineers don't get to work on developing new filesystems.
>
> Assigning someone to work on a new filesystem does not make him
> suddenly smart. As to that stupid engineers don't get to work,
> it is like saying there is no stupid engineers at all. There are
> plenty evidence to contrary. And when management is disastrous
> then most idiots with big mouth and little capacity to produce
> any useful does get assignment to develop new features, while
> those who can actually solve problems are assigned to fix the
> next build, because the only thing that this management worries
> about how to survive another year or another months...

I'm not talking about assigning engineers, I'm saying developing a new  
filesystem, especially one that's proven itself to be usable and  
extendable for the last decade, is something that only smart engineers  
would be capable of doing.

>> And
>> Copland didn't fail because of stupid engineers anyway. If I had to
>> blame someone, I'd blame management.
>
> But if the code was so good then why was most of that code thrown away
> later when management was changed? Still bad management?

Yes. Even the best of engineers will produce crap code when overworked  
and required to implement new features instead of fixing bugs and  
stabilizing the system. Copland is well-known to have suffered from  
featuritis, to the extent that it was practically impossible to test  
in any sane fashion. Bad management can kill any project regardless of  
how good the engineers are.

>>>> The only information you lose when doing canonical normalization is
>>>> what the original byte sequence was.
>>>
>>> Not true. You lose the original sequence of *characters*.
>>
>> Which is only a problem if you care about the byte sequence, which is
>> kinda the whole point of my argument.
>
> Byte sequences are not an issue here. If the filesystem used UTF-16 to
> store filenames, that would NOT cause this problem, because characters
> would be the same even though bytes stored on the disk were different.
> So, what you actually lose here is the original sequence of  
> *characters*.

I've already talked about that, but you are apparently incapable of  
understanding.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 22:46                                                                   ` Kevin Ballard
@ 2008-01-21 22:56                                                                     ` Martin Langhoff
       [not found]                                                                       ` <53C76BEA-2232-4940-8776-9DF1880089A4@sb.org>
  2008-01-21 23:00                                                                     ` Theodore Tso
  2008-01-21 23:44                                                                     ` Linus Torvalds
  2 siblings, 1 reply; 260+ messages in thread
From: Martin Langhoff @ 2008-01-21 22:56 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Jan 22, 2008 11:46 AM, Kevin Ballard <kevin@sb.org> wrote:
> Again, I was talking about a system that used unicode top-to-bottom.
> On HFS+ you have to use UTF-8 for your filename or it simply won't work.

Hmmm. I m pretty sure HFS+ has a lot of problems if you run OSX as an
NFS server with clients in different encodings. It would never work in
real life. The "envelope" OSs have to work in is hugely varied -- much
more so than any other apps. You should try writing one someday ;-)

> other words, I was trying to illustrate that HFS+ isn't wrong, it's
> just different, and the difference is causing the problem.

Did you spot the rather nasty issues that Ted mentioned earlier in the
thread? I would say HFS+ is a bit "special" rather than "different".

cheers,



m

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 22:46                                                                   ` Kevin Ballard
  2008-01-21 22:56                                                                     ` Martin Langhoff
@ 2008-01-21 23:00                                                                     ` Theodore Tso
  2008-01-21 23:09                                                                       ` Kevin Ballard
  2008-01-21 23:44                                                                     ` Linus Torvalds
  2 siblings, 1 reply; 260+ messages in thread
From: Theodore Tso @ 2008-01-21 23:00 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Mon, Jan 21, 2008 at 05:46:27PM -0500, Kevin Ballard wrote:
> I find it amusing that you keep arguing against having git treat filenames 
> as unicode when, if you had actually taken my advice and read my previous 
> email talking about "ideal" vs "practical"...

If by "ideal" you mean a world where 100% of all computers were
designed by Steve Jobs, you might have a point.  But trying to argue
for such a state of idealism seems to be stupid, and certainly a
complete waste of everyone's time on the git mailing list.  It's
simply not reality.  It's like with the infamous resource forks, which
would have worked fine if all the world were MacOS, but which had a
tendency to get stripped off whenver you used a program that wasn't
resource fork aware, like zip, or a protocol that wasn't resource fork
aware, like FTP.  And so people had to put in all sorts of kludges
like BinHex to work around MacOS's "if only the entire world was like
*me*, no one would get hurt" attitude.  In some ways, the MacOS
designers are even worse than Microsoft in terms of having the "the
world revolves around us" attitude.

> In other words, I was trying to illustrate that 
> HFS+ isn't wrong, it's just different, and the difference is causing the 
> problem.

And if you want to interoperate with the rest of the world, where at
least count over 92% of computers are NOT running HFS+, then "Thinking
Different" is indeed causing the problem, yes.  And whose fault is that?

The whole point of interoperability is that when we communicate, we
have to do so in a uniform and predictable way.  If we can't, the next
best thing is to have protocol translators; but in order to do that,
we must avoid lossy transformations, such as HFS+'s
pseudo-normalization.  (Why, by the way, will not result in a "normal"
form for any glyph which can be encoded with and without a combining
character if said glyph was introduced into Unicode after 1988.  So
you can't even call it a "normalization" algorithm, but just a
pseudo-normalization transformation which is lossy and which DESTROYS
filename information in an irrecoverable way.)

	      	     	     	 	  	    - Ted

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 20:53                                                       ` Kevin Ballard
  2008-01-21 21:05                                                         ` David Kastrup
@ 2008-01-21 23:01                                                         ` Dmitry Potapov
  1 sibling, 0 replies; 260+ messages in thread
From: Dmitry Potapov @ 2008-01-21 23:01 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: David Kastrup, git

On Mon, Jan 21, 2008 at 03:53:10PM -0500, Kevin Ballard wrote:
> On Jan 21, 2008, at 3:43 PM, Dmitry Potapov wrote:
> 
> >On Mon, Jan 21, 2008 at 11:59:24AM -0500, Kevin Ballard wrote:
> >>
> >>No, it's a question of hashing algorithm. And it's one that's fairly
> >>easily solved simply by picking a specific nonambiguous UTF-8  
> >>encoding
> >>before hashing.
> >
> >UTF-8 is a *single* encoding, and it maps every Unicode character to
> >a unique binary representation. So, it is completely nonambiguous.
> 
> In this case, encoding refers to normalization form,

I thought we spoke about HFS+, and it does not use any normalization
form, because normalization should produce binary identitical strings
for equivalent strings and HFS+ conversion does not. So, it looks
like you redefine both words "encoding" and "normalization" here.

> as other people  
> have used it in the conversation besides me.

All your arguments based on confusion and the fact that some other
people were probably confused does not make your arguments any more
valid.

> I suggest you stop trying to find inconsequential stuff to argue  
> about, especially when a tiny bit of critical thinking would reveal  
> the answer.

IMHO, most of your arguments are inconsequential stuff, so I am not
sure what I am supposed to do about your writings. Probably, it does
not make sense to respond your mails anymore...

As to critical thinking, it definitely reveals that Apple's choice
was far from being. Is it so difficult to accept?

Anyway, if you think that you know better than other how to properly
deal with the problem, why don't you try to actually *do* something
and write some code that works as your propose.

Dmitry

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
       [not found]                                                                       ` <53C76BEA-2232-4940-8776-9DF1880089A4@sb.org>
@ 2008-01-21 23:05                                                                         ` Kevin Ballard
  2008-01-21 23:16                                                                         ` Martin Langhoff
  1 sibling, 0 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-21 23:05 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 2026 bytes --]

Yet another bounce.
Original CC: martin.langhoff@gmail.com, torvalds@linux-foundation.org, peter@softwolves.pp.se 
, mjscod@web.de, melo@simplicidade.org

On Jan 21, 2008, at 6:02 PM, Kevin Ballard wrote:

> On Jan 21, 2008, at 5:56 PM, Martin Langhoff wrote:
>
>> On Jan 22, 2008 11:46 AM, Kevin Ballard <kevin@sb.org> wrote:
>>> Again, I was talking about a system that used unicode top-to-bottom.
>>> On HFS+ you have to use UTF-8 for your filename or it simply won't  
>>> work.
>>
>> Hmmm. I m pretty sure HFS+ has a lot of problems if you run OSX as an
>> NFS server with clients in different encodings. It would never work  
>> in
>> real life. The "envelope" OSs have to work in is hugely varied --  
>> much
>> more so than any other apps. You should try writing one someday ;-)
>
> I'd imagine writing an OS to be a horrifically complicated task. And  
> yes, I can certainly imagine HFS+ might have issues when used to  
> back an NFS server with other clients, but that still leads back to  
> the original point, which is that all these problems stem from the  
> differences between HFS+ and other filesystems, not any inherent  
> problem with HFS+ itself.
>
>>> other words, I was trying to illustrate that HFS+ isn't wrong, it's
>>> just different, and the difference is causing the problem.
>>
>> Did you spot the rather nasty issues that Ted mentioned earlier in  
>> the
>> thread? I would say HFS+ is a bit "special" rather than "different".
>
> IIRC, the biggest problem he talked about was the changing unicode  
> standard, but since the technote appears to state that HFS+ will not  
> be changing its normalization algorithms to preserve backwards  
> compatibility with existing volumes, that doesn't appear to be a  
> nasty issue after all. Is there another issue I've failed to address  
> in this thread?
>
> -Kevin Ballard
>
> -- 
> Kevin Ballard
> http://kevin.sb.org
> kevin@sb.org
> http://www.tildesoft.com
>
>

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 23:00                                                                     ` Theodore Tso
@ 2008-01-21 23:09                                                                       ` Kevin Ballard
  0 siblings, 0 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-21 23:09 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2536 bytes --]

On Jan 21, 2008, at 6:00 PM, Theodore Tso wrote:

> On Mon, Jan 21, 2008 at 05:46:27PM -0500, Kevin Ballard wrote:
>> I find it amusing that you keep arguing against having git treat  
>> filenames
>> as unicode when, if you had actually taken my advice and read my  
>> previous
>> email talking about "ideal" vs "practical"...
>
> If by "ideal" you mean a world where 100% of all computers were
> designed by Steve Jobs, you might have a point.

NO NO NO NO NO. READ MY EMAIL. STOP MAKING ASSUMPTIONS ABOUT WHAT I'M  
TALKING ABOUT.

The most frustrating thing about this thread is everybody keeps  
arguing about what they *assume* I'm talking about without actually  
bothering to read what I'm saying.

>> In other words, I was trying to illustrate that
>> HFS+ isn't wrong, it's just different, and the difference is  
>> causing the
>> problem.
>
> And if you want to interoperate with the rest of the world, where at
> least count over 92% of computers are NOT running HFS+, then "Thinking
> Different" is indeed causing the problem, yes.  And whose fault is  
> that?

And if you want to interoperate with the rest of the world, where at  
least count over 92% of computers are running Windows, then using  
another OS is stupid, right? Right? I mean, if everyone else is doing  
it, we should too, shouldn't we?

> The whole point of interoperability is that when we communicate, we
> have to do so in a uniform and predictable way.  If we can't, the next
> best thing is to have protocol translators; but in order to do that,
> we must avoid lossy transformations, such as HFS+'s
> pseudo-normalization.  (Why, by the way, will not result in a "normal"
> form for any glyph which can be encoded with and without a combining
> character if said glyph was introduced into Unicode after 1988.  So
> you can't even call it a "normalization" algorithm, but just a
> pseudo-normalization transformation which is lossy and which DESTROYS
> filename information in an irrecoverable way.)

Sure it's normalization, it's just not using one of the standard  
forms. But the form is well-defined.

And yes, protocol translators are a good idea. That's why I thought  
the original suggestion of using a table to map index filenames <-> HFS 
+ filenames sounded like it could work. The only time that should fail  
is if the index contains multiple filenames that HFS+ will treat as a  
single filename. Is there a problem with this approach?

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
       [not found]                                                                       ` <53C76BEA-2232-4940-8776-9DF1880089A4@sb.org>
  2008-01-21 23:05                                                                         ` Kevin Ballard
@ 2008-01-21 23:16                                                                         ` Martin Langhoff
  2008-01-22  0:30                                                                           ` Kevin Ballard
  1 sibling, 1 reply; 260+ messages in thread
From: Martin Langhoff @ 2008-01-21 23:16 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Jan 22, 2008 12:02 PM, Kevin Ballard <kevin@sb.org> wrote:
> I'd imagine writing an OS to be a horrifically complicated task. And yes, I
> can certainly imagine HFS+ might have issues when used to back an NFS server
> with other clients, but that still leads back to the original point, which
> is that all these problems stem from the differences between HFS+ and other
> filesystems, not any inherent problem with HFS+ itself.

Right. If you are defining the requirements for a new FS on a new OS,
would you not include a requirement that says "must not add any funny
rule that prevents clean interoperation with other filesystems or
OSs"? Forgetting that requirement is... a big one! And if someone asks
"how do we do nice user-friendly filename matching with these
technical differences that users mostly don't care about"... wouldn't
you say "do it in the GUI facilities, changing the FS to handle this
is wrong because it will break the OS as a server, as a reliable file
storage"?

FSs have pretty hard requirements these days -- all the modern FS
you've heard about respect the requirement above, and a ton more that
you have to be in the FS business to be aware of. Mostly anyway,
wherever they don't, users have all sorts of trouble.

> IIRC, the biggest problem he talked about was the changing unicode standard,
> but since the technote appears to state that HFS+ will not be changing its
> normalization algorithms to preserve backwards compatibility with existing
> volumes, that doesn't appear to be a nasty issue after all. Is there another
> issue I've failed to address in this thread?

Well, Ted answered that part, noting that then the "normalisation" is
patchy, and everyone else is left to guess what chars are normalised
and what chars aren't so being HFS+ compatible becomes a very weird
game indeed. You didn't reply to his explanation -- you called him
arrogant instead. Did you manage to read to the end if his email?

The HFS+ designers mucked it up -- and then papered over it with the
OSX libraries. But a good chunk of the world does not use them, they
forgot about the little "interop" requirement.

cheers,

m

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 22:53                                                           ` Kevin Ballard
@ 2008-01-21 23:21                                                             ` Dmitry Potapov
  0 siblings, 0 replies; 260+ messages in thread
From: Dmitry Potapov @ 2008-01-21 23:21 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Mon, Jan 21, 2008 at 05:53:50PM -0500, Kevin Ballard wrote:
> 
> I would go look up specifics to back me up, but my DNS is screwing up  
> right now so I can't access most of the internet.

Then you are lucky that your mails reach this ML without problem.

> In any case, there  
> are 4 standard normalization forms - NFC, NFD, NFKC, NFKD. If there  
> are others, they aren't notable enough to be listed in the resource I  
> was reading. HFS+ uses a variant on NFD - it's a well-defined variant,  
> and thus can safely be called its own normalization form. I fail to  
> see how this means it's not "normalization".

The defining property of normalization is producing binary identitical
strings for equivalent strings, IOW, normalization allows you to tell
what strings are equivalent and what are not just by binary comparision.
HFS+ decomposition lacks that property, because strings are not fully
decomposed thus being comparision of equivalent strings may give false
result.

> 
> I'm not talking about assigning engineers, I'm saying developing a new  
> filesystem, especially one that's proven itself to be usable and  
> extendable for the last decade, is something that only smart engineers  
> would be capable of doing.

You know, many people still use FAT, but somehow I don't think that
FAT is good despite of it being extendable for more than a decade...
Apparently, HFS+ was not worst part of the Copland project, but I
see no evidence to think that it was developed by the best engineers.

> >>And
> >>Copland didn't fail because of stupid engineers anyway. If I had to
> >>blame someone, I'd blame management.
> >
> >But if the code was so good then why was most of that code thrown away
> >later when management was changed? Still bad management?
> 
> Yes. Even the best of engineers will produce crap code when overworked  
> and required to implement new features instead of fixing bugs and  
> stabilizing the system. 

I don't think that anyone asked them to implement so much new features.
AFAIK, it was very difficult (nearly impossible) to get anyone to work
on stabilizing existing software and fixing existing bugs in it.

> Copland is well-known to have suffered from  
> featuritis, to the extent that it was practically impossible to test  
> in any sane fashion.

Exactly. IMHO, both management and developers are equally responsible
for that feature-mania.

> Bad management can kill any project regardless of  
> how good the engineers are.

Sure.

> >Byte sequences are not an issue here. If the filesystem used UTF-16 to
> >store filenames, that would NOT cause this problem, because characters
> >would be the same even though bytes stored on the disk were different.
> >So, what you actually lose here is the original sequence of  
> >*characters*.
> 
> I've already talked about that, but you are apparently incapable of  
> understanding.

Well, it is *you* who is incapable of understanding anything, even
basic terms as encoding and normalization...

Dmitry

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 22:46                                                                   ` Kevin Ballard
  2008-01-21 22:56                                                                     ` Martin Langhoff
  2008-01-21 23:00                                                                     ` Theodore Tso
@ 2008-01-21 23:44                                                                     ` Linus Torvalds
  2008-01-22  0:47                                                                       ` Kevin Ballard
  2 siblings, 1 reply; 260+ messages in thread
From: Linus Torvalds @ 2008-01-21 23:44 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Peter Karlsson, Mark Junker, Pedro Melo, git@vger.kernel.org

On Mon, 21 Jan 2008, Kevin Ballard wrote:
> 
> I find it amusing that you keep arguing against having git treat filenames as
> unicode when

NO I DO NOT!

Dammit, stop this idiocy.

I think it's fine having git treat filenames "as unicode", as long as you 
don't do any munging on it.

Why? Because if it's utf-8, then treating them "as unicode" means exactly 
the same as treating them "as a user-specified string".

So stop lying about this whole thing. I have never *ever* argued against 
unicode per se.

All my complaints - every single one of them - comes down to making the 
idiotic choice of trying to munge those strings (not even strictly 
"normalize") into something they are not.

And what you don't seem to understand is that once you accept _unmodified_ 
raw UTF-8 as a good unicode transport mechanism, suddenly other encodings 
are possible. I'm not out to force my world-view on users. If they are 
using legacy encodings (whether in filenames *or* in commit texts or in 
their file contents), that's *their* choice.

I actually personally happen to use UTF-8-encoded unicode.

I'm just not stupid enough to think that (a) corrupting it is a good idea, 
*or* (b) that I should force every Asian installation of git to also force 
people to use unicode (or even having all the conversion libraries and 
overheads!)

So stop this idiotic "unicode == normalization" crap. 

I'm a huge fan of UTF-8. But that does not mean that I think normalization 
is a good idea.

		Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 23:16                                                                         ` Martin Langhoff
@ 2008-01-22  0:30                                                                           ` Kevin Ballard
  0 siblings, 0 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-22  0:30 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 3053 bytes --]


On Jan 21, 2008, at 6:16 PM, Martin Langhoff wrote:

> On Jan 22, 2008 12:02 PM, Kevin Ballard <kevin@sb.org> wrote:
>> I'd imagine writing an OS to be a horrifically complicated task.  
>> And yes, I
>> can certainly imagine HFS+ might have issues when used to back an  
>> NFS server
>> with other clients, but that still leads back to the original  
>> point, which
>> is that all these problems stem from the differences between HFS+  
>> and other
>> filesystems, not any inherent problem with HFS+ itself.
>
> Right. If you are defining the requirements for a new FS on a new OS,
> would you not include a requirement that says "must not add any funny
> rule that prevents clean interoperation with other filesystems or
> OSs"? Forgetting that requirement is... a big one! And if someone asks
> "how do we do nice user-friendly filename matching with these
> technical differences that users mostly don't care about"... wouldn't
> you say "do it in the GUI facilities, changing the FS to handle this
> is wrong because it will break the OS as a server, as a reliable file
> storage"?

Sure, but you have to remember, HFS+ was developed back for Mac OS 8,  
which really wasn't a very good server machine.

> FSs have pretty hard requirements these days -- all the modern FS
> you've heard about respect the requirement above, and a ton more that
> you have to be in the FS business to be aware of. Mostly anyway,
> wherever they don't, users have all sorts of trouble.
>
>> IIRC, the biggest problem he talked about was the changing unicode  
>> standard,
>> but since the technote appears to state that HFS+ will not be  
>> changing its
>> normalization algorithms to preserve backwards compatibility with  
>> existing
>> volumes, that doesn't appear to be a nasty issue after all. Is  
>> there another
>> issue I've failed to address in this thread?
>
> Well, Ted answered that part, noting that then the "normalisation" is
> patchy, and everyone else is left to guess what chars are normalised
> and what chars aren't so being HFS+ compatible becomes a very weird
> game indeed. You didn't reply to his explanation -- you called him
> arrogant instead. Did you manage to read to the end if his email?

I've read every single email in this thread, all the way through. Ted  
was arguing against calling it "normalization". If you want to argue  
that it's using a non-standard normal form, go ahead, but surely you  
can figure out that can you simply re-normalize it to whatever form  
you want.

> The HFS+ designers mucked it up -- and then papered over it with the
> OSX libraries. But a good chunk of the world does not use them, they
> forgot about the little "interop" requirement.

Sure, maybe they did forget about interop. Or maybe they developed  
this back on Mac OS 8 where the only real competitor was Windows, and  
they didn't have to worry about the Mac being used as an NFS server,  
and thus interop wasn't even a requirement.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 21:57                                                                       ` Kevin Ballard
@ 2008-01-22  0:36                                                                         ` Johannes Schindelin
  2008-01-22  0:42                                                                           ` Kevin Ballard
  0 siblings, 1 reply; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-22  0:36 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: git

Hi,

On Mon, 21 Jan 2008, Kevin Ballard wrote:

> On Jan 21, 2008, at 4:49 PM, Martin Langhoff wrote:
> 
> > LOL! Spare us the flamefesting and you will have plenty of time for 
> > learning git internals. You might even learn something.
> 
> Ah, so I'm flaming while you are providing a well-reasoned and 
> articulate argument? Glad to know the difference.

ENOUGH ALREADY!

Yes, you are flaming.  You sent easily over 30 totally useless mails in 
this thread.  Over a couple of days.

And Martin is right, in that same amount of time, you could have learnt 
the internals of git _easily_.  Especially since I provided an own chapter 
in the manual for people like you.

So instead of _PESTERING_ us with things that we do _NOT CARE_ about, you 
could _DO SOMETHING USEFUL_ instead.

For example, read up that chapter, and not type any _POINTLESS_ mails in 
that _UTTERLY POINTLESS_ thread anymore.

I hoped that subtle _HINTS_ would give you an _IDEA_, but _EVIDENTLY_ I 
have to use _ALL-CAPS_ on you.

So _EITHER_ read up _OR_ go away, but _DO NOT BOTHER_ to post _ANY_ 
response without patches in this _DARNED_ thread.

Sheesh,
Dscho

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-22  0:36                                                                         ` Johannes Schindelin
@ 2008-01-22  0:42                                                                           ` Kevin Ballard
  2008-01-22  0:48                                                                             ` David Kastrup
                                                                                               ` (2 more replies)
  0 siblings, 3 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-22  0:42 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1769 bytes --]

On Jan 21, 2008, at 7:36 PM, Johannes Schindelin wrote:

> Hi,
>
> On Mon, 21 Jan 2008, Kevin Ballard wrote:
>
>> On Jan 21, 2008, at 4:49 PM, Martin Langhoff wrote:
>>
>>> LOL! Spare us the flamefesting and you will have plenty of time for
>>> learning git internals. You might even learn something.
>>
>> Ah, so I'm flaming while you are providing a well-reasoned and
>> articulate argument? Glad to know the difference.
>
> ENOUGH ALREADY!
>
> Yes, you are flaming.  You sent easily over 30 totally useless mails  
> in
> this thread.  Over a couple of days.

And so has EVERYONE ELSE. You cannot hold me to a standard which you  
yourself do not apply.

> And Martin is right, in that same amount of time, you could have  
> learnt
> the internals of git _easily_.  Especially since I provided an own  
> chapter
> in the manual for people like you.

As I said before, I've been responding to emails in the midst of doing  
other things. So no, I can't learn an entirely new system in my off- 
time between other tasks, but I can respond to emails.

> So instead of _PESTERING_ us with things that we do _NOT CARE_  
> about, you
> could _DO SOMETHING USEFUL_ instead.

You sure argue a lot for something you don't care about.

I'm especially annoyed since I have made MANY offerings to stop this  
argument and work towards a solution, but NOBODY ELSE seems to care  
enough to actually accept that offer.

And treating me like a moron (using all caps?) just shows how bad you  
are at actually reading my emails. I've been giving you and everyone  
else the courtesy of actually reading your emails, glad to know nobody  
else has bothered to read to the end of mine.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 23:44                                                                     ` Linus Torvalds
@ 2008-01-22  0:47                                                                       ` Kevin Ballard
  2008-01-22  1:01                                                                         ` Linus Torvalds
  0 siblings, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-22  0:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Karlsson, Mark Junker, Pedro Melo, git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 3197 bytes --]

Please read to the bottom of this email. As near as I can figure out,  
you haven't done that on any of my previous emails.

On Jan 21, 2008, at 6:44 PM, Linus Torvalds wrote:

> On Mon, 21 Jan 2008, Kevin Ballard wrote:
>>
>> I find it amusing that you keep arguing against having git treat  
>> filenames as
>> unicode when
>
> NO I DO NOT!
>
> Dammit, stop this idiocy.
>
> I think it's fine having git treat filenames "as unicode", as long  
> as you
> don't do any munging on it.

When I say "treat filenames as unicode" I'm implying the equivalence  
comparisons and everything else that we've been talking about.

> Why? Because if it's utf-8, then treating them "as unicode" means  
> exactly
> the same as treating them "as a user-specified string".

If that's what "as unicode" meant, then the phrase "as unicode" has  
zero meaning.

> So stop lying about this whole thing. I have never *ever* argued  
> against
> unicode per se.

No, you've argued against unicode equivalency in filenames. Can't you  
figure out, when the entire time I've been talking about equivalency,  
that I'm *still* talking about equivalency?

> All my complaints - every single one of them - comes down to making  
> the
> idiotic choice of trying to munge those strings (not even strictly
> "normalize") into something they are not.

Yes, I understand quite well that you are against munging strings.

> And what you don't seem to understand is that once you accept  
> _unmodified_
> raw UTF-8 as a good unicode transport mechanism, suddenly other  
> encodings
> are possible. I'm not out to force my world-view on users. If they are
> using legacy encodings (whether in filenames *or* in commit texts or  
> in
> their file contents), that's *their* choice.

You're not using raw UTF-8, you're just using raw bytes. Calling it  
UTF-8 doesn't mean anything, since you don't actually know that's what  
it is. But this is fairly irrelevant.

> I actually personally happen to use UTF-8-encoded unicode.
>
> I'm just not stupid enough to think that (a) corrupting it is a good  
> idea,
> *or* (b) that I should force every Asian installation of git to also  
> force
> people to use unicode (or even having all the conversion libraries and
> overheads!)
>
> So stop this idiotic "unicode == normalization" crap.
>
> I'm a huge fan of UTF-8. But that does not mean that I think  
> normalization
> is a good idea.

How many times must I say the same thing over and over? I'm not  
arguing that forced normalization is a good thing. I'm arguing that,  
in a system which is unicode-aware top to bottom, forced normalization  
is irrelevant to the user, since they don't care about the exact byte  
sequence. And I'm also arguing that git should have some solution to  
this problem. I find it interesting that you're perfectly happy to  
rant and rail against your misperception of my argument, and yet you  
consistently and repeatedly ignore my offers to stop this argument and  
work towards a solution, as well as my comments on existing proposed  
solutions.

Are you even reading to the end of my emails?

- Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-22  0:42                                                                           ` Kevin Ballard
@ 2008-01-22  0:48                                                                             ` David Kastrup
  2008-01-22  1:06                                                                             ` Martin Langhoff
  2008-01-22  1:34                                                                             ` Johannes Schindelin
  2 siblings, 0 replies; 260+ messages in thread
From: David Kastrup @ 2008-01-22  0:48 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Johannes Schindelin, git

Kevin Ballard <kevin@sb.org> writes:

> On Jan 21, 2008, at 7:36 PM, Johannes Schindelin wrote:
>
>> And Martin is right, in that same amount of time, you could have
>> learnt the internals of git _easily_.  Especially since I provided an
>> own chapter in the manual for people like you.
>
> As I said before, I've been responding to emails in the midst of doing
> other things. So no, I can't learn an entirely new system in my off-
> time between other tasks, but I can respond to emails.

But there is no point in responding to emails as long as you don't have
a clue what you are actually talking about.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-22  0:47                                                                       ` Kevin Ballard
@ 2008-01-22  1:01                                                                         ` Linus Torvalds
  2008-01-22  1:13                                                                           ` Linus Torvalds
  0 siblings, 1 reply; 260+ messages in thread
From: Linus Torvalds @ 2008-01-22  1:01 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Peter Karlsson, Mark Junker, Pedro Melo, git@vger.kernel.org

On Mon, 21 Jan 2008, Kevin Ballard wrote:

> > I think it's fine having git treat filenames "as unicode", as long as you
> > don't do any munging on it.
> 
> When I say "treat filenames as unicode" I'm implying the equivalence
> comparisons and everything else that we've been talking about.

Yes, because you're an idiot.

I've told you over and over again that equivalence is stupid.

It's stupid when it's "equivalent except for case", and it's stupid when 
it's "canonically equivalent".

> No, you've argued against unicode equivalency in filenames. Can't you figure
> out, when the entire time I've been talking about equivalency, that I'm
> *still* talking about equivalency?

I agree: normalization and equivalency is idiotic.

But the two actually go hand in hand:

> > All my complaints - every single one of them - comes down to making the
> > idiotic choice of trying to munge those strings (not even strictly
> > "normalize") into something they are not.
> 
> Yes, I understand quite well that you are against munging strings.

You don't seem to.

The thing is, the two are inexorably intertwined. Any filename equivalence 
(except for the trivial "identity" equivalence) INVARIABLY means that 
filenames get munged.

Why?

Think about the file name "Abc", and think about what happens when you 
create it.

Now, think about what happens if that filename is considered equivalent in 
case..

See? The filesystem has to *corrupt* the filename.

Can you not UNDERSTAND this? Equivalence and normalization is STUPID. It's 
just two sides of the exact same coin. They both INVARIABLY cause the 
filename to be munged.

And changing user data is not acceptable.

Do you get it now?

			Linus "probably not" Torvalds

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-22  0:42                                                                           ` Kevin Ballard
  2008-01-22  0:48                                                                             ` David Kastrup
@ 2008-01-22  1:06                                                                             ` Martin Langhoff
  2008-01-22  1:34                                                                             ` Johannes Schindelin
  2 siblings, 0 replies; 260+ messages in thread
From: Martin Langhoff @ 2008-01-22  1:06 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Johannes Schindelin, git

On Jan 22, 2008 1:42 PM, Kevin Ballard <kevin@sb.org> wrote:
> And so has EVERYONE ELSE. You cannot hold me to a standard which you
> yourself do not apply.

Hi Kevin,

not sure if you are just joking, but perhaps you have not noticed that
in technical lists like these, you get karma to voice strong opinions
once you've contributed lots of good code. *That* is the standard -
and Johannes has earned his karma points, as Ted and Linus have, by
learning the slow way and writing tons of code. Alas, you don't seem
to have time for such a thing!

Strong opinions without working code tend to not get much respect. So
that is the standard that is applied to all of us, including you.

cheers,

martin

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-22  1:01                                                                         ` Linus Torvalds
@ 2008-01-22  1:13                                                                           ` Linus Torvalds
  2008-01-22  2:33                                                                             ` Kevin Ballard
  0 siblings, 1 reply; 260+ messages in thread
From: Linus Torvalds @ 2008-01-22  1:13 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Peter Karlsson, Mark Junker, Pedro Melo, git@vger.kernel.org

On Mon, 21 Jan 2008, Linus Torvalds wrote:
> 
> Think about the file name "Abc", and think about what happens when you 
> create it.
> 
> Now, think about what happens if that filename is considered equivalent in 
> case..
> 
> See? The filesystem has to *corrupt* the filename.

Let me make this really clear, because I'm afraid that you won't get it 
when I leave out any steps of the way.

Let us say that there is a filename "xyz" that is equivalent to a filename 
"abc" in *any* way. It does not matter if xyz/abc is Hello/hello, or 
whether it's two canonically equivalent strings.

So now, do

	close(open(xyz, O_WRONLY | O_CREAT, 0666));
	close(open(abc, O_WRONLY | O_CREAT, 0666));

and then look at the directory contents afterwards.

There are two, and only two, choices here (*):
 - the filesystem created both files, and they show up as created
 - the filesystem decided they were equivalent, and munged one (or both) 
   of them

Now, let's go back to my claim:
 - munging user data is unacceptable
and realize that equivalence BY DEFINITION must do it.

So no, you do *not* get to have your cake and eat it too. You simply 
fundamentally *cannot* have both filename equivalence and a non-munging 
filesystem. See above why.

		Linus

(*) Actually, there is third choice above, which is:

 - the filesystem created the first file, and errored out on the second 
   because it noticed it was equivalent - but not identical - to one it 
   already had

   This one is actually a perfectly fine choice, but it's not "your" kind 
   of equivalence, since it actually makes a difference between two 
   equivalent but non-identical names. So the filenames aren't actually 
   interchangable, and this case is really more of a "the filesystem has 
   some very specific limitations on what it allows".

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-22  0:42                                                                           ` Kevin Ballard
  2008-01-22  0:48                                                                             ` David Kastrup
  2008-01-22  1:06                                                                             ` Martin Langhoff
@ 2008-01-22  1:34                                                                             ` Johannes Schindelin
  2008-01-22  1:53                                                                               ` Martin Langhoff
  2 siblings, 1 reply; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-22  1:34 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: git

Hi,

On Mon, 21 Jan 2008, Kevin Ballard wrote:

> On Jan 21, 2008, at 7:36 PM, Johannes Schindelin wrote:
> 
> > On Mon, 21 Jan 2008, Kevin Ballard wrote:
> > 
> > > On Jan 21, 2008, at 4:49 PM, Martin Langhoff wrote:
> > > 
> > > > LOL! Spare us the flamefesting and you will have plenty of time 
> > > > for learning git internals. You might even learn something.
> > > 
> > > Ah, so I'm flaming while you are providing a well-reasoned and 
> > > articulate argument? Glad to know the difference.
> > 
> > ENOUGH ALREADY!
> > 
> > Yes, you are flaming.  You sent easily over 30 totally useless mails 
> > in this thread.  Over a couple of days.
> 
> And so has EVERYONE ELSE. You cannot hold me to a standard which you 
> yourself do not apply.

While I was playing chess, unsuspectingly, there were 40 mails in this 
useless thread.  You sent 15 of them.

Now, just making a _conservative_ guess, git@vger.kernel.org has about 
50000 subscribers (I know for a fact this figure is too low, so it is 
conservative).

I estimate all except for 10 of them, let's be conservative, 100, did not 
care about your lognorrhea.  They needed at least 0.5 seconds to delete 
those mails _you_ are responsible for.

Congratulations, you cost at least 500 seconds today.  And many nerves.

The sad thing: I read about two people the other day, Singh and McKinstry, 
who I'd rather see writing mails here.

I don't care if you honour my contributions to git.  I don't care about 
you.

You must get a kick out of annoying people.  Why don't you try to play 
chicken with a train?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-22  1:34                                                                             ` Johannes Schindelin
@ 2008-01-22  1:53                                                                               ` Martin Langhoff
  2008-01-22  2:03                                                                                 ` Johannes Schindelin
  0 siblings, 1 reply; 260+ messages in thread
From: Martin Langhoff @ 2008-01-22  1:53 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Kevin Ballard, git

On Jan 22, 2008 2:34 PM, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:

Hey guys. Let's stop right here -- Kevin has perhaps been annoying but
this is a *technical* argument, so let's go back to working code.

Anyone send a patch to clear the air? *please*?





m

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-22  1:53                                                                               ` Martin Langhoff
@ 2008-01-22  2:03                                                                                 ` Johannes Schindelin
  0 siblings, 0 replies; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-22  2:03 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Kevin Ballard, git

Hi,

On Tue, 22 Jan 2008, Martin Langhoff wrote:

> On Jan 22, 2008 2:34 PM, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> 
> Hey guys. Let's stop right here -- Kevin has perhaps been annoying but 
> this is a *technical* argument, so let's go back to working code.
> 
> Anyone send a patch to clear the air? *please*?

Not me.  This thread was not _my_ fault.

And I'll be _damned_ if I encourage people to be a pain in the rear end, 
in order to get other people to write code/patches for them.

I think it is clear who has the obligation of contributing some _code_, 
for a change.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-22  1:13                                                                           ` Linus Torvalds
@ 2008-01-22  2:33                                                                             ` Kevin Ballard
  2008-01-22  2:50                                                                               ` Linus Torvalds
  0 siblings, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-22  2:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Karlsson, Mark Junker, Pedro Melo, git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2769 bytes --]

Linus, have you even bothered to read my arguments, or do you just get  
a kick out of building these straw man arguments? You have  
consistently failed to actually address what I'm talking about, and  
instead persist in explaining stuff I already know, as if that was the  
answer to anything I've been talking about. You are clearly incapable  
of understanding my basic point, no matter how simple I break it down.  
I suspect it's because you've been working low-level so long you can't  
think high-level, and so you manage to misinterpret my high-level  
arguments as boneheaded low-level mistakes.

Anyway, please see my countless former emails where I ask to work  
towards a solution instead of just arguing.

-Kevin Ballard

On Jan 21, 2008, at 8:13 PM, Linus Torvalds wrote:

>
>
> On Mon, 21 Jan 2008, Linus Torvalds wrote:
>>
>> Think about the file name "Abc", and think about what happens when  
>> you
>> create it.
>>
>> Now, think about what happens if that filename is considered  
>> equivalent in
>> case..
>>
>> See? The filesystem has to *corrupt* the filename.
>
> Let me make this really clear, because I'm afraid that you won't get  
> it
> when I leave out any steps of the way.
>
> Let us say that there is a filename "xyz" that is equivalent to a  
> filename
> "abc" in *any* way. It does not matter if xyz/abc is Hello/hello, or
> whether it's two canonically equivalent strings.
>
> So now, do
>
> 	close(open(xyz, O_WRONLY | O_CREAT, 0666));
> 	close(open(abc, O_WRONLY | O_CREAT, 0666));
>
> and then look at the directory contents afterwards.
>
> There are two, and only two, choices here (*):
> - the filesystem created both files, and they show up as created
> - the filesystem decided they were equivalent, and munged one (or  
> both)
>   of them
>
> Now, let's go back to my claim:
> - munging user data is unacceptable
> and realize that equivalence BY DEFINITION must do it.
>
> So no, you do *not* get to have your cake and eat it too. You simply
> fundamentally *cannot* have both filename equivalence and a non- 
> munging
> filesystem. See above why.
>
> 		Linus
>
> (*) Actually, there is third choice above, which is:
>
> - the filesystem created the first file, and errored out on the second
>   because it noticed it was equivalent - but not identical - to one it
>   already had
>
>   This one is actually a perfectly fine choice, but it's not "your"  
> kind
>   of equivalence, since it actually makes a difference between two
>   equivalent but non-identical names. So the filenames aren't actually
>   interchangable, and this case is really more of a "the filesystem  
> has
>   some very specific limitations on what it allows".
>

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-21 22:38                                                                     ` David Kastrup
@ 2008-01-22  2:34                                                                       ` Kevin Ballard
  2008-01-22  7:51                                                                         ` David Kastrup
  0 siblings, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-22  2:34 UTC (permalink / raw)
  To: David Kastrup
  Cc: Theodore Tso, Linus Torvalds, Peter Karlsson, Mark Junker,
	Pedro Melo, git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 862 bytes --]

On Jan 21, 2008, at 5:38 PM, David Kastrup wrote:

>> How many times must I say I never suggested actually changing git's
>> hashing algorithm? And if you want me to suggest a fix to git that
>> works, first you have to wait for me to learn how git's internals
>> work, and frankly, I have too much work on my plate right now to
>> devote the time necessary to learning git's internals well enough to
>> fix this problem.
>
> Then please understand that you have too much work on your plate right
> now to devote the time necessary to provide any constructive  
> criticism.
> A smart person in this situation would shut up until he has the time.

A smart person would not join the conversation late and respond to  
points that have already been exhausted ages ago.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-22  2:33                                                                             ` Kevin Ballard
@ 2008-01-22  2:50                                                                               ` Linus Torvalds
  2008-01-22  3:04                                                                                 ` Kevin Ballard
  0 siblings, 1 reply; 260+ messages in thread
From: Linus Torvalds @ 2008-01-22  2:50 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Peter Karlsson, Mark Junker, Pedro Melo, git@vger.kernel.org

On Mon, 21 Jan 2008, Kevin Ballard wrote:
> 
> Anyway, please see my countless former emails where I ask to work towards a
> solution instead of just arguing.

We know what the solution is:

 - The OS X filesystem _is_ crap (and you seem to have almost admitted as 
   much by your comment that the HFS+ designers did it back in the dark 
   ages and didn't mean for it to ever be a server filesystem anyway)

 - But we can at least make a wrapper around readdir() return the NFC form 
   on OS X, and effectively hide much of the fallout from the crap.

There is no way around it. Your "solutions" all seem to boil down to 
asking git to do the same idiotic crap that OS X does, taking all the 
same performance hits, and just generally doing crap just to work around 
crap in your favourite OS.

And no, making git be stupid just to suit a stupid filesystem simply isn't 
going to happen.

So how about you see _my_ point instead: OS X may have an inferior 
filesystem, but we don't have to make git inferior just for that. The fact 
that OS X does case independence is *its* problem, not git's.

		Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-22  2:50                                                                               ` Linus Torvalds
@ 2008-01-22  3:04                                                                                 ` Kevin Ballard
  2008-01-22  3:17                                                                                   ` Linus Torvalds
                                                                                                     ` (2 more replies)
  0 siblings, 3 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-22  3:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Karlsson, Mark Junker, Pedro Melo, git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2135 bytes --]

On Jan 21, 2008, at 9:50 PM, Linus Torvalds wrote:

> On Mon, 21 Jan 2008, Kevin Ballard wrote:
>>
>> Anyway, please see my countless former emails where I ask to work  
>> towards a
>> solution instead of just arguing.
>
> We know what the solution is:
>
> - The OS X filesystem _is_ crap (and you seem to have almost  
> admitted as
>   much by your comment that the HFS+ designers did it back in the dark
>   ages and didn't mean for it to ever be a server filesystem anyway)

I agree that HFS+ isn't well suited for tasks which it is being asked  
to do. I was never arguing that it was the perfect filesystem. But  
that hardly matters now, I know nobody's going to bother understanding  
my argument so I may as well just stop trying.

> - But we can at least make a wrapper around readdir() return the NFC  
> form
>   on OS X, and effectively hide much of the fallout from the crap.

Again, I don't think that's the correct solution. What about the  
translation table that was suggested back at the beginning of the  
thread? That would solve the case insensitivity issue as well, whereas  
this NFC "solution" does nothing for that.

> There is no way around it. Your "solutions" all seem to boil down to
> asking git to do the same idiotic crap that OS X does, taking all the
> same performance hits, and just generally doing crap just to work  
> around
> crap in your favourite OS.

No, I am not asking git to do the same thing HFS+ does. You just  
persist in misinterpreting my arguments, no matter how many times I  
protest that this is not what I am saying.

> And no, making git be stupid just to suit a stupid filesystem simply  
> isn't
> going to happen.
>
> So how about you see _my_ point instead: OS X may have an inferior
> filesystem, but we don't have to make git inferior just for that.  
> The fact
> that OS X does case independence is *its* problem, not git's.

So, what, you're saying git shouldn't do any work at all to try and  
behave nicer on OS X? Because OS X sure as hell can't change to suit  
git.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-22  3:04                                                                                 ` Kevin Ballard
@ 2008-01-22  3:17                                                                                   ` Linus Torvalds
  2008-01-22  3:21                                                                                   ` Martin Langhoff
       [not found]                                                                                   ` <20080122133427.GB17804@mit.edu>
  2 siblings, 0 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-22  3:17 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Peter Karlsson, Mark Junker, Pedro Melo, git@vger.kernel.org

On Mon, 21 Jan 2008, Kevin Ballard wrote:
> 
> No, I am not asking git to do the same thing HFS+ does. You just persist in
> misinterpreting my arguments, no matter how many times I protest that this is
> not what I am saying.

Sure you do. You continue to say that unicode is the only choice, and you 
continue to say that unicode requires that equivalent names be considered 
the same.

What part of that was I mis-interpreting?

> So, what, you're saying git shouldn't do any work at all to try and behave
> nicer on OS X? Because OS X sure as hell can't change to suit git.

Umm. Git works perfectly fine on OS X, and it's not like we can do a whole 
lot more about it, exactly because we cannot fix the real problem. We can 
hide some of the fallout (idiotic choice of normalization), but the bigger 
issues we can hardly even do anything about (case independence).

And quite frankly, you've also made sure that I have absolutely zero 
interest in even trying to help people with it.

			Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-22  3:04                                                                                 ` Kevin Ballard
  2008-01-22  3:17                                                                                   ` Linus Torvalds
@ 2008-01-22  3:21                                                                                   ` Martin Langhoff
  2008-01-22  4:22                                                                                     ` Kevin Ballard
       [not found]                                                                                   ` <20080122133427.GB17804@mit.edu>
  2 siblings, 1 reply; 260+ messages in thread
From: Martin Langhoff @ 2008-01-22  3:21 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

On Jan 22, 2008 4:04 PM, Kevin Ballard <kevin@sb.org> wrote:
> Again, I don't think that's the correct solution. What about the
> translation table that was suggested back at the beginning of the
> thread? That would solve the case insensitivity issue as well, whereas
> this NFC "solution" does nothing for that.

Kevin,

you seem to know the problem fairly well. Could you write up a set of
testcases that show the bug? See the "t" directory in the git sources
-- you don't need to learn much about git internals, they are just
shell scripts (mostly, I think there's some perl there too). That
could lead to a good contribution to the project.

... and keep you from telling everyone else that you know better how
to hack a project that you know nothing about ;-)

(...)
> So, what, you're saying git shouldn't do any work at all to try and
> behave nicer on OS X?

Kevin - for your edification, that question is usually referred to as
"trolling" in this place we call the internet. Linus outlined what his
technical plan is, so git will probably do something designed by
someone who knows a thing or two about git's internals. So when you
pretend that he is saying the opposite of what he is saying... well,
people do get upset.

cheers,


m

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-22  3:21                                                                                   ` Martin Langhoff
@ 2008-01-22  4:22                                                                                     ` Kevin Ballard
  0 siblings, 0 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-22  4:22 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1015 bytes --]

On Jan 21, 2008, at 10:21 PM, Martin Langhoff wrote:

> On Jan 22, 2008 4:04 PM, Kevin Ballard <kevin@sb.org> wrote:
>> Again, I don't think that's the correct solution. What about the
>> translation table that was suggested back at the beginning of the
>> thread? That would solve the case insensitivity issue as well,  
>> whereas
>> this NFC "solution" does nothing for that.
>
> Kevin,
>
> you seem to know the problem fairly well. Could you write up a set of
> testcases that show the bug? See the "t" directory in the git sources
> -- you don't need to learn much about git internals, they are just
> shell scripts (mostly, I think there's some perl there too). That
> could lead to a good contribution to the project.

See now this is actually a very good suggestion. I probably should  
have done this long ago. Thank you very much for actually responding  
about the problem. You are the first person to do so.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-22  2:34                                                                       ` Kevin Ballard
@ 2008-01-22  7:51                                                                         ` David Kastrup
  0 siblings, 0 replies; 260+ messages in thread
From: David Kastrup @ 2008-01-22  7:51 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Theodore Tso, Linus Torvalds, Peter Karlsson, Mark Junker,
	Pedro Melo, git@vger.kernel.org

Kevin Ballard <kevin@sb.org> writes:

> On Jan 21, 2008, at 5:38 PM, David Kastrup wrote:
>
>>> How many times must I say I never suggested actually changing git's
>>> hashing algorithm? And if you want me to suggest a fix to git that
>>> works, first you have to wait for me to learn how git's internals
>>> work, and frankly, I have too much work on my plate right now to
>>> devote the time necessary to learning git's internals well enough to
>>> fix this problem.
>>
>> Then please understand that you have too much work on your plate right
>> now to devote the time necessary to provide any constructive
>> criticism.
>> A smart person in this situation would shut up until he has the time.
>
> A smart person would not join the conversation late and respond to
> points that have already been exhausted ages ago.

Find somebody willing to explain to you the difference between Email and
IRC, and how to read "Date:" headers.  I have no doubt you'll be able to
grasp the basic involved principles in less than a week.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
       [not found]                                                                                   ` <20080122133427.GB17804@mit.edu>
@ 2008-01-23  0:08                                                                                     ` Theodore Tso
  2008-01-23  0:38                                                                                       ` Kevin Ballard
  2008-01-23  0:38                                                                                       ` Linus Torvalds
  0 siblings, 2 replies; 260+ messages in thread
From: Theodore Tso @ 2008-01-23  0:08 UTC (permalink / raw)
  To: git; +Cc: Kevin Ballard

On Tue, Jan 22, 2008 at 08:34:27AM -0500, Theodore Tso wrote:
> 	* Documenting HFS+'s current pseudo-normalization algorithm.
> 	  It's not enough to say that you need to decompose all
> 	  Unicode characters, since you've claimed that HFS+ doesn't
> 	  decompose Unicode characters after some magic date,
> 	  presumably roughly 9 years ago.

I did some research on this point, since if we really are going to be
compatible with MacOS X's crappy HFS+ system, we need to know what the
decomposition algorithm actually is.  Turns out, there are *two* of
them.  Kevin didn't know what he was talking about.  In fact,
different versions of Mac OS X use different normalization algorithms.

Mac OS X 8.1 through 10.2.x used decompositions based on Unicode 2.1.
Mac OS X 10.3 and later use decompositions based on Unicode 3.2.[1]

As I correctly predicted, Apple is changing their normalization
algorithm in different versions of Mac OS X.  It is not static, which
meands there will be compatibility problems when moving hard drives
between Mac OS X versions.  I don't know if they try to fix this in
their fsck or not, when upgrading from 10.2 to 10.3, but if not,
certain files could disappear as part of the Mac OS X upgrade.  Fun
fun fun.

And clearly Kevin didn't read the tech note very carefully, since it
clearly admits why they did it.  The Mac OS X developers were being
cheasy with how they implemented their HFS B-tree algorithms, and took
the cheap, easy way out.  So yeah, "crappy" is the only word that can
be used for what Mac OS X perpetuated on the world.  Because of that,
a quick Google search shows it causes problems all over the stack, for
many different programs beyond just git, including limewire and
gnutella[2][3], Slim[4], and no doubt others.

[1] http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties
[2] http://lists.limewire.org/pipermail/gui-dev/2003-January/001110.html
[3] http://osdir.com/ml/network.gnutella.limewire.core.devel/2003-01/msg00000.html
[4] http://forums.slimdevices.com/showthread.php?t=40582

In any case, it seems pretty clear that by now everyone except Kevin
has realized that HFS+ is crappy and causes Internet-wide
interoperability problems.  So I'll justify sending this note by
pointing out the specific table of Mac OS's filesystem corruption
algorithm can be found here:

	  http://developer.apple.com/technotes/tn/tn1150table.html

I'd also recommend that the Mac OS X code try to either figure out
whether it is running on an HFS+ partition, or let the HFS+ workaround
code be something that can be controlled via .git/config.  It
shouldn't be on unconditionally even on a Mac OS X system, since if
the git repository is on a ZFS or NFS filesystem, there's no reason to
pay the overhead of working around the HFS+ bugs.

						- Ted

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  0:08                                                                                     ` Theodore Tso
@ 2008-01-23  0:38                                                                                       ` Kevin Ballard
  2008-01-23  1:47                                                                                         ` Martin Langhoff
                                                                                                           ` (2 more replies)
  2008-01-23  0:38                                                                                       ` Linus Torvalds
  1 sibling, 3 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-23  0:38 UTC (permalink / raw)
  To: Theodore Tso; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 5932 bytes --]

On Jan 22, 2008, at 7:08 PM, Theodore Tso wrote:

> On Tue, Jan 22, 2008 at 08:34:27AM -0500, Theodore Tso wrote:
>> 	* Documenting HFS+'s current pseudo-normalization algorithm.
>> 	  It's not enough to say that you need to decompose all
>> 	  Unicode characters, since you've claimed that HFS+ doesn't
>> 	  decompose Unicode characters after some magic date,
>> 	  presumably roughly 9 years ago.
>
> I did some research on this point, since if we really are going to be
> compatible with MacOS X's crappy HFS+ system, we need to know what the
> decomposition algorithm actually is.  Turns out, there are *two* of
> them.  Kevin didn't know what he was talking about.  In fact,
> different versions of Mac OS X use different normalization algorithms.
>
> Mac OS X 8.1 through 10.2.x used decompositions based on Unicode 2.1.
> Mac OS X 10.3 and later use decompositions based on Unicode 3.2.[1]
>
> As I correctly predicted, Apple is changing their normalization
> algorithm in different versions of Mac OS X.  It is not static, which
> meands there will be compatibility problems when moving hard drives
> between Mac OS X versions.  I don't know if they try to fix this in
> their fsck or not, when upgrading from 10.2 to 10.3, but if not,
> certain files could disappear as part of the Mac OS X upgrade.  Fun
> fun fun.
>
> And clearly Kevin didn't read the tech note very carefully, since it
> clearly admits why they did it.  The Mac OS X developers were being
> cheasy with how they implemented their HFS B-tree algorithms, and took
> the cheap, easy way out.  So yeah, "crappy" is the only word that can
> be used for what Mac OS X perpetuated on the world.  Because of that,
> a quick Google search shows it causes problems all over the stack, for
> many different programs beyond just git, including limewire and
> gnutella[2][3], Slim[4], and no doubt others.
>
> [1] http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties
> [2] http://lists.limewire.org/pipermail/gui-dev/2003-January/001110.html
> [3] http://osdir.com/ml/network.gnutella.limewire.core.devel/2003-01/msg00000.html
> [4] http://forums.slimdevices.com/showthread.php?t=40582
>
> In any case, it seems pretty clear that by now everyone except Kevin
> has realized that HFS+ is crappy and causes Internet-wide
> interoperability problems.  So I'll justify sending this note by
> pointing out the specific table of Mac OS's filesystem corruption
> algorithm can be found here:
>
> 	  http://developer.apple.com/technotes/tn/tn1150table.html
>
> I'd also recommend that the Mac OS X code try to either figure out
> whether it is running on an HFS+ partition, or let the HFS+ workaround
> code be something that can be controlled via .git/config.  It
> shouldn't be on unconditionally even on a Mac OS X system, since if
> the git repository is on a ZFS or NFS filesystem, there's no reason to
> pay the overhead of working around the HFS+ bugs.

I just finished talking to one of the HFS+ developers, so I suspect I  
know a lot more on this subject now than you do. Here's some of the  
relevant information:

* Any new characters added to Unicode will only have one form  
(decomposed), so HFS+ will always accept new characters as they will  
be NFD. The only exception is case-sensitivity, as the case-folding  
tables in HFS+ are static, so new characters with case variants will  
be treated in a case-sensitive manner. However, as they are already  
decomposed, the NFD algorithm will not change their encoding. This  
means that no, there are zero problems moving HFS+ drives between  
versions of OS X.

* At the time HFS+ was developed, there was no one common standard for  
normalization. The HFS+ developers picked NFD because they thought it  
was "a more flexible, future-looking form", but Microsoft ended up  
picking the opposite just a short time later. Interestingly, NFC is a  
weird hybrid form which only has composed forms for pre-existing  
characters, and decomposed forms for all new characters (as they only  
have one form). So in a sense NFD is more sane then NFC.

* The core issue here, which is why you think HFS+ is so stupid, is  
that you guys see no problem with having 2 files "Märchen" (NFC) and  
"Märchen" (NFD), whereas the HFS+ developers don't consider it  
acceptable to have 2 visually identical names as independent files.  
Unfortunately, the only way to do this matching is to store the  
normalized form in the filesystem, because it would be a performance  
nightmare to try and do this matching any other way. The HFS+  
developers considered it an acceptable trade-off, and as an  
application developer I tend to agree with them.

As I have stated in the past, this isn't a case of HFS+ being stupid  
and causing problems, it's a case of HFS+ being *different* and  
causing problems. But this difference is just as much your fault as it  
is HFS+'s fault.

* For detecting case-sensitive filesystems you can use pathconf(2):  
_PC_CASE_SENSITIVE (if unsupported, you can assume the filesystem is  
case-sensitive). There is also the getattrlist(2) attribute:  
VOL_CAP_FMT_CASE_SENSITIVE.

There appears to be no API for determining if normalization will be  
applied. However, any filesystem that uses UTF-8 explicitly as storage  
(unlike the Linux filesystems, which you claim use UTF-8 but is  
obviously you really use nothing at all) is pretty much guaranteed to  
have to normalize or it will have abysmal performance.

I must say it is shocking that someone as smart as you is still more  
interested in finding ways to prove me wrong then to actually address  
the problem. It's obvious that the only research you did was intended  
to find ways to call me stupid.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  0:08                                                                                     ` Theodore Tso
  2008-01-23  0:38                                                                                       ` Kevin Ballard
@ 2008-01-23  0:38                                                                                       ` Linus Torvalds
  2008-01-23  1:14                                                                                         ` Martin Langhoff
                                                                                                           ` (2 more replies)
  1 sibling, 3 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-23  0:38 UTC (permalink / raw)
  To: Theodore Tso; +Cc: git, Kevin Ballard



On Tue, 22 Jan 2008, Theodore Tso wrote:
> 
> I'd also recommend that the Mac OS X code try to either figure out
> whether it is running on an HFS+ partition, or let the HFS+ workaround
> code be something that can be controlled via .git/config.  It
> shouldn't be on unconditionally even on a Mac OS X system, since if
> the git repository is on a ZFS or NFS filesystem, there's no reason to
> pay the overhead of working around the HFS+ bugs.

One thing I'd like somebody to check: what _does_ happen with OS X and NFS 
(OS X as a client, not server)? In particular:

 - Is it suddenly sane and case-sensitive?

 - Does the NFS client do any unicode conversion?

I tried to google for it, but didn't find the right keywords to get 
anything useful out of that modern-day internet oracle.

		Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  0:38                                                                                       ` Linus Torvalds
@ 2008-01-23  1:14                                                                                         ` Martin Langhoff
  2008-01-23  1:16                                                                                         ` Kevin Ballard
  2008-01-23  1:33                                                                                         ` Theodore Tso
  2 siblings, 0 replies; 260+ messages in thread
From: Martin Langhoff @ 2008-01-23  1:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Theodore Tso, git, Kevin Ballard

On Jan 23, 2008 1:38 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> One thing I'd like somebody to check: what _does_ happen with OS X and NFS
> (OS X as a client, not server)? In particular:
>
>  - Is it suddenly sane and case-sensitive?

Yes. Similarlty with UFS partitions. After much grief with
case-insensitivity on OSX I reinstalled the OS on a UFS partition,
only to find that most 3rd party apps can't cope with case-sensitive
FSs (this was a while ago, I hope it's gotten better).

>  - Does the NFS client do any unicode conversion?

Don't know, unfortunately. I suspect both bits of mangling happen in
the fs code.


martin

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  0:38                                                                                       ` Linus Torvalds
  2008-01-23  1:14                                                                                         ` Martin Langhoff
@ 2008-01-23  1:16                                                                                         ` Kevin Ballard
  2008-01-23  1:27                                                                                           ` Martin Langhoff
  2008-01-23  1:33                                                                                         ` Theodore Tso
  2 siblings, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-23  1:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Theodore Tso, git

[-- Attachment #1: Type: text/plain, Size: 1393 bytes --]

On Jan 22, 2008, at 7:38 PM, Linus Torvalds wrote:

> On Tue, 22 Jan 2008, Theodore Tso wrote:
>>
>> I'd also recommend that the Mac OS X code try to either figure out
>> whether it is running on an HFS+ partition, or let the HFS+  
>> workaround
>> code be something that can be controlled via .git/config.  It
>> shouldn't be on unconditionally even on a Mac OS X system, since if
>> the git repository is on a ZFS or NFS filesystem, there's no reason  
>> to
>> pay the overhead of working around the HFS+ bugs.
>
> One thing I'd like somebody to check: what _does_ happen with OS X  
> and NFS
> (OS X as a client, not server)? In particular:
>
> - Is it suddenly sane and case-sensitive?
>
> - Does the NFS client do any unicode conversion?
>
> I tried to google for it, but didn't find the right keywords to get
> anything useful out of that modern-day internet oracle.

Straight from the horse's mouth, so to speak:

>> Here's one further question: How does OS X behave as an NFS client?  
>> Does it do any unicode normalization? Is it case-sensitive?
>>
> No conversions are done.. the MacOS X nfs client just sends
> whatever string it was passed to the server.  If I connect
> to a MacOS X server exporting an HFS file system, I can
> "touch FOO" and then "rm foo" and the rm will work.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  1:16                                                                                         ` Kevin Ballard
@ 2008-01-23  1:27                                                                                           ` Martin Langhoff
  0 siblings, 0 replies; 260+ messages in thread
From: Martin Langhoff @ 2008-01-23  1:27 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Linus Torvalds, Theodore Tso, git

On Jan 23, 2008 2:16 PM, Kevin Ballard <kevin@sb.org> wrote:
> > If I connect
> > to a MacOS X server exporting an HFS file system, I can
> > "touch FOO" and then "rm foo" and the rm will work.

So this bit of insanity can affect users on other OSs too, if they use
git on an NFS mountpoint hosted on OSX/HFS+.

IIRC Apple does recommend UFS for servers though. I wonder how XServe
machines ship by default.



m

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  0:38                                                                                       ` Linus Torvalds
  2008-01-23  1:14                                                                                         ` Martin Langhoff
  2008-01-23  1:16                                                                                         ` Kevin Ballard
@ 2008-01-23  1:33                                                                                         ` Theodore Tso
  2008-01-23  1:56                                                                                           ` Linus Torvalds
  2008-01-23  6:41                                                                                           ` Mike Hommey
  2 siblings, 2 replies; 260+ messages in thread
From: Theodore Tso @ 2008-01-23  1:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git, Kevin Ballard

On Tue, Jan 22, 2008 at 04:38:37PM -0800, Linus Torvalds wrote:
> One thing I'd like somebody to check: what _does_ happen with OS X and NFS 
> (OS X as a client, not server)? In particular:
> 
>  - Is it suddenly sane and case-sensitive?

Using a Linux server, and a OS X client, over NFS, it is in
case-sensitive.  This is not unexpected, since you can mount UFS
partitions on Mac OS X, or reformat HFS+ filesystems and make them be
case-sensitive.

>  - Does the NFS client do any unicode conversion?

Nope:

# perl -CO -e 'print pack("U",0x00C4)."\n"'  | xargs touch
# ls -l | cat -v
total 0
0 -rw-r--r--   1 nobody  nobody  0 Jan 22 20:30 M-CM-^D

It's pretty clear the Unicode conversion is being done in HFS+, not in
the VFS layer of Mac OS X.

So presumably if and when Mac OS adopts ZFS, they will be able to be
free of this mess, at least if they care about being compatible with
Solaris.

						- Ted

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  0:38                                                                                       ` Kevin Ballard
@ 2008-01-23  1:47                                                                                         ` Martin Langhoff
  2008-01-23  2:06                                                                                         ` Theodore Tso
  2008-01-23  8:45                                                                                         ` David Kastrup
  2 siblings, 0 replies; 260+ messages in thread
From: Martin Langhoff @ 2008-01-23  1:47 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Theodore Tso, git

On Jan 23, 2008 1:38 PM, Kevin Ballard <kevin@sb.org> wrote:
> I must say it is shocking

Don't ruin it. You were silent for 12hs and lots of patches and
research on the problem started flowing. If you keep making a nuisance
of yourself, people will turn from helping you to beating you up for
being so annoying.

Perhaps help prepare those tests you said it was a good idea to work
on. If you manage to stay silent a bit, we'll need them soon ;-)


m

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  1:33                                                                                         ` Theodore Tso
@ 2008-01-23  1:56                                                                                           ` Linus Torvalds
  2008-01-23  2:02                                                                                             ` Kevin Ballard
  2008-01-23  6:41                                                                                           ` Mike Hommey
  1 sibling, 1 reply; 260+ messages in thread
From: Linus Torvalds @ 2008-01-23  1:56 UTC (permalink / raw)
  To: Theodore Tso; +Cc: git, Kevin Ballard

On Tue, 22 Jan 2008, Theodore Tso wrote:
> 
> It's pretty clear the Unicode conversion is being done in HFS+, not in
> the VFS layer of Mac OS X.

Ok. That's going to make it both easier and harder for them in the future. 
In particular, it probably means that their VFS layer really has no notion 
of this at all, and it's going to be fairly hard to support any kind of 
generic "backwards compatibility" layer on top of other filesystems.

> So presumably if and when Mac OS adopts ZFS, they will be able to be
> free of this mess, at least if they care about being compatible with
> Solaris.

I wouldn't hold my breadth on ZFS, considering the memory requirements. 
ZFS apparently wants *lots* of memory:

	http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_Administration_Considerations
	http://wiki.freebsd.org/ZFSTuningGuide

in fact it seems that the FreeBSD people basically recomment against using 
ZFS on 32-bit kernels because of the memory use issues.

Yes, it could be BSD-specific, but considering Solaris has the same 
recommendation, it sure seems like ZFS isn't ready for prime time on any 
low-end (read: consumer) hardware.

Of course, in a year or two, 2GB will be the norm. Right now it's still 
fairly unusual on Mac hardware outside of the Mac Pro line (which, I 
think, comes with a *minimum* of 2GB), and the people who get it want it 
not for the filesystem caches, but for big photo editing jobs..

			Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  1:56                                                                                           ` Linus Torvalds
@ 2008-01-23  2:02                                                                                             ` Kevin Ballard
  0 siblings, 0 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-23  2:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Theodore Tso, git

[-- Attachment #1: Type: text/plain, Size: 2524 bytes --]

On Jan 22, 2008, at 8:56 PM, Linus Torvalds wrote:

> On Tue, 22 Jan 2008, Theodore Tso wrote:
>>
>> It's pretty clear the Unicode conversion is being done in HFS+, not  
>> in
>> the VFS layer of Mac OS X.
>
> Ok. That's going to make it both easier and harder for them in the  
> future.
> In particular, it probably means that their VFS layer really has no  
> notion
> of this at all, and it's going to be fairly hard to support any kind  
> of
> generic "backwards compatibility" layer on top of other filesystems.

HFS+ was developed on Mac OS 8, which I believe didn't have the notion  
of a VFS, or at least not one that would have been in any way capable  
of doing the case-insensitivity and normalization necessary. However,  
I'm not sure what you mean by a "backwards compatibility" layer on  
other filesystems - if you mean treating another filesystem like HFS+,  
well, if you're using a filesystem that doesn't do normalization then  
the VFS really shouldn't do it for you.

>> So presumably if and when Mac OS adopts ZFS, they will be able to be
>> free of this mess, at least if they care about being compatible with
>> Solaris.
>
> I wouldn't hold my breadth on ZFS, considering the memory  
> requirements.
> ZFS apparently wants *lots* of memory:
>
> 	http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_Administration_Considerations
> 	http://wiki.freebsd.org/ZFSTuningGuide
>
> in fact it seems that the FreeBSD people basically recomment against  
> using
> ZFS on 32-bit kernels because of the memory use issues.
>
> Yes, it could be BSD-specific, but considering Solaris has the same
> recommendation, it sure seems like ZFS isn't ready for prime time on  
> any
> low-end (read: consumer) hardware.
>
> Of course, in a year or two, 2GB will be the norm. Right now it's  
> still
> fairly unusual on Mac hardware outside of the Mac Pro line (which, I
> think, comes with a *minimum* of 2GB), and the people who get it  
> want it
> not for the filesystem caches, but for big photo editing jobs..

Actually, interestingly the new MacBook Air comes with 2GB stock (I'm  
assuming it's soldered onto the motherboard, though, so it makes sense  
that Apple's giving customers 2GB as they can't upgrade themselves).

In any case, everybody's making a big fuss about ZFS, but it really  
doesn't make a lot of sense to use for a consumer system, it seems  
more geared for a server.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  0:38                                                                                       ` Kevin Ballard
  2008-01-23  1:47                                                                                         ` Martin Langhoff
@ 2008-01-23  2:06                                                                                         ` Theodore Tso
  2008-01-23  8:45                                                                                         ` David Kastrup
  2 siblings, 0 replies; 260+ messages in thread
From: Theodore Tso @ 2008-01-23  2:06 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: git

On Tue, Jan 22, 2008 at 07:38:04PM -0500, Kevin Ballard wrote:
> * Any new characters added to Unicode will only have one form (decomposed), 
> so HFS+ will always accept new characters as they will be NFD. The only 
> exception is case-sensitivity, as the case-folding tables in HFS+ are 
> static, so new characters with case variants will be treated in a 
> case-sensitive manner. However, as they are already decomposed, the NFD 
> algorithm will not change their encoding. This means that no, there are 
> zero problems moving HFS+ drives between versions of OS X.

Except there *are* problems, because this promise doesn't apply to
Unicode 2.1 (Mac OS 10.2 and before) and Unicode 3.2 (Mac OS 10.3 and
above).  And there were changes between the normalization algorithm
between Unicode 3.2 and the Unicode version 4.1.  So taking a hard
drive between Mac OS X 10.2 and 10.3 *will* cause problems.  The
guarantees of Unicode stability didn't come until well past Unicode
2.1.

Also, I know of no guarantee that there will be no more new
compositions.  According to Unicode Stnadard Annex #15
(http://unicode.org/reports/tr15/), new characters that can be
decomposed are strongly discouraged, but "It would be possible to add
more compositions in a future version of Unicode".  Got a reference to
back up your claim that there will never be any more?

> * At the time HFS+ was developed, there was no one common standard for 
> normalization. The HFS+ developers picked NFD because they thought it was 
> "a more flexible, future-looking form", but Microsoft ended up picking the 
> opposite just a short time later. Interestingly, NFC is a weird hybrid form 
> which only has composed forms for pre-existing characters, and decomposed 
> forms for all new characters (as they only have one form). So in a sense 
> NFD is more sane then NFC.

NFC is better if you care about compatibility with existing legacy
character sets, where you want round-trip conversions to be
idempotent.  On the other hand, given that Mac OS has historically
never cared about being compatible with the rest of the world, it
makes sense that it would choose NFD.

> * The core issue here, which is why you think HFS+ is so stupid, is that 
> you guys see no problem with having 2 files "Märchen" (NFC) and "Märchen" 
> (NFD), whereas the HFS+ developers don't consider it acceptable to have 2 
> visually identical names as independent files.

Yep.  No problems to do that.  You seem to think that supporting
Unicode requires imposing this constraint, but that's simply not true,
except maybe in some kind of religious sense.

> Unfortunately, the only way 
> to do this matching is to store the normalized form in the filesystem, 
> because it would be a performance nightmare to try and do this matching any 
> other way.

Nope.  They were just not clever enough.  If they use a hashed key for
their b-tree and used a hash which had the property that two strings
that were equivalent in the Unicode sense have the same hash value,
it's quite possible to do Unicode-equivalence lookups quickly.  Yeah,
calculating the hash algorithm takes a bit amount of time, but it gets
called no more than the normalization routine, and its performance
overhead is no worse than the normalizing a string.

I know how to do it in a Linux filesystem; it's just an insane thing
to do, and so I choose not to do it.  But it is doable; if you must
persue the course of filesystem insanity, it's possible to do it in a
performant way, without normalization; it's the same way that you can
use b-tree lookups in a case insensitive way.

> I must say it is shocking that someone as smart as you is still more 
> interested in finding ways to prove me wrong then to actually address the 
> problem. It's obvious that the only research you did was intended to find 
> ways to call me stupid.

No, I did the research to try to find the HFS-specific filename
mangling algorithm.  And given that's based on an back-level, old
version of Unicode, you can't just use NFD algorithm from the latest
Unicode spec.  As I did that research, I came across the evidence that
claims you had made (i.e., that HFS had never changed the Unicode
version for its Normalization algorithm), was directly contradicted by
the Apple TechNote.

    	  					- Ted

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-18 20:28                                             ` Junio C Hamano
  2008-01-18 20:50                                               ` Johannes Schindelin
@ 2008-01-23  2:46                                               ` Eric W. Biederman
  2008-01-23  2:57                                                 ` Junio C Hamano
  1 sibling, 1 reply; 260+ messages in thread
From: Eric W. Biederman @ 2008-01-23  2:46 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

Junio C Hamano <gitster@pobox.com> writes:

> I'd rather see our mental bandwidth spent on coming up with a
> workable workaround for such broken filesystems, while not
> hurting use of git on sane platforms.
>
> I fear it might have to end up to be very messy and slow,
> though.

Random thought.  Would it make sense to implement a git paranoid
mode to autodetect name mangling.

I.e.  After opening or creating a file by name we do a readdir in the
same directory to make certain we can find that same name/inode
combination.  Then on name-mangling systems we can autodetect they
exist and limit ourselves to just what they don't mangle with no
prior knowledge.  By refusing to process names that actively
get mangled.   For small directories that you frequently see in
development it shouldn't even be that slow.

Eric

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  2:46                                               ` Eric W. Biederman
@ 2008-01-23  2:57                                                 ` Junio C Hamano
  2008-01-23 14:26                                                   ` Nicolas Pitre
  0 siblings, 1 reply; 260+ messages in thread
From: Junio C Hamano @ 2008-01-23  2:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Peter Karlsson, Mark Junker, Pedro Melo,
	git@vger.kernel.org

ebiederm@xmission.com (Eric W. Biederman) writes:

> Junio C Hamano <gitster@pobox.com> writes:
>
>> I'd rather see our mental bandwidth spent on coming up with a
>> workable workaround for such broken filesystems, while not
>> hurting use of git on sane platforms.
>>
>> I fear it might have to end up to be very messy and slow,
>> though.
>
> Random thought.  Would it make sense to implement a git paranoid
> mode to autodetect name mangling.
>
> I.e.  After opening or creating a file by name we do a readdir in the
> same directory to make certain we can find that same name/inode
> combination.  Then on name-mangling systems we can autodetect they
> exist and limit ourselves to just what they don't mangle with no
> prior knowledge.  By refusing to process names that actively
> get mangled.   For small directories that you frequently see in
> development it shouldn't even be that slow.

Inside init-db where we already check how the filesystem
behaves, we could have an autodetection. A rough equivalent of
what I had in mind is:

	mkdir -p "Märchen/Märchen"
	if test "$(cd Märchen && echo M*)" = "Märchen"
        then
        	: not mangling
	else
        	git config core.namemangle true
	fi

(of course we do that in C not in shell).

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  1:33                                                                                         ` Theodore Tso
  2008-01-23  1:56                                                                                           ` Linus Torvalds
@ 2008-01-23  6:41                                                                                           ` Mike Hommey
  2008-01-23  8:15                                                                                             ` Kevin Ballard
  1 sibling, 1 reply; 260+ messages in thread
From: Mike Hommey @ 2008-01-23  6:41 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Linus Torvalds, git, Kevin Ballard

On Tue, Jan 22, 2008 at 08:33:25PM -0500, Theodore Tso wrote:
> On Tue, Jan 22, 2008 at 04:38:37PM -0800, Linus Torvalds wrote:
> > One thing I'd like somebody to check: what _does_ happen with OS X and NFS 
> > (OS X as a client, not server)? In particular:
> > 
> >  - Is it suddenly sane and case-sensitive?
> 
> Using a Linux server, and a OS X client, over NFS, it is in
> case-sensitive.  This is not unexpected, since you can mount UFS
> partitions on Mac OS X, or reformat HFS+ filesystems and make them be
> case-sensitive.
> 
> >  - Does the NFS client do any unicode conversion?
> 
> Nope:
> 
> # perl -CO -e 'print pack("U",0x00C4)."\n"'  | xargs touch
> # ls -l | cat -v
> total 0
> 0 -rw-r--r--   1 nobody  nobody  0 Jan 22 20:30 M-CM-^D
> 
> It's pretty clear the Unicode conversion is being done in HFS+, not in
> the VFS layer of Mac OS X.

There must be something at the VFS layer, or some other layer:
- IIRC, Joliet iso9660 volumes end up being mounted with files names in
  NFS when the real file names are NFC on the disk.
- Likewise for Samba shares.
- When I had my problems with iso9660 rockridge volumes using NFC (you
  can create that just fine with mkisofs), the volume is mounted without
  normalisation, i.e. if you get to a shell and want to access files,
  you must use NFC, but at least the Finder does transliteration at some
  stage, because going into the mount point and opening some files fail
  because it's trying to open the file with the name transliterated to
  NFD. I just hope the same doesn't happen with other filesystems.

Also, OSX using NFD widely, a file created from non Unix applications
may end up being named in NFD on any file system. File contents, too,
may end up being transliterated whenever a file is modified with non
Unix applications, introducing unwanted changes.
Typing file names in the Terminal might also make them encoded in NFD,
too.

Mike

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  6:41                                                                                           ` Mike Hommey
@ 2008-01-23  8:15                                                                                             ` Kevin Ballard
  2008-01-23  8:43                                                                                               ` Dmitry Potapov
  2008-01-23  9:40                                                                                               ` Mike Hommey
  0 siblings, 2 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-23  8:15 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Theodore Tso, Linus Torvalds, git

[-- Attachment #1: Type: text/plain, Size: 3087 bytes --]

On Jan 23, 2008, at 1:41 AM, Mike Hommey wrote:

> On Tue, Jan 22, 2008 at 08:33:25PM -0500, Theodore Tso wrote:
>> On Tue, Jan 22, 2008 at 04:38:37PM -0800, Linus Torvalds wrote:
>>> One thing I'd like somebody to check: what _does_ happen with OS X  
>>> and NFS
>>> (OS X as a client, not server)? In particular:
>>>
>>> - Is it suddenly sane and case-sensitive?
>>
>> Using a Linux server, and a OS X client, over NFS, it is in
>> case-sensitive.  This is not unexpected, since you can mount UFS
>> partitions on Mac OS X, or reformat HFS+ filesystems and make them be
>> case-sensitive.
>>
>>> - Does the NFS client do any unicode conversion?
>>
>> Nope:
>>
>> # perl -CO -e 'print pack("U",0x00C4)."\n"'  | xargs touch
>> # ls -l | cat -v
>> total 0
>> 0 -rw-r--r--   1 nobody  nobody  0 Jan 22 20:30 M-CM-^D
>>
>> It's pretty clear the Unicode conversion is being done in HFS+, not  
>> in
>> the VFS layer of Mac OS X.
>
> There must be something at the VFS layer, or some other layer:
> - IIRC, Joliet iso9660 volumes end up being mounted with files names  
> in
>  NFS when the real file names are NFC on the disk.

I assume you mean NFD, not NFS, but here's what one of the HFS+  
engineers had to say:

"In Mac OS X,  SMB, MSDOS, UDF, ISO 9660 (Joliet), NTFS and ZFS file  
systems all store in one form -- NFC.  We store in NFC since that what  
is expected for these files systems."

> - Likewise for Samba shares.

See above.

> - When I had my problems with iso9660 rockridge volumes using NFC (you
>  can create that just fine with mkisofs), the volume is mounted  
> without
>  normalisation, i.e. if you get to a shell and want to access files,
>  you must use NFC, but at least the Finder does transliteration at  
> some
>  stage, because going into the mount point and opening some files fail
>  because it's trying to open the file with the name transliterated to
>  NFD. I just hope the same doesn't happen with other filesystems.

Can you produce a reproducible set of steps for this? Because the  
Finder shouldn't be doing any of this work on its own, all the  
normalization stuff happens directly in HFS+.

> Also, OSX using NFD widely, a file created from non Unix applications
> may end up being named in NFD on any file system. File contents, too,
> may end up being transliterated whenever a file is modified with non
> Unix applications, introducing unwanted changes.
> Typing file names in the Terminal might also make them encoded in NFD,
> too.

Entirely possible, though renormalizing file contents seems a bit less  
likely. I will point out that the text input system in OS X seems to  
default to producing NFC (at least, typing `echo 'Märchen' | xxd` in  
the Terminal shows that the input string there is NFC). So user input  
will most likely produce NFC, the only way you're probably going to  
end up with NFD is if you move a file from HFS+ to another filesystem.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  8:15                                                                                             ` Kevin Ballard
@ 2008-01-23  8:43                                                                                               ` Dmitry Potapov
  2008-01-23  9:02                                                                                                 ` Jonathan del Strother
  2008-01-23  9:40                                                                                               ` Mike Hommey
  1 sibling, 1 reply; 260+ messages in thread
From: Dmitry Potapov @ 2008-01-23  8:43 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Mike Hommey, Theodore Tso, Linus Torvalds, git

On Wed, Jan 23, 2008 at 03:15:02AM -0500, Kevin Ballard wrote:
> 
> Entirely possible, though renormalizing file contents seems a bit less  
> likely. I will point out that the text input system in OS X seems to  
> default to producing NFC (at least, typing `echo 'Märchen' | xxd` in  
> the Terminal shows that the input string there is NFC).

I wonder what happens if you do this:

touch 'Märchen'
echo M*rchen | xxd -g1

Will that produce NFC or NFD?

Dmitry

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  0:38                                                                                       ` Kevin Ballard
  2008-01-23  1:47                                                                                         ` Martin Langhoff
  2008-01-23  2:06                                                                                         ` Theodore Tso
@ 2008-01-23  8:45                                                                                         ` David Kastrup
  2 siblings, 0 replies; 260+ messages in thread
From: David Kastrup @ 2008-01-23  8:45 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Theodore Tso, git

Kevin Ballard <kevin@sb.org> writes:

> I just finished talking to one of the HFS+ developers, so I suspect I
> know a lot more on this subject now than you do.

Uh, Ted is a filesystem developer.  I can't count the hours I spent
talking with my father, a theoretical physicist, but that does not make
me qualified to consider myself a better authority on physics than a
sub-average actual grad student of the matter.

If you don't manage to check your arrogance eventually, you'll be
causing more damage to your cause than if you just shut up.  You make
abundantly clear that you don't understand the _implications_ of the
details you may or not may happen to find out.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  8:43                                                                                               ` Dmitry Potapov
@ 2008-01-23  9:02                                                                                                 ` Jonathan del Strother
  2008-01-23  9:12                                                                                                   ` Dmitry Potapov
  0 siblings, 1 reply; 260+ messages in thread
From: Jonathan del Strother @ 2008-01-23  9:02 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Kevin Ballard, Mike Hommey, Theodore Tso, Linus Torvalds, git

On Jan 23, 2008 8:43 AM, Dmitry Potapov <dpotapov@gmail.com> wrote:
> On Wed, Jan 23, 2008 at 03:15:02AM -0500, Kevin Ballard wrote:
> >
> > Entirely possible, though renormalizing file contents seems a bit less
> > likely. I will point out that the text input system in OS X seems to
> > default to producing NFC (at least, typing `echo 'Märchen' | xxd` in
> > the Terminal shows that the input string there is NFC).
>
> I wonder what happens if you do this:
>
> touch 'Märchen'
> echo M*rchen | xxd -g1
>
> Will that produce NFC or NFD?
>

0000000: 4d 61 cc 88 72 63 68 65 6e 0a                    Ma..rchen.

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  9:02                                                                                                 ` Jonathan del Strother
@ 2008-01-23  9:12                                                                                                   ` Dmitry Potapov
  2008-01-23  9:19                                                                                                     ` Mike Hommey
  0 siblings, 1 reply; 260+ messages in thread
From: Dmitry Potapov @ 2008-01-23  9:12 UTC (permalink / raw)
  To: Jonathan del Strother
  Cc: Kevin Ballard, Mike Hommey, Theodore Tso, Linus Torvalds, git

On Wed, Jan 23, 2008 at 09:02:43AM +0000, Jonathan del Strother wrote:
> On Jan 23, 2008 8:43 AM, Dmitry Potapov <dpotapov@gmail.com> wrote:
> > On Wed, Jan 23, 2008 at 03:15:02AM -0500, Kevin Ballard wrote:
> > >
> > > Entirely possible, though renormalizing file contents seems a bit less
> > > likely. I will point out that the text input system in OS X seems to
> > > default to producing NFC (at least, typing `echo 'Märchen' | xxd` in
> > > the Terminal shows that the input string there is NFC).
> >
> > I wonder what happens if you do this:
> >
> > touch 'Märchen'
> > echo M*rchen | xxd -g1
> >
> > Will that produce NFC or NFD?
> >
> 
> 0000000: 4d 61 cc 88 72 63 68 65 6e 0a                    Ma..rchen.

This is NFC! Did you do that on HFS+?

If so, it means that shell on Mac also converts filenames to NFC when
it reads them from the disk.

Dmitry

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  9:12                                                                                                   ` Dmitry Potapov
@ 2008-01-23  9:19                                                                                                     ` Mike Hommey
  2008-01-23  9:32                                                                                                       ` Dmitry Potapov
  0 siblings, 1 reply; 260+ messages in thread
From: Mike Hommey @ 2008-01-23  9:19 UTC (permalink / raw)
  To: Dmitry Potapov
  Cc: Jonathan del Strother, Kevin Ballard, Theodore Tso,
	Linus Torvalds, git

On Wed, Jan 23, 2008 at 12:12:40PM +0300, Dmitry Potapov <dpotapov@gmail.com> wrote:
> On Wed, Jan 23, 2008 at 09:02:43AM +0000, Jonathan del Strother wrote:
> > On Jan 23, 2008 8:43 AM, Dmitry Potapov <dpotapov@gmail.com> wrote:
> > > On Wed, Jan 23, 2008 at 03:15:02AM -0500, Kevin Ballard wrote:
> > > >
> > > > Entirely possible, though renormalizing file contents seems a bit less
> > > > likely. I will point out that the text input system in OS X seems to
> > > > default to producing NFC (at least, typing `echo 'Märchen' | xxd` in
> > > > the Terminal shows that the input string there is NFC).
> > >
> > > I wonder what happens if you do this:
> > >
> > > touch 'Märchen'
> > > echo M*rchen | xxd -g1
> > >
> > > Will that produce NFC or NFD?
> > >
> > 
> > 0000000: 4d 61 cc 88 72 63 68 65 6e 0a                    Ma..rchen.
> 
> This is NFC! Did you do that on HFS+?

NFD, you mean ?

Mike

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  9:19                                                                                                     ` Mike Hommey
@ 2008-01-23  9:32                                                                                                       ` Dmitry Potapov
  0 siblings, 0 replies; 260+ messages in thread
From: Dmitry Potapov @ 2008-01-23  9:32 UTC (permalink / raw)
  To: Mike Hommey
  Cc: Jonathan del Strother, Kevin Ballard, Theodore Tso,
	Linus Torvalds, git

On Wed, Jan 23, 2008 at 10:19:59AM +0100, Mike Hommey wrote:
> On Wed, Jan 23, 2008 at 12:12:40PM +0300, Dmitry Potapov <dpotapov@gmail.com> wrote:
> > On Wed, Jan 23, 2008 at 09:02:43AM +0000, Jonathan del Strother wrote:
> > > On Jan 23, 2008 8:43 AM, Dmitry Potapov <dpotapov@gmail.com> wrote:
> > > > On Wed, Jan 23, 2008 at 03:15:02AM -0500, Kevin Ballard wrote:
> > > > >
> > > > > Entirely possible, though renormalizing file contents seems a bit less
> > > > > likely. I will point out that the text input system in OS X seems to
> > > > > default to producing NFC (at least, typing `echo 'Märchen' | xxd` in
> > > > > the Terminal shows that the input string there is NFC).
> > > >
> > > > I wonder what happens if you do this:
> > > >
> > > > touch 'Märchen'
> > > > echo M*rchen | xxd -g1
> > > >
> > > > Will that produce NFC or NFD?
> > > >
> > > 
> > > 0000000: 4d 61 cc 88 72 63 68 65 6e 0a                    Ma..rchen.
> > 
> > This is NFC! Did you do that on HFS+?
> 
> NFD, you mean ?

Oops, you are right.

Dmitry

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  8:15                                                                                             ` Kevin Ballard
  2008-01-23  8:43                                                                                               ` Dmitry Potapov
@ 2008-01-23  9:40                                                                                               ` Mike Hommey
  2008-01-23 13:38                                                                                                 ` Theodore Tso
  2008-01-23 16:58                                                                                                 ` Kevin Ballard
  1 sibling, 2 replies; 260+ messages in thread
From: Mike Hommey @ 2008-01-23  9:40 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Theodore Tso, Linus Torvalds, git

On Wed, Jan 23, 2008 at 03:15:02AM -0500, Kevin Ballard <kevin@sb.org> wrote:
> "In Mac OS X,  SMB, MSDOS, UDF, ISO 9660 (Joliet), NTFS and ZFS file  
> systems all store in one form -- NFC.  We store in NFC since that what  
> is expected for these files systems."

That's the point. It's stored in NFC, but what applications see is NFD.

> >- Likewise for Samba shares.
> 
> See above.
> 
> >- When I had my problems with iso9660 rockridge volumes using NFC (you
> > can create that just fine with mkisofs), the volume is mounted  
> >without
> > normalisation, i.e. if you get to a shell and want to access files,
> > you must use NFC, but at least the Finder does transliteration at  
> >some
> > stage, because going into the mount point and opening some files fail
> > because it's trying to open the file with the name transliterated to
> > NFD. I just hope the same doesn't happen with other filesystems.
> 
> Can you produce a reproducible set of steps for this? Because the  
> Finder shouldn't be doing any of this work on its own, all the  
> normalization stuff happens directly in HFS+.

Simple : on a Linux host, create files with NFC names, and create an iso
image with mkisofs, with rockridge but no joliet. Burn this to a disc, and
insert the disc in your OSX host, and try to open files from the finder.
Interestingly, IIRC, Finder is able to copy the files, though.

As a bonus, try the same with an iso volume name in NFC, it's even better :
the created mount point is NFD, but it tries to mount on the name in NFC and
fails. And then you just can't eject the CD anymore.

Mike

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  9:40                                                                                               ` Mike Hommey
@ 2008-01-23 13:38                                                                                                 ` Theodore Tso
  2008-01-23 16:16                                                                                                   ` Linus Torvalds
  2008-01-23 16:58                                                                                                 ` Kevin Ballard
  1 sibling, 1 reply; 260+ messages in thread
From: Theodore Tso @ 2008-01-23 13:38 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Kevin Ballard, Linus Torvalds, git

Here's a reliable test case to test filename normalization on Mac OS.

------ cut here -------
cat > test.pl << EOF
#!/usr/bin/perl -CO
print "M".pack("U",0x00E4)."rchen\n";
print "Ma".pack("U",0x0308)."rchen\n";
EOF
chmod +x test.pl
./test.pl | xargs touch
echo M* | xxd -g1
------ cut here -------

On an NFS mounted filesystem, what you will get is this:

0000000: 4d 61 cc 88 72 63 68 65 6e 20 4d c3 a4 72 63 68  Ma..rchen M..rch
0000010: 65 6e 0a                                         en.

and on an HFS+ mounted filesystem, what you will get is this:

0000000: 4d 61 cc 88 72 63 68 65 6e 0a                    Ma..rchen.

So this demonstrates that on my MacOS 10.4.11 system, on NFS, MacOS is
doing no normalization, as it is creating two files.  On HFS+, MacOS
is mapping both filenames to the same decomposed name.

More (or not) surprisingly, given Kevin Ballard's "reliable source":

  "In Mac OS X,  SMB, MSDOS, UDF, ISO 9660 (Joliet), NTFS and ZFS file
  systems all store in one form -- NFC.  We store in NFC since that what is
  expected for these files systems."

Using a Sony Reader (which uses an internal FAT filesystem) hooked up
to a MacOS 10.4.11 system:

% /fs/u1/tmp/test.pl  | xargs touch
% echo M* | xxd -g1
0000000: 4d 61 cc 88 72 63 68 65 6e 0a                    Ma..rchen.

.. which is the decomposed form.  So it looks like on FAT/MSDOS
filesystems MacOS 10.4.11 normalizes files to NFD, which will *not* do
the right thing as far as Windows compatibility is concerned on USB
sticks, et. al.  Mac OS users would be well advised not to use
non-ASCII names in their filesystems if they care about interoperating
with other systems.  :-P

							- Ted

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  2:57                                                 ` Junio C Hamano
@ 2008-01-23 14:26                                                   ` Nicolas Pitre
  2008-01-23 21:19                                                     ` Junio C Hamano
  0 siblings, 1 reply; 260+ messages in thread
From: Nicolas Pitre @ 2008-01-23 14:26 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Eric W. Biederman, Linus Torvalds, Peter Karlsson, Mark Junker,
	Pedro Melo, git@vger.kernel.org

On Tue, 22 Jan 2008, Junio C Hamano wrote:

> ebiederm@xmission.com (Eric W. Biederman) writes:
> 
> > Random thought.  Would it make sense to implement a git paranoid
> > mode to autodetect name mangling.
> >
> > I.e.  After opening or creating a file by name we do a readdir in the
> > same directory to make certain we can find that same name/inode
> > combination.  Then on name-mangling systems we can autodetect they
> > exist and limit ourselves to just what they don't mangle with no
> > prior knowledge.  By refusing to process names that actively
> > get mangled.   For small directories that you frequently see in
> > development it shouldn't even be that slow.
> 
> Inside init-db where we already check how the filesystem
> behaves, we could have an autodetection.

I wonder if that is good enough.  Git repositories can be copied over to 
different filesystems.


Nicolas

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23 13:38                                                                                                 ` Theodore Tso
@ 2008-01-23 16:16                                                                                                   ` Linus Torvalds
  2008-01-23 17:12                                                                                                     ` Theodore Tso
  2008-01-23 17:19                                                                                                     ` Kevin Ballard
  0 siblings, 2 replies; 260+ messages in thread
From: Linus Torvalds @ 2008-01-23 16:16 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Mike Hommey, Kevin Ballard, git

On Wed, 23 Jan 2008, Theodore Tso wrote:
> 
> So this demonstrates that on my MacOS 10.4.11 system, on NFS, MacOS is
> doing no normalization, as it is creating two files.  On HFS+, MacOS
> is mapping both filenames to the same decomposed name.

Well, it demonstrates that (a) the OS and (b) _perl_ don't mangle 
filenames on non-HFS+ filesystems.

The problem is that since most native applications *expect* that name 
mangling, they'll probably do name mangling of their own (internally) just 
to compare the names!

So I would not be surprised if the globbing libraries, for example, will 
do NFD-mangling in order to glob "correctly", so even programs ported from 
real Unix might end up getting pathnames subtly changed into NFD as part 
of some hot library-on-library action with UTF hackery inside.

Things like the finder etc, which must be very aware of the fact that 
filenames get corrupted, would presumably internally always convert 
everything they get into NFD in order to compare names from different 
sources. And as part of that, programs may well corrupt the name before 
they then use it to create a pathname.

The fact that your perl program works under NFS, but creates NFD on a VFAT 
volume, does imply that they probably used at least some of the same 
routines they use in HFS+ for VFAT. Not entirely surprising: doing case 
insensitive stuff with Unicode is nasty code, so why not share it (even if 
it's then incorrect for FAT)..

Piece of crap it is, though. Apple has painted themselves into a nasty 
corner there.

			Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23  9:40                                                                                               ` Mike Hommey
  2008-01-23 13:38                                                                                                 ` Theodore Tso
@ 2008-01-23 16:58                                                                                                 ` Kevin Ballard
  2008-01-23 17:39                                                                                                   ` Dmitry Potapov
  1 sibling, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-23 16:58 UTC (permalink / raw)
  To: Mike Hommey; +Cc: Theodore Tso, Linus Torvalds, git

[-- Attachment #1: Type: text/plain, Size: 2399 bytes --]

On Jan 23, 2008, at 4:40 AM, Mike Hommey wrote:

> On Wed, Jan 23, 2008 at 03:15:02AM -0500, Kevin Ballard  
> <kevin@sb.org> wrote:
>> "In Mac OS X,  SMB, MSDOS, UDF, ISO 9660 (Joliet), NTFS and ZFS file
>> systems all store in one form -- NFC.  We store in NFC since that  
>> what
>> is expected for these files systems."
>
> That's the point. It's stored in NFC, but what applications see is  
> NFD.

I was actually asking for you to show this instead of just asserting  
it, but I realized I have access to an SMB share myself so I just  
tested.

And you're right. That's very curious. I guess they did that because  
the entire Carbon stack was written assuming NFD (back at the same  
time HFS+ was created), and they wanted to provide a consistent  
interface to applications. Since the filesystem already uses NFC,  
renormalizing to NFD shouldn't lose anything (want the original  
representation back? just normalize back to NFC).

>>> - Likewise for Samba shares.
>>
>> See above.
>>
>>> - When I had my problems with iso9660 rockridge volumes using NFC  
>>> (you
>>> can create that just fine with mkisofs), the volume is mounted
>>> without
>>> normalisation, i.e. if you get to a shell and want to access files,
>>> you must use NFC, but at least the Finder does transliteration at
>>> some
>>> stage, because going into the mount point and opening some files  
>>> fail
>>> because it's trying to open the file with the name transliterated to
>>> NFD. I just hope the same doesn't happen with other filesystems.
>>
>> Can you produce a reproducible set of steps for this? Because the
>> Finder shouldn't be doing any of this work on its own, all the
>> normalization stuff happens directly in HFS+.
>
> Simple : on a Linux host, create files with NFC names, and create an  
> iso
> image with mkisofs, with rockridge but no joliet. Burn this to a  
> disc, and
> insert the disc in your OSX host, and try to open files from the  
> finder.
> Interestingly, IIRC, Finder is able to copy the files, though.
>
> As a bonus, try the same with an iso volume name in NFC, it's even  
> better :
> the created mount point is NFD, but it tries to mount on the name in  
> NFC and
> fails. And then you just can't eject the CD anymore.

I was actually hoping for something I could test myself.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23 16:16                                                                                                   ` Linus Torvalds
@ 2008-01-23 17:12                                                                                                     ` Theodore Tso
  2008-01-23 17:19                                                                                                     ` Kevin Ballard
  1 sibling, 0 replies; 260+ messages in thread
From: Theodore Tso @ 2008-01-23 17:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Mike Hommey, Kevin Ballard, git

On Wed, Jan 23, 2008 at 08:16:33AM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 23 Jan 2008, Theodore Tso wrote:
> > 
> > So this demonstrates that on my MacOS 10.4.11 system, on NFS, MacOS is
> > doing no normalization, as it is creating two files.  On HFS+, MacOS
> > is mapping both filenames to the same decomposed name.
> 
> Well, it demonstrates that (a) the OS and (b) _perl_ don't mangle 
> filenames on non-HFS+ filesystems.

Well "touch" actually since that was what was actually creating the
files; I only used perl because it was easist way to gaurantee exactly
how the filenames would be generated.

> The problem is that since most native applications *expect* that name 
> mangling, they'll probably do name mangling of their own (internally) just 
> to compare the names!
> 
> So I would not be surprised if the globbing libraries, for example, will 
> do NFD-mangling in order to glob "correctly", so even programs ported from 
> real Unix might end up getting pathnames subtly changed into NFD as part 
> of some hot library-on-library action with UTF hackery inside.

It's worse than that.  You can specify at format time whether or not
HFS+ does case-sensitivity or not, and of course, there is UFS, which
I expect does no Unicode normalization at all, much like NFS.  I
suspect what you've pointed out is why certain MacOS programs break
horribly when run on non-HFS+ filesystems, though.  And if that is the
case, then those same programs might not be reliable if the user's
home directory is stored on NFS --- like they would be in an
enteprise/corproate environment, if Apple ever wants to have any hope
of penetrating that market.

Because of this, git code won't be able to just check for HFS+; it
will probably have to do a run-time test to see whether or not the
filesystem is doing case-folding or not, since that can be turned on
or off on a per-filesystem basis.  Also unknown, and which should be
tested, is whether turning off case-folding also turns off Unicode
normalization.  It may be that they did this so that HFS+ could be UFS
compatible, since Darwin *must* be built on a UFS filesystem,
reflecting its Mach/BSD heritage.  (I ran across this while doing my
web research; apparently HFS+ has been causing Apple headaches
internally.  Heh.  :-)

>Things like the finder etc, which must be very aware of the fact that
>filenames get corrupted, would presumably internally always convert
>everything they get into NFD in order to compare names from different
>sources. And as part of that, programs may well corrupt the name before
>they then use it to create a pathname.

Well, hopefully not everyone inside Apple's OS groups are total
morons, and actually use a utf8_str_equiv() routine instead of
strcmp() to do their Unicode comparisons.  But then again, maybe
not...

> The fact that your perl program works under NFS, but creates NFD on a VFAT 
> volume, does imply that they probably used at least some of the same 
> routines they use in HFS+ for VFAT. Not entirely surprising: doing case 
> insensitive stuff with Unicode is nasty code, so why not share it (even if 
> it's then incorrect for FAT)..
> 
> Piece of crap it is, though. Apple has painted themselves into a nasty 
> corner there.

No kidding!!

							- Ted

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23 16:16                                                                                                   ` Linus Torvalds
  2008-01-23 17:12                                                                                                     ` Theodore Tso
@ 2008-01-23 17:19                                                                                                     ` Kevin Ballard
  2008-01-23 17:32                                                                                                       ` Linus Torvalds
                                                                                                                         ` (2 more replies)
  1 sibling, 3 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-23 17:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Theodore Tso, Mike Hommey, git

[-- Attachment #1: Type: text/plain, Size: 4635 bytes --]

On Jan 23, 2008, at 11:16 AM, Linus Torvalds wrote:

> On Wed, 23 Jan 2008, Theodore Tso wrote:
>>
>> So this demonstrates that on my MacOS 10.4.11 system, on NFS, MacOS  
>> is
>> doing no normalization, as it is creating two files.  On HFS+, MacOS
>> is mapping both filenames to the same decomposed name.
>
> Well, it demonstrates that (a) the OS and (b) _perl_ don't mangle
> filenames on non-HFS+ filesystems.
>
> The problem is that since most native applications *expect* that name
> mangling, they'll probably do name mangling of their own  
> (internally) just
> to compare the names!

Well yes, any context in which a string is treated as Unicode instead  
of an opaque sequence of bytes will probably lead to normalization at  
some point (e.g. when searching text, I'm going to want Märchen and  
Märchen to be treated as the same string). The Mac OS X APIs use NFD,  
and everybody else uses NFC, but either way it's still normalization.

> So I would not be surprised if the globbing libraries, for example,  
> will
> do NFD-mangling in order to glob "correctly", so even programs  
> ported from
> real Unix might end up getting pathnames subtly changed into NFD as  
> part
> of some hot library-on-library action with UTF hackery inside.

Why would the globbing libraries have to do anything special to  
understand NFD? In fact, I prefer that they don't - it's very handy to  
be able to type Ma* and have that match Märchen, as the globbing  
library sees Ma??rchen and is happy to match the ??rchen against *.  
Were the filename in NFC, I couldn't do that. Similarly, Ma<tab>  
autocompletes the name Märchen for me. But the convenience is beside  
the point - what I'm trying to show here is that if the globbing  
library were NFD-aware, it probably would decide Ma* shouldn't match  
Märchen, right?

I assume globbing libraries et al don't do UTF-8 hackery in Linux,  
right? And yet using NFC-encoded filenames is fairly common? So why  
should it be any different on OS X, especially since HFS+ isn't the  
only option here (and thus doing NFD conversion in the library would  
mess up other filesystems)?

In fact, probably the biggest reason the NFD-encoding was done at the  
HFS+ level is because they simply couldn't trust user-level libraries  
to always do the NFD conversion for pathnames. And I quote:

"I would prefer that case sensitivity and unicode normalization were  
not the responsibility of the file system -- but I realize that we  
cannot just ignore the problem and let the other layers sort it all  
out."

> Things like the finder etc, which must be very aware of the fact that
> filenames get corrupted, would presumably internally always convert
> everything they get into NFD in order to compare names from different
> sources. And as part of that, programs may well corrupt the name  
> before
> they then use it to create a pathname.

I don't get why you're still calling it corruption when, on an HFS+  
system, NFD-encoding is correct. It would be corruption for HFS+ to  
write anything else but NFD.

> The fact that your perl program works under NFS, but creates NFD on  
> a VFAT
> volume, does imply that they probably used at least some of the same
> routines they use in HFS+ for VFAT. Not entirely surprising: doing  
> case
> insensitive stuff with Unicode is nasty code, so why not share it  
> (even if
> it's then incorrect for FAT)..
>
> Piece of crap it is, though. Apple has painted themselves into a nasty
> corner there.

There's no reason to assume that OS X is actually storing the NFD on  
the volume. In fact, it's quite explicitly not:

"As far as storing exactly what was passed in,  its not just HFS  
that's involved her.  In Mac OS X,  SMB, MSDOS, UDF, ISO 9660  
(Joliet), NTFS and ZFS file systems all store in one form -- NFC.  We  
store in NFC since that what is expected for these files systems.  If  
we were to allow KFD to pass through, it would cause problems when  
these names were accessed outside of Mac OS X.  So this is not just an  
HFS issue but an interchange issue for Mac OS X.  We have the legacy  
NFD use/expectation in our applications and we chose not to ignore the  
problem but make a conscience effort to have the appropriate form used  
(NFD in Mac OS X APIs, NFC elsewhere).  Its not perfect but neither is  
the agnostic approach where both forms can be used and you can have  
duplicate filenames in your file system."

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23 17:19                                                                                                     ` Kevin Ballard
@ 2008-01-23 17:32                                                                                                       ` Linus Torvalds
  2008-01-24 21:02                                                                                                         ` On pathnames Junio C Hamano
  2008-01-23 20:18                                                                                                       ` git on MacOSX and files with decomposed utf-8 file names Jay Soffian
  2008-01-23 23:37                                                                                                       ` Martin Langhoff
  2 siblings, 1 reply; 260+ messages in thread
From: Linus Torvalds @ 2008-01-23 17:32 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Theodore Tso, Mike Hommey, git



On Wed, 23 Jan 2008, Kevin Ballard wrote:
> 
> Well yes, any context in which a string is treated as Unicode instead of an
> opaque sequence of bytes will probably lead to normalization at some point
> (e.g. when searching text, I'm going to want Märchen and Märchen to be treated
> as the same string).

As pointed out (multiple times), this is only true if the programmer is a 
moron.

You do not need to - and *should* not - convert to a common normalization 
in order to compare to Uncode strings. You should just compare them with a 
Unicode-aware comparison routine. It will be faster, and it will avoid 
corrupting the input.

Sadly, stupid people are much too common.

		Linus

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23 16:58                                                                                                 ` Kevin Ballard
@ 2008-01-23 17:39                                                                                                   ` Dmitry Potapov
  2008-01-23 17:47                                                                                                     ` Kevin Ballard
  0 siblings, 1 reply; 260+ messages in thread
From: Dmitry Potapov @ 2008-01-23 17:39 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Mike Hommey, Theodore Tso, Linus Torvalds, git

On Jan 23, 2008 7:58 PM, Kevin Ballard <kevin@sb.org> wrote:
> On Jan 23, 2008, at 4:40 AM, Mike Hommey wrote:
> >
> > That's the point. It's stored in NFC, but what applications see is
> > NFD.
>
> I was actually asking for you to show this instead of just asserting
> it, but I realized I have access to an SMB share myself so I just
> tested.
>
> And you're right. That's very curious. I guess they did that because
> the entire Carbon stack was written assuming NFD (back at the same
> time HFS+ was created), and they wanted to provide a consistent
> interface to applications.

Wait, did you tell us some time ago that normalization does not
matter and you just need to treat strings "as text"? Now, it looks
like the Carbon stack does not treat strings "as text". How come?

Maybe, you should stop lying and admit that changing Unicode
strings does matter even if they remain equivalent.

> Since the filesystem already uses NFC,
> renormalizing to NFD shouldn't lose anything (want the original
> representation back? just normalize back to NFC).

On Windows, you can create two *different* files -- one with NFC
and the other with NFD name. I wonder, how it is going to work
with your renormalization back and force.

Dmitry

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23 17:39                                                                                                   ` Dmitry Potapov
@ 2008-01-23 17:47                                                                                                     ` Kevin Ballard
  0 siblings, 0 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-23 17:47 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: Mike Hommey, Theodore Tso, Linus Torvalds, git

[-- Attachment #1: Type: text/plain, Size: 1612 bytes --]

On Jan 23, 2008, at 12:39 PM, Dmitry Potapov wrote:

> On Jan 23, 2008 7:58 PM, Kevin Ballard <kevin@sb.org> wrote:
>> On Jan 23, 2008, at 4:40 AM, Mike Hommey wrote:
>>>
>>> That's the point. It's stored in NFC, but what applications see is
>>> NFD.
>>
>> I was actually asking for you to show this instead of just asserting
>> it, but I realized I have access to an SMB share myself so I just
>> tested.
>>
>> And you're right. That's very curious. I guess they did that because
>> the entire Carbon stack was written assuming NFD (back at the same
>> time HFS+ was created), and they wanted to provide a consistent
>> interface to applications.
>
> Wait, did you tell us some time ago that normalization does not
> matter and you just need to treat strings "as text"? Now, it looks
> like the Carbon stack does not treat strings "as text". How come?

I'm amazed at how badly you manage to misinterpret everything I say.

>> Since the filesystem already uses NFC,
>> renormalizing to NFD shouldn't lose anything (want the original
>> representation back? just normalize back to NFC).
>
> On Windows, you can create two *different* files -- one with NFC
> and the other with NFD name. I wonder, how it is going to work
> with your renormalization back and force.

I'm not sure what you're trying to say here. As near as I can tell,  
SMB already does encoding conversions itself when talking to different  
clients, so you can hardly say OS X is doing something bad by  
converting between local NFD and NFC on SMB.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23 17:19                                                                                                     ` Kevin Ballard
  2008-01-23 17:32                                                                                                       ` Linus Torvalds
@ 2008-01-23 20:18                                                                                                       ` Jay Soffian
       [not found]                                                                                                         ` <1DC841ED-634F-412C-9560-F37E4172A4CD@sb.org>
  2008-01-23 23:37                                                                                                       ` Martin Langhoff
  2 siblings, 1 reply; 260+ messages in thread
From: Jay Soffian @ 2008-01-23 20:18 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Linus Torvalds, Theodore Tso, Mike Hommey, git

On 1/23/08, Kevin Ballard <kevin@sb.org> wrote:
>
> I don't get why you're still calling it corruption when, on an HFS+
> system, NFD-encoding is correct. It would be corruption for HFS+ to
> write anything else but NFD.

How about this: it's lossy. It's lossy in a similar sense that TIFF ->
JPEG -> TIFF doesn't give you back exactly the same bytes, even though
(modulo the compression level) the two TIFFs might be visually
indistinguishable.

You seem to have an issue with calling this "corruption", but to most
of us, if you have a system where you don't get back *exactly the same
data* that you put in, then the data has been corrupted.

Now, please stop trolling this point, agree to disagree, and either
contribute some code or be quiet and allow others to make progress.

j.

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23 14:26                                                   ` Nicolas Pitre
@ 2008-01-23 21:19                                                     ` Junio C Hamano
  0 siblings, 0 replies; 260+ messages in thread
From: Junio C Hamano @ 2008-01-23 21:19 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Eric W. Biederman, Linus Torvalds, Peter Karlsson, Mark Junker,
	Pedro Melo, git@vger.kernel.org

Nicolas Pitre <nico@cam.org> writes:

> On Tue, 22 Jan 2008, Junio C Hamano wrote:
>
>> ebiederm@xmission.com (Eric W. Biederman) writes:
>> 
>> > Random thought.  Would it make sense to implement a git paranoid
>> > mode to autodetect name mangling.
>> >
>> > I.e.  After opening or creating a file by name we do a readdir in the
>> > same directory to make certain we can find that same name/inode
>> > combination.  Then on name-mangling systems we can autodetect they
>> > exist and limit ourselves to just what they don't mangle with no
>> > prior knowledge.  By refusing to process names that actively
>> > get mangled.   For small directories that you frequently see in
>> > development it shouldn't even be that slow.
>> 
>> Inside init-db where we already check how the filesystem
>> behaves, we could have an autodetection.
>
> I wonder if that is good enough.  Git repositories can be copied over to 
> different filesystems.

Do you mean "cp -a"?  If I am not mistaken we already have that
issue, due to core.filemode, when user does that across
filesystems with different behaviours.

There is not much we can do against "cp -a" other than telling
users that some configurations need to be adjusted.

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-23 17:19                                                                                                     ` Kevin Ballard
  2008-01-23 17:32                                                                                                       ` Linus Torvalds
  2008-01-23 20:18                                                                                                       ` git on MacOSX and files with decomposed utf-8 file names Jay Soffian
@ 2008-01-23 23:37                                                                                                       ` Martin Langhoff
  2 siblings, 0 replies; 260+ messages in thread
From: Martin Langhoff @ 2008-01-23 23:37 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Linus Torvalds, Theodore Tso, Mike Hommey, git

On Jan 24, 2008 6:19 AM, Kevin Ballard <kevin@sb.org> wrote:
> I don't get why you're still calling it corruption

Because in a modern Internet-aware world, whoever designs a FS needs
to acknowledge that they will need to store files from other systems
that have other assumptions. That is, if they want to interoperate.

As you noted not long ago, it is a serious problem if an HFS+
partition is shared over NFS. If you look at all the apps that have
problems with this aspect of HFS+ , they are all apps that transfer
files over the network over diverse protocols. That's why it's a
problem with git, because the files may be coming from a different
machine, running any arbitrary OS that git supports.

In such scenario, can you understand why everyone is saying that HFS+
and the VFS should not mangle names, even if it makes sense to some
use cases under OSX? And do you understand why the same applies to
git, being a network-sharing-oriented app?

So -- if OSX was doing things to make it easier for users to find
matching files at the Finder level, that'd be _fine_. But the FS has
to deal with a lot more variety than that. So this is a bad design
decision -- perhaps less obvious under OS8/9, but completely
disastrous with a network OS such as OSX. Call it "different" if you
want, but that's a euphemism for "wrong".

cheers,

m

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
       [not found]                                                                                                               ` <76718490801231517h6d57e5bfkc19d394d38ad19db@mail.gmail.com>
@ 2008-01-24  2:05                                                                                                                 ` Kevin Ballard
  2008-01-24  3:11                                                                                                                   ` Junio C Hamano
  0 siblings, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-24  2:05 UTC (permalink / raw)
  To: Jay Soffian; +Cc: Linus Torvalds, Theodore Tso, Mike Hommey, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 5716 bytes --]

I hope you don't mind that I'm redirecting this back onto the list.

On Jan 23, 2008, at 6:17 PM, Jay Soffian wrote:

> On 1/23/08, Kevin Ballard <kevin@sb.org> wrote:
>
>> I agree - the argument is fairly worthless. So why does everybody  
>> else
>> keep spending time accusing HFS+ of corrupting filenames? Most of
>> Linus's email was specifically about this point, but apparently  
>> that's
>> alright with you while only a *single* line out of my email in direct
>> response is called trolling?
>
> Everyone else considers what HFS+ does as corruption. You, alone in
> this thread, do not. You are not willing to concede the point, nor let
> it go. I'm accusing you of trolling because you are the single person
> defending HFS+'s behavior.

I don't understand how you can possibly think that disagreeing ==  
trolling. Similarly, just because I'm the only person *on this list*  
who holds my viewpoint doesn't in any way mean I should abandon it. In  
fact, it makes it much more important that I continue to stand up for  
what I believe, The whole notion of democracy is based on the fact  
that every person is important, and that every person has the right to  
their own opinion. I realize this is a mailing list, not a democratic  
body, but the same principles should still apply. If your criteria for  
judging any viewpoint is purely how many people hold that viewpoint,  
then you end up ignoring things just because they are different or new.

I may be the single person defending this behavior on this list, but  
if you were to leave your comfortable linux community and talk to  
people elsewhere, you might find yourself in the minority opinion.

> (Also, while it's certainly possible that you are right and everyone
> else is wrong, most of the other folks have significant experience as
> kernel, filesystem, or git developers, which leads credence to their
> point -- reputation matters.)

Why do you persist in thinking of this as right vs. wrong? I've tried  
to emphasize, many times, that HFS+ behaves this way not because it's  
"right" and ext4 is "wrong", but because HFS+ has a different set of  
values. The developers of HFS+ believed that, for a consumer OS like  
OS X, it made much more sense to treat visually indistinguishable  
filenames as the same file. I, and I'm sure the vast majority of OS X  
users, agree. Unfortunately this decision had some drawbacks, but they  
felt the trade-off was worth it. I'm well aware that you all don't  
think the trade-off was worth it, but like I said, this is a matter of  
behaving differently due to a different set of values, not behaving  
"right" or "wrong". I've been making an attempt to agree to disagree,  
but it seems that you would rather just squash dissent instead of  
accepting it.

>> Again, you're happy to let everybody else write long paragraphs
>> accusing HFS+ of bad behavior (and making horrible assumptions which
>> are generally completely untrue), and you don't think that's noise?
>
> They are responding to you. If you let the point drop, so will they.

I did let the point drop. Then you guys resurrected it. You can't pin  
this one on me.

>> I do ignore most of it, I'm only getting mad because a few people  
>> keep
>> telling me that I'm trolling, or being inflammatory, simply by  
>> posting
>> reasoned, factual replies, but everybody who keeps spewing insults
>> are, apparently, not a problem at all.
>
> I understand that you think your replies are reasoned and factual, but
> everyone else thinks you're wrong. They are getting frustrated
> defending a point with which you continue to disagree, hence the
> insults.

Don't you think I'm frustrated at the behavior of everyone else here?  
But you don't see me flinging insults.

>> At first, I did. Now it's just tiresome, since he keeps calling me  
>> and
>> HFS+ dumb for the exact same reasons he did at the start of the  
>> thread
>> no matter how I respond. Apparently he's simply more interested in
>> keeping his own opinion than in the actual reasons behind HFS+'s
>> decisions. It's rather frustrating.
>
> Have you considered that maybe he's right? In any case, you're not
> going to convince Linus of anything. From what I can tell, he forms
> his opinions based on facts he collects himself and his own
> experience. I gather that anything you say he will consider, *at
> best*, as hearsay. Besides that, he's actually working to solve the
> problem, while still taking the time to respond to your points.

Collecting facts yourself is fine, but insulting anybody with a  
dissenting *opinion* simply because it's different is just plain wrong.

> Really, please, go take a walk outside. Get some fresh air. Maybe stop
> reading the git list for a week or two. In the grand scheme of things
> it doesn't matter what git developers think of HFS+ as long as they're
> willing to make git work with it, which apparently, and in spite of
> you at this point, they are.

For the majority of this thread, nobody was making any indication that  
they cared at all about fixing this problem - that was my primary  
motivation to continue. If I had dropped this the first time someone  
told me to, do you think anybody would be working on the problem now?

As for dropping this conversation now, I'd love to. If you really want  
to drop it, I urge you to do just that - don't respond to this  
message. Read it, digest it, and then just let it sit. If this is the  
last message on the subject, that would be *wonderful*. But if you  
respond to this message then you have absolutely no ground to accuse  
me of refusing to drop it. So please, don't.

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-24  2:05                                                                                                                 ` Kevin Ballard
@ 2008-01-24  3:11                                                                                                                   ` Junio C Hamano
  2008-01-24  4:37                                                                                                                     ` Martin Langhoff
  0 siblings, 1 reply; 260+ messages in thread
From: Junio C Hamano @ 2008-01-24  3:11 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Jay Soffian, Linus Torvalds, Theodore Tso, Mike Hommey,
	Git Mailing List

Kevin Ballard <kevin@sb.org> writes:

> As for dropping this conversation now, I'd love to. If you really want
> to drop it, I urge you to do just that - don't respond to this
> message. Read it, digest it, and then just let it sit. If this is the
> last message on the subject, that would be *wonderful*. But if you
> respond to this message then you have absolutely no ground to accuse
> me of refusing to drop it. So please, don't.

I would not have said that if I were you.  That makes you look
very bad.  The impression I get after reading the above is that
the only thing you care about is to have the last word in the
thread.

People with opinions different from you could tone their message
down and stick to a more neutral sounding statement, "This patch
works around the issue X on HFS+", but not everybody is always
nice-and-calm.  But _you_ do not have to counter fire with fire,
especially if your goal isn't to flame but is to resolve
technical issues with cool head.  As long as you do not get
upset and start the flamewar every time whenever somebody says
"This patch works around the issue only that broken crap HFS+
has due to its stupid filename corruption choice it made", when
he could just have said it in a more neutral way, we can keep
the conversation constructive and civilized.

Let me suggest an alternative, as I think this thread raged on
long enough.  When you read somebody says "HFS+ corrupts", "HFS+
is broken", "this works around the stupidity of HFS+", just take
a deep breath, pretend that you did not hear these words that
make you feel insulted.  Instead pretend that you heard "HFS+
normalizes", "HFS+ is different", and "fixes problem on HFS+".
Do not respond with "No it is not a corruption", "No, HFS+ is
not broken" and "No, that is not a work around, but is a fix"
with another long thread.

I can imagine a civilized conversation to go this way:

	Linus: This patch would hopefully work around the stupid
	and broken normalization choice HFS+ people made years ago.

	You: Ok, I tested that patch, and it does fix the issue
	for me on HFS+ for most cases, but I still have issues
	if I use character X, Y and Z.

	Linus: Yeah, that is another direct consequence of the
	stupidity of HFS+.  At this point I think the previous
	patch bends git backwards enough and I do not know if it
	is worth addressing by bending further...

	You: How about introducing this new structure so that
	these cases can be handled in a way more friendly to
	HFS+, like this patch?

	Linus: Yeah, I can buy that, it looks ugly but it would
	not hurt people on other systems.

Hmm?

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-24  3:11                                                                                                                   ` Junio C Hamano
@ 2008-01-24  4:37                                                                                                                     ` Martin Langhoff
  2008-01-24  5:30                                                                                                                       ` Kevin Ballard
  0 siblings, 1 reply; 260+ messages in thread
From: Martin Langhoff @ 2008-01-24  4:37 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Kevin Ballard, Jay Soffian, Linus Torvalds, Theodore Tso,
	Mike Hommey, Git Mailing List

On Jan 24, 2008 4:11 PM, Junio C Hamano <gitster@pobox.com> wrote:
> make you feel insulted.  Instead pretend that you heard "HFS+
> normalizes", "HFS+ is different", and "fixes problem on HFS+".
> Do not respond with "No it is not a corruption", "No, HFS+ is
> not broken" and "No, that is not a work around, but is a fix"
> with another long thread.

Indeed. And it'd be good if Kevin could consider that this forum is
for technical discussion - not democracy but meritocracy,
best-solution-cracy and perhaps "fix-patch-ocracy". And that people
that have written good code in the past, posted amazing patches, and
wondrous test cases can sometimes get a bit more opinionated. But
newcomers needs to earn a bit of respect before lecturing people.

Kevin, other people have already started posting nice nuggets of test
cases. Where are *your* test cases? That would be a nice way to "have
the last word" on this ;-)

cheers,

martin

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-24  4:37                                                                                                                     ` Martin Langhoff
@ 2008-01-24  5:30                                                                                                                       ` Kevin Ballard
  2008-01-24  6:39                                                                                                                         ` Steffen Prohaska
  0 siblings, 1 reply; 260+ messages in thread
From: Kevin Ballard @ 2008-01-24  5:30 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Junio C Hamano, Jay Soffian, Linus Torvalds, Theodore Tso,
	Mike Hommey, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 601 bytes --]

On Jan 23, 2008, at 11:37 PM, Martin Langhoff wrote:

> Kevin, other people have already started posting nice nuggets of test
> cases. Where are *your* test cases? That would be a nice way to "have
> the last word" on this ;-)


I'm planning on devoting time this weekend to learning enough about  
git to be able to start hacking. I'm just too busy during the week to  
be able to devote the dedicated time necessary to this stuff.  
Hopefully I'll actually be able to start producing stuff this weekend.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-24  5:30                                                                                                                       ` Kevin Ballard
@ 2008-01-24  6:39                                                                                                                         ` Steffen Prohaska
  2008-01-24 18:17                                                                                                                           ` Mitch Tishmack
  2008-01-24 18:52                                                                                                                           ` Mitch Tishmack
  0 siblings, 2 replies; 260+ messages in thread
From: Steffen Prohaska @ 2008-01-24  6:39 UTC (permalink / raw)
  To: Kevin Ballard
  Cc: Martin Langhoff, Junio C Hamano, Jay Soffian, Linus Torvalds,
	Theodore Tso, Mike Hommey, Git Mailing List

On Jan 24, 2008, at 6:30 AM, Kevin Ballard wrote:

> On Jan 23, 2008, at 11:37 PM, Martin Langhoff wrote:
>
>> Kevin, other people have already started posting nice nuggets of test
>> cases. Where are *your* test cases? That would be a nice way to "have
>> the last word" on this ;-)
>
>
> I'm planning on devoting time this weekend to learning enough about  
> git to be able to start hacking. I'm just too busy during the week  
> to be able to devote the dedicated time necessary to this stuff.  
> Hopefully I'll actually be able to start producing stuff this weekend.

You do not need to learn much about git to post a test case.
Only a few lines of shell code that demonstrate how git fails to
handle a specific situation is needed.  To do this, knowledge
about the internals of git does not necessary help.  It should be
sufficient to know how to use git.

You may start with a simple shell script and send it to the list.
Though, a real patch would be the preferred way.  For this, you
should have a quick look into the t/ subdirectory.  Just open any
of the tNUMBER*.sh files.  It should be quite obvious how your
sequence of shell commands could be cast into a git test script.

	Steffen

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-24  6:39                                                                                                                         ` Steffen Prohaska
@ 2008-01-24 18:17                                                                                                                           ` Mitch Tishmack
  2008-01-24 18:52                                                                                                                           ` Mitch Tishmack
  1 sibling, 0 replies; 260+ messages in thread
From: Mitch Tishmack @ 2008-01-24 18:17 UTC (permalink / raw)
  To: Steffen Prohaska
  Cc: Kevin Ballard, Martin Langhoff, Junio C Hamano, Jay Soffian,
	Linus Torvalds, Theodore Tso, Mike Hommey, Git Mailing List

Here is a start maybe, I was just testing all of the HFS variants for  
fun. I will write up a test case later tonight when I am out  done  
with work.

#!/bin/sh
#
# Test git behavior on OSX's multitudes of HFS types
# So far UFS, HFS, HFS+, HFSX, HFS+J, UFS is the only sane FS  
available...
#

#cloneurl="git://git.kernel.org/pub/scm/git/git.git"
cloneurl="/Volumes/gitufs/git"
ramdiskdir="/private/tmp/gitramdisk"
results="/private/tmp/fsresults"

echo "Creating 100M UFS ramdisk for clone operation."
rawdev=`hdid -nomount ram://102400`
newfs $rawdev > /dev/null 2>&1
mkdir $ramdiskdir > /dev/null 2>&1
mount -t ufs $rawdev $ramdiskdir > /dev/null 2>&1
cd $ramdiskdir && git clone $cloneurl > /dev/null 2>&1
cd $ramdiskdir/git
if [ -f $results ] ; then
   echo "Removing old results."
   rm $results
fi

echo "Creating HFS image"
hdiutil create -size 50m -fs HFS -attach -volname "hfs" /tmp/hfs.dmg  
 > /dev/null 2>&1
echo "Creating HFS+ image"
hdiutil create -size 50m -fs HFS+ -attach -volname "hfsplus" /tmp/ 
hfsplus.d > /dev/null 2>&1
echo "Creating HFS+J image"
hdiutil create -size 50m -fs HFS+J -attach -volname "hfsplusJ" /tmp/ 
hfsplusjournal.dmg > /dev/null 2>&1
echo "Creating HFSX image"
hdiutil create -size 50m -fs HFSX -attach -volname "hfsx" /tmp/ 
hfsx.dmg > /dev/null 2>&1
echo "Creating UFS image"
hdiutil create -size 50m -fs UFS -attach -volname "hfsu" /tmp/hfsu.dmg  
 > /dev/null 2>&1

for x in `ls -d /Volumes/hfs*`;do
   echo "Testing $x clone."
   echo "Results for $x:" >> $results
   (echo "-- git clone results --" && cd ${x} && /usr/bin/time git  
clone $ramdiskdir/git) >> $results 2>&1
   (echo "-- git status results --" && cd ${x}/git && /usr/bin/time  
git status) >> $results 2>&1
   cd ${x} && perl -CO -e 'print pack("U",0x00E4)."\n"' | xargs touch  
# umlauted a
   cd ${x} && perl -CO -e 'print pack("U",0x0061).pack("U", 
0x0308)."\n"' | xargs touch # umlauted a by combining diareses
   ls -d ${x}/* | xxd >> $results
   cd && hdiutil eject ${x} > /dev/null 2>&1
done

# cleanup
cd $HOME
umount -f $ramdiskdir > /dev/null 2>&1
hdiutil detach $rawdev > /dev/null 2>&1
rm -Rf $ramdiskdir /tmp/hfs*.dmg
more $results

My results on leopard:
$ cat /tmp/fsresults
Results for /Volumes/hfs:
-- git clone results --
Initialized empty Git repository in /Volumes/hfs/git/.git/
cpio: Unable to create /Volumes/hfs/git/.git/objects/pack/ 
pack-06100ef5fbd98d07358505696e2e0c5600a9b279.pack: Invalid argument
cpio: Unable to create /Volumes/hfs/git/.git/objects/pack/ 
pack-06100ef5fbd98d07358505696e2e0c5600a9b279.idx: Invalid argument
cpio: Unable to create /Volumes/hfs/git/.git/objects/pack/ 
pack-401a5ae571eb23ec896d7e441deae4e313d0de9c.pack: Invalid argument
cpio: Unable to create /Volumes/hfs/git/.git/objects/pack/pack- 
ab8844b63fcb4fc5896e9d75b0d10c566d5ce5bb.pack: Invalid argument
cpio: Unable to create /Volumes/hfs/git/.git/objects/pack/pack- 
ab8844b63fcb4fc5896e9d75b0d10c566d5ce5bb.idx: Invalid argument
cpio: Unable to create /Volumes/hfs/git/.git/objects/pack/ 
pack-9ffbf58084280a496aef6849fe3effe742a99d77.pack: Invalid argument
cpio: Unable to create /Volumes/hfs/git/.git/objects/pack/ 
pack-9ffbf58084280a496aef6849fe3effe742a99d77.idx: Invalid argument
cpio: Unable to create /Volumes/hfs/git/.git/objects/pack/ 
pack-401a5ae571eb23ec896d7e441deae4e313d0de9c.idx: Invalid argument
cpio: Unable to create /Volumes/hfs/git/.git/objects/ 
03/4ee24912da0a700ba27109825710fd84d64591: Invalid argument
         0.20 real         0.02 user         0.05 sys
-- git status results --
git_fs.sh: line 39: cd: /Volumes/hfs/git: No such file or directory
0000000: 2f56 6f6c 756d 6573 2f68 6673 2f61 cc88  /Volumes/hfs/a..
0000010: 0a                                       .
Results for /Volumes/hfsplus:
-- git clone results --
Initialized empty Git repository in /Volumes/hfsplus/git/.git/
         2.56 real         0.47 user         0.94 sys
-- git status results --
# On branch master
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#	gitweb/test/Märchen
nothing added to commit but untracked files present (use "git add" to  
track)
         0.46 real         0.27 user         0.08 sys
0000000: 2f56 6f6c 756d 6573 2f68 6673 706c 7573  /Volumes/hfsplus
0000010: 2f61 cc88 0a2f 566f 6c75 6d65 732f 6866  /a.../Volumes/hf
0000020: 7370 6c75 732f 6769 740a                 splus/git.
Results for /Volumes/hfsplusJ:
-- git clone results --
Initialized empty Git repository in /Volumes/hfsplusJ/git/.git/
         2.29 real         0.45 user         0.91 sys
-- git status results --
# On branch master
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#	gitweb/test/Märchen
nothing added to commit but untracked files present (use "git add" to  
track)
         0.57 real         0.31 user         0.10 sys
0000000: 2f56 6f6c 756d 6573 2f68 6673 706c 7573  /Volumes/hfsplus
0000010: 4a2f 61cc 880a 2f56 6f6c 756d 6573 2f68  J/a.../Volumes/h
0000020: 6673 706c 7573 4a2f 6769 740a            fsplusJ/git.
Results for /Volumes/hfsu:
-- git clone results --
Initialized empty Git repository in /Volumes/hfsu/git/.git/
         5.08 real         0.48 user         0.94 sys
-- git status results --
# On branch master
nothing to commit (working directory clean)
         0.26 real         0.20 user         0.05 sys
0000000: 2f56 6f6c 756d 6573 2f68 6673 752f 61cc  /Volumes/hfsu/a.
0000010: 880a 2f56 6f6c 756d 6573 2f68 6673 752f  ../Volumes/hfsu/
0000020: 6769 740a 2f56 6f6c 756d 6573 2f68 6673  git./Volumes/hfs
0000030: 752f c3a4 0a                             u/...
Results for /Volumes/hfsx:
-- git clone results --
Initialized empty Git repository in /Volumes/hfsx/git/.git/
         2.49 real         0.46 user         0.88 sys
-- git status results --
# On branch master
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#	gitweb/test/Märchen
nothing added to commit but untracked files present (use "git add" to  
track)
         0.25 real         0.20 user         0.04 sys
0000000: 2f56 6f6c 756d 6573 2f68 6673 782f 61cc  /Volumes/hfsx/a.
0000010: 880a 2f56 6f6c 756d 6573 2f68 6673 782f  ../Volumes/hfsx/
0000020: 6769 740a                                git.


On Jan 24, 2008, at 12:39 AM, Steffen Prohaska wrote:

>
> On Jan 24, 2008, at 6:30 AM, Kevin Ballard wrote:
>
>> On Jan 23, 2008, at 11:37 PM, Martin Langhoff wrote:
>>
>>> Kevin, other people have already started posting nice nuggets of  
>>> test
>>> cases. Where are *your* test cases? That would be a nice way to  
>>> "have
>>> the last word" on this ;-)
>>
>>
>> I'm planning on devoting time this weekend to learning enough about  
>> git to be able to start hacking. I'm just too busy during the week  
>> to be able to devote the dedicated time necessary to this stuff.  
>> Hopefully I'll actually be able to start producing stuff this  
>> weekend.
>
> You do not need to learn much about git to post a test case.
> Only a few lines of shell code that demonstrate how git fails to
> handle a specific situation is needed.  To do this, knowledge
> about the internals of git does not necessary help.  It should be
> sufficient to know how to use git.
>
> You may start with a simple shell script and send it to the list.
> Though, a real patch would be the preferred way.  For this, you
> should have a quick look into the t/ subdirectory.  Just open any
> of the tNUMBER*.sh files.  It should be quite obvious how your
> sequence of shell commands could be cast into a git test script.
>
> 	Steffen
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-24  6:39                                                                                                                         ` Steffen Prohaska
  2008-01-24 18:17                                                                                                                           ` Mitch Tishmack
@ 2008-01-24 18:52                                                                                                                           ` Mitch Tishmack
  2008-01-24 19:58                                                                                                                             ` Kevin Ballard
  1 sibling, 1 reply; 260+ messages in thread
From: Mitch Tishmack @ 2008-01-24 18:52 UTC (permalink / raw)
  To: Steffen Prohaska
  Cc: Kevin Ballard, Martin Langhoff, Junio C Hamano, Jay Soffian,
	Linus Torvalds, Theodore Tso, Mike Hommey, Git Mailing List

Apologies Steffen, I grabbed your CamelCase test and did a search/ 
replace, wasn't sure what to call it though... But I am on lunch and  
wanted to be useful. Rip it apart all you want.

Fails on hfs* on OSX, works on ufs. I will bother with zfs when it can  
be used again.

On UFS:
$ /bin/sh ./t0060-normalization.sh
*   ok 1: setup
*   ok 2: rename (silent normalization)
*   ok 3: merge (silent normalization)
* passed all 3 test(s)


On HFS:
$ /bin/sh t0060-normalization.sh
*   ok 1: setup
* FAIL 2: rename (silent normalization)
	
	
	 git mv ä ä &&
	 git commit -m "rename"
	
	
* FAIL 3: merge (silent normalization)
	
	
	 git reset --hard initial &&
	 git merge topic
	
	
* failed 2 among 3 test(s)

The test case, it uses perl, assuming only 5.6.1+ will work with this:
diff --git a/t/t0060-normalization.sh b/t/t0060-normalization.sh
new file mode 100755
index 0000000..e012c02
--- /dev/null
+++ b/t/t0060-normalization.sh
@@ -0,0 +1,36 @@
+#!/bin/sh
+
+test_description='Test for silent normalization issues'
+
+. ./test-lib.sh
+
+auml=`perl -CO -e 'print pack("U",0x00E4)'`
+aumlcdiar=`perl -CO -e 'print pack("U",0x0061).pack("U",0x0308)'`
+test_expect_success setup "
+  touch $aumlcdiar &&
+  git add $aumlcdiar &&
+  git commit -m \"initial\"
+  git tag initial &&
+  git checkout -b topic &&
+  git mv $aumlcdiar tmp &&
+  git mv tmp $auml &&
+  git commit -m \"rename\" &&
+  git checkout -f master
+
+"
+
+test_expect_success 'rename (silent normalization)' "
+
+ git mv $aumlcdiar $auml &&
+ git commit -m \"rename\"
+
+"
+
+test_expect_success 'merge (silent normalization)' '
+
+ git reset --hard initial &&
+ git merge topic
+
+'
+
+test_done
-- 
1.5.3




On Jan 24, 2008, at 12:39 AM, Steffen Prohaska wrote:

>
> On Jan 24, 2008, at 6:30 AM, Kevin Ballard wrote:
>
>> On Jan 23, 2008, at 11:37 PM, Martin Langhoff wrote:
>>
>>> Kevin, other people have already started posting nice nuggets of  
>>> test
>>> cases. Where are *your* test cases? That would be a nice way to  
>>> "have
>>> the last word" on this ;-)
>>
>>
>> I'm planning on devoting time this weekend to learning enough about  
>> git to be able to start hacking. I'm just too busy during the week  
>> to be able to devote the dedicated time necessary to this stuff.  
>> Hopefully I'll actually be able to start producing stuff this  
>> weekend.
>
> You do not need to learn much about git to post a test case.
> Only a few lines of shell code that demonstrate how git fails to
> handle a specific situation is needed.  To do this, knowledge
> about the internals of git does not necessary help.  It should be
> sufficient to know how to use git.
>
> You may start with a simple shell script and send it to the list.
> Though, a real patch would be the preferred way.  For this, you
> should have a quick look into the t/ subdirectory.  Just open any
> of the tNUMBER*.sh files.  It should be quite obvious how your
> sequence of shell commands could be cast into a git test script.
>
> 	Steffen
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 260+ messages in thread

* Re: git on MacOSX and files with decomposed utf-8 file names
  2008-01-24 18:52                                                                                                                           ` Mitch Tishmack
@ 2008-01-24 19:58                                                                                                                             ` Kevin Ballard
  0 siblings, 0 replies; 260+ messages in thread
From: Kevin Ballard @ 2008-01-24 19:58 UTC (permalink / raw)
  To: Mitch Tishmack
  Cc: Steffen Prohaska, Martin Langhoff, Junio C Hamano, Jay Soffian,
	Linus Torvalds, Theodore Tso, Mike Hommey, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 568 bytes --]

On Jan 24, 2008, at 1:52 PM, Mitch Tishmack wrote:

> Apologies Steffen, I grabbed your CamelCase test and did a search/ 
> replace, wasn't sure what to call it though... But I am on lunch and  
> wanted to be useful. Rip it apart all you want.
>
> [snip]

Well, I was planning on writing my own test case today, but you seem  
to have beaten me to the punch. I just tested your script and it does  
indeed fail as expected on HFS+. Thank you for producing this test case.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 260+ messages in thread

* On pathnames
  2008-01-23 17:32                                                                                                       ` Linus Torvalds
@ 2008-01-24 21:02                                                                                                         ` Junio C Hamano
  2008-01-24 22:31                                                                                                           ` Nicolas Pitre
                                                                                                                             ` (3 more replies)
  0 siblings, 4 replies; 260+ messages in thread
From: Junio C Hamano @ 2008-01-24 21:02 UTC (permalink / raw)
  To: git
  Cc: Johannes Schindelin, Linus Torvalds, Kevin Ballard, Theodore Tso,
	Mike Hommey

One of Linus's recent patch introduces an index hashtable so
that we can later hash "equivalent" names into the same bucket
to allow us non-byte-by-byte comparison.

Before going further, I needed to formalize what we are trying
to achieve.  I learned a few things from the long flamewar
thread, but it is very inefficient to go back to the thread to
pick only the useful pieces.  The whole flamewar simply did not
fit a small Panda brain.

That was the reason for this write-up.

Design constraints.  In the following, I'll use two names $A and
$B as an example.  They are a pair of names that are considered
equivalent in some contexts, such as:

        A=xt_connmark.c  B=xt_CONNMARK.c

 (1) Some filesystems prevent you from having these two
     (confusing) paths in a directory at the same time.  Some do
     not implement this confusion prevention, and allows both
     names to exist at the same time.

     Let's call the former "case insensitive", and the latter
     "case sensitive".

 (2) readdir(3) on some "case insensitive" filesystems returns
     $A, after a successful creat(2) of $B.  Others remember
     which one of the two "equivalent" names were used in
     creat(2).

     Let's call the former "case folding", and the latter "case
     preserving".

     We assume open(2) or lstat(2) of $A or $B will succeed
     after allowing creat(2) of $B if a case folding filesystem
     returns $A from readdir(3).

 (3) Among the "case folding" ones, some filesystems fold the
     pathname to a form that is less interoperable with other
     systems, and/or the form that is likely to be different
     from what the end-user usually enters.

     Such filesystems are "inconveniently case folding".

The last one is not quite apparent with the "xt_connmark.c"
example, but if you replace $A and $B in the above description
with:

        A=Ma"rchen       B=Märchen

it would hopefully become more clear.

For example, vfat is generally "case preserving".  In that long
flamewar thread, I think we learned that HFS+ is in general
"inconveniently case folding" with respect to Unicode, by always
folding to $A but the keyboard/IM input is more likely to come
as $B, which happens to be the more interoperable form with
other systems.

Issues with case insensitive filesystems
----------------------------------------

At the data structure level, a pathname to git is a sequence of
bytes terminated with NUL.  This will _not_ change.

By the way, at the data structure level, a tree entry in git can
represent a blob that is a symbolic link.  A tree entry in git
can also represent a blob that is a regular file, and in that
case, it can represent if it is executable or not.  These will
also not change.

Now, let's think about how we allow use of git on a filesystem
that is incapable of symbolic links, and/or a filesystem that
does not have trustable executable bit.

We do not say "Symlinks are evil and not supported everywhere,
so let's introduce a project configuration to disallow addition
of symlinks".  We do not say that to the executable bit, either.

Instead, we have fallback methods to allow manipulating symlinks
and executable bit on such a filesystem that is incapable of
handling them natively.

We should be able to do the same for this "case sensitivity"
issue.  A tree that has xt_connmark.c and xt_CONNMARK.c at the
same time cannot be checked out on a case insensitive filesystem.

The filesystem is simply incapable of it (please just calmly
rephrase it in your head as "does not allow such confusing
craziness" instead of starting another flamewar, if you feel the
expression "incapable of" insults your favorite filesystem).

That may mean the project should avoid such equivalent names in
its trees (and having a project wide configuration could be a
technical means to help enforcing that policy), but it does not
mean the core level of git should prevent them to be created on
such systems.  It just means that there should be a way, that
could (and sometimes has to) be different from the "natural"
way, to manipulate such tree entries even on a case insensitive
filesystem.

For example, if I find that RelNotes symlink incorrectly points
at Documentation/RelNotes-1.5.44.txt and want to fix it and push
it out immediately, but if I am on the road and the only
environment I can borrow is a git installation on a filesystem
that is symlink-challenged, I can still do the fix. On such a
filesystem, a symlink is checked out as a regular file but is
still marked as a symlink in the index.  The only thing I need
to do is to edit the file (making sure not to add an extra LF at
the end) and add it to the index.  That's certainly different
from the "natural" way to do that on a filesystem with symlinks,
which is "ln -fs Documentation/RelNotse-1.5.4.txt RelNotes", but
the point is that we make it possible.

The same thing should apply to two files that cannot be checked
out at the same time on case insensitive filesystems.  Perhaps
we could have something like:

	$ git show :xt_CONNMARK.c >xt_connmark-1.c
        $ edit xt_connmark-1.c
	$ git add --as xt_CONNMARK.c xt_connmark-1.c

Issues with case folding filesystems
------------------------------------

In addition to the above, case folding filesystems additionally
have an issue even when there is no "confusing" names in the
tree.  The project may want to have "Märchen" (but not
"Ma"rchen"), but a checkout (which is creat(2) of "Märchen" --
because that is the byte sequence recorded in tree objects and
the index) will result in "Ma"rchen" and no "Märchen" (hence
readdir(3) returns "Ma"rchen").

Linus's patch to use a hashtable that links "equivalent" names
together is a step in the right direction to address this.  The
tree (and the index) has name $B, we check out and the
filesystem folds it to $A.  When we get the name $A back from
the filesystem (via readdir(3)), we hash the name using a hash
function that would drop names $A and $B into the same bucket,
and compare that name $A with each hash entry using a comparison
that considers $A and $B are equivalent.  If we find one, then
we keep the name $B we have already.

If it is a new file, we won't find any name that is equivalent
to $A in the index, and we use the name $A obtained from
readdir(3).

BUT with a twist.

If the filesystem is known to be inconveniently case folding, we
are better off registering $B instead of $A (assuming we can
convert from $A to $B).

One bad issue during development is that we cannot sanely
emulate case folding behaviour on non case-folding filesystems
without wrapping open(2), lstat(2), and friends, because of the
assumption we made above in (2) where we defined the term "case
folding".  This means that the codepath to deal with case
folding filesystems inevitably are harder to debug.

Tasks
-----

 - Identify which case folding filesystems need to be supported,
   and make sure somebody understands its folding logic;

 - For each supported case folding logic, these are needed:

   - a hash function that throws "equivalent" names in the same
     bucket, to be used in Linus's patch;

   - a compare function to determine equivalent names;

   - a convert function that takes a possibly inconvenient form
     of equivalent name (i.e. $A above) as input and returns
     more convenient form (i.e. $B above)

 - Identify places that we use the names obtained from places
   other than the index and tree.  From these places, we would
   need to call the convert function to (de)mangle the name
   before they hit the index.

   Because we may be getting driven by something like:

	$ find | xargs git-foo

   handling readdir(3) we do ourselves any specially does not
   make much sense.  Any path from the user is suspect.

 - Identify places that we look for a name in the index, and
   perform equivalent comparison instead of memcmp(3) we
   traditionally did.  Linus's patch gives scaffolding for this.

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: On pathnames
  2008-01-24 21:02                                                                                                         ` On pathnames Junio C Hamano
@ 2008-01-24 22:31                                                                                                           ` Nicolas Pitre
  2008-01-25  3:55                                                                                                             ` Martin Langhoff
  2008-01-25  4:12                                                                                                             ` Junio C Hamano
  2008-01-24 23:56                                                                                                           ` Sean
                                                                                                                             ` (2 subsequent siblings)
  3 siblings, 2 replies; 260+ messages in thread
From: Nicolas Pitre @ 2008-01-24 22:31 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Johannes Schindelin, Linus Torvalds, Kevin Ballard,
	Theodore Tso, Mike Hommey

On Thu, 24 Jan 2008, Junio C Hamano wrote:

> If it is a new file, we won't find any name that is equivalent
> to $A in the index, and we use the name $A obtained from
> readdir(3).
> 
> BUT with a twist.
> 
> If the filesystem is known to be inconveniently case folding, we
> are better off registering $B instead of $A (assuming we can
> convert from $A to $B).

Why?

If you have no other representation for the file name than $A already, 
then I don't see why Git would have to play similar evil games and 
corru^H^H^Hnvert $A into $B.  Just store $A in the index and tree 
objects and be done with it.


Nicolas

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: On pathnames
  2008-01-24 21:02                                                                                                         ` On pathnames Junio C Hamano
  2008-01-24 22:31                                                                                                           ` Nicolas Pitre
@ 2008-01-24 23:56                                                                                                           ` Sean
  2008-01-25  0:36                                                                                                           ` Johannes Schindelin
  2008-01-25  4:00                                                                                                           ` Daniel Barkalow
  3 siblings, 0 replies; 260+ messages in thread
From: Sean @ 2008-01-24 23:56 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Johannes Schindelin, Linus Torvalds, Kevin Ballard,
	Theodore Tso, Mike Hommey

On Thu, 24 Jan 2008 13:02:54 -0800
Junio C Hamano <gitster@pobox.com> wrote:

> One bad issue during development is that we cannot sanely
> emulate case folding behaviour on non case-folding filesystems
> without wrapping open(2), lstat(2), and friends, because of the
> assumption we made above in (2) where we defined the term "case
> folding".  This means that the codepath to deal with case
> folding filesystems inevitably are harder to debug.

All true.  Though Linux support for creating and using HFS+ volumes
seems like it may be helpful.  Trying the test case patch[*] posted
by Mitch Tishmack showed the problem here.  The only slightly
strange thing was that there didn't seem to be an issue with the
gitweb/test/Märchen file after cloning to the HFS volume.

Sean.

[*]
$ dd bs=1M count=250 < /dev/zero > hfs_vol
  262144000 bytes (262 MB) copied, 6.12703 s, 42.8 MB/s

$ /sbin/mkfs.hfsplus -v Test -n c=4096,e=1024 hfs_vol
  Initialized hfs_vol as a 250 MB HFS Plus volume

$ mkdir hfs
$ sudo mount -t hfsplus -o loop hfs_vol hfs
$ sudo chmod a+rwx hfs
$ cd hfs
$ git clone ~/local/sources/git
  Initialized empty Git repository in ~/hfs/git/.git/
  49486 blocks

$ cd git
$ make
   ...

$ cd t
$ git apply ~/Mitch_Tishmack.patch
$ ./t0060-normalization.sh
  * FAIL 1: setup

	  touch ä &&
	  git add ä &&
	  git commit -m "initial"
	  git tag initial &&
	  git checkout -b topic &&
	  git mv ä tmp &&
	  git mv tmp ä &&
	  git commit -m "rename" &&
	  git checkout -f master
	
  * FAIL 2: rename (silent normalization)
	
	 git mv ä ä &&
	 git commit -m "rename"
	
  * FAIL 3: merge (silent normalization)

	 git reset --hard initial &&
	 git merge topic
	
  * failed 3 among 3 test(s)

Sean.

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: On pathnames
  2008-01-24 21:02                                                                                                         ` On pathnames Junio C Hamano
  2008-01-24 22:31                                                                                                           ` Nicolas Pitre
  2008-01-24 23:56                                                                                                           ` Sean
@ 2008-01-25  0:36                                                                                                           ` Johannes Schindelin
  2008-01-25  4:00                                                                                                           ` Daniel Barkalow
  3 siblings, 0 replies; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-25  0:36 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Linus Torvalds, Kevin Ballard, Theodore Tso, Mike Hommey

Hi,

On Thu, 24 Jan 2008, Junio C Hamano wrote:

> [A nice, concise, well written and obviously thought-through summary of 
>  the case sensitivity and UTF-8 file name issues.]

Thank you Junio.  It must have taken much more time than just sitting 
down and hacking into the keyboard.  By this thinking before writing, you 
invested some time that you save all the readers, including me.  I 
appreciate that very much.

> [Goes on to describe what we do with symlinks when the filesystem is not 
>  capable of representing symlinks; compares that situation to the 
>  filenames situation.]

There is a fundamental difference between the symlinks situation and the 
filename situation that you should keep in mind:  even if the filesystem 
cannot create symlinks, the nature of filenames as unique keys is not 
changed.  You cannot have a symlink and a file of the same name.  In a 
way, it takes away a degree of freedom of the _values_ that the _keys_ 
point to.

The same is not true for the case-challenged filesystems; they change the 
nature from unique keys to semi-unique keys.  So while other filesystems 
can discern all different keys, these challenged filesystems cannot; they 
take away a degree of freedom of the _keys_.

It is much easier to cope with the lack of degree of freedom in values; 
you have to store the metadata somewhere else -- in this case the index -- 
but it is still easily accessible by the key.

But that is not possible if two different _keys_ are not accepted as 
different by the filesystem.  You can still store the different metadata 
in the index, but the _content_ cannot be in the filesystem under the 
desired keys; not at the same time, anyway.

> Perhaps we could have something like:
> 
> 	$ git show :xt_CONNMARK.c >xt_connmark-1.c
>         $ edit xt_connmark-1.c
> 	$ git add --as xt_CONNMARK.c xt_connmark-1.c

Something similar is already possible:

	$ git checkout xt_CONNMARK.c
	$ edit xt_CONNMARK.c
	$ git add xt_CONNMARK.c

but you have to keep in mind that

	- "git add -u" or "git commit -a" is a no-no-no, and
	- the system will not build, no matter what you change in git

on those filesystems.

Having said that, I think that a config variable/commit hooks for those 
repositories which _happen_ to live on sane filesystems, but have to be 
checked out on challenged ones, makes absolute sense.  (The commit hook is 
possible already, but less efficient than the config variable.)

> If it is a new file, we won't find any name that is equivalent to $A in 
> the index, and we use the name $A obtained from readdir(3).
> 
> BUT with a twist.
> 
> If the filesystem is known to be inconveniently case folding, we are 
> better off registering $B instead of $A (assuming we can convert from $A 
> to $B).

I tend to agree with Nico.  We should not "learn" from the challenged 
filesystems.

> Tasks
> -----
> 
>  - Identify which case folding filesystems need to be supported,
>    and make sure somebody understands its folding logic;
> 
>  - For each supported case folding logic, these are needed:
> 
>    - a hash function that throws "equivalent" names in the same
>      bucket, to be used in Linus's patch;

AFAIR Linus wanted to have one has function to rule them all.  That would 
be way cool, since it means fewer possibilities for bugs to go undetected.

>    - a compare function to determine equivalent names;

AFAICT we need three functions: strcasecmp(), utf8_strcmp() and 
utf8_strcasecmp().  Although I might be wrong, and the second is not 
needed.

Probably the answer for this has been buried in many, many lines that I 
decided not to read.  Maybe I'll ask Randal on IRC, he's usually very 
quick to give me reasonable and concise answers.  And then we trash-talk a 
little, just for fun.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: On pathnames
  2008-01-24 22:31                                                                                                           ` Nicolas Pitre
@ 2008-01-25  3:55                                                                                                             ` Martin Langhoff
  2008-01-25  4:18                                                                                                               ` Junio C Hamano
  2008-01-25  4:12                                                                                                             ` Junio C Hamano
  1 sibling, 1 reply; 260+ messages in thread
From: Martin Langhoff @ 2008-01-25  3:55 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Junio C Hamano, git, Johannes Schindelin, Linus Torvalds,
	Kevin Ballard, Theodore Tso, Mike Hommey

On Jan 25, 2008 11:31 AM, Nicolas Pitre <nico@cam.org> wrote:
> On Thu, 24 Jan 2008, Junio C Hamano wrote:
>
> > If it is a new file, we won't find any name that is equivalent
> > to $A in the index, and we use the name $A obtained from
> > readdir(3).
> >
> > BUT with a twist.
> >
> > If the filesystem is known to be inconveniently case folding, we
> > are better off registering $B instead of $A (assuming we can
> > convert from $A to $B).
>
> Why?
>
> If you have no other representation for the file name than $A already,
> then I don't see why Git would have to play similar evil games and
> corru^H^H^Hnvert $A into $B.  Just store $A in the index and tree
> objects and be done with it.

Because if you happen to be on a case-challenged filesystem, and you
need to add a new file - say xt_CONNBARK.c, you can, even if your FS
only has xt_connbark.c to offer to git. Granted, it's harder, but it
can be done, and it means that powerusers and wrappers around git have
a hope of dealing with it.

I think this is an excellent plan.

There is one thing that I don't see in Junio's plan -- which is
excellent -- and is

 - a warning during checkout if the index contains "equivalent" paths
that will clobber eachother during checkout.
 - an optional warning/error during add, to be raised if I am adding a
path that is equivalent to an already-existing path in the index

The second one is to support project that are known to be developed on
these non-filename-preserving platforms. So if I am on a linux host,
and I add Readme where README already exists, a warning can save me
and the project a bit of grief - and possibly catch an unintended
mistake! Because while I agree that users should be able to store any
file in git, in practice most instances of Ma"rchen/Märchen case will
be due to user error (or editor/gui error).

cheers,


m

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: On pathnames
  2008-01-24 21:02                                                                                                         ` On pathnames Junio C Hamano
                                                                                                                             ` (2 preceding siblings ...)
  2008-01-25  0:36                                                                                                           ` Johannes Schindelin
@ 2008-01-25  4:00                                                                                                           ` Daniel Barkalow
  2008-01-25  4:21                                                                                                             ` Junio C Hamano
  2008-01-25  5:59                                                                                                             ` Jeff King
  3 siblings, 2 replies; 260+ messages in thread
From: Daniel Barkalow @ 2008-01-25  4:00 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Johannes Schindelin, Linus Torvalds, Kevin Ballard,
	Theodore Tso, Mike Hommey

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2012 bytes --]

On Thu, 24 Jan 2008, Junio C Hamano wrote:

> The same thing should apply to two files that cannot be checked
> out at the same time on case insensitive filesystems.  Perhaps
> we could have something like:
> 
> 	$ git show :xt_CONNMARK.c >xt_connmark-1.c
> 	$ edit xt_connmark-1.c
> 	$ git add --as xt_CONNMARK.c xt_connmark-1.c

I think it would be nicer to have:

$ git checkout branch
Warning: xt_CONNMARK.c conflicts with xt_connmark.c; not checking it out
$ git checkout xt_CONNMARK.c --as xt_CONNMARK_caps.c
$ edit xt_CONNMARK_caps.c
$ git add xt_CONNMARK_caps.c

Where the index, when support for filesystems with filename restrictions 
is enabled, keeps track both of the name of the file in the project and 
the name of the file in the filesystem, with this mapping determined 
entirely by the user asking for problem files to be present under 
different names in the working tree.

Of course, you can already do:

$ git update-index --cacheinfo 100644 $(git hash-object -w xt_connmark-1.c) xt_CONNMARK.c

> If it is a new file, we won't find any name that is equivalent
> to $A in the index, and we use the name $A obtained from
> readdir(3).
> 
> BUT with a twist.
> 
> If the filesystem is known to be inconveniently case folding, we
> are better off registering $B instead of $A (assuming we can
> convert from $A to $B).

Is it not the case that, when a user has a file in the filesystem with the 
name Ma"rchen, the user will still type:

$ git add Märchen

and so we see filenames which are convenient, and we don't overly care 
what readdir(3) returns for new filenames? I suppose there is the case of:

$ touch Märchen
$ git add .

Which has to figure out what the files in foo are. But the common case for 
a new filename is that it gets provided by the user in argv, and the right 
file contents come from the one that open(2) returns, and there's no 
obvious way to get the filename that readdir(3) would return for a 
filename in argv anyway.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: On pathnames
  2008-01-24 22:31                                                                                                           ` Nicolas Pitre
  2008-01-25  3:55                                                                                                             ` Martin Langhoff
@ 2008-01-25  4:12                                                                                                             ` Junio C Hamano
  2008-01-25  8:08                                                                                                               ` Pedro Melo
  2008-01-25 12:25                                                                                                               ` Johannes Schindelin
  1 sibling, 2 replies; 260+ messages in thread
From: Junio C Hamano @ 2008-01-25  4:12 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: git, Johannes Schindelin, Linus Torvalds, Kevin Ballard,
	Theodore Tso, Mike Hommey

Nicolas Pitre <nico@cam.org> writes:

> On Thu, 24 Jan 2008, Junio C Hamano wrote:
>
>> If it is a new file, we won't find any name that is equivalent
>> to $A in the index, and we use the name $A obtained from
>> readdir(3).
>> 
>> BUT with a twist.
>> 
>> If the filesystem is known to be inconveniently case folding, we
>> are better off registering $B instead of $A (assuming we can
>> convert from $A to $B).
>
> Why?
>
> If you have no other representation for the file name than $A already, 
> then I don't see why Git would have to play similar evil games and 
> corru^H^H^Hnvert $A into $B.  Just store $A in the index and tree 
> objects and be done with it.

Because this "conversion" is limited to the case where the
filesystem is known to be inconveniently case folding, I
personally do not care about this part of the outline that
deeply.  It would not bite _me_ or my friends either way.

But I would imagine that a person who has to work on HFS+ would
appreciate it if these two sequences behaved the same way:

    $ edit Märchen ;# assume this is a new file
    $ git add Märchen ;# we were told that IM gives $B (aka NFC)

vs

    $ edit Märchen ;# assume this is a new file
    $ git add M*en ;# now readdir(3) gives $A (aka NFD)

If we always convert $A (less interoperable form) to $B (more
interoperable form) on inconveniently case folding filesystems,
the new index entry will always be in form $B.  Without the
conversion, the former will give form $B while the latter will
give form $A.  It is, as you said, "similar evil game to
corrupt", but it is not even a corruption at that point, because
the inconveniently case folding filesystem already corrupted the
pathname before we get our hands on it, and it won't make a
difference for HFS+ only people anyway.

However, if the resulting tree that adds a new file is prepared
on an inconveniently case folding filesystem, the conversion
process, by definition, would make the resulting tree more
interoperable with other systems than without.

So I do not see any downside of doing the conversion on such a
filesystem but there is this "interoperability" upside.

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: On pathnames
  2008-01-25  3:55                                                                                                             ` Martin Langhoff
@ 2008-01-25  4:18                                                                                                               ` Junio C Hamano
  0 siblings, 0 replies; 260+ messages in thread
From: Junio C Hamano @ 2008-01-25  4:18 UTC (permalink / raw)
  To: Martin Langhoff
  Cc: Nicolas Pitre, Junio C Hamano, git, Johannes Schindelin,
	Linus Torvalds, Kevin Ballard, Theodore Tso, Mike Hommey

"Martin Langhoff" <martin.langhoff@gmail.com> writes:

> There is one thing that I don't see in Junio's plan...
>
>  - a warning during checkout if the index contains "equivalent" paths
> that will clobber eachother during checkout.
>  - an optional warning/error during add, to be raised if I am adding a
> path that is equivalent to an already-existing path in the index

Thanks.  I think these and many other issues need to be worked
out.

In my message, I did not even try to be exhaustive.  I outlined
the parts that would most deeply affect the parts I care more
deeply about, which is the plumbing.  As I said many times, I do
not do Porcelains ;-).

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: On pathnames
  2008-01-25  4:00                                                                                                           ` Daniel Barkalow
@ 2008-01-25  4:21                                                                                                             ` Junio C Hamano
  2008-01-25 11:36                                                                                                               ` Johannes Schindelin
  2008-01-25  5:59                                                                                                             ` Jeff King
  1 sibling, 1 reply; 260+ messages in thread
From: Junio C Hamano @ 2008-01-25  4:21 UTC (permalink / raw)
  To: Daniel Barkalow
  Cc: git, Johannes Schindelin, Linus Torvalds, Kevin Ballard,
	Theodore Tso, Mike Hommey

Daniel Barkalow <barkalow@iabervon.org> writes:

> $ git checkout branch
> Warning: xt_CONNMARK.c conflicts with xt_connmark.c; not checking it out
> $ git checkout xt_CONNMARK.c --as xt_CONNMARK_caps.c
> $ edit xt_CONNMARK_caps.c
> $ git add xt_CONNMARK_caps.c

Heh, I like that very much.

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: On pathnames
  2008-01-25  4:00                                                                                                           ` Daniel Barkalow
  2008-01-25  4:21                                                                                                             ` Junio C Hamano
@ 2008-01-25  5:59                                                                                                             ` Jeff King
  1 sibling, 0 replies; 260+ messages in thread
From: Jeff King @ 2008-01-25  5:59 UTC (permalink / raw)
  To: Daniel Barkalow
  Cc: Junio C Hamano, git, Johannes Schindelin, Linus Torvalds,
	Kevin Ballard, Theodore Tso, Mike Hommey

On Thu, Jan 24, 2008 at 11:00:44PM -0500, Daniel Barkalow wrote:

> I think it would be nicer to have:
> 
> $ git checkout branch
> Warning: xt_CONNMARK.c conflicts with xt_connmark.c; not checking it out
> $ git checkout xt_CONNMARK.c --as xt_CONNMARK_caps.c
> $ edit xt_CONNMARK_caps.c
> $ git add xt_CONNMARK_caps.c
> 
> Where the index, when support for filesystems with filename restrictions 
> is enabled, keeps track both of the name of the file in the project and 
> the name of the file in the filesystem, with this mapping determined 
> entirely by the user asking for problem files to be present under 
> different names in the working tree.

Hrm. That makes me think: what if rather than doing utf8-ish
comparisons, the index stores a bidirectional mapping for any "munged"
names, and you can manipulate that mapping?

As in, the index entry for Märchen has an extra entry saying "I am
actually on the filesystem as Ma"rchen" (let's call this an alias) and
there is a pseudo-entry in the index for Ma"rchen that says "I'm not
really here. See Märchen" (let's call this a backref).

Then index-modifying commands like "git-add" or "git-checkout" can set
up the mapping, either manually (using --as or similar) or using a
particular munging scheme (git config core.filemunge hfs). Any time we
give an index path to the filesystem, we use its alias name. Any time we
look up an index entry and it ends up being a backref, we dereference
until we get a real entry. Index iterators would need to skip backrefs.

I think all systems would follow the same codepath, there is no penalty
for filenames which don't use the mapping, and it would be testable on
non-challenged filesystems. But perhaps I am missing some obvious
deficiency or impossibility.

-Peff

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: On pathnames
  2008-01-25  4:12                                                                                                             ` Junio C Hamano
@ 2008-01-25  8:08                                                                                                               ` Pedro Melo
  2008-01-25 12:25                                                                                                               ` Johannes Schindelin
  1 sibling, 0 replies; 260+ messages in thread
From: Pedro Melo @ 2008-01-25  8:08 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nicolas Pitre, git, Johannes Schindelin, Linus Torvalds,
	Kevin Ballard, Theodore Tso, Mike Hommey

Hi,

On Jan 25, 2008, at 4:12 AM, Junio C Hamano wrote:
> Nicolas Pitre <nico@cam.org> writes:
>
>> On Thu, 24 Jan 2008, Junio C Hamano wrote:
>>
>>> If it is a new file, we won't find any name that is equivalent
>>> to $A in the index, and we use the name $A obtained from
>>> readdir(3).
>>>
>>> BUT with a twist.
>>>
>>> If the filesystem is known to be inconveniently case folding, we
>>> are better off registering $B instead of $A (assuming we can
>>> convert from $A to $B).
>>
>> Why?
>>
>> If you have no other representation for the file name than $A  
>> already,
>> then I don't see why Git would have to play similar evil games and
>> corru^H^H^Hnvert $A into $B.  Just store $A in the index and tree
>> objects and be done with it.
>
> Because this "conversion" is limited to the case where the
> filesystem is known to be inconveniently case folding, I
> personally do not care about this part of the outline that
> deeply.  It would not bite _me_ or my friends either way.
>
> But I would imagine that a person who has to work on HFS+ would
> appreciate it if these two sequences behaved the same way:
>
>     $ edit Märchen ;# assume this is a new file
>     $ git add Märchen ;# we were told that IM gives $B (aka NFC)
>
> vs
>
>     $ edit Märchen ;# assume this is a new file
>     $ git add M*en ;# now readdir(3) gives $A (aka NFD)
>
> If we always convert $A (less interoperable form) to $B (more
> interoperable form) on inconveniently case folding filesystems,
> the new index entry will always be in form $B.  Without the
> conversion, the former will give form $B while the latter will
> give form $A.  It is, as you said, "similar evil game to
> corrupt", but it is not even a corruption at that point, because
> the inconveniently case folding filesystem already corrupted the
> pathname before we get our hands on it, and it won't make a
> difference for HFS+ only people anyway.
>
> However, if the resulting tree that adds a new file is prepared
> on an inconveniently case folding filesystem, the conversion
> process, by definition, would make the resulting tree more
> interoperable with other systems than without.

As a HFS+ user, I would welcome this very very much.

I want my tree to be seen by the rest of the world without problems,  
and I don't think I should impose my filesystem view of proper naming  
on others.

Junio, amazing write-up, many thanks.

Best regards,
-- 
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: On pathnames
  2008-01-25  4:21                                                                                                             ` Junio C Hamano
@ 2008-01-25 11:36                                                                                                               ` Johannes Schindelin
  2008-01-25 16:25                                                                                                                 ` Daniel Barkalow
  0 siblings, 1 reply; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-25 11:36 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Daniel Barkalow, git, Linus Torvalds, Kevin Ballard, Theodore Tso,
	Mike Hommey

Hi,

On Thu, 24 Jan 2008, Junio C Hamano wrote:

> Daniel Barkalow <barkalow@iabervon.org> writes:
> 
> > $ git checkout branch
> > Warning: xt_CONNMARK.c conflicts with xt_connmark.c; not checking it out
> > $ git checkout xt_CONNMARK.c --as xt_CONNMARK_caps.c
> > $ edit xt_CONNMARK_caps.c
> > $ git add xt_CONNMARK_caps.c
> 
> Heh, I like that very much.

It would make it easier to test on Linux, too, yes.

But then, it would break the build process all the same.

And the implementation would _need_ the index extension Linus seems to 
resent so.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: On pathnames
  2008-01-25  4:12                                                                                                             ` Junio C Hamano
  2008-01-25  8:08                                                                                                               ` Pedro Melo
@ 2008-01-25 12:25                                                                                                               ` Johannes Schindelin
  2008-01-25 12:50                                                                                                                 ` David Kastrup
  2008-01-25 12:53                                                                                                                 ` Wincent Colaiuta
  1 sibling, 2 replies; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-25 12:25 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Nicolas Pitre, git, Linus Torvalds, Kevin Ballard, Theodore Tso,
	Mike Hommey

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1228 bytes --]

Hi,

On Thu, 24 Jan 2008, Junio C Hamano wrote:

>     $ edit Märchen ;# assume this is a new file
>     $ git add Märchen ;# we were told that IM gives $B (aka NFC)

Please see the discussion on IRC I started with

http://colabti.de/irclogger/irclogger_log/git?date=2008-01-25,Fri&sel=16#l36

(asking for somebody to test the issue if filenames are still mangled when 
the volume was created _case-sensitive_).  The interesting part is this:

http://colabti.de/irclogger/irclogger_log/git?date=2008-01-25,Fri&sel=31#l56

(me asking to git-add the file "Märchen" explicitely),

http://colabti.de/irclogger/irclogger_log/git?date=2008-01-25,Fri&sel=39#l66

(dsymonds saying that no untracked files are shown), and

http://colabti.de/irclogger/irclogger_log/git?date=2008-01-25,Fri&sel=47#l76

(dsymonds showing that the git index contains the mangled filename, _not_ 
what we asked for).  The strange thing is that

http://colabti.de/irclogger/irclogger_log/git?date=2008-01-25,Fri&sel=57#l89

the command line seems not to be mangling the name.

Summary:

it seems that for some strange reason, "git add Märchen" puts the mangled 
filename into the index, even if "echo Märchen" shows the unmangled 
filename.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: On pathnames
  2008-01-25 12:25                                                                                                               ` Johannes Schindelin
@ 2008-01-25 12:50                                                                                                                 ` David Kastrup
  2008-01-25 12:53                                                                                                                 ` Wincent Colaiuta
  1 sibling, 0 replies; 260+ messages in thread
From: David Kastrup @ 2008-01-25 12:50 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Nicolas Pitre, git, Linus Torvalds, Kevin Ballard,
	Theodore Tso, Mike Hommey

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> it seems that for some strange reason, "git add Märchen" puts the
> mangled filename into the index, even if "echo Märchen" shows the
> unmangled filename.

echo is likely a shell builtin.  So "git add Märchen" goes through exec
while echo doesn't.

What does /bin/echo Märchen yield?

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: On pathnames
  2008-01-25 12:25                                                                                                               ` Johannes Schindelin
  2008-01-25 12:50                                                                                                                 ` David Kastrup
@ 2008-01-25 12:53                                                                                                                 ` Wincent Colaiuta
  1 sibling, 0 replies; 260+ messages in thread
From: Wincent Colaiuta @ 2008-01-25 12:53 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, Nicolas Pitre, git, Linus Torvalds, Kevin Ballard,
	Theodore Tso, Mike Hommey

El 25/1/2008, a las 13:25, Johannes Schindelin escribió:

> The strange thing is that
>
> http://colabti.de/irclogger/irclogger_log/git?date=2008-01-25,Fri&sel=57#l89
>
> the command line seems not to be mangling the name.
>
> Summary:
>
> it seems that for some strange reason, "git add Märchen" puts the  
> mangled
> filename into the index, even if "echo Märchen" shows the unmangled
> filename.
>
> Ciao,
> Dscho

Not sure if I grokked the IRC interchange fully but check this out:

$ touch Märchen
$ echo Märchen | xxd -g1
0000000: 4d 61 cc 88 72 63 68 65 6e 0a                    Ma..rchen.
$ echo Märchen | xxd -g1
0000000: 4d c3 a4 72 63 68 65 6e 0a                       M..rchen.

The first one shows me creating the file, then typing "echo M" and  
hitting tab so that the shell autocompletes the filename for me based  
on what it sees in the current directory. Note how it's decomposed.

The second one shows me manually typing the string "Märchen" with no  
tab autocompletion (literally typing ¨ then a), and you'll notice that  
this time it is precomposed.

So that might explain why "echo Märchen" is showing an unmangled name;  
if he just typed it out like I did then that would be the expected  
result.

Cheers,
Wincent

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: On pathnames
  2008-01-25 11:36                                                                                                               ` Johannes Schindelin
@ 2008-01-25 16:25                                                                                                                 ` Daniel Barkalow
  2008-01-25 17:34                                                                                                                   ` Johannes Schindelin
  0 siblings, 1 reply; 260+ messages in thread
From: Daniel Barkalow @ 2008-01-25 16:25 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Junio C Hamano, git, Linus Torvalds, Kevin Ballard, Theodore Tso,
	Mike Hommey

On Fri, 25 Jan 2008, Johannes Schindelin wrote:

> Hi,
> 
> On Thu, 24 Jan 2008, Junio C Hamano wrote:
> 
> > Daniel Barkalow <barkalow@iabervon.org> writes:
> > 
> > > $ git checkout branch
> > > Warning: xt_CONNMARK.c conflicts with xt_connmark.c; not checking it out
> > > $ git checkout xt_CONNMARK.c --as xt_CONNMARK_caps.c
> > > $ edit xt_CONNMARK_caps.c
> > > $ git add xt_CONNMARK_caps.c
> > 
> > Heh, I like that very much.
> 
> It would make it easier to test on Linux, too, yes.
> 
> But then, it would break the build process all the same.

Sure, but it would permit a user of a filesystem that can't handle the 
project to make modifications that generate a commit the filesystem can 
handle, which is currently pretty difficult.

$ git checkout xt_CONNMARK.c --as xt_CONNMARK_tmp.c
$ mv xt_CONNMARK_tmp.c xt_connmark_flag.c
$ edit Makefile
$ git add xt_connmark_flag.c
$ git commit -a

(The key thing here being that git will determine that you removed 
xt_CONNMARK.c despite open(xt_CONNMARK.c) returning something unrelated)

Remember that this level of support is to allow users who can't have the 
project checked out in their filesystems to manipulate the project's data, 
not to actually make the project work as presented in the filesystem by 
git.

> And the implementation would _need_ the index extension Linus seems to 
> resent so.

Linus was objecting to having redundant information stored, because it 
could get skewed. If the information being stored is not redundant (i.e., 
the normal case is that entries have a flag saying they exist in the 
filesystem under their own names, and the new cases are that the entry 
isn't present in the filesystem at all or that the entry is present in the 
filesystem under some other name), that isn't an issue.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 260+ messages in thread

* Re: On pathnames
  2008-01-25 16:25                                                                                                                 ` Daniel Barkalow
@ 2008-01-25 17:34                                                                                                                   ` Johannes Schindelin
  0 siblings, 0 replies; 260+ messages in thread
From: Johannes Schindelin @ 2008-01-25 17:34 UTC (permalink / raw)
  To: Daniel Barkalow
  Cc: Junio C Hamano, git, Linus Torvalds, Kevin Ballard, Theodore Tso,
	Mike Hommey

Hi,

On Fri, 25 Jan 2008, Daniel Barkalow wrote:

> On Fri, 25 Jan 2008, Johannes Schindelin wrote:
> 
> > On Thu, 24 Jan 2008, Junio C Hamano wrote:
> > 
> > > Daniel Barkalow <barkalow@iabervon.org> writes:
> > > 
> > > > $ git checkout branch
> > > > Warning: xt_CONNMARK.c conflicts with xt_connmark.c; not checking it out
> > > > $ git checkout xt_CONNMARK.c --as xt_CONNMARK_caps.c
> > > > $ edit xt_CONNMARK_caps.c
> > > > $ git add xt_CONNMARK_caps.c
> > > 
> > > Heh, I like that very much.
> > 
> > It would make it easier to test on Linux, too, yes.
> > 
> > But then, it would break the build process all the same.
> 
> Sure, but it would permit a user of a filesystem that can't handle the 
> project to make modifications that generate a commit the filesystem can 
> handle, which is currently pretty difficult.
> 
> $ git checkout xt_CONNMARK.c --as xt_CONNMARK_tmp.c
> $ mv xt_CONNMARK_tmp.c xt_connmark_flag.c
> $ edit Makefile
> $ git add xt_connmark_flag.c
> $ git commit -a

AFAICT it is possible right now:

$ git checkout xt_CONNMARK.c
$ git mv xt_CONNMARK.c xt_connmark_flag.c
$ git checkout xt_connmark.c
$ edit Makefile
$ git add Makefile
$ git commit

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 260+ messages in thread

end of thread, other threads:[~2008-01-25 17:35 UTC | newest]

Thread overview: 260+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-16 15:17 git on MacOSX and files with decomposed utf-8 file names Mark Junker
2008-01-16 15:34 ` Johannes Schindelin
2008-01-16 15:43   ` Kevin Ballard
2008-01-16 16:32     ` Johannes Schindelin
2008-01-16 16:46       ` Jakub Narebski
2008-01-16 20:39         ` Kevin Ballard
2008-01-16 21:51           ` Jakub Narebski
2008-01-16 22:06             ` Kevin Ballard
2008-01-16 22:23               ` Johannes Schindelin
2008-01-16 23:16                 ` Kevin Ballard
2008-01-16 22:32               ` Linus Torvalds
2008-01-16 22:52                 ` Linus Torvalds
2008-01-16 23:11                 ` Kevin Ballard
2008-01-16 23:38                   ` Linus Torvalds
2008-01-16 23:57                     ` Pedro Melo
2008-01-17  0:16                       ` Linus Torvalds
2008-01-17  0:27                         ` Pedro Melo
2008-01-17  0:32                           ` David Kastrup
2008-01-17  0:40                             ` Pedro Melo
2008-01-17  0:54                               ` Wincent Colaiuta
2008-01-17  1:08                                 ` Johannes Schindelin
2008-01-17  1:41                                   ` Linus Torvalds
2008-01-17  4:07                                     ` Kevin Ballard
2008-01-17  0:35                           ` Johannes Schindelin
2008-01-17  0:45                             ` Pedro Melo
2008-01-18  8:29                         ` Peter Karlsson
2008-01-18 11:16                           ` Jakub Narebski
2008-01-16 23:58                     ` David Kastrup
2008-01-17  0:19                       ` Linus Torvalds
2008-01-17  0:09                     ` Kevin Ballard
2008-01-17  0:25                       ` Linus Torvalds
2008-01-17  0:33                         ` Johannes Schindelin
2008-01-17  0:43                           ` Pedro Melo
2008-01-17  0:57                             ` Johannes Schindelin
2008-01-17  1:06                           ` Linus Torvalds
2008-01-17  1:16                       ` Linus Torvalds
2008-01-17  3:52                         ` Kevin Ballard
2008-01-17  4:08                           ` Linus Torvalds
2008-01-17  4:30                             ` Kevin Ballard
2008-01-17  4:51                               ` Martin Langhoff
2008-01-17  5:23                                 ` Kevin Ballard
2008-01-17  6:13                                   ` Geert Bosch
2008-01-17  7:11                                     ` Mitch Tishmack
2008-01-17 10:22                                       ` Wincent Colaiuta
2008-01-17 13:44                                         ` Kevin Ballard
2008-01-17 15:57                                           ` Johannes Schindelin
2008-01-17 16:53                                             ` Kevin Ballard
2008-01-18  0:44                                               ` Robin Rosenberg
2008-01-17 14:02                                     ` Andrew Heybey
2008-01-17 15:04                                       ` Kevin Ballard
2008-01-19 19:29                                         ` Kyle Moffett
2008-01-19 19:57                                           ` Kevin Ballard
2008-01-17 10:08                             ` Wincent Colaiuta
2008-01-17 16:43                               ` Linus Torvalds
2008-01-17 18:09                                 ` Mark Junker
2008-01-17 18:12                                   ` Pedro Melo
2008-01-17 18:18                                     ` Johannes Schindelin
2008-01-17 18:36                                       ` Mark Junker
2008-01-17 18:38                                       ` Pedro Melo
2008-01-17 18:44                                     ` Linus Torvalds
2008-01-17 19:02                                       ` Pedro Melo
2008-01-17 18:42                                   ` Linus Torvalds
2008-01-17 18:50                                     ` Mark Junker
2008-01-17 18:52                                     ` Pedro Melo
     [not found]                                       ` <alpine.LFD.1.00.0801 171100330.14959@woody.linux-foundation.org>
2008-01-17 19:01                                       ` Theodore Tso
2008-01-17 19:11                                       ` Linus Torvalds
2008-01-18  0:18                                         ` Kevin Ballard
2008-01-18  0:35                                           ` Linus Torvalds
2008-01-18  1:05                                         ` Robin Rosenberg
2008-01-18  1:24                                           ` Linus Torvalds
2008-01-18  4:08                                             ` Brian Dessent
2008-01-18  8:49                                             ` Dmitry Potapov
2008-01-18  9:42                                             ` Robin Rosenberg
2008-01-18 10:30                                               ` Dmitry Potapov
2008-01-18 15:37                                                 ` Peter Karlsson
2008-01-18 17:24                                                   ` Jakub Narebski
2008-01-18 10:19                                         ` Peter Karlsson
2008-01-18 10:50                                           ` Dmitry Potapov
2008-01-18 15:30                                             ` Peter Karlsson
2008-01-18 17:11                                           ` Linus Torvalds
2008-01-18 20:24                                             ` Kevin Ballard
2008-01-19  8:48                                               ` Dmitry Potapov
2008-01-19 14:55                                                 ` Kevin Ballard
2008-01-19 21:17                                                   ` Dmitry Potapov
2008-01-19 18:58                                                 ` Linus Torvalds
2008-01-19 20:39                                                   ` Mark Junker
2008-01-19 22:58                                                   ` Johannes Schindelin
2008-01-20  6:14                                                     ` Dmitry Potapov
2008-01-20  6:53                                                       ` Linus Torvalds
2008-01-20 13:15                                                       ` Johannes Schindelin
2008-01-20  0:11                                                   ` Wincent Colaiuta
2008-01-20  1:04                                                     ` Linus Torvalds
2008-01-20  5:27                                                       ` Mike Hommey
2008-01-20  5:45                                                         ` Linus Torvalds
2008-01-20  7:00                                                           ` Mike Hommey
2008-01-20  7:26                                                             ` Linus Torvalds
2008-01-20  8:00                                                             ` Dmitry Potapov
2008-01-20  8:12                                                               ` Dmitry Potapov
2008-01-20  9:34                                                       ` Wincent Colaiuta
2008-01-18 20:28                                             ` Junio C Hamano
2008-01-18 20:50                                               ` Johannes Schindelin
2008-01-23  2:46                                               ` Eric W. Biederman
2008-01-23  2:57                                                 ` Junio C Hamano
2008-01-23 14:26                                                   ` Nicolas Pitre
2008-01-23 21:19                                                     ` Junio C Hamano
2008-01-21 14:14                                             ` Peter Karlsson
2008-01-21 16:43                                               ` Kevin Ballard
2008-01-21 16:48                                                 ` David Kastrup
2008-01-21 16:59                                                   ` Kevin Ballard
2008-01-21 20:43                                                     ` Dmitry Potapov
2008-01-21 20:53                                                       ` Kevin Ballard
2008-01-21 21:05                                                         ` David Kastrup
2008-01-21 23:01                                                         ` Dmitry Potapov
2008-01-21 16:53                                                 ` Jeff King
2008-01-21 17:08                                                 ` Nicolas Pitre
2008-01-21 17:25                                                   ` Kevin Ballard
2008-01-21 20:35                                                     ` David Kastrup
2008-01-21 20:32                                                   ` David Kastrup
2008-01-21 18:12                                                 ` Linus Torvalds
2008-01-21 19:05                                                   ` Kevin Ballard
2008-01-21 19:41                                                     ` Linus Torvalds
2008-01-21 19:58                                                       ` Kevin Ballard
2008-01-21 20:33                                                         ` Linus Torvalds
2008-01-21 20:53                                                           ` Kevin Ballard
     [not found]                                                             ` <alpine.LFD.1.0! 0.0801211323120.2957@woody.linux-foundation.org>
2008-01-21 20:58                                                             ` David Kastrup
2008-01-21 21:17                                                             ` Martin Langhoff
2008-01-21 21:28                                                               ` Kevin Ballard
2008-01-21 21:43                                                                 ` Martin Langhoff
2008-01-21 21:33                                                             ` Linus Torvalds
2008-01-21 21:49                                                               ` Kevin Ballard
2008-01-21 22:34                                                                 ` Linus Torvalds
2008-01-21 22:46                                                                   ` Kevin Ballard
2008-01-21 22:56                                                                     ` Martin Langhoff
     [not found]                                                                       ` <53C76BEA-2232-4940-8776-9DF1880089A4@sb.org>
2008-01-21 23:05                                                                         ` Kevin Ballard
2008-01-21 23:16                                                                         ` Martin Langhoff
2008-01-22  0:30                                                                           ` Kevin Ballard
2008-01-21 23:00                                                                     ` Theodore Tso
2008-01-21 23:09                                                                       ` Kevin Ballard
2008-01-21 23:44                                                                     ` Linus Torvalds
2008-01-22  0:47                                                                       ` Kevin Ballard
2008-01-22  1:01                                                                         ` Linus Torvalds
2008-01-22  1:13                                                                           ` Linus Torvalds
2008-01-22  2:33                                                                             ` Kevin Ballard
2008-01-22  2:50                                                                               ` Linus Torvalds
2008-01-22  3:04                                                                                 ` Kevin Ballard
2008-01-22  3:17                                                                                   ` Linus Torvalds
2008-01-22  3:21                                                                                   ` Martin Langhoff
2008-01-22  4:22                                                                                     ` Kevin Ballard
     [not found]                                                                                   ` <20080122133427.GB17804@mit.edu>
2008-01-23  0:08                                                                                     ` Theodore Tso
2008-01-23  0:38                                                                                       ` Kevin Ballard
2008-01-23  1:47                                                                                         ` Martin Langhoff
2008-01-23  2:06                                                                                         ` Theodore Tso
2008-01-23  8:45                                                                                         ` David Kastrup
2008-01-23  0:38                                                                                       ` Linus Torvalds
2008-01-23  1:14                                                                                         ` Martin Langhoff
2008-01-23  1:16                                                                                         ` Kevin Ballard
2008-01-23  1:27                                                                                           ` Martin Langhoff
2008-01-23  1:33                                                                                         ` Theodore Tso
2008-01-23  1:56                                                                                           ` Linus Torvalds
2008-01-23  2:02                                                                                             ` Kevin Ballard
2008-01-23  6:41                                                                                           ` Mike Hommey
2008-01-23  8:15                                                                                             ` Kevin Ballard
2008-01-23  8:43                                                                                               ` Dmitry Potapov
2008-01-23  9:02                                                                                                 ` Jonathan del Strother
2008-01-23  9:12                                                                                                   ` Dmitry Potapov
2008-01-23  9:19                                                                                                     ` Mike Hommey
2008-01-23  9:32                                                                                                       ` Dmitry Potapov
2008-01-23  9:40                                                                                               ` Mike Hommey
2008-01-23 13:38                                                                                                 ` Theodore Tso
2008-01-23 16:16                                                                                                   ` Linus Torvalds
2008-01-23 17:12                                                                                                     ` Theodore Tso
2008-01-23 17:19                                                                                                     ` Kevin Ballard
2008-01-23 17:32                                                                                                       ` Linus Torvalds
2008-01-24 21:02                                                                                                         ` On pathnames Junio C Hamano
2008-01-24 22:31                                                                                                           ` Nicolas Pitre
2008-01-25  3:55                                                                                                             ` Martin Langhoff
2008-01-25  4:18                                                                                                               ` Junio C Hamano
2008-01-25  4:12                                                                                                             ` Junio C Hamano
2008-01-25  8:08                                                                                                               ` Pedro Melo
2008-01-25 12:25                                                                                                               ` Johannes Schindelin
2008-01-25 12:50                                                                                                                 ` David Kastrup
2008-01-25 12:53                                                                                                                 ` Wincent Colaiuta
2008-01-24 23:56                                                                                                           ` Sean
2008-01-25  0:36                                                                                                           ` Johannes Schindelin
2008-01-25  4:00                                                                                                           ` Daniel Barkalow
2008-01-25  4:21                                                                                                             ` Junio C Hamano
2008-01-25 11:36                                                                                                               ` Johannes Schindelin
2008-01-25 16:25                                                                                                                 ` Daniel Barkalow
2008-01-25 17:34                                                                                                                   ` Johannes Schindelin
2008-01-25  5:59                                                                                                             ` Jeff King
2008-01-23 20:18                                                                                                       ` git on MacOSX and files with decomposed utf-8 file names Jay Soffian
     [not found]                                                                                                         ` <1DC841ED-634F-412C-9560-F37E4172A4CD@sb.org>
     [not found]                                                                                                           ` <76718490801231421l7b6552f8sec13f570360198b@mail.gmail.com>
     [not found]                                                                                                             ` <4F906435-A186-4E98-8865-F185D75F14D4@sb.org>
     [not found]                                                                                                               ` <76718490801231517h6d57e5bfkc19d394d38ad19db@mail.gmail.com>
2008-01-24  2:05                                                                                                                 ` Kevin Ballard
2008-01-24  3:11                                                                                                                   ` Junio C Hamano
2008-01-24  4:37                                                                                                                     ` Martin Langhoff
2008-01-24  5:30                                                                                                                       ` Kevin Ballard
2008-01-24  6:39                                                                                                                         ` Steffen Prohaska
2008-01-24 18:17                                                                                                                           ` Mitch Tishmack
2008-01-24 18:52                                                                                                                           ` Mitch Tishmack
2008-01-24 19:58                                                                                                                             ` Kevin Ballard
2008-01-23 23:37                                                                                                       ` Martin Langhoff
2008-01-23 16:58                                                                                                 ` Kevin Ballard
2008-01-23 17:39                                                                                                   ` Dmitry Potapov
2008-01-23 17:47                                                                                                     ` Kevin Ballard
2008-01-21 19:57                                                     ` Theodore Tso
2008-01-21 20:01                                                       ` Kevin Ballard
2008-01-21 20:15                                                         ` Theodore Tso
2008-01-21 20:31                                                           ` Kevin Ballard
2008-01-21 20:46                                                             ` Theodore Tso
2008-01-21 20:59                                                               ` Kevin Ballard
     [not found]                                                               ` <6E303071-82A4-4D69-AA0C-EC41168B9AFE@sb.org>
2008-01-21 21:18                                                                 ` Theodore Tso
2008-01-21 21:43                                                                   ` Kevin Ballard
2008-01-21 21:49                                                                     ` Martin Langhoff
2008-01-21 21:57                                                                       ` Kevin Ballard
2008-01-22  0:36                                                                         ` Johannes Schindelin
2008-01-22  0:42                                                                           ` Kevin Ballard
2008-01-22  0:48                                                                             ` David Kastrup
2008-01-22  1:06                                                                             ` Martin Langhoff
2008-01-22  1:34                                                                             ` Johannes Schindelin
2008-01-22  1:53                                                                               ` Martin Langhoff
2008-01-22  2:03                                                                                 ` Johannes Schindelin
2008-01-21 22:38                                                                     ` David Kastrup
2008-01-22  2:34                                                                       ` Kevin Ballard
2008-01-22  7:51                                                                         ` David Kastrup
2008-01-21 20:56                                                     ` Dmitry Potapov
2008-01-21 21:07                                                       ` Kevin Ballard
2008-01-21 22:41                                                         ` Dmitry Potapov
2008-01-21 22:53                                                           ` Kevin Ballard
2008-01-21 23:21                                                             ` Dmitry Potapov
2008-01-21 19:44                                                   ` Mike Hommey
2008-01-21 20:36                                                   ` Dmitry Potapov
2008-01-21 21:06                                                   ` Martin Langhoff
2008-01-21 21:09                                                     ` David Kastrup
2008-01-21 21:42                                                     ` Linus Torvalds
2008-01-21 22:45                                                       ` Martin Langhoff
2008-01-21 20:30                                                 ` Dmitry Potapov
2008-01-21 18:16                                               ` Linus Torvalds
2008-01-17 21:27                                   ` Dmitry Potapov
2008-01-17 22:01                                 ` JM Ibanez
2008-01-17 22:09                                   ` Johannes Schindelin
2008-01-18  1:27                                     ` Robin Rosenberg
2008-01-17 23:05                                   ` Linus Torvalds
2008-01-17 23:10                                   ` Dmitry Potapov
2008-01-16 23:52           ` Dmitry Potapov
2008-01-16 22:37       ` Eyvind Bernhardsen
2008-01-16 23:03     ` Wincent Colaiuta
2008-01-17  7:29     ` Miles Bader
2008-01-17  4:43 ` Jay Soffian
2008-01-17  4:59   ` Jay Soffian
2008-01-17  5:15     ` Junio C Hamano
2008-01-17 10:28       ` Wincent Colaiuta
2008-01-17 11:10         ` Johannes Schindelin
2008-01-17 11:23           ` Pedro Melo
2008-01-17 11:51             ` Wincent Colaiuta
2008-01-17 12:53               ` Johannes Schindelin
2008-01-17 13:40                 ` Wincent Colaiuta
2008-01-17 17:58               ` Junio C Hamano
2008-01-17 18:22                 ` Johan Herland
2008-01-17 13:05             ` Johannes Schindelin
2008-01-17 11:46           ` Wincent Colaiuta
2008-01-17  5:11   ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).