Re-casing directories on case-insensitive systems

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re-casing directories on case-insensitive systems
@ 2008-01-11 20:19 Kevin Ballard
  2008-01-11 21:09 ` Kevin Ballard
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Kevin Ballard @ 2008-01-11 20:19 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 1222 bytes --]

Somehow I managed to change the case of a directory without git  
realizing it. I thought I issued `git mv CS4536 cs4536` but since that  
won't work in my efforts to reproduce the problem, I must have simply  
issued the `mv` outside of git and then re-added it.

Anyway, here's the state of my directory:

kevin@KBALLARD:~/Documents/School/C07> git ls-tree HEAD
040000 tree b47c8103e2e01fcf145bdc237c0e56ffc61f1c47	CS4536
040000 tree dbf7fc51ef3effebdf9b4e9172e4c86cae52b163	cs4536
040000 tree 15834a7b6534a285bf6930be4e5404b37e1dc718	ece3601
040000 tree 62d229b8c4a389b550df20a3752d666c48c767a4	ma2071

Note that I have both versions of the directory present.  
Unfortunately, only one of them can be present on the filesystem. If I  
run `mv cs4536 CS4536; git reset --hard` I end up with a different  
working tree.

Git should be able to detect this sort of conflict on a case- 
insensitive system. I didn't even realize what I'd done until I pushed  
back to the master repo and ran `git reset --hard` there, then  
wondered why the new file I added to cs4536/ was missing and why my  
directory was still named CS4536.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-11 20:19 Re-casing directories on case-insensitive systems Kevin Ballard
@ 2008-01-11 21:09 ` Kevin Ballard
  2008-01-11 21:19   ` Kevin Ballard
                     ` (2 more replies)
  2008-01-11 21:18 ` Linus Torvalds
  2008-01-11 21:29 ` Johannes Schindelin
  2 siblings, 3 replies; 29+ messages in thread
From: Kevin Ballard @ 2008-01-11 21:09 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 1793 bytes --]

Wow, it's even worse. I made a tmp branch and used git-filter-branch  
to remove the commit that introduced CS4536, leaving only the cs4536  
directory. But now if I try and run `git co master` it refuses, as it  
thinks it's going to overwrite the untracked file CS4536/ 
introduction.txt. I believe it's actually seeing the tracked file  
cs4536/introduction.txt.

-Kevin Ballard

On Jan 11, 2008, at 3:19 PM, Kevin Ballard wrote:

> Somehow I managed to change the case of a directory without git  
> realizing it. I thought I issued `git mv CS4536 cs4536` but since  
> that won't work in my efforts to reproduce the problem, I must have  
> simply issued the `mv` outside of git and then re-added it.
>
> Anyway, here's the state of my directory:
>
> kevin@KBALLARD:~/Documents/School/C07> git ls-tree HEAD
> 040000 tree b47c8103e2e01fcf145bdc237c0e56ffc61f1c47	CS4536
> 040000 tree dbf7fc51ef3effebdf9b4e9172e4c86cae52b163	cs4536
> 040000 tree 15834a7b6534a285bf6930be4e5404b37e1dc718	ece3601
> 040000 tree 62d229b8c4a389b550df20a3752d666c48c767a4	ma2071
>
> Note that I have both versions of the directory present.  
> Unfortunately, only one of them can be present on the filesystem. If  
> I run `mv cs4536 CS4536; git reset --hard` I end up with a different  
> working tree.
>
> Git should be able to detect this sort of conflict on a case- 
> insensitive system. I didn't even realize what I'd done until I  
> pushed back to the master repo and ran `git reset --hard` there,  
> then wondered why the new file I added to cs4536/ was missing and  
> why my directory was still named CS4536.
>
> -Kevin Ballard
>
> -- 
> Kevin Ballard
> http://kevin.sb.org
> kevin@sb.org
> http://www.tildesoft.com
>
>

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-11 21:09 ` Kevin Ballard
@ 2008-01-11 21:19   ` Kevin Ballard
  2008-01-11 21:25   ` Linus Torvalds
  2008-01-11 21:59   ` Robin Rosenberg
  2 siblings, 0 replies; 29+ messages in thread
From: Kevin Ballard @ 2008-01-11 21:19 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 2058 bytes --]

Err, make that git-rebase (for removing the commit).

-Kevin Ballard

On Jan 11, 2008, at 4:09 PM, Kevin Ballard wrote:

> Wow, it's even worse. I made a tmp branch and used git-filter-branch  
> to remove the commit that introduced CS4536, leaving only the cs4536  
> directory. But now if I try and run `git co master` it refuses, as  
> it thinks it's going to overwrite the untracked file CS4536/ 
> introduction.txt. I believe it's actually seeing the tracked file  
> cs4536/introduction.txt.
>
> -Kevin Ballard
>
> On Jan 11, 2008, at 3:19 PM, Kevin Ballard wrote:
>
>> Somehow I managed to change the case of a directory without git  
>> realizing it. I thought I issued `git mv CS4536 cs4536` but since  
>> that won't work in my efforts to reproduce the problem, I must have  
>> simply issued the `mv` outside of git and then re-added it.
>>
>> Anyway, here's the state of my directory:
>>
>> kevin@KBALLARD:~/Documents/School/C07> git ls-tree HEAD
>> 040000 tree b47c8103e2e01fcf145bdc237c0e56ffc61f1c47	CS4536
>> 040000 tree dbf7fc51ef3effebdf9b4e9172e4c86cae52b163	cs4536
>> 040000 tree 15834a7b6534a285bf6930be4e5404b37e1dc718	ece3601
>> 040000 tree 62d229b8c4a389b550df20a3752d666c48c767a4	ma2071
>>
>> Note that I have both versions of the directory present.  
>> Unfortunately, only one of them can be present on the filesystem.  
>> If I run `mv cs4536 CS4536; git reset --hard` I end up with a  
>> different working tree.
>>
>> Git should be able to detect this sort of conflict on a case- 
>> insensitive system. I didn't even realize what I'd done until I  
>> pushed back to the master repo and ran `git reset --hard` there,  
>> then wondered why the new file I added to cs4536/ was missing and  
>> why my directory was still named CS4536.
>>
>> -Kevin Ballard
>>
>> -- 
>> Kevin Ballard
>> http://kevin.sb.org
>> kevin@sb.org
>> http://www.tildesoft.com
>>
>>
>
> -- 
> Kevin Ballard
> http://kevin.sb.org
> kevin@sb.org
> http://www.tildesoft.com
>
>

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-11 21:09 ` Kevin Ballard
  2008-01-11 21:19   ` Kevin Ballard
@ 2008-01-11 21:25   ` Linus Torvalds
  2008-01-11 21:59   ` Robin Rosenberg
  2 siblings, 0 replies; 29+ messages in thread
From: Linus Torvalds @ 2008-01-11 21:25 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: git

On Fri, 11 Jan 2008, Kevin Ballard wrote:
>
> Wow, it's even worse. I made a tmp branch and used git-filter-branch to remove
> the commit that introduced CS4536, leaving only the cs4536 directory. But now
> if I try and run `git co master` it refuses, as it thinks it's going to
> overwrite the untracked file CS4536/introduction.txt. I believe it's actually
> seeing the tracked file cs4536/introduction.txt.

If you don't have any dirty state, I'd suggest removing your working tree 
before doing a "git checkout". That's needed anyway to make sure that your 
working tree has the same case as your index and git trees, because 
otherwise since the crazy filesystem thinks that CS4536/cs4536 ar ethe 
same, you might end up having all the wrong names.

Case differences are hard anyway, but you probably made them even harder 
by them using a rename that actually meant that the old name still 
*existed* in the filesystem (since the new name would always map to the 
old name thanks to your crazy filesystem).

I'm sure you can get into even more trouble with case-independent 
filesystems, but I think you did a pretty good job of hitting on one of 
the craziest cases ;)

		Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-11 21:09 ` Kevin Ballard
  2008-01-11 21:19   ` Kevin Ballard
  2008-01-11 21:25   ` Linus Torvalds
@ 2008-01-11 21:59   ` Robin Rosenberg
  2 siblings, 0 replies; 29+ messages in thread
From: Robin Rosenberg @ 2008-01-11 21:59 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: git

fredagen den 11 januari 2008 skrev Kevin Ballard:
> Wow, it's even worse. I made a tmp branch and used git-filter-branch  
> to remove the commit that introduced CS4536, leaving only the cs4536  
> directory. But now if I try and run `git co master` it refuses, as it  
> thinks it's going to overwrite the untracked file CS4536/ 
> introduction.txt. I believe it's actually seeing the tracked file  
> cs4536/introduction.txt.

I think you should try an index filter. That should help you avoid file system problems.

-- robin

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-11 20:19 Re-casing directories on case-insensitive systems Kevin Ballard
  2008-01-11 21:09 ` Kevin Ballard
@ 2008-01-11 21:18 ` Linus Torvalds
  2008-01-11 21:29 ` Johannes Schindelin
  2 siblings, 0 replies; 29+ messages in thread
From: Linus Torvalds @ 2008-01-11 21:18 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: git

On Fri, 11 Jan 2008, Kevin Ballard wrote:
>
> Anyway, here's the state of my directory:
> 
> kevin@KBALLARD:~/Documents/School/C07> git ls-tree HEAD
> 040000 tree b47c8103e2e01fcf145bdc237c0e56ffc61f1c47	CS4536
> 040000 tree dbf7fc51ef3effebdf9b4e9172e4c86cae52b163	cs4536
> 040000 tree 15834a7b6534a285bf6930be4e5404b37e1dc718	ece3601
> 040000 tree 62d229b8c4a389b550df20a3752d666c48c767a4	ma2071

Hmm. You can do something like

	git ls-files CS4536 | xargs git update-index --force-remove

which will remove gits knowledge of that directory even though "lstat()" 
will still claim that all the files still exists.

Case-insensitive filesystems are a pain.

I wish we had some way to handle it sanely, but I don't think a sane 
solution to case-insensitivity exists. If you limit it to strictly just 
7bit ascii names (like in your example), then some of the problems do go 
away, but it would still be probably fairly major surgery to try to teach 
git about the whole insanity of a case-independent working tree.

		Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-11 20:19 Re-casing directories on case-insensitive systems Kevin Ballard
  2008-01-11 21:09 ` Kevin Ballard
  2008-01-11 21:18 ` Linus Torvalds
@ 2008-01-11 21:29 ` Johannes Schindelin
  2008-01-11 21:44   ` Kevin Ballard
  2 siblings, 1 reply; 29+ messages in thread
From: Johannes Schindelin @ 2008-01-11 21:29 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: git

Hi,

On Fri, 11 Jan 2008, Kevin Ballard wrote:

> Somehow I managed to change the case of a directory without git 
> realizing it. I thought I issued `git mv CS4536 cs4536` but since that 
> won't work in my efforts to reproduce the problem, I must have simply 
> issued the `mv` outside of git and then re-added it.
> 
> Anyway, here's the state of my directory:
> 
> kevin@KBALLARD:~/Documents/School/C07> git ls-tree HEAD
> 040000 tree b47c8103e2e01fcf145bdc237c0e56ffc61f1c47	CS4536
> 040000 tree dbf7fc51ef3effebdf9b4e9172e4c86cae52b163	cs4536
> 040000 tree 15834a7b6534a285bf6930be4e5404b37e1dc718	ece3601
> 040000 tree 62d229b8c4a389b550df20a3752d666c48c767a4	ma2071
> 
> Note that I have both versions of the directory present. Unfortunately, 
> only one of them can be present on the filesystem. If I run `mv cs4536 
> CS4536; git reset --hard` I end up with a different working tree.
> 
> Git should be able to detect this sort of conflict on a case-insensitive 
> system.

Do not blame git for the shortcomings of your setup!

However, as luck has it, I looked into this issue again, as somebody 
raised it with msysgit (for obvious reasons; file systems on Windows are 
case challenged).  If you are serious about this problem, I can give you 
tons of pointers how to go solve it.  (Although I might be disconnected 
this weekend, because of the lack of competence of the IT department 
here.)

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-11 21:29 ` Johannes Schindelin
@ 2008-01-11 21:44   ` Kevin Ballard
  2008-01-11 22:05     ` Johannes Schindelin
  2008-01-11 22:08     ` Linus Torvalds
  0 siblings, 2 replies; 29+ messages in thread
From: Kevin Ballard @ 2008-01-11 21:44 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1416 bytes --]

On Jan 11, 2008, at 4:29 PM, Johannes Schindelin wrote:

>> Git should be able to detect this sort of conflict on a case- 
>> insensitive
>> system.
>
> Do not blame git for the shortcomings of your setup!

Oh, I'm not surprised git doesn't handle this case, nor do I think  
git's required to. I merely think that, given the increasing relevance  
of OS X and the fact that it uses a case-insensitive system by  
default, this sort of problem is going to occur more and more  
frequently and it's quite a learning experience trying to fix it by  
hand. It would be very helpful if git could detect these problems  
itself.

> However, as luck has it, I looked into this issue again, as somebody
> raised it with msysgit (for obvious reasons; file systems on Windows  
> are
> case challenged).  If you are serious about this problem, I can give  
> you
> tons of pointers how to go solve it.  (Although I might be  
> disconnected
> this weekend, because of the lack of competence of the IT department
> here.)

I think I've got a handle on it. I've already expunged the mis-cased  
file using git-rebase to remove the offending commit, now I just need  
to rewrite the second commit's message so it looks like the original  
commit (luckily I didn't do any work in the directory before I re- 
cased it). Thanks anyway.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-11 21:44   ` Kevin Ballard
@ 2008-01-11 22:05     ` Johannes Schindelin
  2008-01-11 22:08     ` Linus Torvalds
  1 sibling, 0 replies; 29+ messages in thread
From: Johannes Schindelin @ 2008-01-11 22:05 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: git

Hi,

On Fri, 11 Jan 2008, Kevin Ballard wrote:

> On Jan 11, 2008, at 4:29 PM, Johannes Schindelin wrote:
> 
> > If you are serious about this problem, I can give you tons of pointers 
> > how to go solve it.  (Although I might be disconnected this weekend, 
> > because of the lack of competence of the IT department here.)
> 
> I think I've got a handle on it. I've already expunged the mis-cased 
> file using git-rebase to remove the offending commit, now I just need to 
> rewrite the second commit's message so it looks like the original commit 
> (luckily I didn't do any work in the directory before I re-cased it). 
> Thanks anyway.

I was not talking about fixing it up in your repository.  If you really 
think that git should help you, you gotta teach it to.  Because people who 
do not experience the same as you will be less likely to feel the urge to 
teach git to help in that situation (because they did not experience that 
situation yet).  For example, I am one of these people.  And I guess a lot 
of these people hang out on this list.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-11 21:44   ` Kevin Ballard
  2008-01-11 22:05     ` Johannes Schindelin
@ 2008-01-11 22:08     ` Linus Torvalds
  2008-01-11 23:10       ` David Kastrup
                         ` (2 more replies)
  1 sibling, 3 replies; 29+ messages in thread
From: Linus Torvalds @ 2008-01-11 22:08 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Johannes Schindelin, git

On Fri, 11 Jan 2008, Kevin Ballard wrote:
> 
> Oh, I'm not surprised git doesn't handle this case, nor do I think git's
> required to. I merely think that, given the increasing relevance of OS X and
> the fact that it uses a case-insensitive system by default, this sort of
> problem is going to occur more and more frequently and it's quite a learning
> experience trying to fix it by hand. It would be very helpful if git could
> detect these problems itself.

I do agree that we could/should do something to help with case-insensitive 
filesystems.

I absolutely *detest* those things, and I think that people who design 
them are total morons - with MS-DOS, you could understand it (people 
didn't know better), but with OS X?

But considering that they exist, we should probably offer at least *some* 
help for people who didn't realize that you could make OS X behave better.

However, it's not like there is even a simple solution. The right place to 
do that check would probably be in "add_index_entry()", but doing a check 
whether the same file already exists (in a different case) is simply 
*extremely* expensive for a very critical piece of code, unless we were to 
change that index data structure a lot (ie add a separate hash for the 
filenames).

Inside the Linux kernel, we have support for insane case-insensitive 
filesystems, and it really does need a lot of effort to do an even 
half-way decent thing while not penalizing the sane case. So it's hard.

(And that's totally ignoring the fact that case-insensitivity then also 
has tons of i18n issues and can get *really* messy - in the above I'm 
talking purely about the issues that would hit us even with 7-bit straight 
ASCII).

So handling case-sensitivity (even when you restrict it to ASCII-only) is 
actually rather messy. The obvious thing to do is to sort everything using 
a case-insensitive sort, but that in turn would break all the rules git 
has for the sorting of trees.

So you can't just change the sort order: you'd literally have to have two 
*different* lookup keys for the index (the "strict sort" and the "case- 
insensitive sort"), and keep them both around.

Almost all of the code that actually touches the index is in read-cache.c, 
and it's not like that is a very complex data structure (or a very big 
file), so adding another key to the sorting probably wouldn't be too 
horrid. But it's definitely a lot more than just a few lines of code!

		Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-11 22:08     ` Linus Torvalds
@ 2008-01-11 23:10       ` David Kastrup
  2008-01-11 23:12         ` Kevin Ballard
  2008-01-11 23:26       ` Robin Rosenberg
  2008-01-12 14:46       ` Dmitry Potapov
  2 siblings, 1 reply; 29+ messages in thread
From: David Kastrup @ 2008-01-11 23:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kevin Ballard, Johannes Schindelin, git

Linus Torvalds <torvalds@linux-foundation.org> writes:

> I do agree that we could/should do something to help with case-insensitive 
> filesystems.
>
> I absolutely *detest* those things, and I think that people who design 
> them are total morons - with MS-DOS, you could understand it (people 
> didn't know better),

Ah, those young whippersnappers who think they are so smart...  there is
a history to that, you know.  Early character sets (like those on punch
cards) had just capital letters.  Even when lowercase letters were
introduced, those tended to use more space (12 instead of 6 bit) and be
harder to print (and the line printers who churned out 40 lines per
second did not bother with such finesse, anyway).  But capital letters
are not designed for readability of long lines.  So when printing or
even screen terminals came into use, one tended to prefer writing in
lowercase letters.  Which actually had the uppercase code points
usually.  Some early microcomputers (for which CP/M was designed)
actually were hooked up with a "standard" 50 or 110 Baud teletype as I/O
device, and those tended to have only lowercase letters, too, in their
basic incantations.  So CP/M, not knowing which kind of input device
would be used and whether it would prefer (or offer exclusively) upper
or lower case, had case-insensitive commands, and consequently also
case-insensitive file names.  And QDOS (whence MSDOS) was basically
intended to be a CP/M ripoff.

> but with OS X?

OS X has an Apple inheritance, Apple has the same inheritance as other
microcomputers which includes a case-insensitive BASIC interpreter
(BASIC again coming from old teletype times).  It is, again, a decision
to drag along old history.

But actually, there is more to it nowadays: two file names containing ü,
but one with a single letter and one with combining accent, look exactly
the same.  If they don't act exactly the same, one opens up quite a hole
for spoofing attacks.  Well, probably hard to avoid (since things like
uppercase Alpha and uppercase A look the same and need to be different
code points, too).  But one also opens a can of worms for confusion.  So
the problem of canonical file names does not go away just with case
sensitivity.

> But considering that they exist, we should probably offer at least
> *some* help for people who didn't realize that you could make OS X
> behave better.

It is not like Linux does not support some case-insensitive file system
types, too.  So the same problems can be had there as well.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-11 23:10       ` David Kastrup
@ 2008-01-11 23:12         ` Kevin Ballard
  0 siblings, 0 replies; 29+ messages in thread
From: Kevin Ballard @ 2008-01-11 23:12 UTC (permalink / raw)
  To: David Kastrup; +Cc: Linus Torvalds, Johannes Schindelin, git

[-- Attachment #1: Type: text/plain, Size: 698 bytes --]

On Jan 11, 2008, at 6:10 PM, David Kastrup wrote:

>> But considering that they exist, we should probably offer at least
>> *some* help for people who didn't realize that you could make OS X
>> behave better.
>
> It is not like Linux does not support some case-insensitive file  
> system
> types, too.  So the same problems can be had there as well.


In addition, while there is an option for HFS+ Case-Sensitive, using  
that can cause bad things to happen as Mac OS X programs are written  
under a case-insensitive assumption and may behave badly when  
presented with a case-sensitive filesystem.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-11 22:08     ` Linus Torvalds
  2008-01-11 23:10       ` David Kastrup
@ 2008-01-11 23:26       ` Robin Rosenberg
  2008-01-12  0:03         ` Kevin Ballard
  2008-01-12  0:37         ` Junio C Hamano
  2008-01-12 14:46       ` Dmitry Potapov
  2 siblings, 2 replies; 29+ messages in thread
From: Robin Rosenberg @ 2008-01-11 23:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kevin Ballard, Johannes Schindelin, git

fredagen den 11 januari 2008 skrev Linus Torvalds:
> I do agree that we could/should do something to help with case-insensitive 
> filesystems.
> 
> I absolutely *detest* those things, and I think that people who design 
> them are total morons - with MS-DOS, you could understand it (people 
> didn't know better), but with OS X?

Could it be some comfort that the other SCM's I know of make a mess of
these cases, regardless of the number of digits in the price tag.

[...]

> Almost all of the code that actually touches the index is in read-cache.c, 
> and it's not like that is a very complex data structure (or a very big 
> file), so adding another key to the sorting probably wouldn't be too 
> horrid. But it's definitely a lot more than just a few lines of code!

Could we just have a lookup table index extension for identifying the 
duplicates (when checking is enabled using core configuration option #3324)? 
That table would keep a mapping from a normalized form (maybe include 
canonical encoding while we're at it) to the actual octet sequence(s) used.

Many operations would translate any supplied form throug the table before
doing the lookup so if we have Foo.h and give FOO.h to git add, it would 
notice and perform add (update index) on Foo.h instead as that is the form we 
alreay know (or refuse yielding an error message; pick your poison). And, 
well you get the picture.

-- robin

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-11 23:26       ` Robin Rosenberg
@ 2008-01-12  0:03         ` Kevin Ballard
  2008-01-12  0:15           ` Robin Rosenberg
  2008-01-12  0:37         ` Junio C Hamano
  1 sibling, 1 reply; 29+ messages in thread
From: Kevin Ballard @ 2008-01-12  0:03 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Linus Torvalds, Johannes Schindelin, git

[-- Attachment #1: Type: text/plain, Size: 1952 bytes --]

Speaking of normalizing composed sequences, could that be the cause  
for the following?

kevin@KBALLARD:~/Dev/git> ls
kevin@KBALLARD:~/Dev/git> ls -a
./          ../         .git/       .gitignore
kevin@KBALLARD:~/Dev/git> git reset --hard
HEAD is now at 58beb2c... Trim leading / off of paths in git-svn  
prop_walk
kevin@KBALLARD:~/Dev/git> git st
# On branch master
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#	gitweb/test/Märchen
nothing added to commit but untracked files present (use "git add" to  
track)

Some further exploration seems to support my cause:

kevin@KBALLARD:~/Dev/git> git ls-files gitweb/test
"gitweb/test/M\303\244rchen"
gitweb/test/file with spaces
gitweb/test/file+plus+sign
kevin@KBALLARD:~/Dev/git/gitweb/test> ls Märchen | xxd
0000000: 4d61 cc88 7263 6865 6e0a                 Ma..rchen.

As you can see, git has the file tracked using M\303\244rchen, where  
\303\244 (or 0xC3A4, or U+00E4) is Latin Small Letter A With  
Diaeresis, but the filesystem reports it as "Ma\xCC\x88rchen" where  
0xCC88 (or U+0308) is Combining Diaeresis.

In other words, the git repository itself exhibits a problem under OS  
X. I'm not sure if I didn't notice this untracked file before, or if  
the filesystem (or the index) actually used the other form previously,  
but regardless there's a problem that I believe would be present even  
if I was using Case-Sensitive HFS+.

-Kevin Ballard

On Jan 11, 2008, at 6:26 PM, Robin Rosenberg wrote:

> Could we just have a lookup table index extension for identifying the
> duplicates (when checking is enabled using core configuration option  
> #3324)?
> That table would keep a mapping from a normalized form (maybe include
> canonical encoding while we're at it) to the actual octet  
> sequence(s) used.

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-12  0:03         ` Kevin Ballard
@ 2008-01-12  0:15           ` Robin Rosenberg
  2008-01-12  0:25             ` Kevin Ballard
  0 siblings, 1 reply; 29+ messages in thread
From: Robin Rosenberg @ 2008-01-12  0:15 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Linus Torvalds, Johannes Schindelin, git

lördagen den 12 januari 2008 skrev Kevin Ballard:
> Speaking of normalizing composed sequences, could that be the cause  
> for the following?
[...]
> kevin@KBALLARD:~/Dev/git/gitweb/test> ls Märchen | xxd
> 0000000: 4d61 cc88 7263 6865 6e0a                 Ma..rchen.
> 
> As you can see, git has the file tracked using M\303\244rchen, where  
> \303\244 (or 0xC3A4, or U+00E4) is Latin Small Letter A With  
> Diaeresis, but the filesystem reports it as "Ma\xCC\x88rchen" where  
> 0xCC88 (or U+0308) is Combining Diaeresis.

Yes that is due to normalization. When adding a file by name git uses the user 
supplied name, but when adding files indirectly it gets the names from the 
file system without denormalizing them. Likewize status gets the names from
the file system without denormalizing and thus you get a mismatch.

-- robin

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-12  0:15           ` Robin Rosenberg
@ 2008-01-12  0:25             ` Kevin Ballard
  2008-01-12  0:27               ` Junio C Hamano
  0 siblings, 1 reply; 29+ messages in thread
From: Kevin Ballard @ 2008-01-12  0:25 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Linus Torvalds, Johannes Schindelin, git

[-- Attachment #1: Type: text/plain, Size: 1211 bytes --]

On Jan 11, 2008, at 7:15 PM, Robin Rosenberg wrote:

> lördagen den 12 januari 2008 skrev Kevin Ballard:
>> Speaking of normalizing composed sequences, could that be the cause
>> for the following?
> [...]
>> kevin@KBALLARD:~/Dev/git/gitweb/test> ls Märchen | xxd
>> 0000000: 4d61 cc88 7263 6865 6e0a                 Ma..rchen.
>>
>> As you can see, git has the file tracked using M\303\244rchen, where
>> \303\244 (or 0xC3A4, or U+00E4) is Latin Small Letter A With
>> Diaeresis, but the filesystem reports it as "Ma\xCC\x88rchen" where
>> 0xCC88 (or U+0308) is Combining Diaeresis.
>
> Yes that is due to normalization. When adding a file by name git  
> uses the user
> supplied name, but when adding files indirectly it gets the names  
> from the
> file system without denormalizing them. Likewize status gets the  
> names from
> the file system without denormalizing and thus you get a mismatch.

Is there a reason for this? It seems like it would be trivial to end  
up with misdiagnosed "untracked" files when using any language other  
than English given this behaviuor.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-12  0:25             ` Kevin Ballard
@ 2008-01-12  0:27               ` Junio C Hamano
  2008-01-12  0:40                 ` Johannes Schindelin
  0 siblings, 1 reply; 29+ messages in thread
From: Junio C Hamano @ 2008-01-12  0:27 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Robin Rosenberg, Linus Torvalds, Johannes Schindelin, git

Kevin Ballard <kevin@sb.org> writes:

> Is there a reason for this? It seems like it would be trivial to end
> up with misdiagnosed "untracked" files when using any language other
> than English given this behaviuor.

No.  The assumption of the code has always been that sane
filesystems would return from readdir() the names you gave from
creat().

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-12  0:27               ` Junio C Hamano
@ 2008-01-12  0:40                 ` Johannes Schindelin
  2008-01-12  1:16                   ` Kevin Ballard
  0 siblings, 1 reply; 29+ messages in thread
From: Johannes Schindelin @ 2008-01-12  0:40 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Kevin Ballard, Robin Rosenberg, Linus Torvalds, git

Hi,

On Fri, 11 Jan 2008, Junio C Hamano wrote:

> Kevin Ballard <kevin@sb.org> writes:
> 
> > Is there a reason for this? It seems like it would be trivial to end 
> > up with misdiagnosed "untracked" files when using any language other 
> > than English given this behaviuor.
> 
> No.  The assumption of the code has always been that sane filesystems 
> would return from readdir() the names you gave from creat().

We do not really have to rehash that whole discussion for the Nth time, do 
we?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-12  0:40                 ` Johannes Schindelin
@ 2008-01-12  1:16                   ` Kevin Ballard
  2008-01-12  1:30                     ` Junio C Hamano
  0 siblings, 1 reply; 29+ messages in thread
From: Kevin Ballard @ 2008-01-12  1:16 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 1600 bytes --]

On Jan 11, 2008, at 7:40 PM, Johannes Schindelin wrote:

> On Fri, 11 Jan 2008, Junio C Hamano wrote:
>
>> Kevin Ballard <kevin@sb.org> writes:
>>
>>> Is there a reason for this? It seems like it would be trivial to end
>>> up with misdiagnosed "untracked" files when using any language other
>>> than English given this behaviuor.
>>
>> No.  The assumption of the code has always been that sane filesystems
>> would return from readdir() the names you gave from creat().
>
> We do not really have to rehash that whole discussion for the Nth  
> time, do
> we?

Apparently so. By Junio's definition, HFS+ is not a sane filesystem,  
and as git grows more popular with OS X users, this issue is going to  
crop up more frequently.

According to the HFS+ Volume Format technote[1], filenames in HFS+ are  
stored in normalized, canonical order. To be more specific, they're  
stored in a special apple variant of Unicode Normal Form D (the  
special variant is for preserving round-trip with older encodings with  
certain codepoint ranges[2]).

In other words, if you hand an HFS+ filesystem a filename that  
contains unicode characters, what you get back later may be in a  
different format. And that's going to be a problem if git doesn't deal  
with this.

[1]: http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties
[2]: http://developer.apple.com/qa/qa2001/qa1173.html

Note: CC list stripped because this is a re-sent email, as the list  
bounced the last one that contained a text/html alternate.

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-12  1:16                   ` Kevin Ballard
@ 2008-01-12  1:30                     ` Junio C Hamano
  2008-01-12  1:43                       ` Kevin Ballard
  0 siblings, 1 reply; 29+ messages in thread
From: Junio C Hamano @ 2008-01-12  1:30 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: git

Kevin Ballard <kevin@sb.org> writes:

> On Jan 11, 2008, at 7:40 PM, Johannes Schindelin wrote:
>
>> On Fri, 11 Jan 2008, Junio C Hamano wrote:
>>
>>> Kevin Ballard <kevin@sb.org> writes:
>>>
>>>> Is there a reason for this? It seems like it would be trivial to end
>>>> up with misdiagnosed "untracked" files when using any language other
>>>> than English given this behaviuor.
>>>
>>> No.  The assumption of the code has always been that sane filesystems
>>> would return from readdir() the names you gave from creat().
>>
>> We do not really have to rehash that whole discussion for the Nth
>> time, do
>> we?
>
> Apparently so. By Junio's definition, HFS+ is not a sane filesystem,
> and as git grows more popular with OS X users, this issue is going to
> crop up more frequently.

It's not "my" definition, but you asked the reason and I gave
the answer.  We can close this issue of "is HFS+ sane" now.
HFS+ is insane, period.  And as Linus said, you cannot forgive
its insanity using the historical baggage argument, like MS-DOS.

HOWEVER.

It is a totally different issue if we want to refuse supporting
insane filesystems.  And the answer is no.  It was not my
intention to say that we do not intend to support them, when I
explained the reason why the things are as they are, which was
the original question by you.

See Robin's proposal to let us translate random names we get
back from readdir() from the filesystem using an additional
look-up table in the index extension section that stores mapping
from canonicalized form to the form that the user registered to
the index.  I think that is a sane approach to tackle this issue
on insane filesystems like HFS+.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-12  1:30                     ` Junio C Hamano
@ 2008-01-12  1:43                       ` Kevin Ballard
  2008-01-12 12:07                         ` David Kastrup
  2008-01-12 15:03                         ` Dmitry Potapov
  0 siblings, 2 replies; 29+ messages in thread
From: Kevin Ballard @ 2008-01-12  1:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1766 bytes --]

On Jan 11, 2008, at 8:30 PM, Junio C Hamano wrote:

>> Apparently so. By Junio's definition, HFS+ is not a sane filesystem,
>> and as git grows more popular with OS X users, this issue is going to
>> crop up more frequently.
>
> It's not "my" definition, but you asked the reason and I gave
> the answer.  We can close this issue of "is HFS+ sane" now.
> HFS+ is insane, period.  And as Linus said, you cannot forgive
> its insanity using the historical baggage argument, like MS-DOS.

Fair enough, though I believe OS X has a good reason, namely it's an  
OS designed for regular users rather than servers or programmers. Case- 
sensitivity would confuse my mother.

> HOWEVER.
>
> It is a totally different issue if we want to refuse supporting
> insane filesystems.  And the answer is no.  It was not my
> intention to say that we do not intend to support them, when I
> explained the reason why the things are as they are, which was
> the original question by you.

Ok. I wasn't implying anything with that phrase there, I was just  
trying to reiterate that HFS+ is case-insensitive and emphasize that  
this issue will become more relevant as time goes by.

> See Robin's proposal to let us translate random names we get
> back from readdir() from the filesystem using an additional
> look-up table in the index extension section that stores mapping
> from canonicalized form to the form that the user registered to
> the index.  I think that is a sane approach to tackle this issue
> on insane filesystems like HFS+.

If I knew what the index extension section was, perhaps I would think  
that's a good idea ;) I have yet to dive into the gory details of how  
this stuff works.

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-12  1:43                       ` Kevin Ballard
@ 2008-01-12 12:07                         ` David Kastrup
  2008-01-12 15:03                         ` Dmitry Potapov
  1 sibling, 0 replies; 29+ messages in thread
From: David Kastrup @ 2008-01-12 12:07 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Junio C Hamano, git

Kevin Ballard <kevin@sb.org> writes:

> On Jan 11, 2008, at 8:30 PM, Junio C Hamano wrote:
>
>>> Apparently so. By Junio's definition, HFS+ is not a sane filesystem,
>>> and as git grows more popular with OS X users, this issue is going to
>>> crop up more frequently.
>>
>> It's not "my" definition, but you asked the reason and I gave
>> the answer.  We can close this issue of "is HFS+ sane" now.
>> HFS+ is insane, period.  And as Linus said, you cannot forgive
>> its insanity using the historical baggage argument, like MS-DOS.
>
> Fair enough, though I believe OS X has a good reason, namely it's an
> OS designed for regular users rather than servers or
> programmers. Case-sensitivity would confuse my mother.

If case-sensitivity would be the primary cause of confusion in
mother-computer interoperation, you have a remarkable mother.

"Type things the same way and they work the same" is a simple enough
rule.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-12  1:43                       ` Kevin Ballard
  2008-01-12 12:07                         ` David Kastrup
@ 2008-01-12 15:03                         ` Dmitry Potapov
  1 sibling, 0 replies; 29+ messages in thread
From: Dmitry Potapov @ 2008-01-12 15:03 UTC (permalink / raw)
  To: Kevin Ballard; +Cc: Junio C Hamano, git

On Fri, Jan 11, 2008 at 08:43:35PM -0500, Kevin Ballard wrote:
> 
> Fair enough, though I believe OS X has a good reason, namely it's an  
> OS designed for regular users rather than servers or programmers. Case- 
> sensitivity would confuse my mother.

Many of *nix servers are running web-services and samba servers, yet most
users are even not aware of whether they dealing with case-sensitive file
system or not, let alone being confused by that. This is because most
regular users will type the name only once when they create a new file
and then just click on this name. So case-sensitive file systems can
really confuse only some badly written applications...

Dmitry

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-11 23:26       ` Robin Rosenberg
  2008-01-12  0:03         ` Kevin Ballard
@ 2008-01-12  0:37         ` Junio C Hamano
  2008-01-12  0:57           ` Robin Rosenberg
  1 sibling, 1 reply; 29+ messages in thread
From: Junio C Hamano @ 2008-01-12  0:37 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Linus Torvalds, Kevin Ballard, Johannes Schindelin, git

Robin Rosenberg <robin.rosenberg@dewire.com> writes:

> Could we just have a lookup table index extension for identifying the 
> duplicates (when checking is enabled using core configuration option #3324)? 
> That table would keep a mapping from a normalized form (maybe include 
> canonical encoding while we're at it) to the actual octet sequence(s) used.

I would agree that the index extension, if we ever are going to
do this, would be the right place to store this information, at
the single repository level.

However, this opens up a can of worms.  What's the canonical key
should be?  If you want to protect yourself from a unicode
normalizing filesystem, you would use one canonicalization,
while if you want to protect from a case losing filesystem you
would use another?  Or do we at the same time downcase and NFD
normalize at the same time and be done with it?

And where should the configuration be stored?  If a project
wants to be interoperable across Linux and vfat, for example,
that canonicalization needs to be enabled in repositories of all
participants, be they on Linux or vfat, so that people on Linux
can be prevented from creating and register two files xt_mark.c
and xt_MARK.c in the same directory, so that people who extract
the source on vfat won't have troubles.

Which means the information needs to be in-tree.  But that
should not be in .gitattributes (which by definition is for
per-path things).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-12  0:37         ` Junio C Hamano
@ 2008-01-12  0:57           ` Robin Rosenberg
  2008-01-12 16:33             ` Johannes Schindelin
  0 siblings, 1 reply; 29+ messages in thread
From: Robin Rosenberg @ 2008-01-12  0:57 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Kevin Ballard, Johannes Schindelin, git

lördagen den 12 januari 2008 skrev Junio C Hamano:
> Robin Rosenberg <robin.rosenberg@dewire.com> writes:
> 
> > Could we just have a lookup table index extension for identifying the 
> > duplicates (when checking is enabled using core configuration option #3324)? 
> > That table would keep a mapping from a normalized form (maybe include 
> > canonical encoding while we're at it) to the actual octet sequence(s) used.
> 
> I would agree that the index extension, if we ever are going to
> do this, would be the right place to store this information, at
> the single repository level.
> 
> However, this opens up a can of worms.  What's the canonical key
> should be?  If you want to protect yourself from a unicode
> normalizing filesystem, you would use one canonicalization,
> while if you want to protect from a case losing filesystem you
> would use another?  Or do we at the same time downcase and NFD
> normalize at the same time and be done with it?

The worms are out already. So the question is whether there
is a way of keeping them in the can instead of having them crawl
all around. I think we could to both unicode (UTF-8 or NFD) and
downcase at the same time.

> And where should the configuration be stored?  If a project
> wants to be interoperable across Linux and vfat, for example,

In the brand new ".gitconfig". It could in principle contain any config option,
but that would not be safe so I guess one should only allow "safe" options
there.

-- robin

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-12  0:57           ` Robin Rosenberg
@ 2008-01-12 16:33             ` Johannes Schindelin
  0 siblings, 0 replies; 29+ messages in thread
From: Johannes Schindelin @ 2008-01-12 16:33 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Junio C Hamano, Linus Torvalds, Kevin Ballard, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 486 bytes --]

Hi,

On Sat, 12 Jan 2008, Robin Rosenberg wrote:

> lördagen den 12 januari 2008 skrev Junio C Hamano:
>
> > And where should the configuration be stored?  If a project wants to 
> > be interoperable across Linux and vfat, for example,
> 
> In the brand new ".gitconfig". It could in principle contain any config 
> option, but that would not be safe so I guess one should only allow 
> "safe" options there.

Funny: I had the same idea (.gitconfig) for the crlf issues...

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-11 22:08     ` Linus Torvalds
  2008-01-11 23:10       ` David Kastrup
  2008-01-11 23:26       ` Robin Rosenberg
@ 2008-01-12 14:46       ` Dmitry Potapov
  2008-01-12 18:47         ` Linus Torvalds
  2 siblings, 1 reply; 29+ messages in thread
From: Dmitry Potapov @ 2008-01-12 14:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kevin Ballard, Johannes Schindelin, git

On Fri, Jan 11, 2008 at 02:08:35PM -0800, Linus Torvalds wrote:
> 
> However, it's not like there is even a simple solution. The right place to 
> do that check would probably be in "add_index_entry()", but doing a check 
> whether the same file already exists (in a different case) is simply 
> *extremely* expensive for a very critical piece of code, unless we were to 
> change that index data structure a lot (ie add a separate hash for the 
> filenames).

After cursory look at the source code, I wonder if converting name1
and name2 to upper case before memcmp in cache_name_compare() can
help case-insensitive systems. This change will change the order of
file names in the index, but I suppose that it should not be a problem,
because the index is host specific. Though, this fix is too simple, so
I guess, I missed something.

> (And that's totally ignoring the fact that case-insensitivity then also 
> has tons of i18n issues and can get *really* messy 

The proper support of i18n is not simple even without case-insensitivity.
For instance, there are four different encodings widely used for Russian
letters. On Windows alone, you have two simulteniously in the default
settings -- Windows-1251 for Windows applications and CP866 for Console
applications... Actually, some console applications can change its default
encoding, and it seems Cygwin programs do that. So, based on whether you
use gcc from Cygwin or Visual C to compile your console program, you can
get a different encoding. On *nix in Russia, koi8-r and utf-8 are most
popular... So, if you have a repository shared between different systems,
you cannot think about a file name just as a sequence of bytes anymore.
OTOH, I doubt that many people are really interested in using non-ASCII
file names with Git right now.

Dmitry

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-12 14:46       ` Dmitry Potapov
@ 2008-01-12 18:47         ` Linus Torvalds
  2008-01-12 19:29           ` Dmitry Potapov
  0 siblings, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2008-01-12 18:47 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: Kevin Ballard, Johannes Schindelin, git

On Sat, 12 Jan 2008, Dmitry Potapov wrote:
> 
> After cursory look at the source code, I wonder if converting name1
> and name2 to upper case before memcmp in cache_name_compare() can
> help case-insensitive systems. This change will change the order of
> file names in the index, but I suppose that it should not be a problem,
> because the index is host specific. Though, this fix is too simple, so
> I guess, I missed something.

No, the index isn't host-specific, and we also have a deep knowledge of 
the fact that the index order is the same as the unpacked tree order.

So no, we absolutely cannot just sort the index differently. We literally 
need to have a separate key for a "upper case lookup".

(That separate key can be just a hash table - it doesn't need to be 
something you can iterate over, so it can be pretty simple).

> > (And that's totally ignoring the fact that case-insensitivity then also 
> > has tons of i18n issues and can get *really* messy 
> 
> The proper support of i18n is not simple even without case-insensitivity.
> For instance, there are four different encodings widely used for Russian
> letters.

.. and git is very clear about this: filenames are *not* "characters" in 
the i18n sense, they are series of bytes. There is absolutely no room for 
ambiguity, and there is no locale for those things.

And that isn't going to change. It's the only sane way to do 
locale-independent names: people can *choose* to see the filenames as some 
UTF-8 sequence, or a series of Latin1, or anything, but that's not 
something git itself will care about.

Trying to involve locale in name comparison simply isn't possible. Two 
different repositories on two different filesystems would get two 
different answers. And that is simply unacceptable in a distributed 
system.

What we can do is to make the simple cases (ie the locale-*independent* 
ones) warn about problems with case insensitivity.

			Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Re-casing directories on case-insensitive systems
  2008-01-12 18:47         ` Linus Torvalds
@ 2008-01-12 19:29           ` Dmitry Potapov
  0 siblings, 0 replies; 29+ messages in thread
From: Dmitry Potapov @ 2008-01-12 19:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kevin Ballard, Johannes Schindelin, git

On Sat, Jan 12, 2008 at 10:47:10AM -0800, Linus Torvalds wrote:
> 
> And that isn't going to change. It's the only sane way to do 
> locale-independent names: people can *choose* to see the filenames as some 
> UTF-8 sequence, or a series of Latin1, or anything, but that's not 
> something git itself will care about.

Unfortunately, to agree on a single encoding for different systems is
even more difficult than agreeing on a single end-of-line encoding.
OTOH, it is not a real issue as long as anyone use ASCII names only.

> 
> Trying to involve locale in name comparison simply isn't possible.

Agreed. However, the proper solution would be that all filenames are
stored in UTF-8, so conversation is done when a file is added to the
index. But that requires a lot of work, and as I said before, I doubt
that many people really want to store files with non-ASCII names, after
all, Git is a developer tool. So, as far as I am concern, it does not
worth efforts.

Dmitry

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2008-01-12 19:29 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-11 20:19 Re-casing directories on case-insensitive systems Kevin Ballard
2008-01-11 21:09 ` Kevin Ballard
2008-01-11 21:19   ` Kevin Ballard
2008-01-11 21:25   ` Linus Torvalds
2008-01-11 21:59   ` Robin Rosenberg
2008-01-11 21:18 ` Linus Torvalds
2008-01-11 21:29 ` Johannes Schindelin
2008-01-11 21:44   ` Kevin Ballard
2008-01-11 22:05     ` Johannes Schindelin
2008-01-11 22:08     ` Linus Torvalds
2008-01-11 23:10       ` David Kastrup
2008-01-11 23:12         ` Kevin Ballard
2008-01-11 23:26       ` Robin Rosenberg
2008-01-12  0:03         ` Kevin Ballard
2008-01-12  0:15           ` Robin Rosenberg
2008-01-12  0:25             ` Kevin Ballard
2008-01-12  0:27               ` Junio C Hamano
2008-01-12  0:40                 ` Johannes Schindelin
2008-01-12  1:16                   ` Kevin Ballard
2008-01-12  1:30                     ` Junio C Hamano
2008-01-12  1:43                       ` Kevin Ballard
2008-01-12 12:07                         ` David Kastrup
2008-01-12 15:03                         ` Dmitry Potapov
2008-01-12  0:37         ` Junio C Hamano
2008-01-12  0:57           ` Robin Rosenberg
2008-01-12 16:33             ` Johannes Schindelin
2008-01-12 14:46       ` Dmitry Potapov
2008-01-12 18:47         ` Linus Torvalds
2008-01-12 19:29           ` Dmitry Potapov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).