* Re: Re-casing directories on case-insensitive systems
2008-01-11 22:08 ` Linus Torvalds
@ 2008-01-11 23:10 ` David Kastrup
2008-01-11 23:12 ` Kevin Ballard
2008-01-11 23:26 ` Robin Rosenberg
2008-01-12 14:46 ` Dmitry Potapov
2 siblings, 1 reply; 29+ messages in thread
From: David Kastrup @ 2008-01-11 23:10 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Kevin Ballard, Johannes Schindelin, git
Linus Torvalds <torvalds@linux-foundation.org> writes:
> I do agree that we could/should do something to help with case-insensitive
> filesystems.
>
> I absolutely *detest* those things, and I think that people who design
> them are total morons - with MS-DOS, you could understand it (people
> didn't know better),
Ah, those young whippersnappers who think they are so smart... there is
a history to that, you know. Early character sets (like those on punch
cards) had just capital letters. Even when lowercase letters were
introduced, those tended to use more space (12 instead of 6 bit) and be
harder to print (and the line printers who churned out 40 lines per
second did not bother with such finesse, anyway). But capital letters
are not designed for readability of long lines. So when printing or
even screen terminals came into use, one tended to prefer writing in
lowercase letters. Which actually had the uppercase code points
usually. Some early microcomputers (for which CP/M was designed)
actually were hooked up with a "standard" 50 or 110 Baud teletype as I/O
device, and those tended to have only lowercase letters, too, in their
basic incantations. So CP/M, not knowing which kind of input device
would be used and whether it would prefer (or offer exclusively) upper
or lower case, had case-insensitive commands, and consequently also
case-insensitive file names. And QDOS (whence MSDOS) was basically
intended to be a CP/M ripoff.
> but with OS X?
OS X has an Apple inheritance, Apple has the same inheritance as other
microcomputers which includes a case-insensitive BASIC interpreter
(BASIC again coming from old teletype times). It is, again, a decision
to drag along old history.
But actually, there is more to it nowadays: two file names containing ü,
but one with a single letter and one with combining accent, look exactly
the same. If they don't act exactly the same, one opens up quite a hole
for spoofing attacks. Well, probably hard to avoid (since things like
uppercase Alpha and uppercase A look the same and need to be different
code points, too). But one also opens a can of worms for confusion. So
the problem of canonical file names does not go away just with case
sensitivity.
> But considering that they exist, we should probably offer at least
> *some* help for people who didn't realize that you could make OS X
> behave better.
It is not like Linux does not support some case-insensitive file system
types, too. So the same problems can be had there as well.
--
David Kastrup, Kriemhildstr. 15, 44793 Bochum
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Re-casing directories on case-insensitive systems
2008-01-11 23:10 ` David Kastrup
@ 2008-01-11 23:12 ` Kevin Ballard
0 siblings, 0 replies; 29+ messages in thread
From: Kevin Ballard @ 2008-01-11 23:12 UTC (permalink / raw)
To: David Kastrup; +Cc: Linus Torvalds, Johannes Schindelin, git
[-- Attachment #1: Type: text/plain, Size: 698 bytes --]
On Jan 11, 2008, at 6:10 PM, David Kastrup wrote:
>> But considering that they exist, we should probably offer at least
>> *some* help for people who didn't realize that you could make OS X
>> behave better.
>
> It is not like Linux does not support some case-insensitive file
> system
> types, too. So the same problems can be had there as well.
In addition, while there is an option for HFS+ Case-Sensitive, using
that can cause bad things to happen as Mac OS X programs are written
under a case-insensitive assumption and may behave badly when
presented with a case-sensitive filesystem.
-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Re-casing directories on case-insensitive systems
2008-01-11 22:08 ` Linus Torvalds
2008-01-11 23:10 ` David Kastrup
@ 2008-01-11 23:26 ` Robin Rosenberg
2008-01-12 0:03 ` Kevin Ballard
2008-01-12 0:37 ` Junio C Hamano
2008-01-12 14:46 ` Dmitry Potapov
2 siblings, 2 replies; 29+ messages in thread
From: Robin Rosenberg @ 2008-01-11 23:26 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Kevin Ballard, Johannes Schindelin, git
fredagen den 11 januari 2008 skrev Linus Torvalds:
> I do agree that we could/should do something to help with case-insensitive
> filesystems.
>
> I absolutely *detest* those things, and I think that people who design
> them are total morons - with MS-DOS, you could understand it (people
> didn't know better), but with OS X?
Could it be some comfort that the other SCM's I know of make a mess of
these cases, regardless of the number of digits in the price tag.
[...]
> Almost all of the code that actually touches the index is in read-cache.c,
> and it's not like that is a very complex data structure (or a very big
> file), so adding another key to the sorting probably wouldn't be too
> horrid. But it's definitely a lot more than just a few lines of code!
Could we just have a lookup table index extension for identifying the
duplicates (when checking is enabled using core configuration option #3324)?
That table would keep a mapping from a normalized form (maybe include
canonical encoding while we're at it) to the actual octet sequence(s) used.
Many operations would translate any supplied form throug the table before
doing the lookup so if we have Foo.h and give FOO.h to git add, it would
notice and perform add (update index) on Foo.h instead as that is the form we
alreay know (or refuse yielding an error message; pick your poison). And,
well you get the picture.
-- robin
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Re-casing directories on case-insensitive systems
2008-01-11 23:26 ` Robin Rosenberg
@ 2008-01-12 0:03 ` Kevin Ballard
2008-01-12 0:15 ` Robin Rosenberg
2008-01-12 0:37 ` Junio C Hamano
1 sibling, 1 reply; 29+ messages in thread
From: Kevin Ballard @ 2008-01-12 0:03 UTC (permalink / raw)
To: Robin Rosenberg; +Cc: Linus Torvalds, Johannes Schindelin, git
[-- Attachment #1: Type: text/plain, Size: 1952 bytes --]
Speaking of normalizing composed sequences, could that be the cause
for the following?
kevin@KBALLARD:~/Dev/git> ls
kevin@KBALLARD:~/Dev/git> ls -a
./ ../ .git/ .gitignore
kevin@KBALLARD:~/Dev/git> git reset --hard
HEAD is now at 58beb2c... Trim leading / off of paths in git-svn
prop_walk
kevin@KBALLARD:~/Dev/git> git st
# On branch master
# Untracked files:
# (use "git add <file>..." to include in what will be committed)
#
# gitweb/test/Märchen
nothing added to commit but untracked files present (use "git add" to
track)
Some further exploration seems to support my cause:
kevin@KBALLARD:~/Dev/git> git ls-files gitweb/test
"gitweb/test/M\303\244rchen"
gitweb/test/file with spaces
gitweb/test/file+plus+sign
kevin@KBALLARD:~/Dev/git/gitweb/test> ls Märchen | xxd
0000000: 4d61 cc88 7263 6865 6e0a Ma..rchen.
As you can see, git has the file tracked using M\303\244rchen, where
\303\244 (or 0xC3A4, or U+00E4) is Latin Small Letter A With
Diaeresis, but the filesystem reports it as "Ma\xCC\x88rchen" where
0xCC88 (or U+0308) is Combining Diaeresis.
In other words, the git repository itself exhibits a problem under OS
X. I'm not sure if I didn't notice this untracked file before, or if
the filesystem (or the index) actually used the other form previously,
but regardless there's a problem that I believe would be present even
if I was using Case-Sensitive HFS+.
-Kevin Ballard
On Jan 11, 2008, at 6:26 PM, Robin Rosenberg wrote:
> Could we just have a lookup table index extension for identifying the
> duplicates (when checking is enabled using core configuration option
> #3324)?
> That table would keep a mapping from a normalized form (maybe include
> canonical encoding while we're at it) to the actual octet
> sequence(s) used.
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Re-casing directories on case-insensitive systems
2008-01-12 0:03 ` Kevin Ballard
@ 2008-01-12 0:15 ` Robin Rosenberg
2008-01-12 0:25 ` Kevin Ballard
0 siblings, 1 reply; 29+ messages in thread
From: Robin Rosenberg @ 2008-01-12 0:15 UTC (permalink / raw)
To: Kevin Ballard; +Cc: Linus Torvalds, Johannes Schindelin, git
lördagen den 12 januari 2008 skrev Kevin Ballard:
> Speaking of normalizing composed sequences, could that be the cause
> for the following?
[...]
> kevin@KBALLARD:~/Dev/git/gitweb/test> ls Märchen | xxd
> 0000000: 4d61 cc88 7263 6865 6e0a Ma..rchen.
>
> As you can see, git has the file tracked using M\303\244rchen, where
> \303\244 (or 0xC3A4, or U+00E4) is Latin Small Letter A With
> Diaeresis, but the filesystem reports it as "Ma\xCC\x88rchen" where
> 0xCC88 (or U+0308) is Combining Diaeresis.
Yes that is due to normalization. When adding a file by name git uses the user
supplied name, but when adding files indirectly it gets the names from the
file system without denormalizing them. Likewize status gets the names from
the file system without denormalizing and thus you get a mismatch.
-- robin
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Re-casing directories on case-insensitive systems
2008-01-12 0:15 ` Robin Rosenberg
@ 2008-01-12 0:25 ` Kevin Ballard
2008-01-12 0:27 ` Junio C Hamano
0 siblings, 1 reply; 29+ messages in thread
From: Kevin Ballard @ 2008-01-12 0:25 UTC (permalink / raw)
To: Robin Rosenberg; +Cc: Linus Torvalds, Johannes Schindelin, git
[-- Attachment #1: Type: text/plain, Size: 1211 bytes --]
On Jan 11, 2008, at 7:15 PM, Robin Rosenberg wrote:
> lördagen den 12 januari 2008 skrev Kevin Ballard:
>> Speaking of normalizing composed sequences, could that be the cause
>> for the following?
> [...]
>> kevin@KBALLARD:~/Dev/git/gitweb/test> ls Märchen | xxd
>> 0000000: 4d61 cc88 7263 6865 6e0a Ma..rchen.
>>
>> As you can see, git has the file tracked using M\303\244rchen, where
>> \303\244 (or 0xC3A4, or U+00E4) is Latin Small Letter A With
>> Diaeresis, but the filesystem reports it as "Ma\xCC\x88rchen" where
>> 0xCC88 (or U+0308) is Combining Diaeresis.
>
> Yes that is due to normalization. When adding a file by name git
> uses the user
> supplied name, but when adding files indirectly it gets the names
> from the
> file system without denormalizing them. Likewize status gets the
> names from
> the file system without denormalizing and thus you get a mismatch.
Is there a reason for this? It seems like it would be trivial to end
up with misdiagnosed "untracked" files when using any language other
than English given this behaviuor.
-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Re-casing directories on case-insensitive systems
2008-01-12 0:25 ` Kevin Ballard
@ 2008-01-12 0:27 ` Junio C Hamano
2008-01-12 0:40 ` Johannes Schindelin
0 siblings, 1 reply; 29+ messages in thread
From: Junio C Hamano @ 2008-01-12 0:27 UTC (permalink / raw)
To: Kevin Ballard; +Cc: Robin Rosenberg, Linus Torvalds, Johannes Schindelin, git
Kevin Ballard <kevin@sb.org> writes:
> Is there a reason for this? It seems like it would be trivial to end
> up with misdiagnosed "untracked" files when using any language other
> than English given this behaviuor.
No. The assumption of the code has always been that sane
filesystems would return from readdir() the names you gave from
creat().
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Re-casing directories on case-insensitive systems
2008-01-12 0:27 ` Junio C Hamano
@ 2008-01-12 0:40 ` Johannes Schindelin
2008-01-12 1:16 ` Kevin Ballard
0 siblings, 1 reply; 29+ messages in thread
From: Johannes Schindelin @ 2008-01-12 0:40 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Kevin Ballard, Robin Rosenberg, Linus Torvalds, git
Hi,
On Fri, 11 Jan 2008, Junio C Hamano wrote:
> Kevin Ballard <kevin@sb.org> writes:
>
> > Is there a reason for this? It seems like it would be trivial to end
> > up with misdiagnosed "untracked" files when using any language other
> > than English given this behaviuor.
>
> No. The assumption of the code has always been that sane filesystems
> would return from readdir() the names you gave from creat().
We do not really have to rehash that whole discussion for the Nth time, do
we?
Ciao,
Dscho
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Re-casing directories on case-insensitive systems
2008-01-12 0:40 ` Johannes Schindelin
@ 2008-01-12 1:16 ` Kevin Ballard
2008-01-12 1:30 ` Junio C Hamano
0 siblings, 1 reply; 29+ messages in thread
From: Kevin Ballard @ 2008-01-12 1:16 UTC (permalink / raw)
To: git
[-- Attachment #1: Type: text/plain, Size: 1600 bytes --]
On Jan 11, 2008, at 7:40 PM, Johannes Schindelin wrote:
> On Fri, 11 Jan 2008, Junio C Hamano wrote:
>
>> Kevin Ballard <kevin@sb.org> writes:
>>
>>> Is there a reason for this? It seems like it would be trivial to end
>>> up with misdiagnosed "untracked" files when using any language other
>>> than English given this behaviuor.
>>
>> No. The assumption of the code has always been that sane filesystems
>> would return from readdir() the names you gave from creat().
>
> We do not really have to rehash that whole discussion for the Nth
> time, do
> we?
Apparently so. By Junio's definition, HFS+ is not a sane filesystem,
and as git grows more popular with OS X users, this issue is going to
crop up more frequently.
According to the HFS+ Volume Format technote[1], filenames in HFS+ are
stored in normalized, canonical order. To be more specific, they're
stored in a special apple variant of Unicode Normal Form D (the
special variant is for preserving round-trip with older encodings with
certain codepoint ranges[2]).
In other words, if you hand an HFS+ filesystem a filename that
contains unicode characters, what you get back later may be in a
different format. And that's going to be a problem if git doesn't deal
with this.
[1]: http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties
[2]: http://developer.apple.com/qa/qa2001/qa1173.html
Note: CC list stripped because this is a re-sent email, as the list
bounced the last one that contained a text/html alternate.
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Re-casing directories on case-insensitive systems
2008-01-12 1:16 ` Kevin Ballard
@ 2008-01-12 1:30 ` Junio C Hamano
2008-01-12 1:43 ` Kevin Ballard
0 siblings, 1 reply; 29+ messages in thread
From: Junio C Hamano @ 2008-01-12 1:30 UTC (permalink / raw)
To: Kevin Ballard; +Cc: git
Kevin Ballard <kevin@sb.org> writes:
> On Jan 11, 2008, at 7:40 PM, Johannes Schindelin wrote:
>
>> On Fri, 11 Jan 2008, Junio C Hamano wrote:
>>
>>> Kevin Ballard <kevin@sb.org> writes:
>>>
>>>> Is there a reason for this? It seems like it would be trivial to end
>>>> up with misdiagnosed "untracked" files when using any language other
>>>> than English given this behaviuor.
>>>
>>> No. The assumption of the code has always been that sane filesystems
>>> would return from readdir() the names you gave from creat().
>>
>> We do not really have to rehash that whole discussion for the Nth
>> time, do
>> we?
>
> Apparently so. By Junio's definition, HFS+ is not a sane filesystem,
> and as git grows more popular with OS X users, this issue is going to
> crop up more frequently.
It's not "my" definition, but you asked the reason and I gave
the answer. We can close this issue of "is HFS+ sane" now.
HFS+ is insane, period. And as Linus said, you cannot forgive
its insanity using the historical baggage argument, like MS-DOS.
HOWEVER.
It is a totally different issue if we want to refuse supporting
insane filesystems. And the answer is no. It was not my
intention to say that we do not intend to support them, when I
explained the reason why the things are as they are, which was
the original question by you.
See Robin's proposal to let us translate random names we get
back from readdir() from the filesystem using an additional
look-up table in the index extension section that stores mapping
from canonicalized form to the form that the user registered to
the index. I think that is a sane approach to tackle this issue
on insane filesystems like HFS+.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Re-casing directories on case-insensitive systems
2008-01-12 1:30 ` Junio C Hamano
@ 2008-01-12 1:43 ` Kevin Ballard
2008-01-12 12:07 ` David Kastrup
2008-01-12 15:03 ` Dmitry Potapov
0 siblings, 2 replies; 29+ messages in thread
From: Kevin Ballard @ 2008-01-12 1:43 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
[-- Attachment #1: Type: text/plain, Size: 1766 bytes --]
On Jan 11, 2008, at 8:30 PM, Junio C Hamano wrote:
>> Apparently so. By Junio's definition, HFS+ is not a sane filesystem,
>> and as git grows more popular with OS X users, this issue is going to
>> crop up more frequently.
>
> It's not "my" definition, but you asked the reason and I gave
> the answer. We can close this issue of "is HFS+ sane" now.
> HFS+ is insane, period. And as Linus said, you cannot forgive
> its insanity using the historical baggage argument, like MS-DOS.
Fair enough, though I believe OS X has a good reason, namely it's an
OS designed for regular users rather than servers or programmers. Case-
sensitivity would confuse my mother.
> HOWEVER.
>
> It is a totally different issue if we want to refuse supporting
> insane filesystems. And the answer is no. It was not my
> intention to say that we do not intend to support them, when I
> explained the reason why the things are as they are, which was
> the original question by you.
Ok. I wasn't implying anything with that phrase there, I was just
trying to reiterate that HFS+ is case-insensitive and emphasize that
this issue will become more relevant as time goes by.
> See Robin's proposal to let us translate random names we get
> back from readdir() from the filesystem using an additional
> look-up table in the index extension section that stores mapping
> from canonicalized form to the form that the user registered to
> the index. I think that is a sane approach to tackle this issue
> on insane filesystems like HFS+.
If I knew what the index extension section was, perhaps I would think
that's a good idea ;) I have yet to dive into the gory details of how
this stuff works.
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2432 bytes --]
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Re-casing directories on case-insensitive systems
2008-01-12 1:43 ` Kevin Ballard
@ 2008-01-12 12:07 ` David Kastrup
2008-01-12 15:03 ` Dmitry Potapov
1 sibling, 0 replies; 29+ messages in thread
From: David Kastrup @ 2008-01-12 12:07 UTC (permalink / raw)
To: Kevin Ballard; +Cc: Junio C Hamano, git
Kevin Ballard <kevin@sb.org> writes:
> On Jan 11, 2008, at 8:30 PM, Junio C Hamano wrote:
>
>>> Apparently so. By Junio's definition, HFS+ is not a sane filesystem,
>>> and as git grows more popular with OS X users, this issue is going to
>>> crop up more frequently.
>>
>> It's not "my" definition, but you asked the reason and I gave
>> the answer. We can close this issue of "is HFS+ sane" now.
>> HFS+ is insane, period. And as Linus said, you cannot forgive
>> its insanity using the historical baggage argument, like MS-DOS.
>
> Fair enough, though I believe OS X has a good reason, namely it's an
> OS designed for regular users rather than servers or
> programmers. Case-sensitivity would confuse my mother.
If case-sensitivity would be the primary cause of confusion in
mother-computer interoperation, you have a remarkable mother.
"Type things the same way and they work the same" is a simple enough
rule.
--
David Kastrup, Kriemhildstr. 15, 44793 Bochum
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Re-casing directories on case-insensitive systems
2008-01-12 1:43 ` Kevin Ballard
2008-01-12 12:07 ` David Kastrup
@ 2008-01-12 15:03 ` Dmitry Potapov
1 sibling, 0 replies; 29+ messages in thread
From: Dmitry Potapov @ 2008-01-12 15:03 UTC (permalink / raw)
To: Kevin Ballard; +Cc: Junio C Hamano, git
On Fri, Jan 11, 2008 at 08:43:35PM -0500, Kevin Ballard wrote:
>
> Fair enough, though I believe OS X has a good reason, namely it's an
> OS designed for regular users rather than servers or programmers. Case-
> sensitivity would confuse my mother.
Many of *nix servers are running web-services and samba servers, yet most
users are even not aware of whether they dealing with case-sensitive file
system or not, let alone being confused by that. This is because most
regular users will type the name only once when they create a new file
and then just click on this name. So case-sensitive file systems can
really confuse only some badly written applications...
Dmitry
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Re-casing directories on case-insensitive systems
2008-01-11 23:26 ` Robin Rosenberg
2008-01-12 0:03 ` Kevin Ballard
@ 2008-01-12 0:37 ` Junio C Hamano
2008-01-12 0:57 ` Robin Rosenberg
1 sibling, 1 reply; 29+ messages in thread
From: Junio C Hamano @ 2008-01-12 0:37 UTC (permalink / raw)
To: Robin Rosenberg; +Cc: Linus Torvalds, Kevin Ballard, Johannes Schindelin, git
Robin Rosenberg <robin.rosenberg@dewire.com> writes:
> Could we just have a lookup table index extension for identifying the
> duplicates (when checking is enabled using core configuration option #3324)?
> That table would keep a mapping from a normalized form (maybe include
> canonical encoding while we're at it) to the actual octet sequence(s) used.
I would agree that the index extension, if we ever are going to
do this, would be the right place to store this information, at
the single repository level.
However, this opens up a can of worms. What's the canonical key
should be? If you want to protect yourself from a unicode
normalizing filesystem, you would use one canonicalization,
while if you want to protect from a case losing filesystem you
would use another? Or do we at the same time downcase and NFD
normalize at the same time and be done with it?
And where should the configuration be stored? If a project
wants to be interoperable across Linux and vfat, for example,
that canonicalization needs to be enabled in repositories of all
participants, be they on Linux or vfat, so that people on Linux
can be prevented from creating and register two files xt_mark.c
and xt_MARK.c in the same directory, so that people who extract
the source on vfat won't have troubles.
Which means the information needs to be in-tree. But that
should not be in .gitattributes (which by definition is for
per-path things).
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Re-casing directories on case-insensitive systems
2008-01-12 0:37 ` Junio C Hamano
@ 2008-01-12 0:57 ` Robin Rosenberg
2008-01-12 16:33 ` Johannes Schindelin
0 siblings, 1 reply; 29+ messages in thread
From: Robin Rosenberg @ 2008-01-12 0:57 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Linus Torvalds, Kevin Ballard, Johannes Schindelin, git
lördagen den 12 januari 2008 skrev Junio C Hamano:
> Robin Rosenberg <robin.rosenberg@dewire.com> writes:
>
> > Could we just have a lookup table index extension for identifying the
> > duplicates (when checking is enabled using core configuration option #3324)?
> > That table would keep a mapping from a normalized form (maybe include
> > canonical encoding while we're at it) to the actual octet sequence(s) used.
>
> I would agree that the index extension, if we ever are going to
> do this, would be the right place to store this information, at
> the single repository level.
>
> However, this opens up a can of worms. What's the canonical key
> should be? If you want to protect yourself from a unicode
> normalizing filesystem, you would use one canonicalization,
> while if you want to protect from a case losing filesystem you
> would use another? Or do we at the same time downcase and NFD
> normalize at the same time and be done with it?
The worms are out already. So the question is whether there
is a way of keeping them in the can instead of having them crawl
all around. I think we could to both unicode (UTF-8 or NFD) and
downcase at the same time.
> And where should the configuration be stored? If a project
> wants to be interoperable across Linux and vfat, for example,
In the brand new ".gitconfig". It could in principle contain any config option,
but that would not be safe so I guess one should only allow "safe" options
there.
-- robin
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Re-casing directories on case-insensitive systems
2008-01-12 0:57 ` Robin Rosenberg
@ 2008-01-12 16:33 ` Johannes Schindelin
0 siblings, 0 replies; 29+ messages in thread
From: Johannes Schindelin @ 2008-01-12 16:33 UTC (permalink / raw)
To: Robin Rosenberg; +Cc: Junio C Hamano, Linus Torvalds, Kevin Ballard, git
[-- Attachment #1: Type: TEXT/PLAIN, Size: 486 bytes --]
Hi,
On Sat, 12 Jan 2008, Robin Rosenberg wrote:
> lördagen den 12 januari 2008 skrev Junio C Hamano:
>
> > And where should the configuration be stored? If a project wants to
> > be interoperable across Linux and vfat, for example,
>
> In the brand new ".gitconfig". It could in principle contain any config
> option, but that would not be safe so I guess one should only allow
> "safe" options there.
Funny: I had the same idea (.gitconfig) for the crlf issues...
Ciao,
Dscho
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Re-casing directories on case-insensitive systems
2008-01-11 22:08 ` Linus Torvalds
2008-01-11 23:10 ` David Kastrup
2008-01-11 23:26 ` Robin Rosenberg
@ 2008-01-12 14:46 ` Dmitry Potapov
2008-01-12 18:47 ` Linus Torvalds
2 siblings, 1 reply; 29+ messages in thread
From: Dmitry Potapov @ 2008-01-12 14:46 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Kevin Ballard, Johannes Schindelin, git
On Fri, Jan 11, 2008 at 02:08:35PM -0800, Linus Torvalds wrote:
>
> However, it's not like there is even a simple solution. The right place to
> do that check would probably be in "add_index_entry()", but doing a check
> whether the same file already exists (in a different case) is simply
> *extremely* expensive for a very critical piece of code, unless we were to
> change that index data structure a lot (ie add a separate hash for the
> filenames).
After cursory look at the source code, I wonder if converting name1
and name2 to upper case before memcmp in cache_name_compare() can
help case-insensitive systems. This change will change the order of
file names in the index, but I suppose that it should not be a problem,
because the index is host specific. Though, this fix is too simple, so
I guess, I missed something.
> (And that's totally ignoring the fact that case-insensitivity then also
> has tons of i18n issues and can get *really* messy
The proper support of i18n is not simple even without case-insensitivity.
For instance, there are four different encodings widely used for Russian
letters. On Windows alone, you have two simulteniously in the default
settings -- Windows-1251 for Windows applications and CP866 for Console
applications... Actually, some console applications can change its default
encoding, and it seems Cygwin programs do that. So, based on whether you
use gcc from Cygwin or Visual C to compile your console program, you can
get a different encoding. On *nix in Russia, koi8-r and utf-8 are most
popular... So, if you have a repository shared between different systems,
you cannot think about a file name just as a sequence of bytes anymore.
OTOH, I doubt that many people are really interested in using non-ASCII
file names with Git right now.
Dmitry
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Re-casing directories on case-insensitive systems
2008-01-12 14:46 ` Dmitry Potapov
@ 2008-01-12 18:47 ` Linus Torvalds
2008-01-12 19:29 ` Dmitry Potapov
0 siblings, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2008-01-12 18:47 UTC (permalink / raw)
To: Dmitry Potapov; +Cc: Kevin Ballard, Johannes Schindelin, git
On Sat, 12 Jan 2008, Dmitry Potapov wrote:
>
> After cursory look at the source code, I wonder if converting name1
> and name2 to upper case before memcmp in cache_name_compare() can
> help case-insensitive systems. This change will change the order of
> file names in the index, but I suppose that it should not be a problem,
> because the index is host specific. Though, this fix is too simple, so
> I guess, I missed something.
No, the index isn't host-specific, and we also have a deep knowledge of
the fact that the index order is the same as the unpacked tree order.
So no, we absolutely cannot just sort the index differently. We literally
need to have a separate key for a "upper case lookup".
(That separate key can be just a hash table - it doesn't need to be
something you can iterate over, so it can be pretty simple).
> > (And that's totally ignoring the fact that case-insensitivity then also
> > has tons of i18n issues and can get *really* messy
>
> The proper support of i18n is not simple even without case-insensitivity.
> For instance, there are four different encodings widely used for Russian
> letters.
.. and git is very clear about this: filenames are *not* "characters" in
the i18n sense, they are series of bytes. There is absolutely no room for
ambiguity, and there is no locale for those things.
And that isn't going to change. It's the only sane way to do
locale-independent names: people can *choose* to see the filenames as some
UTF-8 sequence, or a series of Latin1, or anything, but that's not
something git itself will care about.
Trying to involve locale in name comparison simply isn't possible. Two
different repositories on two different filesystems would get two
different answers. And that is simply unacceptable in a distributed
system.
What we can do is to make the simple cases (ie the locale-*independent*
ones) warn about problems with case insensitivity.
Linus
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Re-casing directories on case-insensitive systems
2008-01-12 18:47 ` Linus Torvalds
@ 2008-01-12 19:29 ` Dmitry Potapov
0 siblings, 0 replies; 29+ messages in thread
From: Dmitry Potapov @ 2008-01-12 19:29 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Kevin Ballard, Johannes Schindelin, git
On Sat, Jan 12, 2008 at 10:47:10AM -0800, Linus Torvalds wrote:
>
> And that isn't going to change. It's the only sane way to do
> locale-independent names: people can *choose* to see the filenames as some
> UTF-8 sequence, or a series of Latin1, or anything, but that's not
> something git itself will care about.
Unfortunately, to agree on a single encoding for different systems is
even more difficult than agreeing on a single end-of-line encoding.
OTOH, it is not a real issue as long as anyone use ASCII names only.
>
> Trying to involve locale in name comparison simply isn't possible.
Agreed. However, the proper solution would be that all filenames are
stored in UTF-8, so conversation is done when a file is added to the
index. But that requires a lot of work, and as I said before, I doubt
that many people really want to store files with non-ASCII names, after
all, Git is a developer tool. So, as far as I am concern, it does not
worth efforts.
Dmitry
^ permalink raw reply [flat|nested] 29+ messages in thread