* [RFC] File system difference handling in git
@ 2008-01-22 9:21 Reece Dunn
2008-01-22 10:24 ` Junio C Hamano
2008-01-22 16:56 ` Linus Torvalds
0 siblings, 2 replies; 8+ messages in thread
From: Reece Dunn @ 2008-01-22 9:21 UTC (permalink / raw)
To: git
Hi,
Observing the various comments w.r.t. the different (potentially
braindead) filesystems that are available, there are two general
categories for behavioural differences:
1. File name representation
For Linux file systems (correct me if I am wrong here), they all store
the file name as-is. The question here is what happens on
Windows-based file systems (e.g. NTFS) that are being read on Linux?
For Mac filesystems, you have the Unicode character decomposition
issues to deal with.
For Windows, you have UTF-16 filename support.
There are two basic usages for file/directory names: passing the name
to the Operating System; getting the name from the Operating System.
Therefore, you have:
os_to_git_path( const NATIVECHAR * ospath, strbuf * gitpath );
git_to_os_path( const char * gitpath, const NATIVECHAR * ospath, int oslen );
These can then be used to handle Operating System differences (e.g.
use WideCharToMultiByte/MultiByteToWideChar conversion on Windows to
map between UTF-8 and UCS-2/UTF-16).
If Mac has an API to handle its strange behaviour, that can be used
here as well.
2. Case (in)sensitivity
Here, you have the following cases:
1. git and the filesystem say that the files are different.
Update the git directory tree and move the file on the filesystem.
2. git and the filesystem say that the files are the same.
Generate an error as is currently done in git.
3. git says that the files are different, but the filesystem says
that the files are the same.
Allow the move, updating the git directory tree only.
- Reece
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] File system difference handling in git
2008-01-22 9:21 [RFC] File system difference handling in git Reece Dunn
@ 2008-01-22 10:24 ` Junio C Hamano
2008-01-22 10:52 ` Reece Dunn
2008-01-22 16:56 ` Linus Torvalds
1 sibling, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2008-01-22 10:24 UTC (permalink / raw)
To: Reece Dunn; +Cc: git
"Reece Dunn" <msclrhd@googlemail.com> writes:
> 1. File name representation
>
> For Linux file systems ...
> Therefore, you have:
>
> os_to_git_path( const NATIVECHAR * ospath, strbuf * gitpath );
> git_to_os_path( const char * gitpath, const NATIVECHAR * ospath, int oslen );
It is not that simple, I am afraid. Legacy encodings can be
used in pathnames. With bog-standard traditional UNIX pathname
semantics, all pathnames are sequences of non-NUL, non-slash
bytes, separated with slashes, so if you do not allow choices
(which is a very sensible ideal world scenario), you can declare
that the "git" encoding is UTF-8 and always check things out
as-is.
But if you want a project ("git" in your above parlance) to be
checked out in two repositories, one with legacy and the other
with UTF-8, you cannot just say os_to_git/git_to_os. You would
need a bit more information from the repository owners what
encodings are suitable. So your os_to_git()/git_to_os() will
not be an identity function even on Linux to support such.
I used to have a data directory on my Linux box with EUC-JP
pathname and exported as an SMB share to my wife's Windows box,
telling samba to transliterate to whatever encoding the other
end liked. I did not want to have the pathname on the Linux end
in UTF-8 because I did not have enough energey to update my
Emacs configuration to grok Japanese in UTF-8 (even though I
finally bit the bullet and switched to UTF-8 on the Linux side
recently).
I know, this is painful. Real life hurts. Even on Linux,
not everybody can live in UTF-8-only world.
> 2. Case (in)sensitivity
>
> Here, you have the following cases:
> ...
> 3. git says that the files are different, but the filesystem says
> that the files are the same.
>
> Allow the move, updating the git directory tree only.
Sorry, I cannot really tell what you are talking about. You
seem to imply, with "Allow the move", that you are describing a
scenario that involves a move of one existing file to another,
but it is not clear. E.g. did you mean, by 3, "When the user
says 'move a b', and if git says a and b are different but if
the filesystem says a and b are the same, then..."?
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] File system difference handling in git
2008-01-22 10:24 ` Junio C Hamano
@ 2008-01-22 10:52 ` Reece Dunn
2008-01-22 17:44 ` Steffen Prohaska
2008-01-22 20:54 ` Jonathan del Strother
0 siblings, 2 replies; 8+ messages in thread
From: Reece Dunn @ 2008-01-22 10:52 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On 22/01/2008, Junio C Hamano <gitster@pobox.com> wrote:
> "Reece Dunn" <msclrhd@googlemail.com> writes:
>
> > 1. File name representation
> >
> > For Linux file systems ...
> > Therefore, you have:
> >
> > os_to_git_path( const NATIVECHAR * ospath, strbuf * gitpath );
> > git_to_os_path( const char * gitpath, const NATIVECHAR * ospath, int oslen );
>
> It is not that simple, I am afraid. Legacy encodings can be
> used in pathnames. With bog-standard traditional UNIX pathname
> semantics, all pathnames are sequences of non-NUL, non-slash
> bytes, separated with slashes, so if you do not allow choices
> (which is a very sensible ideal world scenario), you can declare
> that the "git" encoding is UTF-8 and always check things out
> as-is.
So the upshot of this is that you need to use a platform (Operating
System, filesystem, locale, etc.) that match what the git repository
was created in, otherwise there are going to be issues when
interpreting paths correctly.
The locale issue asside, can the above proposal help users working on
Mac, Linux and Windows interoperate with each other?
I understand that there is not going to be a universal magic fix; what
I'm interested in is minimising the differences between Operating
Systems. This may be a futile effort, as it is likely you will need
some knowledge of the properties of the filesystem being used (as
filesystems with different properties can be used on the same
Operating System).
> > 2. Case (in)sensitivity
> >
> > Here, you have the following cases:
> > ...
> > 3. git says that the files are different, but the filesystem says
> > that the files are the same.
> >
> > Allow the move, updating the git directory tree only.
>
> Sorry, I cannot really tell what you are talking about. You
> seem to imply, with "Allow the move", that you are describing a
> scenario that involves a move of one existing file to another,
> but it is not clear. E.g. did you mean, by 3, "When the user
> says 'move a b', and if git says a and b are different but if
> the filesystem says a and b are the same, then..."?
This is what I am saying. For example, if you say:
git mv myfile.H myfile.h
on a case sensitive filesystem (e.g. ext3), this will work, however on
a case insensitive filesystem (e.g. ntfs) git would complain that the
files are the same.
The workaround is to say:
git mv myfile.H myfile.h.tmp
git mv myfile.h.tmp myfile.h
but this is not ideal, especially if you are automating some move operations.
This also applies to the VCS importers (e.g. git-p4) that can delete a
file that is a case-only move on case insensitive filesystems.
The question then becomes what happens on Mac (with the Unicode
decomposing behaviour) if they differ in the way they are stored (e.g.
in Linus' 'ä' vs 'a¨' example)?
- Reece
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] File system difference handling in git
2008-01-22 9:21 [RFC] File system difference handling in git Reece Dunn
2008-01-22 10:24 ` Junio C Hamano
@ 2008-01-22 16:56 ` Linus Torvalds
2008-01-22 20:21 ` David Kastrup
1 sibling, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2008-01-22 16:56 UTC (permalink / raw)
To: Reece Dunn; +Cc: git
On Tue, 22 Jan 2008, Reece Dunn wrote:
>
> 1. File name representation
>
> For Linux file systems (correct me if I am wrong here), they all store
> the file name as-is. The question here is what happens on
> Windows-based file systems (e.g. NTFS) that are being read on Linux?
Generally, Linux tries to follow the conventions of the filesystem, so
it's generally case-preserving and case-sensitive (but not normalizing in
any way - the case sensitivity is literally a upcase lookup table, so you
do "upcase(c1) == upcase(c2)" for each UCS-2 character, no combining or
decomposition).
But the fs volume can specify if it's a case-sensitive volume or not. And
the volume will also actually contain the "upcase[]" array that defines
the case-sensitivity, so exactly *which* characters are equivalent isn't
actually defined by any external entity, it's defined by the particular
filesystem instance itself!
There's a default upcase table which is probably the one almost everybody
uses.
Caveat: I've never used NTFS myself, so I don't have any personal
knowledge. I can see the sources, and what it thinks it is doing, but
whether it works that way or not I'll leave to others.
Also, note: at least as far as Linux is concerned, NTFS is pure UCS-2. Not
UTF-16.
> For Mac filesystems, you have the Unicode character decomposition
> issues to deal with.
>
> For Windows, you have UTF-16 filename support.
.. and for pretty much all unixes, you also have potentially Latin1 or any
other local convention (eg EUC-JP or EUC-KR).
Sometimes you'd have to guess from the name itself what it is (ie there
might be a mixture). In those cases, it's probably best to *not* even try
to convert to unicode.
> There are two basic usages for file/directory names: passing the name
> to the Operating System; getting the name from the Operating System.
> Therefore, you have:
>
> os_to_git_path( const NATIVECHAR * ospath, strbuf * gitpath );
> git_to_os_path( const char * gitpath, const NATIVECHAR * ospath, int oslen );
It's not going to be that simple. And if you want type safety, it's the
"ospath" that needs to be "char *", since that's what you get from the OS
and it's really the "index form" that you want to protect from giving
unconverted by mistake to "lstat()" and friends.
And it's going to be really quite painful even with the compiler pointing
out each point where you use an "index name" with an operation that wants
an "OS name".
It would be interesting to see how painful it is (make the actual
"conversion" be a no-op at first, just casting the pointer), but I suspect
the answer is "very".
Linus
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] File system difference handling in git
2008-01-22 10:52 ` Reece Dunn
@ 2008-01-22 17:44 ` Steffen Prohaska
2008-01-22 20:54 ` Jonathan del Strother
1 sibling, 0 replies; 8+ messages in thread
From: Steffen Prohaska @ 2008-01-22 17:44 UTC (permalink / raw)
To: Reece Dunn; +Cc: Junio C Hamano, git
On Jan 22, 2008, at 11:52 AM, Reece Dunn wrote:
>>> 2. Case (in)sensitivity
>>>
>>> Here, you have the following cases:
>>> ...
>>> 3. git says that the files are different, but the filesystem says
>>> that the files are the same.
>>>
>>> Allow the move, updating the git directory tree only.
>>
>> Sorry, I cannot really tell what you are talking about. You
>> seem to imply, with "Allow the move", that you are describing a
>> scenario that involves a move of one existing file to another,
>> but it is not clear. E.g. did you mean, by 3, "When the user
>> says 'move a b', and if git says a and b are different but if
>> the filesystem says a and b are the same, then..."?
>
> This is what I am saying. For example, if you say:
>
> git mv myfile.H myfile.h
>
> on a case sensitive filesystem (e.g. ext3), this will work, however on
> a case insensitive filesystem (e.g. ntfs) git would complain that the
> files are the same.
>
> The workaround is to say:
>
> git mv myfile.H myfile.h.tmp
> git mv myfile.h.tmp myfile.h
>
> but this is not ideal, especially if you are automating some move
> operations.
>
> This also applies to the VCS importers (e.g. git-p4) that can delete a
> file that is a case-only move on case insensitive filesystems.
>
> The question then becomes what happens on Mac (with the Unicode
> decomposing behaviour) if they differ in the way they are stored (e.g.
> in Linus' 'ä' vs 'a¨' example)?
You can work around the problem as you described; but later git
will hit you again and fails unexpectedly when you try to merge
your change.
So better avoid renames that only change case until git at least
passes the two test below.
Steffen
---- snip ---
Git behaves strangely (from a user's point of view) on filesystems
that preserve case but do not distinguish it. The two major examples
are Windows and Mac OS X. Simple operations such as "git mv" or "git
merge" can fail unexpectedly.
This commit adds two simple tests. Both tests currently fail on
Windows and Mac, although they pass on Linux.
Signed-off-by: Steffen Prohaska <prohaska@zib.de>
---
t/t0050-filesystems.sh | 36 ++++++++++++++++++++++++++++++++++++
1 files changed, 36 insertions(+), 0 deletions(-)
create mode 100755 t/t0050-filesystems.sh
diff --git a/t/t0050-filesystems.sh b/t/t0050-filesystems.sh
new file mode 100755
index 0000000..953b02b
--- /dev/null
+++ b/t/t0050-filesystems.sh
@@ -0,0 +1,36 @@
+#!/bin/sh
+
+test_description='Various filesystems issues'
+
+. ./test-lib.sh
+
+test_expect_success setup '
+
+ touch camelcase &&
+ git add camelcase &&
+ git commit -m "initial" &&
+ git tag initial &&
+ git checkout -b topic &&
+ git mv camelcase tmp &&
+ git mv tmp CamelCase &&
+ git commit -m "rename" &&
+ git checkout -f master
+
+'
+
+test_expect_success 'rename (case change)' '
+
+ git mv camelcase CamelCase &&
+ git commit -m "rename"
+
+'
+
+test_expect_success 'merge (case change)' '
+
+ git reset --hard initial &&
+ git merge topic
+
+'
+
+test_done
--
1.5.4.rc4
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [RFC] File system difference handling in git
2008-01-22 16:56 ` Linus Torvalds
@ 2008-01-22 20:21 ` David Kastrup
2008-01-22 21:32 ` Linus Torvalds
0 siblings, 1 reply; 8+ messages in thread
From: David Kastrup @ 2008-01-22 20:21 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Reece Dunn, git
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Tue, 22 Jan 2008, Reece Dunn wrote:
>>
>> 1. File name representation
>>
>> For Linux file systems (correct me if I am wrong here), they all store
>> the file name as-is. The question here is what happens on
>> Windows-based file systems (e.g. NTFS) that are being read on Linux?
>
> Generally, Linux tries to follow the conventions of the filesystem, so
> it's generally case-preserving and case-sensitive (but not normalizing in
> any way - the case sensitivity is literally a upcase lookup table, so you
> do "upcase(c1) == upcase(c2)" for each UCS-2 character, no combining or
> decomposition).
s/sensitiv/insensitiv/g
--
David Kastrup, Kriemhildstr. 15, 44793 Bochum
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] File system difference handling in git
2008-01-22 10:52 ` Reece Dunn
2008-01-22 17:44 ` Steffen Prohaska
@ 2008-01-22 20:54 ` Jonathan del Strother
1 sibling, 0 replies; 8+ messages in thread
From: Jonathan del Strother @ 2008-01-22 20:54 UTC (permalink / raw)
To: Reece Dunn; +Cc: Junio C Hamano, git
On Jan 22, 2008 10:52 AM, Reece Dunn <msclrhd@googlemail.com> wrote:
> This is what I am saying. For example, if you say:
>
> git mv myfile.H myfile.h
>
> on a case sensitive filesystem (e.g. ext3), this will work, however on
> a case insensitive filesystem (e.g. ntfs) git would complain that the
> files are the same.
>
> The workaround is to say:
>
> git mv myfile.H myfile.h.tmp
> git mv myfile.h.tmp myfile.h
>
> but this is not ideal, especially if you are automating some move operations.
>
If I remember correctly, this fails when it comes to applying the
commit containing that move, at least on HFS+. You could create 2
commits (one with the first mv, one with the second), and apply them
one at a time, but it's a pretty unpleasant workaround.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] File system difference handling in git
2008-01-22 20:21 ` David Kastrup
@ 2008-01-22 21:32 ` Linus Torvalds
0 siblings, 0 replies; 8+ messages in thread
From: Linus Torvalds @ 2008-01-22 21:32 UTC (permalink / raw)
To: David Kastrup; +Cc: Reece Dunn, git
On Tue, 22 Jan 2008, David Kastrup wrote:
>
> s/sensitiv/insensitiv/g
Duh. Yes.
Linus
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2008-01-22 21:32 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-22 9:21 [RFC] File system difference handling in git Reece Dunn
2008-01-22 10:24 ` Junio C Hamano
2008-01-22 10:52 ` Reece Dunn
2008-01-22 17:44 ` Steffen Prohaska
2008-01-22 20:54 ` Jonathan del Strother
2008-01-22 16:56 ` Linus Torvalds
2008-01-22 20:21 ` David Kastrup
2008-01-22 21:32 ` Linus Torvalds
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox