From: Junio C Hamano <gitster@pobox.com>
To: git@vger.kernel.org
Cc: Johannes Schindelin <Johannes.Schindelin@gmx.de>,
Linus Torvalds <torvalds@linux-foundation.org>,
Kevin Ballard <kevin@sb.org>, Theodore Tso <tytso@MIT.EDU>,
Mike Hommey <mh@glandium.org>
Subject: On pathnames
Date: Thu, 24 Jan 2008 13:02:54 -0800 [thread overview]
Message-ID: <7vprvr7x8h.fsf@gitster.siamese.dyndns.org> (raw)
In-Reply-To: alpine.LFD.1.00.0801230930390.1741@woody.linux-foundation.org
One of Linus's recent patch introduces an index hashtable so
that we can later hash "equivalent" names into the same bucket
to allow us non-byte-by-byte comparison.
Before going further, I needed to formalize what we are trying
to achieve. I learned a few things from the long flamewar
thread, but it is very inefficient to go back to the thread to
pick only the useful pieces. The whole flamewar simply did not
fit a small Panda brain.
That was the reason for this write-up.
Design constraints. In the following, I'll use two names $A and
$B as an example. They are a pair of names that are considered
equivalent in some contexts, such as:
A=xt_connmark.c B=xt_CONNMARK.c
(1) Some filesystems prevent you from having these two
(confusing) paths in a directory at the same time. Some do
not implement this confusion prevention, and allows both
names to exist at the same time.
Let's call the former "case insensitive", and the latter
"case sensitive".
(2) readdir(3) on some "case insensitive" filesystems returns
$A, after a successful creat(2) of $B. Others remember
which one of the two "equivalent" names were used in
creat(2).
Let's call the former "case folding", and the latter "case
preserving".
We assume open(2) or lstat(2) of $A or $B will succeed
after allowing creat(2) of $B if a case folding filesystem
returns $A from readdir(3).
(3) Among the "case folding" ones, some filesystems fold the
pathname to a form that is less interoperable with other
systems, and/or the form that is likely to be different
from what the end-user usually enters.
Such filesystems are "inconveniently case folding".
The last one is not quite apparent with the "xt_connmark.c"
example, but if you replace $A and $B in the above description
with:
A=Ma"rchen B=Märchen
it would hopefully become more clear.
For example, vfat is generally "case preserving". In that long
flamewar thread, I think we learned that HFS+ is in general
"inconveniently case folding" with respect to Unicode, by always
folding to $A but the keyboard/IM input is more likely to come
as $B, which happens to be the more interoperable form with
other systems.
Issues with case insensitive filesystems
----------------------------------------
At the data structure level, a pathname to git is a sequence of
bytes terminated with NUL. This will _not_ change.
By the way, at the data structure level, a tree entry in git can
represent a blob that is a symbolic link. A tree entry in git
can also represent a blob that is a regular file, and in that
case, it can represent if it is executable or not. These will
also not change.
Now, let's think about how we allow use of git on a filesystem
that is incapable of symbolic links, and/or a filesystem that
does not have trustable executable bit.
We do not say "Symlinks are evil and not supported everywhere,
so let's introduce a project configuration to disallow addition
of symlinks". We do not say that to the executable bit, either.
Instead, we have fallback methods to allow manipulating symlinks
and executable bit on such a filesystem that is incapable of
handling them natively.
We should be able to do the same for this "case sensitivity"
issue. A tree that has xt_connmark.c and xt_CONNMARK.c at the
same time cannot be checked out on a case insensitive filesystem.
The filesystem is simply incapable of it (please just calmly
rephrase it in your head as "does not allow such confusing
craziness" instead of starting another flamewar, if you feel the
expression "incapable of" insults your favorite filesystem).
That may mean the project should avoid such equivalent names in
its trees (and having a project wide configuration could be a
technical means to help enforcing that policy), but it does not
mean the core level of git should prevent them to be created on
such systems. It just means that there should be a way, that
could (and sometimes has to) be different from the "natural"
way, to manipulate such tree entries even on a case insensitive
filesystem.
For example, if I find that RelNotes symlink incorrectly points
at Documentation/RelNotes-1.5.44.txt and want to fix it and push
it out immediately, but if I am on the road and the only
environment I can borrow is a git installation on a filesystem
that is symlink-challenged, I can still do the fix. On such a
filesystem, a symlink is checked out as a regular file but is
still marked as a symlink in the index. The only thing I need
to do is to edit the file (making sure not to add an extra LF at
the end) and add it to the index. That's certainly different
from the "natural" way to do that on a filesystem with symlinks,
which is "ln -fs Documentation/RelNotse-1.5.4.txt RelNotes", but
the point is that we make it possible.
The same thing should apply to two files that cannot be checked
out at the same time on case insensitive filesystems. Perhaps
we could have something like:
$ git show :xt_CONNMARK.c >xt_connmark-1.c
$ edit xt_connmark-1.c
$ git add --as xt_CONNMARK.c xt_connmark-1.c
Issues with case folding filesystems
------------------------------------
In addition to the above, case folding filesystems additionally
have an issue even when there is no "confusing" names in the
tree. The project may want to have "Märchen" (but not
"Ma"rchen"), but a checkout (which is creat(2) of "Märchen" --
because that is the byte sequence recorded in tree objects and
the index) will result in "Ma"rchen" and no "Märchen" (hence
readdir(3) returns "Ma"rchen").
Linus's patch to use a hashtable that links "equivalent" names
together is a step in the right direction to address this. The
tree (and the index) has name $B, we check out and the
filesystem folds it to $A. When we get the name $A back from
the filesystem (via readdir(3)), we hash the name using a hash
function that would drop names $A and $B into the same bucket,
and compare that name $A with each hash entry using a comparison
that considers $A and $B are equivalent. If we find one, then
we keep the name $B we have already.
If it is a new file, we won't find any name that is equivalent
to $A in the index, and we use the name $A obtained from
readdir(3).
BUT with a twist.
If the filesystem is known to be inconveniently case folding, we
are better off registering $B instead of $A (assuming we can
convert from $A to $B).
One bad issue during development is that we cannot sanely
emulate case folding behaviour on non case-folding filesystems
without wrapping open(2), lstat(2), and friends, because of the
assumption we made above in (2) where we defined the term "case
folding". This means that the codepath to deal with case
folding filesystems inevitably are harder to debug.
Tasks
-----
- Identify which case folding filesystems need to be supported,
and make sure somebody understands its folding logic;
- For each supported case folding logic, these are needed:
- a hash function that throws "equivalent" names in the same
bucket, to be used in Linus's patch;
- a compare function to determine equivalent names;
- a convert function that takes a possibly inconvenient form
of equivalent name (i.e. $A above) as input and returns
more convenient form (i.e. $B above)
- Identify places that we use the names obtained from places
other than the index and tree. From these places, we would
need to call the convert function to (de)mangle the name
before they hit the index.
Because we may be getting driven by something like:
$ find | xargs git-foo
handling readdir(3) we do ourselves any specially does not
make much sense. Any path from the user is suspect.
- Identify places that we look for a name in the index, and
perform equivalent comparison instead of memcmp(3) we
traditionally did. Linus's patch gives scaffolding for this.
next prev parent reply other threads:[~2008-01-24 21:04 UTC|newest]
Thread overview: 260+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-01-16 15:17 git on MacOSX and files with decomposed utf-8 file names Mark Junker
2008-01-16 15:34 ` Johannes Schindelin
2008-01-16 15:43 ` Kevin Ballard
2008-01-16 16:32 ` Johannes Schindelin
2008-01-16 16:46 ` Jakub Narebski
2008-01-16 20:39 ` Kevin Ballard
2008-01-16 21:51 ` Jakub Narebski
2008-01-16 22:06 ` Kevin Ballard
2008-01-16 22:23 ` Johannes Schindelin
2008-01-16 23:16 ` Kevin Ballard
2008-01-16 22:32 ` Linus Torvalds
2008-01-16 22:52 ` Linus Torvalds
2008-01-16 23:11 ` Kevin Ballard
2008-01-16 23:38 ` Linus Torvalds
2008-01-16 23:57 ` Pedro Melo
2008-01-17 0:16 ` Linus Torvalds
2008-01-17 0:27 ` Pedro Melo
2008-01-17 0:32 ` David Kastrup
2008-01-17 0:40 ` Pedro Melo
2008-01-17 0:54 ` Wincent Colaiuta
2008-01-17 1:08 ` Johannes Schindelin
2008-01-17 1:41 ` Linus Torvalds
2008-01-17 4:07 ` Kevin Ballard
2008-01-17 0:35 ` Johannes Schindelin
2008-01-17 0:45 ` Pedro Melo
2008-01-18 8:29 ` Peter Karlsson
2008-01-18 11:16 ` Jakub Narebski
2008-01-16 23:58 ` David Kastrup
2008-01-17 0:19 ` Linus Torvalds
2008-01-17 0:09 ` Kevin Ballard
2008-01-17 0:25 ` Linus Torvalds
2008-01-17 0:33 ` Johannes Schindelin
2008-01-17 0:43 ` Pedro Melo
2008-01-17 0:57 ` Johannes Schindelin
2008-01-17 1:06 ` Linus Torvalds
2008-01-17 1:16 ` Linus Torvalds
2008-01-17 3:52 ` Kevin Ballard
2008-01-17 4:08 ` Linus Torvalds
2008-01-17 4:30 ` Kevin Ballard
2008-01-17 4:51 ` Martin Langhoff
2008-01-17 5:23 ` Kevin Ballard
2008-01-17 6:13 ` Geert Bosch
2008-01-17 7:11 ` Mitch Tishmack
2008-01-17 10:22 ` Wincent Colaiuta
2008-01-17 13:44 ` Kevin Ballard
2008-01-17 15:57 ` Johannes Schindelin
2008-01-17 16:53 ` Kevin Ballard
2008-01-18 0:44 ` Robin Rosenberg
2008-01-17 14:02 ` Andrew Heybey
2008-01-17 15:04 ` Kevin Ballard
2008-01-19 19:29 ` Kyle Moffett
2008-01-19 19:57 ` Kevin Ballard
2008-01-17 10:08 ` Wincent Colaiuta
2008-01-17 16:43 ` Linus Torvalds
2008-01-17 18:09 ` Mark Junker
2008-01-17 18:12 ` Pedro Melo
2008-01-17 18:18 ` Johannes Schindelin
2008-01-17 18:36 ` Mark Junker
2008-01-17 18:38 ` Pedro Melo
2008-01-17 18:44 ` Linus Torvalds
2008-01-17 19:02 ` Pedro Melo
2008-01-17 18:42 ` Linus Torvalds
2008-01-17 18:50 ` Mark Junker
2008-01-17 18:52 ` Pedro Melo
[not found] ` <alpine.LFD.1.00.0801 171100330.14959@woody.linux-foundation.org>
2008-01-17 19:01 ` Theodore Tso
2008-01-17 19:11 ` Linus Torvalds
2008-01-18 0:18 ` Kevin Ballard
2008-01-18 0:35 ` Linus Torvalds
2008-01-18 1:05 ` Robin Rosenberg
2008-01-18 1:24 ` Linus Torvalds
2008-01-18 4:08 ` Brian Dessent
2008-01-18 8:49 ` Dmitry Potapov
2008-01-18 9:42 ` Robin Rosenberg
2008-01-18 10:30 ` Dmitry Potapov
2008-01-18 15:37 ` Peter Karlsson
2008-01-18 17:24 ` Jakub Narebski
2008-01-18 10:19 ` Peter Karlsson
2008-01-18 10:50 ` Dmitry Potapov
2008-01-18 15:30 ` Peter Karlsson
2008-01-18 17:11 ` Linus Torvalds
2008-01-18 20:24 ` Kevin Ballard
2008-01-19 8:48 ` Dmitry Potapov
2008-01-19 14:55 ` Kevin Ballard
2008-01-19 21:17 ` Dmitry Potapov
2008-01-19 18:58 ` Linus Torvalds
2008-01-19 20:39 ` Mark Junker
2008-01-19 22:58 ` Johannes Schindelin
2008-01-20 6:14 ` Dmitry Potapov
2008-01-20 6:53 ` Linus Torvalds
2008-01-20 13:15 ` Johannes Schindelin
2008-01-20 0:11 ` Wincent Colaiuta
2008-01-20 1:04 ` Linus Torvalds
2008-01-20 5:27 ` Mike Hommey
2008-01-20 5:45 ` Linus Torvalds
2008-01-20 7:00 ` Mike Hommey
2008-01-20 7:26 ` Linus Torvalds
2008-01-20 8:00 ` Dmitry Potapov
2008-01-20 8:12 ` Dmitry Potapov
2008-01-20 9:34 ` Wincent Colaiuta
2008-01-18 20:28 ` Junio C Hamano
2008-01-18 20:50 ` Johannes Schindelin
2008-01-23 2:46 ` Eric W. Biederman
2008-01-23 2:57 ` Junio C Hamano
2008-01-23 14:26 ` Nicolas Pitre
2008-01-23 21:19 ` Junio C Hamano
2008-01-21 14:14 ` Peter Karlsson
2008-01-21 16:43 ` Kevin Ballard
2008-01-21 16:48 ` David Kastrup
2008-01-21 16:59 ` Kevin Ballard
2008-01-21 20:43 ` Dmitry Potapov
2008-01-21 20:53 ` Kevin Ballard
2008-01-21 21:05 ` David Kastrup
2008-01-21 23:01 ` Dmitry Potapov
2008-01-21 16:53 ` Jeff King
2008-01-21 17:08 ` Nicolas Pitre
2008-01-21 17:25 ` Kevin Ballard
2008-01-21 20:35 ` David Kastrup
2008-01-21 20:32 ` David Kastrup
2008-01-21 18:12 ` Linus Torvalds
2008-01-21 19:05 ` Kevin Ballard
2008-01-21 19:41 ` Linus Torvalds
2008-01-21 19:58 ` Kevin Ballard
2008-01-21 20:33 ` Linus Torvalds
2008-01-21 20:53 ` Kevin Ballard
[not found] ` <alpine.LFD.1.0! 0.0801211323120.2957@woody.linux-foundation.org>
2008-01-21 20:58 ` David Kastrup
2008-01-21 21:17 ` Martin Langhoff
2008-01-21 21:28 ` Kevin Ballard
2008-01-21 21:43 ` Martin Langhoff
2008-01-21 21:33 ` Linus Torvalds
2008-01-21 21:49 ` Kevin Ballard
2008-01-21 22:34 ` Linus Torvalds
2008-01-21 22:46 ` Kevin Ballard
2008-01-21 22:56 ` Martin Langhoff
[not found] ` <53C76BEA-2232-4940-8776-9DF1880089A4@sb.org>
2008-01-21 23:05 ` Kevin Ballard
2008-01-21 23:16 ` Martin Langhoff
2008-01-22 0:30 ` Kevin Ballard
2008-01-21 23:00 ` Theodore Tso
2008-01-21 23:09 ` Kevin Ballard
2008-01-21 23:44 ` Linus Torvalds
2008-01-22 0:47 ` Kevin Ballard
2008-01-22 1:01 ` Linus Torvalds
2008-01-22 1:13 ` Linus Torvalds
2008-01-22 2:33 ` Kevin Ballard
2008-01-22 2:50 ` Linus Torvalds
2008-01-22 3:04 ` Kevin Ballard
2008-01-22 3:17 ` Linus Torvalds
2008-01-22 3:21 ` Martin Langhoff
2008-01-22 4:22 ` Kevin Ballard
[not found] ` <20080122133427.GB17804@mit.edu>
2008-01-23 0:08 ` Theodore Tso
2008-01-23 0:38 ` Kevin Ballard
2008-01-23 1:47 ` Martin Langhoff
2008-01-23 2:06 ` Theodore Tso
2008-01-23 8:45 ` David Kastrup
2008-01-23 0:38 ` Linus Torvalds
2008-01-23 1:14 ` Martin Langhoff
2008-01-23 1:16 ` Kevin Ballard
2008-01-23 1:27 ` Martin Langhoff
2008-01-23 1:33 ` Theodore Tso
2008-01-23 1:56 ` Linus Torvalds
2008-01-23 2:02 ` Kevin Ballard
2008-01-23 6:41 ` Mike Hommey
2008-01-23 8:15 ` Kevin Ballard
2008-01-23 8:43 ` Dmitry Potapov
2008-01-23 9:02 ` Jonathan del Strother
2008-01-23 9:12 ` Dmitry Potapov
2008-01-23 9:19 ` Mike Hommey
2008-01-23 9:32 ` Dmitry Potapov
2008-01-23 9:40 ` Mike Hommey
2008-01-23 13:38 ` Theodore Tso
2008-01-23 16:16 ` Linus Torvalds
2008-01-23 17:12 ` Theodore Tso
2008-01-23 17:19 ` Kevin Ballard
2008-01-23 17:32 ` Linus Torvalds
2008-01-24 21:02 ` Junio C Hamano [this message]
2008-01-24 22:31 ` On pathnames Nicolas Pitre
2008-01-25 3:55 ` Martin Langhoff
2008-01-25 4:18 ` Junio C Hamano
2008-01-25 4:12 ` Junio C Hamano
2008-01-25 8:08 ` Pedro Melo
2008-01-25 12:25 ` Johannes Schindelin
2008-01-25 12:50 ` David Kastrup
2008-01-25 12:53 ` Wincent Colaiuta
2008-01-24 23:56 ` Sean
2008-01-25 0:36 ` Johannes Schindelin
2008-01-25 4:00 ` Daniel Barkalow
2008-01-25 4:21 ` Junio C Hamano
2008-01-25 11:36 ` Johannes Schindelin
2008-01-25 16:25 ` Daniel Barkalow
2008-01-25 17:34 ` Johannes Schindelin
2008-01-25 5:59 ` Jeff King
2008-01-23 20:18 ` git on MacOSX and files with decomposed utf-8 file names Jay Soffian
[not found] ` <1DC841ED-634F-412C-9560-F37E4172A4CD@sb.org>
[not found] ` <76718490801231421l7b6552f8sec13f570360198b@mail.gmail.com>
[not found] ` <4F906435-A186-4E98-8865-F185D75F14D4@sb.org>
[not found] ` <76718490801231517h6d57e5bfkc19d394d38ad19db@mail.gmail.com>
2008-01-24 2:05 ` Kevin Ballard
2008-01-24 3:11 ` Junio C Hamano
2008-01-24 4:37 ` Martin Langhoff
2008-01-24 5:30 ` Kevin Ballard
2008-01-24 6:39 ` Steffen Prohaska
2008-01-24 18:17 ` Mitch Tishmack
2008-01-24 18:52 ` Mitch Tishmack
2008-01-24 19:58 ` Kevin Ballard
2008-01-23 23:37 ` Martin Langhoff
2008-01-23 16:58 ` Kevin Ballard
2008-01-23 17:39 ` Dmitry Potapov
2008-01-23 17:47 ` Kevin Ballard
2008-01-21 19:57 ` Theodore Tso
2008-01-21 20:01 ` Kevin Ballard
2008-01-21 20:15 ` Theodore Tso
2008-01-21 20:31 ` Kevin Ballard
2008-01-21 20:46 ` Theodore Tso
2008-01-21 20:59 ` Kevin Ballard
[not found] ` <6E303071-82A4-4D69-AA0C-EC41168B9AFE@sb.org>
2008-01-21 21:18 ` Theodore Tso
2008-01-21 21:43 ` Kevin Ballard
2008-01-21 21:49 ` Martin Langhoff
2008-01-21 21:57 ` Kevin Ballard
2008-01-22 0:36 ` Johannes Schindelin
2008-01-22 0:42 ` Kevin Ballard
2008-01-22 0:48 ` David Kastrup
2008-01-22 1:06 ` Martin Langhoff
2008-01-22 1:34 ` Johannes Schindelin
2008-01-22 1:53 ` Martin Langhoff
2008-01-22 2:03 ` Johannes Schindelin
2008-01-21 22:38 ` David Kastrup
2008-01-22 2:34 ` Kevin Ballard
2008-01-22 7:51 ` David Kastrup
2008-01-21 20:56 ` Dmitry Potapov
2008-01-21 21:07 ` Kevin Ballard
2008-01-21 22:41 ` Dmitry Potapov
2008-01-21 22:53 ` Kevin Ballard
2008-01-21 23:21 ` Dmitry Potapov
2008-01-21 19:44 ` Mike Hommey
2008-01-21 20:36 ` Dmitry Potapov
2008-01-21 21:06 ` Martin Langhoff
2008-01-21 21:09 ` David Kastrup
2008-01-21 21:42 ` Linus Torvalds
2008-01-21 22:45 ` Martin Langhoff
2008-01-21 20:30 ` Dmitry Potapov
2008-01-21 18:16 ` Linus Torvalds
2008-01-17 21:27 ` Dmitry Potapov
2008-01-17 22:01 ` JM Ibanez
2008-01-17 22:09 ` Johannes Schindelin
2008-01-18 1:27 ` Robin Rosenberg
2008-01-17 23:05 ` Linus Torvalds
2008-01-17 23:10 ` Dmitry Potapov
2008-01-16 23:52 ` Dmitry Potapov
2008-01-16 22:37 ` Eyvind Bernhardsen
2008-01-16 23:03 ` Wincent Colaiuta
2008-01-17 7:29 ` Miles Bader
2008-01-17 4:43 ` Jay Soffian
2008-01-17 4:59 ` Jay Soffian
2008-01-17 5:15 ` Junio C Hamano
2008-01-17 10:28 ` Wincent Colaiuta
2008-01-17 11:10 ` Johannes Schindelin
2008-01-17 11:23 ` Pedro Melo
2008-01-17 11:51 ` Wincent Colaiuta
2008-01-17 12:53 ` Johannes Schindelin
2008-01-17 13:40 ` Wincent Colaiuta
2008-01-17 17:58 ` Junio C Hamano
2008-01-17 18:22 ` Johan Herland
2008-01-17 13:05 ` Johannes Schindelin
2008-01-17 11:46 ` Wincent Colaiuta
2008-01-17 5:11 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7vprvr7x8h.fsf@gitster.siamese.dyndns.org \
--to=gitster@pobox.com \
--cc=Johannes.Schindelin@gmx.de \
--cc=git@vger.kernel.org \
--cc=kevin@sb.org \
--cc=mh@glandium.org \
--cc=torvalds@linux-foundation.org \
--cc=tytso@MIT.EDU \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).