From: Linus Torvalds <torvalds@linux-foundation.org>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org
Subject: Re: What's cooking in git.git (Jul 2009, #01; Mon, 06)
Date: Tue, 7 Jul 2009 12:17:46 -0700 (PDT) [thread overview]
Message-ID: <alpine.LFD.2.01.0907071142330.3210@localhost.localdomain> (raw)
In-Reply-To: <7vk52l4q7k.fsf@alter.siamese.dyndns.org>
On Mon, 6 Jul 2009, Junio C Hamano wrote:
>
> * lt/read-directory (Fri May 15 12:01:29 2009 -0700) 3 commits
> - Add initial support for pathname conversion to UTF-8
> - read_directory(): infrastructure for pathname character set
> conversion
> - Add 'fill_directory()' helper function for directory traversal
>
> Before adding the real "conversion", this needs a few real fixups, I
> think. For example there is one hardcoded array that is used without
> bounds check.
Hmm. I'm not sure what array you're talking about (the newpath/newbase
ones? We do protect against PATH_MAX, it's just that we protect against it
in the "previous iteration").
The bigger issue, though, is that I spent half a day looking more at this
series last Thursday, and I've got some improvements, but getting "all the
way" turns out to be really quite painful.
Why?
We have a _lot_ of code that does "lstat()" on pathnames, and it all
basically uses the internal git representation of the pathname. In
particular, we do this a lot for index lookups, but it's true in other
cases too (example: things like tree merging, where we check whether a
file exists in the working tree).
To test this all out, I actually fleshed out the patches to the point
where I could do
[core]
PathEncoding = Latin1
and actually have the working tree use Latin1 encoding, and convert
internally in git to UTF-8, and have a working "git add ."
However, "git add ." was just about the only thing that I made do the
right thing. Even doing a simple "git diff" afterwards would then show the
file as deleted, because the UTF-8 version of the file (that the index
contained) didn't exist in the filesystem. I fixed that with a hack, but
it basically turns out to be pretty damn ugly, and there's a _lot_ of
those places.
So, the question is, "What now?"
There's a few alternatives:
(a) don't do any of this crap at all. What git does right now works fairly
well for most people. Instead, perhaps worry about just the crazy
case-insensitive filesystems, which are a totally separate issue.
End result: git will always have problems with the crazy NFD format
that OS X uses. Mixing git archives across OS X and other saner
operating systems (and in this context, Windows really does count as
"saner" - it really is OS X that is braindamaged!) will be painful if
you have odd characters in your working tree.
This is the simplest approach, of course. The case-insensitivity is
still not trivial, but we could work on it, and it really is a
different problem (and has none of the "if you look the file up with a
converted name, you cannot see it" issues that the Latin1<->UTF8
example had).
(b) Forget about the general case (like Latin1) that needs two-way
conversion. Just worry about OS X being crazy, and do the NFD->NFC
translation, which only needs to be done one way (because OS X will
still accept and recognize NFC characters, so the "converted" path is
still seen as valid by 'lstat()' and friends).
This is very much just a special case of handling filesystems that are
UTF-8, but are confused about what "equivalent" and "identical" means,
and where the filesystem designer was a moron on some seriously crazy
drugs, and thought that equivalence means identity, and thought that
NFD is a sane form to expose.
This is a much simpler case than the general approach. I don't have OS
X to test with, though, and so far it hasn't appeared that any OS X
people really care about to actually implement it. So I can fix up my
series to a certain point, but will never be able to really do the
final testing and tuning. At least with the full "treat filesystem as
Latin1 encoding", I could _test_ it.
(c) Try to bite the bullet. I can do this, but it really is going to be a
_very_ invasive patch-series, and it will probably involve some nasty
changes to the index format (for performance, we'll likely have to
change the index to have _both_ the "git filename", and the
"filesystem filename" in it).
This was what I wanted to do, and it's what you'd need to do if you do
things like Latin1 filesystem trees or ones where pathnames are done
with shift-JIS encoding or if we want to actually use the (crazy)
native Windows UCS filesystem accessors or whatever.
But I have to admit that after looking at the pain, I'm not at all
convinced it's worth it. Do we ever want to say "git supports
filesystems with shift-JIS encoding"? Do people really care deeply
enough about non-utf filesystems that they'd be willing to live with a
_lot_ of pretty nasty complexity, and some real performance overhead?
I have to say, even with plain UTF-8, git isn't really a pleasure to use.
While I did my Latin1 test, I used filenames like "åäö" (the three extra
Finnish/Swedish characters), and if you do this
mkdir test-repo
cd test-repo
git init
echo testfile > åäö
git add .
git ls-files
the end result is not actually really usable. We quote it to a binary
mess, rather than showing "åäö". Our pathname quoting is trying to be
safe, which is good, but it does mean that right now, odd characters
aren't very friendly even _if_ you are using a sane filesystem, and all
plain NFC utf-8.
So right now, my personal opinion is:
- let's just face the fact that the only sane filename representation is
NFC UTF-8. Show filenames as UTF-8 when possible, rather than quoting
them.
- Do case (b) above: add support for converting NFD -> NFC at readdir()
time, so that OS X people can use UTF-8 sanely.
- add a "binary encoding" mode to filesystems that actually use Latin1,
just so that if people use Latin1 or Shift-JIS filesystem encodings, we
promise that we'll never munge those kinds of names.
- Maybe we'd make the "binary encoding" (which is effectively existing
git behavior) be the default on non-OSX platforms.
but that's just my gut feel from trying to weigh the costs of trying to do
something more involved against the costs of OS X support and just letting
crazy encodings exist in their own little worlds. So a development group
that uses Shift-JIS (or Latin1) would be able to work internally with git
that way, but would not be able to sanely work with the world at large
that uses UTF-8.
Linus
next prev parent reply other threads:[~2009-07-07 19:18 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-07-06 18:32 What's cooking in git.git (Jul 2009, #01; Mon, 06) Junio C Hamano
2009-07-06 20:29 ` Marcus Camen
2009-07-06 21:38 ` Junio C Hamano
2009-07-06 22:03 ` Marcus Camen
2009-07-06 22:34 ` Junio C Hamano
2009-07-06 23:42 ` Jakub Narebski
2009-07-07 2:18 ` Mark Lodato
2009-07-07 21:11 ` Jeff King
2009-07-07 6:30 ` Johannes Sixt
2009-07-07 19:17 ` Linus Torvalds [this message]
2009-07-07 19:57 ` Alex Riesen
2009-07-07 22:13 ` Linus Torvalds
2009-07-07 20:08 ` Johannes Schindelin
2009-07-07 20:13 ` Shawn O. Pearce
2009-07-07 22:19 ` Junio C Hamano
2009-07-07 22:28 ` Shawn O. Pearce
2009-07-08 13:42 ` notes, was " Johannes Schindelin
2009-07-08 5:39 ` Stephen Boyd
2009-07-08 6:38 ` Johannes Sixt
2009-07-10 5:05 ` Christian Couder
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LFD.2.01.0907071142330.3210@localhost.localdomain \
--to=torvalds@linux-foundation.org \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).