From: Robin Rosenberg <robin.rosenberg@dewire.com>
To: "Shawn O. Pearce" <spearce@spearce.org>
Cc: Johannes Sixt <j.sixt@viscovery.net>,
Thomas Singer <thomas.singer@syntevo.com>,
git@vger.kernel.org
Subject: Re: non-US-ASCII file names (e.g. Hiragana) on Windows
Date: Tue, 1 Dec 2009 23:11:13 +0100 [thread overview]
Message-ID: <200912012311.14321.robin.rosenberg@dewire.com> (raw)
In-Reply-To: <20091201162627.GE21299@spearce.org>
tisdag 01 december 2009 17:26:27 skrev du:
> Johannes Sixt <j.sixt@viscovery.net> wrote:
> > Thomas Singer schrieb:
> > > To be more precise: Who is interpreting the bytes in the file names as
> > > characters? Windows, Git or Java?
> >
> > In the case of git: Windows does it, using the console's codepage to
> > convert between bytes and Unicode.
> >
> > I don't know about Java, but I guess that no conversion is necessary
> > because Java is Unicode-aware.
>
> Actually, conversion is necessary, and its something that is proving
> to be really painful within JGit.
>
> The Java IO APIs use UTF-16 for file names. However we are reading
> a stream of unknown bytes from the index file and tree objects.
> Thus JGit must convert a stream of bytes into UTF-16 just to get
> to the OS.
>
> The JVM then turns around and converts from UTF-16 to some other
> encoding for the filesystem.
>
> On Win32 I suspect the JVM uses the native UTF-16 file APIs, so
> this translation is lossless.
>
> On POSIX, I suspect the JVM uses $LANG or some other related
> environment variable to guess the user's preferred encoding, and
> then converts from UTF-16 to bytes in that encoding. And I have
> no idea how they handle normalization of composed code points.
>
> All of these layers make for a *very* confusing situation for us
> within JGit:
>
> git tree
> +---------+
>
> | bytes | -+
>
> +---------+ \
> \ +--------+ +---------+
> +-- JGit --> | UTF-16 | -- JVM --> | OS call |
> .git/index / +--------+ +---------+
> +---------+ /
>
> | bytes | -+
>
> +---------+
>
> Its impossible for us to do what C git does, which is just use the
> bytes used by the OS call within the git datastructure. Which of
> course also isn't always portable, e.g. the Mac OS X HFS+ mess.
We can decode the index anyway we like but not file names coming from
the file system. On Windows, any sane name (it does allow invalid UTF-16 too,
but...) will be readable by JGit, but on a UTF-8 posix that may not be so, if
the filename is actually Latin.-1 encoded. In that case the Java runtime will
return a decoded filename containing an "invalid" code point and any attempt to
access the file from java will fail. I can see some horribly expensive ways to
work around that but...
As for the more sane cases I have a compare routine that works on mixed
encodings that may help to solve some of the problems. Ideally it would not
only be able to compare filenames with unknown encodings to handling case
folding and composing characters in one go too. I guess one could make it
fall back to another encoding than Latin-1, but with lesser certainty, but
it will not (for sure) work with any arbitrary set of encodings. You'll have
to choose, so it's only a legacy workaround, as opposed to a solution.
-- robin
next prev parent reply other threads:[~2009-12-01 22:11 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-11-28 18:15 non-US-ASCII file names (e.g. Hiragana) on Windows Thomas Singer
2009-11-28 20:00 ` Johannes Sixt
2009-12-01 8:57 ` Thomas Singer
2009-12-01 9:04 ` Thomas Singer
2009-12-01 10:08 ` Johannes Sixt
2009-12-01 16:26 ` Shawn O. Pearce
2009-12-01 22:11 ` Robin Rosenberg [this message]
2009-11-28 23:07 ` Maximilien Noal
2009-11-29 9:18 ` Thomas Singer
2009-12-01 7:49 ` Thomas Singer
2009-12-01 8:27 ` Johannes Sixt
2009-12-01 8:55 ` Thomas Singer
2009-12-01 10:00 ` Johannes Sixt
2009-12-01 12:08 ` Thomas Singer
2009-12-01 13:17 ` Johannes Sixt
2009-12-01 15:41 ` Thomas Singer
2009-12-01 15:50 ` Erik Faye-Lund
2009-12-01 16:33 ` Thomas Singer
2010-10-30 4:02 ` brad12
2010-10-30 8:58 ` Jakub Narebski
2009-12-01 17:24 ` Jakub Narebski
2009-12-01 18:55 ` Thomas Singer
2009-12-02 16:22 ` Shawn Pearce
2010-10-30 9:52 ` demerphq
2009-12-01 9:12 ` Erik Faye-Lund
2009-12-01 12:11 ` Thomas Singer
2009-11-28 23:37 ` Reece Dunn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200912012311.14321.robin.rosenberg@dewire.com \
--to=robin.rosenberg@dewire.com \
--cc=git@vger.kernel.org \
--cc=j.sixt@viscovery.net \
--cc=spearce@spearce.org \
--cc=thomas.singer@syntevo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).