All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Shawn O. Pearce" <spearce@spearce.org>
To: Johannes Sixt <j.sixt@viscovery.net>
Cc: Thomas Singer <thomas.singer@syntevo.com>, git@vger.kernel.org
Subject: Re: non-US-ASCII file names (e.g. Hiragana) on Windows
Date: Tue, 1 Dec 2009 08:26:27 -0800	[thread overview]
Message-ID: <20091201162627.GE21299@spearce.org> (raw)
In-Reply-To: <4B14EB2E.9020906@viscovery.net>

Johannes Sixt <j.sixt@viscovery.net> wrote:
> Thomas Singer schrieb:
> > To be more precise: Who is interpreting the bytes in the file names as
> > characters? Windows, Git or Java?
> 
> In the case of git: Windows does it, using the console's codepage to
> convert between bytes and Unicode.
> 
> I don't know about Java, but I guess that no conversion is necessary
> because Java is Unicode-aware.

Actually, conversion is necessary, and its something that is proving
to be really painful within JGit.

The Java IO APIs use UTF-16 for file names.  However we are reading
a stream of unknown bytes from the index file and tree objects.
Thus JGit must convert a stream of bytes into UTF-16 just to get
to the OS.

The JVM then turns around and converts from UTF-16 to some other
encoding for the filesystem.

On Win32 I suspect the JVM uses the native UTF-16 file APIs, so
this translation is lossless.

On POSIX, I suspect the JVM uses $LANG or some other related
environment variable to guess the user's preferred encoding, and
then converts from UTF-16 to bytes in that encoding.  And I have
no idea how they handle normalization of composed code points.

All of these layers make for a *very* confusing situation for us
within JGit:

  git tree
  +---------+
  | bytes   | -+
  +---------+   \
                 \             +--------+            +---------+
                  +-- JGit --> | UTF-16 | -- JVM --> | OS call |
  .git/index     /             +--------+            +---------+
  +---------+   /
  | bytes   | -+
  +---------+

Its impossible for us to do what C git does, which is just use the
bytes used by the OS call within the git datastructure.  Which of
course also isn't always portable, e.g. the Mac OS X HFS+ mess.

:-)

-- 
Shawn.

  reply	other threads:[~2009-12-01 16:26 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-11-28 18:15 non-US-ASCII file names (e.g. Hiragana) on Windows Thomas Singer
2009-11-28 20:00 ` Johannes Sixt
2009-12-01  8:57   ` Thomas Singer
2009-12-01  9:04     ` Thomas Singer
2009-12-01 10:08       ` Johannes Sixt
2009-12-01 16:26         ` Shawn O. Pearce [this message]
2009-12-01 22:11           ` Robin Rosenberg
2009-11-28 23:07 ` Maximilien Noal
2009-11-29  9:18   ` Thomas Singer
2009-12-01  7:49     ` Thomas Singer
2009-12-01  8:27       ` Johannes Sixt
2009-12-01  8:55         ` Thomas Singer
2009-12-01 10:00           ` Johannes Sixt
2009-12-01 12:08             ` Thomas Singer
2009-12-01 13:17               ` Johannes Sixt
2009-12-01 15:41                 ` Thomas Singer
2009-12-01 15:50                   ` Erik Faye-Lund
2009-12-01 16:33                     ` Thomas Singer
2010-10-30  4:02                       ` brad12
2010-10-30  8:58                         ` Jakub Narebski
2009-12-01 17:24               ` Jakub Narebski
2009-12-01 18:55                 ` Thomas Singer
2009-12-02 16:22                   ` Shawn Pearce
2010-10-30  9:52                 ` demerphq
2009-12-01  9:12     ` Erik Faye-Lund
2009-12-01 12:11       ` Thomas Singer
2009-11-28 23:37 ` Reece Dunn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20091201162627.GE21299@spearce.org \
    --to=spearce@spearce.org \
    --cc=git@vger.kernel.org \
    --cc=j.sixt@viscovery.net \
    --cc=thomas.singer@syntevo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.