All of lore.kernel.org
 help / color / mirror / Atom feed
From: Paul Betts <paul@paulbetts.org>
To: git@vger.kernel.org
Subject: Path character encodings
Date: Mon, 5 Mar 2012 13:26:57 -0800	[thread overview]
Message-ID: <20120305212657.GA17903@jupiter.local> (raw)

Hi guys,

As part of trying to fix the problems in MSysGit around tree encodings, I
would like to start a discussion on mitigating the backwards compatibility
problems associated with tree path encodings being unspecified.

## History

For those folks unfamiliar with the issue, I'll provide a quick refresh - Git
has traditionally not specified the string encoding of paths inside the tree
object - whatever strings the OS provided from the readdir syscall was used
verbatim to write out tree objects. For most operating systems, this was UTF-8
(though even on certain POSIX OS's there are some caveats with normalized
sequence points, such as OS X).

However, on Windows until *very* recently (and on non-Unicode Linux locales),
the strings returned by the OS are from a locale-specific OEM Code Page (i.e.
Shift-JIS, Windows-1252, etc) and *not* Unicode. These repositories are
currently incorrectly interpreted on other OSs (or even the same OS with a
different locale configured).

Note that *blob* (i.e. content) encoding is a separate issue and is
out-of-scope at the moment.

This will become a bigger problem in the near future , because MSysGit is
seeking to fix this mistake on Windows by explicitly writing all tree objects
in UTF-8. While this is great for new repositories, this will create a
compatibility problem: people who upgrade their Git installation on their
local machine will now have issues with their existing repos.

## Proposed Mitigation

For an initial mitigation plan, I'd like to propose adding a warning to either
git clone or git checkout, that if invalid UTF-8 strings are detected, a
warning is printed to the user.

However, without an actionable solution, it's not much of a help other than to
suggest that they downgrade to a lower version of Git. Possible solutions that
we've discussed are:

  * Add a git-config setting to explicitly set the code-page, defaulted to
    UTF-8. With this, the error message could instruct them to set this
    config locally. This has the additional benefit of enabling Linux users
    to use these existing Windows repositories.

  * Creating a conversion utility to rewrite all trees to use UTF-8. This is
    problematic for obvious reasons, even disregarding the fact that the
    result will be incompatible with the original repo - mainly that it may be
    non-trivial to detect which encoding the strings were originally written
    in. libicu (http://site.icu-project.org/) has code to do this.

-- 
Paul Betts <paul@paulbetts.org>

             reply	other threads:[~2012-03-05 21:27 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-05 21:26 Paul Betts [this message]
2012-03-05 21:40 ` Path character encodings Junio C Hamano
2012-03-05 22:02   ` Paul Betts
2012-03-05 22:18     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120305212657.GA17903@jupiter.local \
    --to=paul@paulbetts.org \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.