git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Path character encodings
@ 2012-03-05 21:26 Paul Betts
  2012-03-05 21:40 ` Junio C Hamano
  0 siblings, 1 reply; 4+ messages in thread
From: Paul Betts @ 2012-03-05 21:26 UTC (permalink / raw)
  To: git

Hi guys,

As part of trying to fix the problems in MSysGit around tree encodings, I
would like to start a discussion on mitigating the backwards compatibility
problems associated with tree path encodings being unspecified.

## History

For those folks unfamiliar with the issue, I'll provide a quick refresh - Git
has traditionally not specified the string encoding of paths inside the tree
object - whatever strings the OS provided from the readdir syscall was used
verbatim to write out tree objects. For most operating systems, this was UTF-8
(though even on certain POSIX OS's there are some caveats with normalized
sequence points, such as OS X).

However, on Windows until *very* recently (and on non-Unicode Linux locales),
the strings returned by the OS are from a locale-specific OEM Code Page (i.e.
Shift-JIS, Windows-1252, etc) and *not* Unicode. These repositories are
currently incorrectly interpreted on other OSs (or even the same OS with a
different locale configured).

Note that *blob* (i.e. content) encoding is a separate issue and is
out-of-scope at the moment.

This will become a bigger problem in the near future , because MSysGit is
seeking to fix this mistake on Windows by explicitly writing all tree objects
in UTF-8. While this is great for new repositories, this will create a
compatibility problem: people who upgrade their Git installation on their
local machine will now have issues with their existing repos.

## Proposed Mitigation

For an initial mitigation plan, I'd like to propose adding a warning to either
git clone or git checkout, that if invalid UTF-8 strings are detected, a
warning is printed to the user.

However, without an actionable solution, it's not much of a help other than to
suggest that they downgrade to a lower version of Git. Possible solutions that
we've discussed are:

  * Add a git-config setting to explicitly set the code-page, defaulted to
    UTF-8. With this, the error message could instruct them to set this
    config locally. This has the additional benefit of enabling Linux users
    to use these existing Windows repositories.

  * Creating a conversion utility to rewrite all trees to use UTF-8. This is
    problematic for obvious reasons, even disregarding the fact that the
    result will be incompatible with the original repo - mainly that it may be
    non-trivial to detect which encoding the strings were originally written
    in. libicu (http://site.icu-project.org/) has code to do this.

-- 
Paul Betts <paul@paulbetts.org>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Path character encodings
  2012-03-05 21:26 Path character encodings Paul Betts
@ 2012-03-05 21:40 ` Junio C Hamano
  2012-03-05 22:02   ` Paul Betts
  0 siblings, 1 reply; 4+ messages in thread
From: Junio C Hamano @ 2012-03-05 21:40 UTC (permalink / raw)
  To: Paul Betts; +Cc: git

Paul Betts <paul@paulbetts.org> writes:

> ## Proposed Mitigation
>
> For an initial mitigation plan, I'd like to propose adding a warning to either
> git clone or git checkout, that if invalid UTF-8 strings are detected, a
> warning is printed to the user.
>
> However, without an actionable solution, it's not much of a help other than to
> suggest that they downgrade to a lower version of Git.

Hmph, I do not see a reason to make a huge molehill in this. The
pathnames are of unspecified encoding, and if a project declares
that they always use UTF-8, that would be great. Older history may
need to be rewritten but that is a given.

Wouldn't a flag day event per project that runs filter-branch and
have participants restart their repositories be sufficient?  Why
does git itself have to do anything about it, and how would it help
users without hurting other git users who are not involved in such a
project?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Path character encodings
  2012-03-05 21:40 ` Junio C Hamano
@ 2012-03-05 22:02   ` Paul Betts
  2012-03-05 22:18     ` Junio C Hamano
  0 siblings, 1 reply; 4+ messages in thread
From: Paul Betts @ 2012-03-05 22:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Hi Junio,

On Mon, Mar 05, 2012 at 01:40:32PM -0800, Junio C Hamano wrote:
> Hmph, I do not see a reason to make a huge molehill in this. The
> pathnames are of unspecified encoding, and if a project declares
> that they always use UTF-8, that would be great. 

I would like to propose, that Git codifies-as-required the majority case
today, that trees should *only* be encoded in UTF-8 going forward. 

> Why does git itself have to do anything about it, and how would it help
> users without hurting other git users who are not involved in such a
> project?

My concern is that users will not know what's happening or how to fix it,
they'll just see messed up paths and assume Git is broken. Even a warning
message would give users something explicit to put into Google to find a
solution for. Not many people will possess the Git-Fu required to manage
filter-branch. 

-- 
Paul Betts <paul@paulbetts.org>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Path character encodings
  2012-03-05 22:02   ` Paul Betts
@ 2012-03-05 22:18     ` Junio C Hamano
  0 siblings, 0 replies; 4+ messages in thread
From: Junio C Hamano @ 2012-03-05 22:18 UTC (permalink / raw)
  To: Paul Betts; +Cc: git

Paul Betts <paul@paulbetts.org> writes:

> On Mon, Mar 05, 2012 at 01:40:32PM -0800, Junio C Hamano wrote:
>> Hmph, I do not see a reason to make a huge molehill in this. The
>> pathnames are of unspecified encoding, and if a project declares
>> that they always use UTF-8, that would be great. 
>
> I would like to propose, that Git codifies-as-required the majority case
> today, that trees should *only* be encoded in UTF-8 going forward. 

I am afraid that that would be a hard sell.

As the pathnames are uninterpreted strings, an older project that
has been using 8859-1 (or EUC for various locales, or anything that
is superset of ASCII) has no good incentive or reason to follow such
a unilateral decision made outside their project, only to get their
participants inconvenienced.  Even giving a "warning" will trigger
"Yeah we know our paths are in latin-1 and everybody in our project
has happily been using that, thankyouverymuch!" and annoy them.

It is not an argument that their history can be re-coded to UTF-8
without losing information, if they still have to go through the
conversion process that their project do not benefit from.

Can we make this per-project opt-*in* somehow?

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-03-05 22:18 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-05 21:26 Path character encodings Paul Betts
2012-03-05 21:40 ` Junio C Hamano
2012-03-05 22:02   ` Paul Betts
2012-03-05 22:18     ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).