git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* non-ascii filenames issue
@ 2009-04-05  9:36 Gregory Petrosyan
  2009-04-05  9:54 ` Teemu Likonen
  0 siblings, 1 reply; 11+ messages in thread
From: Gregory Petrosyan @ 2009-04-05  9:36 UTC (permalink / raw)
  To: git

gregory@home:~$ git --version
git version 1.6.2.2.404.ge96f3
gregory@home:~$ mkdir git-test
gregory@home:~$ cd git-test
gregory@home:~/git-test$ touch файл
gregory@home:~/git-test$ ls -a
.  ..  файл
gregory@home:~/git-test$ git init
Initialized empty Git repository in /home/gregory/git-test/.git/
gregory@home:~/git-test$ git add .
gregory@home:~/git-test$ git status
# On branch master
#
# Initial commit
#
# Changes to be committed:
#   (use "git rm --cached <file>..." to unstage)
#
#  new file:   "\321\204\320\260\320\271\320\273"
#
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                 "файл" should be here instead

This is on Ubuntu Jaunty beta, with latest git built from source.
Please CC me, I am not subscribed.

	Gregory

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: non-ascii filenames issue
  2009-04-05  9:36 non-ascii filenames issue Gregory Petrosyan
@ 2009-04-05  9:54 ` Teemu Likonen
  2009-04-05 10:01   ` Gregory Petrosyan
  0 siblings, 1 reply; 11+ messages in thread
From: Teemu Likonen @ 2009-04-05  9:54 UTC (permalink / raw)
  To: Gregory Petrosyan; +Cc: git

On 2009-04-05 13:36 (+0400), Gregory Petrosyan wrote:

> # Changes to be committed:
> #   (use "git rm --cached <file>..." to unstage)
> #
> #  new file:   "\321\204\320\260\320\271\320\273"
> #
>                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>                  "файл" should be here instead

It can be fixed with command:

    git config --global core.quotepath false

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: non-ascii filenames issue
  2009-04-05  9:54 ` Teemu Likonen
@ 2009-04-05 10:01   ` Gregory Petrosyan
  2009-04-05 10:51     ` John Tapsell
  0 siblings, 1 reply; 11+ messages in thread
From: Gregory Petrosyan @ 2009-04-05 10:01 UTC (permalink / raw)
  To: Teemu Likonen; +Cc: git

On Sun, Apr 05, 2009 at 12:54:28PM +0300, Teemu Likonen wrote:
> On 2009-04-05 13:36 (+0400), Gregory Petrosyan wrote:
> 
> > # Changes to be committed:
> > #   (use "git rm --cached <file>..." to unstage)
> > #
> > #  new file:   "\321\204\320\260\320\271\320\273"
> > #
> >                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >                  "файл" should be here instead
> 
> It can be fixed with command:
> 
>     git config --global core.quotepath false

Thanks! That works. Does it make sence to set it to "false" by default?

	Gregory

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: non-ascii filenames issue
  2009-04-05 10:01   ` Gregory Petrosyan
@ 2009-04-05 10:51     ` John Tapsell
  2009-04-05 16:23       ` Jay Soffian
  2009-04-06  7:28       ` Peter Krefting
  0 siblings, 2 replies; 11+ messages in thread
From: John Tapsell @ 2009-04-05 10:51 UTC (permalink / raw)
  To: Teemu Likonen, git

2009/4/5 Gregory Petrosyan <gregory.petrosyan@gmail.com>:
> On Sun, Apr 05, 2009 at 12:54:28PM +0300, Teemu Likonen wrote:
>> On 2009-04-05 13:36 (+0400), Gregory Petrosyan wrote:
>>
>> > # Changes to be committed:
>> > #   (use "git rm --cached <file>..." to unstage)
>> > #
>> > #  new file:   "\321\204\320\260\320\271\320\273"
>> > #
>> >                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> >                  "файл" should be here instead
>>
>> It can be fixed with command:
>>
>>     git config --global core.quotepath false
>
> Thanks! That works. Does it make sence to set it to "false" by default?

Unfortunately not, because for some absolutely crazy reason, there is
no way at all to tell what encoding the string is in.  It never
occured to anyone that it might actually be useful to be able to read
the filename in an unambiguous way.  The result is this sort of mess.
Just wait until you try to checkout that file on a new filesystem with
a different encoding.  Or try to checkout that file in Windows.  It's
like git decided to step backwards 30 years.

John

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: non-ascii filenames issue
  2009-04-05 10:51     ` John Tapsell
@ 2009-04-05 16:23       ` Jay Soffian
  2009-04-05 19:29         ` Junio C Hamano
  2009-04-06  7:28       ` Peter Krefting
  1 sibling, 1 reply; 11+ messages in thread
From: Jay Soffian @ 2009-04-05 16:23 UTC (permalink / raw)
  To: John Tapsell; +Cc: Teemu Likonen, git

On Sun, Apr 5, 2009 at 6:51 AM, John Tapsell <johnflux@gmail.com> wrote:
> Unfortunately not, because for some absolutely crazy reason

Bzzt. http://article.gmane.org/gmane.comp.version-control.git/50830

And, as always, patches welcomed.

j.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: non-ascii filenames issue
  2009-04-05 16:23       ` Jay Soffian
@ 2009-04-05 19:29         ` Junio C Hamano
  2009-04-05 20:22           ` Jay Soffian
  0 siblings, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2009-04-05 19:29 UTC (permalink / raw)
  To: Jay Soffian; +Cc: John Tapsell, Teemu Likonen, git

Jay Soffian <jaysoffian@gmail.com> writes:

> On Sun, Apr 5, 2009 at 6:51 AM, John Tapsell <johnflux@gmail.com> wrote:
>> Unfortunately not, because for some absolutely crazy reason
>
> Bzzt. http://article.gmane.org/gmane.comp.version-control.git/50830

I do not think the message gives enough information on the issue, as "a
pathname is a slash separated sequence of path components terminated with
a NUL, and a path component is an uninterpreted sequence of bytes
excluding NUL and slash" is simply a UNIX tradition the original git
design took as _given_, so the "some absolutely crazy reason" comment does
not even deserve refuting.

There is _no_ reason, crazy or otherwise.  If you start from "a pathname
is an uninterpreted sequence of bytes" tradition, it is a design parameter
and "how things are", and you simply do not argue with them.  And the
message you quoted doesn't, either.

	Side note: I am not saying that we should not ever change that
	particular design parameter.  I am just explaining why 50830 is
	not a good counterargument to quote against the "some absolutely
	crazy reason" accusation.

> And, as always, patches welcomed.

Before patches, you need a sound design and justification.

At least you need to consider the following (the early ones are easier):

 - Do we unify them to some canonical encoding internally and do the
   matching in the canonical space?   What's the internal representation
   (presumably UTF-8)?

 - How should a user tell the pathname conversion rules between the
   internal repreasentation and the filesystem representation to git?  A
   config variable per a repository?

 - How should this interact with patch+apply dataflow (including "rebase"
   without -i/-m)?  Should pathnames in diffs be in canonical form?

 - How should this interact with case challenged and/or unicode corrupting
   filesystems such as NTFS and HFSplus whose creat(), readdir(), and
   stat() contradict with each other?

 - What should happen when the pathname in the canonical representation
   recorded in the history cannot be externalized on a particular
   filesystem?  Does it gracefully degenerate and give some escape hatch,
   and if so how?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: non-ascii filenames issue
  2009-04-05 19:29         ` Junio C Hamano
@ 2009-04-05 20:22           ` Jay Soffian
  0 siblings, 0 replies; 11+ messages in thread
From: Jay Soffian @ 2009-04-05 20:22 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: John Tapsell, Teemu Likonen, git

> I do not think the message gives enough information on the issue

Of course you are correct. I was perturbed by John's message, but your
thoughtful reply is much more beneficial than my silly link. Thank you
for providing the level-headed response as always.

j.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: non-ascii filenames issue
  2009-04-05 10:51     ` John Tapsell
  2009-04-05 16:23       ` Jay Soffian
@ 2009-04-06  7:28       ` Peter Krefting
  2009-04-06  9:12         ` Johannes Schindelin
  2009-04-07  8:26         ` demerphq
  1 sibling, 2 replies; 11+ messages in thread
From: Peter Krefting @ 2009-04-06  7:28 UTC (permalink / raw)
  To: John Tapsell; +Cc: Teemu Likonen, git

John Tapsell:

> Unfortunately not, because for some absolutely crazy reason, there is no 
> way at all to tell what encoding the string is in.  It never occured to 
> anyone that it might actually be useful to be able to read the filename in 
> an unambiguous way.

It comes from the Unix tradition, unfortunately, that file names are just a 
stream of bytes, instead of a stream of characters mapped to a byte 
sequence. The "stream of bytes" think worked back when everyone used ASCII, 
but as soon as other character encodings were used (i.e back in the 1970s or 
so), that assumption broke.

> The result is this sort of mess. Just wait until you try to checkout that 
> file on a new filesystem with a different encoding.  Or try to checkout 
> that file in Windows.  It's like git decided to step backwards 30 years.

Since most people on Linux nowadays probably are running in a UTF-8-based 
locale, I tried introducing some (very incomplete) patches for the Windows 
port to make this assumption, to allow Windows users to make use of 
non-ASCII file names (Windows uses Unicode strings for file names). Mac OS 
uses (semi-decomposed) UTF-8 strings, so it should also be able to make use 
of this.

Unfortunately, there seems to be quite some resistance towards deciding on 
a platform- and language-independent way of storing file names in Git, but 
rather just going the "Unix" way and making it someone elses problem. I find 
this a bit sad.


-- 
\\// Peter - http://www.softwolves.pp.se/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: non-ascii filenames issue
  2009-04-06  7:28       ` Peter Krefting
@ 2009-04-06  9:12         ` Johannes Schindelin
  2009-04-06 22:33           ` Dmitry Potapov
  2009-04-07  8:26         ` demerphq
  1 sibling, 1 reply; 11+ messages in thread
From: Johannes Schindelin @ 2009-04-06  9:12 UTC (permalink / raw)
  To: Peter Krefting; +Cc: Teemu Likonen, git

Hi,

On Mon, 6 Apr 2009, Peter Krefting wrote:

> It comes from the Unix tradition, unfortunately, that file names are 
> just a stream of bytes, instead of a stream of characters mapped to a 
> byte sequence.

How is that different from .txt not having a defined locale?

Really, please, do not add to the non-information.

> Since most people on Linux nowadays probably are running in a 
> UTF-8-based locale, I tried introducing some (very incomplete) patches 
> for the Windows port to make this assumption, to allow Windows users to 
> make use of non-ASCII file names (Windows uses Unicode strings for file 
> names). Mac OS uses (semi-decomposed) UTF-8 strings, so it should also 
> be able to make use of this.

Most Russian programmers I know do not run in a UTF-8 locale.

> Unfortunately, there seems to be quite some resistance towards deciding 
> on a platform- and language-independent way of storing file names in 
> Git, but rather just going the "Unix" way and making it someone elses 
> problem. I find this a bit sad.

I find it a bit unfair that you say that, after many people participated 
in that very informative thread, and after I tried to work with you 
personally on getting the stuff into 4msysgit.git.

Actually, not just only a bit.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: non-ascii filenames issue
  2009-04-06  9:12         ` Johannes Schindelin
@ 2009-04-06 22:33           ` Dmitry Potapov
  0 siblings, 0 replies; 11+ messages in thread
From: Dmitry Potapov @ 2009-04-06 22:33 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Peter Krefting, Teemu Likonen, git

On Mon, Apr 06, 2009 at 11:12:35AM +0200, Johannes Schindelin wrote:
>
> Most Russian programmers I know do not run in a UTF-8 locale.

Actually, on Linux, people gradually switching to UTF-8 from koi8-r,
but on Windows MSCRT does not support UTF-8, so you do have much choice
here but to use Windows-1251. BTW, the upcoming Cygwin 1.7 is going to
have UTF-8 as the default locale. So, IMHO, UTF-8 is the only reasonable
choice for internal file name representation...

Dmitry

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: non-ascii filenames issue
  2009-04-06  7:28       ` Peter Krefting
  2009-04-06  9:12         ` Johannes Schindelin
@ 2009-04-07  8:26         ` demerphq
  1 sibling, 0 replies; 11+ messages in thread
From: demerphq @ 2009-04-07  8:26 UTC (permalink / raw)
  To: Peter Krefting; +Cc: John Tapsell, Git

2009/4/6 Peter Krefting <peter@softwolves.pp.se>:
> John Tapsell:
>
>> Unfortunately not, because for some absolutely crazy reason, there is no
>> way at all to tell what encoding the string is in.  It never occured to
>> anyone that it might actually be useful to be able to read the filename in
>> an unambiguous way.
>
> It comes from the Unix tradition, unfortunately, that file names are just a
> stream of bytes, instead of a stream of characters mapped to a byte
> sequence. The "stream of bytes" think worked back when everyone used ASCII,
> but as soon as other character encodings were used (i.e back in the 1970s or
> so), that assumption broke.

Those interested in this subject may find the following document on
the creation of utf8 interesting.

http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

cheers,
Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-04-07  8:28 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-05  9:36 non-ascii filenames issue Gregory Petrosyan
2009-04-05  9:54 ` Teemu Likonen
2009-04-05 10:01   ` Gregory Petrosyan
2009-04-05 10:51     ` John Tapsell
2009-04-05 16:23       ` Jay Soffian
2009-04-05 19:29         ` Junio C Hamano
2009-04-05 20:22           ` Jay Soffian
2009-04-06  7:28       ` Peter Krefting
2009-04-06  9:12         ` Johannes Schindelin
2009-04-06 22:33           ` Dmitry Potapov
2009-04-07  8:26         ` demerphq

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).