* git log and utf-u in filenames
@ 2008-09-25 21:50 Joey Hess
2008-09-25 22:43 ` Joey Hess
2008-09-25 23:11 ` Jakub Narebski
0 siblings, 2 replies; 8+ messages in thread
From: Joey Hess @ 2008-09-25 21:50 UTC (permalink / raw)
To: Git Mailing List
[-- Attachment #1: Type: text/plain, Size: 1373 bytes --]
Git, particularly git-log seems to not display utf-8 characters in filenames,
instead showing an escaped representation. On the other hand, commit messages,
as git-log(1) notes, are assumed to be utf-8, and the same utf-8 character
used in a commit message is not escaped, and displays ok.
Can anyone point me at the documentation for this utf-8 filename escaping,
assuming it's not a bug? And did earlier versions of git (circa 2006) perhaps
not do that escaping? I have code in ikiwiki that apparently used to work, but
is certianly not working with current git, due to this escaping.
Here's an example of the inconsistent handling of the same utf-8 character
("ö") in commit messages and filenames.
joey@kodama:~/tmp>mkdir utf8; cd utf8; git-init
Initialized empty Git repository in /home/joey/tmp/utf8/.git/
joey@kodama:~/tmp/utf8>echo hi > ö
joey@kodama:~/tmp/utf8>git add ö; git commit -m 'adding file: ö'
Created initial commit ee7d809: adding file: ö
1 files changed, 1 insertions(+), 0 deletions(-)
create mode 100644 "\303\266"
joey@kodama:~/tmp/utf-8>git log --stat
commit ee7d809d1811b1e1ad485ce3e7274316257029ae
Author: Joey Hess <joey@kodama.kitenet.net>
Date: Thu Sep 25 17:34:10 2008 -0400
adding file: ö
"\303\266" | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
--
see shy jo
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: git log and utf-u in filenames
2008-09-25 21:50 git log and utf-u in filenames Joey Hess
@ 2008-09-25 22:43 ` Joey Hess
2008-09-25 23:15 ` Jakub Narebski
2008-09-25 23:11 ` Jakub Narebski
1 sibling, 1 reply; 8+ messages in thread
From: Joey Hess @ 2008-09-25 22:43 UTC (permalink / raw)
To: Git Mailing List
[-- Attachment #1: Type: text/plain, Size: 2517 bytes --]
Joey Hess wrote:
> And did earlier versions of git (circa 2006) perhaps
> not do that escaping? I have code in ikiwiki that apparently used to work, but
> is certianly not working with current git, due to this escaping.
No, I guess it's always done that, perhaps something broke on my side
in the meantime.
But it doesn't seem right somehow that gitweb, ikiwiki, and seemingly
any other program that needs to look at git log / commits and figure out
what filename is being changed needs to include their own nasty code[1] to
convert the escaped characters back to normal characters.
And it seems that anyone who uses a lot of utf-8 in filenames would shortly
get tired of git commit, git log, etc displaying obfuscated versions of their
filenames.
I'm sure it makes sense to use this format internally in git to represent
filenames, to avoid needing to worry about encoding issues. But it's a shame
that that internal detail is exposed so that everything around git has to
worry about it.
Would making git-log and git-commit display de-escaped filenames be likely
to break something?
--
see shy jo
[1] Such as this from gitweb:
# git may return quoted and escaped filenames
sub unquote {
my $str = shift;
sub unq {
my $seq = shift;
my %es = ( # character escape codes, aka escape sequences
't' => "\t", # tab (HT, TAB)
'n' => "\n", # newline (NL)
'r' => "\r", # return (CR)
'f' => "\f", # form feed (FF)
'b' => "\b", # backspace (BS)
'a' => "\a", # alarm (bell) (BEL)
'e' => "\e", # escape (ESC)
'v' => "\013", # vertical tab (VT)
);
if ($seq =~ m/^[0-7]{1,3}$/) {
# octal char sequence
return chr(oct($seq));
} elsif (exists $es{$seq}) {
# C escape sequence, aka character escape code
return $es{$seq};
}
# quoted ordinary character
return $seq;
}
if ($str =~ m/^"(.*)"$/) {
# needs unquoting
$str = $1;
$str =~ s/\\([^0-7]|[0-7]{1,3})/unq($1)/eg;
}
return $str;
}
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: git log and utf-u in filenames
2008-09-25 21:50 git log and utf-u in filenames Joey Hess
2008-09-25 22:43 ` Joey Hess
@ 2008-09-25 23:11 ` Jakub Narebski
1 sibling, 0 replies; 8+ messages in thread
From: Jakub Narebski @ 2008-09-25 23:11 UTC (permalink / raw)
To: Joey Hess; +Cc: Git Mailing List
Joey Hess <joey@kitenet.net> writes:
> Git, particularly git-log seems to not display utf-8 characters in filenames,
> instead showing an escaped representation. On the other hand, commit messages,
> as git-log(1) notes, are assumed to be utf-8, and the same utf-8 character
> used in a commit message is not escaped, and displays ok.
>
> Can anyone point me at the documentation for this utf-8 filename
> escaping, assuming it's not a bug? And did earlier versions of git
> (circa 2006) perhaps not do that escaping? I have code in ikiwiki
> that apparently used to work, but is certianly not working with
> current git, due to this escaping.
Err... it always worked like this, mainly I think to have 7bit safe
patches for sending via email. Now in the time of 8bit transfer
and using single utf-8 encoding instead of multitude of different
filesystem encodings, you can set core.quotepath to false, although
this would eliminate only octal escaping for >127 ASCII characters;
TAB, CR etc. would still be quoted (and they have to be).
gitconfig(7):
core.quotepath::
The commands that output paths (e.g. 'ls-files',
'diff'), when not given the `-z` option, will quote
"unusual" characters in the pathname by enclosing the
pathname in a double-quote pair and with backslashes the
same way strings in C source code are quoted. If this
variable is set to false, the bytes higher than 0x80 are
not quoted but output as verbatim. Note that double
quote, backslash and control characters are always
quoted without `-z` regardless of the setting of this
variable.
--
Jakub Narebski
Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: git log and utf-u in filenames
2008-09-25 22:43 ` Joey Hess
@ 2008-09-25 23:15 ` Jakub Narebski
2008-09-26 6:33 ` Alex Riesen
0 siblings, 1 reply; 8+ messages in thread
From: Jakub Narebski @ 2008-09-25 23:15 UTC (permalink / raw)
To: Joey Hess; +Cc: Git Mailing List
Joey Hess <joey@kitenet.net> writes:
> Joey Hess wrote:
> > And did earlier versions of git (circa 2006) perhaps not do that
> > escaping? I have code in ikiwiki that apparently used to work, but
> > is certianly not working with current git, due to this escaping.
>
> No, I guess it's always done that, perhaps something broke on my side
> in the meantime.
>
> But it doesn't seem right somehow that gitweb, ikiwiki, and seemingly
> any other program that needs to look at git log / commits and figure out
> what filename is being changed needs to include their own nasty code[1] to
> convert the escaped characters back to normal characters.
Well, in gitweb we could use '-z' option for git-diff-tree and git-ls-tree,
but it has its disadvantages, like having to do actual parsing record after
record instead of simplys splitting outout on end of line ("\n") characters.
> Would making git-log and git-commit display de-escaped filenames be likely
> to break something?
core.quotepath limits filename escaping, but you still _have_ to quote
"\n", "\t", and of course '"' and '\', if you want for filename to have
in single line in text format.
--
Jakub Narebski
Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: git log and utf-u in filenames
2008-09-25 23:15 ` Jakub Narebski
@ 2008-09-26 6:33 ` Alex Riesen
2008-09-26 7:31 ` Jakub Narebski
0 siblings, 1 reply; 8+ messages in thread
From: Alex Riesen @ 2008-09-26 6:33 UTC (permalink / raw)
To: Jakub Narebski; +Cc: Joey Hess, Git Mailing List
Jakub Narebski, Fri, Sep 26, 2008 01:15:58 +0200:
>
> Well, in gitweb we could use '-z' option for git-diff-tree and git-ls-tree,
> but it has its disadvantages, like having to do actual parsing record after
> record instead of simplys splitting outout on end of line ("\n") characters.
>
How about simply splitting output on end of line ("\0" NUL) characters?
The "\n" NL you refer to is just as EOR as NUL.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: git log and utf-u in filenames
2008-09-26 6:33 ` Alex Riesen
@ 2008-09-26 7:31 ` Jakub Narebski
2008-09-26 13:49 ` Alex Riesen
0 siblings, 1 reply; 8+ messages in thread
From: Jakub Narebski @ 2008-09-26 7:31 UTC (permalink / raw)
To: Alex Riesen; +Cc: Joey Hess, Git Mailing List
On Fri, 26 Sep 2008, Alex Riesen wrote:
> Jakub Narebski, Fri, Sep 26, 2008 01:15:58 +0200:
> >
> > Well, in gitweb we could use '-z' option for git-diff-tree and git-ls-tree,
> > but it has its disadvantages, like having to do actual parsing record after
> > record instead of simplys splitting outout on end of line ("\n") characters.
> >
>
> How about simply splitting output on end of line ("\0" NUL) characters?
> The "\n" NL you refer to is just as EOR as NUL.
Doesn't work for "git diff-tree -z [...]" output. When there is rename
or copy detected, NUL is used as separator between fields (beetween
source and destination unquoted filename), not only between records:
git diff-tree
.... <src qfilename> TAB <dst qfilename> LF
git diff-tree -z
.... <src filename> NUL <dst filename> NUL
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: git log and utf-u in filenames
2008-09-26 7:31 ` Jakub Narebski
@ 2008-09-26 13:49 ` Alex Riesen
2008-09-27 8:37 ` Jakub Narebski
0 siblings, 1 reply; 8+ messages in thread
From: Alex Riesen @ 2008-09-26 13:49 UTC (permalink / raw)
To: Jakub Narebski; +Cc: Joey Hess, Git Mailing List
2008/9/26 Jakub Narebski <jnareb@gmail.com>:
> On Fri, 26 Sep 2008, Alex Riesen wrote:
>> Jakub Narebski, Fri, Sep 26, 2008 01:15:58 +0200:
>> >
>> > Well, in gitweb we could use '-z' option for git-diff-tree and git-ls-tree,
>> > but it has its disadvantages, like having to do actual parsing record after
>> > record instead of simplys splitting outout on end of line ("\n") characters.
>> >
>>
>> How about simply splitting output on end of line ("\0" NUL) characters?
>> The "\n" NL you refer to is just as EOR as NUL.
>
> Doesn't work for "git diff-tree -z [...]" output. When there is rename
> or copy detected, NUL is used as separator between fields (beetween
> source and destination unquoted filename), not only between records:
>
> git diff-tree
> .... <src qfilename> TAB <dst qfilename> LF
>
> git diff-tree -z
> .... <src filename> NUL <dst filename> NUL
>
You still have the marker (Rnnn) from pre-<src filename> record and
can treat the next record correspondingly. Still a split, just a bit more
careful handling of the resulting list/array.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: git log and utf-u in filenames
2008-09-26 13:49 ` Alex Riesen
@ 2008-09-27 8:37 ` Jakub Narebski
0 siblings, 0 replies; 8+ messages in thread
From: Jakub Narebski @ 2008-09-27 8:37 UTC (permalink / raw)
To: Alex Riesen; +Cc: Joey Hess, Git Mailing List
On Fri, 26 Sep 2008, Alex Riesen wrote:
> 2008/9/26 Jakub Narebski <jnareb@gmail.com>:
>>> How about simply splitting output on end of line ("\0" NUL) characters?
>>> The "\n" NL you refer to is just as EOR as NUL.
>>
>> Doesn't work for "git diff-tree -z [...]" output. When there is rename
>> or copy detected, NUL is used as separator between fields (beetween
>> source and destination unquoted filename), not only between records:
>>
>> git diff-tree
>> .... <src qfilename> TAB <dst qfilename> LF
>>
>> git diff-tree -z
>> .... <src filename> NUL <dst filename> NUL
>>
>
> You still have the marker (Rnnn) from pre-<src filename> record and
> can treat the next record correspondingly. Still a split, just a bit more
> careful handling of the resulting list/array.
Currently gitweb does something like this:
open $fd, "-|", git_cmd(), "diff-tree", '-r', ...
@difftree = <$fd>;
close $fd;
foreach my $line (@difftree) {
...
}
If gitweb would use git-diff-tree with '-z' option, above code
would get more complicated, offsetting simplification of not using
unquote() (which is already written).
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2008-09-27 8:38 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-25 21:50 git log and utf-u in filenames Joey Hess
2008-09-25 22:43 ` Joey Hess
2008-09-25 23:15 ` Jakub Narebski
2008-09-26 6:33 ` Alex Riesen
2008-09-26 7:31 ` Jakub Narebski
2008-09-26 13:49 ` Alex Riesen
2008-09-27 8:37 ` Jakub Narebski
2008-09-25 23:11 ` Jakub Narebski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).