[PATCH] gitweb: handle non UTF-8 text

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] gitweb: handle non UTF-8 text
@ 2007-05-28 20:47 Martin Koegler
  2007-05-28 23:21 ` Petr Baudis
  0 siblings, 1 reply; 9+ messages in thread
From: Martin Koegler @ 2007-05-28 20:47 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git, Martin Koegler

gitweb assumes, that everything is in UTF-8. If a text contains invalid
UTF-8 character sequences, the text must be in a different encoding.

This patch interprets such a text as latin1.

Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
---
For correct UTF-8, the patch does not change anything.

If commit/blob/... is not in UTF-8, it displays the text
with a very high probability correct. 

As git itself is not aware of any encoding, I know no better
possibility to handle non UTF-8 text in gitweb.

 gitweb/gitweb.perl |   27 +++++++++++++++++----------
 1 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index c3921cb..dfd564d 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -621,12 +621,19 @@ sub esc_url {
 	return $str;
 }
 
+sub my_decode_utf8 {
+	my $str = shift;
+	my $res;
+	eval { $res = decode_utf8 ($str, 1); };
+	return $res || decode('latin1', $str);
+}
+
 # replace invalid utf8 character with SUBSTITUTION sequence
 sub esc_html ($;%) {
 	my $str = shift;
 	my %opts = @_;
 
-	$str = decode_utf8($str);
+	$str = my_decode_utf8($str);
 	$str = $cgi->escapeHTML($str);
 	if ($opts{'-nbsp'}) {
 		$str =~ s/ /&nbsp;/g;
@@ -640,7 +647,7 @@ sub esc_path {
 	my $str = shift;
 	my %opts = @_;
 
-	$str = decode_utf8($str);
+	$str = my_decode_utf8($str);
 	$str = $cgi->escapeHTML($str);
 	if ($opts{'-nbsp'}) {
 		$str =~ s/ /&nbsp;/g;
@@ -925,7 +932,7 @@ sub format_subject_html {
 
 	if (length($short) < length($long)) {
 		return $cgi->a({-href => $href, -class => "list subject",
-		                -title => decode_utf8($long)},
+		                -title => my_decode_utf8($long)},
 		       esc_html($short) . $extra);
 	} else {
 		return $cgi->a({-href => $href, -class => "list subject"},
@@ -1239,7 +1246,7 @@ sub git_get_projects_list {
 			if (check_export_ok("$projectroot/$path")) {
 				my $pr = {
 					path => $path,
-					owner => decode_utf8($owner),
+					owner => my_decode_utf8($owner),
 				};
 				push @list, $pr;
 				(my $forks_path = $path) =~ s/\.git$//;
@@ -1269,7 +1276,7 @@ sub git_get_project_owner {
 			$pr = unescape($pr);
 			$ow = unescape($ow);
 			if ($pr eq $project) {
-				$owner = decode_utf8($ow);
+				$owner = my_decode_utf8($ow);
 				last;
 			}
 		}
@@ -1759,7 +1766,7 @@ sub get_file_owner {
 	}
 	my $owner = $gcos;
 	$owner =~ s/[,;].*$//;
-	return decode_utf8($owner);
+	return my_decode_utf8($owner);
 }
 
 ## ......................................................................
@@ -1842,7 +1849,7 @@ sub git_header_html {
 
 	my $title = "$site_name";
 	if (defined $project) {
-		$title .= " - " . decode_utf8($project);
+		$title .= " - " . my_decode_utf8($project);
 		if (defined $action) {
 			$title .= "/$action";
 			if (defined $file_name) {
@@ -2116,7 +2123,7 @@ sub git_print_page_path {
 
 	print "<div class=\"page_path\">";
 	print $cgi->a({-href => href(action=>"tree", hash_base=>$hb),
-	              -title => 'tree root'}, decode_utf8("[$project]"));
+	              -title => 'tree root'}, my_decode_utf8("[$project]"));
 	print " / ";
 	if (defined $name) {
 		my @dirname = split '/', $name;
@@ -2936,7 +2943,7 @@ sub git_project_list_body {
 		($pr->{'age'}, $pr->{'age_string'}) = @aa;
 		if (!defined $pr->{'descr'}) {
 			my $descr = git_get_project_description($pr->{'path'}) || "";
-			$pr->{'descr_long'} = decode_utf8($descr);
+			$pr->{'descr_long'} = my_decode_utf8($descr);
 			$pr->{'descr'} = chop_str($descr, 25, 5);
 		}
 		if (!defined $pr->{'owner'}) {
@@ -3981,7 +3988,7 @@ sub git_snapshot {
 	my $git = git_cmd_str();
 	my $name = $project;
 	$name =~ s/\047/\047\\\047\047/g;
-	my $filename = decode_utf8(basename($project));
+	my $filename = my_decode_utf8(basename($project));
 	my $cmd;
 	if ($suffix eq 'zip') {
 		$filename .= "-$hash.$suffix";
-- 
1.5.2.846.g9a144

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] gitweb: handle non UTF-8 text
  2007-05-28 20:47 [PATCH] gitweb: handle non UTF-8 text Martin Koegler
@ 2007-05-28 23:21 ` Petr Baudis
  2007-05-29  9:21   ` Jakub Narebski
  0 siblings, 1 reply; 9+ messages in thread
From: Petr Baudis @ 2007-05-28 23:21 UTC (permalink / raw)
  To: Martin Koegler; +Cc: Jakub Narebski, git

On Mon, May 28, 2007 at 10:47:34PM CEST, Martin Koegler wrote:
> gitweb assumes, that everything is in UTF-8. If a text contains invalid
> UTF-8 character sequences, the text must be in a different encoding.
> 
> This patch interprets such a text as latin1.
> 
> Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
> ---
> For correct UTF-8, the patch does not change anything.
> 
> If commit/blob/... is not in UTF-8, it displays the text
> with a very high probability correct. 
> 
> As git itself is not aware of any encoding, I know no better
> possibility to handle non UTF-8 text in gitweb.

I don't think this is a reasonable approach; I actually dispute the high
probability - in western Europe it's obvious to assume latin1, but does
majority of users using non-ascii characters come from there? Or rather
from central Europe (like me, Petr Baudiš? ;-))? Somewhere else?

If we do something like this, we should do it properly and look at
configured i18n.commitEncoding for the project. (But as config lookup
may be expensive, probably do it only when we need it.)

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Ever try. Ever fail. No matter. // Try again. Fail again. Fail better.
		-- Samuel Beckett

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] gitweb: handle non UTF-8 text
  2007-05-28 23:21 ` Petr Baudis
@ 2007-05-29  9:21   ` Jakub Narebski
  2007-05-29 21:55     ` Martin Koegler
  0 siblings, 1 reply; 9+ messages in thread
From: Jakub Narebski @ 2007-05-29  9:21 UTC (permalink / raw)
  To: Petr Baudis, Martin Koegler; +Cc: git, Martin Langhoff, Martyn Smith

[Cc: authors of git-cvsserver]

On Tue, 29 May 2007, Petr Baudis wrote:
> On Mon, May 28, 2007 at 10:47:34PM CEST, Martin Koegler wrote:

>> gitweb assumes, that everything is in UTF-8. If a text contains invalid
>> UTF-8 character sequences, the text must be in a different encoding.

But it doesn't tell us _what_ is the encoding. For commit messages,
with reasonable new git, we have 'encoding' header if git known that
commit message was not in utf-8.

By the way, I winder why we don't have such header for tag objects
(i18n.tagEncoding ;-)...

>> This patch interprets such a text as latin1.

Meaning that it tries to recode text from latin1 (iso-8859-1) to utf-8
(not changing gitweb output encoding, which is utf-8).

It would be much better, and much easier at least for commit message
to add --encoding=utf-8 to git-rev-list / git-log invocation.

>> Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
>> ---
>> For correct UTF-8, the patch does not change anything.
>> 
>> If commit/blob/... is not in UTF-8, it displays the text
>> with a very high probability correct. 

It is commit (with its 'encoding' header, and `--encoding' option
we can use instead of doing it in gitweb, provided that git was
compiled with iconv support), tag (similar to commit, but IIRC
without 'encoding' header, and `--encoding' option), blob (with
no place to store encoding) and pathname in tree (which can be
different from blob encoding).

And I doubt very much about this "very high probability to be
correct".

>> As git itself is not aware of any encoding, I know no better
>> possibility to handle non UTF-8 text in gitweb.
> 
> I don't think this is a reasonable approach; I actually dispute the high
> probability - in western Europe it's obvious to assume latin1, but does
> majority of users using non-ascii characters come from there? Or rather
> from central Europe (like me, Petr Baudiš? ;-))? Somewhere else?

I also don't think that hardcoding latin1 (iso-8859-1) as default
alternate encoding is a good idea. I don't think using iso-8859-1
(outside us-ascii) is _nowadays_ that common. On the other hand I think
that not all users of koi8r, eucjp or iso-2022-jp converted (and can
convert) to utf-8; latin1 users can.

And using latin1 (other encoding) _only_ when there is an invalid utf-8
sequence is not a good idea either; I think that that there are some
latin1 sequences outside us-ascii which are valid utf-8 sequences. That
kind of magic is wrong, wrong, wrong...

> If we do something like this, we should do it properly and look at
> configured i18n.commitEncoding for the project. (But as config lookup
> may be expensive, probably do it only when we need it.)

I think it would be best to make it into %feature, overridable
or not (which would look at i18n.commitEncoding instead of at
gitweb.commitEncoding, but still a feature).

About config lookup: we can either "borrow" config reading code in Perl
from git-cvsserver, perhaps via putting it into Git.pm. Or we can
implement at last core git support for dumping whole config in
unambiguous machine parseable output: "git config --dump", e.g.
  key <LF> value <NUL>
or
  key <NUL>
(the second for "boolean" variables without set value).

Having alternate (read-only) config parser has its advantages and
disadvantages. Advantage is that we avoid fork+exec (performance),
and having two implementations is always good for having format
standarized. Disadvantage is that is yet another code to maintain,
and that config parsing (even read-only config parsing) is a bit tricky
with current git config file format.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] gitweb: handle non UTF-8 text
  2007-05-29  9:21   ` Jakub Narebski
@ 2007-05-29 21:55     ` Martin Koegler
  2007-05-30 20:18       ` Robin Rosenberg
  2007-06-01 21:05       ` Jakub Narebski
  0 siblings, 2 replies; 9+ messages in thread
From: Martin Koegler @ 2007-05-29 21:55 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Petr Baudis, git, Martin Langhoff, Martyn Smith

On Tue, May 29, 2007 at 11:21:11AM +0200, Jakub Narebski wrote:
> On Tue, 29 May 2007, Petr Baudis wrote:
> > On Mon, May 28, 2007 at 10:47:34PM CEST, Martin Koegler wrote:
> 
> >> gitweb assumes, that everything is in UTF-8. If a text contains invalid
> >> UTF-8 character sequences, the text must be in a different encoding.
> 
> But it doesn't tell us _what_ is the encoding. For commit messages,
> with reasonable new git, we have 'encoding' header if git known that
> commit message was not in utf-8.
> 
> By the way, I winder why we don't have such header for tag objects
> (i18n.tagEncoding ;-)...

Why do I need to set i18n.commitEncoding on a normal Linux systems?  We
have a locale, which contains this information. With this, its more
likely, that the commits can be read correctly later, if somebody
forget to set "i18n.commitEncoding" in a repository.

> >> This patch interprets such a text as latin1.
> 
> Meaning that it tries to recode text from latin1 (iso-8859-1) to utf-8
> (not changing gitweb output encoding, which is utf-8).
> 
> It would be much better, and much easier at least for commit message
> to add --encoding=utf-8 to git-rev-list / git-log invocation.

It does not help for old commits, where the encoding was not specified
correctly. If my research is correct, the encoding handling was
introduced at the end of 2006 and released this february.

> >> Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
> >> ---
> >> For correct UTF-8, the patch does not change anything.
> >> 
> >> If commit/blob/... is not in UTF-8, it displays the text
> >> with a very high probability correct. 
> 
> It is commit (with its 'encoding' header, and `--encoding' option
> we can use instead of doing it in gitweb, provided that git was
> compiled with iconv support), tag (similar to commit, but IIRC
> without 'encoding' header, and `--encoding' option), blob (with
> no place to store encoding) and pathname in tree (which can be
> different from blob encoding).
> 
> And I doubt very much about this "very high probability to be
> correct".

For normal text, this should be true:

We can divide ISO-8859-1 into some groups:
a) 0x00-0x7f: shared with UTF-8
b) 0x80-0xBF: continuation characters in UTF-8 (0x80-0x9F are control characters/unused)
c) 0xC0-0xDF: start of a two byte UTF-8 character
d) 0xE0-0xEF: start of a tree byte UTF-8 character
e) 0xF0-0xFF: start of other longer UTF-8 sequences

To misinterpret a ISO-8859-1 text as UTF-8, each character of class
c/d/e must be followed by the correct number of character of class b.

Character of class b are "special character", characters of class
c/d/e are mostly special letters. As "special character" are normally not part
of a word (at least in German),  any occurence of c/d/e at the begin
or the in the middle of the word will therefore result in a invalid UTF-8
sequence. Only a occurence of c/d/e at the end of an word, which is
followed by the correct number occurences of class b result in a correct UTF-8
sequence.

In german, the commonly used character of c/d/e are: ÄÖÜäöüß
The uppercase ÄÖÜ appear ony at the beginning of a word => invalid combination.

Other combinations:
* äöü followed by two "special characters"  (I don't know, where such a combination could occur).
* ß followed by one "special character" (I regard this as the most likly misinterpretation).

I can not speak for other languages. If you doubt, please look at an
character table (eg. http://en.wikipedia.org/wiki/ISO-8859-1#ISO-8859-1)
and think about the possibiltiy of UTF-8 compatible combinations in your languague.

As gitweb is processing a line of text at once, one UTF-8 compatible
combinations has no effect, if any other non UTF-8 combatible
character sequence occurs.

> >> As git itself is not aware of any encoding, I know no better
> >> possibility to handle non UTF-8 text in gitweb.
> > 
> > I don't think this is a reasonable approach; I actually dispute the high
> > probability - in western Europe it's obvious to assume latin1, but does
> > majority of users using non-ascii characters come from there? Or rather
> > from central Europe (like me, Petr Baudiš? ;-))? Somewhere else?
> 
> I also don't think that hardcoding latin1 (iso-8859-1) as default
> alternate encoding is a good idea. I don't think using iso-8859-1
> (outside us-ascii) is _nowadays_ that common. On the other hand I think
> that not all users of koi8r, eucjp or iso-2022-jp converted (and can
> convert) to utf-8; latin1 users can.

UTF-8 is not the universal, dropin solution for ISO-8859-1. It has some drawbacks:
- Some operations are slower, eg.
$$ hexdump -C s
00000000  78 0a 78 0a 78 0a 78 0a  78 0a 78 0a 78 0a 78 0a  |x.x.x.x.x.x.x.x.|
*
01000000
$ grep --version
grep (GNU grep) 2.5.1
$LANG=en_US.ISO-8859-15 time grep "[a]" s
Command exited with non-zero status 1
0.38user 0.05system 0:00.46elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+219minor)pagefaults 0swaps
$ LANG=en_US.UTF-8 time grep "[a]" s
Command exited with non-zero status 1
10.86user 0.31system 0:14.29elapsed 78%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+17151minor)pagefaults 0swaps
- Anything using string length/character position is more complicated.

For some problems, UTF-16 might be a simpler solution.

But I agree, that there should be the possibilty to choose a the
fallback encoding.

> And using latin1 (other encoding) _only_ when there is an invalid utf-8
> sequence is not a good idea either; I think that that there are some
> latin1 sequences outside us-ascii which are valid utf-8 sequences. That
> kind of magic is wrong, wrong, wrong...

Please tell me a better alternative. The non UTF-8 will be in the history
(in blobs/trees/commits/..) forever, where it can not be changed.

I need a solution for this. I can use this patch on my system, but I
would like to see support other encodings in upstream gitweb.

> > If we do something like this, we should do it properly and look at
> > configured i18n.commitEncoding for the project. (But as config lookup
> > may be expensive, probably do it only when we need it.)
> 
> I think it would be best to make it into %feature, overridable
> or not (which would look at i18n.commitEncoding instead of at
> gitweb.commitEncoding, but still a feature).

I would use i18n.commitEncoding only as last fallback. In a project
more different encodings could be used and the guessing logic may need
additional parameter, so I would create a own set of config parameters
for this.

> About config lookup: we can either "borrow" config reading code in Perl
> from git-cvsserver, perhaps via putting it into Git.pm. Or we can
> implement at last core git support for dumping whole config in
> unambiguous machine parseable output: "git config --dump", e.g.
>   key <LF> value <NUL>
> or
>   key <NUL>
> (the second for "boolean" variables without set value).

If we use a new file (in the gitweb config format), the whole thing
will be faster and less complicated.

mfg Martin Kögler

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] gitweb: handle non UTF-8 text
  2007-05-29 21:55     ` Martin Koegler
@ 2007-05-30 20:18       ` Robin Rosenberg
  2007-06-01 21:05       ` Jakub Narebski
  1 sibling, 0 replies; 9+ messages in thread
From: Robin Rosenberg @ 2007-05-30 20:18 UTC (permalink / raw)
  To: Martin Koegler
  Cc: Jakub Narebski, Petr Baudis, git, Martin Langhoff, Martyn Smith

tisdag 29 maj 2007 skrev Martin Koegler:
> On Tue, May 29, 2007 at 11:21:11AM +0200, Jakub Narebski wrote:
> > On Tue, 29 May 2007, Petr Baudis wrote:
> > > On Mon, May 28, 2007 at 10:47:34PM CEST, Martin Koegler wrote:
> > 
> > >> gitweb assumes, that everything is in UTF-8. If a text contains invalid
> > >> UTF-8 character sequences, the text must be in a different encoding.
> > 
> > But it doesn't tell us _what_ is the encoding. For commit messages,
> > with reasonable new git, we have 'encoding' header if git known that
> > commit message was not in utf-8.
> > 
> > By the way, I winder why we don't have such header for tag objects
> > (i18n.tagEncoding ;-)...
> 
> Why do I need to set i18n.commitEncoding on a normal Linux systems?  We
I've asked the same question.. :(
> have a locale, which contains this information. With this, its more
> likely, that the commits can be read correctly later, if somebody
> forget to set "i18n.commitEncoding" in a repository.
No 'if'. Users are virtually guaranteed to forget this setting.

> 
> UTF-8 is not the universal, dropin solution for ISO-8859-1. It has some drawbacks:
> - Some operations are slower, eg.
> - Anything using string length/character position is more complicated.
We'll have to live with that. A nice property of valid UTF-8 is that many operations can
be performed without decoding (like looking for a substring).

> 
> For some problems, UTF-16 might be a simpler solution.
UTF-16 is also variable width (one or two code units). Most apps get away by pretending it is 
fixed width, simply because that works for most people, but then I'm not sure people in asia 
aren't really happy with that assumption either. 

> I would use i18n.commitEncoding only as last fallback. In a project
> more different encodings could be used and the guessing logic may need
> additional parameter, so I would create a own set of config parameters
> for this.

There aren't many simple ways of guessing. The UTF-8 vs other test is simple 
and very reliable for western encodings (and merely good for others, if I'm not misinformed).
The i18n.commitEncoding is just a hint. Another hint is the host's encoding.

1. if lookslike(UTF-8) => assume UTF-8 else...
2. commit's encoding is valid for the text => use it else...
3. i18n.commitEncoding ...
4. gitweb.commitencoding  ....
5. server's location charset ...
6. assume iso-8859-1

Yet another would be to have an extra option to switch encoding on-demand in the gui.

BTW, there's another thread on notes. Maybe they be used to "fix" badly encoded messages
if and when they get a final implementation.

-- robn

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] gitweb: handle non UTF-8 text
  2007-05-29 21:55     ` Martin Koegler
  2007-05-30 20:18       ` Robin Rosenberg
@ 2007-06-01 21:05       ` Jakub Narebski
  2007-06-02 22:15         ` Junio C Hamano
  1 sibling, 1 reply; 9+ messages in thread
From: Jakub Narebski @ 2007-06-01 21:05 UTC (permalink / raw)
  To: Martin Koegler
  Cc: Petr Baudis, git, Martin Langhoff, Martyn Smith, Robin Rosenberg

On Tue, 29 May 2007, Martin Koegler wrote:
> On Tue, May 29, 2007 at 11:21:11AM +0200, Jakub Narebski wrote:
>> On Tue, 29 May 2007, Petr Baudis wrote:
>>> On Mon, May 28, 2007 at 10:47:34PM CEST, Martin Koegler wrote:
>> 
>>>> gitweb assumes, that everything is in UTF-8. If a text contains invalid
>>>> UTF-8 character sequences, the text must be in a different encoding.
>> 
>> But it doesn't tell us _what_ is the encoding. For commit messages,
>> with reasonable new git, we have 'encoding' header if git known that
>> commit message was not in utf-8.
>> 
>> By the way, I winder why we don't have such header for tag objects
>> (i18n.tagEncoding ;-)...
> 
> Why do I need to set i18n.commitEncoding on a normal Linux systems?  We
> have a locale, which contains this information. With this, its more
> likely, that the commits can be read correctly later, if somebody
> forget to set "i18n.commitEncoding" in a repository.

Because repository is (or at least can be) _shared_. People working on
the same repository can have set different locale. Web server running
gitweb can have different locale.

>>>> This patch interprets such a text as latin1.
>> 
>> Meaning that it tries to recode text from latin1 (iso-8859-1) to utf-8
>> (not changing gitweb output encoding, which is utf-8).

And this (i.e. what does "interprets" mean) is what should be in the
commit message too.

>> It would be much better, and much easier at least for commit message
>> to add --encoding=utf-8 to git-rev-list / git-log invocation.
> 
> It does not help for old commits, where the encoding was not specified
> correctly. If my research is correct, the encoding handling was
> introduced at the end of 2006 and released this february.

True. But it _can_ help.

>>>> If commit/blob/... is not in UTF-8, it displays the text
>>>> with a very high probability correct. 
>>
>> And I doubt very much about this "very high probability to be
>> correct".
> 
> For normal text, this should be true:
> 
> We can divide ISO-8859-1 into some groups:
> a) 0x00-0x7f: shared with UTF-8
> b) 0x80-0xBF: continuation characters in UTF-8 (0x80-0x9F are control characters/unused)
> c) 0xC0-0xDF: start of a two byte UTF-8 character
> d) 0xE0-0xEF: start of a tree byte UTF-8 character
> e) 0xF0-0xFF: start of other longer UTF-8 sequences
> 
> To misinterpret a ISO-8859-1 text as UTF-8, each character of class
> c/d/e must be followed by the correct number of character of class b.
[cut]
> As gitweb is processing a line of text at once, one UTF-8 compatible
> combinations has no effect, if any other non UTF-8 combatible
> character sequence occurs.

Thanks for the explanation. In short: if characters not shared with UTF-8
(outside US-ASCII), "special characters" occur usually solo, there is
low probability that line in non-UTF-8 encoding will be valid UTF-8.
Which perhaps is valid for German and latin1 aka. iso-8859-1; not
necessarily so for example for Polish and iso-8859-2, see
  zażółć gęsią jaźń
which is perfectly good fragment containing all Polish special
characters, and as you can see those characters occur one after another.
Well, it still could be invalid UTF-8 sequence; what about koi8r and
eucjp (or other non-UTF-8 encoding for Asian languages)?

> But I agree, that there should be the possibilty to choose a the
> fallback encoding.

I think for the beginning it would be enough to have

  # assume this charset if line contains non-UTF-8 characters
  our $fallback_encoding = "latin1";

or something like that (perhaps different wording in the comment,
perhaps different name of the variable) in the gitweb.perl for your
idea to be accepted.

That, and using to_utf8 (as before e3ad95a8) and not my_decode_utf8
as subroutine name. If only it would be possible to avoid I think
quote costly "eval {....}" invocation...

[cut]

There are six sources of possibly non-UTF-8 input: commits, tags,
trees (file names), blobs, gitweb files and results of system calls.

Only first one, commits, comes with encoding specified... if commit
was made with new enough git, and if committer correctly specified
encoding. Commits are read using git-rev-list, which accept --encoding
parameter, so we can convert it easily to utf-8... if git was compiled
with iconv support. It is possible that due to repository, gitweb user
or global configuration (i18n.logOutputEncoding, i18n.commitEncoding)
this is done automatically. On the other hand I think it is easiest
to have accidental wrongly encoded sequence in commit message.

Second one, tags, really _should_ have encoding header like commits.
On the other hand usually the message is version + PGP signature, so
there is no place for any encoding. Tags are read using git-cat-file,
which does not do any encoding/decoding.

Third, filenames in tree objects, "suffers" from git design decision:
for performance and simplicity git stories filenames in tree 'as is',
and relies on the fact that filenames are the same in tree objects,
in the index (dircache), in the filesystem during saving, and as read
from filesystem. Moreover I think that names encoding on filesystem
might depend on filesystem in question and be different from locale
specified encoding (locale is user local, filesystem is global).
On the other hand side one ususually does not use special characters
in filenames because of the problems they cause.

Fourth, blobs (file contents). They can use different encoding than
commit messages; moreover different files can use different encoding.
Encoding has to be specified externally; there is no place for encoding
header in the blob object structure.

Fifth, gitweb files include files read and transformed such as 
GIT_DIR/description file, or projects index file $projects_list,
and files containing fragments of HTML like README.html or header/footer
files.

Sixth, we sometimes have to decode to utf8 results of system calls
like getpwuid to get owner of a file (of a project), or decode to utf8
path (fragment) to the repository.

There are two places to specify gitweb output charset. First is charset
used in HTML output, which is also default charset (binmode) of STDOUT
stream. Gitweb uses utf-8 here, and utf-8 is recommended for XML and for
XHTML by W3C, although we could theoretically add an option to use
different charset by default, and decode (or not) to this charset, instead
of recoding everything (see above) to utf-8.

Second place is default charset for text/plain blob_plain output:
  # default blob_plain mimetype and default charset for text/plain blob
  our $default_blob_plain_mimetype = 'text/plain';
  our $default_text_plain_charset  = undef;
and for other *_plain output written as text/plain; charset=utf-8, and
which is actually dumpled :raw to STDOUT.

So what should be the solution? Add global, per gitweb installation
configureation variables $input_encoding and $fallback_input_encoding?
What do you think? Do you have other ideas?

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] gitweb: handle non UTF-8 text
  2007-06-01 21:05       ` Jakub Narebski
@ 2007-06-02 22:15         ` Junio C Hamano
  2007-06-03 15:42           ` Jakub Narebski
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2007-06-02 22:15 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Martin Koegler, Petr Baudis, git, Martin Langhoff, Martyn Smith,
	Robin Rosenberg

Jakub Narebski <jnareb@gmail.com> writes:

> On Tue, 29 May 2007, Martin Koegler wrote:
> ...
>> But I agree, that there should be the possibilty to choose a the
>> fallback encoding.
>
> I think for the beginning it would be enough to have
>
>   # assume this charset if line contains non-UTF-8 characters
>   our $fallback_encoding = "latin1";
>
> or something like that (perhaps different wording in the comment,
> perhaps different name of the variable) in the gitweb.perl for your
> idea to be accepted.
>
> That, and using to_utf8 (as before e3ad95a8) and not my_decode_utf8
> as subroutine name. If only it would be possible to avoid I think
> quote costly "eval {....}" invocation...

Except that I had an impression that block form of "eval" (as
opposed to "parse and evaluate string" kind) was not costly at
all.

Please make it so.

I'll read the other parts of your message again -- I might have
further comments.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] gitweb: handle non UTF-8 text
  2007-06-02 22:15         ` Junio C Hamano
@ 2007-06-03 15:42           ` Jakub Narebski
  2007-06-03 18:41             ` Alexandre Julliard
  0 siblings, 1 reply; 9+ messages in thread
From: Jakub Narebski @ 2007-06-03 15:42 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Martin Koegler, Petr Baudis, git, Martin Langhoff, Martyn Smith,
	Robin Rosenberg, Alexandre Julliard

Alexandre, I hope that the patch attached would solve your problem.

Junio C Hamano wrote:
> Jakub Narebski <jnareb@gmail.com> writes:
> 
>> On Tue, 29 May 2007, Martin Koegler wrote:
>> ...
>>> But I agree, that there should be the possibilty to choose the
>>> fallback encoding.
>>
>> I think for the beginning it would be enough to have
>>
>>   # assume this charset if line contains non-UTF-8 characters
>>   our $fallback_encoding = "latin1";

Added this, with more elaborate comment, just before %feature hash.

>> or something like that (perhaps different wording in the comment,
>> perhaps different name of the variable) in the gitweb.perl for your
>> idea to be accepted.
>>
>> That, and using to_utf8 (as before e3ad95a8) and not my_decode_utf8
>> as subroutine name. If only it would be possible to avoid I think
>> quote costly "eval {....}" invocation...
> 
> Except that I had an impression that block form of "eval" (as
> opposed to "parse and evaluate string" kind) was not costly at
> all.

I have checked the time it took to run the gitweb test (t9500),
and the time to run (user+sys) increased about 1% after this patch,
which is even within range of error I think.
 
> Please make it so.

I have changed the name from my_decode_utf8 to the name used for thin
wrapper before commit e3ad95a8 "gitweb: use decode_utf8 directly", namely
to_utf8, and put it in the place where old to_utf8 subroutine was.

Instead of bit hackish "return $res || decode('latin1', $str);" use
"if (defined $res) { ... } else { ... }"; it avoids calling decode()
unnecessary for '' and '0' strings, which are also false, but do not mean
that decode_utf8 failed.

It uses explicit constant names, Encode::FB_CROAK instead of 1, and
Encode::FB_DEFAULT instead of default undef/0.

It adds very, very basic test: it does check _only_ if there are any
errors or warning which would go to web server log; it does not check
if the output is correct. It uses helper files from other i18n tests.

Added comments.


Still the main change is by Martin Koegler and he should be author of
this commit, I think. I have added S-o-b: by me.

-- >8 --
From: Martin Koegler <mkoegler@auto.tuwien.ac.at>
Subject: [PATCH] gitweb: Handle non UTF-8 text better

gitweb assumes that everything is in UTF-8. If a text contains invalid
UTF-8 character sequences, the text must be in a different encoding.

This commit introduces $fallback_encoding which would be used as input
encoding if gitweb encounters text with is not valid UTF-8.

Add basic test for this in t/t9500-gitweb-standalone-no-errors.sh

Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
 gitweb/gitweb.perl                     |   41 ++++++++++++++++++++++++-------
 t/t9500-gitweb-standalone-no-errors.sh |   28 +++++++++++++++++++++
 2 files changed, 59 insertions(+), 10 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index c3921cb..e92596c 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -94,6 +94,13 @@ our $default_text_plain_charset  = undef;
 # (relative to the current git repository)
 our $mimetypes_file = undef;
 
+# assume this charset if line contains non-UTF-8 characters;
+# it should be valid encoding (see Encoding::Supported(3pm) for list),
+# for which encoding all byte sequences are valid, for example
+# 'iso-8859-1' aka 'latin1' (it is decoded without checking, so it
+# could be even 'utf-8' for the old behavior)
+our $fallback_encoding = 'latin1';
+
 # You define site-wide feature defaults here; override them with
 # $GITWEB_CONFIG as necessary.
 our %feature = (
@@ -602,6 +609,20 @@ sub validate_refname {
 	return $input;
 }
 
+# decode sequences of octets in utf8 into Perl's internal form,
+# which is utf-8 with utf8 flag set if needed.  gitweb writes out
+# in utf-8 thanks to "binmode STDOUT, ':utf8'" at beginning
+sub to_utf8 {
+	my $str = shift;
+	my $res;
+	eval { $res = decode_utf8($str, Encode::FB_CROAK); };
+	if (defined $res) {
+		return $res;
+	} else {
+		return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
+	}
+}
+
 # quote unsafe chars, but keep the slash, even when it's not
 # correct, but quoted slashes look too horrible in bookmarks
 sub esc_param {
@@ -626,7 +647,7 @@ sub esc_html ($;%) {
 	my $str = shift;
 	my %opts = @_;
 
-	$str = decode_utf8($str);
+	$str = to_utf8($str);
 	$str = $cgi->escapeHTML($str);
 	if ($opts{'-nbsp'}) {
 		$str =~ s/ /&nbsp;/g;
@@ -640,7 +661,7 @@ sub esc_path {
 	my $str = shift;
 	my %opts = @_;
 
-	$str = decode_utf8($str);
+	$str = to_utf8($str);
 	$str = $cgi->escapeHTML($str);
 	if ($opts{'-nbsp'}) {
 		$str =~ s/ /&nbsp;/g;
@@ -925,7 +946,7 @@ sub format_subject_html {
 
 	if (length($short) < length($long)) {
 		return $cgi->a({-href => $href, -class => "list subject",
-		                -title => decode_utf8($long)},
+		                -title => to_utf8($long)},
 		       esc_html($short) . $extra);
 	} else {
 		return $cgi->a({-href => $href, -class => "list subject"},
@@ -1239,7 +1260,7 @@ sub git_get_projects_list {
 			if (check_export_ok("$projectroot/$path")) {
 				my $pr = {
 					path => $path,
-					owner => decode_utf8($owner),
+					owner => to_utf8($owner),
 				};
 				push @list, $pr;
 				(my $forks_path = $path) =~ s/\.git$//;
@@ -1269,7 +1290,7 @@ sub git_get_project_owner {
 			$pr = unescape($pr);
 			$ow = unescape($ow);
 			if ($pr eq $project) {
-				$owner = decode_utf8($ow);
+				$owner = to_utf8($ow);
 				last;
 			}
 		}
@@ -1759,7 +1780,7 @@ sub get_file_owner {
 	}
 	my $owner = $gcos;
 	$owner =~ s/[,;].*$//;
-	return decode_utf8($owner);
+	return to_utf8($owner);
 }
 
 ## ......................................................................
@@ -1842,7 +1863,7 @@ sub git_header_html {
 
 	my $title = "$site_name";
 	if (defined $project) {
-		$title .= " - " . decode_utf8($project);
+		$title .= " - " . to_utf8($project);
 		if (defined $action) {
 			$title .= "/$action";
 			if (defined $file_name) {
@@ -2116,7 +2137,7 @@ sub git_print_page_path {
 
 	print "<div class=\"page_path\">";
 	print $cgi->a({-href => href(action=>"tree", hash_base=>$hb),
-	              -title => 'tree root'}, decode_utf8("[$project]"));
+	              -title => 'tree root'}, to_utf8("[$project]"));
 	print " / ";
 	if (defined $name) {
 		my @dirname = split '/', $name;
@@ -2936,7 +2957,7 @@ sub git_project_list_body {
 		($pr->{'age'}, $pr->{'age_string'}) = @aa;
 		if (!defined $pr->{'descr'}) {
 			my $descr = git_get_project_description($pr->{'path'}) || "";
-			$pr->{'descr_long'} = decode_utf8($descr);
+			$pr->{'descr_long'} = to_utf8($descr);
 			$pr->{'descr'} = chop_str($descr, 25, 5);
 		}
 		if (!defined $pr->{'owner'}) {
@@ -3981,7 +4002,7 @@ sub git_snapshot {
 	my $git = git_cmd_str();
 	my $name = $project;
 	$name =~ s/\047/\047\\\047\047/g;
-	my $filename = decode_utf8(basename($project));
+	my $filename = to_utf8(basename($project));
 	my $cmd;
 	if ($suffix eq 'zip') {
 		$filename .= "-$hash.$suffix";
diff --git a/t/t9500-gitweb-standalone-no-errors.sh b/t/t9500-gitweb-standalone-no-errors.sh
index b92ab63..44ae503 100755
--- a/t/t9500-gitweb-standalone-no-errors.sh
+++ b/t/t9500-gitweb-standalone-no-errors.sh
@@ -487,4 +487,32 @@ test_expect_success \
 	'gitweb_run "p=.git;a=atom"'
 test_debug 'cat gitweb.log'
 
+# ----------------------------------------------------------------------
+# encoding/decoding
+
+test_expect_success \
+	'encode(commit): utf8' \
+	'. ../t3901-utf8.txt &&
+	 echo "UTF-8" >> file &&
+	 git add file &&
+	 git commit -F ../t3900/1-UTF-8.txt &&
+	 gitweb_run "p=.git;a=commit"'
+test_debug 'cat gitweb.log'
+
+test_expect_success \
+	'encode(commit): iso-8859-1' \
+	'. ../t3901-8859-1.txt &&
+	 echo "ISO-8859-1" >> file &&
+	 git add file &&
+	 git config i18n.commitencoding ISO-8859-1 &&
+	 git commit -F ../t3900/ISO-8859-1.txt &&
+	 git config --unset i18n.commitencoding &&
+	 gitweb_run "p=.git;a=commit"'
+test_debug 'cat gitweb.log'
+
+test_expect_success \
+	'encode(log): utf-8 and iso-8859-1' \
+	'gitweb_run "p=.git;a=log"'
+test_debug 'cat gitweb.log'
+
 test_done
-- 
1.5.2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] gitweb: handle non UTF-8 text
  2007-06-03 15:42           ` Jakub Narebski
@ 2007-06-03 18:41             ` Alexandre Julliard
  0 siblings, 0 replies; 9+ messages in thread
From: Alexandre Julliard @ 2007-06-03 18:41 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Junio C Hamano, Martin Koegler, Petr Baudis, git, Martin Langhoff,
	Martyn Smith, Robin Rosenberg

Jakub Narebski <jnareb@gmail.com> writes:

> Alexandre, I hope that the patch attached would solve your problem.

Yes, it works fine for me, thanks!

-- 
Alexandre Julliard
julliard@winehq.org

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-06-03 18:41 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-28 20:47 [PATCH] gitweb: handle non UTF-8 text Martin Koegler
2007-05-28 23:21 ` Petr Baudis
2007-05-29  9:21   ` Jakub Narebski
2007-05-29 21:55     ` Martin Koegler
2007-05-30 20:18       ` Robin Rosenberg
2007-06-01 21:05       ` Jakub Narebski
2007-06-02 22:15         ` Junio C Hamano
2007-06-03 15:42           ` Jakub Narebski
2007-06-03 18:41             ` Alexandre Julliard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).