git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFD] Handling of non-UTF8 data in gitweb
@ 2011-12-04 16:09 Jakub Narebski
  2011-12-06  1:07 ` Jeff King
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Jakub Narebski @ 2011-12-04 16:09 UTC (permalink / raw)
  To: git; +Cc: Jürgen Kreileder, John Hawley

Hello!

Currently gitweb converts data it receives from git commands to Perl 
internal utf8 representation via to_utf8() subroutine

 # decode sequences of octets in utf8 into Perl's internal form,
 # which is utf-8 with utf8 flag set if needed.  gitweb writes out
 # in utf-8 thanks to "binmode STDOUT, ':utf8'" at beginning
 sub to_utf8 {
 	my $str = shift;
 	return undef unless defined $str;
 	if (utf8::valid($str)) {
 		utf8::decode($str);
 		return $str;
 	} else {
 		return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
 	}
 }

Each part of data must be handled separately.  It is quite error prone
process, as can be seen from quite a number of patches that fix handling
of UTF-8 data (latest from Jürgen).


Much, much simpler would be to force opening of all files (including 
output pipes from git commands) in ':utf8' mode:

  use open qw(:std :utf8);

[Note: perhaps instead of ':utf8' it should be ':encoding(UTF-8)' 
 there...]

But doing this would change gitweb behavior.  Currently when 
encountering something (usually line of output) that is not valid 
UTF-8, we decode it (to UTF-8) using $fallback_encoding, by default
'latin1'.  Note however that this value is per gitweb installation,
not per repository.

Using "use open qw(:std :utf8);" would be like changing the value of 
$fallback_encoding to 'utf8' -- errors would be ignored, and characters 
which are invalid in UTF-8 encoding would get replaced[1] with 
substitution character '�' U+FFFD.

Though at least for HTML output we could use Encode::FB_HTMLCREF 
handling (which would produce &#NNN;) or Encode::FB_XMLCREF (which
would produce &#xHHHH;), though this must be done after HTML escaping...
and is probaby not worth it (FYI this can be done by setting 
$PerlIO::encoding::fallback to either of those values[2])

[1] http://perldoc.perl.org/Encode.html#Handling-Malformed-Data
    http://p3rl.org/Encode
[2] http://perldoc.perl.org/PerlIO/encoding.html
    http://p3rl.org/PerlIO::encoding

I don't know if people are relying on the old behavior.  I guess
it could be emulated by defining our own 'utf-8-with-fallback'
encoding, or by defining our own PerlIO layer with PerlIO::via.
But it no longer be simple solution (though still automatic).


Alternate approach would be to audit gitweb code, and call to_utf8
before storing extracted output of git command in variable (excluding
save types like SHA-1, filemode, timestamp and timezone).  The fact
that to_utf8 is idempotent and can be called multiple times would
help here, I think.


The correct solution would be of course to respect `gui.encoding`
per-repository config variable, and `encoding` gitattribute...
though the latter is hampered by the fact that there is currently
no way to read attribute with "git check-attr" from a given tree:
think of a diff of change of encoding of a file!

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFD] Handling of non-UTF8 data in gitweb
  2011-12-04 16:09 [RFD] Handling of non-UTF8 data in gitweb Jakub Narebski
@ 2011-12-06  1:07 ` Jeff King
  2011-12-07  0:37 ` Junio C Hamano
  2012-01-06 16:35 ` Jakub Narebski
  2 siblings, 0 replies; 7+ messages in thread
From: Jeff King @ 2011-12-06  1:07 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git, Jürgen Kreileder, John Hawley

On Sun, Dec 04, 2011 at 05:09:30PM +0100, Jakub Narebski wrote:

> The correct solution would be of course to respect `gui.encoding`
> per-repository config variable, and `encoding` gitattribute...
> though the latter is hampered by the fact that there is currently
> no way to read attribute with "git check-attr" from a given tree:
> think of a diff of change of encoding of a file!

We deal with the same problem at GitHub.

There really isn't a good way to specify per-file encodings. Something
like gui.encoding is too coarse. As you mentioned, we don't do per-tree
gitattribute lookups, so the encoding attribute has problems when the
encoding of a file changes. But even if we implemented them, you still
have the problem of getting a raw sha1 (e.g., git diff 9624865 e0a3260).
There's no way to look up attributes for that.

It would be nice if you could put an "encoding" header into the blob
object. You could use the .gitattributes in place at "git add" time to
set it. And then at lookup time, you either have the encoding, or you
assume it's in utf8 (if it isn't binary, of course).

But there's no room in the blob format for headers; the content starts
right after the size header.

You can get around this by searching the history for a tree that
contains the blob, and then checking the gitattributes. It's expensive,
but you could build a cache over time. However, it's not guaranteed to
provide a single answer; you could have multiple trees that mention the
blobs, each with different attributes.

And even if you implement all that, we have the problem that older blobs
won't have gotten an encoding header, even if they would have under the
new rules. So rather than assuming utf8, you have to make a guess
anyway.

At GitHub, we talked about a lot of these options and ended up just
using an encoding-detection library to make a best guess. It seems to
work well in practice, but it's only been deployed for a couple of
months.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFD] Handling of non-UTF8 data in gitweb
  2011-12-04 16:09 [RFD] Handling of non-UTF8 data in gitweb Jakub Narebski
  2011-12-06  1:07 ` Jeff King
@ 2011-12-07  0:37 ` Junio C Hamano
  2011-12-10 16:18   ` Jakub Narebski
  2011-12-18 22:00   ` Jakub Narebski
  2012-01-06 16:35 ` Jakub Narebski
  2 siblings, 2 replies; 7+ messages in thread
From: Junio C Hamano @ 2011-12-07  0:37 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git, Jürgen Kreileder, John Hawley

Jakub Narebski <jnareb@gmail.com> writes:

> But doing this would change gitweb behavior.  Currently when 
> encountering something (usually line of output) that is not valid 
> UTF-8, we decode it (to UTF-8) using $fallback_encoding, by default
> 'latin1'.  Note however that this value is per gitweb installation,
> not per repository.

I think we added and you acked 00f429a (gitweb: Handle non UTF-8 text
better, 2007-06-03) for a good reason, and I think the above argues that
whatever issue the commit tried to address is a non-issue. Is it really
true?

> ... I guess
> it could be emulated by defining our own 'utf-8-with-fallback'
> encoding, or by defining our own PerlIO layer with PerlIO::via.
> But it no longer be simple solution (though still automatic).

Between the current "everybody who read from the input must remember to
call to_utf8" and "everybody gets to read utf8 automatically for internal
processing", even though the latter may involve one-time pain to set up
the framework to do so, the pros-and-cons feels obvious to me.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFD] Handling of non-UTF8 data in gitweb
  2011-12-07  0:37 ` Junio C Hamano
@ 2011-12-10 16:18   ` Jakub Narebski
  2011-12-12  5:26     ` Junio C Hamano
  2011-12-18 22:00   ` Jakub Narebski
  1 sibling, 1 reply; 7+ messages in thread
From: Jakub Narebski @ 2011-12-10 16:18 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jürgen Kreileder, John Hawley

On Wed, 7 Dec 2011, Junio C Hamano wrote:
> Jakub Narebski <jnareb@gmail.com> writes:
> 
> > But doing this would change gitweb behavior.  Currently when 
> > encountering something (usually line of output) that is not valid 
> > UTF-8, we decode it (to UTF-8) using $fallback_encoding, by default
> > 'latin1'.  Note however that this value is per gitweb installation,
> > not per repository.
> 
> I think we added and you acked 00f429a (gitweb: Handle non UTF-8 text
> better, 2007-06-03) for a good reason, and I think the above argues that
> whatever issue the commit tried to address is a non-issue. Is it really
> true?

I think that UTF-8 is much more prevalent character encoding in operating
systems, programming languages, APIs, and software applications than it
was 4 years ago.

Also the solution implemented in said commit was a good start, but it
remains incomplete: $fallback_encoding is per-installation which is too
big granularity (there is `gui.encoding` per-repository config... but it
is about main not fallback encoding; best would be to use gitattribute
but currently there is no way to check attribute value at given revision).

The proposed

  use open qw(:std :utf8);

and removal of to_utf8 and $fallback_encoding would be regression compared
to post-00f429a... but the tradeoff of more robust UTF-8 handling might be
worth it.


Note that to_utf8 handles git command output part by part, not as a whole;
for UTF-8 vs latin1 (i.e. iso-8859-1) it does not matter though because
latin1 is very unlikely to be recognized as valid utf-8[1], and ASCII
characters pass-through for UTF-8.

[1]: http://en.wikipedia.org/wiki/UTF-8#Advantages

> > ... I guess
> > it could be emulated by defining our own 'utf-8-with-fallback'
> > encoding, or by defining our own PerlIO layer with PerlIO::via.
> > But it no longer be simple solution (though still automatic).
> 
> Between the current "everybody who read from the input must remember to
> call to_utf8" and "everybody gets to read utf8 automatically for internal
> processing", even though the latter may involve one-time pain to set up
> the framework to do so, the pros-and-cons feels obvious to me.

There is also a matter of performance (':utf8' and ':encoding(UTF-8)'
are AFAIK implemented in C, both the Encode part and PerlIO part).
-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFD] Handling of non-UTF8 data in gitweb
  2011-12-10 16:18   ` Jakub Narebski
@ 2011-12-12  5:26     ` Junio C Hamano
  0 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2011-12-12  5:26 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git, Jürgen Kreileder, John Hawley

Jakub Narebski <jnareb@gmail.com> writes:

> On Wed, 7 Dec 2011, Junio C Hamano wrote:
>> I think we added and you acked 00f429a (gitweb: Handle non UTF-8 text
>> better, 2007-06-03) for a good reason, and I think the above argues that
>> whatever issue the commit tried to address is a non-issue. Is it really
>> true?
>
> I think that UTF-8 is much more prevalent character encoding in operating
> systems, programming languages, APIs, and software applications than it
> was 4 years ago.

Yeah, that was the kind of "reasoning behind it" explanation I was hoping
to see spelled out for people to agree or disagree.

But then the updated gitweb won't have trouble showing history of some
projects that has 4 yours or longer history (hopefully Git itself is not
among them).

> The proposed
>
>   use open qw(:std :utf8);
>
> and removal of to_utf8 and $fallback_encoding would be regression compared
> to post-00f429a... but the tradeoff of more robust UTF-8 handling might be
> worth it.
>
>> > ... I guess
>> > it could be emulated by defining our own 'utf-8-with-fallback'
>> > encoding, or by defining our own PerlIO layer with PerlIO::via.
>> > But it no longer be simple solution (though still automatic).
>> 
>> Between the current "everybody who read from the input must remember to
>> call to_utf8" and "everybody gets to read utf8 automatically for internal
>> processing", even though the latter may involve one-time pain to set up
>> the framework to do so, the pros-and-cons feels obvious to me.
>
> There is also a matter of performance (':utf8' and ':encoding(UTF-8)'
> are AFAIK implemented in C, both the Encode part and PerlIO part).

Would a reasonable first step be to replace the calls to bare "open" with
a wrapper that simulates the "open" interface (e.g. "sub git_open"), but
still keep the same behaviour as post-00f429a that could be much slower
than the native one?  Then a separate patch can build a "regression but
uses native and much faster" alternative on top, no?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFD] Handling of non-UTF8 data in gitweb
  2011-12-07  0:37 ` Junio C Hamano
  2011-12-10 16:18   ` Jakub Narebski
@ 2011-12-18 22:00   ` Jakub Narebski
  1 sibling, 0 replies; 7+ messages in thread
From: Jakub Narebski @ 2011-12-18 22:00 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Jürgen Kreileder, John Hawley, Ismail Dönmez

On Wed, 7 Dec 2011, Junio C Hamano wrote:
> Jakub Narebski <jnareb@gmail.com> writes:
> 
> > But doing this would change gitweb behavior.  Currently when 
> > encountering something (usually line of output) that is not valid 
> > UTF-8, we decode it (to UTF-8) using $fallback_encoding, by default
> > 'latin1'.  Note however that this value is per gitweb installation,
> > not per repository.
> 
> I think we added and you acked 00f429a (gitweb: Handle non UTF-8 text
> better, 2007-06-03) for a good reason, and I think the above argues that
> whatever issue the commit tried to address is a non-issue. Is it really
> true?

Actually... the change in 

  00f429a (gitweb: Handle non UTF-8 text better, 2007-06-03)

worked correctly, but since 

  e5d3de5 (gitweb: use Perl built-in utf8 function for UTF-8 decoding., 2007-12-04)

it was changed to a NON-WORKING version - and *nobody* protested.

   sub to_utf8 {
   	my $str = shift;
  -	my $res;
  -	eval { $res = decode_utf8($str, Encode::FB_CROAK); };
  -	if (defined $res) {
  -		return $res;
  +	if (utf8::valid($str)) {
  +		utf8::decode($str);
  +		return $str;
   	} else {
   		return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
   	}

Well, it works for utf8 and latin1 _only_ (with $fallback_encoding being
set to 'latin1' by default), and for latin1 by historical accident... that
might be why nobody noticed.  $fallback_encoding != 'latin1' or 'utf8'
didn't work.

utf8::valid(STRING) is an internal function that tests whether STRING is
in a consistent state regarding UTF-8.  It returns true is well-formed
UTF-8 and has the UTF-8 flag on _or_ if string is held as bytes (both
these states are 'consistent').  For gitweb the second option was true.

Note that utf8:decode(STRING) returns false if STRING is invalid as UTF-8.

What makes it all work is the fact that utf8:decode(STRING) turns on UTF-8
flag only if source string is valid UTF-8 and contains multi-byte UTF-8
characters... and that if string doesn't have UTF-8 flag set it is treated
as in native Perl encoding, i.e. 'latin1' / 'iso-8859-1' (unless it is EBCDIC).
It is ':utf8' layer that actually convert 'latin1' to 'utf8'.

-- >8 --
Subject: [PATCH] gitweb: Fix fallback mode of to_utf8 subroutine

e5d3de5 (gitweb: use Perl built-in utf8 function for UTF-8 decoding.,
2007-12-04) was meant to make gitweb faster by using Perl's internals
(see subsection "Messing with Perl's Internals" in Encode(3pm) manpage)

Simple benchmark confirms that (old = 00f429a, new = this version):

        old  new
  old    -- -65%
  new  189%   --

Unfortunately it made fallback mode of to_utf8 do not work...  except
for default value 'latin1' of $fallback_encoding ('latin1' is Perl
native encoding), which is why it was not noticed for such long time.

utf8::valid(STRING) is an internal function that tests whether STRING
is in a _consistent state_ regarding UTF-8.  It returns true is
well-formed UTF-8 and has the UTF-8 flag on _*or*_ if string is held
as bytes (both these states are 'consistent').  For gitweb the second
option was true, as output from git commands is opened without ':utf8'
layer.

What made it work at all for STRING in 'latin1' encoding is the fact
that utf8:decode(STRING) turns on UTF-8 flag only if source string is
valid UTF-8 and contains multi-byte UTF-8 characters... and that if
string doesn't have UTF-8 flag set it is treated as in native Perl
encoding, i.e.  'latin1' / 'iso-8859-1' (unless native encoding it is
EBCDIC ;-)).  It was ':utf8' layer that actually converted 'latin1'
(no UTF-8 flag == native == 'latin1) to 'utf8'.


Let's make use of the fact that utf8:decode(STRING) returns false if
STRING is invalid as UTF-8 to check whether to enable fallback mode.

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
I'm sorry for overly wordy commit message...

 gitweb/gitweb.perl |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index d24763b..75b0970 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -1443,8 +1443,7 @@ sub validate_refname {
 sub to_utf8 {
 	my $str = shift;
 	return undef unless defined $str;
-	if (utf8::valid($str)) {
-		utf8::decode($str);
+	if (utf8::valid($str) && utf8::decode($str)) {
 		return $str;
 	} else {
 		return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFD] Handling of non-UTF8 data in gitweb
  2011-12-04 16:09 [RFD] Handling of non-UTF8 data in gitweb Jakub Narebski
  2011-12-06  1:07 ` Jeff King
  2011-12-07  0:37 ` Junio C Hamano
@ 2012-01-06 16:35 ` Jakub Narebski
  2 siblings, 0 replies; 7+ messages in thread
From: Jakub Narebski @ 2012-01-06 16:35 UTC (permalink / raw)
  To: git; +Cc: Jürgen Kreileder, John Hawley, Jeff King, Junio C Hamano

On Sun, 4 Dec 2011, Jakub Narebski wrote:
> 
> Currently gitweb converts data it receives from git commands to Perl 
> internal utf8 representation via to_utf8() subroutine
[...]
> Each part of data must be handled separately.  It is quite error prone
> process, as can be seen from quite a number of patches that fix handling
> of UTF-8 data (latest from Jürgen).
> 
> 
> Much, much simpler would be to force opening of all files (including 
> output pipes from git commands) in ':utf8' mode:
> 
>   use open qw(:std :utf8);
> 
> [Note: perhaps instead of ':utf8' it should be ':encoding(UTF-8)' 
>  there...]
> 
> But doing this would change gitweb behavior.  [...]
[...]
> I don't know if people are relying on the old behavior.  I guess
> it could be emulated by defining our own 'utf-8-with-fallback'
> encoding, or by defining our own PerlIO layer with PerlIO::via.
> But it no longer be simple solution (though still automatic).

I have now created simple Encode::UTF8WithFallback module, so that

  use Encode::UTF8WithFallback;
  use open IN => ':encoding(utf8-with-fallback)';

should be able to replace all calls to to_utf8() without any change
in behavior; at least simple tests shows that.


There however are two problems with this solution:

1. Encode::UTF8WithFallback should really be a separate Perl module
   in a separate file (e.g. 'gitweb/lib/Encode/UTF8WithFallback.pm');
   I was not able to make it work without a separate file.

   This means that it very much requires the change that allows splitting
   gitweb into many files and/or load extra helper modules, and/or require
   extra non-core modules but provide and install them with gitweb if they
   are not available.  These changes are ready, and can be find in 

     'gitweb/split'
   
   branch in my git.git repositories:

     http://repo.or.cz/w/git/jnareb-git.git
     https://github.com/jnareb/git


2. It turned out that the "open" pragma 1.04 from Perl v5.8.6 does not
   work correctly.  We need at least "open" 1.06 (version 1.05 consists
   supposedly only of documentation-only change).

   Because "open" is a core Perl module (core pragma), this means that
   gitweb will require in practice Perl v5.8.9 at least, increasing
   version requirement from current v5.8.0
 
-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-01-06 16:35 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-04 16:09 [RFD] Handling of non-UTF8 data in gitweb Jakub Narebski
2011-12-06  1:07 ` Jeff King
2011-12-07  0:37 ` Junio C Hamano
2011-12-10 16:18   ` Jakub Narebski
2011-12-12  5:26     ` Junio C Hamano
2011-12-18 22:00   ` Jakub Narebski
2012-01-06 16:35 ` Jakub Narebski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).