git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jakub Narebski <jnareb@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, "Jürgen Kreileder" <jk@blackdown.de>,
	"John Hawley" <warthog9@kernel.org>,
	"Ismail Dönmez" <ismail@pardus.org.tr>
Subject: Re: [RFD] Handling of non-UTF8 data in gitweb
Date: Sun, 18 Dec 2011 23:00:58 +0100	[thread overview]
Message-ID: <201112182300.59409.jnareb@gmail.com> (raw)
In-Reply-To: <7vehwhcj3q.fsf@alter.siamese.dyndns.org>

On Wed, 7 Dec 2011, Junio C Hamano wrote:
> Jakub Narebski <jnareb@gmail.com> writes:
> 
> > But doing this would change gitweb behavior.  Currently when 
> > encountering something (usually line of output) that is not valid 
> > UTF-8, we decode it (to UTF-8) using $fallback_encoding, by default
> > 'latin1'.  Note however that this value is per gitweb installation,
> > not per repository.
> 
> I think we added and you acked 00f429a (gitweb: Handle non UTF-8 text
> better, 2007-06-03) for a good reason, and I think the above argues that
> whatever issue the commit tried to address is a non-issue. Is it really
> true?

Actually... the change in 

  00f429a (gitweb: Handle non UTF-8 text better, 2007-06-03)

worked correctly, but since 

  e5d3de5 (gitweb: use Perl built-in utf8 function for UTF-8 decoding., 2007-12-04)

it was changed to a NON-WORKING version - and *nobody* protested.

   sub to_utf8 {
   	my $str = shift;
  -	my $res;
  -	eval { $res = decode_utf8($str, Encode::FB_CROAK); };
  -	if (defined $res) {
  -		return $res;
  +	if (utf8::valid($str)) {
  +		utf8::decode($str);
  +		return $str;
   	} else {
   		return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
   	}

Well, it works for utf8 and latin1 _only_ (with $fallback_encoding being
set to 'latin1' by default), and for latin1 by historical accident... that
might be why nobody noticed.  $fallback_encoding != 'latin1' or 'utf8'
didn't work.

utf8::valid(STRING) is an internal function that tests whether STRING is
in a consistent state regarding UTF-8.  It returns true is well-formed
UTF-8 and has the UTF-8 flag on _or_ if string is held as bytes (both
these states are 'consistent').  For gitweb the second option was true.

Note that utf8:decode(STRING) returns false if STRING is invalid as UTF-8.

What makes it all work is the fact that utf8:decode(STRING) turns on UTF-8
flag only if source string is valid UTF-8 and contains multi-byte UTF-8
characters... and that if string doesn't have UTF-8 flag set it is treated
as in native Perl encoding, i.e. 'latin1' / 'iso-8859-1' (unless it is EBCDIC).
It is ':utf8' layer that actually convert 'latin1' to 'utf8'.

-- >8 --
Subject: [PATCH] gitweb: Fix fallback mode of to_utf8 subroutine

e5d3de5 (gitweb: use Perl built-in utf8 function for UTF-8 decoding.,
2007-12-04) was meant to make gitweb faster by using Perl's internals
(see subsection "Messing with Perl's Internals" in Encode(3pm) manpage)

Simple benchmark confirms that (old = 00f429a, new = this version):

        old  new
  old    -- -65%
  new  189%   --

Unfortunately it made fallback mode of to_utf8 do not work...  except
for default value 'latin1' of $fallback_encoding ('latin1' is Perl
native encoding), which is why it was not noticed for such long time.

utf8::valid(STRING) is an internal function that tests whether STRING
is in a _consistent state_ regarding UTF-8.  It returns true is
well-formed UTF-8 and has the UTF-8 flag on _*or*_ if string is held
as bytes (both these states are 'consistent').  For gitweb the second
option was true, as output from git commands is opened without ':utf8'
layer.

What made it work at all for STRING in 'latin1' encoding is the fact
that utf8:decode(STRING) turns on UTF-8 flag only if source string is
valid UTF-8 and contains multi-byte UTF-8 characters... and that if
string doesn't have UTF-8 flag set it is treated as in native Perl
encoding, i.e.  'latin1' / 'iso-8859-1' (unless native encoding it is
EBCDIC ;-)).  It was ':utf8' layer that actually converted 'latin1'
(no UTF-8 flag == native == 'latin1) to 'utf8'.


Let's make use of the fact that utf8:decode(STRING) returns false if
STRING is invalid as UTF-8 to check whether to enable fallback mode.

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
I'm sorry for overly wordy commit message...

 gitweb/gitweb.perl |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index d24763b..75b0970 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -1443,8 +1443,7 @@ sub validate_refname {
 sub to_utf8 {
 	my $str = shift;
 	return undef unless defined $str;
-	if (utf8::valid($str)) {
-		utf8::decode($str);
+	if (utf8::valid($str) && utf8::decode($str)) {
 		return $str;
 	} else {
 		return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
-- 
1.7.6

  parent reply	other threads:[~2011-12-18 22:01 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-12-04 16:09 [RFD] Handling of non-UTF8 data in gitweb Jakub Narebski
2011-12-06  1:07 ` Jeff King
2011-12-07  0:37 ` Junio C Hamano
2011-12-10 16:18   ` Jakub Narebski
2011-12-12  5:26     ` Junio C Hamano
2011-12-18 22:00   ` Jakub Narebski [this message]
2012-01-06 16:35 ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201112182300.59409.jnareb@gmail.com \
    --to=jnareb@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=ismail@pardus.org.tr \
    --cc=jk@blackdown.de \
    --cc=warthog9@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).