git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jakub Narebski <jnareb@gmail.com>
To: git@vger.kernel.org
Cc: Juergen Kreileder <jk@blackdown.de>, Junio Hamano <gitster@pobox.com>
Subject: [PATCH 4/3 v2 (bugfix)] gitweb: Fix fallback mode of to_utf8 subroutine
Date: Mon, 19 Dec 2011 17:21:17 +0100	[thread overview]
Message-ID: <201112191721.17478.jnareb@gmail.com> (raw)
In-Reply-To: <201112191311.58787.jnareb@gmail.com>

e5d3de5 (gitweb: use Perl built-in utf8 function for UTF-8 decoding.,
2007-12-04) was meant to make gitweb faster by using Perl's internals
(see subsection "Messing with Perl's Internals" in Encode(3pm) manpage)

Simple benchmark confirms that (old = 00f429a, new = this version):
note that it is synthetic benchmark of standalone subroutines, not
of gitweb itself (where probably no visible difference in performace
will show)

        Rate  old  new
  old 1582/s   -- -64%
  new 4453/s 181%   --

Unfortunately it made fallback mode of to_utf8 do not work...  except
for default value 'latin1' of $fallback_encoding (because 'latin1' is
Perl native encoding), which is probably why it was not noticed for so
long.


utf8::valid(STRING) is an internal function that tests whether STRING
is in a _consistent state_ regarding UTF-8.  It returns true is
well-formed UTF-8 and has the UTF-8 flag on _*or*_ if string is held
as bytes (both these states are 'consistent').

For gitweb in most cases the second option was true, as output from
git commands is opened without ':utf8' layer.  So utf8::valid is not
useful for to_utf8.

What made it look as if to_utf8() fallback mode worked correctly
(though only for $fallback_encoding at its default value 'latin1')
was the fact that utf8::decode(STRING) turns on UTF-8 flag only if
source string^W octets form a valid UTF-8 and it contains multi-byte
UTF-8 characters... this means that if string was not valid UTF-8
it didn't get UTF-8 flag.

When string doesn't have UTF-8 flag set, it is treated as if it was in
native Perl encoding, which is 'latin1' (unless native encoding is
EBCDIC ;-)).  So it was ':utf8' layer that actually converted 'latin1'
(no UTF-8 flag == native == 'latin1) to 'utf8', and not to_utf8()
subroutine.  Fallback mode was never triggered.


Let's make use of the fact that utf8::decode(STRING) returns false if
STRING is invalid as UTF-8 to check whether to enable fallback mode.

Note however that if STRING has UTF-8 flag set already, then
utf8::decode also returns false, which could cause problems if given
string was already converted with to_utf8().  Such double conversion
can happen in gitweb.  Therefore we have to check if STRING has UTF-8
flag set with utf8::is_utf8(); if this subroutine returns true then we
have already decoded (converted) string, and don't have to do it
second time.

Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
Jakub Narebski wrote:
> P.S. I started to get strange errors
> 
>  XML Parsing Error: xml processing instruction not at start of external entity
>  Location: http://localhost/cgi-bin/gitweb/gitweb.cgi
>  Line Number 37, Column 1:
>  <?xml version="1.0" encoding="utf-8"?>
>  ^
> 
> while "show source" shows that '<?xml version="1.0" encoding="utf-8"?>'
> is the first line.  WTF?!?
> 
> P.P.S. Now I am getting errors when running gitweb, but only in some
> cases (via mod_cgi not as standalone script, only when using lynx),
> namely it looks like it falls back to 'latin1' when doing content
> which is valid UTF-8.
> 
> Will investigate.

Now it is fixed; the error was caused by to_utf8 not dealing with double
encoding for strings outside 7bit ASCII.

 gitweb/gitweb.perl |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index d24763b..fc41b07 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -1443,8 +1443,8 @@ sub validate_refname {
 sub to_utf8 {
 	my $str = shift;
 	return undef unless defined $str;
-	if (utf8::valid($str)) {
-		utf8::decode($str);
+
+	if (utf8::is_utf8($str) || utf8::decode($str)) {
 		return $str;
 	} else {
 		return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
-- 
1.7.6

      reply	other threads:[~2011-12-19 16:21 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-12-17  9:22 [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes Jakub Narebski
2011-12-17  9:22 ` [PATCH 1/3] gitweb: Call to_utf8() on input string in chop_and_escape_str() Jakub Narebski
2011-12-17  9:22 ` [PATCH 2/3] gitweb: esc_html() site name for title in OPML Jakub Narebski
2011-12-17  9:22 ` [PATCH 3/3] gitweb: Output valid utf8 in git_blame_common('data') Jakub Narebski
2011-12-17 19:27 ` [PATCH 0/3 (resend)] gitweb: Various to_utf8 / esc_html fixes Junio C Hamano
2011-12-19  0:54 ` [PATCH 4/3] gitweb: Fix fallback mode of to_utf8 subroutine Jakub Narebski
2011-12-19 12:11   ` Jakub Narebski
2011-12-19 16:21     ` Jakub Narebski [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201112191721.17478.jnareb@gmail.com \
    --to=jnareb@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jk@blackdown.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).