From: Jakub Narebski <jnareb@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, "Jürgen Kreileder" <jk@blackdown.de>,
"John Hawley" <warthog9@kernel.org>,
"Ismail Dönmez" <ismail@pardus.org.tr>
Subject: Re: [RFD] Handling of non-UTF8 data in gitweb
Date: Sun, 18 Dec 2011 23:00:58 +0100 [thread overview]
Message-ID: <201112182300.59409.jnareb@gmail.com> (raw)
In-Reply-To: <7vehwhcj3q.fsf@alter.siamese.dyndns.org>
On Wed, 7 Dec 2011, Junio C Hamano wrote:
> Jakub Narebski <jnareb@gmail.com> writes:
>
> > But doing this would change gitweb behavior. Currently when
> > encountering something (usually line of output) that is not valid
> > UTF-8, we decode it (to UTF-8) using $fallback_encoding, by default
> > 'latin1'. Note however that this value is per gitweb installation,
> > not per repository.
>
> I think we added and you acked 00f429a (gitweb: Handle non UTF-8 text
> better, 2007-06-03) for a good reason, and I think the above argues that
> whatever issue the commit tried to address is a non-issue. Is it really
> true?
Actually... the change in
00f429a (gitweb: Handle non UTF-8 text better, 2007-06-03)
worked correctly, but since
e5d3de5 (gitweb: use Perl built-in utf8 function for UTF-8 decoding., 2007-12-04)
it was changed to a NON-WORKING version - and *nobody* protested.
sub to_utf8 {
my $str = shift;
- my $res;
- eval { $res = decode_utf8($str, Encode::FB_CROAK); };
- if (defined $res) {
- return $res;
+ if (utf8::valid($str)) {
+ utf8::decode($str);
+ return $str;
} else {
return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
}
Well, it works for utf8 and latin1 _only_ (with $fallback_encoding being
set to 'latin1' by default), and for latin1 by historical accident... that
might be why nobody noticed. $fallback_encoding != 'latin1' or 'utf8'
didn't work.
utf8::valid(STRING) is an internal function that tests whether STRING is
in a consistent state regarding UTF-8. It returns true is well-formed
UTF-8 and has the UTF-8 flag on _or_ if string is held as bytes (both
these states are 'consistent'). For gitweb the second option was true.
Note that utf8:decode(STRING) returns false if STRING is invalid as UTF-8.
What makes it all work is the fact that utf8:decode(STRING) turns on UTF-8
flag only if source string is valid UTF-8 and contains multi-byte UTF-8
characters... and that if string doesn't have UTF-8 flag set it is treated
as in native Perl encoding, i.e. 'latin1' / 'iso-8859-1' (unless it is EBCDIC).
It is ':utf8' layer that actually convert 'latin1' to 'utf8'.
-- >8 --
Subject: [PATCH] gitweb: Fix fallback mode of to_utf8 subroutine
e5d3de5 (gitweb: use Perl built-in utf8 function for UTF-8 decoding.,
2007-12-04) was meant to make gitweb faster by using Perl's internals
(see subsection "Messing with Perl's Internals" in Encode(3pm) manpage)
Simple benchmark confirms that (old = 00f429a, new = this version):
old new
old -- -65%
new 189% --
Unfortunately it made fallback mode of to_utf8 do not work... except
for default value 'latin1' of $fallback_encoding ('latin1' is Perl
native encoding), which is why it was not noticed for such long time.
utf8::valid(STRING) is an internal function that tests whether STRING
is in a _consistent state_ regarding UTF-8. It returns true is
well-formed UTF-8 and has the UTF-8 flag on _*or*_ if string is held
as bytes (both these states are 'consistent'). For gitweb the second
option was true, as output from git commands is opened without ':utf8'
layer.
What made it work at all for STRING in 'latin1' encoding is the fact
that utf8:decode(STRING) turns on UTF-8 flag only if source string is
valid UTF-8 and contains multi-byte UTF-8 characters... and that if
string doesn't have UTF-8 flag set it is treated as in native Perl
encoding, i.e. 'latin1' / 'iso-8859-1' (unless native encoding it is
EBCDIC ;-)). It was ':utf8' layer that actually converted 'latin1'
(no UTF-8 flag == native == 'latin1) to 'utf8'.
Let's make use of the fact that utf8:decode(STRING) returns false if
STRING is invalid as UTF-8 to check whether to enable fallback mode.
Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
I'm sorry for overly wordy commit message...
gitweb/gitweb.perl | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)
diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index d24763b..75b0970 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -1443,8 +1443,7 @@ sub validate_refname {
sub to_utf8 {
my $str = shift;
return undef unless defined $str;
- if (utf8::valid($str)) {
- utf8::decode($str);
+ if (utf8::valid($str) && utf8::decode($str)) {
return $str;
} else {
return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
--
1.7.6
next prev parent reply other threads:[~2011-12-18 22:01 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-12-04 16:09 [RFD] Handling of non-UTF8 data in gitweb Jakub Narebski
2011-12-06 1:07 ` Jeff King
2011-12-07 0:37 ` Junio C Hamano
2011-12-10 16:18 ` Jakub Narebski
2011-12-12 5:26 ` Junio C Hamano
2011-12-18 22:00 ` Jakub Narebski [this message]
2012-01-06 16:35 ` Jakub Narebski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=201112182300.59409.jnareb@gmail.com \
--to=jnareb@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=ismail@pardus.org.tr \
--cc=jk@blackdown.de \
--cc=warthog9@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.