From: Jakub Narebski <jnareb@gmail.com>
To: git@vger.kernel.org
Cc: "Jürgen Kreileder" <jk@blackdown.de>,
"John Hawley" <warthog9@kernel.org>
Subject: [RFD] Handling of non-UTF8 data in gitweb
Date: Sun, 4 Dec 2011 17:09:30 +0100 [thread overview]
Message-ID: <201112041709.32212.jnareb@gmail.com> (raw)
Hello!
Currently gitweb converts data it receives from git commands to Perl
internal utf8 representation via to_utf8() subroutine
# decode sequences of octets in utf8 into Perl's internal form,
# which is utf-8 with utf8 flag set if needed. gitweb writes out
# in utf-8 thanks to "binmode STDOUT, ':utf8'" at beginning
sub to_utf8 {
my $str = shift;
return undef unless defined $str;
if (utf8::valid($str)) {
utf8::decode($str);
return $str;
} else {
return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
}
}
Each part of data must be handled separately. It is quite error prone
process, as can be seen from quite a number of patches that fix handling
of UTF-8 data (latest from Jürgen).
Much, much simpler would be to force opening of all files (including
output pipes from git commands) in ':utf8' mode:
use open qw(:std :utf8);
[Note: perhaps instead of ':utf8' it should be ':encoding(UTF-8)'
there...]
But doing this would change gitweb behavior. Currently when
encountering something (usually line of output) that is not valid
UTF-8, we decode it (to UTF-8) using $fallback_encoding, by default
'latin1'. Note however that this value is per gitweb installation,
not per repository.
Using "use open qw(:std :utf8);" would be like changing the value of
$fallback_encoding to 'utf8' -- errors would be ignored, and characters
which are invalid in UTF-8 encoding would get replaced[1] with
substitution character '�' U+FFFD.
Though at least for HTML output we could use Encode::FB_HTMLCREF
handling (which would produce &#NNN;) or Encode::FB_XMLCREF (which
would produce &#xHHHH;), though this must be done after HTML escaping...
and is probaby not worth it (FYI this can be done by setting
$PerlIO::encoding::fallback to either of those values[2])
[1] http://perldoc.perl.org/Encode.html#Handling-Malformed-Data
http://p3rl.org/Encode
[2] http://perldoc.perl.org/PerlIO/encoding.html
http://p3rl.org/PerlIO::encoding
I don't know if people are relying on the old behavior. I guess
it could be emulated by defining our own 'utf-8-with-fallback'
encoding, or by defining our own PerlIO layer with PerlIO::via.
But it no longer be simple solution (though still automatic).
Alternate approach would be to audit gitweb code, and call to_utf8
before storing extracted output of git command in variable (excluding
save types like SHA-1, filemode, timestamp and timezone). The fact
that to_utf8 is idempotent and can be called multiple times would
help here, I think.
The correct solution would be of course to respect `gui.encoding`
per-repository config variable, and `encoding` gitattribute...
though the latter is hampered by the fact that there is currently
no way to read attribute with "git check-attr" from a given tree:
think of a diff of change of encoding of a file!
--
Jakub Narebski
Poland
next reply other threads:[~2011-12-04 16:09 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-12-04 16:09 Jakub Narebski [this message]
2011-12-06 1:07 ` [RFD] Handling of non-UTF8 data in gitweb Jeff King
2011-12-07 0:37 ` Junio C Hamano
2011-12-10 16:18 ` Jakub Narebski
2011-12-12 5:26 ` Junio C Hamano
2011-12-18 22:00 ` Jakub Narebski
2012-01-06 16:35 ` Jakub Narebski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=201112041709.32212.jnareb@gmail.com \
--to=jnareb@gmail.com \
--cc=git@vger.kernel.org \
--cc=jk@blackdown.de \
--cc=warthog9@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).