From: Jakub Narebski <jnareb@gmail.com>
To: git@vger.kernel.org
Cc: "Jürgen Kreileder" <jk@blackdown.de>,
"John Hawley" <warthog9@kernel.org>
Subject: [RFD] Handling of non-UTF8 data in gitweb
Date: Sun, 4 Dec 2011 17:09:30 +0100 [thread overview]
Message-ID: <201112041709.32212.jnareb@gmail.com> (raw)
Hello!
Currently gitweb converts data it receives from git commands to Perl
internal utf8 representation via to_utf8() subroutine
# decode sequences of octets in utf8 into Perl's internal form,
# which is utf-8 with utf8 flag set if needed. gitweb writes out
# in utf-8 thanks to "binmode STDOUT, ':utf8'" at beginning
sub to_utf8 {
my $str = shift;
return undef unless defined $str;
if (utf8::valid($str)) {
utf8::decode($str);
return $str;
} else {
return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
}
}
Each part of data must be handled separately. It is quite error prone
process, as can be seen from quite a number of patches that fix handling
of UTF-8 data (latest from Jürgen).
Much, much simpler would be to force opening of all files (including
output pipes from git commands) in ':utf8' mode:
use open qw(:std :utf8);
[Note: perhaps instead of ':utf8' it should be ':encoding(UTF-8)'
there...]
But doing this would change gitweb behavior. Currently when
encountering something (usually line of output) that is not valid
UTF-8, we decode it (to UTF-8) using $fallback_encoding, by default
'latin1'. Note however that this value is per gitweb installation,
not per repository.
Using "use open qw(:std :utf8);" would be like changing the value of
$fallback_encoding to 'utf8' -- errors would be ignored, and characters
which are invalid in UTF-8 encoding would get replaced[1] with
substitution character '�' U+FFFD.
Though at least for HTML output we could use Encode::FB_HTMLCREF
handling (which would produce &#NNN;) or Encode::FB_XMLCREF (which
would produce &#xHHHH;), though this must be done after HTML escaping...
and is probaby not worth it (FYI this can be done by setting
$PerlIO::encoding::fallback to either of those values[2])
[1] http://perldoc.perl.org/Encode.html#Handling-Malformed-Data
http://p3rl.org/Encode
[2] http://perldoc.perl.org/PerlIO/encoding.html
http://p3rl.org/PerlIO::encoding
I don't know if people are relying on the old behavior. I guess
it could be emulated by defining our own 'utf-8-with-fallback'
encoding, or by defining our own PerlIO layer with PerlIO::via.
But it no longer be simple solution (though still automatic).
Alternate approach would be to audit gitweb code, and call to_utf8
before storing extracted output of git command in variable (excluding
save types like SHA-1, filemode, timestamp and timezone). The fact
that to_utf8 is idempotent and can be called multiple times would
help here, I think.
The correct solution would be of course to respect `gui.encoding`
per-repository config variable, and `encoding` gitattribute...
though the latter is hampered by the fact that there is currently
no way to read attribute with "git check-attr" from a given tree:
think of a diff of change of encoding of a file!
--
Jakub Narebski
Poland
next reply other threads:[~2011-12-04 16:09 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-12-04 16:09 Jakub Narebski [this message]
2011-12-06 1:07 ` [RFD] Handling of non-UTF8 data in gitweb Jeff King
2011-12-07 0:37 ` Junio C Hamano
2011-12-10 16:18 ` Jakub Narebski
2011-12-12 5:26 ` Junio C Hamano
2011-12-18 22:00 ` Jakub Narebski
2012-01-06 16:35 ` Jakub Narebski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=201112041709.32212.jnareb@gmail.com \
--to=jnareb@gmail.com \
--cc=git@vger.kernel.org \
--cc=jk@blackdown.de \
--cc=warthog9@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.