From: Jakub Narebski <jnareb@gmail.com>
To: "Praveen A" <pravi.a@gmail.com>
Cc: git@vger.kernel.org, "Santhosh Thottingal" <santhosh00@gmail.com>
Subject: Re: gitweb and unicode special characters
Date: Fri, 12 Dec 2008 11:37:57 -0800 (PST) [thread overview]
Message-ID: <m37i65gp6b.fsf@localhost.localdomain> (raw)
In-Reply-To: <3f2beab60812121033r5d41894t77acc271b7c6955c@mail.gmail.com>
"Praveen A" <pravi.a@gmail.com> writes:
> Git currently does not handle unicode special characters ZWJ and ZWNJ,
> both are heavily used in Malayalam and common in other languages
> needing complex text layout like Sinhala and Arabic.
>
> An example of this is shown in the commit message here
> http://git.savannah.gnu.org/gitweb/?p=smc.git;a=commit;h=c3f368c60aabdc380c77608c614d91b0a628590a
>
> \20014 and \20015 should have been ZWNJ and ZWJ respectively. You just
> need to handle them as any other unicode character - especially it is
> a commit message and expectation is normal pain text display.
>
> I hope some one will fix this.
Well, I am bit stumped. git_commit calls format_log_line_html, which
in turn calls esc_html. esc_html looks like this:
sub esc_html ($;%) {
my $str = shift;
my %opts = @_;
** $str = to_utf8($str);
$str = $cgi->escapeHTML($str);
if ($opts{'-nbsp'}) {
$str =~ s/ / /g;
}
** $str =~ s|([[:cntrl:]])|(($1 ne "\t") ? quot_cec($1) : $1)|eg;
return $str;
}
The two important lines are marked with '**'. Not to_utf8 subroutine
is very simple wrapper:
# decode sequences of octets in utf8 into Perl's internal form,
# which is utf-8 with utf8 flag set if needed. gitweb writes out
# in utf-8 thanks to "binmode STDOUT, ':utf8'" at beginning
sub to_utf8 {
my $str = shift;
if (utf8::valid($str)) {
utf8::decode($str);
return $str;
} else {
return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
}
}
So it looks like Perl treats \20014 and \20015 (ZWNJ and ZWJ) as
belonging to '[:cntrl:]' class. I don't know if it is correct from the
point of view of Unicode character classes, therefore if it is a bug
in Perl, or just in gitweb.
We might need protecting similar to ($1 ne "\t"), like (ord($1) < 127)
or something... or perhaps we shouldn't use POSIX character class
[:cntrl:] but something different when dealing with Unicode,
e.g. \p{Cc} or \p{Control}, or perhaps \p{C} (other). I don't know
Perl (nor Unicode) enough to decide...
P.S. Even that might not help much, as Savannah uses git and gitwev
version 1.5.6.5, which is probably version released with some major
distribution. As of now we are at 1.6.0.5...
--
Jakub Narebski
Poland
ShadeHawk on #git
next prev parent reply other threads:[~2008-12-12 19:39 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-12-12 18:33 gitweb and unicode special characters Praveen A
2008-12-12 19:37 ` Jakub Narebski [this message]
2008-12-12 22:09 ` Jakub Narebski
2008-12-13 0:55 ` Praveen A
2008-12-13 1:31 ` Jakub Narebski
2008-12-13 3:06 ` Edward Z. Yang
2008-12-13 22:08 ` Jakub Narebski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=m37i65gp6b.fsf@localhost.localdomain \
--to=jnareb@gmail.com \
--cc=git@vger.kernel.org \
--cc=pravi.a@gmail.com \
--cc=santhosh00@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).