* gitweb and unicode special characters
@ 2008-12-12 18:33 Praveen A
2008-12-12 19:37 ` Jakub Narebski
0 siblings, 1 reply; 7+ messages in thread
From: Praveen A @ 2008-12-12 18:33 UTC (permalink / raw)
To: git; +Cc: Santhosh Thottingal
Hi,
Git currently does not handle unicode special characters ZWJ and ZWNJ,
both are heavily used in Malayalam and common in other languages
needing complex text layout like Sinhala and Arabic.
An example of this is shown in the commit message here
http://git.savannah.gnu.org/gitweb/?p=smc.git;a=commit;h=c3f368c60aabdc380c77608c614d91b0a628590a
\20014 and \20015 should have been ZWNJ and ZWJ respectively. You just
need to handle them as any other unicode character - especially it is
a commit message and expectation is normal pain text display.
I hope some one will fix this.
- Praveen
--
പ്രവീണ് അരിമ്പ്രത്തൊടിയില്
<GPLv2> I know my rights; I want my phone call!
<DRM> What use is a phone call, if you are unable to speak?
(as seen on /.)
Join The DRM Elimination Crew Now!
http://fci.wikia.com/wiki/Anti-DRM-Campaign
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: gitweb and unicode special characters
2008-12-12 18:33 gitweb and unicode special characters Praveen A
@ 2008-12-12 19:37 ` Jakub Narebski
2008-12-12 22:09 ` Jakub Narebski
0 siblings, 1 reply; 7+ messages in thread
From: Jakub Narebski @ 2008-12-12 19:37 UTC (permalink / raw)
To: Praveen A; +Cc: git, Santhosh Thottingal
"Praveen A" <pravi.a@gmail.com> writes:
> Git currently does not handle unicode special characters ZWJ and ZWNJ,
> both are heavily used in Malayalam and common in other languages
> needing complex text layout like Sinhala and Arabic.
>
> An example of this is shown in the commit message here
> http://git.savannah.gnu.org/gitweb/?p=smc.git;a=commit;h=c3f368c60aabdc380c77608c614d91b0a628590a
>
> \20014 and \20015 should have been ZWNJ and ZWJ respectively. You just
> need to handle them as any other unicode character - especially it is
> a commit message and expectation is normal pain text display.
>
> I hope some one will fix this.
Well, I am bit stumped. git_commit calls format_log_line_html, which
in turn calls esc_html. esc_html looks like this:
sub esc_html ($;%) {
my $str = shift;
my %opts = @_;
** $str = to_utf8($str);
$str = $cgi->escapeHTML($str);
if ($opts{'-nbsp'}) {
$str =~ s/ / /g;
}
** $str =~ s|([[:cntrl:]])|(($1 ne "\t") ? quot_cec($1) : $1)|eg;
return $str;
}
The two important lines are marked with '**'. Not to_utf8 subroutine
is very simple wrapper:
# decode sequences of octets in utf8 into Perl's internal form,
# which is utf-8 with utf8 flag set if needed. gitweb writes out
# in utf-8 thanks to "binmode STDOUT, ':utf8'" at beginning
sub to_utf8 {
my $str = shift;
if (utf8::valid($str)) {
utf8::decode($str);
return $str;
} else {
return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
}
}
So it looks like Perl treats \20014 and \20015 (ZWNJ and ZWJ) as
belonging to '[:cntrl:]' class. I don't know if it is correct from the
point of view of Unicode character classes, therefore if it is a bug
in Perl, or just in gitweb.
We might need protecting similar to ($1 ne "\t"), like (ord($1) < 127)
or something... or perhaps we shouldn't use POSIX character class
[:cntrl:] but something different when dealing with Unicode,
e.g. \p{Cc} or \p{Control}, or perhaps \p{C} (other). I don't know
Perl (nor Unicode) enough to decide...
P.S. Even that might not help much, as Savannah uses git and gitwev
version 1.5.6.5, which is probably version released with some major
distribution. As of now we are at 1.6.0.5...
--
Jakub Narebski
Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: gitweb and unicode special characters
2008-12-12 19:37 ` Jakub Narebski
@ 2008-12-12 22:09 ` Jakub Narebski
2008-12-13 0:55 ` Praveen A
0 siblings, 1 reply; 7+ messages in thread
From: Jakub Narebski @ 2008-12-12 22:09 UTC (permalink / raw)
To: Praveen A; +Cc: git, Santhosh Thottingal
Jakub Narebski <jnareb@gmail.com> writes:
> "Praveen A" <pravi.a@gmail.com> writes:
>
> > Git currently does not handle unicode special characters ZWJ and ZWNJ,
> > both are heavily used in Malayalam and common in other languages
> > needing complex text layout like Sinhala and Arabic.
> >
> > An example of this is shown in the commit message here
> > http://git.savannah.gnu.org/gitweb/?p=smc.git;a=commit;h=c3f368c60aabdc380c77608c614d91b0a628590a
> >
> > \20014 and \20015 should have been ZWNJ and ZWJ respectively. You just
> > need to handle them as any other unicode character - especially it is
> > a commit message and expectation is normal pain text display.
> >
> > I hope some one will fix this.
>
> Well, I am bit stumped. git_commit calls format_log_line_html, which
> in turn calls esc_html. esc_html looks like this:
>
> sub esc_html ($;%) {
> my $str = shift;
> my %opts = @_;
>
> ** $str = to_utf8($str);
> $str = $cgi->escapeHTML($str);
> if ($opts{'-nbsp'}) {
> $str =~ s/ / /g;
> }
> ** $str =~ s|([[:cntrl:]])|(($1 ne "\t") ? quot_cec($1) : $1)|eg;
> return $str;
> }
>
> The two important lines are marked with '**'.
[...]
> So it looks like Perl treats \20014 and \20015 (ZWNJ and ZWJ) as
> belonging to '[:cntrl:]' class. I don't know if it is correct from the
> point of view of Unicode character classes, therefore if it is a bug
> in Perl, or just in gitweb.
I checked this, via this simple Perl script:
#!/usr/bin/perl
use charnames ":full";
my $c = ord("\N{ZWNJ}");
printf "oct=%o dec=%d hex=%x\n", $c, $c, $c;
"\N{ZWNJ}" =~ /[[:cntrl:]]/ and print "is [:cntrl:]";
And the answer was:
oct=20014 dex=8204 hex=200c
is [:cntrl:]
'ZERO WIDTH NON-JOINER' _is_ control character... We probably should
use [^[:print:][:space:]] instead of [[:cntrl:]] here.
[...]
> P.S. Even that might not help much, as Savannah uses git and gitwev
> version 1.5.6.5, which is probably version released with some major
> distribution. As of now we are at 1.6.0.5...
Which can be seen from the fact that gitweb uses octal escapes,
instead of hex escapes...
--
Jakub Narebski
Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: gitweb and unicode special characters
2008-12-12 22:09 ` Jakub Narebski
@ 2008-12-13 0:55 ` Praveen A
2008-12-13 1:31 ` Jakub Narebski
0 siblings, 1 reply; 7+ messages in thread
From: Praveen A @ 2008-12-13 0:55 UTC (permalink / raw)
To: Jakub Narebski; +Cc: git, Santhosh Thottingal
2008/12/12 Jakub Narebski <jnareb@gmail.com>:
> Jakub Narebski <jnareb@gmail.com> writes:
>> "Praveen A" <pravi.a@gmail.com> writes:
>>
>> > Git currently does not handle unicode special characters ZWJ and ZWNJ,
>> > both are heavily used in Malayalam and common in other languages
>> > needing complex text layout like Sinhala and Arabic.
>> >
>> > An example of this is shown in the commit message here
>> > http://git.savannah.gnu.org/gitweb/?p=smc.git;a=commit;h=c3f368c60aabdc380c77608c614d91b0a628590a
>> >
>> > \20014 and \20015 should have been ZWNJ and ZWJ respectively. You just
>> > need to handle them as any other unicode character - especially it is
>> > a commit message and expectation is normal pain text display.
>> >
>> > I hope some one will fix this.
>>
>> Well, I am bit stumped. git_commit calls format_log_line_html, which
>> in turn calls esc_html. esc_html looks like this:
>>
>> sub esc_html ($;%) {
>> my $str = shift;
>> my %opts = @_;
>>
>> ** $str = to_utf8($str);
>> $str = $cgi->escapeHTML($str);
>> if ($opts{'-nbsp'}) {
>> $str =~ s/ / /g;
>> }
>> ** $str =~ s|([[:cntrl:]])|(($1 ne "\t") ? quot_cec($1) : $1)|eg;
>> return $str;
>> }
>>
>> The two important lines are marked with '**'.
> [...]
>
>> So it looks like Perl treats \20014 and \20015 (ZWNJ and ZWJ) as
>> belonging to '[:cntrl:]' class. I don't know if it is correct from the
>> point of view of Unicode character classes, therefore if it is a bug
>> in Perl, or just in gitweb.
>
> I checked this, via this simple Perl script:
>
> #!/usr/bin/perl
>
> use charnames ":full";
>
> my $c = ord("\N{ZWNJ}");
> printf "oct=%o dec=%d hex=%x\n", $c, $c, $c;
>
> "\N{ZWNJ}" =~ /[[:cntrl:]]/ and print "is [:cntrl:]";
>
> And the answer was:
>
> oct=20014 dex=8204 hex=200c
> is [:cntrl:]
>
> 'ZERO WIDTH NON-JOINER' _is_ control character... We probably should
> use [^[:print:][:space:]] instead of [[:cntrl:]] here.
That looks good. But I'm wondering why do we need to filter at all?
Is it a security concern? It is just description.
>
> [...]
>> P.S. Even that might not help much, as Savannah uses git and gitwev
>> version 1.5.6.5, which is probably version released with some major
>> distribution. As of now we are at 1.6.0.5...
>
> Which can be seen from the fact that gitweb uses octal escapes,
> instead of hex escapes...
But we can expect it to work someday when savannah updates their git
version, or we can bug them to upgrade if the fix is in official git
release.
- Praveen
j4v4m4n
>
> --
> Jakub Narebski
> Poland
> ShadeHawk on #git
>
--
പ്രവീണ് അരിമ്പ്രത്തൊടിയില്
<GPLv2> I know my rights; I want my phone call!
<DRM> What use is a phone call, if you are unable to speak?
(as seen on /.)
Join The DRM Elimination Crew Now!
http://fci.wikia.com/wiki/Anti-DRM-Campaign
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: gitweb and unicode special characters
2008-12-13 0:55 ` Praveen A
@ 2008-12-13 1:31 ` Jakub Narebski
2008-12-13 3:06 ` Edward Z. Yang
0 siblings, 1 reply; 7+ messages in thread
From: Jakub Narebski @ 2008-12-13 1:31 UTC (permalink / raw)
To: Praveen A; +Cc: git, Santhosh Thottingal
On Sat, 13 Dec 2008 01:55, Praveen A wrote:
> 2008/12/12 Jakub Narebski <jnareb@gmail.com>:
>> Jakub Narebski <jnareb@gmail.com> writes:
>>> "Praveen A" <pravi.a@gmail.com> writes:
>>>
>>>> Git currently does not handle unicode special characters ZWJ and ZWNJ,
>>>> both are heavily used in Malayalam and common in other languages
>>>> needing complex text layout like Sinhala and Arabic.
>>>>
>>>> An example of this is shown in the commit message here
>>>> http://git.savannah.gnu.org/gitweb/?p=smc.git;a=commit;h=c3f368c60aabdc380c77608c614d91b0a628590a
>>>>
>>>> \20014 and \20015 should have been ZWNJ and ZWJ respectively. You just
>>>> need to handle them as any other unicode character - especially it is
>>>> a commit message and expectation is normal pain text display.
>>>
>>> [...] git_commit calls format_log_line_html, which
>>> in turn calls esc_html. esc_html looks like this:
>>>
>>> sub esc_html ($;%) {
[...]
>>> ** $str =~ s|([[:cntrl:]])|(($1 ne "\t") ? quot_cec($1) : $1)|eg;
>>> return $str;
>>> }
>>>
>>> The two important lines are marked with '**'.
>> [...]
>>
>>> So it looks like Perl treats \20014 and \20015 (ZWNJ and ZWJ) as
>>> belonging to '[:cntrl:]' class. I don't know if it is correct from the
>>> point of view of Unicode character classes, therefore if it is a bug
>>> in Perl, or just in gitweb.
>>
>> I checked this, via this simple Perl script:
[...]
>> "\N{ZWNJ}" =~ /[[:cntrl:]]/ and print "is [:cntrl:]";
>>
>> And the answer was:
>>
>> oct=20014 dex=8204 hex=200c
>> is [:cntrl:]
>>
>> 'ZERO WIDTH NON-JOINER' _is_ control character... We probably should
>> use [^[:print:][:space:]] instead of [[:cntrl:]] here.
>
> That looks good. But I'm wondering why do we need to filter at all?
> Is it a security concern? It is just description.
First, from the new description [^[:print:][:space:]], or even
[^[:print:]] (whichever we choose) you can see that those characters
we are showing using C (\r, \v, \b,...) + octal (in older gitweb) or
hex (in never gitweb) escapes would be invisible otherwise, or do
the strange things like \b aka backspace character.
Sidenote: There is probably one exception we want to add, namely not
escape '\r' at the end of line, to be able to deal better with DOS
line endings (\r\n).
Second, and that is I think reason we started to escape control
characters like \014 or ^L i.e. FORM FEED (FF) character (e.g. in
COPYING file), or \033 or ^[ i.e. ESCAPE (\e) character (e.g. commit
20a3847d) is that they are not allowed in XML, which means that they
are not allowed in XHTML, which means that if they are on the page,
and MIME-type is 'application/xml+html' forcing strict XML/XHTML mode
validating browsers would not display the page because it is not valid
XHTML. Mozilla 1.17.2 did this, and it would not show page; I don't
know how it works with more modern browsers.
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: gitweb and unicode special characters
2008-12-13 1:31 ` Jakub Narebski
@ 2008-12-13 3:06 ` Edward Z. Yang
2008-12-13 22:08 ` Jakub Narebski
0 siblings, 1 reply; 7+ messages in thread
From: Edward Z. Yang @ 2008-12-13 3:06 UTC (permalink / raw)
To: git
Jakub Narebski wrote:
> Sidenote: There is probably one exception we want to add, namely not
> escape '\r' at the end of line, to be able to deal better with DOS
> line endings (\r\n).
I'm sorry, but I have to disagree. I find being able to see \r
line-endings in the pretty-printed format is exceedingly useful for
figuring out if a file has been checked in with the wrong line-endings.
The number of files that must have \r line endings are vanishingly small
(Bat files are perhaps the one example I can think of right now).
Cheers,
Edward
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: gitweb and unicode special characters
2008-12-13 3:06 ` Edward Z. Yang
@ 2008-12-13 22:08 ` Jakub Narebski
0 siblings, 0 replies; 7+ messages in thread
From: Jakub Narebski @ 2008-12-13 22:08 UTC (permalink / raw)
To: Edward Z. Yang; +Cc: git
"Edward Z. Yang" <edwardzyang@thewritingpot.com> writes:
> Jakub Narebski wrote:
> > Sidenote: There is probably one exception we want to add, namely not
> > escape '\r' at the end of line, to be able to deal better with DOS
> > line endings (\r\n).
>
> I'm sorry, but I have to disagree. I find being able to see \r
> line-endings in the pretty-printed format is exceedingly useful for
> figuring out if a file has been checked in with the wrong line-endings.
> The number of files that must have \r line endings are vanishingly small
> (BAT files are perhaps the one example I can think of right now).
Well, it is a bit annoying if you have checked file with wrong line
endings, and just noticed this... I was thinking about adding '(DOS)'
or something indicator at the bottom of 'blob' and 'blame' views, but
I guess I can live with '\r'...
In short: I agree, that was not a good idea.
--
Jakub Narebski
Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2008-12-13 22:09 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-12 18:33 gitweb and unicode special characters Praveen A
2008-12-12 19:37 ` Jakub Narebski
2008-12-12 22:09 ` Jakub Narebski
2008-12-13 0:55 ` Praveen A
2008-12-13 1:31 ` Jakub Narebski
2008-12-13 3:06 ` Edward Z. Yang
2008-12-13 22:08 ` Jakub Narebski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).