git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jakub Narebski <jnareb@gmail.com>
To: "Praveen A" <pravi.a@gmail.com>
Cc: git@vger.kernel.org, "Santhosh Thottingal" <santhosh00@gmail.com>
Subject: Re: gitweb and unicode special characters
Date: Fri, 12 Dec 2008 14:09:05 -0800 (PST)	[thread overview]
Message-ID: <m3y6ylf3mq.fsf@localhost.localdomain> (raw)
In-Reply-To: <m37i65gp6b.fsf@localhost.localdomain>

Jakub Narebski <jnareb@gmail.com> writes:
> "Praveen A" <pravi.a@gmail.com> writes:
> 
> > Git currently does not handle unicode special characters ZWJ and ZWNJ,
> > both are heavily used in Malayalam and common in other languages
> > needing complex text layout like Sinhala and Arabic.
> > 
> > An example of this is shown in the commit message here
> > http://git.savannah.gnu.org/gitweb/?p=smc.git;a=commit;h=c3f368c60aabdc380c77608c614d91b0a628590a
> > 
> > \20014 and \20015 should have been ZWNJ and ZWJ respectively. You just
> > need to handle them as any other unicode character - especially it is
> > a commit message and expectation is normal pain text display.
> > 
> > I hope some one will fix this.
> 
> Well, I am bit stumped.  git_commit calls format_log_line_html, which
> in turn calls esc_html.  esc_html looks like this:
> 
>   sub esc_html ($;%) {
>   	my $str = shift;
>   	my %opts = @_;
>   
>   **	$str = to_utf8($str);
>   	$str = $cgi->escapeHTML($str);
>   	if ($opts{'-nbsp'}) {
>   		$str =~ s/ /&nbsp;/g;
>   	}
>   **	$str =~ s|([[:cntrl:]])|(($1 ne "\t") ? quot_cec($1) : $1)|eg;
>   	return $str;
>   }
> 
> The two important lines are marked with '**'.
[...]

> So it looks like Perl treats \20014 and \20015 (ZWNJ and ZWJ) as
> belonging to '[:cntrl:]' class. I don't know if it is correct from the
> point of view of Unicode character classes, therefore if it is a bug
> in Perl, or just in gitweb.

I checked this, via this simple Perl script:

  #!/usr/bin/perl

  use charnames ":full";

  my $c = ord("\N{ZWNJ}");
  printf "oct=%o dec=%d hex=%x\n", $c, $c, $c;

  "\N{ZWNJ}" =~ /[[:cntrl:]]/ and print "is [:cntrl:]";

And the answer was:

  oct=20014 dex=8204 hex=200c
  is [:cntrl:]

'ZERO WIDTH NON-JOINER' _is_ control character... We probably should
use [^[:print:][:space:]] instead of [[:cntrl:]] here.

[...]
> P.S. Even that might not help much, as Savannah uses git and gitwev
> version 1.5.6.5, which is probably version released with some major
> distribution.  As of now we are at 1.6.0.5...

Which can be seen from the fact that gitweb uses octal escapes,
instead of hex escapes...

-- 
Jakub Narebski
Poland
ShadeHawk on #git

  reply	other threads:[~2008-12-12 22:10 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-12 18:33 gitweb and unicode special characters Praveen A
2008-12-12 19:37 ` Jakub Narebski
2008-12-12 22:09   ` Jakub Narebski [this message]
2008-12-13  0:55     ` Praveen A
2008-12-13  1:31       ` Jakub Narebski
2008-12-13  3:06         ` Edward Z. Yang
2008-12-13 22:08           ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m3y6ylf3mq.fsf@localhost.localdomain \
    --to=jnareb@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=pravi.a@gmail.com \
    --cc=santhosh00@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).