From: Jakub Narebski <jnareb@gmail.com>
To: "Praveen A" <pravi.a@gmail.com>
Cc: git@vger.kernel.org, "Santhosh Thottingal" <santhosh00@gmail.com>
Subject: Re: gitweb and unicode special characters
Date: Sat, 13 Dec 2008 02:31:05 +0100 [thread overview]
Message-ID: <200812130231.06929.jnareb@gmail.com> (raw)
In-Reply-To: <3f2beab60812121655m6cd868bfhaaf386e6f5457533@mail.gmail.com>
On Sat, 13 Dec 2008 01:55, Praveen A wrote:
> 2008/12/12 Jakub Narebski <jnareb@gmail.com>:
>> Jakub Narebski <jnareb@gmail.com> writes:
>>> "Praveen A" <pravi.a@gmail.com> writes:
>>>
>>>> Git currently does not handle unicode special characters ZWJ and ZWNJ,
>>>> both are heavily used in Malayalam and common in other languages
>>>> needing complex text layout like Sinhala and Arabic.
>>>>
>>>> An example of this is shown in the commit message here
>>>> http://git.savannah.gnu.org/gitweb/?p=smc.git;a=commit;h=c3f368c60aabdc380c77608c614d91b0a628590a
>>>>
>>>> \20014 and \20015 should have been ZWNJ and ZWJ respectively. You just
>>>> need to handle them as any other unicode character - especially it is
>>>> a commit message and expectation is normal pain text display.
>>>
>>> [...] git_commit calls format_log_line_html, which
>>> in turn calls esc_html. esc_html looks like this:
>>>
>>> sub esc_html ($;%) {
[...]
>>> ** $str =~ s|([[:cntrl:]])|(($1 ne "\t") ? quot_cec($1) : $1)|eg;
>>> return $str;
>>> }
>>>
>>> The two important lines are marked with '**'.
>> [...]
>>
>>> So it looks like Perl treats \20014 and \20015 (ZWNJ and ZWJ) as
>>> belonging to '[:cntrl:]' class. I don't know if it is correct from the
>>> point of view of Unicode character classes, therefore if it is a bug
>>> in Perl, or just in gitweb.
>>
>> I checked this, via this simple Perl script:
[...]
>> "\N{ZWNJ}" =~ /[[:cntrl:]]/ and print "is [:cntrl:]";
>>
>> And the answer was:
>>
>> oct=20014 dex=8204 hex=200c
>> is [:cntrl:]
>>
>> 'ZERO WIDTH NON-JOINER' _is_ control character... We probably should
>> use [^[:print:][:space:]] instead of [[:cntrl:]] here.
>
> That looks good. But I'm wondering why do we need to filter at all?
> Is it a security concern? It is just description.
First, from the new description [^[:print:][:space:]], or even
[^[:print:]] (whichever we choose) you can see that those characters
we are showing using C (\r, \v, \b,...) + octal (in older gitweb) or
hex (in never gitweb) escapes would be invisible otherwise, or do
the strange things like \b aka backspace character.
Sidenote: There is probably one exception we want to add, namely not
escape '\r' at the end of line, to be able to deal better with DOS
line endings (\r\n).
Second, and that is I think reason we started to escape control
characters like \014 or ^L i.e. FORM FEED (FF) character (e.g. in
COPYING file), or \033 or ^[ i.e. ESCAPE (\e) character (e.g. commit
20a3847d) is that they are not allowed in XML, which means that they
are not allowed in XHTML, which means that if they are on the page,
and MIME-type is 'application/xml+html' forcing strict XML/XHTML mode
validating browsers would not display the page because it is not valid
XHTML. Mozilla 1.17.2 did this, and it would not show page; I don't
know how it works with more modern browsers.
--
Jakub Narebski
Poland
next prev parent reply other threads:[~2008-12-13 1:32 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-12-12 18:33 gitweb and unicode special characters Praveen A
2008-12-12 19:37 ` Jakub Narebski
2008-12-12 22:09 ` Jakub Narebski
2008-12-13 0:55 ` Praveen A
2008-12-13 1:31 ` Jakub Narebski [this message]
2008-12-13 3:06 ` Edward Z. Yang
2008-12-13 22:08 ` Jakub Narebski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200812130231.06929.jnareb@gmail.com \
--to=jnareb@gmail.com \
--cc=git@vger.kernel.org \
--cc=pravi.a@gmail.com \
--cc=santhosh00@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).