* [PATCH] gitweb: highlight: strip non-printable characters via col(1)
@ 2011-08-22 22:58 Christopher M. Fuhrman
2011-08-22 23:21 ` Junio C Hamano
2011-08-26 19:54 ` Jakub Narebski
0 siblings, 2 replies; 10+ messages in thread
From: Christopher M. Fuhrman @ 2011-08-22 22:58 UTC (permalink / raw)
To: gitster, git; +Cc: jnareb, cwilson, sylvain, Christopher M. Fuhrman
From: "Christopher M. Fuhrman" <cfuhrman@panix.com>
The current code, as is, passes control characters, such as form-feed
(^L) to highlight which then passes it through to the browser. This
will cause the browser to display one of the following warnings:
Safari v5.1 (6534.50) & Google Chrome v13.0.782.112:
This page contains the following errors:
error on line 657 at column 38: PCDATA invalid Char value 12
Below is a rendering of the page up to the first error.
Mozilla Firefox 3.6.19 & Mozilla Firefox 5.0:
XML Parsing Error: not well-formed
Location:
http://path/to/git/repo/blah/blah
Both errors were generated by gitweb.perl v1.7.3.4 w/ highlight 2.7
using arch/ia64/kernel/unwind.c from the Linux kernel.
Strip non-printable control-characters by piping the output produced
by git-cat-file(1) to col(1) as follows:
git cat-file blob deadbeef314159 | col -bx | highlight <args>
Note usage of the '-x' option which tells col(1) to output multiple
spaces instead of tabs.
Tested under OpenSuSE 11.4 & NetBSD 5.1 using perl 5.12.3 and perl
5.12.2 respectively using Safari, Firefox, and Google Chrome.
Signed-off-by: Christopher M. Fuhrman <cfuhrman@panix.com>
---
Howdy,
I haven't gotten any responses to my patch for a while, so I am now
submitting this for general inclusion into git. Please note that this
is based off the "maint" branch per Documentation/SubmittingPatches
For an example of this bug in action, see:
* http://git.fuhrbear.com/~cfuhrman/?p=linux/.git;a=blob;f=arch/alpha/kernel/core_titan.c;h=219bf271c0ba2e5f2d668af707df57fbbd00ccfd;hb=HEAD
* http://git.fuhrbear.com/~cfuhrman/?p=linux/.git;a=blob;f=arch/ia64/kernel/unwind.c;h=fed6afa2e8a9014e65229e51e64fa4b1c13cc284;hb=HEAD
WRT the col(1) command, I've verified that the binary is installed in
/usr/bin on OpenSuSE, NetBSD, OpenBSD, Solaris 10, and AIX. This
patch assumes that /usr/bin is in $PATH.
Cheers!
gitweb/gitweb.perl | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 50a835a..4c68165 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -3656,6 +3656,7 @@ sub run_highlighter {
close $fd;
open $fd, quote_command(git_cmd(), "cat-file", "blob", $hash)." | ".
+ "col -bx | ".
quote_command($highlight_bin).
" --replace-tabs=8 --fragment --syntax $syntax |"
or die_error(500, "Couldn't open file or run syntax highlighter");
--
1.7.5.4
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] gitweb: highlight: strip non-printable characters via col(1)
2011-08-22 22:58 [PATCH] gitweb: highlight: strip non-printable characters via col(1) Christopher M. Fuhrman
@ 2011-08-22 23:21 ` Junio C Hamano
2011-08-26 19:54 ` Jakub Narebski
1 sibling, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2011-08-22 23:21 UTC (permalink / raw)
To: Christopher M. Fuhrman; +Cc: git, jnareb, cwilson, sylvain
"Christopher M. Fuhrman" <cfuhrman@panix.com> writes:
> Strip non-printable control-characters by piping the output produced
> by git-cat-file(1) to col(1) as follows:
>
> git cat-file blob deadbeef314159 | col -bx | highlight <args>
>
> Note usage of the '-x' option which tells col(1) to output multiple
> spaces instead of tabs.
Are all implementations of col known to correctly handle bytes with their
highest bit on, without mistaking them with unknown control sequences?
Has the code updated by your patch been tested with non-ASCII payload, at
least with UTF-8 outside US-ASCII?
In what locale does the code updated by your patch run under, and would
the use of "col" affected by the choice of the locale in a negative way?
For example, here is what I get on my box:
$ LANG=C LC_ALL=C col -bx <t/t3902-quoted.sh ; echo $?
col: Invalid or incomplete multibyte or wide character
1
that makes me ask you these questions.
> I haven't gotten any responses to my patch for a while, so I am now
> submitting this for general inclusion into git.
Unfortunately, no news is not good news around here, and that is why I
am asking you the above questions.
Thanks.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] gitweb: highlight: strip non-printable characters via col(1)
2011-08-22 22:58 [PATCH] gitweb: highlight: strip non-printable characters via col(1) Christopher M. Fuhrman
2011-08-22 23:21 ` Junio C Hamano
@ 2011-08-26 19:54 ` Jakub Narebski
2011-08-26 21:44 ` Junio C Hamano
1 sibling, 1 reply; 10+ messages in thread
From: Jakub Narebski @ 2011-08-26 19:54 UTC (permalink / raw)
To: Christopher M. Fuhrman; +Cc: gitster, git, cwilson, sylvain
On Tue, 23 Aug 2011, Christopher M. Fuhrman wrote:
> The current code, as is, passes control characters, such as form-feed
> (^L) to highlight which then passes it through to the browser. This
> will cause the browser to display one of the following warnings:
>
> Safari v5.1 (6534.50) & Google Chrome v13.0.782.112:
>
> This page contains the following errors:
>
> error on line 657 at column 38: PCDATA invalid Char value 12
> Below is a rendering of the page up to the first error.
>
> Mozilla Firefox 3.6.19 & Mozilla Firefox 5.0:
>
> XML Parsing Error: not well-formed
> Location:
> http://path/to/git/repo/blah/blah
>
> Both errors were generated by gitweb.perl v1.7.3.4 w/ highlight 2.7
> using arch/ia64/kernel/unwind.c from the Linux kernel.
>
> Strip non-printable control-characters by piping the output produced
> by git-cat-file(1) to col(1) as follows:
>
> git cat-file blob deadbeef314159 | col -bx | highlight <args>
>
> Note usage of the '-x' option which tells col(1) to output multiple
> spaces instead of tabs.
Why use external program (which ming be not installed, or might not
strip control-characters), instead of making gitweb sanitize highlighter
output itself. Something like the patch below (which additionally
shows where there are control characters):
-- >8 --
diff --git i/gitweb/gitweb.perl w/gitweb/gitweb.perl
index 7cf12af..192db2c 100755
--- i/gitweb/gitweb.perl
+++ w/gitweb/gitweb.perl
@@ -1517,6 +1517,17 @@ sub esc_path {
return $str;
}
+# Sanitize for use in XHTML + application/xml+xhtml
+sub sanitize {
+ my $str = shift;
+
+ return undef unless defined $str;
+
+ $str = to_utf8($str);
+ $str =~ s|([[:cntrl:]])|quot_cec($1)|eg;
+ return $str;
+}
+
# Make control characters "printable", using character escape codes (CEC)
sub quot_cec {
my $cntrl = shift;
@@ -6546,7 +6557,8 @@ sub git_blob {
$nr++;
$line = untabify($line);
printf qq!<div class="pre"><a id="l%i" href="%s#l%i" class="linenr">%4i</a> %s</div>\n!,
- $nr, esc_attr(href(-replay => 1)), $nr, $nr, $syntax ? to_utf8($line) : esc_html($line, -nbsp=>1);
+ $nr, esc_attr(href(-replay => 1)), $nr, $nr,
+ $syntax ? sanitize($line) : esc_html($line, -nbsp=>1);
}
}
close $fd
-- 8< --
--
Jakub Narebski
Poland
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] gitweb: highlight: strip non-printable characters via col(1)
2011-08-26 19:54 ` Jakub Narebski
@ 2011-08-26 21:44 ` Junio C Hamano
2011-08-26 22:06 ` Jakub Narebski
0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2011-08-26 21:44 UTC (permalink / raw)
To: Jakub Narebski; +Cc: Christopher M. Fuhrman, git, cwilson, sylvain
Jakub Narebski <jnareb@gmail.com> writes:
> Why use external program (which ming be not installed, or might not
> strip control-characters), instead of making gitweb sanitize highlighter
> output itself. Something like the patch below (which additionally
> shows where there are control characters):
I agree that that would be a more sensible approach. What does your sample
code below do to a HT by the way?
> -- >8 --
> diff --git i/gitweb/gitweb.perl w/gitweb/gitweb.perl
> index 7cf12af..192db2c 100755
> --- i/gitweb/gitweb.perl
> +++ w/gitweb/gitweb.perl
> @@ -1517,6 +1517,17 @@ sub esc_path {
> return $str;
> }
>
> +# Sanitize for use in XHTML + application/xml+xhtml
> +sub sanitize {
> + my $str = shift;
> +
> + return undef unless defined $str;
> +
> + $str = to_utf8($str);
> + $str =~ s|([[:cntrl:]])|quot_cec($1)|eg;
> + return $str;
> +}
> +
> # Make control characters "printable", using character escape codes (CEC)
> sub quot_cec {
> my $cntrl = shift;
> @@ -6546,7 +6557,8 @@ sub git_blob {
> $nr++;
> $line = untabify($line);
> printf qq!<div class="pre"><a id="l%i" href="%s#l%i" class="linenr">%4i</a> %s</div>\n!,
> - $nr, esc_attr(href(-replay => 1)), $nr, $nr, $syntax ? to_utf8($line) : esc_html($line, -nbsp=>1);
> + $nr, esc_attr(href(-replay => 1)), $nr, $nr,
> + $syntax ? sanitize($line) : esc_html($line, -nbsp=>1);
> }
> }
> close $fd
>
> -- 8< --
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] gitweb: highlight: strip non-printable characters via col(1)
2011-08-26 21:44 ` Junio C Hamano
@ 2011-08-26 22:06 ` Jakub Narebski
2011-09-16 12:41 ` [PATCH] gitweb: Strip non-printable characters from syntax highlighter output Jakub Narebski
0 siblings, 1 reply; 10+ messages in thread
From: Jakub Narebski @ 2011-08-26 22:06 UTC (permalink / raw)
To: Junio C Hamano
Cc: Christopher M. Fuhrman, git, Christopher Wilson, Sylvain Rabot
On Fri, 26 Aug 2011, Junio C Hamano wrote:
> Jakub Narebski <jnareb@gmail.com> writes:
>
> > Why use external program (which ming be not installed, or might not
> > strip control-characters), instead of making gitweb sanitize highlighter
> > output itself. Something like the patch below (which additionally
> > shows where there are control characters):
>
> I agree that that would be a more sensible approach. What does your sample
> code below do to a HT by the way?
Actually the line earlier
$line = untabify($line);
replaces HT ("\t") with spaces.
> > -- >8 --
> > diff --git i/gitweb/gitweb.perl w/gitweb/gitweb.perl
> > index 7cf12af..192db2c 100755
> > --- i/gitweb/gitweb.perl
> > +++ w/gitweb/gitweb.perl
> > @@ -1517,6 +1517,17 @@ sub esc_path {
> > return $str;
> > }
> >
> > +# Sanitize for use in XHTML + application/xml+xhtml
> > +sub sanitize {
> > + my $str = shift;
> > +
> > + return undef unless defined $str;
> > +
> > + $str = to_utf8($str);
> > + $str =~ s|([[:cntrl:]])|quot_cec($1)|eg;
> > + return $str;
> > +}
Anyway, it could well be
+ $str =~ s|([[:cntrl:]])|(($1 ne "\t") ? quot_cec($1) : $1)|eg;
+ return $str;
like in esc_html rather than like in esc_path.
> > @@ -6546,7 +6557,8 @@ sub git_blob {
> > $nr++;
> > $line = untabify($line);
^^^^^^^^^^^^^^^^^^^^^^^^
> > printf qq!<div class="pre"><a id="l%i" href="%s#l%i" class="linenr">%4i</a> %s</div>\n!,
> > - $nr, esc_attr(href(-replay => 1)), $nr, $nr, $syntax ? to_utf8($line) : esc_html($line, -nbsp=>1);
> > + $nr, esc_attr(href(-replay => 1)), $nr, $nr,
> > + $syntax ? sanitize($line) : esc_html($line, -nbsp=>1);
> > }
> > }
> > close $fd
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH] gitweb: Strip non-printable characters from syntax highlighter output
2011-08-26 22:06 ` Jakub Narebski
@ 2011-09-16 12:41 ` Jakub Narebski
2011-09-16 16:32 ` Junio C Hamano
2011-09-16 18:11 ` Christopher M. Fuhrman
0 siblings, 2 replies; 10+ messages in thread
From: Jakub Narebski @ 2011-09-16 12:41 UTC (permalink / raw)
To: Junio C Hamano, Christopher M. Fuhrman
Cc: git, Christopher Wilson, Sylvain Rabot
The current code, as is, passes control characters, such as form-feed
(^L) to highlight which then passes it through to the browser. User
agents (web browsers) that support 'application/xhtml+xml' usually
require that web pages declared as XHTML and with this mimetype are
well-formed XML. Unescaped control characters cannot appear within a
contents of a valid XML document.
This will cause the browser to display one of the following warnings:
* Safari v5.1 (6534.50) & Google Chrome v13.0.782.112:
This page contains the following errors:
error on line 657 at column 38: PCDATA invalid Char value 12
Below is a rendering of the page up to the first error.
* Mozilla Firefox 3.6.19 & Mozilla Firefox 5.0:
XML Parsing Error: not well-formed
Location:
http://path/to/git/repo/blah/blah
Both errors were generated by gitweb.perl v1.7.3.4 w/ highlight 2.7
using arch/ia64/kernel/unwind.c from the Linux kernel.
When syntax highlighter is not used, control characters are replaced
by esc_html(), but with syntax highlighter they were passed through to
browser (to_utf8() doesn't remove control characters).
Introduce sanitize() subroutine which strips forbidden characters, but
does not perform HTML escaping, and use it in git_blob() to sanitize
syntax highlighter output for XHTML.
Note that excluding "\t" (U+0009), "\n" (U+000A) and "\r" (U+000D) is
not strictly necessary, atleast for currently the only callsite: "\t"
tabs are replaced by spaces by untabify(), "\n" is stripped from each
line before processing it, and replacing "\r" could be considered
improvement.
Originally-by: Christopher M. Fuhrman <cfuhrman@panix.com>
Signed-off-by: Jakub Narebski <jnareb@gmail.com>
---
The commit message is from Christopher, but I have replaced his solution
of stripping non-printable characters via col(1) program by having gitweb
strip characters not allowed in XML.
Christopher, could you check that it fixes your issue?
gitweb/gitweb.perl | 14 +++++++++++++-
1 files changed, 13 insertions(+), 1 deletions(-)
diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 70a576a..c28b847 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -1517,6 +1517,17 @@ sub esc_path {
return $str;
}
+# Sanitize for use in XHTML + application/xml+xhtm (valid XML 1.0)
+sub sanitize {
+ my $str = shift;
+
+ return undef unless defined $str;
+
+ $str = to_utf8($str);
+ $str =~ s|([[:cntrl:]])|($1 =~ /[\t\n\r]/ ? $1 : quot_cec($1))|eg;
+ return $str;
+}
+
# Make control characters "printable", using character escape codes (CEC)
sub quot_cec {
my $cntrl = shift;
@@ -6484,7 +6495,8 @@ sub git_blob {
$nr++;
$line = untabify($line);
printf qq!<div class="pre"><a id="l%i" href="%s#l%i" class="linenr">%4i</a> %s</div>\n!,
- $nr, esc_attr(href(-replay => 1)), $nr, $nr, $syntax ? to_utf8($line) : esc_html($line, -nbsp=>1);
+ $nr, esc_attr(href(-replay => 1)), $nr, $nr,
+ $syntax ? sanitize($line) : esc_html($line, -nbsp=>1);
}
}
close $fd
--
1.7.6
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] gitweb: Strip non-printable characters from syntax highlighter output
2011-09-16 12:41 ` [PATCH] gitweb: Strip non-printable characters from syntax highlighter output Jakub Narebski
@ 2011-09-16 16:32 ` Junio C Hamano
2011-09-16 18:58 ` Jakub Narebski
2011-09-16 18:11 ` Christopher M. Fuhrman
1 sibling, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2011-09-16 16:32 UTC (permalink / raw)
To: Jakub Narebski
Cc: Christopher M. Fuhrman, git, Christopher Wilson, Sylvain Rabot
Jakub Narebski <jnareb@gmail.com> writes:
> The commit message is from Christopher, but I have replaced his solution
> of stripping non-printable characters via col(1) program by having gitweb
> strip characters not allowed in XML.
>
> Christopher, could you check that it fixes your issue?
Thanks.
Micronit:
> +# Sanitize for use in XHTML + application/xml+xhtm (valid XML 1.0)
> +sub sanitize {
> + my $str = shift;
> +
> + return undef unless defined $str;
Given that the _whole_ point of this subroutine is to make $str safe for
printing, wouldn't you want to either (1) die, declaring that feeding an
undef to this subroutine is a programming error, or (2) return an empty
string?
Given that the input to this function is from the result of feeding $line
to untabify, which relies on $line being defined, and that $line comes
from "while (my $line = <$fd>)" (and then chomp $line), it may be Ok for
this subroutine to make the same assumption as untabify makes.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] gitweb: Strip non-printable characters from syntax highlighter output
2011-09-16 12:41 ` [PATCH] gitweb: Strip non-printable characters from syntax highlighter output Jakub Narebski
2011-09-16 16:32 ` Junio C Hamano
@ 2011-09-16 18:11 ` Christopher M. Fuhrman
1 sibling, 0 replies; 10+ messages in thread
From: Christopher M. Fuhrman @ 2011-09-16 18:11 UTC (permalink / raw)
To: Jakub Narebski; +Cc: Junio C Hamano, git, Christopher Wilson, Sylvain Rabot
Howdy,
On Fri, 16 Sep 2011 at 5:41am, Jakub Narebski wrote:
> The commit message is from Christopher, but I have replaced his solution
> of stripping non-printable characters via col(1) program by having gitweb
> strip characters not allowed in XML.
>
> Christopher, could you check that it fixes your issue?
After applying the patch, I tested it successfully against the following
files:
* linux.git : arch/ia64/kernel/unwind.c
* git.git : t/t3902-quoted.sh
Furthermore, I'm pleased to report that non en_US.UTF8 characters (e.g.,
Chinese hanzi) as found in t3902-quoted.sh are displayed properly when
highlight is enabled.
Tested Web Browsers:
* Safari (5.1 (6534.50)
* Firefox 6.0.2 under Mac OS X Snow Leopard
* Google Chrome 13.0.782.220 under OpenSuSE 11.4
>
> gitweb/gitweb.perl | 14 +++++++++++++-
> 1 files changed, 13 insertions(+), 1 deletions(-)
>
Cheers!
--
Chris Fuhrman
cfuhrman@panix.com
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] gitweb: Strip non-printable characters from syntax highlighter output
2011-09-16 16:32 ` Junio C Hamano
@ 2011-09-16 18:58 ` Jakub Narebski
2011-09-16 20:24 ` Junio C Hamano
0 siblings, 1 reply; 10+ messages in thread
From: Jakub Narebski @ 2011-09-16 18:58 UTC (permalink / raw)
To: Junio C Hamano
Cc: Christopher M. Fuhrman, git, Christopher Wilson, Sylvain Rabot
On Fri, 16 Sep 2011, Junio C Hamano wrote:
> Jakub Narebski <jnareb@gmail.com> writes:
> Micronit:
>
> > +# Sanitize for use in XHTML + application/xml+xhtm (valid XML 1.0)
> > +sub sanitize {
> > + my $str = shift;
> > +
> > + return undef unless defined $str;
>
> Given that the _whole_ point of this subroutine is to make $str safe for
> printing, wouldn't you want to either (1) die, declaring that feeding an
> undef to this subroutine is a programming error, or (2) return an empty
> string?
Well, that
return undef unless defined $str;
line is copy'n'paste (as is most of sanitize() body) from esc_html().
This line was added in 1df4876 (gitweb: Protect escaping functions against
calling on undef, 2010-02-07) with the following explanation
This is a bit of future-proofing esc_html and friends: when called
with undefined value they would now would return undef... which would
probably mean that error would still occur, but closer to the source
of problem.
This means that we can safely use
esc_html(shift) || "Internal Server Error"
in die_error() instead of
esc_html(shift || "Internal Server Error")
So actually now I see that while this line is good to have in esc_html(),
it is not really necessary in sanitize().
But anyway we don't want to replace undef with an empty string; undef is
(usually) an error, and we want to catch it, not to hide it.
> Given that the input to this function is from the result of feeding $line
> to untabify, which relies on $line being defined, and that $line comes
> from "while (my $line = <$fd>)" (and then chomp $line), it may be Ok for
> this subroutine to make the same assumption as untabify makes.
Right.
Passing undef to sanitize() is usually an error, and we don't want to hide
it. We want for gitweb test to detect it.
--
Jakub Narebski
Poland
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] gitweb: Strip non-printable characters from syntax highlighter output
2011-09-16 18:58 ` Jakub Narebski
@ 2011-09-16 20:24 ` Junio C Hamano
0 siblings, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2011-09-16 20:24 UTC (permalink / raw)
To: Jakub Narebski
Cc: Christopher M. Fuhrman, git, Christopher Wilson, Sylvain Rabot
Jakub Narebski <jnareb@gmail.com> writes:
> So actually now I see that while this line is good to have in esc_html(),
> it is not really necessary in sanitize().
>
> But anyway we don't want to replace undef with an empty string; undef is
> (usually) an error, and we want to catch it, not to hide it.
Heh, get off your high horse---whoever wrote such a caller that calls the
subroutine and uses its result without checking it against undef is not
qualified to make such a statement. I do not think letting "perl -w"
notice and complain about an attempt to concatenate undef with string
counts as "catching" it.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2011-09-16 20:24 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-22 22:58 [PATCH] gitweb: highlight: strip non-printable characters via col(1) Christopher M. Fuhrman
2011-08-22 23:21 ` Junio C Hamano
2011-08-26 19:54 ` Jakub Narebski
2011-08-26 21:44 ` Junio C Hamano
2011-08-26 22:06 ` Jakub Narebski
2011-09-16 12:41 ` [PATCH] gitweb: Strip non-printable characters from syntax highlighter output Jakub Narebski
2011-09-16 16:32 ` Junio C Hamano
2011-09-16 18:58 ` Jakub Narebski
2011-09-16 20:24 ` Junio C Hamano
2011-09-16 18:11 ` Christopher M. Fuhrman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).