From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jakub Narebski Subject: [PATCH] gitweb: Strip non-printable characters from syntax highlighter output Date: Fri, 16 Sep 2011 14:41:57 +0200 Message-ID: <201109161441.58946.jnareb@gmail.com> References: <1314053923-13122-1-git-send-email-cfuhrman@panix.com> <7v8vqfdf0l.fsf@alter.siamese.dyndns.org> <201108270006.19289.jnareb@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-2" Content-Transfer-Encoding: 7bit Cc: git@vger.kernel.org, Christopher Wilson , Sylvain Rabot To: Junio C Hamano , "Christopher M. Fuhrman" X-From: git-owner@vger.kernel.org Fri Sep 16 14:42:16 2011 Return-path: Envelope-to: gcvg-git-2@lo.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1R4XkF-0006eA-Ok for gcvg-git-2@lo.gmane.org; Fri, 16 Sep 2011 14:42:16 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753902Ab1IPMmK (ORCPT ); Fri, 16 Sep 2011 08:42:10 -0400 Received: from mail-fx0-f46.google.com ([209.85.161.46]:63252 "EHLO mail-fx0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753800Ab1IPMmJ (ORCPT ); Fri, 16 Sep 2011 08:42:09 -0400 Received: by fxe4 with SMTP id 4so1624616fxe.19 for ; Fri, 16 Sep 2011 05:42:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=from:to:subject:date:user-agent:cc:references:in-reply-to :mime-version:content-type:content-transfer-encoding :content-disposition:message-id; bh=FJrxObiSnuoB8xiX6mthSGBU2sh4qONKDiNrpCs1Zv8=; b=twdciW35iKkVeJWkGxDWIuZwcjAu+k8hyVHagn7CEDUH5cZYaUM64qqly5heq3f6/o Wbo0uY9uc+GMpEI5eluQSIqtaymcLlfVl6c2wFwmQrguRzV2DjSudKb+WFleGy7MPs2/ P+dLq2ztEKjXztnIWbJ4Y20DEoQDnYh63A+wc= Received: by 10.223.34.70 with SMTP id k6mr1345871fad.31.1316176927742; Fri, 16 Sep 2011 05:42:07 -0700 (PDT) Received: from [192.168.1.13] (abvu156.neoplus.adsl.tpnet.pl. [83.8.218.156]) by mx.google.com with ESMTPS id l8sm2957151fai.16.2011.09.16.05.42.05 (version=TLSv1/SSLv3 cipher=OTHER); Fri, 16 Sep 2011 05:42:06 -0700 (PDT) User-Agent: KMail/1.9.3 In-Reply-To: <201108270006.19289.jnareb@gmail.com> Content-Disposition: inline Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: The current code, as is, passes control characters, such as form-feed (^L) to highlight which then passes it through to the browser. User agents (web browsers) that support 'application/xhtml+xml' usually require that web pages declared as XHTML and with this mimetype are well-formed XML. Unescaped control characters cannot appear within a contents of a valid XML document. This will cause the browser to display one of the following warnings: * Safari v5.1 (6534.50) & Google Chrome v13.0.782.112: This page contains the following errors: error on line 657 at column 38: PCDATA invalid Char value 12 Below is a rendering of the page up to the first error. * Mozilla Firefox 3.6.19 & Mozilla Firefox 5.0: XML Parsing Error: not well-formed Location: http://path/to/git/repo/blah/blah Both errors were generated by gitweb.perl v1.7.3.4 w/ highlight 2.7 using arch/ia64/kernel/unwind.c from the Linux kernel. When syntax highlighter is not used, control characters are replaced by esc_html(), but with syntax highlighter they were passed through to browser (to_utf8() doesn't remove control characters). Introduce sanitize() subroutine which strips forbidden characters, but does not perform HTML escaping, and use it in git_blob() to sanitize syntax highlighter output for XHTML. Note that excluding "\t" (U+0009), "\n" (U+000A) and "\r" (U+000D) is not strictly necessary, atleast for currently the only callsite: "\t" tabs are replaced by spaces by untabify(), "\n" is stripped from each line before processing it, and replacing "\r" could be considered improvement. Originally-by: Christopher M. Fuhrman Signed-off-by: Jakub Narebski --- The commit message is from Christopher, but I have replaced his solution of stripping non-printable characters via col(1) program by having gitweb strip characters not allowed in XML. Christopher, could you check that it fixes your issue? gitweb/gitweb.perl | 14 +++++++++++++- 1 files changed, 13 insertions(+), 1 deletions(-) diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl index 70a576a..c28b847 100755 --- a/gitweb/gitweb.perl +++ b/gitweb/gitweb.perl @@ -1517,6 +1517,17 @@ sub esc_path { return $str; } +# Sanitize for use in XHTML + application/xml+xhtm (valid XML 1.0) +sub sanitize { + my $str = shift; + + return undef unless defined $str; + + $str = to_utf8($str); + $str =~ s|([[:cntrl:]])|($1 =~ /[\t\n\r]/ ? $1 : quot_cec($1))|eg; + return $str; +} + # Make control characters "printable", using character escape codes (CEC) sub quot_cec { my $cntrl = shift; @@ -6484,7 +6495,8 @@ sub git_blob { $nr++; $line = untabify($line); printf qq!
%4i %s
\n!, - $nr, esc_attr(href(-replay => 1)), $nr, $nr, $syntax ? to_utf8($line) : esc_html($line, -nbsp=>1); + $nr, esc_attr(href(-replay => 1)), $nr, $nr, + $syntax ? sanitize($line) : esc_html($line, -nbsp=>1); } } close $fd -- 1.7.6