git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Jakub Narębski" <jnareb@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: "Jürgen Kreileder" <jk@blackdown.de>, git@vger.kernel.org
Subject: Re: [PATCH 1/4] gitweb: Fix utf8 encoding for blob_plain, blobdiff_plain, commitdiff_plain, and patch
Date: Tue, 09 Apr 2013 13:31:57 +0200	[thread overview]
Message-ID: <5163FC2D.9050408@gmail.com> (raw)
In-Reply-To: <7vd2u4vg4o.fsf@alter.siamese.dyndns.org>

On 08.04.2013, Junio C Hamano wrote:
> jk@blackdown.de (Jürgen Kreileder) writes:
> 
>> Fixes the encoding for several _plain actions and for text/* and */*+xml blobs. 
>>
>> Signed-off-by: Jürgen Kreileder <jk@blackdown.de>

I see that this patch does (or tries to do) two _independent_ things,
and should be split into at least two separate commits:

>> ---
> 
> Thanks, will queue but not hold until I hear something from Jakub.
> 
>>  gitweb/gitweb.perl |    8 +++++++-
>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
>> index 1309196..9cfe5b5 100755
>> --- a/gitweb/gitweb.perl
>> +++ b/gitweb/gitweb.perl
>> @@ -3823,7 +3823,7 @@ sub blob_contenttype {
>>  	my ($fd, $file_name, $type) = @_;
>>  
>>  	$type ||= blob_mimetype($fd, $file_name);
>> -	if ($type eq 'text/plain' && defined $default_text_plain_charset) {
>> +	if (($type =~ m!^text/\w[-\w]*$! || $type =~ m!^\w[-\w]*/\w[-\w]*\+xml$!) && defined $default_text_plain_charset) {
>>  		$type .= "; charset=$default_text_plain_charset";
>>  	}

First, it extends adding "; charset=$default_text_plain_charset" to
other mimetypes for 'blob_plain' view to all 'text/*' and '*/*+xml'
mimetypes (without changing name of variable... though this is more
complicated as it is configuration variable and we would want to
preserve backward compatibility, but at least a comment would be,
I think, needed).
 
Originally it applied only to 'text/plain' files, which can be
displayed inline by web browser, and which need charset in
'Content-Type:' HTTP header to be displayed correctly, as they
do not include such information inside the file.

'text/html' and 'application/xhtml+xml' can include such information
inside them (meta http-equiv for 'text/html' and <?xml ...> for
'application/xhtml+xml').  I don't know what browser does when there
is conflicting information about charset, i.e. which one wins, but
it is certainly something to consider.

It might be a good change; I don't know what web browser do when
serving 'text/css', 'text/javascript', 'text/xml' to client to
view without media type known.


BTW I have noticed that we do $prevent_xss dance outside
blob_contenttype(), in it's only caller i.e. git_blob_plain()...
which for example means that 'text/html' converted to 'text/plain'
don't get '; charset=...' added.  I guess that it *might* be
what prompted this part of change... but if it is so, it needs
to be fixed at source, e.g. by moving $prevent_xss to
blob_contenttype() subroutine.


About implementation:

>> +	if (($type =~ m!^text/\w[-\w]*$! || $type =~ m!^\w[-\w]*/\w[-\w]*\+xml$!) && defined $default_text_plain_charset) {
>>  		$type .= "; charset=$default_text_plain_charset";

would be better with line split

>> +	if (($type =~ m!^text/\w[-\w]*$! || $type =~ m!^\w[-\w]*/\w[-\w]*\+xml$!) &&
>> +	    defined $default_text_plain_charset) {
>>  		$type .= "; charset=$default_text_plain_charset";

Also: do we need to be strict with '\w[-\w]*', or would '.*' suffice?

>> @@ -7637,7 +7637,9 @@ sub git_blobdiff {
>>  			last if $line =~ m!^\+\+\+!;
>>  		}
>>  		local $/ = undef;
>> +		binmode STDOUT, ':raw';
>>  		print <$fd>;
>> +		binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi
>>  		close $fd;
>>  	}
>>  }
>> @@ -7884,12 +7886,16 @@ sub git_commitdiff {
>>  
>>  	} elsif ($format eq 'plain') {
>>  		local $/ = undef;
>> +		binmode STDOUT, ':raw';
>>  		print <$fd>;
>> +		binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi
>>  		close $fd
>>  			or print "Reading git-diff-tree failed\n";
>>  	} elsif ($format eq 'patch') {
>>  		local $/ = undef;
>> +		binmode STDOUT, ':raw';
>>  		print <$fd>;
>> +		binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi
>>  		close $fd
>>  			or print "Reading git-format-patch failed\n";
>>  	}

Second it changes 'blobdiff_plain', 'commitdiff_plain' (which I think
that should be abandoned in favor of 'patch' view; but that is
a separate issue) and 'patch' views so they use binary-safe output.

Note that in all cases (I think) we use

   $cgi->header(
     -type => 'text/plain',
     -charset => 'utf-8',
     ...
   );

promising web browser that output is as whole in 'utf-8' encoding.

It is not explained in the commit message what is the reason for this
change.  Is it better handing of a situation where files being diff-ed
being in other encoding (like for example in commit that changes
encoding of files from legacy encoding such like e.g. iso-8859-2
or cp1250 to utf-8)?  But what about  -charset => 'utf-8'  then?


About implementation: I think after this change common code crosses
threshold for refactoring it into subroutine, for example:

  sub dump_fh_raw {
  	my $fh = shift;

  	local $/ = undef;
  	binmode STDOUT, ':raw';
  	print <$fh>;
  	binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi

  	return $fh;
  }

-- 
Jakub Narębski

  reply	other threads:[~2013-04-09 11:32 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-08 20:08 [PATCH 1/4] gitweb: Fix utf8 encoding for blob_plain, blobdiff_plain, commitdiff_plain, and patch Jürgen Kreileder
2013-04-08 21:58 ` Junio C Hamano
2013-04-09 11:31   ` Jakub Narębski [this message]
2013-04-09 21:59     ` Jürgen Kreileder
2013-04-10 10:23       ` Jakub Narębski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5163FC2D.9050408@gmail.com \
    --to=jnareb@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jk@blackdown.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).