git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jakub Narebski <jnareb@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, "Jürgen Kreileder" <jk@blackdown.de>,
	"John Hawley" <warthog9@kernel.org>
Subject: Re: [RFD] Handling of non-UTF8 data in gitweb
Date: Sat, 10 Dec 2011 17:18:33 +0100	[thread overview]
Message-ID: <201112101718.34848.jnareb@gmail.com> (raw)
In-Reply-To: <7vehwhcj3q.fsf@alter.siamese.dyndns.org>

On Wed, 7 Dec 2011, Junio C Hamano wrote:
> Jakub Narebski <jnareb@gmail.com> writes:
> 
> > But doing this would change gitweb behavior.  Currently when 
> > encountering something (usually line of output) that is not valid 
> > UTF-8, we decode it (to UTF-8) using $fallback_encoding, by default
> > 'latin1'.  Note however that this value is per gitweb installation,
> > not per repository.
> 
> I think we added and you acked 00f429a (gitweb: Handle non UTF-8 text
> better, 2007-06-03) for a good reason, and I think the above argues that
> whatever issue the commit tried to address is a non-issue. Is it really
> true?

I think that UTF-8 is much more prevalent character encoding in operating
systems, programming languages, APIs, and software applications than it
was 4 years ago.

Also the solution implemented in said commit was a good start, but it
remains incomplete: $fallback_encoding is per-installation which is too
big granularity (there is `gui.encoding` per-repository config... but it
is about main not fallback encoding; best would be to use gitattribute
but currently there is no way to check attribute value at given revision).

The proposed

  use open qw(:std :utf8);

and removal of to_utf8 and $fallback_encoding would be regression compared
to post-00f429a... but the tradeoff of more robust UTF-8 handling might be
worth it.


Note that to_utf8 handles git command output part by part, not as a whole;
for UTF-8 vs latin1 (i.e. iso-8859-1) it does not matter though because
latin1 is very unlikely to be recognized as valid utf-8[1], and ASCII
characters pass-through for UTF-8.

[1]: http://en.wikipedia.org/wiki/UTF-8#Advantages

> > ... I guess
> > it could be emulated by defining our own 'utf-8-with-fallback'
> > encoding, or by defining our own PerlIO layer with PerlIO::via.
> > But it no longer be simple solution (though still automatic).
> 
> Between the current "everybody who read from the input must remember to
> call to_utf8" and "everybody gets to read utf8 automatically for internal
> processing", even though the latter may involve one-time pain to set up
> the framework to do so, the pros-and-cons feels obvious to me.

There is also a matter of performance (':utf8' and ':encoding(UTF-8)'
are AFAIK implemented in C, both the Encode part and PerlIO part).
-- 
Jakub Narebski
Poland

  reply	other threads:[~2011-12-10 16:18 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-12-04 16:09 [RFD] Handling of non-UTF8 data in gitweb Jakub Narebski
2011-12-06  1:07 ` Jeff King
2011-12-07  0:37 ` Junio C Hamano
2011-12-10 16:18   ` Jakub Narebski [this message]
2011-12-12  5:26     ` Junio C Hamano
2011-12-18 22:00   ` Jakub Narebski
2012-01-06 16:35 ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201112101718.34848.jnareb@gmail.com \
    --to=jnareb@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jk@blackdown.de \
    --cc=warthog9@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).