All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jakub Narebski <jnareb@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, "Jürgen Kreileder" <jk@blackdown.de>,
	"John Hawley" <warthog9@kernel.org>
Subject: Re: [RFD] Handling of non-UTF8 data in gitweb
Date: Sat, 10 Dec 2011 17:18:33 +0100	[thread overview]
Message-ID: <201112101718.34848.jnareb@gmail.com> (raw)
In-Reply-To: <7vehwhcj3q.fsf@alter.siamese.dyndns.org>

On Wed, 7 Dec 2011, Junio C Hamano wrote:
> Jakub Narebski <jnareb@gmail.com> writes:
> 
> > But doing this would change gitweb behavior.  Currently when 
> > encountering something (usually line of output) that is not valid 
> > UTF-8, we decode it (to UTF-8) using $fallback_encoding, by default
> > 'latin1'.  Note however that this value is per gitweb installation,
> > not per repository.
> 
> I think we added and you acked 00f429a (gitweb: Handle non UTF-8 text
> better, 2007-06-03) for a good reason, and I think the above argues that
> whatever issue the commit tried to address is a non-issue. Is it really
> true?

I think that UTF-8 is much more prevalent character encoding in operating
systems, programming languages, APIs, and software applications than it
was 4 years ago.

Also the solution implemented in said commit was a good start, but it
remains incomplete: $fallback_encoding is per-installation which is too
big granularity (there is `gui.encoding` per-repository config... but it
is about main not fallback encoding; best would be to use gitattribute
but currently there is no way to check attribute value at given revision).

The proposed

  use open qw(:std :utf8);

and removal of to_utf8 and $fallback_encoding would be regression compared
to post-00f429a... but the tradeoff of more robust UTF-8 handling might be
worth it.


Note that to_utf8 handles git command output part by part, not as a whole;
for UTF-8 vs latin1 (i.e. iso-8859-1) it does not matter though because
latin1 is very unlikely to be recognized as valid utf-8[1], and ASCII
characters pass-through for UTF-8.

[1]: http://en.wikipedia.org/wiki/UTF-8#Advantages

> > ... I guess
> > it could be emulated by defining our own 'utf-8-with-fallback'
> > encoding, or by defining our own PerlIO layer with PerlIO::via.
> > But it no longer be simple solution (though still automatic).
> 
> Between the current "everybody who read from the input must remember to
> call to_utf8" and "everybody gets to read utf8 automatically for internal
> processing", even though the latter may involve one-time pain to set up
> the framework to do so, the pros-and-cons feels obvious to me.

There is also a matter of performance (':utf8' and ':encoding(UTF-8)'
are AFAIK implemented in C, both the Encode part and PerlIO part).
-- 
Jakub Narebski
Poland

  reply	other threads:[~2011-12-10 16:18 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-12-04 16:09 [RFD] Handling of non-UTF8 data in gitweb Jakub Narebski
2011-12-06  1:07 ` Jeff King
2011-12-07  0:37 ` Junio C Hamano
2011-12-10 16:18   ` Jakub Narebski [this message]
2011-12-12  5:26     ` Junio C Hamano
2011-12-18 22:00   ` Jakub Narebski
2012-01-06 16:35 ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201112101718.34848.jnareb@gmail.com \
    --to=jnareb@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jk@blackdown.de \
    --cc=warthog9@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.