Re: [PATCH] gitweb: handle non UTF-8 text

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jakub Narebski <jnareb@gmail.com>
To: Petr Baudis <pasky@suse.cz>, Martin Koegler <mkoegler@auto.tuwien.ac.at>
Cc: git@vger.kernel.org, Martin Langhoff <martin@catalyst.net.nz>,
	Martyn Smith <martyn@catalyst.net.nz>
Subject: Re: [PATCH] gitweb: handle non UTF-8 text
Date: Tue, 29 May 2007 11:21:11 +0200	[thread overview]
Message-ID: <200705291121.12119.jnareb@gmail.com> (raw)
In-Reply-To: <20070528232139.GU4489@pasky.or.cz>

[Cc: authors of git-cvsserver]

On Tue, 29 May 2007, Petr Baudis wrote:
> On Mon, May 28, 2007 at 10:47:34PM CEST, Martin Koegler wrote:

>> gitweb assumes, that everything is in UTF-8. If a text contains invalid
>> UTF-8 character sequences, the text must be in a different encoding.

But it doesn't tell us _what_ is the encoding. For commit messages,
with reasonable new git, we have 'encoding' header if git known that
commit message was not in utf-8.

By the way, I winder why we don't have such header for tag objects
(i18n.tagEncoding ;-)...

>> This patch interprets such a text as latin1.

Meaning that it tries to recode text from latin1 (iso-8859-1) to utf-8
(not changing gitweb output encoding, which is utf-8).

It would be much better, and much easier at least for commit message
to add --encoding=utf-8 to git-rev-list / git-log invocation.

>> Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
>> ---
>> For correct UTF-8, the patch does not change anything.
>> 
>> If commit/blob/... is not in UTF-8, it displays the text
>> with a very high probability correct. 

It is commit (with its 'encoding' header, and `--encoding' option
we can use instead of doing it in gitweb, provided that git was
compiled with iconv support), tag (similar to commit, but IIRC
without 'encoding' header, and `--encoding' option), blob (with
no place to store encoding) and pathname in tree (which can be
different from blob encoding).

And I doubt very much about this "very high probability to be
correct".

>> As git itself is not aware of any encoding, I know no better
>> possibility to handle non UTF-8 text in gitweb.
> 
> I don't think this is a reasonable approach; I actually dispute the high
> probability - in western Europe it's obvious to assume latin1, but does
> majority of users using non-ascii characters come from there? Or rather
> from central Europe (like me, Petr Baudiš? ;-))? Somewhere else?

I also don't think that hardcoding latin1 (iso-8859-1) as default
alternate encoding is a good idea. I don't think using iso-8859-1
(outside us-ascii) is _nowadays_ that common. On the other hand I think
that not all users of koi8r, eucjp or iso-2022-jp converted (and can
convert) to utf-8; latin1 users can.

And using latin1 (other encoding) _only_ when there is an invalid utf-8
sequence is not a good idea either; I think that that there are some
latin1 sequences outside us-ascii which are valid utf-8 sequences. That
kind of magic is wrong, wrong, wrong...

> If we do something like this, we should do it properly and look at
> configured i18n.commitEncoding for the project. (But as config lookup
> may be expensive, probably do it only when we need it.)

I think it would be best to make it into %feature, overridable
or not (which would look at i18n.commitEncoding instead of at
gitweb.commitEncoding, but still a feature).

About config lookup: we can either "borrow" config reading code in Perl
from git-cvsserver, perhaps via putting it into Git.pm. Or we can
implement at last core git support for dumping whole config in
unambiguous machine parseable output: "git config --dump", e.g.
  key <LF> value <NUL>
or
  key <NUL>
(the second for "boolean" variables without set value).

Having alternate (read-only) config parser has its advantages and
disadvantages. Advantage is that we avoid fork+exec (performance),
and having two implementations is always good for having format
standarized. Disadvantage is that is yet another code to maintain,
and that config parsing (even read-only config parsing) is a bit tricky
with current git config file format.

-- 
Jakub Narebski
Poland

next prev parent reply	other threads:[~2007-05-29 12:29 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-28 20:47 [PATCH] gitweb: handle non UTF-8 text Martin Koegler
2007-05-28 23:21 ` Petr Baudis
2007-05-29  9:21   ` Jakub Narebski [this message]
2007-05-29 21:55     ` Martin Koegler
2007-05-30 20:18       ` Robin Rosenberg
2007-06-01 21:05       ` Jakub Narebski
2007-06-02 22:15         ` Junio C Hamano
2007-06-03 15:42           ` Jakub Narebski
2007-06-03 18:41             ` Alexandre Julliard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200705291121.12119.jnareb@gmail.com \
    --to=jnareb@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=martin@catalyst.net.nz \
    --cc=martyn@catalyst.net.nz \
    --cc=mkoegler@auto.tuwien.ac.at \
    --cc=pasky@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.