git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Robin Rosenberg <robin.rosenberg.lists@dewire.com>
To: Martin Koegler <mkoegler@auto.tuwien.ac.at>
Cc: Jakub Narebski <jnareb@gmail.com>, Petr Baudis <pasky@suse.cz>,
	git@vger.kernel.org, Martin Langhoff <martin@catalyst.net.nz>,
	Martyn Smith <martyn@catalyst.net.nz>
Subject: Re: [PATCH] gitweb: handle non UTF-8 text
Date: Wed, 30 May 2007 22:18:19 +0200	[thread overview]
Message-ID: <200705302218.19408.robin.rosenberg.lists@dewire.com> (raw)
In-Reply-To: <20070529215536.GA13250@auto.tuwien.ac.at>

tisdag 29 maj 2007 skrev Martin Koegler:
> On Tue, May 29, 2007 at 11:21:11AM +0200, Jakub Narebski wrote:
> > On Tue, 29 May 2007, Petr Baudis wrote:
> > > On Mon, May 28, 2007 at 10:47:34PM CEST, Martin Koegler wrote:
> > 
> > >> gitweb assumes, that everything is in UTF-8. If a text contains invalid
> > >> UTF-8 character sequences, the text must be in a different encoding.
> > 
> > But it doesn't tell us _what_ is the encoding. For commit messages,
> > with reasonable new git, we have 'encoding' header if git known that
> > commit message was not in utf-8.
> > 
> > By the way, I winder why we don't have such header for tag objects
> > (i18n.tagEncoding ;-)...
> 
> Why do I need to set i18n.commitEncoding on a normal Linux systems?  We
I've asked the same question.. :(
> have a locale, which contains this information. With this, its more
> likely, that the commits can be read correctly later, if somebody
> forget to set "i18n.commitEncoding" in a repository.
No 'if'. Users are virtually guaranteed to forget this setting.

> 
> UTF-8 is not the universal, dropin solution for ISO-8859-1. It has some drawbacks:
> - Some operations are slower, eg.
> - Anything using string length/character position is more complicated.
We'll have to live with that. A nice property of valid UTF-8 is that many operations can
be performed without decoding (like looking for a substring).

> 
> For some problems, UTF-16 might be a simpler solution.
UTF-16 is also variable width (one or two code units). Most apps get away by pretending it is 
fixed width, simply because that works for most people, but then I'm not sure people in asia 
aren't really happy with that assumption either. 

> I would use i18n.commitEncoding only as last fallback. In a project
> more different encodings could be used and the guessing logic may need
> additional parameter, so I would create a own set of config parameters
> for this.

There aren't many simple ways of guessing. The UTF-8 vs other test is simple 
and very reliable for western encodings (and merely good for others, if I'm not misinformed).
The i18n.commitEncoding is just a hint. Another hint is the host's encoding.

1. if lookslike(UTF-8) => assume UTF-8 else...
2. commit's encoding is valid for the text => use it else...
3. i18n.commitEncoding ...
4. gitweb.commitencoding  ....
5. server's location charset ...
6. assume iso-8859-1

Yet another would be to have an extra option to switch encoding on-demand in the gui.

BTW, there's another thread on notes. Maybe they be used to "fix" badly encoded messages
if and when they get a final implementation.

-- robn

  reply	other threads:[~2007-05-30 20:18 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-28 20:47 [PATCH] gitweb: handle non UTF-8 text Martin Koegler
2007-05-28 23:21 ` Petr Baudis
2007-05-29  9:21   ` Jakub Narebski
2007-05-29 21:55     ` Martin Koegler
2007-05-30 20:18       ` Robin Rosenberg [this message]
2007-06-01 21:05       ` Jakub Narebski
2007-06-02 22:15         ` Junio C Hamano
2007-06-03 15:42           ` Jakub Narebski
2007-06-03 18:41             ` Alexandre Julliard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200705302218.19408.robin.rosenberg.lists@dewire.com \
    --to=robin.rosenberg.lists@dewire.com \
    --cc=git@vger.kernel.org \
    --cc=jnareb@gmail.com \
    --cc=martin@catalyst.net.nz \
    --cc=martyn@catalyst.net.nz \
    --cc=mkoegler@auto.tuwien.ac.at \
    --cc=pasky@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).