linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Russell King <rmk+lkml@arm.linux.org.uk>
To: David Woodhouse <dwmw2@infradead.org>
Cc: Tilman Schmidt <tilman@imap.cc>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: OT: character encodings (was: Linux 2.6.20-rc4)
Date: Sun, 7 Jan 2007 17:06:56 +0000	[thread overview]
Message-ID: <20070107170656.GC21133@flint.arm.linux.org.uk> (raw)
In-Reply-To: <1168187346.14763.70.camel@shinybook.infradead.org>

On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote:
> On Sun, 2007-01-07 at 15:38 +0000, Russell King wrote:
> > When a text file is stored on disk, there's no way to tell what
> > character set the characters in that file belong to.  As a result,
> > ISO-8859-1 folk assume that all text files are ISO-8859-1 encoded.
> > UTF-8 folk assume all text files are UTF-8 encoded.  This leads to
> > utter confusion.
> 
> Only if you are making different assumptions about the _same_ set of
> files, on the _same_ system. But that would be silly.

$ git log | head -n 1000 | tail -n 200 > o
$ file -i o
o: text/plain; charset=us-ascii
$ git log | head -n 1000 | tail -n 300 > o
$ file -i o
o: text/plain; charset=us-ascii
$ git log | head -n 1000 | tail -n 400 > o
$ file -i o
o: text/plain; charset=utf-8

(and you know what charset the file is thought to have with all 1000
lines in it.)

All on a system with LANG set to en_GB (iow ISO-8859-1).

> > To see what I mean, try the following:
> > 
> > $ git log | head -n 1000 > o
> > $ file -i o
> > o: text/x-c; charset=iso-8859-1
> > 
> > According to that, the charset of the 'git log' output (which on that
> > test included Leonard's entry) is iso-8859-1, and by that Linus' mailer
> > was right to include it as ISO-8859-1.
> 
> Yes. When you stored it on disk, the character set information was lost.

The same thing actually happens when I look at it via:

  $ git log | head -n 1000 | less

but in this case the output is always interpreted by the terminal to be
in its character set.

> If you were running a mixed-charset system then attempting to recreating
> the lost information with heuristics and assumptions is obviously going
> to be problematic.

I'm not - I'm running a pure ISO-8859-1 system:

$ echo $LANG
en_GB
$ locale -k LC_CTYPE | grep charmap
charmap="ISO-8859-1"

> Actually, because UTF-8 allows me to run a system which is purely based
> on a single character set, I get better results when I try the same
> trick:
> 	shinybook /shiny/git/mtd-2.6 $ git log | head -n 1000 > o
> 	shinybook /shiny/git/mtd-2.6 $ file -i o
> 	o: text/plain; charset=utf-8

$ LANG=en_GB.UTF-8 locale -k LC_CTYPE | grep charmap
charmap="UTF-8"
$ LANG=en_GB.UTF-8 git log | head -n 1000 > o
$ LANG=en_GB.UTF-8 file -i o
o: text/x-c; charset=iso-8859-1
$ git version
git version 1.4.4.2

Looks like the output is iso-8859-1 even with UTF-8!

> > In reality, the output from git log contains an ad-hoc collection of
> > character sets making its interpretation under any one character set
> > incorrect.
> 
> No, the contents of the git log ought to be UTF-8, unless people have
> been misusing it. Git stores its text in UTF-8 (by default), and is
> capable of converting to and from legacy character sets on input
> (git-commit) and output (git-log).

Git may store its text internally in UTF-8 (I don't know but I have no
evidence to suggest it does - in fact I have some evidence in this test
that it doesn't care about charsets.)  git log output on a non-UTF-8
system certainly is not in the hosts character set.  For example:

$ LANG=en_GB.UTF-8 git log | head -n 1000 > o
$ LANG=en_GB git log | head -n 1000 > o2
$ diff -u o o2

That includes the UTF-8 encoded part of Leonard name.  It also includes
Rafa? Bilski's name which is non-UTF-8 encoded.

So, in both cases, exactly the same output bytestream was created
independent of the character set _actually_ being used, which both
includes untranslated UTF-8 and non-UTF-8 sequences.

There is obviously no character set translation going on with the output.
So we can add 'git' to my list of charset-broken programs.

Also, since we have recent data in the git repository which is non-UTF-8
as well, it is clear that there is no character set translation going on
at input time either.

Looking at the git-commit script, there appears to be no character set
conversion going on in there either.

So, I think you'll find that the contents of git _is_ an ad-hoc collection
of character sets which people happen to have in use on their machines.

> > So, in short, UTF-8 is all fine and dandy if your _entire_ universe
> > is UTF-8 enabled.  If you're operating in a mixed charset environment
> > it's one bloody big pain in the butt.
> 
> A mixed charset environment was _already_ a pain in the butt, because
> almost nobody got labelling right. It's wrong to blame that on UTF-8.

I'm not talking about a mixed charset environment.  I'm talking about
non-UTF-8 single charset environments being broken by programs which
universally think the universe is UTF-8 only.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

  reply	other threads:[~2007-01-07 17:07 UTC|newest]

Thread overview: 104+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-01-07  6:19 Linux 2.6.20-rc4 Linus Torvalds
2007-01-07 10:56 ` Jan Engelhardt
2007-01-07 11:44   ` Russell King
2007-01-07 13:06     ` OT: character encodings (was: Linux 2.6.20-rc4) Tilman Schmidt
2007-01-07 15:13       ` David Woodhouse
2007-01-07 15:38         ` Russell King
2007-01-07 16:29           ` David Woodhouse
2007-01-07 17:06             ` Russell King [this message]
2007-01-07 19:11               ` Jan Engelhardt
2007-01-07 19:20                 ` Russell King
2007-01-07 20:48                 ` Willy Tarreau
2007-01-07 23:37                   ` Adrian Bunk
2007-01-08  0:38                     ` Willy Tarreau
2007-01-08  1:03                       ` Adrian Bunk
2007-01-08  1:14                         ` Willy Tarreau
2007-01-08  1:45                           ` Adrian Bunk
2007-01-08  6:52                         ` Jan Engelhardt
2007-01-08  8:02                           ` Adrian Bunk
2007-01-08  1:32                       ` OT: character encodings Tilman Schmidt
2007-01-08  1:59                         ` Adrian Bunk
2007-01-08 19:53                       ` OT: character encodings (was: Linux 2.6.20-rc4) Valdis.Kletnieks
2007-01-07 19:29               ` OT: character encodings Tilman Schmidt
2007-01-07 18:21           ` OT: character encodings (was: Linux 2.6.20-rc4) Alan
2007-01-07 19:12             ` Jan Engelhardt
2007-01-07 22:30               ` Alan
2007-01-08  1:22                 ` Jan Engelhardt
2007-01-08 20:17                   ` Jan Engelhardt
2007-01-08 22:00                     ` Ken Moffat
2007-01-08 23:21                       ` Jan Engelhardt
2007-01-08 23:34                         ` Eberhard Moenkeberg
2007-01-08 16:14                 ` Pavel Machek
2007-01-08 22:17                   ` Tim Pepper
2007-01-08 23:30                     ` Jan Engelhardt
2007-01-07 19:17             ` Russell King
2007-01-07 19:58               ` Robin Rosenberg
2007-01-07 20:05               ` Dave Jones
2007-01-07 20:15                 ` Sean
2007-01-07 20:40                   ` Jan Engelhardt
2007-01-07 21:07                     ` Xavier Bestel
2007-01-08  4:42                 ` David Woodhouse
2007-01-08  1:40               ` Horst H. von Brand
2007-01-07 13:23   ` Linux 2.6.20-rc4 Alan
2007-01-07 12:15 ` Akula2
2007-01-07 12:55   ` Russell King
2007-01-07 13:38     ` Akula2
2007-01-07 13:53       ` Willy Tarreau
2007-01-07 14:23         ` Akula2
2007-01-07 20:57 ` Peter Osterlund
2007-01-07 21:04   ` Peter Osterlund
2007-01-08 15:50     ` Dmitry Torokhov
2007-01-07 22:50   ` Linus Torvalds
2007-01-08  1:00     ` David Miller
2007-01-08  6:38       ` Peter Osterlund
2007-01-08 20:49       ` Peter Osterlund
2007-01-08 21:52         ` David Miller
2007-01-08 22:33     ` Patrick McHardy
2007-01-08 23:02       ` Peter Osterlund
2007-01-08 23:12         ` Linus Torvalds
2007-01-09  3:42           ` Adrian Bunk
2007-01-09  7:39           ` David Miller
2007-01-07 21:22 ` Gene Heskett
2007-01-08  0:22 ` 2.6.20-rc4: known unfixed regressions Adrian Bunk
2007-01-08  1:20   ` Bernhard Schmidt
2007-01-08  0:25 ` 2.6.20-rc4: known regressions with patches available Adrian Bunk
2007-01-08  0:33   ` Marcel Holtmann
2007-01-08 14:50 ` Linux 2.6.20-rc4 Mariusz Kozlowski
2007-01-08 14:58   ` Sylvain Munaut
2007-01-08 15:03     ` Mariusz Kozlowski
2007-01-08 19:11     ` Jean Delvare
2007-01-09  0:38     ` Benjamin Herrenschmidt
2007-01-09  0:56       ` Greg KH
2007-01-09  2:05         ` Benjamin Herrenschmidt
2007-01-09  7:04           ` David Woodhouse
2007-01-09  7:04             ` Sylvain Munaut
2007-01-09  9:04             ` Benjamin Herrenschmidt
2007-01-09  7:14           ` Sylvain Munaut
2007-01-09  7:28             ` David Woodhouse
2007-01-09  9:08               ` Benjamin Herrenschmidt
2007-01-09  9:07             ` Benjamin Herrenschmidt
2007-01-09  7:18           ` Greg KH
2007-01-09  5:25 ` 2.6.20-rc4: known unfixed regressions (v2) Adrian Bunk
2007-01-09 17:58   ` Linus Torvalds
2007-01-09 18:08     ` Malte Schröder
2007-01-09 18:30       ` Linus Torvalds
2007-01-11  0:24         ` Vladimir V. Saveliev
2007-01-11  1:00           ` Nick Piggin
2007-01-11 13:12             ` Vladimir V. Saveliev
2007-01-11 23:53               ` Nick Piggin
2007-01-09 20:28     ` Adrian Bunk
2007-01-09  5:51 ` 2.6.20-rc4: known regressions with patches (v2) Adrian Bunk
2007-01-11  5:10 ` 2.6.20-rc4: known unfixed regressions (v3) Adrian Bunk
2007-01-11  6:43   ` Nick Piggin
2007-01-11  8:45     ` Adrian Bunk
2007-01-11 10:21       ` Jiri Kosina
2007-01-11 10:54         ` Adrian Bunk
2007-01-11 11:08           ` CIJOML
2007-01-11  5:13 ` 2.6.20-rc4: known regressions with patches (v3) Adrian Bunk
2007-01-11 21:39   ` David Chinner
2007-01-11 22:02     ` Andrew Morton
2007-01-11 23:05       ` David Chinner
  -- strict thread matches above, loose matches on Subject: below --
2007-01-08 10:13 OT: character encodings (was: Linux 2.6.20-rc4) Nicolas Mailhot
2007-01-08 10:24 Nicolas Mailhot
2007-01-08 10:44 ` Alan
2007-01-08 10:44   ` Nicolas Mailhot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070107170656.GC21133@flint.arm.linux.org.uk \
    --to=rmk+lkml@arm.linux.org.uk \
    --cc=dwmw2@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tilman@imap.cc \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).