From: Russell King <rmk+lkml@arm.linux.org.uk>
To: David Woodhouse <dwmw2@infradead.org>
Cc: Tilman Schmidt <tilman@imap.cc>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: OT: character encodings (was: Linux 2.6.20-rc4)
Date: Sun, 7 Jan 2007 17:06:56 +0000 [thread overview]
Message-ID: <20070107170656.GC21133@flint.arm.linux.org.uk> (raw)
In-Reply-To: <1168187346.14763.70.camel@shinybook.infradead.org>
On Mon, Jan 08, 2007 at 12:29:05AM +0800, David Woodhouse wrote:
> On Sun, 2007-01-07 at 15:38 +0000, Russell King wrote:
> > When a text file is stored on disk, there's no way to tell what
> > character set the characters in that file belong to. As a result,
> > ISO-8859-1 folk assume that all text files are ISO-8859-1 encoded.
> > UTF-8 folk assume all text files are UTF-8 encoded. This leads to
> > utter confusion.
>
> Only if you are making different assumptions about the _same_ set of
> files, on the _same_ system. But that would be silly.
$ git log | head -n 1000 | tail -n 200 > o
$ file -i o
o: text/plain; charset=us-ascii
$ git log | head -n 1000 | tail -n 300 > o
$ file -i o
o: text/plain; charset=us-ascii
$ git log | head -n 1000 | tail -n 400 > o
$ file -i o
o: text/plain; charset=utf-8
(and you know what charset the file is thought to have with all 1000
lines in it.)
All on a system with LANG set to en_GB (iow ISO-8859-1).
> > To see what I mean, try the following:
> >
> > $ git log | head -n 1000 > o
> > $ file -i o
> > o: text/x-c; charset=iso-8859-1
> >
> > According to that, the charset of the 'git log' output (which on that
> > test included Leonard's entry) is iso-8859-1, and by that Linus' mailer
> > was right to include it as ISO-8859-1.
>
> Yes. When you stored it on disk, the character set information was lost.
The same thing actually happens when I look at it via:
$ git log | head -n 1000 | less
but in this case the output is always interpreted by the terminal to be
in its character set.
> If you were running a mixed-charset system then attempting to recreating
> the lost information with heuristics and assumptions is obviously going
> to be problematic.
I'm not - I'm running a pure ISO-8859-1 system:
$ echo $LANG
en_GB
$ locale -k LC_CTYPE | grep charmap
charmap="ISO-8859-1"
> Actually, because UTF-8 allows me to run a system which is purely based
> on a single character set, I get better results when I try the same
> trick:
> shinybook /shiny/git/mtd-2.6 $ git log | head -n 1000 > o
> shinybook /shiny/git/mtd-2.6 $ file -i o
> o: text/plain; charset=utf-8
$ LANG=en_GB.UTF-8 locale -k LC_CTYPE | grep charmap
charmap="UTF-8"
$ LANG=en_GB.UTF-8 git log | head -n 1000 > o
$ LANG=en_GB.UTF-8 file -i o
o: text/x-c; charset=iso-8859-1
$ git version
git version 1.4.4.2
Looks like the output is iso-8859-1 even with UTF-8!
> > In reality, the output from git log contains an ad-hoc collection of
> > character sets making its interpretation under any one character set
> > incorrect.
>
> No, the contents of the git log ought to be UTF-8, unless people have
> been misusing it. Git stores its text in UTF-8 (by default), and is
> capable of converting to and from legacy character sets on input
> (git-commit) and output (git-log).
Git may store its text internally in UTF-8 (I don't know but I have no
evidence to suggest it does - in fact I have some evidence in this test
that it doesn't care about charsets.) git log output on a non-UTF-8
system certainly is not in the hosts character set. For example:
$ LANG=en_GB.UTF-8 git log | head -n 1000 > o
$ LANG=en_GB git log | head -n 1000 > o2
$ diff -u o o2
That includes the UTF-8 encoded part of Leonard name. It also includes
Rafa? Bilski's name which is non-UTF-8 encoded.
So, in both cases, exactly the same output bytestream was created
independent of the character set _actually_ being used, which both
includes untranslated UTF-8 and non-UTF-8 sequences.
There is obviously no character set translation going on with the output.
So we can add 'git' to my list of charset-broken programs.
Also, since we have recent data in the git repository which is non-UTF-8
as well, it is clear that there is no character set translation going on
at input time either.
Looking at the git-commit script, there appears to be no character set
conversion going on in there either.
So, I think you'll find that the contents of git _is_ an ad-hoc collection
of character sets which people happen to have in use on their machines.
> > So, in short, UTF-8 is all fine and dandy if your _entire_ universe
> > is UTF-8 enabled. If you're operating in a mixed charset environment
> > it's one bloody big pain in the butt.
>
> A mixed charset environment was _already_ a pain in the butt, because
> almost nobody got labelling right. It's wrong to blame that on UTF-8.
I'm not talking about a mixed charset environment. I'm talking about
non-UTF-8 single charset environments being broken by programs which
universally think the universe is UTF-8 only.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
next prev parent reply other threads:[~2007-01-07 17:07 UTC|newest]
Thread overview: 104+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-01-07 6:19 Linux 2.6.20-rc4 Linus Torvalds
2007-01-07 10:56 ` Jan Engelhardt
2007-01-07 11:44 ` Russell King
2007-01-07 13:06 ` OT: character encodings (was: Linux 2.6.20-rc4) Tilman Schmidt
2007-01-07 15:13 ` David Woodhouse
2007-01-07 15:38 ` Russell King
2007-01-07 16:29 ` David Woodhouse
2007-01-07 17:06 ` Russell King [this message]
2007-01-07 19:11 ` Jan Engelhardt
2007-01-07 19:20 ` Russell King
2007-01-07 20:48 ` Willy Tarreau
2007-01-07 23:37 ` Adrian Bunk
2007-01-08 0:38 ` Willy Tarreau
2007-01-08 1:03 ` Adrian Bunk
2007-01-08 1:14 ` Willy Tarreau
2007-01-08 1:45 ` Adrian Bunk
2007-01-08 6:52 ` Jan Engelhardt
2007-01-08 8:02 ` Adrian Bunk
2007-01-08 1:32 ` OT: character encodings Tilman Schmidt
2007-01-08 1:59 ` Adrian Bunk
2007-01-08 19:53 ` OT: character encodings (was: Linux 2.6.20-rc4) Valdis.Kletnieks
2007-01-07 19:29 ` OT: character encodings Tilman Schmidt
2007-01-07 18:21 ` OT: character encodings (was: Linux 2.6.20-rc4) Alan
2007-01-07 19:12 ` Jan Engelhardt
2007-01-07 22:30 ` Alan
2007-01-08 1:22 ` Jan Engelhardt
2007-01-08 20:17 ` Jan Engelhardt
2007-01-08 22:00 ` Ken Moffat
2007-01-08 23:21 ` Jan Engelhardt
2007-01-08 23:34 ` Eberhard Moenkeberg
2007-01-08 16:14 ` Pavel Machek
2007-01-08 22:17 ` Tim Pepper
2007-01-08 23:30 ` Jan Engelhardt
2007-01-07 19:17 ` Russell King
2007-01-07 19:58 ` Robin Rosenberg
2007-01-07 20:05 ` Dave Jones
2007-01-07 20:15 ` Sean
2007-01-07 20:40 ` Jan Engelhardt
2007-01-07 21:07 ` Xavier Bestel
2007-01-08 4:42 ` David Woodhouse
2007-01-08 1:40 ` Horst H. von Brand
2007-01-07 13:23 ` Linux 2.6.20-rc4 Alan
2007-01-07 12:15 ` Akula2
2007-01-07 12:55 ` Russell King
2007-01-07 13:38 ` Akula2
2007-01-07 13:53 ` Willy Tarreau
2007-01-07 14:23 ` Akula2
2007-01-07 20:57 ` Peter Osterlund
2007-01-07 21:04 ` Peter Osterlund
2007-01-08 15:50 ` Dmitry Torokhov
2007-01-07 22:50 ` Linus Torvalds
2007-01-08 1:00 ` David Miller
2007-01-08 6:38 ` Peter Osterlund
2007-01-08 20:49 ` Peter Osterlund
2007-01-08 21:52 ` David Miller
2007-01-08 22:33 ` Patrick McHardy
2007-01-08 23:02 ` Peter Osterlund
2007-01-08 23:12 ` Linus Torvalds
2007-01-09 3:42 ` Adrian Bunk
2007-01-09 7:39 ` David Miller
2007-01-07 21:22 ` Gene Heskett
2007-01-08 0:22 ` 2.6.20-rc4: known unfixed regressions Adrian Bunk
2007-01-08 1:20 ` Bernhard Schmidt
2007-01-08 0:25 ` 2.6.20-rc4: known regressions with patches available Adrian Bunk
2007-01-08 0:33 ` Marcel Holtmann
2007-01-08 14:50 ` Linux 2.6.20-rc4 Mariusz Kozlowski
2007-01-08 14:58 ` Sylvain Munaut
2007-01-08 15:03 ` Mariusz Kozlowski
2007-01-08 19:11 ` Jean Delvare
2007-01-09 0:38 ` Benjamin Herrenschmidt
2007-01-09 0:56 ` Greg KH
2007-01-09 2:05 ` Benjamin Herrenschmidt
2007-01-09 7:04 ` David Woodhouse
2007-01-09 7:04 ` Sylvain Munaut
2007-01-09 9:04 ` Benjamin Herrenschmidt
2007-01-09 7:14 ` Sylvain Munaut
2007-01-09 7:28 ` David Woodhouse
2007-01-09 9:08 ` Benjamin Herrenschmidt
2007-01-09 9:07 ` Benjamin Herrenschmidt
2007-01-09 7:18 ` Greg KH
2007-01-09 5:25 ` 2.6.20-rc4: known unfixed regressions (v2) Adrian Bunk
2007-01-09 17:58 ` Linus Torvalds
2007-01-09 18:08 ` Malte Schröder
2007-01-09 18:30 ` Linus Torvalds
2007-01-11 0:24 ` Vladimir V. Saveliev
2007-01-11 1:00 ` Nick Piggin
2007-01-11 13:12 ` Vladimir V. Saveliev
2007-01-11 23:53 ` Nick Piggin
2007-01-09 20:28 ` Adrian Bunk
2007-01-09 5:51 ` 2.6.20-rc4: known regressions with patches (v2) Adrian Bunk
2007-01-11 5:10 ` 2.6.20-rc4: known unfixed regressions (v3) Adrian Bunk
2007-01-11 6:43 ` Nick Piggin
2007-01-11 8:45 ` Adrian Bunk
2007-01-11 10:21 ` Jiri Kosina
2007-01-11 10:54 ` Adrian Bunk
2007-01-11 11:08 ` CIJOML
2007-01-11 5:13 ` 2.6.20-rc4: known regressions with patches (v3) Adrian Bunk
2007-01-11 21:39 ` David Chinner
2007-01-11 22:02 ` Andrew Morton
2007-01-11 23:05 ` David Chinner
-- strict thread matches above, loose matches on Subject: below --
2007-01-08 10:13 OT: character encodings (was: Linux 2.6.20-rc4) Nicolas Mailhot
2007-01-08 10:24 Nicolas Mailhot
2007-01-08 10:44 ` Alan
2007-01-08 10:44 ` Nicolas Mailhot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070107170656.GC21133@flint.arm.linux.org.uk \
--to=rmk+lkml@arm.linux.org.uk \
--cc=dwmw2@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=tilman@imap.cc \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).