From: Russell King <rmk+lkml@arm.linux.org.uk>
To: David Woodhouse <dwmw2@infradead.org>
Cc: Tilman Schmidt <tilman@imap.cc>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: OT: character encodings (was: Linux 2.6.20-rc4)
Date: Sun, 7 Jan 2007 15:38:33 +0000 [thread overview]
Message-ID: <20070107153833.GA21133@flint.arm.linux.org.uk> (raw)
In-Reply-To: <1168182838.14763.24.camel@shinybook.infradead.org>
On Sun, Jan 07, 2007 at 11:13:57PM +0800, David Woodhouse wrote:
> On Sun, 2007-01-07 at 14:06 +0100, Tilman Schmidt wrote:
> > Russell King schrieb:
> > > Welcome to the mess which the UTF-8 charset creates.
>
> Utter bollocks.
Wrong. The problem is partly caused by not everything understanding
multi-byte character encodings, and text files containing absolutely
_no_ information about their character encodings.
When a text file is stored on disk, there's no way to tell what
character set the characters in that file belong to. As a result,
ISO-8859-1 folk assume that all text files are ISO-8859-1 encoded.
UTF-8 folk assume all text files are UTF-8 encoded. This leads to
utter confusion.
To see what I mean, try the following:
$ git log | head -n 1000 > o
$ file -i o
o: text/x-c; charset=iso-8859-1
According to that, the charset of the 'git log' output (which on that
test included Leonard's entry) is iso-8859-1, and by that Linus' mailer
was right to include it as ISO-8859-1.
In reality, the output from git log contains an ad-hoc collection of
character sets making its interpretation under any one character set
incorrect.
> > The problem of different character encodings coexisting on the same
> > platform, and the resulting occasional messing-up, far predates Unicode.
> > I distinctly remember one case of being bitten by this myself in 1977
> > when Unicode wasn't even on the horizon yet, and I don't think that was
> > the first time.
>
> Indeed. If you take arbitrary content and send it out to the world
> labelled as ISO8859-1, of _course_ you're likely to be corrupting it.
>
> Far from being the cause of the problem, UTF-8 actually offers the
> chance of a _solution_. Because once the Luddites catch up, it'll
> largely eliminate the need for using the multitude of legacy character
> sets and converting between them -- and the problem of mislabelling will
> fairly much go away.
In other words, the UTF-8 luddites require the entire Internet to
upgrade to UTF-8 for UTF-8 to work properly.
I _regularly_ struggle with idiotic programs that assume that the world
is UTF-8 and nothing else. UTF-8 does _not_ solve these inter-operability
problems - it only makes the entire situation worse by introducing yet
another different charset. (Yes, it's also true that there are programs
which assume the world is only another, different, character set.)
Rather than having these problems fixed properly (by looking at the LANG
environment variable) many of these programs now assume that the world
is UTF-8. It isn't.
elinks is one such program. It now assumes UTF-8 _only_ displays.
That's no better than programs which assume ISO-8859-1 only or US-ASCII
only.
So, in short, UTF-8 is all fine and dandy if your _entire_ universe
is UTF-8 enabled. If you're operating in a mixed charset environment
it's one bloody big pain in the butt.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
next prev parent reply other threads:[~2007-01-07 15:38 UTC|newest]
Thread overview: 130+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-01-07 6:19 Linux 2.6.20-rc4 Linus Torvalds
2007-01-07 10:56 ` Jan Engelhardt
2007-01-07 11:44 ` Russell King
2007-01-07 13:06 ` OT: character encodings (was: Linux 2.6.20-rc4) Tilman Schmidt
2007-01-07 15:13 ` David Woodhouse
2007-01-07 15:38 ` Russell King [this message]
2007-01-07 16:29 ` David Woodhouse
2007-01-07 17:06 ` Russell King
2007-01-07 19:11 ` Jan Engelhardt
2007-01-07 19:20 ` Russell King
2007-01-07 20:48 ` Willy Tarreau
2007-01-07 23:37 ` Adrian Bunk
2007-01-08 0:38 ` Willy Tarreau
2007-01-08 1:03 ` Adrian Bunk
2007-01-08 1:14 ` Willy Tarreau
2007-01-08 1:45 ` Adrian Bunk
2007-01-08 6:52 ` Jan Engelhardt
2007-01-08 8:02 ` Adrian Bunk
2007-01-08 1:32 ` OT: character encodings Tilman Schmidt
2007-01-08 1:59 ` Adrian Bunk
2007-01-08 19:53 ` OT: character encodings (was: Linux 2.6.20-rc4) Valdis.Kletnieks
2007-01-07 19:29 ` OT: character encodings Tilman Schmidt
[not found] ` <20070107195051.GF21133@flint.arm.linux.org.uk>
[not found] ` <45A17645.1030905@imap.cc>
2007-01-08 1:53 ` David Woodhouse
2007-01-07 18:21 ` OT: character encodings (was: Linux 2.6.20-rc4) Alan
2007-01-07 19:12 ` Jan Engelhardt
2007-01-07 22:30 ` Alan
2007-01-08 1:22 ` Jan Engelhardt
2007-01-08 20:17 ` Jan Engelhardt
2007-01-08 22:00 ` Ken Moffat
2007-01-08 23:21 ` Jan Engelhardt
2007-01-08 23:34 ` Eberhard Moenkeberg
2007-01-08 16:14 ` Pavel Machek
2007-01-08 22:17 ` Tim Pepper
2007-01-08 23:30 ` Jan Engelhardt
2007-01-07 19:17 ` Russell King
2007-01-07 19:58 ` Robin Rosenberg
2007-01-07 20:05 ` Dave Jones
2007-01-07 20:15 ` Sean
2007-01-07 20:40 ` Jan Engelhardt
2007-01-07 21:07 ` Xavier Bestel
2007-01-08 4:42 ` David Woodhouse
2007-01-08 1:40 ` Horst H. von Brand
2007-01-07 13:23 ` Linux 2.6.20-rc4 Alan
2007-01-07 12:15 ` Akula2
2007-01-07 12:55 ` Russell King
2007-01-07 13:38 ` Akula2
2007-01-07 13:53 ` Willy Tarreau
2007-01-07 14:23 ` Akula2
2007-01-07 20:57 ` Peter Osterlund
2007-01-07 21:04 ` Peter Osterlund
2007-01-08 15:50 ` Dmitry Torokhov
2007-01-07 22:50 ` Linus Torvalds
2007-01-08 1:00 ` David Miller
2007-01-08 6:38 ` Peter Osterlund
2007-01-08 20:49 ` Peter Osterlund
2007-01-08 21:52 ` David Miller
2007-01-08 22:33 ` Patrick McHardy
2007-01-08 22:33 ` Patrick McHardy
2007-01-08 23:02 ` Peter Osterlund
2007-01-08 23:12 ` Linus Torvalds
2007-01-09 3:42 ` Adrian Bunk
2007-01-09 7:39 ` David Miller
2007-01-09 7:39 ` David Miller
2007-01-07 21:22 ` Gene Heskett
2007-01-08 0:22 ` 2.6.20-rc4: known unfixed regressions Adrian Bunk
2007-01-08 0:22 ` Adrian Bunk
2007-01-08 1:20 ` Bernhard Schmidt
2007-01-08 1:20 ` Bernhard Schmidt
2007-01-08 0:25 ` 2.6.20-rc4: known regressions with patches available Adrian Bunk
2007-01-08 0:25 ` Adrian Bunk
2007-01-08 0:33 ` [Bluez-devel] " Marcel Holtmann
2007-01-08 0:33 ` Marcel Holtmann
2007-01-08 0:33 ` Marcel Holtmann
2007-01-08 14:50 ` Linux 2.6.20-rc4 Mariusz Kozlowski
2007-01-08 14:50 ` Mariusz Kozlowski
2007-01-08 14:58 ` Sylvain Munaut
2007-01-08 14:58 ` Sylvain Munaut
2007-01-08 15:03 ` Mariusz Kozlowski
2007-01-08 15:03 ` Mariusz Kozlowski
2007-01-08 19:11 ` Jean Delvare
2007-01-08 19:11 ` Jean Delvare
2007-01-09 0:38 ` Benjamin Herrenschmidt
2007-01-09 0:38 ` Benjamin Herrenschmidt
2007-01-09 0:56 ` Greg KH
2007-01-09 0:56 ` Greg KH
2007-01-09 2:05 ` Benjamin Herrenschmidt
2007-01-09 2:05 ` Benjamin Herrenschmidt
2007-01-09 7:04 ` David Woodhouse
2007-01-09 7:04 ` David Woodhouse
2007-01-09 7:04 ` Sylvain Munaut
2007-01-09 7:04 ` Sylvain Munaut
2007-01-09 9:04 ` Benjamin Herrenschmidt
2007-01-09 9:04 ` Benjamin Herrenschmidt
2007-01-09 7:14 ` Sylvain Munaut
2007-01-09 7:14 ` Sylvain Munaut
2007-01-09 7:28 ` David Woodhouse
2007-01-09 7:28 ` David Woodhouse
2007-01-09 9:08 ` Benjamin Herrenschmidt
2007-01-09 9:08 ` Benjamin Herrenschmidt
2007-01-09 9:07 ` Benjamin Herrenschmidt
2007-01-09 9:07 ` Benjamin Herrenschmidt
2007-01-09 7:18 ` Greg KH
2007-01-09 7:18 ` Greg KH
2007-01-09 5:25 ` 2.6.20-rc4: known unfixed regressions (v2) Adrian Bunk
2007-01-09 5:25 ` Adrian Bunk
2007-01-09 17:58 ` Linus Torvalds
2007-01-09 18:08 ` Malte Schröder
2007-01-09 18:30 ` Linus Torvalds
2007-01-11 0:24 ` Vladimir V. Saveliev
2007-01-11 1:00 ` Nick Piggin
2007-01-11 13:12 ` Vladimir V. Saveliev
2007-01-11 23:53 ` Nick Piggin
2007-01-09 20:28 ` Adrian Bunk
2007-01-09 5:51 ` 2.6.20-rc4: known regressions with patches (v2) Adrian Bunk
2007-01-09 5:51 ` Adrian Bunk
2007-01-11 5:10 ` 2.6.20-rc4: known unfixed regressions (v3) Adrian Bunk
2007-01-11 6:43 ` Nick Piggin
2007-01-11 8:45 ` Adrian Bunk
2007-01-11 10:21 ` Jiri Kosina
2007-01-11 10:54 ` Adrian Bunk
2007-01-11 11:08 ` CIJOML
[not found] ` <Pine.LNX.4.64.0701062216210.3661-AgDkxUvNf0y7TbgM5vRIOg@public.gmane.org>
2007-01-11 5:13 ` 2.6.20-rc4: known regressions with patches (v3) Adrian Bunk
2007-01-11 5:13 ` Adrian Bunk
2007-01-11 21:39 ` David Chinner
2007-01-11 22:02 ` Andrew Morton
2007-01-11 23:05 ` David Chinner
-- strict thread matches above, loose matches on Subject: below --
2007-01-08 10:13 OT: character encodings (was: Linux 2.6.20-rc4) Nicolas Mailhot
2007-01-08 10:24 Nicolas Mailhot
2007-01-08 10:44 ` Alan
2007-01-08 10:44 ` Nicolas Mailhot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070107153833.GA21133@flint.arm.linux.org.uk \
--to=rmk+lkml@arm.linux.org.uk \
--cc=dwmw2@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=tilman@imap.cc \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.