From: Marc Lehmann <pcg@schmorp.de>
To: Linus Torvalds <torvalds@osdl.org>
Cc: viro@parcelfarce.linux.theplanet.co.uk,
Linux kernel <linux-kernel@vger.kernel.org>
Subject: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
Date: Mon, 16 Feb 2004 19:36:16 +0100 [thread overview]
Message-ID: <20040216183616.GA16491@schmorp.de> (raw)
In-Reply-To: <Pine.LNX.4.58.0402141827200.14025@home.osdl.org>
[I may be a bit late in response, but AFAICS these points have not yet
been mentioned]
On Sat, Feb 14, 2004 at 06:41:20PM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
[discussion on why UTF-8 is the only sane encoding, which I absolutely
agree with, removed]
> In short: the kernel talks bytestreams, and that implies that if you want
> to talk to the kernel, you HAVE TO USE UTF-8.
This is not the problem at all. It's perfectly easy to write
applications that talk UTF-8 and just UTF-8 with the kernel.
The problem is that the kernel does not use UTF-8, i.e. applications in
the current linux model have to deal with the fact that the kernel
happily breaks the assumed protocol of using UTF-8 by delivering illegal
byte sequences to applications.
There is no way for applications to handle UTF-8 and illegal-utf8 in
a sane way, so most apps will either eat the illegal bytes, skip the
filename, or crash (the latter case is clearly a bug in the app, thr
former cases aren't).
Fixing the VFS to actually enforce what linus claims (2filenames are
utf-8") is a very good idea, imho.
As I understand it, the reason linux currently doesn't, is that this utf-8
rule was obviously non-enforcable in practise in recent years, since
UTF-8 simply wasn't widespread (even today, applications such as bash or
grep are clearly not UTF-8 ready, as they start to crawl in UTF-8 locales
without special patches, and even with special patches).
So the only sane way to implement this enforcement is usign an
additional moutn-flag, e.g. "force-utf8".
An encoding=xyz mount flag OTOH would be total overkill, as the plan
must be to switch to UTF-8 in the long run, while allowing deviating
behaviour in the short run.
Conversely, filesystems such as NTFS, VFAT etc. need to convert from the
fs encoding to UTF-8 and vice versa automatically, at least when this
flag is specified.
It should become the default in some future linux version.
> People understand the problem. And UTF-8 is the solution.
The kernel needs to fully implement it. Just as a kernel accepting:
open ("directory", O_WRONLY); write (dirfd, ...)...
open ("/some/file", ...)
mkdir ("../some/file", ...)
is considered rather broken behaviour from unix kernels (although these
might have been allowed in some dialects or versions of unix) today, this:
mkdir ("</ encoded using illegal multibyte sequence>", ...)
will be considered broken behaviour in the future. The RFC defining UTF-8
clearly considers this a bug in UTF-8 implementations, the the kernel
in fact does NOT implement UTF-8 right now, although some people claim
that the kernel accepting UTF-8 (and more) is correct behaviour, it isn't
according to the RFC.
> It's getting there. I think even Microsoft has seen the light, and is
> phasing out their crapola (UCS-2LE? Whatever).
Microsoft and Java officially use UTF-16 nowadays. The funny thing is
that "next character" iterators in both languages skip to the next word
in UCS-2, so the claim of both parties of UTF-16 support is basically a
marketing lie.
> No. Things like "iocharset" are not the solution. They are literally the
> _problem_. The solution is to use something that not only acts as ASCII,
[full agreement]
> And that one true format is UTF-8. End of story. If you try to talk to the
> kernel in UCS-2 or anything else, you _will_ fail.
Just that the kernel does not support UTF-8. It delivers and accepts
non-UTF-8 strings such as \xc0\x80. The kernel clearly should not deliver
broken characters when the official stanza is that the linux VFS API is
UTF-8 only (see 3.2, Chapater 3, C12, conformance, ony why it currently
isn't UTF-8).
--
-----==- |
----==-- _ |
---==---(_)__ __ ____ __ Marc Lehmann +--
--==---/ / _ \/ // /\ \/ / pcg@goof.com |e|
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+
The choice of a GNU generation |
|
next prev parent reply other threads:[~2004-02-16 18:36 UTC|newest]
Thread overview: 118+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <04Feb13.163954est.41760@gpu.utcc.utoronto.ca>
2004-02-14 14:27 ` JFS default behavior Nicolas Mailhot
2004-02-14 15:40 ` viro
2004-02-14 17:47 ` Nicolas Mailhot
2004-02-14 17:59 ` Nicolas Mailhot
2004-02-14 23:06 ` Robin Rosenberg
2004-02-14 23:29 ` viro
2004-02-15 0:07 ` Robin Rosenberg
2004-02-15 2:41 ` Linus Torvalds
2004-02-15 3:33 ` Matthias Urlichs
2004-02-15 4:04 ` viro
2004-02-15 9:48 ` Robin Rosenberg
2004-02-15 18:26 ` yodaiken
2004-02-18 2:48 ` Unicode normalization (userspace issue, but what the heck) H. Peter Anvin
2004-02-20 9:48 ` Matthias Urlichs
2004-02-16 15:05 ` stty utf8 Jamie Lokier
2004-02-16 16:10 ` Gerd Knorr
2004-02-16 22:03 ` Jamie Lokier
2004-02-16 22:17 ` Linus Torvalds
2004-02-16 22:04 ` Jamie Lokier
2004-02-16 18:36 ` Marc Lehmann [this message]
2004-02-16 18:49 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Linus Torvalds
2004-02-16 19:26 ` UTF-8 practically vs. theoretically in the VFS API Jeff Garzik
2004-02-16 19:48 ` John Bradford
2004-02-16 19:48 ` Linus Torvalds
2004-02-16 20:20 ` Marc Lehmann
2004-02-16 20:26 ` Linus Torvalds
2004-02-18 2:49 ` Rob Landley
2004-02-16 20:21 ` bert hubert
2004-02-16 20:33 ` Marc Lehmann
2004-02-18 2:58 ` H. Peter Anvin
2004-02-18 3:13 ` Linus Torvalds
2004-02-18 3:22 ` H. Peter Anvin
2004-02-18 3:30 ` Linus Torvalds
2004-02-18 5:30 ` H. Peter Anvin
2004-02-18 10:29 ` Robin Rosenberg
2004-02-18 11:49 ` Tomas Szepe
2004-02-18 11:59 ` Robin Rosenberg
2004-02-18 12:05 ` Tomas Szepe
2004-02-18 12:34 ` Robin Rosenberg
2004-02-18 15:35 ` Linus Torvalds
2004-02-18 19:47 ` Tomas Szepe
2004-02-18 20:01 ` H. Peter Anvin
2004-02-18 21:22 ` Robin Rosenberg
2004-02-18 21:42 ` H. Peter Anvin
2004-02-18 11:24 ` Jamie Lokier
2004-02-18 11:33 ` Jamie Lokier
2004-02-18 16:47 ` H. Peter Anvin
2004-02-18 19:59 ` Linus Torvalds
2004-02-18 20:08 ` H. Peter Anvin
2004-02-18 7:25 ` bert hubert
2004-02-16 20:16 ` Marc Lehmann
2004-02-16 20:20 ` Jeff Garzik
2004-02-16 21:10 ` viro
2004-02-17 7:18 ` jw schultz
2004-02-17 7:42 ` Nick Piggin
2004-02-16 20:03 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
2004-02-16 20:23 ` Linus Torvalds
2004-02-16 20:58 ` Marc Lehmann
2004-02-17 14:12 ` Dave Kleikamp
2004-02-16 22:26 ` Jamie Lokier
2004-02-16 22:40 ` Linus Torvalds
2004-02-16 22:52 ` Linus Torvalds
2004-02-17 13:15 ` Jamie Lokier
2004-02-17 7:14 ` Lehmann
2004-02-17 11:20 ` UTF-8 practically vs. theoretically in the VFS API Helge Hafting
2004-02-17 15:56 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Linus Torvalds
[not found] ` <20040217161111.GE8231@schmorp.de>
2004-02-17 16:32 ` Linus Torvalds
2004-02-17 16:46 ` Jamie Lokier
2004-02-17 19:00 ` UTF-8 practically vs. theoretically in the VFS API Måns Rullgård
2004-02-17 20:57 ` Jamie Lokier
2004-02-17 21:06 ` Alex Belits
2004-02-17 21:47 ` Jamie Lokier
2004-02-22 15:32 ` Eric W. Biederman
2004-02-22 16:28 ` Jamie Lokier
2004-02-22 21:53 ` Eric W. Biederman
2004-02-18 7:23 ` Marc Lehmann
2004-02-17 21:23 ` Matthew Kirkwood
2004-02-18 13:11 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Matthew Garrett
2004-02-17 16:52 ` Marc Lehmann
2004-02-17 16:54 ` UTF-8 practically vs. theoretically in the VFS API Stefan Smietanowski
2004-02-18 1:27 ` Hans Reiser
2004-02-18 2:08 ` Robin Rosenberg
2004-02-18 11:06 ` Jamie Lokier
2004-02-17 20:37 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Robin Rosenberg
2004-02-17 16:36 ` Jamie Lokier
2004-02-17 17:52 ` viro
2004-02-17 19:29 ` Jamie Lokier
2004-02-17 19:45 ` Linus Torvalds
2004-02-17 20:30 ` Jamie Lokier
2004-02-17 20:49 ` Linus Torvalds
2004-02-17 21:17 ` Jamie Lokier
2004-02-17 19:51 ` Jamie Lokier
2004-02-17 19:53 ` viro
2004-02-17 20:35 ` John Bradford
2004-02-17 20:40 ` Jamie Lokier
2004-02-17 20:50 ` John Bradford
2004-02-17 21:04 ` Linus Torvalds
2004-02-17 21:16 ` John Bradford
2004-02-17 21:21 ` Linus Torvalds
2004-02-18 0:52 ` John Bradford
2004-02-17 22:50 ` Robin Rosenberg
2004-02-18 6:48 ` Marc Lehmann
2004-02-17 20:47 ` viro
2004-02-17 20:53 ` John Bradford
2004-02-17 20:59 ` Linus Torvalds
2004-02-17 21:06 ` John Bradford
2004-02-17 21:42 ` Alex Belits
2004-02-18 6:56 ` Marc Lehmann
2004-02-18 20:37 ` Alex Belits
2004-02-18 3:11 ` H. Peter Anvin
2004-02-17 20:38 ` Jamie Lokier
2004-02-18 3:07 ` H. Peter Anvin
2004-02-21 13:54 ` Pavel Machek
2004-02-22 20:09 ` H. Peter Anvin
2004-02-17 1:24 ` Alex Belits
2004-02-17 21:09 ` Jamie Lokier
2004-02-17 21:48 ` Linus Torvalds
2004-02-17 22:19 ` Alex Belits
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20040216183616.GA16491@schmorp.de \
--to=pcg@schmorp.de \
--cc=linux-kernel@vger.kernel.org \
--cc=torvalds@osdl.org \
--cc=viro@parcelfarce.linux.theplanet.co.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.