public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jamie Lokier <jamie@shareable.org>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Marc Lehmann <pcg@schmorp.de>,
	viro@parcelfarce.linux.theplanet.co.uk,
	Linux kernel <linux-kernel@vger.kernel.org>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
Date: Tue, 17 Feb 2004 13:15:59 +0000	[thread overview]
Message-ID: <20040217131559.GA21842@mail.shareable.org> (raw)
In-Reply-To: <Pine.LNX.4.58.0402161447450.30742@home.osdl.org>

Linus Torvalds wrote:
> So I claim (and yes, people are free to disagree with me) that a
> well-written UTF-8 program won't even have any real extra code to handle
> the "broken UTF-8" code. It's just another set of bytes that needs
> escaping, and they need escaping for _exactly_ the same reason some 
> regular utf-8 characters need escaping: because they can't be printed.

Even XML suffers from these sorts of problems: some Unicode characters
aren't allowed in XML, even as numeric references, so in theory XML
applications have to reject or escape some strings.

> So it's all the same thing - it's just the reasons for "unprintability"  
> that are slightly different.

My difficulty with directories containing non-UTF-8 filenames shows up
with web pages in Perl, and not the printability part.  Please excuse
the Perl-oriented examples; Perl has good support for UTF-8 while also
working with arbitrary byte strings, so it's a fine language to
illustrate current problems.

What do you put in a URL composed from filenames in a directory
listing page?  The obvious thing is to %-escape each byte of the
names, in fact that's what everybody does.

In a language like Perl, where strings are labelled according to their
encoding, that means when you unscape the URL you get a string
labelled as "byte string".  You shouldn't tell Perl it's a "UTF-8
string" because some of them won't be (they are strings from
directories).

That's fine if you don't do anything except use those strings
unchanged, but as soon as you want to do something else like prepend a
character with code >= 256 or apply a regex where the pattern has
Unicode characters, Perl transcodes "byte string" to "UTF-8 string"
assuming it was latin1.  That, of course, mangles the string when it's
come from a source which is "nominally UTF-8 but might not be".

Your recommendation to simply pass around bytes all the time doesn't
work well, because to maintain basic properties of strings such as
length(a) + length(b) = length(a+b), that implies you either (1)
always do indexing, lengths, splitting etc. on strings as bytes not
characters, or (2) every operation that operates on a string must be
able to accept non-UTF-8 bytes and treat them the same way.  (2) is
particularly nasty because then your program's logic can't depend on
the nice properties of UTF-8 strings.

That's why this line of Perl fails:

    for (glob "*") { rename $_, "ņi-".$_ or die "rename: $!\n"; }

(The source file, by the way, is assumed to be UTF-8-encoded text).

Perl reads each file name, and declares it to be of type "byte
string".  Then "ņi-" is prepended, which contains a character code >=
256, so the result must be UTF-8 encoded according to Perl.  The
original file name is transcoded from what was assumed to be
iso-8859-1 to UTF-8, "ņi-" is prepended, and that becomes the target
file name for rename().

This mangles the names; both UTF-8 and non-UTF-8 filenames are mangled
equally badly.

Your suggestion means that Perl should do bytewise concatenation of
the the "ņi-" (in UTF-8) and the filename (no encoding assumed).

It's a good one; it's exactly the right thing to do, and it works.

To do that in Perl, when you take a random byte string (such as from
readdir()) you must tell Perl it's a UTF-8 string, so shouldn't be
transcoded when it's combined with another UTF-8 string.  You can do
it, breaking documented rules of course (which say only do this when
you know it's valid UTF-8), with Encode::_utf8_on().

Guess what?  That actually works.  It does the filename operations
properly given any arbitrary filenames.

But remember I said "every operation that operates on a string must be
able to accept non-UTF-8 bytes and treat them the same way" earlier,
and how this is bad because it's nice to depend on UTF-8 properties?

You've just told Perl to treat an arbitrary byte sequence as UTF-8,
when sometimes it isn't.  Among other things, simple operators like
length() and substr() don't work as expected on those weird strings.

When I say don't work as expected, I mean if you had a file named
"müeller" in latin1, Perl will think it's length() is 2.  If you have
a file named "müller", Perl will not only report a length() of 1,
it'll spew a horrible error message when it calculates it.

These aren't Perl problems.  These are problems that any program will
have if it follows your suggestion of "keep everything in bytes" but
wants to combine filenames with other text or do pattern matching on
filenames.

It's not a problem if you can pass around a flag with each byte
sequence, carefully keeping readdir() results separate from text until
the point where your prepared to have a policy saying what to do with
non-UTF-8 readdir() results.

But it is a problem when you want to stuff readdir() results in a
general purpose "string" which is also used for text.

That's technically the wrong thing to do, in all programs.  In
practice, that's what programs do anyway because it's a lot easier
than having different string types for different data sources.

Most times it works out ok, but for the corners:

> It's just _hard_ to think of all the special cases, and most
> programs have bugs because somebody forgot something.

Exactly.

-- Jamie

  reply	other threads:[~2004-02-17 13:16 UTC|newest]

Thread overview: 118+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <04Feb13.163954est.41760@gpu.utcc.utoronto.ca>
2004-02-14 14:27 ` JFS default behavior Nicolas Mailhot
2004-02-14 15:40   ` viro
2004-02-14 17:47     ` Nicolas Mailhot
2004-02-14 17:59       ` Nicolas Mailhot
2004-02-14 23:06     ` Robin Rosenberg
2004-02-14 23:29       ` viro
2004-02-15  0:07         ` Robin Rosenberg
2004-02-15  2:41           ` Linus Torvalds
2004-02-15  3:33             ` Matthias Urlichs
2004-02-15  4:04               ` viro
2004-02-15  9:48                 ` Robin Rosenberg
2004-02-15 18:26                 ` yodaiken
2004-02-18  2:48               ` Unicode normalization (userspace issue, but what the heck) H. Peter Anvin
2004-02-20  9:48                 ` Matthias Urlichs
2004-02-16 15:05             ` stty utf8 Jamie Lokier
2004-02-16 16:10               ` Gerd Knorr
2004-02-16 22:03               ` Jamie Lokier
2004-02-16 22:17                 ` Linus Torvalds
2004-02-16 22:04               ` Jamie Lokier
2004-02-16 18:36             ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
2004-02-16 18:49               ` Linus Torvalds
2004-02-16 19:26                 ` UTF-8 practically vs. theoretically in the VFS API Jeff Garzik
2004-02-16 19:48                   ` John Bradford
2004-02-16 19:48                     ` Linus Torvalds
2004-02-16 20:20                       ` Marc Lehmann
2004-02-16 20:26                         ` Linus Torvalds
2004-02-18  2:49                         ` Rob Landley
2004-02-16 20:21                       ` bert hubert
2004-02-16 20:33                         ` Marc Lehmann
2004-02-18  2:58                         ` H. Peter Anvin
2004-02-18  3:13                           ` Linus Torvalds
2004-02-18  3:22                             ` H. Peter Anvin
2004-02-18  3:30                               ` Linus Torvalds
2004-02-18  5:30                                 ` H. Peter Anvin
2004-02-18 10:29                                   ` Robin Rosenberg
2004-02-18 11:49                                     ` Tomas Szepe
2004-02-18 11:59                                       ` Robin Rosenberg
2004-02-18 12:05                                         ` Tomas Szepe
2004-02-18 12:34                                           ` Robin Rosenberg
2004-02-18 15:35                                   ` Linus Torvalds
2004-02-18 19:47                                     ` Tomas Szepe
2004-02-18 20:01                                       ` H. Peter Anvin
2004-02-18 21:22                                         ` Robin Rosenberg
2004-02-18 21:42                                           ` H. Peter Anvin
2004-02-18 11:24                               ` Jamie Lokier
2004-02-18 11:33                             ` Jamie Lokier
2004-02-18 16:47                               ` H. Peter Anvin
2004-02-18 19:59                               ` Linus Torvalds
2004-02-18 20:08                                 ` H. Peter Anvin
2004-02-18  7:25                           ` bert hubert
2004-02-16 20:16                     ` Marc Lehmann
2004-02-16 20:20                       ` Jeff Garzik
2004-02-16 21:10                       ` viro
2004-02-17  7:18                       ` jw schultz
2004-02-17  7:42                       ` Nick Piggin
2004-02-16 20:03                 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
2004-02-16 20:23                   ` Linus Torvalds
2004-02-16 20:58                     ` Marc Lehmann
2004-02-17 14:12                       ` Dave Kleikamp
2004-02-16 22:26                     ` Jamie Lokier
2004-02-16 22:40                       ` Linus Torvalds
2004-02-16 22:52                         ` Linus Torvalds
2004-02-17 13:15                           ` Jamie Lokier [this message]
2004-02-17  7:14                         ` Lehmann 
2004-02-17 11:20                           ` UTF-8 practically vs. theoretically in the VFS API Helge Hafting
2004-02-17 15:56                           ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Linus Torvalds
     [not found]                             ` <20040217161111.GE8231@schmorp.de>
2004-02-17 16:32                               ` Linus Torvalds
2004-02-17 16:46                                 ` Jamie Lokier
2004-02-17 19:00                                   ` UTF-8 practically vs. theoretically in the VFS API Måns Rullgård
2004-02-17 20:57                                     ` Jamie Lokier
2004-02-17 21:06                                       ` Alex Belits
2004-02-17 21:47                                         ` Jamie Lokier
2004-02-22 15:32                                           ` Eric W. Biederman
2004-02-22 16:28                                             ` Jamie Lokier
2004-02-22 21:53                                               ` Eric W. Biederman
2004-02-18  7:23                                         ` Marc Lehmann
2004-02-17 21:23                                       ` Matthew Kirkwood
2004-02-18 13:11                                   ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Matthew Garrett
2004-02-17 16:52                                 ` Marc Lehmann
2004-02-17 16:54                                 ` UTF-8 practically vs. theoretically in the VFS API Stefan Smietanowski
2004-02-18  1:27                                   ` Hans Reiser
2004-02-18  2:08                                     ` Robin Rosenberg
2004-02-18 11:06                                       ` Jamie Lokier
2004-02-17 20:37                                 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Robin Rosenberg
2004-02-17 16:36                             ` Jamie Lokier
2004-02-17 17:52                               ` viro
2004-02-17 19:29                                 ` Jamie Lokier
2004-02-17 19:45                                   ` Linus Torvalds
2004-02-17 20:30                                     ` Jamie Lokier
2004-02-17 20:49                                       ` Linus Torvalds
2004-02-17 21:17                                         ` Jamie Lokier
2004-02-17 19:51                                   ` Jamie Lokier
2004-02-17 19:53                                   ` viro
2004-02-17 20:35                                     ` John Bradford
2004-02-17 20:40                                       ` Jamie Lokier
2004-02-17 20:50                                         ` John Bradford
2004-02-17 21:04                                           ` Linus Torvalds
2004-02-17 21:16                                             ` John Bradford
2004-02-17 21:21                                               ` Linus Torvalds
2004-02-18  0:52                                                 ` John Bradford
2004-02-17 22:50                                               ` Robin Rosenberg
2004-02-18  6:48                                             ` Marc Lehmann
2004-02-17 20:47                                       ` viro
2004-02-17 20:53                                         ` John Bradford
2004-02-17 20:59                                       ` Linus Torvalds
2004-02-17 21:06                                         ` John Bradford
2004-02-17 21:42                                         ` Alex Belits
2004-02-18  6:56                                           ` Marc Lehmann
2004-02-18 20:37                                             ` Alex Belits
2004-02-18  3:11                                         ` H. Peter Anvin
2004-02-17 20:38                                     ` Jamie Lokier
2004-02-18  3:07                               ` H. Peter Anvin
2004-02-21 13:54                             ` Pavel Machek
2004-02-22 20:09                               ` H. Peter Anvin
2004-02-17  1:24                   ` Alex Belits
2004-02-17 21:09                     ` Jamie Lokier
2004-02-17 21:48                       ` Linus Torvalds
2004-02-17 22:19                       ` Alex Belits

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20040217131559.GA21842@mail.shareable.org \
    --to=jamie@shareable.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pcg@schmorp.de \
    --cc=torvalds@osdl.org \
    --cc=viro@parcelfarce.linux.theplanet.co.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox