From: Jamie Lokier <jamie@shareable.org>
To: Linus Torvalds <torvalds@osdl.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>, linux-kernel@vger.kernel.org
Subject: Re: UTF-8 practically vs. theoretically in the VFS API
Date: Wed, 18 Feb 2004 11:33:38 +0000 [thread overview]
Message-ID: <20040218113338.GH28599@mail.shareable.org> (raw)
In-Reply-To: <Pine.LNX.4.58.0402171910550.2686@home.osdl.org>
Linus Torvalds wrote:
> Somebody correctly pointed out that you do not need any out-of-band
> encoding mechanism - the very fact that it's an invalid sequence is in
> itself a perfectly fine flag. No out-of-band signalling required.
Technically this is almost(*) correct, however a _lot_ of code exists
which assumes logical properties of UTF-8. (See, for example, the
"stty utf8" patch).
Perl, for example, allows you to pass around invalid sequences in
exactly the way you describe. It works, right up until you do
something like length() or substr() or a regex match. Then Perl
screws up the answer, because it sees something like 0xfd and just
assumes it can skip the next 5 bytes, without checking them.
hpa's suggestion that invalid bytes are treated as 0x800000xx works
very nicely, *iff* a program is absolutely consistent about its
treatment of bytes in that way. When there's a mixture of code which
interprets malformed UTF-8 in different ways, then it's messy and
sometimes a security hazard.
-- Jamie
(*) - It's fine until you concatenate two malformed strings. Then the
out-of-band signal is lost if the combination is valid UTF-8.
next prev parent reply other threads:[~2004-02-18 11:33 UTC|newest]
Thread overview: 120+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <04Feb13.163954est.41760@gpu.utcc.utoronto.ca>
2004-02-14 14:27 ` JFS default behavior Nicolas Mailhot
2004-02-14 15:40 ` viro
2004-02-14 17:47 ` Nicolas Mailhot
2004-02-14 17:59 ` Nicolas Mailhot
2004-02-14 23:06 ` Robin Rosenberg
2004-02-14 23:29 ` viro
2004-02-15 0:07 ` Robin Rosenberg
2004-02-15 2:41 ` Linus Torvalds
2004-02-15 3:33 ` Matthias Urlichs
2004-02-15 4:04 ` viro
2004-02-15 9:48 ` Robin Rosenberg
2004-02-15 18:26 ` yodaiken
2004-02-18 2:48 ` Unicode normalization (userspace issue, but what the heck) H. Peter Anvin
2004-02-20 9:48 ` Matthias Urlichs
2004-02-16 15:05 ` stty utf8 Jamie Lokier
2004-02-16 16:10 ` Gerd Knorr
2004-02-16 22:03 ` Jamie Lokier
2004-02-16 22:17 ` Linus Torvalds
2004-02-16 22:04 ` Jamie Lokier
2004-02-16 18:36 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
2004-02-16 18:49 ` Linus Torvalds
2004-02-16 19:26 ` UTF-8 practically vs. theoretically in the VFS API Jeff Garzik
2004-02-16 19:48 ` John Bradford
2004-02-16 19:48 ` Linus Torvalds
2004-02-16 20:20 ` Marc Lehmann
2004-02-16 20:26 ` Linus Torvalds
2004-02-18 2:49 ` Rob Landley
2004-02-16 20:21 ` bert hubert
2004-02-16 20:33 ` Marc Lehmann
2004-02-18 2:58 ` H. Peter Anvin
2004-02-18 3:13 ` Linus Torvalds
2004-02-18 3:22 ` H. Peter Anvin
2004-02-18 3:30 ` Linus Torvalds
2004-02-18 5:30 ` H. Peter Anvin
2004-02-18 10:29 ` Robin Rosenberg
2004-02-18 11:49 ` Tomas Szepe
2004-02-18 11:59 ` Robin Rosenberg
2004-02-18 12:05 ` Tomas Szepe
2004-02-18 12:34 ` Robin Rosenberg
2004-02-18 15:35 ` Linus Torvalds
2004-02-18 19:47 ` Tomas Szepe
2004-02-18 20:01 ` H. Peter Anvin
2004-02-18 21:22 ` Robin Rosenberg
2004-02-18 21:42 ` H. Peter Anvin
2004-02-18 11:24 ` Jamie Lokier
2004-02-18 11:33 ` Jamie Lokier [this message]
2004-02-18 16:47 ` H. Peter Anvin
2004-02-18 19:59 ` Linus Torvalds
2004-02-18 20:08 ` H. Peter Anvin
2004-02-18 7:25 ` bert hubert
2004-02-16 20:16 ` Marc Lehmann
2004-02-16 20:20 ` Jeff Garzik
2004-02-16 21:10 ` viro
2004-02-17 7:18 ` jw schultz
2004-02-17 7:42 ` Nick Piggin
2004-02-16 20:03 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
2004-02-16 20:23 ` Linus Torvalds
2004-02-16 20:58 ` Marc Lehmann
2004-02-17 14:12 ` Dave Kleikamp
2004-02-16 22:26 ` Jamie Lokier
2004-02-16 22:40 ` Linus Torvalds
2004-02-16 22:52 ` Linus Torvalds
2004-02-17 13:15 ` Jamie Lokier
2004-02-17 7:14 ` Lehmann
2004-02-17 11:20 ` UTF-8 practically vs. theoretically in the VFS API Helge Hafting
2004-02-17 15:56 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Linus Torvalds
[not found] ` <20040217161111.GE8231@schmorp.de>
2004-02-17 16:32 ` Linus Torvalds
2004-02-17 16:46 ` Jamie Lokier
2004-02-17 19:00 ` UTF-8 practically vs. theoretically in the VFS API Måns Rullgård
2004-02-17 20:57 ` Jamie Lokier
2004-02-17 21:06 ` Alex Belits
2004-02-17 21:47 ` Jamie Lokier
2004-02-22 15:32 ` Eric W. Biederman
2004-02-22 16:28 ` Jamie Lokier
2004-02-22 21:53 ` Eric W. Biederman
2004-02-18 7:23 ` Marc Lehmann
2004-02-17 21:23 ` Matthew Kirkwood
2004-02-18 13:11 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Matthew Garrett
2004-02-17 16:52 ` Marc Lehmann
2004-02-17 16:54 ` UTF-8 practically vs. theoretically in the VFS API Stefan Smietanowski
2004-02-18 1:27 ` Hans Reiser
2004-02-18 2:08 ` Robin Rosenberg
2004-02-18 11:06 ` Jamie Lokier
2004-02-17 20:37 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Robin Rosenberg
2004-02-17 16:36 ` Jamie Lokier
2004-02-17 17:52 ` viro
2004-02-17 19:29 ` Jamie Lokier
2004-02-17 19:45 ` Linus Torvalds
2004-02-17 20:30 ` Jamie Lokier
2004-02-17 20:49 ` Linus Torvalds
2004-02-17 21:17 ` Jamie Lokier
2004-02-17 19:51 ` Jamie Lokier
2004-02-17 19:53 ` viro
2004-02-17 20:35 ` John Bradford
2004-02-17 20:40 ` Jamie Lokier
2004-02-17 20:50 ` John Bradford
2004-02-17 21:04 ` Linus Torvalds
2004-02-17 21:16 ` John Bradford
2004-02-17 21:21 ` Linus Torvalds
2004-02-18 0:52 ` John Bradford
2004-02-17 22:50 ` Robin Rosenberg
2004-02-18 6:48 ` Marc Lehmann
2004-02-17 20:47 ` viro
2004-02-17 20:53 ` John Bradford
2004-02-17 20:59 ` Linus Torvalds
2004-02-17 21:06 ` John Bradford
2004-02-17 21:42 ` Alex Belits
2004-02-18 6:56 ` Marc Lehmann
2004-02-18 20:37 ` Alex Belits
2004-02-18 3:11 ` H. Peter Anvin
2004-02-17 20:38 ` Jamie Lokier
2004-02-18 3:07 ` H. Peter Anvin
2004-02-21 13:54 ` Pavel Machek
2004-02-22 20:09 ` H. Peter Anvin
2004-02-17 1:24 ` Alex Belits
2004-02-17 21:09 ` Jamie Lokier
2004-02-17 21:48 ` Linus Torvalds
2004-02-17 22:19 ` Alex Belits
2004-02-23 11:35 UTF-8 practically vs. theoretically in the VFS API Norman Diamond
[not found] ` <fa.ip45pqg.i26oru@ifi.uio.no>
2004-02-23 19:13 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20040218113338.GH28599@mail.shareable.org \
--to=jamie@shareable.org \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=torvalds@osdl.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox