Re: JFS default behavior

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: JFS default behavior
       [not found] <04Feb13.163954est.41760@gpu.utcc.utoronto.ca>
@ 2004-02-14 14:27 ` Nicolas Mailhot
  2004-02-14 15:40   ` viro
  0 siblings, 1 reply; 120+ messages in thread
From: Nicolas Mailhot @ 2004-02-14 14:27 UTC (permalink / raw)
  To: chris.siebenmann; +Cc: linux-kernel

Chris Siebenmann wrote:

> You write:
> | So what ?
> | Do you think an app that expects utf-8 filenames won't crash today when
> | served a byte sequence that's invalid UTF-8 ? (or an app that expects
> | ascii when served utf-8 oddities)
> 
>  Such apps are buggy and need to be fixed. 

Well, this means every single java app right now at least.

> This is not Unix's problem,

The w2k problem was at the app level mostly.
It would not have been OS responsibility to fix it.
*However* since the unix time conventions were a bit more sane than 
other os, the damage was less.

> any more than it is Unix's problem if an application frees memory twice,
> writes over unallocated memory, or destroys its stack.

The core os responsability is to share sanely ressources between apps.
Filenames are a shared ressource.
When encodings starts to be incompatible, resulting in applications 
crashes it's the OS job to define and enforce sane conventions so apps 
can coexist together.

Past oversights should not mean the problem should not be fixed 
(especially if solutions exist, even if they are not totally painless).

There is no more justification to keep encoding undefined as there is to 
keep time zone undefined. Last I've seen we're all pretty happy system 
time actually means something on unix (unlike other systems where it can 
be anything depending on the location where the initial installation was 
performed).

>  If all you care about is the future, you need no kernel support.
> Declare that all filesystem names are written in UTF-8, and make your
> tools deal with it. (Most will not care. A few will have to be fixed a
> bit.)

Tools won't change unless they're forced to. That's a plain fact.
As you wrote there shouldn't be a lot of fixups to do, since apps that 
can't deal with utf-8 now use ascii-only filenames anyway, but the few 
fixups that are needed won't happen without a little OS prodding.

(and without OS enforcement illegal utf-8 filename injection will remain 
a security risk)

And I write utf8 here, but any unicode form is fine with me as long as 
it's clearly defined and enforced by the FSs.

Cheers,

-- 
Nicolas Mailhot

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: JFS default behavior
  2004-02-14 14:27 ` JFS default behavior Nicolas Mailhot
@ 2004-02-14 15:40   ` viro
  2004-02-14 17:47     ` Nicolas Mailhot
  2004-02-14 23:06     ` Robin Rosenberg
  0 siblings, 2 replies; 120+ messages in thread
From: viro @ 2004-02-14 15:40 UTC (permalink / raw)
  To: Nicolas Mailhot; +Cc: chris.siebenmann, linux-kernel

On Sat, Feb 14, 2004 at 03:27:50PM +0100, Nicolas Mailhot wrote:
> There is no more justification to keep encoding undefined as there is to 
> keep time zone undefined. Last I've seen we're all pretty happy system 
> time actually means something on unix (unlike other systems where it can 
> be anything depending on the location where the initial installation was 
> performed).

"System time" is amount of time elapsed since the epoch.  Period.  What does
it have to any timezone?

The only place where timezone enters the picture is conversion of time to
year:month:day:hours:minutes:seconds and that's
	a) process-dependent and
	b) done outside of kernel

The same goes for file names.  Filename is a sequence of bytes, no more and
no less.  Anything beyond that belongs to applications.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: JFS default behavior
  2004-02-14 15:40   ` viro
@ 2004-02-14 17:47     ` Nicolas Mailhot
  2004-02-14 17:59       ` Nicolas Mailhot
  2004-02-14 23:06     ` Robin Rosenberg
  1 sibling, 1 reply; 120+ messages in thread
From: Nicolas Mailhot @ 2004-02-14 17:47 UTC (permalink / raw)
  To: viro; +Cc: chris.siebenmann, linux-kernel

viro@parcelfarce.linux.theplanet.co.uk wrote:
> On Sat, Feb 14, 2004 at 03:27:50PM +0100, Nicolas Mailhot wrote:
> 
>>There is no more justification to keep encoding undefined as there is to 
>>keep time zone undefined. Last I've seen we're all pretty happy system 
>>time actually means something on unix (unlike other systems where it can 
>>be anything depending on the location where the initial installation was 
>>performed).
> 
> 
> "System time" is amount of time elapsed since the epoch.  Period.  What does
> it have to any timezone?

And everyone agrees on the epoch and that's why it works.

(just like sensors output is not just any numerical value but has a 
well-defined unit)

With filenames we have a value but what it means exactly is a matter of 
conjecture. That's the problem.
(it wouldn't be if filenames were just magic cookies that never needed 
to be interpreted but there's a lot of actors, be it apps or humans that 
need to agree on what the byte string)

Cheers,

-- 
Nicolas Mailhot



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: JFS default behavior
  2004-02-14 17:47     ` Nicolas Mailhot
@ 2004-02-14 17:59       ` Nicolas Mailhot
  0 siblings, 0 replies; 120+ messages in thread
From: Nicolas Mailhot @ 2004-02-14 17:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: viro, chris.siebenmann

Nicolas Mailhot wrote:

> to be interpreted but there's a lot of actors, be it apps or humans that 
> need to agree on what the byte string)

... actually means

(bad proofreading, sorry)

-- 
Nicolas Mailhot



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: JFS default behavior
  2004-02-14 15:40   ` viro
  2004-02-14 17:47     ` Nicolas Mailhot
@ 2004-02-14 23:06     ` Robin Rosenberg
  2004-02-14 23:29       ` viro
  1 sibling, 1 reply; 120+ messages in thread
From: Robin Rosenberg @ 2004-02-14 23:06 UTC (permalink / raw)
  To: viro; +Cc: Linux kernel

On Saturday 14 February 2004 16.40, you wrote:
> The same goes for file names.  Filename is a sequence of bytes, no more and
> no less.  Anything beyond that belongs to applications.

Should be a sequence of characters since humans are supposed to use them and
it should be the same characters wheneve possible regardless of user's locale.

The  "sequence of bytes" idea is a legacy from prehistoric times when byte == character
was true. That is no longer the case and actually hasn't been for quite a while in
some parts of the world. Interchange is important. The application cannot handle
this since it cannot know what characters a byte string represents. Fixing it in the
kernel is the simple solution since it knows the locale. Its also a small change I
believe. Having an iocharset options for all file systems make it backward compatible
and creates a migration path to UTF-8 as system default locale.

-- robin

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: JFS default behavior
  2004-02-14 23:06     ` Robin Rosenberg
@ 2004-02-14 23:29       ` viro
  2004-02-15  0:07         ` Robin Rosenberg
  0 siblings, 1 reply; 120+ messages in thread
From: viro @ 2004-02-14 23:29 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Linux kernel

On Sun, Feb 15, 2004 at 12:06:23AM +0100, Robin Rosenberg wrote:
> On Saturday 14 February 2004 16.40, you wrote:
> > The same goes for file names.  Filename is a sequence of bytes, no more and
> > no less.  Anything beyond that belongs to applications.
> 
> Should be a sequence of characters since humans are supposed to use them and
> it should be the same characters wheneve possible regardless of user's locale.
 
> The  "sequence of bytes" idea is a legacy from prehistoric times when byte == character
> was true.

Bullshit.  It has _nothing_ to characters, wide or not.  For system filenames
are opaque.  The only things that have special meanings are:
	octet 0x2f ('/') splits the pathname into components
	"." as a component has a special meaning
	".." as a component has a special meaning.
That's it.  The rest is never interpreted by the kernel.

> Having an iocharset options for all file systems make it backward compatible
> and creates a migration path to UTF-8 as system default locale.

Try to realize that different users CAN HAVE DIFFERENT LOCALES.  On the same
system.  And have files on the same fs.  Moreover, homedirs that used to be
on different filesystems can end up one the same fs.  What iocharset would
you use, then?  Sigh...

Again, there is no such thing as iocharset of filesystem - it varies between
users and users can and do share filesystems.  Think of /home; think of /tmp.

It isn't feasible.  At all.  Just as timezone doesn't belong in kernel, locales
have no place there.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: JFS default behavior
  2004-02-14 23:29       ` viro
@ 2004-02-15  0:07         ` Robin Rosenberg
  2004-02-15  2:41           ` Linus Torvalds
  0 siblings, 1 reply; 120+ messages in thread
From: Robin Rosenberg @ 2004-02-15  0:07 UTC (permalink / raw)
  To: viro; +Cc: Linux kernel

On Sunday 15 February 2004 00.29, you wrote:
> On Sun, Feb 15, 2004 at 12:06:23AM +0100, Robin Rosenberg wrote:
> > The  "sequence of bytes" idea is a legacy from prehistoric times when byte == character
> > was true.
> 
> Bullshit.  It has _nothing_ to characters, wide or not.  For system filenames
> are opaque.  The only things that have special meanings are:
> 	octet 0x2f ('/') splits the pathname into components
> 	"." as a component has a special meaning
> 	".." as a component has a special meaning.
> That's it.  The rest is never interpreted by the kernel.
I know how it is (to some degree), and its wrong. The user sees inside the filename
and sees a string of characters, not a byte sequence.

> Try to realize that different users CAN HAVE DIFFERENT LOCALES.  On the same
> system.  And have files on the same fs.  Moreover, homedirs that used to be
> on different filesystems can end up one the same fs.  What iocharset would
> you use, then?  Sigh...
Ok, I've got the iocharset option wrong, god knows why. The problem 
however remains.

It seems you simply don't want to understand the problem, which is that users 
CAN HAVE DIFFERENT LOCALES on the same system and on different system. 
Sigh...

I less concerned with which solution than that a solution should be found. So it
seems no file system has a solution today. Still an iocharset option would relieve
the problem for removable media and muli-boot systems. Most linux machines
are essentially single user and have either the same locale for all users or all
users are using UTF-8 with their locale. It's not the locale, but the charset used
for encoding the locale. The rest cannot be helped.

-- robin

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: JFS default behavior
  2004-02-15  0:07         ` Robin Rosenberg
@ 2004-02-15  2:41           ` Linus Torvalds
  2004-02-15  3:33             ` Matthias Urlichs
                               ` (2 more replies)
  0 siblings, 3 replies; 120+ messages in thread
From: Linus Torvalds @ 2004-02-15  2:41 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: viro, Linux kernel

On Sun, 15 Feb 2004, Robin Rosenberg wrote:
> > 
> > Bullshit.  It has _nothing_ to characters, wide or not.  For system filenames
> > are opaque.  The only things that have special meanings are:
> > 	octet 0x2f ('/') splits the pathname into components
> > 	"." as a component has a special meaning
> > 	".." as a component has a special meaning.
> > That's it.  The rest is never interpreted by the kernel.
>
> I know how it is (to some degree), and its wrong. The user sees inside the filename
> and sees a string of characters, not a byte sequence.

Yes, the user sees a string of characters, but the octet 0x2f ('/') and 
the terminating NUL character '\0' are still perfectly normal characters 
and there is no confusion.

The reason: UTF-8. It's the only sane encoding (apart from a pure extended
ASCII setup, which is also sane, but is obviously unacceptable for a large
portion of the world).

If some misguided person has told you about UCS-2 and horrors like UTF-9,
just ignore them. They are crazy and deluded, and - perhaps more
importantly - stupid.

In short: the kernel talks bytestreams, and that implies that if you want 
to talk to the kernel, you HAVE TO USE UTF-8.

At which point there are no locale issues any more. The only locale issue 
you can have is user space mistaking a stream of bytes as extended ASCII, 
which will cause all your pretty UTF-8 characters to be shown as strange 
latin1 (or other) squiggles.

> It seems you simply don't want to understand the problem, which is that users 
> CAN HAVE DIFFERENT LOCALES on the same system and on different system. 
> Sigh...

People understand the problem. And UTF-8 is the solution.

It's getting there. I think even Microsoft has seen the light, and is
phasing out their crapola (UCS-2LE? Whatever). 

> I less concerned with which solution than that a solution should be found. So it
> seems no file system has a solution today. Still an iocharset option would relieve
> the problem for removable media and muli-boot systems.

No. Things like "iocharset" are not the solution. They are literally the
_problem_. The solution is to use something that not only acts as ASCII,
but also has a wide enough range to cover the whole required space (UCS-2
fails _both_ of these fundamental tests). At which point "iocharset" makes 
no sense any more, and only exists as a way to translate legacy crap into 
the one true format.

And that one true format is UTF-8. End of story. If you try to talk to the 
kernel in UCS-2 or anything else, you _will_ fail.

			Linus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: JFS default behavior
  2004-02-15  2:41           ` Linus Torvalds
@ 2004-02-15  3:33             ` Matthias Urlichs
  2004-02-15  4:04               ` viro
  2004-02-18  2:48               ` Unicode normalization (userspace issue, but what the heck) H. Peter Anvin
  2004-02-16 15:05             ` stty utf8 Jamie Lokier
  2004-02-16 18:36             ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
  2 siblings, 2 replies; 120+ messages in thread
From: Matthias Urlichs @ 2004-02-15  3:33 UTC (permalink / raw)
  To: linux-kernel

Hi, Linus Torvalds wrote:

> In short: the kernel talks bytestreams, and that implies that if you want
> to talk to the kernel, you HAVE TO USE UTF-8.
> 
> At which point there are no locale issues any more.

Not locale, but normalization problems and identical-glyph problems.

Which is actually worse, because you don't have filenames which look
like crap -- instead you have filenames which look perfectly sane, but
they still do not work. Example: is an á one character, or is it an a
followed by a composing ´?

Mac OSX, just as an example, only uses decomposed filenames. I don't know
the current situation, but 10.2 has major problems when you try to access
files with composite characters in their name (across NFS for instance).

I wonder if Linux, i.e. Linus ;-) should decree one single standard
normalization. (I am NOT saying that enforcing this would be the kernel's
job!)

-- 
Matthias Urlichs

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: JFS default behavior
  2004-02-15  3:33             ` Matthias Urlichs
@ 2004-02-15  4:04               ` viro
  2004-02-15  9:48                 ` Robin Rosenberg
  2004-02-15 18:26                 ` yodaiken
  2004-02-18  2:48               ` Unicode normalization (userspace issue, but what the heck) H. Peter Anvin
  1 sibling, 2 replies; 120+ messages in thread
From: viro @ 2004-02-15  4:04 UTC (permalink / raw)
  To: Matthias Urlichs; +Cc: linux-kernel

On Sun, Feb 15, 2004 at 04:33:48AM +0100, Matthias Urlichs wrote:

> Mac OSX, just as an example, only uses decomposed filenames.

So how long does it take for a filename to decompose? ;-)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: JFS default behavior
  2004-02-15  4:04               ` viro
@ 2004-02-15  9:48                 ` Robin Rosenberg
  2004-02-15 18:26                 ` yodaiken
  1 sibling, 0 replies; 120+ messages in thread
From: Robin Rosenberg @ 2004-02-15  9:48 UTC (permalink / raw)
  To: viro; +Cc: Linux kernel

On Sunday 15 February 2004 05.04, you wrote:
> On Sun, Feb 15, 2004 at 04:33:48AM +0100, Matthias Urlichs wrote:
> 
> > Mac OSX, just as an example, only uses decomposed filenames.
> 
> So how long does it take for a filename to decompose?

As long as it takes to switch locale to UTF-8 :) or vice verse.

-- robin


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: JFS default behavior
  2004-02-15  4:04               ` viro
  2004-02-15  9:48                 ` Robin Rosenberg
@ 2004-02-15 18:26                 ` yodaiken
  1 sibling, 0 replies; 120+ messages in thread
From: yodaiken @ 2004-02-15 18:26 UTC (permalink / raw)
  To: viro; +Cc: Matthias Urlichs, linux-kernel

On Sun, Feb 15, 2004 at 04:04:58AM +0000, viro@parcelfarce.linux.theplanet.co.uk wrote:
> On Sun, Feb 15, 2004 at 04:33:48AM +0100, Matthias Urlichs wrote:
> 
> > Mac OSX, just as an example, only uses decomposed filenames.
> 
> So how long does it take for a filename to decompose? ;-)

Depends on whether it is junk or not.




^ permalink raw reply	[flat|nested] 120+ messages in thread

* stty utf8
  2004-02-15  2:41           ` Linus Torvalds
  2004-02-15  3:33             ` Matthias Urlichs
@ 2004-02-16 15:05             ` Jamie Lokier
  2004-02-16 16:10               ` Gerd Knorr
                                 ` (2 more replies)
  2004-02-16 18:36             ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
  2 siblings, 3 replies; 120+ messages in thread
From: Jamie Lokier @ 2004-02-16 15:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux kernel

Linus Torvalds wrote:
> People understand the problem. And UTF-8 is the solution.

Linus, I agree 100%.
My own filesystems have UTF-8 file names, of course.

There are still practical problems, two of which stand out:

1. Just because you hope a filesystem is UTF-8, does not preclude
   readdir() from returning non-UTF-8 names.  (These are far too
   easy to create by accident).

   Because of that, programs which interpret the result of
   readdir() as text, yet are expected to handle any name without
   silently rejecting them or aborting, are forced into strange
   compromises which break basic expectations.

   Spot the bug in this perl script:

     perl -e 'for (glob "*") { rename $_, "ņi-".$_ or die "rename: $!\n"; }'

   (NB: The prefix string is N WITH CEDILLA followed by "i-").

   (Hint: it mangles perfectly fine non-ASCII file names).

   Perl has no perfect behaviour to offer, because what should that
   behaviour be if readdir() might return a non-UTF-8 byte sequence
   as a name?

2. Terminals are not all UTF-8, and some never will be.

   So when someone types something like this on a non-UTF-8
   terminal, they get non-UTF-8 filename:

       vi el-niño.txt

   It isn't just a problem of display.  Now you have created a
   filename which isn't valid UTF-8, and GUI programs may complain,
   perhaps refusing to let you select the file.

   Furthermore, how exactly do you expect a user to use UTF-8 on
   the filesystem when their terminal is not (or sometimes is not)
   using UTF-8?

==> This problem would be very nicely solved with an additional
    terminal flag.  We have "stty ocrnl", "onlcr", "igncr" etc. to
    translate between terminal line endings and the unix convention of
    LF at the end of each line.  Why not create "stty utf8" so that
    non-UTF-8 terminals and UTF-8 terminals alike can work with a
    Linux convention that all programs enter and display UTF-8?  It
    would simplify a lot of things.

-- Jamie

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: stty utf8
  2004-02-16 15:05             ` stty utf8 Jamie Lokier
@ 2004-02-16 16:10               ` Gerd Knorr
  2004-02-16 22:03               ` Jamie Lokier
  2004-02-16 22:04               ` Jamie Lokier
  2 siblings, 0 replies; 120+ messages in thread
From: Gerd Knorr @ 2004-02-16 16:10 UTC (permalink / raw)
  To: linux-kernel

Jamie Lokier <jamie@shareable.org> writes:

> 2. Terminals are not all UTF-8, and some never will be.

> ==> This problem would be very nicely solved with an additional
>     terminal flag.  We have "stty ocrnl", "onlcr", "igncr" etc. to
>     translate between terminal line endings and the unix convention of
>     LF at the end of each line.  Why not create "stty utf8" so that
>     non-UTF-8 terminals and UTF-8 terminals alike can work with a
>     Linux convention that all programs enter and display UTF-8?  It
>     would simplify a lot of things.

It's probably possible to extend luit doing that too.  luit comes with
recent xfree86 releases and does utf-8 <=> locale conversion.  Right
now it does just the opposite:  let people use non-utf8 locales in a
utf-8 xterm.

  Gerd

-- 
Es geht darum, daß ein Haufen Scriptkiddies gerade dabei sind, USENET in
Bunt neu zu erfinden, und sie derzeit einen Haufen Fehler neu machen,
die schon seit 20 Jahren nicht mehr Gegenstand der Forschung sind.
	-- Kristian Köhntopp über blogs und blogger

^ permalink raw reply	[flat|nested] 120+ messages in thread

* UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-15  2:41           ` Linus Torvalds
  2004-02-15  3:33             ` Matthias Urlichs
  2004-02-16 15:05             ` stty utf8 Jamie Lokier
@ 2004-02-16 18:36             ` Marc Lehmann
  2004-02-16 18:49               ` Linus Torvalds
  2 siblings, 1 reply; 120+ messages in thread
From: Marc Lehmann @ 2004-02-16 18:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: viro, Linux kernel

[I may be a bit late in response, but AFAICS these points have not yet
been mentioned]

On Sat, Feb 14, 2004 at 06:41:20PM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
[discussion on why UTF-8 is the only sane encoding, which I absolutely
agree with, removed]

> In short: the kernel talks bytestreams, and that implies that if you want 
> to talk to the kernel, you HAVE TO USE UTF-8.

This is not the problem at all. It's perfectly easy to write
applications that talk UTF-8 and just UTF-8 with the kernel.

The problem is that the kernel does not use UTF-8, i.e. applications in
the current linux model have to deal with the fact that the kernel
happily breaks the assumed protocol of using UTF-8 by delivering illegal
byte sequences to applications.

There is no way for applications to handle UTF-8 and illegal-utf8 in
a sane way, so most apps will either eat the illegal bytes, skip the
filename, or crash (the latter case is clearly a bug in the app, thr
former cases aren't).

Fixing the VFS to actually enforce what linus claims (2filenames are
utf-8") is a very good idea, imho.

As I understand it, the reason linux currently doesn't, is that this utf-8
rule was obviously non-enforcable in practise in recent years, since
UTF-8 simply wasn't widespread (even today, applications such as bash or
grep are clearly not UTF-8 ready, as they start to crawl in UTF-8 locales
without special patches, and even with special patches).

So the only sane way to implement this enforcement is usign an
additional moutn-flag, e.g. "force-utf8".

An encoding=xyz mount flag OTOH would be total overkill, as the plan
must be to switch to UTF-8 in the long run, while allowing deviating
behaviour in the short run.

Conversely, filesystems such as NTFS, VFAT etc. need to convert from the
fs encoding to UTF-8 and vice versa automatically, at least when this
flag is specified.

It should become the default in some future linux version.

> People understand the problem. And UTF-8 is the solution.

The kernel needs to fully implement it. Just as a kernel accepting:

   open ("directory", O_WRONLY); write (dirfd, ...)...
   open ("/some/file", ...)
   mkdir ("../some/file", ...)

is considered rather broken behaviour from unix kernels (although these
might have been allowed in some dialects or versions of unix) today, this:

   mkdir ("</ encoded using illegal multibyte sequence>", ...)

will be considered broken behaviour in the future. The RFC defining UTF-8
clearly considers this a bug in UTF-8 implementations, the the kernel
in fact does NOT implement UTF-8 right now, although some people claim
that the kernel accepting UTF-8 (and more) is correct behaviour, it isn't
according to the RFC.

> It's getting there. I think even Microsoft has seen the light, and is
> phasing out their crapola (UCS-2LE? Whatever). 

Microsoft and Java officially use UTF-16 nowadays. The funny thing is
that "next character" iterators in both languages skip to the next word
in UCS-2, so the claim of both parties of UTF-16 support is basically a
marketing lie.

> No. Things like "iocharset" are not the solution. They are literally the
> _problem_. The solution is to use something that not only acts as ASCII,
[full agreement]

> And that one true format is UTF-8. End of story. If you try to talk to the 
> kernel in UCS-2 or anything else, you _will_ fail.

Just that the kernel does not support UTF-8. It delivers and accepts
non-UTF-8 strings such as \xc0\x80. The kernel clearly should not deliver
broken characters when the official stanza is that the linux VFS API is
UTF-8 only (see 3.2, Chapater 3, C12, conformance, ony why it currently
isn't UTF-8).

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-16 18:36             ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
@ 2004-02-16 18:49               ` Linus Torvalds
  2004-02-16 19:26                 ` UTF-8 practically vs. theoretically in the VFS API Jeff Garzik
  2004-02-16 20:03                 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
  0 siblings, 2 replies; 120+ messages in thread
From: Linus Torvalds @ 2004-02-16 18:49 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: viro, Linux kernel

On Mon, 16 Feb 2004, Marc Lehmann wrote:
> 
> > In short: the kernel talks bytestreams, and that implies that if you want 
> > to talk to the kernel, you HAVE TO USE UTF-8.
> 
> This is not the problem at all. It's perfectly easy to write
> applications that talk UTF-8 and just UTF-8 with the kernel.
> 
> The problem is that the kernel does not use UTF-8, i.e. applications in
> the current linux model have to deal with the fact that the kernel
> happily breaks the assumed protocol of using UTF-8 by delivering illegal
> byte sequences to applications.

You didn't read what I said.

READ MY POSTING. You even quoted it, but you didn't understand it.

I'm saying that "the kernel talks bytestreams".

I have never claimed that the kernel really talk s UTF-8, and indeed, I 
would say that such a kernel would be terminally and horribly broken. 

The kernel is _agnostic_ in what it does. As it should be. It doesn't 
really care AT ALL what you feed it, as long as it is a byte-stream.

Now, that implies that if you want to have extended characters, then YOU 
HAVE TO USE UTF-8.

That's what I'm saying. I am _not_ saying that the kernel uses UTF-8. The 
kernel doesn't care one way or the other. As far as the kernel is 
concerened, you could uuencode all the stuff, and the kernel wouldn't 
think you're crazy. The kernel _only_ cares about byte streams. And that 
is as it should be.

> There is no way for applications to handle UTF-8 and illegal-utf8 in
> a sane way, so most apps will either eat the illegal bytes, skip the
> filename, or crash (the latter case is clearly a bug in the app, thr
> former cases aren't).

What you're complaining about are bad user applications. It has _zero_ to 
do with the kernel.

> Fixing the VFS to actually enforce what linus claims (2filenames are
> utf-8") is a very good idea, imho.

No. Read my claim again. You obviously do not understand it AT ALL. 

What you suggest would be a horribly idiotic and bad idea. The kernel 
doesn't set policy. The kernel says "this is what I can do, you set 
policy".

And UTF-8 just happens to be the only sane policy for encoding complex 
characters into a byte stream. But it is not the only policy.

Another sane policy is to say "byte streams are latin1". It's not an
acceptable policy for encoding _complex_ characters, but it is a policy.
And it's a perfectly sane one.

In short: filenames are byte streams. Nothing more. They don't even have a 
"character set". They literally are just a series of bytes.

And when I say that you have to talk to the kernel using UTF-8, I'm only 
claiming that it is the only sane way to encode extended characters in a 
byte stream. Nothing more.

		Linus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 18:49               ` Linus Torvalds
@ 2004-02-16 19:26                 ` Jeff Garzik
  2004-02-16 19:48                   ` John Bradford
  2004-02-16 20:03                 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
  1 sibling, 1 reply; 120+ messages in thread
From: Jeff Garzik @ 2004-02-16 19:26 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: Linus Torvalds, viro, Linux kernel

Linus Torvalds wrote:
> In short: filenames are byte streams. Nothing more. They don't even have a 
> "character set". They literally are just a series of bytes.
> 
> And when I say that you have to talk to the kernel using UTF-8, I'm only 
> claiming that it is the only sane way to encode extended characters in a 
> byte stream. Nothing more.

Nod.  Maybe it helps Marc to point out the key difference between 
characters and bytes, in UTF8.

In UTF8, the number of characters in a string is less-than-or-equal-to 
the number of bytes in the string.

And the kernel just cares about bytes.

This is the whole benefit to UTF8, right here in this thread.  UTF8 was 
designed such that ten-year-old C code using standard C strings would 
function just fine.  No need to rip up large swaths of your code just to 
call multi-byte versions of the standard string functions.  Most code 
that doesn't deal with locale-specific details like uppercase/lowercase 
Just Works(tm).

	Jeff

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 19:26                 ` UTF-8 practically vs. theoretically in the VFS API Jeff Garzik
@ 2004-02-16 19:48                   ` John Bradford
  2004-02-16 19:48                     ` Linus Torvalds
  2004-02-16 20:16                     ` Marc Lehmann
  0 siblings, 2 replies; 120+ messages in thread
From: John Bradford @ 2004-02-16 19:48 UTC (permalink / raw)
  To: Jeff Garzik, Marc Lehmann; +Cc: Linus Torvalds, viro, Linux kernel

Quote from Jeff Garzik <jgarzik@pobox.com>:
> Linus Torvalds wrote:
> > In short: filenames are byte streams. Nothing more. They don't even have a 
> > "character set". They literally are just a series of bytes.
> > 
> > And when I say that you have to talk to the kernel using UTF-8, I'm only 
> > claiming that it is the only sane way to encode extended characters in a 
> > byte stream. Nothing more.
> 
> 
> Nod.  Maybe it helps Marc to point out the key difference between 
> characters and bytes, in UTF8.
> 
> In UTF8, the number of characters in a string is less-than-or-equal-to 
> the number of bytes in the string.
> 
> And the kernel just cares about bytes.
> 
> This is the whole benefit to UTF8, right here in this thread.  UTF8 was 
> designed such that ten-year-old C code using standard C strings would 
> function just fine.  No need to rip up large swaths of your code just to 
> call multi-byte versions of the standard string functions.  Most code 
> that doesn't deal with locale-specific details like uppercase/lowercase 
> Just Works(tm).

The real problem is with mis-configured userspaces, where buggy UTF-8
decoders are trying to make sense of data in legacy encodings
containing essentially random bytes > 127, which are not part of valid
UTF-8 sequences.

None of this is a real problem, if everything is set up correctly and
bug free.  Unfortunately the Just Works thing falls apart in the,
(frequent), instances that it's not :-(.

John.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 19:48                   ` John Bradford
@ 2004-02-16 19:48                     ` Linus Torvalds
  2004-02-16 20:20                       ` Marc Lehmann
  2004-02-16 20:21                       ` bert hubert
  2004-02-16 20:16                     ` Marc Lehmann
  1 sibling, 2 replies; 120+ messages in thread
From: Linus Torvalds @ 2004-02-16 19:48 UTC (permalink / raw)
  To: John Bradford; +Cc: Jeff Garzik, Marc Lehmann, viro, Linux kernel

On Mon, 16 Feb 2004, John Bradford wrote:
> 
> The real problem is with mis-configured userspaces, where buggy UTF-8
> decoders are trying to make sense of data in legacy encodings
> containing essentially random bytes > 127, which are not part of valid
> UTF-8 sequences.
> 
> None of this is a real problem, if everything is set up correctly and
> bug free.  Unfortunately the Just Works thing falls apart in the,
> (frequent), instances that it's not :-(.

The way to handle that is to aim to never _ever_ decode utf-8 unless you 
really have to. Always leave the string in utf-8 "raw bytestring" mode as 
long as possible, and convert to charater sets only when actually 
printing.

If you do that, then at worst you'll show the user a strange name (extra
points for marking it as being errenous), but everything still works. You
can still lookup/delete/whatever the file (internally the program still
works on the raw byte sequence and isn't confused). Basically accept the
fact that UTF-8 strings can contain "garbage", and don't try to fix it up.

And no, I'm not claiming that it's wonderfully clean and that we should
all love it. But it's _practical_, and the ugliness is certainly a lot
less than in the alternatives.

And it largely works today.

		Linus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-16 18:49               ` Linus Torvalds
  2004-02-16 19:26                 ` UTF-8 practically vs. theoretically in the VFS API Jeff Garzik
@ 2004-02-16 20:03                 ` Marc Lehmann
  2004-02-16 20:23                   ` Linus Torvalds
  2004-02-17  1:24                   ` Alex Belits
  1 sibling, 2 replies; 120+ messages in thread
From: Marc Lehmann @ 2004-02-16 20:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: viro, Linux kernel

On Mon, Feb 16, 2004 at 10:49:48AM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
> > The problem is that the kernel does not use UTF-8, i.e. applications in
> > the current linux model have to deal with the fact that the kernel
> > happily breaks the assumed protocol of using UTF-8 by delivering illegal
> > byte sequences to applications.
> 
> You didn't read what I said.

I read it.

> READ MY POSTING. You even quoted it, but you didn't understand it.

You were able to explain it clearly enough for me, I think.

> I'm saying that "the kernel talks bytestreams".

And I am saying that this is not good, which is my sole point.

> I have never claimed that the kernel really talk s UTF-8, and indeed, I 
> would say that such a kernel would be terminally and horribly broken. 

And I'd say such a kernel would be highly useful, as it would standardize
the encoding of filenames, just as unix standardizes on "mostly ascii"
(i.e. the SuS).

However, just as POSIX is a nice but very limited base, (mostly) ASCII is a nice
and very limited base. UTF-8 would also be a good base.

8-bit bytes as filenames is not a good base, however, since they enforce
a difefrent layer of interrpetation between the user and the kernel, and
this interpretation cannot be based on the locale nor the filesystem
itself, as there is no way to find out what encoding the filename is in.

8-bit bytes is convinient, but not useful for i18n environments. in the
past, it was also convinient and nobody cared, since everything was
either 8-bit or double-byte, and nobody exchanged files.

This, however, is going to change, and the current methodology of "just
guess, you might be right" is a hindrance to this.

> The kernel is _agnostic_ in what it does.

No, it's not. If at all, the kernel specifies a specially-interpreted
(ascii sans / and \0) byte-stream, as you say yourself.

However, just as with URLs (which are byte-streams, too), byte-streams are
useless to store text. You need bytestreams + known encoding.

If filenames were not names, but just binary id's I would agree, but
this is not at all how filenames are used not how their use is applied.

Filenames are composed of text, but the kernel gives no indication
on how to interpret this text, and as a matter of fact, nothing else
gives this indication. glib etc. uses G_BROKEN_FILENAMES to force
locale-encoding. But as others have said, one mans locale is unlike other
mens locale.

> really care AT ALL what you feed it, as long as it is a byte-stream.
> 
> Now, that implies that if you want to have extended characters, then YOU 
> HAVE TO USE UTF-8.

You say so, but there is no logical connection between these two
statements. I can store latin1 easily in a bytestream, as I can store
iso-2022-jp or euc-jp. But they are incompatible to UTF-8.

You are yelling at me for no good reason. "YOU HAVE TO USE UTF-8". Why
should this be? The kernel certainly enforces this. Even you claim that
I don't have to, as the kernel doesn't care.

However, if you think so violently that it has to be UTF-8 that you even
yell it, then why doesn't the kernel comply to this rule? Why should an
applicaiton "HAVE TO" use utf-8 for input when the kernel doesn't even
try to comply and hands out illegal output?

This is just like mmap sometimes returning a page number and sometimes a
byte address... this would also not be useful unless you know the unit
that mmap returns (addreesses in multiples of 1).

> That's what I'm saying. I am _not_ saying that the kernel uses UTF-8. 

But you are saying that you have to feed UTF-8 into the kernel, which is
not the case either. I certainly don't have to..., and, what's worse,
you haven't given any indication of why one has to. Just because you say
so? Or is there actually a reason? If there is a reason, why doesn't the
kernel, in return, also follow this reasoning?

> kernel doesn't care one way or the other. As far as the kernel is 

It doesn't. But the point is that it should. If the kernel would do
everything we want it to do there would be no point in enhancing it.

> concerened, you could uuencode all the stuff, and the kernel wouldn't 

(Yes, because uuencode has the peculiar property of neither generting \0
nor /. It doesn't work in general with byte-streams).

> think you're crazy. The kernel _only_ cares about byte streams.

The kernel _interprets_ the byte-stream already. And some byte-streams are
no valid filenames _already_.

> > There is no way for applications to handle UTF-8 and illegal-utf8 in
> > a sane way, so most apps will either eat the illegal bytes, skip the
> > filename, or crash (the latter case is clearly a bug in the app, thr
> > former cases aren't).
> 
> What you're complaining about are bad user applications. It has _zero_ to 
> do with the kernel.

Could you elaborate on why these apps are bad? What I am interested in
is to know how to fix them? since there is simply no way to interpret
the names returned by the kernel as the corresponding meta-information
is missing.

Consider an OS that allows different characters for path-seperators (unix
only allows '/'). Without the knowledge of the path seperator it would be
impossible to interpret paths. And without knowledge about encoding it's
impossible (but slightly less dramatic) to correctly interpret filenames.

> > Fixing the VFS to actually enforce what linus claims (2filenames are
> > utf-8") is a very good idea, imho.
> 
> No. Read my claim again. You obviously do not understand it AT ALL. 

...

> What you suggest would be a horribly idiotic and bad idea.

Why?

> The kernel doesn't set policy. The kernel says "this is what I can do,
> you set policy".

Exactly. The kernel could specify the API to use UTF-8. This is not more
policy than it currently enforces.

Or do you suggest that the ability to change the source and replace all
occurences of '/' by '\\' means that '/' is not enforced as policy on
path seperators?

We basically seem to disagree on what, exactly, policy is. Policy (to
me) is something that differentiates between several incompatible
alternatives. Chosing policy means to rule out other (useful)
alternatives.

One could argue that '/' is a policy because it precludes the '/'
character from being used in filenames, sth. some filenames or operating
systems support.

I'd say (probably as much as you) that this policy is not a real policy,
it's just idiotic.

But enforcing other restrictions on filenames should magically be real
policy? This is obviously bot idiotic at all, and should be carefully
explored.

> And UTF-8 just happens to be the only sane policy for encoding complex 
> characters into a byte stream. But it is not the only policy.

Just as '/' is not the only possible path seperator. If that is your
point, you should explain why enforcing this is ok while supporting utf-8
(not enforcing, just supporting, meaning having the ability to rule out
non-utf-8 sequences when the admin wants this) is not.

> Another sane policy is to say "byte streams are latin1". It's not an
> acceptable policy for encoding _complex_ characters, but it is a policy.
> And it's a perfectly sane one.

I agree that it is sane. But it is not very useful for the future, as
people who want russian filenames are plainly unable to use the other
filenames in a sensible way. There is no way to know the encoding.

> In short: filenames are byte streams. Nothing more.

Right now, they aren't. Not all sequences of bytes are valid filenames
already, and I think this is perfectly o.k.

> And when I say that you have to talk to the kernel using UTF-8, I'm only 
> claiming that it is the only sane way to encode extended characters in a 
> byte stream. Nothing more.

And i fully agree.

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 19:48                   ` John Bradford
  2004-02-16 19:48                     ` Linus Torvalds
@ 2004-02-16 20:16                     ` Marc Lehmann
  2004-02-16 20:20                       ` Jeff Garzik
                                         ` (3 more replies)
  1 sibling, 4 replies; 120+ messages in thread
From: Marc Lehmann @ 2004-02-16 20:16 UTC (permalink / raw)
  To: John Bradford; +Cc: Jeff Garzik, Linus Torvalds, viro, Linux kernel

On Mon, Feb 16, 2004 at 07:48:19PM +0000, John Bradford <john@grabjohn.com> wrote:
> Quote from Jeff Garzik <jgarzik@pobox.com>:
> None of this is a real problem, if everything is set up correctly and
> bug free.  Unfortunately the Just Works thing falls apart in the,
> (frequent), instances that it's not :-(.

And this is the whole point.

BTW, to people trying to explain some properties of UTF-8 to me. I don't
think ad-hominem attacks like assuming that I don't understand UTF-8
(without any indication that this is so) are useful.

The point here is that the kernel does, in a very narrow interpretation,
not support the use of UTF-8, because proper support of UTF-8 means that
no illegal byte sequences will be produced.

Of course, I can feed the kernel UTF-8, and if everybody does that, it
will generally work quite fine. However, Windows surely works fine if
every program only feeds allowed values into system calls. And even unix
dialects without memory protection work, as long as everybody plays
fair.

The point is, however, that this is highly undesirable, and it would be
nice to have a kernel that would (optionally) fully support a UTF-8
environment in where applications can feed UTF-8 and _expect_ UTF-8 in
return, which _is_ a security issue.

It's very desirable to have a kernel that actively supports this. IT is
clearly not _required_, of course. But then again, process abstraction
is also not required...

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 19:48                     ` Linus Torvalds
@ 2004-02-16 20:20                       ` Marc Lehmann
  2004-02-16 20:26                         ` Linus Torvalds
  2004-02-18  2:49                         ` Rob Landley
  2004-02-16 20:21                       ` bert hubert
  1 sibling, 2 replies; 120+ messages in thread
From: Marc Lehmann @ 2004-02-16 20:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: John Bradford, Jeff Garzik, viro, Linux kernel

On Mon, Feb 16, 2004 at 11:48:35AM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
> works on the raw byte sequence and isn't confused). Basically accept the
> fact that UTF-8 strings can contain "garbage", and don't try to fix it up.

But you are wrong, UTF-8 strings never contain garbage. UTF-8 is
well-defined and is always proper UTF-8. It's a tautology.

The evry idea of "UTF-8 with garbage in it" doesn't make sense.

> And no, I'm not claiming that it's wonderfully clean and that we should
> all love it.

It's also a totally useless idiom...

> And it largely works today.
> 		Linus

On ascii-only-systems, it works fine. My system is largely ascii-only,
with only very few filenames (japanese and german ones mostly) in
UTF-8. Sometimes in EUC-JP, but that's a bug in rar.

It also works fine in single-user environments where the user just forces
everything to be in her locale. It does fail miserably on multi-user
systems. It does fail miserably in ISO-C's locale model. It does fail
miserably with gnu shellutils, fileutils and most other apps.

It fails, because it's not at all well supported by the kernel.

Claiming that it largely works today is simply not true for most
non-ascii-users (which increasingly includes the US).

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:16                     ` Marc Lehmann
@ 2004-02-16 20:20                       ` Jeff Garzik
  2004-02-16 21:10                       ` viro
                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 120+ messages in thread
From: Jeff Garzik @ 2004-02-16 20:20 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: John Bradford, Linus Torvalds, viro, Linux kernel

Marc Lehmann wrote:
> The point here is that the kernel does, in a very narrow interpretation,
> not support the use of UTF-8, because proper support of UTF-8 means that
> no illegal byte sequences will be produced.

Incorrect.  Byte stream transports need not care about their contents.

The only places that need to care about illegal UTF8 byte sequences are 
things like CONFIG_NLS_UTF8.

	Jeff




^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 19:48                     ` Linus Torvalds
  2004-02-16 20:20                       ` Marc Lehmann
@ 2004-02-16 20:21                       ` bert hubert
  2004-02-16 20:33                         ` Marc Lehmann
  2004-02-18  2:58                         ` H. Peter Anvin
  1 sibling, 2 replies; 120+ messages in thread
From: bert hubert @ 2004-02-16 20:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: John Bradford, Jeff Garzik, Marc Lehmann, viro, Linux kernel

On Mon, Feb 16, 2004 at 11:48:35AM -0800, Linus Torvalds wrote:

> The way to handle that is to aim to never _ever_ decode utf-8 unless you 
> really have to. Always leave the string in utf-8 "raw bytestring" mode as 
> long as possible, and convert to charater sets only when actually 
> printing.

Additional good news is that following octets in a utf-8 character sequence
always have the highest order bit set, precluding / or \x0 from appearing,
confusing the kernel.

The remaining zit is that all these represent '..':
2E 2E
C0 AE C0 AE
E0 80 AE E0 80 AE 
F0 80 80 AE F0 80 80 AE 
F8 80 80 80 AE F8 80 80 80 AE 
FC 80 80 80 80 AE FC 80 80 80 80 AE

This in itself is not a problem, the kernel will only recognize 2E 2E as the
real .., but it does show that 'document.doc' might be encoded in a myriad
ways.

So some guidance about using only the simplest possible encoding might be
sensible, if we don't want the kernel to know about utf-8.

> And it largely works today.

Indeed.

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://lartc.org           Linux Advanced Routing & Traffic Control HOWTO

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-16 20:03                 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
@ 2004-02-16 20:23                   ` Linus Torvalds
  2004-02-16 20:58                     ` Marc Lehmann
  2004-02-16 22:26                     ` Jamie Lokier
  2004-02-17  1:24                   ` Alex Belits
  1 sibling, 2 replies; 120+ messages in thread
From: Linus Torvalds @ 2004-02-16 20:23 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: viro, Linux kernel

On Mon, 16 Feb 2004, Marc Lehmann wrote:
> 
> > I'm saying that "the kernel talks bytestreams".
> 
> And I am saying that this is not good, which is my sole point.

Fair enough. 

However, that's where the unix philosophy comes in. The unix philosophy 
has always been to not try to understand the data that the user passes 
around - and that "everything is a bytestream" is very much encoded in the 
basic principles of how unix should work.

That agnosticism has a lot of advantages. It literally means that the
basic operating system doesn't set arbitrary limitations, which means that
you can do things that you couldn't necessarily otherwise easily do.

It does mean that you can do "strange" things too, and it does mean that 
user space basically has a lot of choice in how to interpret those byte 
streams.

And yes, it can cause confusion. You don't like the confusion, so you 
argue that it shouldn't be allowed. It's a valid argument, but it's an 
argument that assumes that choice is bad.

If you want to _force_ everybody to use UTF-8, then yes, the kernel could 
enforce that readdir() would never pass through a broken UTF-8 string, and 
all the path lookup functions also would never accept a broken string. It' 
snot technically impossible to to, although it would add a certain amount 
of pain and overhead.

But the thing is, not everyone uses UTF-8. The big distributions have only 
recently started moving to UTF-8, and it will take _years_ before UTF-8 is 
ubiquotous. And even then it might be the wrong thing to disallow clever 
people from doing clever things. Encoding other information in filenames 
might be proper for a number of applications.

> And I'd say such a kernel would be highly useful, as it would standardize
> the encoding of filenames, just as unix standardizes on "mostly ascii"
> (i.e. the SuS).

It would also be very painful, since it would mean that when you mount an 
old disk, you may be totally unable to read the files, because they have 
filenames that such a kernel would never accept.

> > The kernel is _agnostic_ in what it does.
> 
> No, it's not. If at all, the kernel specifies a specially-interpreted
> (ascii sans / and \0) byte-stream, as you say yourself.
> 
> However, just as with URLs (which are byte-streams, too), byte-streams are
> useless to store text. You need bytestreams + known encoding.

You don't "need" a known encoding. The kernel clearly doesn't need one. 
It's a container, and the encoding comes from the outside. 

And that's what I mean by agnostic - you can make your own encoding. 

Most of the time (but not always) these days UTF-8 is the only sane 
encoding to use. But let people do what they want to do.

Choice is _inherently_ good. Trying to force a world-view is bad. You 
should be able to tell people what they should do to avoid confusion ("use 
UTF-8"), but you should not _force_ them to that if they have good reasons 
not to (and "backwards compatibility" is a better reason than just about 
anything else).

> But you are saying that you have to feed UTF-8 into the kernel, which is
> not the case either.

No. I'm saying that
 (a) "if you want to use complex character sets"
then 
 (b) "you really have to use UTF-8"
to talk to the kernel.

Note the two parts. You're hung up on (b), while I have tried to make it 
clear that (a) is a prerequisite for (b).

Not everybody cares about (a). There are still people who use extended 
ASCII, simply because they DO NOT CARE about complex character sets. And 
if they don't care, and (a) isn't true, then (b) has no meaning any more.

(In all fairness, some people will disagree with (b) even when (a) is true
and like things like UCS-2. Those people are crazy, but I guess I'd just
mention that possibility anyway).

And this is why I say that the kernel only cares about byte streams, and
having it filter to only accept proper UTF-8 sequences would be a horribly
bad idea. Because it _assumes_ (a). That's what "making policy" is all
about. The kernel should not assume that everybody cares about complex
character sets.

This may change, btw. I'm nothing if not pragmatic. In another twenty
years, maybe everybody _literally_ uses complex character sets, and this
whole discussion is totally silly, and the kernel may enforce UTF-8 or
Klingon or whatever. At some point assumptions become _so_ ingrained that
they are no longer policy any more, they are just "fact".

		Linus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:20                       ` Marc Lehmann
@ 2004-02-16 20:26                         ` Linus Torvalds
  2004-02-18  2:49                         ` Rob Landley
  1 sibling, 0 replies; 120+ messages in thread
From: Linus Torvalds @ 2004-02-16 20:26 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: John Bradford, Jeff Garzik, viro, Linux kernel



On Mon, 16 Feb 2004, Marc Lehmann wrote:
>
> On Mon, Feb 16, 2004 at 11:48:35AM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
> > works on the raw byte sequence and isn't confused). Basically accept the
> > fact that UTF-8 strings can contain "garbage", and don't try to fix it up.
> 
> But you are wrong, UTF-8 strings never contain garbage. UTF-8 is
> well-defined and is always proper UTF-8. It's a tautology.
> 
> The evry idea of "UTF-8 with garbage in it" doesn't make sense.

Sure it does.

You live in a theoretical world where
 (a) there is only one standard
 (b) people read it
 (c) people actually follow it and never have bugs

I've got news for you: none of the above is true. 

Which means that IN PRACTICE you will find strings that you think are 
UTF-8-encoded, but that don't end up being proper UTF-8.

That's the difference between real world and theory. 

And you can either write your programs to be "theoretically correct", or 
you can write them to "work".

It's your choice. I know which program I'd prefer to use.

		Linus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:21                       ` bert hubert
@ 2004-02-16 20:33                         ` Marc Lehmann
  2004-02-18  2:58                         ` H. Peter Anvin
  1 sibling, 0 replies; 120+ messages in thread
From: Marc Lehmann @ 2004-02-16 20:33 UTC (permalink / raw)
  To: bert hubert; +Cc: linux-kernel

On Mon, Feb 16, 2004 at 09:21:42PM +0100, bert hubert <ahu@ds9a.nl> wrote:
> The remaining zit is that all these represent '..':

No, they don't. Read the UTF-8 definition...

> This in itself is not a problem, the kernel will only recognize 2E 2E as the
> real .., but it does show that 'document.doc' might be encoded in a myriad
> ways.

No, it can only be encoded in exactly one way *in UTF-8*. It can of course
be encoded differently in other encodings, but in UTF-8, there is only a
single representation. There are no ambiguities.

> So some guidance about using only the simplest possible encoding might be
> sensible, if we don't want the kernel to know about utf-8.

Fortunately, this has all already been taken care of, and is not a problem.

I mean, the _definition_ of UTF-8 works. Wether specific applications
(wether in the kernel or apps) work is a different question. But at
least the specification is rather clear.

Compare this to the URL definition, which only hints that you don't know
the encoding, and therefore, the interpretation as text, of a URL unless
you have an extra channel that communicates it.

While possible, this channel does not exist in practise, creating big
problems for people writing i18n-ized web applications.

The thing is that the kernel certainly _works_ on a very basic level, but
I think the situaiton can be improved by making it clear how to interpret
filenames, which currently is not the case.

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-16 20:23                   ` Linus Torvalds
@ 2004-02-16 20:58                     ` Marc Lehmann
  2004-02-17 14:12                       ` Dave Kleikamp
  2004-02-16 22:26                     ` Jamie Lokier
  1 sibling, 1 reply; 120+ messages in thread
From: Marc Lehmann @ 2004-02-16 20:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: viro, Linux kernel

On Mon, Feb 16, 2004 at 12:23:25PM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
> > And I am saying that this is not good, which is my sole point.
> 
> Fair enough. 

Thank you (honestly), for acknowledging it.

> However, that's where the unix philosophy comes in. The unix philosophy 
> has always been to not try to understand the data that the user passes 
> around - and that "everything is a bytestream" is very much encoded in the 
> basic principles of how unix should work.

This never really applied to paths or filenames, although I admit that
this was the intention indeed, hindred by the need of having _some_
out-of-band data like '/'.

So yes, that's the principle.

> That agnosticism has a lot of advantages. It literally means that the
> basic operating system doesn't set arbitrary limitations, which means that
> you can do things that you couldn't necessarily otherwise easily do.

I think '/' is arbitrary, but of course there is a difference between
ruling out 2 bytes out of 256 as opposed to ruling out hundreds of
thousands of combinations.

However, UTF-8 was invented to transform unicode _text_ to filenames by
the inventors of unix.

Yes, this does not mean that they intended your typical kernel to
enforce this, but still this is what plan9 does, and it's a very useful
"limitation" because it removes the need for difficult-to-get-by-with
out-of-band data stating the interpretation of filenames.

Unlike file contents, filenames are always meant to represent text, not
binary data, which I hope you agree with.

> It does mean that you can do "strange" things too, and it does mean that 
> user space basically has a lot of choice in how to interpret those byte 
> streams.

Windows, for example, allows a lot more strange things like "x.y." to
mean the same as "x.y", or even worse things playing with "illegal
UTF-8".

Most, if not all, of these strange exceptions in windows resulted in
exploitable security flaws.

While this certainly is workable (in the sense that these all were
applicaiton bugs), I think it would be highly useful to be able to enforce
this by saying: "you can put whatever you like into filenames, as long as
it is encoded using UTF-8".

Just like the kernel currently enforces "it must be encoded using
octects", which is trivial, of course, but consider that UCS-2, while also
being a bytestream, is NOT a valid encoding for filenames in linux.

So the kernel is certainly not agnostic versus encodings.

It is, however, agnostic to multibyte encodings as used in e.g. ISO-C.
It does not support wide character encodings.

This is not bad at all, but I claim that the kernel simply isn't as
agnostic as we all might wish, and if there is a specification on
well-formed encodings, it could just as well be "UTF-8".

(Right now it is "most 8-bit encodings, all multibyte encodings but no
multiword encodings with embedded nulls or '/'", which is simply too
unsharp for me to be acceptable. For me, this is clearly a subjective
opinion).

> And yes, it can cause confusion. You don't like the confusion, so you 
> argue that it shouldn't be allowed. It's a valid argument, but it's an 
> argument that assumes that choice is bad.

Actually, I called for an optional way of enforcing a very promising
encoding over a lot of other encodings.

The kernel simply supports only a subset of possible encodings, and I said
that being able to limit this choice to the only encoding that even _you_
call the only sane one, as an option.

> If you want to _force_ everybody to use UTF-8, then yes, the kernel could 

I don't want to do that. I sitll happen to find the rare iso-8859-1
filename on my disk, and almost puke everytime because it breaks bash
etc. I even modified my terminal emulator (rxvt) to support the
occasional german umlaut because it's so convinient.

But all in all, it would be very much easier if this simply couldn't
happen. Especially because it's easy to get from "illegal utf-8 allowed"
to "security issue in formerly secure code".

> enforce that readdir() would never pass through a broken UTF-8 string, and 
> all the path lookup functions also would never accept a broken string. It' 
> snot technically impossible to to, although it would add a certain amount 
> of pain and overhead.

I can certainly yield to arguments like "if you want it, send a patch good
enough to be accepted" :)

> But the thing is, not everyone uses UTF-8.

I don't want to enforce it, exactly because, as you say, it'll
take years. I do think that a way to enforce this as local (or
distribution-wise) policy, would be very helpful indeed.

I don't mind if some embedded system doesn't want the overhead of having a
UTF-8 environment that simply isn't useful for it.

> people from doing clever things. Encoding other information in filenames 
> might be proper for a number of applications.

But it will simplify a lot of applications that otherwise have no idea on
filename encoding (it's a pain in the ass in perl, which has generally
good (unicode-)text support, but still filenames, which are text, simply
come out garbled most of the time).

It's rather similar to the discussion about extended attributes (ok, it's
hornets nest, or at leats was one :). Having a clear API definition would
certainly help.

It's no policy either, as applicaitons wanting to encode e.g. binary
information can certainly do so using UTF-8, just as they currently can do
so by other means, e.g. escaping \0 and /.

> It would also be very painful, since it would mean that when you mount an 
> old disk, you may be totally unable to read the files, because they have 
> filenames that such a kernel would never accept.

That's why I asked for a mount option. I might be too extreme for your
taste in my opinions, but I am still living on earth. I, too, have these
disks.

However, I also have some windows disks, and umlauts or japanese
characters on these often result in chaos, which is not really a bug in
linux either, but highly annoying to people trying to use linux.

> You don't "need" a known encoding. The kernel clearly doesn't need one. 
> It's a container, and the encoding comes from the outside. 

Right now the kernel does need encoding for _some_ byte values. I can't
use UCS-2, it simply isn't agnostic.

Please, I don't want to use UCS2, UTF-16, or other such atrocities :) I
am just trying to make the point that the kernel already enforces some
encodings.

> And that's what I mean by agnostic - you can make your own encoding. 

Given the limitations of the kernel in interpreting byte streams, I can.
However, only under the existing constraints.

I'd like to have a more forcing constraint - UTF-8. It might never be
implemented (especially not against your will), but I think it isn't more
idiotic than what the kernel currently does.

Again, I find the "restrictions" unix has with respect to filenames
very sane and useful, and if at all, it's usually a problem that
filenames can contain unusual characters like \n.

> Choice is _inherently_ good. Trying to force a world-view is bad. You 

Yes! I want the choice of having a kernel supporting UTF-8 (again, meaning
active support, i.e. not accepting malformed utf-8) :)

> UTF-8"), but you should not _force_ them to that if they have good reasons 
> not to (and "backwards compatibility" is a better reason than just about 
> anything else).

Well, being able (having the choice) to force is good. Just like having
nosuid mount options is good, because it allows admins the choice.

> No. I'm saying that
>  (a) "if you want to use complex character sets"
> then 
>  (b) "you really have to use UTF-8"
> to talk to the kernel.
> 
> Note the two parts. You're hung up on (b), while I have tried to make it 
> clear that (a) is a prerequisite for (b).

Ok, this is true.

> (In all fairness, some people will disagree with (b) even when (a) is true
> and like things like UCS-2. Those people are crazy, but I guess I'd just
> mention that possibility anyway).

You have to admit, however, that apart from UCS-2 being obvious insanity
as opposed to UTF-16 or some 32 encoding, that some people have a point
in asking for these.

It IMHO not useful to support this simply because the posix API cannot be
made to deal with it. But I wouldn't rule out encodings like UTF-32 as
simply being crazy.

> This may change, btw. I'm nothing if not pragmatic. In another twenty
> years, maybe everybody _literally_ uses complex character sets, and this
> whole discussion is totally silly, and the kernel may enforce UTF-8 or
> Klingon or whatever. At some point assumptions become _so_ ingrained that
> they are no longer policy any more, they are just "fact".

True. Thanks a lot for explaining your arguments in this detail. In
fact, I can accept most if not all of your arguments, but I sitll think
it would be nice to have this extra functionality.

Arguments like "it's a pain to implement" (which I don't think it is, but
you are clearly better in judging that!), weigh even more to me.

So even if I think it's a good idea, it might never be implemneted for
purely practical reasons.

(end of discussion, I think, for me at last)

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:16                     ` Marc Lehmann
  2004-02-16 20:20                       ` Jeff Garzik
@ 2004-02-16 21:10                       ` viro
  2004-02-17  7:18                       ` jw schultz
  2004-02-17  7:42                       ` Nick Piggin
  3 siblings, 0 replies; 120+ messages in thread
From: viro @ 2004-02-16 21:10 UTC (permalink / raw)
  To: John Bradford, Jeff Garzik, Linus Torvalds, Linux kernel

On Mon, Feb 16, 2004 at 09:16:10PM +0100, Marc Lehmann wrote:
> The point is, however, that this is highly undesirable, and it would be
> nice to have a kernel that would (optionally) fully support a UTF-8
> environment in where applications can feed UTF-8 and _expect_ UTF-8 in
> return, which _is_ a security issue.
> 
> It's very desirable to have a kernel that actively supports this. IT is
> clearly not _required_, of course. But then again, process abstraction
> is also not required...

Mind taking the demagogy elsewhere?  Note that the same handwaving applies
to e.g. file contents.  Care to explain what makes read() and write()
different in that respect?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: stty utf8
  2004-02-16 15:05             ` stty utf8 Jamie Lokier
  2004-02-16 16:10               ` Gerd Knorr
@ 2004-02-16 22:03               ` Jamie Lokier
  2004-02-16 22:17                 ` Linus Torvalds
  2004-02-16 22:04               ` Jamie Lokier
  2 siblings, 1 reply; 120+ messages in thread
From: Jamie Lokier @ 2004-02-16 22:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux kernel

Jamie Lokier wrote:
> Why not create "stty utf8" so that non-UTF-8 terminals and UTF-8
> terminals alike can work with a Linux convention that all programs
> enter and display UTF-8?  It would simplify a lot of things.

I little thought and an experiment later, and I discovered:

When you edit a line with the kernel's terminal line editor, when you
press the Delete key, it writes backspace-space-backspace and removes
one byte from the input.  That fails to do the right thing on UTF-8
terminals.

For example, in a UTF-8 xterm or Gnome terminal, or even on the Linux
console after running "unicode_start", run the command "cat" by
itself, then type "ééé", then hit DEL twice - it will show one
accented letter(*).  Press enter, and cat will echo the line
containing _two_ letters.

There is no fancy environment setting which corrects this problem.
The kernel needs to know it's dealing with a UTF-8 terminal for basic
line editing to work.

(*) The text in quotes is three E WITH ACUTE letters, in case that
doesn't show properly in your mailer.

-- Jamie

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: stty utf8
  2004-02-16 15:05             ` stty utf8 Jamie Lokier
  2004-02-16 16:10               ` Gerd Knorr
  2004-02-16 22:03               ` Jamie Lokier
@ 2004-02-16 22:04               ` Jamie Lokier
  2 siblings, 0 replies; 120+ messages in thread
From: Jamie Lokier @ 2004-02-16 22:04 UTC (permalink / raw)
  To: Linux kernel

Jamie Lokier wrote:
>      perl -e 'for (glob "*") { rename $_, "ņi-".$_ or die "rename: $!\n"; }'
> 
>    (NB: The prefix string is N WITH CEDILLA followed by "i-").
>    (Hint: it mangles perfectly fine non-ASCII file names).
>
>    Perl has no perfect behaviour to offer, because what should that
>    behaviour be if readdir() might return a non-UTF-8 byte sequence
>    as a name?

I've had someone point out that the perl script mangles non-UTF-8
filenames, and there is no correct behaviour for that case.

In fact the _real_ bug is that it mangles perfectly fine UTF-8 filenames.

It's a Perl quirk, but the behaviour is like that for compatibility
with non-UTF-8 filesystems.  I wanted to show how just using UTF-8 for
filenames isn't _yet_ as straightforward as it should be.

-- Jamie

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: stty utf8
  2004-02-16 22:03               ` Jamie Lokier
@ 2004-02-16 22:17                 ` Linus Torvalds
  0 siblings, 0 replies; 120+ messages in thread
From: Linus Torvalds @ 2004-02-16 22:17 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Linux kernel

On Mon, 16 Feb 2004, Jamie Lokier wrote:
> 
> I little thought and an experiment later, and I discovered:
> 
> When you edit a line with the kernel's terminal line editor, when you
> press the Delete key, it writes backspace-space-backspace and removes
> one byte from the input.  That fails to do the right thing on UTF-8
> terminals.

Yes. I looked at that a year ago, and it should be pretty easy to make the 
backspace code look more like the "delete word" code - except the "word" 
is just a utf character.

(Btw, that's one of the things I like about UTF-8, and shows how _well_ 
designed it is - it's trivial to find the beginning of a UTF-8 character, 
even when just doing a stupid scan backwards).

I didn't care enough to really bother fixing it - the fact is, that people 
who care about UTF-8 tend to have to be in graphics mode anyway, and there 
is something to be said for keeping the text console simple even if it 
means it lacks functionality.

But if somebody cares more than I do (hint, hint ;), I do think it should 
be fixed.

> There is no fancy environment setting which corrects this problem.
> The kernel needs to know it's dealing with a UTF-8 terminal for basic
> line editing to work.

Yes. And I'd happily take patches for it.

		Linus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-16 20:23                   ` Linus Torvalds
  2004-02-16 20:58                     ` Marc Lehmann
@ 2004-02-16 22:26                     ` Jamie Lokier
  2004-02-16 22:40                       ` Linus Torvalds
  1 sibling, 1 reply; 120+ messages in thread
From: Jamie Lokier @ 2004-02-16 22:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marc Lehmann, viro, Linux kernel

Linus Torvalds wrote:
> It would also be very painful, since it would mean that when you mount an 
> old disk, you may be totally unable to read the files, because they have 
> filenames that such a kernel would never accept.

Alas, once userspace has migrated to doing everything in UTF-8, you
won't be able to read those files because userspace will barf on them.

Then you'll be glad to have a mount option which converts iso-8859-1
to UTF-* :)  (Even if the old disk as actually not iso-8859-1, at least
you'll be able to read it's mangled filenames, rather than userspace
tripping over them).

-- Jamie

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-16 22:26                     ` Jamie Lokier
@ 2004-02-16 22:40                       ` Linus Torvalds
  2004-02-16 22:52                         ` Linus Torvalds
  2004-02-17  7:14                         ` Lehmann 
  0 siblings, 2 replies; 120+ messages in thread
From: Linus Torvalds @ 2004-02-16 22:40 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Marc Lehmann, viro, Linux kernel

On Mon, 16 Feb 2004, Jamie Lokier wrote:
> 
> Alas, once userspace has migrated to doing everything in UTF-8, you
> won't be able to read those files because userspace will barf on them.

Nope. Read my other email. Done right, user space will _not_ barf on them, 
because it won't try to "normalize" any UTF-8 strings. If the string has 
garbage in it, user space should just pass the garbage through.

We've had this _exact_ issue before. Long before people worried about
UTF-8, people worried about the fact that programs like "ls" shouldn't
print out the extended ASCII characters as-is, because that would cause
bad problems on a terminal as they'd be seen as terminal control
characters.

Does that mean that unix tools like "rm" cannot remove those files? Hell 
no! It just means that when you do "rm -i *", the filename that is printed 
may not have special characters in it that you don't see.

Same goes for UTF-8. A "broken" UTF-8 string (ie something that isn't 
really UTF-8 at all, but just extended ASCII) won't _print_ right, but 
that doesn't mean that the tools won't work. You'll still be able to edit 
the file.

Try it with a regular C locale. Do a simple

	echo > åäö

(that's latin1), and do a "rm -i åäö", and see what it says. 

Right: it does the _right_ thing, and it prints out:

	torvalds@home:~> rm -i åäö
	rm: remove regular file `\345\344\366'? 

In other words, you have a program that doesn't understand a couple of the 
characters (because they don't make sense in its "locale"), but it still 
_works_. It just can't print them.

Which, if you think about is, is 100% EXACTLY equivalent to what a UTF-8
program should do when it sees broken UTF-8. It can still access the file, 
it can still do everything else with it, but it can't print out the 
filename, and it should use some kind of escape sequence to show that 
fact.

The two cases are 100% equivalent. We've gone through this before. There 
is a bit of pain involved, but it's not something new, or something 
fundamentally impossible. It's very straightforward indeed.

		Linus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-16 22:40                       ` Linus Torvalds
@ 2004-02-16 22:52                         ` Linus Torvalds
  2004-02-17 13:15                           ` Jamie Lokier
  2004-02-17  7:14                         ` Lehmann 
  1 sibling, 1 reply; 120+ messages in thread
From: Linus Torvalds @ 2004-02-16 22:52 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Marc Lehmann, viro, Linux kernel

On Mon, 16 Feb 2004, Linus Torvalds wrote:
> 
> Which, if you think about is, is 100% EXACTLY equivalent to what a UTF-8
> program should do when it sees broken UTF-8. It can still access the file, 
> it can still do everything else with it, but it can't print out the 
> filename, and it should use some kind of escape sequence to show that 
> fact.

Side note: a UTF-8 program needs to do escape handling _anyway_, because 
even if the filename is 100% UTF-8 compliant, you still can't print out 
all the characters as such. In particular, charcters like '\n' etc are 
obviously perfectly fine UTF-8, yet they need to be escaped when printing 
out filenames in a file selector.

So I claim (and yes, people are free to disagree with me) that a
well-written UTF-8 program won't even have any real extra code to handle
the "broken UTF-8" code. It's just another set of bytes that needs
escaping, and they need escaping for _exactly_ the same reason some 
regular utf-8 characters need escaping: because they can't be printed.

So it's all the same thing - it's just the reasons for "unprintability"  
that are slightly different.

Now, I'll agree that getting the escaping right (whether for things like 
'\n' or for byte sequences that are invalid UTF-8) can be painful. I just 
don't think that the pain is in any way specific for "invalid UTF-8". It's 
just _hard_ to think of all the special cases, and most programs have bugs 
because somebody forgot something.

		Linus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-16 20:03                 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
  2004-02-16 20:23                   ` Linus Torvalds
@ 2004-02-17  1:24                   ` Alex Belits
  2004-02-17 21:09                     ` Jamie Lokier
  1 sibling, 1 reply; 120+ messages in thread
From: Alex Belits @ 2004-02-17  1:24 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: Linus Torvalds, viro, Linux kernel

On Mon, 16 Feb 2004, Marc Lehmann wrote:

> > I have never claimed that the kernel really talk s UTF-8, and indeed, I
> > would say that such a kernel would be terminally and horribly broken.
>
> And I'd say such a kernel would be highly useful, as it would standardize
> the encoding of filenames, just as unix standardizes on "mostly ascii"
> (i.e. the SuS).
>
> However, just as POSIX is a nice but very limited base, (mostly) ASCII
> is a nice and very limited base. UTF-8 would also be a good base.

  UTF-8 is dependent on Unicode, that is cumbersome, not user-expandable,
not includes an ability to reliably implement subsets of it, poorly
supports language identification/language-dependent processing, and is
controlled by a single organization with extremely poor ability to
incorporate changes into the standard when it becomes necessary. This
means, it's quite possible that this standard will be replaced by
something better in the future when multilingual documents will become
widely used -- right now they certainly are not, and this is why poor
design of Unicode is tolerated by users, and this is also why many people
use non-Unicode-based charsets. Enforcing UTF-8 will burn the bridges to
any other language support infrastructure or encoding, right at the time
when such infrastructure is likely to be created.

> 8-bit bytes as filenames is not a good base, however, since they enforce
> a difefrent layer of interrpetation between the user and the kernel, and
> this interpretation cannot be based on the locale nor the filesystem
> itself, as there is no way to find out what encoding the filename is in.

  This is a matter of GUI implementation. If someone cared about this, he
would store language metadata with filename, too, however this is clearly
contrary to the Unix filesystem design.

> 8-bit bytes is convinient, but not useful for i18n environments. in the
> past, it was also convinient and nobody cared, since everything was
> either 8-bit or double-byte, and nobody exchanged files.

  I did, and it worked _fine_. Everyone who is willing to use UTF-8 is
free to do this right now, and everything will already work great for
them. Writing software to deliberately enforce UTF-8 is something
completely different from using UTF-8 for yourself.

> This, however, is going to change, and the current methodology of "just
> guess, you might be right" is a hindrance to this.

  This was "going to change" for more than a decade already, or,
alternatively, already happened if you listen to someone like Martin
Duerst. The reality is, everything can pass UTF-8 already, yet people use
other encodings for everything, too, and as long as they don't break,
things work. Breaking byte-value transparency in any place in the system
is counterproductive until the moment when everyone uses UTF-8. But I hope
before that day comes, UTF-8 and Unicode will be replaced with something
more sane.

> > The kernel is _agnostic_ in what it does.
>
> No, it's not. If at all, the kernel specifies a specially-interpreted
> (ascii sans / and \0) byte-stream, as you say yourself.
>
> However, just as with URLs (which are byte-streams, too), byte-streams are
> useless to store text. You need bytestreams + known encoding.

  MIME has a perfectly usable standard for declaring encodings, and huge
amounts of text (that may include filenames) are distributed by
MIME-compliant or MIME-like protocols (mail and HTTP, to name two). This
keeps the metadata in the userspace, and it also happens that everything
that displays multilingual text is entirely in userspace. Why kernel is
supposed to mess with this, is beyond me.

> If filenames were not names, but just binary id's I would agree, but
> this is not at all how filenames are used not how their use is applied.
>
> Filenames are composed of text, but the kernel gives no indication
> on how to interpret this text, and as a matter of fact, nothing else
> gives this indication. glib etc. uses G_BROKEN_FILENAMES to force
> locale-encoding. But as others have said, one mans locale is unlike other
> mens locale.

  And this is perfectly fine. Displaying and editing multilingual text is
a user interface issue, that kernel should not be involved in. Dealing
with locales is not limited to displaying but includes input methods, and
in many cases text processing, hyphenation rules, phonetic matching, etc.
that are specific to locale. Nothing in UTF-8 helps with those things,
they still have to be done based on metadata that is entirely in
userspace.

  I admit that some of my time would be better spent if I did a design of
an expandable pluggable language support modules library that would
understand "strings with metadata" and support some kind of stateful
language/encoding serialization for this purpose. It doesn't look like the
multilingual documents are common enough to justify designing such a thing
yet, however if such system was developed, it would be a good example of
how expandability is at best ortogonal, and at worst poorly compatible
with enforcing Unicode. Obviously, all this will have to be done entirely
in userspace, and all that it will require from kernel is byte-value
transparency. Singling out filenames as something that has to use a
particular encoding, just because current king of the i18n mess is
Unicode, is at least nearsighted, at most inconsistent.

> > really care AT ALL what you feed it, as long as it is a byte-stream.
> >
> > Now, that implies that if you want to have extended characters, then YOU
> > HAVE TO USE UTF-8.
>
> You say so, but there is no logical connection between these two
> statements. I can store latin1 easily in a bytestream, as I can store
> iso-2022-jp or euc-jp. But they are incompatible to UTF-8.

  Latin1 is most likely to be stored in iso8859-1, with a single currently
widely used exception of MS Word and similar programs. I hope, it is
pretty clear that what those programs operate on, is not plain text, and
can not replace plain text any soon.

> You are yelling at me for no good reason. "YOU HAVE TO USE UTF-8". Why
> should this be? The kernel certainly enforces this. Even you claim that
> I don't have to, as the kernel doesn't care.
>
> However, if you think so violently that it has to be UTF-8 that you even
> yell it, then why doesn't the kernel comply to this rule? Why should an
> applicaiton "HAVE TO" use utf-8 for input when the kernel doesn't even
> try to comply and hands out illegal output?

  If the program breaks on the malformed input, it's a bad program in the
first place. And if it makes assumptions about what is "malformed" based
on something that other users may disagree with, it's a problem of either
program's users or authors. OS kernel may be in a position where it can
step in and resolve this dispute in favor of either group, however it will
be highly inappropriate to do so based on some little discussion between
UTF-8 supporters and opponents that we can have here.

  I can point at the example of this "solution" that happened years ago
when UCS-2 was all the rage, and it got hardcoded and enforced by NTFS
and everything that handles it. Who is laughing about that decision now?

> This is just like mmap sometimes returning a page number and sometimes a
> byte address... this would also not be useful unless you know the unit
> that mmap returns (addreesses in multiples of 1).

  Huh? From the user's point of view mmap always returns a pointer.
Implementation enforces page boundaries, and all applications should be
aware that the address given to it can not be assumed to be used for
anything meaningful, however every application relies on the fact that it
gets a pointer to a mapped address. If someone jumped in and demanded that
given address must be used, and mmap() should assume MAP_FIXED to be
always set, it would be similar to the demand that kernel should only
allow UTF-8 -- just in that case the brokenness would be more obvious.

> > That's what I'm saying. I am _not_ saying that the kernel uses UTF-8.
>
> But you are saying that you have to feed UTF-8 into the kernel, which is
> not the case either. I certainly don't have to..., and, what's worse,
> you haven't given any indication of why one has to. Just because you say
> so? Or is there actually a reason? If there is a reason, why doesn't the
> kernel, in return, also follow this reasoning?
>
> > kernel doesn't care one way or the other. As far as the kernel is
>
> It doesn't. But the point is that it should. If the kernel would do
> everything we want it to do there would be no point in enhancing it.

  Kernel certainly CAN enforce an encoding. It also can disallow filenames
that contain "Pu-239" substring, and there may be people who would argue
that it should be a good thing.

> > concerened, you could uuencode all the stuff, and the kernel wouldn't
>
> (Yes, because uuencode has the peculiar property of neither generting \0
> nor /. It doesn't work in general with byte-streams).
>
> > think you're crazy. The kernel _only_ cares about byte streams.
>
> The kernel _interprets_ the byte-stream already. And some byte-streams are
> no valid filenames _already_.

  This only affects two byte values, that happen to be universally
understood, and arenever required by any charset to represent anything but
what they are in ASCII. UTF-8 is hardly unique in the way of handling
them, however it imposes huge number of restrictions that may conflict
with what people do or will likely do.

> > > There is no way for applications to handle UTF-8 and illegal-utf8 in
> > > a sane way, so most apps will either eat the illegal bytes, skip the
> > > filename, or crash (the latter case is clearly a bug in the app, thr
> > > former cases aren't).
> >
> > What you're complaining about are bad user applications. It has _zero_ to
> > do with the kernel.
>
> Could you elaborate on why these apps are bad? What I am interested in
> is to know how to fix them? since there is simply no way to interpret
> the names returned by the kernel as the corresponding meta-information
> is missing.

  "Be liberal in what you accept and strict in what you produce". Most of
applications don't even have user interface, and therefore should not be
concerned with how characters look like in the first place.

> Consider an OS that allows different characters for path-seperators (unix
> only allows '/'). Without the knowledge of the path seperator it would be
> impossible to interpret paths.

  Show me the person who uses file names and doesn't know how the path
separator looks like.

> And without knowledge about encoding it's
> impossible (but slightly less dramatic) to correctly interpret filenames.

  This is a userspace issue. Some kernel MAY support this metadata,
however Unix filesystem is not designed that way. If someone is eager to
embed the encoding information into a filename, MIME allows that already,
it's just impractical.

> > The kernel doesn't set policy. The kernel says "this is what I can do,
> > you set policy".
>
> Exactly. The kernel could specify the API to use UTF-8. This is not more
> policy than it currently enforces.

  Says who? How would this policy, if enforced, improve anything, other
than supporting UTF-8 proponents that were unsuccessful in their attempts
to eradicate all other charsets and encodings despite a decade-old
religious war? They may want to recruit Linux kernel to support their
efforts, however no matter how anyone feels about this issue, why would
kernel design mess with this? What's next, preventing C compilers from
reading goto from the source files? Correcting "teh" and "yuo", and
refusing to send SMTP when the message contais "Me, too"? Have a flag that
prevents running vi or emacs depending on which of those editors the
sysadmin, or distro maker, hates with a passion? Aren't all those things
way too far in the area of "activism"?

>
> Or do you suggest that the ability to change the source and replace all
> occurences of '/' by '\\' means that '/' is not enforced as policy on
> path seperators?
>
> We basically seem to disagree on what, exactly, policy is. Policy (to
> me) is something that differentiates between several incompatible
> alternatives. Chosing policy means to rule out other (useful)
> alternatives.

  But again, who should be able to choose the rules? UTF-8 is not
universally accepted. It is a fact. One may wish that it was accepted,
someone else may wish that it was less ugly so it could be accepted, and
vast majority of people don't care, use whatever they have, and will be
pissed if it will get broken just because someone is on a crusade to tell
everyone, how they are supposed to write text.

> One could argue that '/' is a policy because it precludes the '/'
> character from being used in filenames, sth. some filenames or operating
> systems support.

  This is a universally accepted policy. All Unix users know that, it is
written in standards (real ones, like POSIX), and no one uses path
separator for anything else. UTF-8 is a standard that some people use, and
some other people want everyone to use, however it's not universally
accepted, and deliberately breaking others' stuff to force them into using
UTF-8 is, again, unwarranted activism beyond the scope of what kernel is
supposed to do.

> I'd say (probably as much as you) that this policy is not a real policy,
> it's just idiotic.

  It is in a standard, that no one who uses Linux (or any other unixlike
system) ever broken because it was there from the very beginning. UTF-8 is
something that was designed by a bunch of people who wanted to make it a
great gift to the world (or to profit from making more convoluted software
-- hard to tell at this point), and it _competes_ with the status quo.
Linux is not a new system that defines═the rules, plan9 is. Plan9 is
developed as something that was supposed to "cure" Unix inconsistency and
lack of rules, so it got stuffed with personal idiosyncrasies of its
author. And the result is hardly a confirmation that those idiosyncrasies
resulted in a superior, or even just more practical system.

> But enforcing other restrictions on filenames should magically be real
> policy? This is obviously bot idiotic at all, and should be carefully
> explored.
>
> > And UTF-8 just happens to be the only sane policy for encoding complex
> > characters into a byte stream. But it is not the only policy.
>
> Just as '/' is not the only possible path seperator. If that is your
> point, you should explain why enforcing this is ok while supporting utf-8
> (not enforcing, just supporting, meaning having the ability to rule out
> non-utf-8 sequences when the admin wants this) is not.

  '/' is not the only possible separator,  however it's pretty clear that
no other separator is superior to it. Only two other characters were ever
used for this purpose, and both are abandoned by pretty much everyone, so
it's unlikely that many people are going to suffer from the choice of a
slash. Ever.

   UTF-8 is one of the infinite number of possible solutions, and judging
by the rate of acceptance and by the rate at which Unicode representations
and charsets are being invented, it is very likely that not the last
encoding that will be used by humans.

> > Another sane policy is to say "byte streams are latin1". It's not an
> > acceptable policy for encoding _complex_ characters, but it is a policy.
> > And it's a perfectly sane one.
>
> I agree that it is sane. But it is not very useful for the future, as
> people who want russian filenames are plainly unable to use the other
> filenames in a sensible way. There is no way to know the encoding.

  Russians sometimes use filenames in koi8-r, sometimes even in
windows-1251, and mostly in ASCII. Never heard any Russian complaining
about this -- and certainly never in my life I have seen a Russian
filename in UTF-8, nor do I look forward to it. And I am Russian.

> > In short: filenames are byte streams. Nothing more.
>
> Right now, they aren't. Not all sequences of bytes are valid filenames
> already, and I think this is perfectly o.k.

  Slippery slope.

> > And when I say that you have to talk to the kernel using UTF-8, I'm only
> > claiming that it is the only sane way to encode extended characters in a
> > byte stream. Nothing more.
>
> And i fully agree.
>
>

  I disagree -- not with the point that UTF-8 is kinda sane but with the
idea that UTF-8 should be enforced to make it the last encoding that
people will be ever able to use. It may be "sanest" now for a limited
purpose of handling truly multilingual text, but certainly only a
practical use of multilingual documents (what is extremely rare now) may
answer the question, what encoding(s) should be used for this purpose. I
don't think, someone ten years from now would enjoy writing a wrapper that
will convert filenames from some "Expandable Multilingual Metacharset
2014" so they look like valid UTF-8 when they are written into some archaic
filesystem that doesn't even support Elvish and Klingon properly.

-- 
Alex

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-16 22:40                       ` Linus Torvalds
  2004-02-16 22:52                         ` Linus Torvalds
@ 2004-02-17  7:14                         ` Lehmann 
  2004-02-17 11:20                           ` UTF-8 practically vs. theoretically in the VFS API Helge Hafting
  2004-02-17 15:56                           ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Linus Torvalds
  1 sibling, 2 replies; 120+ messages in thread
From: Lehmann  @ 2004-02-17  7:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jamie Lokier, Marc Lehmann, viro, Linux kernel

On Mon, Feb 16, 2004 at 02:40:25PM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
> Try it with a regular C locale. Do a simple
> 
> 	echo > åäö

Just for your info, though. You can't even input these characters in a C
locale, since your libc (and/or xlib) is unable to handle them (lots of SO
C functions will barf on this one). C is 7 bit only.

> Which, if you think about is, is 100% EXACTLY equivalent to what a UTF-8
> program should do when it sees broken UTF-8.

The problem is that the very common C language makes it a pain to use
this in i18n programs. multibyte functions or iconv will no accept
these, so programs wanting to do what you are expecting to do need to
re-implement most if not all of the character handling of your typical
libc.

Yes, it's possible....

> The two cases are 100% equivalent. We've gone through this before. There 
> is a bit of pain involved, but it's not something new, or something 
> fundamentally impossible. It's very straightforward indeed.

The "bit" is enourmous, as you can't use your libc for text processing
anymore.

Yes, it works in non-i18n programms, but right now most programs get
i18n support, which means they will all fail to properly handle
non-locale characters.

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:16                     ` Marc Lehmann
  2004-02-16 20:20                       ` Jeff Garzik
  2004-02-16 21:10                       ` viro
@ 2004-02-17  7:18                       ` jw schultz
  2004-02-17  7:42                       ` Nick Piggin
  3 siblings, 0 replies; 120+ messages in thread
From: jw schultz @ 2004-02-17  7:18 UTC (permalink / raw)
  To: Linux kernel; +Cc: John Bradford, Jeff Garzik, Linus Torvalds, viro

On Mon, Feb 16, 2004 at 09:16:10PM +0100, Marc Lehmann wrote:
> On Mon, Feb 16, 2004 at 07:48:19PM +0000, John Bradford <john@grabjohn.com> wrote:
> > Quote from Jeff Garzik <jgarzik@pobox.com>:
> > None of this is a real problem, if everything is set up correctly and
> > bug free.  Unfortunately the Just Works thing falls apart in the,
> > (frequent), instances that it's not :-(.
>       
> And this is the whole point.
> 
> BTW, to people trying to explain some properties of UTF-8 to me. I don't
> think ad-hominem attacks like assuming that I don't understand UTF-8
> (without any indication that this is so) are useful.
> 
> The point here is that the kernel does, in a very narrow interpretation,
> not support the use of UTF-8, because proper support of UTF-8 means that
> no illegal byte sequences will be produced.

That "interpretation" is so narrow as to be unrealistic.
The kernel supports UTF-8 the same way a stage supports
rock musicians.  You confuse support with enforce, rather
like confusing tolerance with endorsement.

And it should be noted that the kernel doesn't produce file
names.  It only passes them along.

> Of course, I can feed the kernel UTF-8, and if everybody does that, it
> will generally work quite fine. However, Windows surely works fine if
> every program only feeds allowed values into system calls. And even unix
> dialects without memory protection work, as long as everybody plays
> fair.
>
> The point is, however, that this is highly undesirable, and it would be
> nice to have a kernel that would (optionally) fully support a UTF-8

You mean enforce again.  That enhancement request has been
rejected repeatedly because such a thing would be highly
undesirable.  What might be a convenient but unnecessary
restriction today is too likely to become an unbearable
restriction tomorrow.  I don't want the kernel to have to
care about what is or isn't valid UTF-8.  I certainly don't
want to have the kernel loaded with outdated character
tables.

> environment in where applications can feed UTF-8 and _expect_ UTF-8 in
> return, which _is_ a security issue.

I want an environment where applications can feed bytestreams
and expect the same bytestream in return.  I see enough
problems as a result of filesystems that don't do that.

> It's very desirable to have a kernel that actively supports this. IT is

You mean enforces again.  Kernel as police, next thing you
will want is a kernel that prevents undesirable character
sequences.

> clearly not _required_, of course. But then again, process abstraction
> is also not required...

I'll tell you what.  Patch libc.  You can add UTF-8 filename
enforcement to libc.  There are only a few system calls that
would need to have their wrappers enlarged.  I'm sure the
libc people will direct you to someplace very warm if you
ask them for this enhancement.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw@pegasys.ws

		Remember Cernan and Schmitt

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:16                     ` Marc Lehmann
                                         ` (2 preceding siblings ...)
  2004-02-17  7:18                       ` jw schultz
@ 2004-02-17  7:42                       ` Nick Piggin
  3 siblings, 0 replies; 120+ messages in thread
From: Nick Piggin @ 2004-02-17  7:42 UTC (permalink / raw)
  To: Marc Lehmann
  Cc: John Bradford, Jeff Garzik, Linus Torvalds, viro, Linux kernel



Marc Lehmann wrote:

>On Mon, Feb 16, 2004 at 07:48:19PM +0000, John Bradford <john@grabjohn.com> wrote:
>
>>Quote from Jeff Garzik <jgarzik@pobox.com>:
>>None of this is a real problem, if everything is set up correctly and
>>bug free.  Unfortunately the Just Works thing falls apart in the,
>>(frequent), instances that it's not :-(.
>>
>      
>And this is the whole point.
>
>BTW, to people trying to explain some properties of UTF-8 to me. I don't
>think ad-hominem attacks like assuming that I don't understand UTF-8
>(without any indication that this is so) are useful.
>
>The point here is that the kernel does, in a very narrow interpretation,
>not support the use of UTF-8, because proper support of UTF-8 means that
>no illegal byte sequences will be produced.
>
>

So does the kernel support the English language? Does your
email client?


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17  7:14                         ` Lehmann 
@ 2004-02-17 11:20                           ` Helge Hafting
  2004-02-17 15:56                           ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Linus Torvalds
  1 sibling, 0 replies; 120+ messages in thread
From: Helge Hafting @ 2004-02-17 11:20 UTC (permalink / raw)
  Cc: Linus Torvalds, Jamie Lokier, Marc Lehmann, viro, Linux kernel

pcg( Marc)@goof(A.).(Lehmann )com wrote:
> On Mon, Feb 16, 2004 at 02:40:25PM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
> 
>>Try it with a regular C locale. Do a simple
>>
>>	echo > åäö
> 
> 
> Just for your info, though. You can't even input these characters in a C
> locale, since your libc (and/or xlib) is unable to handle them (lots of SO
> C functions will barf on this one). C is 7 bit only.
> 
> 
>>Which, if you think about is, is 100% EXACTLY equivalent to what a UTF-8
>>program should do when it sees broken UTF-8.
> 
> 
> The problem is that the very common C language makes it a pain to use
> this in i18n programs. multibyte functions or iconv will no accept
> these, so programs wanting to do what you are expecting to do need to
> re-implement most if not all of the character handling of your typical
> libc.
> 
> Yes, it's possible....

All you need is a possible_garbage_to_properly_escaped_utf8(char *string)
in libc.  Any program that wants to display filenames it got
straight from readdir (or any binary file contents) will simple feed
the string through that and get back a string with
escapes for anything that isn't utf8.  It is a write-once, use
everywhere thing.

Once up on a time, there were serious problems when someone created
filenames like "; rm -fr *"  Today we use tab completion
and get bash to present the filename with proper escapes.  It is then harmless.
Bad utf8 can be handled the same way.

> The "bit" is enourmous, as you can't use your libc for text processing
> anymore.

Not the current libc, but libc can be improved upon. The same happened to
silly code that weren't 8-bit clean.

Helge Hafting

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-16 22:52                         ` Linus Torvalds
@ 2004-02-17 13:15                           ` Jamie Lokier
  0 siblings, 0 replies; 120+ messages in thread
From: Jamie Lokier @ 2004-02-17 13:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marc Lehmann, viro, Linux kernel

Linus Torvalds wrote:
> So I claim (and yes, people are free to disagree with me) that a
> well-written UTF-8 program won't even have any real extra code to handle
> the "broken UTF-8" code. It's just another set of bytes that needs
> escaping, and they need escaping for _exactly_ the same reason some 
> regular utf-8 characters need escaping: because they can't be printed.

Even XML suffers from these sorts of problems: some Unicode characters
aren't allowed in XML, even as numeric references, so in theory XML
applications have to reject or escape some strings.

> So it's all the same thing - it's just the reasons for "unprintability"  
> that are slightly different.

My difficulty with directories containing non-UTF-8 filenames shows up
with web pages in Perl, and not the printability part.  Please excuse
the Perl-oriented examples; Perl has good support for UTF-8 while also
working with arbitrary byte strings, so it's a fine language to
illustrate current problems.

What do you put in a URL composed from filenames in a directory
listing page?  The obvious thing is to %-escape each byte of the
names, in fact that's what everybody does.

In a language like Perl, where strings are labelled according to their
encoding, that means when you unscape the URL you get a string
labelled as "byte string".  You shouldn't tell Perl it's a "UTF-8
string" because some of them won't be (they are strings from
directories).

That's fine if you don't do anything except use those strings
unchanged, but as soon as you want to do something else like prepend a
character with code >= 256 or apply a regex where the pattern has
Unicode characters, Perl transcodes "byte string" to "UTF-8 string"
assuming it was latin1.  That, of course, mangles the string when it's
come from a source which is "nominally UTF-8 but might not be".

Your recommendation to simply pass around bytes all the time doesn't
work well, because to maintain basic properties of strings such as
length(a) + length(b) = length(a+b), that implies you either (1)
always do indexing, lengths, splitting etc. on strings as bytes not
characters, or (2) every operation that operates on a string must be
able to accept non-UTF-8 bytes and treat them the same way.  (2) is
particularly nasty because then your program's logic can't depend on
the nice properties of UTF-8 strings.

That's why this line of Perl fails:

    for (glob "*") { rename $_, "ņi-".$_ or die "rename: $!\n"; }

(The source file, by the way, is assumed to be UTF-8-encoded text).

Perl reads each file name, and declares it to be of type "byte
string".  Then "ņi-" is prepended, which contains a character code >=
256, so the result must be UTF-8 encoded according to Perl.  The
original file name is transcoded from what was assumed to be
iso-8859-1 to UTF-8, "ņi-" is prepended, and that becomes the target
file name for rename().

This mangles the names; both UTF-8 and non-UTF-8 filenames are mangled
equally badly.

Your suggestion means that Perl should do bytewise concatenation of
the the "ņi-" (in UTF-8) and the filename (no encoding assumed).

It's a good one; it's exactly the right thing to do, and it works.

To do that in Perl, when you take a random byte string (such as from
readdir()) you must tell Perl it's a UTF-8 string, so shouldn't be
transcoded when it's combined with another UTF-8 string.  You can do
it, breaking documented rules of course (which say only do this when
you know it's valid UTF-8), with Encode::_utf8_on().

Guess what?  That actually works.  It does the filename operations
properly given any arbitrary filenames.

But remember I said "every operation that operates on a string must be
able to accept non-UTF-8 bytes and treat them the same way" earlier,
and how this is bad because it's nice to depend on UTF-8 properties?

You've just told Perl to treat an arbitrary byte sequence as UTF-8,
when sometimes it isn't.  Among other things, simple operators like
length() and substr() don't work as expected on those weird strings.

When I say don't work as expected, I mean if you had a file named
"müeller" in latin1, Perl will think it's length() is 2.  If you have
a file named "müller", Perl will not only report a length() of 1,
it'll spew a horrible error message when it calculates it.

These aren't Perl problems.  These are problems that any program will
have if it follows your suggestion of "keep everything in bytes" but
wants to combine filenames with other text or do pattern matching on
filenames.

It's not a problem if you can pass around a flag with each byte
sequence, carefully keeping readdir() results separate from text until
the point where your prepared to have a policy saying what to do with
non-UTF-8 readdir() results.

But it is a problem when you want to stuff readdir() results in a
general purpose "string" which is also used for text.

That's technically the wrong thing to do, in all programs.  In
practice, that's what programs do anyway because it's a lot easier
than having different string types for different data sources.

Most times it works out ok, but for the corners:

> It's just _hard_ to think of all the special cases, and most
> programs have bugs because somebody forgot something.

Exactly.

-- Jamie

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-16 20:58                     ` Marc Lehmann
@ 2004-02-17 14:12                       ` Dave Kleikamp
  0 siblings, 0 replies; 120+ messages in thread
From: Dave Kleikamp @ 2004-02-17 14:12 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: Linus Torvalds, viro, Linux kernel

On Mon, 2004-02-16 at 14:58, Marc Lehmann wrote:

> True. Thanks a lot for explaining your arguments in this detail. In
> fact, I can accept most if not all of your arguments, but I sitll think
> it would be nice to have this extra functionality.
> 
> Arguments like "it's a pain to implement" (which I don't think it is, but
> you are clearly better in judging that!), weigh even more to me.
> 
> So even if I think it's a good idea, it might never be implemneted for
> purely practical reasons.

Use jfs with the mount option iocharset=utf8 and you'll get exactly what
you are asking for.

-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17  7:14                         ` Lehmann 
  2004-02-17 11:20                           ` UTF-8 practically vs. theoretically in the VFS API Helge Hafting
@ 2004-02-17 15:56                           ` Linus Torvalds
       [not found]                             ` <20040217161111.GE8231@schmorp.de>
                                               ` (2 more replies)
  1 sibling, 3 replies; 120+ messages in thread
From: Linus Torvalds @ 2004-02-17 15:56 UTC (permalink / raw)
  To: Marc; +Cc: Jamie Lokier, Marc Lehmann, viro, Linux kernel

On Tue, 17 Feb 2004, Marc wrote:
> On Mon, Feb 16, 2004 at 02:40:25PM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
> > Try it with a regular C locale. Do a simple
> > 
> > 	echo > åäö
> 
> Just for your info, though. You can't even input these characters in a C
> locale, since your libc (and/or xlib) is unable to handle them (lots of SO
> C functions will barf on this one). C is 7 bit only.

Ehh.. It's pointless to tell me that I can't do it. I just did.

The C locale is _not_ 7-bit only. The C locale is the traditional "byte 
locale" for UNIX. It will happily collate 8-bit-characters in their 
(numerical) order. Anything else would be seriously broken.

> > Which, if you think about is, is 100% EXACTLY equivalent to what a UTF-8
> > program should do when it sees broken UTF-8.
> 
> The problem is that the very common C language makes it a pain to use
> this in i18n programs. multibyte functions or iconv will no accept
> these, so programs wanting to do what you are expecting to do need to
> re-implement most if not all of the character handling of your typical
> libc.

These are all teething problems. The thing is, true multi-locale programs
haven't been around long enough that people take the problems for granted.  
A lot of them work today, but "work" is different from "always does the
right thing". These things take a _long_ time for people to sort out the
full implications of.

(Analogy time: how many people _still_ use "find ... | xargs xxx", even
though that can lead to problems and is thus wrong?  You should really use
"find ... -print0 | xargs -0 xxx" to get it _right_, but most people
ignore that, because the common form works for most cases.)

The process is complicated by the fact that most of the people who really 
care about UTF-8 and locales are very strict about it: they have been 
hitting their heads against latin1 users for a logn time, and they are 
frustrated and _tired_ of it, and so they often hate single-byte usage 
with a passion, and consider it not only wrong but EVIL. Which is 
obviously silly, but hey, I understand why they can feel a bit put off by 
the problem.

So the multi-byte people often stare at the standards, and then _refuse_
to touch anything that isn't standards-compliant. When they see something
incorrect, they'd rather dump core (or just truncate it) than try to
handle it gracefully, becuase they want the whole world to see how
incorrect it is.

Which flies in the face of "Be strict in what you generate, be liberal in 
what you accept". A lot of the functions are _not_ willing to be liberal 
in what they accept. Which sometimes just makes the problem worse, for no 
good reason.

The fact is, you shouldn't use "iconv()" unless you controlled the input.
It's a bit like "gets()" - unsafe to use unless you generated the damn
thing yourself and you _know_ it fits in the buffer. But we just don't 
have the functions (yet) to do it _right_, and to escape the input some 
way (yeah, yeah, I know you can do it with iconv() and a lot of cruft 
around it - the point is that nobody does it, because it's too painful).

		Linus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
       [not found]                             ` <20040217161111.GE8231@schmorp.de>
@ 2004-02-17 16:32                               ` Linus Torvalds
  2004-02-17 16:46                                 ` Jamie Lokier
                                                   ` (3 more replies)
  0 siblings, 4 replies; 120+ messages in thread
From: Linus Torvalds @ 2004-02-17 16:32 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: Jamie Lokier, viro, Linux kernel

On Tue, 17 Feb 2004, Marc Lehmann wrote:
> 
> Because there is a fundamental difference between file contents and
> filenames. Filenames are supposed to be text.

I think this is actually the fundamental point where we disagree.

You think of filenames as something the user types in, and that is 
"readable text". And I don't.

I think the filenames are just ways for a _program_ to look up stuff, and
the human readability is a secondary thing (it's "polite", but not a
fundamental part of their meaning).

So the same way I think text is good in config files and I dislike binary
blobs (hey, look at /proc), I think readable filenames are good. But that
doesn't mean that they have to be readable. I can well imagine encoding
meta-data in the filename for some database that uses the filesystem as
its backing store and generates files for large blobs. And then there
would be little if any "goodness" to keeping the filenames readable.

That's also a situation where case-insensitivity can _really_ screw you
(just one of the many).

It may be rare, but unlike you, I don't think there is anything "wrong" 
with considering path components to be just "data".

			Linus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 15:56                           ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Linus Torvalds
       [not found]                             ` <20040217161111.GE8231@schmorp.de>
@ 2004-02-17 16:36                             ` Jamie Lokier
  2004-02-17 17:52                               ` viro
  2004-02-18  3:07                               ` H. Peter Anvin
  2004-02-21 13:54                             ` Pavel Machek
  2 siblings, 2 replies; 120+ messages in thread
From: Jamie Lokier @ 2004-02-17 16:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marc, Marc Lehmann, viro, Linux kernel

Linus Torvalds wrote:
> Which flies in the face of "Be strict in what you generate, be liberal in 
> what you accept". A lot of the functions are _not_ willing to be liberal 
> in what they accept. Which sometimes just makes the problem worse, for no 
> good reason.

Unicode specifies that a program claiming to read UTF-8 _must_ reject
malformed UTF-8.

Ok, we can just ignore Unicode. :)

But the reason they cite is security: when applications allow
malformed UTF-8 through, there's plenty of scope for security holes
due to multiple encodings of "/" and "." and "\0".

This is a real problem: plenty of those Windows worms that attack web
servers get in by using multiple-escaped funny characters and
malformed UTF-8 to get past security checks for ".." and such.

In theory these are not problems; all programs should be liberal in
what they accept, and robust in handling data from the outside world.

In practice, programs quickly lose track of which text is from the
outside world and which is from a trusted source or checked source.
These worms are quite successful at exploiting things the programmers
didn't think of.  Being _conservative_ at all places which scan UTF-8
does seem like it might help a little.

-- Jamie

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 16:32                               ` Linus Torvalds
@ 2004-02-17 16:46                                 ` Jamie Lokier
  2004-02-17 19:00                                   ` UTF-8 practically vs. theoretically in the VFS API Måns Rullgård
  2004-02-18 13:11                                   ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Matthew Garrett
  2004-02-17 16:52                                 ` Marc Lehmann
                                                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 120+ messages in thread
From: Jamie Lokier @ 2004-02-17 16:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marc Lehmann, viro, Linux kernel

Linus Torvalds wrote:
> I think the filenames are just ways for a _program_ to look up stuff, and
> the human readability is a secondary thing (it's "polite", but not a
> fundamental part of their meaning).

Politeness is nice.  I'm sure there's a pragmatic reason most
filenames are meaningful text in some human language :)

I'd like a way to type something like "touch zöe.txt" on an ordinary
latin1 terminal and get a UTF-8 filename in my filesystem.  Thanks :)

-- Jamie

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 16:32                               ` Linus Torvalds
  2004-02-17 16:46                                 ` Jamie Lokier
@ 2004-02-17 16:52                                 ` Marc Lehmann
  2004-02-17 16:54                                 ` UTF-8 practically vs. theoretically in the VFS API Stefan Smietanowski
  2004-02-17 20:37                                 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Robin Rosenberg
  3 siblings, 0 replies; 120+ messages in thread
From: Marc Lehmann @ 2004-02-17 16:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Tue, Feb 17, 2004 at 08:32:15AM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
> > 
> > Because there is a fundamental difference between file contents and
> > filenames. Filenames are supposed to be text.
> 
> I think this is actually the fundamental point where we disagree.

I guess that probably explains it. And I know of no striking arguments to
convince you of changing your fundamental opinion.

*sigh*. Ok, we agree to disagree :)

> It may be rare, but unlike you, I don't think there is anything "wrong" 
> with considering path components to be just "data".

Yeah, there are three things - text, binary, and data (and probably more).

Filenames are then "mostly text", "no binary", and still suitable for
"data".

I have read the example somebody posted of some application encoding
"near-binary" data into filenames (e.g. uglies like "\n" or worse).
However, I think that these cases are extremely rare and not really worth
supporting. Not supporting this is not a problem for applications - after
all, base64 or escaping (that is needed even for "near-binary") works fine
for these apps, too, ignoring the problem of backwards compatibility.

I think that everyone having had the experience of dealing with filenames
containing \n etc., despite your shell/GUI helping in quoting, will easily
share this opinion about usefulness.

That's why it should be a mount option, i.e. an enforcable standard.

And since it seems that JFS already supports this (to some degree unknown
to me), I don't think it should be such a pain to implement.

But yes, I am most probably not going to implement it, especially not if
it will simply never be accepted.

So I guess it simply won't be done. I think it's omissing some highly
useful feature, but I will survive it.

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 16:32                               ` Linus Torvalds
  2004-02-17 16:46                                 ` Jamie Lokier
  2004-02-17 16:52                                 ` Marc Lehmann
@ 2004-02-17 16:54                                 ` Stefan Smietanowski
  2004-02-18  1:27                                   ` Hans Reiser
  2004-02-17 20:37                                 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Robin Rosenberg
  3 siblings, 1 reply; 120+ messages in thread
From: Stefan Smietanowski @ 2004-02-17 16:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marc Lehmann, Jamie Lokier, viro, Linux kernel

Hi Linus.

>>Because there is a fundamental difference between file contents and
>>filenames. Filenames are supposed to be text.
> 
> I think this is actually the fundamental point where we disagree.
> 
> You think of filenames as something the user types in, and that is 
> "readable text". And I don't.
> 
> I think the filenames are just ways for a _program_ to look up stuff, and
> the human readability is a secondary thing (it's "polite", but not a
> fundamental part of their meaning).
> 
> So the same way I think text is good in config files and I dislike binary
> blobs (hey, look at /proc), I think readable filenames are good. But that
> doesn't mean that they have to be readable. I can well imagine encoding
> meta-data in the filename for some database that uses the filesystem as
> its backing store and generates files for large blobs. And then there
> would be little if any "goodness" to keeping the filenames readable.

Just look at Mozilla's cache... They may have turned the blob into
ascii but it's still a blob.

// Stefan


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 16:36                             ` Jamie Lokier
@ 2004-02-17 17:52                               ` viro
  2004-02-17 19:29                                 ` Jamie Lokier
  2004-02-18  3:07                               ` H. Peter Anvin
  1 sibling, 1 reply; 120+ messages in thread
From: viro @ 2004-02-17 17:52 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Linus Torvalds, Marc, Marc Lehmann, Linux kernel

On Tue, Feb 17, 2004 at 04:36:13PM +0000, Jamie Lokier wrote:
> But the reason they cite is security: when applications allow
> malformed UTF-8 through, there's plenty of scope for security holes
> due to multiple encodings of "/" and "." and "\0".
> 
> This is a real problem: plenty of those Windows worms that attack web
> servers get in by using multiple-escaped funny characters and
> malformed UTF-8 to get past security checks for ".." and such.

Pardon?  For that kernel would have to <drumrolls> interpret the bytestream
as UTF-8.  We do not.  So your malformed UTF-8 for .. won't be treated as
.. by the kernel.

BTW, speaking of Plan 9, they do *NOT* reject malformed UTF-8 in pathnames.
Filtering they do is against ASCII controls - i.e. \1--\37 and \177.

All differences between our generic checks and Plan 9 generic checks (aside
of whatever checks particular fs might do) is:
	1) they allow longer pathnames (64K vs our 4K, from my reading of
9/port/chan.c)
	2) they do not allow pathnames containing any octet in range 1--31
	3) they do not allow pathnames containing DEL (octet 127)

The rest is identical:

	* Pathname is split into components by instances of octet 47 (/).
	* Component is special if it's {octet 46} or {octet 46, octet 46}
(. and .. resp.).
	* Name is terminated by octet 0 (NUL).
	* Name components are fed to filesystem drivers without any conversions
- they go as arrays of char, with no concern for encoding.

So could we please put that strawman to rest?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 16:46                                 ` Jamie Lokier
@ 2004-02-17 19:00                                   ` Måns Rullgård
  2004-02-17 20:57                                     ` Jamie Lokier
  2004-02-18 13:11                                   ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Matthew Garrett
  1 sibling, 1 reply; 120+ messages in thread
From: Måns Rullgård @ 2004-02-17 19:00 UTC (permalink / raw)
  To: linux-kernel

Jamie Lokier <jamie@shareable.org> writes:

> Linus Torvalds wrote:
>> I think the filenames are just ways for a _program_ to look up stuff, and
>> the human readability is a secondary thing (it's "polite", but not a
>> fundamental part of their meaning).
>
> Politeness is nice.  I'm sure there's a pragmatic reason most
> filenames are meaningful text in some human language :)
>
> I'd like a way to type something like "touch zöe.txt" on an ordinary
> latin1 terminal and get a UTF-8 filename in my filesystem.  Thanks :)

Then hack either bash (or whatever shell you use) or touch to do just that.

-- 
Måns Rullgård
mru@kth.se


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 17:52                               ` viro
@ 2004-02-17 19:29                                 ` Jamie Lokier
  2004-02-17 19:45                                   ` Linus Torvalds
                                                     ` (2 more replies)
  0 siblings, 3 replies; 120+ messages in thread
From: Jamie Lokier @ 2004-02-17 19:29 UTC (permalink / raw)
  To: viro; +Cc: Linus Torvalds, Marc, Marc Lehmann, Linux kernel

viro@parcelfarce.linux.theplanet.co.uk wrote:
> On Tue, Feb 17, 2004 at 04:36:13PM +0000, Jamie Lokier wrote:
> > But the reason they cite is security: when applications allow
> > malformed UTF-8 through, there's plenty of scope for security holes
> > due to multiple encodings of "/" and "." and "\0".
> > 
> > This is a real problem: plenty of those Windows worms that attack web
> > servers get in by using multiple-escaped funny characters and
> > malformed UTF-8 to get past security checks for ".." and such.
> 
> Pardon?  For that kernel would have to <drumrolls> interpret the bytestream
> as UTF-8.  We do not.  So your malformed UTF-8 for .. won't be treated as
> .. by the kernel.

Well, the security checks on ".." which worms get past aren't in the
kernel either.  This time _you_ made the strawman :)

What happens is that one program or library checks an incoming path
for ".." components - that code knows nothing about UTF-8 of course.

Then it passes the string to another program which assumes the path
has been subject to appropriate security checks, munges it in UTF-8,
and eventually does a file operation with it.  The munging generates
".." components from non-minimal UTF-8 forms - if it's not obeying the
Unicode rejection requirement (which wasn't in earlier versions), that is.

A realistic example is where the second program reads files whose
paths are mentioned in a text file which is parsed as UTF-8, after the
first program has done a security check by grepping for ".."
components.

Unicode says the second program shouldn't accept malformed UTF-8,
precisely because in real scenarios (like this one) there's a mix of
programs and libraries, some aware of UTF-8, some not, and the latter
are involved in security decisions.

Here on linux-kernel we're saying that if the second program accepts
any old byte sequence in a filename, it should preserve the byte
sequence exactly.  But any program whose parser-tokeniser is scanning
UTF-8 is very unlikely to do that - it's just too complicated to say
some bits of a text stream must be remembered as literal bytes, and
others must be scanned as multibyte characters.

We can't blame the second program for allowing those dodgy paths,
because it's the _first_ program which is setting policy.  We can't
blame the first program, because it doesn't care about UTF-8.  The
second program is just obeying orders, and the first program is just
applying POSIX rules.

These type of security holes are quite real, among software which
handles UTF-8 and also deals with paths.  At the current time, that
especially means XML, HTML, URIs, web servers and things behind them.

The holes only arise because software which is interpreting UTF-8 is
mixed with software which isn't.  That's one of the most useful
features of UTF-8, after all - that's why we use it for filenames.

Understand, this isn't a kernel problems; it is simply a good reason
to reject malformed UTF-8 by programs which parse UTF-8.

-- Jamie

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 19:29                                 ` Jamie Lokier
@ 2004-02-17 19:45                                   ` Linus Torvalds
  2004-02-17 20:30                                     ` Jamie Lokier
  2004-02-17 19:51                                   ` Jamie Lokier
  2004-02-17 19:53                                   ` viro
  2 siblings, 1 reply; 120+ messages in thread
From: Linus Torvalds @ 2004-02-17 19:45 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: viro, Marc, Marc Lehmann, Linux kernel

On Tue, 17 Feb 2004, Jamie Lokier wrote:
> 
> Well, the security checks on ".." which worms get past aren't in the
> kernel either.  This time _you_ made the strawman :)

Note that this is something that the kernel _can_ fix easily.

In particular, we already have flags like LOOKUP_FOLLOW and
LOOKUP_DIRECTORY that we use internally in the kernel to specify how to do
certain operations. We export _part_ of that to user space with the
O_DIRECTORY flag that says "allow open of directories".

And yes, we have security-related ones too (LOOKUP_NOALT disables the
alternamte mount-point lookup).

And it would be _trivial_ to add a LOOKUP_NODOTDOT and allow user space to
use it through a O_NODOTDOT thing. But the people who need it really need
to do it and test it, and they need to be committed enough that they say
"yes, we'd use this, even though it's not portable". Because I don't want
to add features to the kernel that people don't use, and a lot of the
users don't want to use Linux-only things..

Same goes for O_NOFOLLOW or O_NOMOUNT, to tell the kernel that it
shouldn't follow symbolic links or cross mount-points - another thing that
some software might want to use in order to check that you can't "escape"  
your subtree.

So these things would be literally trivial to add, and the only issue is 
whether people would really use them.

> What happens is that one program or library checks an incoming path
> for ".." components - that code knows nothing about UTF-8 of course.
> 
> Then it passes the string to another program which assumes the path
> has been subject to appropriate security checks, munges it in UTF-8,
> and eventually does a file operation with it.  The munging generates
> ".." components from non-minimal UTF-8 forms - if it's not obeying the
> Unicode rejection requirement (which wasn't in earlier versions), that is.

But note how my point was that YOU SHOULD NEVER EVER MUNGE A PATHNAME!

It is fundamentally _wrong_ to convert pathnames. You _cannot_ do it 
correctly. 

The rule should be:
 - convert user-input to UTF-8 early (do _nothing_ to it before the 
   conversion). Allow escape sequences here.
 - never ever convert readdir/getcwd/etc system-specified paths AT ALL. 
   They are already in "extended UTF-8" format (where the "extended" part 
   is the 'broken UTF-8' thing. I can be like MS and call my breakage 
   "extended" too ;)
 - always _always_ work on the "extended UTF-8" format, and never EVER 
   convert that to anything else (except when you need to actually print
   it, but then you encode it properly with escape sequences, the way you 
   have to _anyway_).

If you follow the above simple rules, you can't get it wrong. And in those 
rules, ".." is the BYTE SEQUENCE in the "extended UTF-8". Nothing more.

		Linus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 19:29                                 ` Jamie Lokier
  2004-02-17 19:45                                   ` Linus Torvalds
@ 2004-02-17 19:51                                   ` Jamie Lokier
  2004-02-17 19:53                                   ` viro
  2 siblings, 0 replies; 120+ messages in thread
From: Jamie Lokier @ 2004-02-17 19:51 UTC (permalink / raw)
  To: viro; +Cc: Linus Torvalds, Marc, Marc Lehmann, Linux kernel

Jamie Lokier wrote:
> Understand, this isn't a kernel problems; it is simply a good reason
> to reject malformed UTF-8 by programs which parse UTF-8.

I should make clear: since the kernel _doesn't_ parse UTF-8, the
kernel _isn't_ an appropriate place to reject it.

Any userspace program which treats the result of readdir() as UTF-8
characters for any purpose should reject malformed names.  The tough
design decisions are: where in the program to do it, and how to ensure
it will always be done.

You have to reject or escape malformed names at _some_ stage when they
are going to appear in a text context.  The trouble is doing it too
soon (where the program calls readdir()) prevents operating on some
files, and doing it later (where the program is going to use it in a
text context) is easy to forget because by the time a string from
readdir() has travelled through many layers of abstraction between
libraries, it's easy to forget its byteish properties.

-- Jamie

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 19:29                                 ` Jamie Lokier
  2004-02-17 19:45                                   ` Linus Torvalds
  2004-02-17 19:51                                   ` Jamie Lokier
@ 2004-02-17 19:53                                   ` viro
  2004-02-17 20:35                                     ` John Bradford
  2004-02-17 20:38                                     ` Jamie Lokier
  2 siblings, 2 replies; 120+ messages in thread
From: viro @ 2004-02-17 19:53 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Linus Torvalds, Marc, Marc Lehmann, Linux kernel

On Tue, Feb 17, 2004 at 07:29:18PM +0000, Jamie Lokier wrote:
> What happens is that one program or library checks an incoming path
> for ".." components - that code knows nothing about UTF-8 of course.
> 
> Then it passes the string to another program which assumes the path
> has been subject to appropriate security checks, munges it in UTF-8,
> and eventually does a file operation with it.  The munging generates
> ".." components from non-minimal UTF-8 forms - if it's not obeying the
> Unicode rejection requirement (which wasn't in earlier versions), that is.
 
Why the hell would it _ever_ do such normalization?

> A realistic example is where the second program reads files whose
> paths are mentioned in a text file which is parsed as UTF-8, after the
> first program has done a security check by grepping for ".."
> components.
> 
> Unicode says the second program shouldn't accept malformed UTF-8,
> precisely because in real scenarios (like this one) there's a mix of
> programs and libraries, some aware of UTF-8, some not, and the latter
> are involved in security decisions.
> 
> Here on linux-kernel we're saying that if the second program accepts
> any old byte sequence in a filename, it should preserve the byte
> sequence exactly.  But any program whose parser-tokeniser is scanning
> UTF-8 is very unlikely to do that - it's just too complicated to say
> some bits of a text stream must be remembered as literal bytes, and
> others must be scanned as multibyte characters.

So what you are saying is that conversion of invalid multibyte sequences
into non-error wide chars followed by conversion back into UTF-8 can
lead to trouble?  *DUH*

> The holes only arise because software which is interpreting UTF-8 is
> mixed with software which isn't.  That's one of the most useful
> features of UTF-8, after all - that's why we use it for filenames.

The holes only arise because software which is interpreting UTF-8 doesn't
care to do it properly.  Software that doesn't interpret it (including the
kernel) doesn't enter the picture at all.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 19:45                                   ` Linus Torvalds
@ 2004-02-17 20:30                                     ` Jamie Lokier
  2004-02-17 20:49                                       ` Linus Torvalds
  0 siblings, 1 reply; 120+ messages in thread
From: Jamie Lokier @ 2004-02-17 20:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: viro, Marc, Marc Lehmann, Linux kernel

Linus Torvalds wrote:
> And it would be _trivial_ to add a LOOKUP_NODOTDOT and allow user space to
> use it through a O_NODOTDOT thing.

Nope.  That wouldn't help for a bundle of libraries that goes:

    1. Eliminate "." and ".." components, leaving only leading ".."s.
    2. Reject path if it has a leading "..".
    3. Shove it in a string with some other text and pass to other library.

Next program does:

    4. Extract path from string.
    5. open ("/var/public/files/$PATH", ...)

O_NODOTDOT won't protect against that.

> Same goes for O_NOFOLLOW or O_NOMOUNT, to tell the kernel that it
> shouldn't follow symbolic links or cross mount-points - another thing that
> some software might want to use in order to check that you can't "escape"  
> your subtree.

( O_NOMOUNT is a good idea.  I like O_NOFOLLOW - already use it to
avoid lstat() calls. )

> But note how my point was that YOU SHOULD NEVER EVER MUNGE A PATHNAME!
> 
> It is fundamentally _wrong_ to convert pathnames. You _cannot_ do it 
> correctly. 

I know.  You know.  I think everyone else got it the first time too ;)

Have you ever written a script which takes a pathname and puts it in a
text file, and passes to another to operate on, and just skipped over
the details of what a poorly placed control character would do?

Real applications do exactly that sort of thing.  Mostly it works,
occasionally security holes are found.  Welcome to the land of dodgy
CGI scripts and program generated Makefiles.

This is the same.

>  - always _always_ work on the "extended UTF-8" format, and never EVER 
>    convert that to anything else (except when you need to actually print
>    it, but then you encode it properly with escape sequences, the way you 
>    have to _anyway_).
> 
> If you follow the above simple rules, you can't get it wrong. And in those 
> rules, ".." is the BYTE SEQUENCE in the "extended UTF-8". Nothing more.

Yup.  It works right up until you pass your string to a library which
doesn't follow that rule, and which munges malformed UTF-8 because
it's _expecting_ well formed UTF-8.  E.g. you pass a path in an XML
document; the XML parser at the other end will either munge your path
(causing a security hole), or reject it (which is good).

The right thing to do on these occasions is check and/or escape
"extended UTF-8" prior to putting it into a text context.

Practically, that means a UTF-8 aware program has to keep track of
which text is "extended UTF-8" (i.e. bytes), and which text is real UTF-8.

Practically, it means every interface where a path may be passed in a
UTF-8 string has to define whether that's an escaped path, which will
be unescaped before being used for a system call, or an unescaped path.
Then you get into what kind of escaping.

In theory all those checks and escapings will be in the right places.
In theory C programs don't have buffer overflows either.

It is exasperated because UTF-8 is often passed through middle-layer
programs and libraries that don't know anything about it, so when
assembling a whole system it's all too easy to lose track of where to
put the checks and escapings - and where not to.

Yes there _is_ a perfectly fine solution: the one you gave.

In practice it is difficult to ensure a whole system where paths are
mixed with text is consistent about that.  And that's where we get a
good selection of our Windows worms from.

-- Jamie

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 19:53                                   ` viro
@ 2004-02-17 20:35                                     ` John Bradford
  2004-02-17 20:40                                       ` Jamie Lokier
                                                         ` (2 more replies)
  2004-02-17 20:38                                     ` Jamie Lokier
  1 sibling, 3 replies; 120+ messages in thread
From: John Bradford @ 2004-02-17 20:35 UTC (permalink / raw)
  To: viro, Jamie Lokier; +Cc: Linus Torvalds, Marc, Marc Lehmann, Linux kernel, john

> > Here on linux-kernel we're saying that if the second program accepts
> > any old byte sequence in a filename, it should preserve the byte
> > sequence exactly.  But any program whose parser-tokeniser is scanning
> > UTF-8 is very unlikely to do that - it's just too complicated to say
> > some bits of a text stream must be remembered as literal bytes, and
> > others must be scanned as multibyte characters.
> 
> So what you are saying is that conversion of invalid multibyte sequences
> into non-error wide chars followed by conversion back into UTF-8 can
> lead to trouble?  *DUH*
> 
> > The holes only arise because software which is interpreting UTF-8 is
> > mixed with software which isn't.  That's one of the most useful
> > features of UTF-8, after all - that's why we use it for filenames.
> 
> The holes only arise because software which is interpreting UTF-8 doesn't
> care to do it properly.  Software that doesn't interpret it (including the
> kernel) doesn't enter the picture at all.

So is your approach to this problem that because the security issue
isn't specifically in the kernel, we shouldn't discuss it here?
Dispite the fact that there are perfectly good ways of not only
working around the issue, but preventing it from even exising, that
could be implemented in the kernel?

Filenames _are_ arbitrary strings of bytes, I.E. binary data, and that
is how they should be.  I totally agree with that, (except that
obviously \0 and / have to be treated specially).

However, I don't see why it is any more logical to make the suggestion
that filenames generally be treated as UTF-8, IFF they are text at
all, than it is to suggest that filename should be arbitrary strings
of 32-bit words.

Why not:

* State that filenames are strings of 32-bit words.  UCS-4 should be
  the prefered format for storing text in them, but storing legacy
  encodings in the low 8 bits is acceptable, (but a Bad Thing for new
  installations).

* Let legacy applications store 8-bit values in those 32-bit words if
  they want to, but strongly recommend only 7-bit ASCII values are
  stored, not values 128-255.

* Create a divide - filesystems that support strings of 32-bit words
  in their on-disk format, and those that don't.  Those that don't can
  simulate 32-bit words for the new functions that require them, by
  padding with \0 high bytes.

* Hide all filenames with any values > 255 in them from legacy
  applications, by not returning data about them in existing legacy
  kernel functions.

* Introduce new routines to deal with 32-bit filenames, which Unicode
  applications can use when they need to store non-ASCII, (I.E. non
  7-bit).

* Note that UTF-8 stored in the low bytes is still acceptable, (but
  depreciated in favour of UCS-4).

* Note that this preserves the philosophies of:
  * No policy in the kernel
  * Filenames are arbitrary bytestreams, (but now optionally 32-bit
    ones, not just 8-bit ones!)

* Note that this is very different to my last suggestion, which was
  fundamentally broken in many ways.

John.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 16:32                               ` Linus Torvalds
                                                   ` (2 preceding siblings ...)
  2004-02-17 16:54                                 ` UTF-8 practically vs. theoretically in the VFS API Stefan Smietanowski
@ 2004-02-17 20:37                                 ` Robin Rosenberg
  3 siblings, 0 replies; 120+ messages in thread
From: Robin Rosenberg @ 2004-02-17 20:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marc Lehmann, Jamie Lokier, viro, Linux kernel

On Tuesday 17 February 2004 17.32, Linus Torvalds wrote:
> 
> On Tue, 17 Feb 2004, Marc Lehmann wrote:
> > 
> > Because there is a fundamental difference between file contents and
> > filenames. Filenames are supposed to be text.
> 
> I think this is actually the fundamental point where we disagree.
> 
> You think of filenames as something the user types in, and that is 
> "readable text". And I don't.
> 
> I think the filenames are just ways for a _program_ to look up stuff, and
> the human readability is a secondary thing (it's "polite", but not a
> fundamental part of their meaning).

So why don't we use an int as "filename" and why are users to "type" in
filenames? How foolish...

-- robin

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 19:53                                   ` viro
  2004-02-17 20:35                                     ` John Bradford
@ 2004-02-17 20:38                                     ` Jamie Lokier
  1 sibling, 0 replies; 120+ messages in thread
From: Jamie Lokier @ 2004-02-17 20:38 UTC (permalink / raw)
  To: viro; +Cc: Linus Torvalds, Marc, Marc Lehmann, Linux kernel

viro@parcelfarce.linux.theplanet.co.uk wrote:
> So what you are saying is that conversion of invalid multibyte sequences
> into non-error wide chars followed by conversion back into UTF-8 can
> lead to trouble?  *DUH*

Yes.  (The point being that it's a common bug, just like buffer
overflows are common.  Rejecting malformed UTF-8 is a defensive
strategy against it).

> > The holes only arise because software which is interpreting UTF-8 is
> > mixed with software which isn't.  That's one of the most useful
> > features of UTF-8, after all - that's why we use it for filenames.
> 
> The holes only arise because software which is interpreting UTF-8 doesn't
> care to do it properly.

That's right.  Software which does it properly rejects malformed UTF-8.
That is the entire point of my post.

> Software that doesn't interpret it (including the kernel) doesn't
> enter the picture at all.

Yes.

Your posting merely repeated what I said, so I assume we're in agreement :)

-- Jamie

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 20:35                                     ` John Bradford
@ 2004-02-17 20:40                                       ` Jamie Lokier
  2004-02-17 20:50                                         ` John Bradford
  2004-02-17 20:47                                       ` viro
  2004-02-17 20:59                                       ` Linus Torvalds
  2 siblings, 1 reply; 120+ messages in thread
From: Jamie Lokier @ 2004-02-17 20:40 UTC (permalink / raw)
  To: John Bradford; +Cc: viro, Linus Torvalds, Marc, Marc Lehmann, Linux kernel

John Bradford wrote:
> However, I don't see why it is any more logical to make the suggestion
> that filenames generally be treated as UTF-8, IFF they are text at
> all, than it is to suggest that filename should be arbitrary strings
> of 32-bit words.

Ok, but... why?  What does 32-bit words get you that UTF-8 does not?
I can't think of a single advantage, just lots of disadvantages.

-- Jamie

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 20:35                                     ` John Bradford
  2004-02-17 20:40                                       ` Jamie Lokier
@ 2004-02-17 20:47                                       ` viro
  2004-02-17 20:53                                         ` John Bradford
  2004-02-17 20:59                                       ` Linus Torvalds
  2 siblings, 1 reply; 120+ messages in thread
From: viro @ 2004-02-17 20:47 UTC (permalink / raw)
  To: John Bradford
  Cc: Jamie Lokier, Linus Torvalds, Marc, Marc Lehmann, Linux kernel

On Tue, Feb 17, 2004 at 08:35:22PM +0000, John Bradford wrote:
> Why not:

[snip a massive pile of idiocy]
 
> * Note that this is very different to my last suggestion, which was
>   fundamentally broken in many ways.

So is that one.  To the level of Hohensee and RBJ.

*plonk*

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 20:30                                     ` Jamie Lokier
@ 2004-02-17 20:49                                       ` Linus Torvalds
  2004-02-17 21:17                                         ` Jamie Lokier
  0 siblings, 1 reply; 120+ messages in thread
From: Linus Torvalds @ 2004-02-17 20:49 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: viro, Marc, Marc Lehmann, Linux kernel

On Tue, 17 Feb 2004, Jamie Lokier wrote:
> 
> Nope.  That wouldn't help for a bundle of libraries that goes:
> 
>     1. Eliminate "." and ".." components, leaving only leading ".."s.

Who does this anyway? It's wrong. It gives the wrong answer if there was a 
symlink somewhere.

I remember this exact bug in gcc (I think), at some point - trying to
"optimize" the path "aa/bb/../cc" into "aa/cc" is WRONG WRONG WRONG. They
are not the same thing at all.

Any library that does the above is broken.

>     2. Reject path if it has a leading "..".
>     3. Shove it in a string with some other text and pass to other library.
> 
> Next program does:
> 
>     4. Extract path from string.
>     5. open ("/var/public/files/$PATH", ...)
> 
> O_NODOTDOT won't protect against that.

Ok, so explain why? O_NODOTDOT will certainly guarantee that it stays
inside "/var/public/files", since it has no way to escape (modulo
symlinks/mounts, of course).

The point being that with O_NODOTDOT | O_NOMOUNT | O_NOFOLLOW, you can
just do a simple "prepend my beginning pathname" operation, and do the
open that way without having to be careful.

Then, if the thing fails, you now need to be really careful, and perhaps
do a user-space "walk one component at a time" thing to see where it
failed. But what the O_NODOTDOT | O_NOMOUNT | O_NOFOLLOW thing gave you is
that you get a fast-path for the common case (ie you don't _always_ have
to do the "walk one component at a time" crud - only if you hit a case you
might be worried about).

(Now, O_NOMOUNT isn't actually useful if you use an absolute path like the
above example - it kind of assumes that you start from pwd which woul dbe
your "safe point", and that you expect all "safe" files to be under that
one filesystem. With an absolute path, you'll clearly often end up having 
to cross mount-points unless your whole thing is on the root filesystem, 
which kind of makes O_NOMOUNT useless in the first place).

Btw, right now O_NOMOUNT isn't a big issue, since only root can mount 
things anyway. But if we start allowing user mounts (likely with 
restrictions like you can only mount if the mount-point is owned by you 
and writable), O_NOMOUNT may actually become a good idea.

		Linus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 20:40                                       ` Jamie Lokier
@ 2004-02-17 20:50                                         ` John Bradford
  2004-02-17 21:04                                           ` Linus Torvalds
  0 siblings, 1 reply; 120+ messages in thread
From: John Bradford @ 2004-02-17 20:50 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: viro, Linus Torvalds, Marc, Marc Lehmann, Linux kernel

Quote from Jamie Lokier <jamie@shareable.org>:
> John Bradford wrote:
> > However, I don't see why it is any more logical to make the suggestion
> > that filenames generally be treated as UTF-8, IFF they are text at
> > all, than it is to suggest that filename should be arbitrary strings
> > of 32-bit words.
> 
> Ok, but... why?  What does 32-bit words get you that UTF-8 does not?
> I can't think of a single advantage, just lots of disadvantages.

The advantage is that you can use them to store UCS-4.

Now, for file _contents_ this would be a compatibility disaster, which
is why UTF-8 is so convenient, but for file_names_ UCS-4 lets you
unambiguously represent any string of Unicode characters.  Basically -
no more multiple representations of the same thing.  No more funny
corner cases where several different strings of bytes eventually
resolve to the same name being presented to the user.

Of course, there is still the issue where the same glyphs could be
displayed on the screen for two files, for example, one called
"vertical bar", and the other one called "pipe", which is confusing,
but that's a _completely_ different issue.

As far as I can see, storing all filenames as either 7-bit ASCII or
flat UCS-4 simply eliminates the whole lot of security issues you're
thinking of.  If you never use non 7-bit ASCII, legacy applications
continue to work exactly as before.

John.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 20:47                                       ` viro
@ 2004-02-17 20:53                                         ` John Bradford
  0 siblings, 0 replies; 120+ messages in thread
From: John Bradford @ 2004-02-17 20:53 UTC (permalink / raw)
  To: viro; +Cc: Jamie Lokier, Linus Torvalds, Marc, Marc Lehmann, Linux kernel

Quote from viro@parcelfarce.linux.theplanet.co.uk:
> On Tue, Feb 17, 2004 at 08:35:22PM +0000, John Bradford wrote:
> > Why not:
> 
> [snip a massive pile of idiocy]
>  
> > * Note that this is very different to my last suggestion, which was
> >   fundamentally broken in many ways.
> 
> So is that one.  To the level of Hohensee and RBJ.

Correct me if I'm wrong, but your suggestions seems to be along the
lines of:

"Sit and watch a load of buggy userspace applications get written,
content in the knowledge that it wasn't my fault, because the security
vulnerabilities aren't actually in the kernel".

John.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 19:00                                   ` UTF-8 practically vs. theoretically in the VFS API Måns Rullgård
@ 2004-02-17 20:57                                     ` Jamie Lokier
  2004-02-17 21:06                                       ` Alex Belits
  2004-02-17 21:23                                       ` Matthew Kirkwood
  0 siblings, 2 replies; 120+ messages in thread
From: Jamie Lokier @ 2004-02-17 20:57 UTC (permalink / raw)
  To: Måns Rullgård; +Cc: linux-kernel

Måns Rullgård wrote:
> > I'd like a way to type something like "touch zöe.txt" on an ordinary
> > latin1 terminal and get a UTF-8 filename in my filesystem.  Thanks :)
> 
> Then hack either bash (or whatever shell you use) or touch to do just that.

Hacking touch is obviously useless - I'd need to hack all the other
2000 shell utilities to get any useful behaviour.

Hacking bash -- actually readline -- is a much better idea.  Then you
can enter names and they'll be created right.  The only flaw in this
is that "ls" won't be useful, so that'll need to be hacked as well. etc.

No, I think hacking the terminal I/O is the best bet here.  Then _all_
programs which currently work with UTF-8 terminals, which is rapidly
becoming most of them, will work the same with both kinds of terminal,
and the illusion of perfection will be complete and beautiful.

-- Jamie

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 20:35                                     ` John Bradford
  2004-02-17 20:40                                       ` Jamie Lokier
  2004-02-17 20:47                                       ` viro
@ 2004-02-17 20:59                                       ` Linus Torvalds
  2004-02-17 21:06                                         ` John Bradford
                                                           ` (2 more replies)
  2 siblings, 3 replies; 120+ messages in thread
From: Linus Torvalds @ 2004-02-17 20:59 UTC (permalink / raw)
  To: John Bradford; +Cc: viro, Jamie Lokier, Marc, Marc Lehmann, Linux kernel

On Tue, 17 Feb 2004, John Bradford wrote:
> 
> Why not:

I'll start with the first one. That already kills the rest.

> * State that filenames are strings of 32-bit words.  UCS-4 should be
>   the prefered format for storing text in them, but storing legacy
>   encodings in the low 8 bits is acceptable, (but a Bad Thing for new
>   installations).

UCS-4 is as braindamaged as UCS-2 was, and for all the same reasons.

It's bloated, non-expandable, and not backwards compatible.

In contrast, UTF-8 doesn't measurably expand any normal text that didn't 
need it, is backwards compatible in the major ways that matter, and can be 
extended arbitrarily.

UCS-4 has _zero_ advantages over UTF-8. 

Please. Give it up. Anybody who thinks that _any_ other encoding format 
than UTF-8 is valid is just _wrong_. 

(Now, I'll give that a lot of people don't like Unicode, so I'll allow
that maybe you'd want to use the UTF-8 _encoding_scheme_ for some other
mapping, but I don't see that that is worth the pain any more. Unicode may
be a horrible enumeration, but in the end all font encodings are arbitrary
anyway, so the unicode haters might as well start giving up).

In short: even if you hate Unicode with a passion, and refuse to touch it
and think standards are worthless, you should still use the same
transformation that UTF-8 does to your idiotic character set of the day. 
Because the _transform_ makes sense regardless of character set encoding.

		Linus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 20:50                                         ` John Bradford
@ 2004-02-17 21:04                                           ` Linus Torvalds
  2004-02-17 21:16                                             ` John Bradford
  2004-02-18  6:48                                             ` Marc Lehmann
  0 siblings, 2 replies; 120+ messages in thread
From: Linus Torvalds @ 2004-02-17 21:04 UTC (permalink / raw)
  To: John Bradford; +Cc: Jamie Lokier, viro, Marc, Marc Lehmann, Linux kernel

On Tue, 17 Feb 2004, John Bradford wrote:

> > Ok, but... why?  What does 32-bit words get you that UTF-8 does not?
> > I can't think of a single advantage, just lots of disadvantages.
> 
> The advantage is that you can use them to store UCS-4.

Wrong. UTF-8 can store UCS-4 characters just fine.

Admittedly you might need up to six octets for the worst case, but hey, 
since you only need one for the most common case (by _far_), who cares?

And with the same UTF-8 encoding, you could some day encode UCS-8 too if
the idiotic standards bodies some day decide that 4 billion characters 
isn't enough because of all the in-fighting. 

> Now, for file _contents_ this would be a compatibility disaster, which
> is why UTF-8 is so convenient, but for file_names_ UCS-4 lets you
> unambiguously represent any string of Unicode characters.

Why do you think UTF-8 can't do this? Did you read some middle-aged text
written by monks in a monestary that said that UTF-8 encodes a 16-bit
character set?

> Basically - no more multiple representations of the same thing.  No more
> funny corner cases where several different strings of bytes eventually
> resolve to the same name being presented to the user.

Welcome to normalized UTF-8. And realize that the "non-normalized" broken 
stuff is what allows us backwards compatibility.

Of course, since you like UCS-4, you don't care about backwards 
compatibility. 

			Linus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 20:57                                     ` Jamie Lokier
@ 2004-02-17 21:06                                       ` Alex Belits
  2004-02-17 21:47                                         ` Jamie Lokier
  2004-02-18  7:23                                         ` Marc Lehmann
  2004-02-17 21:23                                       ` Matthew Kirkwood
  1 sibling, 2 replies; 120+ messages in thread
From: Alex Belits @ 2004-02-17 21:06 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Måns Rullgård, linux-kernel

On Tue, 17 Feb 2004, Jamie Lokier wrote:

> No, I think hacking the terminal I/O is the best bet here.  Then _all_
> programs which currently work with UTF-8 terminals, which is rapidly
> becoming most of them, will work the same with both kinds of terminal,
> and the illusion of perfection will be complete and beautiful.

  UTF-8 terminals (and variable-encoding terminals) alreay exist,
gnome-terminal is one of them. They are, of course, bloated pigs, but I
would rather have the bloat and idiosyncrasy in the user interface where
it belongs.

-- 
Alex

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 20:59                                       ` Linus Torvalds
@ 2004-02-17 21:06                                         ` John Bradford
  2004-02-17 21:42                                         ` Alex Belits
  2004-02-18  3:11                                         ` H. Peter Anvin
  2 siblings, 0 replies; 120+ messages in thread
From: John Bradford @ 2004-02-17 21:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: viro, Jamie Lokier, Marc, Marc Lehmann, Linux kernel

Quote from Linus Torvalds <torvalds@osdl.org>:
> 
> 
> On Tue, 17 Feb 2004, John Bradford wrote:
> > 
> > Why not:
> 
> I'll start with the first one. That already kills the rest.
> 
> > * State that filenames are strings of 32-bit words.  UCS-4 should be
> >   the prefered format for storing text in them, but storing legacy
> >   encodings in the low 8 bits is acceptable, (but a Bad Thing for new
> >   installations).
> 
> UCS-4 is as braindamaged as UCS-2 was, and for all the same reasons.
> 
> It's bloated, non-expandable, and not backwards compatible.

Which I hardly see as real pain for filenames, especially as I covered
the backward compatibility bit anyway, and wanting to expand beyond
2^31 characters isn't really on my to-do list at the moment, which
just leaves filename bloat, which is laughably trivial in at least
99.9% of cases, and probably just a minor inconvenience the other
0.1%.

But, I don't think I care anymore, anyway, clearly we are going to end
up with UTF-8 filenames everywhere, and security vulnerabilities to go
with them, and as long as I'm aware of that fact, I should be OK.

John.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17  1:24                   ` Alex Belits
@ 2004-02-17 21:09                     ` Jamie Lokier
  2004-02-17 21:48                       ` Linus Torvalds
  2004-02-17 22:19                       ` Alex Belits
  0 siblings, 2 replies; 120+ messages in thread
From: Jamie Lokier @ 2004-02-17 21:09 UTC (permalink / raw)
  To: Alex Belits; +Cc: Marc Lehmann, Linux kernel

Alex Belits wrote:
>   UTF-8 is dependent on Unicode, that is cumbersome, not user-expandable,

Ah, Alex, welcome back. :)

> This means, it's quite possible that this standard will be replaced
> by something better in the future

You mean like Unicode 4 will be replaced by Unicode 5 or something? :)

Seriously, if there was another standard encompassing all languages
and characters, why would they call it something different?

> and this is why poor design of Unicode is tolerated by users, and
> this is also why many people use non-Unicode-based charsets.

You've said this many times before, without explanation.

As far as I know, Unicode is a superset of all pre-existing computer
charsets used anywhere - but do feel free to correct me.

Unicode does have its problems - but what possible advantage does
_any_ known non-Unicode charset have over Unicode, apart from space saving?

You mention that Unicode doesn't well support language identification.
This is true - but the non-Unicode charsets (koi8-r etc.) don't
support that either!  Or do they?

>   And this is perfectly fine. Displaying and editing multilingual text is
> a user interface issue, that kernel should not be involved in.

Actually the kernel does have a line editor which needs to know a little.

>   I can point at the example of this "solution" that happened years ago
> when UCS-2 was all the rage, and it got hardcoded and enforced by NTFS
> and everything that handles it. Who is laughing about that decision now?

We are all laughing ;)

-- Jamie

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 21:04                                           ` Linus Torvalds
@ 2004-02-17 21:16                                             ` John Bradford
  2004-02-17 21:21                                               ` Linus Torvalds
  2004-02-17 22:50                                               ` Robin Rosenberg
  2004-02-18  6:48                                             ` Marc Lehmann
  1 sibling, 2 replies; 120+ messages in thread
From: John Bradford @ 2004-02-17 21:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jamie Lokier, viro, Marc, Marc Lehmann, Linux kernel

Quote from Linus Torvalds <torvalds@osdl.org>:
> 
> 
> On Tue, 17 Feb 2004, John Bradford wrote:
> 
> > > Ok, but... why?  What does 32-bit words get you that UTF-8 does not?
> > > I can't think of a single advantage, just lots of disadvantages.
> > 
> > The advantage is that you can use them to store UCS-4.
> 
> Wrong. UTF-8 can store UCS-4 characters just fine.

Does just fine include unambiguously?  Sure, standards-conforming
UTF-8 is unambiguous, but you've already said time and again that that
doesn't happen in the real world.  I just don't agree on the UTF-8 can
store UCS-4 characters just fine thing _at all_.

> Admittedly you might need up to six octets for the worst case, but hey, 
> since you only need one for the most common case (by _far_), who cares?
> 
> And with the same UTF-8 encoding, you could some day encode UCS-8 too if
> the idiotic standards bodies some day decide that 4 billion characters 
> isn't enough because of all the in-fighting. 
> 
> > Now, for file _contents_ this would be a compatibility disaster, which
> > is why UTF-8 is so convenient, but for file_names_ UCS-4 lets you
> > unambiguously represent any string of Unicode characters.
> 
> Why do you think UTF-8 can't do this? Did you read some middle-aged text
> written by monks in a monestary that said that UTF-8 encodes a 16-bit
> character set?

At the end of the day, I just don't see how your suggestion of leaving
UTF-8 undecoded unless you're presenting it to the user is ever going
to be practical, which brings us back to my first point, that UTF-8
can't, in the real world, represent UCS-4 characters acceptably,
(I.E. unambiguously).

> > Basically - no more multiple representations of the same thing.  No more
> > funny corner cases where several different strings of bytes eventually
> > resolve to the same name being presented to the user.
> 
> Welcome to normalized UTF-8. And realize that the "non-normalized" broken 
> stuff is what allows us backwards compatibility.
> 
> Of course, since you like UCS-4, you don't care about backwards 
> compatibility. 

I don't particularly like UCS-4, I do care about backwards
compatibility, and addressed it right from the begining.

..and I totally don't get the bit about "non-normalised" UTF-8 being
what allows backwards compatibility.  Compatibility with what!?
Existing broken implementations?  Real, standards compliant UTF-8 is
fully backwards compatible with 7-bit ASCII, which is really just
about all any standard which wants to get accepted as a universal
standard can hope to be compatible with.

John.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 20:49                                       ` Linus Torvalds
@ 2004-02-17 21:17                                         ` Jamie Lokier
  0 siblings, 0 replies; 120+ messages in thread
From: Jamie Lokier @ 2004-02-17 21:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: viro, Marc, Marc Lehmann, Linux kernel

Linus Torvalds wrote:
> >     1. Eliminate "." and ".." components, leaving only leading ".."s.
> 
> Who does this anyway? It's wrong. It gives the wrong answer if there was a 
> symlink somewhere.

It's wrong for GCC, but correct for HTML/XLink relative path
resolution.

> > O_NODOTDOT won't protect against that.
> 
> Ok, so explain why? O_NODOTDOT will certainly guarantee that it stays
> inside "/var/public/files", since it has no way to escape (modulo
> symlinks/mounts, of course).

Oh, I meant to say "will" but my mailer must have used the wrong
character encoding. :)

> O_NOMOUNT may actually become a good idea.

Right now, you can avoid crossing filesystems by calling lstat()
_except_ that it doesn't detect a bind mount.  Among other things,
this makes it difficult to know for sure when there is only one path
to a file, because n_link==1 doesn't mean that any more.  O_NOMOUNT
might be useful as a way to detect when you're crossing a bind mount.

-- Jamie

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 21:16                                             ` John Bradford
@ 2004-02-17 21:21                                               ` Linus Torvalds
  2004-02-18  0:52                                                 ` John Bradford
  2004-02-17 22:50                                               ` Robin Rosenberg
  1 sibling, 1 reply; 120+ messages in thread
From: Linus Torvalds @ 2004-02-17 21:21 UTC (permalink / raw)
  To: John Bradford; +Cc: Jamie Lokier, viro, Marc, Marc Lehmann, Linux kernel

On Tue, 17 Feb 2004, John Bradford wrote:
> > 
> > Wrong. UTF-8 can store UCS-4 characters just fine.
> 
> Does just fine include unambiguously?

If you don't care about backwards compatibility, then yes. You just have 
to use "strict" UTF-8.

>				  Sure, standards-conforming
> UTF-8 is unambiguous, but you've already said time and again that that
> doesn't happen in the real world.  I just don't agree on the UTF-8 can
> store UCS-4 characters just fine thing _at all_.

You get to choose between "throw the baby out with the bathwater" or "be 
compatible". 

Sane people choose compatibility. But it's your choice. You can always 
normalize thing if you want to - but don't complain to me if it breaks 
things. It will still break _fewer_ things than UCS-4 would, so even if 
you always normalize you'd still be _better_ off with UTF-8 than you would 
be with UCS-4.

		Linus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 20:57                                     ` Jamie Lokier
  2004-02-17 21:06                                       ` Alex Belits
@ 2004-02-17 21:23                                       ` Matthew Kirkwood
  1 sibling, 0 replies; 120+ messages in thread
From: Matthew Kirkwood @ 2004-02-17 21:23 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Måns Rullgård, linux-kernel

On Tue, 17 Feb 2004, Jamie Lokier wrote:

> No, I think hacking the terminal I/O is the best bet here.  Then _all_
> programs which currently work with UTF-8 terminals, which is rapidly
> becoming most of them, will work the same with both kinds of terminal,
> and the illusion of perfection will be complete and beautiful.

Yep.  A charset-translating tty proxy, a little like screen
or detachtty is what you want.  I wonder if there's an SSH
client or server which can do that.

Matthew.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 20:59                                       ` Linus Torvalds
  2004-02-17 21:06                                         ` John Bradford
@ 2004-02-17 21:42                                         ` Alex Belits
  2004-02-18  6:56                                           ` Marc Lehmann
  2004-02-18  3:11                                         ` H. Peter Anvin
  2 siblings, 1 reply; 120+ messages in thread
From: Alex Belits @ 2004-02-17 21:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: John Bradford, viro, Jamie Lokier, Marc, Marc Lehmann,
	Linux kernel

On Tue, 17 Feb 2004, Linus Torvalds wrote:

> In short: even if you hate Unicode with a passion, and refuse to touch it
> and think standards are worthless, you should still use the same
> transformation that UTF-8 does to your idiotic character set of the day.
> Because the _transform_ makes sense regardless of character set encoding.

  Pretty much every charset other than Unicode does not NEED encoding
because it was already designed to work with existing system. The decision
to make the basic representation of charset full of zero bytes was the
reason that created the need for UTF-8. People who use other charsets may
not have planned for multilingual environments like they should've done,
but they aren't stupid enough to require someone to "bless" them with a
variable-length encoding.

-- 
Alex

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 21:06                                       ` Alex Belits
@ 2004-02-17 21:47                                         ` Jamie Lokier
  2004-02-22 15:32                                           ` Eric W. Biederman
  2004-02-18  7:23                                         ` Marc Lehmann
  1 sibling, 1 reply; 120+ messages in thread
From: Jamie Lokier @ 2004-02-17 21:47 UTC (permalink / raw)
  To: Alex Belits; +Cc: Måns Rullgård, linux-kernel

Alex Belits wrote:
> > No, I think hacking the terminal I/O is the best bet here.  Then _all_
> > programs which currently work with UTF-8 terminals, which is rapidly
> > becoming most of them, will work the same with both kinds of terminal,
> > and the illusion of perfection will be complete and beautiful.
> 
>   UTF-8 terminals (and variable-encoding terminals) alreay exist,
> gnome-terminal is one of them. They are, of course, bloated pigs, but I
> would rather have the bloat and idiosyncrasy in the user interface where
> it belongs.

Yes, I am using it right now.  The fancy characters work well in it.
Problem is, sometimes I have to use a non-UTF-8 terminal, and I would
naturally like to access my files in the same way.

-- Jamie

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 21:09                     ` Jamie Lokier
@ 2004-02-17 21:48                       ` Linus Torvalds
  2004-02-17 22:19                       ` Alex Belits
  1 sibling, 0 replies; 120+ messages in thread
From: Linus Torvalds @ 2004-02-17 21:48 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Alex Belits, Marc Lehmann, Linux kernel



On Tue, 17 Feb 2004, Jamie Lokier wrote:
> 
> >   I can point at the example of this "solution" that happened years ago
> > when UCS-2 was all the rage, and it got hardcoded and enforced by NTFS
> > and everything that handles it. Who is laughing about that decision now?
> 
> We are all laughing ;)

Crying. Sadly, when MS makes a whopper of a mistake (and they do it all 
too often), we're left having to work with the resulting breakage.

I suspect most samba developers are already technically insane (*).

		Linus

(*) Of course, since many of them are Australians, you can't tell.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 21:09                     ` Jamie Lokier
  2004-02-17 21:48                       ` Linus Torvalds
@ 2004-02-17 22:19                       ` Alex Belits
  1 sibling, 0 replies; 120+ messages in thread
From: Alex Belits @ 2004-02-17 22:19 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Alex Belits, Marc Lehmann, Linux kernel

On Tue, 17 Feb 2004, Jamie Lokier wrote:

> > This means, it's quite possible that this standard will be replaced
> > by something better in the future
>
> You mean like Unicode 4 will be replaced by Unicode 5 or something? :)
>
> Seriously, if there was another standard encompassing all languages
> and characters, why would they call it something different?

  Because the only way to do it right is to make it automatically
expandable. Unicode by its very design is monolithic with a single address
space that has to be controlled by one standard body, with no generalized
procedure for adding anything other than going through their "standard
process".

> > and this is why poor design of Unicode is tolerated by users, and
> > this is also why many people use non-Unicode-based charsets.
>
> You've said this many times before, without explanation.

  How am I supposed to explain it if most of people on the list live many
thousand of kilometers from the nearest place where non-iso8859-1 charsets
are used? I have seen what people use in Russia, and it's not Unicode. Am
I supposed to bring all of you there, so you can confirm that, or should I
just point at the percentage of web pages, messages in mailing list
archives and other readily available evidence? Or do you think that this
is all irrelevant because Martin Duerst says so?

> As far as I know, Unicode is a superset of all pre-existing computer
> charsets used anywhere - but do feel free to correct me.

  Unicode is not exactly a superset of all of them, but the issue is
deeper than that. There are two kinds of standards. Ones define rules of
what can be compliant, and allow different levels of implementation that
still can interoperate and be reasonably small in the minimal
implementation. Others are "this is it -- you have to implement all of
that, or you aren't compliant". Charsets are by definition the latter. But
every charset is something reasonably small, can be even _remembered_
by people who use it, and some additional things like matching input
methods, keyboard layouts, even some language-dependent processing is
within what can be written in a small piece of paper. Not so with Unicode.
No single person can even draw a Unicode font, leave alone remember what
is there. A terminal that implements Unicode has to support everything.
There is no editing/input procedure that can be applied based on the
knowledge that some text is in Unicode. It's a display-only thing, and
ridiculously bloated yet not even expandable. An example of a design made
not just by committee but by a committee of people who would never use
most of it.

> Unicode does have its problems - but what possible advantage does
> _any_ known non-Unicode charset have over Unicode, apart from space saving?

  Not known, one that will be developed in the future. At this point the
demand for multilingual processing is so miniscule compared to documents
that use a single charset, the decisions that are made for this small area
of application may be completely arbitrary, and no one would notice. Same
as the situation before non-ASCII symbols appeared -- an overwhelming
majority of computer-processed text was in English, so ASCII and its
trivial extensions were tolerated despite ther inadequacies. When the real
demand will appear, Unicode will quickly show its inadequacies, and the
current "Unicode crusade" is nothing but an attempt to freeze the standard
before this will happen. Then "unicoders" will claim that "it sucks, but
it's everywhere, so live with it. It worked for Windows, right?

> You mention that Unicode doesn't well support language identification.
> This is true - but the non-Unicode charsets (koi8-r etc.) don't
> support that either!  Or do they?

  Other charsets don't have language identification, this is true. However
if language identification IS done, the need for unified charset
immediately disappears -- what can identify a language, can identify a
charset, it's just metadata that can be passed in-band or out of band, in
a general case. If this will be done (and it has to be done for anything
meaningful), the use of Unicode is the answer to the question that never
was asked.

  Unicode is great for representation of multilingual fonts, or as an
intermediate format in conversions, but declaring it the actual format for
all data to be processed is far beyond its area of applicability. It's a
creeping functionality (closely related to creeping featurism) where
every problem looks like a nail for a standard body that standardizes on
hammers.

> >   And this is perfectly fine. Displaying and editing multilingual text is
> > a user interface issue, that kernel should not be involved in.
>
> Actually the kernel does have a line editor which needs to know a little.

  Kernel does not edit multilingual text unless a terminal discipline
supports it. Some limited support is possible on the text console, however
even if Unicode allows printing all characters, it's not multilingual
editing yet. Complex character-handling routines, bidirectional support
and input methods that know multiple languages still belong in userspace,
so it's at most a good reason to make a nice userspace-based support of
multilingual processing.

> >   I can point at the example of this "solution" that happened years ago
> > when UCS-2 was all the rage, and it got hardcoded and enforced by NTFS
> > and everything that handles it. Who is laughing about that decision now?
>
> We are all laughing ;)

  And this is why I think, it is very arrogant to claim now that UTF-8 is
not going the way of UCS-2 in the near future. Again, I am not against
letting people who want to use UCS-2, UTF-8 or UCS-4 Unicode with Klingon
in a private area, to use what they want. I am against hardcoding things
just for the purpose of making it difficult to use anything else. XML, for
example, is a great example of a standard where the model with
single charset/documenr + multiple languages/document artificially created
an unsolvable problem for non-Unicode charset users who wanted to remain
within the standard, and I see this decision as not technically justified
but arbitrary and ideologically based.

-- 
Alex

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 21:16                                             ` John Bradford
  2004-02-17 21:21                                               ` Linus Torvalds
@ 2004-02-17 22:50                                               ` Robin Rosenberg
  1 sibling, 0 replies; 120+ messages in thread
From: Robin Rosenberg @ 2004-02-17 22:50 UTC (permalink / raw)
  To: John Bradford
  Cc: Linus Torvalds, Jamie Lokier, viro, Marc, Marc Lehmann,
	Linux kernel

On Tuesday 17 February 2004 22.16, John Bradford wrote:
> Quote from Linus Torvalds <torvalds@osdl.org>:
> > On Tue, 17 Feb 2004, John Bradford wrote:
> > > > Ok, but... why?  What does 32-bit words get you that UTF-8 does not?
> > > > I can't think of a single advantage, just lots of disadvantages.
> > > The advantage is that you can use them to store UCS-4.
> > Wrong. UTF-8 can store UCS-4 characters just fine.
<nitpick>Yes and no. There are no UTF-8 or UCS-4 characters. These are encodings
for Unicode characters. </nitpick>

> At the end of the day, I just don't see how your suggestion of leaving
> UTF-8 undecoded unless you're presenting it to the user is ever going
> to be practical, which brings us back to my first point, that UTF-8
> can't, in the real world, represent UCS-4 characters acceptably,
> (I.E. unambiguously).
The standard say a decode should not accept invalid UTF-8 characters. Not
decoding them and just pass the garbage on is one way of not "accepting" them; i.e.
"i'm not decoding this trash". In UTF-8 seen as a byte stream this is trivial. Those that 
recode to a 16-bit encoding like QT (recode them to invalid UTF-16 so it can be encoded
back to the original invalid UTF-8). That's at least what the kopete people told me. (Haven't
read the code yet). With UCS-4 or UCS-2 the decoder must reject the data or make a very
good decision. Normalizing is a very bad one, since, as we know, an invalid UTF-8 sequence
simply does not represent a unicode character.

> > > Basically - no more multiple representations of the same thing.  No more
> > > funny corner cases where several different strings of bytes eventually
> > > resolve to the same name being presented to the user.
> > 
> > Welcome to normalized UTF-8. And realize that the "non-normalized" broken 
> > stuff is what allows us backwards compatibility.
And then (above) there is no normalized UTF-8. There are strings of valid characters
and invalid characters. Any app that tries to make sense of any garbage is a security
risk. This apples not only to UTF-8. but function like atof that decodes anything
to a double at best NaN).

-- robin


-- robin

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 21:21                                               ` Linus Torvalds
@ 2004-02-18  0:52                                                 ` John Bradford
  0 siblings, 0 replies; 120+ messages in thread
From: John Bradford @ 2004-02-18  0:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jamie Lokier, viro, Marc, Marc Lehmann, Linux kernel

> Sane people choose compatibility. But it's your choice. You can always 
> normalize thing if you want to - but don't complain to me if it breaks 
> things. It will still break _fewer_ things than UCS-4 would, so even if 
> you always normalize you'd still be _better_ off with UTF-8 than you would 
> be with UCS-4.

Well, if all the UTF-8 diddling is eventually done by glibc, or some
other library, it might just be made to work.

The keep-it-in-UTF-8-all-the-time thing will still break down when a
user inputs a filename by copying the display of a badly encoded
filename using GPM, or in X, but that isn't a kernel issue.

I still don't really get what enforcing strictly standards compliant
UTF-8 has to do with backwards compatibility, though.

_But_ at least I'm about 5% more confident that filenames won't
suddenly blow up in my face, so I can sleep soundly tonight :-).

John.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 16:54                                 ` UTF-8 practically vs. theoretically in the VFS API Stefan Smietanowski
@ 2004-02-18  1:27                                   ` Hans Reiser
  2004-02-18  2:08                                     ` Robin Rosenberg
  0 siblings, 1 reply; 120+ messages in thread
From: Hans Reiser @ 2004-02-18  1:27 UTC (permalink / raw)
  To: Stefan Smietanowski
  Cc: Linus Torvalds, Marc Lehmann, Jamie Lokier, viro, Linux kernel

ReiserFS 6 plans to allow files to be associated with arbitrary files 
and found by those associations.  Some of those files will consist of 
ascii keywords, some will be icon images, etc.....  Human readability 
should not be considered fundamental to a name component, especially 
since programs with no interest in readability may be the only direct 
users of the name.

Hans

>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  1:27                                   ` Hans Reiser
@ 2004-02-18  2:08                                     ` Robin Rosenberg
  2004-02-18 11:06                                       ` Jamie Lokier
  0 siblings, 1 reply; 120+ messages in thread
From: Robin Rosenberg @ 2004-02-18  2:08 UTC (permalink / raw)
  To: Hans Reiser
  Cc: Stefan Smietanowski, Linus Torvalds, Marc Lehmann, Jamie Lokier,
	viro, Linux kernel

On Wednesday 18 February 2004 02.27, Hans Reiser wrote:
> ReiserFS 6 plans to allow files to be associated with arbitrary files 
> and found by those associations.  Some of those files will consist of 
> ascii keywords, some will be icon images, etc.....  Human readability 
> should not be considered fundamental to a name component, especially 
> since programs with no interest in readability may be the only direct 
> users of the name.

If the user never sees a name, it doesn't matter. However the user actually sees
and reads the filenames in /home, portable media, networks devices and lots of
places. However, when a user has named a component those characters are those
that are important to the user because those form an "image" (since you introduced
the term) or "sound" that the user remembers and associates with the content. A 
character is the simplest form of image so it should always look the same.

-- robin

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Unicode normalization (userspace issue, but what the heck)
  2004-02-15  3:33             ` Matthias Urlichs
  2004-02-15  4:04               ` viro
@ 2004-02-18  2:48               ` H. Peter Anvin
  2004-02-20  9:48                 ` Matthias Urlichs
  1 sibling, 1 reply; 120+ messages in thread
From: H. Peter Anvin @ 2004-02-18  2:48 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <pan.2004.02.15.03.33.48.209951@smurf.noris.de>
By author:    Matthias Urlichs <smurf@smurf.noris.de>
In newsgroup: linux.dev.kernel
> 
> Not locale, but normalization problems and identical-glyph problems.
> 
> Which is actually worse, because you don't have filenames which look
> like crap -- instead you have filenames which look perfectly sane, but
> they still do not work. Example: is an Ã¡ one character, or is it an a
> followed by a composing ÂŽ?
> 
> Mac OSX, just as an example, only uses decomposed filenames. I don't know
> the current situation, but 10.2 has major problems when you try to access
> files with composite characters in their name (across NFS for instance).
> 
> I wonder if Linux, i.e. Linus ;-) should decree one single standard
> normalization. (I am NOT saying that enforcing this would be the kernel's
> job!)
> 

I believe that for most applications, normalization form C should be
used.

However, I suspect there are some applications for which this would
not apply.

	-hpa

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:20                       ` Marc Lehmann
  2004-02-16 20:26                         ` Linus Torvalds
@ 2004-02-18  2:49                         ` Rob Landley
  1 sibling, 0 replies; 120+ messages in thread
From: Rob Landley @ 2004-02-18  2:49 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: Linux kernel

On Monday 16 February 2004 14:20, Marc Lehmann wrote:
> On Mon, Feb 16, 2004 at 11:48:35AM -0800, Linus Torvalds <torvalds@osdl.org> 
wrote:
> > works on the raw byte sequence and isn't confused). Basically accept the
> > fact that UTF-8 strings can contain "garbage", and don't try to fix it
> > up.
>
> But you are wrong, UTF-8 strings never contain garbage. UTF-8 is
> well-defined and is always proper UTF-8. It's a tautology.

Would you please learn the difference between "you are wrong" and "I 
disagree"?

Rob



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:21                       ` bert hubert
  2004-02-16 20:33                         ` Marc Lehmann
@ 2004-02-18  2:58                         ` H. Peter Anvin
  2004-02-18  3:13                           ` Linus Torvalds
  2004-02-18  7:25                           ` bert hubert
  1 sibling, 2 replies; 120+ messages in thread
From: H. Peter Anvin @ 2004-02-18  2:58 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <20040216202142.GA5834@outpost.ds9a.nl>
By author:    bert hubert <ahu@ds9a.nl>
In newsgroup: linux.dev.kernel
> 
> Additional good news is that following octets in a utf-8 character sequence
> always have the highest order bit set, precluding / or \x0 from appearing,
> confusing the kernel.
> 

Indeed.  The original name for the encoding was, in fact, "FSS-UTF",
for "filesystem safe Unicode transformation format."  

> The remaining zit is that all these represent '..':
> 2E 2E
> C0 AE C0 AE
> E0 80 AE E0 80 AE 
> F0 80 80 AE F0 80 80 AE 
> F8 80 80 80 AE F8 80 80 80 AE 
> FC 80 80 80 80 AE FC 80 80 80 80 AE

No, they don't.

The first represent "..", the remaining two are illegal encodings and
do not decode to anything.

Those of us who have been involved with the issue have fought
*extremely* hard against DWIM decoders which try to decode the latter
sequences into ".." -- it's incorrect, and a security hazard.  The
only acceptable decodings is to throw an error, or use an out-of-band
encoding mechanism to denote "bad bytecode."

> This in itself is not a problem, the kernel will only recognize 2E 2E as the
> real .., but it does show that 'document.doc' might be encoded in a myriad
> ways.

No, it doesn't.

> So some guidance about using only the simplest possible encoding might be
> sensible, if we don't want the kernel to know about utf-8.

UTF-8 requires the use of the shortest possible encoding.  An
application which doesn't obey that and tries to be "smart" is a
security hazard.

It is a bit unfortunate that the encoding don't exclude these by
design as opposed by error checking; it makes it a little too easy for
clueless programmers to skip :(

	-hpa

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 16:36                             ` Jamie Lokier
  2004-02-17 17:52                               ` viro
@ 2004-02-18  3:07                               ` H. Peter Anvin
  1 sibling, 0 replies; 120+ messages in thread
From: H. Peter Anvin @ 2004-02-18  3:07 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <20040217163613.GA23499@mail.shareable.org>
By author:    Jamie Lokier <jamie@shareable.org>
In newsgroup: linux.dev.kernel
>
> Linus Torvalds wrote:
> > Which flies in the face of "Be strict in what you generate, be liberal in 
> > what you accept". A lot of the functions are _not_ willing to be liberal 
> > in what they accept. Which sometimes just makes the problem worse, for no 
> > good reason.
> 
> Unicode specifies that a program claiming to read UTF-8 _must_ reject
> malformed UTF-8.
> 
> Ok, we can just ignore Unicode. :)
> 
> But the reason they cite is security: when applications allow
> malformed UTF-8 through, there's plenty of scope for security holes
> due to multiple encodings of "/" and "." and "\0".
> 
> This is a real problem: plenty of those Windows worms that attack web
> servers get in by using multiple-escaped funny characters and
> malformed UTF-8 to get past security checks for ".." and such.
> 

Actually, the kernel is 100% compliant in that respect.

The only byte sequences the kernel interpret:

00
2E
2E 2E
2F

.. and it correctly rejects (in the sense that it doesn't alias) any
other possible byte stream that could be interpreted as the same
sequences by a naïvely incorrect UTF-8 encoder.

	-hpa

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 20:59                                       ` Linus Torvalds
  2004-02-17 21:06                                         ` John Bradford
  2004-02-17 21:42                                         ` Alex Belits
@ 2004-02-18  3:11                                         ` H. Peter Anvin
  2 siblings, 0 replies; 120+ messages in thread
From: H. Peter Anvin @ 2004-02-18  3:11 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <Pine.LNX.4.58.0402171251130.2154@home.osdl.org>
By author:    Linus Torvalds <torvalds@osdl.org>
In newsgroup: linux.dev.kernel
> 
> UCS-4 is as braindamaged as UCS-2 was, and for all the same reasons.
> 
> It's bloated, non-expandable, and not backwards compatible.
> 

UCS-4 is actually a very nice format for *internal processing*.  For
data interchange, it sucks eggs.

UCS-2 is historic.  It's successor, UTF-16, is one of the worst
horrors ever inflicted on mankind by Microsoft.

	-hpa

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  2:58                         ` H. Peter Anvin
@ 2004-02-18  3:13                           ` Linus Torvalds
  2004-02-18  3:22                             ` H. Peter Anvin
  2004-02-18 11:33                             ` Jamie Lokier
  2004-02-18  7:25                           ` bert hubert
  1 sibling, 2 replies; 120+ messages in thread
From: Linus Torvalds @ 2004-02-18  3:13 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel

On Wed, 18 Feb 2004, H. Peter Anvin wrote:
> 
> Those of us who have been involved with the issue have fought
> *extremely* hard against DWIM decoders which try to decode the latter
> sequences into ".." -- it's incorrect, and a security hazard.  The
> only acceptable decodings is to throw an error, or use an out-of-band
> encoding mechanism to denote "bad bytecode."

Somebody correctly pointed out that you do not need any out-of-band 
encoding mechanism - the very fact that it's an invalid sequence is in 
itself a perfectly fine flag. No out-of-band signalling required.

The only thing you should make sure of is to not try to normalize it (that 
would hide the error). Just keep carrying the bad sequence along, and 
everybody is happy. Including the filesystem functions that get the "bad" 
name and match it exactly to what it should be matched against.

		Linus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  3:13                           ` Linus Torvalds
@ 2004-02-18  3:22                             ` H. Peter Anvin
  2004-02-18  3:30                               ` Linus Torvalds
  2004-02-18 11:24                               ` Jamie Lokier
  2004-02-18 11:33                             ` Jamie Lokier
  1 sibling, 2 replies; 120+ messages in thread
From: H. Peter Anvin @ 2004-02-18  3:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds wrote:
> 
> On Wed, 18 Feb 2004, H. Peter Anvin wrote:
> 
>>Those of us who have been involved with the issue have fought
>>*extremely* hard against DWIM decoders which try to decode the latter
>>sequences into ".." -- it's incorrect, and a security hazard.  The
>>only acceptable decodings is to throw an error, or use an out-of-band
>>encoding mechanism to denote "bad bytecode."
> 
> Somebody correctly pointed out that you do not need any out-of-band 
> encoding mechanism - the very fact that it's an invalid sequence is in 
> itself a perfectly fine flag. No out-of-band signalling required.
> 
> The only thing you should make sure of is to not try to normalize it (that 
> would hide the error). Just keep carrying the bad sequence along, and 
> everybody is happy. Including the filesystem functions that get the "bad" 
> name and match it exactly to what it should be matched against.
> 

Well, the reason you'd want an out-of-band mechanism is to be able to
display it as some kind of escapes.  Consider a UTF-8 decoder which uses
values in the 0x800000xx range to encode "bogus bytes"; that way it
wouldn't alias to anything else, but the bogus sequence "C0 AE" could be
represented as 0x800000C0 0x800000AE and displayed to the user as
\xC0\xAE\xC0\xAE ... which is different from \u00C0\u00AE ("À®", C3 80
C2 AE).  This would make it possible to figure out in, for example, an
ls listing, what those broken filenames are actually composed of.

There are some advantages to being able to represent all possible byte
sequences and present them to the user, even if they're bogus.

	-hpa

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  3:22                             ` H. Peter Anvin
@ 2004-02-18  3:30                               ` Linus Torvalds
  2004-02-18  5:30                                 ` H. Peter Anvin
  2004-02-18 11:24                               ` Jamie Lokier
  1 sibling, 1 reply; 120+ messages in thread
From: Linus Torvalds @ 2004-02-18  3:30 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel

On Tue, 17 Feb 2004, H. Peter Anvin wrote:
> 
> Well, the reason you'd want an out-of-band mechanism is to be able to
> display it as some kind of escapes. 

I'd suggest just doing that when you convert the utf-8 format to printable 
format _anyway_.  At that point you just make the "printable" 
representation be the binary escape sequence (which you have to have for 
other non-printable utf-8 characters anyway).

And if you do things right (ie you allow user input in that same escaped 
output format), you can allow users to re-create the exact "broken utf-8". 
Which is actually important just so that the user can fix it up (ie 
imagine the user noticing that the filename is broken, and now needs to do 
a "mv broken-name fixed-name" - the user needs some way to re-create the 
brokenness).

		Linus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  3:30                               ` Linus Torvalds
@ 2004-02-18  5:30                                 ` H. Peter Anvin
  2004-02-18 10:29                                   ` Robin Rosenberg
  2004-02-18 15:35                                   ` Linus Torvalds
  0 siblings, 2 replies; 120+ messages in thread
From: H. Peter Anvin @ 2004-02-18  5:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds wrote:
> 
> On Tue, 17 Feb 2004, H. Peter Anvin wrote:
> 
>>Well, the reason you'd want an out-of-band mechanism is to be able to
>>display it as some kind of escapes. 
> 
> 
> I'd suggest just doing that when you convert the utf-8 format to printable 
> format _anyway_.  At that point you just make the "printable" 
> representation be the binary escape sequence (which you have to have for 
> other non-printable utf-8 characters anyway).
> 

What does "printable" mean in this context?  Typically you have to 
convert it to UCS-4 first, so you can index into your font tables, then 
you have to create the right composition, apply the bidirectional text 
algorithm, and so forth.

Rendering general Unicode text is complex enough that you really want it 
layered.  What I described what the first step of that -- mostly trying 
to show that "throwing an error" doesn't necessarily mean "produce no 
output."  What you shouldn't do, though, is alias it with legitimate input.

> And if you do things right (ie you allow user input in that same escaped 
> output format), you can allow users to re-create the exact "broken utf-8". 
> Which is actually important just so that the user can fix it up (ie 
> imagine the user noticing that the filename is broken, and now needs to do 
> a "mv broken-name fixed-name" - the user needs some way to re-create the 
> brokenness).

Indeed.  The C language has gone with \x77 for bytes and \u7777 or 
\U77777777 for Unicode characters (4 vs 8 hex digits respectively); I 
think this is a good UI for shells to follow.  The \x representation 
then doesn't stand for characters but for bytes.  It may be desirable to 
disallow encoding of *valid* UTF-8 characters this way, though.

	-hpa

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 21:04                                           ` Linus Torvalds
  2004-02-17 21:16                                             ` John Bradford
@ 2004-02-18  6:48                                             ` Marc Lehmann
  1 sibling, 0 replies; 120+ messages in thread
From: Marc Lehmann @ 2004-02-18  6:48 UTC (permalink / raw)
  To: linux-kernel

On Tue, Feb 17, 2004 at 01:04:14PM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
> Admittedly you might need up to six octets for the worst case, but hey, 
> since you only need one for the most common case (by _far_), who cares?

Beign a fan of UTF-8, I still have to remark that this is a rather imperialistic view that only
happens to work in many western countries.

It starts to fail in greece, russia and asian countries, where text size
goes up by a factor of 1.5 .. 3.

This was _one_ of the major obstacles that utf-8 had to overcome in asian
countries.

Personally, I think that's not a big problem (memory for text storage is
cheap etc.. :), but I am living in a iso-8859-1 world with only occasional
voyages elsewhere.

> And with the same UTF-8 encoding, you could some day encode UCS-8 too if
> the idiotic standards bodies some day decide that 4 billion characters 
> isn't enough because of all the in-fighting. 

Four billion glyphs will be not be reached, of course, but it's not
impossible that some codeset space inflation will happen due to the
introduction of extra planes for strange purposes.

> Of course, since you like UCS-4, you don't care about backwards 
> compatibility. 

While UCS-2 is obviously useless, UCS-4 is useful in rare cases where you
either need fixed character sizes or the inflation to 5 or 6 byte values
becomes a problem (which should be never).

Using UCS-4 for filenames is just evil (of course :)

UTF-8 was invented for the purpose of mapping unicode to filenames, and
it certainly is the most sane encoding so far, since it doesn't share the
"artificial" limitations to 16, 21 or 32 bits that other unicode encodings
have.

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 21:42                                         ` Alex Belits
@ 2004-02-18  6:56                                           ` Marc Lehmann
  2004-02-18 20:37                                             ` Alex Belits
  0 siblings, 1 reply; 120+ messages in thread
From: Marc Lehmann @ 2004-02-18  6:56 UTC (permalink / raw)
  To: linux-kernel

On Tue, Feb 17, 2004 at 02:42:11PM -0700, Alex Belits <abelits@belits.com> wrote:
>   Pretty much every charset other than Unicode does not NEED encoding
> because it was already designed to work with existing system. The decision
> to make the basic representation of charset full of zero bytes was the
> reason that created the need for UTF-8.

As I told you privately, you continiously confuse charset, encoding and
codeset. As well as spreading misinformation e.g. with respect to language
tagging (which unicode supports, mabe not well, but certainly better than
koi8-r, iso-8859-1 etc.) :(

> not have planned for multilingual environments like they should've done,
> but they aren't stupid enough to require someone to "bless" them with a
> variable-length encoding.

The only other "encoding" that I know that supports (very limited)
language tagging and works in a multilingual environment is iso-2022.
(maybe emacs has something else, or is iso-2022 based, I don't know,
correct me please). iso-2022, is horrible to use and didn't catch on in
many places because of this.

So it seems that these are your only choices. If there is a problem with
unicode, it can be fixed (just as problems have been fixed in the past),
and the resulting standard will be called "Unicode" and will map to
UTF-8, just as _every_ codeset maps to UTF-8 that is <21 bit (in a strict
interpretation) or >64 bit (in a lax interpretation).

I don't understand why you are arguing against unicode so vehemently
without having any other option, and without the need for any other
option.

Please note that the examples you make (koi8-r etc.) fail miserably in a
multilingual environment. koi8-r even starts to fail in a place near you,
where people use koi8-u, often tagges as koi8-r, because most software has
no good means to tag their texts.

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 21:06                                       ` Alex Belits
  2004-02-17 21:47                                         ` Jamie Lokier
@ 2004-02-18  7:23                                         ` Marc Lehmann
  1 sibling, 0 replies; 120+ messages in thread
From: Marc Lehmann @ 2004-02-18  7:23 UTC (permalink / raw)
  To: linux-kernel

On Tue, Feb 17, 2004 at 02:06:21PM -0700, Alex Belits <abelits@phobos.illtel.denver.co.us> wrote:
>   UTF-8 terminals (and variable-encoding terminals) alreay exist,
> gnome-terminal is one of them. They are, of course, bloated pigs, but I

rxvt-unicode (mixed fonts, bad complex script), and mlterm (no mixed
fonts, very good complex script support), are not all bloated, have a
_much_ smaller memory footprint than xterm and are even faster on text
output and scrolling complex scripts than xterm (by a factor of two).

(Of course, gnome-terminal is bloated. loading it requires 45MB of main
memory here and then it's still 5-10 times slower than xterm).

That UTF-8/Unicode in any way means bloated (I know you did not directly
imply this) is a widely circulating but wrong idea nowadays.

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  2:58                         ` H. Peter Anvin
  2004-02-18  3:13                           ` Linus Torvalds
@ 2004-02-18  7:25                           ` bert hubert
  1 sibling, 0 replies; 120+ messages in thread
From: bert hubert @ 2004-02-18  7:25 UTC (permalink / raw)
  To: linux-kernel

On Wed, Feb 18, 2004 at 02:58:42AM +0000, H. Peter Anvin wrote:

> Indeed.  The original name for the encoding was, in fact, "FSS-UTF",
> for "filesystem safe Unicode transformation format."  

That might explain a few things.

> > F8 80 80 80 AE F8 80 80 80 AE 
> > FC 80 80 80 80 AE FC 80 80 80 80 AE
> 
> No, they don't.

Serves me right for trusting a random site, apologies. 

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://lartc.org           Linux Advanced Routing & Traffic Control HOWTO

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  5:30                                 ` H. Peter Anvin
@ 2004-02-18 10:29                                   ` Robin Rosenberg
  2004-02-18 11:49                                     ` Tomas Szepe
  2004-02-18 15:35                                   ` Linus Torvalds
  1 sibling, 1 reply; 120+ messages in thread
From: Robin Rosenberg @ 2004-02-18 10:29 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Linus Torvalds, linux-kernel

On Wednesday 18 February 2004 06.30, H. Peter Anvin wrote:
> Linus Torvalds wrote:
> > On Tue, 17 Feb 2004, H. Peter Anvin wrote:
> >>Well, the reason you'd want an out-of-band mechanism is to be able to
> >>display it as some kind of escapes. 
> > I'd suggest just doing that when you convert the utf-8 format to printable 
> > format _anyway_.  At that point you just make the "printable" 
> > representation be the binary escape sequence (which you have to have for 
> > other non-printable utf-8 characters anyway).
> What does "printable" mean in this context?  Typically you have to 
> convert it to UCS-4 first, so you can index into your font tables, then 
> you have to create the right composition, apply the bidirectional text 
> algorithm, and so forth.

> Rendering general Unicode text is complex enough that you really want it 
> layered.  What I described what the first step of that -- mostly trying 
> to show that "throwing an error" doesn't necessarily mean "produce no 
> output."  What you shouldn't do, though, is alias it with legitimate input.

I think you can use libicu here. Conversion to UCS-4 doesn't for determining
character type doesn't mean you will every have actual strings of UCS-4. It could 
be character by character just for looking it up, so you can have the out-of-band
error flags internally.

> > And if you do things right (ie you allow user input in that same escaped 
> > output format), you can allow users to re-create the exact "broken utf-8". 
> > Which is actually important just so that the user can fix it up (ie 
> > imagine the user noticing that the filename is broken, and now needs to do 
> > a "mv broken-name fixed-name" - the user needs some way to re-create the 
> > brokenness).
> 
> Indeed.  The C language has gone with \x77 for bytes and \u7777 or 
> \U77777777 for Unicode characters (4 vs 8 hex digits respectively); I 
> think this is a good UI for shells to follow.  The \x representation 
> then doesn't stand for characters but for bytes.  It may be desirable to 
> disallow encoding of *valid* UTF-8 characters this way, though.

Agree. \u80808080 I would assume represents a valid character, while \x80\x80\x80\x80
does not. A problem with invalid sequences I just noted is that they break some of 
the nice properties of UTF-8, that people will assume apply, i.e. that you can parse it 
backwards. With UTF-8 (i.e. well-formed utf-8) you can point at a byte and figure "this is 
not the first byte", lets skip backwards to find the start. If invalid sequences can ever occur
you must read from the start of the string.

-- robin

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  2:08                                     ` Robin Rosenberg
@ 2004-02-18 11:06                                       ` Jamie Lokier
  0 siblings, 0 replies; 120+ messages in thread
From: Jamie Lokier @ 2004-02-18 11:06 UTC (permalink / raw)
  To: Robin Rosenberg
  Cc: Hans Reiser, Stefan Smietanowski, Linus Torvalds, Marc Lehmann,
	viro, Linux kernel

Robin Rosenberg wrote:
> A character is the simplest form of image so it should always look the same.

People who need the computer to _speak_ names need language or
phonetic information attached to a name, for it to be spoken properly.

On this, Alex Belits has a good point.  It's all very well
standardising on UTF-8 so every name can be displayed nicely.  That is
incomplete for a user who needs "ls" to work audibly, though.

In practice, such a user configures their machine to assume a
particular language, or guess it with bias to the one they use most often.

That is, in some ways, the same problem as having a mixture of
filenames in an unknown character encoding, except that UTF-8 doesn't
solve it.

-- Jamie

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  3:22                             ` H. Peter Anvin
  2004-02-18  3:30                               ` Linus Torvalds
@ 2004-02-18 11:24                               ` Jamie Lokier
  1 sibling, 0 replies; 120+ messages in thread
From: Jamie Lokier @ 2004-02-18 11:24 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Linus Torvalds, linux-kernel

H. Peter Anvin wrote:
> Well, the reason you'd want an out-of-band mechanism is to be able to
> display it as some kind of escapes.

As soon as you go to "display", you need a mechanism to escape lots of
characters, not just malformed UTF-8.  Consider: \u0000, \u001B,
\u0007 and such need to be escaped too.

-- Jamie

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  3:13                           ` Linus Torvalds
  2004-02-18  3:22                             ` H. Peter Anvin
@ 2004-02-18 11:33                             ` Jamie Lokier
  2004-02-18 16:47                               ` H. Peter Anvin
  2004-02-18 19:59                               ` Linus Torvalds
  1 sibling, 2 replies; 120+ messages in thread
From: Jamie Lokier @ 2004-02-18 11:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: H. Peter Anvin, linux-kernel

Linus Torvalds wrote:
> Somebody correctly pointed out that you do not need any out-of-band 
> encoding mechanism - the very fact that it's an invalid sequence is in 
> itself a perfectly fine flag. No out-of-band signalling required.

Technically this is almost(*) correct, however a _lot_ of code exists
which assumes logical properties of UTF-8.  (See, for example, the
"stty utf8" patch).

Perl, for example, allows you to pass around invalid sequences in
exactly the way you describe.  It works, right up until you do
something like length() or substr() or a regex match.  Then Perl
screws up the answer, because it sees something like 0xfd and just
assumes it can skip the next 5 bytes, without checking them.

hpa's suggestion that invalid bytes are treated as 0x800000xx works
very nicely, *iff* a program is absolutely consistent about its
treatment of bytes in that way.  When there's a mixture of code which
interprets malformed UTF-8 in different ways, then it's messy and
sometimes a security hazard.

-- Jamie

(*) - It's fine until you concatenate two malformed strings.  Then the
      out-of-band signal is lost if the combination is valid UTF-8.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 10:29                                   ` Robin Rosenberg
@ 2004-02-18 11:49                                     ` Tomas Szepe
  2004-02-18 11:59                                       ` Robin Rosenberg
  0 siblings, 1 reply; 120+ messages in thread
From: Tomas Szepe @ 2004-02-18 11:49 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: linux-kernel

On Feb-18 2004, Wed, 11:29 +0100
Robin Rosenberg <robin.rosenberg.lists@dewire.com> wrote:

[snip]
> not the first byte", lets skip backwards to find the start. If invalid sequences can ever occur
[snip]

Would you _please_ read the lkml FAQ and stop posting e-mails with lines
longer than 80 characters?  Thank you.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 11:49                                     ` Tomas Szepe
@ 2004-02-18 11:59                                       ` Robin Rosenberg
  2004-02-18 12:05                                         ` Tomas Szepe
  0 siblings, 1 reply; 120+ messages in thread
From: Robin Rosenberg @ 2004-02-18 11:59 UTC (permalink / raw)
  To: Tomas Szepe; +Cc: linux-kernel

On Wednesday 18 February 2004 12.49, Tomas Szepe wrote:
> Would you _please_ read the lkml FAQ and stop posting e-mails with lines
> longer than 80 characters?  Thank you.

As soon as someone asks nicely... I thought any decent mail client simply
wrapped the lines. Hmm, remember some old system with 3270 access that
didn't.

I'll try to remember that.

-- robin

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 11:59                                       ` Robin Rosenberg
@ 2004-02-18 12:05                                         ` Tomas Szepe
  2004-02-18 12:34                                           ` Robin Rosenberg
  0 siblings, 1 reply; 120+ messages in thread
From: Tomas Szepe @ 2004-02-18 12:05 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: linux-kernel

On Feb-18 2004, Wed, 12:59 +0100
Robin Rosenberg <robin.rosenberg.lists@dewire.com> wrote:

> On Wednesday 18 February 2004 12.49, Tomas Szepe wrote:
> > Would you _please_ read the lkml FAQ and stop posting e-mails with lines
> > longer than 80 characters?  Thank you.
> 
> As soon as someone asks nicely...  I thought any decent mail client simply
> wrapped the lines.

1)  Quite the contrary.  Any _decent_ mail client will _not_ wrap the lines.

2)  A mail client that will wrap the lines will make your posts look like this:

<cut>
Having to put up with the existence of Windows day in and out is the reason I'm
still on
an eight-bit encoding.  Sorry for not explaining the REAL problem, but only a
partial
problem. I need to support all kinds of clients on Windows with protocols that  
convey no
character set info. With samba that's no problem. Having to put up with a Unix
world running
<cut>

> I'll try to remember that.

Thanks again.

-- 
Tomas Szepe <szepe@pinerecords.com>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 12:05                                         ` Tomas Szepe
@ 2004-02-18 12:34                                           ` Robin Rosenberg
  0 siblings, 0 replies; 120+ messages in thread
From: Robin Rosenberg @ 2004-02-18 12:34 UTC (permalink / raw)
  To: Tomas Szepe; +Cc: linux-kernel

On Wednesday 18 February 2004 13.05, Tomas Szepe wrote:
> On Feb-18 2004, Wed, 12:59 +0100
> Robin Rosenberg <robin.rosenberg.lists@dewire.com> wrote:
> 
> > On Wednesday 18 February 2004 12.49, Tomas Szepe wrote:
> > > Would you _please_ read the lkml FAQ and stop posting e-mails with lines
> > > longer than 80 characters?  Thank you.
> > 
> > As soon as someone asks nicely...  I thought any decent mail client simply
> > wrapped the lines.
> 
> 1)  Quite the contrary.  Any _decent_ mail client will _not_ wrap the lines.
>
> 2)  A mail client that will wrap the lines will make your posts look like 
this:
> 
> <cut>
> Having to put up with the existence of Windows day in and out is the reason 
I'm
> still on
> an eight-bit encoding.  Sorry for not explaining the REAL problem, but only 
a
> partial
> problem. I need to support all kinds of clients on Windows with protocols 
that  
> convey no
> character set info. With samba that's no problem. Having to put up with a 
Unix
> world running
> <cut>

That's what happens when the sender wraps the lines at column 80 and your 
client wraps at 72 (or similar situation), just another reason not to wrap 
when sending and let the users  client do whatever the user think is fine.

In order not to wrap and destroy information I have the autowrap feature off 
when composing mail, becase wrapped and cut stack traces, cuts from log files 
etc are a pain. 

BTW The 80 character rule is only mention wrt to signatures.

-- robin

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 16:46                                 ` Jamie Lokier
  2004-02-17 19:00                                   ` UTF-8 practically vs. theoretically in the VFS API Måns Rullgård
@ 2004-02-18 13:11                                   ` Matthew Garrett
  1 sibling, 0 replies; 120+ messages in thread
From: Matthew Garrett @ 2004-02-18 13:11 UTC (permalink / raw)
  To: Linux kernel

Jamie Lokier wrote:

>I'd like a way to type something like "touch zöe.txt" on an ordinary
>latin1 terminal and get a UTF-8 filename in my filesystem.  Thanks :)

screen will already do this - check the encoding command. There's a
couple of more lightweight proxies that do much the same thing.

-- 
Matthew Garrett | mjg59-chiark.mail.linux-rutgers.kernel@srcf.ucam.org

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  5:30                                 ` H. Peter Anvin
  2004-02-18 10:29                                   ` Robin Rosenberg
@ 2004-02-18 15:35                                   ` Linus Torvalds
  2004-02-18 19:47                                     ` Tomas Szepe
  1 sibling, 1 reply; 120+ messages in thread
From: Linus Torvalds @ 2004-02-18 15:35 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Kernel Mailing List

On Tue, 17 Feb 2004, H. Peter Anvin wrote:
> 
> What does "printable" mean in this context?  Typically you have to 
> convert it to UCS-4 first, so you can index into your font tables, then 
> you have to create the right composition, apply the bidirectional text 
> algorithm, and so forth.

Not all characters _have_ font entries. And even when they have font 
entries, they may need escaping for other reasons (ie you may want to 
marshall UTF-8 as plain ASCII just because you want to use a portable 
format for transfer).

Think about the simple (hex) string x0A x00. That's a well-defined UTF-8
string, yet if you want to print it as a filename on the console, you
should obviously print it as "/n" or some similar escaped sequence 
(actually, that's a bad example, since it's a special case, and it would 
probably be better to use the example string x7F x00, which would be shown 
as \u177 or something).

The same is true for a _lot_ of perfectly fine UTF-8 sequences, no?

That implies that you have to use an escaped sequence _anyway_. So as you 
go along, turning the string into something printable, you might as well 
escape the invalid UTF-8 sequences.

In other words: you walk the utf-8 string one character at a time, 
converting it to whatever format (eg UCS-4) you have for font lookup, but 
you also escape characters that you don't have font entries for or that 
aren't in proper UTF-8 format.

When converting to UCS-2, you have to check for the proper format 
_anyway_, so none of this is in any way "extra work". Instead of just 
aborting on an invalid UTF-8 character, you quote it, exactly the same way 
you'd have to quote a _valid_ one that you can't just show as a string.

> Rendering general Unicode text is complex enough that you really want it 
> layered.  What I described what the first step of that -- mostly trying 
> to show that "throwing an error" doesn't necessarily mean "produce no 
> output."  What you shouldn't do, though, is alias it with legitimate input.

Exactly. And since you need an escape sequence anyway, what's the problem?

> > And if you do things right (ie you allow user input in that same escaped 
> > output format), you can allow users to re-create the exact "broken utf-8". 
> > Which is actually important just so that the user can fix it up (ie 
> > imagine the user noticing that the filename is broken, and now needs to do 
> > a "mv broken-name fixed-name" - the user needs some way to re-create the 
> > brokenness).
> 
> Indeed.  The C language has gone with \x77 for bytes and \u7777 or 
> \U77777777 for Unicode characters (4 vs 8 hex digits respectively); I 
> think this is a good UI for shells to follow.  The \x representation 
> then doesn't stand for characters but for bytes.  It may be desirable to 
> disallow encoding of *valid* UTF-8 characters this way, though.

You need to encode even valid UTF-8, since you may not find a font entry 
for the character, or the character just isn't appropriate in that context 
(ie you can't show a newline).

But it makes perfect sense to use a policy of:
 - escape valid UTF-8 characters as '\u7777'
 - escape _invalid_ UTF-8 characters as their hex byte sequence (ie 
   '\xC0\x80\x80', whatever)
 - (and, obviously, escape the valid UTF-8 character '\' as '\\').

Don't you agree? It clearly allows all the cases, and you can re-generate 
the _exact_ original stream of bytes from the above (ie it is nicely 
reversible, which in my opinion is a requirement).

		Linus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 11:33                             ` Jamie Lokier
@ 2004-02-18 16:47                               ` H. Peter Anvin
  2004-02-18 19:59                               ` Linus Torvalds
  1 sibling, 0 replies; 120+ messages in thread
From: H. Peter Anvin @ 2004-02-18 16:47 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Linus Torvalds, linux-kernel

Jamie Lokier wrote:
> 
> hpa's suggestion that invalid bytes are treated as 0x800000xx works
> very nicely, *iff* a program is absolutely consistent about its
> treatment of bytes in that way.  When there's a mixture of code which
> interprets malformed UTF-8 in different ways, then it's messy and
> sometimes a security hazard.
> 

Absolutely.  It has to be considered very carefully.

	-hpa

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 15:35                                   ` Linus Torvalds
@ 2004-02-18 19:47                                     ` Tomas Szepe
  2004-02-18 20:01                                       ` H. Peter Anvin
  0 siblings, 1 reply; 120+ messages in thread
From: Tomas Szepe @ 2004-02-18 19:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: H. Peter Anvin, Kernel Mailing List

On Feb-18 2004, Wed, 07:35 -0800
Linus Torvalds <torvalds@osdl.org> wrote:

> But it makes perfect sense to use a policy of:
>  - escape valid UTF-8 characters as '\u7777'
>  - escape _invalid_ UTF-8 characters as their hex byte sequence (ie 
>    '\xC0\x80\x80', whatever)
>  - (and, obviously, escape the valid UTF-8 character '\' as '\\').
> 
> Don't you agree? It clearly allows all the cases, and you can re-generate 
> the _exact_ original stream of bytes from the above (ie it is nicely 
> reversible, which in my opinion is a requirement).

I really really hope this is _exactly_ what we're going to see in practice.

-- 
Tomas Szepe <szepe@pinerecords.com>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 11:33                             ` Jamie Lokier
  2004-02-18 16:47                               ` H. Peter Anvin
@ 2004-02-18 19:59                               ` Linus Torvalds
  2004-02-18 20:08                                 ` H. Peter Anvin
  1 sibling, 1 reply; 120+ messages in thread
From: Linus Torvalds @ 2004-02-18 19:59 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: H. Peter Anvin, linux-kernel

On Wed, 18 Feb 2004, Jamie Lokier wrote:
> Linus Torvalds wrote:
> > Somebody correctly pointed out that you do not need any out-of-band 
> > encoding mechanism - the very fact that it's an invalid sequence is in 
> > itself a perfectly fine flag. No out-of-band signalling required.
> 
> Technically this is almost(*) correct,
> 
> (*) - It's fine until you concatenate two malformed strings.  Then the
>       out-of-band signal is lost if the combination is valid UTF-8.

But that's what you _want_. Having a real out-of-band signal that says 
"this stuff is wrong, because it was wrong at some point in the past", and 
not allowing concatenation of blocks of utf-8 bytes would be _bad_.

The thing, concatenating two malformed UTF-8 strings is normal behaviour 
in a variety of circumstances, all basically having to do with lower 
levels now knowing about higer-level concepts.

For example, look at a web-page. Look at how the data comes in: it comes 
as a stream of bytes, with blocking rules that have _nothing_ to do with 
the content (timing, mtu's, extended TCP headers etc etc). That doesn't 
mean that you shouldn't be able to
 - work on the partial results and show them to the user as UTF-8
 - be able to concatenate new stuff as it comes in.

Having an out-of-band signal for "bad" would literally be a bad idea. If 
you get a valid UTF-8 stream as a result of concatenation, you should 
consider that to be the correct behaviour, or you should CHECK BEFOREHAND 
if you think it is illegal.

		Linus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 19:47                                     ` Tomas Szepe
@ 2004-02-18 20:01                                       ` H. Peter Anvin
  2004-02-18 21:22                                         ` Robin Rosenberg
  0 siblings, 1 reply; 120+ messages in thread
From: H. Peter Anvin @ 2004-02-18 20:01 UTC (permalink / raw)
  To: Tomas Szepe; +Cc: Linus Torvalds, Kernel Mailing List

Tomas Szepe wrote:
> On Feb-18 2004, Wed, 07:35 -0800
> Linus Torvalds <torvalds@osdl.org> wrote:
> 
>>But it makes perfect sense to use a policy of:
>> - escape valid UTF-8 characters as '\u7777'

[And e.g. \U00017777 for characters above \uFFFF]

>> - escape _invalid_ UTF-8 characters as their hex byte sequence (ie 
>>   '\xC0\x80\x80', whatever)
>> - (and, obviously, escape the valid UTF-8 character '\' as '\\').
>>
>>Don't you agree? It clearly allows all the cases, and you can re-generate 
>>the _exact_ original stream of bytes from the above (ie it is nicely 
>>reversible, which in my opinion is a requirement).
> 
> I really really hope this is _exactly_ what we're going to see in practice.
> 

Same here.  This is clearly The Right Thing[TM].

	-hpa


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 19:59                               ` Linus Torvalds
@ 2004-02-18 20:08                                 ` H. Peter Anvin
  0 siblings, 0 replies; 120+ messages in thread
From: H. Peter Anvin @ 2004-02-18 20:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jamie Lokier, linux-kernel

Linus Torvalds wrote:
> 
> But that's what you _want_. Having a real out-of-band signal that says 
> "this stuff is wrong, because it was wrong at some point in the past", and 
> not allowing concatenation of blocks of utf-8 bytes would be _bad_.
> 

Indeed.  What it does mean, however, is that you have to consider your
concatenation issues if you perform the concatenation in UCS-4 space,
for example, a string that ends in whatever code you have chosen for
<BOGUS-C8> that gets concatenated with <BOGUS-80> needs to get converted
to a valid <U+0200>.  This is of course not an issue if you do the
concatenation in UTF-8 space and don't do round-trip conversion.

None of this is hard, it just takes thinking about rather than
automatically do the obvious things.

> The thing, concatenating two malformed UTF-8 strings is normal behaviour 
> in a variety of circumstances, all basically having to do with lower 
> levels now knowing about higer-level concepts.

Indeed.

	-hpa

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-18  6:56                                           ` Marc Lehmann
@ 2004-02-18 20:37                                             ` Alex Belits
  0 siblings, 0 replies; 120+ messages in thread
From: Alex Belits @ 2004-02-18 20:37 UTC (permalink / raw)
  To: linux-kernel

On Wed, 18 Feb 2004, Marc Lehmann wrote:

> As well as spreading misinformation e.g. with respect to language
> tagging (which unicode supports, mabe not well, but certainly better than
> koi8-r, iso-8859-1 etc.) :(

  No charset supports language tagging (nonstandard, unused and unusable
extensions to Unicode that contradict and negate the very design of it
nonwithstanding) because it's beyond the scope of a charset in the first
place. However any mechanism that can support language tagging
automatically can be used to support any possible set of charsets, thus
making Unicode unnecessary. The only reason why Unicode is not criticized
as much as it deserves, is that almost no one needs the functionality that
it provides, and ones who do (linguists) don't use it in the first place.
The whole movement of "Unicode crusaders/missionaries" is based on the
false premise that their solution to 1/1000th of the problem of handling
multilingual text is suppposed to matter when the whole problem will be
faced by the real-life users.

> > not have planned for multilingual environments like they should've done,
> > but they aren't stupid enough to require someone to "bless" them with a
> > variable-length encoding.
>
> The only other "encoding" that I know that supports (very limited)
> language tagging and works in a multilingual environment is iso-2022.
> (maybe emacs has something else, or is iso-2022 based, I don't know,
> correct me please). iso-2022, is horrible to use and didn't catch on in
> many places because of this.

  iso-2022 is inadequate, and precisely because it was written when
requirements for tagging were too low to warrant anything that will be
adequate for a multilingual environment. So far the requirements of the
overwhelming majority of users are _still_ too low and this is why an
adequate system never was developed. When it will be -- and it will be
inevitable because at some point in the near future -- some tagging format
will be created, and from that moment there will be no point in insisting
on only using Unicode. So far Unicode proponents are too wrapped in their
religious devotion to their creation to understand the limits on its
usefulness.

> So it seems that these are your only choices. If there is a problem with
> unicode, it can be fixed (just as problems have been fixed in the past),

  All "fixes" to Unicode will have to go through the "standard process"
that makes ITU look sane, and will result in yet another incompatible
version. Not to mention that some additions can never (and should not) be
adopted by Unicode because they represent fictional or dead languages that
weren't widely used even when they were alive, yet those languages can be
easily supported by an extensible mechanism as long as people agree on
_names_ for them.

> and the resulting standard will be called "Unicode" and will map to
> UTF-8, just as _every_ codeset maps to UTF-8 that is <21 bit (in a strict
> interpretation) or >64 bit (in a lax interpretation).

  But it will still require software to be updated to be supported.
Extensible standard just works, users will install extensions as they will
become necessary -- heck, even MSIE is now smart enough to auto-install
"language environment", can't be too difficult for things based on a
better design.

> I don't understand why you are arguing against unicode so vehemently
> without having any other option, and without the need for any other
> option.

  The option is to create a thing that does what users will definitely
need, and that no one does now. I hardly see an alternative, involving
Unicode or not. Multilingual text processing that is not crippled by some
ridiculous limitations still has to be implemented yet, and when it will
be, the demand of using Unicode/UTF-8 as the only "blessed"
charset/encoding would only reduce its flexibility. I am not against using
Unicode where it serves some purpose, I am against forcing into the areas
where it contributes nothing of value.

> Please note that the examples you make (koi8-r etc.) fail miserably in a
> multilingual environment. koi8-r even starts to fail in a place near you,
> where people use koi8-u, often tagges as koi8-r, because most software has
> no good means to tag their texts.

  koi8-r and iso8859-1 work _great_ in an environment that I use, it's
just I wouldn't call it truly multilingual. Still when language and
charset tagging is supported (per-message in email and HTTP) I have no
problem using it, along with seeing (and editing) text in other languages
that require other charsets. What I don't see supported is multiple
languages and multiple charsets per document, and this happens because
MIME tagging is per-document instead of per-substring. So if I needed to
deal with truly multilingual documents, my problem would be not what
charset to use (charsets that I use work great already) but how to tag
them -- otherwise I lose editing and processing functionality, that I
value higher than the ability to stare at pretty letters.

  One solution is to change XML to add CHARSET attribute everywhere where
LANG is valid -- something that should've been done from the very
beginning.  Another, likely a superior and more generalized solution, is
likely to be created soon, after all, it's just a serialization of
metadata. Then writing some texts from JRRT work would not require a
Mordor representative in Unicode Consortium (what would be, I guess, a
bad thing).

-- 
Alex

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 20:01                                       ` H. Peter Anvin
@ 2004-02-18 21:22                                         ` Robin Rosenberg
  2004-02-18 21:42                                           ` H. Peter Anvin
  0 siblings, 1 reply; 120+ messages in thread
From: Robin Rosenberg @ 2004-02-18 21:22 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Tomas Szepe, Linus Torvalds, Kernel Mailing List

On Wednesday 18 February 2004 21.01, H. Peter Anvin wrote:
> [And e.g. \U00017777 for characters above \uFFFF]

Isn't that octal :-)

-- robin

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 21:22                                         ` Robin Rosenberg
@ 2004-02-18 21:42                                           ` H. Peter Anvin
  0 siblings, 0 replies; 120+ messages in thread
From: H. Peter Anvin @ 2004-02-18 21:42 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Tomas Szepe, Linus Torvalds, Kernel Mailing List

Robin Rosenberg wrote:
> On Wednesday 18 February 2004 21.01, H. Peter Anvin wrote:
> 
>>[And e.g. \U00017777 for characters above \uFFFF]
> 
> Isn't that octal :-)
> 

No.

	-hpa


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: Unicode normalization (userspace issue, but what the heck)
  2004-02-18  2:48               ` Unicode normalization (userspace issue, but what the heck) H. Peter Anvin
@ 2004-02-20  9:48                 ` Matthias Urlichs
  0 siblings, 0 replies; 120+ messages in thread
From: Matthias Urlichs @ 2004-02-20  9:48 UTC (permalink / raw)
  To: linux-kernel

Hi, H. Peter Anvin wrote:
> By author:    Matthias Urlichs <smurf@smurf.noris.de>
>> Example: is an Ã¡ one character, or is it an a
>> followed by a composing ÂŽ?

Nope, I didn't write that, at least not that way...

Presumably I shouldn't post in UTF-8 if Latin-1(5) is sufficient, but
then that's part of the reason we're having this discussion, so there. ;-)

-- 
Matthias Urlichs

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 15:56                           ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Linus Torvalds
       [not found]                             ` <20040217161111.GE8231@schmorp.de>
  2004-02-17 16:36                             ` Jamie Lokier
@ 2004-02-21 13:54                             ` Pavel Machek
  2004-02-22 20:09                               ` H. Peter Anvin
  2 siblings, 1 reply; 120+ messages in thread
From: Pavel Machek @ 2004-02-21 13:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marc, Jamie Lokier, Marc Lehmann, viro, Linux kernel

Hi!


> Which flies in the face of "Be strict in what you generate, be liberal in 
> what you accept". A lot of the functions are _not_ willing to be liberal 
> in what they accept. Which sometimes just makes the problem worse, for no 
> good reason.

Be liberal in what you accept used to be good rule... until security
became important. While it is still nice from ease-of-use viewpoint,
its bad when you want it secure.
								Pavel
-- 
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 21:47                                         ` Jamie Lokier
@ 2004-02-22 15:32                                           ` Eric W. Biederman
  2004-02-22 16:28                                             ` Jamie Lokier
  0 siblings, 1 reply; 120+ messages in thread
From: Eric W. Biederman @ 2004-02-22 15:32 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Alex Belits, Måns Rullgård, linux-kernel

Jamie Lokier <jamie@shareable.org> writes:

> Alex Belits wrote:
> > > No, I think hacking the terminal I/O is the best bet here.  Then _all_
> > > programs which currently work with UTF-8 terminals, which is rapidly
> > > becoming most of them, will work the same with both kinds of terminal,
> > > and the illusion of perfection will be complete and beautiful.
> > 
> >   UTF-8 terminals (and variable-encoding terminals) alreay exist,
> > gnome-terminal is one of them. They are, of course, bloated pigs, but I
> > would rather have the bloat and idiosyncrasy in the user interface where
> > it belongs.
> 
> Yes, I am using it right now.  The fancy characters work well in it.
> Problem is, sometimes I have to use a non-UTF-8 terminal, and I would
> naturally like to access my files in the same way.

Basically I think this is just a matter of modifying telnetd and
sshd so that for the display they follow the users locale,
at least in cooked mode.

Does anyone have a good grasp what the exact semantics should be and
where the translation should happen?  I know we need to delay the
translation as long as possible so we can get binary streams flowing
through these protocols? 

I guess my question is when do we know the information is going to
a terminal so we should translate it?

Eric

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-22 15:32                                           ` Eric W. Biederman
@ 2004-02-22 16:28                                             ` Jamie Lokier
  2004-02-22 21:53                                               ` Eric W. Biederman
  0 siblings, 1 reply; 120+ messages in thread
From: Jamie Lokier @ 2004-02-22 16:28 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Alex Belits, Måns Rullgård, linux-kernel

Eric W. Biederman wrote:
> I guess my question is when do we know the information is going to
> a terminal so we should translate it?

When a program is writing to a terminal device, then we know it's
going to a terminal _or_ to a program which is pretending to be one
(pseudo-terminal).  Either way, the behaviour should be the same

The "screen" program can be used to do translation, although it's a
rather cumbersome way to go about it, and it has other effects which
are annoying (at least one key is always designated for "screen" commands).

-- Jamie

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-21 13:54                             ` Pavel Machek
@ 2004-02-22 20:09                               ` H. Peter Anvin
  0 siblings, 0 replies; 120+ messages in thread
From: H. Peter Anvin @ 2004-02-22 20:09 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <20040221135439.GA310@elf.ucw.cz>
By author:    Pavel Machek <pavel@ucw.cz>
In newsgroup: linux.dev.kernel
>
> Hi!
> 
> 
> > Which flies in the face of "Be strict in what you generate, be liberal in 
> > what you accept". A lot of the functions are _not_ willing to be liberal 
> > in what they accept. Which sometimes just makes the problem worse, for no 
> > good reason.
> 
> Be liberal in what you accept used to be good rule... until security
> became important. While it is still nice from ease-of-use viewpoint,
> its bad when you want it secure.
> 								Pavel

"Be liberal, but cautious, in what you accept" is pretty much the new
rule.

	-hpa

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-22 16:28                                             ` Jamie Lokier
@ 2004-02-22 21:53                                               ` Eric W. Biederman
  0 siblings, 0 replies; 120+ messages in thread
From: Eric W. Biederman @ 2004-02-22 21:53 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Alex Belits, Måns Rullgård, linux-kernel

Jamie Lokier <jamie@shareable.org> writes:

> Eric W. Biederman wrote:
> > I guess my question is when do we know the information is going to
> > a terminal so we should translate it?
> 
> When a program is writing to a terminal device, then we know it's
> going to a terminal _or_ to a program which is pretending to be one
> (pseudo-terminal).  Either way, the behaviour should be the same
> 
> The "screen" program can be used to do translation, although it's a
> rather cumbersome way to go about it, and it has other effects which
> are annoying (at least one key is always designated for "screen" commands).

Right.  At this point I am not worried about temporary solutions.  I
want to pin down how things should be implemented.  So the user space
programs can be fixed.  Pardon me while I think aloud to frame the problem.

First it is worth noting that the existing practice is that ttys 
always use the character set encoding of the user.  Even X cut and
paste frequently abuses the iso8859-1 range, and instead uses the
native character set encoding instead of iso8825-1.

Now the work is how to get multiple locales to play nicely with each
other.  utf-8 and unicode are convenient for that as they preserve the
existing assumptions that terminals, filenames, and text files are
all using the same character set encoding, even when multiple locales
are involved.

So within one machine utf-8 solves the multiple locale problem.  The
problem has now moved to interoperability between machines.  Since
multiple machines have different upgrade cycles, and are in different
administrative domains everyone does not move to utf-8 at the same
time.

When we add the assertion that all I/O going through a terminal device
is in the native locale we break 8bit transparency.  This holds true
in some instances when both sides use the same character set encoding,
such as utf8.

There are some mitigating factors to this.  ssh already documents
pseudo tty's as potentially breaking 8 bit transparency.  And
applications that require ttys for stdin/stdout are most likely
interactive.  Interactive programs are either character based, or
broken.

Being an unclean channel for pipes will affect at least XMODEM,
YMODEM, and ZMODEM protocols, and possibly ppp.  These programs
already know how to avoid problem characters and because ascii is a
common subset of most character set encodings the effect should be no
worse than a line that is not 8 bit clean.

ssh at least has explict options to allocate or not allocate a
pseudo-tty so getting an 8 bit clean data path is not a problem with
ssh.

The rule ``All data that passing through a pseudo-tty is in the
character set encoding specified by the locale of the owner of the
tty'' seems both reasonable and no significant change from the current
status quo.

Now how does this get implemented?

On the wire between two machines I recommend passing unicode
characters.  Unicode guarantees no round trip loss for any of it's
member character sets, and it reduces everything to one set of
translation tables.

By convention glibc stores unicode values in wchar_t.  mbrstowc will
convert multibyte strings to internal wide characters, based on the
current locale. wctombs will do the opposite.  So going between
unicode and the character set encoding of the current locale is
straight forward.

How do we convert the applications?

There are only four cases I can think of where we connect to a remote
system with terminal semantics.
1) Directly connected serial terminals.
2) telnetd
3) rshd
4) sshd

To my knowledge all of their protocols just pass through characters
and are neutral.  So changing these feels like a protocol extension,
ouch!  Those are the programs that bridge multiple administrative
domains, and they do deal with pseudo ttys so they are where something
needs to happen, to support different character set encodings on
different machines. 

If everyone just switches over to using utf-8 even the above cases are
fine.  So if there is a reasonable expectation that everyone will
change to using utf-8 in the near future even those programs don't
need to change.

Given the delay in changing protocols I propose 2 simple programs.
sh-utf8 and utf8-tty.  The first runs a command converting stdout and
stderr from utf8 to the current locale, and converting stdin into
utf8.  The second creates a pseudo tty and relays to it's controlling
tty, assuming the controlling tty uses utf8 and it's tty uses the
current locale.

Looking around there already is a TTYConv program that seems to fill
this niche, except you must specify the character set encodings
manually.
http://bedroomlan.dyndns.org/~alexios/coding_ttyconv.html

Comments?

Eric

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
@ 2004-02-23 11:35 Norman Diamond
       [not found] ` <fa.ip45pqg.i26oru@ifi.uio.no>
  0 siblings, 1 reply; 120+ messages in thread
From: Norman Diamond @ 2004-02-23 11:35 UTC (permalink / raw)
  To: Eric W. Biederman, linux-kernel

Eric W. Biederman wrote:

> First it is worth noting that the existing practice is that ttys
> always use the character set encoding of the user.

Each tty uses the character set encoding of that tty's user.  There were
times when I needed to have some tty windows open using EUC (ordinary work
on that Linux machine) and some tty windows open using SJIS (editing files
which would be sent to cellular telephones), in the same X session.  They
worked.

> Even X cut and paste frequently abuses the iso8859-1 range,

I'll take your word for it.  I've copied and pasted EUC strings, I've copied
and pasted SJIS strings, I don't know if X copy and paste abused EUC or SJIS
ranges, but it worked.

One thing I never thought of trying to test is to copy and paste between one
tty using EUC and one tty using SJIS.

> Now the work is how to get multiple locales to play nicely with each
> other.  utf-8 and unicode are convenient for that as they preserve the
> existing assumptions that terminals, filenames, and text files are
> all using the same character set encoding, even when multiple locales
> are involved.
>
> So within one machine utf-8 solves the multiple locale problem.

That preserves a nice fiction.  If you depend on assuming that fiction,
you'll get useless results.

> The rule ``All data that passing through a pseudo-tty is in the
> character set encoding specified by the locale of the owner of the
> tty'' seems both reasonable and no significant change from the current
> status quo.

Yes, that is a return to usability.

> On the wire between two machines I recommend passing unicode
> characters.

Why should the wire get a different encoding than the user set in the
pseudo-tty?  Consider TeraTerm.  The user tells TeraTerm what character set
is in use on the wire, which is the same as the character set in use on the
remote side (where sshd or whatever server provides the pseudo-tty).
TeraTerm converts between that and the local character set (where the
TeraTerm program and window and user get the character set decided for them
by someone in Sasazuka or Redmond).

> By convention glibc stores unicode values in wchar_t.

That is hard to believe.  glibc existed before Unicode did and wchar_t
existed before Unicode did.  I sure thought that glibc existed in Japan at
the time, but I could be wrong, I didn't say this is impossible but merely
hard to believe.  In commercial Unix systems, wchar_t held either EUC or
SJIS depending on the vendor.

As usual I do not even have time to keep up with this thread, so if you have
questions then please CC me personally, though I don't know if I'll have
time to investigate anything that needs it.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
       [not found] ` <fa.ip45pqg.i26oru@ifi.uio.no>
@ 2004-02-23 19:13   ` Junio C Hamano
  0 siblings, 0 replies; 120+ messages in thread
From: Junio C Hamano @ 2004-02-23 19:13 UTC (permalink / raw)
  To: Norman Diamond; +Cc: linux-kernel

>>>>> "ND" == Norman Diamond <ndiamond@wta.att.ne.jp> writes:

ND> Eric W. Biederman wrote:
>> Even X cut and paste frequently abuses the iso8859-1 range,

ND> I'll take your word for it.  I've copied and pasted EUC
ND> strings, I've copied and pasted SJIS strings, I don't know
ND> if X copy and paste abused EUC or SJIS ranges, but it
ND> worked.

I do not know what Eric means by "abusing the iso8859-1 rnge",
but passing X selection between traditional X clients IIRC uses
compound text, which is an encoding vaguely similar to ISO-2022,
so clients like kterm can convert it back and forth with EUC or
SJIS as needed.

^ permalink raw reply	[flat|nested] 120+ messages in thread

end of thread, other threads:[~2004-02-23 19:13 UTC | newest]

Thread overview: 120+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <04Feb13.163954est.41760@gpu.utcc.utoronto.ca>
2004-02-14 14:27 ` JFS default behavior Nicolas Mailhot
2004-02-14 15:40   ` viro
2004-02-14 17:47     ` Nicolas Mailhot
2004-02-14 17:59       ` Nicolas Mailhot
2004-02-14 23:06     ` Robin Rosenberg
2004-02-14 23:29       ` viro
2004-02-15  0:07         ` Robin Rosenberg
2004-02-15  2:41           ` Linus Torvalds
2004-02-15  3:33             ` Matthias Urlichs
2004-02-15  4:04               ` viro
2004-02-15  9:48                 ` Robin Rosenberg
2004-02-15 18:26                 ` yodaiken
2004-02-18  2:48               ` Unicode normalization (userspace issue, but what the heck) H. Peter Anvin
2004-02-20  9:48                 ` Matthias Urlichs
2004-02-16 15:05             ` stty utf8 Jamie Lokier
2004-02-16 16:10               ` Gerd Knorr
2004-02-16 22:03               ` Jamie Lokier
2004-02-16 22:17                 ` Linus Torvalds
2004-02-16 22:04               ` Jamie Lokier
2004-02-16 18:36             ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
2004-02-16 18:49               ` Linus Torvalds
2004-02-16 19:26                 ` UTF-8 practically vs. theoretically in the VFS API Jeff Garzik
2004-02-16 19:48                   ` John Bradford
2004-02-16 19:48                     ` Linus Torvalds
2004-02-16 20:20                       ` Marc Lehmann
2004-02-16 20:26                         ` Linus Torvalds
2004-02-18  2:49                         ` Rob Landley
2004-02-16 20:21                       ` bert hubert
2004-02-16 20:33                         ` Marc Lehmann
2004-02-18  2:58                         ` H. Peter Anvin
2004-02-18  3:13                           ` Linus Torvalds
2004-02-18  3:22                             ` H. Peter Anvin
2004-02-18  3:30                               ` Linus Torvalds
2004-02-18  5:30                                 ` H. Peter Anvin
2004-02-18 10:29                                   ` Robin Rosenberg
2004-02-18 11:49                                     ` Tomas Szepe
2004-02-18 11:59                                       ` Robin Rosenberg
2004-02-18 12:05                                         ` Tomas Szepe
2004-02-18 12:34                                           ` Robin Rosenberg
2004-02-18 15:35                                   ` Linus Torvalds
2004-02-18 19:47                                     ` Tomas Szepe
2004-02-18 20:01                                       ` H. Peter Anvin
2004-02-18 21:22                                         ` Robin Rosenberg
2004-02-18 21:42                                           ` H. Peter Anvin
2004-02-18 11:24                               ` Jamie Lokier
2004-02-18 11:33                             ` Jamie Lokier
2004-02-18 16:47                               ` H. Peter Anvin
2004-02-18 19:59                               ` Linus Torvalds
2004-02-18 20:08                                 ` H. Peter Anvin
2004-02-18  7:25                           ` bert hubert
2004-02-16 20:16                     ` Marc Lehmann
2004-02-16 20:20                       ` Jeff Garzik
2004-02-16 21:10                       ` viro
2004-02-17  7:18                       ` jw schultz
2004-02-17  7:42                       ` Nick Piggin
2004-02-16 20:03                 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
2004-02-16 20:23                   ` Linus Torvalds
2004-02-16 20:58                     ` Marc Lehmann
2004-02-17 14:12                       ` Dave Kleikamp
2004-02-16 22:26                     ` Jamie Lokier
2004-02-16 22:40                       ` Linus Torvalds
2004-02-16 22:52                         ` Linus Torvalds
2004-02-17 13:15                           ` Jamie Lokier
2004-02-17  7:14                         ` Lehmann 
2004-02-17 11:20                           ` UTF-8 practically vs. theoretically in the VFS API Helge Hafting
2004-02-17 15:56                           ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Linus Torvalds
     [not found]                             ` <20040217161111.GE8231@schmorp.de>
2004-02-17 16:32                               ` Linus Torvalds
2004-02-17 16:46                                 ` Jamie Lokier
2004-02-17 19:00                                   ` UTF-8 practically vs. theoretically in the VFS API Måns Rullgård
2004-02-17 20:57                                     ` Jamie Lokier
2004-02-17 21:06                                       ` Alex Belits
2004-02-17 21:47                                         ` Jamie Lokier
2004-02-22 15:32                                           ` Eric W. Biederman
2004-02-22 16:28                                             ` Jamie Lokier
2004-02-22 21:53                                               ` Eric W. Biederman
2004-02-18  7:23                                         ` Marc Lehmann
2004-02-17 21:23                                       ` Matthew Kirkwood
2004-02-18 13:11                                   ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Matthew Garrett
2004-02-17 16:52                                 ` Marc Lehmann
2004-02-17 16:54                                 ` UTF-8 practically vs. theoretically in the VFS API Stefan Smietanowski
2004-02-18  1:27                                   ` Hans Reiser
2004-02-18  2:08                                     ` Robin Rosenberg
2004-02-18 11:06                                       ` Jamie Lokier
2004-02-17 20:37                                 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Robin Rosenberg
2004-02-17 16:36                             ` Jamie Lokier
2004-02-17 17:52                               ` viro
2004-02-17 19:29                                 ` Jamie Lokier
2004-02-17 19:45                                   ` Linus Torvalds
2004-02-17 20:30                                     ` Jamie Lokier
2004-02-17 20:49                                       ` Linus Torvalds
2004-02-17 21:17                                         ` Jamie Lokier
2004-02-17 19:51                                   ` Jamie Lokier
2004-02-17 19:53                                   ` viro
2004-02-17 20:35                                     ` John Bradford
2004-02-17 20:40                                       ` Jamie Lokier
2004-02-17 20:50                                         ` John Bradford
2004-02-17 21:04                                           ` Linus Torvalds
2004-02-17 21:16                                             ` John Bradford
2004-02-17 21:21                                               ` Linus Torvalds
2004-02-18  0:52                                                 ` John Bradford
2004-02-17 22:50                                               ` Robin Rosenberg
2004-02-18  6:48                                             ` Marc Lehmann
2004-02-17 20:47                                       ` viro
2004-02-17 20:53                                         ` John Bradford
2004-02-17 20:59                                       ` Linus Torvalds
2004-02-17 21:06                                         ` John Bradford
2004-02-17 21:42                                         ` Alex Belits
2004-02-18  6:56                                           ` Marc Lehmann
2004-02-18 20:37                                             ` Alex Belits
2004-02-18  3:11                                         ` H. Peter Anvin
2004-02-17 20:38                                     ` Jamie Lokier
2004-02-18  3:07                               ` H. Peter Anvin
2004-02-21 13:54                             ` Pavel Machek
2004-02-22 20:09                               ` H. Peter Anvin
2004-02-17  1:24                   ` Alex Belits
2004-02-17 21:09                     ` Jamie Lokier
2004-02-17 21:48                       ` Linus Torvalds
2004-02-17 22:19                       ` Alex Belits
2004-02-23 11:35 UTF-8 practically vs. theoretically in the VFS API Norman Diamond
     [not found] ` <fa.ip45pqg.i26oru@ifi.uio.no>
2004-02-23 19:13   ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox