Re: UTF-8 practically vs. theoretically in the VFS API

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: UTF-8 practically vs. theoretically in the VFS API
@ 2004-02-23 11:35 Norman Diamond
       [not found] ` <fa.ip45pqg.i26oru@ifi.uio.no>
  0 siblings, 1 reply; 50+ messages in thread
From: Norman Diamond @ 2004-02-23 11:35 UTC (permalink / raw)
  To: Eric W. Biederman, linux-kernel

Eric W. Biederman wrote:

> First it is worth noting that the existing practice is that ttys
> always use the character set encoding of the user.

Each tty uses the character set encoding of that tty's user.  There were
times when I needed to have some tty windows open using EUC (ordinary work
on that Linux machine) and some tty windows open using SJIS (editing files
which would be sent to cellular telephones), in the same X session.  They
worked.

> Even X cut and paste frequently abuses the iso8859-1 range,

I'll take your word for it.  I've copied and pasted EUC strings, I've copied
and pasted SJIS strings, I don't know if X copy and paste abused EUC or SJIS
ranges, but it worked.

One thing I never thought of trying to test is to copy and paste between one
tty using EUC and one tty using SJIS.

> Now the work is how to get multiple locales to play nicely with each
> other.  utf-8 and unicode are convenient for that as they preserve the
> existing assumptions that terminals, filenames, and text files are
> all using the same character set encoding, even when multiple locales
> are involved.
>
> So within one machine utf-8 solves the multiple locale problem.

That preserves a nice fiction.  If you depend on assuming that fiction,
you'll get useless results.

> The rule ``All data that passing through a pseudo-tty is in the
> character set encoding specified by the locale of the owner of the
> tty'' seems both reasonable and no significant change from the current
> status quo.

Yes, that is a return to usability.

> On the wire between two machines I recommend passing unicode
> characters.

Why should the wire get a different encoding than the user set in the
pseudo-tty?  Consider TeraTerm.  The user tells TeraTerm what character set
is in use on the wire, which is the same as the character set in use on the
remote side (where sshd or whatever server provides the pseudo-tty).
TeraTerm converts between that and the local character set (where the
TeraTerm program and window and user get the character set decided for them
by someone in Sasazuka or Redmond).

> By convention glibc stores unicode values in wchar_t.

That is hard to believe.  glibc existed before Unicode did and wchar_t
existed before Unicode did.  I sure thought that glibc existed in Japan at
the time, but I could be wrong, I didn't say this is impossible but merely
hard to believe.  In commercial Unix systems, wchar_t held either EUC or
SJIS depending on the vendor.

As usual I do not even have time to keep up with this thread, so if you have
questions then please CC me personally, though I don't know if I'll have
time to investigate anything that needs it.

^ permalink raw reply	[flat|nested] 50+ messages in thread

[parent not found: <fa.ip45pqg.i26oru@ifi.uio.no>]

* Re: UTF-8 practically vs. theoretically in the VFS API
       [not found] ` <fa.ip45pqg.i26oru@ifi.uio.no>
@ 2004-02-23 19:13   ` Junio C Hamano
  0 siblings, 0 replies; 50+ messages in thread
From: Junio C Hamano @ 2004-02-23 19:13 UTC (permalink / raw)
  To: Norman Diamond; +Cc: linux-kernel

>>>>> "ND" == Norman Diamond <ndiamond@wta.att.ne.jp> writes:

ND> Eric W. Biederman wrote:
>> Even X cut and paste frequently abuses the iso8859-1 range,

ND> I'll take your word for it.  I've copied and pasted EUC
ND> strings, I've copied and pasted SJIS strings, I don't know
ND> if X copy and paste abused EUC or SJIS ranges, but it
ND> worked.

I do not know what Eric means by "abusing the iso8859-1 rnge",
but passing X selection between traditional X clients IIRC uses
compound text, which is an encoding vaguely similar to ISO-2022,
so clients like kterm can convert it back and forth with EUC or
SJIS as needed.

^ permalink raw reply	[flat|nested] 50+ messages in thread

[parent not found: <04Feb13.163954est.41760@gpu.utcc.utoronto.ca>]

* Re: JFS default behavior
@ 2004-02-14 23:06 ` Robin Rosenberg
  2004-02-14 23:29   ` viro
  0 siblings, 1 reply; 50+ messages in thread
From: Robin Rosenberg @ 2004-02-14 23:06 UTC (permalink / raw)
  To: viro; +Cc: Linux kernel

On Saturday 14 February 2004 16.40, you wrote:
> The same goes for file names.  Filename is a sequence of bytes, no more and
> no less.  Anything beyond that belongs to applications.

Should be a sequence of characters since humans are supposed to use them and
it should be the same characters wheneve possible regardless of user's locale.

The  "sequence of bytes" idea is a legacy from prehistoric times when byte == character
was true. That is no longer the case and actually hasn't been for quite a while in
some parts of the world. Interchange is important. The application cannot handle
this since it cannot know what characters a byte string represents. Fixing it in the
kernel is the simple solution since it knows the locale. Its also a small change I
believe. Having an iocharset options for all file systems make it backward compatible
and creates a migration path to UTF-8 as system default locale.

-- robin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: JFS default behavior
  2004-02-14 23:06 ` JFS default behavior Robin Rosenberg
@ 2004-02-14 23:29   ` viro
  2004-02-15  0:07     ` Robin Rosenberg
  0 siblings, 1 reply; 50+ messages in thread
From: viro @ 2004-02-14 23:29 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Linux kernel

On Sun, Feb 15, 2004 at 12:06:23AM +0100, Robin Rosenberg wrote:
> On Saturday 14 February 2004 16.40, you wrote:
> > The same goes for file names.  Filename is a sequence of bytes, no more and
> > no less.  Anything beyond that belongs to applications.
> 
> Should be a sequence of characters since humans are supposed to use them and
> it should be the same characters wheneve possible regardless of user's locale.
 
> The  "sequence of bytes" idea is a legacy from prehistoric times when byte == character
> was true.

Bullshit.  It has _nothing_ to characters, wide or not.  For system filenames
are opaque.  The only things that have special meanings are:
	octet 0x2f ('/') splits the pathname into components
	"." as a component has a special meaning
	".." as a component has a special meaning.
That's it.  The rest is never interpreted by the kernel.

> Having an iocharset options for all file systems make it backward compatible
> and creates a migration path to UTF-8 as system default locale.

Try to realize that different users CAN HAVE DIFFERENT LOCALES.  On the same
system.  And have files on the same fs.  Moreover, homedirs that used to be
on different filesystems can end up one the same fs.  What iocharset would
you use, then?  Sigh...

Again, there is no such thing as iocharset of filesystem - it varies between
users and users can and do share filesystems.  Think of /home; think of /tmp.

It isn't feasible.  At all.  Just as timezone doesn't belong in kernel, locales
have no place there.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: JFS default behavior
  2004-02-14 23:29   ` viro
@ 2004-02-15  0:07     ` Robin Rosenberg
  2004-02-15  2:41       ` Linus Torvalds
  0 siblings, 1 reply; 50+ messages in thread
From: Robin Rosenberg @ 2004-02-15  0:07 UTC (permalink / raw)
  To: viro; +Cc: Linux kernel

On Sunday 15 February 2004 00.29, you wrote:
> On Sun, Feb 15, 2004 at 12:06:23AM +0100, Robin Rosenberg wrote:
> > The  "sequence of bytes" idea is a legacy from prehistoric times when byte == character
> > was true.
> 
> Bullshit.  It has _nothing_ to characters, wide or not.  For system filenames
> are opaque.  The only things that have special meanings are:
> 	octet 0x2f ('/') splits the pathname into components
> 	"." as a component has a special meaning
> 	".." as a component has a special meaning.
> That's it.  The rest is never interpreted by the kernel.
I know how it is (to some degree), and its wrong. The user sees inside the filename
and sees a string of characters, not a byte sequence.

> Try to realize that different users CAN HAVE DIFFERENT LOCALES.  On the same
> system.  And have files on the same fs.  Moreover, homedirs that used to be
> on different filesystems can end up one the same fs.  What iocharset would
> you use, then?  Sigh...
Ok, I've got the iocharset option wrong, god knows why. The problem 
however remains.

It seems you simply don't want to understand the problem, which is that users 
CAN HAVE DIFFERENT LOCALES on the same system and on different system. 
Sigh...

I less concerned with which solution than that a solution should be found. So it
seems no file system has a solution today. Still an iocharset option would relieve
the problem for removable media and muli-boot systems. Most linux machines
are essentially single user and have either the same locale for all users or all
users are using UTF-8 with their locale. It's not the locale, but the charset used
for encoding the locale. The rest cannot be helped.

-- robin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: JFS default behavior
  2004-02-15  0:07     ` Robin Rosenberg
@ 2004-02-15  2:41       ` Linus Torvalds
  2004-02-16 18:36         ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
  0 siblings, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2004-02-15  2:41 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: viro, Linux kernel

On Sun, 15 Feb 2004, Robin Rosenberg wrote:
> > 
> > Bullshit.  It has _nothing_ to characters, wide or not.  For system filenames
> > are opaque.  The only things that have special meanings are:
> > 	octet 0x2f ('/') splits the pathname into components
> > 	"." as a component has a special meaning
> > 	".." as a component has a special meaning.
> > That's it.  The rest is never interpreted by the kernel.
>
> I know how it is (to some degree), and its wrong. The user sees inside the filename
> and sees a string of characters, not a byte sequence.

Yes, the user sees a string of characters, but the octet 0x2f ('/') and 
the terminating NUL character '\0' are still perfectly normal characters 
and there is no confusion.

The reason: UTF-8. It's the only sane encoding (apart from a pure extended
ASCII setup, which is also sane, but is obviously unacceptable for a large
portion of the world).

If some misguided person has told you about UCS-2 and horrors like UTF-9,
just ignore them. They are crazy and deluded, and - perhaps more
importantly - stupid.

In short: the kernel talks bytestreams, and that implies that if you want 
to talk to the kernel, you HAVE TO USE UTF-8.

At which point there are no locale issues any more. The only locale issue 
you can have is user space mistaking a stream of bytes as extended ASCII, 
which will cause all your pretty UTF-8 characters to be shown as strange 
latin1 (or other) squiggles.

> It seems you simply don't want to understand the problem, which is that users 
> CAN HAVE DIFFERENT LOCALES on the same system and on different system. 
> Sigh...

People understand the problem. And UTF-8 is the solution.

It's getting there. I think even Microsoft has seen the light, and is
phasing out their crapola (UCS-2LE? Whatever). 

> I less concerned with which solution than that a solution should be found. So it
> seems no file system has a solution today. Still an iocharset option would relieve
> the problem for removable media and muli-boot systems.

No. Things like "iocharset" are not the solution. They are literally the
_problem_. The solution is to use something that not only acts as ASCII,
but also has a wide enough range to cover the whole required space (UCS-2
fails _both_ of these fundamental tests). At which point "iocharset" makes 
no sense any more, and only exists as a way to translate legacy crap into 
the one true format.

And that one true format is UTF-8. End of story. If you try to talk to the 
kernel in UCS-2 or anything else, you _will_ fail.

			Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-15  2:41       ` Linus Torvalds
@ 2004-02-16 18:36         ` Marc Lehmann
  2004-02-16 18:49           ` Linus Torvalds
  0 siblings, 1 reply; 50+ messages in thread
From: Marc Lehmann @ 2004-02-16 18:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: viro, Linux kernel

[I may be a bit late in response, but AFAICS these points have not yet
been mentioned]

On Sat, Feb 14, 2004 at 06:41:20PM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
[discussion on why UTF-8 is the only sane encoding, which I absolutely
agree with, removed]

> In short: the kernel talks bytestreams, and that implies that if you want 
> to talk to the kernel, you HAVE TO USE UTF-8.

This is not the problem at all. It's perfectly easy to write
applications that talk UTF-8 and just UTF-8 with the kernel.

The problem is that the kernel does not use UTF-8, i.e. applications in
the current linux model have to deal with the fact that the kernel
happily breaks the assumed protocol of using UTF-8 by delivering illegal
byte sequences to applications.

There is no way for applications to handle UTF-8 and illegal-utf8 in
a sane way, so most apps will either eat the illegal bytes, skip the
filename, or crash (the latter case is clearly a bug in the app, thr
former cases aren't).

Fixing the VFS to actually enforce what linus claims (2filenames are
utf-8") is a very good idea, imho.

As I understand it, the reason linux currently doesn't, is that this utf-8
rule was obviously non-enforcable in practise in recent years, since
UTF-8 simply wasn't widespread (even today, applications such as bash or
grep are clearly not UTF-8 ready, as they start to crawl in UTF-8 locales
without special patches, and even with special patches).

So the only sane way to implement this enforcement is usign an
additional moutn-flag, e.g. "force-utf8".

An encoding=xyz mount flag OTOH would be total overkill, as the plan
must be to switch to UTF-8 in the long run, while allowing deviating
behaviour in the short run.

Conversely, filesystems such as NTFS, VFAT etc. need to convert from the
fs encoding to UTF-8 and vice versa automatically, at least when this
flag is specified.

It should become the default in some future linux version.

> People understand the problem. And UTF-8 is the solution.

The kernel needs to fully implement it. Just as a kernel accepting:

   open ("directory", O_WRONLY); write (dirfd, ...)...
   open ("/some/file", ...)
   mkdir ("../some/file", ...)

is considered rather broken behaviour from unix kernels (although these
might have been allowed in some dialects or versions of unix) today, this:

   mkdir ("</ encoded using illegal multibyte sequence>", ...)

will be considered broken behaviour in the future. The RFC defining UTF-8
clearly considers this a bug in UTF-8 implementations, the the kernel
in fact does NOT implement UTF-8 right now, although some people claim
that the kernel accepting UTF-8 (and more) is correct behaviour, it isn't
according to the RFC.

> It's getting there. I think even Microsoft has seen the light, and is
> phasing out their crapola (UCS-2LE? Whatever). 

Microsoft and Java officially use UTF-16 nowadays. The funny thing is
that "next character" iterators in both languages skip to the next word
in UCS-2, so the claim of both parties of UTF-16 support is basically a
marketing lie.

> No. Things like "iocharset" are not the solution. They are literally the
> _problem_. The solution is to use something that not only acts as ASCII,
[full agreement]

> And that one true format is UTF-8. End of story. If you try to talk to the 
> kernel in UCS-2 or anything else, you _will_ fail.

Just that the kernel does not support UTF-8. It delivers and accepts
non-UTF-8 strings such as \xc0\x80. The kernel clearly should not deliver
broken characters when the official stanza is that the linux VFS API is
UTF-8 only (see 3.2, Chapater 3, C12, conformance, ony why it currently
isn't UTF-8).

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-16 18:36         ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
@ 2004-02-16 18:49           ` Linus Torvalds
  2004-02-16 19:26             ` UTF-8 practically vs. theoretically in the VFS API Jeff Garzik
  2004-02-16 20:03             ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
  0 siblings, 2 replies; 50+ messages in thread
From: Linus Torvalds @ 2004-02-16 18:49 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: viro, Linux kernel

On Mon, 16 Feb 2004, Marc Lehmann wrote:
> 
> > In short: the kernel talks bytestreams, and that implies that if you want 
> > to talk to the kernel, you HAVE TO USE UTF-8.
> 
> This is not the problem at all. It's perfectly easy to write
> applications that talk UTF-8 and just UTF-8 with the kernel.
> 
> The problem is that the kernel does not use UTF-8, i.e. applications in
> the current linux model have to deal with the fact that the kernel
> happily breaks the assumed protocol of using UTF-8 by delivering illegal
> byte sequences to applications.

You didn't read what I said.

READ MY POSTING. You even quoted it, but you didn't understand it.

I'm saying that "the kernel talks bytestreams".

I have never claimed that the kernel really talk s UTF-8, and indeed, I 
would say that such a kernel would be terminally and horribly broken. 

The kernel is _agnostic_ in what it does. As it should be. It doesn't 
really care AT ALL what you feed it, as long as it is a byte-stream.

Now, that implies that if you want to have extended characters, then YOU 
HAVE TO USE UTF-8.

That's what I'm saying. I am _not_ saying that the kernel uses UTF-8. The 
kernel doesn't care one way or the other. As far as the kernel is 
concerened, you could uuencode all the stuff, and the kernel wouldn't 
think you're crazy. The kernel _only_ cares about byte streams. And that 
is as it should be.

> There is no way for applications to handle UTF-8 and illegal-utf8 in
> a sane way, so most apps will either eat the illegal bytes, skip the
> filename, or crash (the latter case is clearly a bug in the app, thr
> former cases aren't).

What you're complaining about are bad user applications. It has _zero_ to 
do with the kernel.

> Fixing the VFS to actually enforce what linus claims (2filenames are
> utf-8") is a very good idea, imho.

No. Read my claim again. You obviously do not understand it AT ALL. 

What you suggest would be a horribly idiotic and bad idea. The kernel 
doesn't set policy. The kernel says "this is what I can do, you set 
policy".

And UTF-8 just happens to be the only sane policy for encoding complex 
characters into a byte stream. But it is not the only policy.

Another sane policy is to say "byte streams are latin1". It's not an
acceptable policy for encoding _complex_ characters, but it is a policy.
And it's a perfectly sane one.

In short: filenames are byte streams. Nothing more. They don't even have a 
"character set". They literally are just a series of bytes.

And when I say that you have to talk to the kernel using UTF-8, I'm only 
claiming that it is the only sane way to encode extended characters in a 
byte stream. Nothing more.

		Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 18:49           ` Linus Torvalds
@ 2004-02-16 19:26             ` Jeff Garzik
  2004-02-16 19:48               ` John Bradford
  2004-02-16 20:03             ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
  1 sibling, 1 reply; 50+ messages in thread
From: Jeff Garzik @ 2004-02-16 19:26 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: Linus Torvalds, viro, Linux kernel

Linus Torvalds wrote:
> In short: filenames are byte streams. Nothing more. They don't even have a 
> "character set". They literally are just a series of bytes.
> 
> And when I say that you have to talk to the kernel using UTF-8, I'm only 
> claiming that it is the only sane way to encode extended characters in a 
> byte stream. Nothing more.

Nod.  Maybe it helps Marc to point out the key difference between 
characters and bytes, in UTF8.

In UTF8, the number of characters in a string is less-than-or-equal-to 
the number of bytes in the string.

And the kernel just cares about bytes.

This is the whole benefit to UTF8, right here in this thread.  UTF8 was 
designed such that ten-year-old C code using standard C strings would 
function just fine.  No need to rip up large swaths of your code just to 
call multi-byte versions of the standard string functions.  Most code 
that doesn't deal with locale-specific details like uppercase/lowercase 
Just Works(tm).

	Jeff

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 19:26             ` UTF-8 practically vs. theoretically in the VFS API Jeff Garzik
@ 2004-02-16 19:48               ` John Bradford
  2004-02-16 19:48                 ` Linus Torvalds
  2004-02-16 20:16                 ` Marc Lehmann
  0 siblings, 2 replies; 50+ messages in thread
From: John Bradford @ 2004-02-16 19:48 UTC (permalink / raw)
  To: Jeff Garzik, Marc Lehmann; +Cc: Linus Torvalds, viro, Linux kernel

Quote from Jeff Garzik <jgarzik@pobox.com>:
> Linus Torvalds wrote:
> > In short: filenames are byte streams. Nothing more. They don't even have a 
> > "character set". They literally are just a series of bytes.
> > 
> > And when I say that you have to talk to the kernel using UTF-8, I'm only 
> > claiming that it is the only sane way to encode extended characters in a 
> > byte stream. Nothing more.
> 
> 
> Nod.  Maybe it helps Marc to point out the key difference between 
> characters and bytes, in UTF8.
> 
> In UTF8, the number of characters in a string is less-than-or-equal-to 
> the number of bytes in the string.
> 
> And the kernel just cares about bytes.
> 
> This is the whole benefit to UTF8, right here in this thread.  UTF8 was 
> designed such that ten-year-old C code using standard C strings would 
> function just fine.  No need to rip up large swaths of your code just to 
> call multi-byte versions of the standard string functions.  Most code 
> that doesn't deal with locale-specific details like uppercase/lowercase 
> Just Works(tm).

The real problem is with mis-configured userspaces, where buggy UTF-8
decoders are trying to make sense of data in legacy encodings
containing essentially random bytes > 127, which are not part of valid
UTF-8 sequences.

None of this is a real problem, if everything is set up correctly and
bug free.  Unfortunately the Just Works thing falls apart in the,
(frequent), instances that it's not :-(.

John.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 19:48               ` John Bradford
@ 2004-02-16 19:48                 ` Linus Torvalds
  2004-02-16 20:20                   ` Marc Lehmann
  2004-02-16 20:21                   ` bert hubert
  2004-02-16 20:16                 ` Marc Lehmann
  1 sibling, 2 replies; 50+ messages in thread
From: Linus Torvalds @ 2004-02-16 19:48 UTC (permalink / raw)
  To: John Bradford; +Cc: Jeff Garzik, Marc Lehmann, viro, Linux kernel

On Mon, 16 Feb 2004, John Bradford wrote:
> 
> The real problem is with mis-configured userspaces, where buggy UTF-8
> decoders are trying to make sense of data in legacy encodings
> containing essentially random bytes > 127, which are not part of valid
> UTF-8 sequences.
> 
> None of this is a real problem, if everything is set up correctly and
> bug free.  Unfortunately the Just Works thing falls apart in the,
> (frequent), instances that it's not :-(.

The way to handle that is to aim to never _ever_ decode utf-8 unless you 
really have to. Always leave the string in utf-8 "raw bytestring" mode as 
long as possible, and convert to charater sets only when actually 
printing.

If you do that, then at worst you'll show the user a strange name (extra
points for marking it as being errenous), but everything still works. You
can still lookup/delete/whatever the file (internally the program still
works on the raw byte sequence and isn't confused). Basically accept the
fact that UTF-8 strings can contain "garbage", and don't try to fix it up.

And no, I'm not claiming that it's wonderfully clean and that we should
all love it. But it's _practical_, and the ugliness is certainly a lot
less than in the alternatives.

And it largely works today.

		Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 19:48                 ` Linus Torvalds
@ 2004-02-16 20:20                   ` Marc Lehmann
  2004-02-16 20:26                     ` Linus Torvalds
  2004-02-18  2:49                     ` Rob Landley
  2004-02-16 20:21                   ` bert hubert
  1 sibling, 2 replies; 50+ messages in thread
From: Marc Lehmann @ 2004-02-16 20:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: John Bradford, Jeff Garzik, viro, Linux kernel

On Mon, Feb 16, 2004 at 11:48:35AM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
> works on the raw byte sequence and isn't confused). Basically accept the
> fact that UTF-8 strings can contain "garbage", and don't try to fix it up.

But you are wrong, UTF-8 strings never contain garbage. UTF-8 is
well-defined and is always proper UTF-8. It's a tautology.

The evry idea of "UTF-8 with garbage in it" doesn't make sense.

> And no, I'm not claiming that it's wonderfully clean and that we should
> all love it.

It's also a totally useless idiom...

> And it largely works today.
> 		Linus

On ascii-only-systems, it works fine. My system is largely ascii-only,
with only very few filenames (japanese and german ones mostly) in
UTF-8. Sometimes in EUC-JP, but that's a bug in rar.

It also works fine in single-user environments where the user just forces
everything to be in her locale. It does fail miserably on multi-user
systems. It does fail miserably in ISO-C's locale model. It does fail
miserably with gnu shellutils, fileutils and most other apps.

It fails, because it's not at all well supported by the kernel.

Claiming that it largely works today is simply not true for most
non-ascii-users (which increasingly includes the US).

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:20                   ` Marc Lehmann
@ 2004-02-16 20:26                     ` Linus Torvalds
  2004-02-18  2:49                     ` Rob Landley
  1 sibling, 0 replies; 50+ messages in thread
From: Linus Torvalds @ 2004-02-16 20:26 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: John Bradford, Jeff Garzik, viro, Linux kernel



On Mon, 16 Feb 2004, Marc Lehmann wrote:
>
> On Mon, Feb 16, 2004 at 11:48:35AM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
> > works on the raw byte sequence and isn't confused). Basically accept the
> > fact that UTF-8 strings can contain "garbage", and don't try to fix it up.
> 
> But you are wrong, UTF-8 strings never contain garbage. UTF-8 is
> well-defined and is always proper UTF-8. It's a tautology.
> 
> The evry idea of "UTF-8 with garbage in it" doesn't make sense.

Sure it does.

You live in a theoretical world where
 (a) there is only one standard
 (b) people read it
 (c) people actually follow it and never have bugs

I've got news for you: none of the above is true. 

Which means that IN PRACTICE you will find strings that you think are 
UTF-8-encoded, but that don't end up being proper UTF-8.

That's the difference between real world and theory. 

And you can either write your programs to be "theoretically correct", or 
you can write them to "work".

It's your choice. I know which program I'd prefer to use.

		Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:20                   ` Marc Lehmann
  2004-02-16 20:26                     ` Linus Torvalds
@ 2004-02-18  2:49                     ` Rob Landley
  1 sibling, 0 replies; 50+ messages in thread
From: Rob Landley @ 2004-02-18  2:49 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: Linux kernel

On Monday 16 February 2004 14:20, Marc Lehmann wrote:
> On Mon, Feb 16, 2004 at 11:48:35AM -0800, Linus Torvalds <torvalds@osdl.org> 
wrote:
> > works on the raw byte sequence and isn't confused). Basically accept the
> > fact that UTF-8 strings can contain "garbage", and don't try to fix it
> > up.
>
> But you are wrong, UTF-8 strings never contain garbage. UTF-8 is
> well-defined and is always proper UTF-8. It's a tautology.

Would you please learn the difference between "you are wrong" and "I 
disagree"?

Rob



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 19:48                 ` Linus Torvalds
  2004-02-16 20:20                   ` Marc Lehmann
@ 2004-02-16 20:21                   ` bert hubert
  2004-02-16 20:33                     ` Marc Lehmann
  2004-02-18  2:58                     ` H. Peter Anvin
  1 sibling, 2 replies; 50+ messages in thread
From: bert hubert @ 2004-02-16 20:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: John Bradford, Jeff Garzik, Marc Lehmann, viro, Linux kernel

On Mon, Feb 16, 2004 at 11:48:35AM -0800, Linus Torvalds wrote:

> The way to handle that is to aim to never _ever_ decode utf-8 unless you 
> really have to. Always leave the string in utf-8 "raw bytestring" mode as 
> long as possible, and convert to charater sets only when actually 
> printing.

Additional good news is that following octets in a utf-8 character sequence
always have the highest order bit set, precluding / or \x0 from appearing,
confusing the kernel.

The remaining zit is that all these represent '..':
2E 2E
C0 AE C0 AE
E0 80 AE E0 80 AE 
F0 80 80 AE F0 80 80 AE 
F8 80 80 80 AE F8 80 80 80 AE 
FC 80 80 80 80 AE FC 80 80 80 80 AE

This in itself is not a problem, the kernel will only recognize 2E 2E as the
real .., but it does show that 'document.doc' might be encoded in a myriad
ways.

So some guidance about using only the simplest possible encoding might be
sensible, if we don't want the kernel to know about utf-8.

> And it largely works today.

Indeed.

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://lartc.org           Linux Advanced Routing & Traffic Control HOWTO

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:21                   ` bert hubert
@ 2004-02-16 20:33                     ` Marc Lehmann
  2004-02-18  2:58                     ` H. Peter Anvin
  1 sibling, 0 replies; 50+ messages in thread
From: Marc Lehmann @ 2004-02-16 20:33 UTC (permalink / raw)
  To: bert hubert; +Cc: linux-kernel

On Mon, Feb 16, 2004 at 09:21:42PM +0100, bert hubert <ahu@ds9a.nl> wrote:
> The remaining zit is that all these represent '..':

No, they don't. Read the UTF-8 definition...

> This in itself is not a problem, the kernel will only recognize 2E 2E as the
> real .., but it does show that 'document.doc' might be encoded in a myriad
> ways.

No, it can only be encoded in exactly one way *in UTF-8*. It can of course
be encoded differently in other encodings, but in UTF-8, there is only a
single representation. There are no ambiguities.

> So some guidance about using only the simplest possible encoding might be
> sensible, if we don't want the kernel to know about utf-8.

Fortunately, this has all already been taken care of, and is not a problem.

I mean, the _definition_ of UTF-8 works. Wether specific applications
(wether in the kernel or apps) work is a different question. But at
least the specification is rather clear.

Compare this to the URL definition, which only hints that you don't know
the encoding, and therefore, the interpretation as text, of a URL unless
you have an extra channel that communicates it.

While possible, this channel does not exist in practise, creating big
problems for people writing i18n-ized web applications.

The thing is that the kernel certainly _works_ on a very basic level, but
I think the situaiton can be improved by making it clear how to interpret
filenames, which currently is not the case.

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:21                   ` bert hubert
  2004-02-16 20:33                     ` Marc Lehmann
@ 2004-02-18  2:58                     ` H. Peter Anvin
  2004-02-18  3:13                       ` Linus Torvalds
  2004-02-18  7:25                       ` bert hubert
  1 sibling, 2 replies; 50+ messages in thread
From: H. Peter Anvin @ 2004-02-18  2:58 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <20040216202142.GA5834@outpost.ds9a.nl>
By author:    bert hubert <ahu@ds9a.nl>
In newsgroup: linux.dev.kernel
> 
> Additional good news is that following octets in a utf-8 character sequence
> always have the highest order bit set, precluding / or \x0 from appearing,
> confusing the kernel.
> 

Indeed.  The original name for the encoding was, in fact, "FSS-UTF",
for "filesystem safe Unicode transformation format."  

> The remaining zit is that all these represent '..':
> 2E 2E
> C0 AE C0 AE
> E0 80 AE E0 80 AE 
> F0 80 80 AE F0 80 80 AE 
> F8 80 80 80 AE F8 80 80 80 AE 
> FC 80 80 80 80 AE FC 80 80 80 80 AE

No, they don't.

The first represent "..", the remaining two are illegal encodings and
do not decode to anything.

Those of us who have been involved with the issue have fought
*extremely* hard against DWIM decoders which try to decode the latter
sequences into ".." -- it's incorrect, and a security hazard.  The
only acceptable decodings is to throw an error, or use an out-of-band
encoding mechanism to denote "bad bytecode."

> This in itself is not a problem, the kernel will only recognize 2E 2E as the
> real .., but it does show that 'document.doc' might be encoded in a myriad
> ways.

No, it doesn't.

> So some guidance about using only the simplest possible encoding might be
> sensible, if we don't want the kernel to know about utf-8.

UTF-8 requires the use of the shortest possible encoding.  An
application which doesn't obey that and tries to be "smart" is a
security hazard.

It is a bit unfortunate that the encoding don't exclude these by
design as opposed by error checking; it makes it a little too easy for
clueless programmers to skip :(

	-hpa

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  2:58                     ` H. Peter Anvin
@ 2004-02-18  3:13                       ` Linus Torvalds
  2004-02-18  3:22                         ` H. Peter Anvin
  2004-02-18 11:33                         ` Jamie Lokier
  2004-02-18  7:25                       ` bert hubert
  1 sibling, 2 replies; 50+ messages in thread
From: Linus Torvalds @ 2004-02-18  3:13 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel

On Wed, 18 Feb 2004, H. Peter Anvin wrote:
> 
> Those of us who have been involved with the issue have fought
> *extremely* hard against DWIM decoders which try to decode the latter
> sequences into ".." -- it's incorrect, and a security hazard.  The
> only acceptable decodings is to throw an error, or use an out-of-band
> encoding mechanism to denote "bad bytecode."

Somebody correctly pointed out that you do not need any out-of-band 
encoding mechanism - the very fact that it's an invalid sequence is in 
itself a perfectly fine flag. No out-of-band signalling required.

The only thing you should make sure of is to not try to normalize it (that 
would hide the error). Just keep carrying the bad sequence along, and 
everybody is happy. Including the filesystem functions that get the "bad" 
name and match it exactly to what it should be matched against.

		Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  3:13                       ` Linus Torvalds
@ 2004-02-18  3:22                         ` H. Peter Anvin
  2004-02-18  3:30                           ` Linus Torvalds
  2004-02-18 11:24                           ` Jamie Lokier
  2004-02-18 11:33                         ` Jamie Lokier
  1 sibling, 2 replies; 50+ messages in thread
From: H. Peter Anvin @ 2004-02-18  3:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds wrote:
> 
> On Wed, 18 Feb 2004, H. Peter Anvin wrote:
> 
>>Those of us who have been involved with the issue have fought
>>*extremely* hard against DWIM decoders which try to decode the latter
>>sequences into ".." -- it's incorrect, and a security hazard.  The
>>only acceptable decodings is to throw an error, or use an out-of-band
>>encoding mechanism to denote "bad bytecode."
> 
> Somebody correctly pointed out that you do not need any out-of-band 
> encoding mechanism - the very fact that it's an invalid sequence is in 
> itself a perfectly fine flag. No out-of-band signalling required.
> 
> The only thing you should make sure of is to not try to normalize it (that 
> would hide the error). Just keep carrying the bad sequence along, and 
> everybody is happy. Including the filesystem functions that get the "bad" 
> name and match it exactly to what it should be matched against.
> 

Well, the reason you'd want an out-of-band mechanism is to be able to
display it as some kind of escapes.  Consider a UTF-8 decoder which uses
values in the 0x800000xx range to encode "bogus bytes"; that way it
wouldn't alias to anything else, but the bogus sequence "C0 AE" could be
represented as 0x800000C0 0x800000AE and displayed to the user as
\xC0\xAE\xC0\xAE ... which is different from \u00C0\u00AE ("À®", C3 80
C2 AE).  This would make it possible to figure out in, for example, an
ls listing, what those broken filenames are actually composed of.

There are some advantages to being able to represent all possible byte
sequences and present them to the user, even if they're bogus.

	-hpa

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  3:22                         ` H. Peter Anvin
@ 2004-02-18  3:30                           ` Linus Torvalds
  2004-02-18  5:30                             ` H. Peter Anvin
  2004-02-18 11:24                           ` Jamie Lokier
  1 sibling, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2004-02-18  3:30 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel

On Tue, 17 Feb 2004, H. Peter Anvin wrote:
> 
> Well, the reason you'd want an out-of-band mechanism is to be able to
> display it as some kind of escapes. 

I'd suggest just doing that when you convert the utf-8 format to printable 
format _anyway_.  At that point you just make the "printable" 
representation be the binary escape sequence (which you have to have for 
other non-printable utf-8 characters anyway).

And if you do things right (ie you allow user input in that same escaped 
output format), you can allow users to re-create the exact "broken utf-8". 
Which is actually important just so that the user can fix it up (ie 
imagine the user noticing that the filename is broken, and now needs to do 
a "mv broken-name fixed-name" - the user needs some way to re-create the 
brokenness).

		Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  3:30                           ` Linus Torvalds
@ 2004-02-18  5:30                             ` H. Peter Anvin
  2004-02-18 10:29                               ` Robin Rosenberg
  2004-02-18 15:35                               ` Linus Torvalds
  0 siblings, 2 replies; 50+ messages in thread
From: H. Peter Anvin @ 2004-02-18  5:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds wrote:
> 
> On Tue, 17 Feb 2004, H. Peter Anvin wrote:
> 
>>Well, the reason you'd want an out-of-band mechanism is to be able to
>>display it as some kind of escapes. 
> 
> 
> I'd suggest just doing that when you convert the utf-8 format to printable 
> format _anyway_.  At that point you just make the "printable" 
> representation be the binary escape sequence (which you have to have for 
> other non-printable utf-8 characters anyway).
> 

What does "printable" mean in this context?  Typically you have to 
convert it to UCS-4 first, so you can index into your font tables, then 
you have to create the right composition, apply the bidirectional text 
algorithm, and so forth.

Rendering general Unicode text is complex enough that you really want it 
layered.  What I described what the first step of that -- mostly trying 
to show that "throwing an error" doesn't necessarily mean "produce no 
output."  What you shouldn't do, though, is alias it with legitimate input.

> And if you do things right (ie you allow user input in that same escaped 
> output format), you can allow users to re-create the exact "broken utf-8". 
> Which is actually important just so that the user can fix it up (ie 
> imagine the user noticing that the filename is broken, and now needs to do 
> a "mv broken-name fixed-name" - the user needs some way to re-create the 
> brokenness).

Indeed.  The C language has gone with \x77 for bytes and \u7777 or 
\U77777777 for Unicode characters (4 vs 8 hex digits respectively); I 
think this is a good UI for shells to follow.  The \x representation 
then doesn't stand for characters but for bytes.  It may be desirable to 
disallow encoding of *valid* UTF-8 characters this way, though.

	-hpa

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  5:30                             ` H. Peter Anvin
@ 2004-02-18 10:29                               ` Robin Rosenberg
  2004-02-18 11:49                                 ` Tomas Szepe
  2004-02-18 15:35                               ` Linus Torvalds
  1 sibling, 1 reply; 50+ messages in thread
From: Robin Rosenberg @ 2004-02-18 10:29 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Linus Torvalds, linux-kernel

On Wednesday 18 February 2004 06.30, H. Peter Anvin wrote:
> Linus Torvalds wrote:
> > On Tue, 17 Feb 2004, H. Peter Anvin wrote:
> >>Well, the reason you'd want an out-of-band mechanism is to be able to
> >>display it as some kind of escapes. 
> > I'd suggest just doing that when you convert the utf-8 format to printable 
> > format _anyway_.  At that point you just make the "printable" 
> > representation be the binary escape sequence (which you have to have for 
> > other non-printable utf-8 characters anyway).
> What does "printable" mean in this context?  Typically you have to 
> convert it to UCS-4 first, so you can index into your font tables, then 
> you have to create the right composition, apply the bidirectional text 
> algorithm, and so forth.

> Rendering general Unicode text is complex enough that you really want it 
> layered.  What I described what the first step of that -- mostly trying 
> to show that "throwing an error" doesn't necessarily mean "produce no 
> output."  What you shouldn't do, though, is alias it with legitimate input.

I think you can use libicu here. Conversion to UCS-4 doesn't for determining
character type doesn't mean you will every have actual strings of UCS-4. It could 
be character by character just for looking it up, so you can have the out-of-band
error flags internally.

> > And if you do things right (ie you allow user input in that same escaped 
> > output format), you can allow users to re-create the exact "broken utf-8". 
> > Which is actually important just so that the user can fix it up (ie 
> > imagine the user noticing that the filename is broken, and now needs to do 
> > a "mv broken-name fixed-name" - the user needs some way to re-create the 
> > brokenness).
> 
> Indeed.  The C language has gone with \x77 for bytes and \u7777 or 
> \U77777777 for Unicode characters (4 vs 8 hex digits respectively); I 
> think this is a good UI for shells to follow.  The \x representation 
> then doesn't stand for characters but for bytes.  It may be desirable to 
> disallow encoding of *valid* UTF-8 characters this way, though.

Agree. \u80808080 I would assume represents a valid character, while \x80\x80\x80\x80
does not. A problem with invalid sequences I just noted is that they break some of 
the nice properties of UTF-8, that people will assume apply, i.e. that you can parse it 
backwards. With UTF-8 (i.e. well-formed utf-8) you can point at a byte and figure "this is 
not the first byte", lets skip backwards to find the start. If invalid sequences can ever occur
you must read from the start of the string.

-- robin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 10:29                               ` Robin Rosenberg
@ 2004-02-18 11:49                                 ` Tomas Szepe
  2004-02-18 11:59                                   ` Robin Rosenberg
  0 siblings, 1 reply; 50+ messages in thread
From: Tomas Szepe @ 2004-02-18 11:49 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: linux-kernel

On Feb-18 2004, Wed, 11:29 +0100
Robin Rosenberg <robin.rosenberg.lists@dewire.com> wrote:

[snip]
> not the first byte", lets skip backwards to find the start. If invalid sequences can ever occur
[snip]

Would you _please_ read the lkml FAQ and stop posting e-mails with lines
longer than 80 characters?  Thank you.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 11:49                                 ` Tomas Szepe
@ 2004-02-18 11:59                                   ` Robin Rosenberg
  2004-02-18 12:05                                     ` Tomas Szepe
  0 siblings, 1 reply; 50+ messages in thread
From: Robin Rosenberg @ 2004-02-18 11:59 UTC (permalink / raw)
  To: Tomas Szepe; +Cc: linux-kernel

On Wednesday 18 February 2004 12.49, Tomas Szepe wrote:
> Would you _please_ read the lkml FAQ and stop posting e-mails with lines
> longer than 80 characters?  Thank you.

As soon as someone asks nicely... I thought any decent mail client simply
wrapped the lines. Hmm, remember some old system with 3270 access that
didn't.

I'll try to remember that.

-- robin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 11:59                                   ` Robin Rosenberg
@ 2004-02-18 12:05                                     ` Tomas Szepe
  2004-02-18 12:34                                       ` Robin Rosenberg
  0 siblings, 1 reply; 50+ messages in thread
From: Tomas Szepe @ 2004-02-18 12:05 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: linux-kernel

On Feb-18 2004, Wed, 12:59 +0100
Robin Rosenberg <robin.rosenberg.lists@dewire.com> wrote:

> On Wednesday 18 February 2004 12.49, Tomas Szepe wrote:
> > Would you _please_ read the lkml FAQ and stop posting e-mails with lines
> > longer than 80 characters?  Thank you.
> 
> As soon as someone asks nicely...  I thought any decent mail client simply
> wrapped the lines.

1)  Quite the contrary.  Any _decent_ mail client will _not_ wrap the lines.

2)  A mail client that will wrap the lines will make your posts look like this:

<cut>
Having to put up with the existence of Windows day in and out is the reason I'm
still on
an eight-bit encoding.  Sorry for not explaining the REAL problem, but only a
partial
problem. I need to support all kinds of clients on Windows with protocols that  
convey no
character set info. With samba that's no problem. Having to put up with a Unix
world running
<cut>

> I'll try to remember that.

Thanks again.

-- 
Tomas Szepe <szepe@pinerecords.com>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 12:05                                     ` Tomas Szepe
@ 2004-02-18 12:34                                       ` Robin Rosenberg
  0 siblings, 0 replies; 50+ messages in thread
From: Robin Rosenberg @ 2004-02-18 12:34 UTC (permalink / raw)
  To: Tomas Szepe; +Cc: linux-kernel

On Wednesday 18 February 2004 13.05, Tomas Szepe wrote:
> On Feb-18 2004, Wed, 12:59 +0100
> Robin Rosenberg <robin.rosenberg.lists@dewire.com> wrote:
> 
> > On Wednesday 18 February 2004 12.49, Tomas Szepe wrote:
> > > Would you _please_ read the lkml FAQ and stop posting e-mails with lines
> > > longer than 80 characters?  Thank you.
> > 
> > As soon as someone asks nicely...  I thought any decent mail client simply
> > wrapped the lines.
> 
> 1)  Quite the contrary.  Any _decent_ mail client will _not_ wrap the lines.
>
> 2)  A mail client that will wrap the lines will make your posts look like 
this:
> 
> <cut>
> Having to put up with the existence of Windows day in and out is the reason 
I'm
> still on
> an eight-bit encoding.  Sorry for not explaining the REAL problem, but only 
a
> partial
> problem. I need to support all kinds of clients on Windows with protocols 
that  
> convey no
> character set info. With samba that's no problem. Having to put up with a 
Unix
> world running
> <cut>

That's what happens when the sender wraps the lines at column 80 and your 
client wraps at 72 (or similar situation), just another reason not to wrap 
when sending and let the users  client do whatever the user think is fine.

In order not to wrap and destroy information I have the autowrap feature off 
when composing mail, becase wrapped and cut stack traces, cuts from log files 
etc are a pain. 

BTW The 80 character rule is only mention wrt to signatures.

-- robin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  5:30                             ` H. Peter Anvin
  2004-02-18 10:29                               ` Robin Rosenberg
@ 2004-02-18 15:35                               ` Linus Torvalds
  2004-02-18 19:47                                 ` Tomas Szepe
  1 sibling, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2004-02-18 15:35 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Kernel Mailing List

On Tue, 17 Feb 2004, H. Peter Anvin wrote:
> 
> What does "printable" mean in this context?  Typically you have to 
> convert it to UCS-4 first, so you can index into your font tables, then 
> you have to create the right composition, apply the bidirectional text 
> algorithm, and so forth.

Not all characters _have_ font entries. And even when they have font 
entries, they may need escaping for other reasons (ie you may want to 
marshall UTF-8 as plain ASCII just because you want to use a portable 
format for transfer).

Think about the simple (hex) string x0A x00. That's a well-defined UTF-8
string, yet if you want to print it as a filename on the console, you
should obviously print it as "/n" or some similar escaped sequence 
(actually, that's a bad example, since it's a special case, and it would 
probably be better to use the example string x7F x00, which would be shown 
as \u177 or something).

The same is true for a _lot_ of perfectly fine UTF-8 sequences, no?

That implies that you have to use an escaped sequence _anyway_. So as you 
go along, turning the string into something printable, you might as well 
escape the invalid UTF-8 sequences.

In other words: you walk the utf-8 string one character at a time, 
converting it to whatever format (eg UCS-4) you have for font lookup, but 
you also escape characters that you don't have font entries for or that 
aren't in proper UTF-8 format.

When converting to UCS-2, you have to check for the proper format 
_anyway_, so none of this is in any way "extra work". Instead of just 
aborting on an invalid UTF-8 character, you quote it, exactly the same way 
you'd have to quote a _valid_ one that you can't just show as a string.

> Rendering general Unicode text is complex enough that you really want it 
> layered.  What I described what the first step of that -- mostly trying 
> to show that "throwing an error" doesn't necessarily mean "produce no 
> output."  What you shouldn't do, though, is alias it with legitimate input.

Exactly. And since you need an escape sequence anyway, what's the problem?

> > And if you do things right (ie you allow user input in that same escaped 
> > output format), you can allow users to re-create the exact "broken utf-8". 
> > Which is actually important just so that the user can fix it up (ie 
> > imagine the user noticing that the filename is broken, and now needs to do 
> > a "mv broken-name fixed-name" - the user needs some way to re-create the 
> > brokenness).
> 
> Indeed.  The C language has gone with \x77 for bytes and \u7777 or 
> \U77777777 for Unicode characters (4 vs 8 hex digits respectively); I 
> think this is a good UI for shells to follow.  The \x representation 
> then doesn't stand for characters but for bytes.  It may be desirable to 
> disallow encoding of *valid* UTF-8 characters this way, though.

You need to encode even valid UTF-8, since you may not find a font entry 
for the character, or the character just isn't appropriate in that context 
(ie you can't show a newline).

But it makes perfect sense to use a policy of:
 - escape valid UTF-8 characters as '\u7777'
 - escape _invalid_ UTF-8 characters as their hex byte sequence (ie 
   '\xC0\x80\x80', whatever)
 - (and, obviously, escape the valid UTF-8 character '\' as '\\').

Don't you agree? It clearly allows all the cases, and you can re-generate 
the _exact_ original stream of bytes from the above (ie it is nicely 
reversible, which in my opinion is a requirement).

		Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 15:35                               ` Linus Torvalds
@ 2004-02-18 19:47                                 ` Tomas Szepe
  2004-02-18 20:01                                   ` H. Peter Anvin
  0 siblings, 1 reply; 50+ messages in thread
From: Tomas Szepe @ 2004-02-18 19:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: H. Peter Anvin, Kernel Mailing List

On Feb-18 2004, Wed, 07:35 -0800
Linus Torvalds <torvalds@osdl.org> wrote:

> But it makes perfect sense to use a policy of:
>  - escape valid UTF-8 characters as '\u7777'
>  - escape _invalid_ UTF-8 characters as their hex byte sequence (ie 
>    '\xC0\x80\x80', whatever)
>  - (and, obviously, escape the valid UTF-8 character '\' as '\\').
> 
> Don't you agree? It clearly allows all the cases, and you can re-generate 
> the _exact_ original stream of bytes from the above (ie it is nicely 
> reversible, which in my opinion is a requirement).

I really really hope this is _exactly_ what we're going to see in practice.

-- 
Tomas Szepe <szepe@pinerecords.com>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 19:47                                 ` Tomas Szepe
@ 2004-02-18 20:01                                   ` H. Peter Anvin
  2004-02-18 21:22                                     ` Robin Rosenberg
  0 siblings, 1 reply; 50+ messages in thread
From: H. Peter Anvin @ 2004-02-18 20:01 UTC (permalink / raw)
  To: Tomas Szepe; +Cc: Linus Torvalds, Kernel Mailing List

Tomas Szepe wrote:
> On Feb-18 2004, Wed, 07:35 -0800
> Linus Torvalds <torvalds@osdl.org> wrote:
> 
>>But it makes perfect sense to use a policy of:
>> - escape valid UTF-8 characters as '\u7777'

[And e.g. \U00017777 for characters above \uFFFF]

>> - escape _invalid_ UTF-8 characters as their hex byte sequence (ie 
>>   '\xC0\x80\x80', whatever)
>> - (and, obviously, escape the valid UTF-8 character '\' as '\\').
>>
>>Don't you agree? It clearly allows all the cases, and you can re-generate 
>>the _exact_ original stream of bytes from the above (ie it is nicely 
>>reversible, which in my opinion is a requirement).
> 
> I really really hope this is _exactly_ what we're going to see in practice.
> 

Same here.  This is clearly The Right Thing[TM].

	-hpa


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 20:01                                   ` H. Peter Anvin
@ 2004-02-18 21:22                                     ` Robin Rosenberg
  2004-02-18 21:42                                       ` H. Peter Anvin
  0 siblings, 1 reply; 50+ messages in thread
From: Robin Rosenberg @ 2004-02-18 21:22 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Tomas Szepe, Linus Torvalds, Kernel Mailing List

On Wednesday 18 February 2004 21.01, H. Peter Anvin wrote:
> [And e.g. \U00017777 for characters above \uFFFF]

Isn't that octal :-)

-- robin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 21:22                                     ` Robin Rosenberg
@ 2004-02-18 21:42                                       ` H. Peter Anvin
  0 siblings, 0 replies; 50+ messages in thread
From: H. Peter Anvin @ 2004-02-18 21:42 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Tomas Szepe, Linus Torvalds, Kernel Mailing List

Robin Rosenberg wrote:
> On Wednesday 18 February 2004 21.01, H. Peter Anvin wrote:
> 
>>[And e.g. \U00017777 for characters above \uFFFF]
> 
> Isn't that octal :-)
> 

No.

	-hpa


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  3:22                         ` H. Peter Anvin
  2004-02-18  3:30                           ` Linus Torvalds
@ 2004-02-18 11:24                           ` Jamie Lokier
  1 sibling, 0 replies; 50+ messages in thread
From: Jamie Lokier @ 2004-02-18 11:24 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Linus Torvalds, linux-kernel

H. Peter Anvin wrote:
> Well, the reason you'd want an out-of-band mechanism is to be able to
> display it as some kind of escapes.

As soon as you go to "display", you need a mechanism to escape lots of
characters, not just malformed UTF-8.  Consider: \u0000, \u001B,
\u0007 and such need to be escaped too.

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  3:13                       ` Linus Torvalds
  2004-02-18  3:22                         ` H. Peter Anvin
@ 2004-02-18 11:33                         ` Jamie Lokier
  2004-02-18 16:47                           ` H. Peter Anvin
  2004-02-18 19:59                           ` Linus Torvalds
  1 sibling, 2 replies; 50+ messages in thread
From: Jamie Lokier @ 2004-02-18 11:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: H. Peter Anvin, linux-kernel

Linus Torvalds wrote:
> Somebody correctly pointed out that you do not need any out-of-band 
> encoding mechanism - the very fact that it's an invalid sequence is in 
> itself a perfectly fine flag. No out-of-band signalling required.

Technically this is almost(*) correct, however a _lot_ of code exists
which assumes logical properties of UTF-8.  (See, for example, the
"stty utf8" patch).

Perl, for example, allows you to pass around invalid sequences in
exactly the way you describe.  It works, right up until you do
something like length() or substr() or a regex match.  Then Perl
screws up the answer, because it sees something like 0xfd and just
assumes it can skip the next 5 bytes, without checking them.

hpa's suggestion that invalid bytes are treated as 0x800000xx works
very nicely, *iff* a program is absolutely consistent about its
treatment of bytes in that way.  When there's a mixture of code which
interprets malformed UTF-8 in different ways, then it's messy and
sometimes a security hazard.

-- Jamie

(*) - It's fine until you concatenate two malformed strings.  Then the
      out-of-band signal is lost if the combination is valid UTF-8.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 11:33                         ` Jamie Lokier
@ 2004-02-18 16:47                           ` H. Peter Anvin
  2004-02-18 19:59                           ` Linus Torvalds
  1 sibling, 0 replies; 50+ messages in thread
From: H. Peter Anvin @ 2004-02-18 16:47 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Linus Torvalds, linux-kernel

Jamie Lokier wrote:
> 
> hpa's suggestion that invalid bytes are treated as 0x800000xx works
> very nicely, *iff* a program is absolutely consistent about its
> treatment of bytes in that way.  When there's a mixture of code which
> interprets malformed UTF-8 in different ways, then it's messy and
> sometimes a security hazard.
> 

Absolutely.  It has to be considered very carefully.

	-hpa

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 11:33                         ` Jamie Lokier
  2004-02-18 16:47                           ` H. Peter Anvin
@ 2004-02-18 19:59                           ` Linus Torvalds
  2004-02-18 20:08                             ` H. Peter Anvin
  1 sibling, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2004-02-18 19:59 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: H. Peter Anvin, linux-kernel

On Wed, 18 Feb 2004, Jamie Lokier wrote:
> Linus Torvalds wrote:
> > Somebody correctly pointed out that you do not need any out-of-band 
> > encoding mechanism - the very fact that it's an invalid sequence is in 
> > itself a perfectly fine flag. No out-of-band signalling required.
> 
> Technically this is almost(*) correct,
> 
> (*) - It's fine until you concatenate two malformed strings.  Then the
>       out-of-band signal is lost if the combination is valid UTF-8.

But that's what you _want_. Having a real out-of-band signal that says 
"this stuff is wrong, because it was wrong at some point in the past", and 
not allowing concatenation of blocks of utf-8 bytes would be _bad_.

The thing, concatenating two malformed UTF-8 strings is normal behaviour 
in a variety of circumstances, all basically having to do with lower 
levels now knowing about higer-level concepts.

For example, look at a web-page. Look at how the data comes in: it comes 
as a stream of bytes, with blocking rules that have _nothing_ to do with 
the content (timing, mtu's, extended TCP headers etc etc). That doesn't 
mean that you shouldn't be able to
 - work on the partial results and show them to the user as UTF-8
 - be able to concatenate new stuff as it comes in.

Having an out-of-band signal for "bad" would literally be a bad idea. If 
you get a valid UTF-8 stream as a result of concatenation, you should 
consider that to be the correct behaviour, or you should CHECK BEFOREHAND 
if you think it is illegal.

		Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 19:59                           ` Linus Torvalds
@ 2004-02-18 20:08                             ` H. Peter Anvin
  0 siblings, 0 replies; 50+ messages in thread
From: H. Peter Anvin @ 2004-02-18 20:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jamie Lokier, linux-kernel

Linus Torvalds wrote:
> 
> But that's what you _want_. Having a real out-of-band signal that says 
> "this stuff is wrong, because it was wrong at some point in the past", and 
> not allowing concatenation of blocks of utf-8 bytes would be _bad_.
> 

Indeed.  What it does mean, however, is that you have to consider your
concatenation issues if you perform the concatenation in UCS-4 space,
for example, a string that ends in whatever code you have chosen for
<BOGUS-C8> that gets concatenated with <BOGUS-80> needs to get converted
to a valid <U+0200>.  This is of course not an issue if you do the
concatenation in UTF-8 space and don't do round-trip conversion.

None of this is hard, it just takes thinking about rather than
automatically do the obvious things.

> The thing, concatenating two malformed UTF-8 strings is normal behaviour 
> in a variety of circumstances, all basically having to do with lower 
> levels now knowing about higer-level concepts.

Indeed.

	-hpa

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  2:58                     ` H. Peter Anvin
  2004-02-18  3:13                       ` Linus Torvalds
@ 2004-02-18  7:25                       ` bert hubert
  1 sibling, 0 replies; 50+ messages in thread
From: bert hubert @ 2004-02-18  7:25 UTC (permalink / raw)
  To: linux-kernel

On Wed, Feb 18, 2004 at 02:58:42AM +0000, H. Peter Anvin wrote:

> Indeed.  The original name for the encoding was, in fact, "FSS-UTF",
> for "filesystem safe Unicode transformation format."  

That might explain a few things.

> > F8 80 80 80 AE F8 80 80 80 AE 
> > FC 80 80 80 80 AE FC 80 80 80 80 AE
> 
> No, they don't.

Serves me right for trusting a random site, apologies. 

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://lartc.org           Linux Advanced Routing & Traffic Control HOWTO

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 19:48               ` John Bradford
  2004-02-16 19:48                 ` Linus Torvalds
@ 2004-02-16 20:16                 ` Marc Lehmann
  2004-02-16 20:20                   ` Jeff Garzik
                                     ` (3 more replies)
  1 sibling, 4 replies; 50+ messages in thread
From: Marc Lehmann @ 2004-02-16 20:16 UTC (permalink / raw)
  To: John Bradford; +Cc: Jeff Garzik, Linus Torvalds, viro, Linux kernel

On Mon, Feb 16, 2004 at 07:48:19PM +0000, John Bradford <john@grabjohn.com> wrote:
> Quote from Jeff Garzik <jgarzik@pobox.com>:
> None of this is a real problem, if everything is set up correctly and
> bug free.  Unfortunately the Just Works thing falls apart in the,
> (frequent), instances that it's not :-(.

And this is the whole point.

BTW, to people trying to explain some properties of UTF-8 to me. I don't
think ad-hominem attacks like assuming that I don't understand UTF-8
(without any indication that this is so) are useful.

The point here is that the kernel does, in a very narrow interpretation,
not support the use of UTF-8, because proper support of UTF-8 means that
no illegal byte sequences will be produced.

Of course, I can feed the kernel UTF-8, and if everybody does that, it
will generally work quite fine. However, Windows surely works fine if
every program only feeds allowed values into system calls. And even unix
dialects without memory protection work, as long as everybody plays
fair.

The point is, however, that this is highly undesirable, and it would be
nice to have a kernel that would (optionally) fully support a UTF-8
environment in where applications can feed UTF-8 and _expect_ UTF-8 in
return, which _is_ a security issue.

It's very desirable to have a kernel that actively supports this. IT is
clearly not _required_, of course. But then again, process abstraction
is also not required...

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:16                 ` Marc Lehmann
@ 2004-02-16 20:20                   ` Jeff Garzik
  2004-02-16 21:10                   ` viro
                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 50+ messages in thread
From: Jeff Garzik @ 2004-02-16 20:20 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: John Bradford, Linus Torvalds, viro, Linux kernel

Marc Lehmann wrote:
> The point here is that the kernel does, in a very narrow interpretation,
> not support the use of UTF-8, because proper support of UTF-8 means that
> no illegal byte sequences will be produced.

Incorrect.  Byte stream transports need not care about their contents.

The only places that need to care about illegal UTF8 byte sequences are 
things like CONFIG_NLS_UTF8.

	Jeff




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:16                 ` Marc Lehmann
  2004-02-16 20:20                   ` Jeff Garzik
@ 2004-02-16 21:10                   ` viro
  2004-02-17  7:18                   ` jw schultz
  2004-02-17  7:42                   ` Nick Piggin
  3 siblings, 0 replies; 50+ messages in thread
From: viro @ 2004-02-16 21:10 UTC (permalink / raw)
  To: John Bradford, Jeff Garzik, Linus Torvalds, Linux kernel

On Mon, Feb 16, 2004 at 09:16:10PM +0100, Marc Lehmann wrote:
> The point is, however, that this is highly undesirable, and it would be
> nice to have a kernel that would (optionally) fully support a UTF-8
> environment in where applications can feed UTF-8 and _expect_ UTF-8 in
> return, which _is_ a security issue.
> 
> It's very desirable to have a kernel that actively supports this. IT is
> clearly not _required_, of course. But then again, process abstraction
> is also not required...

Mind taking the demagogy elsewhere?  Note that the same handwaving applies
to e.g. file contents.  Care to explain what makes read() and write()
different in that respect?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:16                 ` Marc Lehmann
  2004-02-16 20:20                   ` Jeff Garzik
  2004-02-16 21:10                   ` viro
@ 2004-02-17  7:18                   ` jw schultz
  2004-02-17  7:42                   ` Nick Piggin
  3 siblings, 0 replies; 50+ messages in thread
From: jw schultz @ 2004-02-17  7:18 UTC (permalink / raw)
  To: Linux kernel; +Cc: John Bradford, Jeff Garzik, Linus Torvalds, viro

On Mon, Feb 16, 2004 at 09:16:10PM +0100, Marc Lehmann wrote:
> On Mon, Feb 16, 2004 at 07:48:19PM +0000, John Bradford <john@grabjohn.com> wrote:
> > Quote from Jeff Garzik <jgarzik@pobox.com>:
> > None of this is a real problem, if everything is set up correctly and
> > bug free.  Unfortunately the Just Works thing falls apart in the,
> > (frequent), instances that it's not :-(.
>       
> And this is the whole point.
> 
> BTW, to people trying to explain some properties of UTF-8 to me. I don't
> think ad-hominem attacks like assuming that I don't understand UTF-8
> (without any indication that this is so) are useful.
> 
> The point here is that the kernel does, in a very narrow interpretation,
> not support the use of UTF-8, because proper support of UTF-8 means that
> no illegal byte sequences will be produced.

That "interpretation" is so narrow as to be unrealistic.
The kernel supports UTF-8 the same way a stage supports
rock musicians.  You confuse support with enforce, rather
like confusing tolerance with endorsement.

And it should be noted that the kernel doesn't produce file
names.  It only passes them along.

> Of course, I can feed the kernel UTF-8, and if everybody does that, it
> will generally work quite fine. However, Windows surely works fine if
> every program only feeds allowed values into system calls. And even unix
> dialects without memory protection work, as long as everybody plays
> fair.
>
> The point is, however, that this is highly undesirable, and it would be
> nice to have a kernel that would (optionally) fully support a UTF-8

You mean enforce again.  That enhancement request has been
rejected repeatedly because such a thing would be highly
undesirable.  What might be a convenient but unnecessary
restriction today is too likely to become an unbearable
restriction tomorrow.  I don't want the kernel to have to
care about what is or isn't valid UTF-8.  I certainly don't
want to have the kernel loaded with outdated character
tables.

> environment in where applications can feed UTF-8 and _expect_ UTF-8 in
> return, which _is_ a security issue.

I want an environment where applications can feed bytestreams
and expect the same bytestream in return.  I see enough
problems as a result of filesystems that don't do that.

> It's very desirable to have a kernel that actively supports this. IT is

You mean enforces again.  Kernel as police, next thing you
will want is a kernel that prevents undesirable character
sequences.

> clearly not _required_, of course. But then again, process abstraction
> is also not required...

I'll tell you what.  Patch libc.  You can add UTF-8 filename
enforcement to libc.  There are only a few system calls that
would need to have their wrappers enlarged.  I'm sure the
libc people will direct you to someplace very warm if you
ask them for this enhancement.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw@pegasys.ws

		Remember Cernan and Schmitt

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:16                 ` Marc Lehmann
                                     ` (2 preceding siblings ...)
  2004-02-17  7:18                   ` jw schultz
@ 2004-02-17  7:42                   ` Nick Piggin
  3 siblings, 0 replies; 50+ messages in thread
From: Nick Piggin @ 2004-02-17  7:42 UTC (permalink / raw)
  To: Marc Lehmann
  Cc: John Bradford, Jeff Garzik, Linus Torvalds, viro, Linux kernel



Marc Lehmann wrote:

>On Mon, Feb 16, 2004 at 07:48:19PM +0000, John Bradford <john@grabjohn.com> wrote:
>
>>Quote from Jeff Garzik <jgarzik@pobox.com>:
>>None of this is a real problem, if everything is set up correctly and
>>bug free.  Unfortunately the Just Works thing falls apart in the,
>>(frequent), instances that it's not :-(.
>>
>      
>And this is the whole point.
>
>BTW, to people trying to explain some properties of UTF-8 to me. I don't
>think ad-hominem attacks like assuming that I don't understand UTF-8
>(without any indication that this is so) are useful.
>
>The point here is that the kernel does, in a very narrow interpretation,
>not support the use of UTF-8, because proper support of UTF-8 means that
>no illegal byte sequences will be produced.
>
>

So does the kernel support the English language? Does your
email client?


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-16 18:49           ` Linus Torvalds
  2004-02-16 19:26             ` UTF-8 practically vs. theoretically in the VFS API Jeff Garzik
@ 2004-02-16 20:03             ` Marc Lehmann
  2004-02-16 20:23               ` Linus Torvalds
  1 sibling, 1 reply; 50+ messages in thread
From: Marc Lehmann @ 2004-02-16 20:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: viro, Linux kernel

On Mon, Feb 16, 2004 at 10:49:48AM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
> > The problem is that the kernel does not use UTF-8, i.e. applications in
> > the current linux model have to deal with the fact that the kernel
> > happily breaks the assumed protocol of using UTF-8 by delivering illegal
> > byte sequences to applications.
> 
> You didn't read what I said.

I read it.

> READ MY POSTING. You even quoted it, but you didn't understand it.

You were able to explain it clearly enough for me, I think.

> I'm saying that "the kernel talks bytestreams".

And I am saying that this is not good, which is my sole point.

> I have never claimed that the kernel really talk s UTF-8, and indeed, I 
> would say that such a kernel would be terminally and horribly broken. 

And I'd say such a kernel would be highly useful, as it would standardize
the encoding of filenames, just as unix standardizes on "mostly ascii"
(i.e. the SuS).

However, just as POSIX is a nice but very limited base, (mostly) ASCII is a nice
and very limited base. UTF-8 would also be a good base.

8-bit bytes as filenames is not a good base, however, since they enforce
a difefrent layer of interrpetation between the user and the kernel, and
this interpretation cannot be based on the locale nor the filesystem
itself, as there is no way to find out what encoding the filename is in.

8-bit bytes is convinient, but not useful for i18n environments. in the
past, it was also convinient and nobody cared, since everything was
either 8-bit or double-byte, and nobody exchanged files.

This, however, is going to change, and the current methodology of "just
guess, you might be right" is a hindrance to this.

> The kernel is _agnostic_ in what it does.

No, it's not. If at all, the kernel specifies a specially-interpreted
(ascii sans / and \0) byte-stream, as you say yourself.

However, just as with URLs (which are byte-streams, too), byte-streams are
useless to store text. You need bytestreams + known encoding.

If filenames were not names, but just binary id's I would agree, but
this is not at all how filenames are used not how their use is applied.

Filenames are composed of text, but the kernel gives no indication
on how to interpret this text, and as a matter of fact, nothing else
gives this indication. glib etc. uses G_BROKEN_FILENAMES to force
locale-encoding. But as others have said, one mans locale is unlike other
mens locale.

> really care AT ALL what you feed it, as long as it is a byte-stream.
> 
> Now, that implies that if you want to have extended characters, then YOU 
> HAVE TO USE UTF-8.

You say so, but there is no logical connection between these two
statements. I can store latin1 easily in a bytestream, as I can store
iso-2022-jp or euc-jp. But they are incompatible to UTF-8.

You are yelling at me for no good reason. "YOU HAVE TO USE UTF-8". Why
should this be? The kernel certainly enforces this. Even you claim that
I don't have to, as the kernel doesn't care.

However, if you think so violently that it has to be UTF-8 that you even
yell it, then why doesn't the kernel comply to this rule? Why should an
applicaiton "HAVE TO" use utf-8 for input when the kernel doesn't even
try to comply and hands out illegal output?

This is just like mmap sometimes returning a page number and sometimes a
byte address... this would also not be useful unless you know the unit
that mmap returns (addreesses in multiples of 1).

> That's what I'm saying. I am _not_ saying that the kernel uses UTF-8. 

But you are saying that you have to feed UTF-8 into the kernel, which is
not the case either. I certainly don't have to..., and, what's worse,
you haven't given any indication of why one has to. Just because you say
so? Or is there actually a reason? If there is a reason, why doesn't the
kernel, in return, also follow this reasoning?

> kernel doesn't care one way or the other. As far as the kernel is 

It doesn't. But the point is that it should. If the kernel would do
everything we want it to do there would be no point in enhancing it.

> concerened, you could uuencode all the stuff, and the kernel wouldn't 

(Yes, because uuencode has the peculiar property of neither generting \0
nor /. It doesn't work in general with byte-streams).

> think you're crazy. The kernel _only_ cares about byte streams.

The kernel _interprets_ the byte-stream already. And some byte-streams are
no valid filenames _already_.

> > There is no way for applications to handle UTF-8 and illegal-utf8 in
> > a sane way, so most apps will either eat the illegal bytes, skip the
> > filename, or crash (the latter case is clearly a bug in the app, thr
> > former cases aren't).
> 
> What you're complaining about are bad user applications. It has _zero_ to 
> do with the kernel.

Could you elaborate on why these apps are bad? What I am interested in
is to know how to fix them? since there is simply no way to interpret
the names returned by the kernel as the corresponding meta-information
is missing.

Consider an OS that allows different characters for path-seperators (unix
only allows '/'). Without the knowledge of the path seperator it would be
impossible to interpret paths. And without knowledge about encoding it's
impossible (but slightly less dramatic) to correctly interpret filenames.

> > Fixing the VFS to actually enforce what linus claims (2filenames are
> > utf-8") is a very good idea, imho.
> 
> No. Read my claim again. You obviously do not understand it AT ALL. 

...

> What you suggest would be a horribly idiotic and bad idea.

Why?

> The kernel doesn't set policy. The kernel says "this is what I can do,
> you set policy".

Exactly. The kernel could specify the API to use UTF-8. This is not more
policy than it currently enforces.

Or do you suggest that the ability to change the source and replace all
occurences of '/' by '\\' means that '/' is not enforced as policy on
path seperators?

We basically seem to disagree on what, exactly, policy is. Policy (to
me) is something that differentiates between several incompatible
alternatives. Chosing policy means to rule out other (useful)
alternatives.

One could argue that '/' is a policy because it precludes the '/'
character from being used in filenames, sth. some filenames or operating
systems support.

I'd say (probably as much as you) that this policy is not a real policy,
it's just idiotic.

But enforcing other restrictions on filenames should magically be real
policy? This is obviously bot idiotic at all, and should be carefully
explored.

> And UTF-8 just happens to be the only sane policy for encoding complex 
> characters into a byte stream. But it is not the only policy.

Just as '/' is not the only possible path seperator. If that is your
point, you should explain why enforcing this is ok while supporting utf-8
(not enforcing, just supporting, meaning having the ability to rule out
non-utf-8 sequences when the admin wants this) is not.

> Another sane policy is to say "byte streams are latin1". It's not an
> acceptable policy for encoding _complex_ characters, but it is a policy.
> And it's a perfectly sane one.

I agree that it is sane. But it is not very useful for the future, as
people who want russian filenames are plainly unable to use the other
filenames in a sensible way. There is no way to know the encoding.

> In short: filenames are byte streams. Nothing more.

Right now, they aren't. Not all sequences of bytes are valid filenames
already, and I think this is perfectly o.k.

> And when I say that you have to talk to the kernel using UTF-8, I'm only 
> claiming that it is the only sane way to encode extended characters in a 
> byte stream. Nothing more.

And i fully agree.

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-16 20:03             ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
@ 2004-02-16 20:23               ` Linus Torvalds
  2004-02-16 22:26                 ` Jamie Lokier
  0 siblings, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2004-02-16 20:23 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: viro, Linux kernel

On Mon, 16 Feb 2004, Marc Lehmann wrote:
> 
> > I'm saying that "the kernel talks bytestreams".
> 
> And I am saying that this is not good, which is my sole point.

Fair enough. 

However, that's where the unix philosophy comes in. The unix philosophy 
has always been to not try to understand the data that the user passes 
around - and that "everything is a bytestream" is very much encoded in the 
basic principles of how unix should work.

That agnosticism has a lot of advantages. It literally means that the
basic operating system doesn't set arbitrary limitations, which means that
you can do things that you couldn't necessarily otherwise easily do.

It does mean that you can do "strange" things too, and it does mean that 
user space basically has a lot of choice in how to interpret those byte 
streams.

And yes, it can cause confusion. You don't like the confusion, so you 
argue that it shouldn't be allowed. It's a valid argument, but it's an 
argument that assumes that choice is bad.

If you want to _force_ everybody to use UTF-8, then yes, the kernel could 
enforce that readdir() would never pass through a broken UTF-8 string, and 
all the path lookup functions also would never accept a broken string. It' 
snot technically impossible to to, although it would add a certain amount 
of pain and overhead.

But the thing is, not everyone uses UTF-8. The big distributions have only 
recently started moving to UTF-8, and it will take _years_ before UTF-8 is 
ubiquotous. And even then it might be the wrong thing to disallow clever 
people from doing clever things. Encoding other information in filenames 
might be proper for a number of applications.

> And I'd say such a kernel would be highly useful, as it would standardize
> the encoding of filenames, just as unix standardizes on "mostly ascii"
> (i.e. the SuS).

It would also be very painful, since it would mean that when you mount an 
old disk, you may be totally unable to read the files, because they have 
filenames that such a kernel would never accept.

> > The kernel is _agnostic_ in what it does.
> 
> No, it's not. If at all, the kernel specifies a specially-interpreted
> (ascii sans / and \0) byte-stream, as you say yourself.
> 
> However, just as with URLs (which are byte-streams, too), byte-streams are
> useless to store text. You need bytestreams + known encoding.

You don't "need" a known encoding. The kernel clearly doesn't need one. 
It's a container, and the encoding comes from the outside. 

And that's what I mean by agnostic - you can make your own encoding. 

Most of the time (but not always) these days UTF-8 is the only sane 
encoding to use. But let people do what they want to do.

Choice is _inherently_ good. Trying to force a world-view is bad. You 
should be able to tell people what they should do to avoid confusion ("use 
UTF-8"), but you should not _force_ them to that if they have good reasons 
not to (and "backwards compatibility" is a better reason than just about 
anything else).

> But you are saying that you have to feed UTF-8 into the kernel, which is
> not the case either.

No. I'm saying that
 (a) "if you want to use complex character sets"
then 
 (b) "you really have to use UTF-8"
to talk to the kernel.

Note the two parts. You're hung up on (b), while I have tried to make it 
clear that (a) is a prerequisite for (b).

Not everybody cares about (a). There are still people who use extended 
ASCII, simply because they DO NOT CARE about complex character sets. And 
if they don't care, and (a) isn't true, then (b) has no meaning any more.

(In all fairness, some people will disagree with (b) even when (a) is true
and like things like UCS-2. Those people are crazy, but I guess I'd just
mention that possibility anyway).

And this is why I say that the kernel only cares about byte streams, and
having it filter to only accept proper UTF-8 sequences would be a horribly
bad idea. Because it _assumes_ (a). That's what "making policy" is all
about. The kernel should not assume that everybody cares about complex
character sets.

This may change, btw. I'm nothing if not pragmatic. In another twenty
years, maybe everybody _literally_ uses complex character sets, and this
whole discussion is totally silly, and the kernel may enforce UTF-8 or
Klingon or whatever. At some point assumptions become _so_ ingrained that
they are no longer policy any more, they are just "fact".

		Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-16 20:23               ` Linus Torvalds
@ 2004-02-16 22:26                 ` Jamie Lokier
  2004-02-16 22:40                   ` Linus Torvalds
  0 siblings, 1 reply; 50+ messages in thread
From: Jamie Lokier @ 2004-02-16 22:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marc Lehmann, viro, Linux kernel

Linus Torvalds wrote:
> It would also be very painful, since it would mean that when you mount an 
> old disk, you may be totally unable to read the files, because they have 
> filenames that such a kernel would never accept.

Alas, once userspace has migrated to doing everything in UTF-8, you
won't be able to read those files because userspace will barf on them.

Then you'll be glad to have a mount option which converts iso-8859-1
to UTF-* :)  (Even if the old disk as actually not iso-8859-1, at least
you'll be able to read it's mangled filenames, rather than userspace
tripping over them).

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-16 22:26                 ` Jamie Lokier
@ 2004-02-16 22:40                   ` Linus Torvalds
  2004-02-17  7:14                     ` Lehmann 
  0 siblings, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2004-02-16 22:40 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Marc Lehmann, viro, Linux kernel

On Mon, 16 Feb 2004, Jamie Lokier wrote:
> 
> Alas, once userspace has migrated to doing everything in UTF-8, you
> won't be able to read those files because userspace will barf on them.

Nope. Read my other email. Done right, user space will _not_ barf on them, 
because it won't try to "normalize" any UTF-8 strings. If the string has 
garbage in it, user space should just pass the garbage through.

We've had this _exact_ issue before. Long before people worried about
UTF-8, people worried about the fact that programs like "ls" shouldn't
print out the extended ASCII characters as-is, because that would cause
bad problems on a terminal as they'd be seen as terminal control
characters.

Does that mean that unix tools like "rm" cannot remove those files? Hell 
no! It just means that when you do "rm -i *", the filename that is printed 
may not have special characters in it that you don't see.

Same goes for UTF-8. A "broken" UTF-8 string (ie something that isn't 
really UTF-8 at all, but just extended ASCII) won't _print_ right, but 
that doesn't mean that the tools won't work. You'll still be able to edit 
the file.

Try it with a regular C locale. Do a simple

	echo > åäö

(that's latin1), and do a "rm -i åäö", and see what it says. 

Right: it does the _right_ thing, and it prints out:

	torvalds@home:~> rm -i åäö
	rm: remove regular file `\345\344\366'? 

In other words, you have a program that doesn't understand a couple of the 
characters (because they don't make sense in its "locale"), but it still 
_works_. It just can't print them.

Which, if you think about is, is 100% EXACTLY equivalent to what a UTF-8
program should do when it sees broken UTF-8. It can still access the file, 
it can still do everything else with it, but it can't print out the 
filename, and it should use some kind of escape sequence to show that 
fact.

The two cases are 100% equivalent. We've gone through this before. There 
is a bit of pain involved, but it's not something new, or something 
fundamentally impossible. It's very straightforward indeed.

		Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-16 22:40                   ` Linus Torvalds
@ 2004-02-17  7:14                     ` Lehmann 
  2004-02-17 11:20                       ` UTF-8 practically vs. theoretically in the VFS API Helge Hafting
  2004-02-17 15:56                       ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Linus Torvalds
  0 siblings, 2 replies; 50+ messages in thread
From: Lehmann  @ 2004-02-17  7:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jamie Lokier, Marc Lehmann, viro, Linux kernel

On Mon, Feb 16, 2004 at 02:40:25PM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
> Try it with a regular C locale. Do a simple
> 
> 	echo > åäö

Just for your info, though. You can't even input these characters in a C
locale, since your libc (and/or xlib) is unable to handle them (lots of SO
C functions will barf on this one). C is 7 bit only.

> Which, if you think about is, is 100% EXACTLY equivalent to what a UTF-8
> program should do when it sees broken UTF-8.

The problem is that the very common C language makes it a pain to use
this in i18n programs. multibyte functions or iconv will no accept
these, so programs wanting to do what you are expecting to do need to
re-implement most if not all of the character handling of your typical
libc.

Yes, it's possible....

> The two cases are 100% equivalent. We've gone through this before. There 
> is a bit of pain involved, but it's not something new, or something 
> fundamentally impossible. It's very straightforward indeed.

The "bit" is enourmous, as you can't use your libc for text processing
anymore.

Yes, it works in non-i18n programms, but right now most programs get
i18n support, which means they will all fail to properly handle
non-locale characters.

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17  7:14                     ` Lehmann 
@ 2004-02-17 11:20                       ` Helge Hafting
  2004-02-17 15:56                       ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Linus Torvalds
  1 sibling, 0 replies; 50+ messages in thread
From: Helge Hafting @ 2004-02-17 11:20 UTC (permalink / raw)
  Cc: Linus Torvalds, Jamie Lokier, Marc Lehmann, viro, Linux kernel

pcg( Marc)@goof(A.).(Lehmann )com wrote:
> On Mon, Feb 16, 2004 at 02:40:25PM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
> 
>>Try it with a regular C locale. Do a simple
>>
>>	echo > åäö
> 
> 
> Just for your info, though. You can't even input these characters in a C
> locale, since your libc (and/or xlib) is unable to handle them (lots of SO
> C functions will barf on this one). C is 7 bit only.
> 
> 
>>Which, if you think about is, is 100% EXACTLY equivalent to what a UTF-8
>>program should do when it sees broken UTF-8.
> 
> 
> The problem is that the very common C language makes it a pain to use
> this in i18n programs. multibyte functions or iconv will no accept
> these, so programs wanting to do what you are expecting to do need to
> re-implement most if not all of the character handling of your typical
> libc.
> 
> Yes, it's possible....

All you need is a possible_garbage_to_properly_escaped_utf8(char *string)
in libc.  Any program that wants to display filenames it got
straight from readdir (or any binary file contents) will simple feed
the string through that and get back a string with
escapes for anything that isn't utf8.  It is a write-once, use
everywhere thing.

Once up on a time, there were serious problems when someone created
filenames like "; rm -fr *"  Today we use tab completion
and get bash to present the filename with proper escapes.  It is then harmless.
Bad utf8 can be handled the same way.

> The "bit" is enourmous, as you can't use your libc for text processing
> anymore.

Not the current libc, but libc can be improved upon. The same happened to
silly code that weren't 8-bit clean.

Helge Hafting

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17  7:14                     ` Lehmann 
  2004-02-17 11:20                       ` UTF-8 practically vs. theoretically in the VFS API Helge Hafting
@ 2004-02-17 15:56                       ` Linus Torvalds
       [not found]                         ` <20040217161111.GE8231@schmorp.de>
  1 sibling, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2004-02-17 15:56 UTC (permalink / raw)
  To: Marc; +Cc: Jamie Lokier, Marc Lehmann, viro, Linux kernel

On Tue, 17 Feb 2004, Marc wrote:
> On Mon, Feb 16, 2004 at 02:40:25PM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
> > Try it with a regular C locale. Do a simple
> > 
> > 	echo > åäö
> 
> Just for your info, though. You can't even input these characters in a C
> locale, since your libc (and/or xlib) is unable to handle them (lots of SO
> C functions will barf on this one). C is 7 bit only.

Ehh.. It's pointless to tell me that I can't do it. I just did.

The C locale is _not_ 7-bit only. The C locale is the traditional "byte 
locale" for UNIX. It will happily collate 8-bit-characters in their 
(numerical) order. Anything else would be seriously broken.

> > Which, if you think about is, is 100% EXACTLY equivalent to what a UTF-8
> > program should do when it sees broken UTF-8.
> 
> The problem is that the very common C language makes it a pain to use
> this in i18n programs. multibyte functions or iconv will no accept
> these, so programs wanting to do what you are expecting to do need to
> re-implement most if not all of the character handling of your typical
> libc.

These are all teething problems. The thing is, true multi-locale programs
haven't been around long enough that people take the problems for granted.  
A lot of them work today, but "work" is different from "always does the
right thing". These things take a _long_ time for people to sort out the
full implications of.

(Analogy time: how many people _still_ use "find ... | xargs xxx", even
though that can lead to problems and is thus wrong?  You should really use
"find ... -print0 | xargs -0 xxx" to get it _right_, but most people
ignore that, because the common form works for most cases.)

The process is complicated by the fact that most of the people who really 
care about UTF-8 and locales are very strict about it: they have been 
hitting their heads against latin1 users for a logn time, and they are 
frustrated and _tired_ of it, and so they often hate single-byte usage 
with a passion, and consider it not only wrong but EVIL. Which is 
obviously silly, but hey, I understand why they can feel a bit put off by 
the problem.

So the multi-byte people often stare at the standards, and then _refuse_
to touch anything that isn't standards-compliant. When they see something
incorrect, they'd rather dump core (or just truncate it) than try to
handle it gracefully, becuase they want the whole world to see how
incorrect it is.

Which flies in the face of "Be strict in what you generate, be liberal in 
what you accept". A lot of the functions are _not_ willing to be liberal 
in what they accept. Which sometimes just makes the problem worse, for no 
good reason.

The fact is, you shouldn't use "iconv()" unless you controlled the input.
It's a bit like "gets()" - unsafe to use unless you generated the damn
thing yourself and you _know_ it fits in the buffer. But we just don't 
have the functions (yet) to do it _right_, and to escape the input some 
way (yeah, yeah, I know you can do it with iconv() and a lot of cruft 
around it - the point is that nobody does it, because it's too painful).

		Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

[parent not found: <20040217161111.GE8231@schmorp.de>]

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
       [not found]                         ` <20040217161111.GE8231@schmorp.de>
@ 2004-02-17 16:32                           ` Linus Torvalds
  2004-02-17 16:46                             ` Jamie Lokier
  2004-02-17 16:54                             ` Stefan Smietanowski
  0 siblings, 2 replies; 50+ messages in thread
From: Linus Torvalds @ 2004-02-17 16:32 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: Jamie Lokier, viro, Linux kernel

On Tue, 17 Feb 2004, Marc Lehmann wrote:
> 
> Because there is a fundamental difference between file contents and
> filenames. Filenames are supposed to be text.

I think this is actually the fundamental point where we disagree.

You think of filenames as something the user types in, and that is 
"readable text". And I don't.

I think the filenames are just ways for a _program_ to look up stuff, and
the human readability is a secondary thing (it's "polite", but not a
fundamental part of their meaning).

So the same way I think text is good in config files and I dislike binary
blobs (hey, look at /proc), I think readable filenames are good. But that
doesn't mean that they have to be readable. I can well imagine encoding
meta-data in the filename for some database that uses the filesystem as
its backing store and generates files for large blobs. And then there
would be little if any "goodness" to keeping the filenames readable.

That's also a situation where case-insensitivity can _really_ screw you
(just one of the many).

It may be rare, but unlike you, I don't think there is anything "wrong" 
with considering path components to be just "data".

			Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior)
  2004-02-17 16:32                           ` Linus Torvalds
@ 2004-02-17 16:46                             ` Jamie Lokier
  2004-02-17 19:00                               ` UTF-8 practically vs. theoretically in the VFS API Måns Rullgård
  2004-02-17 16:54                             ` Stefan Smietanowski
  1 sibling, 1 reply; 50+ messages in thread
From: Jamie Lokier @ 2004-02-17 16:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marc Lehmann, viro, Linux kernel

Linus Torvalds wrote:
> I think the filenames are just ways for a _program_ to look up stuff, and
> the human readability is a secondary thing (it's "polite", but not a
> fundamental part of their meaning).

Politeness is nice.  I'm sure there's a pragmatic reason most
filenames are meaningful text in some human language :)

I'd like a way to type something like "touch zöe.txt" on an ordinary
latin1 terminal and get a UTF-8 filename in my filesystem.  Thanks :)

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 16:46                             ` Jamie Lokier
@ 2004-02-17 19:00                               ` Måns Rullgård
  2004-02-17 20:57                                 ` Jamie Lokier
  0 siblings, 1 reply; 50+ messages in thread
From: Måns Rullgård @ 2004-02-17 19:00 UTC (permalink / raw)
  To: linux-kernel

Jamie Lokier <jamie@shareable.org> writes:

> Linus Torvalds wrote:
>> I think the filenames are just ways for a _program_ to look up stuff, and
>> the human readability is a secondary thing (it's "polite", but not a
>> fundamental part of their meaning).
>
> Politeness is nice.  I'm sure there's a pragmatic reason most
> filenames are meaningful text in some human language :)
>
> I'd like a way to type something like "touch zöe.txt" on an ordinary
> latin1 terminal and get a UTF-8 filename in my filesystem.  Thanks :)

Then hack either bash (or whatever shell you use) or touch to do just that.

-- 
Måns Rullgård
mru@kth.se


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 19:00                               ` UTF-8 practically vs. theoretically in the VFS API Måns Rullgård
@ 2004-02-17 20:57                                 ` Jamie Lokier
  2004-02-17 21:06                                   ` Alex Belits
  2004-02-17 21:23                                   ` Matthew Kirkwood
  0 siblings, 2 replies; 50+ messages in thread
From: Jamie Lokier @ 2004-02-17 20:57 UTC (permalink / raw)
  To: Måns Rullgård; +Cc: linux-kernel

Måns Rullgård wrote:
> > I'd like a way to type something like "touch zöe.txt" on an ordinary
> > latin1 terminal and get a UTF-8 filename in my filesystem.  Thanks :)
> 
> Then hack either bash (or whatever shell you use) or touch to do just that.

Hacking touch is obviously useless - I'd need to hack all the other
2000 shell utilities to get any useful behaviour.

Hacking bash -- actually readline -- is a much better idea.  Then you
can enter names and they'll be created right.  The only flaw in this
is that "ls" won't be useful, so that'll need to be hacked as well. etc.

No, I think hacking the terminal I/O is the best bet here.  Then _all_
programs which currently work with UTF-8 terminals, which is rapidly
becoming most of them, will work the same with both kinds of terminal,
and the illusion of perfection will be complete and beautiful.

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 20:57                                 ` Jamie Lokier
@ 2004-02-17 21:06                                   ` Alex Belits
  2004-02-17 21:47                                     ` Jamie Lokier
  2004-02-18  7:23                                     ` Marc Lehmann
  2004-02-17 21:23                                   ` Matthew Kirkwood
  1 sibling, 2 replies; 50+ messages in thread
From: Alex Belits @ 2004-02-17 21:06 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Måns Rullgård, linux-kernel

On Tue, 17 Feb 2004, Jamie Lokier wrote:

> No, I think hacking the terminal I/O is the best bet here.  Then _all_
> programs which currently work with UTF-8 terminals, which is rapidly
> becoming most of them, will work the same with both kinds of terminal,
> and the illusion of perfection will be complete and beautiful.

  UTF-8 terminals (and variable-encoding terminals) alreay exist,
gnome-terminal is one of them. They are, of course, bloated pigs, but I
would rather have the bloat and idiosyncrasy in the user interface where
it belongs.

-- 
Alex

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 21:06                                   ` Alex Belits
@ 2004-02-17 21:47                                     ` Jamie Lokier
  2004-02-22 15:32                                       ` Eric W. Biederman
  2004-02-18  7:23                                     ` Marc Lehmann
  1 sibling, 1 reply; 50+ messages in thread
From: Jamie Lokier @ 2004-02-17 21:47 UTC (permalink / raw)
  To: Alex Belits; +Cc: Måns Rullgård, linux-kernel

Alex Belits wrote:
> > No, I think hacking the terminal I/O is the best bet here.  Then _all_
> > programs which currently work with UTF-8 terminals, which is rapidly
> > becoming most of them, will work the same with both kinds of terminal,
> > and the illusion of perfection will be complete and beautiful.
> 
>   UTF-8 terminals (and variable-encoding terminals) alreay exist,
> gnome-terminal is one of them. They are, of course, bloated pigs, but I
> would rather have the bloat and idiosyncrasy in the user interface where
> it belongs.

Yes, I am using it right now.  The fancy characters work well in it.
Problem is, sometimes I have to use a non-UTF-8 terminal, and I would
naturally like to access my files in the same way.

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 21:47                                     ` Jamie Lokier
@ 2004-02-22 15:32                                       ` Eric W. Biederman
  2004-02-22 16:28                                         ` Jamie Lokier
  0 siblings, 1 reply; 50+ messages in thread
From: Eric W. Biederman @ 2004-02-22 15:32 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Alex Belits, Måns Rullgård, linux-kernel

Jamie Lokier <jamie@shareable.org> writes:

> Alex Belits wrote:
> > > No, I think hacking the terminal I/O is the best bet here.  Then _all_
> > > programs which currently work with UTF-8 terminals, which is rapidly
> > > becoming most of them, will work the same with both kinds of terminal,
> > > and the illusion of perfection will be complete and beautiful.
> > 
> >   UTF-8 terminals (and variable-encoding terminals) alreay exist,
> > gnome-terminal is one of them. They are, of course, bloated pigs, but I
> > would rather have the bloat and idiosyncrasy in the user interface where
> > it belongs.
> 
> Yes, I am using it right now.  The fancy characters work well in it.
> Problem is, sometimes I have to use a non-UTF-8 terminal, and I would
> naturally like to access my files in the same way.

Basically I think this is just a matter of modifying telnetd and
sshd so that for the display they follow the users locale,
at least in cooked mode.

Does anyone have a good grasp what the exact semantics should be and
where the translation should happen?  I know we need to delay the
translation as long as possible so we can get binary streams flowing
through these protocols? 

I guess my question is when do we know the information is going to
a terminal so we should translate it?

Eric

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-22 15:32                                       ` Eric W. Biederman
@ 2004-02-22 16:28                                         ` Jamie Lokier
  2004-02-22 21:53                                           ` Eric W. Biederman
  0 siblings, 1 reply; 50+ messages in thread
From: Jamie Lokier @ 2004-02-22 16:28 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Alex Belits, Måns Rullgård, linux-kernel

Eric W. Biederman wrote:
> I guess my question is when do we know the information is going to
> a terminal so we should translate it?

When a program is writing to a terminal device, then we know it's
going to a terminal _or_ to a program which is pretending to be one
(pseudo-terminal).  Either way, the behaviour should be the same

The "screen" program can be used to do translation, although it's a
rather cumbersome way to go about it, and it has other effects which
are annoying (at least one key is always designated for "screen" commands).

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-22 16:28                                         ` Jamie Lokier
@ 2004-02-22 21:53                                           ` Eric W. Biederman
  0 siblings, 0 replies; 50+ messages in thread
From: Eric W. Biederman @ 2004-02-22 21:53 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Alex Belits, Måns Rullgård, linux-kernel

Jamie Lokier <jamie@shareable.org> writes:

> Eric W. Biederman wrote:
> > I guess my question is when do we know the information is going to
> > a terminal so we should translate it?
> 
> When a program is writing to a terminal device, then we know it's
> going to a terminal _or_ to a program which is pretending to be one
> (pseudo-terminal).  Either way, the behaviour should be the same
> 
> The "screen" program can be used to do translation, although it's a
> rather cumbersome way to go about it, and it has other effects which
> are annoying (at least one key is always designated for "screen" commands).

Right.  At this point I am not worried about temporary solutions.  I
want to pin down how things should be implemented.  So the user space
programs can be fixed.  Pardon me while I think aloud to frame the problem.

First it is worth noting that the existing practice is that ttys 
always use the character set encoding of the user.  Even X cut and
paste frequently abuses the iso8859-1 range, and instead uses the
native character set encoding instead of iso8825-1.

Now the work is how to get multiple locales to play nicely with each
other.  utf-8 and unicode are convenient for that as they preserve the
existing assumptions that terminals, filenames, and text files are
all using the same character set encoding, even when multiple locales
are involved.

So within one machine utf-8 solves the multiple locale problem.  The
problem has now moved to interoperability between machines.  Since
multiple machines have different upgrade cycles, and are in different
administrative domains everyone does not move to utf-8 at the same
time.

When we add the assertion that all I/O going through a terminal device
is in the native locale we break 8bit transparency.  This holds true
in some instances when both sides use the same character set encoding,
such as utf8.

There are some mitigating factors to this.  ssh already documents
pseudo tty's as potentially breaking 8 bit transparency.  And
applications that require ttys for stdin/stdout are most likely
interactive.  Interactive programs are either character based, or
broken.

Being an unclean channel for pipes will affect at least XMODEM,
YMODEM, and ZMODEM protocols, and possibly ppp.  These programs
already know how to avoid problem characters and because ascii is a
common subset of most character set encodings the effect should be no
worse than a line that is not 8 bit clean.

ssh at least has explict options to allocate or not allocate a
pseudo-tty so getting an 8 bit clean data path is not a problem with
ssh.

The rule ``All data that passing through a pseudo-tty is in the
character set encoding specified by the locale of the owner of the
tty'' seems both reasonable and no significant change from the current
status quo.

Now how does this get implemented?

On the wire between two machines I recommend passing unicode
characters.  Unicode guarantees no round trip loss for any of it's
member character sets, and it reduces everything to one set of
translation tables.

By convention glibc stores unicode values in wchar_t.  mbrstowc will
convert multibyte strings to internal wide characters, based on the
current locale. wctombs will do the opposite.  So going between
unicode and the character set encoding of the current locale is
straight forward.

How do we convert the applications?

There are only four cases I can think of where we connect to a remote
system with terminal semantics.
1) Directly connected serial terminals.
2) telnetd
3) rshd
4) sshd

To my knowledge all of their protocols just pass through characters
and are neutral.  So changing these feels like a protocol extension,
ouch!  Those are the programs that bridge multiple administrative
domains, and they do deal with pseudo ttys so they are where something
needs to happen, to support different character set encodings on
different machines. 

If everyone just switches over to using utf-8 even the above cases are
fine.  So if there is a reasonable expectation that everyone will
change to using utf-8 in the near future even those programs don't
need to change.

Given the delay in changing protocols I propose 2 simple programs.
sh-utf8 and utf8-tty.  The first runs a command converting stdout and
stderr from utf8 to the current locale, and converting stdin into
utf8.  The second creates a pseudo tty and relays to it's controlling
tty, assuming the controlling tty uses utf8 and it's tty uses the
current locale.

Looking around there already is a TTYConv program that seems to fill
this niche, except you must specify the character set encodings
manually.
http://bedroomlan.dyndns.org/~alexios/coding_ttyconv.html

Comments?

Eric

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 21:06                                   ` Alex Belits
  2004-02-17 21:47                                     ` Jamie Lokier
@ 2004-02-18  7:23                                     ` Marc Lehmann
  1 sibling, 0 replies; 50+ messages in thread
From: Marc Lehmann @ 2004-02-18  7:23 UTC (permalink / raw)
  To: linux-kernel

On Tue, Feb 17, 2004 at 02:06:21PM -0700, Alex Belits <abelits@phobos.illtel.denver.co.us> wrote:
>   UTF-8 terminals (and variable-encoding terminals) alreay exist,
> gnome-terminal is one of them. They are, of course, bloated pigs, but I

rxvt-unicode (mixed fonts, bad complex script), and mlterm (no mixed
fonts, very good complex script support), are not all bloated, have a
_much_ smaller memory footprint than xterm and are even faster on text
output and scrolling complex scripts than xterm (by a factor of two).

(Of course, gnome-terminal is bloated. loading it requires 45MB of main
memory here and then it's still 5-10 times slower than xterm).

That UTF-8/Unicode in any way means bloated (I know you did not directly
imply this) is a widely circulating but wrong idea nowadays.

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 20:57                                 ` Jamie Lokier
  2004-02-17 21:06                                   ` Alex Belits
@ 2004-02-17 21:23                                   ` Matthew Kirkwood
  1 sibling, 0 replies; 50+ messages in thread
From: Matthew Kirkwood @ 2004-02-17 21:23 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Måns Rullgård, linux-kernel

On Tue, 17 Feb 2004, Jamie Lokier wrote:

> No, I think hacking the terminal I/O is the best bet here.  Then _all_
> programs which currently work with UTF-8 terminals, which is rapidly
> becoming most of them, will work the same with both kinds of terminal,
> and the illusion of perfection will be complete and beautiful.

Yep.  A charset-translating tty proxy, a little like screen
or detachtty is what you want.  I wonder if there's an SSH
client or server which can do that.

Matthew.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 16:32                           ` Linus Torvalds
  2004-02-17 16:46                             ` Jamie Lokier
@ 2004-02-17 16:54                             ` Stefan Smietanowski
  2004-02-18  1:27                               ` Hans Reiser
  1 sibling, 1 reply; 50+ messages in thread
From: Stefan Smietanowski @ 2004-02-17 16:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marc Lehmann, Jamie Lokier, viro, Linux kernel

Hi Linus.

>>Because there is a fundamental difference between file contents and
>>filenames. Filenames are supposed to be text.
> 
> I think this is actually the fundamental point where we disagree.
> 
> You think of filenames as something the user types in, and that is 
> "readable text". And I don't.
> 
> I think the filenames are just ways for a _program_ to look up stuff, and
> the human readability is a secondary thing (it's "polite", but not a
> fundamental part of their meaning).
> 
> So the same way I think text is good in config files and I dislike binary
> blobs (hey, look at /proc), I think readable filenames are good. But that
> doesn't mean that they have to be readable. I can well imagine encoding
> meta-data in the filename for some database that uses the filesystem as
> its backing store and generates files for large blobs. And then there
> would be little if any "goodness" to keeping the filenames readable.

Just look at Mozilla's cache... They may have turned the blob into
ascii but it's still a blob.

// Stefan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 16:54                             ` Stefan Smietanowski
@ 2004-02-18  1:27                               ` Hans Reiser
  2004-02-18  2:08                                 ` Robin Rosenberg
  0 siblings, 1 reply; 50+ messages in thread
From: Hans Reiser @ 2004-02-18  1:27 UTC (permalink / raw)
  To: Stefan Smietanowski
  Cc: Linus Torvalds, Marc Lehmann, Jamie Lokier, viro, Linux kernel

ReiserFS 6 plans to allow files to be associated with arbitrary files 
and found by those associations.  Some of those files will consist of 
ascii keywords, some will be icon images, etc.....  Human readability 
should not be considered fundamental to a name component, especially 
since programs with no interest in readability may be the only direct 
users of the name.

Hans

>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  1:27                               ` Hans Reiser
@ 2004-02-18  2:08                                 ` Robin Rosenberg
  2004-02-18 11:06                                   ` Jamie Lokier
  0 siblings, 1 reply; 50+ messages in thread
From: Robin Rosenberg @ 2004-02-18  2:08 UTC (permalink / raw)
  To: Hans Reiser
  Cc: Stefan Smietanowski, Linus Torvalds, Marc Lehmann, Jamie Lokier,
	viro, Linux kernel

On Wednesday 18 February 2004 02.27, Hans Reiser wrote:
> ReiserFS 6 plans to allow files to be associated with arbitrary files 
> and found by those associations.  Some of those files will consist of 
> ascii keywords, some will be icon images, etc.....  Human readability 
> should not be considered fundamental to a name component, especially 
> since programs with no interest in readability may be the only direct 
> users of the name.

If the user never sees a name, it doesn't matter. However the user actually sees
and reads the filenames in /home, portable media, networks devices and lots of
places. However, when a user has named a component those characters are those
that are important to the user because those form an "image" (since you introduced
the term) or "sound" that the user remembers and associates with the content. A 
character is the simplest form of image so it should always look the same.

-- robin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  2:08                                 ` Robin Rosenberg
@ 2004-02-18 11:06                                   ` Jamie Lokier
  0 siblings, 0 replies; 50+ messages in thread
From: Jamie Lokier @ 2004-02-18 11:06 UTC (permalink / raw)
  To: Robin Rosenberg
  Cc: Hans Reiser, Stefan Smietanowski, Linus Torvalds, Marc Lehmann,
	viro, Linux kernel

Robin Rosenberg wrote:
> A character is the simplest form of image so it should always look the same.

People who need the computer to _speak_ names need language or
phonetic information attached to a name, for it to be spoken properly.

On this, Alex Belits has a good point.  It's all very well
standardising on UTF-8 so every name can be displayed nicely.  That is
incomplete for a user who needs "ls" to work audibly, though.

In practice, such a user configures their machine to assume a
particular language, or guess it with bias to the one they use most often.

That is, in some ways, the same problem as having a mixture of
filenames in an unknown character encoding, except that UTF-8 doesn't
solve it.

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2004-02-23 19:13 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-23 11:35 UTF-8 practically vs. theoretically in the VFS API Norman Diamond
     [not found] ` <fa.ip45pqg.i26oru@ifi.uio.no>
2004-02-23 19:13   ` Junio C Hamano
     [not found] <04Feb13.163954est.41760@gpu.utcc.utoronto.ca>
2004-02-14 23:06 ` JFS default behavior Robin Rosenberg
2004-02-14 23:29   ` viro
2004-02-15  0:07     ` Robin Rosenberg
2004-02-15  2:41       ` Linus Torvalds
2004-02-16 18:36         ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
2004-02-16 18:49           ` Linus Torvalds
2004-02-16 19:26             ` UTF-8 practically vs. theoretically in the VFS API Jeff Garzik
2004-02-16 19:48               ` John Bradford
2004-02-16 19:48                 ` Linus Torvalds
2004-02-16 20:20                   ` Marc Lehmann
2004-02-16 20:26                     ` Linus Torvalds
2004-02-18  2:49                     ` Rob Landley
2004-02-16 20:21                   ` bert hubert
2004-02-16 20:33                     ` Marc Lehmann
2004-02-18  2:58                     ` H. Peter Anvin
2004-02-18  3:13                       ` Linus Torvalds
2004-02-18  3:22                         ` H. Peter Anvin
2004-02-18  3:30                           ` Linus Torvalds
2004-02-18  5:30                             ` H. Peter Anvin
2004-02-18 10:29                               ` Robin Rosenberg
2004-02-18 11:49                                 ` Tomas Szepe
2004-02-18 11:59                                   ` Robin Rosenberg
2004-02-18 12:05                                     ` Tomas Szepe
2004-02-18 12:34                                       ` Robin Rosenberg
2004-02-18 15:35                               ` Linus Torvalds
2004-02-18 19:47                                 ` Tomas Szepe
2004-02-18 20:01                                   ` H. Peter Anvin
2004-02-18 21:22                                     ` Robin Rosenberg
2004-02-18 21:42                                       ` H. Peter Anvin
2004-02-18 11:24                           ` Jamie Lokier
2004-02-18 11:33                         ` Jamie Lokier
2004-02-18 16:47                           ` H. Peter Anvin
2004-02-18 19:59                           ` Linus Torvalds
2004-02-18 20:08                             ` H. Peter Anvin
2004-02-18  7:25                       ` bert hubert
2004-02-16 20:16                 ` Marc Lehmann
2004-02-16 20:20                   ` Jeff Garzik
2004-02-16 21:10                   ` viro
2004-02-17  7:18                   ` jw schultz
2004-02-17  7:42                   ` Nick Piggin
2004-02-16 20:03             ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
2004-02-16 20:23               ` Linus Torvalds
2004-02-16 22:26                 ` Jamie Lokier
2004-02-16 22:40                   ` Linus Torvalds
2004-02-17  7:14                     ` Lehmann 
2004-02-17 11:20                       ` UTF-8 practically vs. theoretically in the VFS API Helge Hafting
2004-02-17 15:56                       ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Linus Torvalds
     [not found]                         ` <20040217161111.GE8231@schmorp.de>
2004-02-17 16:32                           ` Linus Torvalds
2004-02-17 16:46                             ` Jamie Lokier
2004-02-17 19:00                               ` UTF-8 practically vs. theoretically in the VFS API Måns Rullgård
2004-02-17 20:57                                 ` Jamie Lokier
2004-02-17 21:06                                   ` Alex Belits
2004-02-17 21:47                                     ` Jamie Lokier
2004-02-22 15:32                                       ` Eric W. Biederman
2004-02-22 16:28                                         ` Jamie Lokier
2004-02-22 21:53                                           ` Eric W. Biederman
2004-02-18  7:23                                     ` Marc Lehmann
2004-02-17 21:23                                   ` Matthew Kirkwood
2004-02-17 16:54                             ` Stefan Smietanowski
2004-02-18  1:27                               ` Hans Reiser
2004-02-18  2:08                                 ` Robin Rosenberg
2004-02-18 11:06                                   ` Jamie Lokier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox