Re: UTF-8 practically vs. theoretically in the VFS API

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 18:49           ` Linus Torvalds
@ 2004-02-16 19:26             ` Jeff Garzik
  2004-02-16 19:48               ` John Bradford
  2004-02-16 20:03             ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
  1 sibling, 1 reply; 50+ messages in thread
From: Jeff Garzik @ 2004-02-16 19:26 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: Linus Torvalds, viro, Linux kernel

Linus Torvalds wrote:
> In short: filenames are byte streams. Nothing more. They don't even have a 
> "character set". They literally are just a series of bytes.
> 
> And when I say that you have to talk to the kernel using UTF-8, I'm only 
> claiming that it is the only sane way to encode extended characters in a 
> byte stream. Nothing more.

Nod.  Maybe it helps Marc to point out the key difference between 
characters and bytes, in UTF8.

In UTF8, the number of characters in a string is less-than-or-equal-to 
the number of bytes in the string.

And the kernel just cares about bytes.

This is the whole benefit to UTF8, right here in this thread.  UTF8 was 
designed such that ten-year-old C code using standard C strings would 
function just fine.  No need to rip up large swaths of your code just to 
call multi-byte versions of the standard string functions.  Most code 
that doesn't deal with locale-specific details like uppercase/lowercase 
Just Works(tm).

	Jeff

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 19:26             ` UTF-8 practically vs. theoretically in the VFS API Jeff Garzik
@ 2004-02-16 19:48               ` John Bradford
  2004-02-16 19:48                 ` Linus Torvalds
  2004-02-16 20:16                 ` Marc Lehmann
  0 siblings, 2 replies; 50+ messages in thread
From: John Bradford @ 2004-02-16 19:48 UTC (permalink / raw)
  To: Jeff Garzik, Marc Lehmann; +Cc: Linus Torvalds, viro, Linux kernel

Quote from Jeff Garzik <jgarzik@pobox.com>:
> Linus Torvalds wrote:
> > In short: filenames are byte streams. Nothing more. They don't even have a 
> > "character set". They literally are just a series of bytes.
> > 
> > And when I say that you have to talk to the kernel using UTF-8, I'm only 
> > claiming that it is the only sane way to encode extended characters in a 
> > byte stream. Nothing more.
> 
> 
> Nod.  Maybe it helps Marc to point out the key difference between 
> characters and bytes, in UTF8.
> 
> In UTF8, the number of characters in a string is less-than-or-equal-to 
> the number of bytes in the string.
> 
> And the kernel just cares about bytes.
> 
> This is the whole benefit to UTF8, right here in this thread.  UTF8 was 
> designed such that ten-year-old C code using standard C strings would 
> function just fine.  No need to rip up large swaths of your code just to 
> call multi-byte versions of the standard string functions.  Most code 
> that doesn't deal with locale-specific details like uppercase/lowercase 
> Just Works(tm).

The real problem is with mis-configured userspaces, where buggy UTF-8
decoders are trying to make sense of data in legacy encodings
containing essentially random bytes > 127, which are not part of valid
UTF-8 sequences.

None of this is a real problem, if everything is set up correctly and
bug free.  Unfortunately the Just Works thing falls apart in the,
(frequent), instances that it's not :-(.

John.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 19:48               ` John Bradford
@ 2004-02-16 19:48                 ` Linus Torvalds
  2004-02-16 20:20                   ` Marc Lehmann
  2004-02-16 20:21                   ` bert hubert
  2004-02-16 20:16                 ` Marc Lehmann
  1 sibling, 2 replies; 50+ messages in thread
From: Linus Torvalds @ 2004-02-16 19:48 UTC (permalink / raw)
  To: John Bradford; +Cc: Jeff Garzik, Marc Lehmann, viro, Linux kernel

On Mon, 16 Feb 2004, John Bradford wrote:
> 
> The real problem is with mis-configured userspaces, where buggy UTF-8
> decoders are trying to make sense of data in legacy encodings
> containing essentially random bytes > 127, which are not part of valid
> UTF-8 sequences.
> 
> None of this is a real problem, if everything is set up correctly and
> bug free.  Unfortunately the Just Works thing falls apart in the,
> (frequent), instances that it's not :-(.

The way to handle that is to aim to never _ever_ decode utf-8 unless you 
really have to. Always leave the string in utf-8 "raw bytestring" mode as 
long as possible, and convert to charater sets only when actually 
printing.

If you do that, then at worst you'll show the user a strange name (extra
points for marking it as being errenous), but everything still works. You
can still lookup/delete/whatever the file (internally the program still
works on the raw byte sequence and isn't confused). Basically accept the
fact that UTF-8 strings can contain "garbage", and don't try to fix it up.

And no, I'm not claiming that it's wonderfully clean and that we should
all love it. But it's _practical_, and the ugliness is certainly a lot
less than in the alternatives.

And it largely works today.

		Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 19:48               ` John Bradford
  2004-02-16 19:48                 ` Linus Torvalds
@ 2004-02-16 20:16                 ` Marc Lehmann
  2004-02-16 20:20                   ` Jeff Garzik
                                     ` (3 more replies)
  1 sibling, 4 replies; 50+ messages in thread
From: Marc Lehmann @ 2004-02-16 20:16 UTC (permalink / raw)
  To: John Bradford; +Cc: Jeff Garzik, Linus Torvalds, viro, Linux kernel

On Mon, Feb 16, 2004 at 07:48:19PM +0000, John Bradford <john@grabjohn.com> wrote:
> Quote from Jeff Garzik <jgarzik@pobox.com>:
> None of this is a real problem, if everything is set up correctly and
> bug free.  Unfortunately the Just Works thing falls apart in the,
> (frequent), instances that it's not :-(.

And this is the whole point.

BTW, to people trying to explain some properties of UTF-8 to me. I don't
think ad-hominem attacks like assuming that I don't understand UTF-8
(without any indication that this is so) are useful.

The point here is that the kernel does, in a very narrow interpretation,
not support the use of UTF-8, because proper support of UTF-8 means that
no illegal byte sequences will be produced.

Of course, I can feed the kernel UTF-8, and if everybody does that, it
will generally work quite fine. However, Windows surely works fine if
every program only feeds allowed values into system calls. And even unix
dialects without memory protection work, as long as everybody plays
fair.

The point is, however, that this is highly undesirable, and it would be
nice to have a kernel that would (optionally) fully support a UTF-8
environment in where applications can feed UTF-8 and _expect_ UTF-8 in
return, which _is_ a security issue.

It's very desirable to have a kernel that actively supports this. IT is
clearly not _required_, of course. But then again, process abstraction
is also not required...

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 19:48                 ` Linus Torvalds
@ 2004-02-16 20:20                   ` Marc Lehmann
  2004-02-16 20:26                     ` Linus Torvalds
  2004-02-18  2:49                     ` Rob Landley
  2004-02-16 20:21                   ` bert hubert
  1 sibling, 2 replies; 50+ messages in thread
From: Marc Lehmann @ 2004-02-16 20:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: John Bradford, Jeff Garzik, viro, Linux kernel

On Mon, Feb 16, 2004 at 11:48:35AM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
> works on the raw byte sequence and isn't confused). Basically accept the
> fact that UTF-8 strings can contain "garbage", and don't try to fix it up.

But you are wrong, UTF-8 strings never contain garbage. UTF-8 is
well-defined and is always proper UTF-8. It's a tautology.

The evry idea of "UTF-8 with garbage in it" doesn't make sense.

> And no, I'm not claiming that it's wonderfully clean and that we should
> all love it.

It's also a totally useless idiom...

> And it largely works today.
> 		Linus

On ascii-only-systems, it works fine. My system is largely ascii-only,
with only very few filenames (japanese and german ones mostly) in
UTF-8. Sometimes in EUC-JP, but that's a bug in rar.

It also works fine in single-user environments where the user just forces
everything to be in her locale. It does fail miserably on multi-user
systems. It does fail miserably in ISO-C's locale model. It does fail
miserably with gnu shellutils, fileutils and most other apps.

It fails, because it's not at all well supported by the kernel.

Claiming that it largely works today is simply not true for most
non-ascii-users (which increasingly includes the US).

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:16                 ` Marc Lehmann
@ 2004-02-16 20:20                   ` Jeff Garzik
  2004-02-16 21:10                   ` viro
                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 50+ messages in thread
From: Jeff Garzik @ 2004-02-16 20:20 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: John Bradford, Linus Torvalds, viro, Linux kernel

Marc Lehmann wrote:
> The point here is that the kernel does, in a very narrow interpretation,
> not support the use of UTF-8, because proper support of UTF-8 means that
> no illegal byte sequences will be produced.

Incorrect.  Byte stream transports need not care about their contents.

The only places that need to care about illegal UTF8 byte sequences are 
things like CONFIG_NLS_UTF8.

	Jeff




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 19:48                 ` Linus Torvalds
  2004-02-16 20:20                   ` Marc Lehmann
@ 2004-02-16 20:21                   ` bert hubert
  2004-02-16 20:33                     ` Marc Lehmann
  2004-02-18  2:58                     ` H. Peter Anvin
  1 sibling, 2 replies; 50+ messages in thread
From: bert hubert @ 2004-02-16 20:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: John Bradford, Jeff Garzik, Marc Lehmann, viro, Linux kernel

On Mon, Feb 16, 2004 at 11:48:35AM -0800, Linus Torvalds wrote:

> The way to handle that is to aim to never _ever_ decode utf-8 unless you 
> really have to. Always leave the string in utf-8 "raw bytestring" mode as 
> long as possible, and convert to charater sets only when actually 
> printing.

Additional good news is that following octets in a utf-8 character sequence
always have the highest order bit set, precluding / or \x0 from appearing,
confusing the kernel.

The remaining zit is that all these represent '..':
2E 2E
C0 AE C0 AE
E0 80 AE E0 80 AE 
F0 80 80 AE F0 80 80 AE 
F8 80 80 80 AE F8 80 80 80 AE 
FC 80 80 80 80 AE FC 80 80 80 80 AE

This in itself is not a problem, the kernel will only recognize 2E 2E as the
real .., but it does show that 'document.doc' might be encoded in a myriad
ways.

So some guidance about using only the simplest possible encoding might be
sensible, if we don't want the kernel to know about utf-8.

> And it largely works today.

Indeed.

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://lartc.org           Linux Advanced Routing & Traffic Control HOWTO

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:20                   ` Marc Lehmann
@ 2004-02-16 20:26                     ` Linus Torvalds
  2004-02-18  2:49                     ` Rob Landley
  1 sibling, 0 replies; 50+ messages in thread
From: Linus Torvalds @ 2004-02-16 20:26 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: John Bradford, Jeff Garzik, viro, Linux kernel



On Mon, 16 Feb 2004, Marc Lehmann wrote:
>
> On Mon, Feb 16, 2004 at 11:48:35AM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
> > works on the raw byte sequence and isn't confused). Basically accept the
> > fact that UTF-8 strings can contain "garbage", and don't try to fix it up.
> 
> But you are wrong, UTF-8 strings never contain garbage. UTF-8 is
> well-defined and is always proper UTF-8. It's a tautology.
> 
> The evry idea of "UTF-8 with garbage in it" doesn't make sense.

Sure it does.

You live in a theoretical world where
 (a) there is only one standard
 (b) people read it
 (c) people actually follow it and never have bugs

I've got news for you: none of the above is true. 

Which means that IN PRACTICE you will find strings that you think are 
UTF-8-encoded, but that don't end up being proper UTF-8.

That's the difference between real world and theory. 

And you can either write your programs to be "theoretically correct", or 
you can write them to "work".

It's your choice. I know which program I'd prefer to use.

		Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:21                   ` bert hubert
@ 2004-02-16 20:33                     ` Marc Lehmann
  2004-02-18  2:58                     ` H. Peter Anvin
  1 sibling, 0 replies; 50+ messages in thread
From: Marc Lehmann @ 2004-02-16 20:33 UTC (permalink / raw)
  To: bert hubert; +Cc: linux-kernel

On Mon, Feb 16, 2004 at 09:21:42PM +0100, bert hubert <ahu@ds9a.nl> wrote:
> The remaining zit is that all these represent '..':

No, they don't. Read the UTF-8 definition...

> This in itself is not a problem, the kernel will only recognize 2E 2E as the
> real .., but it does show that 'document.doc' might be encoded in a myriad
> ways.

No, it can only be encoded in exactly one way *in UTF-8*. It can of course
be encoded differently in other encodings, but in UTF-8, there is only a
single representation. There are no ambiguities.

> So some guidance about using only the simplest possible encoding might be
> sensible, if we don't want the kernel to know about utf-8.

Fortunately, this has all already been taken care of, and is not a problem.

I mean, the _definition_ of UTF-8 works. Wether specific applications
(wether in the kernel or apps) work is a different question. But at
least the specification is rather clear.

Compare this to the URL definition, which only hints that you don't know
the encoding, and therefore, the interpretation as text, of a URL unless
you have an extra channel that communicates it.

While possible, this channel does not exist in practise, creating big
problems for people writing i18n-ized web applications.

The thing is that the kernel certainly _works_ on a very basic level, but
I think the situaiton can be improved by making it clear how to interpret
filenames, which currently is not the case.

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:16                 ` Marc Lehmann
  2004-02-16 20:20                   ` Jeff Garzik
@ 2004-02-16 21:10                   ` viro
  2004-02-17  7:18                   ` jw schultz
  2004-02-17  7:42                   ` Nick Piggin
  3 siblings, 0 replies; 50+ messages in thread
From: viro @ 2004-02-16 21:10 UTC (permalink / raw)
  To: John Bradford, Jeff Garzik, Linus Torvalds, Linux kernel

On Mon, Feb 16, 2004 at 09:16:10PM +0100, Marc Lehmann wrote:
> The point is, however, that this is highly undesirable, and it would be
> nice to have a kernel that would (optionally) fully support a UTF-8
> environment in where applications can feed UTF-8 and _expect_ UTF-8 in
> return, which _is_ a security issue.
> 
> It's very desirable to have a kernel that actively supports this. IT is
> clearly not _required_, of course. But then again, process abstraction
> is also not required...

Mind taking the demagogy elsewhere?  Note that the same handwaving applies
to e.g. file contents.  Care to explain what makes read() and write()
different in that respect?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:16                 ` Marc Lehmann
  2004-02-16 20:20                   ` Jeff Garzik
  2004-02-16 21:10                   ` viro
@ 2004-02-17  7:18                   ` jw schultz
  2004-02-17  7:42                   ` Nick Piggin
  3 siblings, 0 replies; 50+ messages in thread
From: jw schultz @ 2004-02-17  7:18 UTC (permalink / raw)
  To: Linux kernel; +Cc: John Bradford, Jeff Garzik, Linus Torvalds, viro

On Mon, Feb 16, 2004 at 09:16:10PM +0100, Marc Lehmann wrote:
> On Mon, Feb 16, 2004 at 07:48:19PM +0000, John Bradford <john@grabjohn.com> wrote:
> > Quote from Jeff Garzik <jgarzik@pobox.com>:
> > None of this is a real problem, if everything is set up correctly and
> > bug free.  Unfortunately the Just Works thing falls apart in the,
> > (frequent), instances that it's not :-(.
>       
> And this is the whole point.
> 
> BTW, to people trying to explain some properties of UTF-8 to me. I don't
> think ad-hominem attacks like assuming that I don't understand UTF-8
> (without any indication that this is so) are useful.
> 
> The point here is that the kernel does, in a very narrow interpretation,
> not support the use of UTF-8, because proper support of UTF-8 means that
> no illegal byte sequences will be produced.

That "interpretation" is so narrow as to be unrealistic.
The kernel supports UTF-8 the same way a stage supports
rock musicians.  You confuse support with enforce, rather
like confusing tolerance with endorsement.

And it should be noted that the kernel doesn't produce file
names.  It only passes them along.

> Of course, I can feed the kernel UTF-8, and if everybody does that, it
> will generally work quite fine. However, Windows surely works fine if
> every program only feeds allowed values into system calls. And even unix
> dialects without memory protection work, as long as everybody plays
> fair.
>
> The point is, however, that this is highly undesirable, and it would be
> nice to have a kernel that would (optionally) fully support a UTF-8

You mean enforce again.  That enhancement request has been
rejected repeatedly because such a thing would be highly
undesirable.  What might be a convenient but unnecessary
restriction today is too likely to become an unbearable
restriction tomorrow.  I don't want the kernel to have to
care about what is or isn't valid UTF-8.  I certainly don't
want to have the kernel loaded with outdated character
tables.

> environment in where applications can feed UTF-8 and _expect_ UTF-8 in
> return, which _is_ a security issue.

I want an environment where applications can feed bytestreams
and expect the same bytestream in return.  I see enough
problems as a result of filesystems that don't do that.

> It's very desirable to have a kernel that actively supports this. IT is

You mean enforces again.  Kernel as police, next thing you
will want is a kernel that prevents undesirable character
sequences.

> clearly not _required_, of course. But then again, process abstraction
> is also not required...

I'll tell you what.  Patch libc.  You can add UTF-8 filename
enforcement to libc.  There are only a few system calls that
would need to have their wrappers enlarged.  I'm sure the
libc people will direct you to someplace very warm if you
ask them for this enhancement.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw@pegasys.ws

		Remember Cernan and Schmitt

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:16                 ` Marc Lehmann
                                     ` (2 preceding siblings ...)
  2004-02-17  7:18                   ` jw schultz
@ 2004-02-17  7:42                   ` Nick Piggin
  3 siblings, 0 replies; 50+ messages in thread
From: Nick Piggin @ 2004-02-17  7:42 UTC (permalink / raw)
  To: Marc Lehmann
  Cc: John Bradford, Jeff Garzik, Linus Torvalds, viro, Linux kernel



Marc Lehmann wrote:

>On Mon, Feb 16, 2004 at 07:48:19PM +0000, John Bradford <john@grabjohn.com> wrote:
>
>>Quote from Jeff Garzik <jgarzik@pobox.com>:
>>None of this is a real problem, if everything is set up correctly and
>>bug free.  Unfortunately the Just Works thing falls apart in the,
>>(frequent), instances that it's not :-(.
>>
>      
>And this is the whole point.
>
>BTW, to people trying to explain some properties of UTF-8 to me. I don't
>think ad-hominem attacks like assuming that I don't understand UTF-8
>(without any indication that this is so) are useful.
>
>The point here is that the kernel does, in a very narrow interpretation,
>not support the use of UTF-8, because proper support of UTF-8 means that
>no illegal byte sequences will be produced.
>
>

So does the kernel support the English language? Does your
email client?


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17  7:14                     ` Lehmann 
@ 2004-02-17 11:20                       ` Helge Hafting
  2004-02-17 15:56                       ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Linus Torvalds
  1 sibling, 0 replies; 50+ messages in thread
From: Helge Hafting @ 2004-02-17 11:20 UTC (permalink / raw)
  Cc: Linus Torvalds, Jamie Lokier, Marc Lehmann, viro, Linux kernel

pcg( Marc)@goof(A.).(Lehmann )com wrote:
> On Mon, Feb 16, 2004 at 02:40:25PM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
> 
>>Try it with a regular C locale. Do a simple
>>
>>	echo > åäö
> 
> 
> Just for your info, though. You can't even input these characters in a C
> locale, since your libc (and/or xlib) is unable to handle them (lots of SO
> C functions will barf on this one). C is 7 bit only.
> 
> 
>>Which, if you think about is, is 100% EXACTLY equivalent to what a UTF-8
>>program should do when it sees broken UTF-8.
> 
> 
> The problem is that the very common C language makes it a pain to use
> this in i18n programs. multibyte functions or iconv will no accept
> these, so programs wanting to do what you are expecting to do need to
> re-implement most if not all of the character handling of your typical
> libc.
> 
> Yes, it's possible....

All you need is a possible_garbage_to_properly_escaped_utf8(char *string)
in libc.  Any program that wants to display filenames it got
straight from readdir (or any binary file contents) will simple feed
the string through that and get back a string with
escapes for anything that isn't utf8.  It is a write-once, use
everywhere thing.

Once up on a time, there were serious problems when someone created
filenames like "; rm -fr *"  Today we use tab completion
and get bash to present the filename with proper escapes.  It is then harmless.
Bad utf8 can be handled the same way.

> The "bit" is enourmous, as you can't use your libc for text processing
> anymore.

Not the current libc, but libc can be improved upon. The same happened to
silly code that weren't 8-bit clean.

Helge Hafting

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 16:32                           ` Linus Torvalds
  2004-02-17 16:46                             ` Jamie Lokier
@ 2004-02-17 16:54                             ` Stefan Smietanowski
  2004-02-18  1:27                               ` Hans Reiser
  1 sibling, 1 reply; 50+ messages in thread
From: Stefan Smietanowski @ 2004-02-17 16:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marc Lehmann, Jamie Lokier, viro, Linux kernel

Hi Linus.

>>Because there is a fundamental difference between file contents and
>>filenames. Filenames are supposed to be text.
> 
> I think this is actually the fundamental point where we disagree.
> 
> You think of filenames as something the user types in, and that is 
> "readable text". And I don't.
> 
> I think the filenames are just ways for a _program_ to look up stuff, and
> the human readability is a secondary thing (it's "polite", but not a
> fundamental part of their meaning).
> 
> So the same way I think text is good in config files and I dislike binary
> blobs (hey, look at /proc), I think readable filenames are good. But that
> doesn't mean that they have to be readable. I can well imagine encoding
> meta-data in the filename for some database that uses the filesystem as
> its backing store and generates files for large blobs. And then there
> would be little if any "goodness" to keeping the filenames readable.

Just look at Mozilla's cache... They may have turned the blob into
ascii but it's still a blob.

// Stefan


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 16:46                             ` Jamie Lokier
@ 2004-02-17 19:00                               ` Måns Rullgård
  2004-02-17 20:57                                 ` Jamie Lokier
  0 siblings, 1 reply; 50+ messages in thread
From: Måns Rullgård @ 2004-02-17 19:00 UTC (permalink / raw)
  To: linux-kernel

Jamie Lokier <jamie@shareable.org> writes:

> Linus Torvalds wrote:
>> I think the filenames are just ways for a _program_ to look up stuff, and
>> the human readability is a secondary thing (it's "polite", but not a
>> fundamental part of their meaning).
>
> Politeness is nice.  I'm sure there's a pragmatic reason most
> filenames are meaningful text in some human language :)
>
> I'd like a way to type something like "touch zöe.txt" on an ordinary
> latin1 terminal and get a UTF-8 filename in my filesystem.  Thanks :)

Then hack either bash (or whatever shell you use) or touch to do just that.

-- 
Måns Rullgård
mru@kth.se


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 19:00                               ` UTF-8 practically vs. theoretically in the VFS API Måns Rullgård
@ 2004-02-17 20:57                                 ` Jamie Lokier
  2004-02-17 21:06                                   ` Alex Belits
  2004-02-17 21:23                                   ` Matthew Kirkwood
  0 siblings, 2 replies; 50+ messages in thread
From: Jamie Lokier @ 2004-02-17 20:57 UTC (permalink / raw)
  To: Måns Rullgård; +Cc: linux-kernel

Måns Rullgård wrote:
> > I'd like a way to type something like "touch zöe.txt" on an ordinary
> > latin1 terminal and get a UTF-8 filename in my filesystem.  Thanks :)
> 
> Then hack either bash (or whatever shell you use) or touch to do just that.

Hacking touch is obviously useless - I'd need to hack all the other
2000 shell utilities to get any useful behaviour.

Hacking bash -- actually readline -- is a much better idea.  Then you
can enter names and they'll be created right.  The only flaw in this
is that "ls" won't be useful, so that'll need to be hacked as well. etc.

No, I think hacking the terminal I/O is the best bet here.  Then _all_
programs which currently work with UTF-8 terminals, which is rapidly
becoming most of them, will work the same with both kinds of terminal,
and the illusion of perfection will be complete and beautiful.

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 20:57                                 ` Jamie Lokier
@ 2004-02-17 21:06                                   ` Alex Belits
  2004-02-17 21:47                                     ` Jamie Lokier
  2004-02-18  7:23                                     ` Marc Lehmann
  2004-02-17 21:23                                   ` Matthew Kirkwood
  1 sibling, 2 replies; 50+ messages in thread
From: Alex Belits @ 2004-02-17 21:06 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Måns Rullgård, linux-kernel

On Tue, 17 Feb 2004, Jamie Lokier wrote:

> No, I think hacking the terminal I/O is the best bet here.  Then _all_
> programs which currently work with UTF-8 terminals, which is rapidly
> becoming most of them, will work the same with both kinds of terminal,
> and the illusion of perfection will be complete and beautiful.

  UTF-8 terminals (and variable-encoding terminals) alreay exist,
gnome-terminal is one of them. They are, of course, bloated pigs, but I
would rather have the bloat and idiosyncrasy in the user interface where
it belongs.

-- 
Alex

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 20:57                                 ` Jamie Lokier
  2004-02-17 21:06                                   ` Alex Belits
@ 2004-02-17 21:23                                   ` Matthew Kirkwood
  1 sibling, 0 replies; 50+ messages in thread
From: Matthew Kirkwood @ 2004-02-17 21:23 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Måns Rullgård, linux-kernel

On Tue, 17 Feb 2004, Jamie Lokier wrote:

> No, I think hacking the terminal I/O is the best bet here.  Then _all_
> programs which currently work with UTF-8 terminals, which is rapidly
> becoming most of them, will work the same with both kinds of terminal,
> and the illusion of perfection will be complete and beautiful.

Yep.  A charset-translating tty proxy, a little like screen
or detachtty is what you want.  I wonder if there's an SSH
client or server which can do that.

Matthew.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 21:06                                   ` Alex Belits
@ 2004-02-17 21:47                                     ` Jamie Lokier
  2004-02-22 15:32                                       ` Eric W. Biederman
  2004-02-18  7:23                                     ` Marc Lehmann
  1 sibling, 1 reply; 50+ messages in thread
From: Jamie Lokier @ 2004-02-17 21:47 UTC (permalink / raw)
  To: Alex Belits; +Cc: Måns Rullgård, linux-kernel

Alex Belits wrote:
> > No, I think hacking the terminal I/O is the best bet here.  Then _all_
> > programs which currently work with UTF-8 terminals, which is rapidly
> > becoming most of them, will work the same with both kinds of terminal,
> > and the illusion of perfection will be complete and beautiful.
> 
>   UTF-8 terminals (and variable-encoding terminals) alreay exist,
> gnome-terminal is one of them. They are, of course, bloated pigs, but I
> would rather have the bloat and idiosyncrasy in the user interface where
> it belongs.

Yes, I am using it right now.  The fancy characters work well in it.
Problem is, sometimes I have to use a non-UTF-8 terminal, and I would
naturally like to access my files in the same way.

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 16:54                             ` Stefan Smietanowski
@ 2004-02-18  1:27                               ` Hans Reiser
  2004-02-18  2:08                                 ` Robin Rosenberg
  0 siblings, 1 reply; 50+ messages in thread
From: Hans Reiser @ 2004-02-18  1:27 UTC (permalink / raw)
  To: Stefan Smietanowski
  Cc: Linus Torvalds, Marc Lehmann, Jamie Lokier, viro, Linux kernel

ReiserFS 6 plans to allow files to be associated with arbitrary files 
and found by those associations.  Some of those files will consist of 
ascii keywords, some will be icon images, etc.....  Human readability 
should not be considered fundamental to a name component, especially 
since programs with no interest in readability may be the only direct 
users of the name.

Hans

>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  1:27                               ` Hans Reiser
@ 2004-02-18  2:08                                 ` Robin Rosenberg
  2004-02-18 11:06                                   ` Jamie Lokier
  0 siblings, 1 reply; 50+ messages in thread
From: Robin Rosenberg @ 2004-02-18  2:08 UTC (permalink / raw)
  To: Hans Reiser
  Cc: Stefan Smietanowski, Linus Torvalds, Marc Lehmann, Jamie Lokier,
	viro, Linux kernel

On Wednesday 18 February 2004 02.27, Hans Reiser wrote:
> ReiserFS 6 plans to allow files to be associated with arbitrary files 
> and found by those associations.  Some of those files will consist of 
> ascii keywords, some will be icon images, etc.....  Human readability 
> should not be considered fundamental to a name component, especially 
> since programs with no interest in readability may be the only direct 
> users of the name.

If the user never sees a name, it doesn't matter. However the user actually sees
and reads the filenames in /home, portable media, networks devices and lots of
places. However, when a user has named a component those characters are those
that are important to the user because those form an "image" (since you introduced
the term) or "sound" that the user remembers and associates with the content. A 
character is the simplest form of image so it should always look the same.

-- robin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:20                   ` Marc Lehmann
  2004-02-16 20:26                     ` Linus Torvalds
@ 2004-02-18  2:49                     ` Rob Landley
  1 sibling, 0 replies; 50+ messages in thread
From: Rob Landley @ 2004-02-18  2:49 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: Linux kernel

On Monday 16 February 2004 14:20, Marc Lehmann wrote:
> On Mon, Feb 16, 2004 at 11:48:35AM -0800, Linus Torvalds <torvalds@osdl.org> 
wrote:
> > works on the raw byte sequence and isn't confused). Basically accept the
> > fact that UTF-8 strings can contain "garbage", and don't try to fix it
> > up.
>
> But you are wrong, UTF-8 strings never contain garbage. UTF-8 is
> well-defined and is always proper UTF-8. It's a tautology.

Would you please learn the difference between "you are wrong" and "I 
disagree"?

Rob



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-16 20:21                   ` bert hubert
  2004-02-16 20:33                     ` Marc Lehmann
@ 2004-02-18  2:58                     ` H. Peter Anvin
  2004-02-18  3:13                       ` Linus Torvalds
  2004-02-18  7:25                       ` bert hubert
  1 sibling, 2 replies; 50+ messages in thread
From: H. Peter Anvin @ 2004-02-18  2:58 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <20040216202142.GA5834@outpost.ds9a.nl>
By author:    bert hubert <ahu@ds9a.nl>
In newsgroup: linux.dev.kernel
> 
> Additional good news is that following octets in a utf-8 character sequence
> always have the highest order bit set, precluding / or \x0 from appearing,
> confusing the kernel.
> 

Indeed.  The original name for the encoding was, in fact, "FSS-UTF",
for "filesystem safe Unicode transformation format."  

> The remaining zit is that all these represent '..':
> 2E 2E
> C0 AE C0 AE
> E0 80 AE E0 80 AE 
> F0 80 80 AE F0 80 80 AE 
> F8 80 80 80 AE F8 80 80 80 AE 
> FC 80 80 80 80 AE FC 80 80 80 80 AE

No, they don't.

The first represent "..", the remaining two are illegal encodings and
do not decode to anything.

Those of us who have been involved with the issue have fought
*extremely* hard against DWIM decoders which try to decode the latter
sequences into ".." -- it's incorrect, and a security hazard.  The
only acceptable decodings is to throw an error, or use an out-of-band
encoding mechanism to denote "bad bytecode."

> This in itself is not a problem, the kernel will only recognize 2E 2E as the
> real .., but it does show that 'document.doc' might be encoded in a myriad
> ways.

No, it doesn't.

> So some guidance about using only the simplest possible encoding might be
> sensible, if we don't want the kernel to know about utf-8.

UTF-8 requires the use of the shortest possible encoding.  An
application which doesn't obey that and tries to be "smart" is a
security hazard.

It is a bit unfortunate that the encoding don't exclude these by
design as opposed by error checking; it makes it a little too easy for
clueless programmers to skip :(

	-hpa

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  2:58                     ` H. Peter Anvin
@ 2004-02-18  3:13                       ` Linus Torvalds
  2004-02-18  3:22                         ` H. Peter Anvin
  2004-02-18 11:33                         ` Jamie Lokier
  2004-02-18  7:25                       ` bert hubert
  1 sibling, 2 replies; 50+ messages in thread
From: Linus Torvalds @ 2004-02-18  3:13 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel

On Wed, 18 Feb 2004, H. Peter Anvin wrote:
> 
> Those of us who have been involved with the issue have fought
> *extremely* hard against DWIM decoders which try to decode the latter
> sequences into ".." -- it's incorrect, and a security hazard.  The
> only acceptable decodings is to throw an error, or use an out-of-band
> encoding mechanism to denote "bad bytecode."

Somebody correctly pointed out that you do not need any out-of-band 
encoding mechanism - the very fact that it's an invalid sequence is in 
itself a perfectly fine flag. No out-of-band signalling required.

The only thing you should make sure of is to not try to normalize it (that 
would hide the error). Just keep carrying the bad sequence along, and 
everybody is happy. Including the filesystem functions that get the "bad" 
name and match it exactly to what it should be matched against.

		Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  3:13                       ` Linus Torvalds
@ 2004-02-18  3:22                         ` H. Peter Anvin
  2004-02-18  3:30                           ` Linus Torvalds
  2004-02-18 11:24                           ` Jamie Lokier
  2004-02-18 11:33                         ` Jamie Lokier
  1 sibling, 2 replies; 50+ messages in thread
From: H. Peter Anvin @ 2004-02-18  3:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds wrote:
> 
> On Wed, 18 Feb 2004, H. Peter Anvin wrote:
> 
>>Those of us who have been involved with the issue have fought
>>*extremely* hard against DWIM decoders which try to decode the latter
>>sequences into ".." -- it's incorrect, and a security hazard.  The
>>only acceptable decodings is to throw an error, or use an out-of-band
>>encoding mechanism to denote "bad bytecode."
> 
> Somebody correctly pointed out that you do not need any out-of-band 
> encoding mechanism - the very fact that it's an invalid sequence is in 
> itself a perfectly fine flag. No out-of-band signalling required.
> 
> The only thing you should make sure of is to not try to normalize it (that 
> would hide the error). Just keep carrying the bad sequence along, and 
> everybody is happy. Including the filesystem functions that get the "bad" 
> name and match it exactly to what it should be matched against.
> 

Well, the reason you'd want an out-of-band mechanism is to be able to
display it as some kind of escapes.  Consider a UTF-8 decoder which uses
values in the 0x800000xx range to encode "bogus bytes"; that way it
wouldn't alias to anything else, but the bogus sequence "C0 AE" could be
represented as 0x800000C0 0x800000AE and displayed to the user as
\xC0\xAE\xC0\xAE ... which is different from \u00C0\u00AE ("À®", C3 80
C2 AE).  This would make it possible to figure out in, for example, an
ls listing, what those broken filenames are actually composed of.

There are some advantages to being able to represent all possible byte
sequences and present them to the user, even if they're bogus.

	-hpa

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  3:22                         ` H. Peter Anvin
@ 2004-02-18  3:30                           ` Linus Torvalds
  2004-02-18  5:30                             ` H. Peter Anvin
  2004-02-18 11:24                           ` Jamie Lokier
  1 sibling, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2004-02-18  3:30 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel

On Tue, 17 Feb 2004, H. Peter Anvin wrote:
> 
> Well, the reason you'd want an out-of-band mechanism is to be able to
> display it as some kind of escapes. 

I'd suggest just doing that when you convert the utf-8 format to printable 
format _anyway_.  At that point you just make the "printable" 
representation be the binary escape sequence (which you have to have for 
other non-printable utf-8 characters anyway).

And if you do things right (ie you allow user input in that same escaped 
output format), you can allow users to re-create the exact "broken utf-8". 
Which is actually important just so that the user can fix it up (ie 
imagine the user noticing that the filename is broken, and now needs to do 
a "mv broken-name fixed-name" - the user needs some way to re-create the 
brokenness).

		Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  3:30                           ` Linus Torvalds
@ 2004-02-18  5:30                             ` H. Peter Anvin
  2004-02-18 10:29                               ` Robin Rosenberg
  2004-02-18 15:35                               ` Linus Torvalds
  0 siblings, 2 replies; 50+ messages in thread
From: H. Peter Anvin @ 2004-02-18  5:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds wrote:
> 
> On Tue, 17 Feb 2004, H. Peter Anvin wrote:
> 
>>Well, the reason you'd want an out-of-band mechanism is to be able to
>>display it as some kind of escapes. 
> 
> 
> I'd suggest just doing that when you convert the utf-8 format to printable 
> format _anyway_.  At that point you just make the "printable" 
> representation be the binary escape sequence (which you have to have for 
> other non-printable utf-8 characters anyway).
> 

What does "printable" mean in this context?  Typically you have to 
convert it to UCS-4 first, so you can index into your font tables, then 
you have to create the right composition, apply the bidirectional text 
algorithm, and so forth.

Rendering general Unicode text is complex enough that you really want it 
layered.  What I described what the first step of that -- mostly trying 
to show that "throwing an error" doesn't necessarily mean "produce no 
output."  What you shouldn't do, though, is alias it with legitimate input.

> And if you do things right (ie you allow user input in that same escaped 
> output format), you can allow users to re-create the exact "broken utf-8". 
> Which is actually important just so that the user can fix it up (ie 
> imagine the user noticing that the filename is broken, and now needs to do 
> a "mv broken-name fixed-name" - the user needs some way to re-create the 
> brokenness).

Indeed.  The C language has gone with \x77 for bytes and \u7777 or 
\U77777777 for Unicode characters (4 vs 8 hex digits respectively); I 
think this is a good UI for shells to follow.  The \x representation 
then doesn't stand for characters but for bytes.  It may be desirable to 
disallow encoding of *valid* UTF-8 characters this way, though.

	-hpa

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 21:06                                   ` Alex Belits
  2004-02-17 21:47                                     ` Jamie Lokier
@ 2004-02-18  7:23                                     ` Marc Lehmann
  1 sibling, 0 replies; 50+ messages in thread
From: Marc Lehmann @ 2004-02-18  7:23 UTC (permalink / raw)
  To: linux-kernel

On Tue, Feb 17, 2004 at 02:06:21PM -0700, Alex Belits <abelits@phobos.illtel.denver.co.us> wrote:
>   UTF-8 terminals (and variable-encoding terminals) alreay exist,
> gnome-terminal is one of them. They are, of course, bloated pigs, but I

rxvt-unicode (mixed fonts, bad complex script), and mlterm (no mixed
fonts, very good complex script support), are not all bloated, have a
_much_ smaller memory footprint than xterm and are even faster on text
output and scrolling complex scripts than xterm (by a factor of two).

(Of course, gnome-terminal is bloated. loading it requires 45MB of main
memory here and then it's still 5-10 times slower than xterm).

That UTF-8/Unicode in any way means bloated (I know you did not directly
imply this) is a widely circulating but wrong idea nowadays.

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  2:58                     ` H. Peter Anvin
  2004-02-18  3:13                       ` Linus Torvalds
@ 2004-02-18  7:25                       ` bert hubert
  1 sibling, 0 replies; 50+ messages in thread
From: bert hubert @ 2004-02-18  7:25 UTC (permalink / raw)
  To: linux-kernel

On Wed, Feb 18, 2004 at 02:58:42AM +0000, H. Peter Anvin wrote:

> Indeed.  The original name for the encoding was, in fact, "FSS-UTF",
> for "filesystem safe Unicode transformation format."  

That might explain a few things.

> > F8 80 80 80 AE F8 80 80 80 AE 
> > FC 80 80 80 80 AE FC 80 80 80 80 AE
> 
> No, they don't.

Serves me right for trusting a random site, apologies. 

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://lartc.org           Linux Advanced Routing & Traffic Control HOWTO

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  5:30                             ` H. Peter Anvin
@ 2004-02-18 10:29                               ` Robin Rosenberg
  2004-02-18 11:49                                 ` Tomas Szepe
  2004-02-18 15:35                               ` Linus Torvalds
  1 sibling, 1 reply; 50+ messages in thread
From: Robin Rosenberg @ 2004-02-18 10:29 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Linus Torvalds, linux-kernel

On Wednesday 18 February 2004 06.30, H. Peter Anvin wrote:
> Linus Torvalds wrote:
> > On Tue, 17 Feb 2004, H. Peter Anvin wrote:
> >>Well, the reason you'd want an out-of-band mechanism is to be able to
> >>display it as some kind of escapes. 
> > I'd suggest just doing that when you convert the utf-8 format to printable 
> > format _anyway_.  At that point you just make the "printable" 
> > representation be the binary escape sequence (which you have to have for 
> > other non-printable utf-8 characters anyway).
> What does "printable" mean in this context?  Typically you have to 
> convert it to UCS-4 first, so you can index into your font tables, then 
> you have to create the right composition, apply the bidirectional text 
> algorithm, and so forth.

> Rendering general Unicode text is complex enough that you really want it 
> layered.  What I described what the first step of that -- mostly trying 
> to show that "throwing an error" doesn't necessarily mean "produce no 
> output."  What you shouldn't do, though, is alias it with legitimate input.

I think you can use libicu here. Conversion to UCS-4 doesn't for determining
character type doesn't mean you will every have actual strings of UCS-4. It could 
be character by character just for looking it up, so you can have the out-of-band
error flags internally.

> > And if you do things right (ie you allow user input in that same escaped 
> > output format), you can allow users to re-create the exact "broken utf-8". 
> > Which is actually important just so that the user can fix it up (ie 
> > imagine the user noticing that the filename is broken, and now needs to do 
> > a "mv broken-name fixed-name" - the user needs some way to re-create the 
> > brokenness).
> 
> Indeed.  The C language has gone with \x77 for bytes and \u7777 or 
> \U77777777 for Unicode characters (4 vs 8 hex digits respectively); I 
> think this is a good UI for shells to follow.  The \x representation 
> then doesn't stand for characters but for bytes.  It may be desirable to 
> disallow encoding of *valid* UTF-8 characters this way, though.

Agree. \u80808080 I would assume represents a valid character, while \x80\x80\x80\x80
does not. A problem with invalid sequences I just noted is that they break some of 
the nice properties of UTF-8, that people will assume apply, i.e. that you can parse it 
backwards. With UTF-8 (i.e. well-formed utf-8) you can point at a byte and figure "this is 
not the first byte", lets skip backwards to find the start. If invalid sequences can ever occur
you must read from the start of the string.

-- robin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  2:08                                 ` Robin Rosenberg
@ 2004-02-18 11:06                                   ` Jamie Lokier
  0 siblings, 0 replies; 50+ messages in thread
From: Jamie Lokier @ 2004-02-18 11:06 UTC (permalink / raw)
  To: Robin Rosenberg
  Cc: Hans Reiser, Stefan Smietanowski, Linus Torvalds, Marc Lehmann,
	viro, Linux kernel

Robin Rosenberg wrote:
> A character is the simplest form of image so it should always look the same.

People who need the computer to _speak_ names need language or
phonetic information attached to a name, for it to be spoken properly.

On this, Alex Belits has a good point.  It's all very well
standardising on UTF-8 so every name can be displayed nicely.  That is
incomplete for a user who needs "ls" to work audibly, though.

In practice, such a user configures their machine to assume a
particular language, or guess it with bias to the one they use most often.

That is, in some ways, the same problem as having a mixture of
filenames in an unknown character encoding, except that UTF-8 doesn't
solve it.

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  3:22                         ` H. Peter Anvin
  2004-02-18  3:30                           ` Linus Torvalds
@ 2004-02-18 11:24                           ` Jamie Lokier
  1 sibling, 0 replies; 50+ messages in thread
From: Jamie Lokier @ 2004-02-18 11:24 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Linus Torvalds, linux-kernel

H. Peter Anvin wrote:
> Well, the reason you'd want an out-of-band mechanism is to be able to
> display it as some kind of escapes.

As soon as you go to "display", you need a mechanism to escape lots of
characters, not just malformed UTF-8.  Consider: \u0000, \u001B,
\u0007 and such need to be escaped too.

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  3:13                       ` Linus Torvalds
  2004-02-18  3:22                         ` H. Peter Anvin
@ 2004-02-18 11:33                         ` Jamie Lokier
  2004-02-18 16:47                           ` H. Peter Anvin
  2004-02-18 19:59                           ` Linus Torvalds
  1 sibling, 2 replies; 50+ messages in thread
From: Jamie Lokier @ 2004-02-18 11:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: H. Peter Anvin, linux-kernel

Linus Torvalds wrote:
> Somebody correctly pointed out that you do not need any out-of-band 
> encoding mechanism - the very fact that it's an invalid sequence is in 
> itself a perfectly fine flag. No out-of-band signalling required.

Technically this is almost(*) correct, however a _lot_ of code exists
which assumes logical properties of UTF-8.  (See, for example, the
"stty utf8" patch).

Perl, for example, allows you to pass around invalid sequences in
exactly the way you describe.  It works, right up until you do
something like length() or substr() or a regex match.  Then Perl
screws up the answer, because it sees something like 0xfd and just
assumes it can skip the next 5 bytes, without checking them.

hpa's suggestion that invalid bytes are treated as 0x800000xx works
very nicely, *iff* a program is absolutely consistent about its
treatment of bytes in that way.  When there's a mixture of code which
interprets malformed UTF-8 in different ways, then it's messy and
sometimes a security hazard.

-- Jamie

(*) - It's fine until you concatenate two malformed strings.  Then the
      out-of-band signal is lost if the combination is valid UTF-8.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 10:29                               ` Robin Rosenberg
@ 2004-02-18 11:49                                 ` Tomas Szepe
  2004-02-18 11:59                                   ` Robin Rosenberg
  0 siblings, 1 reply; 50+ messages in thread
From: Tomas Szepe @ 2004-02-18 11:49 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: linux-kernel

On Feb-18 2004, Wed, 11:29 +0100
Robin Rosenberg <robin.rosenberg.lists@dewire.com> wrote:

[snip]
> not the first byte", lets skip backwards to find the start. If invalid sequences can ever occur
[snip]

Would you _please_ read the lkml FAQ and stop posting e-mails with lines
longer than 80 characters?  Thank you.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 11:49                                 ` Tomas Szepe
@ 2004-02-18 11:59                                   ` Robin Rosenberg
  2004-02-18 12:05                                     ` Tomas Szepe
  0 siblings, 1 reply; 50+ messages in thread
From: Robin Rosenberg @ 2004-02-18 11:59 UTC (permalink / raw)
  To: Tomas Szepe; +Cc: linux-kernel

On Wednesday 18 February 2004 12.49, Tomas Szepe wrote:
> Would you _please_ read the lkml FAQ and stop posting e-mails with lines
> longer than 80 characters?  Thank you.

As soon as someone asks nicely... I thought any decent mail client simply
wrapped the lines. Hmm, remember some old system with 3270 access that
didn't.

I'll try to remember that.

-- robin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 11:59                                   ` Robin Rosenberg
@ 2004-02-18 12:05                                     ` Tomas Szepe
  2004-02-18 12:34                                       ` Robin Rosenberg
  0 siblings, 1 reply; 50+ messages in thread
From: Tomas Szepe @ 2004-02-18 12:05 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: linux-kernel

On Feb-18 2004, Wed, 12:59 +0100
Robin Rosenberg <robin.rosenberg.lists@dewire.com> wrote:

> On Wednesday 18 February 2004 12.49, Tomas Szepe wrote:
> > Would you _please_ read the lkml FAQ and stop posting e-mails with lines
> > longer than 80 characters?  Thank you.
> 
> As soon as someone asks nicely...  I thought any decent mail client simply
> wrapped the lines.

1)  Quite the contrary.  Any _decent_ mail client will _not_ wrap the lines.

2)  A mail client that will wrap the lines will make your posts look like this:

<cut>
Having to put up with the existence of Windows day in and out is the reason I'm
still on
an eight-bit encoding.  Sorry for not explaining the REAL problem, but only a
partial
problem. I need to support all kinds of clients on Windows with protocols that  
convey no
character set info. With samba that's no problem. Having to put up with a Unix
world running
<cut>

> I'll try to remember that.

Thanks again.

-- 
Tomas Szepe <szepe@pinerecords.com>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 12:05                                     ` Tomas Szepe
@ 2004-02-18 12:34                                       ` Robin Rosenberg
  0 siblings, 0 replies; 50+ messages in thread
From: Robin Rosenberg @ 2004-02-18 12:34 UTC (permalink / raw)
  To: Tomas Szepe; +Cc: linux-kernel

On Wednesday 18 February 2004 13.05, Tomas Szepe wrote:
> On Feb-18 2004, Wed, 12:59 +0100
> Robin Rosenberg <robin.rosenberg.lists@dewire.com> wrote:
> 
> > On Wednesday 18 February 2004 12.49, Tomas Szepe wrote:
> > > Would you _please_ read the lkml FAQ and stop posting e-mails with lines
> > > longer than 80 characters?  Thank you.
> > 
> > As soon as someone asks nicely...  I thought any decent mail client simply
> > wrapped the lines.
> 
> 1)  Quite the contrary.  Any _decent_ mail client will _not_ wrap the lines.
>
> 2)  A mail client that will wrap the lines will make your posts look like 
this:
> 
> <cut>
> Having to put up with the existence of Windows day in and out is the reason 
I'm
> still on
> an eight-bit encoding.  Sorry for not explaining the REAL problem, but only 
a
> partial
> problem. I need to support all kinds of clients on Windows with protocols 
that  
> convey no
> character set info. With samba that's no problem. Having to put up with a 
Unix
> world running
> <cut>

That's what happens when the sender wraps the lines at column 80 and your 
client wraps at 72 (or similar situation), just another reason not to wrap 
when sending and let the users  client do whatever the user think is fine.

In order not to wrap and destroy information I have the autowrap feature off 
when composing mail, becase wrapped and cut stack traces, cuts from log files 
etc are a pain. 

BTW The 80 character rule is only mention wrt to signatures.

-- robin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18  5:30                             ` H. Peter Anvin
  2004-02-18 10:29                               ` Robin Rosenberg
@ 2004-02-18 15:35                               ` Linus Torvalds
  2004-02-18 19:47                                 ` Tomas Szepe
  1 sibling, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2004-02-18 15:35 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Kernel Mailing List

On Tue, 17 Feb 2004, H. Peter Anvin wrote:
> 
> What does "printable" mean in this context?  Typically you have to 
> convert it to UCS-4 first, so you can index into your font tables, then 
> you have to create the right composition, apply the bidirectional text 
> algorithm, and so forth.

Not all characters _have_ font entries. And even when they have font 
entries, they may need escaping for other reasons (ie you may want to 
marshall UTF-8 as plain ASCII just because you want to use a portable 
format for transfer).

Think about the simple (hex) string x0A x00. That's a well-defined UTF-8
string, yet if you want to print it as a filename on the console, you
should obviously print it as "/n" or some similar escaped sequence 
(actually, that's a bad example, since it's a special case, and it would 
probably be better to use the example string x7F x00, which would be shown 
as \u177 or something).

The same is true for a _lot_ of perfectly fine UTF-8 sequences, no?

That implies that you have to use an escaped sequence _anyway_. So as you 
go along, turning the string into something printable, you might as well 
escape the invalid UTF-8 sequences.

In other words: you walk the utf-8 string one character at a time, 
converting it to whatever format (eg UCS-4) you have for font lookup, but 
you also escape characters that you don't have font entries for or that 
aren't in proper UTF-8 format.

When converting to UCS-2, you have to check for the proper format 
_anyway_, so none of this is in any way "extra work". Instead of just 
aborting on an invalid UTF-8 character, you quote it, exactly the same way 
you'd have to quote a _valid_ one that you can't just show as a string.

> Rendering general Unicode text is complex enough that you really want it 
> layered.  What I described what the first step of that -- mostly trying 
> to show that "throwing an error" doesn't necessarily mean "produce no 
> output."  What you shouldn't do, though, is alias it with legitimate input.

Exactly. And since you need an escape sequence anyway, what's the problem?

> > And if you do things right (ie you allow user input in that same escaped 
> > output format), you can allow users to re-create the exact "broken utf-8". 
> > Which is actually important just so that the user can fix it up (ie 
> > imagine the user noticing that the filename is broken, and now needs to do 
> > a "mv broken-name fixed-name" - the user needs some way to re-create the 
> > brokenness).
> 
> Indeed.  The C language has gone with \x77 for bytes and \u7777 or 
> \U77777777 for Unicode characters (4 vs 8 hex digits respectively); I 
> think this is a good UI for shells to follow.  The \x representation 
> then doesn't stand for characters but for bytes.  It may be desirable to 
> disallow encoding of *valid* UTF-8 characters this way, though.

You need to encode even valid UTF-8, since you may not find a font entry 
for the character, or the character just isn't appropriate in that context 
(ie you can't show a newline).

But it makes perfect sense to use a policy of:
 - escape valid UTF-8 characters as '\u7777'
 - escape _invalid_ UTF-8 characters as their hex byte sequence (ie 
   '\xC0\x80\x80', whatever)
 - (and, obviously, escape the valid UTF-8 character '\' as '\\').

Don't you agree? It clearly allows all the cases, and you can re-generate 
the _exact_ original stream of bytes from the above (ie it is nicely 
reversible, which in my opinion is a requirement).

		Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 11:33                         ` Jamie Lokier
@ 2004-02-18 16:47                           ` H. Peter Anvin
  2004-02-18 19:59                           ` Linus Torvalds
  1 sibling, 0 replies; 50+ messages in thread
From: H. Peter Anvin @ 2004-02-18 16:47 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Linus Torvalds, linux-kernel

Jamie Lokier wrote:
> 
> hpa's suggestion that invalid bytes are treated as 0x800000xx works
> very nicely, *iff* a program is absolutely consistent about its
> treatment of bytes in that way.  When there's a mixture of code which
> interprets malformed UTF-8 in different ways, then it's messy and
> sometimes a security hazard.
> 

Absolutely.  It has to be considered very carefully.

	-hpa

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 15:35                               ` Linus Torvalds
@ 2004-02-18 19:47                                 ` Tomas Szepe
  2004-02-18 20:01                                   ` H. Peter Anvin
  0 siblings, 1 reply; 50+ messages in thread
From: Tomas Szepe @ 2004-02-18 19:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: H. Peter Anvin, Kernel Mailing List

On Feb-18 2004, Wed, 07:35 -0800
Linus Torvalds <torvalds@osdl.org> wrote:

> But it makes perfect sense to use a policy of:
>  - escape valid UTF-8 characters as '\u7777'
>  - escape _invalid_ UTF-8 characters as their hex byte sequence (ie 
>    '\xC0\x80\x80', whatever)
>  - (and, obviously, escape the valid UTF-8 character '\' as '\\').
> 
> Don't you agree? It clearly allows all the cases, and you can re-generate 
> the _exact_ original stream of bytes from the above (ie it is nicely 
> reversible, which in my opinion is a requirement).

I really really hope this is _exactly_ what we're going to see in practice.

-- 
Tomas Szepe <szepe@pinerecords.com>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 11:33                         ` Jamie Lokier
  2004-02-18 16:47                           ` H. Peter Anvin
@ 2004-02-18 19:59                           ` Linus Torvalds
  2004-02-18 20:08                             ` H. Peter Anvin
  1 sibling, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2004-02-18 19:59 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: H. Peter Anvin, linux-kernel

On Wed, 18 Feb 2004, Jamie Lokier wrote:
> Linus Torvalds wrote:
> > Somebody correctly pointed out that you do not need any out-of-band 
> > encoding mechanism - the very fact that it's an invalid sequence is in 
> > itself a perfectly fine flag. No out-of-band signalling required.
> 
> Technically this is almost(*) correct,
> 
> (*) - It's fine until you concatenate two malformed strings.  Then the
>       out-of-band signal is lost if the combination is valid UTF-8.

But that's what you _want_. Having a real out-of-band signal that says 
"this stuff is wrong, because it was wrong at some point in the past", and 
not allowing concatenation of blocks of utf-8 bytes would be _bad_.

The thing, concatenating two malformed UTF-8 strings is normal behaviour 
in a variety of circumstances, all basically having to do with lower 
levels now knowing about higer-level concepts.

For example, look at a web-page. Look at how the data comes in: it comes 
as a stream of bytes, with blocking rules that have _nothing_ to do with 
the content (timing, mtu's, extended TCP headers etc etc). That doesn't 
mean that you shouldn't be able to
 - work on the partial results and show them to the user as UTF-8
 - be able to concatenate new stuff as it comes in.

Having an out-of-band signal for "bad" would literally be a bad idea. If 
you get a valid UTF-8 stream as a result of concatenation, you should 
consider that to be the correct behaviour, or you should CHECK BEFOREHAND 
if you think it is illegal.

		Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 19:47                                 ` Tomas Szepe
@ 2004-02-18 20:01                                   ` H. Peter Anvin
  2004-02-18 21:22                                     ` Robin Rosenberg
  0 siblings, 1 reply; 50+ messages in thread
From: H. Peter Anvin @ 2004-02-18 20:01 UTC (permalink / raw)
  To: Tomas Szepe; +Cc: Linus Torvalds, Kernel Mailing List

Tomas Szepe wrote:
> On Feb-18 2004, Wed, 07:35 -0800
> Linus Torvalds <torvalds@osdl.org> wrote:
> 
>>But it makes perfect sense to use a policy of:
>> - escape valid UTF-8 characters as '\u7777'

[And e.g. \U00017777 for characters above \uFFFF]

>> - escape _invalid_ UTF-8 characters as their hex byte sequence (ie 
>>   '\xC0\x80\x80', whatever)
>> - (and, obviously, escape the valid UTF-8 character '\' as '\\').
>>
>>Don't you agree? It clearly allows all the cases, and you can re-generate 
>>the _exact_ original stream of bytes from the above (ie it is nicely 
>>reversible, which in my opinion is a requirement).
> 
> I really really hope this is _exactly_ what we're going to see in practice.
> 

Same here.  This is clearly The Right Thing[TM].

	-hpa


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 19:59                           ` Linus Torvalds
@ 2004-02-18 20:08                             ` H. Peter Anvin
  0 siblings, 0 replies; 50+ messages in thread
From: H. Peter Anvin @ 2004-02-18 20:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jamie Lokier, linux-kernel

Linus Torvalds wrote:
> 
> But that's what you _want_. Having a real out-of-band signal that says 
> "this stuff is wrong, because it was wrong at some point in the past", and 
> not allowing concatenation of blocks of utf-8 bytes would be _bad_.
> 

Indeed.  What it does mean, however, is that you have to consider your
concatenation issues if you perform the concatenation in UCS-4 space,
for example, a string that ends in whatever code you have chosen for
<BOGUS-C8> that gets concatenated with <BOGUS-80> needs to get converted
to a valid <U+0200>.  This is of course not an issue if you do the
concatenation in UTF-8 space and don't do round-trip conversion.

None of this is hard, it just takes thinking about rather than
automatically do the obvious things.

> The thing, concatenating two malformed UTF-8 strings is normal behaviour 
> in a variety of circumstances, all basically having to do with lower 
> levels now knowing about higer-level concepts.

Indeed.

	-hpa

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 20:01                                   ` H. Peter Anvin
@ 2004-02-18 21:22                                     ` Robin Rosenberg
  2004-02-18 21:42                                       ` H. Peter Anvin
  0 siblings, 1 reply; 50+ messages in thread
From: Robin Rosenberg @ 2004-02-18 21:22 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Tomas Szepe, Linus Torvalds, Kernel Mailing List

On Wednesday 18 February 2004 21.01, H. Peter Anvin wrote:
> [And e.g. \U00017777 for characters above \uFFFF]

Isn't that octal :-)

-- robin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-18 21:22                                     ` Robin Rosenberg
@ 2004-02-18 21:42                                       ` H. Peter Anvin
  0 siblings, 0 replies; 50+ messages in thread
From: H. Peter Anvin @ 2004-02-18 21:42 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Tomas Szepe, Linus Torvalds, Kernel Mailing List

Robin Rosenberg wrote:
> On Wednesday 18 February 2004 21.01, H. Peter Anvin wrote:
> 
>>[And e.g. \U00017777 for characters above \uFFFF]
> 
> Isn't that octal :-)
> 

No.

	-hpa


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-17 21:47                                     ` Jamie Lokier
@ 2004-02-22 15:32                                       ` Eric W. Biederman
  2004-02-22 16:28                                         ` Jamie Lokier
  0 siblings, 1 reply; 50+ messages in thread
From: Eric W. Biederman @ 2004-02-22 15:32 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Alex Belits, Måns Rullgård, linux-kernel

Jamie Lokier <jamie@shareable.org> writes:

> Alex Belits wrote:
> > > No, I think hacking the terminal I/O is the best bet here.  Then _all_
> > > programs which currently work with UTF-8 terminals, which is rapidly
> > > becoming most of them, will work the same with both kinds of terminal,
> > > and the illusion of perfection will be complete and beautiful.
> > 
> >   UTF-8 terminals (and variable-encoding terminals) alreay exist,
> > gnome-terminal is one of them. They are, of course, bloated pigs, but I
> > would rather have the bloat and idiosyncrasy in the user interface where
> > it belongs.
> 
> Yes, I am using it right now.  The fancy characters work well in it.
> Problem is, sometimes I have to use a non-UTF-8 terminal, and I would
> naturally like to access my files in the same way.

Basically I think this is just a matter of modifying telnetd and
sshd so that for the display they follow the users locale,
at least in cooked mode.

Does anyone have a good grasp what the exact semantics should be and
where the translation should happen?  I know we need to delay the
translation as long as possible so we can get binary streams flowing
through these protocols? 

I guess my question is when do we know the information is going to
a terminal so we should translate it?

Eric

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-22 15:32                                       ` Eric W. Biederman
@ 2004-02-22 16:28                                         ` Jamie Lokier
  2004-02-22 21:53                                           ` Eric W. Biederman
  0 siblings, 1 reply; 50+ messages in thread
From: Jamie Lokier @ 2004-02-22 16:28 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Alex Belits, Måns Rullgård, linux-kernel

Eric W. Biederman wrote:
> I guess my question is when do we know the information is going to
> a terminal so we should translate it?

When a program is writing to a terminal device, then we know it's
going to a terminal _or_ to a program which is pretending to be one
(pseudo-terminal).  Either way, the behaviour should be the same

The "screen" program can be used to do translation, although it's a
rather cumbersome way to go about it, and it has other effects which
are annoying (at least one key is always designated for "screen" commands).

-- Jamie

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
  2004-02-22 16:28                                         ` Jamie Lokier
@ 2004-02-22 21:53                                           ` Eric W. Biederman
  0 siblings, 0 replies; 50+ messages in thread
From: Eric W. Biederman @ 2004-02-22 21:53 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Alex Belits, Måns Rullgård, linux-kernel

Jamie Lokier <jamie@shareable.org> writes:

> Eric W. Biederman wrote:
> > I guess my question is when do we know the information is going to
> > a terminal so we should translate it?
> 
> When a program is writing to a terminal device, then we know it's
> going to a terminal _or_ to a program which is pretending to be one
> (pseudo-terminal).  Either way, the behaviour should be the same
> 
> The "screen" program can be used to do translation, although it's a
> rather cumbersome way to go about it, and it has other effects which
> are annoying (at least one key is always designated for "screen" commands).

Right.  At this point I am not worried about temporary solutions.  I
want to pin down how things should be implemented.  So the user space
programs can be fixed.  Pardon me while I think aloud to frame the problem.

First it is worth noting that the existing practice is that ttys 
always use the character set encoding of the user.  Even X cut and
paste frequently abuses the iso8859-1 range, and instead uses the
native character set encoding instead of iso8825-1.

Now the work is how to get multiple locales to play nicely with each
other.  utf-8 and unicode are convenient for that as they preserve the
existing assumptions that terminals, filenames, and text files are
all using the same character set encoding, even when multiple locales
are involved.

So within one machine utf-8 solves the multiple locale problem.  The
problem has now moved to interoperability between machines.  Since
multiple machines have different upgrade cycles, and are in different
administrative domains everyone does not move to utf-8 at the same
time.

When we add the assertion that all I/O going through a terminal device
is in the native locale we break 8bit transparency.  This holds true
in some instances when both sides use the same character set encoding,
such as utf8.

There are some mitigating factors to this.  ssh already documents
pseudo tty's as potentially breaking 8 bit transparency.  And
applications that require ttys for stdin/stdout are most likely
interactive.  Interactive programs are either character based, or
broken.

Being an unclean channel for pipes will affect at least XMODEM,
YMODEM, and ZMODEM protocols, and possibly ppp.  These programs
already know how to avoid problem characters and because ascii is a
common subset of most character set encodings the effect should be no
worse than a line that is not 8 bit clean.

ssh at least has explict options to allocate or not allocate a
pseudo-tty so getting an 8 bit clean data path is not a problem with
ssh.

The rule ``All data that passing through a pseudo-tty is in the
character set encoding specified by the locale of the owner of the
tty'' seems both reasonable and no significant change from the current
status quo.

Now how does this get implemented?

On the wire between two machines I recommend passing unicode
characters.  Unicode guarantees no round trip loss for any of it's
member character sets, and it reduces everything to one set of
translation tables.

By convention glibc stores unicode values in wchar_t.  mbrstowc will
convert multibyte strings to internal wide characters, based on the
current locale. wctombs will do the opposite.  So going between
unicode and the character set encoding of the current locale is
straight forward.

How do we convert the applications?

There are only four cases I can think of where we connect to a remote
system with terminal semantics.
1) Directly connected serial terminals.
2) telnetd
3) rshd
4) sshd

To my knowledge all of their protocols just pass through characters
and are neutral.  So changing these feels like a protocol extension,
ouch!  Those are the programs that bridge multiple administrative
domains, and they do deal with pseudo ttys so they are where something
needs to happen, to support different character set encodings on
different machines. 

If everyone just switches over to using utf-8 even the above cases are
fine.  So if there is a reasonable expectation that everyone will
change to using utf-8 in the near future even those programs don't
need to change.

Given the delay in changing protocols I propose 2 simple programs.
sh-utf8 and utf8-tty.  The first runs a command converting stdout and
stderr from utf8 to the current locale, and converting stdin into
utf8.  The second creates a pseudo tty and relays to it's controlling
tty, assuming the controlling tty uses utf8 and it's tty uses the
current locale.

Looking around there already is a TTYConv program that seems to fill
this niche, except you must specify the character set encodings
manually.
http://bedroomlan.dyndns.org/~alexios/coding_ttyconv.html

Comments?

Eric

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
@ 2004-02-23 11:35 Norman Diamond
       [not found] ` <fa.ip45pqg.i26oru@ifi.uio.no>
  0 siblings, 1 reply; 50+ messages in thread
From: Norman Diamond @ 2004-02-23 11:35 UTC (permalink / raw)
  To: Eric W. Biederman, linux-kernel

Eric W. Biederman wrote:

> First it is worth noting that the existing practice is that ttys
> always use the character set encoding of the user.

Each tty uses the character set encoding of that tty's user.  There were
times when I needed to have some tty windows open using EUC (ordinary work
on that Linux machine) and some tty windows open using SJIS (editing files
which would be sent to cellular telephones), in the same X session.  They
worked.

> Even X cut and paste frequently abuses the iso8859-1 range,

I'll take your word for it.  I've copied and pasted EUC strings, I've copied
and pasted SJIS strings, I don't know if X copy and paste abused EUC or SJIS
ranges, but it worked.

One thing I never thought of trying to test is to copy and paste between one
tty using EUC and one tty using SJIS.

> Now the work is how to get multiple locales to play nicely with each
> other.  utf-8 and unicode are convenient for that as they preserve the
> existing assumptions that terminals, filenames, and text files are
> all using the same character set encoding, even when multiple locales
> are involved.
>
> So within one machine utf-8 solves the multiple locale problem.

That preserves a nice fiction.  If you depend on assuming that fiction,
you'll get useless results.

> The rule ``All data that passing through a pseudo-tty is in the
> character set encoding specified by the locale of the owner of the
> tty'' seems both reasonable and no significant change from the current
> status quo.

Yes, that is a return to usability.

> On the wire between two machines I recommend passing unicode
> characters.

Why should the wire get a different encoding than the user set in the
pseudo-tty?  Consider TeraTerm.  The user tells TeraTerm what character set
is in use on the wire, which is the same as the character set in use on the
remote side (where sshd or whatever server provides the pseudo-tty).
TeraTerm converts between that and the local character set (where the
TeraTerm program and window and user get the character set decided for them
by someone in Sasazuka or Redmond).

> By convention glibc stores unicode values in wchar_t.

That is hard to believe.  glibc existed before Unicode did and wchar_t
existed before Unicode did.  I sure thought that glibc existed in Japan at
the time, but I could be wrong, I didn't say this is impossible but merely
hard to believe.  In commercial Unix systems, wchar_t held either EUC or
SJIS depending on the vendor.

As usual I do not even have time to keep up with this thread, so if you have
questions then please CC me personally, though I don't know if I'll have
time to investigate anything that needs it.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: UTF-8 practically vs. theoretically in the VFS API
       [not found] ` <fa.ip45pqg.i26oru@ifi.uio.no>
@ 2004-02-23 19:13   ` Junio C Hamano
  0 siblings, 0 replies; 50+ messages in thread
From: Junio C Hamano @ 2004-02-23 19:13 UTC (permalink / raw)
  To: Norman Diamond; +Cc: linux-kernel

>>>>> "ND" == Norman Diamond <ndiamond@wta.att.ne.jp> writes:

ND> Eric W. Biederman wrote:
>> Even X cut and paste frequently abuses the iso8859-1 range,

ND> I'll take your word for it.  I've copied and pasted EUC
ND> strings, I've copied and pasted SJIS strings, I don't know
ND> if X copy and paste abused EUC or SJIS ranges, but it
ND> worked.

I do not know what Eric means by "abusing the iso8859-1 rnge",
but passing X selection between traditional X clients IIRC uses
compound text, which is an encoding vaguely similar to ISO-2022,
so clients like kterm can convert it back and forth with EUC or
SJIS as needed.

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2004-02-23 19:13 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-23 11:35 UTF-8 practically vs. theoretically in the VFS API Norman Diamond
     [not found] ` <fa.ip45pqg.i26oru@ifi.uio.no>
2004-02-23 19:13   ` Junio C Hamano
     [not found] <04Feb13.163954est.41760@gpu.utcc.utoronto.ca>
2004-02-14 23:06 ` JFS default behavior Robin Rosenberg
2004-02-14 23:29   ` viro
2004-02-15  0:07     ` Robin Rosenberg
2004-02-15  2:41       ` Linus Torvalds
2004-02-16 18:36         ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
2004-02-16 18:49           ` Linus Torvalds
2004-02-16 19:26             ` UTF-8 practically vs. theoretically in the VFS API Jeff Garzik
2004-02-16 19:48               ` John Bradford
2004-02-16 19:48                 ` Linus Torvalds
2004-02-16 20:20                   ` Marc Lehmann
2004-02-16 20:26                     ` Linus Torvalds
2004-02-18  2:49                     ` Rob Landley
2004-02-16 20:21                   ` bert hubert
2004-02-16 20:33                     ` Marc Lehmann
2004-02-18  2:58                     ` H. Peter Anvin
2004-02-18  3:13                       ` Linus Torvalds
2004-02-18  3:22                         ` H. Peter Anvin
2004-02-18  3:30                           ` Linus Torvalds
2004-02-18  5:30                             ` H. Peter Anvin
2004-02-18 10:29                               ` Robin Rosenberg
2004-02-18 11:49                                 ` Tomas Szepe
2004-02-18 11:59                                   ` Robin Rosenberg
2004-02-18 12:05                                     ` Tomas Szepe
2004-02-18 12:34                                       ` Robin Rosenberg
2004-02-18 15:35                               ` Linus Torvalds
2004-02-18 19:47                                 ` Tomas Szepe
2004-02-18 20:01                                   ` H. Peter Anvin
2004-02-18 21:22                                     ` Robin Rosenberg
2004-02-18 21:42                                       ` H. Peter Anvin
2004-02-18 11:24                           ` Jamie Lokier
2004-02-18 11:33                         ` Jamie Lokier
2004-02-18 16:47                           ` H. Peter Anvin
2004-02-18 19:59                           ` Linus Torvalds
2004-02-18 20:08                             ` H. Peter Anvin
2004-02-18  7:25                       ` bert hubert
2004-02-16 20:16                 ` Marc Lehmann
2004-02-16 20:20                   ` Jeff Garzik
2004-02-16 21:10                   ` viro
2004-02-17  7:18                   ` jw schultz
2004-02-17  7:42                   ` Nick Piggin
2004-02-16 20:03             ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
2004-02-16 20:23               ` Linus Torvalds
2004-02-16 22:26                 ` Jamie Lokier
2004-02-16 22:40                   ` Linus Torvalds
2004-02-17  7:14                     ` Lehmann 
2004-02-17 11:20                       ` UTF-8 practically vs. theoretically in the VFS API Helge Hafting
2004-02-17 15:56                       ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Linus Torvalds
     [not found]                         ` <20040217161111.GE8231@schmorp.de>
2004-02-17 16:32                           ` Linus Torvalds
2004-02-17 16:46                             ` Jamie Lokier
2004-02-17 19:00                               ` UTF-8 practically vs. theoretically in the VFS API Måns Rullgård
2004-02-17 20:57                                 ` Jamie Lokier
2004-02-17 21:06                                   ` Alex Belits
2004-02-17 21:47                                     ` Jamie Lokier
2004-02-22 15:32                                       ` Eric W. Biederman
2004-02-22 16:28                                         ` Jamie Lokier
2004-02-22 21:53                                           ` Eric W. Biederman
2004-02-18  7:23                                     ` Marc Lehmann
2004-02-17 21:23                                   ` Matthew Kirkwood
2004-02-17 16:54                             ` Stefan Smietanowski
2004-02-18  1:27                               ` Hans Reiser
2004-02-18  2:08                                 ` Robin Rosenberg
2004-02-18 11:06                                   ` Jamie Lokier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox