public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-11  6:39 ` Tim Connors
@ 2004-02-11 16:35   ` Dave Kleikamp
  2004-02-12  0:45     ` Andy Isaacson
  0 siblings, 1 reply; 68+ messages in thread
From: Dave Kleikamp @ 2004-02-11 16:35 UTC (permalink / raw)
  To: Tim Connors; +Cc: linux-kernel, JFS Discussion

On Wed, 2004-02-11 at 00:39, Tim Connors wrote:
> I submitted a bug to the jfs people, because jfs incorrectly returns
> -EINVAL (this isn't even documented in man pages as a valid return
> from open()) from an open() on a filename with UTF-8 in it.
> 
> See http://www-124.ibm.com/developerworks/bugs/?func=detailbug&bug_id=3838&group_id=35
> and http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=229308
> 
> This was triggered just by upgrading the console-utils package in
> debian (the problem existed all along, except that when I first made
> the filesystem a jfs one, I reinstalled from backups, rather than
> reinstalling debian from scratch)

Yeah, JFS has poor default behavior based on CONFIG_NLS_DEFAULT.  I
attempted to explain why it works that way in the first bug listed above
if anyone is curious.

I think the right thing for JFS to do is to change the default behavior
to simply store the bytes as they are seen, and to only do charset
conversion when the iocharset mount option is explicitly set.  This may
impact some current users, but they will be able to get the old behavior
by setting iocharset to whatever CONFIG_NLS_DEFAULT is set to in the
running kernel.

I intend to make this change soon if there are no objections.

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-11 16:35   ` JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.) Dave Kleikamp
@ 2004-02-12  0:45     ` Andy Isaacson
  2004-02-12  1:19       ` Tim Connors
                         ` (4 more replies)
  0 siblings, 5 replies; 68+ messages in thread
From: Andy Isaacson @ 2004-02-12  0:45 UTC (permalink / raw)
  To: Dave Kleikamp; +Cc: linux-kernel

On Wed, Feb 11, 2004 at 10:35:10AM -0600, Dave Kleikamp wrote:
> Yeah, JFS has poor default behavior based on CONFIG_NLS_DEFAULT.  I
> attempted to explain why it works that way in the first bug listed above
> if anyone is curious.

I think your suggested fix is good, but it begs the question:

Why on earth is JFS worried about the filename, anyways?  Why has it
*ever* had *any* behavior other than "string of bytes, delimited with /,
terminated with \0" ?

I read your response about OS/2, and maybe I'm just slow, but I don't
see what that has to do with anything.

Does JFS on AIX have the same buggy behavior?

What behavior was the code originally designed to implement, on OS/2?
Why was that behavior chosen rather than "filenames are a string of
bytes"?

Feel free to point to a "Design of the OS/2 JFS interface" document if
such exists and answers my question. :)

-andy

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12  0:45     ` Andy Isaacson
@ 2004-02-12  1:19       ` Tim Connors
  2004-02-12  3:54       ` jw schultz
                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 68+ messages in thread
From: Tim Connors @ 2004-02-12  1:19 UTC (permalink / raw)
  To: linux-kernel

Andy Isaacson <adi@hexapodia.org> said on Wed, 11 Feb 2004 18:45:32 -0600:
> On Wed, Feb 11, 2004 at 10:35:10AM -0600, Dave Kleikamp wrote:
> > Yeah, JFS has poor default behavior based on CONFIG_NLS_DEFAULT.  I
> > attempted to explain why it works that way in the first bug listed above
> > if anyone is curious.
> 
> I think your suggested fix is good, but it begs the question:
> 
> Why on earth is JFS worried about the filename, anyways?  Why has it
> *ever* had *any* behavior other than "string of bytes, delimited with /,
> terminated with \0" ?

Thanks for wording my question better. That was *precisely* the
question I was trying to ask :)

-- 
TimC -- http://astronomy.swin.edu.au/staff/tconnors/
Disclaimer: This post owned by the owner

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12  0:45     ` Andy Isaacson
  2004-02-12  1:19       ` Tim Connors
@ 2004-02-12  3:54       ` jw schultz
  2004-02-12 12:03         ` Robin Rosenberg
  2004-02-12  8:54       ` Jamie Lokier
                         ` (2 subsequent siblings)
  4 siblings, 1 reply; 68+ messages in thread
From: jw schultz @ 2004-02-12  3:54 UTC (permalink / raw)
  To: linux-kernel

On Wed, Feb 11, 2004 at 06:45:32PM -0600, Andy Isaacson wrote:
> On Wed, Feb 11, 2004 at 10:35:10AM -0600, Dave Kleikamp wrote:
> > Yeah, JFS has poor default behavior based on CONFIG_NLS_DEFAULT.  I
> > attempted to explain why it works that way in the first bug listed above
> > if anyone is curious.
> 
> I think your suggested fix is good, but it begs the question:
> 
> Why on earth is JFS worried about the filename, anyways?  Why has it
> *ever* had *any* behavior other than "string of bytes, delimited with /,
> terminated with \0" ?
> 
> I read your response about OS/2, and maybe I'm just slow, but I don't
> see what that has to do with anything.
> 
> Does JFS on AIX have the same buggy behavior?
> 
> What behavior was the code originally designed to implement, on OS/2?
> Why was that behavior chosen rather than "filenames are a string of
> bytes"?
> 
> Feel free to point to a "Design of the OS/2 JFS interface" document if
> such exists and answers my question. :)

His first link almost explains it.

| In OS/2, the kernel had access to each process's locale
| information, and converting the pathnames from the user's
| charset to unicode made access to the filesystem very
| transparent, even when users used different character sets
| on the same computer. 
| 
| Unfortunately, in Linux the kernel has no per-process
| information to go on, so it uses the charset specified by
| CONFIG_NLS_DEFAULT when the kernel is built. Obviously,
| this is neither intuitive or generally useful. 
| 
| I am considering changing the default behavior to
| trivially convert the user-supplied pathnames to utf-16
| when stored in on-disk. This default behavior could be
| overridden by specifying the iocharset= mount flag.

Apparently in OS2 they implemented a policy of utf-16 into
the kernel so that applications would not have to be as
locale aware.  This could be called kernel pollution.

For Linux there is no policy except perhaps in userspace.
It is up to userspace to determine what the policy will be
regarding charset for filename storage.  Common practice
seems to be utf-8.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw@pegasys.ws

		Remember Cernan and Schmitt

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12  0:45     ` Andy Isaacson
  2004-02-12  1:19       ` Tim Connors
  2004-02-12  3:54       ` jw schultz
@ 2004-02-12  8:54       ` Jamie Lokier
  2004-02-12 15:55         ` Robin Rosenberg
  2004-02-12 13:28       ` Dave Kleikamp
  2004-02-12 15:26       ` Valdis.Kletnieks
  4 siblings, 1 reply; 68+ messages in thread
From: Jamie Lokier @ 2004-02-12  8:54 UTC (permalink / raw)
  To: Andy Isaacson; +Cc: Dave Kleikamp, linux-kernel

Andy Isaacson wrote:
> Why on earth is JFS worried about the filename, anyways?  Why has it
> *ever* had *any* behavior other than "string of bytes, delimited with /,
> terminated with \0" ?

Perhaps for the same reason that these other in-tree filesystems are
sensitive to the character encoding:

   Joliet (ISO-9660 extension), FAT/VFAT, NTFS, BeFS, SMBFS, CIFS.

Those filesystems will also fail, or give unexpected behaviour (such
as bytes being changed to '?'), if you pass them names which are not
in the appropriate encoding.

-- Jamie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12  3:54       ` jw schultz
@ 2004-02-12 12:03         ` Robin Rosenberg
  0 siblings, 0 replies; 68+ messages in thread
From: Robin Rosenberg @ 2004-02-12 12:03 UTC (permalink / raw)
  To: jw schultz, linux-kernel

On Thursday 12 February 2004 04.54, jw schultz wrote:
> For Linux there is no policy except perhaps in userspace.
> It is up to userspace to determine what the policy will be
> regarding charset for filename storage.  Common practice
> seems to be utf-8.

Isn't it is the user's locale, whatever that is? I believe my file names
use ISO-8859-1 (except ntfs, vfat). In northern/western europe 
ISO-8859-1 is common. (Sometimes ISO-8859-15 which for all
practical purposes is backwards compatible with 8859-1). UTF-8 is
gaining terrain though since it is now the default in some distributions
even for Nordic languages (causing big problems for those not
expecting it).

-- robin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12  0:45     ` Andy Isaacson
                         ` (2 preceding siblings ...)
  2004-02-12  8:54       ` Jamie Lokier
@ 2004-02-12 13:28       ` Dave Kleikamp
  2004-02-12 15:26       ` Valdis.Kletnieks
  4 siblings, 0 replies; 68+ messages in thread
From: Dave Kleikamp @ 2004-02-12 13:28 UTC (permalink / raw)
  To: Andy Isaacson; +Cc: linux-kernel

On Wed, 2004-02-11 at 18:45, Andy Isaacson wrote:

> Why on earth is JFS worried about the filename, anyways?  Why has it
> *ever* had *any* behavior other than "string of bytes, delimited with /,
> terminated with \0" ?

The problem that was addressed in OS/2 was that one user using locale A
would create some files using non-ascii characters.  Then a user using
locale B would access these files, but the characters in those names did
not make sense in his locale.  Storing the file names in unicode allowed
the characters to always translate to the correct characters in the
user's locale, when the charset allowed it.  I'm not familiar enough
with the European locales to give specific examples.  It was never an
issue in the U.S. :^)

The OS/2 kernel has locale information for each process, so this
actually works very well there.  I will admit that it was a mistake not
to change the default behavior when we ported this to Linux.

> I read your response about OS/2, and maybe I'm just slow, but I don't
> see what that has to do with anything.
> 
> Does JFS on AIX have the same buggy behavior?

I know that JFS1 did not.  I'm not sure about JFS2, since it was ported
from the same OS/2 code base.

> What behavior was the code originally designed to implement, on OS/2?
> Why was that behavior chosen rather than "filenames are a string of
> bytes"?

I hope I explained that well enough above.

> Feel free to point to a "Design of the OS/2 JFS interface" document if
> such exists and answers my question. :)
> 
> -andy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12  0:45     ` Andy Isaacson
                         ` (3 preceding siblings ...)
  2004-02-12 13:28       ` Dave Kleikamp
@ 2004-02-12 15:26       ` Valdis.Kletnieks
  2004-02-12 15:41         ` Dave Kleikamp
  4 siblings, 1 reply; 68+ messages in thread
From: Valdis.Kletnieks @ 2004-02-12 15:26 UTC (permalink / raw)
  To: Andy Isaacson; +Cc: Dave Kleikamp, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 484 bytes --]

On Wed, 11 Feb 2004 18:45:32 CST, Andy Isaacson said:

> Does JFS on AIX have the same buggy behavior?

Nope, it's been tolerant of all 254 bit patterns except \0 and '/'
since at least AIX 3.1.2 back in the early 90s.  It doesn't even have
a concept of "UTF-8 filename" - it considers that a userspace issue.

Now, over the last 15 years I've tripped over a number of *userspace*
things that did really stupid things when handed non-ASCII filenames,
but that's a different issue...


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12 15:26       ` Valdis.Kletnieks
@ 2004-02-12 15:41         ` Dave Kleikamp
  0 siblings, 0 replies; 68+ messages in thread
From: Dave Kleikamp @ 2004-02-12 15:41 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: Andy Isaacson, linux-kernel

On Thu, 2004-02-12 at 09:26, Valdis.Kletnieks@vt.edu wrote:
> On Wed, 11 Feb 2004 18:45:32 CST, Andy Isaacson said:
> Now, over the last 15 years I've tripped over a number of *userspace*
> things that did really stupid things when handed non-ASCII filenames,
> but that's a different issue...

That's the problem that OS/2 addressed.  In OS/2 each application would
see the correct charset for its locale, no matter what the locale of the
application that created the file was.  In Linux, the file system simply
doesn't have the information needed to do this, so it was a mistake to
try to imitate it.
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12  8:54       ` Jamie Lokier
@ 2004-02-12 15:55         ` Robin Rosenberg
  2004-02-12 16:17           ` John Bradford
  2004-02-13  0:38           ` Jamie Lokier
  0 siblings, 2 replies; 68+ messages in thread
From: Robin Rosenberg @ 2004-02-12 15:55 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Linux kernel

On Thursday 12 February 2004 09.54, you wrote:
> Andy Isaacson wrote:
> > Why on earth is JFS worried about the filename, anyways?  Why has it
> > *ever* had *any* behavior other than "string of bytes, delimited with /,
> > terminated with \0" ?
> 
> Perhaps for the same reason that these other in-tree filesystems are
> sensitive to the character encoding:
> 
>    Joliet (ISO-9660 extension), FAT/VFAT, NTFS, BeFS, SMBFS, CIFS.
> 
> Those filesystems will also fail, or give unexpected behaviour (such
> as bytes being changed to '?'), if you pass them names which are not
> in the appropriate encoding.

Definitely a good reason.  It seem many assume file names are a local thing,
but this is not so. Now consider the case with an external firewire
disk or memory stick created on a machine with iso-8859-1 as the system character
set and e.g xfs as the file system. What happens when I hook it up to a new redhat
installation that thinks file names are best stored as utf8? Most non-ascii
file names aren't even legal in utf8.

-- robin


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12 15:55         ` Robin Rosenberg
@ 2004-02-12 16:17           ` John Bradford
  2004-02-12 16:40             ` Robin Rosenberg
  2004-02-13  0:17             ` Jamie Lokier
  2004-02-13  0:38           ` Jamie Lokier
  1 sibling, 2 replies; 68+ messages in thread
From: John Bradford @ 2004-02-12 16:17 UTC (permalink / raw)
  To: Robin Rosenberg, Jamie Lokier; +Cc: Linux kernel

> Definitely a good reason.  It seem many assume file names are a local thing,
> but this is not so. Now consider the case with an external firewire
> disk or memory stick created on a machine with iso-8859-1 as the system character
> set and e.g xfs as the file system. What happens when I hook it up to a new redhat
> installation that thinks file names are best stored as utf8? Most non-ascii
> file names aren't even legal in utf8.

Another thing to consider is that you can encode the same character in
several ways using utf8, so two filenames could have different byte
strings, but evaluate to the same set of unicode characters.

John.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12 16:17           ` John Bradford
@ 2004-02-12 16:40             ` Robin Rosenberg
  2004-02-12 17:16               ` John Bradford
  2004-02-13  0:17             ` Jamie Lokier
  1 sibling, 1 reply; 68+ messages in thread
From: Robin Rosenberg @ 2004-02-12 16:40 UTC (permalink / raw)
  To: John Bradford; +Cc: Linux kernel

On Thursday 12 February 2004 17.17, you wrote:
> Another thing to consider is that you can encode the same character in
> several ways using utf8, so two filenames could have different byte
> strings, but evaluate to the same set of unicode characters.

No. That's not UTF-8.

-- robin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
@ 2004-02-12 16:50 Nicolas Mailhot
  2004-02-12 18:12 ` Robin Rosenberg
  2004-02-13  3:03 ` Jamie Lokier
  0 siblings, 2 replies; 68+ messages in thread
From: Nicolas Mailhot @ 2004-02-12 16:50 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1450 bytes --]

Not specifying the file name encoding (either per fs type, per partition
or per filename) is plain dangerous. It is not a userspace problem -
flash/hotplug disks move, users on the same system can have different
locales and try to share files, a user can change his locale to another
one (hear the screams of RH users forcibly converted to utf8 which had
to fix years of storage which filenames were suddenly borked) 

See also the sun zip encoding bug - everyone uses zip files in Java, zip
authors thought a filename is "just a bunch of bytes" and didn't put
filename encoding info in the zip format, and now java zip handling goes
boom since numerous encodings are unicode-incompatible. It's slowly
getting its way to the top-25 most reported java bugs.

(of course as usual US users/coders  are not hit and do not feel
concerned)

The only reason we got by with it so far is linux localisation was poor,
and systems didn't scale high enough to permit high number of users per
system (reducing locale collision risks)

The only reason we might get by in the future is everyone will be using
utf8.

But that's not a reason not to fix the core problem - I don't want to
spent hours fixing filenames next time someone comes up with a new
encoding. Please put valid encoding info somewhere or declare filenames
are utf-8 od utf-16 only - changing user locale should not corrupt old
data.

Cheers,

-- 
Nicolas Mailhot

[-- Attachment #2: Ceci est une partie de message numériquement signée --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12 16:40             ` Robin Rosenberg
@ 2004-02-12 17:16               ` John Bradford
  2004-02-12 18:06                 ` Robin Rosenberg
  0 siblings, 1 reply; 68+ messages in thread
From: John Bradford @ 2004-02-12 17:16 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Linux kernel

Quote from Robin Rosenberg <robin.rosenberg.lists@dewire.com>:
> On Thursday 12 February 2004 17.17, you wrote:
> > Another thing to consider is that you can encode the same character in
> > several ways using utf8, so two filenames could have different byte
> > strings, but evaluate to the same set of unicode characters.
> 
> No. That's not UTF-8.

Please don't break the CC list on replies.

I'm not sure whether it's valid UTF-8 or not, but it's certainly
possible to code, for example, an 'A', (decimal 65), via an escape to
a 31-bit character representation.  Presumably the majority of UTF-8
parsers would decode the sequence as 65, rather than emit an error.

Also, even ignoring that, how do you handle things like accented
characters which can be represented as single characters, or as
sequences containing combining characters?  Some applications might
convert the sequence containing combining characters in to the single
character, and others might not.

John.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12 17:16               ` John Bradford
@ 2004-02-12 18:06                 ` Robin Rosenberg
  2004-02-12 19:08                   ` John Bradford
  0 siblings, 1 reply; 68+ messages in thread
From: Robin Rosenberg @ 2004-02-12 18:06 UTC (permalink / raw)
  To: John Bradford; +Cc: Linux kernel

On Thursday 12 February 2004 18.16, John Bradford wrote:
> I'm not sure whether it's valid UTF-8 or not, but it's certainly
> possible to code, for example, an 'A', (decimal 65), via an escape to
> a 31-bit character representation.  Presumably the majority of UTF-8
> parsers would decode the sequence as 65, rather than emit an error.

There are many ways of getting things wrong. The algorithm for encoding 
UTF-8 doesn't give you the option of encoding 65 as two bytes; any UCS-4 
character with code 0-0x7F must result in a onand the same principle goes 
for every other character and the unicdeo standard forbids the use of anything
but the shortest possible sequence.

> Also, even ignoring that, how do you handle things like accented
> characters which can be represented as single characters, or as
> sequences containing combining characters?  Some applications might
> convert the sequence containing combining characters in to the single
> character, and others might not.

In UTF-8 you cannot represent à as `a. I can have both in a file name and they
are different. An application that assumes `a is the same a à (in UTF-8) is broken
and should be fixed. 

-- robin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12 16:50 JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.) Nicolas Mailhot
@ 2004-02-12 18:12 ` Robin Rosenberg
  2004-02-13  3:03 ` Jamie Lokier
  1 sibling, 0 replies; 68+ messages in thread
From: Robin Rosenberg @ 2004-02-12 18:12 UTC (permalink / raw)
  To: Nicolas Mailhot; +Cc: linux-kernel

On Thursday 12 February 2004 17.50, you wrote:
> But that's not a reason not to fix the core problem - I don't want to
> spent hours fixing filenames next time someone comes up with a new
> encoding. Please put valid encoding info somewhere or declare filenames
> are utf-8 od utf-16 only - changing user locale should not corrupt old
> data.

Yes! 

-- robin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12 18:06                 ` Robin Rosenberg
@ 2004-02-12 19:08                   ` John Bradford
  2004-02-12 19:39                     ` Robin Rosenberg
  2004-02-14 15:24                     ` Eduard Bloch
  0 siblings, 2 replies; 68+ messages in thread
From: John Bradford @ 2004-02-12 19:08 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Linux kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1952 bytes --]

> > I'm not sure whether it's valid UTF-8 or not, but it's certainly
> > possible to code, for example, an 'A', (decimal 65), via an escape to
> > a 31-bit character representation.  Presumably the majority of UTF-8
> > parsers would decode the sequence as 65, rather than emit an error.
> 
> There are many ways of getting things wrong. The algorithm for encoding 
> UTF-8 doesn't give you the option of encoding 65 as two bytes; any UCS-4 
> character with code 0-0x7F must result in a onand the same principle goes 
> for every other character and the unicdeo standard forbids the use of anything
> but the shortest possible sequence.

The recommended encoding algorithm forbids anything but the shortest
sequence, yes, but what will the majority of decoders do?  I suspect
that at least some will follow the usual networking rule of be liberal
in what you accept, which for filenames may well cause all sorts of
security holes.

> > Also, even ignoring that, how do you handle things like accented
> > characters which can be represented as single characters, or as
> > sequences containing combining characters?  Some applications might
> > convert the sequence containing combining characters in to the single
> > character, and others might not.
> 
> In UTF-8 you cannot represent à as `a. I can have both in a file name and they
> are different. An application that assumes `a is the same a à (in UTF-8) is broken
> and should be fixed. 

Well, as long as every userspace implementation gets it correct, we'll
be OK.  Personally, I doubt they all will, especially those that
convert from legacy encodings to Unicode, although quite possibly the
above scenario with combining characters is not likely to happen for
filenames.  Or is it?  What about copying a file from a filesystem
with a UTF-8 encoding to a filesystem with a legacy encoding, and then
back again?

However, I am less concerned about this second scenario than the first.

John.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12 19:08                   ` John Bradford
@ 2004-02-12 19:39                     ` Robin Rosenberg
  2004-02-12 21:13                       ` John Bradford
  2004-02-14 15:24                     ` Eduard Bloch
  1 sibling, 1 reply; 68+ messages in thread
From: Robin Rosenberg @ 2004-02-12 19:39 UTC (permalink / raw)
  To: John Bradford; +Cc: Linux kernel

On Thursday 12 February 2004 20.08, you wrote:
> > There are many ways of getting things wrong. The algorithm for encoding 
> > UTF-8 doesn't give you the option of encoding 65 as two bytes; any UCS-4 
> > character with code 0-0x7F must result in a onand the same principle goes 
> > for every other character and the unicdeo standard forbids the use of anything
> > but the shortest possible sequence.
> 
> The recommended encoding algorithm forbids anything but the shortest
That algorithm is the /definition/ of UTF-8, not just an example. Sure you can actually 
do it another way, but the result is uniquely defined (or else it's not UTF-8).

> Well, as long as every userspace implementation gets it correct, we'll
> be OK.  Personally, I doubt they all will, especially those that
> convert from legacy encodings to Unicode, although quite possibly the
> above scenario with combining characters is not likely to happen for
> filenames.  Or is it?  What about copying a file from a filesystem
> with a UTF-8 encoding to a filesystem with a legacy encoding, and then
> back again?

Sounds like you think we want to invent a new problem. The problem is
here and it's real (not in the U.S, but the the rest of the world). There are 
Network file systems (samba in particular), partitions belonging to other 
OS's (ntfs, fat or even other Linux installation on the same machine), 
removable devices etc etc.

Microsoft introduced a kludge for managing long file names in a short filename 
context. Since Linux doesn't have the length limit a nicer kludge could be used
to represent unicode as non-unicode in userspace like a Uxxxxx. When there is
a mismatch there has to be kludge, but it's still many times better than a bunch
of character that look like garbage (and cause legacy application so choke).

-- robin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12 19:39                     ` Robin Rosenberg
@ 2004-02-12 21:13                       ` John Bradford
  2004-02-12 22:29                         ` Robin Rosenberg
  2004-02-13  3:15                         ` Jamie Lokier
  0 siblings, 2 replies; 68+ messages in thread
From: John Bradford @ 2004-02-12 21:13 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Linux kernel

Quote from Robin Rosenberg <robin.rosenberg.lists@dewire.com>:
> On Thursday 12 February 2004 20.08, you wrote:
> > > There are many ways of getting things wrong. The algorithm for encoding 
> > > UTF-8 doesn't give you the option of encoding 65 as two bytes; any UCS-4 
> > > character with code 0-0x7F must result in a onand the same principle goes 
> > > for every other character and the unicdeo standard forbids the use of anything
> > > but the shortest possible sequence.
> > 
> > The recommended encoding algorithm forbids anything but the shortest
> That algorithm is the /definition/ of UTF-8, not just an example. Sure you can actually 
> do it another way, but the result is uniquely defined (or else it's not UTF-8).

I know what you're saying, there is only one way to encode the data
correctly.  I totally agree with that.

However, we both know that UTF-8 provides escapes from the 7-bit
encoding, and although it goes against the standard to encode 7-bit
characters using such sequences, in the real world don't you think
that there will be a lot of decoders which decode the multi-byte
sequence back, rather than report an error?  This is not something
that will be happening in the kernel - it will be up to userspace to
do it, so there may well be many different implementations.

Imagine you have two files, with the following filename bytes:

11000001 10000001 00000000

01000001 00000000

..and a _real world_ application, which is not necessarily completely
UTF-8 conformant, tries to open the file with filename 'A'.  Which one
is it going to open?

> > Well, as long as every userspace implementation gets it correct, we'll
> > be OK.  Personally, I doubt they all will, especially those that
> > convert from legacy encodings to Unicode, although quite possibly the
> > above scenario with combining characters is not likely to happen for
> > filenames.  Or is it?  What about copying a file from a filesystem
> > with a UTF-8 encoding to a filesystem with a legacy encoding, and then
> > back again?
> 
> Sounds like you think we want to invent a new problem.

I am aware that similar problems already exist.  However, most legacy
encodings don't suffer from the first issue we discussed above, where
multiple byte sequences could be decoded to the same character codes.
I don't think that the issue with combining characters is likely to be
an issue, I only mentioned it as an example.  As you pointed out a
single accented character, and a two character combination are
distinct, and converting the combination to the corresponding single
character in a filename would definitely be wrong, in my opinion.
However, that doesn't mean that software won't do it.

John.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12 21:13                       ` John Bradford
@ 2004-02-12 22:29                         ` Robin Rosenberg
  2004-02-12 22:50                           ` Valdis.Kletnieks
  2004-02-13  2:58                           ` Jamie Lokier
  2004-02-13  3:15                         ` Jamie Lokier
  1 sibling, 2 replies; 68+ messages in thread
From: Robin Rosenberg @ 2004-02-12 22:29 UTC (permalink / raw)
  To: John Bradford; +Cc: Linux kernel

On Thursday 12 February 2004 22.13, you wrote:
> I know what you're saying, there is only one way to encode the data
> correctly.  I totally agree with that.
> 
> However, we both know that UTF-8 provides escapes from the 7-bit
> encoding, and although it goes against the standard to encode 7-bit
> characters using such sequences, in the real world don't you think
> that there will be a lot of decoders which decode the multi-byte
> sequence back, rather than report an error?  This is not something
> that will be happening in the kernel - it will be up to userspace to
> do it, so there may well be many different implementations.

Oh, I wasn't thinking of fixing *every* application out there, but making
the kernel api's convert between the user locale and the file system locale,
thus restricting the problems to places that can be fixed.

An alternative would be glibc since it's used by most apps, but then there
could be funny and inefficient interactions with filesystems that already
do the job. The "future" common case would be utf-utf conversion for all
native file systems, i.e. no work.

[... ]

> I don't think that the issue with combining characters is likely to be
> an issue, I only mentioned it as an example.  As you pointed out a
> single accented character, and a two character combination are
> distinct, and converting the combination to the corresponding single
> character in a filename would definitely be wrong, in my opinion.
> However, that doesn't mean that software won't do it.

Some applications break if I put any non-ascii characters, but they few
enough that I can afford the loss. Most shell scripts break if I even have
a space in a filename.  This shouldn't be any worse than that. The space
issue is really serious (but I don't think that can be fixed other than teaching
people to program properly, and possibly improving bash's knowledge of the 
difference between a space and argument separator).

-- robin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12 22:29                         ` Robin Rosenberg
@ 2004-02-12 22:50                           ` Valdis.Kletnieks
  2004-02-13  2:58                           ` Jamie Lokier
  1 sibling, 0 replies; 68+ messages in thread
From: Valdis.Kletnieks @ 2004-02-12 22:50 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Linux kernel

[-- Attachment #1: Type: text/plain, Size: 877 bytes --]

On Thu, 12 Feb 2004 23:29:11 +0100, Robin Rosenberg said:

> a space in a filename.  This shouldn't be any worse than that. The space
> issue is really serious (but I don't think that can be fixed other than teaching
> people to program properly, and possibly improving bash's knowledge of the 
> difference between a space and argument separator).

Other than allocating a key and bytecode for non-breaking-white-space as a
separator (Hmm.. allocate 'left-windows' purely for ironic value? ;), how do
you propose to actually improve it's knowledge of the distinction?  The basic
problem is that we're overloading x'20' as both space and separator, and then
end up disambiguating based on context and syntax.  And quite frankly, I don't
see much hope for improving things as long as x'20' is overloaded.

Could go the VMS command/this/that/the/other/thing route, I guess?  :)


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12 16:17           ` John Bradford
  2004-02-12 16:40             ` Robin Rosenberg
@ 2004-02-13  0:17             ` Jamie Lokier
  1 sibling, 0 replies; 68+ messages in thread
From: Jamie Lokier @ 2004-02-13  0:17 UTC (permalink / raw)
  To: John Bradford; +Cc: Robin Rosenberg, Linux kernel

John Bradford wrote:
> > Definitely a good reason.  It seem many assume file names are a local thing,
> > but this is not so. Now consider the case with an external firewire
> > disk or memory stick created on a machine with iso-8859-1 as the system character
> > set and e.g xfs as the file system. What happens when I hook it up to a new redhat
> > installation that thinks file names are best stored as utf8? Most non-ascii
> > file names aren't even legal in utf8.
> 
> Another thing to consider is that you can encode the same character in
> several ways using utf8,

No, you can't.  Only the shortest encoding of a character is valid
UTF-8, and any program which claims to comply with Unicode is
_required_ to reject all other encodings, citing security as the main
reason.

That means any code which transcodes UTF-8 to another encoding (such
as iso-8859-1) must reject the non-minimal forms as invalid
characters, in whatever way that is done.

If there's any transcoding code in Linux which doesn't do that, it's a
potential security hole and should be fixed.

> so two filenames could have different byte strings, but evaluate to
> the same set of unicode characters.

That's true in some other encodings I think (the iso-2022 ones), but
not UTF-8.

-- Jamie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12 15:55         ` Robin Rosenberg
  2004-02-12 16:17           ` John Bradford
@ 2004-02-13  0:38           ` Jamie Lokier
  2004-02-13  1:16             ` Robin Rosenberg
  1 sibling, 1 reply; 68+ messages in thread
From: Jamie Lokier @ 2004-02-13  0:38 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Linux kernel

Robin Rosenberg wrote:
 Now consider the case with an external firewire
> disk or memory stick created on a machine with iso-8859-1 as the system character
> set and e.g xfs as the file system. What happens when I hook it up to a new redhat
> installation that thinks file names are best stored as utf8? Most non-ascii
> file names aren't even legal in utf8.

It goes wrong.  This happens both with filesystems that know nothing
about encodings, e.g. ext3, and filesystems that need to be told what
to transcode to/from utf-8, e.g. ntfs.

It is also a problem that some applications access the filesystem
assuming utf-8 and some don't.  Nothing in the filesystem can make the
different applications cooperate regarding these.  E.g. I have
filenames that look fine in "ls" containg things like c-cedilla, but
xmms displays them wrongly.

-- Jamie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-13  0:38           ` Jamie Lokier
@ 2004-02-13  1:16             ` Robin Rosenberg
  2004-02-13  1:23               ` Jamie Lokier
  2004-02-13  2:29               ` viro
  0 siblings, 2 replies; 68+ messages in thread
From: Robin Rosenberg @ 2004-02-13  1:16 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Linux kernel

On Friday 13 February 2004 01.38, Jamie Lokier wrote:
> Robin Rosenberg wrote:
>  Now consider the case with an external firewire
> > disk or memory stick created on a machine with iso-8859-1 as the system character
> > set and e.g xfs as the file system. What happens when I hook it up to a new redhat
> > installation that thinks file names are best stored as utf8? Most non-ascii
> > file names aren't even legal in utf8.
> 
> It goes wrong.  This happens both with filesystems that know nothing
> about encodings, e.g. ext3, and filesystems that need to be told what
> to transcode to/from utf-8, e.g. ntfs.

Yes, so ext3&co. should be equipped with charset options just the other so
it can be fixed by the user or in some cases the mount tools. 

Is there a place to store character set information in these file systems?

> It is also a problem that some applications access the filesystem
> assuming utf-8 and some don't.  Nothing in the filesystem can make the
> different applications cooperate regarding these.  E.g. I have
> filenames that look fine in "ls" containg things like c-cedilla, but
> xmms displays them wrongly.

Some apps simply don't think non-ascii is relevant. Xmms is one, although
is doesn't crash at least. My guess was that it was a font problem since it
looks like XMMS uses some special fonts. Even new apps (like gedit have 
character set problems. These apps have to be fixed since they don't work
properly anywhere outside the US. But that is a pure userspace problem, not 
a kernel one. 

-- robin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-13  1:16             ` Robin Rosenberg
@ 2004-02-13  1:23               ` Jamie Lokier
  2004-02-13  1:46                 ` Robin Rosenberg
  2004-02-13  2:29               ` viro
  1 sibling, 1 reply; 68+ messages in thread
From: Jamie Lokier @ 2004-02-13  1:23 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Linux kernel

Robin Rosenberg wrote:
> Is there a place to store character set information in these file systems?

Please don't confuse character set with character encoding.  The
problem we are talking about here is about character encoding.

Once upon a time the two were muddled; that's why MIME and HTTP use
"charset" to mean character encoding.

And the answer is: yes, you can store it wherever you want :)

> Some apps simply don't think non-ascii is relevant. Xmms is one, although
> is doesn't crash at least. My guess was that it was a font problem since it
> looks like XMMS uses some special fonts.

It's not a font problem.  XMMS simply displays each byte as a separate
character because that's what it assumes it should do.  No font will fix that.

-- Jamie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-13  1:23               ` Jamie Lokier
@ 2004-02-13  1:46                 ` Robin Rosenberg
  0 siblings, 0 replies; 68+ messages in thread
From: Robin Rosenberg @ 2004-02-13  1:46 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Linux kernel

On Friday 13 February 2004 02.23, Jamie Lokier wrote:
> Robin Rosenberg wrote:
> > Is there a place to store character set information in these file systems?
> 
> Please don't confuse character set with character encoding.  The
> problem we are talking about here is about character encoding.
> Once upon a time the two were muddled; that's why MIME and HTTP use
> "charset" to mean character encoding.
I shall try not to mix them in the future. The reason for the name in MIME is
probably because a (mime)charset does specify a character set (+encoding),
while the mime-encoding only specifies raw bytes.

> And the answer is: yes, you can store it wherever you want :)
I was thinking of the file system meta data so mount or the kernel or the fs could 
handle it.

> > Some apps simply don't think non-ascii is relevant. Xmms is one, although
> > is doesn't crash at least. My guess was that it was a font problem since it
> > looks like XMMS uses some special fonts.
> 
> It's not a font problem.  XMMS simply displays each byte as a separate
> character because that's what it assumes it should do.  No font will fix that.
I assumed a font problem because my machine is using ISO-8859-1 and 
XMMS doesn't display tose non-ascii characters I use; of course it could be both.

-- robin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-13  1:16             ` Robin Rosenberg
  2004-02-13  1:23               ` Jamie Lokier
@ 2004-02-13  2:29               ` viro
  2004-02-13  3:23                 ` Jamie Lokier
  2004-02-13 10:03                 ` Robin Rosenberg
  1 sibling, 2 replies; 68+ messages in thread
From: viro @ 2004-02-13  2:29 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: Jamie Lokier, Linux kernel

On Fri, Feb 13, 2004 at 02:16:53AM +0100, Robin Rosenberg wrote:
> Yes, so ext3&co. should be equipped with charset options just the other so
> it can be fixed by the user or in some cases the mount tools. 
> 
> Is there a place to store character set information in these file systems?

Bullshit.  Just as there is no timezone common for all users, there is no
charset common for all of them.  Charset of _machine_ doesn't make any sense
at all - toy operating systems nonwithstanding.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12 22:29                         ` Robin Rosenberg
  2004-02-12 22:50                           ` Valdis.Kletnieks
@ 2004-02-13  2:58                           ` Jamie Lokier
  2004-02-13  9:48                             ` Robin Rosenberg
  1 sibling, 1 reply; 68+ messages in thread
From: Jamie Lokier @ 2004-02-13  2:58 UTC (permalink / raw)
  To: Robin Rosenberg; +Cc: John Bradford, Linux kernel

Robin Rosenberg wrote:
> Most shell scripts break if I even have a space in a filename.  This
> shouldn't be any worse than that. The space issue is really serious
> (but I don't think that can be fixed other than teaching people to
> program properly, and possibly improving bash's knowledge of the
> difference between a space and argument separator).

Space works fine for me.  Completion, wildcard expansion, variable
substition etc. all fine.  Bash doesn't need changing - your scripts do.

-- Jamie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12 16:50 JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.) Nicolas Mailhot
  2004-02-12 18:12 ` Robin Rosenberg
@ 2004-02-13  3:03 ` Jamie Lokier
  2004-02-13 10:07   ` Robin Rosenberg
  2004-02-13 18:06   ` Nicolas Mailhot
  1 sibling, 2 replies; 68+ messages in thread
From: Jamie Lokier @ 2004-02-13  3:03 UTC (permalink / raw)
  To: Nicolas Mailhot; +Cc: linux-kernel

Nicolas Mailhot wrote:
> But that's not a reason not to fix the core problem - I don't want to
> spent hours fixing filenames next time someone comes up with a new
> encoding. Please put valid encoding info somewhere or declare filenames
> are utf-8 od utf-16 only - changing user locale should not corrupt old
> data.

If you attach encoding to names for a whole filesystem, you will get
really unpleasant bugs including security holes because some names
won't be writable, so the fs will either return error codes when those
names are used, or silently alter the names.

-- Jamie


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12 21:13                       ` John Bradford
  2004-02-12 22:29                         ` Robin Rosenberg
@ 2004-02-13  3:15                         ` Jamie Lokier
  1 sibling, 0 replies; 68+ messages in thread
From: Jamie Lokier @ 2004-02-13  3:15 UTC (permalink / raw)
  To: John Bradford; +Cc: Robin Rosenberg, Linux kernel

John Bradford wrote:
> in the real world don't you think that there will be a lot of
> decoders which decode the multi-byte sequence back, rather than
> report an error?

There will be decoders which convert ASCII "a" to "A" too.  We can't
fix broken code; at least we can make it clear to anyone writing a
decoder what is acceptable, and that being "liberal" in what's decoded
is not acceptable and considered a security flaw.

An app author only writes the UTF-8 decoder once; it isn't at all hard
to convert non-minimal forms to the replacement char U+FFFD.
(Although that could be a security hole in some cases, it's much
better than allowing non-zero characters to decoder to NUL or "/" or
".").  Rejecting a non-minimal form is often hard, because the UTF-8
decoder is often used in a place which cannot flag errors.

> Imagine you have two files, with the following filename bytes:
> 
> 11000001 10000001 00000000
> 01000001 00000000
> 
> ..and a _real world_ application, which is not necessarily completely
> UTF-8 conformant, tries to open the file with filename 'A'.  Which one
> is it going to open?

The one which "ls" and other programs show as "A".
The other one will typically show as "?" or a diamond or something.

> I don't think that the issue with combining characters is likely to be
> an issue, I only mentioned it as an example.  As you pointed out a
> single accented character, and a two character combination are
> distinct, and converting the combination to the corresponding single
> character in a filename would definitely be wrong, in my opinion.
> However, that doesn't mean that software won't do it.

Indeed some software will do it, and worse than that: they may look
the same in an editor or file selector.  (See recent problems with
misleading URLs for why that sort of thing can be a security hole).

The combining char problem is similar to case folding: some
filesystems and programs treat "a" and "A" as equivalent too.  If the
kernel had an encoding converter, and the filesystem stored iso-8859-1
while userspace was presented with utf-8, it is likely that several
Unicode characters would be mapped to "a", causing similar problems to
automatic case folding in filesystems.

In other words, there is no clear solution to this problem.

-- Jamie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-13  2:29               ` viro
@ 2004-02-13  3:23                 ` Jamie Lokier
  2004-02-14 15:09                   ` Eduard Bloch
  2004-02-13 10:03                 ` Robin Rosenberg
  1 sibling, 1 reply; 68+ messages in thread
From: Jamie Lokier @ 2004-02-13  3:23 UTC (permalink / raw)
  To: viro; +Cc: Robin Rosenberg, Linux kernel

viro@parcelfarce.linux.theplanet.co.uk wrote:
> On Fri, Feb 13, 2004 at 02:16:53AM +0100, Robin Rosenberg wrote:
> > Yes, so ext3&co. should be equipped with charset options just the other so
> > it can be fixed by the user or in some cases the mount tools. 
> > 
> > Is there a place to store character set information in these file systems?
> 
> Bullshit.  Just as there is no timezone common for all users, there is no
> charset common for all of them.  Charset of _machine_ doesn't make any sense
> at all - toy operating systems nonwithstanding.

Charset of a filename does make sense, though.  That's not per user,
it's per filename.

A name which one user entered as "£10.txt" should ideally display as
that sequence of characters to all users who want to display the name.

I already have this problem on my filesystems: some programs show the
names assuming UTF-8, other programs show them assuming
iso-8859-1.

But it's worse than that.  On my filesystem, names are stored in UTF-8
as is recommended these days.  "ls" on some terminals shows the names
as I wrote them.  But on other terminals it shows the wrong names.

If I create a file using a shell command, what I get depends on which
terminal I used to create it.  If I am using a terminal which displays
UTF-8 but ssh to another machine, the other machine assumes the
terminal is displaying iso-8859-1 even though the other machine's
default locale is UTF-8.  And so on.

-- Jamie


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-13  2:58                           ` Jamie Lokier
@ 2004-02-13  9:48                             ` Robin Rosenberg
  0 siblings, 0 replies; 68+ messages in thread
From: Robin Rosenberg @ 2004-02-13  9:48 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: John Bradford, Linux kernel

On Friday 13 February 2004 03.58, Jamie Lokier wrote:
> Robin Rosenberg wrote:
> > Most shell scripts break if I even have a space in a filename.  This
> > shouldn't be any worse than that. The space issue is really serious
> > (but I don't think that can be fixed other than teaching people to
> > program properly, and possibly improving bash's knowledge of the
> > difference between a space and argument separator).
> 
> Space works fine for me.  Completion, wildcard expansion, variable
> substition etc. all fine.  Bash doesn't need changing - your scripts do.

I'm thinking about many scripts in the wild, and my own scripts (usually) handle spaces
well, but it's awkward sometimes although quoting usually resolves the issue (never mind what
happens with filenames with quotes, newlines and other garabage, but even those work sometimes. 
Fortunately these are rare, very rare and usually the result of a programming mistake elsewhere :-)

On the command line there is no problem. 

With other script languages I use this is rarely an issue.

-- robin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-13  2:29               ` viro
  2004-02-13  3:23                 ` Jamie Lokier
@ 2004-02-13 10:03                 ` Robin Rosenberg
  2004-02-13 10:22                   ` vda
  1 sibling, 1 reply; 68+ messages in thread
From: Robin Rosenberg @ 2004-02-13 10:03 UTC (permalink / raw)
  To: viro; +Cc: Jamie Lokier, Linux kernel

On Friday 13 February 2004 03.29, you wrote:
> On Fri, Feb 13, 2004 at 02:16:53AM +0100, Robin Rosenberg wrote:
> > Yes, so ext3&co. should be equipped with charset options just the other so
> > it can be fixed by the user or in some cases the mount tools. 
> > 
> > Is there a place to store character set information in these file systems?
> 
> Bullshit.  Just as there is no timezone common for all users, there is no
> charset common for all of them.  Charset of _machine_ doesn't make any sense
> at all - toy operating systems nonwithstanding.

For us using toy languages, we see characters in filenames, not byte sequences, and
if whenever possible users should see the same name regardless of locale.

-- robin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-13  3:03 ` Jamie Lokier
@ 2004-02-13 10:07   ` Robin Rosenberg
  2004-02-13 18:06   ` Nicolas Mailhot
  1 sibling, 0 replies; 68+ messages in thread
From: Robin Rosenberg @ 2004-02-13 10:07 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Nicolas Mailhot, linux-kernel

On Friday 13 February 2004 04.03, Jamie Lokier wrote:
> Nicolas Mailhot wrote:
> > But that's not a reason not to fix the core problem - I don't want to
> > spent hours fixing filenames next time someone comes up with a new
> > encoding. Please put valid encoding info somewhere or declare filenames
> > are utf-8 od utf-16 only - changing user locale should not corrupt old
> > data.
> 
> If you attach encoding to names for a whole filesystem, you will get
> really unpleasant bugs including security holes because some names
> won't be writable, so the fs will either return error codes when those
> names are used, or silently alter the names.

Depends on how to handle those undecodeble file names. non-ascii filenames are
probably a security issue (negative characters) with some apps. Making them inaccessible
is definitely not ok. I proposed one version, although it might be a good idea to look at those file
systems that handle the problem already so a uniform solution could be used that makes all filenames
accessible regardless of which characters are used and doesn't cause unneccessary
confusion as to what is the name.

-- robin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-13 10:03                 ` Robin Rosenberg
@ 2004-02-13 10:22                   ` vda
  2004-02-13 10:29                     ` Robin Rosenberg
  0 siblings, 1 reply; 68+ messages in thread
From: vda @ 2004-02-13 10:22 UTC (permalink / raw)
  To: Robin Rosenberg, viro; +Cc: Jamie Lokier, Linux kernel

On Friday 13 February 2004 12:03, Robin Rosenberg wrote:
> On Friday 13 February 2004 03.29, you wrote:
> > On Fri, Feb 13, 2004 at 02:16:53AM +0100, Robin Rosenberg wrote:
> > > Yes, so ext3&co. should be equipped with charset options just the other
> > > so it can be fixed by the user or in some cases the mount tools.
> > >
> > > Is there a place to store character set information in these file
> > > systems?
> >
> > Bullshit.  Just as there is no timezone common for all users, there is no
> > charset common for all of them.  Charset of _machine_ doesn't make any
> > sense at all - toy operating systems nonwithstanding.
>
> For us using toy languages, we see characters in filenames, not byte
> sequences, and if whenever possible users should see the same name
> regardless of locale.

Al says that there can be a hundred of users on the box _simultaneously_,
each with different locale. fs should store filenames
in locale-agnostic way.
-- 
vda

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
       [not found] <04Feb13.015940est.41760@gpu.utcc.utoronto.ca>
@ 2004-02-13 10:26 ` Robin Rosenberg
  0 siblings, 0 replies; 68+ messages in thread
From: Robin Rosenberg @ 2004-02-13 10:26 UTC (permalink / raw)
  To: Chris Siebenmann; +Cc: Linux kernel

On Friday 13 February 2004 07.59, Chris Siebenmann wrote:
> You write:
> | Oh, I wasn't thinking of fixing *every* application out there, but
> | making the kernel api's convert between the user locale and the file
> | system locale, thus restricting the problems to places that can be
> | fixed.
> 
>  Why should the kernel have any idea about locales, or care? (We've just
> had an illustration, in this very thread, about why bits of the kernel
> caring about locales is dangerous.)
We have also explained why it's a problem and why "something" should care. 
The problem is clear, the solution is less clear. Some file systems already try to 
to handle the issue because the fs itself define the character set. That's the
argument for solving the issue with other file systems the same way. It's also only 
the fs media that can reliably know this since media are movable these days.

>  Making the kernel convert between character sets also requires as a
> corollary that the kernel know about all of the character sets, which is
> both dangerous and liable to expand one's kernel impressively.
That's NLS support, which is already there. Conceivably this could be
a compile-time option for the file systems that due legacy do not state
what character set/encoding is to be used so the system could be tuned
for use in a homogeneous environment w.r.t locale.

>  Declaring that the kernel operates in a fixed locale amounts to
> declaring that it will reject certain byte sequences for filenames
> because it doesn't like how they smell, without clear technical need
> for it. People generally object to their kernel restricting them for
> such reasons.

The needs are not "technical", they are "user" needs. (I hear them laughing
in Redmond).

-- robin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-13 10:22                   ` vda
@ 2004-02-13 10:29                     ` Robin Rosenberg
  0 siblings, 0 replies; 68+ messages in thread
From: Robin Rosenberg @ 2004-02-13 10:29 UTC (permalink / raw)
  To: vda; +Cc: viro, Jamie Lokier, Linux kernel

On Friday 13 February 2004 11.22, vda wrote:
> Al says that there can be a hundred of users on the box _simultaneously_,
> each with different locale. fs should store filenames
> in locale-agnostic way.

I assume we agree then :-)

-- robin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
       [not found] <04Feb13.024659est.41760@gpu.utcc.utoronto.ca>
@ 2004-02-13 17:57 ` Nicolas Mailhot
  0 siblings, 0 replies; 68+ messages in thread
From: Nicolas Mailhot @ 2004-02-13 17:57 UTC (permalink / raw)
  To: chris.siebenmann; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1653 bytes --]

Le ven, 13/02/2004 à 02:46 -0500, Chris Siebenmann a écrit :
> You write:
> | Please put valid encoding info somewhere  [...]
> 
>  There is no place for encoding information in the Unix API;

Big surprise;)

>  you would
> have to implement a new one. Even if the kernel is informed of process
> locale when a process creates files, a new API that returns filename
> encoding alongside the file name itself is necessary. And relying on
> process locale on creation leads to undesirable results in some cases.
> 
> | [...] or declare filenames are utf-8 od utf-16 only - changing user
> | locale should not corrupt old data.
> 
>  Since not all byte sequences are valid UTF-8, this immediately means
> that some old files are inaccessible since their filenames are now
> illegal[*]. This also screws everyone who has no desire to work in
> UTF-8, and it screws everyone completely if ever UTF-8 is decided to not
> be the solution to the world's problems.

So what ?
Do you think an app that expects utf-8 filenames won't crash today when
served a byte sequence that's invalid UTF-8 ? (or an app that expects
ascii when served utf-8 oddities)

The problem exists now - putting encoding info somewhere of agreeing on
a common convention won't solve the legacy mess. What it will do is
avoid we get stuck the same way in a decade.

As long as an FS is shared by multiple apps/users agreeing on what the
filenames mean exactly should not be revolutionary. And btw I don't care
if it's UTF-8, UCS or something else. I just want a common ground so
peple and apps can communicate sanely.

Cheers,

-- 
Nicolas Mailhot

[-- Attachment #2: Ceci est une partie de message numériquement signée --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-13  3:03 ` Jamie Lokier
  2004-02-13 10:07   ` Robin Rosenberg
@ 2004-02-13 18:06   ` Nicolas Mailhot
  2004-02-13 18:15     ` viro
  1 sibling, 1 reply; 68+ messages in thread
From: Nicolas Mailhot @ 2004-02-13 18:06 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1326 bytes --]

Le ven, 13/02/2004 à 03:03 +0000, Jamie Lokier a écrit :
> Nicolas Mailhot wrote:
> > But that's not a reason not to fix the core problem - I don't want to
> > spent hours fixing filenames next time someone comes up with a new
> > encoding. Please put valid encoding info somewhere or declare filenames
> > are utf-8 od utf-16 only - changing user locale should not corrupt old
> > data.
> 
> If you attach encoding to names for a whole filesystem, you will get
> really unpleasant bugs including security holes because some names
> won't be writable, so the fs will either return error codes when those
> names are used, or silently alter the names.

You can have security holes now just by tricking an app into reading
files written by another app which disagreed on the locale.

And as for the filename problems :
- just mangle existing invalid filenames when a default encoding is
agreed upon
- refuse to write new files with invalid filenames just like you would
with the few names forbidden in ascii - apps will learn to cope.

Some convention is needed, expecting it to materialise without os
enforcement is deluding oneself, getting a change like this in place
will definitely be painful but the current situation is far from
painless for a lot of people.

Regards,

-- 
Nicolas Mailhot

[-- Attachment #2: Ceci est une partie de message numériquement signée --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-13 18:06   ` Nicolas Mailhot
@ 2004-02-13 18:15     ` viro
  2004-02-13 18:24       ` Valdis.Kletnieks
  2004-02-13 18:31       ` Richard B. Johnson
  0 siblings, 2 replies; 68+ messages in thread
From: viro @ 2004-02-13 18:15 UTC (permalink / raw)
  To: Nicolas Mailhot; +Cc: Jamie Lokier, linux-kernel

On Fri, Feb 13, 2004 at 07:06:46PM +0100, Nicolas Mailhot wrote:
> And as for the filename problems :
> - just mangle existing invalid filenames when a default encoding is
> agreed upon
> - refuse to write new files with invalid filenames just like you would
> with the few names forbidden in ascii - apps will learn to cope.

What names forbidden in ASCII?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-13 18:15     ` viro
@ 2004-02-13 18:24       ` Valdis.Kletnieks
  2004-02-13 18:31         ` viro
  2004-02-13 18:31       ` Richard B. Johnson
  1 sibling, 1 reply; 68+ messages in thread
From: Valdis.Kletnieks @ 2004-02-13 18:24 UTC (permalink / raw)
  To: viro; +Cc: Nicolas Mailhot, Jamie Lokier, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 150 bytes --]

On Fri, 13 Feb 2004 18:15:42 GMT, viro@parcelfarce.linux.theplanet.co.uk said:

> What names forbidden in ASCII?

Anything with a / or a \0 in it. ;)

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-13 18:24       ` Valdis.Kletnieks
@ 2004-02-13 18:31         ` viro
  2004-02-13 20:27           ` Jamie Lokier
  0 siblings, 1 reply; 68+ messages in thread
From: viro @ 2004-02-13 18:31 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: Nicolas Mailhot, Jamie Lokier, linux-kernel

On Fri, Feb 13, 2004 at 01:24:33PM -0500, Valdis.Kletnieks@vt.edu wrote:
> On Fri, 13 Feb 2004 18:15:42 GMT, viro@parcelfarce.linux.theplanet.co.uk said:
> 
> > What names forbidden in ASCII?
> 
> Anything with a / or a \0 in it. ;)

You try and pass something _without_ \0 in it to the kernel ;-)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-13 18:15     ` viro
  2004-02-13 18:24       ` Valdis.Kletnieks
@ 2004-02-13 18:31       ` Richard B. Johnson
  2004-02-13 18:50         ` JFS default behavior Ulrich Drepper
  2004-02-13 22:39         ` JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.) Robin Rosenberg
  1 sibling, 2 replies; 68+ messages in thread
From: Richard B. Johnson @ 2004-02-13 18:31 UTC (permalink / raw)
  To: viro; +Cc: Nicolas Mailhot, Jamie Lokier, linux-kernel

On Fri, 13 Feb 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:

> On Fri, Feb 13, 2004 at 07:06:46PM +0100, Nicolas Mailhot wrote:
> > And as for the filename problems :
> > - just mangle existing invalid filenames when a default encoding is
> > agreed upon
> > - refuse to write new files with invalid filenames just like you would
> > with the few names forbidden in ascii - apps will learn to cope.
>
> What names forbidden in ASCII?

I think that all ASCII characters below 0x20 are forbidden in
Unix file-names and others shown in the reference cited and
"disapproved".

http://www.med.nyu.edu/rcr/rcr/nyu_vms/unixfileanddirectorynames.htm


Cheers,
Dick Johnson
Penguin : Linux version 2.4.24 on an i686 machine (797.90 BogoMips).
            Note 96.31% of all statistics are fiction.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior
  2004-02-13 18:31       ` Richard B. Johnson
@ 2004-02-13 18:50         ` Ulrich Drepper
  2004-02-13 22:39         ` JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.) Robin Rosenberg
  1 sibling, 0 replies; 68+ messages in thread
From: Ulrich Drepper @ 2004-02-13 18:50 UTC (permalink / raw)
  To: root; +Cc: viro, Nicolas Mailhot, Jamie Lokier, linux-kernel

Richard B. Johnson wrote:

> I think that all ASCII characters below 0x20 are forbidden in
> Unix file-names

Not true.  Filenames in Unix are defined as

3.169 Filename
  A name consisting of 1 to {NAME_MAX} bytes used to name a file. The
  characters composing the name may be selected from the set of all
  character values excluding the slash character and the null byte. The
  filenames dot and dot-dot have special meaning. A filename is
  sometimes referred to as a   pathname component  .


Only NUL and / are special.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-13 18:31         ` viro
@ 2004-02-13 20:27           ` Jamie Lokier
  0 siblings, 0 replies; 68+ messages in thread
From: Jamie Lokier @ 2004-02-13 20:27 UTC (permalink / raw)
  To: viro; +Cc: Valdis.Kletnieks, Nicolas Mailhot, linux-kernel

viro@parcelfarce.linux.theplanet.co.uk wrote:
> You try and pass something _without_ \0 in it to the kernel ;-)

:)

But seriously, even that is a security issue when someone requests a
URL containing "%00", or some text contains a filename to operate on
and the name contains \0.

For example, if I write a Perl regular expression to reject paths from
the outside world containing "..": m{(?:/|^)\.\.(?:/|\z)}, it will
fail to notice when given the path "..\0" that the kernel will treat
it identically to "..".  Potential security hole, depending on the context.

-- Jamie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-13 18:31       ` Richard B. Johnson
  2004-02-13 18:50         ` JFS default behavior Ulrich Drepper
@ 2004-02-13 22:39         ` Robin Rosenberg
  1 sibling, 0 replies; 68+ messages in thread
From: Robin Rosenberg @ 2004-02-13 22:39 UTC (permalink / raw)
  To: root; +Cc: viro, Nicolas Mailhot, Jamie Lokier, linux-kernel

On Friday 13 February 2004 19.31, Richard B. Johnson wrote:
> On Fri, 13 Feb 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
> 
> > On Fri, Feb 13, 2004 at 07:06:46PM +0100, Nicolas Mailhot wrote:
> > > And as for the filename problems :
> > > - just mangle existing invalid filenames when a default encoding is
> > > agreed upon
> > > - refuse to write new files with invalid filenames just like you would
> > > with the few names forbidden in ascii - apps will learn to cope.
> >
> > What names forbidden in ASCII?
> 
> I think that all ASCII characters below 0x20 are forbidden in
> Unix file-names and others shown in the reference cited and
> "disapproved".
> 
> http://www.med.nyu.edu/rcr/rcr/nyu_vms/unixfileanddirectorynames.htm

That's not really a formal definition of what's allowed. It's a recommendation
for users on how to avoid detecting applications that cannot handle all file names,
i.e. buggy applications. Try 

	touch "$(/bin/ls -1|head)"

and you will find apps that can handle the nice filename and those that cannot. I'm
definitely not endorsing them and it would probably be wise to implement a system policy that
allows administrators to ban such names as they represent security holes and all sorts of
problems.

Some filesystems forbid these names, but unix doesn't.

-- robin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-13  3:23                 ` Jamie Lokier
@ 2004-02-14 15:09                   ` Eduard Bloch
  2004-02-15  1:01                     ` Jamie Lokier
  0 siblings, 1 reply; 68+ messages in thread
From: Eduard Bloch @ 2004-02-14 15:09 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Linux kernel

#include <hallo.h>
* Jamie Lokier [Fri, Feb 13 2004, 03:23:05AM]:

> If I create a file using a shell command, what I get depends on which
> terminal I used to create it.  If I am using a terminal which displays
> UTF-8 but ssh to another machine, the other machine assumes the
> terminal is displaying iso-8859-1 even though the other machine's
> default locale is UTF-8.  And so on.

Then you have something wrong in the shell configuration of the remote
machine. I do not see any problems in having a ssh shell opened from a
UTF-8 terminal to a machine where the shell environment is also
configured to use UTF-8 environment.

The only problem that may appear if you deliberatedly configured the
user environment on the other side for latin1, then you would have to
fix it in some way. Eg. configuring LANG depending on SSH* variables in
.bashrc.

Regards,
Eduard.
-- 
Das Merkmal eines kleinen Menschen ist, daß er hochmütig wird, wenn
er merkt, daß man ihn braucht.
		-- Friedl Beutelrock

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-12 19:08                   ` John Bradford
  2004-02-12 19:39                     ` Robin Rosenberg
@ 2004-02-14 15:24                     ` Eduard Bloch
  1 sibling, 0 replies; 68+ messages in thread
From: Eduard Bloch @ 2004-02-14 15:24 UTC (permalink / raw)
  To: John Bradford; +Cc: Linux kernel

#include <hallo.h>
* John Bradford [Thu, Feb 12 2004, 07:08:06PM]:

> Well, as long as every userspace implementation gets it correct, we'll
> be OK.  Personally, I doubt they all will, especially those that
> convert from legacy encodings to Unicode, although quite possibly the
> above scenario with combining characters is not likely to happen for
> filenames.  Or is it?  What about copying a file from a filesystem
> with a UTF-8 encoding to a filesystem with a legacy encoding, and then
> back again?

I always wondered why there is no "iocharset" option for unixoid
filesystems. IMO there could be an easy migration path for existing
installations to UTF-8:

 - convert all filenames to UTF-8 (or any other Unicode encoding)
 - mount the FS with "iocharset=UTF-8,charset=latin1" (for current
   Latin1 users). Users can continue to use their latin1 names while
   they are stored in Unicode on the disk (this is what currently
   happens with VFAT, a very nice solution IMHO)
 - when enough applications are ready for multibyte encodings, remove
   the charset/iocharset workaround and make people use .UTF-8 locales

Though, the ultimate solution for the steps 2. and 3. would be the
Microsoft-like way:

 - convert the filenames in libc (from $locale to UTF-8), depending on
   which locale the user has set

This sounds like cheating but would allow to be most flexible and most
compatible to encoding-ignoring applications.

Eduard.
-- 
Wir sind nichts; was wir suchen ist alles.
		-- Johann Christian Friedrich Hölderlin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-14 15:09                   ` Eduard Bloch
@ 2004-02-15  1:01                     ` Jamie Lokier
  2004-02-16 14:03                       ` Eduard Bloch
  0 siblings, 1 reply; 68+ messages in thread
From: Jamie Lokier @ 2004-02-15  1:01 UTC (permalink / raw)
  To: Eduard Bloch; +Cc: Linux kernel

Eduard Bloch wrote:
> > If I create a file using a shell command, what I get depends on which
> > terminal I used to create it.  If I am using a terminal which displays
> > UTF-8 but ssh to another machine, the other machine assumes the
> > terminal is displaying iso-8859-1 even though the other machine's
> > default locale is UTF-8.  And so on.
> 
> Then you have something wrong in the shell configuration of the remote
> machine. I do not see any problems in having a ssh shell opened from a
> UTF-8 terminal to a machine where the shell environment is also
> configured to use UTF-8 environment.

Of course that's fine.  What goes wrong is when you connect to that
same machine from another terminal which is not UTF-8.

There are in fact two different problems, and you have ignored them both :)

Firstly, "ls", editors, filenames:

   The shell configuration is irrelevant.  If I create a file name
   like "£100.txt" (that's POUND followed by "100.txt") when I'm
   connected from a UTF-8 terminal, it creates a filename encoded in
   UTF-8 and displays it fine.

   If I then log in to the same machine from another terminal which
   displays latin1, then "ls" will _not_ display the name correctly
   _regardless_ of shell or locale configuration.

   If I then create a file called "£100.txt" (same name) using the
   terminal which displays latin1, it creates a filename encoded in
   latin1.

   When I log in using the UTF-8 terminal, "ls" won't display the
   second name as it was entered.  Neither will GNOME or KDE.

   Unfortunately, to be compatible with shell utilities, programs like
   Mutt and Emacs which _are_ aware of the display and input encodings
   will use the current terminal's encoding when accessing the
   filesystem.  So even those programs create file names with the
   wrong encoding, if you log in from the wrong kind of terminal.

   When I open a file in Emacs, and the file contains UTF-8, that
   displays just fine on either kind of terminal (provided the terminal
   can display the characters).  But Emacs, and many other programs,
   will display the wrong file _names_ when logged in from the wrong
   kind of terminal.


Secondly, message locale and the shell:

   There is no mechanism for SSH to convey which character encoding
   the remote machine must use for displaying and inputting text, yet
   client terminals come in different flavours.  That is the problem.

   (On my laptop, for example, which is a standard RH9, Gnome terminal
   windows are UTF-8 but console is latin1).  These are both fine
   locally.  There is no configuration on a remote machine which is right
   for both of them, though.)

   I think this is because the character encoding used by the terminal
   should be in the TERM environment variable, but it is in LANG instead.

> The only problem that may appear if you deliberatedly configured the
> user environment on the other side for latin1, then you would have to
> fix it in some way. Eg. configuring LANG depending on SSH* variables in
> .bashrc.

No.  If I have a plain shell with no configuration at all, then
both charset-aware programs like Mutt and Emacs, and
non-charset-aware code like filename display from "ls" do _not_
automatically display filenames properly on both kinds of client
terminal.

In the former case it is because SSH does not automatically convey
the appropriate setting for LANG, which (rather dubiously) includes
whether to use UTF-8 for display.

In the latter case, "ls" and such, there is nothing SSH can do.

(And that's what makes this relevant to linux-kernel - "ls" has no
way to display names correctly on both terminal types precisely
because it does not have any information about the character
encoding of the filenames returned by readdir()).

The result of all this is that everything works fine as long as you
only log in from the kind of terminal which matches the remote machine.

Unfortunately, while the modern GUIs all use UTF-8 (this is a good
thing in the long run), both the default Linux console, and most
non-Linux terminals, do not use UTF-8.

Therefore file names are generally created and displayed in UTF-8 when
using any of the modern GUIs, including GUI terminals, but file names
are generally created and displayed in a locale-specific encoding
(usually iso-8859-1) when using any console, external terminal, or ssh
from an older client.


Btw, as a practical matter, it took me about a year before I figured
out how to enter a "£" (POUND) symbol into a message being edited with
Mutt and Emacs on a remote server.  Until I learned to explicitly set
"LANG=en_GB.utf8" on the remote server when I logged in from GNOME
Terminal (it was a RH9 box which by default set LANG=en_GB, which is
_correct_ for most clients), typing "£" just didn't enter anything.


Third problem (a straightforward Linux bug):

   I just did unicode_start on the console, which turns on UTF-8 for
   that virtual terminal - for display and for keyboard input.

   Then I did unicode_stop.  Guess what: it put the display back in
   iso-8859-1 for that virtual terminal, but the keyboard remained stuck
   in UTF-8 for _all_ virtual terminals.  Once in that state, I had
   difficulty typing the pound sign which appears earlier in this
   message, and in fact I don't know how to restore the console without
   rebooting the client machine.  "reset" doesn't work; using a
   different virtual terminal doesn't work.

-- Jamie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
       [not found]             ` <1piXj-1d3-3@gated-at.bofh.it>
@ 2004-02-15 14:26               ` Pascal Schmidt
       [not found]               ` <1pRLy-21o-31@gated-at.bofh.it>
  1 sibling, 0 replies; 68+ messages in thread
From: Pascal Schmidt @ 2004-02-15 14:26 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel

On Sun, 15 Feb 2004 02:10:05 +0100, you wrote in linux.kernel:

>    Then I did unicode_stop.  Guess what: it put the display back in
>    iso-8859-1 for that virtual terminal, but the keyboard remained stuck
>    in UTF-8 for _all_ virtual terminals.

kbd_mode -a to reset to ASCII mode.

-- 
Ciao,
Pascal

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-15  1:01                     ` Jamie Lokier
@ 2004-02-16 14:03                       ` Eduard Bloch
  2004-02-16 14:28                         ` Jamie Lokier
                                           ` (3 more replies)
  0 siblings, 4 replies; 68+ messages in thread
From: Eduard Bloch @ 2004-02-16 14:03 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Linux kernel

#include <hallo.h>
* Jamie Lokier [Sun, Feb 15 2004, 01:01:50AM]:

> > Then you have something wrong in the shell configuration of the remote
> > machine. I do not see any problems in having a ssh shell opened from a
> > UTF-8 terminal to a machine where the shell environment is also
> > configured to use UTF-8 environment.
> 
> Of course that's fine.  What goes wrong is when you connect to that
> same machine from another terminal which is not UTF-8.
> 
> There are in fact two different problems, and you have ignored them both :)
> 
> Firstly, "ls", editors, filenames:
> 
>    The shell configuration is irrelevant.  If I create a file name
>    like "£100.txt" (that's POUND followed by "100.txt") when I'm

Sure, sure, I can read it since I use UTF-8 too.

>    connected from a UTF-8 terminal, it creates a filename encoded in
>    UTF-8 and displays it fine.
> 
>    If I then log in to the same machine from another terminal which
>    displays latin1, then "ls" will _not_ display the name correctly
>    _regardless_ of shell or locale configuration.

I know what you mean and that is why I already proposed a radical
solution. Let me repeat it:

 - convert all files from the previous charset to UTF-8 overnight
   if the previous charset was unknown, first make sure that you can
   guess it for all users and contact users that have files with
   suspicous filenames (eg. not convertable from Latin1). Or look trough
   their shell/X config files (*)
 
 - in libc, implement a recoding function to convert file names from
   LC_CTYPE to the underlying UTF-8 encoding

Done.

(*) There is no other way. Linux developers ignored the diversity of
charset/encodings over many years and now the needed information is lost
(not stored anywhere in the filesystem)

>    If I then create a file called "£100.txt" (same name) using the
>    terminal which displays latin1, it creates a filename encoded in
>    latin1.

Of course. That is what the conversion shoudl be done in Userspace
(libc). The kernel itself does not know about used locale.

>    Unfortunately, to be compatible with shell utilities, programs like
>    Mutt and Emacs which _are_ aware of the display and input encodings
>    will use the current terminal's encoding when accessing the

That is the correct way, though.

>    filesystem.  So even those programs create file names with the
>    wrong encoding, if you log in from the wrong kind of terminal.

It is the _right_ enconding in the moment when they create it.

> Secondly, message locale and the shell:
> 
>    There is no mechanism for SSH to convey which character encoding
>    the remote machine must use for displaying and inputting text, yet
>    client terminals come in different flavours.  That is the problem.
> 
>    (On my laptop, for example, which is a standard RH9, Gnome terminal
>    windows are UTF-8 but console is latin1).  These are both fine
>    locally.  There is no configuration on a remote machine which is right
>    for both of them, though.)

Yup, I know that problem. At least to display them correctly, you can
either run unicode_start (to enable console's own conversion) which
sucks when they are chars from completely different language groups, eg.
latin and cyrillic. I used dynafont for a while which worked well for
displaying characters.

>    I think this is because the character encoding used by the terminal
>    should be in the TERM environment variable, but it is in LANG instead.

No. TERM does not have anything to do with locales (LANG).

Regards,
Eduard.
-- 
Selbstlosigkeit ist ausgereifter Egoismus.
		-- Herbert Spencer

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-16 14:03                       ` Eduard Bloch
@ 2004-02-16 14:28                         ` Jamie Lokier
  2004-02-16 19:22                           ` Eduard Bloch
  2004-02-16 15:18                         ` Valdis.Kletnieks
                                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 68+ messages in thread
From: Jamie Lokier @ 2004-02-16 14:28 UTC (permalink / raw)
  To: Eduard Bloch; +Cc: Linux kernel

Eduard Bloch wrote:
> >    I think this is because the character encoding used by the terminal
> >    should be in the TERM environment variable, but it is in LANG instead.
> 
> No. TERM does not have anything to do with locales (LANG).

No.  The locale should not have anything to do with the appropriate
byte sequences need to make the terminal display characters.

It is wrong that LANG must have a different value depending on whether
I log in using a DEC VT100 or a Gnome Terminal, even though I wish to
see exactly the same language, dialect, messages, number formats,
currency formats, dates and times.

It is acceptable that LANG may control the encoding stored in files
and filenames, but this should be independent of the terminal type.

It is especially wrong that libraries which should be
locale-independent - such as curses, slang and readline - must read
the LANG variable in addition to TERM.  If curses does not read and
parse LANG, simple things like the box around a dialog will not line
up correctly.  This is wrong - it is a terminal characteristic.

-- Jamie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-16 14:03                       ` Eduard Bloch
  2004-02-16 14:28                         ` Jamie Lokier
@ 2004-02-16 15:18                         ` Valdis.Kletnieks
  2004-02-16 15:32                           ` Jamie Lokier
  2004-02-16 15:46                           ` John Bradford
  2004-02-16 15:27                         ` Jamie Lokier
  2004-02-16 15:44                         ` Robin Rosenberg
  3 siblings, 2 replies; 68+ messages in thread
From: Valdis.Kletnieks @ 2004-02-16 15:18 UTC (permalink / raw)
  To: Eduard Bloch; +Cc: Jamie Lokier, Linux kernel

[-- Attachment #1: Type: text/plain, Size: 957 bytes --]

On Mon, 16 Feb 2004 15:03:38 +0100, Eduard Bloch said:

>  - convert all files from the previous charset to UTF-8 overnight
>    if the previous charset was unknown, first make sure that you can
>    guess it for all users and contact users that have files with
>    suspicous filenames (eg. not convertable from Latin1). Or look trough
>    their shell/X config files (*)

Hazardous.

>  - in libc, implement a recoding function to convert file names from
>    LC_CTYPE to the underlying UTF-8 encoding

Hmm.. could be fun if somebody is calling 'open', and the UTF-8 encoding
requires the insertion of extra characters to encode it - what do you do then?
That looks like a security hole just waiting to happen.  Probably has lots of
lurking corner cases too - what if you creat() a file, then do a readdir() and
strcmp() each entry looking for your file (while comparing a filename smashed
to UTF-8 to the original unsmashed string)?


[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-16 14:03                       ` Eduard Bloch
  2004-02-16 14:28                         ` Jamie Lokier
  2004-02-16 15:18                         ` Valdis.Kletnieks
@ 2004-02-16 15:27                         ` Jamie Lokier
  2004-02-16 15:44                         ` Robin Rosenberg
  3 siblings, 0 replies; 68+ messages in thread
From: Jamie Lokier @ 2004-02-16 15:27 UTC (permalink / raw)
  To: Eduard Bloch; +Cc: Linux kernel

Eduard Bloch wrote:
> Yup, I know that problem. At least to display them correctly, you can
> either run unicode_start (to enable console's own conversion) which
> sucks when they are chars from completely different language groups, eg.
> latin and cyrillic. I used dynafont for a while which worked well for
> displaying characters.

Sorry, unicode_start doesn't work on most terminals (e.g. the VT100
downstairs or the Putty in the internet cafe), and it's also very
antisocial to do when I log in from someone else's Linux console.

-- Jamie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-16 15:18                         ` Valdis.Kletnieks
@ 2004-02-16 15:32                           ` Jamie Lokier
  2004-02-16 19:13                             ` Eduard Bloch
  2004-02-16 15:46                           ` John Bradford
  1 sibling, 1 reply; 68+ messages in thread
From: Jamie Lokier @ 2004-02-16 15:32 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: Eduard Bloch, Linux kernel

Valdis.Kletnieks@vt.edu wrote:
> >  - in libc, implement a recoding function to convert file names from
> >    LC_CTYPE to the underlying UTF-8 encoding
> 
> Hmm.. could be fun if somebody is calling 'open', and the UTF-8 encoding
> requires the insertion of extra characters to encode it - what do you do then?

> That looks like a security hole just waiting to happen.  Probably
> has lots of lurking corner cases too - what if you creat() a file,
> then do a readdir() and strcmp() each entry looking for your file
> (while comparing a filename smashed to UTF-8 to the original
> unsmashed string)?

Actually, following Eduard's proposal, that would work fine.  The file
name would be passed to libc in the current encoding, created in
UTF-8, libc's readdir() would convert it back (which is always
possible without mangling), and strcmp() would be fine.

The real problem comes when you readdir() a directory which contains
non-UTF-8 names.  Even if you changes your local filesystem, when you
go travelling an remotely-mounted filesystem elsewhere may have them.
What does Eduard's libc do then?  Ignore the names?  Mangle them?

Not to mention the extremely unpleasant performance implications.

-- Jamie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-16 14:03                       ` Eduard Bloch
                                           ` (2 preceding siblings ...)
  2004-02-16 15:27                         ` Jamie Lokier
@ 2004-02-16 15:44                         ` Robin Rosenberg
  3 siblings, 0 replies; 68+ messages in thread
From: Robin Rosenberg @ 2004-02-16 15:44 UTC (permalink / raw)
  To: Eduard Bloch; +Cc: Jamie Lokier, Linux kernel

On Monday 16 February 2004 15.03, Eduard Bloch wrote:
> I know what you mean and that is why I already proposed a radical
> solution. Let me repeat it:
> 
>  - convert all files from the previous charset to UTF-8 overnight
>    if the previous charset was unknown, first make sure that you can
>    guess it for all users and contact users that have files with
>    suspicous filenames (eg. not convertable from Latin1). Or look trough
>    their shell/X config files (*)

Thankfully isolatin-1 (and all other encodings in use AFAIK) can be converted to UTF-8.
IsoLatin1 is also extremly simpe to convert-

-- robin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
       [not found]                 ` <1pSRf-31Z-5@gated-at.bofh.it>
@ 2004-02-16 15:44                   ` Pascal Schmidt
  2004-02-16 15:59                     ` Valdis.Kletnieks
  0 siblings, 1 reply; 68+ messages in thread
From: Pascal Schmidt @ 2004-02-16 15:44 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: linux-kernel

On Mon, 16 Feb 2004 16:30:13 +0100, you wrote in linux.kernel:

> lurking corner cases too - what if you creat() a file, then do a
> readdir() and strcmp() each entry looking for your file (while
> comparing a filename smashed to UTF-8 to the original unsmashed string)?

That's broken on multitasking systems anyway. Even if you find the
same name, somebody (root process for example) might have unlinked your
file and created another with the same name between you calling creat()
and doing the readdir(). What would be the use of this, anyway?

-- 
Ciao,
Pascal

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-16 15:18                         ` Valdis.Kletnieks
  2004-02-16 15:32                           ` Jamie Lokier
@ 2004-02-16 15:46                           ` John Bradford
  2004-02-16 15:48                             ` viro
  2004-02-16 16:25                             ` Robin Rosenberg
  1 sibling, 2 replies; 68+ messages in thread
From: John Bradford @ 2004-02-16 15:46 UTC (permalink / raw)
  To: Valdis.Kletnieks, Eduard Bloch; +Cc: Jamie Lokier, Linux kernel

> >  - convert all files from the previous charset to UTF-8 overnight
> >    if the previous charset was unknown, first make sure that you can
> >    guess it for all users and contact users that have files with
> >    suspicous filenames (eg. not convertable from Latin1). Or look troug=
> h
> >    their shell/X config files (*)
> 
> Hazardous.
> 
> >  - in libc, implement a recoding function to convert file names from
> >    LC_CTYPE to the underlying UTF-8 encoding
> 
> Hmm.. could be fun if somebody is calling 'open', and the UTF-8 encoding
> requires the insertion of extra characters to encode it - what do you do =
> then?
> That looks like a security hole just waiting to happen.  Probably has lot=
> s of
> lurking corner cases too - what if you creat() a file, then do a readdir(=
> ) and
> strcmp() each entry looking for your file (while comparing a filename sma=
> shed
> to UTF-8 to the original unsmashed string)?

The current situation is that so many applications simply treat
filenames as arbitrary sequences of bytes.  With many encodings, this
simply happens to work, and an encoding mis-match will result in some
incorrect characters being displayed for byte values > 127.  However,
some encodings, such as UTF-8, are simply _not_ compatible with the
'you can also treat it like an arbitrary byte string model', and there
is a very real potential for security holes in bad implementations if
we go down the "it's an arbitrary byte string, but you _should_ store
UTF-8 there" route.

Maybe we should forget filename encoding altogether, and start
thinking of filenames as arbitrary sequences of _32-bit words_.
Existing applications can store their arbitrary byte sequences in the
low byte, and new calls can be added to provide Unicode-aware
userspace applications with access to the 32-bit space, which _must_
be used for UCS-4.

John.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-16 15:46                           ` John Bradford
@ 2004-02-16 15:48                             ` viro
  2004-02-16 16:43                               ` John Bradford
  2004-02-16 16:25                             ` Robin Rosenberg
  1 sibling, 1 reply; 68+ messages in thread
From: viro @ 2004-02-16 15:48 UTC (permalink / raw)
  To: John Bradford; +Cc: Valdis.Kletnieks, Eduard Bloch, Jamie Lokier, Linux kernel

On Mon, Feb 16, 2004 at 03:46:21PM +0000, John Bradford wrote:
> The current situation is that so many applications simply treat
> filenames as arbitrary sequences of bytes.  With many encodings, this
> simply happens to work, and an encoding mis-match will result in some
> incorrect characters being displayed for byte values > 127.  However,
> some encodings, such as UTF-8, are simply _not_ compatible with the
> 'you can also treat it like an arbitrary byte string model', and there

Excuse me?  Would you fscking mind explaining what, in your opinion,
UTF-8 is and what makes "simply _not_ compatible" with aforementioned
model?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-16 15:44                   ` Pascal Schmidt
@ 2004-02-16 15:59                     ` Valdis.Kletnieks
  0 siblings, 0 replies; 68+ messages in thread
From: Valdis.Kletnieks @ 2004-02-16 15:59 UTC (permalink / raw)
  To: Pascal Schmidt; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 239 bytes --]

On Mon, 16 Feb 2004 16:44:48 +0100, Pascal Schmidt said:

> file and created another with the same name between you calling creat()
> and doing the readdir(). What would be the use of this, anyway?

How does the shell do 'echo foo*'?






[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-16 15:46                           ` John Bradford
  2004-02-16 15:48                             ` viro
@ 2004-02-16 16:25                             ` Robin Rosenberg
  1 sibling, 0 replies; 68+ messages in thread
From: Robin Rosenberg @ 2004-02-16 16:25 UTC (permalink / raw)
  To: John Bradford; +Cc: Valdis.Kletnieks, Eduard Bloch, Jamie Lokier, Linux kernel

On Monday 16 February 2004 16.46, John Bradford wrote:
> Maybe we should forget filename encoding altogether, and start
> thinking of filenames as arbitrary sequences of _32-bit words_.
> Existing applications can store their arbitrary byte sequences in the
> low byte, and new calls can be added to provide Unicode-aware
> userspace applications with access to the 32-bit space, which _must_
> be used for UCS-4.

You forgot a :-). Right :-/

-- robin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-16 15:48                             ` viro
@ 2004-02-16 16:43                               ` John Bradford
  0 siblings, 0 replies; 68+ messages in thread
From: John Bradford @ 2004-02-16 16:43 UTC (permalink / raw)
  To: viro; +Cc: Valdis.Kletnieks, Eduard Bloch, Jamie Lokier, Linux kernel, john

Quote from viro@parcelfarce.linux.theplanet.co.uk:
> On Mon, Feb 16, 2004 at 03:46:21PM +0000, John Bradford wrote:
> > The current situation is that so many applications simply treat
> > filenames as arbitrary sequences of bytes.  With many encodings, this
> > simply happens to work, and an encoding mis-match will result in some
> > incorrect characters being displayed for byte values > 127.  However,
> > some encodings, such as UTF-8, are simply _not_ compatible with the
> > 'you can also treat it like an arbitrary byte string model', and there
> 
> Excuse me?  Would you fscking mind explaining what, in your opinion,
> UTF-8 is

Read the UTF-8 manual page.

> and what makes "simply _not_ compatible" with aforementioned
> model?

Byte values > 127 in UTF-8 don't map to single characters, but instead
many of them form part of an escape to a larger set of values.

The net effect is that if you have filenames in an existing 8-bit
encoding, such as any of the ISO-8859- encodings, and treat them as
being in another, similar encoding, you may get some incorrect
characters.  This is not ideal, of course, but it is not very
confusing for end users.  You can more or less store arbitrary bytes
in the filename, and usually at least get a displayable, re-typeable,
somewhat usable result out.

Note that _many_ current applications expect to be able to do just
that.

However, with UTF-8, random bytes > 127 may not map to any valid
character sequence at all, or may map to a sequence that is not
permitted by the spec, but which is, for example, a 31-bit
representation of a value < 128.  These are a potential source of
security vulnerabilities for badly written decoders.

Now, this problem is not limited to UTF-8 - many 16 bit encodings may
have similar issues with 'random' byte streams.

However, with my proposed solution, Unicode-aware applications can be
adapted to write their filenames as UCS-4, and existing applications
which continue to see 8-bit byte streams which they can interpret as
they like, will see 7-bit ascii for characters which can be
represented in it, and a random character for those which can't.
Assuming that those applications treat the byte sequence as an
ISO-8859- type character set, (not UTF-8, or a 16-bit character set),
this shouldn't be too much of a problem, except where the low byte of
the UCS-4 character is \0 or /.  We can work around this by replacing
such bytes with another character in the 8-bit read routine, (which
isn't expected to deal with anything other than 7-bit ASCII 100%
correctly, (which is no worse than what we have at the moment, as far
as I can see)).

Applications which do treat the 8-bit byte stream as UTF-8 or an
existing 16-bit encoding should have only one additional thing to deal
with over what they have to deal with today, and that is the potential
for a filename created from truncated 32-bit UCS-4 values to contain
\0 or /.  I suggested above that the kernel could deal with that by
substituting another value, but obviously UTF-8 and 16-bit encodings
are more sensitive to what that substitute value is, than ISO-8859-
type encodings are.

John.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
       [not found]                     ` <1pTu7-3Ce-7@gated-at.bofh.it>
@ 2004-02-16 17:26                       ` Pascal Schmidt
  2004-02-16 17:58                         ` Valdis.Kletnieks
  0 siblings, 1 reply; 68+ messages in thread
From: Pascal Schmidt @ 2004-02-16 17:26 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: linux-kernel

On Mon, 16 Feb 2004 17:10:23 +0100, you wrote in linux.kernel:

>> file and created another with the same name between you calling creat()
>> and doing the readdir(). What would be the use of this, anyway?
> How does the shell do 'echo foo*'?

I fail to see the connection with creat() followed by readdir(). The shell
is surely not expecting the names that follow from the glob expansion to
have any relationship with previous shell operations.

-- 
Ciao,
Pascal

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-16 17:26                       ` Pascal Schmidt
@ 2004-02-16 17:58                         ` Valdis.Kletnieks
  2004-02-16 19:48                           ` Pascal Schmidt
  0 siblings, 1 reply; 68+ messages in thread
From: Valdis.Kletnieks @ 2004-02-16 17:58 UTC (permalink / raw)
  To: Pascal Schmidt; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1637 bytes --]

On Mon, 16 Feb 2004 18:26:47 +0100, Pascal Schmidt said:
> On Mon, 16 Feb 2004 17:10:23 +0100, you wrote in linux.kernel:
> 
> >> file and created another with the same name between you calling creat()
> >> and doing the readdir(). What would be the use of this, anyway?
> > How does the shell do 'echo foo*'?
> 
> I fail to see the connection with creat() followed by readdir(). The shell
> is surely not expecting the names that follow from the glob expansion to
> have any relationship with previous shell operations

Oh?

% rm *
% touch foo1 bar1    # this calls creat() or open() or similar
% touch foo2 bar2	# as will this...
% echo foo*	# and this will do a readdir(), presumably

Do you have any expectations what the echo will do?  Obviously the glob
DOES have a relationship with previous shell operations.

The point is that *if* we assume that glibc is going to do some magic
conversion when creating a file, we are assuming that glibc will *always* keep
the conversion hidden. No matter what.  Because the user now has expectations
of what that file was called when he created it - the string he passed to
open()/creat().  If what gets handed to the kernel is something different, we
have to make sure that the user never finds out about it.

And if there's special iso8859-* chars in the filename, this means that the magic
handwave to convert to utf-8 inside glibc will either have to do it in-place (mangling
the user-supplied filename, and bad karma) or it gets to call malloc() to get a work
space (can't use a 'static char[MAXPATHLEN]', that's not thread-safe.

This gets *very* interesting if the malloc() fails.. ;)

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-16 15:32                           ` Jamie Lokier
@ 2004-02-16 19:13                             ` Eduard Bloch
  0 siblings, 0 replies; 68+ messages in thread
From: Eduard Bloch @ 2004-02-16 19:13 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Valdis.Kletnieks, Linux kernel

#include <hallo.h>
* Jamie Lokier [Mon, Feb 16 2004, 03:32:24PM]:

> > That looks like a security hole just waiting to happen.  Probably
> > has lots of lurking corner cases too - what if you creat() a file,
> > then do a readdir() and strcmp() each entry looking for your file
> > (while comparing a filename smashed to UTF-8 to the original
> > unsmashed string)?
> 
> Actually, following Eduard's proposal, that would work fine.  The file
> name would be passed to libc in the current encoding, created in
> UTF-8, libc's readdir() would convert it back (which is always
> possible without mangling), and strcmp() would be fine.
> 
> The real problem comes when you readdir() a directory which contains
> non-UTF-8 names.  Even if you changes your local filesystem, when you
> go travelling an remotely-mounted filesystem elsewhere may have them.
> What does Eduard's libc do then?  Ignore the names?  Mangle them?

Just pass the uncoverted strings then. Please note that this is exactly
what happens today - every application running in UTF-8 locale and
facing incompatible filenames has to deal with this problem. I wonder
why so many people pretend that the current situation is "less or more
okay".

> Not to mention the extremely unpleasant performance implications.

You always loose a bit performance when dealing with Unicode. Just
accept it.

Regards,
Eduard.
-- 
Lang ist der Weg durch Lehren, kurz und wirksam durch Beispiele.
		-- Lucius Annaeus Seneca (4-65 n.Chr.)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-16 14:28                         ` Jamie Lokier
@ 2004-02-16 19:22                           ` Eduard Bloch
  2004-02-16 21:44                             ` Jamie Lokier
  0 siblings, 1 reply; 68+ messages in thread
From: Eduard Bloch @ 2004-02-16 19:22 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Linux kernel

#include <hallo.h>
* Jamie Lokier [Mon, Feb 16 2004, 02:28:07PM]:

> > >    I think this is because the character encoding used by the terminal
> > >    should be in the TERM environment variable, but it is in LANG instead.
> > 
> > No. TERM does not have anything to do with locales (LANG).
> 
> No.  The locale should not have anything to do with the appropriate
> byte sequences need to make the terminal display characters.

Heh. It would be very nice if we had the situation that you describe,
but that is actually not the case. TERM specifies the general
capabilities of the terminal. It does _not_ tell the application inside
which FONT encoding is used, nor whether it is compatible with multibyte
input.

> It is wrong that LANG must have a different value depending on whether
> I log in using a DEC VT100 or a Gnome Terminal, even though I wish to
> see exactly the same language, dialect, messages, number formats,
> currency formats, dates and times.

Nonsense, sorry. How should your application know how to encode its
output? How should it know which font is used. I have heard about some
magic strings that application can send to the Xterm (when TERM=xterm)
to tell it to change the font encoding (similar to the string used to
set the window Title used by mc, for example). But this is an extension,
not mandatory for a general implentation of a "terminal".

> It is acceptable that LANG may control the encoding stored in files
> and filenames, but this should be independent of the terminal type.

And what controls the font setting? (see above)

> It is especially wrong that libraries which should be
> locale-independent - such as curses, slang and readline - must read
> the LANG variable in addition to TERM.  If curses does not read and

See above. Especially since different chars are used to draw graphical
characters (lines, boxes, ...), they _must_ know which font encoding
they have to expect.

Regards,
Eduard.
-- 
Zufälle sind die Mittel des Schicksals, durch die es seine wichtigsten
Pläne mit uns durchführt.
		-- Charles Tschopp

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-16 17:58                         ` Valdis.Kletnieks
@ 2004-02-16 19:48                           ` Pascal Schmidt
  0 siblings, 0 replies; 68+ messages in thread
From: Pascal Schmidt @ 2004-02-16 19:48 UTC (permalink / raw)
  To: Valdis.Kletnieks; +Cc: linux-kernel

On Mon, 16 Feb 2004 Valdis.Kletnieks@vt.edu wrote:

> Oh?
>
> % rm *
> % touch foo1 bar1    # this calls creat() or open() or similar
> % touch foo2 bar2	# as will this...
> % echo foo*	# and this will do a readdir(), presumably
>
> Do you have any expectations what the echo will do?  Obviously the glob
> DOES have a relationship with previous shell operations.

Yes, and? One may expect the echo to give "foo1 foo2", but that depends
on a lot of side effect, such as no other processing doing things in
the current directory. The same is true in a program - if you need
to know whether you could create a file, the only sane way is to use
creat() from an application and look at the return value. No other
method is meaningful - arbitrary things can happen between creating
a file and running readdir().

> The point is that *if* we assume that glibc is going to do some magic
> conversion when creating a file, we are assuming that glibc will
> *always* keep the conversion hidden. No matter what.  Because the user
> now has expectations of what that file was called when he created it -
> the string he passed to open()/creat().  If what gets handed to the
> kernel is something different, we have to make sure that the user never
> finds out about it.

That way lies madness, I agree. The sane thing (but breaks existing
applications) would be to reject any filename that is not valid
UTF-8, returning -EINVAL. I don't think *that* is going to happen,
though. ;)

-- 
Ciao,
Pascal

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)
  2004-02-16 19:22                           ` Eduard Bloch
@ 2004-02-16 21:44                             ` Jamie Lokier
  0 siblings, 0 replies; 68+ messages in thread
From: Jamie Lokier @ 2004-02-16 21:44 UTC (permalink / raw)
  To: Eduard Bloch; +Cc: Linux kernel

Eduard Bloch wrote:
> TERM specifies the general capabilities of the terminal. It does
> _not_ tell the application inside which FONT encoding is used, nor
> whether it is compatible with multibyte input.

It should - especially the multibyte encoding.

The font is irrelevant; our trouble here is *character encoding* which
has nothing to do with fonts.  Please don't use the incorrect term as
there is widespread confusion over it already.

That isn't just about which glyph is displayed in response to each
byte.  UTF-8 affects terminal escape sequence parsing, and also the
relationship between number of non-control bytes transmitted and the
distance moved by the cursor.

If I write a UTF-8 string to a VT220-like terminal (such as xterm
approximates), some text characters are interpreted as terminal
commands.  (Hint: 0x9b (which can occur in UTF-8 text) is equivalent
to 0x1c 0x5b, the control sequence introducer; there are others too).

When you edit a line with the unix terminal line editor, when you type
DEL, it writes BACKSPACE-SPACE-BACKSPACE and removes one byte from the
input.  That utterly fails to do the right thing on UTF-8 terminals.
For example, run the command "cat" by itself, then type "£££", then
hit DEL twice - it will show one pound sterling sign.  Press enter,
and cat will echo the line containing _two_ pound sterling signs.

No setting of LANG or TERM makes that behave correctly.

So, do you think the kernel's line editor should be locale-aware too? :)

> > It is wrong that LANG must have a different value depending on whether
> > I log in using a DEC VT100 or a Gnome Terminal, even though I wish to
> > see exactly the same language, dialect, messages, number formats,
> > currency formats, dates and times.

NB: It's wrong because LANG should be for terminal-independent locale
properties, such as which languages I want to use and how I want text
files stored.

If I log into a remote machine, I want characters displayed according
to the local terminal's requirements, but I want text files and
filenames to use the remote machine's locale, naturally.

> Nonsense, sorry. How should your application know how to encode its
> output?

Increasingly I'm thinking UTF-8-ness should be a terminal capability,
like ocrnl.  The kernel's own line editor needs to know this property
anyway, and it would really help with moving filenames and everything
else over to UTF-8 - with no change to the simple unix programs such
as the shell utilities.

> > It is especially wrong that libraries which should be
> > locale-independent - such as curses, slang and readline - must
> > read the LANG variable in addition to TERM.
> 
> See above. Especially since different chars are used to draw graphical
> characters (lines, boxes, ...), they _must_ know which font encoding
> they have to expect.

See "acsc" in the terminfo(5) database.  Line & box drawing characters
have been treated as a terminal capability for a long time.  Case made :)

-- Jamie

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2004-02-16 21:44 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-12 16:50 JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.) Nicolas Mailhot
2004-02-12 18:12 ` Robin Rosenberg
2004-02-13  3:03 ` Jamie Lokier
2004-02-13 10:07   ` Robin Rosenberg
2004-02-13 18:06   ` Nicolas Mailhot
2004-02-13 18:15     ` viro
2004-02-13 18:24       ` Valdis.Kletnieks
2004-02-13 18:31         ` viro
2004-02-13 20:27           ` Jamie Lokier
2004-02-13 18:31       ` Richard B. Johnson
2004-02-13 18:50         ` JFS default behavior Ulrich Drepper
2004-02-13 22:39         ` JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.) Robin Rosenberg
     [not found] <1pvrI-8bq-29@gated-at.bofh.it>
     [not found] ` <1pvrI-8bq-31@gated-at.bofh.it>
     [not found]   ` <1pvrJ-8bq-33@gated-at.bofh.it>
     [not found]     ` <1pvrJ-8bq-35@gated-at.bofh.it>
     [not found]       ` <1pvrJ-8bq-37@gated-at.bofh.it>
     [not found]         ` <1pvrJ-8bq-39@gated-at.bofh.it>
     [not found]           ` <1pvrJ-8bq-41@gated-at.bofh.it>
     [not found]             ` <1pvrJ-8bq-43@gated-at.bofh.it>
     [not found]               ` <1pTay-3hc-13@gated-at.bofh.it>
     [not found]                 ` <1pTay-3hc-15@gated-at.bofh.it>
     [not found]                   ` <1pTay-3hc-11@gated-at.bofh.it>
     [not found]                     ` <1pTu7-3Ce-7@gated-at.bofh.it>
2004-02-16 17:26                       ` Pascal Schmidt
2004-02-16 17:58                         ` Valdis.Kletnieks
2004-02-16 19:48                           ` Pascal Schmidt
     [not found] <1nioI-5Re-1@gated-at.bofh.it>
     [not found] ` <1orqh-6gs-47@gated-at.bofh.it>
     [not found]   ` <1ozGR-60N-1@gated-at.bofh.it>
     [not found]     ` <1oAa3-6pR-37@gated-at.bofh.it>
     [not found]       ` <1oBpi-7pO-1@gated-at.bofh.it>
     [not found]         ` <1oCbM-8oW-9@gated-at.bofh.it>
     [not found]           ` <1p9Kl-7BV-1@gated-at.bofh.it>
     [not found]             ` <1piXj-1d3-3@gated-at.bofh.it>
2004-02-15 14:26               ` Pascal Schmidt
     [not found]               ` <1pRLy-21o-31@gated-at.bofh.it>
     [not found]                 ` <1pSRf-31Z-5@gated-at.bofh.it>
2004-02-16 15:44                   ` Pascal Schmidt
2004-02-16 15:59                     ` Valdis.Kletnieks
     [not found] <04Feb13.024659est.41760@gpu.utcc.utoronto.ca>
2004-02-13 17:57 ` Nicolas Mailhot
     [not found] <04Feb13.015940est.41760@gpu.utcc.utoronto.ca>
2004-02-13 10:26 ` Robin Rosenberg
  -- strict thread matches above, loose matches on Subject: below --
2004-02-09 11:58 UTF-8 in file systems? xfs/extfs/etc Nico Schottelius
2004-02-11  6:39 ` Tim Connors
2004-02-11 16:35   ` JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.) Dave Kleikamp
2004-02-12  0:45     ` Andy Isaacson
2004-02-12  1:19       ` Tim Connors
2004-02-12  3:54       ` jw schultz
2004-02-12 12:03         ` Robin Rosenberg
2004-02-12  8:54       ` Jamie Lokier
2004-02-12 15:55         ` Robin Rosenberg
2004-02-12 16:17           ` John Bradford
2004-02-12 16:40             ` Robin Rosenberg
2004-02-12 17:16               ` John Bradford
2004-02-12 18:06                 ` Robin Rosenberg
2004-02-12 19:08                   ` John Bradford
2004-02-12 19:39                     ` Robin Rosenberg
2004-02-12 21:13                       ` John Bradford
2004-02-12 22:29                         ` Robin Rosenberg
2004-02-12 22:50                           ` Valdis.Kletnieks
2004-02-13  2:58                           ` Jamie Lokier
2004-02-13  9:48                             ` Robin Rosenberg
2004-02-13  3:15                         ` Jamie Lokier
2004-02-14 15:24                     ` Eduard Bloch
2004-02-13  0:17             ` Jamie Lokier
2004-02-13  0:38           ` Jamie Lokier
2004-02-13  1:16             ` Robin Rosenberg
2004-02-13  1:23               ` Jamie Lokier
2004-02-13  1:46                 ` Robin Rosenberg
2004-02-13  2:29               ` viro
2004-02-13  3:23                 ` Jamie Lokier
2004-02-14 15:09                   ` Eduard Bloch
2004-02-15  1:01                     ` Jamie Lokier
2004-02-16 14:03                       ` Eduard Bloch
2004-02-16 14:28                         ` Jamie Lokier
2004-02-16 19:22                           ` Eduard Bloch
2004-02-16 21:44                             ` Jamie Lokier
2004-02-16 15:18                         ` Valdis.Kletnieks
2004-02-16 15:32                           ` Jamie Lokier
2004-02-16 19:13                             ` Eduard Bloch
2004-02-16 15:46                           ` John Bradford
2004-02-16 15:48                             ` viro
2004-02-16 16:43                               ` John Bradford
2004-02-16 16:25                             ` Robin Rosenberg
2004-02-16 15:27                         ` Jamie Lokier
2004-02-16 15:44                         ` Robin Rosenberg
2004-02-13 10:03                 ` Robin Rosenberg
2004-02-13 10:22                   ` vda
2004-02-13 10:29                     ` Robin Rosenberg
2004-02-12 13:28       ` Dave Kleikamp
2004-02-12 15:26       ` Valdis.Kletnieks
2004-02-12 15:41         ` Dave Kleikamp

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox