* Re: UTF-8 practically vs. theoretically in the VFS API
@ 2004-02-23 11:35 Norman Diamond
[not found] ` <fa.ip45pqg.i26oru@ifi.uio.no>
0 siblings, 1 reply; 50+ messages in thread
From: Norman Diamond @ 2004-02-23 11:35 UTC (permalink / raw)
To: Eric W. Biederman, linux-kernel
Eric W. Biederman wrote:
> First it is worth noting that the existing practice is that ttys
> always use the character set encoding of the user.
Each tty uses the character set encoding of that tty's user. There were
times when I needed to have some tty windows open using EUC (ordinary work
on that Linux machine) and some tty windows open using SJIS (editing files
which would be sent to cellular telephones), in the same X session. They
worked.
> Even X cut and paste frequently abuses the iso8859-1 range,
I'll take your word for it. I've copied and pasted EUC strings, I've copied
and pasted SJIS strings, I don't know if X copy and paste abused EUC or SJIS
ranges, but it worked.
One thing I never thought of trying to test is to copy and paste between one
tty using EUC and one tty using SJIS.
> Now the work is how to get multiple locales to play nicely with each
> other. utf-8 and unicode are convenient for that as they preserve the
> existing assumptions that terminals, filenames, and text files are
> all using the same character set encoding, even when multiple locales
> are involved.
>
> So within one machine utf-8 solves the multiple locale problem.
That preserves a nice fiction. If you depend on assuming that fiction,
you'll get useless results.
> The rule ``All data that passing through a pseudo-tty is in the
> character set encoding specified by the locale of the owner of the
> tty'' seems both reasonable and no significant change from the current
> status quo.
Yes, that is a return to usability.
> On the wire between two machines I recommend passing unicode
> characters.
Why should the wire get a different encoding than the user set in the
pseudo-tty? Consider TeraTerm. The user tells TeraTerm what character set
is in use on the wire, which is the same as the character set in use on the
remote side (where sshd or whatever server provides the pseudo-tty).
TeraTerm converts between that and the local character set (where the
TeraTerm program and window and user get the character set decided for them
by someone in Sasazuka or Redmond).
> By convention glibc stores unicode values in wchar_t.
That is hard to believe. glibc existed before Unicode did and wchar_t
existed before Unicode did. I sure thought that glibc existed in Japan at
the time, but I could be wrong, I didn't say this is impossible but merely
hard to believe. In commercial Unix systems, wchar_t held either EUC or
SJIS depending on the vendor.
As usual I do not even have time to keep up with this thread, so if you have
questions then please CC me personally, though I don't know if I'll have
time to investigate anything that needs it.
^ permalink raw reply [flat|nested] 50+ messages in thread[parent not found: <fa.ip45pqg.i26oru@ifi.uio.no>]
* Re: UTF-8 practically vs. theoretically in the VFS API [not found] ` <fa.ip45pqg.i26oru@ifi.uio.no> @ 2004-02-23 19:13 ` Junio C Hamano 0 siblings, 0 replies; 50+ messages in thread From: Junio C Hamano @ 2004-02-23 19:13 UTC (permalink / raw) To: Norman Diamond; +Cc: linux-kernel >>>>> "ND" == Norman Diamond <ndiamond@wta.att.ne.jp> writes: ND> Eric W. Biederman wrote: >> Even X cut and paste frequently abuses the iso8859-1 range, ND> I'll take your word for it. I've copied and pasted EUC ND> strings, I've copied and pasted SJIS strings, I don't know ND> if X copy and paste abused EUC or SJIS ranges, but it ND> worked. I do not know what Eric means by "abusing the iso8859-1 rnge", but passing X selection between traditional X clients IIRC uses compound text, which is an encoding vaguely similar to ISO-2022, so clients like kterm can convert it back and forth with EUC or SJIS as needed. ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <04Feb13.163954est.41760@gpu.utcc.utoronto.ca>]
* Re: JFS default behavior @ 2004-02-14 23:06 ` Robin Rosenberg 2004-02-14 23:29 ` viro 0 siblings, 1 reply; 50+ messages in thread From: Robin Rosenberg @ 2004-02-14 23:06 UTC (permalink / raw) To: viro; +Cc: Linux kernel On Saturday 14 February 2004 16.40, you wrote: > The same goes for file names. Filename is a sequence of bytes, no more and > no less. Anything beyond that belongs to applications. Should be a sequence of characters since humans are supposed to use them and it should be the same characters wheneve possible regardless of user's locale. The "sequence of bytes" idea is a legacy from prehistoric times when byte == character was true. That is no longer the case and actually hasn't been for quite a while in some parts of the world. Interchange is important. The application cannot handle this since it cannot know what characters a byte string represents. Fixing it in the kernel is the simple solution since it knows the locale. Its also a small change I believe. Having an iocharset options for all file systems make it backward compatible and creates a migration path to UTF-8 as system default locale. -- robin ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: JFS default behavior 2004-02-14 23:06 ` JFS default behavior Robin Rosenberg @ 2004-02-14 23:29 ` viro 2004-02-15 0:07 ` Robin Rosenberg 0 siblings, 1 reply; 50+ messages in thread From: viro @ 2004-02-14 23:29 UTC (permalink / raw) To: Robin Rosenberg; +Cc: Linux kernel On Sun, Feb 15, 2004 at 12:06:23AM +0100, Robin Rosenberg wrote: > On Saturday 14 February 2004 16.40, you wrote: > > The same goes for file names. Filename is a sequence of bytes, no more and > > no less. Anything beyond that belongs to applications. > > Should be a sequence of characters since humans are supposed to use them and > it should be the same characters wheneve possible regardless of user's locale. > The "sequence of bytes" idea is a legacy from prehistoric times when byte == character > was true. Bullshit. It has _nothing_ to characters, wide or not. For system filenames are opaque. The only things that have special meanings are: octet 0x2f ('/') splits the pathname into components "." as a component has a special meaning ".." as a component has a special meaning. That's it. The rest is never interpreted by the kernel. > Having an iocharset options for all file systems make it backward compatible > and creates a migration path to UTF-8 as system default locale. Try to realize that different users CAN HAVE DIFFERENT LOCALES. On the same system. And have files on the same fs. Moreover, homedirs that used to be on different filesystems can end up one the same fs. What iocharset would you use, then? Sigh... Again, there is no such thing as iocharset of filesystem - it varies between users and users can and do share filesystems. Think of /home; think of /tmp. It isn't feasible. At all. Just as timezone doesn't belong in kernel, locales have no place there. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: JFS default behavior 2004-02-14 23:29 ` viro @ 2004-02-15 0:07 ` Robin Rosenberg 2004-02-15 2:41 ` Linus Torvalds 0 siblings, 1 reply; 50+ messages in thread From: Robin Rosenberg @ 2004-02-15 0:07 UTC (permalink / raw) To: viro; +Cc: Linux kernel On Sunday 15 February 2004 00.29, you wrote: > On Sun, Feb 15, 2004 at 12:06:23AM +0100, Robin Rosenberg wrote: > > The "sequence of bytes" idea is a legacy from prehistoric times when byte == character > > was true. > > Bullshit. It has _nothing_ to characters, wide or not. For system filenames > are opaque. The only things that have special meanings are: > octet 0x2f ('/') splits the pathname into components > "." as a component has a special meaning > ".." as a component has a special meaning. > That's it. The rest is never interpreted by the kernel. I know how it is (to some degree), and its wrong. The user sees inside the filename and sees a string of characters, not a byte sequence. > Try to realize that different users CAN HAVE DIFFERENT LOCALES. On the same > system. And have files on the same fs. Moreover, homedirs that used to be > on different filesystems can end up one the same fs. What iocharset would > you use, then? Sigh... Ok, I've got the iocharset option wrong, god knows why. The problem however remains. It seems you simply don't want to understand the problem, which is that users CAN HAVE DIFFERENT LOCALES on the same system and on different system. Sigh... I less concerned with which solution than that a solution should be found. So it seems no file system has a solution today. Still an iocharset option would relieve the problem for removable media and muli-boot systems. Most linux machines are essentially single user and have either the same locale for all users or all users are using UTF-8 with their locale. It's not the locale, but the charset used for encoding the locale. The rest cannot be helped. -- robin ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: JFS default behavior 2004-02-15 0:07 ` Robin Rosenberg @ 2004-02-15 2:41 ` Linus Torvalds 2004-02-16 18:36 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann 0 siblings, 1 reply; 50+ messages in thread From: Linus Torvalds @ 2004-02-15 2:41 UTC (permalink / raw) To: Robin Rosenberg; +Cc: viro, Linux kernel On Sun, 15 Feb 2004, Robin Rosenberg wrote: > > > > Bullshit. It has _nothing_ to characters, wide or not. For system filenames > > are opaque. The only things that have special meanings are: > > octet 0x2f ('/') splits the pathname into components > > "." as a component has a special meaning > > ".." as a component has a special meaning. > > That's it. The rest is never interpreted by the kernel. > > I know how it is (to some degree), and its wrong. The user sees inside the filename > and sees a string of characters, not a byte sequence. Yes, the user sees a string of characters, but the octet 0x2f ('/') and the terminating NUL character '\0' are still perfectly normal characters and there is no confusion. The reason: UTF-8. It's the only sane encoding (apart from a pure extended ASCII setup, which is also sane, but is obviously unacceptable for a large portion of the world). If some misguided person has told you about UCS-2 and horrors like UTF-9, just ignore them. They are crazy and deluded, and - perhaps more importantly - stupid. In short: the kernel talks bytestreams, and that implies that if you want to talk to the kernel, you HAVE TO USE UTF-8. At which point there are no locale issues any more. The only locale issue you can have is user space mistaking a stream of bytes as extended ASCII, which will cause all your pretty UTF-8 characters to be shown as strange latin1 (or other) squiggles. > It seems you simply don't want to understand the problem, which is that users > CAN HAVE DIFFERENT LOCALES on the same system and on different system. > Sigh... People understand the problem. And UTF-8 is the solution. It's getting there. I think even Microsoft has seen the light, and is phasing out their crapola (UCS-2LE? Whatever). > I less concerned with which solution than that a solution should be found. So it > seems no file system has a solution today. Still an iocharset option would relieve > the problem for removable media and muli-boot systems. No. Things like "iocharset" are not the solution. They are literally the _problem_. The solution is to use something that not only acts as ASCII, but also has a wide enough range to cover the whole required space (UCS-2 fails _both_ of these fundamental tests). At which point "iocharset" makes no sense any more, and only exists as a way to translate legacy crap into the one true format. And that one true format is UTF-8. End of story. If you try to talk to the kernel in UCS-2 or anything else, you _will_ fail. Linus ^ permalink raw reply [flat|nested] 50+ messages in thread
* UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) 2004-02-15 2:41 ` Linus Torvalds @ 2004-02-16 18:36 ` Marc Lehmann 2004-02-16 18:49 ` Linus Torvalds 0 siblings, 1 reply; 50+ messages in thread From: Marc Lehmann @ 2004-02-16 18:36 UTC (permalink / raw) To: Linus Torvalds; +Cc: viro, Linux kernel [I may be a bit late in response, but AFAICS these points have not yet been mentioned] On Sat, Feb 14, 2004 at 06:41:20PM -0800, Linus Torvalds <torvalds@osdl.org> wrote: [discussion on why UTF-8 is the only sane encoding, which I absolutely agree with, removed] > In short: the kernel talks bytestreams, and that implies that if you want > to talk to the kernel, you HAVE TO USE UTF-8. This is not the problem at all. It's perfectly easy to write applications that talk UTF-8 and just UTF-8 with the kernel. The problem is that the kernel does not use UTF-8, i.e. applications in the current linux model have to deal with the fact that the kernel happily breaks the assumed protocol of using UTF-8 by delivering illegal byte sequences to applications. There is no way for applications to handle UTF-8 and illegal-utf8 in a sane way, so most apps will either eat the illegal bytes, skip the filename, or crash (the latter case is clearly a bug in the app, thr former cases aren't). Fixing the VFS to actually enforce what linus claims (2filenames are utf-8") is a very good idea, imho. As I understand it, the reason linux currently doesn't, is that this utf-8 rule was obviously non-enforcable in practise in recent years, since UTF-8 simply wasn't widespread (even today, applications such as bash or grep are clearly not UTF-8 ready, as they start to crawl in UTF-8 locales without special patches, and even with special patches). So the only sane way to implement this enforcement is usign an additional moutn-flag, e.g. "force-utf8". An encoding=xyz mount flag OTOH would be total overkill, as the plan must be to switch to UTF-8 in the long run, while allowing deviating behaviour in the short run. Conversely, filesystems such as NTFS, VFAT etc. need to convert from the fs encoding to UTF-8 and vice versa automatically, at least when this flag is specified. It should become the default in some future linux version. > People understand the problem. And UTF-8 is the solution. The kernel needs to fully implement it. Just as a kernel accepting: open ("directory", O_WRONLY); write (dirfd, ...)... open ("/some/file", ...) mkdir ("../some/file", ...) is considered rather broken behaviour from unix kernels (although these might have been allowed in some dialects or versions of unix) today, this: mkdir ("</ encoded using illegal multibyte sequence>", ...) will be considered broken behaviour in the future. The RFC defining UTF-8 clearly considers this a bug in UTF-8 implementations, the the kernel in fact does NOT implement UTF-8 right now, although some people claim that the kernel accepting UTF-8 (and more) is correct behaviour, it isn't according to the RFC. > It's getting there. I think even Microsoft has seen the light, and is > phasing out their crapola (UCS-2LE? Whatever). Microsoft and Java officially use UTF-16 nowadays. The funny thing is that "next character" iterators in both languages skip to the next word in UCS-2, so the claim of both parties of UTF-16 support is basically a marketing lie. > No. Things like "iocharset" are not the solution. They are literally the > _problem_. The solution is to use something that not only acts as ASCII, [full agreement] > And that one true format is UTF-8. End of story. If you try to talk to the > kernel in UCS-2 or anything else, you _will_ fail. Just that the kernel does not support UTF-8. It delivers and accepts non-UTF-8 strings such as \xc0\x80. The kernel clearly should not deliver broken characters when the official stanza is that the linux VFS API is UTF-8 only (see 3.2, Chapater 3, C12, conformance, ony why it currently isn't UTF-8). -- -----==- | ----==-- _ | ---==---(_)__ __ ____ __ Marc Lehmann +-- --==---/ / _ \/ // /\ \/ / pcg@goof.com |e| -=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+ The choice of a GNU generation | | ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) 2004-02-16 18:36 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann @ 2004-02-16 18:49 ` Linus Torvalds 2004-02-16 19:26 ` UTF-8 practically vs. theoretically in the VFS API Jeff Garzik 2004-02-16 20:03 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann 0 siblings, 2 replies; 50+ messages in thread From: Linus Torvalds @ 2004-02-16 18:49 UTC (permalink / raw) To: Marc Lehmann; +Cc: viro, Linux kernel On Mon, 16 Feb 2004, Marc Lehmann wrote: > > > In short: the kernel talks bytestreams, and that implies that if you want > > to talk to the kernel, you HAVE TO USE UTF-8. > > This is not the problem at all. It's perfectly easy to write > applications that talk UTF-8 and just UTF-8 with the kernel. > > The problem is that the kernel does not use UTF-8, i.e. applications in > the current linux model have to deal with the fact that the kernel > happily breaks the assumed protocol of using UTF-8 by delivering illegal > byte sequences to applications. You didn't read what I said. READ MY POSTING. You even quoted it, but you didn't understand it. I'm saying that "the kernel talks bytestreams". I have never claimed that the kernel really talk s UTF-8, and indeed, I would say that such a kernel would be terminally and horribly broken. The kernel is _agnostic_ in what it does. As it should be. It doesn't really care AT ALL what you feed it, as long as it is a byte-stream. Now, that implies that if you want to have extended characters, then YOU HAVE TO USE UTF-8. That's what I'm saying. I am _not_ saying that the kernel uses UTF-8. The kernel doesn't care one way or the other. As far as the kernel is concerened, you could uuencode all the stuff, and the kernel wouldn't think you're crazy. The kernel _only_ cares about byte streams. And that is as it should be. > There is no way for applications to handle UTF-8 and illegal-utf8 in > a sane way, so most apps will either eat the illegal bytes, skip the > filename, or crash (the latter case is clearly a bug in the app, thr > former cases aren't). What you're complaining about are bad user applications. It has _zero_ to do with the kernel. > Fixing the VFS to actually enforce what linus claims (2filenames are > utf-8") is a very good idea, imho. No. Read my claim again. You obviously do not understand it AT ALL. What you suggest would be a horribly idiotic and bad idea. The kernel doesn't set policy. The kernel says "this is what I can do, you set policy". And UTF-8 just happens to be the only sane policy for encoding complex characters into a byte stream. But it is not the only policy. Another sane policy is to say "byte streams are latin1". It's not an acceptable policy for encoding _complex_ characters, but it is a policy. And it's a perfectly sane one. In short: filenames are byte streams. Nothing more. They don't even have a "character set". They literally are just a series of bytes. And when I say that you have to talk to the kernel using UTF-8, I'm only claiming that it is the only sane way to encode extended characters in a byte stream. Nothing more. Linus ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-16 18:49 ` Linus Torvalds @ 2004-02-16 19:26 ` Jeff Garzik 2004-02-16 19:48 ` John Bradford 2004-02-16 20:03 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann 1 sibling, 1 reply; 50+ messages in thread From: Jeff Garzik @ 2004-02-16 19:26 UTC (permalink / raw) To: Marc Lehmann; +Cc: Linus Torvalds, viro, Linux kernel Linus Torvalds wrote: > In short: filenames are byte streams. Nothing more. They don't even have a > "character set". They literally are just a series of bytes. > > And when I say that you have to talk to the kernel using UTF-8, I'm only > claiming that it is the only sane way to encode extended characters in a > byte stream. Nothing more. Nod. Maybe it helps Marc to point out the key difference between characters and bytes, in UTF8. In UTF8, the number of characters in a string is less-than-or-equal-to the number of bytes in the string. And the kernel just cares about bytes. This is the whole benefit to UTF8, right here in this thread. UTF8 was designed such that ten-year-old C code using standard C strings would function just fine. No need to rip up large swaths of your code just to call multi-byte versions of the standard string functions. Most code that doesn't deal with locale-specific details like uppercase/lowercase Just Works(tm). Jeff ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-16 19:26 ` UTF-8 practically vs. theoretically in the VFS API Jeff Garzik @ 2004-02-16 19:48 ` John Bradford 2004-02-16 19:48 ` Linus Torvalds 2004-02-16 20:16 ` Marc Lehmann 0 siblings, 2 replies; 50+ messages in thread From: John Bradford @ 2004-02-16 19:48 UTC (permalink / raw) To: Jeff Garzik, Marc Lehmann; +Cc: Linus Torvalds, viro, Linux kernel Quote from Jeff Garzik <jgarzik@pobox.com>: > Linus Torvalds wrote: > > In short: filenames are byte streams. Nothing more. They don't even have a > > "character set". They literally are just a series of bytes. > > > > And when I say that you have to talk to the kernel using UTF-8, I'm only > > claiming that it is the only sane way to encode extended characters in a > > byte stream. Nothing more. > > > Nod. Maybe it helps Marc to point out the key difference between > characters and bytes, in UTF8. > > In UTF8, the number of characters in a string is less-than-or-equal-to > the number of bytes in the string. > > And the kernel just cares about bytes. > > This is the whole benefit to UTF8, right here in this thread. UTF8 was > designed such that ten-year-old C code using standard C strings would > function just fine. No need to rip up large swaths of your code just to > call multi-byte versions of the standard string functions. Most code > that doesn't deal with locale-specific details like uppercase/lowercase > Just Works(tm). The real problem is with mis-configured userspaces, where buggy UTF-8 decoders are trying to make sense of data in legacy encodings containing essentially random bytes > 127, which are not part of valid UTF-8 sequences. None of this is a real problem, if everything is set up correctly and bug free. Unfortunately the Just Works thing falls apart in the, (frequent), instances that it's not :-(. John. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-16 19:48 ` John Bradford @ 2004-02-16 19:48 ` Linus Torvalds 2004-02-16 20:20 ` Marc Lehmann 2004-02-16 20:21 ` bert hubert 2004-02-16 20:16 ` Marc Lehmann 1 sibling, 2 replies; 50+ messages in thread From: Linus Torvalds @ 2004-02-16 19:48 UTC (permalink / raw) To: John Bradford; +Cc: Jeff Garzik, Marc Lehmann, viro, Linux kernel On Mon, 16 Feb 2004, John Bradford wrote: > > The real problem is with mis-configured userspaces, where buggy UTF-8 > decoders are trying to make sense of data in legacy encodings > containing essentially random bytes > 127, which are not part of valid > UTF-8 sequences. > > None of this is a real problem, if everything is set up correctly and > bug free. Unfortunately the Just Works thing falls apart in the, > (frequent), instances that it's not :-(. The way to handle that is to aim to never _ever_ decode utf-8 unless you really have to. Always leave the string in utf-8 "raw bytestring" mode as long as possible, and convert to charater sets only when actually printing. If you do that, then at worst you'll show the user a strange name (extra points for marking it as being errenous), but everything still works. You can still lookup/delete/whatever the file (internally the program still works on the raw byte sequence and isn't confused). Basically accept the fact that UTF-8 strings can contain "garbage", and don't try to fix it up. And no, I'm not claiming that it's wonderfully clean and that we should all love it. But it's _practical_, and the ugliness is certainly a lot less than in the alternatives. And it largely works today. Linus ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-16 19:48 ` Linus Torvalds @ 2004-02-16 20:20 ` Marc Lehmann 2004-02-16 20:26 ` Linus Torvalds 2004-02-18 2:49 ` Rob Landley 2004-02-16 20:21 ` bert hubert 1 sibling, 2 replies; 50+ messages in thread From: Marc Lehmann @ 2004-02-16 20:20 UTC (permalink / raw) To: Linus Torvalds; +Cc: John Bradford, Jeff Garzik, viro, Linux kernel On Mon, Feb 16, 2004 at 11:48:35AM -0800, Linus Torvalds <torvalds@osdl.org> wrote: > works on the raw byte sequence and isn't confused). Basically accept the > fact that UTF-8 strings can contain "garbage", and don't try to fix it up. But you are wrong, UTF-8 strings never contain garbage. UTF-8 is well-defined and is always proper UTF-8. It's a tautology. The evry idea of "UTF-8 with garbage in it" doesn't make sense. > And no, I'm not claiming that it's wonderfully clean and that we should > all love it. It's also a totally useless idiom... > And it largely works today. > Linus On ascii-only-systems, it works fine. My system is largely ascii-only, with only very few filenames (japanese and german ones mostly) in UTF-8. Sometimes in EUC-JP, but that's a bug in rar. It also works fine in single-user environments where the user just forces everything to be in her locale. It does fail miserably on multi-user systems. It does fail miserably in ISO-C's locale model. It does fail miserably with gnu shellutils, fileutils and most other apps. It fails, because it's not at all well supported by the kernel. Claiming that it largely works today is simply not true for most non-ascii-users (which increasingly includes the US). -- -----==- | ----==-- _ | ---==---(_)__ __ ____ __ Marc Lehmann +-- --==---/ / _ \/ // /\ \/ / pcg@goof.com |e| -=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+ The choice of a GNU generation | | ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-16 20:20 ` Marc Lehmann @ 2004-02-16 20:26 ` Linus Torvalds 2004-02-18 2:49 ` Rob Landley 1 sibling, 0 replies; 50+ messages in thread From: Linus Torvalds @ 2004-02-16 20:26 UTC (permalink / raw) To: Marc Lehmann; +Cc: John Bradford, Jeff Garzik, viro, Linux kernel On Mon, 16 Feb 2004, Marc Lehmann wrote: > > On Mon, Feb 16, 2004 at 11:48:35AM -0800, Linus Torvalds <torvalds@osdl.org> wrote: > > works on the raw byte sequence and isn't confused). Basically accept the > > fact that UTF-8 strings can contain "garbage", and don't try to fix it up. > > But you are wrong, UTF-8 strings never contain garbage. UTF-8 is > well-defined and is always proper UTF-8. It's a tautology. > > The evry idea of "UTF-8 with garbage in it" doesn't make sense. Sure it does. You live in a theoretical world where (a) there is only one standard (b) people read it (c) people actually follow it and never have bugs I've got news for you: none of the above is true. Which means that IN PRACTICE you will find strings that you think are UTF-8-encoded, but that don't end up being proper UTF-8. That's the difference between real world and theory. And you can either write your programs to be "theoretically correct", or you can write them to "work". It's your choice. I know which program I'd prefer to use. Linus ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-16 20:20 ` Marc Lehmann 2004-02-16 20:26 ` Linus Torvalds @ 2004-02-18 2:49 ` Rob Landley 1 sibling, 0 replies; 50+ messages in thread From: Rob Landley @ 2004-02-18 2:49 UTC (permalink / raw) To: Marc Lehmann; +Cc: Linux kernel On Monday 16 February 2004 14:20, Marc Lehmann wrote: > On Mon, Feb 16, 2004 at 11:48:35AM -0800, Linus Torvalds <torvalds@osdl.org> wrote: > > works on the raw byte sequence and isn't confused). Basically accept the > > fact that UTF-8 strings can contain "garbage", and don't try to fix it > > up. > > But you are wrong, UTF-8 strings never contain garbage. UTF-8 is > well-defined and is always proper UTF-8. It's a tautology. Would you please learn the difference between "you are wrong" and "I disagree"? Rob ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-16 19:48 ` Linus Torvalds 2004-02-16 20:20 ` Marc Lehmann @ 2004-02-16 20:21 ` bert hubert 2004-02-16 20:33 ` Marc Lehmann 2004-02-18 2:58 ` H. Peter Anvin 1 sibling, 2 replies; 50+ messages in thread From: bert hubert @ 2004-02-16 20:21 UTC (permalink / raw) To: Linus Torvalds Cc: John Bradford, Jeff Garzik, Marc Lehmann, viro, Linux kernel On Mon, Feb 16, 2004 at 11:48:35AM -0800, Linus Torvalds wrote: > The way to handle that is to aim to never _ever_ decode utf-8 unless you > really have to. Always leave the string in utf-8 "raw bytestring" mode as > long as possible, and convert to charater sets only when actually > printing. Additional good news is that following octets in a utf-8 character sequence always have the highest order bit set, precluding / or \x0 from appearing, confusing the kernel. The remaining zit is that all these represent '..': 2E 2E C0 AE C0 AE E0 80 AE E0 80 AE F0 80 80 AE F0 80 80 AE F8 80 80 80 AE F8 80 80 80 AE FC 80 80 80 80 AE FC 80 80 80 80 AE This in itself is not a problem, the kernel will only recognize 2E 2E as the real .., but it does show that 'document.doc' might be encoded in a myriad ways. So some guidance about using only the simplest possible encoding might be sensible, if we don't want the kernel to know about utf-8. > And it largely works today. Indeed. -- http://www.PowerDNS.com Open source, database driven DNS Software http://lartc.org Linux Advanced Routing & Traffic Control HOWTO ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-16 20:21 ` bert hubert @ 2004-02-16 20:33 ` Marc Lehmann 2004-02-18 2:58 ` H. Peter Anvin 1 sibling, 0 replies; 50+ messages in thread From: Marc Lehmann @ 2004-02-16 20:33 UTC (permalink / raw) To: bert hubert; +Cc: linux-kernel On Mon, Feb 16, 2004 at 09:21:42PM +0100, bert hubert <ahu@ds9a.nl> wrote: > The remaining zit is that all these represent '..': No, they don't. Read the UTF-8 definition... > This in itself is not a problem, the kernel will only recognize 2E 2E as the > real .., but it does show that 'document.doc' might be encoded in a myriad > ways. No, it can only be encoded in exactly one way *in UTF-8*. It can of course be encoded differently in other encodings, but in UTF-8, there is only a single representation. There are no ambiguities. > So some guidance about using only the simplest possible encoding might be > sensible, if we don't want the kernel to know about utf-8. Fortunately, this has all already been taken care of, and is not a problem. I mean, the _definition_ of UTF-8 works. Wether specific applications (wether in the kernel or apps) work is a different question. But at least the specification is rather clear. Compare this to the URL definition, which only hints that you don't know the encoding, and therefore, the interpretation as text, of a URL unless you have an extra channel that communicates it. While possible, this channel does not exist in practise, creating big problems for people writing i18n-ized web applications. The thing is that the kernel certainly _works_ on a very basic level, but I think the situaiton can be improved by making it clear how to interpret filenames, which currently is not the case. -- -----==- | ----==-- _ | ---==---(_)__ __ ____ __ Marc Lehmann +-- --==---/ / _ \/ // /\ \/ / pcg@goof.com |e| -=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+ The choice of a GNU generation | | ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-16 20:21 ` bert hubert 2004-02-16 20:33 ` Marc Lehmann @ 2004-02-18 2:58 ` H. Peter Anvin 2004-02-18 3:13 ` Linus Torvalds 2004-02-18 7:25 ` bert hubert 1 sibling, 2 replies; 50+ messages in thread From: H. Peter Anvin @ 2004-02-18 2:58 UTC (permalink / raw) To: linux-kernel Followup to: <20040216202142.GA5834@outpost.ds9a.nl> By author: bert hubert <ahu@ds9a.nl> In newsgroup: linux.dev.kernel > > Additional good news is that following octets in a utf-8 character sequence > always have the highest order bit set, precluding / or \x0 from appearing, > confusing the kernel. > Indeed. The original name for the encoding was, in fact, "FSS-UTF", for "filesystem safe Unicode transformation format." > The remaining zit is that all these represent '..': > 2E 2E > C0 AE C0 AE > E0 80 AE E0 80 AE > F0 80 80 AE F0 80 80 AE > F8 80 80 80 AE F8 80 80 80 AE > FC 80 80 80 80 AE FC 80 80 80 80 AE No, they don't. The first represent "..", the remaining two are illegal encodings and do not decode to anything. Those of us who have been involved with the issue have fought *extremely* hard against DWIM decoders which try to decode the latter sequences into ".." -- it's incorrect, and a security hazard. The only acceptable decodings is to throw an error, or use an out-of-band encoding mechanism to denote "bad bytecode." > This in itself is not a problem, the kernel will only recognize 2E 2E as the > real .., but it does show that 'document.doc' might be encoded in a myriad > ways. No, it doesn't. > So some guidance about using only the simplest possible encoding might be > sensible, if we don't want the kernel to know about utf-8. UTF-8 requires the use of the shortest possible encoding. An application which doesn't obey that and tries to be "smart" is a security hazard. It is a bit unfortunate that the encoding don't exclude these by design as opposed by error checking; it makes it a little too easy for clueless programmers to skip :( -hpa ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-18 2:58 ` H. Peter Anvin @ 2004-02-18 3:13 ` Linus Torvalds 2004-02-18 3:22 ` H. Peter Anvin 2004-02-18 11:33 ` Jamie Lokier 2004-02-18 7:25 ` bert hubert 1 sibling, 2 replies; 50+ messages in thread From: Linus Torvalds @ 2004-02-18 3:13 UTC (permalink / raw) To: H. Peter Anvin; +Cc: linux-kernel On Wed, 18 Feb 2004, H. Peter Anvin wrote: > > Those of us who have been involved with the issue have fought > *extremely* hard against DWIM decoders which try to decode the latter > sequences into ".." -- it's incorrect, and a security hazard. The > only acceptable decodings is to throw an error, or use an out-of-band > encoding mechanism to denote "bad bytecode." Somebody correctly pointed out that you do not need any out-of-band encoding mechanism - the very fact that it's an invalid sequence is in itself a perfectly fine flag. No out-of-band signalling required. The only thing you should make sure of is to not try to normalize it (that would hide the error). Just keep carrying the bad sequence along, and everybody is happy. Including the filesystem functions that get the "bad" name and match it exactly to what it should be matched against. Linus ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-18 3:13 ` Linus Torvalds @ 2004-02-18 3:22 ` H. Peter Anvin 2004-02-18 3:30 ` Linus Torvalds 2004-02-18 11:24 ` Jamie Lokier 2004-02-18 11:33 ` Jamie Lokier 1 sibling, 2 replies; 50+ messages in thread From: H. Peter Anvin @ 2004-02-18 3:22 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel Linus Torvalds wrote: > > On Wed, 18 Feb 2004, H. Peter Anvin wrote: > >>Those of us who have been involved with the issue have fought >>*extremely* hard against DWIM decoders which try to decode the latter >>sequences into ".." -- it's incorrect, and a security hazard. The >>only acceptable decodings is to throw an error, or use an out-of-band >>encoding mechanism to denote "bad bytecode." > > Somebody correctly pointed out that you do not need any out-of-band > encoding mechanism - the very fact that it's an invalid sequence is in > itself a perfectly fine flag. No out-of-band signalling required. > > The only thing you should make sure of is to not try to normalize it (that > would hide the error). Just keep carrying the bad sequence along, and > everybody is happy. Including the filesystem functions that get the "bad" > name and match it exactly to what it should be matched against. > Well, the reason you'd want an out-of-band mechanism is to be able to display it as some kind of escapes. Consider a UTF-8 decoder which uses values in the 0x800000xx range to encode "bogus bytes"; that way it wouldn't alias to anything else, but the bogus sequence "C0 AE" could be represented as 0x800000C0 0x800000AE and displayed to the user as \xC0\xAE\xC0\xAE ... which is different from \u00C0\u00AE ("À®", C3 80 C2 AE). This would make it possible to figure out in, for example, an ls listing, what those broken filenames are actually composed of. There are some advantages to being able to represent all possible byte sequences and present them to the user, even if they're bogus. -hpa ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-18 3:22 ` H. Peter Anvin @ 2004-02-18 3:30 ` Linus Torvalds 2004-02-18 5:30 ` H. Peter Anvin 2004-02-18 11:24 ` Jamie Lokier 1 sibling, 1 reply; 50+ messages in thread From: Linus Torvalds @ 2004-02-18 3:30 UTC (permalink / raw) To: H. Peter Anvin; +Cc: linux-kernel On Tue, 17 Feb 2004, H. Peter Anvin wrote: > > Well, the reason you'd want an out-of-band mechanism is to be able to > display it as some kind of escapes. I'd suggest just doing that when you convert the utf-8 format to printable format _anyway_. At that point you just make the "printable" representation be the binary escape sequence (which you have to have for other non-printable utf-8 characters anyway). And if you do things right (ie you allow user input in that same escaped output format), you can allow users to re-create the exact "broken utf-8". Which is actually important just so that the user can fix it up (ie imagine the user noticing that the filename is broken, and now needs to do a "mv broken-name fixed-name" - the user needs some way to re-create the brokenness). Linus ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-18 3:30 ` Linus Torvalds @ 2004-02-18 5:30 ` H. Peter Anvin 2004-02-18 10:29 ` Robin Rosenberg 2004-02-18 15:35 ` Linus Torvalds 0 siblings, 2 replies; 50+ messages in thread From: H. Peter Anvin @ 2004-02-18 5:30 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel Linus Torvalds wrote: > > On Tue, 17 Feb 2004, H. Peter Anvin wrote: > >>Well, the reason you'd want an out-of-band mechanism is to be able to >>display it as some kind of escapes. > > > I'd suggest just doing that when you convert the utf-8 format to printable > format _anyway_. At that point you just make the "printable" > representation be the binary escape sequence (which you have to have for > other non-printable utf-8 characters anyway). > What does "printable" mean in this context? Typically you have to convert it to UCS-4 first, so you can index into your font tables, then you have to create the right composition, apply the bidirectional text algorithm, and so forth. Rendering general Unicode text is complex enough that you really want it layered. What I described what the first step of that -- mostly trying to show that "throwing an error" doesn't necessarily mean "produce no output." What you shouldn't do, though, is alias it with legitimate input. > And if you do things right (ie you allow user input in that same escaped > output format), you can allow users to re-create the exact "broken utf-8". > Which is actually important just so that the user can fix it up (ie > imagine the user noticing that the filename is broken, and now needs to do > a "mv broken-name fixed-name" - the user needs some way to re-create the > brokenness). Indeed. The C language has gone with \x77 for bytes and \u7777 or \U77777777 for Unicode characters (4 vs 8 hex digits respectively); I think this is a good UI for shells to follow. The \x representation then doesn't stand for characters but for bytes. It may be desirable to disallow encoding of *valid* UTF-8 characters this way, though. -hpa ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-18 5:30 ` H. Peter Anvin @ 2004-02-18 10:29 ` Robin Rosenberg 2004-02-18 11:49 ` Tomas Szepe 2004-02-18 15:35 ` Linus Torvalds 1 sibling, 1 reply; 50+ messages in thread From: Robin Rosenberg @ 2004-02-18 10:29 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Linus Torvalds, linux-kernel On Wednesday 18 February 2004 06.30, H. Peter Anvin wrote: > Linus Torvalds wrote: > > On Tue, 17 Feb 2004, H. Peter Anvin wrote: > >>Well, the reason you'd want an out-of-band mechanism is to be able to > >>display it as some kind of escapes. > > I'd suggest just doing that when you convert the utf-8 format to printable > > format _anyway_. At that point you just make the "printable" > > representation be the binary escape sequence (which you have to have for > > other non-printable utf-8 characters anyway). > What does "printable" mean in this context? Typically you have to > convert it to UCS-4 first, so you can index into your font tables, then > you have to create the right composition, apply the bidirectional text > algorithm, and so forth. > Rendering general Unicode text is complex enough that you really want it > layered. What I described what the first step of that -- mostly trying > to show that "throwing an error" doesn't necessarily mean "produce no > output." What you shouldn't do, though, is alias it with legitimate input. I think you can use libicu here. Conversion to UCS-4 doesn't for determining character type doesn't mean you will every have actual strings of UCS-4. It could be character by character just for looking it up, so you can have the out-of-band error flags internally. > > And if you do things right (ie you allow user input in that same escaped > > output format), you can allow users to re-create the exact "broken utf-8". > > Which is actually important just so that the user can fix it up (ie > > imagine the user noticing that the filename is broken, and now needs to do > > a "mv broken-name fixed-name" - the user needs some way to re-create the > > brokenness). > > Indeed. The C language has gone with \x77 for bytes and \u7777 or > \U77777777 for Unicode characters (4 vs 8 hex digits respectively); I > think this is a good UI for shells to follow. The \x representation > then doesn't stand for characters but for bytes. It may be desirable to > disallow encoding of *valid* UTF-8 characters this way, though. Agree. \u80808080 I would assume represents a valid character, while \x80\x80\x80\x80 does not. A problem with invalid sequences I just noted is that they break some of the nice properties of UTF-8, that people will assume apply, i.e. that you can parse it backwards. With UTF-8 (i.e. well-formed utf-8) you can point at a byte and figure "this is not the first byte", lets skip backwards to find the start. If invalid sequences can ever occur you must read from the start of the string. -- robin ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-18 10:29 ` Robin Rosenberg @ 2004-02-18 11:49 ` Tomas Szepe 2004-02-18 11:59 ` Robin Rosenberg 0 siblings, 1 reply; 50+ messages in thread From: Tomas Szepe @ 2004-02-18 11:49 UTC (permalink / raw) To: Robin Rosenberg; +Cc: linux-kernel On Feb-18 2004, Wed, 11:29 +0100 Robin Rosenberg <robin.rosenberg.lists@dewire.com> wrote: [snip] > not the first byte", lets skip backwards to find the start. If invalid sequences can ever occur [snip] Would you _please_ read the lkml FAQ and stop posting e-mails with lines longer than 80 characters? Thank you. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-18 11:49 ` Tomas Szepe @ 2004-02-18 11:59 ` Robin Rosenberg 2004-02-18 12:05 ` Tomas Szepe 0 siblings, 1 reply; 50+ messages in thread From: Robin Rosenberg @ 2004-02-18 11:59 UTC (permalink / raw) To: Tomas Szepe; +Cc: linux-kernel On Wednesday 18 February 2004 12.49, Tomas Szepe wrote: > Would you _please_ read the lkml FAQ and stop posting e-mails with lines > longer than 80 characters? Thank you. As soon as someone asks nicely... I thought any decent mail client simply wrapped the lines. Hmm, remember some old system with 3270 access that didn't. I'll try to remember that. -- robin ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-18 11:59 ` Robin Rosenberg @ 2004-02-18 12:05 ` Tomas Szepe 2004-02-18 12:34 ` Robin Rosenberg 0 siblings, 1 reply; 50+ messages in thread From: Tomas Szepe @ 2004-02-18 12:05 UTC (permalink / raw) To: Robin Rosenberg; +Cc: linux-kernel On Feb-18 2004, Wed, 12:59 +0100 Robin Rosenberg <robin.rosenberg.lists@dewire.com> wrote: > On Wednesday 18 February 2004 12.49, Tomas Szepe wrote: > > Would you _please_ read the lkml FAQ and stop posting e-mails with lines > > longer than 80 characters? Thank you. > > As soon as someone asks nicely... I thought any decent mail client simply > wrapped the lines. 1) Quite the contrary. Any _decent_ mail client will _not_ wrap the lines. 2) A mail client that will wrap the lines will make your posts look like this: <cut> Having to put up with the existence of Windows day in and out is the reason I'm still on an eight-bit encoding. Sorry for not explaining the REAL problem, but only a partial problem. I need to support all kinds of clients on Windows with protocols that convey no character set info. With samba that's no problem. Having to put up with a Unix world running <cut> > I'll try to remember that. Thanks again. -- Tomas Szepe <szepe@pinerecords.com> ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-18 12:05 ` Tomas Szepe @ 2004-02-18 12:34 ` Robin Rosenberg 0 siblings, 0 replies; 50+ messages in thread From: Robin Rosenberg @ 2004-02-18 12:34 UTC (permalink / raw) To: Tomas Szepe; +Cc: linux-kernel On Wednesday 18 February 2004 13.05, Tomas Szepe wrote: > On Feb-18 2004, Wed, 12:59 +0100 > Robin Rosenberg <robin.rosenberg.lists@dewire.com> wrote: > > > On Wednesday 18 February 2004 12.49, Tomas Szepe wrote: > > > Would you _please_ read the lkml FAQ and stop posting e-mails with lines > > > longer than 80 characters? Thank you. > > > > As soon as someone asks nicely... I thought any decent mail client simply > > wrapped the lines. > > 1) Quite the contrary. Any _decent_ mail client will _not_ wrap the lines. > > 2) A mail client that will wrap the lines will make your posts look like this: > > <cut> > Having to put up with the existence of Windows day in and out is the reason I'm > still on > an eight-bit encoding. Sorry for not explaining the REAL problem, but only a > partial > problem. I need to support all kinds of clients on Windows with protocols that > convey no > character set info. With samba that's no problem. Having to put up with a Unix > world running > <cut> That's what happens when the sender wraps the lines at column 80 and your client wraps at 72 (or similar situation), just another reason not to wrap when sending and let the users client do whatever the user think is fine. In order not to wrap and destroy information I have the autowrap feature off when composing mail, becase wrapped and cut stack traces, cuts from log files etc are a pain. BTW The 80 character rule is only mention wrt to signatures. -- robin ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-18 5:30 ` H. Peter Anvin 2004-02-18 10:29 ` Robin Rosenberg @ 2004-02-18 15:35 ` Linus Torvalds 2004-02-18 19:47 ` Tomas Szepe 1 sibling, 1 reply; 50+ messages in thread From: Linus Torvalds @ 2004-02-18 15:35 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Kernel Mailing List On Tue, 17 Feb 2004, H. Peter Anvin wrote: > > What does "printable" mean in this context? Typically you have to > convert it to UCS-4 first, so you can index into your font tables, then > you have to create the right composition, apply the bidirectional text > algorithm, and so forth. Not all characters _have_ font entries. And even when they have font entries, they may need escaping for other reasons (ie you may want to marshall UTF-8 as plain ASCII just because you want to use a portable format for transfer). Think about the simple (hex) string x0A x00. That's a well-defined UTF-8 string, yet if you want to print it as a filename on the console, you should obviously print it as "/n" or some similar escaped sequence (actually, that's a bad example, since it's a special case, and it would probably be better to use the example string x7F x00, which would be shown as \u177 or something). The same is true for a _lot_ of perfectly fine UTF-8 sequences, no? That implies that you have to use an escaped sequence _anyway_. So as you go along, turning the string into something printable, you might as well escape the invalid UTF-8 sequences. In other words: you walk the utf-8 string one character at a time, converting it to whatever format (eg UCS-4) you have for font lookup, but you also escape characters that you don't have font entries for or that aren't in proper UTF-8 format. When converting to UCS-2, you have to check for the proper format _anyway_, so none of this is in any way "extra work". Instead of just aborting on an invalid UTF-8 character, you quote it, exactly the same way you'd have to quote a _valid_ one that you can't just show as a string. > Rendering general Unicode text is complex enough that you really want it > layered. What I described what the first step of that -- mostly trying > to show that "throwing an error" doesn't necessarily mean "produce no > output." What you shouldn't do, though, is alias it with legitimate input. Exactly. And since you need an escape sequence anyway, what's the problem? > > And if you do things right (ie you allow user input in that same escaped > > output format), you can allow users to re-create the exact "broken utf-8". > > Which is actually important just so that the user can fix it up (ie > > imagine the user noticing that the filename is broken, and now needs to do > > a "mv broken-name fixed-name" - the user needs some way to re-create the > > brokenness). > > Indeed. The C language has gone with \x77 for bytes and \u7777 or > \U77777777 for Unicode characters (4 vs 8 hex digits respectively); I > think this is a good UI for shells to follow. The \x representation > then doesn't stand for characters but for bytes. It may be desirable to > disallow encoding of *valid* UTF-8 characters this way, though. You need to encode even valid UTF-8, since you may not find a font entry for the character, or the character just isn't appropriate in that context (ie you can't show a newline). But it makes perfect sense to use a policy of: - escape valid UTF-8 characters as '\u7777' - escape _invalid_ UTF-8 characters as their hex byte sequence (ie '\xC0\x80\x80', whatever) - (and, obviously, escape the valid UTF-8 character '\' as '\\'). Don't you agree? It clearly allows all the cases, and you can re-generate the _exact_ original stream of bytes from the above (ie it is nicely reversible, which in my opinion is a requirement). Linus ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-18 15:35 ` Linus Torvalds @ 2004-02-18 19:47 ` Tomas Szepe 2004-02-18 20:01 ` H. Peter Anvin 0 siblings, 1 reply; 50+ messages in thread From: Tomas Szepe @ 2004-02-18 19:47 UTC (permalink / raw) To: Linus Torvalds; +Cc: H. Peter Anvin, Kernel Mailing List On Feb-18 2004, Wed, 07:35 -0800 Linus Torvalds <torvalds@osdl.org> wrote: > But it makes perfect sense to use a policy of: > - escape valid UTF-8 characters as '\u7777' > - escape _invalid_ UTF-8 characters as their hex byte sequence (ie > '\xC0\x80\x80', whatever) > - (and, obviously, escape the valid UTF-8 character '\' as '\\'). > > Don't you agree? It clearly allows all the cases, and you can re-generate > the _exact_ original stream of bytes from the above (ie it is nicely > reversible, which in my opinion is a requirement). I really really hope this is _exactly_ what we're going to see in practice. -- Tomas Szepe <szepe@pinerecords.com> ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-18 19:47 ` Tomas Szepe @ 2004-02-18 20:01 ` H. Peter Anvin 2004-02-18 21:22 ` Robin Rosenberg 0 siblings, 1 reply; 50+ messages in thread From: H. Peter Anvin @ 2004-02-18 20:01 UTC (permalink / raw) To: Tomas Szepe; +Cc: Linus Torvalds, Kernel Mailing List Tomas Szepe wrote: > On Feb-18 2004, Wed, 07:35 -0800 > Linus Torvalds <torvalds@osdl.org> wrote: > >>But it makes perfect sense to use a policy of: >> - escape valid UTF-8 characters as '\u7777' [And e.g. \U00017777 for characters above \uFFFF] >> - escape _invalid_ UTF-8 characters as their hex byte sequence (ie >> '\xC0\x80\x80', whatever) >> - (and, obviously, escape the valid UTF-8 character '\' as '\\'). >> >>Don't you agree? It clearly allows all the cases, and you can re-generate >>the _exact_ original stream of bytes from the above (ie it is nicely >>reversible, which in my opinion is a requirement). > > I really really hope this is _exactly_ what we're going to see in practice. > Same here. This is clearly The Right Thing[TM]. -hpa ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-18 20:01 ` H. Peter Anvin @ 2004-02-18 21:22 ` Robin Rosenberg 2004-02-18 21:42 ` H. Peter Anvin 0 siblings, 1 reply; 50+ messages in thread From: Robin Rosenberg @ 2004-02-18 21:22 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Tomas Szepe, Linus Torvalds, Kernel Mailing List On Wednesday 18 February 2004 21.01, H. Peter Anvin wrote: > [And e.g. \U00017777 for characters above \uFFFF] Isn't that octal :-) -- robin ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-18 21:22 ` Robin Rosenberg @ 2004-02-18 21:42 ` H. Peter Anvin 0 siblings, 0 replies; 50+ messages in thread From: H. Peter Anvin @ 2004-02-18 21:42 UTC (permalink / raw) To: Robin Rosenberg; +Cc: Tomas Szepe, Linus Torvalds, Kernel Mailing List Robin Rosenberg wrote: > On Wednesday 18 February 2004 21.01, H. Peter Anvin wrote: > >>[And e.g. \U00017777 for characters above \uFFFF] > > Isn't that octal :-) > No. -hpa ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-18 3:22 ` H. Peter Anvin 2004-02-18 3:30 ` Linus Torvalds @ 2004-02-18 11:24 ` Jamie Lokier 1 sibling, 0 replies; 50+ messages in thread From: Jamie Lokier @ 2004-02-18 11:24 UTC (permalink / raw) To: H. Peter Anvin; +Cc: Linus Torvalds, linux-kernel H. Peter Anvin wrote: > Well, the reason you'd want an out-of-band mechanism is to be able to > display it as some kind of escapes. As soon as you go to "display", you need a mechanism to escape lots of characters, not just malformed UTF-8. Consider: \u0000, \u001B, \u0007 and such need to be escaped too. -- Jamie ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-18 3:13 ` Linus Torvalds 2004-02-18 3:22 ` H. Peter Anvin @ 2004-02-18 11:33 ` Jamie Lokier 2004-02-18 16:47 ` H. Peter Anvin 2004-02-18 19:59 ` Linus Torvalds 1 sibling, 2 replies; 50+ messages in thread From: Jamie Lokier @ 2004-02-18 11:33 UTC (permalink / raw) To: Linus Torvalds; +Cc: H. Peter Anvin, linux-kernel Linus Torvalds wrote: > Somebody correctly pointed out that you do not need any out-of-band > encoding mechanism - the very fact that it's an invalid sequence is in > itself a perfectly fine flag. No out-of-band signalling required. Technically this is almost(*) correct, however a _lot_ of code exists which assumes logical properties of UTF-8. (See, for example, the "stty utf8" patch). Perl, for example, allows you to pass around invalid sequences in exactly the way you describe. It works, right up until you do something like length() or substr() or a regex match. Then Perl screws up the answer, because it sees something like 0xfd and just assumes it can skip the next 5 bytes, without checking them. hpa's suggestion that invalid bytes are treated as 0x800000xx works very nicely, *iff* a program is absolutely consistent about its treatment of bytes in that way. When there's a mixture of code which interprets malformed UTF-8 in different ways, then it's messy and sometimes a security hazard. -- Jamie (*) - It's fine until you concatenate two malformed strings. Then the out-of-band signal is lost if the combination is valid UTF-8. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-18 11:33 ` Jamie Lokier @ 2004-02-18 16:47 ` H. Peter Anvin 2004-02-18 19:59 ` Linus Torvalds 1 sibling, 0 replies; 50+ messages in thread From: H. Peter Anvin @ 2004-02-18 16:47 UTC (permalink / raw) To: Jamie Lokier; +Cc: Linus Torvalds, linux-kernel Jamie Lokier wrote: > > hpa's suggestion that invalid bytes are treated as 0x800000xx works > very nicely, *iff* a program is absolutely consistent about its > treatment of bytes in that way. When there's a mixture of code which > interprets malformed UTF-8 in different ways, then it's messy and > sometimes a security hazard. > Absolutely. It has to be considered very carefully. -hpa ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-18 11:33 ` Jamie Lokier 2004-02-18 16:47 ` H. Peter Anvin @ 2004-02-18 19:59 ` Linus Torvalds 2004-02-18 20:08 ` H. Peter Anvin 1 sibling, 1 reply; 50+ messages in thread From: Linus Torvalds @ 2004-02-18 19:59 UTC (permalink / raw) To: Jamie Lokier; +Cc: H. Peter Anvin, linux-kernel On Wed, 18 Feb 2004, Jamie Lokier wrote: > Linus Torvalds wrote: > > Somebody correctly pointed out that you do not need any out-of-band > > encoding mechanism - the very fact that it's an invalid sequence is in > > itself a perfectly fine flag. No out-of-band signalling required. > > Technically this is almost(*) correct, > > (*) - It's fine until you concatenate two malformed strings. Then the > out-of-band signal is lost if the combination is valid UTF-8. But that's what you _want_. Having a real out-of-band signal that says "this stuff is wrong, because it was wrong at some point in the past", and not allowing concatenation of blocks of utf-8 bytes would be _bad_. The thing, concatenating two malformed UTF-8 strings is normal behaviour in a variety of circumstances, all basically having to do with lower levels now knowing about higer-level concepts. For example, look at a web-page. Look at how the data comes in: it comes as a stream of bytes, with blocking rules that have _nothing_ to do with the content (timing, mtu's, extended TCP headers etc etc). That doesn't mean that you shouldn't be able to - work on the partial results and show them to the user as UTF-8 - be able to concatenate new stuff as it comes in. Having an out-of-band signal for "bad" would literally be a bad idea. If you get a valid UTF-8 stream as a result of concatenation, you should consider that to be the correct behaviour, or you should CHECK BEFOREHAND if you think it is illegal. Linus ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-18 19:59 ` Linus Torvalds @ 2004-02-18 20:08 ` H. Peter Anvin 0 siblings, 0 replies; 50+ messages in thread From: H. Peter Anvin @ 2004-02-18 20:08 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jamie Lokier, linux-kernel Linus Torvalds wrote: > > But that's what you _want_. Having a real out-of-band signal that says > "this stuff is wrong, because it was wrong at some point in the past", and > not allowing concatenation of blocks of utf-8 bytes would be _bad_. > Indeed. What it does mean, however, is that you have to consider your concatenation issues if you perform the concatenation in UCS-4 space, for example, a string that ends in whatever code you have chosen for <BOGUS-C8> that gets concatenated with <BOGUS-80> needs to get converted to a valid <U+0200>. This is of course not an issue if you do the concatenation in UTF-8 space and don't do round-trip conversion. None of this is hard, it just takes thinking about rather than automatically do the obvious things. > The thing, concatenating two malformed UTF-8 strings is normal behaviour > in a variety of circumstances, all basically having to do with lower > levels now knowing about higer-level concepts. Indeed. -hpa ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-18 2:58 ` H. Peter Anvin 2004-02-18 3:13 ` Linus Torvalds @ 2004-02-18 7:25 ` bert hubert 1 sibling, 0 replies; 50+ messages in thread From: bert hubert @ 2004-02-18 7:25 UTC (permalink / raw) To: linux-kernel On Wed, Feb 18, 2004 at 02:58:42AM +0000, H. Peter Anvin wrote: > Indeed. The original name for the encoding was, in fact, "FSS-UTF", > for "filesystem safe Unicode transformation format." That might explain a few things. > > F8 80 80 80 AE F8 80 80 80 AE > > FC 80 80 80 80 AE FC 80 80 80 80 AE > > No, they don't. Serves me right for trusting a random site, apologies. -- http://www.PowerDNS.com Open source, database driven DNS Software http://lartc.org Linux Advanced Routing & Traffic Control HOWTO ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-16 19:48 ` John Bradford 2004-02-16 19:48 ` Linus Torvalds @ 2004-02-16 20:16 ` Marc Lehmann 2004-02-16 20:20 ` Jeff Garzik ` (3 more replies) 1 sibling, 4 replies; 50+ messages in thread From: Marc Lehmann @ 2004-02-16 20:16 UTC (permalink / raw) To: John Bradford; +Cc: Jeff Garzik, Linus Torvalds, viro, Linux kernel On Mon, Feb 16, 2004 at 07:48:19PM +0000, John Bradford <john@grabjohn.com> wrote: > Quote from Jeff Garzik <jgarzik@pobox.com>: > None of this is a real problem, if everything is set up correctly and > bug free. Unfortunately the Just Works thing falls apart in the, > (frequent), instances that it's not :-(. And this is the whole point. BTW, to people trying to explain some properties of UTF-8 to me. I don't think ad-hominem attacks like assuming that I don't understand UTF-8 (without any indication that this is so) are useful. The point here is that the kernel does, in a very narrow interpretation, not support the use of UTF-8, because proper support of UTF-8 means that no illegal byte sequences will be produced. Of course, I can feed the kernel UTF-8, and if everybody does that, it will generally work quite fine. However, Windows surely works fine if every program only feeds allowed values into system calls. And even unix dialects without memory protection work, as long as everybody plays fair. The point is, however, that this is highly undesirable, and it would be nice to have a kernel that would (optionally) fully support a UTF-8 environment in where applications can feed UTF-8 and _expect_ UTF-8 in return, which _is_ a security issue. It's very desirable to have a kernel that actively supports this. IT is clearly not _required_, of course. But then again, process abstraction is also not required... -- -----==- | ----==-- _ | ---==---(_)__ __ ____ __ Marc Lehmann +-- --==---/ / _ \/ // /\ \/ / pcg@goof.com |e| -=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+ The choice of a GNU generation | | ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-16 20:16 ` Marc Lehmann @ 2004-02-16 20:20 ` Jeff Garzik 2004-02-16 21:10 ` viro ` (2 subsequent siblings) 3 siblings, 0 replies; 50+ messages in thread From: Jeff Garzik @ 2004-02-16 20:20 UTC (permalink / raw) To: Marc Lehmann; +Cc: John Bradford, Linus Torvalds, viro, Linux kernel Marc Lehmann wrote: > The point here is that the kernel does, in a very narrow interpretation, > not support the use of UTF-8, because proper support of UTF-8 means that > no illegal byte sequences will be produced. Incorrect. Byte stream transports need not care about their contents. The only places that need to care about illegal UTF8 byte sequences are things like CONFIG_NLS_UTF8. Jeff ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-16 20:16 ` Marc Lehmann 2004-02-16 20:20 ` Jeff Garzik @ 2004-02-16 21:10 ` viro 2004-02-17 7:18 ` jw schultz 2004-02-17 7:42 ` Nick Piggin 3 siblings, 0 replies; 50+ messages in thread From: viro @ 2004-02-16 21:10 UTC (permalink / raw) To: John Bradford, Jeff Garzik, Linus Torvalds, Linux kernel On Mon, Feb 16, 2004 at 09:16:10PM +0100, Marc Lehmann wrote: > The point is, however, that this is highly undesirable, and it would be > nice to have a kernel that would (optionally) fully support a UTF-8 > environment in where applications can feed UTF-8 and _expect_ UTF-8 in > return, which _is_ a security issue. > > It's very desirable to have a kernel that actively supports this. IT is > clearly not _required_, of course. But then again, process abstraction > is also not required... Mind taking the demagogy elsewhere? Note that the same handwaving applies to e.g. file contents. Care to explain what makes read() and write() different in that respect? ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-16 20:16 ` Marc Lehmann 2004-02-16 20:20 ` Jeff Garzik 2004-02-16 21:10 ` viro @ 2004-02-17 7:18 ` jw schultz 2004-02-17 7:42 ` Nick Piggin 3 siblings, 0 replies; 50+ messages in thread From: jw schultz @ 2004-02-17 7:18 UTC (permalink / raw) To: Linux kernel; +Cc: John Bradford, Jeff Garzik, Linus Torvalds, viro On Mon, Feb 16, 2004 at 09:16:10PM +0100, Marc Lehmann wrote: > On Mon, Feb 16, 2004 at 07:48:19PM +0000, John Bradford <john@grabjohn.com> wrote: > > Quote from Jeff Garzik <jgarzik@pobox.com>: > > None of this is a real problem, if everything is set up correctly and > > bug free. Unfortunately the Just Works thing falls apart in the, > > (frequent), instances that it's not :-(. > > And this is the whole point. > > BTW, to people trying to explain some properties of UTF-8 to me. I don't > think ad-hominem attacks like assuming that I don't understand UTF-8 > (without any indication that this is so) are useful. > > The point here is that the kernel does, in a very narrow interpretation, > not support the use of UTF-8, because proper support of UTF-8 means that > no illegal byte sequences will be produced. That "interpretation" is so narrow as to be unrealistic. The kernel supports UTF-8 the same way a stage supports rock musicians. You confuse support with enforce, rather like confusing tolerance with endorsement. And it should be noted that the kernel doesn't produce file names. It only passes them along. > Of course, I can feed the kernel UTF-8, and if everybody does that, it > will generally work quite fine. However, Windows surely works fine if > every program only feeds allowed values into system calls. And even unix > dialects without memory protection work, as long as everybody plays > fair. > > The point is, however, that this is highly undesirable, and it would be > nice to have a kernel that would (optionally) fully support a UTF-8 You mean enforce again. That enhancement request has been rejected repeatedly because such a thing would be highly undesirable. What might be a convenient but unnecessary restriction today is too likely to become an unbearable restriction tomorrow. I don't want the kernel to have to care about what is or isn't valid UTF-8. I certainly don't want to have the kernel loaded with outdated character tables. > environment in where applications can feed UTF-8 and _expect_ UTF-8 in > return, which _is_ a security issue. I want an environment where applications can feed bytestreams and expect the same bytestream in return. I see enough problems as a result of filesystems that don't do that. > It's very desirable to have a kernel that actively supports this. IT is You mean enforces again. Kernel as police, next thing you will want is a kernel that prevents undesirable character sequences. > clearly not _required_, of course. But then again, process abstraction > is also not required... I'll tell you what. Patch libc. You can add UTF-8 filename enforcement to libc. There are only a few system calls that would need to have their wrappers enlarged. I'm sure the libc people will direct you to someplace very warm if you ask them for this enhancement. -- ________________________________________________________________ J.W. Schultz Pegasystems Technologies email address: jw@pegasys.ws Remember Cernan and Schmitt ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-16 20:16 ` Marc Lehmann ` (2 preceding siblings ...) 2004-02-17 7:18 ` jw schultz @ 2004-02-17 7:42 ` Nick Piggin 3 siblings, 0 replies; 50+ messages in thread From: Nick Piggin @ 2004-02-17 7:42 UTC (permalink / raw) To: Marc Lehmann Cc: John Bradford, Jeff Garzik, Linus Torvalds, viro, Linux kernel Marc Lehmann wrote: >On Mon, Feb 16, 2004 at 07:48:19PM +0000, John Bradford <john@grabjohn.com> wrote: > >>Quote from Jeff Garzik <jgarzik@pobox.com>: >>None of this is a real problem, if everything is set up correctly and >>bug free. Unfortunately the Just Works thing falls apart in the, >>(frequent), instances that it's not :-(. >> > >And this is the whole point. > >BTW, to people trying to explain some properties of UTF-8 to me. I don't >think ad-hominem attacks like assuming that I don't understand UTF-8 >(without any indication that this is so) are useful. > >The point here is that the kernel does, in a very narrow interpretation, >not support the use of UTF-8, because proper support of UTF-8 means that >no illegal byte sequences will be produced. > > So does the kernel support the English language? Does your email client? ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) 2004-02-16 18:49 ` Linus Torvalds 2004-02-16 19:26 ` UTF-8 practically vs. theoretically in the VFS API Jeff Garzik @ 2004-02-16 20:03 ` Marc Lehmann 2004-02-16 20:23 ` Linus Torvalds 1 sibling, 1 reply; 50+ messages in thread From: Marc Lehmann @ 2004-02-16 20:03 UTC (permalink / raw) To: Linus Torvalds; +Cc: viro, Linux kernel On Mon, Feb 16, 2004 at 10:49:48AM -0800, Linus Torvalds <torvalds@osdl.org> wrote: > > The problem is that the kernel does not use UTF-8, i.e. applications in > > the current linux model have to deal with the fact that the kernel > > happily breaks the assumed protocol of using UTF-8 by delivering illegal > > byte sequences to applications. > > You didn't read what I said. I read it. > READ MY POSTING. You even quoted it, but you didn't understand it. You were able to explain it clearly enough for me, I think. > I'm saying that "the kernel talks bytestreams". And I am saying that this is not good, which is my sole point. > I have never claimed that the kernel really talk s UTF-8, and indeed, I > would say that such a kernel would be terminally and horribly broken. And I'd say such a kernel would be highly useful, as it would standardize the encoding of filenames, just as unix standardizes on "mostly ascii" (i.e. the SuS). However, just as POSIX is a nice but very limited base, (mostly) ASCII is a nice and very limited base. UTF-8 would also be a good base. 8-bit bytes as filenames is not a good base, however, since they enforce a difefrent layer of interrpetation between the user and the kernel, and this interpretation cannot be based on the locale nor the filesystem itself, as there is no way to find out what encoding the filename is in. 8-bit bytes is convinient, but not useful for i18n environments. in the past, it was also convinient and nobody cared, since everything was either 8-bit or double-byte, and nobody exchanged files. This, however, is going to change, and the current methodology of "just guess, you might be right" is a hindrance to this. > The kernel is _agnostic_ in what it does. No, it's not. If at all, the kernel specifies a specially-interpreted (ascii sans / and \0) byte-stream, as you say yourself. However, just as with URLs (which are byte-streams, too), byte-streams are useless to store text. You need bytestreams + known encoding. If filenames were not names, but just binary id's I would agree, but this is not at all how filenames are used not how their use is applied. Filenames are composed of text, but the kernel gives no indication on how to interpret this text, and as a matter of fact, nothing else gives this indication. glib etc. uses G_BROKEN_FILENAMES to force locale-encoding. But as others have said, one mans locale is unlike other mens locale. > really care AT ALL what you feed it, as long as it is a byte-stream. > > Now, that implies that if you want to have extended characters, then YOU > HAVE TO USE UTF-8. You say so, but there is no logical connection between these two statements. I can store latin1 easily in a bytestream, as I can store iso-2022-jp or euc-jp. But they are incompatible to UTF-8. You are yelling at me for no good reason. "YOU HAVE TO USE UTF-8". Why should this be? The kernel certainly enforces this. Even you claim that I don't have to, as the kernel doesn't care. However, if you think so violently that it has to be UTF-8 that you even yell it, then why doesn't the kernel comply to this rule? Why should an applicaiton "HAVE TO" use utf-8 for input when the kernel doesn't even try to comply and hands out illegal output? This is just like mmap sometimes returning a page number and sometimes a byte address... this would also not be useful unless you know the unit that mmap returns (addreesses in multiples of 1). > That's what I'm saying. I am _not_ saying that the kernel uses UTF-8. But you are saying that you have to feed UTF-8 into the kernel, which is not the case either. I certainly don't have to..., and, what's worse, you haven't given any indication of why one has to. Just because you say so? Or is there actually a reason? If there is a reason, why doesn't the kernel, in return, also follow this reasoning? > kernel doesn't care one way or the other. As far as the kernel is It doesn't. But the point is that it should. If the kernel would do everything we want it to do there would be no point in enhancing it. > concerened, you could uuencode all the stuff, and the kernel wouldn't (Yes, because uuencode has the peculiar property of neither generting \0 nor /. It doesn't work in general with byte-streams). > think you're crazy. The kernel _only_ cares about byte streams. The kernel _interprets_ the byte-stream already. And some byte-streams are no valid filenames _already_. > > There is no way for applications to handle UTF-8 and illegal-utf8 in > > a sane way, so most apps will either eat the illegal bytes, skip the > > filename, or crash (the latter case is clearly a bug in the app, thr > > former cases aren't). > > What you're complaining about are bad user applications. It has _zero_ to > do with the kernel. Could you elaborate on why these apps are bad? What I am interested in is to know how to fix them? since there is simply no way to interpret the names returned by the kernel as the corresponding meta-information is missing. Consider an OS that allows different characters for path-seperators (unix only allows '/'). Without the knowledge of the path seperator it would be impossible to interpret paths. And without knowledge about encoding it's impossible (but slightly less dramatic) to correctly interpret filenames. > > Fixing the VFS to actually enforce what linus claims (2filenames are > > utf-8") is a very good idea, imho. > > No. Read my claim again. You obviously do not understand it AT ALL. ... > What you suggest would be a horribly idiotic and bad idea. Why? > The kernel doesn't set policy. The kernel says "this is what I can do, > you set policy". Exactly. The kernel could specify the API to use UTF-8. This is not more policy than it currently enforces. Or do you suggest that the ability to change the source and replace all occurences of '/' by '\\' means that '/' is not enforced as policy on path seperators? We basically seem to disagree on what, exactly, policy is. Policy (to me) is something that differentiates between several incompatible alternatives. Chosing policy means to rule out other (useful) alternatives. One could argue that '/' is a policy because it precludes the '/' character from being used in filenames, sth. some filenames or operating systems support. I'd say (probably as much as you) that this policy is not a real policy, it's just idiotic. But enforcing other restrictions on filenames should magically be real policy? This is obviously bot idiotic at all, and should be carefully explored. > And UTF-8 just happens to be the only sane policy for encoding complex > characters into a byte stream. But it is not the only policy. Just as '/' is not the only possible path seperator. If that is your point, you should explain why enforcing this is ok while supporting utf-8 (not enforcing, just supporting, meaning having the ability to rule out non-utf-8 sequences when the admin wants this) is not. > Another sane policy is to say "byte streams are latin1". It's not an > acceptable policy for encoding _complex_ characters, but it is a policy. > And it's a perfectly sane one. I agree that it is sane. But it is not very useful for the future, as people who want russian filenames are plainly unable to use the other filenames in a sensible way. There is no way to know the encoding. > In short: filenames are byte streams. Nothing more. Right now, they aren't. Not all sequences of bytes are valid filenames already, and I think this is perfectly o.k. > And when I say that you have to talk to the kernel using UTF-8, I'm only > claiming that it is the only sane way to encode extended characters in a > byte stream. Nothing more. And i fully agree. -- -----==- | ----==-- _ | ---==---(_)__ __ ____ __ Marc Lehmann +-- --==---/ / _ \/ // /\ \/ / pcg@goof.com |e| -=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+ The choice of a GNU generation | | ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) 2004-02-16 20:03 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann @ 2004-02-16 20:23 ` Linus Torvalds 2004-02-16 22:26 ` Jamie Lokier 0 siblings, 1 reply; 50+ messages in thread From: Linus Torvalds @ 2004-02-16 20:23 UTC (permalink / raw) To: Marc Lehmann; +Cc: viro, Linux kernel On Mon, 16 Feb 2004, Marc Lehmann wrote: > > > I'm saying that "the kernel talks bytestreams". > > And I am saying that this is not good, which is my sole point. Fair enough. However, that's where the unix philosophy comes in. The unix philosophy has always been to not try to understand the data that the user passes around - and that "everything is a bytestream" is very much encoded in the basic principles of how unix should work. That agnosticism has a lot of advantages. It literally means that the basic operating system doesn't set arbitrary limitations, which means that you can do things that you couldn't necessarily otherwise easily do. It does mean that you can do "strange" things too, and it does mean that user space basically has a lot of choice in how to interpret those byte streams. And yes, it can cause confusion. You don't like the confusion, so you argue that it shouldn't be allowed. It's a valid argument, but it's an argument that assumes that choice is bad. If you want to _force_ everybody to use UTF-8, then yes, the kernel could enforce that readdir() would never pass through a broken UTF-8 string, and all the path lookup functions also would never accept a broken string. It' snot technically impossible to to, although it would add a certain amount of pain and overhead. But the thing is, not everyone uses UTF-8. The big distributions have only recently started moving to UTF-8, and it will take _years_ before UTF-8 is ubiquotous. And even then it might be the wrong thing to disallow clever people from doing clever things. Encoding other information in filenames might be proper for a number of applications. > And I'd say such a kernel would be highly useful, as it would standardize > the encoding of filenames, just as unix standardizes on "mostly ascii" > (i.e. the SuS). It would also be very painful, since it would mean that when you mount an old disk, you may be totally unable to read the files, because they have filenames that such a kernel would never accept. > > The kernel is _agnostic_ in what it does. > > No, it's not. If at all, the kernel specifies a specially-interpreted > (ascii sans / and \0) byte-stream, as you say yourself. > > However, just as with URLs (which are byte-streams, too), byte-streams are > useless to store text. You need bytestreams + known encoding. You don't "need" a known encoding. The kernel clearly doesn't need one. It's a container, and the encoding comes from the outside. And that's what I mean by agnostic - you can make your own encoding. Most of the time (but not always) these days UTF-8 is the only sane encoding to use. But let people do what they want to do. Choice is _inherently_ good. Trying to force a world-view is bad. You should be able to tell people what they should do to avoid confusion ("use UTF-8"), but you should not _force_ them to that if they have good reasons not to (and "backwards compatibility" is a better reason than just about anything else). > But you are saying that you have to feed UTF-8 into the kernel, which is > not the case either. No. I'm saying that (a) "if you want to use complex character sets" then (b) "you really have to use UTF-8" to talk to the kernel. Note the two parts. You're hung up on (b), while I have tried to make it clear that (a) is a prerequisite for (b). Not everybody cares about (a). There are still people who use extended ASCII, simply because they DO NOT CARE about complex character sets. And if they don't care, and (a) isn't true, then (b) has no meaning any more. (In all fairness, some people will disagree with (b) even when (a) is true and like things like UCS-2. Those people are crazy, but I guess I'd just mention that possibility anyway). And this is why I say that the kernel only cares about byte streams, and having it filter to only accept proper UTF-8 sequences would be a horribly bad idea. Because it _assumes_ (a). That's what "making policy" is all about. The kernel should not assume that everybody cares about complex character sets. This may change, btw. I'm nothing if not pragmatic. In another twenty years, maybe everybody _literally_ uses complex character sets, and this whole discussion is totally silly, and the kernel may enforce UTF-8 or Klingon or whatever. At some point assumptions become _so_ ingrained that they are no longer policy any more, they are just "fact". Linus ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) 2004-02-16 20:23 ` Linus Torvalds @ 2004-02-16 22:26 ` Jamie Lokier 2004-02-16 22:40 ` Linus Torvalds 0 siblings, 1 reply; 50+ messages in thread From: Jamie Lokier @ 2004-02-16 22:26 UTC (permalink / raw) To: Linus Torvalds; +Cc: Marc Lehmann, viro, Linux kernel Linus Torvalds wrote: > It would also be very painful, since it would mean that when you mount an > old disk, you may be totally unable to read the files, because they have > filenames that such a kernel would never accept. Alas, once userspace has migrated to doing everything in UTF-8, you won't be able to read those files because userspace will barf on them. Then you'll be glad to have a mount option which converts iso-8859-1 to UTF-* :) (Even if the old disk as actually not iso-8859-1, at least you'll be able to read it's mangled filenames, rather than userspace tripping over them). -- Jamie ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) 2004-02-16 22:26 ` Jamie Lokier @ 2004-02-16 22:40 ` Linus Torvalds 2004-02-17 7:14 ` Lehmann 0 siblings, 1 reply; 50+ messages in thread From: Linus Torvalds @ 2004-02-16 22:40 UTC (permalink / raw) To: Jamie Lokier; +Cc: Marc Lehmann, viro, Linux kernel On Mon, 16 Feb 2004, Jamie Lokier wrote: > > Alas, once userspace has migrated to doing everything in UTF-8, you > won't be able to read those files because userspace will barf on them. Nope. Read my other email. Done right, user space will _not_ barf on them, because it won't try to "normalize" any UTF-8 strings. If the string has garbage in it, user space should just pass the garbage through. We've had this _exact_ issue before. Long before people worried about UTF-8, people worried about the fact that programs like "ls" shouldn't print out the extended ASCII characters as-is, because that would cause bad problems on a terminal as they'd be seen as terminal control characters. Does that mean that unix tools like "rm" cannot remove those files? Hell no! It just means that when you do "rm -i *", the filename that is printed may not have special characters in it that you don't see. Same goes for UTF-8. A "broken" UTF-8 string (ie something that isn't really UTF-8 at all, but just extended ASCII) won't _print_ right, but that doesn't mean that the tools won't work. You'll still be able to edit the file. Try it with a regular C locale. Do a simple echo > åäö (that's latin1), and do a "rm -i åäö", and see what it says. Right: it does the _right_ thing, and it prints out: torvalds@home:~> rm -i åäö rm: remove regular file `\345\344\366'? In other words, you have a program that doesn't understand a couple of the characters (because they don't make sense in its "locale"), but it still _works_. It just can't print them. Which, if you think about is, is 100% EXACTLY equivalent to what a UTF-8 program should do when it sees broken UTF-8. It can still access the file, it can still do everything else with it, but it can't print out the filename, and it should use some kind of escape sequence to show that fact. The two cases are 100% equivalent. We've gone through this before. There is a bit of pain involved, but it's not something new, or something fundamentally impossible. It's very straightforward indeed. Linus ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) 2004-02-16 22:40 ` Linus Torvalds @ 2004-02-17 7:14 ` Lehmann 2004-02-17 11:20 ` UTF-8 practically vs. theoretically in the VFS API Helge Hafting 2004-02-17 15:56 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Linus Torvalds 0 siblings, 2 replies; 50+ messages in thread From: Lehmann @ 2004-02-17 7:14 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jamie Lokier, Marc Lehmann, viro, Linux kernel On Mon, Feb 16, 2004 at 02:40:25PM -0800, Linus Torvalds <torvalds@osdl.org> wrote: > Try it with a regular C locale. Do a simple > > echo > åäö Just for your info, though. You can't even input these characters in a C locale, since your libc (and/or xlib) is unable to handle them (lots of SO C functions will barf on this one). C is 7 bit only. > Which, if you think about is, is 100% EXACTLY equivalent to what a UTF-8 > program should do when it sees broken UTF-8. The problem is that the very common C language makes it a pain to use this in i18n programs. multibyte functions or iconv will no accept these, so programs wanting to do what you are expecting to do need to re-implement most if not all of the character handling of your typical libc. Yes, it's possible.... > The two cases are 100% equivalent. We've gone through this before. There > is a bit of pain involved, but it's not something new, or something > fundamentally impossible. It's very straightforward indeed. The "bit" is enourmous, as you can't use your libc for text processing anymore. Yes, it works in non-i18n programms, but right now most programs get i18n support, which means they will all fail to properly handle non-locale characters. -- -----==- | ----==-- _ | ---==---(_)__ __ ____ __ Marc Lehmann +-- --==---/ / _ \/ // /\ \/ / pcg@goof.com |e| -=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+ The choice of a GNU generation | | ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-17 7:14 ` Lehmann @ 2004-02-17 11:20 ` Helge Hafting 2004-02-17 15:56 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Linus Torvalds 1 sibling, 0 replies; 50+ messages in thread From: Helge Hafting @ 2004-02-17 11:20 UTC (permalink / raw) Cc: Linus Torvalds, Jamie Lokier, Marc Lehmann, viro, Linux kernel pcg( Marc)@goof(A.).(Lehmann )com wrote: > On Mon, Feb 16, 2004 at 02:40:25PM -0800, Linus Torvalds <torvalds@osdl.org> wrote: > >>Try it with a regular C locale. Do a simple >> >> echo > åäö > > > Just for your info, though. You can't even input these characters in a C > locale, since your libc (and/or xlib) is unable to handle them (lots of SO > C functions will barf on this one). C is 7 bit only. > > >>Which, if you think about is, is 100% EXACTLY equivalent to what a UTF-8 >>program should do when it sees broken UTF-8. > > > The problem is that the very common C language makes it a pain to use > this in i18n programs. multibyte functions or iconv will no accept > these, so programs wanting to do what you are expecting to do need to > re-implement most if not all of the character handling of your typical > libc. > > Yes, it's possible.... All you need is a possible_garbage_to_properly_escaped_utf8(char *string) in libc. Any program that wants to display filenames it got straight from readdir (or any binary file contents) will simple feed the string through that and get back a string with escapes for anything that isn't utf8. It is a write-once, use everywhere thing. Once up on a time, there were serious problems when someone created filenames like "; rm -fr *" Today we use tab completion and get bash to present the filename with proper escapes. It is then harmless. Bad utf8 can be handled the same way. > The "bit" is enourmous, as you can't use your libc for text processing > anymore. Not the current libc, but libc can be improved upon. The same happened to silly code that weren't 8-bit clean. Helge Hafting ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) 2004-02-17 7:14 ` Lehmann 2004-02-17 11:20 ` UTF-8 practically vs. theoretically in the VFS API Helge Hafting @ 2004-02-17 15:56 ` Linus Torvalds [not found] ` <20040217161111.GE8231@schmorp.de> 1 sibling, 1 reply; 50+ messages in thread From: Linus Torvalds @ 2004-02-17 15:56 UTC (permalink / raw) To: Marc; +Cc: Jamie Lokier, Marc Lehmann, viro, Linux kernel On Tue, 17 Feb 2004, Marc wrote: > On Mon, Feb 16, 2004 at 02:40:25PM -0800, Linus Torvalds <torvalds@osdl.org> wrote: > > Try it with a regular C locale. Do a simple > > > > echo > åäö > > Just for your info, though. You can't even input these characters in a C > locale, since your libc (and/or xlib) is unable to handle them (lots of SO > C functions will barf on this one). C is 7 bit only. Ehh.. It's pointless to tell me that I can't do it. I just did. The C locale is _not_ 7-bit only. The C locale is the traditional "byte locale" for UNIX. It will happily collate 8-bit-characters in their (numerical) order. Anything else would be seriously broken. > > Which, if you think about is, is 100% EXACTLY equivalent to what a UTF-8 > > program should do when it sees broken UTF-8. > > The problem is that the very common C language makes it a pain to use > this in i18n programs. multibyte functions or iconv will no accept > these, so programs wanting to do what you are expecting to do need to > re-implement most if not all of the character handling of your typical > libc. These are all teething problems. The thing is, true multi-locale programs haven't been around long enough that people take the problems for granted. A lot of them work today, but "work" is different from "always does the right thing". These things take a _long_ time for people to sort out the full implications of. (Analogy time: how many people _still_ use "find ... | xargs xxx", even though that can lead to problems and is thus wrong? You should really use "find ... -print0 | xargs -0 xxx" to get it _right_, but most people ignore that, because the common form works for most cases.) The process is complicated by the fact that most of the people who really care about UTF-8 and locales are very strict about it: they have been hitting their heads against latin1 users for a logn time, and they are frustrated and _tired_ of it, and so they often hate single-byte usage with a passion, and consider it not only wrong but EVIL. Which is obviously silly, but hey, I understand why they can feel a bit put off by the problem. So the multi-byte people often stare at the standards, and then _refuse_ to touch anything that isn't standards-compliant. When they see something incorrect, they'd rather dump core (or just truncate it) than try to handle it gracefully, becuase they want the whole world to see how incorrect it is. Which flies in the face of "Be strict in what you generate, be liberal in what you accept". A lot of the functions are _not_ willing to be liberal in what they accept. Which sometimes just makes the problem worse, for no good reason. The fact is, you shouldn't use "iconv()" unless you controlled the input. It's a bit like "gets()" - unsafe to use unless you generated the damn thing yourself and you _know_ it fits in the buffer. But we just don't have the functions (yet) to do it _right_, and to escape the input some way (yeah, yeah, I know you can do it with iconv() and a lot of cruft around it - the point is that nobody does it, because it's too painful). Linus ^ permalink raw reply [flat|nested] 50+ messages in thread
[parent not found: <20040217161111.GE8231@schmorp.de>]
* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) [not found] ` <20040217161111.GE8231@schmorp.de> @ 2004-02-17 16:32 ` Linus Torvalds 2004-02-17 16:46 ` Jamie Lokier 2004-02-17 16:54 ` Stefan Smietanowski 0 siblings, 2 replies; 50+ messages in thread From: Linus Torvalds @ 2004-02-17 16:32 UTC (permalink / raw) To: Marc Lehmann; +Cc: Jamie Lokier, viro, Linux kernel On Tue, 17 Feb 2004, Marc Lehmann wrote: > > Because there is a fundamental difference between file contents and > filenames. Filenames are supposed to be text. I think this is actually the fundamental point where we disagree. You think of filenames as something the user types in, and that is "readable text". And I don't. I think the filenames are just ways for a _program_ to look up stuff, and the human readability is a secondary thing (it's "polite", but not a fundamental part of their meaning). So the same way I think text is good in config files and I dislike binary blobs (hey, look at /proc), I think readable filenames are good. But that doesn't mean that they have to be readable. I can well imagine encoding meta-data in the filename for some database that uses the filesystem as its backing store and generates files for large blobs. And then there would be little if any "goodness" to keeping the filenames readable. That's also a situation where case-insensitivity can _really_ screw you (just one of the many). It may be rare, but unlike you, I don't think there is anything "wrong" with considering path components to be just "data". Linus ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) 2004-02-17 16:32 ` Linus Torvalds @ 2004-02-17 16:46 ` Jamie Lokier 2004-02-17 19:00 ` UTF-8 practically vs. theoretically in the VFS API Måns Rullgård 2004-02-17 16:54 ` Stefan Smietanowski 1 sibling, 1 reply; 50+ messages in thread From: Jamie Lokier @ 2004-02-17 16:46 UTC (permalink / raw) To: Linus Torvalds; +Cc: Marc Lehmann, viro, Linux kernel Linus Torvalds wrote: > I think the filenames are just ways for a _program_ to look up stuff, and > the human readability is a secondary thing (it's "polite", but not a > fundamental part of their meaning). Politeness is nice. I'm sure there's a pragmatic reason most filenames are meaningful text in some human language :) I'd like a way to type something like "touch zöe.txt" on an ordinary latin1 terminal and get a UTF-8 filename in my filesystem. Thanks :) -- Jamie ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-17 16:46 ` Jamie Lokier @ 2004-02-17 19:00 ` Måns Rullgård 2004-02-17 20:57 ` Jamie Lokier 0 siblings, 1 reply; 50+ messages in thread From: Måns Rullgård @ 2004-02-17 19:00 UTC (permalink / raw) To: linux-kernel Jamie Lokier <jamie@shareable.org> writes: > Linus Torvalds wrote: >> I think the filenames are just ways for a _program_ to look up stuff, and >> the human readability is a secondary thing (it's "polite", but not a >> fundamental part of their meaning). > > Politeness is nice. I'm sure there's a pragmatic reason most > filenames are meaningful text in some human language :) > > I'd like a way to type something like "touch zöe.txt" on an ordinary > latin1 terminal and get a UTF-8 filename in my filesystem. Thanks :) Then hack either bash (or whatever shell you use) or touch to do just that. -- Måns Rullgård mru@kth.se ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-17 19:00 ` UTF-8 practically vs. theoretically in the VFS API Måns Rullgård @ 2004-02-17 20:57 ` Jamie Lokier 2004-02-17 21:06 ` Alex Belits 2004-02-17 21:23 ` Matthew Kirkwood 0 siblings, 2 replies; 50+ messages in thread From: Jamie Lokier @ 2004-02-17 20:57 UTC (permalink / raw) To: Måns Rullgård; +Cc: linux-kernel Måns Rullgård wrote: > > I'd like a way to type something like "touch zöe.txt" on an ordinary > > latin1 terminal and get a UTF-8 filename in my filesystem. Thanks :) > > Then hack either bash (or whatever shell you use) or touch to do just that. Hacking touch is obviously useless - I'd need to hack all the other 2000 shell utilities to get any useful behaviour. Hacking bash -- actually readline -- is a much better idea. Then you can enter names and they'll be created right. The only flaw in this is that "ls" won't be useful, so that'll need to be hacked as well. etc. No, I think hacking the terminal I/O is the best bet here. Then _all_ programs which currently work with UTF-8 terminals, which is rapidly becoming most of them, will work the same with both kinds of terminal, and the illusion of perfection will be complete and beautiful. -- Jamie ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-17 20:57 ` Jamie Lokier @ 2004-02-17 21:06 ` Alex Belits 2004-02-17 21:47 ` Jamie Lokier 2004-02-18 7:23 ` Marc Lehmann 2004-02-17 21:23 ` Matthew Kirkwood 1 sibling, 2 replies; 50+ messages in thread From: Alex Belits @ 2004-02-17 21:06 UTC (permalink / raw) To: Jamie Lokier; +Cc: Måns Rullgård, linux-kernel On Tue, 17 Feb 2004, Jamie Lokier wrote: > No, I think hacking the terminal I/O is the best bet here. Then _all_ > programs which currently work with UTF-8 terminals, which is rapidly > becoming most of them, will work the same with both kinds of terminal, > and the illusion of perfection will be complete and beautiful. UTF-8 terminals (and variable-encoding terminals) alreay exist, gnome-terminal is one of them. They are, of course, bloated pigs, but I would rather have the bloat and idiosyncrasy in the user interface where it belongs. -- Alex ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-17 21:06 ` Alex Belits @ 2004-02-17 21:47 ` Jamie Lokier 2004-02-22 15:32 ` Eric W. Biederman 2004-02-18 7:23 ` Marc Lehmann 1 sibling, 1 reply; 50+ messages in thread From: Jamie Lokier @ 2004-02-17 21:47 UTC (permalink / raw) To: Alex Belits; +Cc: Måns Rullgård, linux-kernel Alex Belits wrote: > > No, I think hacking the terminal I/O is the best bet here. Then _all_ > > programs which currently work with UTF-8 terminals, which is rapidly > > becoming most of them, will work the same with both kinds of terminal, > > and the illusion of perfection will be complete and beautiful. > > UTF-8 terminals (and variable-encoding terminals) alreay exist, > gnome-terminal is one of them. They are, of course, bloated pigs, but I > would rather have the bloat and idiosyncrasy in the user interface where > it belongs. Yes, I am using it right now. The fancy characters work well in it. Problem is, sometimes I have to use a non-UTF-8 terminal, and I would naturally like to access my files in the same way. -- Jamie ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-17 21:47 ` Jamie Lokier @ 2004-02-22 15:32 ` Eric W. Biederman 2004-02-22 16:28 ` Jamie Lokier 0 siblings, 1 reply; 50+ messages in thread From: Eric W. Biederman @ 2004-02-22 15:32 UTC (permalink / raw) To: Jamie Lokier; +Cc: Alex Belits, Måns Rullgård, linux-kernel Jamie Lokier <jamie@shareable.org> writes: > Alex Belits wrote: > > > No, I think hacking the terminal I/O is the best bet here. Then _all_ > > > programs which currently work with UTF-8 terminals, which is rapidly > > > becoming most of them, will work the same with both kinds of terminal, > > > and the illusion of perfection will be complete and beautiful. > > > > UTF-8 terminals (and variable-encoding terminals) alreay exist, > > gnome-terminal is one of them. They are, of course, bloated pigs, but I > > would rather have the bloat and idiosyncrasy in the user interface where > > it belongs. > > Yes, I am using it right now. The fancy characters work well in it. > Problem is, sometimes I have to use a non-UTF-8 terminal, and I would > naturally like to access my files in the same way. Basically I think this is just a matter of modifying telnetd and sshd so that for the display they follow the users locale, at least in cooked mode. Does anyone have a good grasp what the exact semantics should be and where the translation should happen? I know we need to delay the translation as long as possible so we can get binary streams flowing through these protocols? I guess my question is when do we know the information is going to a terminal so we should translate it? Eric ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-22 15:32 ` Eric W. Biederman @ 2004-02-22 16:28 ` Jamie Lokier 2004-02-22 21:53 ` Eric W. Biederman 0 siblings, 1 reply; 50+ messages in thread From: Jamie Lokier @ 2004-02-22 16:28 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Alex Belits, Måns Rullgård, linux-kernel Eric W. Biederman wrote: > I guess my question is when do we know the information is going to > a terminal so we should translate it? When a program is writing to a terminal device, then we know it's going to a terminal _or_ to a program which is pretending to be one (pseudo-terminal). Either way, the behaviour should be the same The "screen" program can be used to do translation, although it's a rather cumbersome way to go about it, and it has other effects which are annoying (at least one key is always designated for "screen" commands). -- Jamie ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-22 16:28 ` Jamie Lokier @ 2004-02-22 21:53 ` Eric W. Biederman 0 siblings, 0 replies; 50+ messages in thread From: Eric W. Biederman @ 2004-02-22 21:53 UTC (permalink / raw) To: Jamie Lokier; +Cc: Alex Belits, Måns Rullgård, linux-kernel Jamie Lokier <jamie@shareable.org> writes: > Eric W. Biederman wrote: > > I guess my question is when do we know the information is going to > > a terminal so we should translate it? > > When a program is writing to a terminal device, then we know it's > going to a terminal _or_ to a program which is pretending to be one > (pseudo-terminal). Either way, the behaviour should be the same > > The "screen" program can be used to do translation, although it's a > rather cumbersome way to go about it, and it has other effects which > are annoying (at least one key is always designated for "screen" commands). Right. At this point I am not worried about temporary solutions. I want to pin down how things should be implemented. So the user space programs can be fixed. Pardon me while I think aloud to frame the problem. First it is worth noting that the existing practice is that ttys always use the character set encoding of the user. Even X cut and paste frequently abuses the iso8859-1 range, and instead uses the native character set encoding instead of iso8825-1. Now the work is how to get multiple locales to play nicely with each other. utf-8 and unicode are convenient for that as they preserve the existing assumptions that terminals, filenames, and text files are all using the same character set encoding, even when multiple locales are involved. So within one machine utf-8 solves the multiple locale problem. The problem has now moved to interoperability between machines. Since multiple machines have different upgrade cycles, and are in different administrative domains everyone does not move to utf-8 at the same time. When we add the assertion that all I/O going through a terminal device is in the native locale we break 8bit transparency. This holds true in some instances when both sides use the same character set encoding, such as utf8. There are some mitigating factors to this. ssh already documents pseudo tty's as potentially breaking 8 bit transparency. And applications that require ttys for stdin/stdout are most likely interactive. Interactive programs are either character based, or broken. Being an unclean channel for pipes will affect at least XMODEM, YMODEM, and ZMODEM protocols, and possibly ppp. These programs already know how to avoid problem characters and because ascii is a common subset of most character set encodings the effect should be no worse than a line that is not 8 bit clean. ssh at least has explict options to allocate or not allocate a pseudo-tty so getting an 8 bit clean data path is not a problem with ssh. The rule ``All data that passing through a pseudo-tty is in the character set encoding specified by the locale of the owner of the tty'' seems both reasonable and no significant change from the current status quo. Now how does this get implemented? On the wire between two machines I recommend passing unicode characters. Unicode guarantees no round trip loss for any of it's member character sets, and it reduces everything to one set of translation tables. By convention glibc stores unicode values in wchar_t. mbrstowc will convert multibyte strings to internal wide characters, based on the current locale. wctombs will do the opposite. So going between unicode and the character set encoding of the current locale is straight forward. How do we convert the applications? There are only four cases I can think of where we connect to a remote system with terminal semantics. 1) Directly connected serial terminals. 2) telnetd 3) rshd 4) sshd To my knowledge all of their protocols just pass through characters and are neutral. So changing these feels like a protocol extension, ouch! Those are the programs that bridge multiple administrative domains, and they do deal with pseudo ttys so they are where something needs to happen, to support different character set encodings on different machines. If everyone just switches over to using utf-8 even the above cases are fine. So if there is a reasonable expectation that everyone will change to using utf-8 in the near future even those programs don't need to change. Given the delay in changing protocols I propose 2 simple programs. sh-utf8 and utf8-tty. The first runs a command converting stdout and stderr from utf8 to the current locale, and converting stdin into utf8. The second creates a pseudo tty and relays to it's controlling tty, assuming the controlling tty uses utf8 and it's tty uses the current locale. Looking around there already is a TTYConv program that seems to fill this niche, except you must specify the character set encodings manually. http://bedroomlan.dyndns.org/~alexios/coding_ttyconv.html Comments? Eric ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-17 21:06 ` Alex Belits 2004-02-17 21:47 ` Jamie Lokier @ 2004-02-18 7:23 ` Marc Lehmann 1 sibling, 0 replies; 50+ messages in thread From: Marc Lehmann @ 2004-02-18 7:23 UTC (permalink / raw) To: linux-kernel On Tue, Feb 17, 2004 at 02:06:21PM -0700, Alex Belits <abelits@phobos.illtel.denver.co.us> wrote: > UTF-8 terminals (and variable-encoding terminals) alreay exist, > gnome-terminal is one of them. They are, of course, bloated pigs, but I rxvt-unicode (mixed fonts, bad complex script), and mlterm (no mixed fonts, very good complex script support), are not all bloated, have a _much_ smaller memory footprint than xterm and are even faster on text output and scrolling complex scripts than xterm (by a factor of two). (Of course, gnome-terminal is bloated. loading it requires 45MB of main memory here and then it's still 5-10 times slower than xterm). That UTF-8/Unicode in any way means bloated (I know you did not directly imply this) is a widely circulating but wrong idea nowadays. -- -----==- | ----==-- _ | ---==---(_)__ __ ____ __ Marc Lehmann +-- --==---/ / _ \/ // /\ \/ / pcg@goof.com |e| -=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+ The choice of a GNU generation | | ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-17 20:57 ` Jamie Lokier 2004-02-17 21:06 ` Alex Belits @ 2004-02-17 21:23 ` Matthew Kirkwood 1 sibling, 0 replies; 50+ messages in thread From: Matthew Kirkwood @ 2004-02-17 21:23 UTC (permalink / raw) To: Jamie Lokier; +Cc: Måns Rullgård, linux-kernel On Tue, 17 Feb 2004, Jamie Lokier wrote: > No, I think hacking the terminal I/O is the best bet here. Then _all_ > programs which currently work with UTF-8 terminals, which is rapidly > becoming most of them, will work the same with both kinds of terminal, > and the illusion of perfection will be complete and beautiful. Yep. A charset-translating tty proxy, a little like screen or detachtty is what you want. I wonder if there's an SSH client or server which can do that. Matthew. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-17 16:32 ` Linus Torvalds 2004-02-17 16:46 ` Jamie Lokier @ 2004-02-17 16:54 ` Stefan Smietanowski 2004-02-18 1:27 ` Hans Reiser 1 sibling, 1 reply; 50+ messages in thread From: Stefan Smietanowski @ 2004-02-17 16:54 UTC (permalink / raw) To: Linus Torvalds; +Cc: Marc Lehmann, Jamie Lokier, viro, Linux kernel Hi Linus. >>Because there is a fundamental difference between file contents and >>filenames. Filenames are supposed to be text. > > I think this is actually the fundamental point where we disagree. > > You think of filenames as something the user types in, and that is > "readable text". And I don't. > > I think the filenames are just ways for a _program_ to look up stuff, and > the human readability is a secondary thing (it's "polite", but not a > fundamental part of their meaning). > > So the same way I think text is good in config files and I dislike binary > blobs (hey, look at /proc), I think readable filenames are good. But that > doesn't mean that they have to be readable. I can well imagine encoding > meta-data in the filename for some database that uses the filesystem as > its backing store and generates files for large blobs. And then there > would be little if any "goodness" to keeping the filenames readable. Just look at Mozilla's cache... They may have turned the blob into ascii but it's still a blob. // Stefan ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-17 16:54 ` Stefan Smietanowski @ 2004-02-18 1:27 ` Hans Reiser 2004-02-18 2:08 ` Robin Rosenberg 0 siblings, 1 reply; 50+ messages in thread From: Hans Reiser @ 2004-02-18 1:27 UTC (permalink / raw) To: Stefan Smietanowski Cc: Linus Torvalds, Marc Lehmann, Jamie Lokier, viro, Linux kernel ReiserFS 6 plans to allow files to be associated with arbitrary files and found by those associations. Some of those files will consist of ascii keywords, some will be icon images, etc..... Human readability should not be considered fundamental to a name component, especially since programs with no interest in readability may be the only direct users of the name. Hans > ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-18 1:27 ` Hans Reiser @ 2004-02-18 2:08 ` Robin Rosenberg 2004-02-18 11:06 ` Jamie Lokier 0 siblings, 1 reply; 50+ messages in thread From: Robin Rosenberg @ 2004-02-18 2:08 UTC (permalink / raw) To: Hans Reiser Cc: Stefan Smietanowski, Linus Torvalds, Marc Lehmann, Jamie Lokier, viro, Linux kernel On Wednesday 18 February 2004 02.27, Hans Reiser wrote: > ReiserFS 6 plans to allow files to be associated with arbitrary files > and found by those associations. Some of those files will consist of > ascii keywords, some will be icon images, etc..... Human readability > should not be considered fundamental to a name component, especially > since programs with no interest in readability may be the only direct > users of the name. If the user never sees a name, it doesn't matter. However the user actually sees and reads the filenames in /home, portable media, networks devices and lots of places. However, when a user has named a component those characters are those that are important to the user because those form an "image" (since you introduced the term) or "sound" that the user remembers and associates with the content. A character is the simplest form of image so it should always look the same. -- robin ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: UTF-8 practically vs. theoretically in the VFS API 2004-02-18 2:08 ` Robin Rosenberg @ 2004-02-18 11:06 ` Jamie Lokier 0 siblings, 0 replies; 50+ messages in thread From: Jamie Lokier @ 2004-02-18 11:06 UTC (permalink / raw) To: Robin Rosenberg Cc: Hans Reiser, Stefan Smietanowski, Linus Torvalds, Marc Lehmann, viro, Linux kernel Robin Rosenberg wrote: > A character is the simplest form of image so it should always look the same. People who need the computer to _speak_ names need language or phonetic information attached to a name, for it to be spoken properly. On this, Alex Belits has a good point. It's all very well standardising on UTF-8 so every name can be displayed nicely. That is incomplete for a user who needs "ls" to work audibly, though. In practice, such a user configures their machine to assume a particular language, or guess it with bias to the one they use most often. That is, in some ways, the same problem as having a mixture of filenames in an unknown character encoding, except that UTF-8 doesn't solve it. -- Jamie ^ permalink raw reply [flat|nested] 50+ messages in thread
end of thread, other threads:[~2004-02-23 19:13 UTC | newest]
Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-23 11:35 UTF-8 practically vs. theoretically in the VFS API Norman Diamond
[not found] ` <fa.ip45pqg.i26oru@ifi.uio.no>
2004-02-23 19:13 ` Junio C Hamano
[not found] <04Feb13.163954est.41760@gpu.utcc.utoronto.ca>
2004-02-14 23:06 ` JFS default behavior Robin Rosenberg
2004-02-14 23:29 ` viro
2004-02-15 0:07 ` Robin Rosenberg
2004-02-15 2:41 ` Linus Torvalds
2004-02-16 18:36 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
2004-02-16 18:49 ` Linus Torvalds
2004-02-16 19:26 ` UTF-8 practically vs. theoretically in the VFS API Jeff Garzik
2004-02-16 19:48 ` John Bradford
2004-02-16 19:48 ` Linus Torvalds
2004-02-16 20:20 ` Marc Lehmann
2004-02-16 20:26 ` Linus Torvalds
2004-02-18 2:49 ` Rob Landley
2004-02-16 20:21 ` bert hubert
2004-02-16 20:33 ` Marc Lehmann
2004-02-18 2:58 ` H. Peter Anvin
2004-02-18 3:13 ` Linus Torvalds
2004-02-18 3:22 ` H. Peter Anvin
2004-02-18 3:30 ` Linus Torvalds
2004-02-18 5:30 ` H. Peter Anvin
2004-02-18 10:29 ` Robin Rosenberg
2004-02-18 11:49 ` Tomas Szepe
2004-02-18 11:59 ` Robin Rosenberg
2004-02-18 12:05 ` Tomas Szepe
2004-02-18 12:34 ` Robin Rosenberg
2004-02-18 15:35 ` Linus Torvalds
2004-02-18 19:47 ` Tomas Szepe
2004-02-18 20:01 ` H. Peter Anvin
2004-02-18 21:22 ` Robin Rosenberg
2004-02-18 21:42 ` H. Peter Anvin
2004-02-18 11:24 ` Jamie Lokier
2004-02-18 11:33 ` Jamie Lokier
2004-02-18 16:47 ` H. Peter Anvin
2004-02-18 19:59 ` Linus Torvalds
2004-02-18 20:08 ` H. Peter Anvin
2004-02-18 7:25 ` bert hubert
2004-02-16 20:16 ` Marc Lehmann
2004-02-16 20:20 ` Jeff Garzik
2004-02-16 21:10 ` viro
2004-02-17 7:18 ` jw schultz
2004-02-17 7:42 ` Nick Piggin
2004-02-16 20:03 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Marc Lehmann
2004-02-16 20:23 ` Linus Torvalds
2004-02-16 22:26 ` Jamie Lokier
2004-02-16 22:40 ` Linus Torvalds
2004-02-17 7:14 ` Lehmann
2004-02-17 11:20 ` UTF-8 practically vs. theoretically in the VFS API Helge Hafting
2004-02-17 15:56 ` UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Linus Torvalds
[not found] ` <20040217161111.GE8231@schmorp.de>
2004-02-17 16:32 ` Linus Torvalds
2004-02-17 16:46 ` Jamie Lokier
2004-02-17 19:00 ` UTF-8 practically vs. theoretically in the VFS API Måns Rullgård
2004-02-17 20:57 ` Jamie Lokier
2004-02-17 21:06 ` Alex Belits
2004-02-17 21:47 ` Jamie Lokier
2004-02-22 15:32 ` Eric W. Biederman
2004-02-22 16:28 ` Jamie Lokier
2004-02-22 21:53 ` Eric W. Biederman
2004-02-18 7:23 ` Marc Lehmann
2004-02-17 21:23 ` Matthew Kirkwood
2004-02-17 16:54 ` Stefan Smietanowski
2004-02-18 1:27 ` Hans Reiser
2004-02-18 2:08 ` Robin Rosenberg
2004-02-18 11:06 ` Jamie Lokier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox