* Re: UTF-8 filenames
@ 2004-02-22 12:30 Norman Diamond
2004-02-22 20:45 ` Jamie Lokier
0 siblings, 1 reply; 9+ messages in thread
From: Norman Diamond @ 2004-02-22 12:30 UTC (permalink / raw)
To: linux-kernel
kernel@mikebell.org wrote:
> So then, just about everyone agrees that if you've got a filename with
> non-ASCII characters, you should pass it to creat() as UTF-8. You have
> to pass it as something, individual encodings like BIG5 and EUC-JP
> are unacceptable, and UCS-4's benefits over UTF-8 (simplicity and in
> VERY rare cases storage size reductions) aren't worth the stuff it
> breaks. Correct?
Correct except for the following cases. Unix users for more than 20 years
have been creating filenames encoded in EUC-JP or SJIS (yes sadly some Unix
systems used SJIS). I don't know how long BIG5 and Korean filenames have
been supported in Unix but it's probably not much different. Consider
converting all your ASCII filenames to UTF-16. Let everyone share the
short-term pain for the long-term gain. When you get everyone to agree on
UTF-16, it will be ugly, but it will be equal for everyone.
By the way, another subthread mentioned that stty puts some stuff in the
kernel that could be done in user space. In Unix systems the same is true
for IMEs, stty options specify the encoding of the output of an IME (e.g.
EUC-JP or SJIS, which then gets forwarded as input to shells, applications,
etc.), and whether a single backspace (or whatever character deletion
character) deletes an entire input character instead of just deleting a
single byte, etc. I keep forgetting to see if Linux has the same stty
options. I haven't needed to set them with stty because if I need to use a
different locale then I just open a new terminal emulator window using that
locale.
I don't have time even to follow all of this thread, so if anyone has
questions then CC me personally. I don't know if I'll have time to answer
either, but I'll try.
> As I see it, there's no way for the kernel to deal with all the legacy
> filenames out there. There's no way the kernel can magically fix them.
That's true. Some options of mount and some options of stty can be moved to
user space, but they will always need to be available.
By the way in Windows 98 it's really neat to share a disk folder across the
network and let clients with different code pages create files. The host
where the folder is stored can't even delete some of the files that get
created.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: UTF-8 filenames
2004-02-22 12:30 UTF-8 filenames Norman Diamond
@ 2004-02-22 20:45 ` Jamie Lokier
2004-02-22 23:35 ` Norman Diamond
0 siblings, 1 reply; 9+ messages in thread
From: Jamie Lokier @ 2004-02-22 20:45 UTC (permalink / raw)
To: Norman Diamond; +Cc: linux-kernel
Norman Diamond wrote:
> Consider
> converting all your ASCII filenames to UTF-16. Let everyone share the
> short-term pain for the long-term gain. When you get everyone to agree on
> UTF-16, it will be ugly, but it will be equal for everyone.
UTF-8 is the only sane universal encoding in unix.
UTF-16 is not an option; it's not POSIX compatible, it won't work with
the assumptions made by _all_ unix programs that deal with paths, and
in it won't by useful at all in a unix environment without rewriting
*every single program*.
Also, what would be the point? UTF-16 as an encoding is about as
complex as UTF-8 (charcters in UTF-16 are 2-4 bytes long depending on
the character), so it's equally hard to program with correctly.
> By the way, another subthread mentioned that stty puts some stuff in the
> kernel that could be done in user space. In Unix systems the same is true
> for IMEs, stty options specify the encoding of the output of an IME (e.g.
> EUC-JP or SJIS, which then gets forwarded as input to shells, applications,
> etc.), and whether a single backspace (or whatever character deletion
> character) deletes an entire input character instead of just deleting a
> single byte, etc. I keep forgetting to see if Linux has the same stty
> options. I haven't needed to set them with stty because if I need to use a
> different locale then I just open a new terminal emulator window using that
> locale.
Do you have a list or description of the specific stty options that
are used?
-- Jamie
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: UTF-8 filenames
2004-02-22 20:45 ` Jamie Lokier
@ 2004-02-22 23:35 ` Norman Diamond
2004-02-23 6:10 ` Robin Rosenberg
2004-02-23 12:00 ` Jamie Lokier
0 siblings, 2 replies; 9+ messages in thread
From: Norman Diamond @ 2004-02-22 23:35 UTC (permalink / raw)
To: Jamie Lokier; +Cc: linux-kernel
Jamie Lokier replied to me:
> > Consider
> > converting all your ASCII filenames to UTF-16. Let everyone share the
> > short-term pain for the long-term gain. When you get everyone to agree on
> > UTF-16, it will be ugly, but it will be equal for everyone.
>
> UTF-8 is the only sane universal encoding in unix.
That's a bit beside the point. I was replying to the assertion that
everyone agreed to use UTF-8. (And particularly, for large character sets.)
> UTF-16 is not an option;
Of course. Perhaps my use of reductio al absurdum was unclear. I was
trying to show that UTF-8, despite its sanity, is not universally agreeable.
The actual reason is because it came late to the scene (around 20 years ago)
and it is not backwards compatible. But to make the point, I compared it
with UTF-16 which is equally not universally agreeable.
> it's not POSIX compatible,
OK, UTF-8 has one less reason than UTF-16 has, for being not universally
agreeable. But the biggest reason still remains, as mentioned above.
> > By the way, another subthread mentioned that stty puts some stuff in the
> > kernel that could be done in user space. In Unix systems the same is true
> > for IMEs, stty options specify the encoding of the output of an IME (e.g.
> > EUC-JP or SJIS, which then gets forwarded as input to shells, applications,
> > etc.), and whether a single backspace (or whatever character deletion
> > character) deletes an entire input character instead of just deleting a
> > single byte, etc. I keep forgetting to see if Linux has the same stty
> > options. I haven't needed to set them with stty because if I need to use a
> > different locale then I just open a new terminal emulator window using that
> > locale.
>
> Do you have a list or description of the specific stty options that
> are used?
Well, I thought I described them as I saw them used in Unix. I no longer
have access to machines running commercial Unix systems, but some of the
stty options were the way I did describe. I have a feeling that System V
might have implemented them slightly differently from BSD-based systems, but
regardless, the same functionality was pretty much "universally" needed and
implemented.
If you're asking whether I noticed similar stty options in Linux, I didn't
notice because of the reason mentioned (I just opened another terminal
emulator window using the locale that I temporarily needed). But I'll try
to remember to look next weekend. Sorry, I'm leaving for work in a minute
and can't look now.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: UTF-8 filenames
2004-02-22 23:35 ` Norman Diamond
@ 2004-02-23 6:10 ` Robin Rosenberg
2004-02-23 11:34 ` Norman Diamond
2004-02-23 12:00 ` Jamie Lokier
1 sibling, 1 reply; 9+ messages in thread
From: Robin Rosenberg @ 2004-02-23 6:10 UTC (permalink / raw)
To: Norman Diamond; +Cc: Jamie Lokier, linux-kernel
On Monday 23 February 2004 00.35, Norman Diamond wrote:
> Of course. Perhaps my use of reductio al absurdum was unclear. I was
> trying to show that UTF-8, despite its sanity, is not universally agreeable.
> The actual reason is because it came late to the scene (around 20 years ago)
> and it is not backwards compatible.
Even later, it's from 1992 I believe and a standard even later. That is long after
we went from national variants of ASCII to ISO-Latin-1. If I recall it correctly it was
the years around 1987 that we started having multiple encodings fo text.
-- robin
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: UTF-8 filenames
2004-02-23 6:10 ` Robin Rosenberg
@ 2004-02-23 11:34 ` Norman Diamond
2004-02-23 12:15 ` Robin Rosenberg
0 siblings, 1 reply; 9+ messages in thread
From: Norman Diamond @ 2004-02-23 11:34 UTC (permalink / raw)
To: Robin Rosenberg; +Cc: Jamie Lokier, linux-kernel
Robin Rosenberg replied to me:
> > Of course. Perhaps my use of reductio al absurdum was unclear.
Actually Mark Hahn got me on that, it should be ad, but I get too many ads
already. Now reminiscing about the days before I posted to LKML, because my
inbox was less than 75% spams in those days.
> > I was trying to show that UTF-8, despite its sanity, is not universally
> > agreeable. The actual reason is because it came late to the scene
> > (around 20 years ago) and it is not backwards compatible.
>
> Even later, it's from 1992 I believe
Oh, then it was even later to the scene than I thought.
> and a standard even later. That is long after we went from national
> variants of ASCII to ISO-Latin-1. If I recall it correctly it was the
> years around 1987 that we started having multiple encodings fo text.
SJIS and EUC both existed before 1987.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: UTF-8 filenames
2004-02-22 23:35 ` Norman Diamond
2004-02-23 6:10 ` Robin Rosenberg
@ 2004-02-23 12:00 ` Jamie Lokier
2004-02-23 23:42 ` Norman Diamond
1 sibling, 1 reply; 9+ messages in thread
From: Jamie Lokier @ 2004-02-23 12:00 UTC (permalink / raw)
To: Norman Diamond; +Cc: linux-kernel
Norman Diamond wrote:
> > Do you have a list or description of the specific stty options that
> > are used?
>
> [...] I'll try to remember to look next weekend.
Thanks; that would be useful.
-- Jamie
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: UTF-8 filenames
2004-02-23 11:34 ` Norman Diamond
@ 2004-02-23 12:15 ` Robin Rosenberg
0 siblings, 0 replies; 9+ messages in thread
From: Robin Rosenberg @ 2004-02-23 12:15 UTC (permalink / raw)
To: Norman Diamond; +Cc: Jamie Lokier, linux-kernel
On Monday 23 February 2004 12.34, Norman Diamond wrote:
> SJIS and EUC both existed before 1987.
Well, I'm narrow minded, but one bit wider than the
ASCII world :-)
-- robin
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: UTF-8 filenames
2004-02-23 12:00 ` Jamie Lokier
@ 2004-02-23 23:42 ` Norman Diamond
2004-02-24 0:02 ` Jamie Lokier
0 siblings, 1 reply; 9+ messages in thread
From: Norman Diamond @ 2004-02-23 23:42 UTC (permalink / raw)
To: Jamie Lokier; +Cc: linux-kernel
Jamie Lokier wrote:
> > > Do you have a list or description of the specific stty options that
> > > are used?
> >
> > [...] I'll try to remember to look next weekend.
>
> Thanks; that would be useful.
Actually not. I had a few minutes to look yesterday at a Red Hat 7.3 system
at work, and it seems that Linux stty had none of the necessary options. In
keyboard input with an IME active, the backspace key deleted a single byte
instead of an entire character, exactly one of the problems that commercial
Unix (and MS-DOS) systems solved 20 years ago.
The reason I think my checking was not useful is that I think you already
knew that Linux didn't have it :-)
Sorry I cannot check details on which bits are used by commercial Unix
systems. As mentioned previously, I no longer have access to any such
systems. Some vendors documented the options in the "man stty" pages in
both Japanese and English, but other vendors only documented them in the
"man stty" page in Japanese.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: UTF-8 filenames
2004-02-23 23:42 ` Norman Diamond
@ 2004-02-24 0:02 ` Jamie Lokier
0 siblings, 0 replies; 9+ messages in thread
From: Jamie Lokier @ 2004-02-24 0:02 UTC (permalink / raw)
To: Norman Diamond; +Cc: linux-kernel
Norman Diamond wrote:
> The reason I think my checking was not useful is that I think you already
> knew that Linux didn't have it :-)
Yes, it already came up earlier in the thread. Linux is very likely
to get an option to make the delete key work.
> Sorry I cannot check details on which bits are used by commercial Unix
> systems. As mentioned previously, I no longer have access to any such
> systems. Some vendors documented the options in the "man stty" pages in
> both Japanese and English, but other vendors only documented them in the
> "man stty" page in Japanese.
Do you remember which systems you used which had English "man stty"
pages with these options? Such pages can often be looked up online.
The reason I ask is that it makes sense for any option which is added
to Linux to copy existing options on another unix flavour, if they are
sensible.
-- Jamie
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2004-02-24 0:03 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-22 12:30 UTF-8 filenames Norman Diamond
2004-02-22 20:45 ` Jamie Lokier
2004-02-22 23:35 ` Norman Diamond
2004-02-23 6:10 ` Robin Rosenberg
2004-02-23 11:34 ` Norman Diamond
2004-02-23 12:15 ` Robin Rosenberg
2004-02-23 12:00 ` Jamie Lokier
2004-02-23 23:42 ` Norman Diamond
2004-02-24 0:02 ` Jamie Lokier
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox