* Re: JFS default behavior
@ 2004-02-15 23:03 Nicolas Mailhot
2004-02-16 3:45 ` Jan Knutar
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Nicolas Mailhot @ 2004-02-15 23:03 UTC (permalink / raw)
To: linux-kernel
[-- Attachment #1: Type: text/plain, Size: 784 bytes --]
| Linus Torvalds pointed the way of Tux :
| In short: the kernel talks bytestreams, and that implies that if you
| want to talk to the kernel, you HAVE TO USE UTF-8.
In that case :
- should the kernel allow apps to write filenames that are invalid
UTF-8 and will crash UTF-8 apps ?
- should this UTF-8 rule be noted somewhere (in a FAQ/man page/LSB spec/
whatever) so apps authors know they are supposed to read and write UTF-8
filenames and not apply locale rules to kernel objects ?
- what happens to already existing invalid UTF-8 filenames ? Should the
kernel forcibly rewrite them (in 2.7.0...) to remove legacy mess ? What
should happen if someone plug an unconverted FS in such a system
afterwards ?
These are the questions people have been asking.
[-- Attachment #2: Ceci est une partie de message numériquement signée --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: JFS default behavior
2004-02-15 23:03 JFS default behavior Nicolas Mailhot
@ 2004-02-16 3:45 ` Jan Knutar
2004-02-16 8:30 ` Nicolas Mailhot
2004-02-16 6:21 ` jw schultz
2004-02-19 10:59 ` JFS default behavior / UTF-8 filenames kernel
2 siblings, 1 reply; 18+ messages in thread
From: Jan Knutar @ 2004-02-16 3:45 UTC (permalink / raw)
To: Nicolas Mailhot, linux-kernel
> - what happens to already existing invalid UTF-8 filenames ? Should
> the kernel forcibly rewrite them (in 2.7.0...) to remove legacy mess
> ? What should happen if someone plug an unconverted FS in such a
> system afterwards ?
What I would like would be a userspace tool, that would recurse and
convert filename encodings from specified locale to UTF-8. Something
like "any2utf8 -from iso8859-1 -recurse /mnt/myoldmp3disk".
Does anyone know if such a tool exists already?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: JFS default behavior
2004-02-15 23:03 JFS default behavior Nicolas Mailhot
2004-02-16 3:45 ` Jan Knutar
@ 2004-02-16 6:21 ` jw schultz
2004-02-16 15:55 ` Jamie Lokier
2004-02-19 10:59 ` JFS default behavior / UTF-8 filenames kernel
2 siblings, 1 reply; 18+ messages in thread
From: jw schultz @ 2004-02-16 6:21 UTC (permalink / raw)
To: linux-kernel
On Mon, Feb 16, 2004 at 12:03:03AM +0100, Nicolas Mailhot wrote:
> | Linus Torvalds pointed the way of Tux :
>
> | In short: the kernel talks bytestreams, and that implies that if you
> | want to talk to the kernel, you HAVE TO USE UTF-8.
>
> In that case :
> - should the kernel allow apps to write filenames that are invalid
> UTF-8 and will crash UTF-8 apps ?
Yes. The kernel interface specifies it as a bytesteam with
0x00 and 0x2f having special meaning. That is a constraint,
not a policy. It is user space that determines the policy
of UTF-8.
> UTF-8 and will crash UTF-8 apps ?
Fix the broken apps. Crashing because of "invalid" UTF-8 is
no more excusable than crashing because of a string longer
than expected (buffer overrun). Filenames as read from the
filesystem should be treated just like any other untrusted
input.
> - should this UTF-8 rule be noted somewhere (in a FAQ/man page/LSB spec/
> whatever) so apps authors know they are supposed to read and write UTF-8
> filenames and not apply locale rules to kernel objects ?
Since the LSB spec describes user space it might be a
suitable place.
> - what happens to already existing invalid UTF-8 filenames ? Should the
> kernel forcibly rewrite them (in 2.7.0...) to remove legacy mess ? What
If you have a filesystem with filenames that don't conform
to your policy write userspace tools to detect and/or fix
them. If you have programs creating non-conforming
filenames, fix or rm those programs.
> kernel forcibly rewrite them (in 2.7.0...) to remove legacy mess ? What
> should happen if someone plug an unconverted FS in such a system
> afterwards ?
The kernel won't care. Any user space code that treats the
filenames as something other than bytestreams should be able
to cope with any sequence of bytes.
> These are the questions people have been asking.
OK. The questions have been asked and answered.
Asking again and again and again won't change the answer.
--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: jw@pegasys.ws
Remember Cernan and Schmitt
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: JFS default behavior
2004-02-16 3:45 ` Jan Knutar
@ 2004-02-16 8:30 ` Nicolas Mailhot
2004-02-16 8:54 ` Valdis.Kletnieks
0 siblings, 1 reply; 18+ messages in thread
From: Nicolas Mailhot @ 2004-02-16 8:30 UTC (permalink / raw)
To: Jan Knutar; +Cc: linux-kernel
[-- Attachment #1: Type: text/plain, Size: 997 bytes --]
Le lun, 16/02/2004 à 05:45 +0200, Jan Knutar a écrit :
> > - what happens to already existing invalid UTF-8 filenames ? Should
> > the kernel forcibly rewrite them (in 2.7.0...) to remove legacy mess
> > ? What should happen if someone plug an unconverted FS in such a
> > system afterwards ?
>
> What I would like would be a userspace tool, that would recurse and
> convert filename encodings from specified locale to UTF-8. Something
> like "any2utf8 -from iso8859-1 -recurse /mnt/myoldmp3disk".
> Does anyone know if such a tool exists already?
One can do find+ recode magic now
The question is :
- can this be automated ?
- how can one recognise and unconverted fs ?
- how can on guess the encodings(s) that have been used before on such
an fs ?
You're assuming the situation is merely a iso8859-1 to utf-8 migration.
Far from it. The core problem is everyone damn wrote what it pleased him
without considering future readers.
Cheers,
--
Nicolas Mailhot
[-- Attachment #2: Ceci est une partie de message numériquement signée --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: JFS default behavior
2004-02-16 8:30 ` Nicolas Mailhot
@ 2004-02-16 8:54 ` Valdis.Kletnieks
0 siblings, 0 replies; 18+ messages in thread
From: Valdis.Kletnieks @ 2004-02-16 8:54 UTC (permalink / raw)
To: Nicolas Mailhot; +Cc: Jan Knutar, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1325 bytes --]
On Mon, 16 Feb 2004 09:30:41 +0100, Nicolas Mailhot said:
> You're assuming the situation is merely a iso8859-1 to utf-8 migration.
> Far from it. The core problem is everyone damn wrote what it pleased him
> without considering future readers.
Given the fact that there isn't in general any way for the kernel to know what
was intended, I don't see how any kernel policy other than "NUL and / are
special, but if you use anything other than UTF-8 it will eventually come back
to haunt you" can possibly be made to work.
For that matter, I have seen actual production code that intentionally created
fairly deep directory trees and terminal file names that were basically hashes
written in radix-254 and blatted out in binary. Lots of them. The original
problem report I got was along the lines of "We installed XYZ, and the file
system appears corrupted - ls -R weird the screen out, and 'find | wc -l' is
127,000 different than what 'df -i' reports".
I was ready to strangle the guilty party - radix-64 wouldn't have been a big
efficiency hit and at least the uuencode/base-64 charset doesn't weird your
terminal out. :)
So it's not even always possible to make the assumption that the filename is
supposed to make sense in *any* charset. This one requires fixing in some
combination of userspace and meatspace....
[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: JFS default behavior
2004-02-16 6:21 ` jw schultz
@ 2004-02-16 15:55 ` Jamie Lokier
2004-02-17 6:47 ` jw schultz
0 siblings, 1 reply; 18+ messages in thread
From: Jamie Lokier @ 2004-02-16 15:55 UTC (permalink / raw)
To: jw schultz, linux-kernel
jw schultz wrote:
> If you have a filesystem with filenames that don't conform
> to your policy write userspace tools to detect and/or fix
> them. If you have programs creating non-conforming
> filenames, fix or rm those programs.
You do understand that GNU coreutils, bash etc. are among those
programs, right? As in "touch zöe.txt" creates a non-conforming
filename...
> OK. The questions have been asked and answered.
> Asking again and again and again won't change the answer.
The question of what a program like this should do has not been
answered:
perl -e 'for (glob "*") { rename $_, "ņi-".$_ or die "rename: $!\n"; }'
(NB: The prefix string is N WITH CEDILLA followed by "i-").
Hint: it mangles perfectly fine non-ASCII file names, instead of just
prefixing the prefix string. If you change the program to correctly
prepend the prefix string, then it mangles non-UTF-8 names, which is
arguably correct, but can result in you losing some files.
This _is_ a userspace problem, but it is a genuine problem for which
no good answer is yet apparent.
-- Jamie
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: JFS default behavior
2004-02-16 15:55 ` Jamie Lokier
@ 2004-02-17 6:47 ` jw schultz
2004-02-17 21:37 ` Jamie Lokier
0 siblings, 1 reply; 18+ messages in thread
From: jw schultz @ 2004-02-17 6:47 UTC (permalink / raw)
To: linux-kernel
On Mon, Feb 16, 2004 at 03:55:34PM +0000, Jamie Lokier wrote:
> jw schultz wrote:
> > If you have a filesystem with filenames that don't conform
> > to your policy write userspace tools to detect and/or fix
> > them. If you have programs creating non-conforming
> > filenames, fix or rm those programs.
>
> You do understand that GNU coreutils, bash etc. are among those
Doesn't matter where they come from.
> programs, right? As in "touch zöe.txt" creates a non-conforming
> filename...
Your concrete example is a good one. Where did that
filename come from? It would seem to have come from the
keyboard via a tty (or simulator) which also had to display
it. I'd say this is an argument for the terminal to display
UTF-8 and convert intput into UTF-8. That is something that
seems to be not consistantly done as yet. Ultimately it
seems to be a responsiblity of the user interface, whether
tty or GUI. Until that happens the shells might be able to
fill the gap, however poorly.
Perhaps the utilities that don't attempt to interpret
filenames should treat filenames exactly like the kernel
does.
> > OK. The questions have been asked and answered.
> > Asking again and again and again won't change the answer.
>
> The question of what a program like this should do has not been
> answered:
>
> perl -e 'for (glob "*") { rename $_, "??i-".$_ or die "rename: $!\n"; }'
>
> (NB: The prefix string is N WITH CEDILLA followed by "i-").
>
> Hint: it mangles perfectly fine non-ASCII file names, instead of just
> prefixing the prefix string. If you change the program to correctly
> prepend the prefix string, then it mangles non-UTF-8 names, which is
> arguably correct, but can result in you losing some files.
Then if there is incorrect behavior is it the shell, tty or perl that is
getting things wrong here.
> This _is_ a userspace problem, but it is a genuine problem for which
> no good answer is yet apparent.
I'll buy that. Then the first question to ask is "what is
the correct forum for resolving this".
--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: jw@pegasys.ws
Remember Cernan and Schmitt
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: JFS default behavior
2004-02-17 6:47 ` jw schultz
@ 2004-02-17 21:37 ` Jamie Lokier
2004-02-17 22:12 ` Linus Torvalds
0 siblings, 1 reply; 18+ messages in thread
From: Jamie Lokier @ 2004-02-17 21:37 UTC (permalink / raw)
To: jw schultz, linux-kernel
jw schultz wrote:
> Your concrete example is a good one. Where did that
> filename come from? It would seem to have come from the
> keyboard via a tty (or simulator) which also had to display
> it. I'd say this is an argument for the terminal to display
> UTF-8 and convert intput into UTF-8. That is something that
> seems to be not consistantly done as yet. Ultimately it
> seems to be a responsiblity of the user interface, whether
> tty or GUI. Until that happens the shells might be able to
> fill the gap, however poorly.
Many terminals will not ever display UTF-8. Think: all the serial terminals.
This is why I think "stty utf8" or something along those lines would
be useful. The terminal itself doesn't have to talk UTF-8; however,
the applications talking with /dev/tty would always see UTF-8.
That seems to solve most of the practical user interface problems of
the command line, in one single clean place.
-- Jamie
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: JFS default behavior
2004-02-17 21:37 ` Jamie Lokier
@ 2004-02-17 22:12 ` Linus Torvalds
2004-02-18 9:59 ` Jamie Lokier
0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2004-02-17 22:12 UTC (permalink / raw)
To: Jamie Lokier; +Cc: jw schultz, linux-kernel
On Tue, 17 Feb 2004, Jamie Lokier wrote:
>
> Many terminals will not ever display UTF-8. Think: all the serial terminals.
>
> This is why I think "stty utf8" or something along those lines would
> be useful. The terminal itself doesn't have to talk UTF-8; however,
> the applications talking with /dev/tty would always see UTF-8.
>
> That seems to solve most of the practical user interface problems of
> the command line, in one single clean place.
Doesn't "screen" already do this? I don't think you want to have the
locale handling in the kernel, along with translation of multi-key
characters (and from things like CJK terminals? I don't know what format
they send). Sounds like you should use a user-mode thing that knows about
locales...
Linus
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: JFS default behavior
2004-02-17 22:12 ` Linus Torvalds
@ 2004-02-18 9:59 ` Jamie Lokier
2004-02-18 15:54 ` Linus Torvalds
0 siblings, 1 reply; 18+ messages in thread
From: Jamie Lokier @ 2004-02-18 9:59 UTC (permalink / raw)
To: Linus Torvalds; +Cc: jw schultz, linux-kernel
Linus Torvalds wrote:
> Doesn't "screen" already do this? I don't think you want to have the
> locale handling in the kernel, along with translation of multi-key
> characters (and from things like CJK terminals? I don't know what format
> they send). Sounds like you should use a user-mode thing that knows about
> locales...
Yes. I was thinking in a rather DEC VT100/Putty/xterm- centric way
for a moment; please excuse the slip.
It's irritating that logging in from the wrong kind of terminal
doesn't just provide the right "user experience" for the command line
automatically. It's also a pain that ssh doesn't inform the remote
end whether the local terminal is UTF-8, so everything seem to be
working fine until one day you discover typing "£" in an editor just
beeps. Grr.. Oh well.
These are all solvable in userspace. Then again, so were most of the
other stty options; didn't stop them from being implemented in the kernel :)
-- Jamie
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: JFS default behavior
2004-02-18 9:59 ` Jamie Lokier
@ 2004-02-18 15:54 ` Linus Torvalds
2004-02-18 23:58 ` Jamie Lokier
0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2004-02-18 15:54 UTC (permalink / raw)
To: Jamie Lokier; +Cc: jw schultz, linux-kernel
On Wed, 18 Feb 2004, Jamie Lokier wrote:
>
> It's irritating that logging in from the wrong kind of terminal
> doesn't just provide the right "user experience" for the command line
> automatically.
Well, you should be able to just start something "screen"-equivalent
directly by just making it your default shell or have a fix to "login".
The thing is, the kernel tty layer is happy to work with utf-8 (well,
modulo the issues of erase etc - and Andries posted that patch already,
and there are probably others like it) if your terminal supports it, but
if your terminal doesn't have CJK supprt internally, then you need
something to do the multi-character translations anyway in order to be
able to input them in the first place.
And that is _not_ an stty option.
Btw, from the screen man-page it appears that screen is not able to do
that either. You can put screen into utf-8 mode, but it sounds like it
just means that it passes UTF-8 through, not that it does any translation
from "latin1 vt100 to utf-8".
I think there are a few editors that actually do ("mined" looks like it
should do it).
Linus
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: JFS default behavior
2004-02-18 15:54 ` Linus Torvalds
@ 2004-02-18 23:58 ` Jamie Lokier
0 siblings, 0 replies; 18+ messages in thread
From: Jamie Lokier @ 2004-02-18 23:58 UTC (permalink / raw)
To: Linus Torvalds; +Cc: jw schultz, linux-kernel
Linus Torvalds wrote:
> Btw, from the screen man-page it appears that screen is not able to do
> that either. You can put screen into utf-8 mode, but it sounds like it
> just means that it passes UTF-8 through, not that it does any translation
> from "latin1 vt100 to utf-8".
Screen works nicely. Do this:
echo 'defutf8 on' >> ~/.screenrc
Then screen presents a UTF-8 interface to the shell and other
programs, regardless of what kind of terminal you connect from :)
(It's a bit overkill, no actually it's a lot overkill, and you have the
annoyance of screen intercepting at least one commonly used editing key.)
(Just remember to set the LANG environment variable to include
".UTF-8" so that screen-oriented programs know to display properly. I
do it automatically using a script which queries the current terminal,
to workaround ssh not forwarding LANG).
> I think there are a few editors that actually do ("mined" looks like it
> should do it).
Emacs does, of course.
-- Jamie
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: JFS default behavior / UTF-8 filenames
2004-02-15 23:03 JFS default behavior Nicolas Mailhot
2004-02-16 3:45 ` Jan Knutar
2004-02-16 6:21 ` jw schultz
@ 2004-02-19 10:59 ` kernel
2004-02-19 14:05 ` Dave Kleikamp
2 siblings, 1 reply; 18+ messages in thread
From: kernel @ 2004-02-19 10:59 UTC (permalink / raw)
To: linux-kernel
So then, just about everyone agrees that if you've got a filename with
non-ASCII characters, you should pass it to creat() as UTF-8. You have
to pass it as something, individual encodings like BIG5 and EUC-JP
are unacceptable, and UCS-4's benefits over UTF-8 (simplicity and in
VERY rare cases storage size reductions) aren't worth the stuff it
breaks. Correct?
As I see it, there's no way for the kernel to deal with all the legacy
filenames out there. There's no way the kernel can magically fix them.
So the only thing the kernel could do for those who want to see valid
unicode is have an option to make UTF-8 only filesystems. Best would be
if it was done at mkfs time and always enforced from then on in so that
a non-UTF8 filename can never be created. Because if you want the kernel
to not pass non-UTF8 filenames back to userspace, the ONLY clean way to
do that is to make sure they're not there in the first place. You could
maybe try it with a mount=utf8only flag, but the only thing that could
do then would be to make the files with invalid filenames "disappear".
For filesystems like JFS and NTFS, I think this is the best way in the
long run, have the kernel output as UTF-8 by default, assume UTF-8
inputs, and reject non-UTF8 filenames because they can't really store
the arbitrary string of bytes model anyway.
For others which can, maybe leave it up to the filesystem creator
whether to reject non-UTF8 filenames or to accept invalid ones as well?
Either way, a well-written userspace app shouldn't barf on recieving
invalid UTF-8 from the kernel, we'll have legacy filenames around for a
good long time yet, and it's the only way to be portable to older
linuxes and other UNIXes where you definatly would not be guaranteed
valid UTF-8 no matter what new linux kernels decide.
In any case, the important part is to make sure userspace stops writing
filenames in BIG5 as soon as possible. I don't know if this can be done
nicely in libc, with libc automagically transforming the BIG5 filename
in open() to UTF-8 and the UTF-8 in readdir() to BIG5 based on the
locale, or if we have to rely on every userspace app to store filenames
in UTF-8 by themselves. But that's a decision for the glibc guys. It
doesn't affect that filenames need to start being written to the
filesystem in UTF-8 rather than other encodings, and that the only
decision the kernel has to make is whether or not to reject attempts to
create filenames which are invalid UTF-8.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: JFS default behavior / UTF-8 filenames
2004-02-19 10:59 ` JFS default behavior / UTF-8 filenames kernel
@ 2004-02-19 14:05 ` Dave Kleikamp
2004-02-19 23:47 ` kernel
0 siblings, 1 reply; 18+ messages in thread
From: Dave Kleikamp @ 2004-02-19 14:05 UTC (permalink / raw)
To: kernel; +Cc: linux-kernel
On Thu, 2004-02-19 at 04:59, kernel@mikebell.org wrote:
> For filesystems like JFS and NTFS, I think this is the best way in the
> long run, have the kernel output as UTF-8 by default, assume UTF-8
> inputs, and reject non-UTF8 filenames because they can't really store
> the arbitrary string of bytes model anyway.
Actually, I just submitted a patch to fix the default behavior of JFS to
always treat the name as an arbitrary string. The previous default
depended on the value of CONFIG_NLS_DEFAULT. Setting the mount option
iocharset=utf8 will reject non-utf8 filenames as you propose.
The arbitrary string of bytes is treated as the latin1 charset in that
it is stored as 0x00nn (in UTF2), but JFS really doesn't care what the
character set is.
> For others which can, maybe leave it up to the filesystem creator
> whether to reject non-UTF8 filenames or to accept invalid ones as well?
It's been said before, but a posix-compliant file system should accept
any bytes other that NUL and '/'.
Shaggy
--
David Kleikamp
IBM Linux Technology Center
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: JFS default behavior / UTF-8 filenames
2004-02-19 14:05 ` Dave Kleikamp
@ 2004-02-19 23:47 ` kernel
2004-02-20 15:00 ` Dave Kleikamp
0 siblings, 1 reply; 18+ messages in thread
From: kernel @ 2004-02-19 23:47 UTC (permalink / raw)
To: Dave Kleikamp; +Cc: linux-kernel
On Thu, Feb 19, 2004 at 08:05:06AM -0600, Dave Kleikamp wrote:
> The arbitrary string of bytes is treated as the latin1 charset in that
> it is stored as 0x00nn (in UTF2), but JFS really doesn't care what the
> character set is.
While I don't really care one way or the other about the whole
"rejecting non-UTF8 filenames" thing, trying to store 8bit strings in
UTF2 (no such thing, is there? Is JFS UCS-2 or UTF-16?) seems really
ugly. In general at least, maybe it's not so bad in JFS's case
specifically because of there not being much sharing of JFS filesystems
between linux and non-linux systems.
But if JFS uses that "make the high byte zero and return the low byte
only" scheme, what does it do when it encounters a UCS-2 filename that
has a non-NUL high byte on an existing filesystem? I can't see any ways
of dealing with this that aren't much more horribly broken than merely
refusing to create filenames that aren't valid in the current encoding.
If it throws the high byte away then you've made it impossible to open
said files, and up to 256 files per character of the filename can now
appear to have the same filename.
So what does JFS do in its "throw away the high byte and store binary
character strings in the low byte" mode? How does it deal with an
existing filesystem that has filenames that don't conform to said rule?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: JFS default behavior / UTF-8 filenames
2004-02-19 23:47 ` kernel
@ 2004-02-20 15:00 ` Dave Kleikamp
2004-02-22 19:22 ` kernel
0 siblings, 1 reply; 18+ messages in thread
From: Dave Kleikamp @ 2004-02-20 15:00 UTC (permalink / raw)
To: kernel; +Cc: linux-kernel
On Thu, 2004-02-19 at 17:47, kernel@mikebell.org wrote:
> While I don't really care one way or the other about the whole
> "rejecting non-UTF8 filenames" thing, trying to store 8bit strings in
> UTF2 (no such thing, is there? Is JFS UCS-2 or UTF-16?)
UCS-2 - I can't keep this stuff straight.
> seems really
> ugly. In general at least, maybe it's not so bad in JFS's case
> specifically because of there not being much sharing of JFS filesystems
> between linux and non-linux systems.
>
> But if JFS uses that "make the high byte zero and return the low byte
> only" scheme, what does it do when it encounters a UCS-2 filename that
> has a non-NUL high byte on an existing filesystem? I can't see any ways
> of dealing with this that aren't much more horribly broken than merely
> refusing to create filenames that aren't valid in the current encoding.
> If it throws the high byte away then you've made it impossible to open
> said files, and up to 256 files per character of the filename can now
> appear to have the same filename.
>
> So what does JFS do in its "throw away the high byte and store binary
> character strings in the low byte" mode? How does it deal with an
> existing filesystem that has filenames that don't conform to said rule?
With no iocharset specified, a filename with such a character will be
inaccessible. Probably the best thing for readdir to do is to
substitute a '?' and print a message to the syslog to mount the volume
with iocharset=utf8 to be able to access the file. Of course I would
limit the number of printk's to something small. I'll submit a patch to
do this.
--
David Kleikamp
IBM Linux Technology Center
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: JFS default behavior / UTF-8 filenames
2004-02-20 15:00 ` Dave Kleikamp
@ 2004-02-22 19:22 ` kernel
2004-02-24 14:44 ` Dave Kleikamp
0 siblings, 1 reply; 18+ messages in thread
From: kernel @ 2004-02-22 19:22 UTC (permalink / raw)
To: Dave Kleikamp; +Cc: linux-kernel
On Fri, Feb 20, 2004 at 09:00:58AM -0600, Dave Kleikamp wrote:
> With no iocharset specified, a filename with such a character will be
> inaccessible. Probably the best thing for readdir to do is to
> substitute a '?' and print a message to the syslog to mount the volume
> with iocharset=utf8 to be able to access the file. Of course I would
> limit the number of printk's to something small. I'll submit a patch to
> do this.
And that's why I was saying I think UTF-8 mode is the "least broken" for
any filesystem that stores filenames in a specific encoding rather than
"as the client submitted it". And most especially for UCS-2/UTF-16
filesystems.
I think the default for a filesystem should be something that absolutely
will not disappear your files. So for NTFS/JFS, it should be UTF-8. And
if a traditional UNIX filesystem wants to do a UTF-8 only mode, I think
ideally it should be done at mkfs time rather than mount time.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: JFS default behavior / UTF-8 filenames
2004-02-22 19:22 ` kernel
@ 2004-02-24 14:44 ` Dave Kleikamp
0 siblings, 0 replies; 18+ messages in thread
From: Dave Kleikamp @ 2004-02-24 14:44 UTC (permalink / raw)
To: kernel; +Cc: linux-kernel
On Sun, 2004-02-22 at 13:22, kernel@mikebell.org wrote:
>
> And that's why I was saying I think UTF-8 mode is the "least broken" for
> any filesystem that stores filenames in a specific encoding rather than
> "as the client submitted it". And most especially for UCS-2/UTF-16
> filesystems.
I receive a lot of complaints when JFS does not accept names because
they contain an "invalid" character. Defaulting to UTF-8 will cause
some non-utf-8 filenames to be rejected. The change I made makes the
default behavior sane and posix-compliant. It won't make everybody
happy, but it will provide predicable, sane behavior.
> I think the default for a filesystem should be something that absolutely
> will not disappear your files. So for NTFS/JFS, it should be UTF-8. And
> if a traditional UNIX filesystem wants to do a UTF-8 only mode, I think
> ideally it should be done at mkfs time rather than mount time.
The biggest problem with changing the default now is that the behavior
was unpredictable before. Now, the default behavior will not allow
filenames to be stored with UCS-2 characters greater than 0x00ff, so
there won't be inaccessible files unless the iocharset option has been
used. This allows the average user to get sane behavior, but allows the
flexibility of accessing the file system in a specific character set for
those users who know what they are doing.
--
David Kleikamp
IBM Linux Technology Center
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2004-02-24 14:44 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-15 23:03 JFS default behavior Nicolas Mailhot
2004-02-16 3:45 ` Jan Knutar
2004-02-16 8:30 ` Nicolas Mailhot
2004-02-16 8:54 ` Valdis.Kletnieks
2004-02-16 6:21 ` jw schultz
2004-02-16 15:55 ` Jamie Lokier
2004-02-17 6:47 ` jw schultz
2004-02-17 21:37 ` Jamie Lokier
2004-02-17 22:12 ` Linus Torvalds
2004-02-18 9:59 ` Jamie Lokier
2004-02-18 15:54 ` Linus Torvalds
2004-02-18 23:58 ` Jamie Lokier
2004-02-19 10:59 ` JFS default behavior / UTF-8 filenames kernel
2004-02-19 14:05 ` Dave Kleikamp
2004-02-19 23:47 ` kernel
2004-02-20 15:00 ` Dave Kleikamp
2004-02-22 19:22 ` kernel
2004-02-24 14:44 ` Dave Kleikamp
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox