From: tridge@samba.org
To: Linus Torvalds <torvalds@osdl.org>
Cc: Kernel Mailing List <linux-kernel@vger.kernel.org>,
Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Subject: Re: UTF-8 and case-insensitivity
Date: Wed, 18 Feb 2004 10:20:00 +1100 [thread overview]
Message-ID: <16434.41376.453823.260362@samba.org> (raw)
In-Reply-To: <Pine.LNX.4.58.0402170704210.2154@home.osdl.org>
Linus,
> Yes, we could add context sensitivity to the dcache with a context
> bitmask.
>
> However, it's _not_ correct.
>
> It assumes that there is only one way to do lower/upper case, which just
> isn't true. What about different locales that have different case rules?
> Your "one bit per dentry" becomes "one bit per locale per dentry". That's
> just horribly hard to do.
I think you're making it sound much harder than it really is.
We just add a VFS hook in the filesystems. The filesystem chooses the
encoding specific comparison function. If the filesystem doesn't
provide one then don't do case insensitivity. If the filesystem does
provide one (for example NTFS, JFS) then use it. Then all I need to do
is convince one of the filesystem maintainers to add a mount time
option to specify the case table (for example by specifying the name
of a file in the filesystem that holds it).
So, all the really ugly stuff is then in the per-filesystem code, and
all the VFS and dcache has to do is know about a single context bit
per dentry.
> I don't know how Windows does it, so maybe this thing is hardcoded, and
> you don't even want "true" case insensitivity.
NTFS has a 128k table on disk, created at mkfs time and indexed by the
UCS2 character. The interesting thing about this table is that it
doesn't seem to vary between different locales as one might expect. I
have checked 3 locales so far (Swedish, Japanese and English) and all
have the same 128k table. I should check a few more locales to see if
it really is the same everywhere. Contact me off-list if you have a
NTFS filesystem created in a different locale and would be willing to
run a test program against it to see if the table is different from
the one we have in Samba.
There is stuff in the charset handling of every locale that does vary
in windows, but it isn't the case table, its the "valid characters"
map used to determine what characters are allowed when converting
strings into legacy multi-byte encodings. Even I don't think that the
kernel will ever have to deal with that crap unless someone is foolish
enough to port Samba into the kernel (several people have actually
done that despite the insanity of the idea, but they all did an
absolutely terrible job of it and certainly didn't take care to get
all the charset handling right).
> How "correct" is Windows?
from my rather limited point of view I always have to assume that
windows is "correct", unless I can show that its behaviour leads to
data loss, a security hole or something equally extreme.
Cheers, Tridge
next prev parent reply other threads:[~2004-02-17 23:20 UTC|newest]
Thread overview: 135+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-02-17 4:12 UTF-8 and case-insensitivity tridge
2004-02-17 5:11 ` Linus Torvalds
2004-02-17 6:54 ` tridge
2004-02-17 8:33 ` Neil Brown
2004-02-17 22:48 ` tridge
2004-02-18 0:06 ` Neil Brown
2004-02-18 9:47 ` Helge Hafting
2004-02-17 15:13 ` Linus Torvalds
2004-02-17 16:57 ` Linus Torvalds
2004-02-17 19:44 ` viro
2004-02-17 20:10 ` Linus Torvalds
2004-02-17 20:17 ` viro
2004-02-17 20:23 ` Linus Torvalds
2004-02-17 21:08 ` Robin Rosenberg
2004-02-17 21:17 ` Linus Torvalds
2004-02-17 22:27 ` Robin Rosenberg
2004-02-18 3:02 ` tridge
2004-02-17 23:57 ` tridge
2004-02-17 23:20 ` tridge [this message]
2004-02-17 23:43 ` Linus Torvalds
2004-02-18 3:26 ` tridge
2004-02-18 5:33 ` H. Peter Anvin
2004-02-18 7:54 ` Marc Lehmann
2004-02-18 2:37 ` H. Peter Anvin
2004-02-18 3:03 ` Linus Torvalds
2004-02-18 3:14 ` H. Peter Anvin
2004-02-18 3:27 ` Linus Torvalds
2004-02-18 21:31 ` tridge
2004-02-18 22:23 ` Linus Torvalds
2004-02-18 22:28 ` Linus Torvalds
2004-02-18 22:50 ` tridge
2004-02-18 22:59 ` Linus Torvalds
2004-02-18 23:09 ` tridge
2004-02-18 23:16 ` Linus Torvalds
2004-02-19 8:10 ` Jamie Lokier
2004-02-19 16:09 ` Linus Torvalds
2004-02-19 16:38 ` Jamie Lokier
2004-02-19 16:54 ` Linus Torvalds
2004-02-19 18:29 ` Jamie Lokier
2004-02-19 19:48 ` Eureka! (was Re: UTF-8 and case-insensitivity) Linus Torvalds
2004-02-19 19:51 ` Linus Torvalds
2004-02-19 19:48 ` H. Peter Anvin
2004-02-19 20:04 ` Linus Torvalds
2004-02-19 20:05 ` viro
2004-02-19 20:23 ` Linus Torvalds
2004-02-19 20:32 ` Linus Torvalds
2004-02-19 20:45 ` viro
2004-02-19 21:26 ` Linus Torvalds
2004-02-19 21:38 ` Linus Torvalds
2004-02-19 21:45 ` Linus Torvalds
2004-02-19 21:43 ` viro
2004-02-19 21:53 ` Linus Torvalds
2004-02-19 22:21 ` David Lang
2004-02-19 20:48 ` Jamie Lokier
2004-02-19 21:30 ` Linus Torvalds
2004-02-20 0:00 ` Jamie Lokier
2004-02-20 0:17 ` Linus Torvalds
2004-02-20 0:24 ` Linus Torvalds
2004-02-20 0:30 ` Trond Myklebust
2004-02-20 0:54 ` Jamie Lokier
2004-02-20 0:57 ` tridge
2004-02-20 1:07 ` Paul Wagland
2004-02-20 13:31 ` Chris Wedgwood
2004-02-20 0:46 ` Jamie Lokier
2004-02-23 10:13 ` Tim Connors
2004-02-20 1:39 ` Junio C Hamano
2004-02-20 12:54 ` Jamie Lokier
2004-02-19 23:37 ` tridge
2004-02-20 0:02 ` Linus Torvalds
2004-02-20 0:16 ` tridge
2004-02-20 0:37 ` Linus Torvalds
2004-02-20 1:26 ` tridge
2004-02-20 1:07 ` H. Peter Anvin
2004-02-20 2:30 ` Theodore Ts'o
2004-02-20 12:04 ` explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN Ingo Molnar
2004-02-20 13:19 ` Jamie Lokier
2004-02-20 13:37 ` Ingo Molnar
2004-02-20 14:00 ` Ingo Molnar
2004-02-20 16:31 ` Jamie Lokier
2004-02-20 13:23 ` [patch] " Ingo Molnar
2004-02-20 18:00 ` viro
2004-02-20 15:41 ` Linus Torvalds
2004-02-20 17:04 ` Ingo Molnar
2004-02-20 17:19 ` Linus Torvalds
2004-02-20 18:48 ` Ingo Molnar
2004-02-21 1:44 ` Jamie Lokier
2004-02-21 7:58 ` Ingo Molnar
2004-02-21 8:04 ` viro
2004-02-21 17:46 ` Ingo Molnar
2004-02-21 18:15 ` Linus Torvalds
2004-02-21 8:26 ` Keith Owens
2004-02-23 10:59 ` Pavel Machek
2004-02-23 13:55 ` Jamie Lokier
2004-02-23 16:45 ` Ingo Molnar
2004-02-23 17:32 ` Jamie Lokier
2004-02-20 23:00 ` tridge
2004-02-20 17:33 ` Jamie Lokier
2004-02-20 18:22 ` Linus Torvalds
2004-02-21 0:38 ` Jamie Lokier
2004-02-21 1:10 ` Linus Torvalds
2004-02-21 3:01 ` Jamie Lokier
2004-02-20 17:47 ` Jamie Lokier
2004-02-20 20:38 ` Christer Weinigel
2004-02-22 15:07 ` Jamie Lokier
2004-02-22 16:55 ` Miquel van Smoorenburg
2004-02-19 19:08 ` UTF-8 and case-insensitivity Helge Hafting
2004-02-18 4:08 ` tridge
2004-02-18 10:05 ` Robin Rosenberg
2004-02-18 11:43 ` tridge
2004-02-18 12:31 ` Robin Rosenberg
2004-02-18 16:48 ` H. Peter Anvin
2004-02-18 20:00 ` H. Peter Anvin
2004-02-19 2:53 ` Daniel Newby
2004-02-17 5:25 ` Tim Connors
2004-02-17 7:43 ` H. Peter Anvin
2004-02-17 8:05 ` H. Peter Anvin
2004-02-17 14:25 ` Dave Kleikamp
2004-02-18 0:16 ` Robert White
2004-02-18 0:20 ` Linus Torvalds
2004-02-18 1:03 ` Robert White
2004-02-18 21:48 ` Ville Herva
2004-02-18 2:48 ` tridge
2004-02-18 20:56 ` Robert White
[not found] <fa.epf5o9k.1rkudgo@ifi.uio.no>
[not found] ` <fa.idvvhjl.1jge92d@ifi.uio.no>
2004-02-18 1:09 ` Andy Lutomirski
[not found] <1q4Si-658-5@gated-at.bofh.it>
[not found] ` <1q7no-8ss-7@gated-at.bofh.it>
[not found] ` <1qfb7-7s5-19@gated-at.bofh.it>
[not found] ` <1qmPm-6Gl-11@gated-at.bofh.it>
[not found] ` <1qpWI-1Sa-1@gated-at.bofh.it>
[not found] ` <1qqpO-2lx-3@gated-at.bofh.it>
[not found] ` <1qqzv-2tr-3@gated-at.bofh.it>
[not found] ` <1qqJc-2A2-5@gated-at.bofh.it>
[not found] ` <1qHAR-2Wm-49@gated-at.bofh.it>
[not found] ` <1qIwr-5GB-11@gated-at.bofh.it>
[not found] ` <1qIwr-5GB-9@gated-at.bofh.it>
[not found] ` <1qIQ1-5WR-27@gated-at.bofh.it>
[not found] ` <1qIZt-6b9-11@gated-at.bofh.it>
[not found] ` <1qJsF-6Be-45@gated-at.bofh.it>
2004-02-19 0:06 ` Pascal Schmidt
2004-02-19 1:01 ` tridge
2004-02-19 1:08 ` Hua Zhong
2004-02-19 1:46 ` tridge
2004-02-19 2:44 ` Theodore Ts'o
2004-02-19 3:20 ` tridge
2004-02-19 10:18 ` Helge Hafting
2004-02-19 12:11 ` Paulo Marques
2004-02-19 19:04 ` Helge Hafting
2004-02-19 14:08 ` Theodore Ts'o
2004-02-19 20:12 ` Robert White
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=16434.41376.453823.260362@samba.org \
--to=tridge@samba.org \
--cc=linux-kernel@vger.kernel.org \
--cc=torvalds@osdl.org \
--cc=viro@parcelfarce.linux.theplanet.co.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox