From: tridge@samba.org
To: hpa@zytor.com
Cc: Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: UTF-8 and case-insensitivity
Date: Wed, 18 Feb 2004 15:08:00 +1100 [thread overview]
Message-ID: <16434.58656.381712.241116@samba.org> (raw)
In-Reply-To: <c0uj52$3mg$1@terminus.zytor.com>
Hpa,
> So you're hosed if anyone uses characters outside the UCS-2 character
> set...
I've heard they are re-defining all those 16 bit numbers to be UCS-16
instead of UCS-2 for exactly that reason. This is rather similar to
the move in the Unix community to start using UTF-8.
Note that I am not at all proposing that we use UCS-2 in the Linux
kernel (except in places where you have to, like the NTFS
filesystem). I am proposing that the filesystems be able to offer a
case-insenstive hash function to the dcache, and I would expect that
this function would be based on UTF-8.
The function might operate internally by converting UTF-8 to UCS-2, or
it might use a sparse mapping table. It would almost certainly have a
fast-path that looked first to see if there are any bytes with the top
bit set, and if there are none then it can do a really easy 7 bit
table based hash which would make this really fast for most users.
The point is that the kernel proper (the VFS and dcache in particular)
won't have to care how this hash works. They're just consumers of it.
> There is a "standard" table, which is published by the Unicode
> consortium.
The table used in windows is not exactly the same as the one on
unicode.org. Which is "correct" I will leave up to the pedants to
discuss, as all that Samba cares about is that it uses the same table
as w2k.
> However, the "standard" table isn't what you want in certain
> locales, e.g. Turkish.
I'd really like someone to confirm this for me by volunteering to run
a tool I provide on a Turkish NTFS filesystem or sending me a
compressed empty Turkish NTFS volume (please ask first by email - I
only need one of these). Up to now I have only ever seen the one 128k
table used across all windows locales. If this table really *is*
different in some locales then I need to know.
Cheers, Tridge
next prev parent reply other threads:[~2004-02-18 4:10 UTC|newest]
Thread overview: 135+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-02-17 4:12 UTF-8 and case-insensitivity tridge
2004-02-17 5:11 ` Linus Torvalds
2004-02-17 6:54 ` tridge
2004-02-17 8:33 ` Neil Brown
2004-02-17 22:48 ` tridge
2004-02-18 0:06 ` Neil Brown
2004-02-18 9:47 ` Helge Hafting
2004-02-17 15:13 ` Linus Torvalds
2004-02-17 16:57 ` Linus Torvalds
2004-02-17 19:44 ` viro
2004-02-17 20:10 ` Linus Torvalds
2004-02-17 20:17 ` viro
2004-02-17 20:23 ` Linus Torvalds
2004-02-17 21:08 ` Robin Rosenberg
2004-02-17 21:17 ` Linus Torvalds
2004-02-17 22:27 ` Robin Rosenberg
2004-02-18 3:02 ` tridge
2004-02-17 23:57 ` tridge
2004-02-17 23:20 ` tridge
2004-02-17 23:43 ` Linus Torvalds
2004-02-18 3:26 ` tridge
2004-02-18 5:33 ` H. Peter Anvin
2004-02-18 7:54 ` Marc Lehmann
2004-02-18 2:37 ` H. Peter Anvin
2004-02-18 3:03 ` Linus Torvalds
2004-02-18 3:14 ` H. Peter Anvin
2004-02-18 3:27 ` Linus Torvalds
2004-02-18 21:31 ` tridge
2004-02-18 22:23 ` Linus Torvalds
2004-02-18 22:28 ` Linus Torvalds
2004-02-18 22:50 ` tridge
2004-02-18 22:59 ` Linus Torvalds
2004-02-18 23:09 ` tridge
2004-02-18 23:16 ` Linus Torvalds
2004-02-19 8:10 ` Jamie Lokier
2004-02-19 16:09 ` Linus Torvalds
2004-02-19 16:38 ` Jamie Lokier
2004-02-19 16:54 ` Linus Torvalds
2004-02-19 18:29 ` Jamie Lokier
2004-02-19 19:48 ` Eureka! (was Re: UTF-8 and case-insensitivity) Linus Torvalds
2004-02-19 19:51 ` Linus Torvalds
2004-02-19 19:48 ` H. Peter Anvin
2004-02-19 20:04 ` Linus Torvalds
2004-02-19 20:05 ` viro
2004-02-19 20:23 ` Linus Torvalds
2004-02-19 20:32 ` Linus Torvalds
2004-02-19 20:45 ` viro
2004-02-19 21:26 ` Linus Torvalds
2004-02-19 21:38 ` Linus Torvalds
2004-02-19 21:45 ` Linus Torvalds
2004-02-19 21:43 ` viro
2004-02-19 21:53 ` Linus Torvalds
2004-02-19 22:21 ` David Lang
2004-02-19 20:48 ` Jamie Lokier
2004-02-19 21:30 ` Linus Torvalds
2004-02-20 0:00 ` Jamie Lokier
2004-02-20 0:17 ` Linus Torvalds
2004-02-20 0:24 ` Linus Torvalds
2004-02-20 0:30 ` Trond Myklebust
2004-02-20 0:54 ` Jamie Lokier
2004-02-20 0:57 ` tridge
2004-02-20 1:07 ` Paul Wagland
2004-02-20 13:31 ` Chris Wedgwood
2004-02-20 0:46 ` Jamie Lokier
2004-02-23 10:13 ` Tim Connors
2004-02-20 1:39 ` Junio C Hamano
2004-02-20 12:54 ` Jamie Lokier
2004-02-19 23:37 ` tridge
2004-02-20 0:02 ` Linus Torvalds
2004-02-20 0:16 ` tridge
2004-02-20 0:37 ` Linus Torvalds
2004-02-20 1:26 ` tridge
2004-02-20 1:07 ` H. Peter Anvin
2004-02-20 2:30 ` Theodore Ts'o
2004-02-20 12:04 ` explicit dcache <-> user-space cache coherency, sys_mark_dir_clean(), O_CLEAN Ingo Molnar
2004-02-20 13:19 ` Jamie Lokier
2004-02-20 13:37 ` Ingo Molnar
2004-02-20 14:00 ` Ingo Molnar
2004-02-20 16:31 ` Jamie Lokier
2004-02-20 13:23 ` [patch] " Ingo Molnar
2004-02-20 18:00 ` viro
2004-02-20 15:41 ` Linus Torvalds
2004-02-20 17:04 ` Ingo Molnar
2004-02-20 17:19 ` Linus Torvalds
2004-02-20 18:48 ` Ingo Molnar
2004-02-21 1:44 ` Jamie Lokier
2004-02-21 7:58 ` Ingo Molnar
2004-02-21 8:04 ` viro
2004-02-21 17:46 ` Ingo Molnar
2004-02-21 18:15 ` Linus Torvalds
2004-02-21 8:26 ` Keith Owens
2004-02-23 10:59 ` Pavel Machek
2004-02-23 13:55 ` Jamie Lokier
2004-02-23 16:45 ` Ingo Molnar
2004-02-23 17:32 ` Jamie Lokier
2004-02-20 23:00 ` tridge
2004-02-20 17:33 ` Jamie Lokier
2004-02-20 18:22 ` Linus Torvalds
2004-02-21 0:38 ` Jamie Lokier
2004-02-21 1:10 ` Linus Torvalds
2004-02-21 3:01 ` Jamie Lokier
2004-02-20 17:47 ` Jamie Lokier
2004-02-20 20:38 ` Christer Weinigel
2004-02-22 15:07 ` Jamie Lokier
2004-02-22 16:55 ` Miquel van Smoorenburg
2004-02-19 19:08 ` UTF-8 and case-insensitivity Helge Hafting
2004-02-18 4:08 ` tridge [this message]
2004-02-18 10:05 ` Robin Rosenberg
2004-02-18 11:43 ` tridge
2004-02-18 12:31 ` Robin Rosenberg
2004-02-18 16:48 ` H. Peter Anvin
2004-02-18 20:00 ` H. Peter Anvin
2004-02-19 2:53 ` Daniel Newby
2004-02-17 5:25 ` Tim Connors
2004-02-17 7:43 ` H. Peter Anvin
2004-02-17 8:05 ` H. Peter Anvin
2004-02-17 14:25 ` Dave Kleikamp
2004-02-18 0:16 ` Robert White
2004-02-18 0:20 ` Linus Torvalds
2004-02-18 1:03 ` Robert White
2004-02-18 21:48 ` Ville Herva
2004-02-18 2:48 ` tridge
2004-02-18 20:56 ` Robert White
[not found] <fa.epf5o9k.1rkudgo@ifi.uio.no>
[not found] ` <fa.idvvhjl.1jge92d@ifi.uio.no>
2004-02-18 1:09 ` Andy Lutomirski
[not found] <1q4Si-658-5@gated-at.bofh.it>
[not found] ` <1q7no-8ss-7@gated-at.bofh.it>
[not found] ` <1qfb7-7s5-19@gated-at.bofh.it>
[not found] ` <1qmPm-6Gl-11@gated-at.bofh.it>
[not found] ` <1qpWI-1Sa-1@gated-at.bofh.it>
[not found] ` <1qqpO-2lx-3@gated-at.bofh.it>
[not found] ` <1qqzv-2tr-3@gated-at.bofh.it>
[not found] ` <1qqJc-2A2-5@gated-at.bofh.it>
[not found] ` <1qHAR-2Wm-49@gated-at.bofh.it>
[not found] ` <1qIwr-5GB-11@gated-at.bofh.it>
[not found] ` <1qIwr-5GB-9@gated-at.bofh.it>
[not found] ` <1qIQ1-5WR-27@gated-at.bofh.it>
[not found] ` <1qIZt-6b9-11@gated-at.bofh.it>
[not found] ` <1qJsF-6Be-45@gated-at.bofh.it>
2004-02-19 0:06 ` Pascal Schmidt
2004-02-19 1:01 ` tridge
2004-02-19 1:08 ` Hua Zhong
2004-02-19 1:46 ` tridge
2004-02-19 2:44 ` Theodore Ts'o
2004-02-19 3:20 ` tridge
2004-02-19 10:18 ` Helge Hafting
2004-02-19 12:11 ` Paulo Marques
2004-02-19 19:04 ` Helge Hafting
2004-02-19 14:08 ` Theodore Ts'o
2004-02-19 20:12 ` Robert White
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=16434.58656.381712.241116@samba.org \
--to=tridge@samba.org \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox