Re: RFC: Case-insensitive support for XFS

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: RFC: Case-insensitive support for XFS
       [not found] ` <op.tzpbqspl3jf8g2@pc-bnaujok.melbourne.sgi.com>
@ 2007-10-05 15:44   ` Christoph Hellwig
  2007-10-05 18:52     ` Nicholas Miell
                       ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Christoph Hellwig @ 2007-10-05 15:44 UTC (permalink / raw)
  To: Barry Naujok; +Cc: xfs@oss.sgi.com, linux-fsdevel, urban

[Adding -fsdevel because some of the things touched here might be of
 broader interest and Urban because his name is on nls_utf8.c]

On Fri, Oct 05, 2007 at 11:57:54AM +1000, Barry Naujok wrote:
> 
> On it's own, linux only provides case conversion for old-style
> character sets - 8 bit sequences only. A lot of distos are
> now defaulting to UTF-8 and Linux NLS stuff does not support
> case conversion for any unicode sets.

The lack of case tables in nls_utf8.c defintively seems odd to me.
Urban, is there a reason for that?  The only thing that comes to
mind is that these tables might be quite large.

> NTFS in Linux also implements it's own dcache and NTFS also

					^^^^^^^ dentry operations?

> stores its unicode case table on disk. This allows the filesystem
> to migrate to newer forms of Unicode at the time of formatting
> the filesystem. Eg. Windows Vista now supports Unicode 5.0
> while older version would support an earlier version of
> Unicode. Linux's version of NTFS case table is implemented
> in fs/ntfs/upcase.c defined as default_upcase.

Because ntfs uses 16bit wide chars it prefers to use it's own tables.
I'm not sure it's a that good idea.  JFS also has wide-char names on
disk but at least partially uses the generic nls support, so there must
be some trade-offs.

> It will be proposed that in the future, XFS may default to
> UTF-8 on disk and to go for the old format, explicitily
> use a mkfs.xfs option. Two superbits will be used: one for
> case-insensitive (which generates lowercase hashes on disk)
> and that already exists on IRIX filesystems and a new one
> for UTF-8 filenames. Any combination of the two bits can be
> used and the dentry_operations will be adjusted accordingly.

I don't think arbitrary combinations make sense.  Without case insensitive
support a unix filesystem couldn't care less what charset the filenames
are in, except for the terminating 0 and '/', '.', '..' it's an entirely
opaqueue stream of bytes.  So chosing a charset only makes sense
with the case insensitive filename option.

> So, in regards to the UTF-8 case-conversion/folding table, we
> have several options to choose from:
>    - Use the HFS+ method as-is.
>    - Use an NTFS scheme with an on-disk table.
>    - Pick a current table and stick with it (similar to HFS+).
>    - How much of Unicode to we support? Just the the "Basic
>      Multilingual Plane" (U+0000 - U+FFFF) or the entire set?
>      (anything above U+FFFF won't have case-conversion
>       requirements). Seems that all the other filesystems
>       just support the "BMP".
>    - UTF-8, UTF-16 or UCS-2.
> 
> With the last point, UTF-8 has several advantages IMO:
>    - xfs_repair can easily detect UTF-8 sequences in filenames
>      and also validate UTF-8 sequences.
>    - char based structures don't change
>    - "nulls" in filenames.
>    - no endian conversions required.

I think the right approach is to use the fs/nls/ code and allow the
user to select any table with a mount option as at least in russia
and eastern europe some non-utf8 charsets still seem to be prefered.
The default should of course be utf8 and support for utf8 case
conversion should be added to fs/nls/

> Internally, the names will probably be converted to "u16"s for
> efficient processing. Conversion between UTF-8 and UTF-16/UCS-2
> is very straight forward.

Do we really need that?  And if so please make sure this only happens
for filesystems created with the case insensitivity option so normal
filesystems don't have to pay for these bloated strings.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: Case-insensitive support for XFS
  2007-10-05 15:44   ` RFC: Case-insensitive support for XFS Christoph Hellwig
@ 2007-10-05 18:52     ` Nicholas Miell
  2007-10-08  5:07       ` Barry Naujok
  2007-10-05 19:10     ` Anton Altaparmakov
  2007-10-08  0:33     ` Barry Naujok
  2 siblings, 1 reply; 12+ messages in thread
From: Nicholas Miell @ 2007-10-05 18:52 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Barry Naujok, xfs@oss.sgi.com, linux-fsdevel, urban

On Fri, 2007-10-05 at 16:44 +0100, Christoph Hellwig wrote:
> [Adding -fsdevel because some of the things touched here might be of
>  broader interest and Urban because his name is on nls_utf8.c]
> 
> On Fri, Oct 05, 2007 at 11:57:54AM +1000, Barry Naujok wrote:
> > 
> > On it's own, linux only provides case conversion for old-style
> > character sets - 8 bit sequences only. A lot of distos are
> > now defaulting to UTF-8 and Linux NLS stuff does not support
> > case conversion for any unicode sets.
> 
> The lack of case tables in nls_utf8.c defintively seems odd to me.
> Urban, is there a reason for that?  The only thing that comes to
> mind is that these tables might be quite large.
> 

Case conversion in Unicode is locale dependent. The legacy 8-bit
character encodings don't code for enough characters to run into the
ambiguities, so they can get away with fixed case conversion tables.
Unicode can't.

I'd point you to the Unicode technical report which explains how to do
it, but unicode.org seems to be offline right now.

> > NTFS in Linux also implements it's own dcache and NTFS also
> 
> 					^^^^^^^ dentry operations?
> 
> > stores its unicode case table on disk. This allows the filesystem
> > to migrate to newer forms of Unicode at the time of formatting
> > the filesystem. Eg. Windows Vista now supports Unicode 5.0
> > while older version would support an earlier version of
> > Unicode. Linux's version of NTFS case table is implemented
> > in fs/ntfs/upcase.c defined as default_upcase.
> 
> Because ntfs uses 16bit wide chars it prefers to use it's own tables.
> I'm not sure it's a that good idea.  

Well, Windows uses those on-disk tables, so the Linux driver has to
also. I don't see how that's a bad idea or any way to not do it and
remain compatible.

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: Case-insensitive support for XFS
  2007-10-05 15:44   ` RFC: Case-insensitive support for XFS Christoph Hellwig
  2007-10-05 18:52     ` Nicholas Miell
@ 2007-10-05 19:10     ` Anton Altaparmakov
  2007-10-06  6:37       ` Brad Boyer
  2007-10-08  0:43       ` Barry Naujok
  2007-10-08  0:33     ` Barry Naujok
  2 siblings, 2 replies; 12+ messages in thread
From: Anton Altaparmakov @ 2007-10-05 19:10 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Barry Naujok, xfs@oss.sgi.com, linux-fsdevel, urban

Hi,

On 5 Oct 2007, at 16:44, Christoph Hellwig wrote:
> [Adding -fsdevel because some of the things touched here might be of
>  broader interest and Urban because his name is on nls_utf8.c]
>
> On Fri, Oct 05, 2007 at 11:57:54AM +1000, Barry Naujok wrote:
>> On it's own, linux only provides case conversion for old-style
>> character sets - 8 bit sequences only. A lot of distos are
>> now defaulting to UTF-8 and Linux NLS stuff does not support
>> case conversion for any unicode sets.
>
> The lack of case tables in nls_utf8.c defintively seems odd to me.
> Urban, is there a reason for that?  The only thing that comes to
> mind is that these tables might be quite large.
>
>> NTFS in Linux also implements it's own dcache and NTFS also
>
> 					^^^^^^^ dentry operations?

Where did that come from?  NTFS does not have its own dcache!  It  
doesn't have its own dentry operations either...  NTFS uses the  
default ones...

All the case insensitivity handling "cleverness" is done inside  
ntfs_lookup(), i.e. the NTFS directory inode operation ->lookup.

>> stores its unicode case table on disk. This allows the filesystem
>> to migrate to newer forms of Unicode at the time of formatting
>> the filesystem. Eg. Windows Vista now supports Unicode 5.0
>> while older version would support an earlier version of
>> Unicode. Linux's version of NTFS case table is implemented
>> in fs/ntfs/upcase.c defined as default_upcase.

The one in the current kernel is the Windows NT4/2000/XP one.

Windows Vista uses a different table (the content is actually  
significantly different).  My not yet allowed to be released NTFS  
driver uses the Vista table by default.

But the default does not matter for NTFS.  At mount time, the upcase  
table stored on the volume is read into memory and compared to the  
default one.  If they match perfectly the default one is used (it is  
reference counted and discarded when not in use) and if they do not  
match the one from the volume is used.  So we support both NT4/2k/XP  
and Vista style volumes fine no matter what default table we use...   
The only thing is that for each non-default table we waste 128kiB of  
vmalloc()ed kernel memory thus if you mount 10 NTFS volumes with non- 
default table we are wasting 1MiB of data...

> Because ntfs uses 16bit wide chars it prefers to use it's own tables.
> I'm not sure it's a that good idea.

The upcase table is used during the case insensitive ->lookup and if  
you have the wrong table it will make the traversal in the directory  
b-tree go wrong and so you may not find files that actually exist  
when doing a ->lookup!

So yes it is not only a good idea but an absolutely essential idea!   
You have to use the same upcase table for a volume as the upcase  
table with which the names on the volume were created otherwise your  
b-trees are screwed if they use any characters where the upper casing  
between the upcase table used when writing and the upcase table used  
when doing the lookup are not matched.

> JFS also has wide-char names on
> disk but at least partially uses the generic nls support, so there  
> must
> be some trade-offs.
>
>> It will be proposed that in the future, XFS may default to
>> UTF-8 on disk and to go for the old format, explicitily
>> use a mkfs.xfs option. Two superbits will be used: one for
>> case-insensitive (which generates lowercase hashes on disk)
>> and that already exists on IRIX filesystems and a new one
>> for UTF-8 filenames. Any combination of the two bits can be
>> used and the dentry_operations will be adjusted accordingly.
>
> I don't think arbitrary combinations make sense.  Without case  
> insensitive
> support a unix filesystem couldn't care less what charset the  
> filenames
> are in, except for the terminating 0 and '/', '.', '..' it's an  
> entirely
> opaqueue stream of bytes.  So chosing a charset only makes sense
> with the case insensitive filename option.
>
>> So, in regards to the UTF-8 case-conversion/folding table, we
>> have several options to choose from:
>>    - Use the HFS+ method as-is.
>>    - Use an NTFS scheme with an on-disk table.
>>    - Pick a current table and stick with it (similar to HFS+).
>>    - How much of Unicode to we support? Just the the "Basic
>>      Multilingual Plane" (U+0000 - U+FFFF) or the entire set?
>>      (anything above U+FFFF won't have case-conversion
>>       requirements). Seems that all the other filesystems
>>       just support the "BMP".
>>    - UTF-8, UTF-16 or UCS-2.
>>
>> With the last point, UTF-8 has several advantages IMO:
>>    - xfs_repair can easily detect UTF-8 sequences in filenames
>>      and also validate UTF-8 sequences.
>>    - char based structures don't change
>>    - "nulls" in filenames.
>>    - no endian conversions required.
>
> I think the right approach is to use the fs/nls/ code and allow the
> user to select any table with a mount option as at least in russia
> and eastern europe some non-utf8 charsets still seem to be prefered.
> The default should of course be utf8 and support for utf8 case
> conversion should be added to fs/nls/
>
>> Internally, the names will probably be converted to "u16"s for
>> efficient processing. Conversion between UTF-8 and UTF-16/UCS-2
>> is very straight forward.
>
> Do we really need that?  And if so please make sure this only happens
> for filesystems created with the case insensitivity option so normal
> filesystems don't have to pay for these bloated strings.

There is nothing efficient about using u16 in memory AFAIK.  In fact  
for majority of the time it just means you use twice the memory per  
string...

FWIW Mac OS X uses utf8 in the kernel and so does HFS(+) and I can't  
see anything wrong with that.  And Windows uses u16 (little endian)  
and so does NTFS.  So there is precedent for doing both internally...

What are the reasons for suggesting that it would be more efficient  
to use u16 internally?

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: Case-insensitive support for XFS
  2007-10-05 19:10     ` Anton Altaparmakov
@ 2007-10-06  6:37       ` Brad Boyer
  2007-10-06 13:00         ` Anton Altaparmakov
  2007-10-08  0:43       ` Barry Naujok
  1 sibling, 1 reply; 12+ messages in thread
From: Brad Boyer @ 2007-10-06  6:37 UTC (permalink / raw)
  To: Anton Altaparmakov
  Cc: Christoph Hellwig, Barry Naujok, xfs@oss.sgi.com, linux-fsdevel,
	urban

On Fri, Oct 05, 2007 at 08:10:23PM +0100, Anton Altaparmakov wrote:
> But the default does not matter for NTFS.  At mount time, the upcase  
> table stored on the volume is read into memory and compared to the  
> default one.  If they match perfectly the default one is used (it is  
> reference counted and discarded when not in use) and if they do not  
> match the one from the volume is used.  So we support both NT4/2k/XP  
> and Vista style volumes fine no matter what default table we use...   
> The only thing is that for each non-default table we waste 128kiB of  
> vmalloc()ed kernel memory thus if you mount 10 NTFS volumes with non- 
> default table we are wasting 1MiB of data...

For HFS+, there is a single case conversion table that is defined in
the on-disk format. It's in fs/hfsplus/tables.c with the data taken
directly from Apple's documentation.

> The upcase table is used during the case insensitive ->lookup and if  
> you have the wrong table it will make the traversal in the directory  
> b-tree go wrong and so you may not find files that actually exist  
> when doing a ->lookup!

I had the same issue in HFS+. If the case conversion isn't handled
right, the key matching doesn't work and the code wanders off into
nowhere in the catalog btree on any catalog lookup. Since everything
in HFS+ goes through the catalog in one way or another, losing this
would make most of the filesystem inaccessible. Even a lookup by
inode number to satisfy iget() goes through the same search code.

> So yes it is not only a good idea but an absolutely essential idea!   
> You have to use the same upcase table for a volume as the upcase  
> table with which the names on the volume were created otherwise your  
> b-trees are screwed if they use any characters where the upper casing  
> between the upcase table used when writing and the upcase table used  
> when doing the lookup are not matched.

The HFS+ unicode handling is a hard-coded mess of tables and offsets
for this exact reason. It handles manual decomposition and case
folding in exactly the method from the official documentation. Any
other way wouldn't properly support a filesystem with non-ASCII
file names.

> FWIW Mac OS X uses utf8 in the kernel and so does HFS(+) and I can't  
> see anything wrong with that.  And Windows uses u16 (little endian)  
> and so does NTFS.  So there is precedent for doing both internally...

Apple may use utf8 internally in OSX, but HFS+ uses UTF16 on disk. Just
look at the definition of struct hfsplus_unistr in hfsplus_raw.h. The
utf8 <=> utf16 conversion is the one place the hfsplus module uses
the nls code directly. If you want to talk about original HFS, Apple
never supported the use of unicode and converts in the driver to the
encoding used on the individual HFS volume. The Linux implementation
of HFS uses the nls code in a pretty traditional way to do this.

> What are the reasons for suggesting that it would be more efficient  
> to use u16 internally?

At least for HFS+, it's easiest to use a u16 to track the characters
because that is what is on disk. That's not a very generic reason,
obviously.

	Brad Boyer
	flar@allandria.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: Case-insensitive support for XFS
  2007-10-06  6:37       ` Brad Boyer
@ 2007-10-06 13:00         ` Anton Altaparmakov
  0 siblings, 0 replies; 12+ messages in thread
From: Anton Altaparmakov @ 2007-10-06 13:00 UTC (permalink / raw)
  To: Brad Boyer
  Cc: Christoph Hellwig, Barry Naujok, xfs@oss.sgi.com, linux-fsdevel,
	urban

Hi,

On 6 Oct 2007, at 07:37, Brad Boyer wrote:
> On Fri, Oct 05, 2007 at 08:10:23PM +0100, Anton Altaparmakov wrote:
>> But the default does not matter for NTFS.  At mount time, the upcase
>> table stored on the volume is read into memory and compared to the
>> default one.  If they match perfectly the default one is used (it is
>> reference counted and discarded when not in use) and if they do not
>> match the one from the volume is used.  So we support both NT4/2k/XP
>> and Vista style volumes fine no matter what default table we use...
>> The only thing is that for each non-default table we waste 128kiB of
>> vmalloc()ed kernel memory thus if you mount 10 NTFS volumes with non-
>> default table we are wasting 1MiB of data...
>
> For HFS+, there is a single case conversion table that is defined in
> the on-disk format. It's in fs/hfsplus/tables.c with the data taken
> directly from Apple's documentation.
>
>> The upcase table is used during the case insensitive ->lookup and if
>> you have the wrong table it will make the traversal in the directory
>> b-tree go wrong and so you may not find files that actually exist
>> when doing a ->lookup!
>
> I had the same issue in HFS+. If the case conversion isn't handled
> right, the key matching doesn't work and the code wanders off into
> nowhere in the catalog btree on any catalog lookup. Since everything
> in HFS+ goes through the catalog in one way or another, losing this
> would make most of the filesystem inaccessible. Even a lookup by
> inode number to satisfy iget() goes through the same search code.
>
>> So yes it is not only a good idea but an absolutely essential idea!
>> You have to use the same upcase table for a volume as the upcase
>> table with which the names on the volume were created otherwise your
>> b-trees are screwed if they use any characters where the upper casing
>> between the upcase table used when writing and the upcase table used
>> when doing the lookup are not matched.
>
> The HFS+ unicode handling is a hard-coded mess of tables and offsets
> for this exact reason. It handles manual decomposition and case
> folding in exactly the method from the official documentation. Any
> other way wouldn't properly support a filesystem with non-ASCII
> file names.
>
>> FWIW Mac OS X uses utf8 in the kernel and so does HFS(+) and I can't
>> see anything wrong with that.  And Windows uses u16 (little endian)
>> and so does NTFS.  So there is precedent for doing both internally...
>
> Apple may use utf8 internally in OSX, but HFS+ uses UTF16 on disk.  
> Just

Ah, oops, sorry.  I had never looked at that bit of the HFS+ code.  I  
just assumed that HFS+ must use the same on-disk as inside the VFS on  
OS X but you are quite right that it does not do so.

> look at the definition of struct hfsplus_unistr in hfsplus_raw.h. The
> utf8 <=> utf16 conversion is the one place the hfsplus module uses
> the nls code directly. If you want to talk about original HFS, Apple
> never supported the use of unicode and converts in the driver to the
> encoding used on the individual HFS volume. The Linux implementation
> of HFS uses the nls code in a pretty traditional way to do this.
>
>> What are the reasons for suggesting that it would be more efficient
>> to use u16 internally?
>
> At least for HFS+, it's easiest to use a u16 to track the characters
> because that is what is on disk. That's not a very generic reason,
> obviously.

Not a reason at all actually.  It does not matter whether you use u16  
or utf8 because in both cases you have to do character-by-character  
translation/handling for HFS+ (because of precomposed vs decomposed  
Unicode and thus strings having to match even when they are not byte- 
for-byte identical) and once you are doing that sort of parsing and  
conversion you might as well convert to utf8 which IMHO is more  
efficient in the general case not least because it uses less memory.

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: Case-insensitive support for XFS
  2007-10-05 15:44   ` RFC: Case-insensitive support for XFS Christoph Hellwig
  2007-10-05 18:52     ` Nicholas Miell
  2007-10-05 19:10     ` Anton Altaparmakov
@ 2007-10-08  0:33     ` Barry Naujok
  2 siblings, 0 replies; 12+ messages in thread
From: Barry Naujok @ 2007-10-08  0:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs@oss.sgi.com, linux-fsdevel

On Sat, 06 Oct 2007 01:44:42 +1000, Christoph Hellwig <hch@infradead.org>  
wrote:

> [Adding -fsdevel because some of the things touched here might be of
>  broader interest and Urban because his name is on nls_utf8.c]
>
> On Fri, Oct 05, 2007 at 11:57:54AM +1000, Barry Naujok wrote:
>>
>> It will be proposed that in the future, XFS may default to
>> UTF-8 on disk and to go for the old format, explicitily
>> use a mkfs.xfs option. Two superbits will be used: one for
>> case-insensitive (which generates lowercase hashes on disk)
>> and that already exists on IRIX filesystems and a new one
>> for UTF-8 filenames. Any combination of the two bits can be
>> used and the dentry_operations will be adjusted accordingly.
>
> I don't think arbitrary combinations make sense.  Without case  
> insensitive
> support a unix filesystem couldn't care less what charset the filenames
> are in, except for the terminating 0 and '/', '.', '..' it's an entirely
> opaqueue stream of bytes.  So chosing a charset only makes sense
> with the case insensitive filename option.

I was thinking along the lines of the isocharset mount option
that specifies the 8-bit codepage should be converted to/from UTF-8.
In the end, I suppose it ends up as a an "opaque stream of bytes"
for a case sensitive filesytem. I've started implementing the
changes to XFS and UTF8/old have no differences.

>> So, in regards to the UTF-8 case-conversion/folding table, we
>> have several options to choose from:
>>    - Use the HFS+ method as-is.
>>    - Use an NTFS scheme with an on-disk table.
>>    - Pick a current table and stick with it (similar to HFS+).
>>    - How much of Unicode to we support? Just the the "Basic
>>      Multilingual Plane" (U+0000 - U+FFFF) or the entire set?
>>      (anything above U+FFFF won't have case-conversion
>>       requirements). Seems that all the other filesystems
>>       just support the "BMP".
>>    - UTF-8, UTF-16 or UCS-2.
>>
>> With the last point, UTF-8 has several advantages IMO:
>>    - xfs_repair can easily detect UTF-8 sequences in filenames
>>      and also validate UTF-8 sequences.
>>    - char based structures don't change
>>    - "nulls" in filenames.
>>    - no endian conversions required.
>
> I think the right approach is to use the fs/nls/ code and allow the
> user to select any table with a mount option as at least in russia
> and eastern europe some non-utf8 charsets still seem to be prefered.
> The default should of course be utf8 and support for utf8 case
> conversion should be added to fs/nls/
>
>> Internally, the names will probably be converted to "u16"s for
>> efficient processing. Conversion between UTF-8 and UTF-16/UCS-2
>> is very straight forward.
>
> Do we really need that?  And if so please make sure this only happens
> for filesystems created with the case insensitivity option so normal
> filesystems don't have to pay for these bloated strings.

Sort of as the NLS conversions use wchar_t's. From that, I can
convert straight back to utf8 anyway.

Barry.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: Case-insensitive support for XFS
  2007-10-05 19:10     ` Anton Altaparmakov
  2007-10-06  6:37       ` Brad Boyer
@ 2007-10-08  0:43       ` Barry Naujok
  1 sibling, 0 replies; 12+ messages in thread
From: Barry Naujok @ 2007-10-08  0:43 UTC (permalink / raw)
  To: Anton Altaparmakov, Christoph Hellwig; +Cc: xfs@oss.sgi.com, linux-fsdevel

On Sat, 06 Oct 2007 05:10:23 +1000, Anton Altaparmakov <aia21@cam.ac.uk>  
wrote:

> Hi,
>
> On 5 Oct 2007, at 16:44, Christoph Hellwig wrote:
>> [Adding -fsdevel because some of the things touched here might be of
>>  broader interest and Urban because his name is on nls_utf8.c]
>>
>> On Fri, Oct 05, 2007 at 11:57:54AM +1000, Barry Naujok wrote:
>>> On it's own, linux only provides case conversion for old-style
>>> character sets - 8 bit sequences only. A lot of distos are
>>> now defaulting to UTF-8 and Linux NLS stuff does not support
>>> case conversion for any unicode sets.
>>
>> The lack of case tables in nls_utf8.c defintively seems odd to me.
>> Urban, is there a reason for that?  The only thing that comes to
>> mind is that these tables might be quite large.
>>
>>> NTFS in Linux also implements it's own dcache and NTFS also
>>
>> 					^^^^^^^ dentry operations?
>
> Where did that come from?  NTFS does not have its own dcache!  It  
> doesn't have its own dentry operations either...  NTFS uses the default  
> ones...
>
> All the case insensitivity handling "cleverness" is done inside  
> ntfs_lookup(), i.e. the NTFS directory inode operation ->lookup.

Sorry if I got this wrong. I derived my comment from fs/ntfs/namei.c:

  * In order to handle the case insensitivity issues of NTFS with regards  
to the
  * dcache and the dcache requiring only one dentry per directory, we deal  
with
  * dentry aliases that only differ in case in ->ntfs_lookup() while  
maintaining
  * a case sensitive dcache.

Misinterpretation reading it again :)

>>> Internally, the names will probably be converted to "u16"s for
>>> efficient processing. Conversion between UTF-8 and UTF-16/UCS-2
>>> is very straight forward.
>>
>> Do we really need that?  And if so please make sure this only happens
>> for filesystems created with the case insensitivity option so normal
>> filesystems don't have to pay for these bloated strings.
>
> There is nothing efficient about using u16 in memory AFAIK.  In fact for  
> majority of the time it just means you use twice the memory per string...
>
> FWIW Mac OS X uses utf8 in the kernel and so does HFS(+) and I can't see  
> anything wrong with that.  And Windows uses u16 (little endian) and so  
> does NTFS.  So there is precedent for doing both internally...
>
> What are the reasons for suggesting that it would be more efficient to  
> use u16 internally?

As I said to Christoph before, the only reason is the nls conversions
use wchar_t. As I don't have any case tables yet (one of the primary
points for discussion), I haven't settled on which method to use.

If I do use u16, it will only be used temporarily for case comparison.

Regards,
barry.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: Case-insensitive support for XFS
  2007-10-05 18:52     ` Nicholas Miell
@ 2007-10-08  5:07       ` Barry Naujok
  2007-10-08  5:44         ` Nicholas Miell
  0 siblings, 1 reply; 12+ messages in thread
From: Barry Naujok @ 2007-10-08  5:07 UTC (permalink / raw)
  To: Nicholas Miell, Christoph Hellwig; +Cc: xfs@oss.sgi.com, linux-fsdevel, urban

On Sat, 06 Oct 2007 04:52:18 +1000, Nicholas Miell <nmiell@comcast.net>  
wrote:

> On Fri, 2007-10-05 at 16:44 +0100, Christoph Hellwig wrote:
>> [Adding -fsdevel because some of the things touched here might be of
>>  broader interest and Urban because his name is on nls_utf8.c]
>>
>> On Fri, Oct 05, 2007 at 11:57:54AM +1000, Barry Naujok wrote:
>> >
>> > On it's own, linux only provides case conversion for old-style
>> > character sets - 8 bit sequences only. A lot of distos are
>> > now defaulting to UTF-8 and Linux NLS stuff does not support
>> > case conversion for any unicode sets.
>>
>> The lack of case tables in nls_utf8.c defintively seems odd to me.
>> Urban, is there a reason for that?  The only thing that comes to
>> mind is that these tables might be quite large.
>>
>
> Case conversion in Unicode is locale dependent. The legacy 8-bit
> character encodings don't code for enough characters to run into the
> ambiguities, so they can get away with fixed case conversion tables.
> Unicode can't.

Based on http://www.unicode.org/reports/tr21/tr21-5.html and
http://www.unicode.org/Public/UNIDATA/CaseFolding.txt

Doing case comparison using that table should cater for most
circumstances except a few exeptions. It should be enough
to satisfy a locale independant case-insensitive filesystem
(ie. the C + F case folding option).

Is normalization required after case-folding? What I read
implies it is not necessary for this purpose (and would
slow things down and bloat the code more).

Now I suppose, it's just a question of a fixed table in the
kernel driver (HFS+ style), or data stored in a special
inode on-disk (NTFS style, shared refcounted in memory
when the same). With the on-disk, the table can be generated
 from mkfs.xfs.

Regards,
Barry.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: Case-insensitive support for XFS
  2007-10-08  5:07       ` Barry Naujok
@ 2007-10-08  5:44         ` Nicholas Miell
  2007-10-08  6:17           ` Barry Naujok
                             ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Nicholas Miell @ 2007-10-08  5:44 UTC (permalink / raw)
  To: Barry Naujok; +Cc: Christoph Hellwig, xfs@oss.sgi.com, linux-fsdevel, urban

On Mon, 2007-10-08 at 15:07 +1000, Barry Naujok wrote:
> On Sat, 06 Oct 2007 04:52:18 +1000, Nicholas Miell <nmiell@comcast.net>  
> wrote:
> 
> > On Fri, 2007-10-05 at 16:44 +0100, Christoph Hellwig wrote:
> >> [Adding -fsdevel because some of the things touched here might be of
> >>  broader interest and Urban because his name is on nls_utf8.c]
> >>
> >> On Fri, Oct 05, 2007 at 11:57:54AM +1000, Barry Naujok wrote:
> >> >
> >> > On it's own, linux only provides case conversion for old-style
> >> > character sets - 8 bit sequences only. A lot of distos are
> >> > now defaulting to UTF-8 and Linux NLS stuff does not support
> >> > case conversion for any unicode sets.
> >>
> >> The lack of case tables in nls_utf8.c defintively seems odd to me.
> >> Urban, is there a reason for that?  The only thing that comes to
> >> mind is that these tables might be quite large.
> >>
> >
> > Case conversion in Unicode is locale dependent. The legacy 8-bit
> > character encodings don't code for enough characters to run into the
> > ambiguities, so they can get away with fixed case conversion tables.
> > Unicode can't.
> 
> Based on http://www.unicode.org/reports/tr21/tr21-5.html and
> http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
> 
> Doing case comparison using that table should cater for most
> circumstances except a few exeptions. It should be enough
> to satisfy a locale independant case-insensitive filesystem
> (ie. the C + F case folding option).
> 
> Is normalization required after case-folding? What I read
> implies it is not necessary for this purpose (and would
> slow things down and bloat the code more).
> 
> Now I suppose, it's just a question of a fixed table in the
> kernel driver (HFS+ style), or data stored in a special
> inode on-disk (NTFS style, shared refcounted in memory
> when the same). With the on-disk, the table can be generated
>  from mkfs.xfs.

You also have to decide whether to screw over people who speak Turkic
languages and expect an 'I' to 'ı' mapping or everybody else who expect
an 'I' to 'i' mapping.

Although, if you're content in ignoring the kernel's native NLS case
mapping tables (which expect a locale-independent 1-to-1 mapping), you
could just uppercase everything and map both 'i' and 'ı' to 'I'.

Then you have to decide whether things like 'ê' map to 'E' or 'Ê', which
is also locale dependent.

-- 
Nicholas Miell <nmiell@comcast.net>

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: Case-insensitive support for XFS
  2007-10-08  5:44         ` Nicholas Miell
@ 2007-10-08  6:17           ` Barry Naujok
  2007-10-08  7:00           ` Barry Naujok
  2007-10-10  2:27           ` Barry Naujok
  2 siblings, 0 replies; 12+ messages in thread
From: Barry Naujok @ 2007-10-08  6:17 UTC (permalink / raw)
  To: Nicholas Miell; +Cc: Christoph Hellwig, xfs@oss.sgi.com, linux-fsdevel, urban

On Mon, 08 Oct 2007 15:44:48 +1000, Nicholas Miell <nmiell@comcast.net>  
wrote:

> On Mon, 2007-10-08 at 15:07 +1000, Barry Naujok wrote:
>> On Sat, 06 Oct 2007 04:52:18 +1000, Nicholas Miell <nmiell@comcast.net>
>> wrote:
>>
>> > On Fri, 2007-10-05 at 16:44 +0100, Christoph Hellwig wrote:
>> >> [Adding -fsdevel because some of the things touched here might be of
>> >>  broader interest and Urban because his name is on nls_utf8.c]
>> >>
>> >> On Fri, Oct 05, 2007 at 11:57:54AM +1000, Barry Naujok wrote:
>> >> >
>> >> > On it's own, linux only provides case conversion for old-style
>> >> > character sets - 8 bit sequences only. A lot of distos are
>> >> > now defaulting to UTF-8 and Linux NLS stuff does not support
>> >> > case conversion for any unicode sets.
>> >>
>> >> The lack of case tables in nls_utf8.c defintively seems odd to me.
>> >> Urban, is there a reason for that?  The only thing that comes to
>> >> mind is that these tables might be quite large.
>> >>
>> >
>> > Case conversion in Unicode is locale dependent. The legacy 8-bit
>> > character encodings don't code for enough characters to run into the
>> > ambiguities, so they can get away with fixed case conversion tables.
>> > Unicode can't.
>>
>> Based on http://www.unicode.org/reports/tr21/tr21-5.html and
>> http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
>>
>> Doing case comparison using that table should cater for most
>> circumstances except a few exeptions. It should be enough
>> to satisfy a locale independant case-insensitive filesystem
>> (ie. the C + F case folding option).
>>
>> Is normalization required after case-folding? What I read
>> implies it is not necessary for this purpose (and would
>> slow things down and bloat the code more).
>>
>> Now I suppose, it's just a question of a fixed table in the
>> kernel driver (HFS+ style), or data stored in a special
>> inode on-disk (NTFS style, shared refcounted in memory
>> when the same). With the on-disk, the table can be generated
>>  from mkfs.xfs.
>
> You also have to decide whether to screw over people who speak Turkic
> languages and expect an 'I' to 'ı' mapping or everybody else who expect
> an 'I' to 'i' mapping.

Is there some way in the kernel, that I'm unaware of, in knowing what
the user's current language and/or codepage locale is set to?

The only thing I've found is the isocharset option that the other
filesystems use or the default_nls_table() if one isn't specified.
The default one seems to be a CONFIG option.

> Although, if you're content in ignoring the kernel's native NLS case
> mapping tables (which expect a locale-independent 1-to-1 mapping), you
> could just uppercase everything and map both 'i' and 'ı' to 'I'.
>
> Then you have to decide whether things like 'ê' map to 'E' or 'Ê', which
> is also locale dependent.

Looking at case-folding, it would be generating lower case equivalent
characters, nls->charset2lower.

Barry.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: Case-insensitive support for XFS
  2007-10-08  5:44         ` Nicholas Miell
  2007-10-08  6:17           ` Barry Naujok
@ 2007-10-08  7:00           ` Barry Naujok
  2007-10-10  2:27           ` Barry Naujok
  2 siblings, 0 replies; 12+ messages in thread
From: Barry Naujok @ 2007-10-08  7:00 UTC (permalink / raw)
  To: Nicholas Miell; +Cc: Christoph Hellwig, xfs@oss.sgi.com, linux-fsdevel, urban

On Mon, 08 Oct 2007 15:44:48 +1000, Nicholas Miell <nmiell@comcast.net>  
wrote:

> You also have to decide whether to screw over people who speak Turkic
> languages and expect an 'I' to 'ı' mapping or everybody else who expect
> an 'I' to 'i' mapping.

I suspect they would be used to the false case-insensitive match. I
tested it on Windows XP with NTFS: İ (U+0130) did not match I or i
or ı (U+0131). I also tested it with the Turkish language/keyboard set.

Once it's set in a filesystem, the handling of it can't really be
swapped back and forth either, otherwise, you may lose access to
that file.

There is no practical way that I can see of supporting this
fully, even with using the NLS tables. The on-disk hashes have
to remain consistent regardless of what language is specified.

Barry.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RFC: Case-insensitive support for XFS
  2007-10-08  5:44         ` Nicholas Miell
  2007-10-08  6:17           ` Barry Naujok
  2007-10-08  7:00           ` Barry Naujok
@ 2007-10-10  2:27           ` Barry Naujok
  2 siblings, 0 replies; 12+ messages in thread
From: Barry Naujok @ 2007-10-10  2:27 UTC (permalink / raw)
  To: Nicholas Miell; +Cc: Christoph Hellwig, xfs@oss.sgi.com, linux-fsdevel, urban

On Mon, 08 Oct 2007 15:44:48 +1000, Nicholas Miell <nmiell@comcast.net>  
wrote:

> On Mon, 2007-10-08 at 15:07 +1000, Barry Naujok wrote:
>> On Sat, 06 Oct 2007 04:52:18 +1000, Nicholas Miell <nmiell@comcast.net>
>> wrote:
>>
>> > On Fri, 2007-10-05 at 16:44 +0100, Christoph Hellwig wrote:
>> >> [Adding -fsdevel because some of the things touched here might be of
>> >>  broader interest and Urban because his name is on nls_utf8.c]
>> >>
>> >> On Fri, Oct 05, 2007 at 11:57:54AM +1000, Barry Naujok wrote:
>> >> >
>> >> > On it's own, linux only provides case conversion for old-style
>> >> > character sets - 8 bit sequences only. A lot of distos are
>> >> > now defaulting to UTF-8 and Linux NLS stuff does not support
>> >> > case conversion for any unicode sets.
>> >>
>> >> The lack of case tables in nls_utf8.c defintively seems odd to me.
>> >> Urban, is there a reason for that?  The only thing that comes to
>> >> mind is that these tables might be quite large.
>> >>
>> >
>> > Case conversion in Unicode is locale dependent. The legacy 8-bit
>> > character encodings don't code for enough characters to run into the
>> > ambiguities, so they can get away with fixed case conversion tables.
>> > Unicode can't.
>>
>> Based on http://www.unicode.org/reports/tr21/tr21-5.html and
>> http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
>>
>> Doing case comparison using that table should cater for most
>> circumstances except a few exeptions. It should be enough
>> to satisfy a locale independant case-insensitive filesystem
>> (ie. the C + F case folding option).
>>
>> Is normalization required after case-folding? What I read
>> implies it is not necessary for this purpose (and would
>> slow things down and bloat the code more).
>>
>> Now I suppose, it's just a question of a fixed table in the
>> kernel driver (HFS+ style), or data stored in a special
>> inode on-disk (NTFS style, shared refcounted in memory
>> when the same). With the on-disk, the table can be generated
>>  from mkfs.xfs.
>
> You also have to decide whether to screw over people who speak Turkic
> languages and expect an 'I' to 'ı' mapping or everybody else who expect
> an 'I' to 'i' mapping.

I have had a thought about this. If the case table is stored on-disk like
NTFS, then mkfs.xfs can specify whether to use Turkic I's or not.

That guarantees consistent case folding for the filesystem. mkfs.xfs can
default to a Turkic case table if the user's locale is tr/az and the
"default case table" if not. mkfs.xfs will have to highlight this setting
if the user specifies the generic case-insensitive option. mkfs.xfs
should also allow the user to specify which of the case tables to use.

Barry.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2007-10-10  2:27 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <op.ty6361ut3jf8g2@pc-bnaujok.melbourne.sgi.com>
     [not found] ` <op.tzpbqspl3jf8g2@pc-bnaujok.melbourne.sgi.com>
2007-10-05 15:44   ` RFC: Case-insensitive support for XFS Christoph Hellwig
2007-10-05 18:52     ` Nicholas Miell
2007-10-08  5:07       ` Barry Naujok
2007-10-08  5:44         ` Nicholas Miell
2007-10-08  6:17           ` Barry Naujok
2007-10-08  7:00           ` Barry Naujok
2007-10-10  2:27           ` Barry Naujok
2007-10-05 19:10     ` Anton Altaparmakov
2007-10-06  6:37       ` Brad Boyer
2007-10-06 13:00         ` Anton Altaparmakov
2007-10-08  0:43       ` Barry Naujok
2007-10-08  0:33     ` Barry Naujok

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).