public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Olaf Weber <olaf@sgi.com>
To: Andi Kleen <andi@firstfloor.org>, Ben Myers <bpm@sgi.com>
Cc: linux-fsdevel@vger.kernel.org, tinguely@sgi.com, xfs@oss.sgi.com
Subject: Re: [RFC v2] Unicode/UTF-8 support for XFS
Date: Tue, 23 Sep 2014 18:13:11 +0200	[thread overview]
Message-ID: <54219C17.3090104@sgi.com> (raw)
In-Reply-To: <20140922192958.GJ4120@two.firstfloor.org>

On 22-09-14 21:29, Andi Kleen wrote:
>>> So 250kB bloat -- and what does this fix exactly?
>>
>> We're trying to address the size issue by only loading the module when
>
> I'm not sure this is really addressing it.

You only pay the space cost if you use it, similar to the nls tables.

>> it's needed, but yeah it's big.  Open to suggestions on how best to deal
>> with that.  I understand the sticker shock.
>
> I don't even understand why you need the whole table.
>
> You want to not compare some special symbols, and a few other symbols
> are equivalent to others.  But most symbols are only identical to themselves.
>
> Couldn't you have a much smaller table that only expresses
> the exceptions?

The trie tells you whether a given sequence of bytes is a UTF-8 encoded 
unicode codepoint, and if so, it gives the unicode version in which the 
codepoint was assigned an interpretation (if any), the canonical combining 
class (required for normalization), and the decomposition and case fold (if 
any).

A big part of the table does decompositions for Korean: eliminating the 
Hangul decompositions removes 156320 bytes, leaving 89936 bytes.

Hangul decomposition uses two or three unicode code points and a terminating 
NUL byte in a UTF-8 string. The code points each require a three-byte UTF-8 
sequence, so the total is 7 bytes per 2-part decomposition, and 10 bytes per 
3-part decomposition.

With that in mind, the 156320 additional bytes spent on Hangul are accounted 
for as follows:

  22344 bytes : 11172 leaves * 2 byte leaf size
   2793 bytes :   399 2-part decompositions at 7 bytes each
107730 bytes : 10773 3-part decompositions at 10 bytes each

This adds up to 132867 bytes of data, with the remainder, 23453 bytes, spent 
on additional internal trie nodes.

>> As far as telling the customer "don't do that", my guess is that they
>> would just go elsewhere.  There are several other options for
>> filesystems that support unicode.
>
> They could put some code into their user app that generates
> an unique representation.

This assumes a single app, and that they control the source of that app.

Olaf

-- 
Olaf Weber                 SGI               Phone:  +31(0)30-6696796
                            Veldzigt 2b       Fax:    +31(0)30-6696799
Technical Lead             3454 PW de Meern  Vnet:   955-6796
Storage Software           The Netherlands   Email:  olaf@sgi.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2014-09-23 16:13 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
2014-09-18 20:08 ` [PATCH 01/10] xfs: return the first match during case-insensitive lookup Ben Myers
2014-09-18 20:09 ` [PATCH 02/10] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
2014-09-18 20:09 ` [PATCH 03/13] libxfs: add xfs_nameops.normhash Ben Myers
2014-09-18 20:10 ` [PATCH 04/10] xfs: change interface of xfs_nameops.normhash Ben Myers
2014-09-18 20:11 ` [PATCH 05/10] xfs: add a superblock feature bit to indicate UTF-8 support Ben Myers
2014-09-18 20:13 ` [PATCH 03/10] xfs: add xfs_nameops.normhash Ben Myers
2014-09-18 20:14 ` [PATCH 06/10] xfs: add unicode character database files Ben Myers
2014-09-22 20:54   ` Dave Chinner
2014-09-26 17:09     ` Christoph Hellwig
2014-09-18 20:15 ` [PATCH 07/10] xfs: add trie generator and supporting code for UTF-8 Ben Myers
2014-09-22 20:57   ` Dave Chinner
2014-09-23 18:57     ` Ben Myers
2014-09-26 17:10       ` Christoph Hellwig
2014-09-18 20:16 ` [PATCH 08/10] xfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers
2014-09-18 20:17 ` [PATCH 09/10] xfs: apply utf-8 normalization rules to user extended attribute names Ben Myers
2014-09-18 20:18 ` [PATCH 10/10] xfs: implement demand load of utf8norm.ko Ben Myers
2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers
2014-09-18 20:33   ` [PATCH 01/13] libxfs: return the first match during case-insensitive lookup Ben Myers
2014-09-18 20:33   ` [PATCH 02/13] libxfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
2014-09-18 20:34   ` [PATCH 03/13] libxfs: add xfs_nameops.normhash Ben Myers
2014-09-18 20:35   ` [PATCH 04/13] libxfs: change interface of xfs_nameops.normhash Ben Myers
2014-09-18 20:36   ` [PATCH 05/13] libxfs: add a superblock feature bit to indicate UTF-8 support Ben Myers
2014-09-18 20:37   ` [PATCH 06/13] xfsprogs: add unicode character database files Ben Myers
2014-09-18 20:38   ` [PATCH 07/13] libxfs: add trie generator and supporting code for UTF-8 Ben Myers
2014-09-18 20:38   ` [PATCH 08/13] libxfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers
2014-09-18 20:39   ` [PATCH 09/13] libxfs: apply utf-8 normalization rules to user extended attribute names Ben Myers
2014-09-18 20:40   ` [PATCH 10/13] xfsprogs: add utf8 support to growfs Ben Myers
2014-09-18 20:41   ` [PATCH 11/13] xfsprogs: add utf8 support to mkfs.xfs Ben Myers
2014-09-18 20:42   ` [PATCH 12/13] xfsprogs: add utf8 support to xfs_repair Ben Myers
2014-09-18 20:43   ` [PATCH 13/13] xfsprogs: add a preliminary test for utf8 support Ben Myers
2014-09-19 16:06   ` [PATCH 07a/13] xfsprogs: add trie generator for UTF-8 Ben Myers
2014-09-23 18:34     ` Roger Willcocks
2014-09-24 23:11       ` Ben Myers
2014-09-19 16:07   ` [PATCH 07b/13] libxfs: add supporting code " Ben Myers
2014-09-18 21:10 ` [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
2014-09-18 21:24   ` Zach Brown
2014-09-18 22:23     ` Ben Myers
2014-09-19 16:03 ` [PATCH 07a/10] xfs: add trie generator for UTF-8 Ben Myers
2014-09-19 16:04 ` [PATCH 07b/10] xfs: add supporting code " Ben Myers
2014-09-22 14:55 ` [RFC v2] Unicode/UTF-8 support for XFS Andi Kleen
2014-09-22 18:41   ` Ben Myers
2014-09-22 19:29     ` Andi Kleen
2014-09-23 16:13       ` Olaf Weber [this message]
2014-09-23 20:15         ` Andi Kleen
2014-09-23 20:45           ` Ben Myers
2014-09-24 11:07           ` Olaf Weber
2014-09-26 14:06             ` Olaf Weber
2014-09-23 13:01   ` Olaf Weber
2014-09-23 20:02     ` Andi Kleen
2014-09-22 22:26 ` Dave Chinner
2014-09-24 13:21   ` Olaf Weber
2014-09-24 23:10     ` Dave Chinner
2014-09-25 13:33       ` Zuckerman, Boris
2014-09-26 14:50       ` Olaf Weber
2014-09-26 16:56         ` Christoph Hellwig
2014-09-26 17:04           ` Jeremy Allison
2014-09-26 17:06             ` Christoph Hellwig
2014-09-26 17:13               ` Jeremy Allison
2014-09-26 19:37             ` Olaf Weber
2014-09-26 19:46               ` Jeremy Allison
2014-09-26 20:03                 ` Olaf Weber
2014-09-29 20:16                   ` J. Bruce Fields
2014-09-29 11:06               ` Christoph Hellwig
2014-09-26 17:30           ` Ben Myers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54219C17.3090104@sgi.com \
    --to=olaf@sgi.com \
    --cc=andi@firstfloor.org \
    --cc=bpm@sgi.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=tinguely@sgi.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox