From: Olaf Weber <olaf@sgi.com>
To: Andi Kleen <andi@firstfloor.org>
Cc: linux-fsdevel@vger.kernel.org, Ben Myers <bpm@sgi.com>,
tinguely@sgi.com, xfs@oss.sgi.com
Subject: Re: [RFC v2] Unicode/UTF-8 support for XFS
Date: Wed, 24 Sep 2014 13:07:36 +0200 [thread overview]
Message-ID: <5422A5F8.5040703@sgi.com> (raw)
In-Reply-To: <20140923201540.GB15923@two.firstfloor.org>
On 23-09-14 22:15, Andi Kleen wrote:
>> A big part of the table does decompositions for Korean: eliminating
>> the Hangul decompositions removes 156320 bytes, leaving 89936 bytes.
>
> Are there regular ranges or other redundancies in the Korean encoding
> that could be used to compress paths?
Yes, though at the expense of more complicated code and interfaces. in
particular, lookups that want a normalized string would need to provide a
10-byte buffer to store it in.
> Doing some basic research other people already answered this:
>
> Please use the ICU or google tables referenced below. Apparently
> smaller is possible too, but 40-50k seems more reasonable.
Riffing off the http://macchiato.com/unicode/normalization_footprint.htm
link you provided, looking at the NFKD case. For Unicode 3.0.0 that link
gives 3483 NFKD normalizations (exlcuding Hangul), and gives 26,918 bytes as
the size of a simple lookup table (key/offset pairs, with the offset
pointing into a string table).
In Unicode 7.0.0 I count 5721 NFKD normalizations (again excluding Hangul).
As NUL-terminated UTF-8 strings these take 23390 bytes. Using a
key-offset table I need 3 bytes for the key (code points are 21 bits) and 2
bytes for an offset. Total is 5721 * (3 + 2) + 23390 = 51995 bytes. Stealing
4 bits from the key field and 1 from the offset to store the size of the
normalized string I can remove the NUL bytes from the string table and
reduce total size to 46274 bytes.
The trie implementation used here would use 66283 bytes to store the same
information, but it also provides unicode version and canonical combining
class for all codepoints. There are 10268 leaves in this case, with the size
of each leaf being 1 byte for version, one for ccc, plus the size of the
decomposition, if any. So a quick estimate on the space used just for the
NFKD data is some 45747 bytes.
I'm pretty certain that a trie that only stores the NFKD would be smaller
than 45747 bytes as it would need fewer internal nodes, but didn't do the
experiment. But as you can see, for just the NFKD part and excluding Hangul,
the size of the trie is within the ballpark of the numbers you gave.
Case folding adds a partial trie that forwards to the "main" trie for parts
that are identical. This adds 2672 extra leaves and 20171 extra bytes. The
data for the normalization corrections adds another 3328 bytes. With a bit
of rounding, total size comes to 89840 bytes.
> I'm just gonna make the claim that whatever performance you
> get from a larger table is dwarfed by the cache miss overhead.
That's possible, though it seems plausible you'd only suffer from this doing
actual Hangul decomposition: all the data related to this (trie nodes, trie
leaves, and strings) sits in one contiguous block in memory.
Olaf
--
Olaf Weber SGI Phone: +31(0)30-6696796
Veldzigt 2b Fax: +31(0)30-6696799
Technical Lead 3454 PW de Meern Vnet: 955-6796
Storage Software The Netherlands Email: olaf@sgi.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2014-09-24 11:07 UTC|newest]
Thread overview: 65+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-09-18 19:56 [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
2014-09-18 20:08 ` [PATCH 01/10] xfs: return the first match during case-insensitive lookup Ben Myers
2014-09-18 20:09 ` [PATCH 02/10] xfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
2014-09-18 20:09 ` [PATCH 03/13] libxfs: add xfs_nameops.normhash Ben Myers
2014-09-18 20:10 ` [PATCH 04/10] xfs: change interface of xfs_nameops.normhash Ben Myers
2014-09-18 20:11 ` [PATCH 05/10] xfs: add a superblock feature bit to indicate UTF-8 support Ben Myers
2014-09-18 20:13 ` [PATCH 03/10] xfs: add xfs_nameops.normhash Ben Myers
2014-09-18 20:14 ` [PATCH 06/10] xfs: add unicode character database files Ben Myers
2014-09-22 20:54 ` Dave Chinner
2014-09-26 17:09 ` Christoph Hellwig
2014-09-18 20:15 ` [PATCH 07/10] xfs: add trie generator and supporting code for UTF-8 Ben Myers
2014-09-22 20:57 ` Dave Chinner
2014-09-23 18:57 ` Ben Myers
2014-09-26 17:10 ` Christoph Hellwig
2014-09-18 20:16 ` [PATCH 08/10] xfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers
2014-09-18 20:17 ` [PATCH 09/10] xfs: apply utf-8 normalization rules to user extended attribute names Ben Myers
2014-09-18 20:18 ` [PATCH 10/10] xfs: implement demand load of utf8norm.ko Ben Myers
2014-09-18 20:31 ` [PATCH 00/13] xfsprogs: Unicode/UTF-8 support for XFS Ben Myers
2014-09-18 20:33 ` [PATCH 01/13] libxfs: return the first match during case-insensitive lookup Ben Myers
2014-09-18 20:33 ` [PATCH 02/13] libxfs: rename XFS_CMP_CASE to XFS_CMP_MATCH Ben Myers
2014-09-18 20:34 ` [PATCH 03/13] libxfs: add xfs_nameops.normhash Ben Myers
2014-09-18 20:35 ` [PATCH 04/13] libxfs: change interface of xfs_nameops.normhash Ben Myers
2014-09-18 20:36 ` [PATCH 05/13] libxfs: add a superblock feature bit to indicate UTF-8 support Ben Myers
2014-09-18 20:37 ` [PATCH 06/13] xfsprogs: add unicode character database files Ben Myers
2014-09-18 20:38 ` [PATCH 07/13] libxfs: add trie generator and supporting code for UTF-8 Ben Myers
2014-09-18 20:38 ` [PATCH 08/13] libxfs: add xfs_nameops for utf8 and utf8+casefold Ben Myers
2014-09-18 20:39 ` [PATCH 09/13] libxfs: apply utf-8 normalization rules to user extended attribute names Ben Myers
2014-09-18 20:40 ` [PATCH 10/13] xfsprogs: add utf8 support to growfs Ben Myers
2014-09-18 20:41 ` [PATCH 11/13] xfsprogs: add utf8 support to mkfs.xfs Ben Myers
2014-09-18 20:42 ` [PATCH 12/13] xfsprogs: add utf8 support to xfs_repair Ben Myers
2014-09-18 20:43 ` [PATCH 13/13] xfsprogs: add a preliminary test for utf8 support Ben Myers
2014-09-19 16:06 ` [PATCH 07a/13] xfsprogs: add trie generator for UTF-8 Ben Myers
2014-09-23 18:34 ` Roger Willcocks
2014-09-24 23:11 ` Ben Myers
2014-09-19 16:07 ` [PATCH 07b/13] libxfs: add supporting code " Ben Myers
2014-09-18 21:10 ` [RFC v2] Unicode/UTF-8 support for XFS Ben Myers
2014-09-18 21:24 ` Zach Brown
2014-09-18 22:23 ` Ben Myers
2014-09-19 16:03 ` [PATCH 07a/10] xfs: add trie generator for UTF-8 Ben Myers
2014-09-19 16:04 ` [PATCH 07b/10] xfs: add supporting code " Ben Myers
2014-09-22 14:55 ` [RFC v2] Unicode/UTF-8 support for XFS Andi Kleen
2014-09-22 18:41 ` Ben Myers
2014-09-22 19:29 ` Andi Kleen
2014-09-23 16:13 ` Olaf Weber
2014-09-23 20:15 ` Andi Kleen
2014-09-23 20:45 ` Ben Myers
2014-09-24 11:07 ` Olaf Weber [this message]
2014-09-26 14:06 ` Olaf Weber
2014-09-23 13:01 ` Olaf Weber
2014-09-23 20:02 ` Andi Kleen
2014-09-22 22:26 ` Dave Chinner
2014-09-24 13:21 ` Olaf Weber
2014-09-24 23:10 ` Dave Chinner
2014-09-25 13:33 ` Zuckerman, Boris
2014-09-26 14:50 ` Olaf Weber
2014-09-26 16:56 ` Christoph Hellwig
2014-09-26 17:04 ` Jeremy Allison
2014-09-26 17:06 ` Christoph Hellwig
2014-09-26 17:13 ` Jeremy Allison
2014-09-26 19:37 ` Olaf Weber
2014-09-26 19:46 ` Jeremy Allison
2014-09-26 20:03 ` Olaf Weber
2014-09-29 20:16 ` J. Bruce Fields
2014-09-29 11:06 ` Christoph Hellwig
2014-09-26 17:30 ` Ben Myers
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5422A5F8.5040703@sgi.com \
--to=olaf@sgi.com \
--cc=andi@firstfloor.org \
--cc=bpm@sgi.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=tinguely@sgi.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox