From mboxrd@z Thu Jan 1 00:00:00 1970 From: Olaf Weber Subject: Re: [RFC v2] Unicode/UTF-8 support for XFS Date: Tue, 23 Sep 2014 15:01:20 +0200 Message-ID: <54216F20.1090302@sgi.com> References: <20140918195650.GI19952@sgi.com> <87lhpbhfgg.fsf@tassilo.jf.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset="windows-1252"; Format="flowed" Content-Transfer-Encoding: quoted-printable Cc: linux-fsdevel@vger.kernel.org, tinguely@sgi.com, xfs@oss.sgi.com To: Andi Kleen , Ben Myers Return-path: In-Reply-To: <87lhpbhfgg.fsf@tassilo.jf.intel.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com List-Id: linux-fsdevel.vger.kernel.org On 22-09-14 16:55, Andi Kleen wrote: > Ben Myers writes: >> >> Strings are normalized using a trie that stores the relevant >> information. The trie itself is about 250kB in size, and lives in a >> separate module. > > So 250kB bloat -- and what does this fix exactly? > > Someone putting random ligatures into their file names and expecting > the file to be the same as before. Can't they just not do that? I like the 'office' example because it is applicable to English and easy to = explain. Once you move away from English examples are much easier to come = by. Take a Dutch name like 'Ren=E9e Soutendijk'. These two forms both spell Ren=E9e in UTF-8: 0x52 0x65 0x6E 0xC3 0xA9 0x65 0x52 0x65 0x6E 0x65 0xCC 0x81 0x65 The difference is LATIN SMALL LETTER E WITH ACUTE (U+00E9) LATIN SMALL LETTER E (U+0065) COMBINING ACUTE ACCENT (U+0301) and corresponds to the difference between NFC and NFD. These two forms both spell Soutendijk in UTF-8: 0x53 0x6F 0x75 0x74 0x65 0x6E 0x64 0x69 0x6A 0x6B 0x53 0x6F 0x75 0x74 0x65 0x6E 0x64 0xC4 0xB3 0x6B The difference is LATIN SMALL LETTER I (U+0069) LATIN SMALL LETTER J (U+006A) LATIN SMALL LIGATURE IJ (U+0133) and the former is the compatibility decomposition of the latter, the 'K' in = NFKC/NFKD. Do accented letters count as random ligatures that people should just not u= se? The bulk of the table deals with Korean. Olaf -- = Olaf Weber SGI Phone: +31(0)30-6696796 Veldzigt 2b Fax: +31(0)30-6696799 Technical Lead 3454 PW de Meern Vnet: 955-6796 Storage Software The Netherlands Email: olaf@sgi.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs