From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S261909AbUBRGsM (ORCPT ); Wed, 18 Feb 2004 01:48:12 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S262425AbUBRGsM (ORCPT ); Wed, 18 Feb 2004 01:48:12 -0500 Received: from islay.mach.uni-karlsruhe.de ([129.13.162.92]:26292 "EHLO mailout.schmorp.de") by vger.kernel.org with ESMTP id S261909AbUBRGsI (ORCPT ); Wed, 18 Feb 2004 01:48:08 -0500 Date: Wed, 18 Feb 2004 07:48:05 +0100 From: Marc Lehmann To: linux-kernel@vger.kernel.org Subject: Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS default behavior) Message-ID: <20040218064805.GC1146@schmorp.de> Mail-Followup-To: linux-kernel@vger.kernel.org References: <20040217071448.GA8846@schmorp.de> <20040217163613.GA23499@mail.shareable.org> <20040217175209.GO8858@parcelfarce.linux.theplanet.co.uk> <20040217192917.GA24311@mail.shareable.org> <20040217195348.GQ8858@parcelfarce.linux.theplanet.co.uk> <200402172035.i1HKZM4j000154@81-2-122-30.bradfords.org.uk> <20040217204024.GE24311@mail.shareable.org> <200402172050.i1HKoLPG000210@81-2-122-30.bradfords.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Operating-System: Linux version 2.4.24 (root@cerebro) (gcc version 2.95.4 20011002 (Debian prerelease)) Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 17, 2004 at 01:04:14PM -0800, Linus Torvalds wrote: > Admittedly you might need up to six octets for the worst case, but hey, > since you only need one for the most common case (by _far_), who cares? Beign a fan of UTF-8, I still have to remark that this is a rather imperialistic view that only happens to work in many western countries. It starts to fail in greece, russia and asian countries, where text size goes up by a factor of 1.5 .. 3. This was _one_ of the major obstacles that utf-8 had to overcome in asian countries. Personally, I think that's not a big problem (memory for text storage is cheap etc.. :), but I am living in a iso-8859-1 world with only occasional voyages elsewhere. > And with the same UTF-8 encoding, you could some day encode UCS-8 too if > the idiotic standards bodies some day decide that 4 billion characters > isn't enough because of all the in-fighting. Four billion glyphs will be not be reached, of course, but it's not impossible that some codeset space inflation will happen due to the introduction of extra planes for strange purposes. > Of course, since you like UCS-4, you don't care about backwards > compatibility. While UCS-2 is obviously useless, UCS-4 is useful in rare cases where you either need fixed character sizes or the inflation to 5 or 6 byte values becomes a problem (which should be never). Using UCS-4 for filenames is just evil (of course :) UTF-8 was invented for the purpose of mapping unicode to filenames, and it certainly is the most sane encoding so far, since it doesn't share the "artificial" limitations to 16, 21 or 32 bits that other unicode encodings have. -- -----==- | ----==-- _ | ---==---(_)__ __ ____ __ Marc Lehmann +-- --==---/ / _ \/ // /\ \/ / pcg@goof.com |e| -=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+ The choice of a GNU generation | |