From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S266694AbUBMDPR (ORCPT ); Thu, 12 Feb 2004 22:15:17 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S266695AbUBMDPR (ORCPT ); Thu, 12 Feb 2004 22:15:17 -0500 Received: from mail.shareable.org ([81.29.64.88]:25474 "EHLO mail.shareable.org") by vger.kernel.org with ESMTP id S266694AbUBMDPH (ORCPT ); Thu, 12 Feb 2004 22:15:07 -0500 Date: Fri, 13 Feb 2004 03:15:02 +0000 From: Jamie Lokier To: John Bradford Cc: Robin Rosenberg , Linux kernel Subject: Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.) Message-ID: <20040213031502.GG25499@mail.shareable.org> References: <20040209115852.GB877@schottelius.org> <200402121906.54699.robin.rosenberg.lists@dewire.com> <200402121908.i1CJ86NC000167@81-2-122-30.bradfords.org.uk> <200402122039.19143.robin.rosenberg.lists@dewire.com> <200402122113.i1CLDqoB000179@81-2-122-30.bradfords.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200402122113.i1CLDqoB000179@81-2-122-30.bradfords.org.uk> User-Agent: Mutt/1.4.1i Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org John Bradford wrote: > in the real world don't you think that there will be a lot of > decoders which decode the multi-byte sequence back, rather than > report an error? There will be decoders which convert ASCII "a" to "A" too. We can't fix broken code; at least we can make it clear to anyone writing a decoder what is acceptable, and that being "liberal" in what's decoded is not acceptable and considered a security flaw. An app author only writes the UTF-8 decoder once; it isn't at all hard to convert non-minimal forms to the replacement char U+FFFD. (Although that could be a security hole in some cases, it's much better than allowing non-zero characters to decoder to NUL or "/" or "."). Rejecting a non-minimal form is often hard, because the UTF-8 decoder is often used in a place which cannot flag errors. > Imagine you have two files, with the following filename bytes: > > 11000001 10000001 00000000 > 01000001 00000000 > > ..and a _real world_ application, which is not necessarily completely > UTF-8 conformant, tries to open the file with filename 'A'. Which one > is it going to open? The one which "ls" and other programs show as "A". The other one will typically show as "?" or a diamond or something. > I don't think that the issue with combining characters is likely to be > an issue, I only mentioned it as an example. As you pointed out a > single accented character, and a two character combination are > distinct, and converting the combination to the corresponding single > character in a filename would definitely be wrong, in my opinion. > However, that doesn't mean that software won't do it. Indeed some software will do it, and worse than that: they may look the same in an editor or file selector. (See recent problems with misleading URLs for why that sort of thing can be a security hole). The combining char problem is similar to case folding: some filesystems and programs treat "a" and "A" as equivalent too. If the kernel had an encoding converter, and the filesystem stored iso-8859-1 while userspace was presented with utf-8, it is likely that several Unicode characters would be mapped to "a", causing similar problems to automatic case folding in filesystems. In other words, there is no clear solution to this problem. -- Jamie