From mboxrd@z Thu Jan  1 00:00:00 1970
From: Anton Altaparmakov <aia21@cam.ac.uk>
Subject: Re: [PATCH] Full NLS support for HFS (classic) filesystem
Date: Tue, 31 May 2005 15:49:18 +0100
Message-ID: <1117550958.8073.30.camel@imp.csi.cam.ac.uk>
References: <429B1E35.2040905@rambler.ru>
	 <Pine.LNX.4.61.0505301337040.3743@scrub.home> <429C68A0.20003@rambler.ru>
	 <Pine.LNX.4.61.0505311156520.3728@scrub.home> <429CBC75.2030605@rambler.ru>
	 <Pine.LNX.4.61.0505311401290.3728@scrub.home> <429CD545.1070308@rambler.ru>
	 <Pine.LNX.4.61.0505311550080.3728@scrub.home>
Mime-Version: 1.0
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
Cc: Pavel Fedin <sonic_amiga@rambler.ru>, linux-fsdevel@vger.kernel.org
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from ppsw-0.csi.cam.ac.uk ([131.111.8.130]:18622 "EHLO
	ppsw-0.csi.cam.ac.uk") by vger.kernel.org with ESMTP
	id S261504AbVEaOt3 (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
	Tue, 31 May 2005 10:49:29 -0400
To: Roman Zippel <zippel@linux-m68k.org>
In-Reply-To: <Pine.LNX.4.61.0505311550080.3728@scrub.home>
Sender: linux-fsdevel-owner@vger.kernel.org
List-Id: linux-fsdevel.vger.kernel.org

Hi,

On Tue, 2005-05-31 at 15:59 +0200, Roman Zippel wrote:
> On Tue, 31 May 2005, Pavel Fedin wrote:
> > > If the names were translated correctly, HFS would have found them. You need
> > > to give me an example, which should have worked, but failed.
> > 
> >  I can't produce exact russian string (don't remember), but it was about 50%
> > of all russian names.
> 
> Without an example I can't reproduce, what you're trying to say here (at 
> least the "HFS doesn't find the file, even though it's correctly 
> translated" part). 

There are lost of characters that cannot be translated.  I have this
problem with NTFS, too.  My solution is to just ignore file names that
cannot be translated.  If a user complains they cannot see some
filenames, I tell them to use utf8 for their encoding which always works
for translation.

NLS is fundamentally broken so there is no point in trying to use clever
dynamic tables to do it.  Just ignore it is IMO the correct way.

btw. not having mappings is not even the biggest problem.  It gets much
worse and even Pavel's dynamic mappings are not actually going to work.
For example there are some characters in asian languages which when
translated will give a character but when you then reverse translate
this character you end up with something that is not the same as the
starting character.

This is a fundamental flaw with the whole NLS codepages approach because
there are symbols in Unicode which have identical meaning but two
different Unicode values.  You get this for the "exact" ideographs and
the "simplified" ideographs (e.g. CJK and compatibility ideographs - see
Unicode standard and various NLS pages for details).  If you don't
believe me I can dig out the old emails which have concrete examples of
how you convert a CJK ideograph to some codepage and then back and you
end up with a compatibility CJK ideograph instead of the original one.
Of course if you start with the compatibility CJK ideograph and do the
translation + reverse translation you end up with the same compatibility
ideograph but that doesn't help you when you use the "real" ideograph as
for example Windows seems to do as a lot of asian people have complained
to me about ntfs when used with codepages.  All of them went away happy
when I told them to tell the ntfs driver to use utf8...

So unless you use UTF8, any other conversion using NLS/code pages will
always have failure cases...

(btw. I first suspected bugs in the code pages but I verified them on
the MS website and they were correct...)

> > > Create the tables in a nls module and you can do whatever you want in the
> > > uni2char/char2uni functions.
> > 
> >  Huh...
> >  The problem is: when using 8-bit iocharset and 8-bit codepage char2uni from
> > codepage always gives the result but AFTER THIS uni2char to iocharset does NOT
> > necessarily gives the result. There are characters in codepage which have no
> > equivalents in iocharset. They will be lost, you suggest to turn them into
> > '?'. But how to reverse this in order to supply to hfs_strcmp()?
> 
> So create two functions uni2char/char2uni, which provide perfect reverse 
> mapping. Sorry, but I don't understand what your problem is here.
> It seems you're making it more complex than it really is.

That is impossible due to the problems with compatibility characters I
explained above which why I would agree with you that such magic
conversions should never happen, just put "use utf8 if you have
problems" in the mount man page...

Best regards,

        Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/