Re: Fwd: NLS mappings for iso-8859-* encodings

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Fwd: NLS mappings for iso-8859-* encodings
@ 2002-05-07 23:07 Petr Vandrovec
  2002-05-08 18:17 ` Anton Altaparmakov
  0 siblings, 1 reply; 4+ messages in thread
From: Petr Vandrovec @ 2002-05-07 23:07 UTC (permalink / raw)
  To: Urban Widmark; +Cc: linux-kernel

On  8 May 02 at 0:08, Urban Widmark wrote:
> On Tue, 7 May 2002, Petr Vandrovec wrote:
> 
> ncpfs should perhaps not use iso8859-x to read filenames in some cp*
> encoding. The default nls you can specify is strange, is it the default
> for chars on the filesystem or the default to use for display?
> 
> isofs uses it for display (and has no need for a second nls table).
> smbfs uses it for display and has a second default for the remote chars.
> ncpfs uses it as default for both display and remote.
> vfat also uses it for both on-disk and display.
> 
> I think ncpfs should demand that the user sets two defaults and if that
> isn't done no default translation is made (just do a memcpy in ncp__vol2io
> and ncp__io2vol). That's what smbfs does anyway.

Yes, it looks like a good idea.

> In unicode the 0x80-0x9F does not contain any printable characters, but
> they are defined. I know one table for iso8859-1 that lists that part as
> being empty/undefined, but it's not an iso document.
> 
> For someone setting their default to iso8859-1 that patch is probably ok,
> but what happens when someone sets it to a variable length encoding? (sjis)

They still have a problem - but they'll probably know what to do as they
had to change default NLS from iso8859-1 to something else.

> But if you have checked that you are not mapping two values to the same
> thing (which would break the back-and-forth translation that smbfs does) I
> don't see how that patch can harm anything.

Yes, I checked it. After changing iso* all singlebyte encodings except
cp874 contain unique mapping for all byte values (cp874 is unique, but
some values are unmappable).
                                    Thanks,
                                            Petr Vandrovec
                                            vandrove@vc.cvut.cz
                                            

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Fwd: NLS mappings for iso-8859-* encodings
  2002-05-07 23:07 Fwd: NLS mappings for iso-8859-* encodings Petr Vandrovec
@ 2002-05-08 18:17 ` Anton Altaparmakov
  0 siblings, 0 replies; 4+ messages in thread
From: Anton Altaparmakov @ 2002-05-08 18:17 UTC (permalink / raw)
  To: Petr Vandrovec; +Cc: Urban Widmark, linux-kernel

At 00:07 08/05/02, Petr Vandrovec wrote:
>On  8 May 02 at 0:08, Urban Widmark wrote:
> > On Tue, 7 May 2002, Petr Vandrovec wrote:
> > But if you have checked that you are not mapping two values to the same
> > thing (which would break the back-and-forth translation that smbfs does) I
> > don't see how that patch can harm anything.
>
>Yes, I checked it. After changing iso* all singlebyte encodings except
>cp874 contain unique mapping for all byte values (cp874 is unique, but
>some values are unmappable).

Wrong. The NLS tables do not guarantee unique mapping. So all fs which do 
"back-and-forth" translation are broken, the only encoding which really 
works is UTF-8.

We found out the hard way in ntfs. An example: take CP936 (GB2312).

Take a Unicode character between 0x4e00 and 0x9fa5, i.e. the CJK Ideograph 
range (yes we found examples using these characters on various (chinese?) 
websites).

Convert to gb2312 using NLS and then back to Unicode and you end up with a 
Unicode character in the range 0xF900-0xFA2D, i.e. the CJK Compatibility 
Ideographs.

Concrete example we ran into with ntfs, Unicode character 0x884C (a CJK 
Ideograph). Translates to gb2312 character sequence 0xD0, 0xD0, and then 
back to Unicode character 0xFA08 (a CJK Compatibility Ideograph).

I double checked the translation manually and I also checked the original 
translation tables on the microsoft website and this is indeed what 
happens. If you looks at the translation table there are several Unicode 
characters mapping to the gb2312 character sequence 0xD0, 0xD0, but 
obviously this only maps back to a single Unicode character.

Also if you lookup the Unicode character database 2.1 (I checked rev 2.1.8) 
from the Unicode Consortium it specifies this as correct:

[snip]
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FA5;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
[snip]
FA08;CJK COMPATIBILITY IDEOGRAPH-FA08;Lo;0;L;884C;;;;N;;;;;
[snip]

This means that Unicode itself is not a one-to-one mapping. Apparently 
multiple characters have the same meanings... )))-:

I never imagined I would find something so braindamaged in Unicode but 
there you go!

Basically this means, at least for NTFS, but I think it is the same for all 
file systems, that on directory lookups, either we have to search the 
directory by just looking at EVERY directory entry and converting each to 
the current NLS and comparing for identity to the name being searched for 
or we have to use UTF-8 as that guarantees to preserve back-and-forth 
mappings one-to-one (I believe).

Doing a directory lookup where the whole directory is scanned linearly is 
incredibly slow and the overhead of having to convert every single 
directory entry to compare it every time a lookup() happens is very large, 
so I don't want to implement that on NTFS, so if anyone complains to me 
about their character translation not working properly they will just hear 
use UTF-8 and it will work.

But you will have to find UTF-8 fonts and user space support code in order 
to see the correct output displayed. Otherwise you just see random 
characters in your filenames... But at least It Works(TM).

Best regards,

         Anton

-- 
   "I've not lost my mind. It's backed up on tape somewhere." - Unknown
-- 
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Fwd: NLS mappings for iso-8859-* encodings
@ 2002-05-07 16:13 Petr Vandrovec
  2002-05-07 22:08 ` Urban Widmark
  0 siblings, 1 reply; 4+ messages in thread
From: Petr Vandrovec @ 2002-05-07 16:13 UTC (permalink / raw)
  To: linux-kernel

Hi,
  I sent message below to linux-fsdevel yesterday, but I received no
feedback. Meanwhile I also created patch which does changes proposed
below (map 0x80-0x9F to unicode 0x80-0x9F for ISO encodings).
Patch is available at http://platan.vc.cvut.cz/nls3.patch (39KB).

  If I'll not receive any feedback, I plan to send it to Linus soon.
Currently if you'll mount NCP filesystem with accented characters
without proper iocharset/codepage options, you'll not see filenames
with accented characters at all, as they will not pass through
char2uni of default (iso8859-1) NLS (there was warning printk,
but it was way to DoS...).

  I do not want to use way SMB does (map unknown characters to
:x## string) as it is not trivial to map them back. But if you
think that it is correct that some NLS tables contain characters
without unicode equivalents...
					Thanks,
						Petr Vandrovec
						vandrove@vc.cvut.cz

----- Forwarded (typos cleared) message -----

Resent-Message-Id: <200205071658.RAA26606@zikova.cvut.cz>
From: "Petr Vandrovec" <VANDROVE@vc.cvut.cz>
Organization:  CC CTU Prague
To: linux-fsdevel@vger.kernel.org
Subject:       NLS mappings for iso-8859-* encodings
X-Mailing-List: 	linux-fsdevel@vger.kernel.org

Hi,
  today it was pointed to me (see Debian bugreport #145654,
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=145654) that
all nls_iso8859-* mappings available in kernel refuse to map
characters in range 0x80-0x9F to anything reasonable.

  This behavior means, that with NLS default set to any of
iso8859-* encoding (including default iso-8859-1) filesystems
which contain data in cp850/852/437 codepages will have bad
problems, as majority of accented characters live in 0x80-0x9F
range in these codepages.

  And worse is that old 2.2.x kernels defaulted to 1:1 mapping,
so people were used to see wrong accented characters, but all filenames.
Now they see nothing :-( 

  Is there any reason why 0x80-0x9F range is not mapped identically
to 0x80-0x9F unicode range? I believe that unicode is even defined
as having first 256 characters identical to iso8859-1.
                                                Thanks,
                                                    Petr Vandrovec
                                                    vandrove@vc.cvut.cz

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

----- End forwarded message -----

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Fwd: NLS mappings for iso-8859-* encodings
  2002-05-07 16:13 Petr Vandrovec
@ 2002-05-07 22:08 ` Urban Widmark
  0 siblings, 0 replies; 4+ messages in thread
From: Urban Widmark @ 2002-05-07 22:08 UTC (permalink / raw)
  To: Petr Vandrovec; +Cc: linux-kernel

On Tue, 7 May 2002, Petr Vandrovec wrote:

>   If I'll not receive any feedback, I plan to send it to Linus soon.
> Currently if you'll mount NCP filesystem with accented characters
> without proper iocharset/codepage options, you'll not see filenames
> with accented characters at all, as they will not pass through
> char2uni of default (iso8859-1) NLS (there was warning printk,
> but it was way to DoS...).

ncpfs should perhaps not use iso8859-x to read filenames in some cp*
encoding. The default nls you can specify is strange, is it the default
for chars on the filesystem or the default to use for display?

isofs uses it for display (and has no need for a second nls table).
smbfs uses it for display and has a second default for the remote chars.
ncpfs uses it as default for both display and remote.
vfat also uses it for both on-disk and display.

I think ncpfs should demand that the user sets two defaults and if that
isn't done no default translation is made (just do a memcpy in ncp__vol2io
and ncp__io2vol). That's what smbfs does anyway.

In unicode the 0x80-0x9F does not contain any printable characters, but
they are defined. I know one table for iso8859-1 that lists that part as
being empty/undefined, but it's not an iso document.

For someone setting their default to iso8859-1 that patch is probably ok,
but what happens when someone sets it to a variable length encoding? (sjis)

The other definition is that the linux side is always utf8 and that the
default therefore must be what the other end writes. I still haven't seen
any setup where (eg) X is configured to do that (with fonts and all) but
it was stated as the official encoding by a bearded senior member of this
list.

But if you have checked that you are not mapping two values to the same
thing (which would break the back-and-forth translation that smbfs does) I
don't see how that patch can harm anything.

/Urban

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2002-05-08 18:19 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-05-07 23:07 Fwd: NLS mappings for iso-8859-* encodings Petr Vandrovec
2002-05-08 18:17 ` Anton Altaparmakov
  -- strict thread matches above, loose matches on Subject: below --
2002-05-07 16:13 Petr Vandrovec
2002-05-07 22:08 ` Urban Widmark

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox