Re: NLS: utf8 conversions - Clemens Ladisch

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Clemens Ladisch <clemens@ladisch.de>
To: Alan Stern <stern@rowland.harvard.edu>
Cc: Kernel development list <linux-kernel@vger.kernel.org>,
	USB list <linux-usb@vger.kernel.org>
Subject: Re: NLS: utf8 conversions
Date: Mon, 27 Apr 2009 10:09:19 +0200	[thread overview]
Message-ID: <49F5682F.20300@ladisch.de> (raw)
In-Reply-To: <Pine.LNX.4.44L0.0904241534110.4531-100000@iolanthe.rowland.org>

Alan Stern wrote:
> Although nobody seems to have made a big deal about it, the conversions
> between utf8 and utf16 done by fs/nls/nls_base.c are wrong in a couple
> of important respects:
> 
> 	They don't handle Unicode code points larger than U+FFFF.
> 
> 	They don't detect invalid values, in particular, surrogate
> 	code points.
> 
> The problems stem from the fact the characters at issue can't be
> represented by a single 16-bit wchar_t.  But that's no excuse for
> performing an incorrect conversion to or from utf16.
> 
> Are there any definite thoughts on how this should be handled?  I don't 
> see any way for the single-character conversion routines (utf8_mbtowc 
> and utf8_wctomb) to come to grips with these issues, except perhaps for 
> returning an error when a character would be invalid or too big to fit 
> in 16 bits.
> 
> The string-oriented routines (utf8_mbstowcs and utf8_wcstombs) could be 
> adapted to deal with these issues properly.
> 
> Any comments or suggestions for other approaches?

The single-character utf8_* routines in nls_base.c are just special
cases of the NLS API for the UTF-8 encoding; the string-oriented
routines, as far as I can see, are actually only used to do conversions
between UTF-8 and UTF-16, not wchar_t, so they probably should be
renamed.

As for the NLS API itself: If we want to be able to handle code points
larger than U+FFFF, the obvious answer is to make wchar_t a 32-bit type.
This should not be too large a problem because the FS NLS API is
designed so that wchar_t is only used for temporary values, i.e.,
characters are converted from some on-disk encoding to wchar_t, then
from wchar_t to some I/O encoding (usually UTF-8); and the conversions
are done one code point at a time.

The file systems that use some form of UTF-16 (VFAT, NTFS, CIFS, UDF,
etc.) use the NLS API in a different way: they treat the individual
UTF-16 values as wchar_t values and do only the conversion from wchar_t
to the I/O encoding.  Here we'd need to introduce an additional
conversion step between UTF-16 and wchar_t, i.e., treat UTF-16 like any
other multibyte encoding.

Best regards,
Clemens

next prev parent reply	other threads:[~2009-04-27  8:09 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-24 20:02 NLS: utf8 conversions Alan Stern
2009-04-27  8:09 ` Clemens Ladisch [this message]
2009-04-27 15:51   ` Alan Stern
2009-04-28  6:51     ` Clemens Ladisch
2009-04-28 15:51       ` Alan Stern

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49F5682F.20300@ladisch.de \
    --to=clemens@ladisch.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-usb@vger.kernel.org \
    --cc=stern@rowland.harvard.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.