public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Clemens Ladisch <clemens@ladisch.de>
To: Alan Stern <stern@rowland.harvard.edu>
Cc: Kernel development list <linux-kernel@vger.kernel.org>,
	USB list <linux-usb@vger.kernel.org>
Subject: Re: NLS: utf8 conversions
Date: Mon, 27 Apr 2009 10:09:19 +0200	[thread overview]
Message-ID: <49F5682F.20300@ladisch.de> (raw)
In-Reply-To: <Pine.LNX.4.44L0.0904241534110.4531-100000@iolanthe.rowland.org>

Alan Stern wrote:
> Although nobody seems to have made a big deal about it, the conversions
> between utf8 and utf16 done by fs/nls/nls_base.c are wrong in a couple
> of important respects:
> 
> 	They don't handle Unicode code points larger than U+FFFF.
> 
> 	They don't detect invalid values, in particular, surrogate
> 	code points.
> 
> The problems stem from the fact the characters at issue can't be
> represented by a single 16-bit wchar_t.  But that's no excuse for
> performing an incorrect conversion to or from utf16.
> 
> Are there any definite thoughts on how this should be handled?  I don't 
> see any way for the single-character conversion routines (utf8_mbtowc 
> and utf8_wctomb) to come to grips with these issues, except perhaps for 
> returning an error when a character would be invalid or too big to fit 
> in 16 bits.
> 
> The string-oriented routines (utf8_mbstowcs and utf8_wcstombs) could be 
> adapted to deal with these issues properly.
> 
> Any comments or suggestions for other approaches?

The single-character utf8_* routines in nls_base.c are just special
cases of the NLS API for the UTF-8 encoding; the string-oriented
routines, as far as I can see, are actually only used to do conversions
between UTF-8 and UTF-16, not wchar_t, so they probably should be
renamed.

As for the NLS API itself: If we want to be able to handle code points
larger than U+FFFF, the obvious answer is to make wchar_t a 32-bit type.
This should not be too large a problem because the FS NLS API is
designed so that wchar_t is only used for temporary values, i.e.,
characters are converted from some on-disk encoding to wchar_t, then
from wchar_t to some I/O encoding (usually UTF-8); and the conversions
are done one code point at a time.

The file systems that use some form of UTF-16 (VFAT, NTFS, CIFS, UDF,
etc.) use the NLS API in a different way: they treat the individual
UTF-16 values as wchar_t values and do only the conversion from wchar_t
to the I/O encoding.  Here we'd need to introduce an additional
conversion step between UTF-16 and wchar_t, i.e., treat UTF-16 like any
other multibyte encoding.


Best regards,
Clemens

  reply	other threads:[~2009-04-27  8:09 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-24 20:02 NLS: utf8 conversions Alan Stern
2009-04-27  8:09 ` Clemens Ladisch [this message]
2009-04-27 15:51   ` Alan Stern
2009-04-28  6:51     ` Clemens Ladisch
2009-04-28 15:51       ` Alan Stern

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49F5682F.20300@ladisch.de \
    --to=clemens@ladisch.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-usb@vger.kernel.org \
    --cc=stern@rowland.harvard.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox