From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=49766 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Pq43n-0007e9-Rv
	for qemu-devel@nongnu.org; Thu, 17 Feb 2011 08:38:24 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1Pq43m-0003Kq-OZ
	for qemu-devel@nongnu.org; Thu, 17 Feb 2011 08:38:19 -0500
Received: from mail-qy0-f173.google.com ([209.85.216.173]:48334)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1Pq43m-0003II-HH
	for qemu-devel@nongnu.org; Thu, 17 Feb 2011 08:38:18 -0500
Received: by mail-qy0-f173.google.com with SMTP id 38so5067612qyl.4
	for <qemu-devel@nongnu.org>; Thu, 17 Feb 2011 05:38:18 -0800 (PST)
Message-ID: <4D5D24B2.30500@codemonkey.ws>
Date: Thu, 17 Feb 2011 07:37:54 -0600
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] KVM call minutes for Feb 15
References: <20110215162629.GN21720@x200.localdomain>	<4D5B0889.4030303@codemonkey.ws>	<4D5BA5E9.90307@redhat.com>	<4D5BD259.3080804@codemonkey.ws>	<4D5CE9AB.2030503@redhat.com>	<4D5D10C1.9010209@codemonkey.ws>	<4D5D133F.4050801@redhat.com>
	<4D5D1E54.1070704@codemonkey.ws> <4D5D21C1.80009@redhat.com>
In-Reply-To: <4D5D21C1.80009@redhat.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Avi Kivity <avi@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>, qemu-devel@nongnu.org, kvm@vger.kernel.org

On 02/17/2011 07:25 AM, Avi Kivity wrote:
> On 02/17/2011 03:10 PM, Anthony Liguori wrote:
>> On 02/17/2011 06:23 AM, Avi Kivity wrote:
>>> On 02/17/2011 02:12 PM, Anthony Liguori wrote:
>>>>> (btw what happens in a non-UTF-8 locale? I guess we should just 
>>>>> reject unencodable strings).
>>>>
>>>>
>>>> While QEMU is mostly ASCII internally, for the purposes of the JSON 
>>>> parser, we always encode and decode UTF-8.  We reject invalid UTF-8 
>>>> sequences.  But since JSON is string-encoded unicode, we can always 
>>>> decode a JSON string to valid UTF-8 as long as the string is well 
>>>> formed.
>>>
>>> That is wrong.  If the user passes a Unicode filename it is expected 
>>> to be translated to the current locale encoding for the purpose of, 
>>> say, filename lookup.
>>
>> QEMU does not support anything but UTF-8.
>
> Since when?
>
> AFAICT, JSON string conversion is the only place where there is any 
> dependency on UTF-8.  Anything else should just work.
>
>>
>> That's pretty common with Unix software.  I don't think any modern 
>> Unix platform actually uses UCS2 or UTF-16.  It's either ascii or UTF-8.
>
> Most/all Linux distributions support UTF-8 as well as a zillion other 
> encodings (single-byte ASCII + another charset, or multi-byte charsets 
> for languages with many characters.

An application has to explicitly support an encoding.  It is not 
transparent.  UCS2/UTF-16 means that strings are not 'const char *'s but 
'const wchar_t *' where typedef unsigned short wchar_t;.

QEMU assumes, in lots of places that strings are single-byte NUL 
terminated.  Basically, any use of snprintf, printf, strcpy, strlen, 
etc. pretty much tie you to ASCII/UTF-8.  You can have a single NUL byte 
as part of a valid UCS2 string.

>> The only place it even matters is Windows and Windows has ASCII and 
>> UTF-16 versions of their APIs.  So on Windows, non-ASCII characters 
>> won't be handled correctly (yet another one of the many issues with 
>> Windows support in QEMU).  UTF-8 is self-recovering though so it 
>> degrades gracefully.
>
> It matters on Linux with el_GR.iso88597, for example.

The whole series of iso8859 (8-bit encodings) are officially abandoned 
in favor of UCS and encodings that support the full UCS code page 
(UTF-8/UTF-16).

I see no strong reason to try and support deprecated encodings when 
there are perfectly valid replacements like el_GR.utf8.

Regards,

Anthony Liguori