From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=38431 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Pq4VN-0003Qw-SS for qemu-devel@nongnu.org; Thu, 17 Feb 2011 09:06:51 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Pq4VM-0001i1-3a for qemu-devel@nongnu.org; Thu, 17 Feb 2011 09:06:49 -0500 Received: from mx1.redhat.com ([209.132.183.28]:49606) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Pq4VL-0001ht-Ra for qemu-devel@nongnu.org; Thu, 17 Feb 2011 09:06:48 -0500 Message-ID: <4D5D2B71.9090201@redhat.com> Date: Thu, 17 Feb 2011 16:06:41 +0200 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [Qemu-devel] KVM call minutes for Feb 15 References: <20110215162629.GN21720@x200.localdomain> <4D5B0889.4030303@codemonkey.ws> <4D5BA5E9.90307@redhat.com> <4D5BD259.3080804@codemonkey.ws> <4D5CE9AB.2030503@redhat.com> <4D5D10C1.9010209@codemonkey.ws> <4D5D133F.4050801@redhat.com> <4D5D1E54.1070704@codemonkey.ws> <4D5D21C1.80009@redhat.com> <4D5D2496.8030900@codemonkey.ws> In-Reply-To: <4D5D2496.8030900@codemonkey.ws> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: Chris Wright , qemu-devel@nongnu.org, kvm@vger.kernel.org On 02/17/2011 03:37 PM, Anthony Liguori wrote: > On 02/17/2011 07:25 AM, Avi Kivity wrote: >> On 02/17/2011 03:10 PM, Anthony Liguori wrote: >>> On 02/17/2011 06:23 AM, Avi Kivity wrote: >>>> On 02/17/2011 02:12 PM, Anthony Liguori wrote: >>>>>> (btw what happens in a non-UTF-8 locale? I guess we should just >>>>>> reject unencodable strings). >>>>> >>>>> >>>>> While QEMU is mostly ASCII internally, for the purposes of the >>>>> JSON parser, we always encode and decode UTF-8. We reject invalid >>>>> UTF-8 sequences. But since JSON is string-encoded unicode, we can >>>>> always decode a JSON string to valid UTF-8 as long as the string >>>>> is well formed. >>>> >>>> That is wrong. If the user passes a Unicode filename it is >>>> expected to be translated to the current locale encoding for the >>>> purpose of, say, filename lookup. >>> >>> QEMU does not support anything but UTF-8. >> >> Since when? >> >> AFAICT, JSON string conversion is the only place where there is any >> dependency on UTF-8. Anything else should just work. >> >>> >>> That's pretty common with Unix software. I don't think any modern >>> Unix platform actually uses UCS2 or UTF-16. It's either ascii or >>> UTF-8. >> >> Most/all Linux distributions support UTF-8 as well as a zillion other >> encodings (single-byte ASCII + another charset, or multi-byte >> charsets for languages with many characters. > > Maybe there's some confusion here. UTF-8 is an encoding, not a locale. > > The common encodings are ASCII, UTF-8, UCS2, UTF-16, and UTF-32. ASCII is a character set and encoding. The rest are encodings for Unicode. There are lots of other encodings, say latin-1. > > An application has to explicitly support an encoding. It is not > transparent. It is fully transparent until you do wire conversions (like we do with qmp which is explicitly UTF-8). > UCS2/UTF-16 means that strings are not 'const char *'s but 'const > wchar_t *' where typedef unsigned short wchar_t;. > > QEMU assumes, in lots of places that strings are single-byte NUL > terminated. Basically, any use of snprintf, printf, strcpy, strlen, > etc. pretty much tie you to ASCII/UTF-8. You can have a single NUL > byte as part of a valid UCS2 string. We're tied to single- or multiple- byte encodings, and can't do wchar_t. But that's very different from ASCII/UTF-8 only. > >>> The only place it even matters is Windows and Windows has ASCII and >>> UTF-16 versions of their APIs. So on Windows, non-ASCII characters >>> won't be handled correctly (yet another one of the many issues with >>> Windows support in QEMU). UTF-8 is self-recovering though so it >>> degrades gracefully. >> >> It matters on Linux with el_GR.iso88597, for example. > > The whole series of iso8859 (8-bit encodings) are officially abandoned > in favor of UCS and encodings that support the full UCS code page > (UTF-8/UTF-16). > > I see no strong reason to try and support deprecated encodings when > there are perfectly valid replacements like el_GR.utf8. All it takes is a call to iconv(3). I agree it's unlikely to happen in practice. -- error compiling committee.c: too many arguments to function