From mboxrd@z Thu Jan 1 00:00:00 1970 From: Avi Kivity Subject: Re: [Qemu-devel] KVM call minutes for Feb 15 Date: Thu, 17 Feb 2011 15:25:21 +0200 Message-ID: <4D5D21C1.80009@redhat.com> References: <20110215162629.GN21720@x200.localdomain> <4D5B0889.4030303@codemonkey.ws> <4D5BA5E9.90307@redhat.com> <4D5BD259.3080804@codemonkey.ws> <4D5CE9AB.2030503@redhat.com> <4D5D10C1.9010209@codemonkey.ws> <4D5D133F.4050801@redhat.com> <4D5D1E54.1070704@codemonkey.ws> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Chris Wright , qemu-devel@nongnu.org, kvm@vger.kernel.org To: Anthony Liguori Return-path: Received: from mx1.redhat.com ([209.132.183.28]:39687 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751824Ab1BQNZa (ORCPT ); Thu, 17 Feb 2011 08:25:30 -0500 In-Reply-To: <4D5D1E54.1070704@codemonkey.ws> Sender: kvm-owner@vger.kernel.org List-ID: On 02/17/2011 03:10 PM, Anthony Liguori wrote: > On 02/17/2011 06:23 AM, Avi Kivity wrote: >> On 02/17/2011 02:12 PM, Anthony Liguori wrote: >>>> (btw what happens in a non-UTF-8 locale? I guess we should just >>>> reject unencodable strings). >>> >>> >>> While QEMU is mostly ASCII internally, for the purposes of the JSON >>> parser, we always encode and decode UTF-8. We reject invalid UTF-8 >>> sequences. But since JSON is string-encoded unicode, we can always >>> decode a JSON string to valid UTF-8 as long as the string is well >>> formed. >> >> That is wrong. If the user passes a Unicode filename it is expected >> to be translated to the current locale encoding for the purpose of, >> say, filename lookup. > > QEMU does not support anything but UTF-8. Since when? AFAICT, JSON string conversion is the only place where there is any dependency on UTF-8. Anything else should just work. > > That's pretty common with Unix software. I don't think any modern > Unix platform actually uses UCS2 or UTF-16. It's either ascii or UTF-8. Most/all Linux distributions support UTF-8 as well as a zillion other encodings (single-byte ASCII + another charset, or multi-byte charsets for languages with many characters. > The only place it even matters is Windows and Windows has ASCII and > UTF-16 versions of their APIs. So on Windows, non-ASCII characters > won't be handled correctly (yet another one of the many issues with > Windows support in QEMU). UTF-8 is self-recovering though so it > degrades gracefully. It matters on Linux with el_GR.iso88597, for example. If you feed a JSON string and translate it blindly to UTF-8, you'll get garbage when you feed it to system calls. Practically everyone uses UTF-8 these days, so the impact is minimal, but it is more correct (as well as simpler) to ask the system libraries to encode using the current locale. -- error compiling committee.c: too many arguments to function