From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:54399) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gPE5F-0007TQ-8F for qemu-devel@nongnu.org; Tue, 20 Nov 2018 17:01:29 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gPE55-0003aq-SC for qemu-devel@nongnu.org; Tue, 20 Nov 2018 17:01:23 -0500 References: <20181120203628.2367003-1-eblake@redhat.com> From: Eric Blake Message-ID: <1671d82c-da21-de1e-58c4-dd22696f9a62@redhat.com> Date: Tue, 20 Nov 2018 16:01:00 -0600 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH] misc: Avoid UTF-8 in error messages List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: John Snow , qemu-devel@nongnu.org Cc: qemu-trivial@nongnu.org, Markus Armbruster [adding Markus in CC, since git didn't do it automatically from the 'Reported-by'] On 11/20/18 3:28 PM, John Snow wrote: > > > On 11/20/18 3:36 PM, Eric Blake wrote: >> While most developers are now using UTF-8 environments, it's >> harder to guarantee that error messages will be output to >> a multibyte locale. Rather than risking error messages that >> get corrupted into mojibake when the user runs qemu in a >> non-multibyte locale, let's stick to straight ASCII error >> messages, rather than assuming that our use of UTF-8 in source >> code string constants will work unchanged in other locales. >> >> Reported-by: Markus Armbruster >> Signed-off-by: Eric Blake >> --- >> hw/misc/tmp105.c | 2 +- >> hw/misc/tmp421.c | 2 +- >> 2 files changed, 2 insertions(+), 2 deletions(-) > > Do we have any policy in place to prohibit this in the future? > (Presumably a policy that is automatic and won't interfere with QEMU > localization efforts which may rightly attempt to use UTF-8 for those > locales.) Not that I know of. > > Do you have a script or trick to find utf-8 containing strings in our > source? Markus found these two, probably by reading over a list resulting from his claim of finding 217 out of 6455 files (53 of them binary, which don't count): https://lists.gnu.org/archive/html/qemu-devel/2018-11/msg04017.html My quick and dirty attempt, which does not quite reproduce his numbers: $ LC_ALL=C git grep -l $'[\x80-\xff]' | wc 279 279 7490 Thus, by forcing a unibyte locale (where encoding errors are impossible) with sane range expressions (POSIX says only the C locale is required to interpret regex ranges according to byte value - all bets are off in other locales) and using $'' to type non-UTF-8 bytes into my search, I found 279 files with at least one byte outside of ASCII. But the use of -l has no easy way to filter which of those files are binary; while dropping -l claims 2138 "lines" with non-ASCII, which gets tedious to scroll through, especially considering there ARE binary files in the mix. Narrowing the search to a more specific pattern: $ LC_ALL=C git grep $'".*[\x80-\xff].*"' | grep -v 'Binary file' | wc 129 685 8808 is a bit more manageable, with MOST of the hits in pc-bios/qemu.rsrc (false positive hits, due to interesting? comments), in po/ (which doesn't count), or in scripts/ for python. And the proof for THIS patch: $ LC_ALL=C git grep -l $'".*[\x80-\xff].*"' origin -- '**/*.[ch]' | cat origin:hw/misc/tmp105.c origin:hw/misc/tmp421.c > > Only curious, don't hold this patch up on my account. I'm not raising a > challenge. Maybe checkpatch.pl could be taught to do a similar check? -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org