* Re: [Qemu-devel] [PATCH] misc: Avoid UTF-8 in error messages
2018-11-20 20:36 [Qemu-devel] [PATCH] misc: Avoid UTF-8 in error messages Eric Blake
@ 2018-11-20 21:28 ` John Snow
2018-11-20 22:01 ` Eric Blake
2018-11-21 6:03 ` Thomas Huth
2018-11-22 12:27 ` [Qemu-devel] [Qemu-trivial] " Laurent Vivier
2 siblings, 1 reply; 8+ messages in thread
From: John Snow @ 2018-11-20 21:28 UTC (permalink / raw)
To: Eric Blake, qemu-devel; +Cc: qemu-trivial
On 11/20/18 3:36 PM, Eric Blake wrote:
> While most developers are now using UTF-8 environments, it's
> harder to guarantee that error messages will be output to
> a multibyte locale. Rather than risking error messages that
> get corrupted into mojibake when the user runs qemu in a
> non-multibyte locale, let's stick to straight ASCII error
> messages, rather than assuming that our use of UTF-8 in source
> code string constants will work unchanged in other locales.
>
> Reported-by: Markus Armbruster <armbru@redhat.com>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
> hw/misc/tmp105.c | 2 +-
> hw/misc/tmp421.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/hw/misc/tmp105.c b/hw/misc/tmp105.c
> index 0918f3a6ea2..f6d7163273a 100644
> --- a/hw/misc/tmp105.c
> +++ b/hw/misc/tmp105.c
> @@ -79,7 +79,7 @@ static void tmp105_set_temperature(Object *obj, Visitor *v, const char *name,
> return;
> }
> if (temp >= 128000 || temp < -128000) {
> - error_setg(errp, "value %" PRId64 ".%03" PRIu64 " °C is out of range",
> + error_setg(errp, "value %" PRId64 ".%03" PRIu64 " C is out of range",
> temp / 1000, temp % 1000);
> return;
> }
> diff --git a/hw/misc/tmp421.c b/hw/misc/tmp421.c
> index c234044305d..eeb11000f0f 100644
> --- a/hw/misc/tmp421.c
> +++ b/hw/misc/tmp421.c
> @@ -153,7 +153,7 @@ static void tmp421_set_temperature(Object *obj, Visitor *v, const char *name,
> }
>
> if (temp >= maxs[ext_range] || temp < mins[ext_range]) {
> - error_setg(errp, "value %" PRId64 ".%03" PRIu64 " °C is out of range",
> + error_setg(errp, "value %" PRId64 ".%03" PRIu64 " C is out of range",
> temp / 1000, temp % 1000);
> return;
> }
>
Do we have any policy in place to prohibit this in the future?
(Presumably a policy that is automatic and won't interfere with QEMU
localization efforts which may rightly attempt to use UTF-8 for those
locales.)
Do you have a script or trick to find utf-8 containing strings in our
source?
Only curious, don't hold this patch up on my account. I'm not raising a
challenge.
--js
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [PATCH] misc: Avoid UTF-8 in error messages
2018-11-20 21:28 ` John Snow
@ 2018-11-20 22:01 ` Eric Blake
2018-11-20 22:43 ` John Snow
2018-11-21 11:39 ` Philippe Mathieu-Daudé
0 siblings, 2 replies; 8+ messages in thread
From: Eric Blake @ 2018-11-20 22:01 UTC (permalink / raw)
To: John Snow, qemu-devel; +Cc: qemu-trivial, Markus Armbruster
[adding Markus in CC, since git didn't do it automatically from the
'Reported-by']
On 11/20/18 3:28 PM, John Snow wrote:
>
>
> On 11/20/18 3:36 PM, Eric Blake wrote:
>> While most developers are now using UTF-8 environments, it's
>> harder to guarantee that error messages will be output to
>> a multibyte locale. Rather than risking error messages that
>> get corrupted into mojibake when the user runs qemu in a
>> non-multibyte locale, let's stick to straight ASCII error
>> messages, rather than assuming that our use of UTF-8 in source
>> code string constants will work unchanged in other locales.
>>
>> Reported-by: Markus Armbruster <armbru@redhat.com>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>> ---
>> hw/misc/tmp105.c | 2 +-
>> hw/misc/tmp421.c | 2 +-
>> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> Do we have any policy in place to prohibit this in the future?
> (Presumably a policy that is automatic and won't interfere with QEMU
> localization efforts which may rightly attempt to use UTF-8 for those
> locales.)
Not that I know of.
>
> Do you have a script or trick to find utf-8 containing strings in our
> source?
Markus found these two, probably by reading over a list resulting from
his claim of finding 217 out of 6455 files (53 of them binary, which
don't count):
https://lists.gnu.org/archive/html/qemu-devel/2018-11/msg04017.html
My quick and dirty attempt, which does not quite reproduce his numbers:
$ LC_ALL=C git grep -l $'[\x80-\xff]' | wc
279 279 7490
Thus, by forcing a unibyte locale (where encoding errors are impossible)
with sane range expressions (POSIX says only the C locale is required to
interpret regex ranges according to byte value - all bets are off in
other locales) and using $'' to type non-UTF-8 bytes into my search, I
found 279 files with at least one byte outside of ASCII. But the use of
-l has no easy way to filter which of those files are binary; while
dropping -l claims 2138 "lines" with non-ASCII, which gets tedious to
scroll through, especially considering there ARE binary files in the mix.
Narrowing the search to a more specific pattern:
$ LC_ALL=C git grep $'".*[\x80-\xff].*"' | grep -v 'Binary file' | wc
129 685 8808
is a bit more manageable, with MOST of the hits in pc-bios/qemu.rsrc
(false positive hits, due to interesting? comments), in po/ (which
doesn't count), or in scripts/ for python. And the proof for THIS patch:
$ LC_ALL=C git grep -l $'".*[\x80-\xff].*"' origin -- '**/*.[ch]' | cat
origin:hw/misc/tmp105.c
origin:hw/misc/tmp421.c
>
> Only curious, don't hold this patch up on my account. I'm not raising a
> challenge.
Maybe checkpatch.pl could be taught to do a similar check?
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [PATCH] misc: Avoid UTF-8 in error messages
2018-11-20 22:01 ` Eric Blake
@ 2018-11-20 22:43 ` John Snow
2018-11-21 11:39 ` Philippe Mathieu-Daudé
1 sibling, 0 replies; 8+ messages in thread
From: John Snow @ 2018-11-20 22:43 UTC (permalink / raw)
To: Eric Blake, qemu-devel; +Cc: qemu-trivial, Markus Armbruster
On 11/20/18 5:01 PM, Eric Blake wrote:
> [adding Markus in CC, since git didn't do it automatically from the
> 'Reported-by']
>
> On 11/20/18 3:28 PM, John Snow wrote:
>>
>>
>> On 11/20/18 3:36 PM, Eric Blake wrote:
>>> While most developers are now using UTF-8 environments, it's
>>> harder to guarantee that error messages will be output to
>>> a multibyte locale. Rather than risking error messages that
>>> get corrupted into mojibake when the user runs qemu in a
>>> non-multibyte locale, let's stick to straight ASCII error
>>> messages, rather than assuming that our use of UTF-8 in source
>>> code string constants will work unchanged in other locales.
>>>
>>> Reported-by: Markus Armbruster <armbru@redhat.com>
>>> Signed-off-by: Eric Blake <eblake@redhat.com>
>>> ---
>>> hw/misc/tmp105.c | 2 +-
>>> hw/misc/tmp421.c | 2 +-
>>> 2 files changed, 2 insertions(+), 2 deletions(-)
>
>>
>> Do we have any policy in place to prohibit this in the future?
>> (Presumably a policy that is automatic and won't interfere with QEMU
>> localization efforts which may rightly attempt to use UTF-8 for those
>> locales.)
>
> Not that I know of.
>
>>
>> Do you have a script or trick to find utf-8 containing strings in our
>> source?
>
> Markus found these two, probably by reading over a list resulting from
> his claim of finding 217 out of 6455 files (53 of them binary, which
> don't count):
> https://lists.gnu.org/archive/html/qemu-devel/2018-11/msg04017.html
>
> My quick and dirty attempt, which does not quite reproduce his numbers:
>
> $ LC_ALL=C git grep -l $'[\x80-\xff]' | wc
> 279 279 7490
>
> Thus, by forcing a unibyte locale (where encoding errors are impossible)
> with sane range expressions (POSIX says only the C locale is required to
> interpret regex ranges according to byte value - all bets are off in
> other locales) and using $'' to type non-UTF-8 bytes into my search, I
> found 279 files with at least one byte outside of ASCII. But the use of
> -l has no easy way to filter which of those files are binary; while
> dropping -l claims 2138 "lines" with non-ASCII, which gets tedious to
> scroll through, especially considering there ARE binary files in the mix.
>
> Narrowing the search to a more specific pattern:
>
> $ LC_ALL=C git grep $'".*[\x80-\xff].*"' | grep -v 'Binary file' | wc
> 129 685 8808
>
> is a bit more manageable, with MOST of the hits in pc-bios/qemu.rsrc
> (false positive hits, due to interesting? comments), in po/ (which
> doesn't count), or in scripts/ for python. And the proof for THIS patch:
>
> $ LC_ALL=C git grep -l $'".*[\x80-\xff].*"' origin -- '**/*.[ch]' | cat
> origin:hw/misc/tmp105.c
> origin:hw/misc/tmp421.c
>
Aha!
Thanks. I was surprised we've managed to introduce only two cases of
this so far. Your patch seems complete in this case.
Since I took the time to read along, I may as well:
Reviewed-by: John Snow <jsnow@redhat.com>
>>
>> Only curious, don't hold this patch up on my account. I'm not raising a
>> challenge.
>
> Maybe checkpatch.pl could be taught to do a similar check?
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [PATCH] misc: Avoid UTF-8 in error messages
2018-11-20 22:01 ` Eric Blake
2018-11-20 22:43 ` John Snow
@ 2018-11-21 11:39 ` Philippe Mathieu-Daudé
2018-11-21 16:20 ` Markus Armbruster
1 sibling, 1 reply; 8+ messages in thread
From: Philippe Mathieu-Daudé @ 2018-11-21 11:39 UTC (permalink / raw)
To: Eric Blake, John Snow, qemu-devel; +Cc: qemu-trivial, Markus Armbruster
On 20/11/18 23:01, Eric Blake wrote:
> [adding Markus in CC, since git didn't do it automatically from the
> 'Reported-by']
>
> On 11/20/18 3:28 PM, John Snow wrote:
>>
>>
>> On 11/20/18 3:36 PM, Eric Blake wrote:
>>> While most developers are now using UTF-8 environments, it's
>>> harder to guarantee that error messages will be output to
>>> a multibyte locale. Rather than risking error messages that
>>> get corrupted into mojibake when the user runs qemu in a
>>> non-multibyte locale, let's stick to straight ASCII error
>>> messages, rather than assuming that our use of UTF-8 in source
>>> code string constants will work unchanged in other locales.
>>>
>>> Reported-by: Markus Armbruster <armbru@redhat.com>
>>> Signed-off-by: Eric Blake <eblake@redhat.com>
>>> ---
>>> hw/misc/tmp105.c | 2 +-
>>> hw/misc/tmp421.c | 2 +-
>>> 2 files changed, 2 insertions(+), 2 deletions(-)
>
>>
>> Do we have any policy in place to prohibit this in the future?
>> (Presumably a policy that is automatic and won't interfere with QEMU
>> localization efforts which may rightly attempt to use UTF-8 for those
>> locales.)
>
> Not that I know of.
>
>>
>> Do you have a script or trick to find utf-8 containing strings in our
>> source?
>
> Markus found these two, probably by reading over a list resulting from
> his claim of finding 217 out of 6455 files (53 of them binary, which
> don't count):
> https://lists.gnu.org/archive/html/qemu-devel/2018-11/msg04017.html
>
> My quick and dirty attempt, which does not quite reproduce his numbers:
>
> $ LC_ALL=C git grep -l $'[\x80-\xff]' | wc
> 279 279 7490
>
> Thus, by forcing a unibyte locale (where encoding errors are impossible)
> with sane range expressions (POSIX says only the C locale is required to
> interpret regex ranges according to byte value - all bets are off in
> other locales) and using $'' to type non-UTF-8 bytes into my search, I
> found 279 files with at least one byte outside of ASCII. But the use of
> -l has no easy way to filter which of those files are binary; while
> dropping -l claims 2138 "lines" with non-ASCII, which gets tedious to
> scroll through, especially considering there ARE binary files in the mix.
>
> Narrowing the search to a more specific pattern:
>
> $ LC_ALL=C git grep $'".*[\x80-\xff].*"' | grep -v 'Binary file' | wc
> 129 685 8808
>
> is a bit more manageable, with MOST of the hits in pc-bios/qemu.rsrc
> (false positive hits, due to interesting? comments), in po/ (which
> doesn't count), or in scripts/ for python. And the proof for THIS patch:
>
> $ LC_ALL=C git grep -l $'".*[\x80-\xff].*"' origin -- '**/*.[ch]' | cat
> origin:hw/misc/tmp105.c
> origin:hw/misc/tmp421.c
Can we add the last 3 lines in the commit message?
>
>>
>> Only curious, don't hold this patch up on my account. I'm not raising a
>> challenge.
>
> Maybe checkpatch.pl could be taught to do a similar check?
It looks easier in shell than perl...
We could add a checkpatch.sh which finally call 'exec -l checkpatch.pl
$@' or similar?
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [PATCH] misc: Avoid UTF-8 in error messages
2018-11-21 11:39 ` Philippe Mathieu-Daudé
@ 2018-11-21 16:20 ` Markus Armbruster
0 siblings, 0 replies; 8+ messages in thread
From: Markus Armbruster @ 2018-11-21 16:20 UTC (permalink / raw)
To: Philippe Mathieu-Daudé
Cc: Eric Blake, John Snow, qemu-devel, qemu-trivial
Philippe Mathieu-Daudé <philmd@redhat.com> writes:
> On 20/11/18 23:01, Eric Blake wrote:
>> [adding Markus in CC, since git didn't do it automatically from the
>> 'Reported-by']
>>
>> On 11/20/18 3:28 PM, John Snow wrote:
>>>
>>>
>>> On 11/20/18 3:36 PM, Eric Blake wrote:
>>>> While most developers are now using UTF-8 environments, it's
>>>> harder to guarantee that error messages will be output to
>>>> a multibyte locale. Rather than risking error messages that
>>>> get corrupted into mojibake when the user runs qemu in a
>>>> non-multibyte locale, let's stick to straight ASCII error
>>>> messages, rather than assuming that our use of UTF-8 in source
>>>> code string constants will work unchanged in other locales.
>>>>
>>>> Reported-by: Markus Armbruster <armbru@redhat.com>
>>>> Signed-off-by: Eric Blake <eblake@redhat.com>
>>>> ---
>>>> hw/misc/tmp105.c | 2 +-
>>>> hw/misc/tmp421.c | 2 +-
>>>> 2 files changed, 2 insertions(+), 2 deletions(-)
>>
>>>
>>> Do we have any policy in place to prohibit this in the future?
>>> (Presumably a policy that is automatic and won't interfere with QEMU
>>> localization efforts which may rightly attempt to use UTF-8 for those
>>> locales.)
>>
>> Not that I know of.
We already outlaw newline and trailing punctuation, we could amend that
to outlaw non-ASCII.
$ git-grep 'no newline'
include/qapi/error.h: * The resulting message should be a single phrase, with no newline or
util/qemu-error.c: * a single phrase, with no newline or trailing punctuation.
util/qemu-error.c: * a single phrase, with no newline or trailing punctuation.
util/qemu-error.c: * a single phrase, with no newline or trailing punctuation.
util/qemu-error.c: * a single phrase, with no newline or trailing punctuation.
util/qemu-error.c: * a single phrase, with no newline or trailing punctuation.
util/qemu-error.c: * single phrase, with no newline or trailing punctuation.
util/qemu-error.c: * single phrase, with no newline or trailing punctuation.
[...]
>> Maybe checkpatch.pl could be taught to do a similar check?
>
> It looks easier in shell than perl...
>
> We could add a checkpatch.sh which finally call 'exec -l checkpatch.pl
> $@' or similar?
Congratulations, you found yet another way to make our checkpatch
program less readable.
Back to serious. checkpatch.pl already flags error messages containing
newlines (search for "should not contain newlines"). Extending that to
flag non-ASCII characters shouldn't be hard.
> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [PATCH] misc: Avoid UTF-8 in error messages
2018-11-20 20:36 [Qemu-devel] [PATCH] misc: Avoid UTF-8 in error messages Eric Blake
2018-11-20 21:28 ` John Snow
@ 2018-11-21 6:03 ` Thomas Huth
2018-11-22 12:27 ` [Qemu-devel] [Qemu-trivial] " Laurent Vivier
2 siblings, 0 replies; 8+ messages in thread
From: Thomas Huth @ 2018-11-21 6:03 UTC (permalink / raw)
To: Eric Blake, qemu-devel; +Cc: qemu-trivial
On 2018-11-20 21:36, Eric Blake wrote:
> While most developers are now using UTF-8 environments, it's
> harder to guarantee that error messages will be output to
> a multibyte locale. Rather than risking error messages that
> get corrupted into mojibake when the user runs qemu in a
> non-multibyte locale, let's stick to straight ASCII error
> messages, rather than assuming that our use of UTF-8 in source
> code string constants will work unchanged in other locales.
>
> Reported-by: Markus Armbruster <armbru@redhat.com>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
> hw/misc/tmp105.c | 2 +-
> hw/misc/tmp421.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/hw/misc/tmp105.c b/hw/misc/tmp105.c
> index 0918f3a6ea2..f6d7163273a 100644
> --- a/hw/misc/tmp105.c
> +++ b/hw/misc/tmp105.c
> @@ -79,7 +79,7 @@ static void tmp105_set_temperature(Object *obj, Visitor *v, const char *name,
> return;
> }
> if (temp >= 128000 || temp < -128000) {
> - error_setg(errp, "value %" PRId64 ".%03" PRIu64 " °C is out of range",
> + error_setg(errp, "value %" PRId64 ".%03" PRIu64 " C is out of range",
> temp / 1000, temp % 1000);
> return;
> }
> diff --git a/hw/misc/tmp421.c b/hw/misc/tmp421.c
> index c234044305d..eeb11000f0f 100644
> --- a/hw/misc/tmp421.c
> +++ b/hw/misc/tmp421.c
> @@ -153,7 +153,7 @@ static void tmp421_set_temperature(Object *obj, Visitor *v, const char *name,
> }
>
> if (temp >= maxs[ext_range] || temp < mins[ext_range]) {
> - error_setg(errp, "value %" PRId64 ".%03" PRIu64 " °C is out of range",
> + error_setg(errp, "value %" PRId64 ".%03" PRIu64 " C is out of range",
> temp / 1000, temp % 1000);
> return;
> }
>
Reviewed-by: Thomas Huth <thuth@redhat.com>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [Qemu-trivial] [PATCH] misc: Avoid UTF-8 in error messages
2018-11-20 20:36 [Qemu-devel] [PATCH] misc: Avoid UTF-8 in error messages Eric Blake
2018-11-20 21:28 ` John Snow
2018-11-21 6:03 ` Thomas Huth
@ 2018-11-22 12:27 ` Laurent Vivier
2 siblings, 0 replies; 8+ messages in thread
From: Laurent Vivier @ 2018-11-22 12:27 UTC (permalink / raw)
To: Eric Blake, qemu-devel; +Cc: qemu-trivial
On 20/11/2018 21:36, Eric Blake wrote:
> While most developers are now using UTF-8 environments, it's
> harder to guarantee that error messages will be output to
> a multibyte locale. Rather than risking error messages that
> get corrupted into mojibake when the user runs qemu in a
> non-multibyte locale, let's stick to straight ASCII error
> messages, rather than assuming that our use of UTF-8 in source
> code string constants will work unchanged in other locales.
>
> Reported-by: Markus Armbruster <armbru@redhat.com>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
> hw/misc/tmp105.c | 2 +-
> hw/misc/tmp421.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/hw/misc/tmp105.c b/hw/misc/tmp105.c
> index 0918f3a6ea2..f6d7163273a 100644
> --- a/hw/misc/tmp105.c
> +++ b/hw/misc/tmp105.c
> @@ -79,7 +79,7 @@ static void tmp105_set_temperature(Object *obj, Visitor *v, const char *name,
> return;
> }
> if (temp >= 128000 || temp < -128000) {
> - error_setg(errp, "value %" PRId64 ".%03" PRIu64 " °C is out of range",
> + error_setg(errp, "value %" PRId64 ".%03" PRIu64 " C is out of range",
> temp / 1000, temp % 1000);
> return;
> }
> diff --git a/hw/misc/tmp421.c b/hw/misc/tmp421.c
> index c234044305d..eeb11000f0f 100644
> --- a/hw/misc/tmp421.c
> +++ b/hw/misc/tmp421.c
> @@ -153,7 +153,7 @@ static void tmp421_set_temperature(Object *obj, Visitor *v, const char *name,
> }
>
> if (temp >= maxs[ext_range] || temp < mins[ext_range]) {
> - error_setg(errp, "value %" PRId64 ".%03" PRIu64 " °C is out of range",
> + error_setg(errp, "value %" PRId64 ".%03" PRIu64 " C is out of range",
> temp / 1000, temp % 1000);
> return;
> }
>
Applied to qemu-trivial with updated message log (command to find the
non-ASCII characters).
Thanks,
Laurent
^ permalink raw reply [flat|nested] 8+ messages in thread