From: "Daniel P. Berrangé" <berrange@redhat.com>
To: Markus Armbruster <armbru@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>, Bandan Das <bsd@redhat.com>,
qemu-devel@nongnu.org,
Samuel Thibault <samuel.thibault@ens-lyon.org>,
Gerd Hoffmann <kraxel@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v2] vl: set LC_CTYPE early in main() for all code
Date: Tue, 16 Apr 2019 17:09:27 +0100 [thread overview]
Message-ID: <20190416160927.GT31311@redhat.com> (raw)
In-Reply-To: <87zhoq3pn9.fsf@dusky.pond.sub.org>
On Tue, Apr 16, 2019 at 06:01:46PM +0200, Markus Armbruster wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
>
> > On Tue, Apr 16, 2019 at 09:49:09AM +0200, Markus Armbruster wrote:
> >> Daniel P. Berrangé <berrange@redhat.com> writes:
> > The main thing I can see would be filenames.
> >
> > Though having said it is UTF-8 on looking more closely I think QEMU is
> > probably 8-bit clean in its handling, so will just be blindly passing
> > whatever filename string it get from libvirt straight on to the kernel
> > with no interpretation.
>
> Sounds good to me.
>
> > Libvirt has enabled UTF-8 validation in its JSON library when encoding
> > data it sends to QEMU, so any data libvirt is sending will be a valid
> > UTF-8 byte sequence at least. Libvirt doesn't axctually do any charset
> > conversion though, so if libvirt runs in a non-UTF8 locale it will
> > likely trip over this UTF-8 validation.
>
> QMP input must be encoded in UTF-8. Converting from other encodings to
> UTF-8 is the QMP client's problem.
Ok, so consider the host OS is globally running in a non-UTF-8 locale
such as ISO8859-1. This means that any multibyte filenames in the
filesystem are assumed to be in ISO8859-1 encoding.
Since QMP input must be UTF-8, libvirt must convert the filename
from the current locale (ISO8859-1) to UTF-8 otherwise it might
be putting an invalid UTF-8 sequence in the JSON.
For QEMU to be able to open the file, QEMU must be honouring the
host OS LC_CTYPE, and converting from UTF-8 back to LC_CTYPE
character set.
>
> The more interesting direction is the one I inquired about: QMP output.
> If locale-dependent text gets sent to QMP, converting it to UTF-8 is
> QEMU's problem.
>
> On closer look, anything but JSON string contents is plain ASCII by
> construction. JSON string contents gets assembled in to_json() case
> QTYPE_QSTRING. It expects QString to use UTF-8[*]. You can have any
> locale as long as it uses ASCII or UTF-8.
IOW
>
> >> > + *
> >> > + * - Lots of codes uses is{upper,lower,alnum,...} functions, expecting
> >> > + * C locale sorting behaviour. Most QEMU usage should likely be
> >> > + * changed to g_ascii_is{upper,lower,alnum...} to match code
> >> > + * assumptions, without being broken by locale settnigs.
> >> > + *
> >> > + * We do still have two requirements
> >> > + *
> >> > + * - Ability to correct display translated text according to the
> >> > + * user's locale
> >> > + *
> >> > + * - Ability to handle multibyte characters, ideally according to
> >> > + * user's locale specified character set. This affects ability
> >> > + * of usb-mtp to correctly convert filenames to UCS16 and curses
> >> > + * & GTK frontends wide character display.
> >> > + *
> >> > + * The second requirement would need LC_CTYPE to be honoured, but
> >> > + * this conflicts with the 2nd & 3rd problems listed earlier. For
> >> > + * now we make a tradeoff, trying to set an explicit UTF-8 localee
> >> > + *
> >> > + * Note we can't set LC_MESSAGES here, since mingw doesn't define
> >> > + * this constant in locale.h Fortunately we only need it for the
> >> > + * GTK frontend and that uses gi18n.h which pulls in a definition
> >> > + * of LC_MESSAGES.
> >> > + */
> >> > + setlocale(LC_CTYPE, "C.UTF-8");
> >> > +
> >> > module_call_init(MODULE_INIT_TRACE);
> >> >
> >> > qemu_init_cpu_list();
> >>
> >> We should've stayed out of the GUI business.
> >
> > This isn't only a GUI problem as above, it affects USB MTP.
>
> I believe setlocale() in QEMU is basically wrong. Finding all the
> places that rely on the current locale when they shouldn't and
> converting them to locale-independent alternatives is a huge amount of
> work. Even if we managed to complete it, it wouldn't stay complete.
>
> Instead, find the places that have reason to use the locale, and fix
> them to uselocale().
I think that's fundamentally the wrong way around. Most stuff *should*
be locale dependant, otherwise any interaction with the host OS is
likely to use incorrect localization. It isn't practical to put a
uselocale() call around every place that opens a filename. There are
a few places where QEMU should be locale indepandant such as the QMP
and guest OS ABI sensitive things, which should take account of it.
Regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
WARNING: multiple messages have this Message-ID (diff)
From: "Daniel P. Berrangé" <berrange@redhat.com>
To: Markus Armbruster <armbru@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>, Bandan Das <bsd@redhat.com>,
Gerd Hoffmann <kraxel@redhat.com>,
qemu-devel@nongnu.org,
Samuel Thibault <samuel.thibault@ens-lyon.org>
Subject: Re: [Qemu-devel] [PATCH v2] vl: set LC_CTYPE early in main() for all code
Date: Tue, 16 Apr 2019 17:09:27 +0100 [thread overview]
Message-ID: <20190416160927.GT31311@redhat.com> (raw)
Message-ID: <20190416160927.j0K1obF9fwCg7jGnT9yU22_xIuKXF5XcK83u2ksWpuY@z> (raw)
In-Reply-To: <87zhoq3pn9.fsf@dusky.pond.sub.org>
On Tue, Apr 16, 2019 at 06:01:46PM +0200, Markus Armbruster wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
>
> > On Tue, Apr 16, 2019 at 09:49:09AM +0200, Markus Armbruster wrote:
> >> Daniel P. Berrangé <berrange@redhat.com> writes:
> > The main thing I can see would be filenames.
> >
> > Though having said it is UTF-8 on looking more closely I think QEMU is
> > probably 8-bit clean in its handling, so will just be blindly passing
> > whatever filename string it get from libvirt straight on to the kernel
> > with no interpretation.
>
> Sounds good to me.
>
> > Libvirt has enabled UTF-8 validation in its JSON library when encoding
> > data it sends to QEMU, so any data libvirt is sending will be a valid
> > UTF-8 byte sequence at least. Libvirt doesn't axctually do any charset
> > conversion though, so if libvirt runs in a non-UTF8 locale it will
> > likely trip over this UTF-8 validation.
>
> QMP input must be encoded in UTF-8. Converting from other encodings to
> UTF-8 is the QMP client's problem.
Ok, so consider the host OS is globally running in a non-UTF-8 locale
such as ISO8859-1. This means that any multibyte filenames in the
filesystem are assumed to be in ISO8859-1 encoding.
Since QMP input must be UTF-8, libvirt must convert the filename
from the current locale (ISO8859-1) to UTF-8 otherwise it might
be putting an invalid UTF-8 sequence in the JSON.
For QEMU to be able to open the file, QEMU must be honouring the
host OS LC_CTYPE, and converting from UTF-8 back to LC_CTYPE
character set.
>
> The more interesting direction is the one I inquired about: QMP output.
> If locale-dependent text gets sent to QMP, converting it to UTF-8 is
> QEMU's problem.
>
> On closer look, anything but JSON string contents is plain ASCII by
> construction. JSON string contents gets assembled in to_json() case
> QTYPE_QSTRING. It expects QString to use UTF-8[*]. You can have any
> locale as long as it uses ASCII or UTF-8.
IOW
>
> >> > + *
> >> > + * - Lots of codes uses is{upper,lower,alnum,...} functions, expecting
> >> > + * C locale sorting behaviour. Most QEMU usage should likely be
> >> > + * changed to g_ascii_is{upper,lower,alnum...} to match code
> >> > + * assumptions, without being broken by locale settnigs.
> >> > + *
> >> > + * We do still have two requirements
> >> > + *
> >> > + * - Ability to correct display translated text according to the
> >> > + * user's locale
> >> > + *
> >> > + * - Ability to handle multibyte characters, ideally according to
> >> > + * user's locale specified character set. This affects ability
> >> > + * of usb-mtp to correctly convert filenames to UCS16 and curses
> >> > + * & GTK frontends wide character display.
> >> > + *
> >> > + * The second requirement would need LC_CTYPE to be honoured, but
> >> > + * this conflicts with the 2nd & 3rd problems listed earlier. For
> >> > + * now we make a tradeoff, trying to set an explicit UTF-8 localee
> >> > + *
> >> > + * Note we can't set LC_MESSAGES here, since mingw doesn't define
> >> > + * this constant in locale.h Fortunately we only need it for the
> >> > + * GTK frontend and that uses gi18n.h which pulls in a definition
> >> > + * of LC_MESSAGES.
> >> > + */
> >> > + setlocale(LC_CTYPE, "C.UTF-8");
> >> > +
> >> > module_call_init(MODULE_INIT_TRACE);
> >> >
> >> > qemu_init_cpu_list();
> >>
> >> We should've stayed out of the GUI business.
> >
> > This isn't only a GUI problem as above, it affects USB MTP.
>
> I believe setlocale() in QEMU is basically wrong. Finding all the
> places that rely on the current locale when they shouldn't and
> converting them to locale-independent alternatives is a huge amount of
> work. Even if we managed to complete it, it wouldn't stay complete.
>
> Instead, find the places that have reason to use the locale, and fix
> them to uselocale().
I think that's fundamentally the wrong way around. Most stuff *should*
be locale dependant, otherwise any interaction with the host OS is
likely to use incorrect localization. It isn't practical to put a
uselocale() call around every place that opens a filename. There are
a few places where QEMU should be locale indepandant such as the QMP
and guest OS ABI sensitive things, which should take account of it.
Regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
next prev parent reply other threads:[~2019-04-16 16:09 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-04-15 14:15 [Qemu-devel] [PATCH v2] vl: set LC_CTYPE early in main() for all code Daniel P. Berrangé
2019-04-15 14:15 ` Daniel P. Berrangé
2019-04-15 14:25 ` Peter Maydell
2019-04-15 14:25 ` Peter Maydell
2019-04-15 15:25 ` Eric Blake
2019-04-15 15:25 ` Eric Blake
2019-04-15 15:46 ` Peter Maydell
2019-04-15 15:46 ` Peter Maydell
2019-04-15 15:51 ` Daniel P. Berrangé
2019-04-15 15:51 ` Daniel P. Berrangé
2019-04-16 7:49 ` Markus Armbruster
2019-04-16 7:49 ` Markus Armbruster
2019-04-16 9:03 ` Daniel P. Berrangé
2019-04-16 9:03 ` Daniel P. Berrangé
2019-04-16 16:01 ` Markus Armbruster
2019-04-16 16:01 ` Markus Armbruster
2019-04-16 16:09 ` Daniel P. Berrangé [this message]
2019-04-16 16:09 ` Daniel P. Berrangé
2019-04-16 19:19 ` Markus Armbruster
2019-04-16 19:19 ` Markus Armbruster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190416160927.GT31311@redhat.com \
--to=berrange@redhat.com \
--cc=armbru@redhat.com \
--cc=bsd@redhat.com \
--cc=kraxel@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=samuel.thibault@ens-lyon.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.