From: David Gibson <dgibson@redhat.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Thomas Huth <thuth@redhat.com>,
qemu-devel@nongnu.org, dgilbert@redhat.com, wehuang@redhat.com,
drjones@redhat.com, amit.shah@redhat.com
Subject: Re: [Qemu-devel] [PATCH] hw/virtio/balloon: Fixes for different host page sizes
Date: Thu, 14 Apr 2016 13:39:51 +1000 [thread overview]
Message-ID: <20160414133951.1bb4cd69@voom.fritz.box> (raw)
In-Reply-To: <20160413145835-mutt-send-email-mst@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 9364 bytes --]
On Wed, 13 Apr 2016 16:15:25 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote:
> > The balloon code currently calls madvise() with TARGET_PAGE_SIZE
> > as length parameter, and an address which is directly based on
> > the page address supplied by the guest. Since the virtio-balloon
> > protocol is always based on 4k based addresses/sizes, no matter
> > what the host and guest are using as page size, this has a couple
> > of issues which could even lead to data loss in the worst case.
> >
> > TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that
> > value for the madvise() call. If TARGET_PAGE_SIZE is bigger than
> > 4k, we also destroy the 4k areas after the current one - which
> > might be wrong since the guest did not want free that area yet (in
> > case the guest used as smaller MMU page size than the hard-coded
> > TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define
> > called BALLOON_PAGE_SIZE (which is 4096) to use this as the size
> > parameter for the madvise() call instead.
>
> this makes absolute sense.
>
>
> > Then, there's yet another problem: If the host page size is bigger
> > than the 4k balloon page size, we can not simply call madvise() on
> > each of the 4k balloon addresses that we get from the guest - since
> > the madvise() always evicts the whole host page, not only a 4k area!
>
> Does it really round length up?
> Wow, it does:
> len = (len_in + ~PAGE_MASK) & PAGE_MASK;
>
> which seems to be undocumented, but has been there forever.
>
>
> > So in this case, we've got to track the 4k fragments of a host page
> > and only call madvise(DONTNEED) when all fragments have been collected.
> > This of course only works fine if the guest sends consecutive 4k
> > fragments - which is the case in the most important scenarios that
> > I try to address here (like a ppc64 guest with 64k page size that
> > is running on a ppc64 host with 64k page size). In case the guest
> > uses a page size that is smaller than the host page size, we might
> > need to add some more additional logic here later to increase the
> > probability of being able to release memory, but at least the guest
> > should now not crash anymore due to unintentionally evicted pages.
> >
> > Signed-off-by: Thomas Huth <thuth@redhat.com>
> > ---
> > I've tested this patch with both, a 4k page size ppc64 guest
> > and a 64k page size ppc64 guest on a 64k page size ppc64 host.
> > With this patch applied, I was not able to crash to crash the
> > guests anymore (the 4k guest easily crashes without this patch).
> > And looking at the output of the "free" command on the host,
> > the ballooning also still works as expected.
> >
> > hw/virtio/virtio-balloon.c | 68 ++++++++++++++++++++++++++++++++++----
> > include/hw/virtio/virtio-balloon.h | 3 ++
> > 2 files changed, 65 insertions(+), 6 deletions(-)
> >
> > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > index c74101e..886faa8 100644
> > --- a/hw/virtio/virtio-balloon.c
> > +++ b/hw/virtio/virtio-balloon.c
> > @@ -35,13 +35,56 @@
> > #include "hw/virtio/virtio-bus.h"
> > #include "hw/virtio/virtio-access.h"
> >
> > -static void balloon_page(void *addr, int deflate)
> > +#define BALLOON_PAGE_SIZE (1 << VIRTIO_BALLOON_PFN_SHIFT)
> > +#define BALLOON_NO_CURRENT_PAGE ((void *)-1)
> > +
> > +static void balloon_page(VirtIOBalloon *s, void *addr, int deflate)
> > {
> > #if defined(__linux__)
> > - if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
> > - kvm_has_sync_mmu())) {
> > - qemu_madvise(addr, TARGET_PAGE_SIZE,
> > - deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> > + size_t host_page_size;
> > + void *aligned_addr;
> > +
> > + if (qemu_balloon_is_inhibited() || (kvm_enabled() && !kvm_has_sync_mmu())) {
> > + return;
> > + }
> > +
> > + host_page_size = getpagesize();
> > + if (host_page_size <= BALLOON_PAGE_SIZE) {
> > + qemu_madvise(addr, BALLOON_PAGE_SIZE,
> > + deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> > + return;
> > + }
> > +
> > + /* Host page size > ballon page size ==> use aligned host address */
> > + aligned_addr = (void *)((uintptr_t)addr & ~(host_page_size - 1));
> > + if (deflate) {
> > + /* MADV_WILLNEED is just a hint for the host kernel, the guest should
> > + * also be able to use the memory again without this call, so let's
> > + * only do it for the first, aligned fragment of a host page and
> > + * ignore it for the others.
> > + */
> > + if (addr == aligned_addr) {
> > + qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_WILLNEED);
> > + }
> > + s->current_addr = BALLOON_NO_CURRENT_PAGE;
> > + } else {
> > + const int max_frags = host_page_size / BALLOON_PAGE_SIZE;
> > + /* If we end up here, that means we want to evict balloon pages, but
> > + * the host's page size is bigger than the 4k pages from the balloon.
> > + * Since madvise() only works on the granularity of the host page size,
> > + * we've got to collect all the 4k fragments from the guest first
> > + * before we can issue the MADV_DONTNEED call.
> > + */
> > + if (aligned_addr != s->current_addr) {
> > + memset(s->fragment_bits, 0, s->fragment_bits_size);
> > + s->current_addr = aligned_addr;
> > + }
> > + set_bit((addr - aligned_addr) / BALLOON_PAGE_SIZE, s->fragment_bits);
> > + if (find_first_zero_bit(s->fragment_bits, max_frags) == max_frags) {
> > + qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_DONTNEED);
> > + memset(s->fragment_bits, 0, s->fragment_bits_size);
> > + s->current_addr = BALLOON_NO_CURRENT_PAGE;
> > + }
> > }
> > #endif
> > }
> > @@ -240,7 +283,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
> > /* Using memory_region_get_ram_ptr is bending the rules a bit, but
> > should be OK because we only want a single page. */
> > addr = section.offset_within_region;
> > - balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
> > + balloon_page(s, memory_region_get_ram_ptr(section.mr) + addr,
> > !!(vq == s->dvq));
> > memory_region_unref(section.mr);
> > }
> > @@ -455,6 +498,13 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
> > s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
> > s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
> >
> > + if (getpagesize() > BALLOON_PAGE_SIZE) {
> > + s->fragment_bits_size = (getpagesize() / BALLOON_PAGE_SIZE
> > + + sizeof(long) * 8 - 1) / 8;
> > + s->fragment_bits = g_malloc0(s->fragment_bits_size);
> > + s->current_addr = BALLOON_NO_CURRENT_PAGE;
> > + }
> > +
> > reset_stats(s);
> >
> > register_savevm(dev, "virtio-balloon", -1, 1,
> > @@ -466,6 +516,7 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp)
> > VirtIODevice *vdev = VIRTIO_DEVICE(dev);
> > VirtIOBalloon *s = VIRTIO_BALLOON(dev);
> >
> > + g_free(s->fragment_bits);
> > balloon_stats_destroy_timer(s);
> > qemu_remove_balloon_handler(s);
> > unregister_savevm(dev, "virtio-balloon", s);
> > @@ -480,6 +531,11 @@ static void virtio_balloon_device_reset(VirtIODevice *vdev)
> > g_free(s->stats_vq_elem);
> > s->stats_vq_elem = NULL;
> > }
> > +
> > + if (s->fragment_bits) {
> > + memset(s->fragment_bits, 0, s->fragment_bits_size);
> > + s->current_addr = BALLOON_NO_CURRENT_PAGE;
> > + }
> > }
> >
> > static void virtio_balloon_instance_init(Object *obj)
> > diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> > index 35f62ac..04b7c0c 100644
> > --- a/include/hw/virtio/virtio-balloon.h
> > +++ b/include/hw/virtio/virtio-balloon.h
> > @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon {
> > int64_t stats_last_update;
> > int64_t stats_poll_interval;
> > uint32_t host_features;
> > + void *current_addr;
> > + unsigned long *fragment_bits;
> > + int fragment_bits_size;
> > } VirtIOBalloon;
> >
> > #endif
> > --
> > 1.8.3.1
>
>
> It looks like fragment_bits would have to be migrated.
Hmm.. do they? If we just ignore migration, isn't the worse that
happens that we just keep one extra page allocated.
> Which is a lot of complexity.
> And work arounds for specific guest behaviour are really ugly.
> There are patches on-list to maintain a balloon bitmap -
> that will enable fixing it cleanly.
> How about we just skip madvise if host page size is > balloon
> page size, for 2.6?
>
> --
> MST
--
David Gibson <dgibson@redhat.com>
Senior Software Engineer, Virtualization, Red Hat
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
next prev parent reply other threads:[~2016-04-14 3:38 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-04-13 11:52 [Qemu-devel] [PATCH] hw/virtio/balloon: Fixes for different host page sizes Thomas Huth
2016-04-13 12:37 ` Andrew Jones
2016-04-13 13:15 ` Michael S. Tsirkin
2016-04-13 14:51 ` Thomas Huth
2016-04-13 17:07 ` Michael S. Tsirkin
2016-04-13 17:38 ` Thomas Huth
2016-04-13 17:55 ` Michael S. Tsirkin
2016-04-13 18:11 ` Thomas Huth
2016-04-13 18:14 ` Michael S. Tsirkin
2016-04-14 3:45 ` David Gibson
2016-04-13 18:21 ` Andrew Jones
2016-04-14 11:47 ` Dr. David Alan Gilbert
2016-04-14 12:19 ` Thomas Huth
2016-04-14 18:34 ` Dr. David Alan Gilbert
2016-04-15 4:26 ` David Gibson
2016-05-23 6:25 ` Jitendra Kolhe
2016-04-14 3:39 ` David Gibson [this message]
2016-04-14 3:37 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160414133951.1bb4cd69@voom.fritz.box \
--to=dgibson@redhat.com \
--cc=amit.shah@redhat.com \
--cc=dgilbert@redhat.com \
--cc=drjones@redhat.com \
--cc=mst@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=thuth@redhat.com \
--cc=wehuang@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.