qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: David Gibson <dgibson@redhat.com>
To: Thomas Huth <thuth@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
	qemu-devel@nongnu.org, dgilbert@redhat.com, wehuang@redhat.com,
	drjones@redhat.com, amit.shah@redhat.com
Subject: Re: [Qemu-devel] [PATCH] hw/virtio/balloon: Fixes for different host page sizes
Date: Thu, 14 Apr 2016 13:37:22 +1000	[thread overview]
Message-ID: <20160414133722.258bbab7@voom.fritz.box> (raw)
In-Reply-To: <1460548364-27469-1-git-send-email-thuth@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 9087 bytes --]

On Wed, 13 Apr 2016 13:52:44 +0200
Thomas Huth <thuth@redhat.com> wrote:

> The balloon code currently calls madvise() with TARGET_PAGE_SIZE
> as length parameter, and an address which is directly based on
> the page address supplied by the guest. Since the virtio-balloon
> protocol is always based on 4k based addresses/sizes, no matter
> what the host and guest are using as page size, this has a couple
> of issues which could even lead to data loss in the worst case.
> 
> TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that
> value for the madvise() call. If TARGET_PAGE_SIZE is bigger than
> 4k, we also destroy the 4k areas after the current one - which
> might be wrong since the guest did not want free that area yet (in
> case the guest used as smaller MMU page size than the hard-coded
> TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define
> called BALLOON_PAGE_SIZE (which is 4096) to use this as the size
> parameter for the madvise() call instead.
> 
> Then, there's yet another problem: If the host page size is bigger
> than the 4k balloon page size, we can not simply call madvise() on
> each of the 4k balloon addresses that we get from the guest - since
> the madvise() always evicts the whole host page, not only a 4k area!
> So in this case, we've got to track the 4k fragments of a host page
> and only call madvise(DONTNEED) when all fragments have been collected.
> This of course only works fine if the guest sends consecutive 4k
> fragments - which is the case in the most important scenarios that
> I try to address here (like a ppc64 guest with 64k page size that
> is running on a ppc64 host with 64k page size). In case the guest
> uses a page size that is smaller than the host page size, we might
> need to add some more additional logic here later to increase the
> probability of being able to release memory, but at least the guest
> should now not crash anymore due to unintentionally evicted pages.

Yeah, this is a serious bug.  I think the patch is basically ok, except
for a couple of minor points noted below.

It's a bit more complex than I'd generally like to apply during hard
freeze, but given the seriousness of the problem, I think we should put
this in for qemu-2.6.

> 
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
>  I've tested this patch with both, a 4k page size ppc64 guest
>  and a 64k page size ppc64 guest on a 64k page size ppc64 host.
>  With this patch applied, I was not able to crash to crash the
>  guests anymore (the 4k guest easily crashes without this patch).
>  And looking at the output of the "free" command on the host,
>  the ballooning also still works as expected.
> 
>  hw/virtio/virtio-balloon.c         | 68 ++++++++++++++++++++++++++++++++++----
>  include/hw/virtio/virtio-balloon.h |  3 ++
>  2 files changed, 65 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index c74101e..886faa8 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -35,13 +35,56 @@
>  #include "hw/virtio/virtio-bus.h"
>  #include "hw/virtio/virtio-access.h"
>  
> -static void balloon_page(void *addr, int deflate)
> +#define BALLOON_PAGE_SIZE        (1 << VIRTIO_BALLOON_PFN_SHIFT)
> +#define BALLOON_NO_CURRENT_PAGE  ((void *)-1)
> +
> +static void balloon_page(VirtIOBalloon *s, void *addr, int deflate)
>  {
>  #if defined(__linux__)
> -    if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
> -                                         kvm_has_sync_mmu())) {
> -        qemu_madvise(addr, TARGET_PAGE_SIZE,
> -                deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> +    size_t host_page_size;
> +    void *aligned_addr;
> +
> +    if (qemu_balloon_is_inhibited() || (kvm_enabled() && !kvm_has_sync_mmu())) {
> +        return;
> +    }
> +
> +    host_page_size = getpagesize();

I think you actually want getrampagesize() here (or rather some wrapper
around that to handle the non-kvm cases properly) - to cover the case
where the VM has been built using hugepages in the host.  For the
normal case there's already the qemu_real_host_page_size global, so you
shouldn't need to call getpagesize() yourself.

> +    if (host_page_size <= BALLOON_PAGE_SIZE) {
> +        qemu_madvise(addr, BALLOON_PAGE_SIZE,
> +                     deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> +        return;
> +    }
> +
> +    /* Host page size > ballon page size ==> use aligned host address */
> +    aligned_addr = (void *)((uintptr_t)addr & ~(host_page_size - 1));
> +    if (deflate) {
> +        /* MADV_WILLNEED is just a hint for the host kernel, the guest should
> +         * also be able to use the memory again without this call, so let's
> +         * only do it for the first, aligned fragment of a host page and
> +         * ignore it for the others.
> +         */
> +        if (addr == aligned_addr) {
> +            qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_WILLNEED);
> +        }
> +        s->current_addr = BALLOON_NO_CURRENT_PAGE;
> +    } else {
> +        const int max_frags = host_page_size / BALLOON_PAGE_SIZE;
> +        /* If we end up here, that means we want to evict balloon pages, but
> +         * the host's page size is bigger than the 4k pages from the balloon.
> +         * Since madvise() only works on the granularity of the host page size,
> +         * we've got to collect all the 4k fragments from the guest first
> +         * before we can issue the MADV_DONTNEED call.
> +         */
> +        if (aligned_addr != s->current_addr) {

I'd suggest a (non fatal) error_report() here if s->current_addr !=
BALLOON_NO_CURRENT_PAGE.

> +            memset(s->fragment_bits, 0, s->fragment_bits_size);
> +            s->current_addr = aligned_addr;
> +        }
> +        set_bit((addr - aligned_addr) / BALLOON_PAGE_SIZE, s->fragment_bits);
> +        if (find_first_zero_bit(s->fragment_bits, max_frags) == max_frags) {

If you invert the sense of the bitmap you can just check for an
all-zero bitmap here which might be a little simpler.

> +            qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_DONTNEED);
> +            memset(s->fragment_bits, 0, s->fragment_bits_size);
> +            s->current_addr = BALLOON_NO_CURRENT_PAGE;
> +        }
>      }
>  #endif
>  }
> @@ -240,7 +283,7 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>              /* Using memory_region_get_ram_ptr is bending the rules a bit, but
>                 should be OK because we only want a single page.  */
>              addr = section.offset_within_region;
> -            balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
> +            balloon_page(s, memory_region_get_ram_ptr(section.mr) + addr,
>                           !!(vq == s->dvq));
>              memory_region_unref(section.mr);
>          }
> @@ -455,6 +498,13 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
>      s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>      s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
>  
> +    if (getpagesize() > BALLOON_PAGE_SIZE) {
> +        s->fragment_bits_size = (getpagesize() / BALLOON_PAGE_SIZE
> +                                 + sizeof(long) * 8 - 1) / 8;
> +        s->fragment_bits = g_malloc0(s->fragment_bits_size);
> +        s->current_addr = BALLOON_NO_CURRENT_PAGE;
> +    }
> +
>      reset_stats(s);
>  
>      register_savevm(dev, "virtio-balloon", -1, 1,
> @@ -466,6 +516,7 @@ static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp)
>      VirtIODevice *vdev = VIRTIO_DEVICE(dev);
>      VirtIOBalloon *s = VIRTIO_BALLOON(dev);
>  
> +    g_free(s->fragment_bits);
>      balloon_stats_destroy_timer(s);
>      qemu_remove_balloon_handler(s);
>      unregister_savevm(dev, "virtio-balloon", s);
> @@ -480,6 +531,11 @@ static void virtio_balloon_device_reset(VirtIODevice *vdev)
>          g_free(s->stats_vq_elem);
>          s->stats_vq_elem = NULL;
>      }
> +
> +    if (s->fragment_bits) {
> +        memset(s->fragment_bits, 0, s->fragment_bits_size);
> +        s->current_addr = BALLOON_NO_CURRENT_PAGE;
> +    }
>  }
>  
>  static void virtio_balloon_instance_init(Object *obj)
> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> index 35f62ac..04b7c0c 100644
> --- a/include/hw/virtio/virtio-balloon.h
> +++ b/include/hw/virtio/virtio-balloon.h
> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon {
>      int64_t stats_last_update;
>      int64_t stats_poll_interval;
>      uint32_t host_features;
> +    void *current_addr;
> +    unsigned long *fragment_bits;
> +    int fragment_bits_size;
>  } VirtIOBalloon;
>  
>  #endif
> -- 
> 1.8.3.1
> 


-- 
David Gibson <dgibson@redhat.com>
Senior Software Engineer, Virtualization, Red Hat

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

      parent reply	other threads:[~2016-04-14  3:36 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-13 11:52 [Qemu-devel] [PATCH] hw/virtio/balloon: Fixes for different host page sizes Thomas Huth
2016-04-13 12:37 ` Andrew Jones
2016-04-13 13:15 ` Michael S. Tsirkin
2016-04-13 14:51   ` Thomas Huth
2016-04-13 17:07     ` Michael S. Tsirkin
2016-04-13 17:38       ` Thomas Huth
2016-04-13 17:55         ` Michael S. Tsirkin
2016-04-13 18:11           ` Thomas Huth
2016-04-13 18:14             ` Michael S. Tsirkin
2016-04-14  3:45               ` David Gibson
2016-04-13 18:21             ` Andrew Jones
2016-04-14 11:47     ` Dr. David Alan Gilbert
2016-04-14 12:19       ` Thomas Huth
2016-04-14 18:34         ` Dr. David Alan Gilbert
2016-04-15  4:26           ` David Gibson
2016-05-23  6:25     ` Jitendra Kolhe
2016-04-14  3:39   ` David Gibson
2016-04-14  3:37 ` David Gibson [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160414133722.258bbab7@voom.fritz.box \
    --to=dgibson@redhat.com \
    --cc=amit.shah@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=drjones@redhat.com \
    --cc=mst@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=thuth@redhat.com \
    --cc=wehuang@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).