From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:47810) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aqY4n-0004sL-NJ for qemu-devel@nongnu.org; Wed, 13 Apr 2016 23:36:19 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aqY4h-0004ov-H5 for qemu-devel@nongnu.org; Wed, 13 Apr 2016 23:36:17 -0400 Received: from mx1.redhat.com ([209.132.183.28]:42577) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aqY4h-0004oq-8h for qemu-devel@nongnu.org; Wed, 13 Apr 2016 23:36:11 -0400 Received: from int-mx13.intmail.prod.int.phx2.redhat.com (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 8E676C049D5C for ; Thu, 14 Apr 2016 03:36:09 +0000 (UTC) Date: Thu, 14 Apr 2016 13:37:22 +1000 From: David Gibson Message-ID: <20160414133722.258bbab7@voom.fritz.box> In-Reply-To: <1460548364-27469-1-git-send-email-thuth@redhat.com> References: <1460548364-27469-1-git-send-email-thuth@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; boundary="Sig_/GiK9vCyuyJ9+pl=yc/aE5kA"; protocol="application/pgp-signature" Subject: Re: [Qemu-devel] [PATCH] hw/virtio/balloon: Fixes for different host page sizes List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Thomas Huth Cc: "Michael S. Tsirkin" , qemu-devel@nongnu.org, dgilbert@redhat.com, wehuang@redhat.com, drjones@redhat.com, amit.shah@redhat.com --Sig_/GiK9vCyuyJ9+pl=yc/aE5kA Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Wed, 13 Apr 2016 13:52:44 +0200 Thomas Huth wrote: > The balloon code currently calls madvise() with TARGET_PAGE_SIZE > as length parameter, and an address which is directly based on > the page address supplied by the guest. Since the virtio-balloon > protocol is always based on 4k based addresses/sizes, no matter > what the host and guest are using as page size, this has a couple > of issues which could even lead to data loss in the worst case. >=20 > TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that > value for the madvise() call. If TARGET_PAGE_SIZE is bigger than > 4k, we also destroy the 4k areas after the current one - which > might be wrong since the guest did not want free that area yet (in > case the guest used as smaller MMU page size than the hard-coded > TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define > called BALLOON_PAGE_SIZE (which is 4096) to use this as the size > parameter for the madvise() call instead. >=20 > Then, there's yet another problem: If the host page size is bigger > than the 4k balloon page size, we can not simply call madvise() on > each of the 4k balloon addresses that we get from the guest - since > the madvise() always evicts the whole host page, not only a 4k area! > So in this case, we've got to track the 4k fragments of a host page > and only call madvise(DONTNEED) when all fragments have been collected. > This of course only works fine if the guest sends consecutive 4k > fragments - which is the case in the most important scenarios that > I try to address here (like a ppc64 guest with 64k page size that > is running on a ppc64 host with 64k page size). In case the guest > uses a page size that is smaller than the host page size, we might > need to add some more additional logic here later to increase the > probability of being able to release memory, but at least the guest > should now not crash anymore due to unintentionally evicted pages. Yeah, this is a serious bug. I think the patch is basically ok, except for a couple of minor points noted below. It's a bit more complex than I'd generally like to apply during hard freeze, but given the seriousness of the problem, I think we should put this in for qemu-2.6. >=20 > Signed-off-by: Thomas Huth > --- > I've tested this patch with both, a 4k page size ppc64 guest > and a 64k page size ppc64 guest on a 64k page size ppc64 host. > With this patch applied, I was not able to crash to crash the > guests anymore (the 4k guest easily crashes without this patch). > And looking at the output of the "free" command on the host, > the ballooning also still works as expected. >=20 > hw/virtio/virtio-balloon.c | 68 ++++++++++++++++++++++++++++++++= ++---- > include/hw/virtio/virtio-balloon.h | 3 ++ > 2 files changed, 65 insertions(+), 6 deletions(-) >=20 > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c > index c74101e..886faa8 100644 > --- a/hw/virtio/virtio-balloon.c > +++ b/hw/virtio/virtio-balloon.c > @@ -35,13 +35,56 @@ > #include "hw/virtio/virtio-bus.h" > #include "hw/virtio/virtio-access.h" > =20 > -static void balloon_page(void *addr, int deflate) > +#define BALLOON_PAGE_SIZE (1 << VIRTIO_BALLOON_PFN_SHIFT) > +#define BALLOON_NO_CURRENT_PAGE ((void *)-1) > + > +static void balloon_page(VirtIOBalloon *s, void *addr, int deflate) > { > #if defined(__linux__) > - if (!qemu_balloon_is_inhibited() && (!kvm_enabled() || > - kvm_has_sync_mmu())) { > - qemu_madvise(addr, TARGET_PAGE_SIZE, > - deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); > + size_t host_page_size; > + void *aligned_addr; > + > + if (qemu_balloon_is_inhibited() || (kvm_enabled() && !kvm_has_sync_m= mu())) { > + return; > + } > + > + host_page_size =3D getpagesize(); I think you actually want getrampagesize() here (or rather some wrapper around that to handle the non-kvm cases properly) - to cover the case where the VM has been built using hugepages in the host. For the normal case there's already the qemu_real_host_page_size global, so you shouldn't need to call getpagesize() yourself. > + if (host_page_size <=3D BALLOON_PAGE_SIZE) { > + qemu_madvise(addr, BALLOON_PAGE_SIZE, > + deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); > + return; > + } > + > + /* Host page size > ballon page size =3D=3D> use aligned host addres= s */ > + aligned_addr =3D (void *)((uintptr_t)addr & ~(host_page_size - 1)); > + if (deflate) { > + /* MADV_WILLNEED is just a hint for the host kernel, the guest s= hould > + * also be able to use the memory again without this call, so le= t's > + * only do it for the first, aligned fragment of a host page and > + * ignore it for the others. > + */ > + if (addr =3D=3D aligned_addr) { > + qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_WILLNEE= D); > + } > + s->current_addr =3D BALLOON_NO_CURRENT_PAGE; > + } else { > + const int max_frags =3D host_page_size / BALLOON_PAGE_SIZE; > + /* If we end up here, that means we want to evict balloon pages,= but > + * the host's page size is bigger than the 4k pages from the bal= loon. > + * Since madvise() only works on the granularity of the host pag= e size, > + * we've got to collect all the 4k fragments from the guest first > + * before we can issue the MADV_DONTNEED call. > + */ > + if (aligned_addr !=3D s->current_addr) { I'd suggest a (non fatal) error_report() here if s->current_addr !=3D BALLOON_NO_CURRENT_PAGE. > + memset(s->fragment_bits, 0, s->fragment_bits_size); > + s->current_addr =3D aligned_addr; > + } > + set_bit((addr - aligned_addr) / BALLOON_PAGE_SIZE, s->fragment_b= its); > + if (find_first_zero_bit(s->fragment_bits, max_frags) =3D=3D max_= frags) { If you invert the sense of the bitmap you can just check for an all-zero bitmap here which might be a little simpler. > + qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_DONTNEE= D); > + memset(s->fragment_bits, 0, s->fragment_bits_size); > + s->current_addr =3D BALLOON_NO_CURRENT_PAGE; > + } > } > #endif > } > @@ -240,7 +283,7 @@ static void virtio_balloon_handle_output(VirtIODevice= *vdev, VirtQueue *vq) > /* Using memory_region_get_ram_ptr is bending the rules a bi= t, but > should be OK because we only want a single page. */ > addr =3D section.offset_within_region; > - balloon_page(memory_region_get_ram_ptr(section.mr) + addr, > + balloon_page(s, memory_region_get_ram_ptr(section.mr) + addr, > !!(vq =3D=3D s->dvq)); > memory_region_unref(section.mr); > } > @@ -455,6 +498,13 @@ static void virtio_balloon_device_realize(DeviceStat= e *dev, Error **errp) > s->dvq =3D virtio_add_queue(vdev, 128, virtio_balloon_handle_output); > s->svq =3D virtio_add_queue(vdev, 128, virtio_balloon_receive_stats); > =20 > + if (getpagesize() > BALLOON_PAGE_SIZE) { > + s->fragment_bits_size =3D (getpagesize() / BALLOON_PAGE_SIZE > + + sizeof(long) * 8 - 1) / 8; > + s->fragment_bits =3D g_malloc0(s->fragment_bits_size); > + s->current_addr =3D BALLOON_NO_CURRENT_PAGE; > + } > + > reset_stats(s); > =20 > register_savevm(dev, "virtio-balloon", -1, 1, > @@ -466,6 +516,7 @@ static void virtio_balloon_device_unrealize(DeviceSta= te *dev, Error **errp) > VirtIODevice *vdev =3D VIRTIO_DEVICE(dev); > VirtIOBalloon *s =3D VIRTIO_BALLOON(dev); > =20 > + g_free(s->fragment_bits); > balloon_stats_destroy_timer(s); > qemu_remove_balloon_handler(s); > unregister_savevm(dev, "virtio-balloon", s); > @@ -480,6 +531,11 @@ static void virtio_balloon_device_reset(VirtIODevice= *vdev) > g_free(s->stats_vq_elem); > s->stats_vq_elem =3D NULL; > } > + > + if (s->fragment_bits) { > + memset(s->fragment_bits, 0, s->fragment_bits_size); > + s->current_addr =3D BALLOON_NO_CURRENT_PAGE; > + } > } > =20 > static void virtio_balloon_instance_init(Object *obj) > diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virti= o-balloon.h > index 35f62ac..04b7c0c 100644 > --- a/include/hw/virtio/virtio-balloon.h > +++ b/include/hw/virtio/virtio-balloon.h > @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon { > int64_t stats_last_update; > int64_t stats_poll_interval; > uint32_t host_features; > + void *current_addr; > + unsigned long *fragment_bits; > + int fragment_bits_size; > } VirtIOBalloon; > =20 > #endif > --=20 > 1.8.3.1 >=20 --=20 David Gibson Senior Software Engineer, Virtualization, Red Hat --Sig_/GiK9vCyuyJ9+pl=yc/aE5kA Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJXDxByAAoJEGw4ysog2bOSk9kQAJosnYZF1TjMMu2uUnvciB7w pHUlbsLmVL2IlQqYE5faXMEIQJ03PTZLcKn3q542U10/fxQkwr141J2zcWRgt+/i NwcipjagD/poNitivROf9x6SofTWm5gUwcxMkALRd+fRNd4NrIxpayillJfspHhn 4uS1D3mCT0H0prg5l3wzBNswvewGtjouApnurkIvcrxszrrdq+wQRep5EVCKrCKd AOMMloYvDk4V2qar6LA4AGfIM1VH0FnXzCB1mshMS+I7D4L5MdchF39cqFS6WJx+ J31AG8AsHHl4AatwNuGONg4F/nc2wAwPiDMTTlT97wM4kjIKeEOam99QncJVPU5Z Jdf7u72LZPViftEpwu+TiU2z+RhzqszzVMq2L7OIGy6pQkax7FU/LplJCTIAJBLM 60UZt8v0Bp71fv569/MN0qQ+W9Vxnc5lqMYmmDm3mweZxc5+323MlW4E8d4rJ2rY jCO8biaePmAOCHcXB/S4EtFdpb9cjKOnt8FNu0TCTTYnIRN2b9SgzUiRmFJk+fFF Ym7vMzm+c9uqqApnCEJYI1eQh6iG5TNODX6lPqE1213IC6ALzPoj/lQZRCoNtoyN b8CSaSwzrFlBw8chEQgMtNKFpeyVTCaL+t0cbG7coE15mYjDQWayTc1nGSlu6dW6 zCp5fmYohMnxn0q4mmNu =lBAi -----END PGP SIGNATURE----- --Sig_/GiK9vCyuyJ9+pl=yc/aE5kA--