From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:48106) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aqY77-00088a-2X for qemu-devel@nongnu.org; Wed, 13 Apr 2016 23:38:42 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aqY73-0005Lq-OD for qemu-devel@nongnu.org; Wed, 13 Apr 2016 23:38:41 -0400 Received: from mx1.redhat.com ([209.132.183.28]:47089) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aqY73-0005Lm-GM for qemu-devel@nongnu.org; Wed, 13 Apr 2016 23:38:37 -0400 Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 28925C084AC1 for ; Thu, 14 Apr 2016 03:38:37 +0000 (UTC) Date: Thu, 14 Apr 2016 13:39:51 +1000 From: David Gibson Message-ID: <20160414133951.1bb4cd69@voom.fritz.box> In-Reply-To: <20160413145835-mutt-send-email-mst@redhat.com> References: <1460548364-27469-1-git-send-email-thuth@redhat.com> <20160413145835-mutt-send-email-mst@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; boundary="Sig_/JJyq7fg2ZYOqlQsDm6Ny2KJ"; protocol="application/pgp-signature" Subject: Re: [Qemu-devel] [PATCH] hw/virtio/balloon: Fixes for different host page sizes List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Michael S. Tsirkin" Cc: Thomas Huth , qemu-devel@nongnu.org, dgilbert@redhat.com, wehuang@redhat.com, drjones@redhat.com, amit.shah@redhat.com --Sig_/JJyq7fg2ZYOqlQsDm6Ny2KJ Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Wed, 13 Apr 2016 16:15:25 +0300 "Michael S. Tsirkin" wrote: > On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote: > > The balloon code currently calls madvise() with TARGET_PAGE_SIZE > > as length parameter, and an address which is directly based on > > the page address supplied by the guest. Since the virtio-balloon > > protocol is always based on 4k based addresses/sizes, no matter > > what the host and guest are using as page size, this has a couple > > of issues which could even lead to data loss in the worst case. > >=20 > > TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that > > value for the madvise() call. If TARGET_PAGE_SIZE is bigger than > > 4k, we also destroy the 4k areas after the current one - which > > might be wrong since the guest did not want free that area yet (in > > case the guest used as smaller MMU page size than the hard-coded > > TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define > > called BALLOON_PAGE_SIZE (which is 4096) to use this as the size > > parameter for the madvise() call instead. =20 >=20 > this makes absolute sense. >=20 >=20 > > Then, there's yet another problem: If the host page size is bigger > > than the 4k balloon page size, we can not simply call madvise() on > > each of the 4k balloon addresses that we get from the guest - since > > the madvise() always evicts the whole host page, not only a 4k area! =20 >=20 > Does it really round length up? > Wow, it does: > len =3D (len_in + ~PAGE_MASK) & PAGE_MASK; >=20 > which seems to be undocumented, but has been there forever. >=20 >=20 > > So in this case, we've got to track the 4k fragments of a host page > > and only call madvise(DONTNEED) when all fragments have been collected. > > This of course only works fine if the guest sends consecutive 4k > > fragments - which is the case in the most important scenarios that > > I try to address here (like a ppc64 guest with 64k page size that > > is running on a ppc64 host with 64k page size). In case the guest > > uses a page size that is smaller than the host page size, we might > > need to add some more additional logic here later to increase the > > probability of being able to release memory, but at least the guest > > should now not crash anymore due to unintentionally evicted pages. > >=20 > > Signed-off-by: Thomas Huth > > --- > > I've tested this patch with both, a 4k page size ppc64 guest > > and a 64k page size ppc64 guest on a 64k page size ppc64 host. > > With this patch applied, I was not able to crash to crash the > > guests anymore (the 4k guest easily crashes without this patch). > > And looking at the output of the "free" command on the host, > > the ballooning also still works as expected. > >=20 > > hw/virtio/virtio-balloon.c | 68 ++++++++++++++++++++++++++++++= ++++---- > > include/hw/virtio/virtio-balloon.h | 3 ++ > > 2 files changed, 65 insertions(+), 6 deletions(-) > >=20 > > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c > > index c74101e..886faa8 100644 > > --- a/hw/virtio/virtio-balloon.c > > +++ b/hw/virtio/virtio-balloon.c > > @@ -35,13 +35,56 @@ > > #include "hw/virtio/virtio-bus.h" > > #include "hw/virtio/virtio-access.h" > > =20 > > -static void balloon_page(void *addr, int deflate) > > +#define BALLOON_PAGE_SIZE (1 << VIRTIO_BALLOON_PFN_SHIFT) > > +#define BALLOON_NO_CURRENT_PAGE ((void *)-1) > > + > > +static void balloon_page(VirtIOBalloon *s, void *addr, int deflate) > > { > > #if defined(__linux__) > > - if (!qemu_balloon_is_inhibited() && (!kvm_enabled() || > > - kvm_has_sync_mmu())) { > > - qemu_madvise(addr, TARGET_PAGE_SIZE, > > - deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED); > > + size_t host_page_size; > > + void *aligned_addr; > > + > > + if (qemu_balloon_is_inhibited() || (kvm_enabled() && !kvm_has_sync= _mmu())) { > > + return; > > + } > > + > > + host_page_size =3D getpagesize(); > > + if (host_page_size <=3D BALLOON_PAGE_SIZE) { > > + qemu_madvise(addr, BALLOON_PAGE_SIZE, > > + deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED= ); > > + return; > > + } > > + > > + /* Host page size > ballon page size =3D=3D> use aligned host addr= ess */ > > + aligned_addr =3D (void *)((uintptr_t)addr & ~(host_page_size - 1)); > > + if (deflate) { > > + /* MADV_WILLNEED is just a hint for the host kernel, the guest= should > > + * also be able to use the memory again without this call, so = let's > > + * only do it for the first, aligned fragment of a host page a= nd > > + * ignore it for the others. > > + */ > > + if (addr =3D=3D aligned_addr) { > > + qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_WILLN= EED); > > + } > > + s->current_addr =3D BALLOON_NO_CURRENT_PAGE; > > + } else { > > + const int max_frags =3D host_page_size / BALLOON_PAGE_SIZE; > > + /* If we end up here, that means we want to evict balloon page= s, but > > + * the host's page size is bigger than the 4k pages from the b= alloon. > > + * Since madvise() only works on the granularity of the host p= age size, > > + * we've got to collect all the 4k fragments from the guest fi= rst > > + * before we can issue the MADV_DONTNEED call. > > + */ > > + if (aligned_addr !=3D s->current_addr) { > > + memset(s->fragment_bits, 0, s->fragment_bits_size); > > + s->current_addr =3D aligned_addr; > > + } > > + set_bit((addr - aligned_addr) / BALLOON_PAGE_SIZE, s->fragment= _bits); > > + if (find_first_zero_bit(s->fragment_bits, max_frags) =3D=3D ma= x_frags) { > > + qemu_madvise(aligned_addr, host_page_size, QEMU_MADV_DONTN= EED); > > + memset(s->fragment_bits, 0, s->fragment_bits_size); > > + s->current_addr =3D BALLOON_NO_CURRENT_PAGE; > > + } > > } > > #endif > > } > > @@ -240,7 +283,7 @@ static void virtio_balloon_handle_output(VirtIODevi= ce *vdev, VirtQueue *vq) > > /* Using memory_region_get_ram_ptr is bending the rules a = bit, but > > should be OK because we only want a single page. */ > > addr =3D section.offset_within_region; > > - balloon_page(memory_region_get_ram_ptr(section.mr) + addr, > > + balloon_page(s, memory_region_get_ram_ptr(section.mr) + ad= dr, > > !!(vq =3D=3D s->dvq)); > > memory_region_unref(section.mr); > > } > > @@ -455,6 +498,13 @@ static void virtio_balloon_device_realize(DeviceSt= ate *dev, Error **errp) > > s->dvq =3D virtio_add_queue(vdev, 128, virtio_balloon_handle_outpu= t); > > s->svq =3D virtio_add_queue(vdev, 128, virtio_balloon_receive_stat= s); > > =20 > > + if (getpagesize() > BALLOON_PAGE_SIZE) { > > + s->fragment_bits_size =3D (getpagesize() / BALLOON_PAGE_SIZE > > + + sizeof(long) * 8 - 1) / 8; > > + s->fragment_bits =3D g_malloc0(s->fragment_bits_size); > > + s->current_addr =3D BALLOON_NO_CURRENT_PAGE; > > + } > > + > > reset_stats(s); > > =20 > > register_savevm(dev, "virtio-balloon", -1, 1, > > @@ -466,6 +516,7 @@ static void virtio_balloon_device_unrealize(DeviceS= tate *dev, Error **errp) > > VirtIODevice *vdev =3D VIRTIO_DEVICE(dev); > > VirtIOBalloon *s =3D VIRTIO_BALLOON(dev); > > =20 > > + g_free(s->fragment_bits); > > balloon_stats_destroy_timer(s); > > qemu_remove_balloon_handler(s); > > unregister_savevm(dev, "virtio-balloon", s); > > @@ -480,6 +531,11 @@ static void virtio_balloon_device_reset(VirtIODevi= ce *vdev) > > g_free(s->stats_vq_elem); > > s->stats_vq_elem =3D NULL; > > } > > + > > + if (s->fragment_bits) { > > + memset(s->fragment_bits, 0, s->fragment_bits_size); > > + s->current_addr =3D BALLOON_NO_CURRENT_PAGE; > > + } > > } > > =20 > > static void virtio_balloon_instance_init(Object *obj) > > diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/vir= tio-balloon.h > > index 35f62ac..04b7c0c 100644 > > --- a/include/hw/virtio/virtio-balloon.h > > +++ b/include/hw/virtio/virtio-balloon.h > > @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon { > > int64_t stats_last_update; > > int64_t stats_poll_interval; > > uint32_t host_features; > > + void *current_addr; > > + unsigned long *fragment_bits; > > + int fragment_bits_size; > > } VirtIOBalloon; > > =20 > > #endif > > --=20 > > 1.8.3.1 =20 >=20 >=20 > It looks like fragment_bits would have to be migrated. Hmm.. do they? If we just ignore migration, isn't the worse that happens that we just keep one extra page allocated. > Which is a lot of complexity. > And work arounds for specific guest behaviour are really ugly. > There are patches on-list to maintain a balloon bitmap - > that will enable fixing it cleanly. > How about we just skip madvise if host page size is > balloon > page size, for 2.6? >=20 > --=20 > MST --=20 David Gibson Senior Software Engineer, Virtualization, Red Hat --Sig_/JJyq7fg2ZYOqlQsDm6Ny2KJ Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJXDxEHAAoJEGw4ysog2bOSnPgP/1F07PEh3fYlivLBxlNt9qqa faSI67rwGe5dPENzqycS8XHtldqh7sB+T/TLaPY11x9XfI2LgFwKujBK+Ijawf9T qCnGThUlFReRzSZHvn8PuQ1YCbuvvd+0BZQvkwYol68263duTuoyqMpnDAKbu9Dc oqh8W+HfwIAXRFXfWMuSo0OsWykS0+01EhorQKJA3JZ3mjBLByGA4ZHyTTFwZe8T pTfvqwyt/gq14zcZWgqSuvjcJsAsFiq41XQjNnyiOmmuQ8lOvF9WS28j6MhMw+nY CNU19jNIa0+Su7y8sioNJT8UiOELKzhK21PLyaWMBZ58He8NA4EeBHgAXy4Ye8py xCUEOzX6kyjePF5CRDaodvpWzvXqzk/63HkOC6TAOd+c41k9rN791mBE+dE3Atmr RCbXh+2vI14TdtUzP/6JL2lBsuJR0SdazNGY4HEwqmEnpa8Wi1Xm8WTK0O0zOKsF +sdEDHHOjzpdQg/1OOQmnou4+4fSRrLkxf2ZgKJb2aES36GNICHiV5Wr8UUGHHUo DwHAeusj73iv38BU0M6xyxNboZn2pLUN4VCgqraCUwXkxXt1O55oiTMbab0IFHzv QqbVCTjX4TC2psv9jwWkM3lbOxjZd4Xr3lMXhQ27CeJtzeU/akIYyWzFkKoDPfdr X1VD4Y7iSbYGlCfJvsIL =xeKX -----END PGP SIGNATURE----- --Sig_/JJyq7fg2ZYOqlQsDm6Ny2KJ--