From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:49010) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aqYCb-0007NF-94 for qemu-devel@nongnu.org; Wed, 13 Apr 2016 23:44:22 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aqYCX-0006TR-LY for qemu-devel@nongnu.org; Wed, 13 Apr 2016 23:44:21 -0400 Received: from mx1.redhat.com ([209.132.183.28]:34309) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aqYCX-0006TM-Dr for qemu-devel@nongnu.org; Wed, 13 Apr 2016 23:44:17 -0400 Date: Thu, 14 Apr 2016 13:45:32 +1000 From: David Gibson Message-ID: <20160414134532.23802ad5@voom.fritz.box> In-Reply-To: <20160413181444.GB14022@redhat.com> References: <1460548364-27469-1-git-send-email-thuth@redhat.com> <20160413145835-mutt-send-email-mst@redhat.com> <570E5D05.2030507@redhat.com> <20160413200618-mutt-send-email-mst@redhat.com> <570E8404.10503@redhat.com> <20160413205524-mutt-send-email-mst@redhat.com> <570E8BD8.9000008@redhat.com> <20160413181444.GB14022@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; boundary="Sig_/njy_7JjJ9yNl+DthYE4i09X"; protocol="application/pgp-signature" Subject: Re: [Qemu-devel] [PATCH] hw/virtio/balloon: Fixes for different host page sizes List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Michael S. Tsirkin" Cc: Thomas Huth , drjones@redhat.com, qemu-devel@nongnu.org, dgilbert@redhat.com, jitendra.kolhe@hpe.com, wehuang@redhat.com, amit.shah@redhat.com --Sig_/njy_7JjJ9yNl+DthYE4i09X Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Wed, 13 Apr 2016 21:14:44 +0300 "Michael S. Tsirkin" wrote: > On Wed, Apr 13, 2016 at 08:11:36PM +0200, Thomas Huth wrote: > > On 13.04.2016 19:55, Michael S. Tsirkin wrote: =20 > > > On Wed, Apr 13, 2016 at 07:38:12PM +0200, Thomas Huth wrote: =20 > > >> On 13.04.2016 19:07, Michael S. Tsirkin wrote: =20 > > >>> On Wed, Apr 13, 2016 at 04:51:49PM +0200, Thomas Huth wrote: =20 > > >>>> On 13.04.2016 15:15, Michael S. Tsirkin wrote: =20 > > >>>>> On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote: =20 > > >> ... =20 > > >>>>>> Then, there's yet another problem: If the host page size is bigg= er > > >>>>>> than the 4k balloon page size, we can not simply call madvise() = on > > >>>>>> each of the 4k balloon addresses that we get from the guest - si= nce > > >>>>>> the madvise() always evicts the whole host page, not only a 4k a= rea! > > >>>>>> > > >>>>>> So in this case, we've got to track the 4k fragments of a host p= age > > >>>>>> and only call madvise(DONTNEED) when all fragments have been col= lected. > > >>>>>> This of course only works fine if the guest sends consecutive 4k > > >>>>>> fragments - which is the case in the most important scenarios th= at > > >>>>>> I try to address here (like a ppc64 guest with 64k page size that > > >>>>>> is running on a ppc64 host with 64k page size). In case the guest > > >>>>>> uses a page size that is smaller than the host page size, we mig= ht > > >>>>>> need to add some more additional logic here later to increase the > > >>>>>> probability of being able to release memory, but at least the gu= est > > >>>>>> should now not crash anymore due to unintentionally evicted page= s. =20 > > >>>> ... =20 > > >>>>>> static void virtio_balloon_instance_init(Object *obj) > > >>>>>> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/vir= tio/virtio-balloon.h > > >>>>>> index 35f62ac..04b7c0c 100644 > > >>>>>> --- a/include/hw/virtio/virtio-balloon.h > > >>>>>> +++ b/include/hw/virtio/virtio-balloon.h > > >>>>>> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon { > > >>>>>> int64_t stats_last_update; > > >>>>>> int64_t stats_poll_interval; > > >>>>>> uint32_t host_features; > > >>>>>> + void *current_addr; > > >>>>>> + unsigned long *fragment_bits; > > >>>>>> + int fragment_bits_size; > > >>>>>> } VirtIOBalloon; > > >>>>>> =20 > > >>>>>> #endif =20 > > >>>>> > > >>>>> It looks like fragment_bits would have to be migrated. > > >>>>> Which is a lot of complexity. =20 > > >> ... =20 > > >>>>> How about we just skip madvise if host page size is > balloon > > >>>>> page size, for 2.6? =20 > > >>>> > > >>>> That would mean a regression compared to what we have today. Curre= ntly, > > >>>> the ballooning is working OK for 64k guests on a 64k ppc host - ra= ther > > >>>> by chance than on purpose, but it's working. The guest is always s= ending > > >>>> all the 4k fragments of a 64k page, and QEMU is trying to call mad= vise() > > >>>> for every one of them, but the kernel is ignoring madvise() on > > >>>> non-64k-aligned addresses, so we end up with a situation where the > > >>>> madvise() frees a whole 64k page which is also declared as free by= the > > >>>> guest. > > >>>> > > >>>> I think we should either take this patch as it is right now (witho= ut > > >>>> adding extra code for migration) and later update it to the bitmap= code > > >>>> by Jitendra Kolhe, or omit it completely (leaving 4k guests broken= ) and > > >>>> fix it properly after the bitmap code has been applied. But disabl= ing > > >>>> the balloon code for 64k guests on 64k hosts completely does not s= ound > > >>>> very appealing to me. What do you think? =20 > > >>> > > >>> True. As simple a hack - how about disabling madvise when host page= size > > > >>> target page size? =20 > > >> > > >> That could work - but is there a generic way in QEMU to get the curr= ent > > >> page size from a guest (since this might differ from TARGET_PAGE_SIZ= E)? > > >> Or would that mean to pollute the virtio-balloon code with ugly #ifd= efs? =20 > > >=20 > > > let's just use TARGET_PAGE_SIZE, that's the best I can think of. =20 > >=20 > > That won't work - at least not on ppc: TARGET_PAGE_SIZE is always > > defined to 4096 here. The Linux kernel then switches the real page size > > during runtime to 65536. So we'd need a way to detect this automaticall= y... > >=20 > > Thomas =20 >=20 > I see. I don't know how to do this. If we can't find a quick fix, leave > this part around for now then? Just fix the page size. I'm not sure what you're suggesting by "fix the page size". TARGET_PAGE_SIZE remains 4kiB because that's the default hardware page size. However, modern Linux guests always use 64kiB pages in practice. This isn't a global setting, it just means it uses 64kiB pages for every mapping it establishes, so there's really no sane way for qemu to determine the "guest page size" because that's not really a well-defined concept. Even having some flag that's cleared if the guest ever makes a < 64kiB mapping won't work, because I think there are a few edge cases where a "64kiB" guest which uses 64kiB pages for all RAM mappings will use 4kiB pages for certain IO mappings. --=20 David Gibson Senior Software Engineer, Virtualization, Red Hat --Sig_/njy_7JjJ9yNl+DthYE4i09X Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJXDxJcAAoJEGw4ysog2bOS4swP/jDAlxq5CTmphPSmDBxwwDou INxVj2uN/4ZaRo7LKhVRggMIvOaH73q6DZ7NUJz2hjF3HnVQk2ZCBaIHHnUGZarw iSguYFGQix8H11UkVwXTaBggK9L3APWz/Qc5dYy9UIO3mfnwxipgQA26OIOJTEsN Tg1nQ2bszO+Hnj4VymWAcmdlFpE1SawG74y9NKPcPSinh8ng0XePxLU76xoL/TLr YwrrMDHt8v3o/3AzwiAm2dFZnCnmdaVoRaQt+CD74GnxfPqYukle1La2NfH6ZnBl 7j97Qs6hTHOTFtPloBbVCZOkZaA9ukIvHKRYgTgVLhy4d8PBgY2mgA33CtDgsx9w e0Q0Sdk+gEDKuudNIZRLQFpRryR0VilN2mU6oyrndaLkl3TB95NVSTMvJizawGtP U+g7KRsORPGaEibfdKFYD70Bf530nnGSok2k9VbjSTVp0v/W3vUK8GuVT+0KQE4M g9kqj5sfoVT07HlpxOeiFwjxR64V7r3pmYdJzHQIDghyWtV6cHyYXrnzJg1rorLj N4u3NfDOQQjdNgVKX4zumJ7XI8npiEobQNotR4AnvoBIFXAeRwMGCnmk8L/o/mjV GOK/9G1Bif6FSE7kpWKS9SlYO3s399WWJh7QbGKXso9YVWpaedP6qTveKaRhmcFn 8PhknApB297CTH8g6nyX =wz0k -----END PGP SIGNATURE----- --Sig_/njy_7JjJ9yNl+DthYE4i09X--