From: "Michael S. Tsirkin" <mst@redhat.com>
To: Thomas Huth <thuth@redhat.com>
Cc: qemu-devel@nongnu.org, dgibson@redhat.com, dgilbert@redhat.com,
wehuang@redhat.com, drjones@redhat.com, amit.shah@redhat.com,
jitendra.kolhe@hpe.com
Subject: Re: [Qemu-devel] [PATCH] hw/virtio/balloon: Fixes for different host page sizes
Date: Wed, 13 Apr 2016 20:07:51 +0300 [thread overview]
Message-ID: <20160413200618-mutt-send-email-mst@redhat.com> (raw)
In-Reply-To: <570E5D05.2030507@redhat.com>
On Wed, Apr 13, 2016 at 04:51:49PM +0200, Thomas Huth wrote:
> On 13.04.2016 15:15, Michael S. Tsirkin wrote:
> > On Wed, Apr 13, 2016 at 01:52:44PM +0200, Thomas Huth wrote:
> >> The balloon code currently calls madvise() with TARGET_PAGE_SIZE
> >> as length parameter, and an address which is directly based on
> >> the page address supplied by the guest. Since the virtio-balloon
> >> protocol is always based on 4k based addresses/sizes, no matter
> >> what the host and guest are using as page size, this has a couple
> >> of issues which could even lead to data loss in the worst case.
> >>
> >> TARGET_PAGE_SIZE might not be 4k, so it is wrong to use that
> >> value for the madvise() call. If TARGET_PAGE_SIZE is bigger than
> >> 4k, we also destroy the 4k areas after the current one - which
> >> might be wrong since the guest did not want free that area yet (in
> >> case the guest used as smaller MMU page size than the hard-coded
> >> TARGET_PAGE_SIZE). So to fix this issue, introduce a proper define
> >> called BALLOON_PAGE_SIZE (which is 4096) to use this as the size
> >> parameter for the madvise() call instead.
> >
> > this makes absolute sense.
> >
> >> Then, there's yet another problem: If the host page size is bigger
> >> than the 4k balloon page size, we can not simply call madvise() on
> >> each of the 4k balloon addresses that we get from the guest - since
> >> the madvise() always evicts the whole host page, not only a 4k area!
> >
> > Does it really round length up?
> > Wow, it does:
> > len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> >
> > which seems to be undocumented, but has been there forever.
>
> Yes, that's ugly - I also had to take a look at the kernel sources to
> understand what this call is supposed to do when being called with a
> size < PAGE_SIZE.
>
> >> So in this case, we've got to track the 4k fragments of a host page
> >> and only call madvise(DONTNEED) when all fragments have been collected.
> >> This of course only works fine if the guest sends consecutive 4k
> >> fragments - which is the case in the most important scenarios that
> >> I try to address here (like a ppc64 guest with 64k page size that
> >> is running on a ppc64 host with 64k page size). In case the guest
> >> uses a page size that is smaller than the host page size, we might
> >> need to add some more additional logic here later to increase the
> >> probability of being able to release memory, but at least the guest
> >> should now not crash anymore due to unintentionally evicted pages.
> ...
> >> static void virtio_balloon_instance_init(Object *obj)
> >> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> >> index 35f62ac..04b7c0c 100644
> >> --- a/include/hw/virtio/virtio-balloon.h
> >> +++ b/include/hw/virtio/virtio-balloon.h
> >> @@ -43,6 +43,9 @@ typedef struct VirtIOBalloon {
> >> int64_t stats_last_update;
> >> int64_t stats_poll_interval;
> >> uint32_t host_features;
> >> + void *current_addr;
> >> + unsigned long *fragment_bits;
> >> + int fragment_bits_size;
> >> } VirtIOBalloon;
> >>
> >> #endif
> >> --
> >> 1.8.3.1
> >
> >
> > It looks like fragment_bits would have to be migrated.
> > Which is a lot of complexity.
>
> I think we could simply omit this for now. In case somebody migrates the
> VM while the ballooning is going on, we'd loose the information for one
> host page, so we might miss one madvise(DONTNEED), but I think we could
> live with that.
>
> > And work arounds for specific guest behaviour are really ugly.
> > There are patches on-list to maintain a balloon bitmap -
> > that will enable fixing it cleanly.
>
> Ah, good to know, I wasn't aware of them yet, so that will be a chance
> for a really proper final solution, I hope.
>
> > How about we just skip madvise if host page size is > balloon
> > page size, for 2.6?
>
> That would mean a regression compared to what we have today. Currently,
> the ballooning is working OK for 64k guests on a 64k ppc host - rather
> by chance than on purpose, but it's working. The guest is always sending
> all the 4k fragments of a 64k page, and QEMU is trying to call madvise()
> for every one of them, but the kernel is ignoring madvise() on
> non-64k-aligned addresses, so we end up with a situation where the
> madvise() frees a whole 64k page which is also declared as free by the
> guest.
>
> I think we should either take this patch as it is right now (without
> adding extra code for migration) and later update it to the bitmap code
> by Jitendra Kolhe, or omit it completely (leaving 4k guests broken) and
> fix it properly after the bitmap code has been applied. But disabling
> the balloon code for 64k guests on 64k hosts completely does not sound
> very appealing to me. What do you think?
>
> Thomas
True. As simple a hack - how about disabling madvise when host page size >
target page size?
--
MST
next prev parent reply other threads:[~2016-04-13 17:08 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-04-13 11:52 [Qemu-devel] [PATCH] hw/virtio/balloon: Fixes for different host page sizes Thomas Huth
2016-04-13 12:37 ` Andrew Jones
2016-04-13 13:15 ` Michael S. Tsirkin
2016-04-13 14:51 ` Thomas Huth
2016-04-13 17:07 ` Michael S. Tsirkin [this message]
2016-04-13 17:38 ` Thomas Huth
2016-04-13 17:55 ` Michael S. Tsirkin
2016-04-13 18:11 ` Thomas Huth
2016-04-13 18:14 ` Michael S. Tsirkin
2016-04-14 3:45 ` David Gibson
2016-04-13 18:21 ` Andrew Jones
2016-04-14 11:47 ` Dr. David Alan Gilbert
2016-04-14 12:19 ` Thomas Huth
2016-04-14 18:34 ` Dr. David Alan Gilbert
2016-04-15 4:26 ` David Gibson
2016-05-23 6:25 ` Jitendra Kolhe
2016-04-14 3:39 ` David Gibson
2016-04-14 3:37 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160413200618-mutt-send-email-mst@redhat.com \
--to=mst@redhat.com \
--cc=amit.shah@redhat.com \
--cc=dgibson@redhat.com \
--cc=dgilbert@redhat.com \
--cc=drjones@redhat.com \
--cc=jitendra.kolhe@hpe.com \
--cc=qemu-devel@nongnu.org \
--cc=thuth@redhat.com \
--cc=wehuang@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).