Re: [Qemu-devel] [PATCH v2 3/4] vfio: Inhibit ballooning based on group attachment to a container

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Peter Xu <peterx@redhat.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org, cohuck@redhat.com,
	mst@redhat.com, david@redhat.com
Subject: Re: [Qemu-devel] [PATCH v2 3/4] vfio: Inhibit ballooning based on group attachment to a container
Date: Wed, 8 Aug 2018 11:22:19 +0800	[thread overview]
Message-ID: <20180808032219.GA24415@xz-mi> (raw)
In-Reply-To: <20180807103505.4c182f4d@t450s.home>

On Tue, Aug 07, 2018 at 10:35:05AM -0600, Alex Williamson wrote:
> On Tue, 7 Aug 2018 21:10:21 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Mon, Jul 30, 2018 at 05:14:21PM -0600, Alex Williamson wrote:
> > > We use a VFIOContainer to associate an AddressSpace to one or more
> > > VFIOGroups.  The VFIOContainer represents the DMA context for that
> > > AdressSpace for those VFIOGroups and is synchronized to changes in
> > > that AddressSpace via a MemoryListener.  For IOMMU backed devices,
> > > maintaining the DMA context for a VFIOGroup generally involves
> > > pinning a host virtual address in order to create a stable host
> > > physical address and then mapping a translation from the associated
> > > guest physical address to that host physical address into the IOMMU.
> > > 
> > > While the above maintains the VFIOContainer synchronized to the QEMU
> > > memory API of the VM, memory ballooning occurs outside of that API.
> > > Inflating the memory balloon (ie. cooperatively capturing pages from
> > > the guest for use by the host) simply uses MADV_DONTNEED to "zap"
> > > pages from QEMU's host virtual address space.  The page pinning and
> > > IOMMU mapping above remains in place, negating the host's ability to
> > > reuse the page, but the host virtual to host physical mapping of the
> > > page is invalidated outside of QEMU's memory API.
> > > 
> > > When the balloon is later deflated, attempting to cooperatively
> > > return pages to the guest, the page is simply freed by the guest
> > > balloon driver, allowing it to be used in the guest and incurring a
> > > page fault when that occurs.  The page fault maps a new host physical
> > > page backing the existing host virtual address, meanwhile the
> > > VFIOContainer still maintains the translation to the original host
> > > physical address.  At this point the guest vCPU and any assigned
> > > devices will map different host physical addresses to the same guest
> > > physical address.  Badness.
> > > 
> > > The IOMMU typically does not have page level granularity with which
> > > it can track this mapping without also incurring inefficiencies in
> > > using page size mappings throughout.  MMU notifiers in the host
> > > kernel also provide indicators for invalidating the mapping on
> > > balloon inflation, not for updating the mapping when the balloon is
> > > deflated.  For these reasons we assume a default behavior that the
> > > mapping of each VFIOGroup into the VFIOContainer is incompatible
> > > with memory ballooning and increment the balloon inhibitor to match
> > > the attached VFIOGroups.
> > > 
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > ---
> > >  hw/vfio/common.c |    5 +++++
> > >  1 file changed, 5 insertions(+)
> > > 
> > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > > index fb396cf00ac4..4881b691a659 100644
> > > --- a/hw/vfio/common.c
> > > +++ b/hw/vfio/common.c
> > > @@ -32,6 +32,7 @@
> > >  #include "hw/hw.h"
> > >  #include "qemu/error-report.h"
> > >  #include "qemu/range.h"
> > > +#include "sysemu/balloon.h"
> > >  #include "sysemu/kvm.h"
> > >  #include "trace.h"
> > >  #include "qapi/error.h"
> > > @@ -1049,6 +1050,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> > >              group->container = container;
> > >              QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> > >              vfio_kvm_device_add_group(group);
> > > +            qemu_balloon_inhibit(true);  
> > 
> > [1]
> > 
> > >              return 0;
> > >          }
> > >      }
> > > @@ -1198,6 +1200,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
> > >      }
> > >  
> > >      vfio_kvm_device_add_group(group);
> > > +    qemu_balloon_inhibit(true);  
> > 
> > AFAIU there is a very critical information that this
> > qemu_balloon_inhibit() call must be before the call to:
> > 
> >     memory_listener_register(&container->listener, container->space->as);
> > 
> > Since the memory listener registeration is the point when we do the
> > pinning of the pages.  So to make sure we won't have stale pages we
> > must call qemu_balloon_inhibit() before memory_listener_register()
> > (which is what this patch does).  However this is not that obvious,
> > not sure whether that might worth a comment.
> > 
> > Considering this, not sure whether we can just do this per-container
> > instead of per-group, then we also don't need to bother with extra
> > group-add paths like [1].
> > 
> > No matter what, this patch looks good to me (and it is correct AFAIK),
> > so I'm leaving r-b and I'll leave Alex to decide:
> 
> Thanks Peter.  I agree, I'll add more commentary.  A minor correction,
> we won't have "stale" pages at the time we pin, the act of pinning will
> make those valid (same as I discussed in reply to mst why we don't need
> to worry about pages ballooned before the device is added), but once a
> page is pinned, we need to make sure it's not madvised dontneed, so we
> need to be sure there's no possible race there, which effectively means
> inhibiting before the memory listener can do any pinning.

Yes.

> 
> The reason I chose to inhibit per group is that it becomes easier to
> allow endpoint drivers to opt-in.  For instance if we could have ccw
> and vfio-pci in the same VM, they would by default share a container.
> If ccw releases the inhibit, we'd need to somehow reinstate it for the
> vfio-pci device and remember which did what if one is hot unplugged.
> Doing the inhibit at the group level resolves this, the ccw group adds
> an inhibit by default, then releases it, the vfio-pci group adds an
> inhibit and maintains it so long as attached.  I struggled with whether
> this should actually be a per-device inhibit, but then there's a gap
> that the container listener is active before the device is retrieved,
> so again the per-group inhibit was a better fit.  Thanks,

Thanks for explaining.  I didn't look into the ccw patch, but it
sounds reasonable to me now.

Regards,

-- 
Peter Xu

next prev parent reply	other threads:[~2018-08-08  3:22 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-30 23:13 [Qemu-devel] [PATCH v2 0/4] Balloon inhibit enhancements, vfio restriction Alex Williamson
2018-07-30 23:13 ` [Qemu-devel] [PATCH v2 1/4] balloon: Allow nested inhibits Alex Williamson
2018-07-31  8:25   ` David Hildenbrand
2018-08-07 12:56   ` Peter Xu
2018-08-07 14:20   ` Cornelia Huck
2018-07-30 23:14 ` [Qemu-devel] [PATCH v2 2/4] kvm: Use inhibit to prevent ballooning without synchronous mmu Alex Williamson
2018-07-31  8:24   ` David Hildenbrand
2018-08-07 12:56   ` Peter Xu
2018-08-07 14:24   ` Cornelia Huck
2018-07-30 23:14 ` [Qemu-devel] [PATCH v2 3/4] vfio: Inhibit ballooning based on group attachment to a container Alex Williamson
2018-08-07 13:10   ` Peter Xu
2018-08-07 16:35     ` Alex Williamson
2018-08-08  3:22       ` Peter Xu [this message]
2018-07-30 23:14 ` [Qemu-devel] [PATCH v2 4/4] vfio/ccw/pci: Allow devices to opt-in for ballooning Alex Williamson
2018-08-07 14:15   ` Cornelia Huck
2018-07-31 12:29 ` [Qemu-devel] [PATCH v2 0/4] Balloon inhibit enhancements, vfio restriction Michael S. Tsirkin
2018-07-31 14:44   ` Alex Williamson
2018-07-31 15:07     ` Dr. David Alan Gilbert
2018-07-31 21:50       ` Alex Williamson
2018-08-03 18:42         ` Michael S. Tsirkin
2018-08-03 20:12           ` Alex Williamson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180808032219.GA24415@xz-mi \
    --to=peterx@redhat.com \
    --cc=alex.williamson@redhat.com \
    --cc=cohuck@redhat.com \
    --cc=david@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).