qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Alex Williamson <alex.williamson@redhat.com>
To: "Daniel P. Berrange" <berrange@redhat.com>
Cc: Michael Roth <mdroth@linux.vnet.ibm.com>,
	libvir-list@redhat.com, qemu-devel@nongnu.org,
	qemu-ppc@nongnu.org, aik@ozlabs.ru, abologna@redhat.com,
	laine@laine.org, pkrempa@redhat.com, mprivozn@redhat.com,
	sbhat@linux.vnet.ibm.com
Subject: Re: [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host
Date: Thu, 29 Jun 2017 13:28:11 -0600	[thread overview]
Message-ID: <20170629132811.77cd8144@w520.home> (raw)
In-Reply-To: <20170629083319.GA19941@redhat.com>

On Thu, 29 Jun 2017 09:33:19 +0100
"Daniel P. Berrange" <berrange@redhat.com> wrote:

> On Wed, Jun 28, 2017 at 07:24:55PM -0500, Michael Roth wrote:
> > Hi everyone. Hoping to get some feedback on this approach, or some
> > alternatives proposed below, to the following issue:
> > 
> > Currently libvirt immediately attempts to rebind a managed device back to the
> > host driver when it receives a DEVICE_DELETED event from QEMU. This is
> > problematic for 2 reasons:
> > 
> > 1) If multiple devices from a group are attached to a guest, this can move
> >    the group into a "non-viable" state where some devices are assigned to
> >    the host and some to the guest.
> > 
> > 2) When QEMU emits the DEVICE_DELETED event, there's still a "finalize" phase
> >    where additional cleanup occurs. In most cases libvirt can ignore this
> >    cleanup, but in the case of VFIO devices this is where closing of a VFIO
> >    group FD occurs, and failing to wait before rebinding the device to the
> >    host driver can result in unexpected behavior. In the case of powernv
> >    hosts at least, this can lead to a host driver crashing due to the default
> >    DMA windows not having been fully-restored yet. The window between this is
> >    and the initial DEVICE_DELETED seems to be ~6 seconds in practice. We've
> >    seen host dumps with Mellanox CX4 VFs being rebound to host driver during
> >    this period (on powernv hosts).  

I've been trying to tackle this from the kernel side too, currently
Linux's driver model really neither allows vfio bus drivers to nak
unbinding a device from an in-use group nor nak binding a device
from an in-use group to an incompatible driver.  The issue you identify
in QEMU/libvirt exacerbates the problem as QEMU has not yet finalized
the device/group references before libvirt tries to unbind the device
from the vfio bus driver and attach it to a host driver.  I'd love to
solve this from both sides by allowing the kernel to prevent driver
binds that we'd consider compromising and also introduce a bit of
patience in the QEMU/libvirt path to avoid the kernel needing to impose
that driver blocking.

> Why on earth does QEMU's device finalization take 6 seconds to complete.
> That feels very broken to me, unless QEMU is not being schedled due to
> host being overcomitted. If that's not the case, then we have a bug to
> investigate in QEMU to find out why cleanup is delayed so long.

I wouldn't necessarily jump to the conclusion that this is a bug, if
it's relating to tearing down the IOMMU mappings for the device, gigs
of mappings can take non-trivial time.  Is that the scenario here?  Is
that 6s somehow proportional to guest memory size?
 
> From libvirt's POV, we consider 'DEVICE_DELETED' as meaning both that the
> frontend has gone *and* the corresponding backend has gone. Aside from
> cleaning the VFIO group, we use this as a trigger for all other device
> related cleanup like SELinux labelling, cgroup device ACLs, etc. If the
> backend is not guaranteed to be closed in QEMU when this emit is emitted
> then either QEMU needs to delay the event until it is really cleaned up,
> or QEMU needs to add a further event to emit when the backend is clean.

Clearly libvirt and QEMU's idea of what DEVICE_DELETED means don't
align.

> > Patches 1-4 address 1) by deferring rebinding of a hostdev to the host driver
> > until all the devices in the group have been detached, at which point all
> > the hostdevs are rebound as a group. Until that point, the devices are traced
> > by the drvManager's inactiveList in a similar manner to hostdevs that are
> > assigned to VFIO via the nodedev-detach interface.

There are certainly some benefits to group-awareness here, currently
an admin user like libvirt can trigger a BUG_ON by trying to bind a
device back to a host driver when a group is still in use, at best we
might improve that to rejecting the compromising bind.  

> > Patch 5 addresses 2) by adding an additional check that, when the last device
> > from a group is detached, polls /proc for open FDs referencing the VFIO group
> > path in /dev/vfio/<iommu_group> and waiting for the FD to be closed. If we
> > time out, we abandon rebinding the hostdevs back to the host.  
> 
> That is just gross - it is tieing libvirt to details of the QEMU internal
> implementation. I really don't think we should be doing that. So NACK to
> this from my POV.

It seems a little silly for QEMU to emit the event while it's still in
use, clearly emitting the event at the right point would negate any
need for snooping around in proc.

> > There are a couple alternatives to Patch 5 that might be worth considering:
> > 
> > a) Add a DEVICE_FINALIZED event to QEMU and wait for that instead of
> >    DEVICE_DELETED. Paired with patches 1-4 this would let us drop patch 5 in
> >    favor of minimal changes to libvirt's event handlers.
> > 
> >    The downsides are:
> >     - that we'd incur some latency for all device-detach calls, but it's not
> >       apparent to me whether this delay is significant for anything outside
> >       of VFIO.
> >     - there may be cases where finalization after DEVICE_DELETE/unparent are
> >       is not guaranteed, and I'm not sure QOM would encourage such
> >       expectations even if that's currently the case.
> > 
> > b) Add a GROUP_DELETED event to VFIO's finalize callback. This is the most
> >    direct solution. With this we could completely separate out the handling
> >    of rebinding to host driver based on receival of this event.
> > 
> >    The downsides are:
> >     - this would only work for newer versions of QEMU, though we could use
> >       the poll-wait in patch 5 as a fallback.
> >     - synchronizing sync/async device-detach threads with sync/async
> >       handlers for this would be a bit hairy, but I have a WIP in progress
> >       that seems *fairly reasonable*
> > 
> > c) Take the approach in Patch 5, either as a precursor to implementing b) or
> >    something else, or just sticking with that for now.
> > 
> > d) ???  
> 
> Fix DEVICE_DELETE so its only emitted when the backend associated with
> the device is fully cleaned up.

Adding a FINALIZE seems to require a two-step fix, fix QEMU then fix
libvirt, whereas moving DELETE to the correct location automatically
fixes the behavior with existing libvirt.  I don't know that a
GROUP_DELETED makes much sense, libvirt can know about groups on its
own and it just leads to a vfio specific path.  Thanks,

Alex

  parent reply	other threads:[~2017-06-29 19:28 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-29  0:24 [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host Michael Roth
2017-06-29  0:24 ` [Qemu-devel] [RFC PATCH 1/5] virhostdev: factor release out from reattach and export it for use later Michael Roth
2017-06-29  0:24 ` [Qemu-devel] [RFC PATCH 2/5] qemu_hotplug: squash qemuDomainRemovePCIHostDevice into caller Michael Roth
2017-06-29  0:24 ` [Qemu-devel] [RFC PATCH 3/5] virpci: introduce virPCIIOMMUGroupIterate() Michael Roth
2017-06-29  0:24 ` [Qemu-devel] [RFC PATCH 4/5] qemu: hotplug: unbind VFIO devices as a group Michael Roth
2017-06-29  0:25 ` [Qemu-devel] [RFC PATCH 5/5] qemu: hotplug: wait for VFIO group FD close before unbind Michael Roth
2017-06-29  8:33 ` [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host Daniel P. Berrange
2017-06-29 18:22   ` Michael Roth
2017-06-29 19:31     ` Alex Williamson
2017-06-29 19:28   ` Alex Williamson [this message]
2017-06-29 20:21     ` Michael Roth
2017-06-29 18:50 ` Laine Stump
2017-06-29 19:44   ` Alex Williamson
2017-06-30  2:27     ` Laine Stump
2017-06-29 20:59   ` Michael Roth
2017-06-30  6:59     ` Andrea Bolognani

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170629132811.77cd8144@w520.home \
    --to=alex.williamson@redhat.com \
    --cc=abologna@redhat.com \
    --cc=aik@ozlabs.ru \
    --cc=berrange@redhat.com \
    --cc=laine@laine.org \
    --cc=libvir-list@redhat.com \
    --cc=mdroth@linux.vnet.ibm.com \
    --cc=mprivozn@redhat.com \
    --cc=pkrempa@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    --cc=sbhat@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).