From: Michael Roth <mdroth@linux.vnet.ibm.com>
To: libvir-list@redhat.com
Cc: qemu-devel@nongnu.org, qemu-ppc@nongnu.org, aik@ozlabs.ru,
alex.williamson@redhat.com, abologna@redhat.com, laine@laine.org,
pkrempa@redhat.com, berrange@redhat.com, mprivozn@redhat.com,
sbhat@linux.vnet.ibm.com
Subject: [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host
Date: Wed, 28 Jun 2017 19:24:55 -0500 [thread overview]
Message-ID: <1498695900-1648-1-git-send-email-mdroth@linux.vnet.ibm.com> (raw)
Hi everyone. Hoping to get some feedback on this approach, or some
alternatives proposed below, to the following issue:
Currently libvirt immediately attempts to rebind a managed device back to the
host driver when it receives a DEVICE_DELETED event from QEMU. This is
problematic for 2 reasons:
1) If multiple devices from a group are attached to a guest, this can move
the group into a "non-viable" state where some devices are assigned to
the host and some to the guest.
2) When QEMU emits the DEVICE_DELETED event, there's still a "finalize" phase
where additional cleanup occurs. In most cases libvirt can ignore this
cleanup, but in the case of VFIO devices this is where closing of a VFIO
group FD occurs, and failing to wait before rebinding the device to the
host driver can result in unexpected behavior. In the case of powernv
hosts at least, this can lead to a host driver crashing due to the default
DMA windows not having been fully-restored yet. The window between this is
and the initial DEVICE_DELETED seems to be ~6 seconds in practice. We've
seen host dumps with Mellanox CX4 VFs being rebound to host driver during
this period (on powernv hosts).
Patches 1-4 address 1) by deferring rebinding of a hostdev to the host driver
until all the devices in the group have been detached, at which point all
the hostdevs are rebound as a group. Until that point, the devices are traced
by the drvManager's inactiveList in a similar manner to hostdevs that are
assigned to VFIO via the nodedev-detach interface.
Patch 5 addresses 2) by adding an additional check that, when the last device
from a group is detached, polls /proc for open FDs referencing the VFIO group
path in /dev/vfio/<iommu_group> and waiting for the FD to be closed. If we
time out, we abandon rebinding the hostdevs back to the host.
There are a couple alternatives to Patch 5 that might be worth considering:
a) Add a DEVICE_FINALIZED event to QEMU and wait for that instead of
DEVICE_DELETED. Paired with patches 1-4 this would let us drop patch 5 in
favor of minimal changes to libvirt's event handlers.
The downsides are:
- that we'd incur some latency for all device-detach calls, but it's not
apparent to me whether this delay is significant for anything outside
of VFIO.
- there may be cases where finalization after DEVICE_DELETE/unparent are
is not guaranteed, and I'm not sure QOM would encourage such
expectations even if that's currently the case.
b) Add a GROUP_DELETED event to VFIO's finalize callback. This is the most
direct solution. With this we could completely separate out the handling
of rebinding to host driver based on receival of this event.
The downsides are:
- this would only work for newer versions of QEMU, though we could use
the poll-wait in patch 5 as a fallback.
- synchronizing sync/async device-detach threads with sync/async
handlers for this would be a bit hairy, but I have a WIP in progress
that seems *fairly reasonable*
c) Take the approach in Patch 5, either as a precursor to implementing b) or
something else, or just sticking with that for now.
d) ???
Personally I'm leaning toward c), but I'm wondering if that's "good enough"
for now, or if we should pursue something more robust from the start, or
perhaps something else entirely?
Any feedback is greatly appreciated!
src/libvirt_private.syms | 5 ++
src/qemu/qemu_hostdev.c | 16 +++++
src/qemu/qemu_hostdev.h | 4 ++
src/qemu/qemu_hotplug.c | 56 ++++++++++++++----
src/util/virfile.c | 52 +++++++++++++++++
src/util/virfile.h | 1 +
src/util/virhostdev.c | 173 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
src/util/virhostdev.h | 16 +++++
src/util/virpci.c | 69 ++++++++++++++++++----
src/util/virpci.h | 4 ++
10 files changed, 360 insertions(+), 36 deletions(-)
next reply other threads:[~2017-06-29 0:26 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-06-29 0:24 Michael Roth [this message]
2017-06-29 0:24 ` [Qemu-devel] [RFC PATCH 1/5] virhostdev: factor release out from reattach and export it for use later Michael Roth
2017-06-29 0:24 ` [Qemu-devel] [RFC PATCH 2/5] qemu_hotplug: squash qemuDomainRemovePCIHostDevice into caller Michael Roth
2017-06-29 0:24 ` [Qemu-devel] [RFC PATCH 3/5] virpci: introduce virPCIIOMMUGroupIterate() Michael Roth
2017-06-29 0:24 ` [Qemu-devel] [RFC PATCH 4/5] qemu: hotplug: unbind VFIO devices as a group Michael Roth
2017-06-29 0:25 ` [Qemu-devel] [RFC PATCH 5/5] qemu: hotplug: wait for VFIO group FD close before unbind Michael Roth
2017-06-29 8:33 ` [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host Daniel P. Berrange
2017-06-29 18:22 ` Michael Roth
2017-06-29 19:31 ` Alex Williamson
2017-06-29 19:28 ` Alex Williamson
2017-06-29 20:21 ` Michael Roth
2017-06-29 18:50 ` Laine Stump
2017-06-29 19:44 ` Alex Williamson
2017-06-30 2:27 ` Laine Stump
2017-06-29 20:59 ` Michael Roth
2017-06-30 6:59 ` Andrea Bolognani
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1498695900-1648-1-git-send-email-mdroth@linux.vnet.ibm.com \
--to=mdroth@linux.vnet.ibm.com \
--cc=abologna@redhat.com \
--cc=aik@ozlabs.ru \
--cc=alex.williamson@redhat.com \
--cc=berrange@redhat.com \
--cc=laine@laine.org \
--cc=libvir-list@redhat.com \
--cc=mprivozn@redhat.com \
--cc=pkrempa@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=qemu-ppc@nongnu.org \
--cc=sbhat@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).