From: Mark McLoughlin <markmc@redhat.com>
To: kvm <kvm@vger.kernel.org>,
"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>
Cc: Chris Wright <chrisw@redhat.com>,
"Dugger, Donald D" <donald.d.dugger@intel.com>,
"Kay, Allen M" <allen.m.kay@intel.com>
Subject: KVM PCI device assignment issues
Date: Fri, 13 Feb 2009 16:32:47 +0000 [thread overview]
Message-ID: <1234542767.23746.81.camel@blaa> (raw)
Hi,
KVM has support for PCI device assignment using VT-d and AMD IOMMU, but
there are a number of inter-related issues that need some further
discussion:
- Unbinding devices from any existing device driver before assignment
- Resetting devices before and after assignment
- Helping users figure out which devices can actually be assigned
This gets confusing, so some background constraints first:
- Conventional PCI devices (i.e. PCI/PCI-X, not PCIe) behind the same
bridge must be assigned to the same VT-d domain - i.e given device
A (0000:0f:1.0) and device B (and 0000:0f:2.0), if you assign
device A to guest, you cannot then use device B in the host or
another guest.
- Some newer PCIe devices (and newer conventional PCI devices too via
PCI Advanced Features) support Function Level Reset (FLR). This
allows a PCI function to be reset without affecting any other
functions on that device, or any other devices. This feature is not
widespread yet AFAIK - e.g. I've seen it on an audio controller,
and it must also be supported by SR-IOV devices.
- Secondary Bus Reset (SBR) allows software to trigger a reset on all
devices (and functions) behind a PCI bridge.
- A PCI Power Management D-state transition (D3hot to D0) can be used
to reset a device (all functions).
- Some PCI devices don't have page aligned MMIO BARs. These devices
(all functions) cannot be safely assigned to guests.
Driver Unbinding
================
Before a device is assigned to a guest, we should make sure that no host
device driver is currently bound to the device.
We can do that with e.g.
$> echo -n "8086 10de" > /sys/bus/pci/drivers/pci-stub/new_id
$> echo -n 0000:00:19.0 > /sys/bus/pci/drivers/e1000e/unbind
$> echo -n 0000:00:19.0 > /sys/bus/pci/drivers/pci-stub/bind
One minor problem with this scheme is that at this point you can't
unbind from pci-stub and trigger a re-probe and have e1000e bind to it.
In order to support that, we need a "remove_id" interface to remove the
dynamic ID.
What we don't support is a way to unbind permanently. Xen has a
pciback.hide module param which tries to achieve this, but you end up
with the inevitable issues around making sure pciback is loaded before
the device driver etc.
Permanent unbinding isn't necessarily needed, but it might help provide
a solution to some of the nastier issues below.
Device Reset
============
Before assigning a device to a guest, it should be reset. The host or a
previous guest may have left the device in an unknown state. Not
resetting can be seen in testing to lead to e.g. "TX Unit Hang" errors
with e1000e devices.
FLR is without doubt the preferable solution here. KVM already
implements this. However, the range of devices which support FLR is
currently quite limited.
If we're assigning devices from behind a PCI/PCI-x bridge (remember all
devices must be assigned together), then we can use SBR to reset them
all together. Clearly, though, one should make sure that all devices
behind that bridge are not in use before doing the reset. We could
implement this with a "reset" sysfs interface for pci-stub - it would
only reset a device using SBR if all devices behind that bridge were
bound to pci-stub.
Where a conventional PCI device is on the root bus, or where a PCIe
device is on the root bus or another bus with multiple devices, we could
use the D-state transition reset. Since this resets all functions on a
device, we would need a similar approach where all functions must be
bound to pci-stub before being reset.
Furthermore, we would need to prevent pci-stub from resetting a device
it is bound to where the device is already assigned to a guest. To
achieve this, we would want KVM to explicitly call in to pci-stub to
mark a device as in use.
The alternatives to such an approach are:
a) Only support FLR capable devices
b) Cross our fingers and hope that work without a device reset
c) Allow a driver to be permanently unbound from a device and require
the user to reboot after unbinding before assigning
Filtering
=========
In order to support a sane user interface in management tools, it should
be possible to list all PCI devices on available on a host and filter
out those which cannot be assigned to a guest.
Furthermore, it should be possible to do this without actually affecting
any of the devices - i.e. a "try to unbind and see if we oops" approach
clearly isn't great.
Finally, some management tools would like to be able to do this
filtering given the constraint of a device being reserved for a
currently inactive guest.
This last constraint is the most difficult and points to the logic
needing to be in userland management libraries. Possibly the only sane
kernel space support would be "try to unbind and reset; if it works then
the device is assignable".
Conclusions
===========
Only supporting devices with FLR restricts our user pool far too
severely.
Permanent unbinding is not supportable.
SBR and D-state reset support is doable with the addition of a "reset"
interface to pci-stub and some logic to check that a reset does not
affect devices not already bound to pci-stub.
KVM would need to be able to mark pci-stub bound devices as in use when
assigned to a guest.
We need the opposite to "new_id" to allow dynids to be removed.
The filtering abilities available to userland via kernel interfaces will
be limited. Further logic will need to be implemented in userland.
Cheers,
Mark.
next reply other threads:[~2009-02-13 16:32 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-02-13 16:32 Mark McLoughlin [this message]
2009-02-13 16:56 ` KVM PCI device assignment issues Greg KH
2009-02-13 17:06 ` Mark McLoughlin
2009-02-13 17:36 ` Matthew Wilcox
2009-02-13 18:22 ` Chris Wright
2009-02-13 19:47 ` Chris Wright
2009-02-24 9:20 ` Zhao, Yu
2009-02-14 2:12 ` [PATCH] pci: add remove_id sysfs entry Chris Wright
2009-02-14 3:33 ` Greg KH
2009-02-24 1:26 ` Chris Wright
2009-02-24 2:17 ` [PATCH 1/2] PCI: add some sysfs ABI docs Chris Wright
2009-02-24 2:18 ` [PATCH 2/2] PCI: add remove_id sysfs entry Chris Wright
2009-02-24 3:47 ` Greg KH
2009-02-24 5:33 ` Chris Wright
2009-02-24 5:43 ` Greg KH
2009-02-24 3:47 ` [PATCH 1/2] PCI: add some sysfs ABI docs Greg KH
2009-02-24 5:08 ` Chris Wright
2009-02-24 5:50 ` [PATCH 1/2 v2] " Chris Wright
2009-02-24 5:52 ` [PATCH 2/2 v2] PCI: add remove_id sysfs entry Chris Wright
2009-02-26 5:37 ` Han, Weidong
2009-02-27 0:27 ` Chris Wright
2009-03-20 0:35 ` Jesse Barnes
2009-02-24 17:37 ` [PATCH 1/2 v2] PCI: add some sysfs ABI docs Jesse Barnes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1234542767.23746.81.camel@blaa \
--to=markmc@redhat.com \
--cc=allen.m.kay@intel.com \
--cc=chrisw@redhat.com \
--cc=donald.d.dugger@intel.com \
--cc=kvm@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox