From: Laszlo Ersek <lersek@redhat.com>
To: "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>
Subject: [RFC PATCH linux-2.6.18-xen] pciback: clean up (MSI-X vec, entrynr) list when resetting PCI device
Date: Wed, 01 Jun 2011 12:05:44 +0200 [thread overview]
Message-ID: <4DE60EF8.5060902@redhat.com> (raw)
Hi,
this is more of an RFC than a patch now for linux-2.6.18-xen. Describing the situation captured in RHBZ#688673, in a nutshell:
- let's say we have an Intel 82576 card (igb),
- the card exports virtual functions (igbvf),
- one virtual function is passed through to a PV guest,
- the igbvf driver, co-operating with pcifront, pciback, and the hypervisor, sets up three MSI-X vectors for the virtual function PCI device,
- when the domU is shut down, the MSI-X vectors are "leaked" in dom0, because nobody ever reaches dom0's pci_disable_msix() / msi_remove_pci_irq_vectors() during shutdown,
- therefore configuration of the VF during the next boot of such a guest doesn't succeed (dom0 thinks, based on its stale list, that MSI-X vectors are already allocated/mapped for the device)
dom0 (msix_capability_init(), output beautified a bit):
msix entry 0 for dev 01:10:0 are not freed before acquire again.
msix entry 1 for dev 01:10:0 are not freed before acquire again.
msix entry 2 for dev 01:10:0 are not freed before acquire again.
guest:
Determining IP information for eth1...
Failed to obtain physical IRQ 255
Failed to obtain physical IRQ 254
Failed to obtain physical IRQ 253
- if the device or the full igbvf module is removed before shutdown in the guest (in case the boot was successful to begin with!), then pci_disable_msix() is called and everything works fine; the next boot / ifup succeeds.
I backported c/s 1070 to the RHEL-5 kernel, but it was not enough -- this is not an abrupt termination (destroy), but a contolled shutdown. Here's what I suspect happens (in RHEL-5) when device_shutdown() walks the list of devices in the PV domU and calls the appropriate shutdown hook for igbvf:
domU: igbvf_shutdown()
igbvf_suspend()
pci_disable_device()
pcibios_disable_device()
pcibios_disable_resources()
pci_write_config_word(
dev, PCI_COMMAND,
orig & ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)
)
...
pcifront_bus_write()
do_pci_op(XEN_PCI_OP_conf_write)
dom0: pciback_do_op()
pciback_config_write()
conf_space_write()
command_write() [via PCI_COMMAND funcptr]
pci_disable_device()
disable_msi_mode()
dev->msix_enabled = 0;
The final assignment above precludes c/s 1070 from doing the job.
Of course igbvf_shutdown() could invoke igbvf_remove() instead (or another lower-level function that ends up in dom0's pci_disable_msix()), but it should not depend on the guest playing nice.
Below's my proposed patch for RHEL-5, ported to upstream.
----v----
introduce msi_prune_pci_irq_vectors()
extend pciback_reset_device() with the following functionality (msi_prune_pci_irq_vectors()):
- remove all (MSI-X vector, entry number) pairs that might still belong, in dom0's mind, to the PCI device being reset,
- unmap some space "used for saving/restoring MSI-X tables" that might otherwise go leaked
The function, modeled after msi_remove_pci_irq_vectors(), doesn't touch the hypervisor, nor the PCI device; it only trims the list used by dom0 for filtering. Justification being: when this is called, either the owner domain is gone already (or very close to being gone), or the device was already correctly detached.
----^----
Now I think one comment in the patch below does not hold for upstream: "msi_dev_head is only maintained in dom0". According to c/s 680, this doesn't seem to be the case for upstream.
Is the case described above possible at all in upstream? Do you think the fix I propose is correct? It seemed to do the job with RHEL-5 in my testing.
dom0:
PCI: 0000:01:10.0: cleaning up MSI-X entry 0
PCI: 0000:01:10.0: cleaning up MSI-X entry 1
PCI: 0000:01:10.0: cleaning up MSI-X entry 2
PCI: 0000:01:10.0: cleaning up mask_base
Just in case:
drivers/pci/msi-xen.c | 46 ++++++++++++++++++++++++++++++++++++++
drivers/xen/pciback/pciback_ops.c | 5 ++++
include/linux/pci.h | 1
3 files changed, 52 insertions(+)
Signed-off-by: Laszlo Ersek<lersek@redhat.com>
Thank you very much!
lacos
diff -r 876a5aaac026 drivers/pci/msi-xen.c
--- a/drivers/pci/msi-xen.c Thu May 26 12:33:41 2011 +0100
+++ b/drivers/pci/msi-xen.c Tue May 31 19:17:26 2011 +0200
@@ -894,6 +894,51 @@ void msi_remove_pci_irq_vectors(struct p
dev->irq = msi_dev_entry->default_irq;
}
+void msi_prune_pci_irq_vectors(struct pci_dev *dev)
+{
+ unsigned long flags;
+ struct msi_dev_list *msi_dev_scan, *msi_dev_entry = NULL;
+ struct msi_pirq_entry *pirq_entry, *tmp;
+
+ if (!pci_msi_enable || !dev)
+ return;
+
+ /* msi_dev_head is only maintained in dom0 */
+ BUG_ON(!is_initial_xendomain());
+
+ /* search for the set of MSI-X vectors, don't extend list */
+ spin_lock_irqsave(&msi_dev_lock, flags);
+ list_for_each_entry(msi_dev_scan, &msi_dev_head, list)
+ if (msi_dev_scan->dev == dev) {
+ msi_dev_entry = msi_dev_scan;
+ break;
+ }
+ spin_unlock_irqrestore(&msi_dev_lock, flags);
+ if (msi_dev_entry == NULL)
+ return;
+
+ /* Empty the set. No PHYSDEVOP_unmap_pirq hypercalls: the domU is
+ * already gone, or the device is already unplugged.
+ */
+ spin_lock_irqsave(&msi_dev_entry->pirq_list_lock, flags);
+ if (!list_empty(&msi_dev_entry->pirq_list_head))
+ list_for_each_entry_safe(pirq_entry, tmp,
+ &msi_dev_entry->pirq_list_head, list) {
+ printk(KERN_INFO "PCI: %s: cleaning up MSI-X entry "
+ "%d\n", pci_name(dev), pirq_entry->entry_nr);
+ list_del(&pirq_entry->list);
+ kfree(pirq_entry);
+ }
+ spin_unlock_irqrestore(&msi_dev_entry->pirq_list_lock, flags);
+
+ if (msi_dev_entry->mask_base != NULL) {
+ printk(KERN_INFO "PCI: %s: cleaning up mask_base\n",
+ pci_name(dev));
+ iounmap(msi_dev_entry->mask_base);
+ msi_dev_entry->mask_base = NULL;
+ }
+}
+
void pci_no_msi(void)
{
pci_msi_enable = 0;
@@ -906,5 +951,6 @@ EXPORT_SYMBOL(pci_disable_msix);
#ifdef CONFIG_XEN
EXPORT_SYMBOL(register_msi_get_owner);
EXPORT_SYMBOL(unregister_msi_get_owner);
+EXPORT_SYMBOL(msi_prune_pci_irq_vectors);
#endif
diff -r 876a5aaac026 drivers/xen/pciback/pciback_ops.c
--- a/drivers/xen/pciback/pciback_ops.c Thu May 26 12:33:41 2011 +0100
+++ b/drivers/xen/pciback/pciback_ops.c Tue May 31 19:17:26 2011 +0200
@@ -29,6 +29,11 @@ void pciback_reset_device(struct pci_dev
pci_disable_msix(dev);
if (dev->msi_enabled)
pci_disable_msi(dev);
+
+ /* After a controlled shutdown or the crash fixup above, prune
+ * dom0's MSI-X vector list for the device.
+ */
+ msi_prune_pci_irq_vectors(dev);
#endif
pci_disable_device(dev);
diff -r 876a5aaac026 include/linux/pci.h
--- a/include/linux/pci.h Thu May 26 12:33:41 2011 +0100
+++ b/include/linux/pci.h Tue May 31 19:17:26 2011 +0200
@@ -640,6 +640,7 @@ extern void msi_remove_pci_irq_vectors(s
#ifdef CONFIG_XEN
extern int register_msi_get_owner(int (*func)(struct pci_dev *dev));
extern int unregister_msi_get_owner(int (*func)(struct pci_dev *dev));
+extern void msi_prune_pci_irq_vectors(struct pci_dev *dev);
#endif
#endif
next reply other threads:[~2011-06-01 10:05 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-06-01 10:05 Laszlo Ersek [this message]
2011-06-01 11:02 ` [RFC PATCH linux-2.6.18-xen] pciback: clean up (MSI-X vec, entrynr) list when resetting PCI device Laszlo Ersek
2011-06-01 14:31 ` Konrad Rzeszutek Wilk
2011-06-01 16:18 ` Paolo Bonzini
2011-06-01 17:32 ` Laszlo Ersek
2011-06-01 18:13 ` 2.6.38 (FC15) with PCI passthrough fails mysteriously with iommu=soft Konrad Rzeszutek Wilk
2011-06-02 7:42 ` Laszlo Ersek
2011-06-02 20:49 ` Pasi Kärkkäinen
2011-06-01 12:56 ` [RFC PATCH linux-2.6.18-xen] pciback: clean up (MSI-X vec, entrynr) list when resetting PCI device Jan Beulich
2011-06-01 14:12 ` Laszlo Ersek
2011-06-01 14:59 ` Jan Beulich
2011-06-01 15:27 ` Laszlo Ersek
2011-06-01 16:07 ` Jan Beulich
2011-06-01 16:25 ` Andrew Jones
2011-06-01 16:27 ` Paolo Bonzini
2011-06-01 16:16 ` Paolo Bonzini
2011-06-01 14:51 ` Konrad Rzeszutek Wilk
2011-06-01 15:01 ` Konrad Rzeszutek Wilk
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4DE60EF8.5060902@redhat.com \
--to=lersek@redhat.com \
--cc=xen-devel@lists.xensource.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).