All of lore.kernel.org
 help / color / mirror / Atom feed
From: Laszlo Ersek <lersek@redhat.com>
To: "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>
Subject: [RFC PATCH linux-2.6.18-xen] pciback: clean up (MSI-X vec, entrynr) list when resetting PCI device
Date: Wed, 01 Jun 2011 12:05:44 +0200	[thread overview]
Message-ID: <4DE60EF8.5060902@redhat.com> (raw)

Hi,

this is more of an RFC than a patch now for linux-2.6.18-xen. Describing the situation captured in RHBZ#688673, in a nutshell:

- let's say we have an Intel 82576 card (igb),
- the card exports virtual functions (igbvf),
- one virtual function is passed through to a PV guest,
- the igbvf driver, co-operating with pcifront, pciback, and the hypervisor, sets up three MSI-X vectors for the virtual function PCI device,
- when the domU is shut down, the MSI-X vectors are "leaked" in dom0, because nobody ever reaches dom0's pci_disable_msix() / msi_remove_pci_irq_vectors() during shutdown,
- therefore configuration of the VF during the next boot of such a guest doesn't succeed (dom0 thinks, based on its stale list, that MSI-X vectors are already allocated/mapped for the device)

  dom0 (msix_capability_init(), output beautified a bit):
    msix entry 0 for dev 01:10:0 are not freed before acquire again.
    msix entry 1 for dev 01:10:0 are not freed before acquire again.
    msix entry 2 for dev 01:10:0 are not freed before acquire again.

  guest:
      Determining IP information for eth1...
      Failed to obtain physical IRQ 255
      Failed to obtain physical IRQ 254
      Failed to obtain physical IRQ 253

- if the device or the full igbvf module is removed before shutdown in the guest (in case the boot was successful to begin with!), then pci_disable_msix() is called and everything works fine; the next boot / ifup succeeds.

I backported c/s 1070 to the RHEL-5 kernel, but it was not enough -- this is not an abrupt termination (destroy), but a contolled shutdown. Here's what I suspect happens (in RHEL-5) when device_shutdown() walks the list of devices in the PV domU and calls the appropriate shutdown hook for igbvf:

domU: igbvf_shutdown()
        igbvf_suspend()
          pci_disable_device()
            pcibios_disable_device()
              pcibios_disable_resources()
                pci_write_config_word(
                    dev, PCI_COMMAND,
                    orig & ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)
                )
                  ...
                    pcifront_bus_write()
                      do_pci_op(XEN_PCI_OP_conf_write)
dom0:                   pciback_do_op()
                          pciback_config_write()
                            conf_space_write()
                              command_write() [via PCI_COMMAND funcptr]
                                pci_disable_device()
                                  disable_msi_mode()
                                    dev->msix_enabled = 0;

The final assignment above precludes c/s 1070 from doing the job.

Of course igbvf_shutdown() could invoke igbvf_remove() instead (or another lower-level function that ends up in dom0's pci_disable_msix()), but it should not depend on the guest playing nice.

Below's my proposed patch for RHEL-5, ported to upstream. 

----v----
introduce msi_prune_pci_irq_vectors()

extend pciback_reset_device() with the following functionality (msi_prune_pci_irq_vectors()):
- remove all (MSI-X vector, entry number) pairs that might still belong, in dom0's mind, to the PCI device being reset,
- unmap some space "used for saving/restoring MSI-X tables" that might otherwise go leaked

The function, modeled after msi_remove_pci_irq_vectors(), doesn't touch the hypervisor, nor the PCI device; it only trims the list used by dom0 for filtering. Justification being: when this is called, either the owner domain is gone already (or very close to being gone), or the device was already correctly detached.
----^----

Now I think one comment in the patch below does not hold for upstream: "msi_dev_head is only maintained in dom0". According to c/s 680, this doesn't seem to be the case for upstream.

Is the case described above possible at all in upstream? Do you think the fix I propose is correct? It seemed to do the job with RHEL-5 in my testing.

  dom0:
    PCI: 0000:01:10.0: cleaning up MSI-X entry 0
    PCI: 0000:01:10.0: cleaning up MSI-X entry 1
    PCI: 0000:01:10.0: cleaning up MSI-X entry 2
    PCI: 0000:01:10.0: cleaning up mask_base

Just in case:

 drivers/pci/msi-xen.c             |   46 ++++++++++++++++++++++++++++++++++++++
 drivers/xen/pciback/pciback_ops.c |    5 ++++
 include/linux/pci.h               |    1 
 3 files changed, 52 insertions(+)

Signed-off-by: Laszlo Ersek<lersek@redhat.com>

Thank you very much!
lacos


diff -r 876a5aaac026 drivers/pci/msi-xen.c
--- a/drivers/pci/msi-xen.c	Thu May 26 12:33:41 2011 +0100
+++ b/drivers/pci/msi-xen.c	Tue May 31 19:17:26 2011 +0200
@@ -894,6 +894,51 @@ void msi_remove_pci_irq_vectors(struct p
 	dev->irq = msi_dev_entry->default_irq;
 }
 
+void msi_prune_pci_irq_vectors(struct pci_dev *dev)
+{
+	unsigned long flags;
+	struct msi_dev_list *msi_dev_scan, *msi_dev_entry = NULL;
+	struct msi_pirq_entry *pirq_entry, *tmp;
+
+	if (!pci_msi_enable || !dev)
+		return;
+
+	/* msi_dev_head is only maintained in dom0 */
+	BUG_ON(!is_initial_xendomain());
+
+	/* search for the set of MSI-X vectors, don't extend list */
+	spin_lock_irqsave(&msi_dev_lock, flags);
+	list_for_each_entry(msi_dev_scan, &msi_dev_head, list)
+		if (msi_dev_scan->dev == dev) {
+			msi_dev_entry = msi_dev_scan;
+			break;
+		}
+	spin_unlock_irqrestore(&msi_dev_lock, flags);
+	if (msi_dev_entry == NULL)
+		return;
+
+	/* Empty the set. No PHYSDEVOP_unmap_pirq hypercalls: the domU is
+	 * already gone, or the device is already unplugged.
+	 */
+	spin_lock_irqsave(&msi_dev_entry->pirq_list_lock, flags);
+	if (!list_empty(&msi_dev_entry->pirq_list_head))
+		list_for_each_entry_safe(pirq_entry, tmp,
+		                         &msi_dev_entry->pirq_list_head, list) {
+			printk(KERN_INFO "PCI: %s: cleaning up MSI-X entry "
+			       "%d\n", pci_name(dev), pirq_entry->entry_nr);
+			list_del(&pirq_entry->list);
+			kfree(pirq_entry);
+		}
+	spin_unlock_irqrestore(&msi_dev_entry->pirq_list_lock, flags);
+
+	if (msi_dev_entry->mask_base != NULL) {
+		printk(KERN_INFO "PCI: %s: cleaning up mask_base\n",
+		       pci_name(dev));
+		iounmap(msi_dev_entry->mask_base);
+		msi_dev_entry->mask_base = NULL;
+	}
+}
+
 void pci_no_msi(void)
 {
 	pci_msi_enable = 0;
@@ -906,5 +951,6 @@ EXPORT_SYMBOL(pci_disable_msix);
 #ifdef CONFIG_XEN
 EXPORT_SYMBOL(register_msi_get_owner);
 EXPORT_SYMBOL(unregister_msi_get_owner);
+EXPORT_SYMBOL(msi_prune_pci_irq_vectors);
 #endif
 
diff -r 876a5aaac026 drivers/xen/pciback/pciback_ops.c
--- a/drivers/xen/pciback/pciback_ops.c	Thu May 26 12:33:41 2011 +0100
+++ b/drivers/xen/pciback/pciback_ops.c	Tue May 31 19:17:26 2011 +0200
@@ -29,6 +29,11 @@ void pciback_reset_device(struct pci_dev
 			pci_disable_msix(dev);
 		if (dev->msi_enabled)
 			pci_disable_msi(dev);
+
+		/* After a controlled shutdown or the crash fixup above, prune
+		 * dom0's MSI-X vector list for the device.
+		 */
+		msi_prune_pci_irq_vectors(dev);
 #endif
 		pci_disable_device(dev);
 
diff -r 876a5aaac026 include/linux/pci.h
--- a/include/linux/pci.h	Thu May 26 12:33:41 2011 +0100
+++ b/include/linux/pci.h	Tue May 31 19:17:26 2011 +0200
@@ -640,6 +640,7 @@ extern void msi_remove_pci_irq_vectors(s
 #ifdef CONFIG_XEN
 extern int register_msi_get_owner(int (*func)(struct pci_dev *dev));
 extern int unregister_msi_get_owner(int (*func)(struct pci_dev *dev));
+extern void msi_prune_pci_irq_vectors(struct pci_dev *dev);
 #endif
 #endif

             reply	other threads:[~2011-06-01 10:05 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-01 10:05 Laszlo Ersek [this message]
2011-06-01 11:02 ` [RFC PATCH linux-2.6.18-xen] pciback: clean up (MSI-X vec, entrynr) list when resetting PCI device Laszlo Ersek
2011-06-01 14:31   ` Konrad Rzeszutek Wilk
2011-06-01 16:18     ` Paolo Bonzini
2011-06-01 17:32     ` Laszlo Ersek
2011-06-01 18:13       ` 2.6.38 (FC15) with PCI passthrough fails mysteriously with iommu=soft Konrad Rzeszutek Wilk
2011-06-02  7:42         ` Laszlo Ersek
2011-06-02 20:49         ` Pasi Kärkkäinen
2011-06-01 12:56 ` [RFC PATCH linux-2.6.18-xen] pciback: clean up (MSI-X vec, entrynr) list when resetting PCI device Jan Beulich
2011-06-01 14:12   ` Laszlo Ersek
2011-06-01 14:59     ` Jan Beulich
2011-06-01 15:27       ` Laszlo Ersek
2011-06-01 16:07         ` Jan Beulich
2011-06-01 16:25           ` Andrew Jones
2011-06-01 16:27             ` Paolo Bonzini
2011-06-01 16:16       ` Paolo Bonzini
2011-06-01 14:51 ` Konrad Rzeszutek Wilk
2011-06-01 15:01   ` Konrad Rzeszutek Wilk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4DE60EF8.5060902@redhat.com \
    --to=lersek@redhat.com \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.