IOMMU faults

All of lore.kernel.org
 help / color / mirror / Atom feed

* IOMMU faults
@ 2011-06-16  9:25 Tim Deegan
  2011-06-16  9:47 ` Jean Guyader
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Tim Deegan @ 2011-06-16  9:25 UTC (permalink / raw)
  To: Allen Kay, Wei Wang; +Cc: xen-devel, Jean Guyader

Hi, IOMMU maintainers,

What should Xen do when an IOMMU fault happens?  As far as I can
see both the AMD and Intel code clears the error in the IOMMU and
carries on, but I suspect some more vigorous action is appropriate.
I've seen traces from an Intel machine that seemed to be livelocked on
IOMMU faults from a passed-through VGA card, until it was killed by the
watchdog.  I think I can see two things that contribute to that:

 - The Intel IOMMU fault handler prints quite a lot of info in interrupt
   context, making it easier to livelock.  Still I think the general
   problem applies on AMD too.
 - Domain destruction re-assigns passed though cards to dom0, but the
   cards don't seem to get reset.  So there's nothing to stop a card
   battering away at DMA in the meantime.  That seems like a problem
   independent of livelock, actually.

In any case, it seems like it would be a good idea to stop a
broken/malicious/deassigned card from flooding Xen with IOMMU faults.

I was considering just writing 0 to the faulting card's PCI command
register, but I'm told that's not always enough to properly deactivate
a card, and it might be a little over-zealous to do it on the first
offence. 

Ideas?

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Xen Platform Team
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: IOMMU faults
  2011-06-16  9:25 IOMMU faults Tim Deegan
@ 2011-06-16  9:47 ` Jean Guyader
  2011-06-16 10:07   ` Tim Deegan
  2011-06-16 14:30 ` Wei Wang2
  2011-06-16 19:21 ` Kay, Allen M
  2 siblings, 1 reply; 12+ messages in thread
From: Jean Guyader @ 2011-06-16  9:47 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Wei Wang, xen-devel@lists.xensource.com, Allen Kay, Jean Guyader

[-- Attachment #1: Type: text/plain, Size: 2254 bytes --]

On 16/06 10:25, Tim Deegan wrote:
> Hi, IOMMU maintainers,
> 
> What should Xen do when an IOMMU fault happens?  As far as I can
> see both the AMD and Intel code clears the error in the IOMMU and
> carries on, but I suspect some more vigorous action is appropriate.
> I've seen traces from an Intel machine that seemed to be livelocked on
> IOMMU faults from a passed-through VGA card, until it was killed by the
> watchdog.  I think I can see two things that contribute to that:
> 
>  - The Intel IOMMU fault handler prints quite a lot of info in interrupt
>    context, making it easier to livelock.  Still I think the general
>    problem applies on AMD too.
>  - Domain destruction re-assigns passed though cards to dom0, but the
>    cards don't seem to get reset.  So there's nothing to stop a card
>    battering away at DMA in the meantime.  That seems like a problem
>    independent of livelock, actually.
> 
> In any case, it seems like it would be a good idea to stop a
> broken/malicious/deassigned card from flooding Xen with IOMMU faults.
> 
> I was considering just writing 0 to the faulting card's PCI command
> register, but I'm told that's not always enough to properly deactivate
> a card, and it might be a little over-zealous to do it on the first
> offence. 
> 
> Ideas?
> 

Hi Tim,

We have seed such behavior when we were testing GPU assignement especially
the Intel GPU. The problem is that domain destruction in Xen is assynchronous
and right now the pci device reset is done in dom0 with some help of the toolstack.

In the Intel GPU case we need to make sure that the guest memory and the IOMMU
are still in place while we perform to reset otherwise the device drift into
an unstable state.

There is probably other ways to do that in a cleaner way but we decided to move
the pci reset code into Xen, so we are sure we perform the reset while the device
is in a known state (functionning state).

Attached is the patch we have in XenClient that move the pci reset into Xen.
The modifications we have made to the VT-d code should go in the IOMMU generic
section. I appologise but this patch is based on Xen 3.4, if we think this is
the right way to do it, I can submit a proper patch against unstable and 4.1.

Regards,
Jean

[-- Attachment #2: iommu-vtd-pci-reset-on-reassign --]
[-- Type: text/plain, Size: 6846 bytes --]

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index feda3ba..2f5a69a 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -510,8 +510,6 @@ void arch_domain_destroy(struct domain *d)
 
     pci_release_devices(d);
     free_domain_pirqs(d);
-    if ( !is_idle_domain(d) )
-        iommu_domain_destroy(d);
 
     paging_final_teardown(d);
 
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 32f1d48..a7b0207 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -385,6 +385,7 @@ int domain_kill(struct domain *d)
 	v4v_destroy(d);
         evtchn_destroy(d);
         gnttab_release_mappings(d);
+        iommu_domain_destroy(d);
         /* fallthrough */
     case DOMDYING_dying:
         rc = domain_relinquish_resources(d);
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 723cff0..e90a335 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -75,6 +75,81 @@ struct pci_dev *pci_get_pdev(int bus, int devfn)
     return NULL;
 }
 
+struct pci_dev *pci_get_bridge(u8 secondary)
+{
+    struct pci_dev *pdev = NULL, *m = NULL;
+
+    spin_lock(&pcidevs_lock);
+
+    list_for_each_entry ( pdev, &alldevs_list, alldevs_list )
+    {
+        u8 type = pci_conf_read8(pdev->bus,
+                                 PCI_SLOT(pdev->devfn),
+                                 PCI_FUNC(pdev->devfn),
+                                 PCI_HEADER_TYPE);
+        if ( (type & 3) == PCI_HEADER_TYPE_BRIDGE )
+        {
+            u8 sec = pci_conf_read8(pdev->bus,
+                                    PCI_SLOT(pdev->devfn),
+                                    PCI_FUNC(pdev->devfn),
+                                    PCI_SECONDARY_BUS);
+            if ( sec == secondary )
+            {
+                m = pdev;
+                goto out;
+            }
+         }
+    }
+out:
+    spin_unlock(&pcidevs_lock);
+    return m;
+}
+
+struct pci_dev *pci_get_gfx(struct domain *d)
+{
+    struct pci_dev *pdev = NULL, *m = NULL;
+
+    spin_lock(&pcidevs_lock);
+
+    list_for_each_entry ( pdev, &alldevs_list, alldevs_list )
+    {
+        if ( d == NULL || pdev->domain == d )
+        {
+            u16 class = pci_conf_read16(pdev->bus,
+                                       PCI_SLOT(pdev->devfn),
+                                       PCI_FUNC(pdev->devfn),
+                                       PCI_CLASS_DEVICE);
+
+            if ( class == 0x0300 )
+            {
+                m = pdev;
+                goto out;
+            }
+        }
+    }
+out:
+    spin_unlock(&pcidevs_lock);
+    return m;
+
+}
+
+struct pci_dev *pci_match_pdev(int (*match)(struct pci_dev *))
+{
+    struct pci_dev *pdev = NULL, *m = NULL;
+
+    spin_lock(&pcidevs_lock);
+
+    list_for_each_entry ( pdev, &alldevs_list, alldevs_list )
+        if ( match(pdev) )
+        {
+            m = pdev;
+            goto out;
+        }
+out:
+    spin_unlock(&pcidevs_lock);
+    return m;
+}
+
 struct pci_dev *pci_get_pdev_by_domain(struct domain *d, int bus, int devfn)
 {
     struct pci_dev *pdev = NULL;
diff --git a/xen/drivers/passthrough/vtd/Makefile b/xen/drivers/passthrough/vtd/Makefile
index 0e6f163..00ba797 100644
--- a/xen/drivers/passthrough/vtd/Makefile
+++ b/xen/drivers/passthrough/vtd/Makefile
@@ -6,3 +6,4 @@ obj-y += dmar.o
 obj-y += utils.o
 obj-y += qinval.o
 obj-y += intremap.o
+obj-y += pci_reset.o
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 687f3fd..df88dc7 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -21,7 +21,6 @@
 
 #include <xen/irq.h>
 #include <xen/sched.h>
-#include <xen/xmalloc.h>
 #include <xen/mm.h>
 #include <xen/paging.h>
 #include <xen/domain_page.h>
@@ -32,11 +31,13 @@
 #include <xen/pci.h>
 #include <xen/pci_regs.h>
 #include <xen/keyhandler.h>
+#include <xen/xmalloc.h>
 #include <asm/msi.h>
 #include "iommu.h"
 #include "dmar.h"
 #include "extern.h"
 #include "vtd.h"
+#include "pci_reset.h"
 
 #define domain_iommu_domid(d) ((d)->arch.hvm_domain.hvm_iommu.iommu_domid)
 
@@ -1503,6 +1504,15 @@ static int reassign_device_ownership(
 
     if ( (drhd = acpi_find_matched_drhd_unit(pdev)) == NULL )
         return -ENODEV;
+
+    /* Only do the reset here if target isn't dom0.
+    ** For the reassign to dom0 it's done in the iommu_domain_teardown
+    ** function because reassign_device_ownership happen too late in
+    ** the process.
+    */
+    if (target->domain_id != 0)
+        pci_reset_device(bus, devfn);
+
     pdev_iommu = drhd->iommu;
     domain_context_unmap(source, bus, devfn);
 
@@ -1534,10 +1544,14 @@ static int reassign_device_ownership(
 void iommu_domain_teardown(struct domain *d)
 {
     struct hvm_iommu *hd = domain_hvm_iommu(d);
+    struct pci_dev *pdev = NULL;
 
     if ( list_empty(&acpi_drhd_units) )
         return;
 
+    if ((pdev = pci_get_gfx(d)))
+        pci_reset_device(pdev->bus, pdev->devfn);
+
     spin_lock(&hd->mapping_lock);
     iommu_free_pagetable(hd->pgd_maddr, agaw_to_level(hd->agaw));
     hd->pgd_maddr = 0;
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 6827e0d..9ed6f60 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -77,6 +77,9 @@ int pci_remove_device(u8 bus, u8 devfn);
 int pci_add_device_ext(u8 bus, u8 devfn, struct pci_dev_info *info);
 struct pci_dev *pci_get_pdev(int bus, int devfn);
 struct pci_dev *pci_get_pdev_by_domain(struct domain *d, int bus, int devfn);
+struct pci_dev *pci_match_pdev(int (*match)(struct pci_dev *));
+struct pci_dev *pci_get_bridge(u8 secondary);
+struct pci_dev *pci_get_gfx(struct domain *d);
 
 uint8_t pci_conf_read8(
     unsigned int bus, unsigned int dev, unsigned int func, unsigned int reg);
diff --git a/xen/include/xen/pci_regs.h b/xen/include/xen/pci_regs.h
index 361554c..20b00e8 100644
--- a/xen/include/xen/pci_regs.h
+++ b/xen/include/xen/pci_regs.h
@@ -210,6 +210,7 @@
 #define  PCI_CAP_ID_AGP3	0x0E	/* AGP Target PCI-PCI bridge */
 #define  PCI_CAP_ID_EXP 	0x10	/* PCI Express */
 #define  PCI_CAP_ID_MSIX	0x11	/* MSI-X */
+#define  PCI_CAP_ID_AF          0x13    /* PCI Advanced Features */
 #define PCI_CAP_LIST_NEXT	1	/* Next capability in the list */
 #define PCI_CAP_FLAGS		2	/* Capability defined flags (16 bits) */
 #define PCI_CAP_SIZEOF		4
@@ -315,6 +316,17 @@
 #define  PCI_CHSWP_EXT		0x40	/* ENUM# status - extraction */
 #define  PCI_CHSWP_INS		0x80	/* ENUM# status - insertion */
 
+/* PCI Advanced Feature registers */
+
+#define PCI_AF_LENGTH		2
+#define PCI_AF_CAP		3
+#define  PCI_AF_CAP_TP		0x01
+#define  PCI_AF_CAP_FLR		0x02
+#define PCI_AF_CTRL		4
+#define  PCI_AF_CTRL_FLR	0x01
+#define PCI_AF_STATUS		5
+#define  PCI_AF_STATUS_TP	0x01
+
 /* PCI-X registers */
 
 #define PCI_X_CMD		2	/* Modes & Features */

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Re: IOMMU faults
  2011-06-16  9:47 ` Jean Guyader
@ 2011-06-16 10:07   ` Tim Deegan
  2011-06-16 10:28     ` Jean Guyader
  0 siblings, 1 reply; 12+ messages in thread
From: Tim Deegan @ 2011-06-16 10:07 UTC (permalink / raw)
  To: Jean Guyader
  Cc: Wei Wang, Jean, xen-devel@lists.xensource.com, Allen Kay, Guyader

Hi Jean, 

At 10:47 +0100 on 16 Jun (1308221277), Jean Guyader wrote:
> We have seed such behavior when we were testing GPU assignement especially
> the Intel GPU. The problem is that domain destruction in Xen is assynchronous
> and right now the pci device reset is done in dom0 with some help of the toolstack.
> 
> In the Intel GPU case we need to make sure that the guest memory and the IOMMU
> are still in place while we perform to reset otherwise the device drift into
> an unstable state.
> 
> There is probably other ways to do that in a cleaner way but we decided to move
> the pci reset code into Xen, so we are sure we perform the reset while the device
> is in a known state (functionning state).
> 
> Attached is the patch we have in XenClient that move the pci reset into Xen.

Thanks, Jean.  This sounds like a good idea to me, though I'd like to
hear Wei and Allen's opinions.

The patch is incomplete (missing the new pci_reset.[ch] files) but I get
the general idea.  A few questions:

 - Why the special handling for one graphics device on each domain? 
   (And if one, why not all?)
 - Why not reset when the target is dom0?  It seems like it can do no
   harm and should protect dom0 from assigning itself an active PCI
   card. 

Of course, even with this patch, my original question still stands:
should Xen do something more assertive in the IOMMU fault handler?

Cheers,

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Xen Platform Team
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Re: IOMMU faults
  2011-06-16 10:07   ` Tim Deegan
@ 2011-06-16 10:28     ` Jean Guyader
  2011-06-24 13:32       ` Tim Deegan
  0 siblings, 1 reply; 12+ messages in thread
From: Jean Guyader @ 2011-06-16 10:28 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Wei Wang, xen-devel@lists.xensource.com, Allen Kay, Jean Guyader

[-- Attachment #1: Type: text/plain, Size: 2050 bytes --]

On 16/06 11:07, Tim Deegan wrote:
> Hi Jean, 
> 

Reply below,

> At 10:47 +0100 on 16 Jun (1308221277), Jean Guyader wrote:
> > We have seed such behavior when we were testing GPU assignement especially
> > the Intel GPU. The problem is that domain destruction in Xen is assynchronous
> > and right now the pci device reset is done in dom0 with some help of the toolstack.
> > 
> > In the Intel GPU case we need to make sure that the guest memory and the IOMMU
> > are still in place while we perform to reset otherwise the device drift into
> > an unstable state.
> > 
> > There is probably other ways to do that in a cleaner way but we decided to move
> > the pci reset code into Xen, so we are sure we perform the reset while the device
> > is in a known state (functionning state).
> > 
> > Attached is the patch we have in XenClient that move the pci reset into Xen.
> 
> Thanks, Jean.  This sounds like a good idea to me, though I'd like to
> hear Wei and Allen's opinions.
> 
> The patch is incomplete (missing the new pci_reset.[ch] files) but I get
> the general idea.  A few questions:

Reattach the full patch.

> 
>  - Why the special handling for one graphics device on each domain? 
>    (And if one, why not all?)

No good reason really just a limitation of the patch, we can trivially
get ride of the limitation.

>  - Why not reset when the target is dom0?  It seems like it can do no
>    harm and should protect dom0 from assigning itself an active PCI
>    card. 

Reset could be quiet expensive (couple of seconds sometimes).
We did that to avoid a double reset on domain reboot.
I agree that we should remove that, or extend the IOMMU API
so we can reassign from domU to domU without going through
dom0.

> 
> Of course, even with this patch, my original question still stands:
> should Xen do something more assertive in the IOMMU fault handler?
> 

What we really want to achive here is to stop DMA on this device.
One way of doing it is to perform a proper PCI reset (FLR, secondary
bus reset, ...) when that happens.

Jean

[-- Attachment #2: iommu-vtd-pci-reset-on-reassign --]
[-- Type: text/plain, Size: 19643 bytes --]

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index feda3ba..2f5a69a 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -510,8 +510,6 @@ void arch_domain_destroy(struct domain *d)
 
     pci_release_devices(d);
     free_domain_pirqs(d);
-    if ( !is_idle_domain(d) )
-        iommu_domain_destroy(d);
 
     paging_final_teardown(d);
 
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 32f1d48..a7b0207 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -385,6 +385,7 @@ int domain_kill(struct domain *d)
 	v4v_destroy(d);
         evtchn_destroy(d);
         gnttab_release_mappings(d);
+        iommu_domain_destroy(d);
         /* fallthrough */
     case DOMDYING_dying:
         rc = domain_relinquish_resources(d);
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 723cff0..e90a335 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -75,6 +75,81 @@ struct pci_dev *pci_get_pdev(int bus, int devfn)
     return NULL;
 }
 
+struct pci_dev *pci_get_bridge(u8 secondary)
+{
+    struct pci_dev *pdev = NULL, *m = NULL;
+
+    spin_lock(&pcidevs_lock);
+
+    list_for_each_entry ( pdev, &alldevs_list, alldevs_list )
+    {
+        u8 type = pci_conf_read8(pdev->bus,
+                                 PCI_SLOT(pdev->devfn),
+                                 PCI_FUNC(pdev->devfn),
+                                 PCI_HEADER_TYPE);
+        if ( (type & 3) == PCI_HEADER_TYPE_BRIDGE )
+        {
+            u8 sec = pci_conf_read8(pdev->bus,
+                                    PCI_SLOT(pdev->devfn),
+                                    PCI_FUNC(pdev->devfn),
+                                    PCI_SECONDARY_BUS);
+            if ( sec == secondary )
+            {
+                m = pdev;
+                goto out;
+            }
+         }
+    }
+out:
+    spin_unlock(&pcidevs_lock);
+    return m;
+}
+
+struct pci_dev *pci_get_gfx(struct domain *d)
+{
+    struct pci_dev *pdev = NULL, *m = NULL;
+
+    spin_lock(&pcidevs_lock);
+
+    list_for_each_entry ( pdev, &alldevs_list, alldevs_list )
+    {
+        if ( d == NULL || pdev->domain == d )
+        {
+            u16 class = pci_conf_read16(pdev->bus,
+                                       PCI_SLOT(pdev->devfn),
+                                       PCI_FUNC(pdev->devfn),
+                                       PCI_CLASS_DEVICE);
+
+            if ( class == 0x0300 )
+            {
+                m = pdev;
+                goto out;
+            }
+        }
+    }
+out:
+    spin_unlock(&pcidevs_lock);
+    return m;
+
+}
+
+struct pci_dev *pci_match_pdev(int (*match)(struct pci_dev *))
+{
+    struct pci_dev *pdev = NULL, *m = NULL;
+
+    spin_lock(&pcidevs_lock);
+
+    list_for_each_entry ( pdev, &alldevs_list, alldevs_list )
+        if ( match(pdev) )
+        {
+            m = pdev;
+            goto out;
+        }
+out:
+    spin_unlock(&pcidevs_lock);
+    return m;
+}
+
 struct pci_dev *pci_get_pdev_by_domain(struct domain *d, int bus, int devfn)
 {
     struct pci_dev *pdev = NULL;
diff --git a/xen/drivers/passthrough/vtd/Makefile b/xen/drivers/passthrough/vtd/Makefile
index 0e6f163..00ba797 100644
--- a/xen/drivers/passthrough/vtd/Makefile
+++ b/xen/drivers/passthrough/vtd/Makefile
@@ -6,3 +6,4 @@ obj-y += dmar.o
 obj-y += utils.o
 obj-y += qinval.o
 obj-y += intremap.o
+obj-y += pci_reset.o
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 687f3fd..df88dc7 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -21,7 +21,6 @@
 
 #include <xen/irq.h>
 #include <xen/sched.h>
-#include <xen/xmalloc.h>
 #include <xen/mm.h>
 #include <xen/paging.h>
 #include <xen/domain_page.h>
@@ -32,11 +31,13 @@
 #include <xen/pci.h>
 #include <xen/pci_regs.h>
 #include <xen/keyhandler.h>
+#include <xen/xmalloc.h>
 #include <asm/msi.h>
 #include "iommu.h"
 #include "dmar.h"
 #include "extern.h"
 #include "vtd.h"
+#include "pci_reset.h"
 
 #define domain_iommu_domid(d) ((d)->arch.hvm_domain.hvm_iommu.iommu_domid)
 
@@ -1503,6 +1504,15 @@ static int reassign_device_ownership(
 
     if ( (drhd = acpi_find_matched_drhd_unit(pdev)) == NULL )
         return -ENODEV;
+
+    /* Only do the reset here if target isn't dom0.
+    ** For the reassign to dom0 it's done in the iommu_domain_teardown
+    ** function because reassign_device_ownership happen too late in
+    ** the process.
+    */
+    if (target->domain_id != 0)
+        pci_reset_device(bus, devfn);
+
     pdev_iommu = drhd->iommu;
     domain_context_unmap(source, bus, devfn);
 
@@ -1534,10 +1544,14 @@ static int reassign_device_ownership(
 void iommu_domain_teardown(struct domain *d)
 {
     struct hvm_iommu *hd = domain_hvm_iommu(d);
+    struct pci_dev *pdev = NULL;
 
     if ( list_empty(&acpi_drhd_units) )
         return;
 
+    if ((pdev = pci_get_gfx(d)))
+        pci_reset_device(pdev->bus, pdev->devfn);
+
     spin_lock(&hd->mapping_lock);
     iommu_free_pagetable(hd->pgd_maddr, agaw_to_level(hd->agaw));
     hd->pgd_maddr = 0;
diff --git a/xen/drivers/passthrough/vtd/pci_reset.c b/xen/drivers/passthrough/vtd/pci_reset.c
new file mode 100644
index 0000000..f0614f9
--- /dev/null
+++ b/xen/drivers/passthrough/vtd/pci_reset.c
@@ -0,0 +1,389 @@
+/*
+ * Copyright (c) 2011, Citrix Systems. 
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * Copyright (C) Julian Pidancet <julian.pidancet@citrix.com>
+ * Copyright (C) Jean Guyader <jean.guyader@citrix.com>
+ */
+
+#include <xen/sched.h>
+#include <xen/pci.h>
+#include <xen/pci_regs.h>
+#include <asm/delay.h>
+#include <xen/delay.h>
+
+#define INTEL_VID               0x8086
+#define INTEL_DID_GMA           0x2a00
+#define INTEL_DID_CALPELLA      0x0046
+#define INTEL_DID_ARRANDALE     0x0042
+#define INTEL_DID_SNB1          0x0126
+#define INTEL_DID_SNB2          0x0102
+
+#define NVIDIA_VID              0x10de
+#define ATI_VID                 0x1002
+
+#define IS_MISCGFX(vid, class) ( ((vid) == NVIDIA_VID) &&               \
+                                 (class) == 0x300                       \
+                               )
+
+#define IS_GMA(vid, did) ( (vid) == INTEL_VID &&                        \
+                          ((did) & 0xff00) == INTEL_DID_GMA)
+
+#define IS_IRONLAKE(vid, did) ( (vid) == INTEL_VID &&                   \
+                                ((did) == INTEL_DID_CALPELLA ||         \
+                                 (did) == INTEL_DID_SNB1 ||              \
+                                 (did) == INTEL_DID_SNB2 ||             \
+                                 (did) == INTEL_DID_ARRANDALE)          \
+                              )
+
+#define PCI_BDF_FMT "%02x:%02x.%x"
+
+#define PCI_COMMAND_VALID (PCI_COMMAND_IO | PCI_COMMAND_MEMORY | \
+                           PCI_COMMAND_MASTER)
+
+#define PCI_CONFIG_SIZE 0x100
+
+#define ypcmem(a,b,c) memcpy(b,a,c)
+
+static int reg_len[PCI_CONFIG_SIZE] = {
+/*00*/ 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
+/*10*/ 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0,
+/*20*/ 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+/*30*/ 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
+
+/*40*/ 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0,
+/*50*/ 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0,
+/*60*/ 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0,
+/*70*/ 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0,
+
+/*80*/ 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0,
+/*90*/ 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0,
+/*a0*/ 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+/*b0*/ 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0,
+
+/*c0*/ 0, 1, 2, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0,
+/*d0*/ 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0,
+/*e0*/ 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0,
+/*f0*/ 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0,
+};
+
+static int reg_compare_len[PCI_CONFIG_SIZE] = {
+/*00*/ 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
+/*10*/ 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0,
+/*20*/ 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+/*30*/ 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
+
+/*40*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+/*50*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+/*60*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+/*70*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+
+/*80*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+/*90*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+/*a0*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+/*b0*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+
+/*c0*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+/*d0*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+/*e0*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+/*f0*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0,
+};
+
+static void
+read_config_space (u8 b, u16 devfn, u8 * cfg)
+{
+  u8 d = PCI_SLOT(devfn), f = PCI_FUNC(devfn);
+  u32 cfg32;
+  u16 cfg16;
+  int i;
+
+  for (i = 0; i < PCI_CONFIG_SIZE; ++i)
+    {
+      switch (reg_len[i])
+        {
+        case 1:
+          cfg[i] = pci_conf_read8(b, d, f, i);
+          break;
+        case 2:
+          cfg16 = pci_conf_read16(b, d, f, i);
+          memcpy (&cfg[i], &cfg16, sizeof (cfg16));
+          break;
+        case 4:
+          cfg32 = pci_conf_read32(b, d, f, i);
+          memcpy (&cfg[i], &cfg32, sizeof (cfg32));
+          break;
+        }
+    }
+}
+
+static void
+write_config_space (u8 b, u16 devfn, u8 * cfg)
+{
+  u8 d = PCI_SLOT(devfn), f = PCI_FUNC(devfn);
+  u32 cfg32;
+  u16 cfg16;
+  int i;
+
+
+  for (i = 0; i < PCI_CONFIG_SIZE; ++i)
+    {
+      switch (reg_len[i])
+        {
+        case 1:
+          pci_conf_write8(b, d, f, i, cfg[i]);
+          break;
+        case 2:
+          ypcmem (&cfg[i], &cfg16, sizeof (cfg16));
+          pci_conf_write16(b, d, f, i, cfg16);
+          break;
+        case 4:
+          ypcmem (&cfg[i], &cfg32, sizeof (cfg32));
+          pci_conf_write32(b, d, f, i, cfg32);
+          break;
+        }
+    }
+}
+
+static int
+compare_config_space (u8 b, u16 devfn, u8 * cfg)
+{
+  u8 d = PCI_SLOT(devfn), f = PCI_FUNC(devfn);
+  u32 cfg32;
+  u16 cfg16;
+  u8 cfg8;
+  int ret = 0;
+  int i;
+
+  for (i = 0; i < PCI_CONFIG_SIZE; ++i)
+    {
+      switch (reg_compare_len[i])
+        {
+        case 1:
+          cfg8 = pci_conf_read8(b, d, f, i);
+          if (memcmp (&cfg[i], &cfg8, sizeof (cfg8)))
+          {
+              printk("compare_config_space failed, wait a bit a restore again " PCI_BDF_FMT " %x\n",
+                      b, d, f, i);
+              ret++;
+          }
+          break;
+        case 2:
+          cfg16 = pci_conf_read16(b, d, f, i);
+          if (memcmp (&cfg[i], &cfg16, sizeof (cfg16)))
+          {
+              printk("compare_config_space failed, wait a bit a restore again " PCI_BDF_FMT " %x\n",
+                      b, d, f, i);
+              ret++;
+          }
+          break;
+        case 4:
+          cfg32 = pci_conf_read32(b, d, f, i);
+          if (memcmp (&cfg[i], &cfg32, sizeof (cfg32)))
+          {
+              printk("compare_config_space failed, wait a bit a restore again " PCI_BDF_FMT " %x\n",
+                      b, d, f, i);
+              ret++;
+          }
+          break;
+        }
+    }
+  return ret;
+}
+
+static int
+restore_config_space (u8 b, u16 devfn, u8 * cfg)
+{
+    int tries = 3;
+
+    while (tries--)
+    {
+        write_config_space(b, devfn, cfg);
+        if (!compare_config_space(b, devfn, cfg))
+            return 0;
+        cpu_relax();
+        mdelay(100); /* Wait 100 ms between restores */
+    }
+    return tries == 0;
+}
+
+static void pci_reset_function(u8 bus, u8 devfn, int force)
+{
+    s_time_t start_time;
+    u8 d = PCI_SLOT(devfn), f = PCI_FUNC(devfn);
+    u8 val;
+    int af_pos;
+    u8 af;
+
+    printk("Resetting PCI function %d:%d.%d\n", bus, d, f);
+
+    af_pos = pci_find_cap_offset(bus, d, f, PCI_CAP_ID_AF);
+    if (af_pos == 0)
+    {
+        printk("%s: PCI Advanced Features not found\n", __func__);
+        if (force) {
+            printk("%s: trying A4h\n", __func__);
+            af_pos = 0xa4;
+        }
+        else
+            BUG_ON(1);
+    }
+
+    if (!force)
+    {
+        af = pci_conf_read8(bus, d, f, af_pos + PCI_AF_CAP);
+        if (!(af & PCI_AF_CAP_FLR))
+        {
+            printk("%s: PCI device cannot FLR\n", __func__);
+            BUG_ON(1);
+        }
+    }
+
+    /* Leave CMD MEMORY set otherwise the platform can crashe during FLR */
+    pci_conf_write16(bus, d, f, PCI_COMMAND, 2);
+
+    /* Wait for bus accesses to complete */
+    start_time = NOW();
+    for ( ; ; )
+    {
+        val = pci_conf_read8(bus, d, f, af_pos + PCI_AF_STATUS);
+        if ( !(val & PCI_AF_STATUS_TP) )
+            break;
+
+        if ( NOW() > start_time + MILLISECS(1000) )
+        {
+            printk("%s: Failed to quiesce bus accesses\n", __func__);
+            BUG_ON(1);
+        }
+
+        cpu_relax();
+    }
+
+    /* Trigger the FLR */
+    pci_conf_write8(bus, d, f, af_pos + PCI_AF_CTRL, PCI_AF_CTRL_FLR);
+    mdelay(100);
+}
+
+static void force_cmd_register(u8 bus, u8 devfn)
+{
+    u8 d = PCI_SLOT(devfn), f = PCI_FUNC(devfn);
+    u8 val;
+
+    val = pci_conf_read16(bus, d, f, PCI_COMMAND);
+
+    if ((val & 0x7) == 0)
+    {
+        val |= 0x7;
+        pci_conf_write16(bus, d, f, PCI_COMMAND, val);
+    }
+}
+
+static void pci_reset_bus(u8 bus, u8 devfn)
+{
+    u8 d = PCI_SLOT(devfn), f = PCI_FUNC(devfn);
+    u8 val;
+
+    val = pci_conf_read8(bus, d, f, PCI_BRIDGE_CONTROL);
+    pci_conf_write8(bus, d, f, PCI_BRIDGE_CONTROL, val | PCI_BRIDGE_CTL_BUS_RESET);
+    udelay(20000);
+    pci_conf_write8(bus, d, f, PCI_BRIDGE_CONTROL, val);
+    udelay(20000);
+}
+
+static void pci_reset_misc_gfx(u8 bus, u8 devfn)
+{
+    u8 d = PCI_SLOT(devfn), f = PCI_FUNC(devfn);
+    u16 vendorid, deviceid, class;
+
+    vendorid = pci_conf_read16(bus, d, f, PCI_VENDOR_ID);
+    deviceid = pci_conf_read16(bus, d, f, PCI_DEVICE_ID);
+    class = pci_conf_read16(bus, d, f, PCI_CLASS_DEVICE);
+
+    if (IS_MISCGFX(vendorid, class)) {
+        struct pci_dev *bridge = pci_get_bridge(bus);
+        u8 *config;
+
+        if (bridge == NULL) {
+            printk("%s: Can't find bridge.\n", __func__);
+            return;
+        }
+
+        config = xmalloc_array(u8, PCI_CONFIG_SIZE);
+        read_config_space(bus, devfn, config);
+
+        pci_reset_bus(bridge->bus, bridge->devfn);
+
+        udelay(20000);
+
+        restore_config_space(bus, devfn, config);
+        xfree(config);
+    }
+
+}
+
+static void pci_reset_intel_gfx(u8 bus, u8 devfn)
+{
+    u8 d = PCI_SLOT(devfn), f = PCI_FUNC(devfn);
+    u16 vendorid, deviceid;
+    u8 val;
+    u8 *config;
+    int i;
+
+    vendorid = pci_conf_read16(bus, d, f, PCI_VENDOR_ID);
+    deviceid = pci_conf_read16(bus, d, f, PCI_DEVICE_ID);
+
+    /* Intel GMA */
+    if (IS_GMA(vendorid, deviceid)) {
+        config = xmalloc_array(u8, PCI_CONFIG_SIZE);
+        read_config_space(bus, devfn, config);
+
+        /* Perform GPU Media reset */
+        pci_conf_write8(bus, d, f, 0xC0, 0x0D);
+
+        for (i = 0; i < 10; i++)
+        {
+            mdelay(10);
+
+            val = pci_conf_read8(bus, d, f, 0xC0);
+            if ( !(val & 0x1) )
+                break;
+        }
+
+        pci_reset_function(bus, devfn, 1);
+        restore_config_space(bus, devfn, config);
+        xfree(config);
+
+        force_cmd_register(bus, devfn);
+    }
+
+    /* Ironlake */
+    if (IS_IRONLAKE(vendorid, deviceid)) {
+
+        config = xmalloc_array(u8, PCI_CONFIG_SIZE);
+        read_config_space(bus, devfn, config);
+
+        pci_reset_function(bus, devfn, 0);
+
+        restore_config_space(bus, devfn, config);
+        xfree(config);
+
+        force_cmd_register(bus, devfn);
+    }
+}
+
+void pci_reset_device(u8 bus, u8 devfn)
+{
+    pci_reset_intel_gfx(bus, devfn);
+    pci_reset_misc_gfx(bus, devfn);
+}
diff --git a/xen/drivers/passthrough/vtd/pci_reset.h b/xen/drivers/passthrough/vtd/pci_reset.h
new file mode 100644
index 0000000..dfd0872
--- /dev/null
+++ b/xen/drivers/passthrough/vtd/pci_reset.h
@@ -0,0 +1,27 @@
+/*
+ * Copyright (c) 2011, Citrix Systems. 
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * Copyright (C) Julian Pidancet <julian.pidancet@citrix.com>
+ * Copyright (C) Jean Guyader <jean.guyader@citrix.com>
+ */
+
+
+#ifndef PCI_RESET_H_
+# define PCI_RESET_H_
+
+void pci_reset_device(u8 bus, u8 devfn);
+
+#endif /* PCI_RESET_H_ */
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 6827e0d..9ed6f60 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -77,6 +77,9 @@ int pci_remove_device(u8 bus, u8 devfn);
 int pci_add_device_ext(u8 bus, u8 devfn, struct pci_dev_info *info);
 struct pci_dev *pci_get_pdev(int bus, int devfn);
 struct pci_dev *pci_get_pdev_by_domain(struct domain *d, int bus, int devfn);
+struct pci_dev *pci_match_pdev(int (*match)(struct pci_dev *));
+struct pci_dev *pci_get_bridge(u8 secondary);
+struct pci_dev *pci_get_gfx(struct domain *d);
 
 uint8_t pci_conf_read8(
     unsigned int bus, unsigned int dev, unsigned int func, unsigned int reg);
diff --git a/xen/include/xen/pci_regs.h b/xen/include/xen/pci_regs.h
index 361554c..20b00e8 100644
--- a/xen/include/xen/pci_regs.h
+++ b/xen/include/xen/pci_regs.h
@@ -210,6 +210,7 @@
 #define  PCI_CAP_ID_AGP3	0x0E	/* AGP Target PCI-PCI bridge */
 #define  PCI_CAP_ID_EXP 	0x10	/* PCI Express */
 #define  PCI_CAP_ID_MSIX	0x11	/* MSI-X */
+#define  PCI_CAP_ID_AF          0x13    /* PCI Advanced Features */
 #define PCI_CAP_LIST_NEXT	1	/* Next capability in the list */
 #define PCI_CAP_FLAGS		2	/* Capability defined flags (16 bits) */
 #define PCI_CAP_SIZEOF		4
@@ -315,6 +316,17 @@
 #define  PCI_CHSWP_EXT		0x40	/* ENUM# status - extraction */
 #define  PCI_CHSWP_INS		0x80	/* ENUM# status - insertion */
 
+/* PCI Advanced Feature registers */
+
+#define PCI_AF_LENGTH		2
+#define PCI_AF_CAP		3
+#define  PCI_AF_CAP_TP		0x01
+#define  PCI_AF_CAP_FLR		0x02
+#define PCI_AF_CTRL		4
+#define  PCI_AF_CTRL_FLR	0x01
+#define PCI_AF_STATUS		5
+#define  PCI_AF_STATUS_TP	0x01
+
 /* PCI-X registers */
 
 #define PCI_X_CMD		2	/* Modes & Features */

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Re: IOMMU faults
  2011-06-16 10:28     ` Jean Guyader
@ 2011-06-24 13:32       ` Tim Deegan
  2011-06-30 10:08         ` Tim Deegan
  0 siblings, 1 reply; 12+ messages in thread
From: Tim Deegan @ 2011-06-24 13:32 UTC (permalink / raw)
  To: Jean Guyader; +Cc: Wei Wang, xen-devel, Allen Kay

Hi, 

At 11:28 +0100 on 16 Jun (1308223697), Jean Guyader wrote:
> > Of course, even with this patch, my original question still stands:
> > should Xen do something more assertive in the IOMMU fault handler?
> 
> What we really want to achive here is to stop DMA on this device.
> One way of doing it is to perform a proper PCI reset (FLR, secondary
> bus reset, ...) when that happens.

I think that's more or less a consensus then, that we should try to
stop the device from the IOMMU fault handler.

Looking at your patch in a bit more detail, I see two things that worry
me.  The first is that the new pci_reset_device() function does nothing
at all if the device isn't one of the particular graphics cards it know
about!

The second is this comment: 

> +    /* Leave CMD MEMORY set otherwise the platform can crashe during FLR */
> +    pci_conf_write16(bus, d, f, PCI_COMMAND, 2);

which implies that my current approach of just disabling the card might
have pretty bad conequences.  Can you expand on that?  Would it be
better just to mask out PCI_COMMAND_MASTER?  And if I do that do I need
to try and issue a reset as well (i.e. are there cards that are known to
ignore this bit?)

Cheers,

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Xen Platform Team
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Re: IOMMU faults
  2011-06-24 13:32       ` Tim Deegan
@ 2011-06-30 10:08         ` Tim Deegan
  2011-06-30 10:31           ` Jean Guyader
  0 siblings, 1 reply; 12+ messages in thread
From: Tim Deegan @ 2011-06-30 10:08 UTC (permalink / raw)
  To: Jean Guyader; +Cc: Wei Wang, xen-devel, Allen Kay

At 14:32 +0100 on 24 Jun (1308925934), Tim Deegan wrote:
> At 11:28 +0100 on 16 Jun (1308223697), Jean Guyader wrote:
> > > Of course, even with this patch, my original question still stands:
> > > should Xen do something more assertive in the IOMMU fault handler?
> > 
> > What we really want to achive here is to stop DMA on this device.
> > One way of doing it is to perform a proper PCI reset (FLR, secondary
> > bus reset, ...) when that happens.
> 
> I think that's more or less a consensus then, that we should try to
> stop the device from the IOMMU fault handler.
> 
> Looking at your patch in a bit more detail, I see two things that worry
> me.  The first is that the new pci_reset_device() function does nothing
> at all if the device isn't one of the particular graphics cards it know
> about!
> 
> The second is this comment: 
> 
> > +    /* Leave CMD MEMORY set otherwise the platform can crashe during FLR */
> > +    pci_conf_write16(bus, d, f, PCI_COMMAND, 2);
> 
> which implies that my current approach of just disabling the card might
> have pretty bad conequences.  Can you expand on that?  Would it be
> better just to mask out PCI_COMMAND_MASTER?  And if I do that do I need
> to try and issue a reset as well (i.e. are there cards that are known to
> ignore this bit?)

Ping?

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Xen Platform Team
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Re: IOMMU faults
  2011-06-30 10:08         ` Tim Deegan
@ 2011-06-30 10:31           ` Jean Guyader
  0 siblings, 0 replies; 12+ messages in thread
From: Jean Guyader @ 2011-06-30 10:31 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Wei Wang, xen-devel@lists.xensource.com, Allen Kay, Jean Guyader

On 30/06 11:08, Tim Deegan wrote:
> At 14:32 +0100 on 24 Jun (1308925934), Tim Deegan wrote:
> > At 11:28 +0100 on 16 Jun (1308223697), Jean Guyader wrote:
> > > > Of course, even with this patch, my original question still stands:
> > > > should Xen do something more assertive in the IOMMU fault handler?
> > > 
> > > What we really want to achive here is to stop DMA on this device.
> > > One way of doing it is to perform a proper PCI reset (FLR, secondary
> > > bus reset, ...) when that happens.
> > 
> > I think that's more or less a consensus then, that we should try to
> > stop the device from the IOMMU fault handler.
> > 
> > Looking at your patch in a bit more detail, I see two things that worry
> > me.  The first is that the new pci_reset_device() function does nothing
> > at all if the device isn't one of the particular graphics cards it know
> > about!
> > 

In our case the reset of other devices is done using the classic Xen toolstack
way in dom0. But the code could be easily changed to do a proper reset on all
the type of devices.


> > The second is this comment: 
> > 
> > > +    /* Leave CMD MEMORY set otherwise the platform can crashe during FLR */
> > > +    pci_conf_write16(bus, d, f, PCI_COMMAND, 2);
> > 
> > which implies that my current approach of just disabling the card might
> > have pretty bad conequences.  Can you expand on that?  Would it be
> > better just to mask out PCI_COMMAND_MASTER?  And if I do that do I need
> > to try and issue a reset as well (i.e. are there cards that are known to
> > ignore this bit?)

Agreed, masking out PCI_COMMAND_MASTER should be enough, Linux only do that.

Jean

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: IOMMU faults
  2011-06-16  9:25 IOMMU faults Tim Deegan
  2011-06-16  9:47 ` Jean Guyader
@ 2011-06-16 14:30 ` Wei Wang2
  2011-06-16 14:47   ` Konrad Rzeszutek Wilk
  2011-06-16 19:21 ` Kay, Allen M
  2 siblings, 1 reply; 12+ messages in thread
From: Wei Wang2 @ 2011-06-16 14:30 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel@lists.xensource.com, Allen Kay, Jean Guyader

Alberto BozzoOn Thursday 16 June 2011 11:25:09 Tim Deegan wrote:
> Hi, IOMMU maintainers,
>
> What should Xen do when an IOMMU fault happens?  As far as I can
> see both the AMD and Intel code clears the error in the IOMMU and
> carries on, but I suspect some more vigorous action is appropriate.
> I've seen traces from an Intel machine that seemed to be livelocked on
> IOMMU faults from a passed-through VGA card, until it was killed by the
> watchdog.  I think I can see two things that contribute to that:
>
>  - The Intel IOMMU fault handler prints quite a lot of info in interrupt
>    context, making it easier to livelock.  Still I think the general
>    problem applies on AMD too.

This info could still be useful for debugging, but we should only enable this 
for debug build. 

>  - Domain destruction re-assigns passed though cards to dom0, but the
>    cards don't seem to get reset.  So there's nothing to stop a card
>    battering away at DMA in the meantime.  That seems like a problem
>    independent of livelock, actually.

There should  be some FLR codes in tools (both xm and xl). But this might not 
work well with some devices...

> In any case, it seems like it would be a good idea to stop a
> broken/malicious/deassigned card from flooding Xen with IOMMU faults.

Yes, agree that. Actually I saw a lot could be improved in the fault handler. 
When iommu faults come from dma error, we should either stop the device from 
doing dma or inject errors to guest if the guest driver is able to handle io 
page fault.

> I was considering just writing 0 to the faulting card's PCI command
> register, but I'm told that's not always enough to properly deactivate
> a card, and it might be a little over-zealous to do it on the first
> offence.
> Ideas?
It seems difficult to find a generic approach to stop a device without knowing 
more device specific details... 

Thanks,
Wei
> Tim.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Re: IOMMU faults
  2011-06-16 14:30 ` Wei Wang2
@ 2011-06-16 14:47   ` Konrad Rzeszutek Wilk
  2011-06-17  8:08     ` Tim Deegan
  0 siblings, 1 reply; 12+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-06-16 14:47 UTC (permalink / raw)
  To: Wei Wang2
  Cc: xen-devel@lists.xensource.com, Allen Kay, Tim Deegan,
	Jean Guyader

> > I was considering just writing 0 to the faulting card's PCI command
> > register, but I'm told that's not always enough to properly deactivate
> > a card, and it might be a little over-zealous to do it on the first
> > offence.
> > Ideas?
> It seems difficult to find a generic approach to stop a device without knowing 
> more device specific details... 

Perhaps make something similar to the MCE fault interrupts? As in when the error
happens, the Dom0 is notified of the offending BDF and persuses whatever action
it thinks are neccessary. The action would be to tell the device driver to
turn itself off. But how it would interact with the driver.. Well how does Linux
deal with this today? Is there an extension to the device driver API (similar to
the power) to notify the driver that it has done bad things and to shut itself off?

Perhaps similar to the PCIe AER handling?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Re: IOMMU faults
  2011-06-16 14:47   ` Konrad Rzeszutek Wilk
@ 2011-06-17  8:08     ` Tim Deegan
  0 siblings, 0 replies; 12+ messages in thread
From: Tim Deegan @ 2011-06-17  8:08 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Wei Wang2, Jean, xen-devel@lists.xensource.com, Allen Kay,
	Guyader

At 10:47 -0400 on 16 Jun (1308221250), Konrad Rzeszutek Wilk wrote:
> Perhaps make something similar to the MCE fault interrupts? As in when
> the error happens, the Dom0 is notified of the offending BDF and
> persuses whatever action it thinks are neccessary. The action would be
> to tell the device driver to turn itself off. But how it would
> interact with the driver.. Well how does Linux deal with this today?
> Is there an extension to the device driver API (similar to the power)
> to notify the driver that it has done bad things and to shut itself
> off?

That sort of interface might be nice too, but I was worried more about
badly-behaved guests or devices.  In the livelock case the guest might
never get to run so can't do anything, and a malicious guest would just
ignore the message anyway.

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Xen Platform Team
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: IOMMU faults
  2011-06-16  9:25 IOMMU faults Tim Deegan
  2011-06-16  9:47 ` Jean Guyader
  2011-06-16 14:30 ` Wei Wang2
@ 2011-06-16 19:21 ` Kay, Allen M
  2011-06-17  8:06   ` Tim Deegan
  2 siblings, 1 reply; 12+ messages in thread
From: Kay, Allen M @ 2011-06-16 19:21 UTC (permalink / raw)
  To: Tim Deegan, Wei Wang; +Cc: Jean, xen-devel@lists.xensource.com, Guyader

> - The Intel IOMMU fault handler prints quite a lot of info in interrupt
>   context, making it easier to livelock.  Still I think the general
>   problem applies on AMD too.

Someone at Intel looked into implementing measured rate printing in vt-d fault handler.  He encountered some complications.  I remember it had to do with measured rate printing not enabled by default (?).  For now, I think having it print out only for debug case sounds simple enough.  I will submit a patch for it.

> - Domain destruction re-assigns passed though cards to dom0, but the
>   cards don't seem to get reset.  So there's nothing to stop a card
>   battering away at DMA in the meantime.  That seems like a problem
>   independent of livelock, actually.

>From reading the code in libxl, it seems libxl__device_pci_reset() is called by both libxl__device_pci_add() and do_pci_remove().  Isn't do_pci_remove() called when the pass through device is reassigned to dom0 during a domain teardown?

Allen

-----Original Message-----
From: Tim Deegan [mailto:Tim.Deegan@citrix.com] 
Sent: Thursday, June 16, 2011 2:25 AM
To: Kay, Allen M; Wei Wang
Cc: xen-devel@lists.xensource.com; Jean Guyader
Subject: IOMMU faults

Hi, IOMMU maintainers,

What should Xen do when an IOMMU fault happens?  As far as I can
see both the AMD and Intel code clears the error in the IOMMU and
carries on, but I suspect some more vigorous action is appropriate.
I've seen traces from an Intel machine that seemed to be livelocked on
IOMMU faults from a passed-through VGA card, until it was killed by the
watchdog.  I think I can see two things that contribute to that:

 - The Intel IOMMU fault handler prints quite a lot of info in interrupt
   context, making it easier to livelock.  Still I think the general
   problem applies on AMD too.
 - Domain destruction re-assigns passed though cards to dom0, but the
   cards don't seem to get reset.  So there's nothing to stop a card
   battering away at DMA in the meantime.  That seems like a problem
   independent of livelock, actually.

In any case, it seems like it would be a good idea to stop a
broken/malicious/deassigned card from flooding Xen with IOMMU faults.

I was considering just writing 0 to the faulting card's PCI command
register, but I'm told that's not always enough to properly deactivate
a card, and it might be a little over-zealous to do it on the first
offence. 

Ideas?

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Xen Platform Team
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: RE: IOMMU faults
  2011-06-16 19:21 ` Kay, Allen M
@ 2011-06-17  8:06   ` Tim Deegan
  0 siblings, 0 replies; 12+ messages in thread
From: Tim Deegan @ 2011-06-17  8:06 UTC (permalink / raw)
  To: Kay, Allen M; +Cc: Wei Wang, Jean, xen-devel, Guyader

At 12:21 -0700 on 16 Jun (1308226873), Kay, Allen M wrote:
> > - The Intel IOMMU fault handler prints quite a lot of info in interrupt
> >   context, making it easier to livelock.  Still I think the general
> >   problem applies on AMD too.
> 
> Someone at Intel looked into implementing measured rate printing in vt-d fault handler.  He encountered some complications.  I remember it had to do with measured rate printing not enabled by default (?).  For now, I think having it print out only for debug case sounds simple enough.  I will submit a patch for it.
> 

That's great, thanks. 

> > - Domain destruction re-assigns passed though cards to dom0, but the
> >   cards don't seem to get reset.  So there's nothing to stop a card
> >   battering away at DMA in the meantime.  That seems like a problem
> >   independent of livelock, actually.
> 
> >From reading the code in libxl, it seems libxl__device_pci_reset() is called by both libxl__device_pci_add() and do_pci_remove().  Isn't do_pci_remove() called when the pass through device is reassigned to dom0 during a domain teardown?

Libxl could be too late, though.  When a domain is destroyed, its iommu
tables get torn down in Xen.  So if it has active devices:
 - they can start raising IOMMU faults immediately, and in some
   circumstances libxl might never get to run. 
 - since deassign is implemented as "assign to dom0" they might start 
   DMAing over dom0 memory.
If we can rely on the dom0 tools always completely resetting a domains's
devices before calling domctl_destroydomain, that should never happen.
That seems a bit fragile, though I guess dom0 can shoot itself in the foot in
enough other ways.   I prefer Jean's reset-in-xen approach; it's only a few
hundred lines of code and we could reuse some of it for resetting
badly-behaved cards from the IOMMU fault handler. 

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Xen Platform Team
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-06-30 10:31 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-16  9:25 IOMMU faults Tim Deegan
2011-06-16  9:47 ` Jean Guyader
2011-06-16 10:07   ` Tim Deegan
2011-06-16 10:28     ` Jean Guyader
2011-06-24 13:32       ` Tim Deegan
2011-06-30 10:08         ` Tim Deegan
2011-06-30 10:31           ` Jean Guyader
2011-06-16 14:30 ` Wei Wang2
2011-06-16 14:47   ` Konrad Rzeszutek Wilk
2011-06-17  8:08     ` Tim Deegan
2011-06-16 19:21 ` Kay, Allen M
2011-06-17  8:06   ` Tim Deegan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.