All of lore.kernel.org
 help / color / mirror / Atom feed
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: gordan@bobich.net, xen-devel@lists.xensource.com
Subject: Dealing with non-existent BDF devices in VT-d and in the hardware.
Date: Tue, 11 Mar 2014 13:30:39 -0400	[thread overview]
Message-ID: <20140311173039.GB14684@phenom.dumpdata.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 2558 bytes --]

Hey,

I am one of those lucky folks who had purchased a motherboard that has bugs.

I figured I would post this email as way for a starting point
for some discussion on this - and perhaps have a similar as 'pci-phantom'
way of instructing the hypervisor what to do with them.

The problem I am seeing is that this device:

08:03.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22A IEEE-1394a-2000 Controller (PHY/Link) [iOHCI-Lynx]

Can't be passed in the guest. Or rather it can - but everytime
the guest (or domain0) tries to access I see:

(XEN) [VT-D]iommu.c:885: iommu_fault_status: Fault Overflow
(XEN) [VT-D]iommu.c:887: iommu_fault_status: Primary Pending Fault
(XEN) [VT-D]iommu.c:865: DMAR:[DMA Write] Request device [0000:08:00.0] fault addr 0, iommu reg = ffff82c3ffd53000
(XEN) DMAR:[fault reason 02h] Present bit in context entry is clear
(XEN) print_vtd_entries: iommu ffff83043dca99b0 dev 0000:08:00.0 gmfn 0
(XEN)     root_entry = ffff83043dc6b000
(XEN)     root_entry[8] = 3326b5001
(XEN)     context = ffff8303326b5000
(XEN)     context[0] = 0_0
(XEN)     ctxt_entry[0] not present


Of course the '08:00.0' device does not exist. It is rather this chipset:
07:00.0 PCI bridge: Tundra Semiconductor Corp. Device 8113 (rev 01)

that is buggy and using the wrong BDF when forwarding DMA requests from
devices underneath it (like this Firewire chip).

The hack I came up with was to create in the Xen code that deals with
PCI passthrough a copy of the bridge (so 07:00.0) but with a new
BDF: 08:00.0. And link it to the PCI device that I am passing to the
guest (so 08:03.0).

The end result is that when loading the driver (hack.c) one should
see:

(XEN) 0000:08:00.0 linked with 08:03.0
(XEN) [VT-D]iommu.c:1456: d0:PCI: map 0000:08:00.0
(XEN) [VT-D]iommu.c:1476: d0:PCI: map 0000:08:03.0
(XEN) PCI add link 0000:08:00.0

And when launching a guest with the BDF:
pci = ["08:03.0"]

the hypervisor will automatically also create an VT-d context for the
08:00.0 device.

To use this hack, apply the 0001-xen-pci-Introduce-a-way-to-deal-with-buggy-hardware-.patch
to your hypervisor, compile and install.

And also compile the 'hack.c' module. There is an attached 'Makefile'
that will do it for you. Make sure you edit it to set the right BDF
entries in it.

Once done install your new hypervisor, and insmod ./hack.ko and try
passing in the device to your guest (or use it normally). The
'DMAR:[DMA Write]' error should go away.

This should be generic enough for most devices. It needn't be a bridge
that is spewing out these DMAR errors.

[-- Attachment #2: 0001-xen-pci-Introduce-a-way-to-deal-with-buggy-hardware-.patch --]
[-- Type: text/plain, Size: 11683 bytes --]

>From cb165429726978952f5b9e75bece1dcb5630667f Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Wed, 19 Feb 2014 10:58:19 -0500
Subject: [PATCH] xen/pci: Introduce a way to deal with buggy hardware with
 "hidden" PCI buses.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/arch/x86/physdev.c              | 14 +++++-
 xen/drivers/passthrough/pci.c       | 89 +++++++++++++++++++++++++++++++++----
 xen/drivers/passthrough/vtd/iommu.c | 31 +++++++++++++
 xen/include/public/physdev.h        |  1 +
 xen/include/xen/pci.h               |  3 ++
 5 files changed, 128 insertions(+), 10 deletions(-)

diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index bc0634c..f843c49 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -609,7 +609,11 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         if ( copy_from_guest(&add, arg, 1) != 0 )
             break;
 
-        pdev_info.is_extfn = !!(add.flags & XEN_PCI_DEV_EXTFN);
+        if ( add.flags & XEN_PCI_DEV_EXTFN)
+            pdev_info.is_extfn = 1;
+        else
+            pdev_info.is_extfn = 0;
+
         if ( add.flags & XEN_PCI_DEV_VIRTFN )
         {
             pdev_info.is_virtfn = 1;
@@ -618,6 +622,14 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         }
         else
             pdev_info.is_virtfn = 0;
+
+        if ( add.flags & XEN_PCI_DEV_LINK )
+        {
+            pdev_info.is_link = 1;
+            pdev_info.physfn.bus = add.physfn.bus;
+            pdev_info.physfn.devfn = add.physfn.devfn;
+        } else
+            pdev_info.is_link = 0;
         ret = pci_add_device(add.seg, add.bus, add.devfn, &pdev_info);
         break;
     }
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 2a6eaa4..0e59216 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -153,7 +153,8 @@ static void __init parse_phantom_dev(char *str) {
 }
 custom_param("pci-phantom", parse_phantom_dev);
 
-static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
+static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn,
+                                  int link, u8 orig_bus, u8 orig_devfn)
 {
     struct pci_dev *pdev;
 
@@ -169,8 +170,38 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
     *((u8*) &pdev->bus) = bus;
     *((u8*) &pdev->devfn) = devfn;
     pdev->domain = NULL;
+    pdev->link = NULL;
     INIT_LIST_HEAD(&pdev->msi_list);
 
+    if ( link )
+    {
+        struct pci_dev *dev;
+        list_for_each_entry ( dev, &pseg->alldevs_list, alldevs_list )
+        {
+            if ( dev->bus == orig_bus && dev->devfn == orig_devfn )
+            {
+                /* N.B. The 'bus' passed is 'new' one, while 'orig_bus' are
+                 * the ones we expect to exist. We over-write 'bus' and
+                 * 'devfn' with the original one so that this new device
+                 * will be created with the original device properties.
+                 */
+                if ( dev->link )
+                {
+                    xfree (pdev);
+                    return NULL;
+                }
+                bus = dev->bus;
+                devfn = dev->devfn;
+                dev->link = pdev;
+                pdev->link = dev;
+                pdev->info.is_link = 1;
+                printk("%04x:%02x:%02x.%u linked with %02x:%02x.%u\n",
+                       pseg->nr, pdev->bus, PCI_SLOT(pdev->devfn),
+                       PCI_FUNC(pdev->devfn), dev->bus, PCI_SLOT(dev->devfn),
+                       PCI_FUNC(dev->devfn));
+            }
+        }
+    }
     if ( pci_find_cap_offset(pseg->nr, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
                              PCI_CAP_ID_MSIX) )
     {
@@ -201,12 +232,32 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
             sub_bus = pci_conf_read8(pseg->nr, bus, PCI_SLOT(devfn),
                                      PCI_FUNC(devfn), PCI_SUBORDINATE_BUS);
 
+            if ( pdev->info.is_link )
+            {
+                if ( sec_bus >= pdev->bus && pdev->bus <= sub_bus )
+                {
+#if 0
+                    u8 i = sec_bus;
+                    /* We can create an loop in bus2bridge by pointing to ourselves.
+                     * Hence destroy sec_bus up to pdev_bus values */
+                    spin_lock(&pseg->bus2bridge_lock);
+                    for ( ; i <= pdev->bus; i++ )
+                        pseg->bus2bridge[sec_bus].map = 0;
+                    spin_unlock(&pseg->bus2bridge_lock);
+                    /* And increment it so it won't cover us again*/
+                    sec_bus = pdev->bus + 1;
+                    printk("Link corrected [%02x:%02x:%u] spanning %x->%x\n", pdev->bus,
+                           PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn),sec_bus, sub_bus);
+#endif
+                    break;
+                }
+            }
             spin_lock(&pseg->bus2bridge_lock);
             for ( ; sec_bus <= sub_bus; sec_bus++ )
             {
                 pseg->bus2bridge[sec_bus].map = 1;
-                pseg->bus2bridge[sec_bus].bus = bus;
-                pseg->bus2bridge[sec_bus].devfn = devfn;
+                pseg->bus2bridge[sec_bus].bus = pdev->bus;
+                pseg->bus2bridge[sec_bus].devfn = pdev->devfn;
             }
             spin_unlock(&pseg->bus2bridge_lock);
             break;
@@ -299,7 +350,7 @@ int __init pci_hide_device(int bus, int devfn)
     int rc = -ENOMEM;
 
     spin_lock(&pcidevs_lock);
-    pdev = alloc_pdev(get_pseg(0), bus, devfn);
+    pdev = alloc_pdev(get_pseg(0), bus, devfn, 0, 0, 0);
     if ( pdev )
     {
         _pci_hide_device(pdev);
@@ -317,7 +368,7 @@ int __init pci_ro_device(int seg, int bus, int devfn)
 
     if ( !pseg )
         return -ENOMEM;
-    pdev = alloc_pdev(pseg, bus, devfn);
+    pdev = alloc_pdev(pseg, bus, devfn, 0, 0, 0);
     if ( !pdev )
         return -ENOMEM;
 
@@ -458,6 +509,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info *info)
     struct pci_seg *pseg;
     struct pci_dev *pdev;
     unsigned int slot = PCI_SLOT(devfn), func = PCI_FUNC(devfn);
+    u8 bus_link = 0, devfn_link = 0;
     const char *pdev_type;
     int ret;
 
@@ -474,6 +526,12 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info *info)
             pci_add_device(seg, info->physfn.bus, info->physfn.devfn, NULL);
         pdev_type = "virtual function";
     }
+    else if (info->is_link)
+    {
+        bus_link = info->physfn.bus;
+        devfn_link = info->physfn.devfn;
+        pdev_type = "link";
+    }
     else
     {
         info = NULL;
@@ -490,7 +548,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info *info)
     pseg = alloc_pseg(seg);
     if ( !pseg )
         goto out;
-    pdev = alloc_pdev(pseg, bus, devfn);
+    pdev = alloc_pdev(pseg, bus, devfn, (info && info->is_link), bus_link, devfn_link);
     if ( !pdev )
         goto out;
 
@@ -604,7 +662,7 @@ out:
 int pci_remove_device(u16 seg, u8 bus, u8 devfn)
 {
     struct pci_seg *pseg = get_pseg(seg);
-    struct pci_dev *pdev;
+    struct pci_dev *pdev, *link = NULL;
     int ret;
 
     ret = xsm_resource_unplug_pci(XSM_PRIV, (seg << 16) | (bus << 8) | devfn);
@@ -617,16 +675,29 @@ int pci_remove_device(u16 seg, u8 bus, u8 devfn)
         return -ENODEV;
 
     spin_lock(&pcidevs_lock);
+retry:
     list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
         if ( pdev->bus == bus && pdev->devfn == devfn )
         {
             ret = iommu_remove_device(pdev);
             if ( pdev->domain )
                 list_del(&pdev->domain_list);
-            pci_cleanup_msi(pdev);
+            if ( !pdev->info.is_link ) /* If we are not the 'fake device' */
+                pci_cleanup_msi(pdev);
+            if ( pdev->link ) {
+                /* Can be NULL if the other device was removed first. */
+                link = pdev->link;
+            }
             free_pdev(pseg, pdev);
             printk(XENLOG_DEBUG "PCI remove device %04x:%02x:%02x.%u\n",
                    seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
+            if ( link )
+            {
+                bus = link->bus;
+                devfn = link->devfn;
+                link->link = NULL;
+                goto retry;
+            }
             break;
         }
 
@@ -838,7 +909,7 @@ static int __init _scan_pci_devices(struct pci_seg *pseg, void *arg)
                     continue;
                 }
 
-                pdev = alloc_pdev(pseg, bus, PCI_DEVFN(dev, func));
+                pdev = alloc_pdev(pseg, bus, PCI_DEVFN(dev, func), 0, 0, 0);
                 if ( !pdev )
                 {
                     printk("%s: alloc_pdev failed.\n", __func__);
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 5f10034..a5a4664 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1468,6 +1468,25 @@ static int domain_context_mapping(
         if ( ret )
             break;
 
+        if ( pdev->link )
+        {
+            u8 bus_link, devfn_link;
+            struct pci_dev *link_dev = pdev->link;
+
+            ASSERT ( link_dev );
+
+            bus_link = link_dev->bus;
+            devfn_link = link_dev->devfn;
+
+            if ( iommu_verbose )
+                dprintk(VTDPREFIX, "d%d:PCI: map %04x:%02x:%02x.%u\n",
+                        domain->domain_id, seg, bus_link,
+                        PCI_SLOT(devfn_link), PCI_FUNC(devfn_link));
+
+
+            ret = domain_context_mapping_one(domain, drhd->iommu, bus_link, devfn_link,
+                                             pci_get_pdev(seg, bus, devfn));
+        }
         if ( find_upstream_bridge(seg, &bus, &devfn, &secbus) < 1 )
             break;
 
@@ -1603,6 +1622,18 @@ static int domain_context_unmap(
         if ( ret )
             break;
 
+        if ( pdev->link )
+        {
+            struct pci_dev *link = pdev->link;
+
+            ASSERT(link->link == pdev);
+            tmp_bus = link->bus;
+            tmp_devfn = link->devfn;
+            if ( iommu_verbose )
+                dprintk(VTDPREFIX, "d%d:PCI: unmap %04x:%02x:%02x.%u\n",
+                        domain->domain_id, seg, tmp_bus, PCI_SLOT(tmp_devfn), PCI_FUNC(tmp_devfn));
+            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn);
+        }
         tmp_bus = bus;
         tmp_devfn = devfn;
         if ( find_upstream_bridge(seg, &tmp_bus, &tmp_devfn, &secbus) < 1 )
diff --git a/xen/include/public/physdev.h b/xen/include/public/physdev.h
index d547928..c476175 100644
--- a/xen/include/public/physdev.h
+++ b/xen/include/public/physdev.h
@@ -281,6 +281,7 @@ DEFINE_XEN_GUEST_HANDLE(physdev_pci_mmcfg_reserved_t);
 #define XEN_PCI_DEV_EXTFN              0x1
 #define XEN_PCI_DEV_VIRTFN             0x2
 #define XEN_PCI_DEV_PXM                0x4
+#define XEN_PCI_DEV_LINK               0x8
 
 #define PHYSDEVOP_pci_device_add        25
 struct physdev_pci_device_add {
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index cadb525..b883c28 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -39,6 +39,7 @@ struct pci_dev_info {
         u8 bus;
         u8 devfn;
     } physfn;
+    bool_t is_link;
 };
 
 struct pci_dev {
@@ -75,6 +76,8 @@ struct pci_dev {
 #define PT_FAULT_THRESHOLD 10
     } fault;
     u64 vf_rlen[6];
+
+    struct pci_dev *link;
 };
 
 #define for_each_pdev(domain, pdev) \
-- 
1.8.5.3


[-- Attachment #3: hack.c --]
[-- Type: text/plain, Size: 2052 bytes --]


#include <linux/module.h>
#include <linux/string.h>
#include <linux/types.h>
#include <linux/init.h>
#include <linux/stat.h>
#include <linux/err.h>
#include <linux/ctype.h>
#include <linux/slab.h>
#include <linux/limits.h>
#include <linux/device.h>
#include <linux/pci.h>
#include <linux/device.h>

#include <linux/pci.h>

#include <xen/interface/xen.h>
#include <xen/interface/physdev.h>

#include <asm/xen/hypervisor.h>
#include <asm/xen/hypercall.h>

#define LSI_HACK  "0.1"

MODULE_AUTHOR("Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>");
MODULE_DESCRIPTION("lsi hack");
MODULE_LICENSE("GPL");
MODULE_VERSION(LSI_HACK);

/* Want to link to the device passed in the guest */
int bus = 8;
module_param(bus, int, 0644);
MODULE_PARM_DESC(bus, "bus");
int slot = 3;
module_param(slot, int, 0644);
MODULE_PARM_DESC(slot, "slot");
int func = 0;
module_param(func, int, 0644);
MODULE_PARM_DESC(func, "slot");

#define XEN_PCI_DEV_LINK 0x8
static int __init lsi_hack_init(void)
{
        int r = 0;

        struct physdev_pci_device_add add = {
			.seg	= 0,
                        .bus    = 0x8, /* The phantom bridge */
                        .devfn  = PCI_DEVFN(0,0), /* And its slot and function */
			.physfn.bus	= bus, /* The device we want to link too. */
			.physfn.devfn	= PCI_DEVFN(slot,func),
			.flags = XEN_PCI_DEV_LINK,
                };
	printk("%s: %02x:%02x.%u, %02x:%02x.%u, %x\n",
		__func__, add.bus, PCI_SLOT(add.devfn),
		PCI_FUNC(add.devfn), add.physfn.bus,
		PCI_SLOT(add.physfn.devfn), PCI_FUNC(add.physfn.devfn),
		add.flags);
        r = HYPERVISOR_physdev_op(PHYSDEVOP_pci_device_add, &add);

        return r;
}

static void __exit lsi_hack_exit(void)
{
        int r = 0;
        struct physdev_manage_pci manage_pci;

        manage_pci.bus = 0x7;
        manage_pci.devfn = PCI_DEVFN(0,0);

        r = HYPERVISOR_physdev_op(PHYSDEVOP_manage_pci_remove,
                &manage_pci);
        if (r)
                printk(KERN_ERR "%s: %d\n", __FUNCTION__, r);
}

module_init(lsi_hack_init);
module_exit(lsi_hack_exit);

[-- Attachment #4: Makefile --]
[-- Type: text/plain, Size: 700 bytes --]

# Comment/uncomment the following line to disable/enable debugging
#DEBUG = y

# Add your debugging flag (or not) to CFLAGS
ifeq ($(DEBUG),y)
  DEBFLAGS = -O -g # "-O" is needed to expand inlines
else
  DEBFLAGS = -O2
endif

EXTRA_CFLAGS += $(DEBFLAGS) -I$(LDDINCDIR)

ifneq ($(KERNELRELEASE),)
# call from kernel build system

obj-m   := hack.o

else

KERNELDIR ?= /lib/modules/$(shell uname -r)/build
PWD       := $(shell pwd)

default:
	$(MAKE) -C $(KERNELDIR) M=$(PWD) LDDINCDIR=$(PWD)/../include modules

endif

clean:
	rm -rf *.o *~ core .depend .*.cmd *.ko *.mod.c .tmp_versions

depend .depend dep:
	$(CC) $(CFLAGS) -M *.c > .depend


ifeq (.depend,$(wildcard .depend))
include .depend
endif

[-- Attachment #5: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

             reply	other threads:[~2014-03-11 17:30 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-11 17:30 Konrad Rzeszutek Wilk [this message]
2014-03-11 17:36 ` Dealing with non-existent BDF devices in VT-d and in the hardware Andrew Cooper
2014-03-11 17:49   ` Konrad Rzeszutek Wilk
2014-03-14  2:18     ` Zhang, Yang Z
2014-03-14 17:51       ` Konrad Rzeszutek Wilk
2014-03-17  1:03         ` Zhang, Yang Z
2014-03-17 20:00           ` Konrad Rzeszutek Wilk
2014-03-19  0:32             ` Zhang, Yang Z
2014-03-19 12:57               ` Konrad Rzeszutek Wilk
2014-03-19 14:24                 ` Jan Beulich
2014-03-20  0:48                   ` Zhang, Yang Z
2014-03-20  7:14                     ` Gordan Bobic
2014-03-20 10:04                       ` Jan Beulich
2014-03-20  9:58                     ` Jan Beulich
2014-03-24  2:37                       ` Zhang, Yang Z
2014-03-24  7:25                         ` Jan Beulich
2014-03-12  9:17 ` Jan Beulich
2014-03-12 14:22   ` Konrad Rzeszutek Wilk
2014-03-12 17:10     ` Gordan Bobic
  -- strict thread matches above, loose matches on Subject: below --
2014-03-20  1:34 Konrad Rzeszutek Wilk
2014-03-20 10:02 ` Jan Beulich
2014-03-20 19:44   ` Konrad Rzeszutek Wilk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140311173039.GB14684@phenom.dumpdata.com \
    --to=konrad.wilk@oracle.com \
    --cc=gordan@bobich.net \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.