From mboxrd@z Thu Jan 1 00:00:00 1970 From: Konrad Rzeszutek Wilk Subject: Re: Is: pci=assign-busses blows up Xen 4.4 Was:Re: [PATCH] x86/msi: Validate the guest-identified PCI devices in pci_prepare_msix() Date: Wed, 5 Feb 2014 15:07:08 -0500 Message-ID: <20140205200708.GA9278@phenom.dumpdata.com> References: <20140122043128.GA9931@konrad-lan.dumpdata.com> <52DFA2200200007800115B70@nat28.tlf.novell.com> <52DF9D46.7030904@citrix.com> <52DFC2DA0200007800115C79@nat28.tlf.novell.com> <20140122214034.GB9460@phenom.dumpdata.com> <52E0DFBB0200007800116041@nat28.tlf.novell.com> <20140124150128.GF12946@phenom.dumpdata.com> <52E2A0930200007800116CAE@nat28.tlf.novell.com> <20140124174349.GA15472@phenom.dumpdata.com> <20140124215652.GA18710@phenom.dumpdata.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="vkogqOf2sHV7VnPd" Return-path: Content-Disposition: inline In-Reply-To: <20140124215652.GA18710@phenom.dumpdata.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Jan Beulich Cc: George Dunlap , Andrew Cooper , Xen-devel List-Id: xen-devel@lists.xenproject.org --vkogqOf2sHV7VnPd Content-Type: text/plain; charset=us-ascii Content-Disposition: inline > And sure enough if I boot Xen without 'pci=assign-busses' it works just > fine. > > Ugh. > > I wonder how Xen 4.3 would actually do the PCI passthrough - it booted with > the 'assign-busses' - but I hadn't tried to do PCI passthrough of the > PF device (the I210). I did not work very well. Especially with PCI devices. > > If do pass in '05:00.0' (new bus number) I wonder if it will use IOMMU context > with whatever '05:00.0' was _before_ the bus re-assigment aka: > > 05:00.0 PCI bridge: Tundra Semiconductor Corp. Device 8113 (rev 01) (prog-if 01 [Subtractive decode]) > Flags: bus master, fast devsel, latency 0 > Bus: primary=05, secondary=06, subordinate=07, sec-latency=32 > Memory behind bridge: f1500000-f16fffff > Capabilities: [60] Subsystem: Super Micro Computer Inc Device 0805 > Capabilities: [a0] Power Management version 3 > > Which I think would confuse Xen as this is clearly labeled as bridge > not a PCI device. > > > The reason for me using 'pci=assign-busses' is that it looks to be > the only option to use SR-IOV. .. on this particular motherboard. > > Which I suppose makes sense as it tries to create VFs right after its own bus id: > > > +-01.1-[02-03]--+-[0000:03]-+-10.0 Intel Corporation 82576 Virtual Function > | | +-10.1 Intel Corporation 82576 Virtual Function > | | +-10.2 Intel Corporation 82576 Virtual Function > | | +-10.3 Intel Corporation 82576 Virtual Function > | | +-10.4 Intel Corporation 82576 Virtual Function > | | +-10.5 Intel Corporation 82576 Virtual Function > | | +-10.6 Intel Corporation 82576 Virtual Function > | | +-10.7 Intel Corporation 82576 Virtual Function > | | +-11.0 Intel Corporation 82576 Virtual Function > | | +-11.1 Intel Corporation 82576 Virtual Function > | | +-11.2 Intel Corporation 82576 Virtual Function > | | +-11.3 Intel Corporation 82576 Virtual Function > | | +-11.4 Intel Corporation 82576 Virtual Function > | | \-11.5 Intel Corporation 82576 Virtual Function > | \-[0000:02]-+-00.0 Intel Corporation 82576 Gigabit Network Connection > | \-00.1 Intel Corporation 82576 Gigabit Network Connection > > > But why does it have to have the bus _right_ after its own? Can't it > use one at the end of the its bus-space? The bus is after it is occupied > by another card (if I boot without 'pci=assign-busses'). Because you need to program the bridge to accept the bus requests for the PF and VF bus numbers. And hence the need to program it in the bridge to span more bus numbers. > > I do recall using this particular SR-IOV card on a different hardware > a year ago or so. And it did work. I think that might be because > there were no PCI cards _after_ the SR-IOV card. It was because it was a motherboard with an SRIOV aware BIOS. And a server one while this is more geared towards .. budget-servers? Anyhow, what I discovered was that the patch attached does allow me to boot with Xen. It is not pretty. But I was thinking to fix in the hypervisor and realized there are three ways of fixing it: 1). Do the hypercall to delete/add devices and let initial domain figure this out. (which the Linux attached patch does). 2). Be more aware of the bus2bridge topology when removing a PCI bridge or device. I had one bug where we ended up with this bus2bridge structure: 6 -> 06:00.0 7 -> 06:00.0 8 -> 07:01.0 Which meant that for devices on bus 8, 7 and 6 we would never find the upstream bridge. The reason is that 6 -> 06 points to itself so find_upstream_bridge ends up looping 255 times around and returns -1. Oddly enough the 06:00.0 device does get removed and then added (Via the hypercalls) and the reason for the bus2bridge having stale information is that it copies the data but does not invalidate that. I am not entirely sure I undertand why we do that. In 'free_pdev' we do this: for ( ; sec_bus <= sub_bus; sec_bus++ ) pseg->bus2bridge[sec_bus] = pseg->bus2bridge[pdev->bus]; and then: list_del(&pdev->alldevs_list); xfree(pdev->msix); xfree(pdev); so if the device that is being deleted is the bridge - we point the secondary and subordinate to the deleted device. But if the deleted device is the upstream bridge we end up leaving a stale bus2bridge context. That is OK normally as 'alloc_pdev' would over-write it (if the secondary and subordinate did not change). But in 'assign-busses' case they change so we are left with an 'stale' one. This means when the same device is added (but with a new bus value) we end fixing up the secondary to subordinate busses to point to us (06). But '06' which used to be a secondary bus, still retains the old value. One way to fix this is to detect it and correct: spin_lock(&pseg->bus2bridge_lock); for ( ; sec_bus <= sub_bus; sec_bus++ ) pseg->bus2bridge[sec_bus] = pseg->bus2bridge[pdev->bus]; /* Check for infinite recursion where bus2bridge would point to * itself, aka: * 6-> 06:00.0 * 7-> 06:00.0 * and we are removing 06:00.0, but may have 07:00.0 devices. * We invalidate the 6 as the upstream bridge is effectively * removed. We cannot remove the 07 as the 06:00.0 might be * added right back in. */ if ( pseg->bus2bridge[pdev->bus].map ) { u8 bus, devfn; bus = pdev->bus; if ( __find_upstream_bridge( pseg->nr, &bus, &devfn, &sec_bus, 1 ) < 0 ) { /* Recursion detected! Invalidate ourselves. */ printk("%04x:%02x:%02x.%u recursed clearing it", pseg->nr, pdev->bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); if (pdev->bus != bus || devfn != pdev->devfn) printk(" %02x:%02x supplied", pdev->bus, pdev->devfn); PCI_SLOT(pseg->bus2bridge[i].devfn), PCI_FUNC(pseg->bus2bridge[i].devfn)); pseg->bus2bridge[pdev->bus].map = 0; } } spin_unlock(&pseg->bus2bridge_lock); But I am not sure if that is the right way of doing it. Anyhow there was another assumption made in which 'assign-busses' crippled Xen (see second attachment). 3). Trap on PCI_SECONDARY_BUS and PCI_SUBORDINATE_BUS writes and fixup the structures. I hadn't attempted that but that could also be done. That way Xen is aware of those changes and can update its PCI structures. Thoughts? --vkogqOf2sHV7VnPd Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="0001-DRAFT-xen-pci-Re-add-all-PCI-devices-if-pci-assign-b.patch" >>From bee45c2613b1f827e2610d7f8d06989f3cd76907 Mon Sep 17 00:00:00 2001 From: Konrad Rzeszutek Wilk Date: Wed, 5 Feb 2014 14:26:56 -0500 Subject: [PATCH] DRAFT xen/pci: Re-add all PCI devices if pci=assign-busses is used. That parameter wreaks havoc with Xen hypervisor. Its internal structures end up being confused such that 'upstream bridge' information is lost. As such, this patch re-programs the Xen hypervisor's PCI devices. It does it in three steps: 1). Before 'acpi_init' (which parses the ACPI DSDT for PCI devices) in register_xen_pci_notifier we collect all of the PCI devices BDFs that are active. 2). When 'acpi_init' has finished and has reprogrammed the bus numbers, we intersect the list of all of the PCI devices that Linux knows with the list we created in step 1). The result is an array of BDFs which are orphaned - meaning they are not present on the machine any more - but Xen hypervisor is still holding on to them - because Linux has not made the 'xen_remove_device' call on them. The reason for that is explained later in this description[*1]. With the list of orphaned PCI devices and the ones we have added - we make the hypercall to remove all the orphaned ones and all the ones that were added. At this stage Xen has no knowledge of any PCI devices. 3). We all of the PCI devices that Linux knows about. This way the view from Linux and Xen is synced when it comes to the PCI devices. [*1]. Linux seperates the PCI devices from PCI bridges in two structures. That means that PCI devices know their slot and function number. While the bus structure keeps track of the bus number. This seperation allows Linux to expand the bridge to span more bus numbers and the changes are only updated in the PCI bus structures. The PCI devices are oblivious to this. Also the notifier call chain is only executed when a PCI device is added - and since this is during early bootup - the notifier is not used to 'delete' the devices that might have existed with the old bus numbers - because Linux hasn't gotten to enumerate them. With this patch, pci=assign-busses works with Xen hypervisors. Signed-off-by: Konrad Rzeszutek Wilk --- drivers/xen/pci.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 files changed, 117 insertions(+), 1 deletions(-) diff --git a/drivers/xen/pci.c b/drivers/xen/pci.c index dd9c249..178de97 100644 --- a/drivers/xen/pci.c +++ b/drivers/xen/pci.c @@ -186,12 +186,104 @@ static struct notifier_block device_nb = { .notifier_call = xen_pci_notifier, }; +#include +static void __init walk_bus(struct pci_bus *bus, int (*fnc)(struct device *dev)) +{ + struct pci_dev *dev; + struct pci_bus *child; + + list_for_each_entry(dev, &bus->devices, bus_list) { + if (dev->subordinate) + continue; /* Scan bridges in the next loop */ + (void)fnc(&dev->dev); + } + list_for_each_entry(child, &bus->children, node) { + dev = child->self; + if (dev) + (void)fnc(&dev->dev); + walk_bus(child, fnc); + } +} +static void __init walk_tree(int (*fnc)(struct device *dev)) +{ + struct pci_bus *bus; + + down_read(&pci_bus_sem); + list_for_each_entry(bus, &pci_root_buses, node) + walk_bus(bus, fnc); + up_read(&pci_bus_sem); +} + +#include + +#define PCI_BUS(bdf) (((bdf) >> 8) & 0xff) +#define PCI_BDF(b,d,f) ((((b) & 0xff) << 8) | PCI_DEVFN(d,f)) +#define PCI_DEVFN2(bdf) ((bdf) & 0xff) +#define PCI_BDF2(b,df) ((((b) & 0xff) << 8) | ((df) & 0xff)) +static unsigned long __initdata *pci_devs; + +static void __init check_device(int bus, int slot, int func) +{ + u16 class; + + class = read_pci_config(bus, slot, func, PCI_CLASS_REVISION); + if (class == 0xffff) + return; + + set_bit(PCI_BDF(bus, slot, func), pci_devs); +} +static int __init xen_prune_pci_devs(struct device *dev) +{ + struct pci_dev *pci_dev = to_pci_dev(dev); + u16 busdevfn; + + busdevfn = PCI_BDF2(pci_dev->bus->number, pci_dev->devfn); + if (test_bit(busdevfn, pci_devs)) /* If present it is not orphaned */ + clear_bit(busdevfn, pci_devs); + return 0; +} +static void __init xen_delete_orphaned_pci_devs(void) +{ + struct physdev_manage_pci manage_pci; + unsigned int i; + + for_each_set_bit(i, pci_devs, PCI_BDF(-1, -1, -1) + 1) { + manage_pci.bus = PCI_BUS(i); + manage_pci.devfn = PCI_DEVFN2(i); + (void)HYPERVISOR_physdev_op(PHYSDEVOP_manage_pci_remove, + &manage_pci); + } +} + static int __init register_xen_pci_notifier(void) { + int bus, slot, func, rc = 0; + if (!xen_initial_domain()) return 0; - return bus_register_notifier(&pci_bus_type, &device_nb); + rc = bus_register_notifier(&pci_bus_type, &device_nb); + + if (!pcibios_assign_all_busses()) + return rc; + + if (!early_pci_allowed()) + return rc; + + /* 64K bits needed - we will revisit it in xen_pci_refresh */ + pci_devs = kcalloc(BITS_TO_LONGS(PCI_BDF(-1, -1, -1) + 1), sizeof(unsigned long), GFP_KERNEL); + if (!pci_devs) + return rc; + + /* Poor man's PCI discovery */ + for (bus = 0; bus < 256; bus++) { + for (slot = 0; slot < 32; slot++) { + for (func = 0; func < 8; func++) { + check_device(bus, slot, func); + } + } + } + return rc; } arch_initcall(register_xen_pci_notifier); @@ -241,3 +333,27 @@ static int __init xen_mcfg_late(void) */ subsys_initcall_sync(xen_mcfg_late); #endif + +static int __init xen_pci_refresh(void) +{ + if (!xen_initial_domain()) + return 0; + + if (!pcibios_assign_all_busses()) + return 0; + + /* Update the list - so that we only have orphaned devices. */ + walk_tree(&xen_prune_pci_devs); + + /* Remove orphaned devices. */ + xen_delete_orphaned_pci_devs(); + /* Remove all existing ones */ + walk_tree(&xen_remove_device); + + /* Now the hypervisor has no PCI devices, so lets add them in */ + walk_tree(&xen_add_device); + + kfree(pci_devs); + return 0; +} +subsys_initcall_sync(xen_pci_refresh); -- 1.7.7.6 --vkogqOf2sHV7VnPd Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="0001-pci-Don-t-assume-the-removed-device-is-a-bridge.patch" >>From 76dc10b829f3beebd23c0c99dd653e50b429c5bd Mon Sep 17 00:00:00 2001 From: Konrad Rzeszutek Wilk Date: Tue, 4 Feb 2014 11:48:16 -0500 Subject: [PATCH] pci: Don't assume the removed device is a bridge. When we are instructed to remove a PCI device it is usally done from the initial domain via PHYSDEVOP_pci_device_remove or PHYSDEVOP_manage_pci_remove. That is OK except in the case where the initial domain has re-programmed the PCI bridges with a new PCI_SUBORDINATE_BUS value causing the bus number to change. That means a device that had been addressed via say this BDF: 06:00.0 is now addressed via 09:00.0. Now assume that the device that is being deleted is a bridge and it used to be 06:00.0 - but since the bus numbers are different any reads done on the PCI_SUBORDINATE_BUS can return bogus values (as we are now addressing a completetly new device). To guard against that we save away the subordinate and secondary bus numbers the first time the device is introduced. Then when the device is deleted we use those values instead of reading from the PCI device. Signed-off-by: Konrad Rzeszutek Wilk --- xen/drivers/passthrough/pci.c | 8 ++++---- xen/include/xen/pci.h | 4 ++++ 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c index 6152370..4e73427 100644 --- a/xen/drivers/passthrough/pci.c +++ b/xen/drivers/passthrough/pci.c @@ -201,6 +201,8 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn) sub_bus = pci_conf_read8(pseg->nr, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), PCI_SUBORDINATE_BUS); + pdev->info.bus.sec = sec_bus; + pdev->info.bus.sub = sub_bus; spin_lock(&pseg->bus2bridge_lock); for ( ; sec_bus <= sub_bus; sec_bus++ ) { @@ -265,10 +267,8 @@ static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev) case DEV_TYPE_LEGACY_PCI_BRIDGE: dev = PCI_SLOT(pdev->devfn); func = PCI_FUNC(pdev->devfn); - sec_bus = pci_conf_read8(pseg->nr, pdev->bus, dev, func, - PCI_SECONDARY_BUS); - sub_bus = pci_conf_read8(pseg->nr, pdev->bus, dev, func, - PCI_SUBORDINATE_BUS); + sec_bus = pdev->info.bus.sec; + sub_bus = pdev->info.bus.sub; spin_lock(&pseg->bus2bridge_lock); for ( ; sec_bus <= sub_bus; sec_bus++ ) diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h index cadb525..c3f6ee4 100644 --- a/xen/include/xen/pci.h +++ b/xen/include/xen/pci.h @@ -39,6 +39,10 @@ struct pci_dev_info { u8 bus; u8 devfn; } physfn; + struct { + u8 sec; + u8 sub; + } bus; /* Only set if device is a bridge */ }; struct pci_dev { -- 1.7.7.6 --vkogqOf2sHV7VnPd Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --vkogqOf2sHV7VnPd--