From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Jan Beulich <JBeulich@suse.com>
Cc: George Dunlap <george.dunlap@eu.citrix.com>,
Andrew Cooper <andrew.cooper3@citrix.com>,
Xen-devel <xen-devel@lists.xen.org>
Subject: Re: Is: pci=assign-busses blows up Xen 4.4 Was:Re: [PATCH] x86/msi: Validate the guest-identified PCI devices in pci_prepare_msix()
Date: Wed, 5 Feb 2014 15:07:08 -0500 [thread overview]
Message-ID: <20140205200708.GA9278@phenom.dumpdata.com> (raw)
In-Reply-To: <20140124215652.GA18710@phenom.dumpdata.com>
[-- Attachment #1: Type: text/plain, Size: 7120 bytes --]
> And sure enough if I boot Xen without 'pci=assign-busses' it works just
> fine.
>
> Ugh.
>
> I wonder how Xen 4.3 would actually do the PCI passthrough - it booted with
> the 'assign-busses' - but I hadn't tried to do PCI passthrough of the
> PF device (the I210).
I did not work very well. Especially with PCI devices.
>
> If do pass in '05:00.0' (new bus number) I wonder if it will use IOMMU context
> with whatever '05:00.0' was _before_ the bus re-assigment aka:
>
> 05:00.0 PCI bridge: Tundra Semiconductor Corp. Device 8113 (rev 01) (prog-if 01 [Subtractive decode])
> Flags: bus master, fast devsel, latency 0
> Bus: primary=05, secondary=06, subordinate=07, sec-latency=32
> Memory behind bridge: f1500000-f16fffff
> Capabilities: [60] Subsystem: Super Micro Computer Inc Device 0805
> Capabilities: [a0] Power Management version 3
>
> Which I think would confuse Xen as this is clearly labeled as bridge
> not a PCI device.
>
>
> The reason for me using 'pci=assign-busses' is that it looks to be
> the only option to use SR-IOV.
.. on this particular motherboard.
>
> Which I suppose makes sense as it tries to create VFs right after its own bus id:
>
>
> +-01.1-[02-03]--+-[0000:03]-+-10.0 Intel Corporation 82576 Virtual Function
> | | +-10.1 Intel Corporation 82576 Virtual Function
> | | +-10.2 Intel Corporation 82576 Virtual Function
> | | +-10.3 Intel Corporation 82576 Virtual Function
> | | +-10.4 Intel Corporation 82576 Virtual Function
> | | +-10.5 Intel Corporation 82576 Virtual Function
> | | +-10.6 Intel Corporation 82576 Virtual Function
> | | +-10.7 Intel Corporation 82576 Virtual Function
> | | +-11.0 Intel Corporation 82576 Virtual Function
> | | +-11.1 Intel Corporation 82576 Virtual Function
> | | +-11.2 Intel Corporation 82576 Virtual Function
> | | +-11.3 Intel Corporation 82576 Virtual Function
> | | +-11.4 Intel Corporation 82576 Virtual Function
> | | \-11.5 Intel Corporation 82576 Virtual Function
> | \-[0000:02]-+-00.0 Intel Corporation 82576 Gigabit Network Connection
> | \-00.1 Intel Corporation 82576 Gigabit Network Connection
>
>
> But why does it have to have the bus _right_ after its own? Can't it
> use one at the end of the its bus-space? The bus is after it is occupied
> by another card (if I boot without 'pci=assign-busses').
Because you need to program the bridge to accept the bus requests for the
PF and VF bus numbers. And hence the need to program it in the bridge
to span more bus numbers.
>
> I do recall using this particular SR-IOV card on a different hardware
> a year ago or so. And it did work. I think that might be because
> there were no PCI cards _after_ the SR-IOV card.
It was because it was a motherboard with an SRIOV aware BIOS. And a
server one while this is more geared towards .. budget-servers?
Anyhow, what I discovered was that the patch attached does allow me to
boot with Xen. It is not pretty.
But I was thinking to fix in the hypervisor and realized there are three
ways of fixing it:
1). Do the hypercall to delete/add devices and let initial domain figure
this out. (which the Linux attached patch does).
2). Be more aware of the bus2bridge topology when removing a PCI bridge or
device. I had one bug where we ended up with this bus2bridge structure:
6 -> 06:00.0
7 -> 06:00.0
8 -> 07:01.0
Which meant that for devices on bus 8, 7 and 6 we would never find the
upstream bridge. The reason is that 6 -> 06 points to itself so
find_upstream_bridge ends up looping 255 times around and returns -1.
Oddly enough the 06:00.0 device does get removed and then added (Via
the hypercalls) and the reason for the bus2bridge having stale information
is that it copies the data but does not invalidate that.
I am not entirely sure I undertand why we do that. In 'free_pdev' we do
this:
for ( ; sec_bus <= sub_bus; sec_bus++ )
pseg->bus2bridge[sec_bus] = pseg->bus2bridge[pdev->bus];
and then:
list_del(&pdev->alldevs_list);
xfree(pdev->msix);
xfree(pdev);
so if the device that is being deleted is the bridge - we point the secondary
and subordinate to the deleted device. But if the deleted device is the
upstream bridge we end up leaving a stale bus2bridge context.
That is OK normally as 'alloc_pdev' would over-write it (if the secondary
and subordinate did not change). But in 'assign-busses' case they change so
we are left with an 'stale' one.
This means when the same device is added (but with a new bus value) we
end fixing up the secondary to subordinate busses to point to us (06).
But '06' which used to be a secondary bus, still retains the old value.
One way to fix this is to detect it and correct:
spin_lock(&pseg->bus2bridge_lock);
for ( ; sec_bus <= sub_bus; sec_bus++ )
pseg->bus2bridge[sec_bus] = pseg->bus2bridge[pdev->bus];
/* Check for infinite recursion where bus2bridge would point to
* itself, aka:
* 6-> 06:00.0
* 7-> 06:00.0
* and we are removing 06:00.0, but may have 07:00.0 devices.
* We invalidate the 6 as the upstream bridge is effectively
* removed. We cannot remove the 07 as the 06:00.0 might be
* added right back in. */
if ( pseg->bus2bridge[pdev->bus].map )
{
u8 bus, devfn;
bus = pdev->bus;
if ( __find_upstream_bridge( pseg->nr, &bus, &devfn, &sec_bus, 1 ) < 0 )
{
/* Recursion detected! Invalidate ourselves. */
printk("%04x:%02x:%02x.%u recursed clearing it",
pseg->nr, pdev->bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
if (pdev->bus != bus || devfn != pdev->devfn)
printk(" %02x:%02x supplied", pdev->bus, pdev->devfn);
PCI_SLOT(pseg->bus2bridge[i].devfn),
PCI_FUNC(pseg->bus2bridge[i].devfn));
pseg->bus2bridge[pdev->bus].map = 0;
}
}
spin_unlock(&pseg->bus2bridge_lock);
But I am not sure if that is the right way of doing it. Anyhow there
was another assumption made in which 'assign-busses' crippled Xen
(see second attachment).
3). Trap on PCI_SECONDARY_BUS and PCI_SUBORDINATE_BUS writes and
fixup the structures.
I hadn't attempted that but that could also be done. That way Xen
is aware of those changes and can update its PCI structures.
Thoughts?
[-- Attachment #2: 0001-DRAFT-xen-pci-Re-add-all-PCI-devices-if-pci-assign-b.patch --]
[-- Type: text/plain, Size: 5977 bytes --]
>From bee45c2613b1f827e2610d7f8d06989f3cd76907 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Wed, 5 Feb 2014 14:26:56 -0500
Subject: [PATCH] DRAFT xen/pci: Re-add all PCI devices if pci=assign-busses
is used.
That parameter wreaks havoc with Xen hypervisor. Its internal
structures end up being confused such that 'upstream bridge'
information is lost.
As such, this patch re-programs the Xen hypervisor's PCI devices.
It does it in three steps:
1). Before 'acpi_init' (which parses the ACPI DSDT for PCI devices)
in register_xen_pci_notifier we collect all of the PCI devices
BDFs that are active.
2). When 'acpi_init' has finished and has reprogrammed the bus
numbers, we intersect the list of all of the PCI devices that
Linux knows with the list we created in step 1). The result
is an array of BDFs which are orphaned - meaning they are not
present on the machine any more - but Xen hypervisor is still
holding on to them - because Linux has not made the
'xen_remove_device' call on them. The reason for that is
explained later in this description[*1]. With the list of
orphaned PCI devices and the ones we have added - we make
the hypercall to remove all the orphaned ones and all the
ones that were added.
At this stage Xen has no knowledge of any PCI devices.
3). We all of the PCI devices that Linux knows about.
This way the view from Linux and Xen is synced when it comes
to the PCI devices.
[*1]. Linux seperates the PCI devices from PCI bridges in two
structures. That means that PCI devices know their slot and function
number. While the bus structure keeps track of the bus number. This
seperation allows Linux to expand the bridge to span more bus
numbers and the changes are only updated in the PCI bus structures.
The PCI devices are oblivious to this. Also the notifier call chain
is only executed when a PCI device is added - and since this is
during early bootup - the notifier is not used to 'delete' the devices
that might have existed with the old bus numbers - because Linux
hasn't gotten to enumerate them.
With this patch, pci=assign-busses works with Xen hypervisors.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
drivers/xen/pci.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 117 insertions(+), 1 deletions(-)
diff --git a/drivers/xen/pci.c b/drivers/xen/pci.c
index dd9c249..178de97 100644
--- a/drivers/xen/pci.c
+++ b/drivers/xen/pci.c
@@ -186,12 +186,104 @@ static struct notifier_block device_nb = {
.notifier_call = xen_pci_notifier,
};
+#include <linux/device.h>
+static void __init walk_bus(struct pci_bus *bus, int (*fnc)(struct device *dev))
+{
+ struct pci_dev *dev;
+ struct pci_bus *child;
+
+ list_for_each_entry(dev, &bus->devices, bus_list) {
+ if (dev->subordinate)
+ continue; /* Scan bridges in the next loop */
+ (void)fnc(&dev->dev);
+ }
+ list_for_each_entry(child, &bus->children, node) {
+ dev = child->self;
+ if (dev)
+ (void)fnc(&dev->dev);
+ walk_bus(child, fnc);
+ }
+}
+static void __init walk_tree(int (*fnc)(struct device *dev))
+{
+ struct pci_bus *bus;
+
+ down_read(&pci_bus_sem);
+ list_for_each_entry(bus, &pci_root_buses, node)
+ walk_bus(bus, fnc);
+ up_read(&pci_bus_sem);
+}
+
+#include <asm/pci-direct.h>
+
+#define PCI_BUS(bdf) (((bdf) >> 8) & 0xff)
+#define PCI_BDF(b,d,f) ((((b) & 0xff) << 8) | PCI_DEVFN(d,f))
+#define PCI_DEVFN2(bdf) ((bdf) & 0xff)
+#define PCI_BDF2(b,df) ((((b) & 0xff) << 8) | ((df) & 0xff))
+static unsigned long __initdata *pci_devs;
+
+static void __init check_device(int bus, int slot, int func)
+{
+ u16 class;
+
+ class = read_pci_config(bus, slot, func, PCI_CLASS_REVISION);
+ if (class == 0xffff)
+ return;
+
+ set_bit(PCI_BDF(bus, slot, func), pci_devs);
+}
+static int __init xen_prune_pci_devs(struct device *dev)
+{
+ struct pci_dev *pci_dev = to_pci_dev(dev);
+ u16 busdevfn;
+
+ busdevfn = PCI_BDF2(pci_dev->bus->number, pci_dev->devfn);
+ if (test_bit(busdevfn, pci_devs)) /* If present it is not orphaned */
+ clear_bit(busdevfn, pci_devs);
+ return 0;
+}
+static void __init xen_delete_orphaned_pci_devs(void)
+{
+ struct physdev_manage_pci manage_pci;
+ unsigned int i;
+
+ for_each_set_bit(i, pci_devs, PCI_BDF(-1, -1, -1) + 1) {
+ manage_pci.bus = PCI_BUS(i);
+ manage_pci.devfn = PCI_DEVFN2(i);
+ (void)HYPERVISOR_physdev_op(PHYSDEVOP_manage_pci_remove,
+ &manage_pci);
+ }
+}
+
static int __init register_xen_pci_notifier(void)
{
+ int bus, slot, func, rc = 0;
+
if (!xen_initial_domain())
return 0;
- return bus_register_notifier(&pci_bus_type, &device_nb);
+ rc = bus_register_notifier(&pci_bus_type, &device_nb);
+
+ if (!pcibios_assign_all_busses())
+ return rc;
+
+ if (!early_pci_allowed())
+ return rc;
+
+ /* 64K bits needed - we will revisit it in xen_pci_refresh */
+ pci_devs = kcalloc(BITS_TO_LONGS(PCI_BDF(-1, -1, -1) + 1), sizeof(unsigned long), GFP_KERNEL);
+ if (!pci_devs)
+ return rc;
+
+ /* Poor man's PCI discovery */
+ for (bus = 0; bus < 256; bus++) {
+ for (slot = 0; slot < 32; slot++) {
+ for (func = 0; func < 8; func++) {
+ check_device(bus, slot, func);
+ }
+ }
+ }
+ return rc;
}
arch_initcall(register_xen_pci_notifier);
@@ -241,3 +333,27 @@ static int __init xen_mcfg_late(void)
*/
subsys_initcall_sync(xen_mcfg_late);
#endif
+
+static int __init xen_pci_refresh(void)
+{
+ if (!xen_initial_domain())
+ return 0;
+
+ if (!pcibios_assign_all_busses())
+ return 0;
+
+ /* Update the list - so that we only have orphaned devices. */
+ walk_tree(&xen_prune_pci_devs);
+
+ /* Remove orphaned devices. */
+ xen_delete_orphaned_pci_devs();
+ /* Remove all existing ones */
+ walk_tree(&xen_remove_device);
+
+ /* Now the hypervisor has no PCI devices, so lets add them in */
+ walk_tree(&xen_add_device);
+
+ kfree(pci_devs);
+ return 0;
+}
+subsys_initcall_sync(xen_pci_refresh);
--
1.7.7.6
[-- Attachment #3: 0001-pci-Don-t-assume-the-removed-device-is-a-bridge.patch --]
[-- Type: text/plain, Size: 2964 bytes --]
>From 76dc10b829f3beebd23c0c99dd653e50b429c5bd Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Tue, 4 Feb 2014 11:48:16 -0500
Subject: [PATCH] pci: Don't assume the removed device is a bridge.
When we are instructed to remove a PCI device it is usally
done from the initial domain via PHYSDEVOP_pci_device_remove or
PHYSDEVOP_manage_pci_remove. That is OK except in the case where
the initial domain has re-programmed the PCI bridges with a new
PCI_SUBORDINATE_BUS value causing the bus number to change.
That means a device that had been addressed via say this
BDF: 06:00.0 is now addressed via 09:00.0. Now assume that the
device that is being deleted is a bridge and it used to be
06:00.0 - but since the bus numbers are different any
reads done on the PCI_SUBORDINATE_BUS can return bogus values
(as we are now addressing a completetly new device).
To guard against that we save away the subordinate and secondary
bus numbers the first time the device is introduced. Then when
the device is deleted we use those values instead of reading
from the PCI device.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
xen/drivers/passthrough/pci.c | 8 ++++----
xen/include/xen/pci.h | 4 ++++
2 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 6152370..4e73427 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -201,6 +201,8 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
sub_bus = pci_conf_read8(pseg->nr, bus, PCI_SLOT(devfn),
PCI_FUNC(devfn), PCI_SUBORDINATE_BUS);
+ pdev->info.bus.sec = sec_bus;
+ pdev->info.bus.sub = sub_bus;
spin_lock(&pseg->bus2bridge_lock);
for ( ; sec_bus <= sub_bus; sec_bus++ )
{
@@ -265,10 +267,8 @@ static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev)
case DEV_TYPE_LEGACY_PCI_BRIDGE:
dev = PCI_SLOT(pdev->devfn);
func = PCI_FUNC(pdev->devfn);
- sec_bus = pci_conf_read8(pseg->nr, pdev->bus, dev, func,
- PCI_SECONDARY_BUS);
- sub_bus = pci_conf_read8(pseg->nr, pdev->bus, dev, func,
- PCI_SUBORDINATE_BUS);
+ sec_bus = pdev->info.bus.sec;
+ sub_bus = pdev->info.bus.sub;
spin_lock(&pseg->bus2bridge_lock);
for ( ; sec_bus <= sub_bus; sec_bus++ )
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index cadb525..c3f6ee4 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -39,6 +39,10 @@ struct pci_dev_info {
u8 bus;
u8 devfn;
} physfn;
+ struct {
+ u8 sec;
+ u8 sub;
+ } bus; /* Only set if device is a bridge */
};
struct pci_dev {
--
1.7.7.6
[-- Attachment #4: Type: text/plain, Size: 126 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
next prev parent reply other threads:[~2014-02-05 20:07 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-01-21 21:54 Regression compared to Xen 4.3, Xen 4.4-rc2 - pci_prepare_msix+0xb1/0x12 - BOOM Konrad Rzeszutek Wilk
2014-01-22 0:23 ` Andrew Cooper
2014-01-22 0:24 ` [PATCH] x86/msi: Validate the guest-identified PCI devices in pci_prepare_msix() Andrew Cooper
2014-01-22 4:31 ` Konrad Rzeszutek Wilk
2014-01-22 9:49 ` Jan Beulich
2014-01-22 10:28 ` Andrew Cooper
2014-01-22 12:08 ` Jan Beulich
2014-01-22 21:40 ` Konrad Rzeszutek Wilk
2014-01-23 8:24 ` Jan Beulich
2014-01-24 15:01 ` Konrad Rzeszutek Wilk
2014-01-24 15:55 ` Jan Beulich
2014-01-24 16:19 ` Jan Beulich
2014-01-24 17:43 ` Konrad Rzeszutek Wilk
2014-01-24 21:56 ` Is: pci=assign-busses blows up Xen 4.4 Was:Re: " Konrad Rzeszutek Wilk
2014-02-05 20:07 ` Konrad Rzeszutek Wilk [this message]
2014-02-06 9:02 ` Jan Beulich
2014-02-21 19:18 ` Konrad Rzeszutek Wilk
2014-02-24 9:15 ` Is: pci=assign-busses blows up Xen 4.4 Jan Beulich
2014-02-24 16:15 ` Konrad Rzeszutek Wilk
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140205200708.GA9278@phenom.dumpdata.com \
--to=konrad.wilk@oracle.com \
--cc=JBeulich@suse.com \
--cc=andrew.cooper3@citrix.com \
--cc=george.dunlap@eu.citrix.com \
--cc=xen-devel@lists.xen.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).