xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/9] vpci: PCI config space emulation
@ 2017-06-30 15:01 Roger Pau Monne
  2017-06-30 15:01 ` [PATCH v4 2/9] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas Roger Pau Monne
                   ` (8 more replies)
  0 siblings, 9 replies; 44+ messages in thread
From: Roger Pau Monne @ 2017-06-30 15:01 UTC (permalink / raw)
  To: xen-devel; +Cc: boris.ostrovsky, julien.grall

Hello,

The following series contain an implementation of handlers for the PCI
configuration space inside of Xen. This allows Xen to detect accesses
to the PCI configuration space and react accordingly.

Why is this needed? IMHO, there are two main points of doing all this
emulation inside of Xen, the first one is to prevent adding a bunch of
duplicated Xen PV specific code to each OS we want to support in PVH
mode. This just promotes Xen code duplication amongst OSes, which
leads to a higher maintainership burden.

The second reason would be that this code (or it's functionality to be
more precise) already exists in QEMU (and pciback to a degree), and
it's code that we already support and maintain. By moving it into the
hypervisor itself every guest type can make use of it, and should be
shared between them all. I know that the code in this series is not
yet suitable for DomU HVM guests in it's current state, but it should
be in due time.

As usual, each patch contains a changeset summary between versions,
I'm not going to copy the list of changes here.

Patch 1 implements the generic handlers for accesses to the PCI
configuration space together with a minimal user-space test harness
that I've used during development. Currently a per-device linked list
is used in order to store the list of handlers, and they are sorted
based on their offset inside of the configuration space. Patch 1 also
adds the x86 port IO traps and wires them into the newly introduced
vPCI dispatchers. Patch 2 and 3 adds handlers for the MMCFG areas (as
found on the MMCFG ACPI table). Patches 4 and 5 are mostly code
moment/refactoring in order to implement support for BAR mapping in
patch 6. Finally patches 7 and 9 add support for trapping accesses
to the MSI and MSI-X capabilities respectively, so that interrupts are
properly setup on behalf of Dom0.

The branch containing the patches can be found at:

git://xenbits.xen.org/people/royger/xen.git vpci_v4

Note that this is only safe to use for the hardware domain (that's
trusted), any non-trusted domain will need a lot more of traps before
it can freely access the PCI configuration space.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v4 2/9] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas
  2017-06-30 15:01 [PATCH v4 0/9] vpci: PCI config space emulation Roger Pau Monne
@ 2017-06-30 15:01 ` Roger Pau Monne
  2017-07-10 13:34   ` Paul Durrant
  2017-07-13 20:15   ` Jan Beulich
  2017-06-30 15:01 ` [PATCH v4 3/9] x86/physdev: enable PHYSDEVOP_pci_mmcfg_reserved for PVH Dom0 Roger Pau Monne
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 44+ messages in thread
From: Roger Pau Monne @ 2017-06-30 15:01 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, julien.grall, Paul Durrant, Jan Beulich,
	boris.ostrovsky, Roger Pau Monne

Introduce a set of handlers for the accesses to the MMCFG areas. Those
areas are setup based on the contents of the hardware MMCFG tables,
and the list of handled MMCFG areas is stored inside of the hvm_domain
struct.

The read/writes are forwarded to the generic vpci handlers once the
address is decoded in order to obtain the device and register the
guest is trying to access.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v3:
 - Propagate changes from previous patches: drop xen_ prefix for vpci
   functions, pass slot and func instead of devfn and fix the error
   paths of the MMCFG handlers.
 - s/ecam/mmcfg/.
 - Move the destroy code to a separate function, so the hvm_mmcfg
   struct can be private to hvm/io.c.
 - Constify the return of vpci_mmcfg_find.
 - Use d instead of v->domain in vpci_mmcfg_accept.
 - Allow 8byte accesses to the mmcfg.

Changes since v1:
 - Added locking.
---
 xen/arch/x86/hvm/dom0_build.c    |  27 ++++++
 xen/arch/x86/hvm/hvm.c           |   3 +
 xen/arch/x86/hvm/io.c            | 188 ++++++++++++++++++++++++++++++++++++++-
 xen/include/asm-x86/hvm/domain.h |   3 +
 xen/include/asm-x86/hvm/io.h     |   7 ++
 5 files changed, 225 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index 0e7d06be95..57db8adc8d 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -38,6 +38,8 @@
 #include <public/hvm/hvm_info_table.h>
 #include <public/hvm/hvm_vcpu.h>
 
+#include "../x86_64/mmconfig.h"
+
 /*
  * Have the TSS cover the ISA port range, which makes it
  * - 104 bytes base structure
@@ -1041,6 +1043,24 @@ static int __init pvh_setup_acpi(struct domain *d, paddr_t start_info)
     return 0;
 }
 
+int __init pvh_setup_mmcfg(struct domain *d)
+{
+    unsigned int i;
+    int rc;
+
+    for ( i = 0; i < pci_mmcfg_config_num; i++ )
+    {
+        rc = register_vpci_mmcfg_handler(d, pci_mmcfg_config[i].address,
+                                         pci_mmcfg_config[i].start_bus_number,
+                                         pci_mmcfg_config[i].end_bus_number,
+                                         pci_mmcfg_config[i].pci_segment);
+        if ( rc )
+            return rc;
+    }
+
+    return 0;
+}
+
 int __init dom0_construct_pvh(struct domain *d, const module_t *image,
                               unsigned long image_headroom,
                               module_t *initrd,
@@ -1090,6 +1110,13 @@ int __init dom0_construct_pvh(struct domain *d, const module_t *image,
         return rc;
     }
 
+    rc = pvh_setup_mmcfg(d);
+    if ( rc )
+    {
+        printk("Failed to setup Dom0 PCI MMCFG areas: %d\n", rc);
+        return rc;
+    }
+
     panic("Building a PVHv2 Dom0 is not yet supported.");
     return 0;
 }
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index c4176ee458..f45e2bd23d 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -584,6 +584,7 @@ int hvm_domain_initialise(struct domain *d, unsigned long domcr_flags,
     spin_lock_init(&d->arch.hvm_domain.write_map.lock);
     INIT_LIST_HEAD(&d->arch.hvm_domain.write_map.list);
     INIT_LIST_HEAD(&d->arch.hvm_domain.g2m_ioport_list);
+    INIT_LIST_HEAD(&d->arch.hvm_domain.mmcfg_regions);
 
     rc = create_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0, NULL, NULL);
     if ( rc )
@@ -729,6 +730,8 @@ void hvm_domain_destroy(struct domain *d)
         list_del(&ioport->list);
         xfree(ioport);
     }
+
+    destroy_vpci_mmcfg(&d->arch.hvm_domain.mmcfg_regions);
 }
 
 static int hvm_save_tsc_adjust(struct domain *d, hvm_domain_context_t *h)
diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index 4e91a485cd..bb67f3accc 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -261,11 +261,11 @@ void register_g2m_portio_handler(struct domain *d)
 static int vpci_access_check(unsigned int reg, unsigned int len)
 {
     /* Check access size. */
-    if ( len != 1 && len != 2 && len != 4 )
+    if ( len != 1 && len != 2 && len != 4 && len != 8 )
         return -EINVAL;
 
-    /* Check if access crosses a double-word boundary. */
-    if ( (reg & 3) + len > 4 )
+    /* Check if access crosses a double-word boundary or it's not aligned. */
+    if ( (len <= 4 && (reg & 3) + len > 4) || (len == 8 && (reg & 3) != 0) )
         return -EINVAL;
 
     return 0;
@@ -398,6 +398,188 @@ void register_vpci_portio_handler(struct domain *d)
     handler->ops = &vpci_portio_ops;
 }
 
+struct hvm_mmcfg {
+    paddr_t addr;
+    size_t size;
+    unsigned int bus;
+    unsigned int segment;
+    struct list_head next;
+};
+
+/* Handlers to trap PCI ECAM config accesses. */
+static const struct hvm_mmcfg *vpci_mmcfg_find(struct domain *d,
+                                               unsigned long addr)
+{
+    const struct hvm_mmcfg *mmcfg;
+
+    ASSERT(vpci_locked(d));
+    list_for_each_entry ( mmcfg, &d->arch.hvm_domain.mmcfg_regions, next )
+        if ( addr >= mmcfg->addr && addr < mmcfg->addr + mmcfg->size )
+            return mmcfg;
+
+    return NULL;
+}
+
+static void vpci_mmcfg_decode_addr(const struct hvm_mmcfg *mmcfg,
+                                   unsigned long addr, unsigned int *bus,
+                                   unsigned int *slot, unsigned int *func,
+                                   unsigned int *reg)
+{
+    addr -= mmcfg->addr;
+    *bus = ((addr >> 20) & 0xff) + mmcfg->bus;
+    *slot = (addr >> 15) & 0x1f;
+    *func = (addr >> 12) & 0x7;
+    *reg = addr & 0xfff;
+}
+
+static int vpci_mmcfg_accept(struct vcpu *v, unsigned long addr)
+{
+    struct domain *d = v->domain;
+    bool found;
+
+    vpci_lock(d);
+    found = vpci_mmcfg_find(d, addr);
+    vpci_unlock(d);
+
+    return found;
+}
+
+static int vpci_mmcfg_read(struct vcpu *v, unsigned long addr,
+                           unsigned int len, unsigned long *data)
+{
+    struct domain *d = v->domain;
+    const struct hvm_mmcfg *mmcfg;
+    unsigned int bus, slot, func, reg;
+
+    *data = ~(unsigned long)0;
+
+    vpci_lock(d);
+    mmcfg = vpci_mmcfg_find(d, addr);
+    if ( !mmcfg )
+    {
+        vpci_unlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    vpci_mmcfg_decode_addr(mmcfg, addr, &bus, &slot, &func, &reg);
+
+    if ( vpci_access_check(reg, len) )
+    {
+        vpci_unlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    pcidevs_lock();
+    if ( len == 8 )
+    {
+        /*
+         * According to the PCIe 3.1A specification:
+         *  - Configuration Reads and Writes must usually be DWORD or smaller
+         *    in size.
+         *  - Because Root Complex implementations are not required to support
+         *    accesses to a RCRB that cross DW boundaries [...] software
+         *    should take care not to cause the generation of such accesses
+         *    when accessing a RCRB unless the Root Complex will support the
+         *    access.
+         *  Xen however supports 8byte accesses by splitting them into two
+         *  4byte accesses.
+         */
+        *data = vpci_read(mmcfg->segment, bus, slot, func, reg, 4);
+        *data |= (uint64_t)vpci_read(mmcfg->segment, bus, slot, func,
+                                     reg + 4, 4) << 32;
+    }
+    else
+        *data = vpci_read(mmcfg->segment, bus, slot, func, reg, len);
+    pcidevs_unlock();
+    vpci_unlock(d);
+
+    return X86EMUL_OKAY;
+}
+
+static int vpci_mmcfg_write(struct vcpu *v, unsigned long addr,
+                            unsigned int len, unsigned long data)
+{
+    struct domain *d = v->domain;
+    const struct hvm_mmcfg *mmcfg;
+    unsigned int bus, slot, func, reg;
+
+    vpci_lock(d);
+    mmcfg = vpci_mmcfg_find(d, addr);
+    if ( !mmcfg )
+        return X86EMUL_OKAY;
+
+    vpci_mmcfg_decode_addr(mmcfg, addr, &bus, &slot, &func, &reg);
+
+    if ( vpci_access_check(reg, len) )
+        return X86EMUL_OKAY;
+
+    pcidevs_lock();
+    if ( len == 8 )
+    {
+        vpci_write(mmcfg->segment, bus, slot, func, reg, 4, data);
+        vpci_write(mmcfg->segment, bus, slot, func, reg + 4, 4, data >> 32);
+    }
+    else
+        vpci_write(mmcfg->segment, bus, slot, func, reg, len, data);
+    pcidevs_unlock();
+    vpci_unlock(d);
+
+    return X86EMUL_OKAY;
+}
+
+static const struct hvm_mmio_ops vpci_mmcfg_ops = {
+    .check = vpci_mmcfg_accept,
+    .read = vpci_mmcfg_read,
+    .write = vpci_mmcfg_write,
+};
+
+int register_vpci_mmcfg_handler(struct domain *d, paddr_t addr,
+                                unsigned int start_bus, unsigned int end_bus,
+                                unsigned int seg)
+{
+    struct hvm_mmcfg *mmcfg;
+
+    ASSERT(is_hardware_domain(d));
+
+    vpci_lock(d);
+    if ( vpci_mmcfg_find(d, addr) )
+    {
+        vpci_unlock(d);
+        return -EEXIST;
+    }
+
+    mmcfg = xmalloc(struct hvm_mmcfg);
+    if ( !mmcfg )
+    {
+        vpci_unlock(d);
+        return -ENOMEM;
+    }
+
+    if ( list_empty(&d->arch.hvm_domain.mmcfg_regions) )
+        register_mmio_handler(d, &vpci_mmcfg_ops);
+
+    mmcfg->addr = addr + (start_bus << 20);
+    mmcfg->bus = start_bus;
+    mmcfg->segment = seg;
+    mmcfg->size = (end_bus - start_bus + 1) << 20;
+    list_add(&mmcfg->next, &d->arch.hvm_domain.mmcfg_regions);
+    vpci_unlock(d);
+
+    return 0;
+}
+
+void destroy_vpci_mmcfg(struct list_head *domain_mmcfg)
+{
+    while ( !list_empty(domain_mmcfg) )
+    {
+        struct hvm_mmcfg *mmcfg = list_first_entry(domain_mmcfg,
+                                                   struct hvm_mmcfg, next);
+
+        list_del(&mmcfg->next);
+        xfree(mmcfg);
+    }
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-x86/hvm/domain.h
index cbf4170789..7028f93861 100644
--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -187,6 +187,9 @@ struct hvm_domain {
     /* Lock for the PCI emulation layer (vPCI). */
     spinlock_t vpci_lock;
 
+    /* List of ECAM (MMCFG) regions trapped by Xen. */
+    struct list_head mmcfg_regions;
+
     /* List of permanently write-mapped pages. */
     struct {
         spinlock_t lock;
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index 0af1ed14dc..4fe996fe49 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -163,6 +163,13 @@ void register_g2m_portio_handler(struct domain *d);
 /* HVM port IO handler for PCI accesses. */
 void register_vpci_portio_handler(struct domain *d);
 
+/* HVM MMIO handler for PCI MMCFG accesses. */
+int register_vpci_mmcfg_handler(struct domain *d, paddr_t addr,
+                                unsigned int start_bus, unsigned int end_bus,
+                                unsigned int seg);
+/* Destroy tracked MMCFG areas. */
+void destroy_vpci_mmcfg(struct list_head *domain_mmcfg);
+
 #endif /* __ASM_X86_HVM_IO_H__ */
 
 
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 3/9] x86/physdev: enable PHYSDEVOP_pci_mmcfg_reserved for PVH Dom0
  2017-06-30 15:01 [PATCH v4 0/9] vpci: PCI config space emulation Roger Pau Monne
  2017-06-30 15:01 ` [PATCH v4 2/9] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas Roger Pau Monne
@ 2017-06-30 15:01 ` Roger Pau Monne
  2017-07-14 10:32   ` Jan Beulich
  2017-06-30 15:01 ` [PATCH v4 4/9] xen/mm: move modify_identity_mmio to global file and drop __init Roger Pau Monne
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monne @ 2017-06-30 15:01 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, julien.grall, Jan Beulich, boris.ostrovsky,
	Roger Pau Monne

So that hotplug (or MMCFG regions not present in the MCFG ACPI table)
can be added at run time by the hardware domain.

When a new MMCFG area is added to a PVH Dom0, Xen will scan it and add
the devices to the hardware domain.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
changes since v3:
 - New in this version.
---
 tools/tests/vpci/emul.h       |  2 --
 xen/arch/x86/hvm/hypercall.c  |  4 ++++
 xen/arch/x86/physdev.c        | 19 +++++++++++++++++++
 xen/drivers/passthrough/pci.c | 37 ++++++++++++++++++++++++++++++++++---
 xen/drivers/vpci/vpci.c       |  4 ++--
 xen/include/xen/pci.h         |  1 +
 6 files changed, 60 insertions(+), 7 deletions(-)

diff --git a/tools/tests/vpci/emul.h b/tools/tests/vpci/emul.h
index 1b0217e7e3..047079de4c 100644
--- a/tools/tests/vpci/emul.h
+++ b/tools/tests/vpci/emul.h
@@ -58,8 +58,6 @@ extern struct pci_dev test_pdev;
 
 #include "vpci.h"
 
-#define __hwdom_init
-
 #define has_vpci(d) true
 
 /* Define our own locks. */
diff --git a/xen/arch/x86/hvm/hypercall.c b/xen/arch/x86/hvm/hypercall.c
index e7238ce293..89625d514c 100644
--- a/xen/arch/x86/hvm/hypercall.c
+++ b/xen/arch/x86/hvm/hypercall.c
@@ -89,6 +89,10 @@ static long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         if ( !has_pirq(curr->domain) )
             return -ENOSYS;
         break;
+    case PHYSDEVOP_pci_mmcfg_reserved:
+        if ( !is_hardware_domain(curr->domain) )
+            return -ENOSYS;
+        break;
     }
 
     if ( !curr->hcall_compat )
diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index 0eb409758f..6b1c92fa0b 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -559,6 +559,25 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 
         ret = pci_mmcfg_reserved(info.address, info.segment,
                                  info.start_bus, info.end_bus, info.flags);
+        if ( ret || !is_hvm_domain(currd) )
+            break;
+
+        /*
+         * For HVM (PVH) domains try to add the newly found MMCFG to the
+         * domain.
+         */
+        ret = register_vpci_mmcfg_handler(currd, info.address, info.start_bus,
+                                          info.end_bus, info.segment);
+        if ( ret == -EEXIST )
+        {
+            ret = 0;
+            break;
+        }
+        if ( ret )
+            break;
+
+        ret = pci_scan_and_setup_segment(info.segment);
+
         break;
     }
 
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 3208cd5d71..2d38a5a297 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -924,7 +924,7 @@ out:
     return ret;
 }
 
-bool_t __init pci_device_detect(u16 seg, u8 bus, u8 dev, u8 func)
+bool pci_device_detect(u16 seg, u8 bus, u8 dev, u8 func)
 {
     u32 vendor;
 
@@ -971,7 +971,7 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn)
  * scan pci devices to add all existed PCI devices to alldevs_list,
  * and setup pci hierarchy in array bus2bridge.
  */
-static int __init _scan_pci_devices(struct pci_seg *pseg, void *arg)
+static int _scan_pci_devices(struct pci_seg *pseg, void *arg)
 {
     struct pci_dev *pdev;
     int bus, dev, func;
@@ -1050,7 +1050,7 @@ static void setup_one_hwdom_device(const struct setup_hwdom *ctxt,
                ctxt->d->domain_id, err);
 }
 
-static int __hwdom_init _setup_hwdom_pci_devices(struct pci_seg *pseg, void *arg)
+static int _setup_hwdom_pci_devices(struct pci_seg *pseg, void *arg)
 {
     struct setup_hwdom *ctxt = arg;
     int bus, devfn;
@@ -1110,6 +1110,37 @@ void __hwdom_init setup_hwdom_pci_devices(
     pcidevs_unlock();
 }
 
+static int add_device(uint8_t devfn, struct pci_dev *pdev)
+{
+    return iommu_add_device(pdev);
+}
+
+int pci_scan_and_setup_segment(uint16_t segment)
+{
+    struct pci_seg *pseg = get_pseg(segment);
+    struct setup_hwdom ctxt = {
+        .d = current->domain,
+        .handler = add_device,
+    };
+    int ret;
+
+    if ( !pseg )
+        return -EINVAL;
+
+    pcidevs_lock();
+    ret = _scan_pci_devices(pseg, NULL);
+    if ( ret )
+        goto out;
+
+    ret = _setup_hwdom_pci_devices(pseg, &ctxt);
+    if ( ret )
+        goto out;
+
+ out:
+    pcidevs_unlock();
+    return ret;
+}
+
 #ifdef CONFIG_ACPI
 #include <acpi/acpi.h>
 #include <acpi/apei.h>
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index c54de83b82..7d4ecd5fb5 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -33,12 +33,12 @@ struct vpci_register {
     struct list_head node;
 };
 
-int __hwdom_init vpci_add_handlers(struct pci_dev *pdev)
+int vpci_add_handlers(struct pci_dev *pdev)
 {
     unsigned int i;
     int rc = 0;
 
-    if ( !has_vpci(pdev->domain) )
+    if ( !has_vpci(pdev->domain) || pdev->vpci )
         return 0;
 
     pdev->vpci = xzalloc(struct vpci);
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index a9b80e330b..e550effcc9 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -131,6 +131,7 @@ struct pci_dev *pci_get_real_pdev(int seg, int bus, int devfn);
 struct pci_dev *pci_get_pdev_by_domain(
     struct domain *, int seg, int bus, int devfn);
 void pci_check_disable_device(u16 seg, u8 bus, u8 devfn);
+int pci_scan_and_setup_segment(uint16_t segment);
 
 uint8_t pci_conf_read8(
     unsigned int seg, unsigned int bus, unsigned int dev, unsigned int func,
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 4/9] xen/mm: move modify_identity_mmio to global file and drop __init
  2017-06-30 15:01 [PATCH v4 0/9] vpci: PCI config space emulation Roger Pau Monne
  2017-06-30 15:01 ` [PATCH v4 2/9] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas Roger Pau Monne
  2017-06-30 15:01 ` [PATCH v4 3/9] x86/physdev: enable PHYSDEVOP_pci_mmcfg_reserved for PVH Dom0 Roger Pau Monne
@ 2017-06-30 15:01 ` Roger Pau Monne
  2017-07-14 10:32   ` Jan Beulich
  2017-06-30 15:01 ` [PATCH v4 5/9] xen/pci: split code to size BARs from pci_add_device Roger Pau Monne
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monne @ 2017-06-30 15:01 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, julien.grall, Jan Beulich, boris.ostrovsky,
	Roger Pau Monne

And also allow it to do non-identity mappings by adding a new
parameter.

This function will be needed in order to map the BARs from PCI devices
into the Dom0 p2m (and is also used by the x86 Dom0 builder). While
there fix the function to use gfn_t and mfn_t instead of unsigned long
for memory addresses.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v3:
 - Remove the dummy modify_identity_mmio helper in dom0_build.c
 - Try to make the comment in modify MMIO less scary.
 - Clarify commit message.
 - Only build the function for x86 or if there's PCI support.

Changes since v2:
 - Use mfn_t and gfn_t.
 - Remove stray newline.
---
 xen/arch/x86/hvm/dom0_build.c | 30 ++----------------------------
 xen/common/memory.c           | 40 ++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/p2m-common.h  |  3 +++
 3 files changed, 45 insertions(+), 28 deletions(-)

diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index 57db8adc8d..6b9f76ec36 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -61,32 +61,6 @@ static struct acpi_madt_interrupt_override __initdata *intsrcovr;
 static unsigned int __initdata acpi_nmi_sources;
 static struct acpi_madt_nmi_source __initdata *nmisrc;
 
-static int __init modify_identity_mmio(struct domain *d, unsigned long pfn,
-                                       unsigned long nr_pages, const bool map)
-{
-    int rc;
-
-    for ( ; ; )
-    {
-        rc = (map ? map_mmio_regions : unmap_mmio_regions)
-             (d, _gfn(pfn), nr_pages, _mfn(pfn));
-        if ( rc == 0 )
-            break;
-        if ( rc < 0 )
-        {
-            printk(XENLOG_WARNING
-                   "Failed to identity %smap [%#lx,%#lx) for d%d: %d\n",
-                   map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc);
-            break;
-        }
-        nr_pages -= rc;
-        pfn += rc;
-        process_pending_softirqs();
-    }
-
-    return rc;
-}
-
 /* Populate a HVM memory range using the biggest possible order. */
 static int __init pvh_populate_memory_range(struct domain *d,
                                             unsigned long start,
@@ -397,7 +371,7 @@ static int __init pvh_setup_p2m(struct domain *d)
      * Memory below 1MB is identity mapped.
      * NB: this only makes sense when booted from legacy BIOS.
      */
-    rc = modify_identity_mmio(d, 0, MB1_PAGES, true);
+    rc = modify_mmio(d, _gfn(0), _mfn(0), MB1_PAGES, true);
     if ( rc )
     {
         printk("Failed to identity map low 1MB: %d\n", rc);
@@ -964,7 +938,7 @@ static int __init pvh_setup_acpi(struct domain *d, paddr_t start_info)
         nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) +
                           d->arch.e820[i].size);
 
-        rc = modify_identity_mmio(d, pfn, nr_pages, true);
+        rc = modify_mmio(d, _gfn(pfn), _mfn(pfn), nr_pages, true);
         if ( rc )
         {
             printk("Failed to map ACPI region [%#lx, %#lx) into Dom0 memory map\n",
diff --git a/xen/common/memory.c b/xen/common/memory.c
index b2066db07e..410b6e77d9 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -1465,6 +1465,46 @@ int prepare_ring_for_helper(
     return 0;
 }
 
+#if defined(CONFIG_X86) || defined(CONFIG_HAS_PCI)
+int modify_mmio(struct domain *d, gfn_t gfn, mfn_t mfn, unsigned long nr_pages,
+                const bool map)
+{
+    int rc;
+
+    /*
+     * ATM this function should only be used by the hardware domain
+     * because it doesn't support preemption/continuation, and as such
+     * can take a non-trivial amount of time. Note that it periodically calls
+     * process_pending_softirqs in order to avoid stalling the system.
+     */
+    ASSERT(is_hardware_domain(d));
+
+    for ( ; ; )
+    {
+        rc = (map ? map_mmio_regions : unmap_mmio_regions)
+             (d, gfn, nr_pages, mfn);
+        if ( rc == 0 )
+            break;
+        if ( rc < 0 )
+        {
+            printk(XENLOG_G_WARNING
+                   "Failed to %smap [%" PRI_gfn ", %" PRI_gfn ") -> "
+                   "[%" PRI_mfn ", %" PRI_mfn ") for d%d: %d\n",
+                   map ? "" : "un", gfn_x(gfn), gfn_x(gfn_add(gfn, nr_pages)),
+                   mfn_x(mfn), mfn_x(mfn_add(mfn, nr_pages)), d->domain_id,
+                   rc);
+            break;
+        }
+        nr_pages -= rc;
+        mfn = mfn_add(mfn, rc);
+        gfn = gfn_add(gfn, rc);
+        process_pending_softirqs();
+    }
+
+    return rc;
+}
+#endif
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/include/xen/p2m-common.h b/xen/include/xen/p2m-common.h
index 2b5696cf33..c2f9015ad8 100644
--- a/xen/include/xen/p2m-common.h
+++ b/xen/include/xen/p2m-common.h
@@ -20,4 +20,7 @@ int unmap_mmio_regions(struct domain *d,
                        unsigned long nr,
                        mfn_t mfn);
 
+int modify_mmio(struct domain *d, gfn_t gfn, mfn_t mfn, unsigned long nr_pages,
+                const bool map);
+
 #endif /* _XEN_P2M_COMMON_H */
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 5/9] xen/pci: split code to size BARs from pci_add_device
  2017-06-30 15:01 [PATCH v4 0/9] vpci: PCI config space emulation Roger Pau Monne
                   ` (2 preceding siblings ...)
  2017-06-30 15:01 ` [PATCH v4 4/9] xen/mm: move modify_identity_mmio to global file and drop __init Roger Pau Monne
@ 2017-06-30 15:01 ` Roger Pau Monne
  2017-07-14 10:33   ` Jan Beulich
  2017-06-30 15:01 ` [PATCH v4 6/9] xen/vpci: add handlers to map the BARs Roger Pau Monne
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monne @ 2017-06-30 15:01 UTC (permalink / raw)
  To: xen-devel; +Cc: boris.ostrovsky, Roger Pau Monne, julien.grall, Jan Beulich

So that it can be called from outside in order to get the size of regular PCI
BARs. This will be required in order to map the BARs from PCI devices into PVH
Dom0 p2m.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
---
Changes since v3:
 - Rename function to size BARs to pci_size_mem_bar.
 - Change the parameters passed to the function. Pass the position and
   whether the BAR is the last one, instead of the (base, max_bars,
   *index) tuple.
 - Make the function return the number of BARs consumed (1 for 32b, 2
   for 64b BARs).
 - Change the dprintk back to printk.
 - Do not log another error message in pci_add_device in case
   pci_size_mem_bar fails.
---
 xen/drivers/passthrough/pci.c | 89 ++++++++++++++++++++++++++-----------------
 xen/include/xen/pci.h         |  3 ++
 2 files changed, 58 insertions(+), 34 deletions(-)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 2d38a5a297..656a2a316b 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -588,6 +588,54 @@ static void pci_enable_acs(struct pci_dev *pdev)
     pci_conf_write16(seg, bus, dev, func, pos + PCI_ACS_CTRL, ctrl);
 }
 
+int pci_size_mem_bar(unsigned int seg, unsigned int bus, unsigned int slot,
+                     unsigned int func, unsigned int pos, bool last,
+                     uint64_t *paddr, uint64_t *psize)
+{
+    uint32_t hi = 0, bar = pci_conf_read32(seg, bus, slot, func, pos);
+    uint64_t addr, size;
+
+    ASSERT((bar & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_MEMORY);
+    pci_conf_write32(seg, bus, slot, func, pos, ~0);
+    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+         PCI_BASE_ADDRESS_MEM_TYPE_64 )
+    {
+        if ( last )
+        {
+            printk(XENLOG_WARNING
+                    "device %04x:%02x:%02x.%u with 64-bit BAR in last slot\n",
+                    seg, bus, slot, func);
+            return -EINVAL;
+        }
+        hi = pci_conf_read32(seg, bus, slot, func, pos + 4);
+        pci_conf_write32(seg, bus, slot, func, pos + 4, ~0);
+    }
+    size = pci_conf_read32(seg, bus, slot, func, pos) &
+           PCI_BASE_ADDRESS_MEM_MASK;
+    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+         PCI_BASE_ADDRESS_MEM_TYPE_64 )
+    {
+        size |= (u64)pci_conf_read32(seg, bus, slot, func, pos + 4) << 32;
+        pci_conf_write32(seg, bus, slot, func, pos + 4, hi);
+    }
+    else if ( size )
+        size |= (u64)~0 << 32;
+    pci_conf_write32(seg, bus, slot, func, pos, bar);
+    size = -(size);
+    addr = (bar & PCI_BASE_ADDRESS_MEM_MASK) | ((u64)hi << 32);
+
+    if ( paddr )
+        *paddr = addr;
+    if ( psize )
+        *psize = size;
+
+    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+         PCI_BASE_ADDRESS_MEM_TYPE_64 )
+        return 2;
+
+    return 1;
+}
+
 int pci_add_device(u16 seg, u8 bus, u8 devfn,
                    const struct pci_dev_info *info, nodeid_t node)
 {
@@ -648,11 +696,10 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
             unsigned int i;
 
             BUILD_BUG_ON(ARRAY_SIZE(pdev->vf_rlen) != PCI_SRIOV_NUM_BARS);
-            for ( i = 0; i < PCI_SRIOV_NUM_BARS; ++i )
+            for ( i = 0; i < PCI_SRIOV_NUM_BARS; )
             {
                 unsigned int idx = pos + PCI_SRIOV_BAR + i * 4;
                 u32 bar = pci_conf_read32(seg, bus, slot, func, idx);
-                u32 hi = 0;
 
                 if ( (bar & PCI_BASE_ADDRESS_SPACE) ==
                      PCI_BASE_ADDRESS_SPACE_IO )
@@ -663,38 +710,12 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
                            seg, bus, slot, func, i);
                     continue;
                 }
-                pci_conf_write32(seg, bus, slot, func, idx, ~0);
-                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
-                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
-                {
-                    if ( i >= PCI_SRIOV_NUM_BARS )
-                    {
-                        printk(XENLOG_WARNING
-                               "SR-IOV device %04x:%02x:%02x.%u with 64-bit"
-                               " vf BAR in last slot\n",
-                               seg, bus, slot, func);
-                        break;
-                    }
-                    hi = pci_conf_read32(seg, bus, slot, func, idx + 4);
-                    pci_conf_write32(seg, bus, slot, func, idx + 4, ~0);
-                }
-                pdev->vf_rlen[i] = pci_conf_read32(seg, bus, slot, func, idx) &
-                                   PCI_BASE_ADDRESS_MEM_MASK;
-                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
-                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
-                {
-                    pdev->vf_rlen[i] |= (u64)pci_conf_read32(seg, bus,
-                                                             slot, func,
-                                                             idx + 4) << 32;
-                    pci_conf_write32(seg, bus, slot, func, idx + 4, hi);
-                }
-                else if ( pdev->vf_rlen[i] )
-                    pdev->vf_rlen[i] |= (u64)~0 << 32;
-                pci_conf_write32(seg, bus, slot, func, idx, bar);
-                pdev->vf_rlen[i] = -pdev->vf_rlen[i];
-                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
-                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
-                    ++i;
+                ret = pci_size_mem_bar(seg, bus, slot, func, idx,
+                                       i == PCI_SRIOV_NUM_BARS - 1, NULL,
+                                       &pdev->vf_rlen[i]);
+                if ( ret < 0 )
+                    break;
+                i += ret;
             }
         }
         else
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index e550effcc9..11ad185cec 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -165,6 +165,9 @@ const char *parse_pci(const char *, unsigned int *seg, unsigned int *bus,
                       unsigned int *dev, unsigned int *func);
 const char *parse_pci_seg(const char *, unsigned int *seg, unsigned int *bus,
                           unsigned int *dev, unsigned int *func, bool *def_seg);
+int pci_size_mem_bar(unsigned int seg, unsigned int bus, unsigned int slot,
+                     unsigned int func, unsigned int pos, bool last,
+                     uint64_t *addr, uint64_t *size);
 
 
 bool_t pcie_aer_get_firmware_first(const struct pci_dev *);
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 6/9] xen/vpci: add handlers to map the BARs
  2017-06-30 15:01 [PATCH v4 0/9] vpci: PCI config space emulation Roger Pau Monne
                   ` (3 preceding siblings ...)
  2017-06-30 15:01 ` [PATCH v4 5/9] xen/pci: split code to size BARs from pci_add_device Roger Pau Monne
@ 2017-06-30 15:01 ` Roger Pau Monne
  2017-07-14 15:11   ` Jan Beulich
  2017-06-30 15:01 ` [PATCH v4 7/9] vpci/msi: add MSI handlers Roger Pau Monne
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monne @ 2017-06-30 15:01 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, julien.grall, Jan Beulich,
	boris.ostrovsky, Roger Pau Monne

Introduce a set of handlers that trap accesses to the PCI BARs and the command
register, in order to emulate BAR sizing and BAR relocation.

The command handler is used to detect changes to bit 2 (response to memory
space accesses), and maps/unmaps the BARs of the device into the guest p2m.

The BAR register handlers are used to detect attempts by the guest to size or
relocate the BARs.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
---
Changes since v3:
 - Propagate previous changes: drop xen_ prefix and use u8/u16/u32
   instead of the previous half_word/word/double_word.
 - Constify some of the paramerters.
 - s/VPCI_BAR_MEM/VPCI_BAR_MEM32/.
 - Simplify the number of fields stored for each BAR, a single address
   field is stored and contains the address of the BAR both on Xen and
   in the guest.
 - Allow the guest to move the BARs around in the physical memory map.
 - Add support for expansion ROM BARs.
 - Do not cache the value of the command register.
 - Remove a label used in vpci_cmd_write.
 - Fix the calculation of the sizing mask in vpci_bar_write.
 - Check the memory decode bit in order to decide if a BAR is
   positioned or not.
 - Disable memory decoding before sizing the BARs in Xen.
 - When mapping/unmapping BARs check if there's overlap between BARs,
   in order to avoid unmapping memory required by another BAR.
 - Introduce a macro to check whether a BAR is mappable or not.
 - Add a comment regarding the lack of support for SR-IOV.
 - Remove the usage of the GENMASK macro.

Changes since v2:
 - Detect unset BARs and allow the hardware domain to position them.
---
 xen/drivers/vpci/Makefile |   2 +-
 xen/drivers/vpci/header.c | 473 ++++++++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/vpci.h    |  23 +++
 3 files changed, 497 insertions(+), 1 deletion(-)
 create mode 100644 xen/drivers/vpci/header.c

diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
index 840a906470..241467212f 100644
--- a/xen/drivers/vpci/Makefile
+++ b/xen/drivers/vpci/Makefile
@@ -1 +1 @@
-obj-y += vpci.o
+obj-y += vpci.o header.o
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
new file mode 100644
index 0000000000..3c800c4cf7
--- /dev/null
+++ b/xen/drivers/vpci/header.c
@@ -0,0 +1,473 @@
+/*
+ * Generic functionality for handling accesses to the PCI header from the
+ * configuration space.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/sched.h>
+#include <xen/vpci.h>
+#include <xen/p2m-common.h>
+
+#define MAPPABLE_BAR(x)                                                 \
+    (((x)->type == VPCI_BAR_MEM32 || (x)->type == VPCI_BAR_MEM64_LO ||  \
+     ((x)->type == VPCI_BAR_ROM && (x)->enabled)) &&                    \
+     (x)->addr != INVALID_PADDR)
+
+static struct rangeset *vpci_get_bar_memory(const struct domain *d,
+                                            const struct vpci_bar *map)
+{
+    const struct pci_dev *pdev;
+    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
+    int rc;
+
+    if ( !mem )
+        return ERR_PTR(-ENOMEM);
+
+    /*
+     * Create a rangeset that represents the current BAR memory region
+     * and compare it against all the currently active BAR memory regions.
+     * If an overlap is found, subtract it from the region to be
+     * mapped/unmapped.
+     *
+     * NB: the rangeset uses frames, and if start and end addresses are
+     * equal it means only one frame is used, that's why PFN_DOWN is used
+     * to calculate the end of the rangeset.
+     */
+    rc = rangeset_add_range(mem, PFN_DOWN(map->addr),
+                            PFN_DOWN(map->addr + map->size));
+    if ( rc )
+    {
+        rangeset_destroy(mem);
+        return ERR_PTR(rc);
+    }
+
+    list_for_each_entry(pdev, &d->arch.pdev_list, domain_list)
+    {
+        uint16_t cmd = pci_conf_read16(pdev->seg, pdev->bus,
+                                       PCI_SLOT(pdev->devfn),
+                                       PCI_FUNC(pdev->devfn),
+                                       PCI_COMMAND);
+        unsigned int i;
+
+        /* Check if memory decoding is enabled. */
+        if ( !(cmd & PCI_COMMAND_MEMORY) )
+            continue;
+
+        for ( i = 0; i < ARRAY_SIZE(pdev->vpci->header.bars); i++ )
+        {
+            const struct vpci_bar *bar = &pdev->vpci->header.bars[i];
+
+            if ( bar == map || !MAPPABLE_BAR(bar) ||
+                 !rangeset_overlaps_range(mem, PFN_DOWN(bar->addr),
+                                          PFN_DOWN(bar->addr + bar->size)) )
+                continue;
+
+            rc = rangeset_remove_range(mem, PFN_DOWN(bar->addr),
+                                       PFN_DOWN(bar->addr + bar->size));
+            if ( rc )
+            {
+                rangeset_destroy(mem);
+                return ERR_PTR(rc);
+            }
+        }
+    }
+
+    return mem;
+}
+
+struct map_data {
+    struct domain *d;
+    bool map;
+};
+
+static int vpci_map_range(unsigned long s, unsigned long e, void *data)
+{
+    const struct map_data *map = data;
+
+    return modify_mmio(map->d, _gfn(s), _mfn(s), e - s + 1, map->map);
+}
+
+static int vpci_modify_bar(struct domain *d, const struct vpci_bar *bar,
+                           const bool map)
+{
+    struct rangeset *mem;
+    struct map_data data = { .d = d, .map = map };
+    int rc;
+
+    ASSERT(MAPPABLE_BAR(bar));
+
+    mem = vpci_get_bar_memory(d, bar);
+    if ( IS_ERR(mem) )
+        return -PTR_ERR(mem);
+
+    rc = rangeset_report_ranges(mem, 0, ~0ul, vpci_map_range, &data);
+    rangeset_destroy(mem);
+    if ( rc )
+        return rc;
+
+    return 0;
+}
+
+static int vpci_modify_bars(const struct pci_dev *pdev, const bool map)
+{
+    const struct vpci_header *header = &pdev->vpci->header;
+    unsigned int i;
+
+    for ( i = 0; i < ARRAY_SIZE(header->bars); i++ )
+    {
+        const struct vpci_bar *bar = &header->bars[i];
+        int rc;
+
+        if ( !MAPPABLE_BAR(bar) )
+            continue;
+
+        rc = vpci_modify_bar(pdev->domain, bar, map);
+        if ( rc )
+            return rc;
+    }
+
+    return 0;
+}
+
+static void vpci_cmd_read(struct pci_dev *pdev, unsigned int reg,
+                          union vpci_val *val, void *data)
+{
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+
+    val->u16 = pci_conf_read16(seg, bus, slot, func, reg);
+}
+
+static void vpci_cmd_write(struct pci_dev *pdev, unsigned int reg,
+                           union vpci_val val, void *data)
+{
+    uint16_t cmd = val.u16, current_cmd;
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    int rc;
+
+    current_cmd = pci_conf_read16(seg, bus, slot, func, reg);
+
+    if ( !((cmd ^ current_cmd) & PCI_COMMAND_MEMORY) )
+    {
+        /*
+         * Let the guest play with all the bits directly except for the
+         * memory decoding one.
+         */
+        pci_conf_write16(seg, bus, slot, func, reg, cmd);
+        return;
+    }
+
+    /* Memory space access change. */
+    rc = vpci_modify_bars(pdev, cmd & PCI_COMMAND_MEMORY);
+    if ( rc )
+    {
+        dprintk(XENLOG_ERR,
+                "%04x:%02x:%02x.%u:unable to %smap BARs: %d\n",
+                seg, bus, slot, func,
+                cmd & PCI_COMMAND_MEMORY ? "" : "un", rc);
+        return;
+    }
+
+    pci_conf_write16(seg, bus, slot, func, reg, cmd);
+}
+
+static void vpci_bar_read(struct pci_dev *pdev, unsigned int reg,
+                          union vpci_val *val, void *data)
+{
+    const struct vpci_bar *bar = data;
+    bool hi = false;
+
+    ASSERT(bar->type == VPCI_BAR_MEM32 || bar->type == VPCI_BAR_MEM64_LO ||
+           bar->type == VPCI_BAR_MEM64_HI);
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+
+    if ( bar->sizing )
+        val->u32 = ~(bar->size - 1) >> (hi ? 32 : 0);
+    else
+        val->u32 = bar->addr >> (hi ? 32 : 0);
+
+    if ( !hi )
+    {
+        val->u32 |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
+                                                : PCI_BASE_ADDRESS_MEM_TYPE_64;
+        val->u32 |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
+    }
+}
+
+static void vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
+                           union vpci_val val, void *data)
+{
+    struct vpci_bar *bar = data;
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    uint32_t wdata = val.u32, size_mask;
+    bool hi = false;
+
+    switch ( bar->type )
+    {
+    case VPCI_BAR_MEM32:
+    case VPCI_BAR_MEM64_LO:
+        size_mask = (uint32_t)PCI_BASE_ADDRESS_MEM_MASK;
+        break;
+    case VPCI_BAR_MEM64_HI:
+        size_mask = ~0u;
+        break;
+    default:
+        ASSERT_UNREACHABLE();
+        return;
+    }
+
+    if ( (wdata & size_mask) == size_mask )
+    {
+        /* Next reads from this register are going to return the BAR size. */
+        bar->sizing = true;
+        return;
+    }
+
+    /* End previous sizing cycle if any. */
+    bar->sizing = false;
+
+    /*
+     * Ignore attempts to change the position of the BAR if memory decoding is
+     * active.
+     */
+    if ( pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &
+         PCI_COMMAND_MEMORY )
+        return;
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+
+    if ( !hi )
+        wdata &= PCI_BASE_ADDRESS_MEM_MASK;
+
+    /* Update the relevant part of the BAR address. */
+    bar->addr &= ~((uint64_t)0xffffffff << (hi ? 32 : 0));
+    bar->addr |= (uint64_t)wdata << (hi ? 32 : 0);
+
+    /* Make sure Xen writes back the same value for the BAR RO bits. */
+    if ( !hi )
+        wdata |= pci_conf_read32(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                                 PCI_FUNC(pdev->devfn), reg) &
+                                 ~PCI_BASE_ADDRESS_MEM_MASK;
+    pci_conf_write32(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                     PCI_FUNC(pdev->devfn), reg, wdata);
+}
+
+static void vpci_rom_read(struct pci_dev *pdev, unsigned int reg,
+                          union vpci_val *val, void *data)
+{
+    const struct vpci_bar *rom = data;
+
+    if ( rom->sizing )
+        val->u32 = ~(rom->size - 1);
+    else
+        val->u32 = rom->addr;
+
+    val->u32 |= rom->enabled ? PCI_ROM_ADDRESS_ENABLE : 0;
+}
+
+static void vpci_rom_write(struct pci_dev *pdev, unsigned int reg,
+                           union vpci_val val, void *data)
+{
+    struct vpci_bar *rom = data;
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    const uint32_t wdata = val.u32;
+
+    if ( (wdata & PCI_ROM_ADDRESS_MASK) == PCI_ROM_ADDRESS_MASK )
+    {
+        /* Next reads from this register are going to return the BAR size. */
+        rom->sizing = true;
+        return;
+    }
+
+    /* End previous sizing cycle if any. */
+    rom->sizing = false;
+
+    rom->addr = wdata & PCI_ROM_ADDRESS_MASK;
+
+    /* Check if memory decoding is enabled. */
+    if ( pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &
+         PCI_COMMAND_MEMORY &&
+         (rom->enabled ^ (wdata & PCI_ROM_ADDRESS_ENABLE)) )
+    {
+        if ( vpci_modify_bar(pdev->domain, rom,
+                             wdata & PCI_ROM_ADDRESS_ENABLE) )
+            return;
+
+        rom->enabled = wdata & PCI_ROM_ADDRESS_ENABLE;
+    }
+
+    pci_conf_write32(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                     PCI_FUNC(pdev->devfn), reg, wdata);
+}
+
+static int vpci_init_bars(struct pci_dev *pdev)
+{
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    uint8_t header_type;
+    uint16_t cmd;
+    uint32_t rom_val;
+    uint64_t addr, size;
+    unsigned int i, num_bars, rom_reg;
+    struct vpci_header *header = &pdev->vpci->header;
+    struct vpci_bar *bars = header->bars;
+    int rc;
+
+    header_type = pci_conf_read8(seg, bus, slot, func, PCI_HEADER_TYPE) & 0x7f;
+    switch ( header_type )
+    {
+    case PCI_HEADER_TYPE_NORMAL:
+        num_bars = 6;
+        rom_reg = PCI_ROM_ADDRESS;
+        break;
+    case PCI_HEADER_TYPE_BRIDGE:
+        num_bars = 2;
+        rom_reg = PCI_ROM_ADDRESS1;
+        break;
+    default:
+        return -EOPNOTSUPP;
+    }
+
+    /* Setup a handler for the command register. */
+    cmd = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);
+    rc = vpci_add_register(pdev, vpci_cmd_read, vpci_cmd_write, PCI_COMMAND,
+                           2, header);
+    if ( rc )
+        return rc;
+
+    /* Disable memory decoding before sizing. */
+    if ( cmd & PCI_COMMAND_MEMORY )
+        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND,
+                         cmd & ~PCI_COMMAND_MEMORY);
+
+    for ( i = 0; i < num_bars; i++ )
+    {
+        uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
+        uint32_t val = pci_conf_read32(seg, bus, slot, func, reg);
+
+        if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
+        {
+            bars[i].type = VPCI_BAR_MEM64_HI;
+            rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
+                                   &bars[i]);
+            if ( rc )
+                return rc;
+
+            continue;
+        }
+        if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
+        {
+            bars[i].type = VPCI_BAR_IO;
+            continue;
+        }
+        if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+             PCI_BASE_ADDRESS_MEM_TYPE_64 )
+            bars[i].type = VPCI_BAR_MEM64_LO;
+        else
+            bars[i].type = VPCI_BAR_MEM32;
+
+        /* Size the BAR and map it. */
+        rc = pci_size_mem_bar(seg, bus, slot, func, reg, i == num_bars - 1,
+                              &addr, &size);
+        if ( rc < 0 )
+            return rc;
+
+        if ( size == 0 )
+        {
+            bars[i].type = VPCI_BAR_EMPTY;
+            continue;
+        }
+
+        bars[i].addr = (cmd & PCI_COMMAND_MEMORY) ? addr : INVALID_PADDR;
+        bars[i].size = size;
+        bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
+
+        rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
+                               &bars[i]);
+        if ( rc )
+            return rc;
+    }
+
+    /* Check expansion ROM. */
+    rom_val = pci_conf_read32(seg, bus, slot, func, rom_reg);
+    if ( rom_val & PCI_ROM_ADDRESS_ENABLE )
+        pci_conf_write32(seg, bus, slot, func, rom_reg,
+                         rom_val & ~PCI_ROM_ADDRESS_ENABLE);
+
+    rc = pci_size_mem_bar(seg, bus, slot, func, rom_reg, true, &addr, &size);
+    if ( rc < 0 )
+        return rc;
+
+    if ( size )
+    {
+        struct vpci_bar *rom = &header->bars[num_bars];
+
+        rom->type = VPCI_BAR_ROM;
+        rom->size = size;
+        rom->enabled = rom_val & PCI_ROM_ADDRESS_ENABLE;
+        if ( rom->enabled )
+            rom->addr = addr;
+        else
+            rom->addr = INVALID_PADDR;
+
+        rc = vpci_add_register(pdev, vpci_rom_read, vpci_rom_write, rom_reg, 4,
+                               rom);
+        if ( rc )
+            return rc;
+
+        if ( rom->enabled )
+            pci_conf_write32(seg, bus, slot, func, rom_reg, rom_val);
+    }
+
+    if ( cmd & PCI_COMMAND_MEMORY )
+    {
+        rc = vpci_modify_bars(pdev, true);
+        if ( rc )
+            return rc;
+
+        /* Enable memory decoding. */
+        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND, cmd);
+    }
+
+    return 0;
+}
+
+REGISTER_VPCI_INIT(vpci_init_bars);
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
+
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 5e1b0bb3da..452ee482e8 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -63,6 +63,29 @@ void vpci_write(unsigned int seg, unsigned int bus, unsigned int slot,
 struct vpci {
     /* Root pointer for the tree of vPCI handlers. */
     struct list_head handlers;
+
+#ifdef __XEN__
+    /* Hide the rest of the vpci struct from the user-space test harness. */
+    struct vpci_header {
+        /* Information about the PCI BARs of this device. */
+        struct vpci_bar {
+            enum {
+                VPCI_BAR_EMPTY,
+                VPCI_BAR_IO,
+                VPCI_BAR_MEM32,
+                VPCI_BAR_MEM64_LO,
+                VPCI_BAR_MEM64_HI,
+                VPCI_BAR_ROM,
+            } type;
+            paddr_t addr;
+            uint64_t size;
+            bool prefetchable;
+            bool sizing;
+            bool enabled;
+        } bars[7]; /* At most 6 BARS + 1 expansion ROM BAR. */
+        /* FIXME: currently there's no support for SR-IOV. */
+    } header;
+#endif
 };
 
 #endif
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 7/9] vpci/msi: add MSI handlers
  2017-06-30 15:01 [PATCH v4 0/9] vpci: PCI config space emulation Roger Pau Monne
                   ` (4 preceding siblings ...)
  2017-06-30 15:01 ` [PATCH v4 6/9] xen/vpci: add handlers to map the BARs Roger Pau Monne
@ 2017-06-30 15:01 ` Roger Pau Monne
  2017-07-18  8:56   ` Paul Durrant
  2017-08-02 13:34   ` Jan Beulich
  2017-06-30 15:01 ` [PATCH v4 8/9] vpci: add a priority parameter to the vPCI register initializer Roger Pau Monne
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 44+ messages in thread
From: Roger Pau Monne @ 2017-06-30 15:01 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, julien.grall, Paul Durrant, Jan Beulich,
	boris.ostrovsky, Roger Pau Monne

Add handlers for the MSI control, address, data and mask fields in
order to detect accesses to them and setup the interrupts as requested
by the guest.

Note that the pending register is not trapped, and the guest can
freely read/write to it.

Whether Xen is going to provide this functionality to Dom0 (MSI
emulation) is controlled by the "msi" option in the dom0 field. When
disabling this option Xen will hide the MSI capability structure from
Dom0.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v3:
 - Propagate changes from previous versions: drop xen_ prefix, drop
   return value from handlers, use the new vpci_val fields.
 - Use MASK_EXTR.
 - Remove the usage of GENMASK.
 - Add GFLAGS_SHIFT_DEST_ID and use it in msi_flags.
 - Add "arch" to the MSI arch specific functions.
 - Move the dumping of vPCI MSI information to dump_msi (key 'M').
 - Remove the guest_vectors field.
 - Allow the guest to change the number of active vectors without
   having to disable and enable MSI.
 - Check the number of active vectors when parsing the disable
   mask.
 - Remove the debug messages from vpci_init_msi.
 - Move the arch-specific part of the dump handler to x86/hvm/vmsi.c.
 - Use trylock in the dump handler to get the vpci lock.

Changes since v2:
 - Add an arch-specific abstraction layer. Note that this is only implemented
   for x86 currently.
 - Add a wrapper to detect MSI enabling for vPCI.

NB: I've only been able to test this with devices using a single MSI interrupt
and no mask register. I will try to find hardware that supports the mask
register and more than one vector, but I cannot make any promises.

If there are doubts about the untested parts we could always force Xen to
report no per-vector masking support and only 1 available vector, but I would
rather avoid doing it.
---
 xen/arch/x86/hvm/vmsi.c      | 149 ++++++++++++++++++
 xen/arch/x86/msi.c           |   3 +
 xen/drivers/vpci/Makefile    |   2 +-
 xen/drivers/vpci/msi.c       | 348 +++++++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/hvm/io.h |  18 +++
 xen/include/asm-x86/msi.h    |   1 +
 xen/include/xen/hvm/irq.h    |   2 +
 xen/include/xen/vpci.h       |  26 ++++
 8 files changed, 548 insertions(+), 1 deletion(-)
 create mode 100644 xen/drivers/vpci/msi.c

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index a36692c313..5732c70b5c 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -622,3 +622,152 @@ void msix_write_completion(struct vcpu *v)
     if ( msixtbl_write(v, ctrl_address, 4, 0) != X86EMUL_OKAY )
         gdprintk(XENLOG_WARNING, "MSI-X write completion failure\n");
 }
+
+static unsigned int msi_vector(uint16_t data)
+{
+    return MASK_EXTR(data, MSI_DATA_VECTOR_MASK);
+}
+
+static unsigned int msi_flags(uint16_t data, uint64_t addr)
+{
+    unsigned int rh, dm, dest_id, deliv_mode, trig_mode;
+
+    rh = MASK_EXTR(addr, MSI_ADDR_REDIRECTION_MASK);
+    dm = MASK_EXTR(addr, MSI_ADDR_DESTMODE_MASK);
+    dest_id = MASK_EXTR(addr, MSI_ADDR_DEST_ID_MASK);
+    deliv_mode = MASK_EXTR(data, MSI_DATA_DELIVERY_MODE_MASK);
+    trig_mode = MASK_EXTR(data, MSI_DATA_TRIGGER_MASK);
+
+    return (dest_id << GFLAGS_SHIFT_DEST_ID) | (rh << GFLAGS_SHIFT_RH) |
+           (dm << GFLAGS_SHIFT_DM) | (deliv_mode << GFLAGS_SHIFT_DELIV_MODE) |
+           (trig_mode << GFLAGS_SHIFT_TRG_MODE);
+}
+
+void vpci_msi_arch_mask(struct vpci_arch_msi *arch, struct pci_dev *pdev,
+                        unsigned int entry, bool mask)
+{
+    struct domain *d = pdev->domain;
+    const struct pirq *pinfo;
+    struct irq_desc *desc;
+    unsigned long flags;
+    int irq;
+
+    ASSERT(arch->pirq >= 0);
+    pinfo = pirq_info(d, arch->pirq + entry);
+    ASSERT(pinfo);
+
+    irq = pinfo->arch.irq;
+    ASSERT(irq < nr_irqs && irq >= 0);
+
+    desc = irq_to_desc(irq);
+    ASSERT(desc);
+
+    spin_lock_irqsave(&desc->lock, flags);
+    guest_mask_msi_irq(desc, mask);
+    spin_unlock_irqrestore(&desc->lock, flags);
+}
+
+int vpci_msi_arch_enable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
+                         uint64_t address, uint32_t data, unsigned int vectors)
+{
+    struct msi_info msi_info = {
+        .seg = pdev->seg,
+        .bus = pdev->bus,
+        .devfn = pdev->devfn,
+        .entry_nr = vectors,
+    };
+    unsigned int i;
+    int rc;
+
+    ASSERT(arch->pirq == -1);
+
+    /* Get a PIRQ. */
+    rc = allocate_and_map_msi_pirq(pdev->domain, -1, &arch->pirq,
+                                   MAP_PIRQ_TYPE_MULTI_MSI, &msi_info);
+    if ( rc )
+    {
+        dprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: failed to map PIRQ: %d\n",
+                pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                PCI_FUNC(pdev->devfn), rc);
+        return rc;
+    }
+
+    for ( i = 0; i < vectors; i++ )
+    {
+        xen_domctl_bind_pt_irq_t bind = {
+            .machine_irq = arch->pirq + i,
+            .irq_type = PT_IRQ_TYPE_MSI,
+            .u.msi.gvec = msi_vector(data) + i,
+            .u.msi.gflags = msi_flags(data, address),
+        };
+
+        pcidevs_lock();
+        rc = pt_irq_create_bind(pdev->domain, &bind);
+        if ( rc )
+        {
+            dprintk(XENLOG_ERR,
+                    "%04x:%02x:%02x.%u: failed to bind PIRQ %u: %d\n",
+                    pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                    PCI_FUNC(pdev->devfn), arch->pirq + i, rc);
+            spin_lock(&pdev->domain->event_lock);
+            unmap_domain_pirq(pdev->domain, arch->pirq);
+            spin_unlock(&pdev->domain->event_lock);
+            pcidevs_unlock();
+            arch->pirq = -1;
+            return rc;
+        }
+        pcidevs_unlock();
+    }
+
+    return 0;
+}
+
+int vpci_msi_arch_disable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
+                          unsigned int vectors)
+{
+    unsigned int i;
+
+    ASSERT(arch->pirq != -1);
+
+    for ( i = 0; i < vectors; i++ )
+    {
+        xen_domctl_bind_pt_irq_t bind = {
+            .machine_irq = arch->pirq + i,
+            .irq_type = PT_IRQ_TYPE_MSI,
+        };
+
+        pcidevs_lock();
+        pt_irq_destroy_bind(pdev->domain, &bind);
+        pcidevs_unlock();
+    }
+
+    pcidevs_lock();
+    spin_lock(&pdev->domain->event_lock);
+    unmap_domain_pirq(pdev->domain, arch->pirq);
+    spin_unlock(&pdev->domain->event_lock);
+    pcidevs_unlock();
+
+    arch->pirq = -1;
+
+    return 0;
+}
+
+int vpci_msi_arch_init(struct vpci_arch_msi *arch)
+{
+    arch->pirq = -1;
+    return 0;
+}
+
+void vpci_msi_arch_print(struct vpci_arch_msi *arch, uint16_t data,
+                         uint64_t addr)
+{
+    printk("vec=%#02x%7s%6s%3sassert%5s%7s dest_id=%lu pirq: %d\n",
+           MASK_EXTR(data, MSI_DATA_VECTOR_MASK),
+           data & MSI_DATA_DELIVERY_LOWPRI ? "lowest" : "fixed",
+           data & MSI_DATA_TRIGGER_LEVEL ? "level" : "edge",
+           data & MSI_DATA_LEVEL_ASSERT ? "" : "de",
+           addr & MSI_ADDR_DESTMODE_LOGIC ? "log" : "phys",
+           addr & MSI_ADDR_REDIRECTION_LOWPRI ? "lowest" : "cpu",
+           MASK_EXTR(addr, MSI_ADDR_DEST_ID_MASK),
+           arch->pirq);
+}
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index d98f400699..573378d6c3 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -30,6 +30,7 @@
 #include <public/physdev.h>
 #include <xen/iommu.h>
 #include <xsm/xsm.h>
+#include <xen/vpci.h>
 
 static s8 __read_mostly use_msi = -1;
 boolean_param("msi", use_msi);
@@ -1536,6 +1537,8 @@ static void dump_msi(unsigned char key)
                attr.guest_masked ? 'G' : ' ',
                mask);
     }
+
+    vpci_dump_msi();
 }
 
 static int __init msi_setup_keyhandler(void)
diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
index 241467212f..62cec9e82b 100644
--- a/xen/drivers/vpci/Makefile
+++ b/xen/drivers/vpci/Makefile
@@ -1 +1 @@
-obj-y += vpci.o header.o
+obj-y += vpci.o header.o msi.o
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
new file mode 100644
index 0000000000..d8f3418616
--- /dev/null
+++ b/xen/drivers/vpci/msi.c
@@ -0,0 +1,348 @@
+/*
+ * Handlers for accesses to the MSI capability structure.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/sched.h>
+#include <xen/vpci.h>
+#include <asm/msi.h>
+#include <xen/keyhandler.h>
+
+/* Handlers for the MSI control field (PCI_MSI_FLAGS). */
+static void vpci_msi_control_read(struct pci_dev *pdev, unsigned int reg,
+                                  union vpci_val *val, void *data)
+{
+    const struct vpci_msi *msi = data;
+
+    /* Set multiple message capable. */
+    val->u16 = MASK_INSR(fls(msi->max_vectors) - 1, PCI_MSI_FLAGS_QMASK);
+
+    if ( msi->enabled ) {
+        val->u16 |= PCI_MSI_FLAGS_ENABLE;
+        val->u16 |= MASK_INSR(fls(msi->vectors) - 1, PCI_MSI_FLAGS_QSIZE);
+    }
+    val->u16 |= msi->masking ? PCI_MSI_FLAGS_MASKBIT : 0;
+    val->u16 |= msi->address64 ? PCI_MSI_FLAGS_64BIT : 0;
+}
+
+static void vpci_msi_enable(struct pci_dev *pdev, struct vpci_msi *msi,
+                            unsigned int vectors)
+{
+    int ret;
+
+    ASSERT(!msi->vectors);
+
+    ret = vpci_msi_arch_enable(&msi->arch, pdev, msi->address, msi->data,
+                               vectors);
+    if ( ret )
+        return;
+
+    /* Apply the mask bits. */
+    if ( msi->masking )
+    {
+        unsigned int i;
+        uint32_t mask = msi->mask;
+
+        for ( i = ffs(mask) - 1; mask && i < vectors; i = ffs(mask) - 1 )
+        {
+            vpci_msi_arch_mask(&msi->arch, pdev, i, true);
+            __clear_bit(i, &mask);
+        }
+    }
+
+    __msi_set_enable(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                     PCI_FUNC(pdev->devfn), msi->pos, 1);
+
+    msi->vectors = vectors;
+    msi->enabled = true;
+}
+
+static int vpci_msi_disable(struct pci_dev *pdev, struct vpci_msi *msi)
+{
+    int ret;
+
+    ASSERT(msi->vectors);
+
+    __msi_set_enable(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                     PCI_FUNC(pdev->devfn), msi->pos, 0);
+
+    ret = vpci_msi_arch_disable(&msi->arch, pdev, msi->vectors);
+    if ( ret )
+        return ret;
+
+    msi->vectors = 0;
+    msi->enabled = false;
+
+    return 0;
+}
+
+static void vpci_msi_control_write(struct pci_dev *pdev, unsigned int reg,
+                                   union vpci_val val, void *data)
+{
+    struct vpci_msi *msi = data;
+    unsigned int vectors = 1 << MASK_EXTR(val.u16, PCI_MSI_FLAGS_QSIZE);
+    int ret;
+
+    if ( vectors > msi->max_vectors )
+        vectors = msi->max_vectors;
+
+    if ( !!(val.u16 & PCI_MSI_FLAGS_ENABLE) == msi->enabled &&
+         (vectors == msi->vectors || !msi->enabled) )
+        return;
+
+    if ( val.u16 & PCI_MSI_FLAGS_ENABLE )
+    {
+        if ( msi->enabled )
+        {
+            /*
+             * Change to the number of enabled vectors, disable and
+             * enable MSI in order to apply it.
+             */
+            ret = vpci_msi_disable(pdev, msi);
+            if ( ret )
+                return;
+        }
+        vpci_msi_enable(pdev, msi, vectors);
+    }
+    else
+        vpci_msi_disable(pdev, msi);
+}
+
+/* Handlers for the address field (32bit or low part of a 64bit address). */
+static void vpci_msi_address_read(struct pci_dev *pdev, unsigned int reg,
+                                  union vpci_val *val, void *data)
+{
+    const struct vpci_msi *msi = data;
+
+    val->u32 = msi->address;
+}
+
+static void vpci_msi_address_write(struct pci_dev *pdev, unsigned int reg,
+                                   union vpci_val val, void *data)
+{
+    struct vpci_msi *msi = data;
+
+    /* Clear low part. */
+    msi->address &= ~(uint64_t)0xffffffff;
+    msi->address |= val.u32;
+}
+
+/* Handlers for the high part of a 64bit address field. */
+static void vpci_msi_address_upper_read(struct pci_dev *pdev, unsigned int reg,
+                                        union vpci_val *val, void *data)
+{
+    const struct vpci_msi *msi = data;
+
+    val->u32 = msi->address >> 32;
+}
+
+static void vpci_msi_address_upper_write(struct pci_dev *pdev, unsigned int reg,
+                                         union vpci_val val, void *data)
+{
+    struct vpci_msi *msi = data;
+
+    /* Clear high part. */
+    msi->address &= ~((uint64_t)0xffffffff << 32);
+    msi->address |= (uint64_t)val.u32 << 32;
+}
+
+/* Handlers for the data field. */
+static void vpci_msi_data_read(struct pci_dev *pdev, unsigned int reg,
+                               union vpci_val *val, void *data)
+{
+    const struct vpci_msi *msi = data;
+
+    val->u16 = msi->data;
+}
+
+static void vpci_msi_data_write(struct pci_dev *pdev, unsigned int reg,
+                                union vpci_val val, void *data)
+{
+    struct vpci_msi *msi = data;
+
+    msi->data = val.u16;
+}
+
+static void vpci_msi_mask_read(struct pci_dev *pdev, unsigned int reg,
+                               union vpci_val *val, void *data)
+{
+    const struct vpci_msi *msi = data;
+
+    val->u32 = msi->mask;
+}
+
+static void vpci_msi_mask_write(struct pci_dev *pdev, unsigned int reg,
+                                union vpci_val val, void *data)
+{
+    struct vpci_msi *msi = data;
+    uint32_t dmask;
+
+    dmask = msi->mask ^ val.u32;
+
+    if ( !dmask )
+        return;
+
+    if ( msi->enabled )
+    {
+        unsigned int i;
+
+        for ( i = ffs(dmask) - 1; dmask && i < msi->vectors;
+              i = ffs(dmask) - 1 )
+        {
+            vpci_msi_arch_mask(&msi->arch, pdev, i, MASK_EXTR(val.u32, 1 << i));
+            __clear_bit(i, &dmask);
+        }
+    }
+
+    msi->mask = val.u32;
+}
+
+static int vpci_init_msi(struct pci_dev *pdev)
+{
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    struct vpci_msi *msi;
+    unsigned int msi_offset;
+    uint16_t control;
+    int ret;
+
+    msi_offset = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSI);
+    if ( !msi_offset )
+        return 0;
+
+    msi = xzalloc(struct vpci_msi);
+    if ( !msi )
+        return -ENOMEM;
+
+    msi->pos = msi_offset;
+
+    control = pci_conf_read16(seg, bus, slot, func,
+                              msi_control_reg(msi_offset));
+
+    ret = vpci_add_register(pdev, vpci_msi_control_read,
+                            vpci_msi_control_write,
+                            msi_control_reg(msi_offset), 2, msi);
+    if ( ret )
+        goto error;
+
+    /* Get the maximum number of vectors the device supports. */
+    msi->max_vectors = multi_msi_capable(control);
+    ASSERT(msi->max_vectors <= 32);
+
+    /* No PIRQ bind yet. */
+    vpci_msi_arch_init(&msi->arch);
+
+    if ( is_64bit_address(control) )
+        msi->address64 = true;
+    if ( is_mask_bit_support(control) )
+        msi->masking = true;
+
+    ret = vpci_add_register(pdev, vpci_msi_address_read,
+                            vpci_msi_address_write,
+                            msi_lower_address_reg(msi_offset), 4, msi);
+    if ( ret )
+        goto error;
+
+    ret = vpci_add_register(pdev, vpci_msi_data_read, vpci_msi_data_write,
+                            msi_data_reg(msi_offset, msi->address64), 2,
+                            msi);
+    if ( ret )
+        goto error;
+
+    if ( msi->address64 )
+    {
+        ret = vpci_add_register(pdev, vpci_msi_address_upper_read,
+                                vpci_msi_address_upper_write,
+                                msi_upper_address_reg(msi_offset), 4, msi);
+        if ( ret )
+            goto error;
+    }
+
+    if ( msi->masking )
+    {
+        ret = vpci_add_register(pdev, vpci_msi_mask_read, vpci_msi_mask_write,
+                                msi_mask_bits_reg(msi_offset,
+                                                  msi->address64), 4, msi);
+        if ( ret )
+            goto error;
+    }
+
+    pdev->vpci->msi = msi;
+
+    return 0;
+
+ error:
+    ASSERT(ret);
+    xfree(msi);
+    return ret;
+}
+
+REGISTER_VPCI_INIT(vpci_init_msi);
+
+void vpci_dump_msi(void)
+{
+    struct domain *d;
+
+    for_each_domain ( d )
+    {
+        const struct pci_dev *pdev;
+
+        if ( !has_vpci(d) )
+            continue;
+
+        printk("vPCI MSI information for guest %u\n", d->domain_id);
+
+        if ( !vpci_trylock(d) )
+        {
+            printk("Unable to get vPCI lock, skipping\n");
+            continue;
+        }
+
+        list_for_each_entry ( pdev, &d->arch.pdev_list, domain_list )
+        {
+            uint8_t seg = pdev->seg, bus = pdev->bus;
+            uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+            struct vpci_msi *msi = pdev->vpci->msi;
+
+            if ( !msi )
+                continue;
+
+            printk("Device %04x:%02x:%02x.%u\n", seg, bus, slot, func);
+
+            printk("Enabled: %u Supports masking: %u 64-bit addresses: %u\n",
+                   msi->enabled, msi->masking, msi->address64);
+            printk("Max vectors: %u enabled vectors: %u\n",
+                   msi->max_vectors, msi->vectors);
+
+            vpci_msi_arch_print(&msi->arch, msi->data, msi->address);
+
+            if ( msi->masking )
+                printk("mask=%#032x\n", msi->mask);
+        }
+        vpci_unlock(d);
+    }
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
+
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index 4fe996fe49..55ed094734 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -20,6 +20,7 @@
 #define __ASM_X86_HVM_IO_H__
 
 #include <xen/mm.h>
+#include <xen/pci.h>
 #include <asm/hvm/vpic.h>
 #include <asm/hvm/vioapic.h>
 #include <public/hvm/ioreq.h>
@@ -126,6 +127,23 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_irq,
 void msix_write_completion(struct vcpu *);
 void msixtbl_init(struct domain *d);
 
+/* Arch-specific MSI data for vPCI. */
+struct vpci_arch_msi {
+    int pirq;
+};
+
+/* Arch-specific vPCI MSI helpers. */
+void vpci_msi_arch_mask(struct vpci_arch_msi *arch, struct pci_dev *pdev,
+                        unsigned int entry, bool mask);
+int vpci_msi_arch_enable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
+                         uint64_t address, uint32_t data,
+                         unsigned int vectors);
+int vpci_msi_arch_disable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
+                          unsigned int vectors);
+int vpci_msi_arch_init(struct vpci_arch_msi *arch);
+void vpci_msi_arch_print(struct vpci_arch_msi *arch, uint16_t data,
+                         uint64_t addr);
+
 enum stdvga_cache_state {
     STDVGA_CACHE_UNINITIALIZED,
     STDVGA_CACHE_ENABLED,
diff --git a/xen/include/asm-x86/msi.h b/xen/include/asm-x86/msi.h
index 213ee53f72..9c36c34372 100644
--- a/xen/include/asm-x86/msi.h
+++ b/xen/include/asm-x86/msi.h
@@ -48,6 +48,7 @@
 #define MSI_ADDR_REDIRECTION_SHIFT  3
 #define MSI_ADDR_REDIRECTION_CPU    (0 << MSI_ADDR_REDIRECTION_SHIFT)
 #define MSI_ADDR_REDIRECTION_LOWPRI (1 << MSI_ADDR_REDIRECTION_SHIFT)
+#define MSI_ADDR_REDIRECTION_MASK   0x8
 
 #define MSI_ADDR_DEST_ID_SHIFT		12
 #define	 MSI_ADDR_DEST_ID_MASK		0x00ff000
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index 0d2c72c109..d07185a479 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -57,7 +57,9 @@ struct dev_intx_gsi_link {
 #define VMSI_DELIV_MASK   0x7000
 #define VMSI_TRIG_MODE    0x8000
 
+#define GFLAGS_SHIFT_DEST_ID        0
 #define GFLAGS_SHIFT_RH             8
+#define GFLAGS_SHIFT_DM             9
 #define GFLAGS_SHIFT_DELIV_MODE     12
 #define GFLAGS_SHIFT_TRG_MODE       15
 
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 452ee482e8..2a7d7557b3 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -13,6 +13,7 @@
  * of just returning whether the lock is hold by any CPU).
  */
 #define vpci_lock(d) spin_lock_recursive(&(d)->arch.hvm_domain.vpci_lock)
+#define vpci_trylock(d) spin_trylock_recursive(&(d)->arch.hvm_domain.vpci_lock)
 #define vpci_unlock(d) spin_unlock_recursive(&(d)->arch.hvm_domain.vpci_lock)
 #define vpci_locked(d) spin_is_locked(&(d)->arch.hvm_domain.vpci_lock)
 
@@ -85,9 +86,34 @@ struct vpci {
         } bars[7]; /* At most 6 BARS + 1 expansion ROM BAR. */
         /* FIXME: currently there's no support for SR-IOV. */
     } header;
+
+    /* MSI data. */
+    struct vpci_msi {
+        /* Offset of the capability in the config space. */
+        unsigned int pos;
+        /* Maximum number of vectors supported by the device. */
+        unsigned int max_vectors;
+        /* Number of vectors configured. */
+        unsigned int vectors;
+        /* Address and data fields. */
+        uint64_t address;
+        uint16_t data;
+        /* Mask bitfield. */
+        uint32_t mask;
+        /* Enabled? */
+        bool enabled;
+        /* Supports per-vector masking? */
+        bool masking;
+        /* 64-bit address capable? */
+        bool address64;
+        /* Arch-specific data. */
+        struct vpci_arch_msi arch;
+    } *msi;
 #endif
 };
 
+void vpci_dump_msi(void);
+
 #endif
 
 /*
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 8/9] vpci: add a priority parameter to the vPCI register initializer
  2017-06-30 15:01 [PATCH v4 0/9] vpci: PCI config space emulation Roger Pau Monne
                   ` (5 preceding siblings ...)
  2017-06-30 15:01 ` [PATCH v4 7/9] vpci/msi: add MSI handlers Roger Pau Monne
@ 2017-06-30 15:01 ` Roger Pau Monne
  2017-08-02 14:13   ` Jan Beulich
  2017-06-30 15:01 ` [PATCH v4 9/9] vpci/msix: add MSI-X handlers Roger Pau Monne
       [not found] ` <20170630150117.88489-2-roger.pau@citrix.com>
  8 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monne @ 2017-06-30 15:01 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, julien.grall, Jan Beulich, boris.ostrovsky,
	Roger Pau Monne

This is needed for MSI-X, since MSI-X will need to be initialized
before parsing the BARs, so that the header BAR handlers are aware of
the MSI-X related holes and make sure they are not mapped in order for
the trap handlers to work properly.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v3:
 - Add a numerial suffix to the section used to store the pointer to
   each initializer function, and sort them at link time.
---
 xen/arch/arm/xen.lds.S    | 2 +-
 xen/arch/x86/xen.lds.S    | 2 +-
 xen/drivers/vpci/header.c | 2 +-
 xen/drivers/vpci/msi.c    | 2 +-
 xen/include/xen/vpci.h    | 9 ++++++---
 5 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/xen/arch/arm/xen.lds.S b/xen/arch/arm/xen.lds.S
index a24d8e913a..a1fef99c76 100644
--- a/xen/arch/arm/xen.lds.S
+++ b/xen/arch/arm/xen.lds.S
@@ -42,7 +42,7 @@ SECTIONS
   . = ALIGN(PAGE_SIZE);
   .rodata : {
        __start_vpci_array = .;
-       *(.rodata.vpci)
+       *(SORT(.rodata.vpci.*))
        __end_vpci_array = .;
         _srodata = .;          /* Read-only data */
         /* Bug frames table */
diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index 451e7970da..93f104aaf5 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -77,7 +77,7 @@ SECTIONS
   __2M_rodata_start = .;       /* Start of 2M superpages, mapped RO. */
   .rodata : {
        __start_vpci_array = .;
-       *(.rodata.vpci)
+       *(SORT(.rodata.vpci.*))
        __end_vpci_array = .;
        _srodata = .;
        /* Bug frames table */
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 3c800c4cf7..ae5719ab1a 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -459,7 +459,7 @@ static int vpci_init_bars(struct pci_dev *pdev)
     return 0;
 }
 
-REGISTER_VPCI_INIT(vpci_init_bars);
+REGISTER_VPCI_INIT(vpci_init_bars, VPCI_PRIORITY_LOW);
 
 /*
  * Local variables:
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index d8f3418616..5261cda5f4 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -290,7 +290,7 @@ static int vpci_init_msi(struct pci_dev *pdev)
     return ret;
 }
 
-REGISTER_VPCI_INIT(vpci_init_msi);
+REGISTER_VPCI_INIT(vpci_init_msi, VPCI_PRIORITY_LOW);
 
 void vpci_dump_msi(void)
 {
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 2a7d7557b3..ca693f3667 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -37,9 +37,12 @@ typedef void (vpci_write_t)(struct pci_dev *pdev, unsigned int reg,
 
 typedef int (*vpci_register_init_t)(struct pci_dev *dev);
 
-#define REGISTER_VPCI_INIT(x)                   \
-  static const vpci_register_init_t x##_entry   \
-               __used_section(".rodata.vpci") = x
+#define VPCI_PRIORITY_HIGH      "1"
+#define VPCI_PRIORITY_LOW       "9"
+
+#define REGISTER_VPCI_INIT(x, p)                        \
+  static const vpci_register_init_t x##_entry           \
+               __used_section(".rodata.vpci." p) = x
 
 /* Add vPCI handlers to device. */
 int __must_check vpci_add_handlers(struct pci_dev *dev);
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v4 9/9] vpci/msix: add MSI-X handlers
  2017-06-30 15:01 [PATCH v4 0/9] vpci: PCI config space emulation Roger Pau Monne
                   ` (6 preceding siblings ...)
  2017-06-30 15:01 ` [PATCH v4 8/9] vpci: add a priority parameter to the vPCI register initializer Roger Pau Monne
@ 2017-06-30 15:01 ` Roger Pau Monne
  2017-08-02 15:07   ` Jan Beulich
       [not found] ` <20170630150117.88489-2-roger.pau@citrix.com>
  8 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monne @ 2017-06-30 15:01 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, julien.grall, Jan Beulich, boris.ostrovsky,
	Roger Pau Monne

Add handlers for accesses to the MSI-X message control field on the
PCI configuration space, and traps for accesses to the memory region
that contains the MSI-X table and PBA. This traps detect attempts from
the guest to configure MSI-X interrupts and properly sets them up.

Note that accesses to the Table Offset, Table BIR, PBA Offset and PBA
BIR are not trapped by Xen at the moment.

Finally, turn the panic in the Dom0 PVH builder into a warning.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v3:
 - Propagate changes from previous versions: remove xen_ prefix, use
   the new fields in vpci_val and remove the return value from
   handlers.
 - Remove the usage of GENMASK.
 - Mave the arch-specific parts of the dump routine to the
   x86/hvm/vmsi.c dump handler.
 - Chain the MSI-X dump handler to the 'M' debug key.
 - Fix the header BAR mappings so that the MSI-X regions inside of
   BARs are unmapped from the domain p2m in order for the handlers to
   work properly.
 - Unconditionally trap and forward accesses to the PBA MSI-X area.
 - Simplify the conditionals in vpci_msix_control_write.
 - Fix vpci_msix_accept to use a bool type.
 - Allow all supported accesses as described in the spec to the MSI-X
   table.
 - Truncate the returned address when the access is a 32b read.
 - Always return X86EMUL_OKAY from the handlers, returning ~0 in the
   read case if the access is not supported, or ignoring writes.
 - Do not check that max_entries is != 0 in the init handler.
 - Use trylock in the dump handler.

Changes since v2:
 - Split out arch-specific code.

This patch has been tested with devices using both a single MSI-X
entry and multiple ones.
---
 xen/arch/x86/hvm/dom0_build.c    |   2 +-
 xen/arch/x86/hvm/hvm.c           |   1 +
 xen/arch/x86/hvm/vmsi.c          | 128 +++++++++-
 xen/arch/x86/msi.c               |   1 +
 xen/drivers/vpci/Makefile        |   2 +-
 xen/drivers/vpci/header.c        |  85 ++++++-
 xen/drivers/vpci/msix.c          | 503 +++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/hvm/domain.h |   3 +
 xen/include/asm-x86/hvm/io.h     |  18 ++
 xen/include/xen/vpci.h           |  39 +++
 10 files changed, 774 insertions(+), 8 deletions(-)
 create mode 100644 xen/drivers/vpci/msix.c

diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index 6b9f76ec36..c060eb85eb 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -1091,7 +1091,7 @@ int __init dom0_construct_pvh(struct domain *d, const module_t *image,
         return rc;
     }
 
-    panic("Building a PVHv2 Dom0 is not yet supported.");
+    printk("WARNING: PVH is an experimental mode with limited functionality\n");
     return 0;
 }
 
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index f45e2bd23d..9277e84150 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -585,6 +585,7 @@ int hvm_domain_initialise(struct domain *d, unsigned long domcr_flags,
     INIT_LIST_HEAD(&d->arch.hvm_domain.write_map.list);
     INIT_LIST_HEAD(&d->arch.hvm_domain.g2m_ioport_list);
     INIT_LIST_HEAD(&d->arch.hvm_domain.mmcfg_regions);
+    INIT_LIST_HEAD(&d->arch.hvm_domain.msix_tables);
 
     rc = create_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0, NULL, NULL);
     if ( rc )
diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 5732c70b5c..f1c72f23d9 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -643,17 +643,15 @@ static unsigned int msi_flags(uint16_t data, uint64_t addr)
            (trig_mode << GFLAGS_SHIFT_TRG_MODE);
 }
 
-void vpci_msi_arch_mask(struct vpci_arch_msi *arch, struct pci_dev *pdev,
-                        unsigned int entry, bool mask)
+static void vpci_mask_pirq(struct domain *d, int pirq, bool mask)
 {
-    struct domain *d = pdev->domain;
     const struct pirq *pinfo;
     struct irq_desc *desc;
     unsigned long flags;
     int irq;
 
-    ASSERT(arch->pirq >= 0);
-    pinfo = pirq_info(d, arch->pirq + entry);
+    ASSERT(pirq >= 0);
+    pinfo = pirq_info(d, pirq);
     ASSERT(pinfo);
 
     irq = pinfo->arch.irq;
@@ -667,6 +665,12 @@ void vpci_msi_arch_mask(struct vpci_arch_msi *arch, struct pci_dev *pdev,
     spin_unlock_irqrestore(&desc->lock, flags);
 }
 
+void vpci_msi_arch_mask(struct vpci_arch_msi *arch, struct pci_dev *pdev,
+                        unsigned int entry, bool mask)
+{
+    vpci_mask_pirq(pdev->domain, arch->pirq + entry, mask);
+}
+
 int vpci_msi_arch_enable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
                          uint64_t address, uint32_t data, unsigned int vectors)
 {
@@ -771,3 +775,117 @@ void vpci_msi_arch_print(struct vpci_arch_msi *arch, uint16_t data,
            MASK_EXTR(addr, MSI_ADDR_DEST_ID_MASK),
            arch->pirq);
 }
+
+void vpci_msix_arch_mask(struct vpci_arch_msix_entry *arch,
+                         struct pci_dev *pdev, bool mask)
+{
+    vpci_mask_pirq(pdev->domain, arch->pirq, mask);
+}
+
+int vpci_msix_arch_enable(struct vpci_arch_msix_entry *arch,
+                          struct pci_dev *pdev, uint64_t address,
+                          uint32_t data, unsigned int entry_nr,
+                          paddr_t table_base)
+{
+    struct domain *d = pdev->domain;
+    xen_domctl_bind_pt_irq_t bind = {
+        .irq_type = PT_IRQ_TYPE_MSI,
+        .u.msi.gvec = msi_vector(data),
+        .u.msi.gflags = msi_flags(data, address),
+    };
+    int rc;
+
+    if ( arch->pirq == -1 )
+    {
+        struct msi_info msi_info = {
+            .seg = pdev->seg,
+            .bus = pdev->bus,
+            .devfn = pdev->devfn,
+            .table_base = table_base,
+            .entry_nr = entry_nr,
+        };
+
+        /* Map PIRQ. */
+        rc = allocate_and_map_msi_pirq(d, -1, &arch->pirq,
+                                       MAP_PIRQ_TYPE_MSI, &msi_info);
+        if ( rc )
+        {
+            dprintk(XENLOG_ERR,
+                    "%04x:%02x:%02x.%u: unable to map MSI-X PIRQ entry %u: %d\n",
+                    pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                    PCI_FUNC(pdev->devfn), entry_nr, rc);
+            return rc;
+        }
+    }
+
+    bind.machine_irq = arch->pirq;
+    pcidevs_lock();
+    rc = pt_irq_create_bind(d, &bind);
+    if ( rc )
+    {
+        dprintk(XENLOG_ERR,
+                "%04x:%02x:%02x.%u: unable to create MSI-X bind %u: %d\n",
+                pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                PCI_FUNC(pdev->devfn), entry_nr, rc);
+        spin_lock(&d->event_lock);
+        unmap_domain_pirq(d, arch->pirq);
+        spin_unlock(&d->event_lock);
+        pcidevs_unlock();
+        arch->pirq = -1;
+        return rc;
+    }
+    pcidevs_unlock();
+
+    return 0;
+}
+
+int vpci_msix_arch_disable(struct vpci_arch_msix_entry *arch,
+                           struct pci_dev *pdev)
+{
+    struct domain *d = pdev->domain;
+    xen_domctl_bind_pt_irq_t bind = {
+        .irq_type = PT_IRQ_TYPE_MSI,
+        .machine_irq = arch->pirq,
+    };
+    int rc;
+
+    if ( arch->pirq == -1 )
+        return 0;
+
+    pcidevs_lock();
+    rc = pt_irq_destroy_bind(d, &bind);
+    if ( rc )
+    {
+        pcidevs_unlock();
+        return rc;
+    }
+
+    spin_lock(&d->event_lock);
+    unmap_domain_pirq(d, arch->pirq);
+    spin_unlock(&d->event_lock);
+    pcidevs_unlock();
+
+    arch->pirq = -1;
+
+    return 0;
+}
+
+int vpci_msix_arch_init(struct vpci_arch_msix_entry *arch)
+{
+    arch->pirq = -1;
+    return 0;
+}
+
+void vpci_msix_arch_print(struct vpci_arch_msix_entry *entry, uint32_t data,
+                          uint64_t addr, bool masked, unsigned int pos)
+{
+    printk("%4u vec=%#02x%7s%6s%3sassert%5s%7s dest_id=%lu mask=%u pirq: %d\n",
+           pos, MASK_EXTR(data, MSI_DATA_VECTOR_MASK),
+           data & MSI_DATA_DELIVERY_LOWPRI ? "lowest" : "fixed",
+           data & MSI_DATA_TRIGGER_LEVEL ? "level" : "edge",
+           data & MSI_DATA_LEVEL_ASSERT ? "" : "de",
+           addr & MSI_ADDR_DESTMODE_LOGIC ? "log" : "phys",
+           addr & MSI_ADDR_REDIRECTION_LOWPRI ? "lowest" : "cpu",
+           MASK_EXTR(addr, MSI_ADDR_DEST_ID_MASK),
+           masked, entry->pirq);
+}
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 573378d6c3..ad5c27df18 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -1539,6 +1539,7 @@ static void dump_msi(unsigned char key)
     }
 
     vpci_dump_msi();
+    vpci_dump_msix();
 }
 
 static int __init msi_setup_keyhandler(void)
diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
index 62cec9e82b..55d1bdfda0 100644
--- a/xen/drivers/vpci/Makefile
+++ b/xen/drivers/vpci/Makefile
@@ -1 +1 @@
-obj-y += vpci.o header.o msi.o
+obj-y += vpci.o header.o msi.o msix.o
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index ae5719ab1a..07056350d6 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -20,6 +20,7 @@
 #include <xen/sched.h>
 #include <xen/vpci.h>
 #include <xen/p2m-common.h>
+#include <asm/p2m.h>
 
 #define MAPPABLE_BAR(x)                                                 \
     (((x)->type == VPCI_BAR_MEM32 || (x)->type == VPCI_BAR_MEM64_LO ||  \
@@ -100,11 +101,45 @@ static int vpci_map_range(unsigned long s, unsigned long e, void *data)
     return modify_mmio(map->d, _gfn(s), _mfn(s), e - s + 1, map->map);
 }
 
+static int vpci_unmap_msix(struct domain *d, struct vpci_msix_mem *msix)
+{
+    unsigned long gfn;
+
+    for ( gfn = PFN_DOWN(msix->addr); gfn <= PFN_UP(msix->addr + msix->size);
+          gfn++ )
+    {
+        p2m_type_t t;
+        mfn_t mfn = get_gfn(d, gfn, &t);
+        int rc;
+
+        if ( mfn_eq(mfn, INVALID_MFN) )
+        {
+            /* Nothing to do, this is already a hole. */
+            put_gfn(d, gfn);
+            continue;
+        }
+
+        if ( !p2m_is_mmio(t) )
+        {
+            put_gfn(d, gfn);
+            return -EINVAL;
+        }
+
+        rc = modify_mmio(d, _gfn(gfn), mfn, 1, false);
+        put_gfn(d, gfn);
+        if ( rc )
+            return rc;
+    }
+
+    return 0;
+}
+
 static int vpci_modify_bar(struct domain *d, const struct vpci_bar *bar,
                            const bool map)
 {
     struct rangeset *mem;
     struct map_data data = { .d = d, .map = map };
+    unsigned int i;
     int rc;
 
     ASSERT(MAPPABLE_BAR(bar));
@@ -113,6 +148,35 @@ static int vpci_modify_bar(struct domain *d, const struct vpci_bar *bar,
     if ( IS_ERR(mem) )
         return -PTR_ERR(mem);
 
+    /*
+     * Make sure the MSI-X regions of the BAR are not mapped into the domain
+     * p2m, or else the MSI-X handlers are useless. Only do this when mapping,
+     * since that's when the memory decoding on the device is enabled.
+     */
+    for ( i = 0; map && i < ARRAY_SIZE(bar->msix); i++ )
+    {
+        struct vpci_msix_mem *msix = bar->msix[i];
+
+        if ( !msix || msix->addr == INVALID_PADDR )
+            continue;
+
+        rc = vpci_unmap_msix(d, msix);
+        if ( rc )
+        {
+            rangeset_destroy(mem);
+            return rc;
+        }
+
+        rc = rangeset_remove_range(mem, PFN_DOWN(msix->addr),
+                                   PFN_DOWN(msix->addr + msix->size));
+        if ( rc )
+        {
+            rangeset_destroy(mem);
+            return rc;
+        }
+
+    }
+
     rc = rangeset_report_ranges(mem, 0, ~0ul, vpci_map_range, &data);
     rangeset_destroy(mem);
     if ( rc )
@@ -221,6 +285,7 @@ static void vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
     uint8_t seg = pdev->seg, bus = pdev->bus;
     uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
     uint32_t wdata = val.u32, size_mask;
+    unsigned int i;
     bool hi = false;
 
     switch ( bar->type )
@@ -269,6 +334,11 @@ static void vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
     bar->addr &= ~((uint64_t)0xffffffff << (hi ? 32 : 0));
     bar->addr |= (uint64_t)wdata << (hi ? 32 : 0);
 
+    /* Update any MSI-X areas contained in this BAR. */
+    for ( i = 0; i < ARRAY_SIZE(bar->msix); i++ )
+        if ( bar->msix[i] )
+            bar->msix[i]->addr = bar->addr + bar->msix[i]->offset;
+
     /* Make sure Xen writes back the same value for the BAR RO bits. */
     if ( !hi )
         wdata |= pci_conf_read32(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
@@ -405,7 +475,20 @@ static int vpci_init_bars(struct pci_dev *pdev)
             continue;
         }
 
-        bars[i].addr = (cmd & PCI_COMMAND_MEMORY) ? addr : INVALID_PADDR;
+        if ( cmd & PCI_COMMAND_MEMORY )
+        {
+            unsigned int j;
+
+            bars[i].addr = addr;
+
+            for ( j = 0; j < ARRAY_SIZE(bars[i].msix); j++ )
+                if ( bars[i].msix[j] )
+                    bars[i].msix[j]->addr = bars[i].addr +
+                                            bars[i].msix[j]->offset;
+        }
+        else
+            bars[i].addr = INVALID_PADDR;
+
         bars[i].size = size;
         bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
 
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
new file mode 100644
index 0000000000..dc0e7070e4
--- /dev/null
+++ b/xen/drivers/vpci/msix.c
@@ -0,0 +1,503 @@
+/*
+ * Handlers for accesses to the MSI-X capability structure and the memory
+ * region.
+ *
+ * Copyright (C) 2017 Citrix Systems R&D
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/sched.h>
+#include <xen/vpci.h>
+#include <asm/msi.h>
+#include <xen/p2m-common.h>
+#include <xen/keyhandler.h>
+
+#define MSIX_SIZE(num) (offsetof(struct vpci_msix, entries[num]))
+#define MSIX_ADDR_IN_RANGE(a, table)                                    \
+    ((table)->addr != INVALID_PADDR && (a) >= (table)->addr &&          \
+     (a) < (table)->addr + (table)->size)
+
+static void vpci_msix_control_read(struct pci_dev *pdev, unsigned int reg,
+                                   union vpci_val *val, void *data)
+{
+    const struct vpci_msix *msix = data;
+
+    val->u16 = (msix->max_entries - 1) & PCI_MSIX_FLAGS_QSIZE;
+    val->u16 |= msix->enabled ? PCI_MSIX_FLAGS_ENABLE : 0;
+    val->u16 |= msix->masked ? PCI_MSIX_FLAGS_MASKALL : 0;
+}
+
+static void vpci_msix_control_write(struct pci_dev *pdev, unsigned int reg,
+                                    union vpci_val val, void *data)
+{
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    struct vpci_msix *msix = data;
+    paddr_t table_base = pdev->vpci->header.bars[msix->table.bir].addr;
+    bool new_masked, new_enabled;
+    unsigned int i;
+    int rc;
+
+    new_masked = val.u16 & PCI_MSIX_FLAGS_MASKALL;
+    new_enabled = val.u16 & PCI_MSIX_FLAGS_ENABLE;
+
+    if ( !msix->enabled && new_enabled )
+    {
+        /* MSI-X enabled. */
+        for ( i = 0; i < msix->max_entries; i++ )
+        {
+            if ( msix->entries[i].masked )
+                continue;
+
+            rc = vpci_msix_arch_enable(&msix->entries[i].arch, pdev,
+                                       msix->entries[i].addr,
+                                       msix->entries[i].data,
+                                       msix->entries[i].nr, table_base);
+            if ( rc )
+            {
+                gdprintk(XENLOG_ERR,
+                         "%04x:%02x:%02x.%u: unable to update entry %u: %d\n",
+                         seg, bus, slot, func, i, rc);
+                return;
+            }
+
+            vpci_msix_arch_mask(&msix->entries[i].arch, pdev, false);
+        }
+    }
+    else if ( msix->enabled && !new_enabled )
+    {
+        /* MSI-X disabled. */
+        for ( i = 0; i < msix->max_entries; i++ )
+        {
+            rc = vpci_msix_arch_disable(&msix->entries[i].arch, pdev);
+            if ( rc )
+            {
+                gdprintk(XENLOG_ERR,
+                         "%04x:%02x:%02x.%u: unable to disable entry %u: %d\n",
+                         seg, bus, slot, func, i, rc);
+                return;
+            }
+        }
+    }
+
+    if ( (new_enabled != msix->enabled || new_masked != msix->masked) &&
+         pci_msi_conf_write_intercept(pdev, reg, 2, &val.u32) >= 0 )
+        pci_conf_write16(seg, bus, slot, func, reg, val.u32);
+
+    msix->masked = new_masked;
+    msix->enabled = new_enabled;
+}
+
+static struct vpci_msix *vpci_msix_find(struct domain *d, unsigned long addr)
+{
+    struct vpci_msix *msix;
+
+    ASSERT(vpci_locked(d));
+    list_for_each_entry ( msix,  &d->arch.hvm_domain.msix_tables, next )
+    {
+        uint8_t seg = msix->pdev->seg, bus = msix->pdev->bus;
+        uint8_t slot = PCI_SLOT(msix->pdev->devfn);
+        uint8_t func = PCI_FUNC(msix->pdev->devfn);
+        uint16_t cmd = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);
+
+        if ( (cmd & PCI_COMMAND_MEMORY) &&
+             (MSIX_ADDR_IN_RANGE(addr, &msix->table) ||
+             MSIX_ADDR_IN_RANGE(addr, &msix->pba)) )
+            return msix;
+    }
+
+    return NULL;
+}
+
+static int vpci_msix_accept(struct vcpu *v, unsigned long addr)
+{
+    bool found;
+
+    vpci_lock(v->domain);
+    found = vpci_msix_find(v->domain, addr);
+    vpci_unlock(v->domain);
+
+    return found;
+}
+
+static int vpci_msix_access_check(struct pci_dev *pdev, unsigned long addr,
+                                  unsigned int len)
+{
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+
+    /* Only allow 32/64b accesses. */
+    if ( len != 4 && len != 8 )
+    {
+        gdprintk(XENLOG_ERR,
+                 "%04x:%02x:%02x.%u: invalid MSI-X table access size: %u\n",
+                 seg, bus, slot, func, len);
+        return -EINVAL;
+    }
+
+    /* Only allow aligned accesses. */
+    if ( (addr & (len - 1)) != 0 )
+    {
+        gdprintk(XENLOG_ERR,
+                 "%04x:%02x:%02x.%u: MSI-X only allows aligned accesses\n",
+                 seg, bus, slot, func);
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+static struct vpci_msix_entry *vpci_msix_get_entry(struct vpci_msix *msix,
+                                                   unsigned long addr)
+{
+    return &msix->entries[(addr - msix->table.addr) / PCI_MSIX_ENTRY_SIZE];
+}
+
+static int vpci_msix_read(struct vcpu *v, unsigned long addr,
+                          unsigned int len, unsigned long *data)
+{
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
+    const struct vpci_msix_entry *entry;
+    unsigned int offset;
+
+    vpci_lock(d);
+
+    msix = vpci_msix_find(d, addr);
+    if ( !msix )
+    {
+        vpci_unlock(d);
+        *data = ~0ul;
+        return X86EMUL_OKAY;
+    }
+
+    if ( vpci_msix_access_check(msix->pdev, addr, len) )
+    {
+        vpci_unlock(d);
+        *data = ~0ul;
+        return X86EMUL_OKAY;
+    }
+
+    if ( MSIX_ADDR_IN_RANGE(addr, &msix->pba) )
+    {
+        /* Access to PBA. */
+        switch ( len )
+        {
+        case 4:
+            *data = readl(addr);
+            break;
+        case 8:
+            *data = readq(addr);
+            break;
+        default:
+            ASSERT_UNREACHABLE();
+            break;
+        }
+
+        vpci_unlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    /* Get the table entry and offset. */
+    entry = vpci_msix_get_entry(msix, addr);
+    offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
+
+    switch ( offset )
+    {
+    case PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET:
+        /*
+         * NB: do explicit truncation to the size of the access. This shouldn't
+         * be required here, since the caller of the handler should already
+         * take the appropriate measures to truncate the value before returning
+         * to the guest, but better be safe than sorry.
+         */
+        *data = len == 8 ? entry->addr : (uint32_t)entry->addr;
+        break;
+    case PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET:
+        *data = entry->addr >> 32;
+        break;
+    case PCI_MSIX_ENTRY_DATA_OFFSET:
+        *data = entry->data;
+        if ( len == 8 )
+            *data |=
+                (uint64_t)(entry->masked ? PCI_MSIX_VECTOR_BITMASK : 0) << 32;
+        break;
+    case PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET:
+        *data = entry->masked ? PCI_MSIX_VECTOR_BITMASK : 0;
+        break;
+    default:
+        BUG();
+    }
+    vpci_unlock(d);
+
+    return X86EMUL_OKAY;
+}
+
+static int vpci_msix_write(struct vcpu *v, unsigned long addr,
+                                 unsigned int len, unsigned long data)
+{
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
+    struct vpci_msix_entry *entry;
+    unsigned int offset;
+
+    vpci_lock(d);
+    msix = vpci_msix_find(d, addr);
+    if ( !msix )
+    {
+        vpci_unlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    if ( MSIX_ADDR_IN_RANGE(addr, &msix->pba) )
+    {
+        /* Ignore writes to PBA, it's behavior is undefined. */
+        vpci_unlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    if ( vpci_msix_access_check(msix->pdev, addr, len) )
+    {
+        vpci_unlock(d);
+        return X86EMUL_OKAY;
+    }
+
+    /* Get the table entry and offset. */
+    entry = vpci_msix_get_entry(msix, addr);
+    offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
+
+    switch ( offset )
+    {
+    case PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET:
+        if ( len == 8 )
+        {
+            entry->addr = data;
+            break;
+        }
+        entry->addr &= ~0xffffffff;
+        entry->addr |= data;
+        break;
+    case PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET:
+        entry->addr &= ~((uint64_t)0xffffffff << 32);
+        entry->addr |= data << 32;
+        break;
+    case PCI_MSIX_ENTRY_DATA_OFFSET:
+        /*
+         * 8 byte writes to the msg data and vector control fields are
+         * only allowed if the entry is masked.
+         */
+        if ( len == 8 && !entry->masked && !msix->masked && msix->enabled )
+        {
+            vpci_unlock(d);
+            return X86EMUL_OKAY;
+        }
+
+        entry->data = data;
+
+        if ( len == 4 )
+            break;
+
+        data >>= 32;
+        /* fallthrough */
+    case PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET:
+    {
+        bool new_masked = data & PCI_MSIX_VECTOR_BITMASK;
+        struct pci_dev *pdev = msix->pdev;
+        paddr_t table_base = pdev->vpci->header.bars[msix->table.bir].addr;
+        int rc;
+
+        if ( !msix->enabled )
+        {
+            entry->masked = new_masked;
+            break;
+        }
+
+        if ( new_masked != entry->masked && !new_masked )
+        {
+            /* Unmasking an entry, update it. */
+            rc = vpci_msix_arch_enable(&entry->arch, pdev, entry->addr,
+                                       entry->data, entry->nr, table_base);
+            if ( rc )
+            {
+                vpci_unlock(d);
+                gdprintk(XENLOG_ERR,
+                         "%04x:%02x:%02x.%u: unable to update entry %u: %d\n",
+                         pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
+                         PCI_FUNC(pdev->devfn), entry->nr, rc);
+                return X86EMUL_OKAY;
+            }
+        }
+
+        vpci_msix_arch_mask(&entry->arch, pdev, new_masked);
+        entry->masked = new_masked;
+
+        break;
+    }
+    default:
+        BUG();
+    }
+    vpci_unlock(d);
+
+    return X86EMUL_OKAY;
+}
+
+static const struct hvm_mmio_ops vpci_msix_table_ops = {
+    .check = vpci_msix_accept,
+    .read = vpci_msix_read,
+    .write = vpci_msix_write,
+};
+
+static int vpci_init_msix(struct pci_dev *pdev)
+{
+    struct domain *d = pdev->domain;
+    uint8_t seg = pdev->seg, bus = pdev->bus;
+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+    struct vpci_msix *msix;
+    unsigned int msix_offset, i, max_entries;
+    struct vpci_bar *table_bar, *pba_bar;
+    uint16_t control;
+    int rc;
+
+    msix_offset = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSIX);
+    if ( !msix_offset )
+        return 0;
+
+    control = pci_conf_read16(seg, bus, slot, func,
+                              msix_control_reg(msix_offset));
+
+    /* Get the maximum number of vectors the device supports. */
+    max_entries = msix_table_size(control);
+
+    msix = xzalloc_bytes(MSIX_SIZE(max_entries));
+    if ( !msix )
+        return -ENOMEM;
+
+    msix->max_entries = max_entries;
+    msix->pdev = pdev;
+
+    /* Find the MSI-X table address. */
+    msix->table.offset = pci_conf_read32(seg, bus, slot, func,
+                                         msix_table_offset_reg(msix_offset));
+    msix->table.bir = msix->table.offset & PCI_MSIX_BIRMASK;
+    msix->table.offset &= ~PCI_MSIX_BIRMASK;
+    msix->table.size = msix->max_entries * PCI_MSIX_ENTRY_SIZE;
+    msix->table.addr = INVALID_PADDR;
+
+    /* Find the MSI-X pba address. */
+    msix->pba.offset = pci_conf_read32(seg, bus, slot, func,
+                                       msix_pba_offset_reg(msix_offset));
+    msix->pba.bir = msix->pba.offset & PCI_MSIX_BIRMASK;
+    msix->pba.offset &= ~PCI_MSIX_BIRMASK;
+    msix->pba.size = DIV_ROUND_UP(msix->max_entries, 8);
+    msix->pba.addr = INVALID_PADDR;
+
+    for ( i = 0; i < msix->max_entries; i++)
+    {
+        msix->entries[i].masked = true;
+        msix->entries[i].nr = i;
+        vpci_msix_arch_init(&msix->entries[i].arch);
+    }
+
+    if ( list_empty(&d->arch.hvm_domain.msix_tables) )
+        register_mmio_handler(d, &vpci_msix_table_ops);
+
+    list_add(&msix->next, &d->arch.hvm_domain.msix_tables);
+
+    rc = vpci_add_register(pdev, vpci_msix_control_read,
+                           vpci_msix_control_write,
+                           msix_control_reg(msix_offset), 2, msix);
+    if ( rc )
+    {
+        dprintk(XENLOG_ERR,
+                "%04x:%02x:%02x.%u: failed to add handler for MSI-X control: %d\n",
+                seg, bus, slot, func, rc);
+        goto error;
+    }
+
+    table_bar = &pdev->vpci->header.bars[msix->table.bir];
+    pba_bar = &pdev->vpci->header.bars[msix->pba.bir];
+
+    /*
+     * The header handlers will take care of leaving a hole for the MSI-X
+     * related areas, that's why MSI-X needs to be initialized before the
+     * header.
+     */
+    table_bar->msix[VPCI_BAR_MSIX_TABLE] = &msix->table;
+    pba_bar->msix[VPCI_BAR_MSIX_PBA] = &msix->pba;
+    pdev->vpci->msix = msix;
+
+    return 0;
+
+ error:
+    ASSERT(rc);
+    xfree(msix);
+    return rc;
+}
+
+REGISTER_VPCI_INIT(vpci_init_msix, VPCI_PRIORITY_HIGH);
+
+void vpci_dump_msix(void)
+{
+    struct domain *d;
+    struct pci_dev *pdev;
+
+    for_each_domain ( d )
+    {
+        if ( !has_vpci(d) )
+            continue;
+
+        printk("vPCI MSI-X information for guest %u\n", d->domain_id);
+
+        if ( !vpci_trylock(d) )
+        {
+            printk("Unable to get vPCI lock, skipping\n");
+            continue;
+        }
+
+        list_for_each_entry ( pdev, &d->arch.pdev_list, domain_list)
+        {
+            uint8_t seg = pdev->seg, bus = pdev->bus;
+            uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
+            struct vpci_msix *msix = pdev->vpci->msix;
+            unsigned int i;
+
+            if ( !msix )
+                continue;
+
+            printk("Device %04x:%02x:%02x.%u\n", seg, bus, slot, func);
+
+            printk("Max entries: %u maskall: %u enabled: %u\n",
+                   msix->max_entries, msix->masked, msix->enabled);
+
+            printk("Guest table entries:\n");
+            for ( i = 0; i < msix->max_entries; i++ )
+                vpci_msix_arch_print(&msix->entries[i].arch,
+                                     msix->entries[i].data,
+                                     msix->entries[i].addr,
+                                     msix->entries[i].masked, i);
+        }
+        vpci_unlock(d);
+    }
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
+
diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-x86/hvm/domain.h
index 7028f93861..980d718327 100644
--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -190,6 +190,9 @@ struct hvm_domain {
     /* List of ECAM (MMCFG) regions trapped by Xen. */
     struct list_head mmcfg_regions;
 
+    /* List of MSI-X tables. */
+    struct list_head msix_tables;
+
     /* List of permanently write-mapped pages. */
     struct {
         spinlock_t lock;
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index 55ed094734..739eefe541 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -144,6 +144,24 @@ int vpci_msi_arch_init(struct vpci_arch_msi *arch);
 void vpci_msi_arch_print(struct vpci_arch_msi *arch, uint16_t data,
                          uint64_t addr);
 
+/* Arch-specific MSI-X entry data for vPCI. */
+struct vpci_arch_msix_entry {
+    int pirq;
+};
+
+/* Arch-specific vPCI MSI-X helpers. */
+void vpci_msix_arch_mask(struct vpci_arch_msix_entry *arch,
+                         struct pci_dev *pdev, bool mask);
+int vpci_msix_arch_enable(struct vpci_arch_msix_entry *arch,
+                          struct pci_dev *pdev, uint64_t address,
+                          uint32_t data, unsigned int entry_nr,
+                          paddr_t table_base);
+int vpci_msix_arch_disable(struct vpci_arch_msix_entry *arch,
+                           struct pci_dev *pdev);
+int vpci_msix_arch_init(struct vpci_arch_msix_entry *arch);
+void vpci_msix_arch_print(struct vpci_arch_msix_entry *entry, uint32_t data,
+                          uint64_t addr, bool masked, unsigned int pos);
+
 enum stdvga_cache_state {
     STDVGA_CACHE_UNINITIALIZED,
     STDVGA_CACHE_ENABLED,
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index ca693f3667..5bc6380531 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -83,6 +83,10 @@ struct vpci {
             } type;
             paddr_t addr;
             uint64_t size;
+#define VPCI_BAR_MSIX_TABLE     0
+#define VPCI_BAR_MSIX_PBA       1
+#define VPCI_BAR_MSIX_NUM       2
+            struct vpci_msix_mem *msix[VPCI_BAR_MSIX_NUM];
             bool prefetchable;
             bool sizing;
             bool enabled;
@@ -112,10 +116,45 @@ struct vpci {
         /* Arch-specific data. */
         struct vpci_arch_msi arch;
     } *msi;
+
+    /* MSI-X data. */
+    struct vpci_msix {
+        struct pci_dev *pdev;
+        /* Maximum number of vectors supported by the device. */
+        unsigned int max_entries;
+        /* MSI-X enabled? */
+        bool enabled;
+        /* Masked? */
+        bool masked;
+        /* List link. */
+        struct list_head next;
+        /* Table information. */
+        struct vpci_msix_mem {
+            /* MSI-X table offset. */
+            unsigned int offset;
+            /* MSI-X table BIR. */
+            unsigned int bir;
+            /* Table addr. */
+            paddr_t addr;
+            /* Table size. */
+            unsigned int size;
+        } table;
+        /* PBA */
+        struct vpci_msix_mem pba;
+        /* Entries. */
+        struct vpci_msix_entry {
+            unsigned int nr;
+            uint64_t addr;
+            uint32_t data;
+            bool masked;
+            struct vpci_arch_msix_entry arch;
+        } entries[];
+    } *msix;
 #endif
 };
 
 void vpci_dump_msi(void);
+void vpci_dump_msix(void);
 
 #endif
 
-- 
2.11.0 (Apple Git-81)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 1/9] xen/vpci: introduce basic handlers to trap accesses to the PCI config space
       [not found] ` <20170630150117.88489-2-roger.pau@citrix.com>
@ 2017-07-10 13:27   ` Paul Durrant
  2017-07-13 14:36   ` Jan Beulich
  1 sibling, 0 replies; 44+ messages in thread
From: Paul Durrant @ 2017-07-10 13:27 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Wei Liu, Andrew Cooper, julien.grall@arm.com, Jan Beulich,
	Ian Jackson, boris.ostrovsky@oracle.com, Roger Pau Monne

> -----Original Message-----
> From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> Sent: 30 June 2017 16:01
> To: xen-devel@lists.xenproject.org
> Cc: boris.ostrovsky@oracle.com; julien.grall@arm.com;
> konrad.wilk@oracle.com; Roger Pau Monne <roger.pau@citrix.com>; Ian
> Jackson <Ian.Jackson@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Jan
> Beulich <jbeulich@suse.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>
> Subject: [PATCH v4 1/9] xen/vpci: introduce basic handlers to trap accesses
> to the PCI config space
> 
> This functionality is going to reside in vpci.c (and the corresponding
> vpci.h header), and should be arch-agnostic. The handlers introduced
> in this patch setup the basic functionality required in order to trap
> accesses to the PCI config space, and allow decoding the address and
> finding the corresponding handler that should handle the access
> (although no handlers are implemented).
> 
> Note that the traps to the PCI IO ports registers (0xcf8/0xcfc) are
> setup inside of a x86 HVM file, since that's not shared with other
> arches.
> 
> A new XEN_X86_EMU_VPCI x86 domain flag is added in order to signal Xen
> whether a domain should use the newly introduced vPCI handlers, this
> is only enabled for PVH Dom0 at the moment.
> 
> A very simple user-space test is also provided, so that the basic
> functionality of the vPCI traps can be asserted. This has been proven
> quite helpful during development, since the logic to handle partial
> accesses or accesses that expand across multiple registers is not
> trivial.
> 
> The handlers for the registers are added to a linked list that's keep
> sorted at all times. Both the read and write handlers support accesses
> that expand across multiple emulated registers and contain gaps not
> emulated.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Wei Liu <wei.liu2@citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Paul Durrant <paul.durrant@citrix.com>
> ---
> Changes since v3:
> * User-space test harness:
>  - Fix spaces in container_of macro.
>  - Implement a dummy locking functions.
>  - Remove 'current' macro make current a pointer to the statically
>    allocated vpcu.
>  - Remove unneeded parentheses in the pci_conf_readX macros.
>  - Fix the name of the write test macro.
>  - Remove the dummy EXPORT_SYMBOL macro (this was needed by the RB
>    code only).
>  - Import the max macro.
>  - Test all possible read/write size combinations with all possible
>    emulated register sizes.
>  - Introduce a test for register removal.
> * Hypervisor code:
>  - Use a sorted list in order to store the config space handlers.
>  - Remove some unneeded 'else' branches.
>  - Make the IO port handlers always return X86EMUL_OKAY, and set the
>    data to all 1's in case of read failure (write are simply ignored).
>  - In hvm_select_ioreq_server reuse local variables when calling
>    XEN_DMOP_PCI_SBDF.
>  - Store the pointers to the initialization functions in the .rodata
>    section.
>  - Do not ignore the return value of xen_vpci_add_handlers in
>    setup_one_hwdom_device.
>  - Remove the vpci_init macro.
>  - Do not hide the pointers inside of the vpci_{read/write}_t
>    typedefs.
>  - Rename priv_data to private in vpci_register.
>  - Simplify checking for register overlap in vpci_register_cmp.
>  - Check that the offset and the length match before removing a
>    register in xen_vpci_remove_register.
>  - Make vpci_read_hw return a value rather than storing it in a
>    pointer passed by parameter.
>  - Handler dispatcher functions vpci_{read/write} no longer return an
>    error code, errors on reads/writes should be treated like hardware
>    (writes ignored, reads return all 1's or garbage).
>  - Make sure pcidevs is locked before calling pci_get_pdev_by_domain.
>  - Use a recursive spinlock for the vpci lock, so that spin_is_locked
>    checks that the current CPU is holding the lock.
>  - Make the code less error-chatty by removing some of the printk's.
>  - Pass the slot and the function as separate parameters to the
>    handler dispatchers (instead of passing devfn).
>  - Allow handlers to be registered with either a read or write
>    function only, the missing handler will be replaced by a dummy
>    handler (writes ignored, reads return 1's).
>  - Introduce PCI_CFG_SPACE_* defines from Linux.
>  - Simplify the handler dispatchers by removing the recursion, now the
>    dispatchers iterate over the list of sorted handlers and call them
>    in order.
>  - Remove the GENMASK_BYTES, SHIFT_RIGHT_BYTES and ADD_RESULT
> macros,
>    and instead provide a merge_result function in order to merge a
>    register output into a partial result.
>  - Rename the fields of the vpci_val union to u8/u16/u32.
>  - Remove the return values from the read/write handlers, errors
>    should be handled internally and signaled as would be done on
>    native hardware.
>  - Remove the usage of the GENMASK macro.
> 
> Changes since v2:
>  - Generalize the PCI address decoding and use it for IOREQ code also.
> 
> Changes since v1:
>  - Allow access to cross a word-boundary.
>  - Add locking.
>  - Add cleanup to xen_vpci_add_handlers in case of failure.
> ---
[snip]
> diff --git a/xen/arch/arm/xen.lds.S b/xen/arch/arm/xen.lds.S
> index 44bd3bf0ce..a24d8e913a 100644
> --- a/xen/arch/arm/xen.lds.S
> +++ b/xen/arch/arm/xen.lds.S
> @@ -41,6 +41,9 @@ SECTIONS
> 
>    . = ALIGN(PAGE_SIZE);
>    .rodata : {
> +       __start_vpci_array = .;
> +       *(.rodata.vpci)
> +       __end_vpci_array = .;
>          _srodata = .;          /* Read-only data */
>          /* Bug frames table */
>         __start_bug_frames = .;
> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> index f7873da323..23e7df3838 100644
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -376,11 +376,21 @@ static bool emulation_flags_ok(const struct domain
> *d, uint32_t emflags)
>      if ( is_hvm_domain(d) )
>      {
>          if ( is_hardware_domain(d) &&
> -             emflags != (XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC) )
> -            return false;
> -        if ( !is_hardware_domain(d) && emflags &&
> -             emflags != XEN_X86_EMU_ALL && emflags != XEN_X86_EMU_LAPIC )
> +             emflags != (XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC|
> +                         XEN_X86_EMU_VPCI) )
>              return false;
> +        if ( !is_hardware_domain(d) )
> +        {
> +            switch ( emflags )
> +            {
> +            case XEN_X86_EMU_ALL & ~XEN_X86_EMU_VPCI:
> +            case XEN_X86_EMU_LAPIC:
> +            case 0:
> +                break;
> +            default:
> +                return false;
> +            }
> +        }

Can't the if and the following switch be combined?

>      }
>      else if ( emflags != 0 && emflags != XEN_X86_EMU_PIT )
>      {
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index 3ed6ec468d..c4176ee458 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -36,6 +36,7 @@
>  #include <xen/rangeset.h>
>  #include <xen/monitor.h>
>  #include <xen/warning.h>
> +#include <xen/vpci.h>
>  #include <asm/shadow.h>
>  #include <asm/hap.h>
>  #include <asm/current.h>
> @@ -630,6 +631,7 @@ int hvm_domain_initialise(struct domain *d, unsigned
> long domcr_flags,
>          d->arch.hvm_domain.io_bitmap = hvm_io_bitmap;
> 
>      register_g2m_portio_handler(d);
> +    register_vpci_portio_handler(d);
> 
>      hvm_ioreq_init(d);
> 
> diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
> index 214ab307c4..4e91a485cd 100644
> --- a/xen/arch/x86/hvm/io.c
> +++ b/xen/arch/x86/hvm/io.c
> @@ -25,6 +25,7 @@
>  #include <xen/trace.h>
>  #include <xen/event.h>
>  #include <xen/hypercall.h>
> +#include <xen/vpci.h>
>  #include <asm/current.h>
>  #include <asm/cpufeature.h>
>  #include <asm/processor.h>
> @@ -256,6 +257,147 @@ void register_g2m_portio_handler(struct domain
> *d)
>      handler->ops = &g2m_portio_ops;
>  }
> 
> +/* Do some sanity checks. */
> +static int vpci_access_check(unsigned int reg, unsigned int len)
> +{
> +    /* Check access size. */
> +    if ( len != 1 && len != 2 && len != 4 )
> +        return -EINVAL;
> +
> +    /* Check if access crosses a double-word boundary. */
> +    if ( (reg & 3) + len > 4 )
> +        return -EINVAL;
> +
> +    return 0;
> +}
> +
> +/* Helper to decode a PCI address. */
> +void hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
> +                         unsigned int *bus, unsigned int *slot,
> +                         unsigned int *func, unsigned int *reg)
> +{
> +    unsigned long bdf;
> +
> +    ASSERT(CF8_ENABLED(cf8));
> +
> +    bdf = CF8_BDF(cf8);
> +    *bus = PCI_BUS(bdf);
> +    *slot = PCI_SLOT(bdf);
> +    *func = PCI_FUNC(bdf);
> +    /*
> +     * NB: the lower 2 bits of the register address are fetched from the
> +     * offset into the 0xcfc register when reading/writing to it.
> +     */
> +    *reg = CF8_ADDR_LO(cf8) | (addr & 3);
> +}
> +
> +/* vPCI config space IO ports handlers (0xcf8/0xcfc). */
> +static bool vpci_portio_accept(const struct hvm_io_handler *handler,
> +                               const ioreq_t *p)
> +{
> +    return (p->addr == 0xcf8 && p->size == 4) || (p->addr & 0xfffc) == 0xcfc;
> +}
> +
> +static int vpci_portio_read(const struct hvm_io_handler *handler,
> +                            uint64_t addr, uint32_t size, uint64_t *data)
> +{
> +    struct domain *d = current->domain;
> +    unsigned int bus, slot, func, reg;
> +
> +    *data = ~(uint64_t)0;
> +
> +    vpci_lock(d);
> +    if ( addr == 0xcf8 )
> +    {
> +        ASSERT(size == 4);
> +        *data = d->arch.hvm_domain.pci_cf8;
> +        vpci_unlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +    if ( !CF8_ENABLED(d->arch.hvm_domain.pci_cf8) )
> +    {
> +        vpci_unlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +
> +    /* Decode the PCI address. */
> +    hvm_pci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, &slot,
> &func,
> +                        &reg);
> +
> +    if ( vpci_access_check(reg, size) )
> +    {
> +        vpci_unlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +
> +    pcidevs_lock();
> +    *data = vpci_read(0, bus, slot, func, reg, size);
> +    pcidevs_unlock();
> +    vpci_unlock(d);
> +
> +    return X86EMUL_OKAY;
> +}
> +
> +static int vpci_portio_write(const struct hvm_io_handler *handler,
> +                             uint64_t addr, uint32_t size, uint64_t data)
> +{
> +    struct domain *d = current->domain;
> +    unsigned int bus, slot, func, reg;
> +
> +    vpci_lock(d);
> +    if ( addr == 0xcf8 )
> +    {
> +        ASSERT(size == 4);
> +        d->arch.hvm_domain.pci_cf8 = data;
> +        vpci_unlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +    if ( !CF8_ENABLED(d->arch.hvm_domain.pci_cf8) )
> +    {
> +        vpci_unlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +
> +    /* Decode the PCI address. */
> +    hvm_pci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, &slot,
> &func,
> +                        &reg);
> +
> +    if ( vpci_access_check(reg, size) )
> +    {
> +        vpci_unlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +
> +    pcidevs_lock();
> +    vpci_write(0, bus, slot, func, reg, size, data);
> +    pcidevs_unlock();
> +    vpci_unlock(d);
> +
> +    return X86EMUL_OKAY;
> +}
> +
> +static const struct hvm_io_ops vpci_portio_ops = {
> +    .accept = vpci_portio_accept,
> +    .read = vpci_portio_read,
> +    .write = vpci_portio_write,
> +};
> +
> +void register_vpci_portio_handler(struct domain *d)
> +{
> +    struct hvm_io_handler *handler;
> +
> +    if ( !has_vpci(d) )
> +        return;
> +
> +    handler = hvm_next_io_handler(d);
> +    if ( !handler )
> +        return;
> +
> +    spin_lock_init(&d->arch.hvm_domain.vpci_lock);
> +    handler->type = IOREQ_TYPE_PIO;
> +    handler->ops = &vpci_portio_ops;
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/arch/x86/hvm/ioreq.c b/xen/arch/x86/hvm/ioreq.c
> index b2a8b0e986..726c5c0c36 100644
> --- a/xen/arch/x86/hvm/ioreq.c
> +++ b/xen/arch/x86/hvm/ioreq.c
> @@ -1178,18 +1178,16 @@ struct hvm_ioreq_server
> *hvm_select_ioreq_server(struct domain *d,
>           CF8_ENABLED(cf8) )
>      {
>          uint32_t sbdf, x86_fam;
> +        unsigned int bus, slot, func, reg;
> +
> +        hvm_pci_decode_addr(cf8, p->addr, &bus, &slot, &func, &reg);
> 
>          /* PCI config data cycle */
> 
> -        sbdf = XEN_DMOP_PCI_SBDF(0,
> -                                 PCI_BUS(CF8_BDF(cf8)),
> -                                 PCI_SLOT(CF8_BDF(cf8)),
> -                                 PCI_FUNC(CF8_BDF(cf8)));
> +        sbdf = XEN_DMOP_PCI_SBDF(0, bus, slot, func);
> 
>          type = XEN_DMOP_IO_RANGE_PCI;
> -        addr = ((uint64_t)sbdf << 32) |
> -               CF8_ADDR_LO(cf8) |
> -               (p->addr & 3);
> +        addr = ((uint64_t)sbdf << 32) | reg;
>          /* AMD extended configuration space access? */
>          if ( CF8_ADDR_HI(cf8) &&
>               d->arch.cpuid->x86_vendor == X86_VENDOR_AMD &&
> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> index f7b927858c..4cf919f206 100644
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -1566,7 +1566,8 @@ void __init noreturn __start_xen(unsigned long
> mbi_p)
>          domcr_flags |= DOMCRF_hvm |
>                         ((hvm_funcs.hap_supported && !opt_dom0_shadow) ?
>                           DOMCRF_hap : 0);
> -        config.emulation_flags =
> XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC;
> +        config.emulation_flags =
> XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC|
> +                                 XEN_X86_EMU_VPCI;
>      }
> 
>      /* Create initial domain 0. */
> diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
> index 8289a1bf09..451e7970da 100644
> --- a/xen/arch/x86/xen.lds.S
> +++ b/xen/arch/x86/xen.lds.S
> @@ -76,6 +76,9 @@ SECTIONS
> 
>    __2M_rodata_start = .;       /* Start of 2M superpages, mapped RO. */
>    .rodata : {
> +       __start_vpci_array = .;
> +       *(.rodata.vpci)
> +       __end_vpci_array = .;
>         _srodata = .;
>         /* Bug frames table */
>         __start_bug_frames = .;
> diff --git a/xen/drivers/Makefile b/xen/drivers/Makefile
> index 19391802a8..d51c766453 100644
> --- a/xen/drivers/Makefile
> +++ b/xen/drivers/Makefile
> @@ -1,6 +1,6 @@
>  subdir-y += char
>  subdir-$(CONFIG_HAS_CPUFREQ) += cpufreq
> -subdir-$(CONFIG_HAS_PCI) += pci
> +subdir-$(CONFIG_HAS_PCI) += pci vpci
>  subdir-$(CONFIG_HAS_PASSTHROUGH) += passthrough
>  subdir-$(CONFIG_ACPI) += acpi
>  subdir-$(CONFIG_VIDEO) += video
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 6e7126b2e8..3208cd5d71 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -30,6 +30,7 @@
>  #include <xen/radix-tree.h>
>  #include <xen/softirq.h>
>  #include <xen/tasklet.h>
> +#include <xen/vpci.h>
>  #include <xsm/xsm.h>
>  #include <asm/msi.h>
>  #include "ats.h"
> @@ -1026,9 +1027,10 @@ static void setup_one_hwdom_device(const
> struct setup_hwdom *ctxt,
>                                    struct pci_dev *pdev)
>  {
>      u8 devfn = pdev->devfn;
> +    int err;
> 
>      do {
> -        int err = ctxt->handler(devfn, pdev);
> +        err = ctxt->handler(devfn, pdev);
> 
>          if ( err )
>          {
> @@ -1041,6 +1043,11 @@ static void setup_one_hwdom_device(const
> struct setup_hwdom *ctxt,
>          devfn += pdev->phantom_stride;
>      } while ( devfn != pdev->devfn &&
>                PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
> +
> +    err = vpci_add_handlers(pdev);
> +    if ( err )
> +        printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
> +               ctxt->d->domain_id, err);
>  }
> 
>  static int __hwdom_init _setup_hwdom_pci_devices(struct pci_seg *pseg,
> void *arg)
> diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
> new file mode 100644
> index 0000000000..840a906470
> --- /dev/null
> +++ b/xen/drivers/vpci/Makefile
> @@ -0,0 +1 @@
> +obj-y += vpci.o
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> new file mode 100644
> index 0000000000..c54de83b82
> --- /dev/null
> +++ b/xen/drivers/vpci/vpci.c
> @@ -0,0 +1,405 @@
> +/*
> + * Generic functionality for handling accesses to the PCI configuration space
> + * from guests.
> + *
> + * Copyright (C) 2017 Citrix Systems R&D
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms and conditions of the GNU General Public
> + * License, version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; If not, see
> <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <xen/sched.h>
> +#include <xen/vpci.h>
> +
> +extern const vpci_register_init_t __start_vpci_array[], __end_vpci_array[];
> +#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
> +
> +/* Internal struct to store the emulated PCI registers. */
> +struct vpci_register {
> +    vpci_read_t *read;
> +    vpci_write_t *write;
> +    unsigned int size;
> +    unsigned int offset;
> +    void *private;
> +    struct list_head node;
> +};
> +
> +int __hwdom_init vpci_add_handlers(struct pci_dev *pdev)
> +{
> +    unsigned int i;
> +    int rc = 0;
> +
> +    if ( !has_vpci(pdev->domain) )
> +        return 0;
> +
> +    pdev->vpci = xzalloc(struct vpci);
> +    if ( !pdev->vpci )
> +        return -ENOMEM;
> +
> +    INIT_LIST_HEAD(&pdev->vpci->handlers);
> +
> +    for ( i = 0; i < NUM_VPCI_INIT; i++ )
> +    {
> +        rc = __start_vpci_array[i](pdev);
> +        if ( rc )
> +            break;
> +    }
> +
> +    if ( rc )
> +    {
> +        while ( !list_empty(&pdev->vpci->handlers) )
> +        {
> +            struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
> +                                                       struct vpci_register,
> +                                                       node);
> +
> +            list_del(&r->node);
> +            xfree(r);
> +        }
> +        xfree(pdev->vpci);
> +    }
> +
> +    return rc;
> +}
> +
> +static int vpci_register_cmp(const struct vpci_register *r1,
> +                             const struct vpci_register *r2)
> +{
> +    /* Return 0 if registers overlap. */
> +    if ( r1->offset < r2->offset + r2->size &&
> +         r2->offset < r1->offset + r1->size )
> +        return 0;
> +    if ( r1->offset < r2->offset )
> +        return -1;
> +    if ( r1->offset > r2->offset )
> +        return 1;
> +
> +    ASSERT_UNREACHABLE();
> +    return 0;
> +}
> +
> +/* Dummy hooks, writes are ignored, reads return 1's */
> +static void vpci_ignored_read(struct pci_dev *pdev, unsigned int reg,
> +                              union vpci_val *val, void *data)
> +{
> +    val->u32 = ~(uint32_t)0;
> +}
> +
> +static void vpci_ignored_write(struct pci_dev *pdev, unsigned int reg,
> +                               union vpci_val val, void *data)
> +{
> +}
> +
> +int vpci_add_register(const struct pci_dev *pdev, vpci_read_t
> read_handler,
> +                      vpci_write_t write_handler, unsigned int offset,
> +                      unsigned int size, void *data)
> +{
> +    struct list_head *head;
> +    struct vpci_register *r;
> +
> +    /* Some sanity checks. */
> +    if ( (size != 1 && size != 2 && size != 4) ||
> +         offset >= PCI_CFG_SPACE_EXP_SIZE || offset & (size - 1) ||
> +         (read_handler == NULL && write_handler == NULL) )
> +        return -EINVAL;
> +
> +    r = xmalloc(struct vpci_register);
> +    if ( !r )
> +        return -ENOMEM;
> +
> +    r->read = read_handler ?: vpci_ignored_read;
> +    r->write = write_handler ?: vpci_ignored_write;
> +    r->size = size;
> +    r->offset = offset;
> +    r->private = data;
> +
> +    vpci_lock(pdev->domain);
> +
> +    /* The list of handlers must be keep sorted at all times. */
> +    list_for_each ( head, &pdev->vpci->handlers )
> +    {
> +        const struct vpci_register *this =
> +            list_entry(head, const struct vpci_register, node);
> +        int cmp = vpci_register_cmp(r, this);
> +
> +        if ( cmp < 0 )
> +            break;
> +        if ( cmp == 0 )
> +        {
> +            vpci_unlock(pdev->domain);
> +            xfree(r);
> +            return -EEXIST;
> +        }
> +    }
> +
> +    list_add_tail(&r->node, head);
> +    vpci_unlock(pdev->domain);
> +
> +    return 0;
> +}
> +
> +int vpci_remove_register(const struct pci_dev *pdev, unsigned int offset,
> +                         unsigned int size)
> +{
> +    const struct vpci_register r = { .offset = offset, .size = size };
> +    struct vpci_register *rm = NULL;
> +
> +    vpci_lock(pdev->domain);
> +
> +    list_for_each_entry ( rm, &pdev->vpci->handlers, node )
> +        if ( vpci_register_cmp(&r, rm) <= 0 )
> +            break;
> +
> +    if ( !rm || rm->offset != offset || rm->size != size )
> +    {
> +        vpci_unlock(pdev->domain);
> +        return -ENOENT;
> +    }
> +
> +    list_del(&rm->node);
> +    vpci_unlock(pdev->domain);
> +    xfree(rm);
> +
> +    return 0;
> +}
> +
> +/* Wrappers for performing reads/writes to the underlying hardware. */
> +static uint32_t vpci_read_hw(unsigned int seg, unsigned int bus,
> +                             unsigned int slot, unsigned int func,
> +                             unsigned int reg, uint32_t size)
> +{
> +    uint32_t data;
> +
> +    switch ( size )
> +    {
> +    case 4:
> +        data = pci_conf_read32(seg, bus, slot, func, reg);
> +        break;
> +    case 2:
> +        data = pci_conf_read16(seg, bus, slot, func, reg);
> +        break;
> +    case 1:
> +        data = pci_conf_read8(seg, bus, slot, func, reg);
> +        break;
> +    default:
> +        BUG();
> +    }
> +
> +    return data;
> +}
> +
> +static void vpci_write_hw(unsigned int seg, unsigned int bus,
> +                          unsigned int slot, unsigned int func,
> +                          unsigned int reg, uint32_t size, uint32_t data)
> +{
> +    switch ( size )
> +    {
> +    case 4:
> +        pci_conf_write32(seg, bus, slot, func, reg, data);
> +        break;
> +    case 3:
> +        /*
> +         * This is possible because a 4byte write can have 1byte trapped and
> +         * the rest passed-through.
> +         */
> +        if ( reg & 1 )
> +        {
> +            pci_conf_write8(seg, bus, slot, func, reg, data);
> +            pci_conf_write16(seg, bus, slot, func, reg + 1, data >> 8);
> +        }
> +        else
> +        {
> +            pci_conf_write16(seg, bus, slot, func, reg, data);
> +            pci_conf_write8(seg, bus, slot, func, reg + 2, data >> 16);
> +        }
> +        break;
> +    case 2:
> +        pci_conf_write16(seg, bus, slot, func, reg, data);
> +        break;
> +    case 1:
> +        pci_conf_write8(seg, bus, slot, func, reg, data);
> +        break;
> +    default:
> +        BUG();
> +    }
> +}
> +
> +/*
> + * Merge new data into a partial result.
> + *
> + * Zero the bytes of 'data' from [offset, offset + size), and
> + * merge the value found in 'new' from [0, offset) left shifted
> + * by 'offset'.
> + */
> +uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
> +                      unsigned int offset)
> +{
> +    uint32_t mask = ((uint64_t)1 << (size * 8)) - 1;
> +
> +    return (data & ~(mask << (offset * 8))) | ((new & mask) << (offset * 8));
> +}
> +
> +uint32_t vpci_read(unsigned int seg, unsigned int bus, unsigned int slot,
> +                   unsigned int func, unsigned int reg, uint32_t size)
> +{
> +    struct domain *d = current->domain;
> +    struct pci_dev *pdev;
> +    const struct vpci_register *r;
> +    unsigned int data_offset = 0;
> +    uint32_t data;
> +
> +    ASSERT(pcidevs_locked());
> +    ASSERT(vpci_locked(d));
> +
> +    /*
> +     * Read the hardware value.
> +     * NB: at the moment vPCI passthroughs everything (ie: permissive).
> +     */
> +    data = vpci_read_hw(seg, bus, slot, func, reg, size);
> +
> +    /* Find the PCI dev matching the address. */
> +    pdev = pci_get_pdev_by_domain(d, seg, bus, PCI_DEVFN(slot, func));
> +    if ( !pdev )
> +        return data;
> +
> +    /* Replace any values reported by the emulated registers. */
> +    list_for_each_entry ( r, &pdev->vpci->handlers, node )
> +    {
> +        const struct vpci_register emu = {
> +            .offset = reg + data_offset,
> +            .size = size - data_offset
> +        };
> +        int cmp = vpci_register_cmp(&emu, r);
> +        union vpci_val val = { .u32 = ~0 };
> +        unsigned int merge_size;
> +
> +        if ( cmp < 0 )
> +            break;
> +        if ( cmp > 0 )
> +            continue;
> +
> +        r->read(pdev, r->offset, &val, r->private);
> +
> +        /* Check if the read is in the middle of a register. */
> +        if ( r->offset < emu.offset )
> +            val.u32 >>= (emu.offset - r->offset) * 8;
> +
> +        data_offset = max(emu.offset, r->offset) - reg;
> +        /* Find the intersection size between the two sets. */
> +        merge_size = min(emu.offset + emu.size, r->offset + r->size) -
> +                     max(emu.offset, r->offset);
> +        /* Merge the emulated data into the native read value. */
> +        data = merge_result(data, val.u32, merge_size, data_offset);
> +        data_offset += merge_size;
> +        if ( data_offset == size )
> +            break;
> +    }
> +
> +    return data;
> +}
> +
> +/*
> + * Perform a maybe partial write to a register.
> + *
> + * Note that this will only work for simple registers, if Xen needs to
> + * trap accesses to rw1c registers (like the status PCI header register)
> + * the logic in vpci_write will have to be expanded in order to correctly
> + * deal with them.
> + */
> +static void vpci_write_helper(struct pci_dev *pdev,
> +                              const struct vpci_register *r, unsigned int size,
> +                              unsigned int offset, uint32_t data)
> +{
> +    union vpci_val val = { .u32 = data };
> +
> +    ASSERT(size <= r->size);
> +    if ( size != r->size )
> +    {
> +        r->read(pdev, r->offset, &val, r->private);
> +        val.u32 = merge_result(val.u32, data, size, offset);
> +    }
> +
> +    r->write(pdev, r->offset, val, r->private);
> +}
> +
> +void vpci_write(unsigned int seg, unsigned int bus, unsigned int slot,
> +                unsigned int func, unsigned int reg, uint32_t size,
> +                uint32_t data)
> +{
> +    struct domain *d = current->domain;
> +    struct pci_dev *pdev;
> +    const struct vpci_register *r;
> +    unsigned int data_offset = 0;
> +
> +    ASSERT(pcidevs_locked());
> +    ASSERT(vpci_locked(d));
> +
> +    /*
> +     * Find the PCI dev matching the address.
> +     * Passthrough everything that's not trapped.
> +     * */
> +    pdev = pci_get_pdev_by_domain(d, seg, bus, PCI_DEVFN(slot, func));
> +    if ( !pdev )
> +    {
> +        vpci_write_hw(seg, bus, slot, func, reg, size, data);
> +        return;
> +    }
> +
> +    /* Write the value to the hardware or emulated registers. */
> +    list_for_each_entry ( r, &pdev->vpci->handlers, node )
> +    {
> +        const struct vpci_register emu = {
> +            .offset = reg + data_offset,
> +            .size = size - data_offset
> +        };
> +        int cmp = vpci_register_cmp(&emu, r);
> +        unsigned int write_size;
> +
> +        if ( cmp < 0 )
> +            break;
> +        if ( cmp > 0 )
> +            continue;
> +
> +        if ( emu.offset < r->offset )
> +        {
> +            /* Heading gap, write partial content to hardware. */
> +            vpci_write_hw(seg, bus, slot, func, emu.offset,
> +                          r->offset - emu.offset, data >> (data_offset * 8));
> +            data_offset += r->offset - emu.offset;
> +        }
> +
> +        /* Find the intersection size between the two sets. */
> +        write_size = min(emu.offset + emu.size, r->offset + r->size) -
> +                     max(emu.offset, r->offset);
> +        vpci_write_helper(pdev, r, write_size, reg + data_offset - r->offset,
> +                          data >> (data_offset * 8));
> +        data_offset += write_size;
> +        if ( data_offset == size )
> +            break;
> +    }
> +
> +    if ( data_offset < size )
> +        /* Tailing gap, write the remaining. */
> +        vpci_write_hw(seg, bus, slot, func, reg + data_offset,
> +                      size - data_offset, data >> (data_offset * 8));
> +}
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */
> +
> diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
> index 27d80eeff4..9be09df85d 100644
> --- a/xen/include/asm-x86/domain.h
> +++ b/xen/include/asm-x86/domain.h
> @@ -427,6 +427,7 @@ struct arch_domain
>  #define has_vpit(d)        (!!((d)->arch.emulation_flags &
> XEN_X86_EMU_PIT))
>  #define has_pirq(d)        (!!((d)->arch.emulation_flags & \
>                              XEN_X86_EMU_USE_PIRQ))
> +#define has_vpci(d)        (!!((d)->arch.emulation_flags &
> XEN_X86_EMU_VPCI))
> 
>  #define has_arch_pdevs(d)    (!list_empty(&(d)->arch.pdev_list))
> 
> diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-
> x86/hvm/domain.h
> index d2899c9bb2..cbf4170789 100644
> --- a/xen/include/asm-x86/hvm/domain.h
> +++ b/xen/include/asm-x86/hvm/domain.h
> @@ -184,6 +184,9 @@ struct hvm_domain {
>      /* List of guest to machine IO ports mapping. */
>      struct list_head g2m_ioport_list;
> 
> +    /* Lock for the PCI emulation layer (vPCI). */
> +    spinlock_t vpci_lock;
> +
>      /* List of permanently write-mapped pages. */
>      struct {
>          spinlock_t lock;
> diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
> index 2484eb1c75..0af1ed14dc 100644
> --- a/xen/include/asm-x86/hvm/io.h
> +++ b/xen/include/asm-x86/hvm/io.h
> @@ -149,12 +149,20 @@ void stdvga_deinit(struct domain *d);
> 
>  extern void hvm_dpci_msi_eoi(struct domain *d, int vector);
> 
> +/* Decode a PCI port IO access into a bus/devfn/reg. */
> +void hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
> +                         unsigned int *bus, unsigned int *slot,
> +                         unsigned int *func, unsigned int *reg);
> +
>  /*
>   * HVM port IO handler that performs forwarding of guest IO ports into
> machine
>   * IO ports.
>   */
>  void register_g2m_portio_handler(struct domain *d);
> 
> +/* HVM port IO handler for PCI accesses. */
> +void register_vpci_portio_handler(struct domain *d);
> +
>  #endif /* __ASM_X86_HVM_IO_H__ */
> 
> 
> diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-
> x86/xen.h
> index f21332e897..86a1a09a8d 100644
> --- a/xen/include/public/arch-x86/xen.h
> +++ b/xen/include/public/arch-x86/xen.h
> @@ -295,12 +295,15 @@ struct xen_arch_domainconfig {
>  #define XEN_X86_EMU_PIT             (1U<<_XEN_X86_EMU_PIT)
>  #define _XEN_X86_EMU_USE_PIRQ       9
>  #define XEN_X86_EMU_USE_PIRQ        (1U<<_XEN_X86_EMU_USE_PIRQ)
> +#define _XEN_X86_EMU_VPCI           10
> +#define XEN_X86_EMU_VPCI            (1U<<_XEN_X86_EMU_VPCI)
> 
>  #define XEN_X86_EMU_ALL             (XEN_X86_EMU_LAPIC |
> XEN_X86_EMU_HPET |  \
>                                       XEN_X86_EMU_PM | XEN_X86_EMU_RTC |      \
>                                       XEN_X86_EMU_IOAPIC | XEN_X86_EMU_PIC |  \
>                                       XEN_X86_EMU_VGA | XEN_X86_EMU_IOMMU |   \
> -                                     XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ)
> +                                     XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ |\
> +                                     XEN_X86_EMU_VPCI)
>      uint32_t emulation_flags;
>  };
> 
> diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
> index 59b6e8a81c..a9b80e330b 100644
> --- a/xen/include/xen/pci.h
> +++ b/xen/include/xen/pci.h
> @@ -88,6 +88,9 @@ struct pci_dev {
>  #define PT_FAULT_THRESHOLD 10
>      } fault;
>      u64 vf_rlen[6];
> +
> +    /* Data for vPCI. */
> +    struct vpci *vpci;
>  };
> 
>  #define for_each_pdev(domain, pdev) \
> diff --git a/xen/include/xen/pci_regs.h b/xen/include/xen/pci_regs.h
> index ecd6124d91..cc4ee3b83e 100644
> --- a/xen/include/xen/pci_regs.h
> +++ b/xen/include/xen/pci_regs.h
> @@ -23,6 +23,14 @@
>  #define LINUX_PCI_REGS_H
> 
>  /*
> + * Conventional PCI and PCI-X Mode 1 devices have 256 bytes of
> + * configuration space.  PCI-X Mode 2 and PCIe devices have 4096 bytes of
> + * configuration space.
> + */
> +#define PCI_CFG_SPACE_SIZE	256
> +#define PCI_CFG_SPACE_EXP_SIZE	4096
> +
> +/*
>   * Under PCI, each device has 256 bytes of configuration address space,
>   * of which the first 64 bytes are standardized as follows:
>   */
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> new file mode 100644
> index 0000000000..5e1b0bb3da
> --- /dev/null
> +++ b/xen/include/xen/vpci.h
> @@ -0,0 +1,79 @@
> +#ifndef _VPCI_
> +#define _VPCI_
> +
> +#include <xen/pci.h>
> +#include <xen/types.h>
> +#include <xen/list.h>
> +
> +/*
> + * Helpers for locking/unlocking.
> + *
> + * NB: the recursive variants are used so that spin_is_locked
> + * returns whether the lock is hold by the current CPU (instead
> + * of just returning whether the lock is hold by any CPU).
> + */
> +#define vpci_lock(d) spin_lock_recursive(&(d)-
> >arch.hvm_domain.vpci_lock)
> +#define vpci_unlock(d) spin_unlock_recursive(&(d)-
> >arch.hvm_domain.vpci_lock)
> +#define vpci_locked(d) spin_is_locked(&(d)->arch.hvm_domain.vpci_lock)
> +
> +/* Value read or written by the handlers. */
> +union vpci_val {
> +    uint8_t u8;
> +    uint16_t u16;
> +    uint32_t u32;
> +};
> +
> +/*
> + * The vPCI handlers will never be called concurrently for the same domain,
> ii
> + * is guaranteed that the vpci domain lock will always be locked when calling
> + * any handler.
> + */
> +typedef void (vpci_read_t)(struct pci_dev *pdev, unsigned int reg,
> +                           union vpci_val *val, void *data);
> +
> +typedef void (vpci_write_t)(struct pci_dev *pdev, unsigned int reg,
> +                            union vpci_val val, void *data);
> +
> +typedef int (*vpci_register_init_t)(struct pci_dev *dev);
> +
> +#define REGISTER_VPCI_INIT(x)                   \
> +  static const vpci_register_init_t x##_entry   \
> +               __used_section(".rodata.vpci") = x
> +
> +/* Add vPCI handlers to device. */
> +int __must_check vpci_add_handlers(struct pci_dev *dev);
> +
> +/* Add/remove a register handler. */
> +int __must_check vpci_add_register(const struct pci_dev *pdev,
> +                                   vpci_read_t read_handler,
> +                                   vpci_write_t write_handler,
> +                                   unsigned int offset,
> +                                   unsigned int size, void *data);
> +int __must_check vpci_remove_register(const struct pci_dev *pdev,
> +                                      unsigned int offset,
> +                                      unsigned int size);
> +
> +/* Generic read/write handlers for the PCI config space. */
> +uint32_t vpci_read(unsigned int seg, unsigned int bus, unsigned int slot,
> +                   unsigned int func, unsigned int reg, uint32_t size);
> +void vpci_write(unsigned int seg, unsigned int bus, unsigned int slot,
> +                unsigned int func, unsigned int reg, uint32_t size,
> +                uint32_t data);
> +
> +struct vpci {
> +    /* Root pointer for the tree of vPCI handlers. */
> +    struct list_head handlers;
> +};
> +
> +#endif
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */
> +

All the rest LGTM.

  Paul

> --
> 2.11.0 (Apple Git-81)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 2/9] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas
  2017-06-30 15:01 ` [PATCH v4 2/9] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas Roger Pau Monne
@ 2017-07-10 13:34   ` Paul Durrant
  2017-07-13 20:15   ` Jan Beulich
  1 sibling, 0 replies; 44+ messages in thread
From: Paul Durrant @ 2017-07-10 13:34 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, julien.grall@arm.com, Jan Beulich,
	boris.ostrovsky@oracle.com, Roger Pau Monne

> -----Original Message-----
> From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> Sent: 30 June 2017 16:01
> To: xen-devel@lists.xenproject.org
> Cc: boris.ostrovsky@oracle.com; julien.grall@arm.com;
> konrad.wilk@oracle.com; Roger Pau Monne <roger.pau@citrix.com>; Jan
> Beulich <jbeulich@suse.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>
> Subject: [PATCH v4 2/9] x86/mmcfg: add handlers for the PVH Dom0 MMCFG
> areas
> 
> Introduce a set of handlers for the accesses to the MMCFG areas. Those
> areas are setup based on the contents of the hardware MMCFG tables,
> and the list of handled MMCFG areas is stored inside of the hvm_domain
> struct.
> 
> The read/writes are forwarded to the generic vpci handlers once the
> address is decoded in order to obtain the device and register the
> guest is trying to access.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Paul Durrant <paul.durrant@citrix.com>
> ---
> Changes since v3:
>  - Propagate changes from previous patches: drop xen_ prefix for vpci
>    functions, pass slot and func instead of devfn and fix the error
>    paths of the MMCFG handlers.
>  - s/ecam/mmcfg/.
>  - Move the destroy code to a separate function, so the hvm_mmcfg
>    struct can be private to hvm/io.c.
>  - Constify the return of vpci_mmcfg_find.
>  - Use d instead of v->domain in vpci_mmcfg_accept.
>  - Allow 8byte accesses to the mmcfg.
> 
> Changes since v1:
>  - Added locking.
> ---
>  xen/arch/x86/hvm/dom0_build.c    |  27 ++++++
>  xen/arch/x86/hvm/hvm.c           |   3 +
>  xen/arch/x86/hvm/io.c            | 188
> ++++++++++++++++++++++++++++++++++++++-
>  xen/include/asm-x86/hvm/domain.h |   3 +
>  xen/include/asm-x86/hvm/io.h     |   7 ++
>  5 files changed, 225 insertions(+), 3 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/dom0_build.c
> b/xen/arch/x86/hvm/dom0_build.c
> index 0e7d06be95..57db8adc8d 100644
> --- a/xen/arch/x86/hvm/dom0_build.c
> +++ b/xen/arch/x86/hvm/dom0_build.c
> @@ -38,6 +38,8 @@
>  #include <public/hvm/hvm_info_table.h>
>  #include <public/hvm/hvm_vcpu.h>
> 
> +#include "../x86_64/mmconfig.h"
> +
>  /*
>   * Have the TSS cover the ISA port range, which makes it
>   * - 104 bytes base structure
> @@ -1041,6 +1043,24 @@ static int __init pvh_setup_acpi(struct domain *d,
> paddr_t start_info)
>      return 0;
>  }
> 
> +int __init pvh_setup_mmcfg(struct domain *d)
> +{
> +    unsigned int i;
> +    int rc;
> +
> +    for ( i = 0; i < pci_mmcfg_config_num; i++ )
> +    {
> +        rc = register_vpci_mmcfg_handler(d, pci_mmcfg_config[i].address,
> +                                         pci_mmcfg_config[i].start_bus_number,
> +                                         pci_mmcfg_config[i].end_bus_number,
> +                                         pci_mmcfg_config[i].pci_segment);
> +        if ( rc )
> +            return rc;
> +    }
> +
> +    return 0;
> +}
> +
>  int __init dom0_construct_pvh(struct domain *d, const module_t *image,
>                                unsigned long image_headroom,
>                                module_t *initrd,
> @@ -1090,6 +1110,13 @@ int __init dom0_construct_pvh(struct domain *d,
> const module_t *image,
>          return rc;
>      }
> 
> +    rc = pvh_setup_mmcfg(d);
> +    if ( rc )
> +    {
> +        printk("Failed to setup Dom0 PCI MMCFG areas: %d\n", rc);
> +        return rc;
> +    }
> +
>      panic("Building a PVHv2 Dom0 is not yet supported.");
>      return 0;
>  }
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index c4176ee458..f45e2bd23d 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -584,6 +584,7 @@ int hvm_domain_initialise(struct domain *d, unsigned
> long domcr_flags,
>      spin_lock_init(&d->arch.hvm_domain.write_map.lock);
>      INIT_LIST_HEAD(&d->arch.hvm_domain.write_map.list);
>      INIT_LIST_HEAD(&d->arch.hvm_domain.g2m_ioport_list);
> +    INIT_LIST_HEAD(&d->arch.hvm_domain.mmcfg_regions);
> 
>      rc = create_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0, NULL,
> NULL);
>      if ( rc )
> @@ -729,6 +730,8 @@ void hvm_domain_destroy(struct domain *d)
>          list_del(&ioport->list);
>          xfree(ioport);
>      }
> +
> +    destroy_vpci_mmcfg(&d->arch.hvm_domain.mmcfg_regions);
>  }
> 
>  static int hvm_save_tsc_adjust(struct domain *d, hvm_domain_context_t
> *h)
> diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
> index 4e91a485cd..bb67f3accc 100644
> --- a/xen/arch/x86/hvm/io.c
> +++ b/xen/arch/x86/hvm/io.c
> @@ -261,11 +261,11 @@ void register_g2m_portio_handler(struct domain
> *d)
>  static int vpci_access_check(unsigned int reg, unsigned int len)
>  {
>      /* Check access size. */
> -    if ( len != 1 && len != 2 && len != 4 )
> +    if ( len != 1 && len != 2 && len != 4 && len != 8 )
>          return -EINVAL;
> 
> -    /* Check if access crosses a double-word boundary. */
> -    if ( (reg & 3) + len > 4 )
> +    /* Check if access crosses a double-word boundary or it's not aligned. */
> +    if ( (len <= 4 && (reg & 3) + len > 4) || (len == 8 && (reg & 3) != 0) )

Maybe !!(reg & 3) in the second clause to be consistent with the previous clause's boolean usage of (reg & 3)?

>          return -EINVAL;
> 
>      return 0;
> @@ -398,6 +398,188 @@ void register_vpci_portio_handler(struct domain
> *d)
>      handler->ops = &vpci_portio_ops;
>  }
> 
> +struct hvm_mmcfg {
> +    paddr_t addr;
> +    size_t size;
> +    unsigned int bus;
> +    unsigned int segment;
> +    struct list_head next;
> +};
> +
> +/* Handlers to trap PCI ECAM config accesses. */
> +static const struct hvm_mmcfg *vpci_mmcfg_find(struct domain *d,
> +                                               unsigned long addr)
> +{
> +    const struct hvm_mmcfg *mmcfg;
> +
> +    ASSERT(vpci_locked(d));
> +    list_for_each_entry ( mmcfg, &d->arch.hvm_domain.mmcfg_regions,
> next )
> +        if ( addr >= mmcfg->addr && addr < mmcfg->addr + mmcfg->size )
> +            return mmcfg;
> +
> +    return NULL;
> +}
> +
> +static void vpci_mmcfg_decode_addr(const struct hvm_mmcfg *mmcfg,
> +                                   unsigned long addr, unsigned int *bus,
> +                                   unsigned int *slot, unsigned int *func,
> +                                   unsigned int *reg)
> +{
> +    addr -= mmcfg->addr;
> +    *bus = ((addr >> 20) & 0xff) + mmcfg->bus;
> +    *slot = (addr >> 15) & 0x1f;
> +    *func = (addr >> 12) & 0x7;
> +    *reg = addr & 0xfff;

Lots of magic numbers here. Perhaps define some macros analogous to the CF8 ones?

> +}
> +
> +static int vpci_mmcfg_accept(struct vcpu *v, unsigned long addr)
> +{
> +    struct domain *d = v->domain;
> +    bool found;
> +
> +    vpci_lock(d);
> +    found = vpci_mmcfg_find(d, addr);
> +    vpci_unlock(d);
> +
> +    return found;
> +}
> +
> +static int vpci_mmcfg_read(struct vcpu *v, unsigned long addr,
> +                           unsigned int len, unsigned long *data)
> +{
> +    struct domain *d = v->domain;
> +    const struct hvm_mmcfg *mmcfg;
> +    unsigned int bus, slot, func, reg;
> +
> +    *data = ~(unsigned long)0;
> +
> +    vpci_lock(d);
> +    mmcfg = vpci_mmcfg_find(d, addr);
> +    if ( !mmcfg )
> +    {
> +        vpci_unlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +
> +    vpci_mmcfg_decode_addr(mmcfg, addr, &bus, &slot, &func, &reg);
> +
> +    if ( vpci_access_check(reg, len) )
> +    {
> +        vpci_unlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +
> +    pcidevs_lock();
> +    if ( len == 8 )
> +    {
> +        /*
> +         * According to the PCIe 3.1A specification:
> +         *  - Configuration Reads and Writes must usually be DWORD or smaller
> +         *    in size.
> +         *  - Because Root Complex implementations are not required to
> support
> +         *    accesses to a RCRB that cross DW boundaries [...] software
> +         *    should take care not to cause the generation of such accesses
> +         *    when accessing a RCRB unless the Root Complex will support the
> +         *    access.
> +         *  Xen however supports 8byte accesses by splitting them into two
> +         *  4byte accesses.
> +         */
> +        *data = vpci_read(mmcfg->segment, bus, slot, func, reg, 4);
> +        *data |= (uint64_t)vpci_read(mmcfg->segment, bus, slot, func,
> +                                     reg + 4, 4) << 32;
> +    }
> +    else
> +        *data = vpci_read(mmcfg->segment, bus, slot, func, reg, len);
> +    pcidevs_unlock();
> +    vpci_unlock(d);
> +
> +    return X86EMUL_OKAY;
> +}
> +
> +static int vpci_mmcfg_write(struct vcpu *v, unsigned long addr,
> +                            unsigned int len, unsigned long data)
> +{
> +    struct domain *d = v->domain;
> +    const struct hvm_mmcfg *mmcfg;
> +    unsigned int bus, slot, func, reg;
> +
> +    vpci_lock(d);
> +    mmcfg = vpci_mmcfg_find(d, addr);
> +    if ( !mmcfg )
> +        return X86EMUL_OKAY;
> +
> +    vpci_mmcfg_decode_addr(mmcfg, addr, &bus, &slot, &func, &reg);
> +
> +    if ( vpci_access_check(reg, len) )
> +        return X86EMUL_OKAY;
> +
> +    pcidevs_lock();
> +    if ( len == 8 )
> +    {
> +        vpci_write(mmcfg->segment, bus, slot, func, reg, 4, data);
> +        vpci_write(mmcfg->segment, bus, slot, func, reg + 4, 4, data >> 32);
> +    }
> +    else
> +        vpci_write(mmcfg->segment, bus, slot, func, reg, len, data);
> +    pcidevs_unlock();
> +    vpci_unlock(d);
> +
> +    return X86EMUL_OKAY;
> +}
> +
> +static const struct hvm_mmio_ops vpci_mmcfg_ops = {
> +    .check = vpci_mmcfg_accept,
> +    .read = vpci_mmcfg_read,
> +    .write = vpci_mmcfg_write,
> +};
> +
> +int register_vpci_mmcfg_handler(struct domain *d, paddr_t addr,
> +                                unsigned int start_bus, unsigned int end_bus,
> +                                unsigned int seg)
> +{
> +    struct hvm_mmcfg *mmcfg;
> +
> +    ASSERT(is_hardware_domain(d));
> +
> +    vpci_lock(d);
> +    if ( vpci_mmcfg_find(d, addr) )
> +    {
> +        vpci_unlock(d);
> +        return -EEXIST;
> +    }
> +
> +    mmcfg = xmalloc(struct hvm_mmcfg);
> +    if ( !mmcfg )
> +    {
> +        vpci_unlock(d);
> +        return -ENOMEM;
> +    }
> +
> +    if ( list_empty(&d->arch.hvm_domain.mmcfg_regions) )
> +        register_mmio_handler(d, &vpci_mmcfg_ops);
> +
> +    mmcfg->addr = addr + (start_bus << 20);
> +    mmcfg->bus = start_bus;
> +    mmcfg->segment = seg;
> +    mmcfg->size = (end_bus - start_bus + 1) << 20;
> +    list_add(&mmcfg->next, &d->arch.hvm_domain.mmcfg_regions);
> +    vpci_unlock(d);
> +
> +    return 0;
> +}
> +
> +void destroy_vpci_mmcfg(struct list_head *domain_mmcfg)
> +{
> +    while ( !list_empty(domain_mmcfg) )
> +    {
> +        struct hvm_mmcfg *mmcfg = list_first_entry(domain_mmcfg,
> +                                                   struct hvm_mmcfg, next);
> +
> +        list_del(&mmcfg->next);
> +        xfree(mmcfg);
> +    }
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-
> x86/hvm/domain.h
> index cbf4170789..7028f93861 100644
> --- a/xen/include/asm-x86/hvm/domain.h
> +++ b/xen/include/asm-x86/hvm/domain.h
> @@ -187,6 +187,9 @@ struct hvm_domain {
>      /* Lock for the PCI emulation layer (vPCI). */
>      spinlock_t vpci_lock;
> 
> +    /* List of ECAM (MMCFG) regions trapped by Xen. */
> +    struct list_head mmcfg_regions;
> +
>      /* List of permanently write-mapped pages. */
>      struct {
>          spinlock_t lock;
> diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
> index 0af1ed14dc..4fe996fe49 100644
> --- a/xen/include/asm-x86/hvm/io.h
> +++ b/xen/include/asm-x86/hvm/io.h
> @@ -163,6 +163,13 @@ void register_g2m_portio_handler(struct domain
> *d);
>  /* HVM port IO handler for PCI accesses. */
>  void register_vpci_portio_handler(struct domain *d);
> 
> +/* HVM MMIO handler for PCI MMCFG accesses. */
> +int register_vpci_mmcfg_handler(struct domain *d, paddr_t addr,
> +                                unsigned int start_bus, unsigned int end_bus,
> +                                unsigned int seg);
> +/* Destroy tracked MMCFG areas. */
> +void destroy_vpci_mmcfg(struct list_head *domain_mmcfg);
> +
>  #endif /* __ASM_X86_HVM_IO_H__ */
> 

Rest LGTM.

  Paul

> 
> --
> 2.11.0 (Apple Git-81)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 1/9] xen/vpci: introduce basic handlers to trap accesses to the PCI config space
       [not found] ` <20170630150117.88489-2-roger.pau@citrix.com>
  2017-07-10 13:27   ` [PATCH v4 1/9] xen/vpci: introduce basic handlers to trap accesses to the PCI config space Paul Durrant
@ 2017-07-13 14:36   ` Jan Beulich
  2017-07-14 15:33     ` Roger Pau Monné
  1 sibling, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2017-07-13 14:36 UTC (permalink / raw)
  To: roger.pau
  Cc: wei.liu2, andrew.cooper3, ian.jackson, julien.grall, paul.durrant,
	xen-devel, boris.ostrovsky

>>> Roger Pau Monne <roger.pau@citrix.com> 06/30/17 5:01 PM >>>
> --- /dev/null
> +++ b/tools/tests/vpci/Makefile
> @@ -0,0 +1,40 @@
> +
> +XEN_ROOT=$(CURDIR)/../../..
> +include $(XEN_ROOT)/tools/Rules.mk
> +
> +TARGET := test_vpci
> +
> +.PHONY: all
> +all: $(TARGET)
> +
> +.PHONY: run
> +run: $(TARGET)
> +    ./$(TARGET) > $(TARGET).out

Is this a good way to run a test? Aiui it'll result in there not being
anything visible immediately; one has to go look at the produced file.
I'd suggest to leave it to the person invoking "make run" whether to
redirect output.

> +$(TARGET): vpci.c vpci.h list.h
> +    $(HOSTCC) -g -o $@ vpci.c main.c

If you compile main.c, why is there no dependency on it? And how about
emul.h?

> +.PHONY: clean
> +clean:
> +    rm -rf $(TARGET) $(TARGET).out *.o *~ vpci.h vpci.c list.h
> +
> +.PHONY: distclean
> +distclean: clean
> +
> +.PHONY: install
> +install:
> +
> +vpci.h: $(XEN_ROOT)/xen/include/xen/vpci.h
> +    sed -e '/#include/d' <$< >$@

Couldn't you combine this and list.h's rule into a pattern one?

> --- /dev/null
> +++ b/tools/tests/vpci/emul.h
> @@ -0,0 +1,117 @@
> +/*
> + * Unit tests for the generic vPCI handler code.
> + *
> + * Copyright (C) 2017 Citrix Systems R&D
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms and conditions of the GNU General Public
> + * License, version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef _TEST_VPCI_
> +#define _TEST_VPCI_
> +
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <stddef.h>
> +#include <stdint.h>
> +#include <stdbool.h>
> +#include <errno.h>
> +#include <assert.h>
> +
> +#define container_of(ptr, type, member) ({                      \
> +        typeof(((type *)0)->member) *__mptr = (ptr);            \
> +        (type *)((char *)__mptr - offsetof(type, member));      \

I don't know what tools maintainers think about such name space
violations; in hypervisor code I'd ask you to avoid leading underscores
in macro local variables (same in min()/max() and elsewhere then).

> +/* Read a 32b register using all possible sizes. */
> +void multiread4(unsigned int reg, uint32_t val)
> +{
> +    unsigned int i;
> +
> +    /* Read using bytes. */
> +    for ( i = 0; i < 4; i++ )
> +        VPCI_READ_CHECK(reg + i, 1, (val >> (i * 8)) & UINT8_MAX);
> +
> +    /* Read using 2bytes. */
> +    for ( i = 0; i < 2; i++ )
> +        VPCI_READ_CHECK(reg + i * 2, 2, (val >> (i * 2 * 8)) & UINT16_MAX);
> +
> +    VPCI_READ_CHECK(reg, 4, val);
> +}
> +
> +void multiwrite4_check(unsigned int reg, uint32_t val)

Naming question again: Why the _check suffix here, but not on the read
function above?

> +{
> +    unsigned int i;
> +
> +    /* Write using bytes. */
> +    for ( i = 0; i < 4; i++ )
> +        VPCI_WRITE_CHECK(reg + i, 1, (val >> (i * 8)) & UINT8_MAX);
> +    multiread4(reg, val);
> +
> +    /* Write using 2bytes. */
> +    for ( i = 0; i < 2; i++ )
> +        VPCI_WRITE_CHECK(reg + i * 2, 2, (val >> (i * 2 * 8)) & UINT16_MAX);
> +    multiread4(reg, val);
> +
> +    VPCI_WRITE_CHECK(reg, 4, val);
> +    multiread4(reg, val);
> +}

Wouldn't it be better to vary the value written between the individual
sizes? Perhaps move the 32-bit write between the two loops, using ~val?
Otherwise you won't know whether what you read back is a result of the
writes you actually mean to test or earlier ones?

> +int
> +main(int argc, char **argv)
> +{
> +    /* Index storage by offset. */
> +    uint32_t r0 = 0xdeadbeef;
> +    uint8> +    uint16_t r20[2] = { 0 };

Just { } will suffice.

> +    uint32_t r24 = 0;
> +    uint8_t r28, r30;
> +    unsigned int i;
> +    int rc;
> +
> +    INIT_LIST_HEAD(&vpci.handlers);
> +
> +    VPCI_ADD_REG(vpci_read32, vpci_write32, 0, 4, r0);
> +    VPCI_READ_CHECK(0, 4, 0xdeadbeef);

Why aren't you using r0 here?

> +    VPCI_WRITE_CHECK(0, 4, 0xbcbcbcbc);
> +
> +    VPCI_ADD_REG(vpci_read8, vpci_write8, 5, 1, r5);
> +    VPCI_READ_CHECK(5, 1, 0xef);
> +    VPCI_WRITE_CHECK(5, 1, 0xba);
> +
> +    VPCI_ADD_REG(vpci_read8, vpci_write8, 6, 1, r6);
> +    VPCI_READ_CHECK(6, 1, 0xbe);
> +    VPCI_WRITE_CHECK(6, 1, 0xba);
> +
> +    VPCI_ADD_REG(vpci_read8, vpci_write8, 7, 1, r7);
> +    VPCI_READ_CHECK(7, 1, 0xef);
> +    VPCI_WRITE_CHECK(7, 1, 0xbd);
> +
> +    VPCI_ADD_REG(vpci_read16, vpci_write16, 12, 2, r12);
> +    VPCI_READ_CHECK(12, 2, 0x8696);
> +    VPCI_READ_CHECK(12, 4, 0xffff8696);
> +
> +    /*
> +     * At this point we have the following layout:
> +     *
> +     * 32    24    16     8     0
> +     *  +-----+-----+-----+-----+
> +     *  |          r0           | 0
> +     *  +-----+-----+-----+-----+
> +     *  | r7  |  r6 |  r5 |/////| 32
> +     *  +-----+-----+-----+-----|

This is misleading (especially for readers of the code following this
comment), as you've written different values by now.

> +     *  |///////////////////////| 64
> +     *  +-----------+-----------+
> +     *  |///////////|    r12    | 96
> +     *  +-----------+-----------+
> +     *             ...
> +     *  / = empty.
> +     */
> +
> +    /* Try to add an overlapping register handler. */
> +    VPCI_ADD_INVALID_REG(vpci_read32, vpci_write32, 4, 4);
> +
> +    /* Try to add a non-aligned register. */
> +    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 15, 2);
> +
> +    /* Try to add a register with wrong size. */
> +    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 8, 3);
> +
> +    /* Try to add a register with missing handlers. */
> +    VPCI_ADD_INVALID_REG(NULL, NULL, 8, 2);
> +
> +    /* Read/write of unset register. */
> +    VPCI_READ_CHECK(8, 4, 0xffffffff);
> +    VPCI_READ_CHECK(8, 2, 0xffff);
> +    VPCI_READ_CHECK(8, 1, 0xff);
> +    VPCI_WRITE(10, 2, 0xbeef);
> +    VPCI_READ_CHECK(10, 2, 0xffff);
> +
> +    /* Read of multiple registers */
> +    VPCI_WRITE_CHECK(7, 1, 0xbd);
> +    VPCI_READ_CHECK(4, 4, 0xbdbabaff);
> +
> +    /* Partial read of a register. */
> +    VPCI_WRITE_CHECK(0, 4, 0x1a1b1c1d);
> +    VPCI_READ_CHECK(2, 1, 0x1b);
> +    VPCI_READ_CHECK(6, 2, 0xbdba);
> +
> +    /* Write of multiple registers. */
> +    VPCI_WRITE_CHECK(4, 4, 0xaabbccff);
> +
> +    /* Partial write of a register. */
> +    VPCI_WRITE_CHECK(2, 1, 0xfe);
> +    VPCI_WRITE_CHECK(6, 2, 0xfebc);
> +
> +    /*
> +     * Test all possible read/write size combinations.
> +     *
> +     * Populate 128bits (16B) with 1B registers, 160bits (20B) with 2B
> +     * registers, and finally 192bits (24B) with 4B registers.

I can't see how the numbers here are in line with the code this is
meant to describe. Perhaps this is a leftover from an earlier variant
of the code?

> --- a/xen/arch/arm/xen.lds.S
> +++ b/xen/arch/arm/xen.lds.S
> @@ -41,6 +41,9 @@ SECTIONS
>  
>    . = ALIGN(PAGE_SIZE);
>    .rodata : {
> +       __start_vpci_array = .;
> +       *(.rodata.vpci)
> +       __end_vpci_array = .;

Do you really need this (unconditionally)?

> +static int vpci_access_check(unsigned int reg, unsigned int len)

The way you use it, this function want to return bool.

> +void hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
> +                         unsigned int *bus, unsigned int *slot,
> +                         unsigned int *func, unsigned int *reg)

Since you return nothing right now, how about avoid one of the
indirections? Best candidate would probably be the register value.

> +{
> +    unsigned long bdf;

Why long instead of int?

> +static bool vpci_portio_accept(const struct hv> +    return (p->addr == 0xcf8 && p->size == 4) || (p->addr & 0xfffc) == 0xcfc;

Maybe better ~3 instead of 0xfffc (also likely to produce slightly
better code)?

> +static int vpci_portio_read(const struct hvm_io_handler *handler,
> +                            uint64_t addr, uint32_t size, uint64_t *data)
> +{
> +    struct domain *d = current->domain;
> +    unsigned int bus, slot, func, reg;
> +
> +    *data = ~(uint64_t)0;
> +
> +    vpci_lock(d);
> +    if ( addr == 0xcf8 )
> +    {
> +        ASSERT(size == 4);
> +        *data = d->arch.hvm_domain.pci_cf8;
> +        vpci_unlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +    if ( !CF8_ENABLED(d->arch.hvm_domain.pci_cf8) )
> +    {
> +        vpci_unlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +
> +    /* Decode the PCI address. */
> +    hvm_pci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, &slot, &func,
> +                        ®);

With the function name I don't view a comment like the one here as very
useful.

> --- a/xen/arch/x86/hvm/ioreq.c
> +++ b/xen/arch/x86/hvm/ioreq.c
> @@ -1178,18 +1178,16 @@ struct hvm_ioreq_server *hvm_select_ioreq_server(struct domain *d,
>           CF8_ENABLED(cf8) )
>      {
>          uint32_t sbdf, x86_fam;
> +        unsigned int bus, slot, func, reg;
> +
> +        hvm_pci_decode_addr(cf8, p->addr, &bus, &slot, &func, ®);
>  
>          /* PCI config data cycle */
>  
> -        sbdf = XEN_DMOP_PCI_SBDF(0,
> -                                 PCI_BUS(CF8_BDF(cf8)),
> -                                 PCI_SLOT(CF8_BDF(cf8)),
> -                                 PCI_FUNC(CF8_BDF(cf8)));
> +        sbdf = XEN_DMOP_PCI_SBDF(0, bus, slot, func);
>  
>          type = XEN_DMOP_IO_RANGE_PCI;
> -        addr = ((uint64_t)sbdf << 32) |
> -               CF8_ADDR_LO(cf8) |
> -               (p->addr & 3);
> +        addr = ((uint64_t)sbdf << 32) | reg;
>          /* AMD extended configuration space access? */
>          if ( CF8_ADDR_HI(cf8) &&
>               d->arch.cpuid->x86_vendor == X86_VENDOR_AMD &&

This and the introduction of hvm_pci_decode_addr() would likely better
be broken out into a prereq patch, as this one is quite large even
without this effectively unrelated change.

> --- /dev/null
> +++ b/xen/drivers/vpci/vpci.c
> @@ -0,0 +1,405 @@
> +/*
> + * Generic functionality for handling accesses to the PCI configuration space
> + * from guests.
> + *
> + * Copyright (C) 2017 Citrix Systems R&D
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms and conditions of the GNU General Public
> + * License, version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <xen/sched.h>
> +#include <xen/vpci.h>
> +
> +extern const vpci_register_init_t __start_vpci_array[], __end_vpci_array[];
> +#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
> +
> +/* Internal struct to store the emulated PCI registers. */
> +struct vpci_register {
> +    vpci_read_t *read;
> +    vpci_write_t *write;
> +    unsigned int size;
> +    unsigned int offset;
> +    void *private;
> +    struct list_head node;
> +};
> +
> +int __hwdom_init vpci_add_handlers(struct pci_dev *pdev)

As pointed out in reply to an earlier version, this lacks a prereq
change: setup_one_hwdom_device() needs to be marked __hwdom_init. And
then, now that you have the annotation here, the placement of the
array in the linker script should depend on whether __hwdom_init is an
alias of __init.

> +int vpci_add_register(const struct pci_dev *pdev, vpci_read_t r> +                      unsigned int size, void *data)
> +{
> +    struct list_head *head;
> +    struct vpci_register *r;
> +
> +    /* Some sanity checks. */
> +    if ( (size != 1 && size != 2 && size != 4) ||
> +         offset >= PCI_CFG_SPACE_EXP_SIZE || offset & (size - 1) ||

Please add parens around the operands of &.

> +         (read_handler == NULL && write_handler == NULL) )

Please be consistent with NULL checks - as they're shorter, I'd suggest
to always use ...

> +        return -EINVAL;
> +
> +    r = xmalloc(struct vpci_register);
> +    if ( !r )

... this style.

> +        return -ENOMEM;
> +
> +    r->read = read_handler ?: vpci_ignored_read;
> +    r->write = write_handler ?: vpci_ignored_write;
> +    r->size = size;
> +    r->offset = offset;
> +    r->private = data;
> +
> +    vpci_lock(pdev->domain);
> +
> +    /* The list of handlers must be keep sorted at all times. */

kept

> +    list_for_each ( head, &pdev->vpci->handlers )

"head" is not a good name for something that doesn't always point at
the head of whatever list. How about "prev"?

> +int vpci_remove_register(const struct pci_dev *pdev, unsigned int offset,
> +                         unsigned int size)
> +{
> +    const struct vpci_register r = { .offset = offset, .size = size };
> +    struct vpci_register *rm = NULL;

Pointless initializer afaict (there's none on the equivalent variable
in the add function).

> +    vpci_lock(pdev->domain);
> +
> +    list_for_each_entry ( rm, &pdev->vpci->handlers, node )
> +        if ( vpci_register_cmp(&r, rm) <= 0 )
> +            break;
> +
> +    if ( !rm || rm->offset != offset || rm->size != size )

Obviously the !rm check here isn't needed then either, which points out
that you have a problem here: You don't properly handle the case of not
coming through the "break" path above, i.e. when rm points at the list
head (which isn't a full struct vpci_register).

> +static uint32_t vpci_read_hw(unsigned int seg, unsigned int bus,
> +                             unsigned int slot, unsigned int func,
> +                             unsigned int reg, uint32_t size)
> +{
> +    uint32_t data;
> +
> +    switch ( size )
> +    {
> +    case 4:
> +        data = pci_conf_read32(seg, bus, slot, func, reg);
> +        break;
> +    case 2:
> +        data = pci_conf_read16(seg, bus, slot, func, reg);
> +        break;
> +    case 1:
> +        data = pci_conf_read8(seg, bus, slot, func, reg);
> +        break;
> +    default:
> +        BUG();

As long as this is Dom0-only, BUG()s like this are probably fine, but
if this ever gets extended to DomU-s, will we really remember to
convert them?

> +/*
> + * Merge new data into a partial result.
> + *
> + * Zero the bytes of 'data' from [offset, offset + size), and
> + * merge the value found in 'new' from [0, offset) left shifted
> + * by 'offset'.
> + */
> +uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,

static?

> +                      unsigned int offset)
> +{
> +    uint32_t mask = ((uint64_t)1 << (size * 8)) - 1;

No need to use 64-bit arithmetic here: 0xffffffff >> (32 - 8 * size).

> +uint32_t vpci_read(unsigned int seg, unsigned int bus, unsigned int slot,
> +                   unsigned int func, unsigned int reg, uint32_t size)
> +{
> +    struct domain *d = current->domain;
> +    struct pci_dev *pdev;
> +    const struct vpci_register *r;
> +    unsigned int data_offset = 0;
> +    uint32_t data;
> +
> +    ASSERT(pcidevs_locked());
> +    ASSERT(vpci_locked(d));
> +
> +    /*
> +     * Read the hardware value.
> +     * NB: at the moment vPCI passthroughs everything (ie: permissive).

passes through

> +     */
> +    data = vpci_read_hw(seg, bus, slot, func, reg, size);

I continue to be worried of reads that have side effects here. Granted
we currently don't emulate any, but it would feel better if we didn't
do the read for no reason. I.e. do hw reads only to fill gaps between
emulated fields.

> +    /* Find the PCI dev matching the address. *> +    /* Replace any values reported by the emulated registers. */
> +    list_for_each_entry ( r, &pdev->vpci->handlers, node )
> +    {
> +        const struct vpci_register emu = {
> +            .offset = reg + data_offset,
> +            .size = size - data_offset
> +        };
> +        int cmp = vpci_register_cmp(&emu, r);
> +        union vpci_val val = { .u32 = ~0 };
> +        unsigned int merge_size;
> +
> +        if ( cmp < 0 )
> +            break;
> +        if ( cmp > 0 )
> +            continue;
> +
> +        r->read(pdev, r->offset, &val, r->private);
> +
> +        /* Check if the read is in the middle of a register. */
> +        if ( r->offset < emu.offset )
> +            val.u32 >>= (emu.offset - r->offset) * 8;
> +
> +        data_offset = max(emu.offset, r->offset) - reg;
> +        /* Find the intersection size between the two sets. */
> +        merge_size = min(emu.offset + emu.size, r->offset + r->size) -
> +                     max(emu.offset, r->offset);
> +        /* Merge the emulated data into the native read value. */
> +        data = merge_result(data, val.u32, merge_size, data_offset);
> +        data_offset += merge_size;
> +        if ( data_offset == size )
> +            break;

ASSERT(data_offset < size) ?

> --- /dev/null
> +++ b/xen/include/xen/vpci.h
> @@ -0,0 +1,79 @@
> +#ifndef _VPCI_
> +#define _VPCI_
> +
> +#include <xen/pci.h>
> +#include <xen/types.h>
> +#include <xen/list.h>
> +
> +/*
> + * Helpers for locking/unlocking.
> + *
> + * NB: the recursive variants are used so that spin_is_locked
> + * returns whether the lock is hold by the current CPU (instead
> + * of just returning whether the lock is hold by any CPU).
> + */
> +#define vpci_lock(d) spin_lock_recursive(&(d)->arch.hvm_domain.vpci_lock)
> +#define vpci_unlock(d) spin_unlock_recursive(&(d)->arch.hvm_domain.vpci_lock)
> +#define vpci_locked(d) spin_is_locked(&(d)->arch.hvm_domain.vpci_lock)
> +
> +/* Value read or written by the handlers. */
> +union vpci_val {
> +    uint8_t u8;
> +    uint16_t u16;
> +    uint32_t u32;
> +};

I continue to be unconvinced that this union is a good way to handle
different sizes. Afaict Coverity (or similar tools) may recognize quite
a few possible uses of uninitialized data. Quite likely all of them
would be false positives, but anyway. Would it really be a big problem
to uniformly pass uint32_t values around?

> +/*
> + * The vPCI handlers will never be called concurrently for the same domain, ii
> + * is guaranteed that the vpci domain lock will always be locked when calling
> + * any handler.
> + */
> +typedef void (vpci_read_t)(struct pci_dev *pdev, unsigned int reg,
> +                           union vpci_val *val, void *data);
> +
> +typedef void (vpci_write_t)(struct pci_dev *pdev, unsigned int reg,
> +                            union vpci_val val, void *data);

Stray parentheses around the type name being defined.

> +typedef int (*vpci_register_init_t)(struct pci_dev *dev);

This one is inconsistent with the other two in that it defines a
pointer type.

> +#define REGISTER_VPCI_INIT(x)                   \
> +  static const vpci_register_init_t x##_entry   \
> +               __used_section(".rodata.vpci") = x
> +
> +/* Add vPCI handlers to device. */
> +int __must_check vpci_add_handlers(struct pci_dev *dev);
> +
> +/* Add/remove a register handler. */
> +int __must_check vpci_add_register(const struct pci_dev *pdev,
> +                                   vpci_read_t read_handler,
> +                                   vpci_write_t write_handler,

I'm surprised this compiles without (at least) warnings - you appear to
be lacking *s here.

> +                                   unsigned int offset,
> +                                   unsigned int size, void *data);
> +int __must_check vpci_remove_register(const struct pci_dev *pdev,
> +                                      unsigned int offset,
> +                                      unsigned int size);
> > +uint32_t vpci_read(unsigned int seg, unsigned int bus, unsigned int slot,
> +                   unsigned int func, unsigned int reg, uint32_t size);
> +void vpci_write(unsigned int seg, unsigned int bus, unsigned int slot,
> +                unsigned int func, unsigned int reg, uint32_t size,
> +                uint32_t data);

I don't see why size needs to be of a fixed width type in both of these.

> +struct vpci {
> +    /* Root pointer for the tree of vPCI handlers. */
> +    struct list_head handlers;

The comment says "tree", but right now this really is just a list.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 2/9] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas
  2017-06-30 15:01 ` [PATCH v4 2/9] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas Roger Pau Monne
  2017-07-10 13:34   ` Paul Durrant
@ 2017-07-13 20:15   ` Jan Beulich
  2017-07-14 16:33     ` Roger Pau Monné
  1 sibling, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2017-07-13 20:15 UTC (permalink / raw)
  To: roger.pau
  Cc: andrew.cooper3, julien.grall, paul.durrant, xen-devel,
	boris.ostrovsky

>>> Roger Pau Monne <roger.pau@citrix.com> 06/30/17 5:02 PM >>>
> @@ -1041,6 +1043,24 @@ static int __init pvh_setup_acpi(struct domain *d, paddr_t start_info)
>      return 0;
>  }
>  
> +int __init pvh_setup_mmcfg(struct domain *d)

Didn't I point out that __init van't be correct here, and instead this
needs to be __hwdom_init? I can see that the only current caller is
__init, but that merely suggests there is a second call missing.

> +{
> +    unsigned int i;
> +    int rc;
> +
> +    for ( i = 0; i < pci_mmcfg_config_num; i++ )
> +    {
> +        rc = register_vpci_mmcfg_handler(d, pci_mmcfg_config[i].address,
> +                                         pci_mmcfg_config[i].start_bus_number,
> +                                         pci_mmcfg_config[i].end_bus_number,
> +                                         pci_mmcfg_config[i].pci_segment);
> +        if ( rc )
> +            return rc;

I would make this a best effort thing, i.e. issue a log message upon
failure but continue the loop. There's a good chance Dom0 will still
be able to come up.

> --- a/xen/arch/x86/hvm/io.c
> +++ b/xen/arch/x86/hvm/io.c
> @@ -261,11 +261,11 @@ void register_g2m_portio_handler(struct domain *d)
>  static int vpci_access_check(unsigned int reg, unsigned int len)
>  {
>      /* Check access size. */
> -    if ( len != 1 && len != 2 && len != 4 )
> +    if ( len != 1 && len != 2 && len != 4 && len != 8 )
>          return -EINVAL;
>  
> -    /* Check if access crosses a double-word boundary. */
> -    if ( (reg & 3) + len > 4 )
> +    /* Check if access crosses a double-word boundary or it's not aligned. */
> +    if ( (len <= 4 && (reg & 3) + len > 4) || (len == 8 && (reg & 3) != 0) )
>          return -EINVAL;

For one I suppose you mean "& 7" in the 8-byte case. And then I don't
understand why you permit mis-aligned 2-byte writes, but not mis-aligned
4-byte ones as long as they fall withing a quad-word. Any such asymmetry
needs at least a comment.

> @@ -398,6 +398,188 @@ void register_vpci_portio_handler(struct domain *d)
>      handler->ops = &vpci_portio_ops;
>  }
>  
> +struct hvm_mmcfg {
> +    paddr_t addr;
> +    size_t size;

paddr_t and size_t don't really fit together, most notably on 32-bit.
As I don't think any individual range can possibly be 4Gb or larger, I
think unsigned int would suffice here.

> +    unsigned int bus;
> +    unsigned int segment;

Depending on how many instances of this structure we expect, it may be
worthwhile to limit these two to 8 and 16 bits respectively.

> +/* Handlers to trap PCI ECAM config accesses. */

An "ECAM" did survive here.

> +static const struct hvm_mmcfg *vpci_mmcfg_find(struct domain *d,
> +                                               unsigned long addr)

paddr_t (to match the structure field)

> +static void vpci_mmcfg_decode_addr(const struct hvm_mmcfg *mmcfg,
> +                                   unsigned long addr, unsigned int *bus,

Same here (and it seems more below). Also, just like in patch 1,
perhaps return the register by value rather than via indirection.

> +                                   unsigned int *slot, unsigned int *func,
> +                                   unsigned int *reg)
> +{
> +    addr -= mmcfg->addr;
> +    *bus = ((addr >> 20) & 0xff) + mmcfg->bus;
> +    *slot = (addr >> 15) & 0x1f;
> +    *func = (addr >> 12) & 0x7;
> +    *reg = addr & 0xfff;

Iirc there already was a comment to use manifest constants or macros
here.

> +static int vpci_mmcfg_accept(struct vcpu *v, unsigned long addr)
> +{
> +    struct domain *d = v->domain;
> +    bool found;
> +
> +    vpci_lock(d);
> +    found = vpci_mmcfg_find(d, addr);
> +    vpci_unlock(d);

The latest here I wonder whether the lock wouldn't better be an r/w one.

> +static int vpci_mmcfg_read(struct vcpu *v, unsigned long addr,
> +                           unsigned int len, unsigned long *data)

uint64_t * (to be 32-bit compatible)

> +{
> +    struct domain *d = v->domain;
> +   > +
> +    vpci_lock(d);
> +    mmcfg = vpci_mmcfg_find(d, addr);
> +    if ( !mmcfg )
> +    {
> +        vpci_unlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +
> +    vpci_mmcfg_decode_addr(mmcfg, addr, &bus, &slot, &func, ®);
> +
> +    if ( vpci_access_check(reg, len) )
> +    {
> +        vpci_unlock(d);
> +        return X86EMUL_OKAY;
> +    }
> +
> +    pcidevs_lock();
> +    if ( len == 8 )
> +    {
> +        /*
> +         * According to the PCIe 3.1A specification:
> +         *  - Configuration Reads and Writes must usually be DWORD or smaller
> +         *    in size.
> +         *  - Because Root Complex implementations are not required to support
> +         *    accesses to a RCRB that cross DW boundaries [...] software
> +         *    should take care not to cause the generation of such accesses
> +         *    when accessing a RCRB unless the Root Complex will support the
> +         *    access.
> +         *  Xen however supports 8byte accesses by splitting them into two
> +         *  4byte accesses.
> +         */
> +        *data = vpci_read(mmcfg->segment, bus, slot, func, reg, 4);
> +        *data |= (uint64_t)vpci_read(mmcfg->segment, bus, slot, func,
> +                                     reg + 4, 4) << 32;
> +    }
> +    else
> +        *data = vpci_read(mmcfg->segment, bus, slot, func, reg, len);

I think it would be preferable to avoid the else, by merging this and
the first of the other two reads.

> +    pcidevs_unlock();
> +    vpci_unlock(d);

Question on lock order (should have gone into the patch 1 reply already,
but I had thought of this only after sending): Is it really a good idea
to nest this way? The pcidevs lock is covering quite large regions at
times, so the risk of a lock order violation seems non-negligible even
if there may be none right now. Futhermore the new uses of the pcidevs
lock you introduce would seem to make it quite desirable to make that
one an r/w one too. Otoh that's a recursive one, so it'll be non-trivial
to convert ...

> +int register_vpci_mmcfg_handler(struct domain *d, paddr_t addr,

__hwdom_init

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 3/9] x86/physdev: enable PHYSDEVOP_pci_mmcfg_reserved for PVH Dom0
  2017-06-30 15:01 ` [PATCH v4 3/9] x86/physdev: enable PHYSDEVOP_pci_mmcfg_reserved for PVH Dom0 Roger Pau Monne
@ 2017-07-14 10:32   ` Jan Beulich
  2017-07-20 10:23     ` Roger Pau Monne
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2017-07-14 10:32 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, julien.grall, boris.ostrovsky, xen-devel

>>> On 30.06.17 at 17:01, <roger.pau@citrix.com> wrote:
> So that hotplug (or MMCFG regions not present in the MCFG ACPI table)
> can be added at run time by the hardware domain.

I think the emphasis should be the other way around. I'm rather certain
hotplug of bridges doesn't really work right now anyway; at least
IO-APIC hotplug code is completely missing.

> When a new MMCFG area is added to a PVH Dom0, Xen will scan it and add
> the devices to the hardware domain.

Adding the MMIO regions is certainly necessary, but what's the point of
also scanning the bus and adding the devices? We expect Dom0 to tell us
anyway, and not doing the scan in Xen avoids complications we presently
have in the segment 0 case when Dom0 decides to re-number busses (e.g.
in order to fit in SR-IOV VFs).

> --- a/xen/arch/x86/hvm/hypercall.c
> +++ b/xen/arch/x86/hvm/hypercall.c
> @@ -89,6 +89,10 @@ static long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>          if ( !has_pirq(curr->domain) )
>              return -ENOSYS;
>          break;
> +    case PHYSDEVOP_pci_mmcfg_reserved:
> +        if ( !is_hardware_domain(curr->domain) )
> +            return -ENOSYS;
> +        break;

This physdevop (like most ones) is restricted to Dom0 use anyway
(properly expressed via XSM check), so I'd rather see you check
has_vpci() here, in line with e.g. the check visible in context.

> --- a/xen/arch/x86/physdev.c
> +++ b/xen/arch/x86/physdev.c
> @@ -559,6 +559,25 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>  
>          ret = pci_mmcfg_reserved(info.address, info.segment,
>                                   info.start_bus, info.end_bus, info.flags);
> +        if ( ret || !is_hvm_domain(currd) )
> +            break;
> +
> +        /*
> +         * For HVM (PVH) domains try to add the newly found MMCFG to the
> +         * domain.
> +         */
> +        ret = register_vpci_mmcfg_handler(currd, info.address, info.start_bus,
> +                                          info.end_bus, info.segment);
> +        if ( ret == -EEXIST )
> +        {
> +            ret = 0;
> +            break;

I don't really understand this part: Why would handlers be registered
already? If you consider double registration, wouldn't that better
either be detected by pci_mmcfg_reserved() (and the call here avoided
altogether) or the fact indeed be reported back to the caller?

> @@ -1110,6 +1110,37 @@ void __hwdom_init setup_hwdom_pci_devices(
>      pcidevs_unlock();
>  }
>  
> +static int add_device(uint8_t devfn, struct pci_dev *pdev)
> +{
> +    return iommu_add_device(pdev);
> +}

You're discarding devfn here, just for iommu_add_device() to re-do the
phantom function handling. At the very least this is wasteful. Perhaps
you minimally want to call iommu_add_device() only when
devfn == pdev->devfn (if all of this code stays in the first place)?

> +int pci_scan_and_setup_segment(uint16_t segment)
> +{
> +    struct pci_seg *pseg = get_pseg(segment);
> +    struct setup_hwdom ctxt = {
> +        .d = current->domain,
> +        .handler = add_device,
> +    };
> +    int ret;
> +
> +    if ( !pseg )
> +        return -EINVAL;
> +
> +    pcidevs_lock();
> +    ret = _scan_pci_devices(pseg, NULL);
> +    if ( ret )
> +        goto out;
> +
> +    ret = _setup_hwdom_pci_devices(pseg, &ctxt);
> +    if ( ret )
> +        goto out;
> +
> + out:

Please let's avoid such unnecessary goto-s. Even the first one could be
easily avoided without making the code anywhere near unreadable.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 4/9] xen/mm: move modify_identity_mmio to global file and drop __init
  2017-06-30 15:01 ` [PATCH v4 4/9] xen/mm: move modify_identity_mmio to global file and drop __init Roger Pau Monne
@ 2017-07-14 10:32   ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2017-07-14 10:32 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, julien.grall, boris.ostrovsky, xen-devel

>>> On 30.06.17 at 17:01, <roger.pau@citrix.com> wrote:
> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -1465,6 +1465,46 @@ int prepare_ring_for_helper(
>      return 0;
>  }
>  
> +#if defined(CONFIG_X86) || defined(CONFIG_HAS_PCI)

Why both? X86 selects HAS_PCI, and such (reverse) dependencies exist
precisely to avoid such conditionals to become rather complex over
time.

> +int modify_mmio(struct domain *d, gfn_t gfn, mfn_t mfn, unsigned long nr_pages,
> +                const bool map)

Already in the original function I've been puzzled by this const - if
you wanted such, you should put it consistently on all applicable
parameters. But since we don't normally do so elsewhere, the globally
consistent approach would be to simply drop it.

> +{
> +    int rc;
> +
> +    /*
> +     * ATM this function should only be used by the hardware domain
> +     * because it doesn't support preemption/continuation, and as such
> +     * can take a non-trivial amount of time. Note that it periodically calls

non-negligible?

> +     * process_pending_softirqs in order to avoid stalling the system.
> +     */
> +    ASSERT(is_hardware_domain(d));
> +
> +    for ( ; ; )
> +    {
> +        rc = (map ? map_mmio_regions : unmap_mmio_regions)
> +             (d, gfn, nr_pages, mfn);
> +        if ( rc == 0 )
> +            break;
> +        if ( rc < 0 )
> +        {
> +            printk(XENLOG_G_WARNING

As long as this is Dom0 only I'd suggest to drop the _G_ infix, just
like it was in the original.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/9] xen/pci: split code to size BARs from pci_add_device
  2017-06-30 15:01 ` [PATCH v4 5/9] xen/pci: split code to size BARs from pci_add_device Roger Pau Monne
@ 2017-07-14 10:33   ` Jan Beulich
  2017-07-20 14:00     ` Roger Pau Monne
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2017-07-14 10:33 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, julien.grall, boris.ostrovsky

>>> On 30.06.17 at 17:01, <roger.pau@citrix.com> wrote:
> So that it can be called from outside in order to get the size of regular PCI
> BARs. This will be required in order to map the BARs from PCI devices into PVH
> Dom0 p2m.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -588,6 +588,54 @@ static void pci_enable_acs(struct pci_dev *pdev)
>      pci_conf_write16(seg, bus, dev, func, pos + PCI_ACS_CTRL, ctrl);
>  }
>  
> +int pci_size_mem_bar(unsigned int seg, unsigned int bus, unsigned int slot,
> +                     unsigned int func, unsigned int pos, bool last,
> +                     uint64_t *paddr, uint64_t *psize)
> +{
> +    uint32_t hi = 0, bar = pci_conf_read32(seg, bus, slot, func, pos);
> +    uint64_t addr, size;
> +
> +    ASSERT((bar & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_MEMORY);
> +    pci_conf_write32(seg, bus, slot, func, pos, ~0);
> +    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> +         PCI_BASE_ADDRESS_MEM_TYPE_64 )
> +    {
> +        if ( last )
> +        {
> +            printk(XENLOG_WARNING
> +                    "device %04x:%02x:%02x.%u with 64-bit BAR in last slot\n",

This message needs to tell what kind of slot is being processed (just
like the original did).

> +                    seg, bus, slot, func);
> +            return -EINVAL;
> +        }
> +        hi = pci_conf_read32(seg, bus, slot, func, pos + 4);
> +        pci_conf_write32(seg, bus, slot, func, pos + 4, ~0);
> +    }
> +    size = pci_conf_read32(seg, bus, slot, func, pos) &
> +           PCI_BASE_ADDRESS_MEM_MASK;
> +    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> +         PCI_BASE_ADDRESS_MEM_TYPE_64 )
> +    {
> +        size |= (u64)pci_conf_read32(seg, bus, slot, func, pos + 4) << 32;

uint64_t

> +        pci_conf_write32(seg, bus, slot, func, pos + 4, hi);
> +    }
> +    else if ( size )
> +        size |= (u64)~0 << 32;

Again (and more below).

> +    pci_conf_write32(seg, bus, slot, func, pos, bar);
> +    size = -(size);

Stray parentheses.

> +    addr = (bar & PCI_BASE_ADDRESS_MEM_MASK) | ((u64)hi << 32);
> +
> +    if ( paddr )
> +        *paddr = addr;
> +    if ( psize )
> +        *psize = size;

Is it reasonable to expect the caller to not care about the size?

> @@ -663,38 +710,12 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>                             seg, bus, slot, func, i);
>                      continue;
>                  }
> -                pci_conf_write32(seg, bus, slot, func, idx, ~0);
> -                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> -                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
> -                {
> -                    if ( i >= PCI_SRIOV_NUM_BARS )
> -                    {
> -                        printk(XENLOG_WARNING
> -                               "SR-IOV device %04x:%02x:%02x.%u with 64-bit"
> -                               " vf BAR in last slot\n",
> -                               seg, bus, slot, func);
> -                        break;
> -                    }
> -                    hi = pci_conf_read32(seg, bus, slot, func, idx + 4);
> -                    pci_conf_write32(seg, bus, slot, func, idx + 4, ~0);
> -                }
> -                pdev->vf_rlen[i] = pci_conf_read32(seg, bus, slot, func, idx) &
> -                                   PCI_BASE_ADDRESS_MEM_MASK;
> -                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> -                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
> -                {
> -                    pdev->vf_rlen[i] |= (u64)pci_conf_read32(seg, bus,
> -                                                             slot, func,
> -                                                             idx + 4) << 32;
> -                    pci_conf_write32(seg, bus, slot, func, idx + 4, hi);
> -                }
> -                else if ( pdev->vf_rlen[i] )
> -                    pdev->vf_rlen[i] |= (u64)~0 << 32;
> -                pci_conf_write32(seg, bus, slot, func, idx, bar);
> -                pdev->vf_rlen[i] = -pdev->vf_rlen[i];
> -                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> -                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
> -                    ++i;
> +                ret = pci_size_mem_bar(seg, bus, slot, func, idx,
> +                                       i == PCI_SRIOV_NUM_BARS - 1, NULL,
> +                                       &pdev->vf_rlen[i]);
> +                if ( ret < 0 )
> +                    break;

ASSERT(ret) ?

> +                i += ret;

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 6/9] xen/vpci: add handlers to map the BARs
  2017-06-30 15:01 ` [PATCH v4 6/9] xen/vpci: add handlers to map the BARs Roger Pau Monne
@ 2017-07-14 15:11   ` Jan Beulich
  2017-07-24 14:58     ` Roger Pau Monne
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2017-07-14 15:11 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, AndrewCooper,
	Ian Jackson, TimDeegan, julien.grall, xen-devel, boris.ostrovsky

>>> On 30.06.17 at 17:01, <roger.pau@citrix.com> wrote:
> Introduce a set of handlers that trap accesses to the PCI BARs and the command
> register, in order to emulate BAR sizing and BAR relocation.

I don't think "emulate" is the right term here - you really don't mean to
change anything, you only want to snoop Dom0 writes.

> --- /dev/null
> +++ b/xen/drivers/vpci/header.c
> @@ -0,0 +1,473 @@
> +/*
> + * Generic functionality for handling accesses to the PCI header from the
> + * configuration space.
> + *
> + * Copyright (C) 2017 Citrix Systems R&D
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms and conditions of the GNU General Public
> + * License, version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <xen/sched.h>
> +#include <xen/vpci.h>
> +#include <xen/p2m-common.h>
> +
> +#define MAPPABLE_BAR(x)                                                 \
> +    (((x)->type == VPCI_BAR_MEM32 || (x)->type == VPCI_BAR_MEM64_LO ||  \
> +     ((x)->type == VPCI_BAR_ROM && (x)->enabled)) &&                    \
> +     (x)->addr != INVALID_PADDR)
> +
> +static struct rangeset *vpci_get_bar_memory(const struct domain *d,
> +                                            const struct vpci_bar *map)
> +{
> +    const struct pci_dev *pdev;
> +    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
> +    int rc;
> +
> +    if ( !mem )
> +        return ERR_PTR(-ENOMEM);
> +
> +    /*
> +     * Create a rangeset that represents the current BAR memory region
> +     * and compare it against all the currently active BAR memory regions.
> +     * If an overlap is found, subtract it from the region to be
> +     * mapped/unmapped.
> +     *
> +     * NB: the rangeset uses frames, and if start and end addresses are
> +     * equal it means only one frame is used, that's why PFN_DOWN is used
> +     * to calculate the end of the rangeset.
> +     */

That explanation doesn't seem to fit: Did you perhaps mean to
point out that rangeset ranges are inclusive ones?

> +    rc = rangeset_add_range(mem, PFN_DOWN(map->addr),
> +                            PFN_DOWN(map->addr + map->size));

Don't you need to subtract 1 here (and elsewhere below)?

> +    if ( rc )
> +    {
> +        rangeset_destroy(mem);
> +        return ERR_PTR(rc);
> +    }
> +
> +    list_for_each_entry(pdev, &d->arch.pdev_list, domain_list)
> +    {
> +        uint16_t cmd = pci_conf_read16(pdev->seg, pdev->bus,
> +                                       PCI_SLOT(pdev->devfn),
> +                                       PCI_FUNC(pdev->devfn),
> +                                       PCI_COMMAND);

This is quite a lot of overhead - a loop over all devices plus a config
space read on each one. What state the memory decode bit is in
could be recorded in the ->enabled flag, couldn't it? And devices on
different sub-branches of the topology can't possibly have
overlapping entries that we need to worry about, as the bridge
windows would suppress actual accesses.

> +        unsigned int i;
> +
> +        /* Check if memory decoding is enabled. */
> +        if ( !(cmd & PCI_COMMAND_MEMORY) )
> +            continue;
> +
> +        for ( i = 0; i < ARRAY_SIZE(pdev->vpci->header.bars); i++ )
> +        {
> +            const struct vpci_bar *bar = &pdev->vpci->header.bars[i];
> +
> +            if ( bar == map || !MAPPABLE_BAR(bar) ||
> +                 !rangeset_overlaps_range(mem, PFN_DOWN(bar->addr),
> +                                          PFN_DOWN(bar->addr + bar->size)) )
> +                continue;
> +
> +            rc = rangeset_remove_range(mem, PFN_DOWN(bar->addr),
> +                                       PFN_DOWN(bar->addr + bar->size));

I'm struggling to convince myself of the correctness of this approach
(including other code further down which is also involved). I think you
should have taken the time to add a few words on the approach
chosen to the description. For example, it doesn't look like things will
go right if the device being dealt with has two BARs both using part
of the same page.

> +static int vpci_modify_bar(struct domain *d, const struct vpci_bar *bar,
> +                           const bool map)
> +{
> +    struct rangeset *mem;
> +    struct map_data data = { .d = d, .map = map };
> +    int rc;
> +
> +    ASSERT(MAPPABLE_BAR(bar));
> +
> +    mem = vpci_get_bar_memory(d, bar);
> +    if ( IS_ERR(mem) )
> +        return -PTR_ERR(mem);

The negation looks wrong to me.

> +static void vpci_cmd_write(struct pci_dev *pdev, unsigned int reg,
> +                           union vpci_val val, void *data)
> +{
> +    uint16_t cmd = val.u16, current_cmd;
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    int rc;
> +
> +    current_cmd = pci_conf_read16(seg, bus, slot, func, reg);
> +
> +    if ( !((cmd ^ current_cmd) & PCI_COMMAND_MEMORY) )
> +    {
> +        /*
> +         * Let the guest play with all the bits directly except for the
> +         * memory decoding one.
> +         */
> +        pci_conf_write16(seg, bus, slot, func, reg, cmd);
> +        return;

Please invert the condition and have both cases use the same write
at the end of the function.

> +static void vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
> +                           union vpci_val val, void *data)
> +{
> +    struct vpci_bar *bar = data;
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    uint32_t wdata = val.u32, size_mask;
> +    bool hi = false;
> +
> +    switch ( bar->type )
> +    {
> +    case VPCI_BAR_MEM32:
> +    case VPCI_BAR_MEM64_LO:
> +        size_mask = (uint32_t)PCI_BASE_ADDRESS_MEM_MASK;
> +        break;
> +    case VPCI_BAR_MEM64_HI:
> +        size_mask = ~0u;
> +        break;
> +    default:
> +        ASSERT_UNREACHABLE();
> +        return;
> +    }
> +
> +    if ( (wdata & size_mask) == size_mask )
> +    {
> +        /* Next reads from this register are going to return the BAR size. */
> +        bar->sizing = true;
> +        return;

I think the comment needs extending to explain why the written
sizing value can't possibly be an address. This is particularly
relevant because I'm not sure that assumption would hold on e.g.
ARM (which I don't think has guaranteed ROM right below 4Gb).

> +    }
> +
> +    /* End previous sizing cycle if any. */
> +    bar->sizing = false;
> +
> +    /*
> +     * Ignore attempts to change the position of the BAR if memory decoding is
> +     * active.
> +     */
> +    if ( pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &
> +         PCI_COMMAND_MEMORY )
> +        return;

Especially as long as this code supports only Dom0 I think we want
a warning here.

> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +
> +    if ( !hi )
> +        wdata &= PCI_BASE_ADDRESS_MEM_MASK;
> +
> +    /* Update the relevant part of the BAR address. */
> +    bar->addr &= ~((uint64_t)0xffffffff << (hi ? 32 : 0));

Maybe shorter "0xffffffffull << (hi ? 0 : 32)"?

> +static void vpci_rom_write(struct pci_dev *pdev, unsigned int reg,
> +                           union vpci_val val, void *data)
> +{
> +    struct vpci_bar *rom = data;
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    const uint32_t wdata = val.u32;
> +
> +    if ( (wdata & PCI_ROM_ADDRESS_MASK) == PCI_ROM_ADDRESS_MASK )
> +    {
> +        /* Next reads from this register are going to return the BAR size. */
> +        rom->sizing = true;
> +        return;
> +    }
> +
> +    /* End previous sizing cycle if any. */
> +    rom->sizing = false;
> +
> +    rom->addr = wdata & PCI_ROM_ADDRESS_MASK;
> +
> +    /* Check if memory decoding is enabled. */
> +    if ( pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &
> +         PCI_COMMAND_MEMORY &&
> +         (rom->enabled ^ (wdata & PCI_ROM_ADDRESS_ENABLE)) )

Just like you parenthesize the operands of ^, please also do so for
the ones of &. Also the ^-expression relies on the particular value
of PCI_ROM_ADDRESS_ENABLE, which I'd prefer if you avoided.

> +static int vpci_init_bars(struct pci_dev *pdev)
> +{
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    uint8_t header_type;
> +    uint16_t cmd;
> +    uint32_t rom_val;
> +    uint64_t addr, size;
> +    unsigned int i, num_bars, rom_reg;
> +    struct vpci_header *header = &pdev->vpci->header;
> +    struct vpci_bar *bars = header->bars;
> +    int rc;
> +
> +    header_type = pci_conf_read8(seg, bus, slot, func, PCI_HEADER_TYPE) & 0x7f;
> +    switch ( header_type )

I'd prefer if you didn't introduce variables used just once.

> +    {
> +    case PCI_HEADER_TYPE_NORMAL:
> +        num_bars = 6;
> +        rom_reg = PCI_ROM_ADDRESS;
> +        break;
> +    case PCI_HEADER_TYPE_BRIDGE:
> +        num_bars = 2;
> +        rom_reg = PCI_ROM_ADDRESS1;
> +        break;
> +    default:
> +        return -EOPNOTSUPP;
> +    }
> +
> +    /* Setup a handler for the command register. */
> +    cmd = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);

This is unrelated to what you mean to do here. Please move it ...

> +    rc = vpci_add_register(pdev, vpci_cmd_read, vpci_cmd_write, PCI_COMMAND,
> +                           2, header);
> +    if ( rc )
> +        return rc;
> +
> +    /* Disable memory decoding before sizing. */

... here.

> +    if ( cmd & PCI_COMMAND_MEMORY )
> +        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND,
> +                         cmd & ~PCI_COMMAND_MEMORY);
> +
> +    for ( i = 0; i < num_bars; i++ )
> +    {
> +        uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
> +        uint32_t val = pci_conf_read32(seg, bus, slot, func, reg);
> +
> +        if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
> +        {
> +            bars[i].type = VPCI_BAR_MEM64_HI;
> +            rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
> +                                   &bars[i]);
> +            if ( rc )
> +                return rc;
> +
> +            continue;
> +        }
> +        if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
> +        {
> +            bars[i].type = VPCI_BAR_IO;
> +            continue;
> +        }
> +        if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> +             PCI_BASE_ADDRESS_MEM_TYPE_64 )
> +            bars[i].type = VPCI_BAR_MEM64_LO;
> +        else
> +            bars[i].type = VPCI_BAR_MEM32;

Perhaps ignore the 64-bit indicator if it appears in the last BAR?

> +        /* Size the BAR and map it. */
> +        rc = pci_size_mem_bar(seg, bus, slot, func, reg, i == num_bars - 1,
> +                              &addr, &size);
> +        if ( rc < 0 )
> +            return rc;
> +
> +        if ( size == 0 )
> +        {
> +            bars[i].type = VPCI_BAR_EMPTY;
> +            continue;
> +        }
> +
> +        bars[i].addr = (cmd & PCI_COMMAND_MEMORY) ? addr : INVALID_PADDR;

This doesn't match up with logic further up: When the memory decode
bit gets cleared, you don't zap the addresses, so I think you'd better
store it here too. Use INVALID_PADDR only when the value read has
all address bits set (same caveat as pointed out earlier).

> +        bars[i].size = size;
> +        bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
> +
> +        rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
> +                               &bars[i]);
> +        if ( rc )
> +            return rc;
> +    }
> +
> +    /* Check expansion ROM. */
> +    rom_val = pci_conf_read32(seg, bus, slot, func, rom_reg);
> +    if ( rom_val & PCI_ROM_ADDRESS_ENABLE )
> +        pci_conf_write32(seg, bus, slot, func, rom_reg,
> +                         rom_val & ~PCI_ROM_ADDRESS_ENABLE);

Do you really need to do this when you've cleared the memory
decode bit already?

> +    rc = pci_size_mem_bar(seg, bus, slot, func, rom_reg, true, &addr, &size);

You can't use this function here without first making it capable of
dealing with ROM BARs - it expects the low bits to be different
than what we have here (see the early ASSERT() that's there).

> +    if ( rc < 0 )
> +        return rc;

Perhaps I didn't pay attention elsewhere, but here it is quite obvious
that in the error case you return with the device in a state other than
on input.

> +    if ( size )
> +    {
> +        struct vpci_bar *rom = &header->bars[num_bars];
> +
> +        rom->type = VPCI_BAR_ROM;
> +        rom->size = size;
> +        rom->enabled = rom_val & PCI_ROM_ADDRESS_ENABLE;
> +        if ( rom->enabled )
> +            rom->addr = addr;
> +        else
> +            rom->addr = INVALID_PADDR;

Same remark as further up.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 1/9] xen/vpci: introduce basic handlers to trap accesses to the PCI config space
  2017-07-13 14:36   ` Jan Beulich
@ 2017-07-14 15:33     ` Roger Pau Monné
  2017-07-14 16:01       ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monné @ 2017-07-14 15:33 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, andrew.cooper3, ian.jackson, julien.grall, paul.durrant,
	xen-devel, boris.ostrovsky

On Thu, Jul 13, 2017 at 08:36:18AM -0600, Jan Beulich wrote:
> >>> Roger Pau Monne <roger.pau@citrix.com> 06/30/17 5:01 PM >>>
> > --- /dev/null
> > +++ b/tools/tests/vpci/Makefile
> > @@ -0,0 +1,40 @@
> > +
> > +XEN_ROOT=$(CURDIR)/../../..
> > +include $(XEN_ROOT)/tools/Rules.mk
> > +
> > +TARGET := test_vpci
> > +
> > +.PHONY: all
> > +all: $(TARGET)
> > +
> > +.PHONY: run
> > +run: $(TARGET)
> > +    ./$(TARGET) > $(TARGET).out
> 
> Is this a good way to run a test? Aiui it'll result in there not being
> anything visible immediately; one has to go look at the produced file.
> I'd suggest to leave it to the person invoking "make run" whether to
> redirect output.

OK, this is based in the hpet testing code, that does this. I'm fine
with not redirecting the output.

> > +$(TARGET): vpci.c vpci.h list.h
> > +    $(HOSTCC) -g -o $@ vpci.c main.c
> 
> If you compile main.c, why is there no dependency on it? And how about
> emul.h?

I didn't add such dependencies because those files are not the result
of any other targets, but I agree it's better to explicitly list them.

> > +.PHONY: clean
> > +clean:
> > +    rm -rf $(TARGET) $(TARGET).out *.o *~ vpci.h vpci.c list.h
> > +
> > +.PHONY: distclean
> > +distclean: clean
> > +
> > +.PHONY: install
> > +install:
> > +
> > +vpci.h: $(XEN_ROOT)/xen/include/xen/vpci.h
> > +    sed -e '/#include/d' <$< >$@
> 
> Couldn't you combine this and list.h's rule into a pattern one?

Yes, I think so, let me try.

> > --- /dev/null
> > +++ b/tools/tests/vpci/emul.h
> > @@ -0,0 +1,117 @@
> > +/*
> > + * Unit tests for the generic vPCI handler code.
> > + *
> > + * Copyright (C) 2017 Citrix Systems R&D
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms and conditions of the GNU General Public
> > + * License, version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public
> > + * License along with this program; If not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#ifndef _TEST_VPCI_
> > +#define _TEST_VPCI_
> > +
> > +#include <stdlib.h>
> > +#include <stdio.h>
> > +#include <stddef.h>
> > +#include <stdint.h>
> > +#include <stdbool.h>
> > +#include <errno.h>
> > +#include <assert.h>
> > +
> > +#define container_of(ptr, type, member) ({                      \
> > +        typeof(((type *)0)->member) *__mptr = (ptr);            \
> > +        (type *)((char *)__mptr - offsetof(type, member));      \
> 
> I don't know what tools maintainers think about such name space
> violations; in hypervisor code I'd ask you to avoid leading underscores
> in macro local variables (same in min()/max() and elsewhere then).

OK. container_of, max and min and verbatim copies of the macros in
xen/include/xen/kernel.h, with the style adjusted in the container_of
case IIRC (as requested in the previous review).

> > +/* Read a 32b register using all possible sizes. */
> > +void multiread4(unsigned int reg, uint32_t val)
> > +{
> > +    unsigned int i;
> > +
> > +    /* Read using bytes. */
> > +    for ( i = 0; i < 4; i++ )
> > +        VPCI_READ_CHECK(reg + i, 1, (val >> (i * 8)) & UINT8_MAX);
> > +
> > +    /* Read using 2bytes. */
> > +    for ( i = 0; i < 2; i++ )
> > +        VPCI_READ_CHECK(reg + i * 2, 2, (val >> (i * 2 * 8)) & UINT16_MAX);
> > +
> > +    VPCI_READ_CHECK(reg, 4, val);
> > +}
> > +
> > +void multiwrite4_check(unsigned int reg, uint32_t val)
> 
> Naming question again: Why the _check suffix here, but not on the read
> function above?

Right, I guess it's clearer to add the _check prefix to both. I didn't
add it to the read one because I felt it was already implicit, while
on the write one not so much.

> > +{
> > +    unsigned int i;
> > +
> > +    /* Write using bytes. */
> > +    for ( i = 0; i < 4; i++ )
> > +        VPCI_WRITE_CHECK(reg + i, 1, (val >> (i * 8)) & UINT8_MAX);
> > +    multiread4(reg, val);
> > +
> > +    /* Write using 2bytes. */
> > +    for ( i = 0; i < 2; i++ )
> > +        VPCI_WRITE_CHECK(reg + i * 2, 2, (val >> (i * 2 * 8)) & UINT16_MAX);
> > +    multiread4(reg, val);
> > +
> > +    VPCI_WRITE_CHECK(reg, 4, val);
> > +    multiread4(reg, val);
> > +}
> 
> Wouldn't it be better to vary the value written between the individual
> sizes? Perhaps move the 32-bit write between the two loops, using ~val?
> Otherwise you won't know whether what you read back is a result of the
> writes you actually mean to test or earlier ones?

So storing a new value in val between each size test? I could even use
something randomly generated.

> > +int
> > +main(int argc, char **argv)
> > +{
> > +    /* Index storage by offset. */
> > +    uint32_t r0 = 0xdeadbeef;
> > +    uint8> +    uint16_t r20[2] = { 0 };
> 
> Just { } will suffice.
> 
> > +    uint32_t r24 = 0;
> > +    uint8_t r28, r30;
> > +    unsigned int i;
> > +    int rc;
> > +
> > +    INIT_LIST_HEAD(&vpci.handlers);
> > +
> > +    VPCI_ADD_REG(vpci_read32, vpci_write32, 0, 4, r0);
> > +    VPCI_READ_CHECK(0, 4, 0xdeadbeef);
> 
> Why aren't you using r0 here?

Yes, that would be better (and safer in case this is changed).

> > +    VPCI_WRITE_CHECK(0, 4, 0xbcbcbcbc);
> > +
> > +    VPCI_ADD_REG(vpci_read8, vpci_write8, 5, 1, r5);
> > +    VPCI_READ_CHECK(5, 1, 0xef);
> > +    VPCI_WRITE_CHECK(5, 1, 0xba);
> > +
> > +    VPCI_ADD_REG(vpci_read8, vpci_write8, 6, 1, r6);
> > +    VPCI_READ_CHECK(6, 1, 0xbe);
> > +    VPCI_WRITE_CHECK(6, 1, 0xba);
> > +
> > +    VPCI_ADD_REG(vpci_read8, vpci_write8, 7, 1, r7);
> > +    VPCI_READ_CHECK(7, 1, 0xef);
> > +    VPCI_WRITE_CHECK(7, 1, 0xbd);
> > +
> > +    VPCI_ADD_REG(vpci_read16, vpci_write16, 12, 2, r12);
> > +    VPCI_READ_CHECK(12, 2, 0x8696);
> > +    VPCI_READ_CHECK(12, 4, 0xffff8696);
> > +
> > +    /*
> > +     * At this point we have the following layout:
> > +     *
> > +     * 32    24    16     8     0
> > +     *  +-----+-----+-----+-----+
> > +     *  |          r0           | 0
> > +     *  +-----+-----+-----+-----+
> > +     *  | r7  |  r6 |  r5 |/////| 32
> > +     *  +-----+-----+-----+-----|
> 
> This is misleading (especially for readers of the code following this
> comment), as you've written different values by now.

Well, the position of the variables that hold the values of each
register are correct, it's just the value they store that has changed.

> > +     *  |///////////////////////| 64
> > +     *  +-----------+-----------+
> > +     *  |///////////|    r12    | 96
> > +     *  +-----------+-----------+
> > +     *             ...
> > +     *  / = empty.
> > +     */
> > +
> > +    /* Try to add an overlapping register handler. */
> > +    VPCI_ADD_INVALID_REG(vpci_read32, vpci_write32, 4, 4);
> > +
> > +    /* Try to add a non-aligned register. */
> > +    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 15, 2);
> > +
> > +    /* Try to add a register with wrong size. */
> > +    VPCI_ADD_INVALID_REG(vpci_read16, vpci_write16, 8, 3);
> > +
> > +    /* Try to add a register with missing handlers. */
> > +    VPCI_ADD_INVALID_REG(NULL, NULL, 8, 2);
> > +
> > +    /* Read/write of unset register. */
> > +    VPCI_READ_CHECK(8, 4, 0xffffffff);
> > +    VPCI_READ_CHECK(8, 2, 0xffff);
> > +    VPCI_READ_CHECK(8, 1, 0xff);
> > +    VPCI_WRITE(10, 2, 0xbeef);
> > +    VPCI_READ_CHECK(10, 2, 0xffff);
> > +
> > +    /* Read of multiple registers */
> > +    VPCI_WRITE_CHECK(7, 1, 0xbd);
> > +    VPCI_READ_CHECK(4, 4, 0xbdbabaff);
> > +
> > +    /* Partial read of a register. */
> > +    VPCI_WRITE_CHECK(0, 4, 0x1a1b1c1d);
> > +    VPCI_READ_CHECK(2, 1, 0x1b);
> > +    VPCI_READ_CHECK(6, 2, 0xbdba);
> > +
> > +    /* Write of multiple registers. */
> > +    VPCI_WRITE_CHECK(4, 4, 0xaabbccff);
> > +
> > +    /* Partial write of a register. */
> > +    VPCI_WRITE_CHECK(2, 1, 0xfe);
> > +    VPCI_WRITE_CHECK(6, 2, 0xfebc);
> > +
> > +    /*
> > +     * Test all possible read/write size combinations.
> > +     *
> > +     * Populate 128bits (16B) with 1B registers, 160bits (20B) with 2B
> > +     * registers, and finally 192bits (24B) with 4B registers.
> 
> I can't see how the numbers here are in line with the code this is
> meant to describe. Perhaps this is a leftover from an earlier variant
> of the code?

I'm not sure I understand this, the registers (or layout) described in
this comment are just added below the comment. Would you like me to
first add the registers and place the comment afterwards?

> > --- a/xen/arch/arm/xen.lds.S
> > +++ b/xen/arch/arm/xen.lds.S
> > @@ -41,6 +41,9 @@ SECTIONS
> >  
> >    . = ALIGN(PAGE_SIZE);
> >    .rodata : {
> > +       __start_vpci_array = .;
> > +       *(.rodata.vpci)
> > +       __end_vpci_array = .;
> 
> Do you really need this (unconditionally)?

Right, this should have a ifdef CONFIG_PCI.

> > +static int vpci_access_check(unsigned int reg, unsigned int len)
> 
> The way you use it, this function want to return bool.
> 
> > +void hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
> > +                         unsigned int *bus, unsigned int *slot,
> > +                         unsigned int *func, unsigned int *reg)
> 
> Since you return nothing right now, how about avoid one of the
> indirections? Best candidate would probably be the register value.

I don't really like functions that return some data in the return
value (if it's not an error code) and some other data in parameters.
But yes, if it has to return something I guess the register value is
the one that makes more sense.

> > +{
> > +    unsigned long bdf;
> 
> Why long instead of int?
> 
> > +static bool vpci_portio_accept(const struct hv> +    return (p->addr == 0xcf8 && p->size == 4) || (p->addr & 0xfffc) == 0xcfc;
> 
> Maybe better ~3 instead of 0xfffc (also likely to produce slightly
> better code)?

Yes, it's certainly not any worse than using 0xfffc. Maybe a define
would be helpful.

> > +static int vpci_portio_read(const struct hvm_io_handler *handler,
> > +                            uint64_t addr, uint32_t size, uint64_t *data)
> > +{
> > +    struct domain *d = current->domain;
> > +    unsigned int bus, slot, func, reg;
> > +
> > +    *data = ~(uint64_t)0;
> > +
> > +    vpci_lock(d);
> > +    if ( addr == 0xcf8 )
> > +    {
> > +        ASSERT(size == 4);
> > +        *data = d->arch.hvm_domain.pci_cf8;
> > +        vpci_unlock(d);
> > +        return X86EMUL_OKAY;
> > +    }
> > +    if ( !CF8_ENABLED(d->arch.hvm_domain.pci_cf8) )
> > +    {
> > +        vpci_unlock(d);
> > +        return X86EMUL_OKAY;
> > +    }
> > +
> > +    /* Decode the PCI address. */
> > +    hvm_pci_decode_addr(d->arch.hvm_domain.pci_cf8, addr, &bus, &slot, &func,
> > +                        ®);
> 
> With the function name I don't view a comment like the one here as very
> useful.
> 
> > --- a/xen/arch/x86/hvm/ioreq.c
> > +++ b/xen/arch/x86/hvm/ioreq.c
> > @@ -1178,18 +1178,16 @@ struct hvm_ioreq_server *hvm_select_ioreq_server(struct domain *d,
> >           CF8_ENABLED(cf8) )
> >      {
> >          uint32_t sbdf, x86_fam;
> > +        unsigned int bus, slot, func, reg;
> > +
> > +        hvm_pci_decode_addr(cf8, p->addr, &bus, &slot, &func, ®);
> >  
> >          /* PCI config data cycle */
> >  
> > -        sbdf = XEN_DMOP_PCI_SBDF(0,
> > -                                 PCI_BUS(CF8_BDF(cf8)),
> > -                                 PCI_SLOT(CF8_BDF(cf8)),
> > -                                 PCI_FUNC(CF8_BDF(cf8)));
> > +        sbdf = XEN_DMOP_PCI_SBDF(0, bus, slot, func);
> >  
> >          type = XEN_DMOP_IO_RANGE_PCI;
> > -        addr = ((uint64_t)sbdf << 32) |
> > -               CF8_ADDR_LO(cf8) |
> > -               (p->addr & 3);
> > +        addr = ((uint64_t)sbdf << 32) | reg;
> >          /* AMD extended configuration space access? */
> >          if ( CF8_ADDR_HI(cf8) &&
> >               d->arch.cpuid->x86_vendor == X86_VENDOR_AMD &&
> 
> This and the introduction of hvm_pci_decode_addr() would likely better
> be broken out into a prereq patch, as this one is quite large even
> without this effectively unrelated change.

OK.

> > --- /dev/null
> > +++ b/xen/drivers/vpci/vpci.c
> > @@ -0,0 +1,405 @@
> > +/*
> > + * Generic functionality for handling accesses to the PCI configuration space
> > + * from guests.
> > + *
> > + * Copyright (C) 2017 Citrix Systems R&D
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms and conditions of the GNU General Public
> > + * License, version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public
> > + * License along with this program; If not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#include <xen/sched.h>
> > +#include <xen/vpci.h>
> > +
> > +extern const vpci_register_init_t __start_vpci_array[], __end_vpci_array[];
> > +#define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
> > +
> > +/* Internal struct to store the emulated PCI registers. */
> > +struct vpci_register {
> > +    vpci_read_t *read;
> > +    vpci_write_t *write;
> > +    unsigned int size;
> > +    unsigned int offset;
> > +    void *private;
> > +    struct list_head node;
> > +};
> > +
> > +int __hwdom_init vpci_add_handlers(struct pci_dev *pdev)
> 
> As pointed out in reply to an earlier version, this lacks a prereq
> change: setup_one_hwdom_device() needs to be marked __hwdom_init. And
> then, now that you have the annotation here, the placement of the
> array in the linker script should depend on whether __hwdom_init is an
> alias of __init.

The __hwdom_init prefix is dropped shortly from this function (patch
#3), but I agree on sending a pre-patch to address
setup_one_hwdom_device.

The linker script I'm not sure it's worth modifying, by the end of the
series the list of handlers must reside in .rodata.

> > +int vpci_add_register(const struct pci_dev *pdev, vpci_read_t r> +                      unsigned int size, void *data)
> > +{
> > +    struct list_head *head;
> > +    struct vpci_register *r;
> > +
> > +    /* Some sanity checks. */
> > +    if ( (size != 1 && size != 2 && size != 4) ||
> > +         offset >= PCI_CFG_SPACE_EXP_SIZE || offset & (size - 1) ||
> 
> Please add parens around the operands of &.
> 
> > +         (read_handler == NULL && write_handler == NULL) )
> 
> Please be consistent with NULL checks - as they're shorter, I'd suggest
> to always use ...
> 
> > +        return -EINVAL;
> > +
> > +    r = xmalloc(struct vpci_register);
> > +    if ( !r )
> 
> ... this style.

I'm trying, but this is different from BSD coding style that only
allows explicitly checking against NULL, so sometimes I fail, sorry.

> > +        return -ENOMEM;
> > +
> > +    r->read = read_handler ?: vpci_ignored_read;
> > +    r->write = write_handler ?: vpci_ignored_write;
> > +    r->size = size;
> > +    r->offset = offset;
> > +    r->private = data;
> > +
> > +    vpci_lock(pdev->domain);
> > +
> > +    /* The list of handlers must be keep sorted at all times. */
> 
> kept
> 
> > +    list_for_each ( head, &pdev->vpci->handlers )
> 
> "head" is not a good name for something that doesn't always point at
> the head of whatever list. How about "prev"?
> 
> > +int vpci_remove_register(const struct pci_dev *pdev, unsigned int offset,
> > +                         unsigned int size)
> > +{
> > +    const struct vpci_register r = { .offset = offset, .size = size };
> > +    struct vpci_register *rm = NULL;
> 
> Pointless initializer afaict (there's none on the equivalent variable
> in the add function).
>
> > +    vpci_lock(pdev->domain);
> > +
> > +    list_for_each_entry ( rm, &pdev->vpci->handlers, node )
> > +        if ( vpci_register_cmp(&r, rm) <= 0 )
> > +            break;
> > +
> > +    if ( !rm || rm->offset != offset || rm->size != size )
> 
> Obviously the !rm check here isn't needed then either, which points out
> that you have a problem here: You don't properly handle the case of not
> coming through the "break" path above, i.e. when rm points at the list
> head (which isn't a full struct vpci_register).

Right (also applies to the comment above), IMHO the interface of the
Linux lists seems terrible. Will fix it.

> > +static uint32_t vpci_read_hw(unsigned int seg, unsigned int bus,
> > +                             unsigned int slot, unsigned int func,
> > +                             unsigned int reg, uint32_t size)
> > +{
> > +    uint32_t data;
> > +
> > +    switch ( size )
> > +    {
> > +    case 4:
> > +        data = pci_conf_read32(seg, bus, slot, func, reg);
> > +        break;
> > +    case 2:
> > +        data = pci_conf_read16(seg, bus, slot, func, reg);
> > +        break;
> > +    case 1:
> > +        data = pci_conf_read8(seg, bus, slot, func, reg);
> > +        break;
> > +    default:
> > +        BUG();
> 
> As long as this is Dom0-only, BUG()s like this are probably fine, but
> if this ever gets extended to DomU-s, will we really remember to
> convert them?

ASSERT_UNREACHABLE() and set data to ~0 to be safe?

> > +/*
> > + * Merge new data into a partial result.
> > + *
> > + * Zero the bytes of 'data' from [offset, offset + size), and
> > + * merge the value found in 'new' from [0, offset) left shifted
> > + * by 'offset'.
> > + */
> > +uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
> 
> static?

Ups.

> > +                      unsigned int offset)
> > +{
> > +    uint32_t mask = ((uint64_t)1 << (size * 8)) - 1;
> 
> No need to use 64-bit arithmetic here: 0xffffffff >> (32 - 8 * size).

Shame, will fix.

> > +uint32_t vpci_read(unsigned int seg, unsigned int bus, unsigned int slot,
> > +                   unsigned int func, unsigned int reg, uint32_t size)
> > +{
> > +    struct domain *d = current->domain;
> > +    struct pci_dev *pdev;
> > +    const struct vpci_register *r;
> > +    unsigned int data_offset = 0;
> > +    uint32_t data;
> > +
> > +    ASSERT(pcidevs_locked());
> > +    ASSERT(vpci_locked(d));
> > +
> > +    /*
> > +     * Read the hardware value.
> > +     * NB: at the moment vPCI passthroughs everything (ie: permissive).
> 
> passes through
> 
> > +     */
> > +    data = vpci_read_hw(seg, bus, slot, func, reg, size);
> 
> I continue to be worried of reads that have side effects here. Granted
> we currently don't emulate any, but it would feel better if we didn't
> do the read for no reason. I.e. do hw reads only to fill gaps between
> emulated fields.

Heh, right. I got this "idea" from pciback, but I will change it so
the logic is similar to the write one (which obviously doesn't write
everything and then checks for emulated registers).

As a side question, which kind of registers have read side effects on
PCI? Reading the spec (PCIe 3.1A) there's no type of register listed
in section 7.4 (ro, rw, rw1c and the sticky versions) that mentions
read side effects. Is that described somewhere for specific
registers?

> > +    /* Find the PCI dev matching the address. *> +    /* Replace any values reported by the emulated registers. */
> > +    list_for_each_entry ( r, &pdev->vpci->handlers, node )
> > +    {
> > +        const struct vpci_register emu = {
> > +            .offset = reg + data_offset,
> > +            .size = size - data_offset
> > +        };
> > +        int cmp = vpci_register_cmp(&emu, r);
> > +        union vpci_val val = { .u32 = ~0 };
> > +        unsigned int merge_size;
> > +
> > +        if ( cmp < 0 )
> > +            break;
> > +        if ( cmp > 0 )
> > +            continue;
> > +
> > +        r->read(pdev, r->offset, &val, r->private);
> > +
> > +        /* Check if the read is in the middle of a register. */
> > +        if ( r->offset < emu.offset )
> > +            val.u32 >>= (emu.offset - r->offset) * 8;
> > +
> > +        data_offset = max(emu.offset, r->offset) - reg;
> > +        /* Find the intersection size between the two sets. */
> > +        merge_size = min(emu.offset + emu.size, r->offset + r->size) -
> > +                     max(emu.offset, r->offset);
> > +        /* Merge the emulated data into the native read value. */
> > +        data = merge_result(data, val.u32, merge_size, data_offset);
> > +        data_offset += merge_size;
> > +        if ( data_offset == size )
> > +            break;
> 
> ASSERT(data_offset < size) ?
> 
> > --- /dev/null
> > +++ b/xen/include/xen/vpci.h
> > @@ -0,0 +1,79 @@
> > +#ifndef _VPCI_
> > +#define _VPCI_
> > +
> > +#include <xen/pci.h>
> > +#include <xen/types.h>
> > +#include <xen/list.h>
> > +
> > +/*
> > + * Helpers for locking/unlocking.
> > + *
> > + * NB: the recursive variants are used so that spin_is_locked
> > + * returns whether the lock is hold by the current CPU (instead
> > + * of just returning whether the lock is hold by any CPU).
> > + */
> > +#define vpci_lock(d) spin_lock_recursive(&(d)->arch.hvm_domain.vpci_lock)
> > +#define vpci_unlock(d) spin_unlock_recursive(&(d)->arch.hvm_domain.vpci_lock)
> > +#define vpci_locked(d) spin_is_locked(&(d)->arch.hvm_domain.vpci_lock)
> > +
> > +/* Value read or written by the handlers. */
> > +union vpci_val {
> > +    uint8_t u8;
> > +    uint16_t u16;
> > +    uint32_t u32;
> > +};
> 
> I continue to be unconvinced that this union is a good way to handle
> different sizes. Afaict Coverity (or similar tools) may recognize quite
> a few possible uses of uninitialized data. Quite likely all of them
> would be false positives, but anyway. Would it really be a big problem
> to uniformly pass uint32_t values around?

Hm, no I don't think so. I would then add explicit truncation of the
values in the read/write handlers.

> > +/*
> > + * The vPCI handlers will never be called concurrently for the same domain, ii
> > + * is guaranteed that the vpci domain lock will always be locked when calling
> > + * any handler.
> > + */
> > +typedef void (vpci_read_t)(struct pci_dev *pdev, unsigned int reg,
> > +                           union vpci_val *val, void *data);
> > +
> > +typedef void (vpci_write_t)(struct pci_dev *pdev, unsigned int reg,
> > +                            union vpci_val val, void *data);
> 
> Stray parentheses around the type name being defined.
> 
> > +typedef int (*vpci_register_init_t)(struct pci_dev *dev);
> 
> This one is inconsistent with the other two in that it defines a
> pointer type.
> 
> > +#define REGISTER_VPCI_INIT(x)                   \
> > +  static const vpci_register_init_t x##_entry   \
> > +               __used_section(".rodata.vpci") = x
> > +
> > +/* Add vPCI handlers to device. */
> > +int __must_check vpci_add_handlers(struct pci_dev *dev);
> > +
> > +/* Add/remove a register handler. */
> > +int __must_check vpci_add_register(const struct pci_dev *pdev,
> > +                                   vpci_read_t read_handler,
> > +                                   vpci_write_t write_handler,
> 
> I'm surprised this compiles without (at least) warnings - you appear to
> be lacking *s here.

I think in the previous version the type itself had a pointer, and
then I removed it and haven't updated it here. But yes, none of the
compilers seems to complain:

https://travis-ci.org/royger/xen/builds/248811315

Is it maybe implicit that function types are pointers?

> > +                                   unsigned int offset,
> > +                                   unsigned int size, void *data);
> > +int __must_check vpci_remove_register(const struct pci_dev *pdev,
> > +                                      unsigned int offset,
> > +                                      unsigned int size);
> > > +uint32_t vpci_read(unsigned int seg, unsigned int bus, unsigned int slot,
> > +                   unsigned int func, unsigned int reg, uint32_t size);
> > +void vpci_write(unsigned int seg, unsigned int bus, unsigned int slot,
> > +                unsigned int func, unsigned int reg, uint32_t size,
> > +                uint32_t data);
> 
> I don't see why size needs to be of a fixed width type in both of these.

unsigned int it is then.

> > +struct vpci {
> > +    /* Root pointer for the tree of vPCI handlers. */
> > +    struct list_head handlers;
> 
> The comment says "tree", but right now this really is just a list.

Ups, leftover from the previous RB version.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 1/9] xen/vpci: introduce basic handlers to trap accesses to the PCI config space
  2017-07-14 15:33     ` Roger Pau Monné
@ 2017-07-14 16:01       ` Jan Beulich
  2017-07-14 16:41         ` Roger Pau Monné
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2017-07-14 16:01 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: wei.liu2, andrew.cooper3, ian.jackson, julien.grall, paul.durrant,
	xen-devel, boris.ostrovsky

>>> On 14.07.17 at 17:33, <roger.pau@citrix.com> wrote:
> On Thu, Jul 13, 2017 at 08:36:18AM -0600, Jan Beulich wrote:
>> >>> Roger Pau Monne <roger.pau@citrix.com> 06/30/17 5:01 PM >>>
>> > +#define container_of(ptr, type, member) ({                      \
>> > +        typeof(((type *)0)->member) *__mptr = (ptr);            \
>> > +        (type *)((char *)__mptr - offsetof(type, member));      \
>> 
>> I don't know what tools maintainers think about such name space
>> violations; in hypervisor code I'd ask you to avoid leading underscores
>> in macro local variables (same in min()/max() and elsewhere then).
> 
> OK. container_of, max and min and verbatim copies of the macros in
> xen/include/xen/kernel.h, with the style adjusted in the container_of
> case IIRC (as requested in the previous review).

Well, that's one of the frequent problems we have: People copy and
paste things without questioning them. We only make things worse if
we clone code we wouldn't permit in anymore nowadays.

>> > +{
>> > +    unsigned int i;
>> > +
>> > +    /* Write using bytes. */
>> > +    for ( i = 0; i < 4; i++ )
>> > +        VPCI_WRITE_CHECK(reg + i, 1, (val >> (i * 8)) & UINT8_MAX);
>> > +    multiread4(reg, val);
>> > +
>> > +    /* Write using 2bytes. */
>> > +    for ( i = 0; i < 2; i++ )
>> > +        VPCI_WRITE_CHECK(reg + i * 2, 2, (val >> (i * 2 * 8)) & UINT16_MAX);
>> > +    multiread4(reg, val);
>> > +
>> > +    VPCI_WRITE_CHECK(reg, 4, val);
>> > +    multiread4(reg, val);
>> > +}
>> 
>> Wouldn't it be better to vary the value written between the individual
>> sizes? Perhaps move the 32-bit write between the two loops, using ~val?
>> Otherwise you won't know whether what you read back is a result of the
>> writes you actually mean to test or earlier ones?
> 
> So storing a new value in val between each size test? I could even use
> something randomly generated.

Random data is bad for reproducibility (if e.g. you want to debug a
case where the test suddenly fails).

>> > +    /*
>> > +     * Test all possible read/write size combinations.
>> > +     *
>> > +     * Populate 128bits (16B) with 1B registers, 160bits (20B) with 2B
>> > +     * registers, and finally 192bits (24B) with 4B registers.
>> 
>> I can't see how the numbers here are in line with the code this is
>> meant to describe. Perhaps this is a leftover from an earlier variant
>> of the code?
> 
> I'm not sure I understand this, the registers (or layout) described in
> this comment are just added below the comment. Would you like me to
> first add the registers and place the comment afterwards?

No, my point is that code that follows this doesn't populate as
many bits as the comment says. From what I understand, you
use 4 byte registers, 2 word ones, and one dword one.

>> > --- a/xen/arch/arm/xen.lds.S
>> > +++ b/xen/arch/arm/xen.lds.S
>> > @@ -41,6 +41,9 @@ SECTIONS
>> >  
>> >    . = ALIGN(PAGE_SIZE);
>> >    .rodata : {
>> > +       __start_vpci_array = .;
>> > +       *(.rodata.vpci)
>> > +       __end_vpci_array = .;
>> 
>> Do you really need this (unconditionally)?
> 
> Right, this should have a ifdef CONFIG_PCI.

CONFIG_HAS_PCI for one, and then ARM doesn't select this at
all. Hence the question.

>> > +static int vpci_access_check(unsigned int reg, unsigned int len)
>> 
>> The way you use it, this function want to return bool.
>> 
>> > +void hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
>> > +                         unsigned int *bus, unsigned int *slot,
>> > +                         unsigned int *func, unsigned int *reg)
>> 
>> Since you return nothing right now, how about avoid one of the
>> indirections? Best candidate would probably be the register value.
> 
> I don't really like functions that return some data in the return
> value (if it's not an error code) and some other data in parameters.

Well, okay, I view it the other way around - return by indirection
is to be used if return by value is not reasonable (too much data).
Hence it's kind of an overflow to me, not a replacement.

>> > +int __hwdom_init vpci_add_handlers(struct pci_dev *pdev)
>> 
>> As pointed out in reply to an earlier version, this lacks a prereq
>> change: setup_one_hwdom_device() needs to be marked __hwdom_init. And
>> then, now that you have the annotation here, the placement of the
>> array in the linker script should depend on whether __hwdom_init is an
>> alias of __init.
> 
> The __hwdom_init prefix is dropped shortly from this function (patch
> #3), but I agree on sending a pre-patch to address
> setup_one_hwdom_device.

I have one ready, btw.

> The linker script I'm not sure it's worth modifying, by the end of the
> series the list of handlers must reside in .rodata.

As per the reply to that later patch, I'm not yet convinced that
these annotations will go away. Hence I'd prefer if things were
handled fully correctly here.

>> > +static uint32_t vpci_read_hw(unsigned int seg, unsigned int bus,
>> > +                             unsigned int slot, unsigned int func,
>> > +                             unsigned int reg, uint32_t size)
>> > +{
>> > +    uint32_t data;
>> > +
>> > +    switch ( size )
>> > +    {
>> > +    case 4:
>> > +        data = pci_conf_read32(seg, bus, slot, func, reg);
>> > +        break;
>> > +    case 2:
>> > +        data = pci_conf_read16(seg, bus, slot, func, reg);
>> > +        break;
>> > +    case 1:
>> > +        data = pci_conf_read8(seg, bus, slot, func, reg);
>> > +        break;
>> > +    default:
>> > +        BUG();
>> 
>> As long as this is Dom0-only, BUG()s like this are probably fine, but
>> if this ever gets extended to DomU-s, will we really remember to
>> convert them?
> 
> ASSERT_UNREACHABLE() and set data to ~0 to be safe?

Yes please.

>> > +     */
>> > +    data = vpci_read_hw(seg, bus, slot, func, reg, size);
>> 
>> I continue to be worried of reads that have side effects here. Granted
>> we currently don't emulate any, but it would feel better if we didn't
>> do the read for no reason. I.e. do hw reads only to fill gaps between
>> emulated fields.
> 
> Heh, right. I got this "idea" from pciback, but I will change it so
> the logic is similar to the write one (which obviously doesn't write
> everything and then checks for emulated registers).
> 
> As a side question, which kind of registers have read side effects on
> PCI? Reading the spec (PCIe 3.1A) there's no type of register listed
> in section 7.4 (ro, rw, rw1c and the sticky versions) that mentions
> read side effects. Is that described somewhere for specific
> registers?

I don't think there are any specified, but iirc a well known side effect
of VPD reads from some cards is that it'll hang the box for certain
(normally invalid) indexes. As said, we don't emulate anything like
that, but let's be defensive wrt hardware quirks.

>> > +int __must_check vpci_add_register(const struct pci_dev *pdev,
>> > +                                   vpci_read_t read_handler,
>> > +                                   vpci_write_t write_handler,
>> 
>> I'm surprised this compiles without (at least) warnings - you appear to
>> be lacking *s here.
> 
> I think in the previous version the type itself had a pointer, and
> then I removed it and haven't updated it here. But yes, none of the
> compilers seems to complain:
> 
> https://travis-ci.org/royger/xen/builds/248811315 
> 
> Is it maybe implicit that function types are pointers?

Well, maybe I'm wrong with my assumption that this formally is
illegal.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 2/9] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas
  2017-07-13 20:15   ` Jan Beulich
@ 2017-07-14 16:33     ` Roger Pau Monné
  2017-07-28 12:22       ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monné @ 2017-07-14 16:33 UTC (permalink / raw)
  To: Jan Beulich
  Cc: andrew.cooper3, julien.grall, paul.durrant, xen-devel,
	boris.ostrovsky

On Thu, Jul 13, 2017 at 02:15:26PM -0600, Jan Beulich wrote:
> >>> Roger Pau Monne <roger.pau@citrix.com> 06/30/17 5:02 PM >>>
> > @@ -1041,6 +1043,24 @@ static int __init pvh_setup_acpi(struct domain *d, paddr_t start_info)
> >      return 0;
> >  }
> >  
> > +int __init pvh_setup_mmcfg(struct domain *d)
> 
> Didn't I point out that __init van't be correct here, and instead this
> needs to be __hwdom_init? I can see that the only current caller is
> __init, but that merely suggests there is a second call missing.

Mostly likely, and I failed to update it.

AFAIK it's not possible to build a late PVH hwdom (or I don't see
how), so I guess that missing call should be added if we ever support
that.

> > +{
> > +    unsigned int i;
> > +    int rc;
> > +
> > +    for ( i = 0; i < pci_mmcfg_config_num; i++ )
> > +    {
> > +        rc = register_vpci_mmcfg_handler(d, pci_mmcfg_config[i].address,
> > +                                         pci_mmcfg_config[i].start_bus_number,
> > +                                         pci_mmcfg_config[i].end_bus_number,
> > +                                         pci_mmcfg_config[i].pci_segment);
> > +        if ( rc )
> > +            return rc;
> 
> I would make this a best effort thing, i.e. issue a log message upon
> failure but continue the loop. There's a good chance Dom0 will still
> be able to come up.

It's worth a try certainly.

> > --- a/xen/arch/x86/hvm/io.c
> > +++ b/xen/arch/x86/hvm/io.c
> > @@ -261,11 +261,11 @@ void register_g2m_portio_handler(struct domain *d)
> >  static int vpci_access_check(unsigned int reg, unsigned int len)
> >  {
> >      /* Check access size. */
> > -    if ( len != 1 && len != 2 && len != 4 )
> > +    if ( len != 1 && len != 2 && len != 4 && len != 8 )
> >          return -EINVAL;
> >  
> > -    /* Check if access crosses a double-word boundary. */
> > -    if ( (reg & 3) + len > 4 )
> > +    /* Check if access crosses a double-word boundary or it's not aligned. */
> > +    if ( (len <= 4 && (reg & 3) + len > 4) || (len == 8 && (reg & 3) != 0) )
> >          return -EINVAL;
> 
> For one I suppose you mean "& 7" in the 8-byte case.

I cannot find anything in the PCIe 3.1A specification that says that
8B accesses should be aligned. AFAICT it only mentions that accesses
should not cross double-word (4B) boundaries, because it's not
mandatory for the root complex to support such accesses.

> And then I don't
> understand why you permit mis-aligned 2-byte writes, but not mis-aligned
> 4-byte ones as long as they fall withing a quad-word. Any such asymmetry
> needs at least a comment.

IIRC reading soemthing like that on the Mindshare PCI book, but I
don't have it at hand. Will check on Monday. Anyway, I cannot seem to
find any specific set of restrictions in the PCI/PCIe specifications,
apart from the one that accesses should not cross a double-word
boundary.

I'm fine with only allowing accesses aligned to their respective
sizes, but I think I should add a comment somewhere regarding where
this has been picked from. Do you have any references from the
AMD/Intel SDMs maybe?

> > @@ -398,6 +398,188 @@ void register_vpci_portio_handler(struct domain *d)
> >      handler->ops = &vpci_portio_ops;
> >  }
> >  
> > +struct hvm_mmcfg {
> > +    paddr_t addr;
> > +    size_t size;
> 
> paddr_t and size_t don't really fit together, most notably on 32-bit.
> As I don't think any individual range can possibly be 4Gb or larger, I
> think unsigned int would suffice here.
> 
> > +    unsigned int bus;
> > +    unsigned int segment;
> 
> Depending on how many instances of this structure we expect, it may be
> worthwhile to limit these two to 8 and 16 bits respectively.

Hm, so far the boxes I've tested on only had 1 MCFG area, but it's
probably best to change the types and the order, so that there's no
padding.

> > +/* Handlers to trap PCI ECAM config accesses. */
> 
> An "ECAM" did survive here.

Shame, I should have grepped the patch.

> > +static const struct hvm_mmcfg *vpci_mmcfg_find(struct domain *d,
> > +                                               unsigned long addr)
> 
> paddr_t (to match the structure field)
> 
> > +static void vpci_mmcfg_decode_addr(const struct hvm_mmcfg *mmcfg,
> > +                                   unsigned long addr, unsigned int *bus,
> 
> Same here (and it seems more below). Also, just like in patch 1,
> perhaps return the register by value rather than via indirection.
> 
> > +                                   unsigned int *slot, unsigned int *func,
> > +                                   unsigned int *reg)
> > +{
> > +    addr -= mmcfg->addr;
> > +    *bus = ((addr >> 20) & 0xff) + mmcfg->bus;
> > +    *slot = (addr >> 15) & 0x1f;
> > +    *func = (addr >> 12) & 0x7;
> > +    *reg = addr & 0xfff;
> 
> Iirc there already was a comment to use manifest constants or macros
> here.

Yes, going to fix that.

> > +static int vpci_mmcfg_accept(struct vcpu *v, unsigned long addr)
> > +{
> > +    struct domain *d = v->domain;
> > +    bool found;
> > +
> > +    vpci_lock(d);
> > +    found = vpci_mmcfg_find(d, addr);
> > +    vpci_unlock(d);
> 
> The latest here I wonder whether the lock wouldn't better be an r/w one.

TBH, my first implementation was using a rw lock, but then I though it
was not worth it and switched to a spinlock. I Don't mind making it a
rw lock, but then the argument passed to the read handlers should be
constified for safety IMHO.

Also note that due to the usage of the pcidevs lock whether this is rw
or a spinlock doesn't make much of a difference.

> > +static int vpci_mmcfg_read(struct vcpu *v, unsigned long addr,
> > +                           unsigned int len, unsigned long *data)
> 
> uint64_t * (to be 32-bit compatible)

Will this work properly on 32bit builds?

hvm_mmio_{read/write}_t types expect a unsigned long, not a
uint64_t. I'm confused about how this worked before with a 32bit
hypervisor and a 64bit guest, how where movq handled?

> > +{
> > +    struct domain *d = v->domain;
> > +   > +
> > +    vpci_lock(d);
> > +    mmcfg = vpci_mmcfg_find(d, addr);
> > +    if ( !mmcfg )
> > +    {
> > +        vpci_unlock(d);
> > +        return X86EMUL_OKAY;
> > +    }
> > +
> > +    vpci_mmcfg_decode_addr(mmcfg, addr, &bus, &slot, &func, ®);
> > +
> > +    if ( vpci_access_check(reg, len) )
> > +    {
> > +        vpci_unlock(d);
> > +        return X86EMUL_OKAY;
> > +    }
> > +
> > +    pcidevs_lock();
> > +    if ( len == 8 )
> > +    {
> > +        /*
> > +         * According to the PCIe 3.1A specification:
> > +         *  - Configuration Reads and Writes must usually be DWORD or smaller
> > +         *    in size.
> > +         *  - Because Root Complex implementations are not required to support
> > +         *    accesses to a RCRB that cross DW boundaries [...] software
> > +         *    should take care not to cause the generation of such accesses
> > +         *    when accessing a RCRB unless the Root Complex will support the
> > +         *    access.
> > +         *  Xen however supports 8byte accesses by splitting them into two
> > +         *  4byte accesses.
> > +         */
> > +        *data = vpci_read(mmcfg->segment, bus, slot, func, reg, 4);
> > +        *data |= (uint64_t)vpci_read(mmcfg->segment, bus, slot, func,
> > +                                     reg + 4, 4) << 32;
> > +    }
> > +    else
> > +        *data = vpci_read(mmcfg->segment, bus, slot, func, reg, len);
> 
> I think it would be preferable to avoid the else, by merging this and
> the first of the other two reads.

Ack.

> > +    pcidevs_unlock();
> > +    vpci_unlock(d);
> 
> Question on lock order (should have gone into the patch 1 reply already,
> but I had thought of this only after sending): Is it really a good idea
> to nest this way?

I saw no other way to make sure the pdev is not removed while poking
at it.

> The pcidevs lock is covering quite large regions at
> times, so the risk of a lock order violation seems non-negligible even
> if there may be none right now. Futhermore the new uses of the pcidevs
> lock you introduce would seem to make it quite desirable to make that
> one an r/w one too. Otoh that's a recursive one, so it'll be non-trivial
> to convert ...

I can try, but as you say doesn't seem trivial at all.

> > +int register_vpci_mmcfg_handler(struct domain *d, paddr_t addr,
> 
> __hwdom_init

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 1/9] xen/vpci: introduce basic handlers to trap accesses to the PCI config space
  2017-07-14 16:01       ` Jan Beulich
@ 2017-07-14 16:41         ` Roger Pau Monné
  2017-07-28 12:25           ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monné @ 2017-07-14 16:41 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, andrew.cooper3, ian.jackson, julien.grall, paul.durrant,
	xen-devel, boris.ostrovsky

On Fri, Jul 14, 2017 at 10:01:54AM -0600, Jan Beulich wrote:
> >>> On 14.07.17 at 17:33, <roger.pau@citrix.com> wrote:
> > On Thu, Jul 13, 2017 at 08:36:18AM -0600, Jan Beulich wrote:
> >> >>> Roger Pau Monne <roger.pau@citrix.com> 06/30/17 5:01 PM >>>
> >> > +#define container_of(ptr, type, member) ({                      \
> >> > +        typeof(((type *)0)->member) *__mptr = (ptr);            \
> >> > +        (type *)((char *)__mptr - offsetof(type, member));      \
> >> 
> >> I don't know what tools maintainers think about such name space
> >> violations; in hypervisor code I'd ask you to avoid leading underscores
> >> in macro local variables (same in min()/max() and elsewhere then).
> > 
> > OK. container_of, max and min and verbatim copies of the macros in
> > xen/include/xen/kernel.h, with the style adjusted in the container_of
> > case IIRC (as requested in the previous review).
> 
> Well, that's one of the frequent problems we have: People copy and
> paste things without questioning them. We only make things worse if
> we clone code we wouldn't permit in anymore nowadays.

Sorry, that comment sounded like a justification, which is not
intended. I was just explaining how that ended up there the way it
is.

> >> > +    /*
> >> > +     * Test all possible read/write size combinations.
> >> > +     *
> >> > +     * Populate 128bits (16B) with 1B registers, 160bits (20B) with 2B
> >> > +     * registers, and finally 192bits (24B) with 4B registers.
> >> 
> >> I can't see how the numbers here are in line with the code this is
> >> meant to describe. Perhaps this is a leftover from an earlier variant
> >> of the code?
> > 
> > I'm not sure I understand this, the registers (or layout) described in
> > this comment are just added below the comment. Would you like me to
> > first add the registers and place the comment afterwards?
> 
> No, my point is that code that follows this doesn't populate as
> many bits as the comment says. From what I understand, you
> use 4 byte registers, 2 word ones, and one dword one.

OK, I think I see what you mean. The comment makes it looks I'm
populating 128bits, which what I indented to say is:

[...]
 * Place 4 1B registers at 128bits (16B), 2 2B registers at 160bits (20B)
 * and finally 1 4B register at 192bits (24B).

> >> > --- a/xen/arch/arm/xen.lds.S
> >> > +++ b/xen/arch/arm/xen.lds.S
> >> > @@ -41,6 +41,9 @@ SECTIONS
> >> >  
> >> >    . = ALIGN(PAGE_SIZE);
> >> >    .rodata : {
> >> > +       __start_vpci_array = .;
> >> > +       *(.rodata.vpci)
> >> > +       __end_vpci_array = .;
> >> 
> >> Do you really need this (unconditionally)?
> > 
> > Right, this should have a ifdef CONFIG_PCI.
> 
> CONFIG_HAS_PCI for one, and then ARM doesn't select this at
> all. Hence the question.

I think it would be better to just add it now? The code is not really
x86 specific (although it's only used by x86 ATM). IMHO adding a
CONFIG_HAS_PCI to both linker scripts is the best solution.

> >> > +static int vpci_access_check(unsigned int reg, unsigned int len)
> >> 
> >> The way you use it, this function want to return bool.
> >> 
> >> > +void hvm_pci_decode_addr(unsigned int cf8, unsigned int addr,
> >> > +                         unsigned int *bus, unsigned int *slot,
> >> > +                         unsigned int *func, unsigned int *reg)
> >> 
> >> Since you return nothing right now, how about avoid one of the
> >> indirections? Best candidate would probably be the register value.
> > 
> > I don't really like functions that return some data in the return
> > value (if it's not an error code) and some other data in parameters.
> 
> Well, okay, I view it the other way around - return by indirection
> is to be used if return by value is not reasonable (too much data).
> Hence it's kind of an overflow to me, not a replacement.
> 
> >> > +int __hwdom_init vpci_add_handlers(struct pci_dev *pdev)
> >> 
> >> As pointed out in reply to an earlier version, this lacks a prereq
> >> change: setup_one_hwdom_device() needs to be marked __hwdom_init. And
> >> then, now that you have the annotation here, the placement of the
> >> array in the linker script should depend on whether __hwdom_init is an
> >> alias of __init.
> > 
> > The __hwdom_init prefix is dropped shortly from this function (patch
> > #3), but I agree on sending a pre-patch to address
> > setup_one_hwdom_device.
> 
> I have one ready, btw.
> 
> > The linker script I'm not sure it's worth modifying, by the end of the
> > series the list of handlers must reside in .rodata.
> 
> As per the reply to that later patch, I'm not yet convinced that
> these annotations will go away. Hence I'd prefer if things were
> handled fully correctly here.
> 
> >> > +static uint32_t vpci_read_hw(unsigned int seg, unsigned int bus,
> >> > +                             unsigned int slot, unsigned int func,
> >> > +                             unsigned int reg, uint32_t size)
> >> > +{
> >> > +    uint32_t data;
> >> > +
> >> > +    switch ( size )
> >> > +    {
> >> > +    case 4:
> >> > +        data = pci_conf_read32(seg, bus, slot, func, reg);
> >> > +        break;
> >> > +    case 2:
> >> > +        data = pci_conf_read16(seg, bus, slot, func, reg);
> >> > +        break;
> >> > +    case 1:
> >> > +        data = pci_conf_read8(seg, bus, slot, func, reg);
> >> > +        break;
> >> > +    default:
> >> > +        BUG();
> >> 
> >> As long as this is Dom0-only, BUG()s like this are probably fine, but
> >> if this ever gets extended to DomU-s, will we really remember to
> >> convert them?
> > 
> > ASSERT_UNREACHABLE() and set data to ~0 to be safe?
> 
> Yes please.
> 
> >> > +     */
> >> > +    data = vpci_read_hw(seg, bus, slot, func, reg, size);
> >> 
> >> I continue to be worried of reads that have side effects here. Granted
> >> we currently don't emulate any, but it would feel better if we didn't
> >> do the read for no reason. I.e. do hw reads only to fill gaps between
> >> emulated fields.
> > 
> > Heh, right. I got this "idea" from pciback, but I will change it so
> > the logic is similar to the write one (which obviously doesn't write
> > everything and then checks for emulated registers).
> > 
> > As a side question, which kind of registers have read side effects on
> > PCI? Reading the spec (PCIe 3.1A) there's no type of register listed
> > in section 7.4 (ro, rw, rw1c and the sticky versions) that mentions
> > read side effects. Is that described somewhere for specific
> > registers?
> 
> I don't think there are any specified, but iirc a well known side effect
> of VPD reads from some cards is that it'll hang the box for certain
> (normally invalid) indexes. As said, we don't emulate anything like
> that, but let's be defensive wrt hardware quirks.

Thanks for the clarification.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 7/9] vpci/msi: add MSI handlers
  2017-06-30 15:01 ` [PATCH v4 7/9] vpci/msi: add MSI handlers Roger Pau Monne
@ 2017-07-18  8:56   ` Paul Durrant
  2017-08-02 13:34   ` Jan Beulich
  1 sibling, 0 replies; 44+ messages in thread
From: Paul Durrant @ 2017-07-18  8:56 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, julien.grall@arm.com, Jan Beulich,
	boris.ostrovsky@oracle.com, Roger Pau Monne

> -----Original Message-----
> From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> Sent: 30 June 2017 16:01
> To: xen-devel@lists.xenproject.org
> Cc: boris.ostrovsky@oracle.com; julien.grall@arm.com;
> konrad.wilk@oracle.com; Roger Pau Monne <roger.pau@citrix.com>; Jan
> Beulich <jbeulich@suse.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>
> Subject: [PATCH v4 7/9] vpci/msi: add MSI handlers
> 
> Add handlers for the MSI control, address, data and mask fields in
> order to detect accesses to them and setup the interrupts as requested
> by the guest.
> 
> Note that the pending register is not trapped, and the guest can
> freely read/write to it.
> 
> Whether Xen is going to provide this functionality to Dom0 (MSI
> emulation) is controlled by the "msi" option in the dom0 field. When
> disabling this option Xen will hide the MSI capability structure from
> Dom0.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Paul Durrant <paul.durrant@citrix.com>
> ---
> Changes since v3:
>  - Propagate changes from previous versions: drop xen_ prefix, drop
>    return value from handlers, use the new vpci_val fields.
>  - Use MASK_EXTR.
>  - Remove the usage of GENMASK.
>  - Add GFLAGS_SHIFT_DEST_ID and use it in msi_flags.
>  - Add "arch" to the MSI arch specific functions.
>  - Move the dumping of vPCI MSI information to dump_msi (key 'M').
>  - Remove the guest_vectors field.
>  - Allow the guest to change the number of active vectors without
>    having to disable and enable MSI.
>  - Check the number of active vectors when parsing the disable
>    mask.
>  - Remove the debug messages from vpci_init_msi.
>  - Move the arch-specific part of the dump handler to x86/hvm/vmsi.c.
>  - Use trylock in the dump handler to get the vpci lock.
> 
> Changes since v2:
>  - Add an arch-specific abstraction layer. Note that this is only implemented
>    for x86 currently.
>  - Add a wrapper to detect MSI enabling for vPCI.
> 
> NB: I've only been able to test this with devices using a single MSI interrupt
> and no mask register. I will try to find hardware that supports the mask
> register and more than one vector, but I cannot make any promises.
> 
> If there are doubts about the untested parts we could always force Xen to
> report no per-vector masking support and only 1 available vector, but I would
> rather avoid doing it.
> ---
>  xen/arch/x86/hvm/vmsi.c      | 149 ++++++++++++++++++
>  xen/arch/x86/msi.c           |   3 +
>  xen/drivers/vpci/Makefile    |   2 +-
>  xen/drivers/vpci/msi.c       | 348
> +++++++++++++++++++++++++++++++++++++++++++
>  xen/include/asm-x86/hvm/io.h |  18 +++
>  xen/include/asm-x86/msi.h    |   1 +
>  xen/include/xen/hvm/irq.h    |   2 +
>  xen/include/xen/vpci.h       |  26 ++++
>  8 files changed, 548 insertions(+), 1 deletion(-)
>  create mode 100644 xen/drivers/vpci/msi.c
> 
> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> index a36692c313..5732c70b5c 100644
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -622,3 +622,152 @@ void msix_write_completion(struct vcpu *v)
>      if ( msixtbl_write(v, ctrl_address, 4, 0) != X86EMUL_OKAY )
>          gdprintk(XENLOG_WARNING, "MSI-X write completion failure\n");
>  }
> +
> +static unsigned int msi_vector(uint16_t data)
> +{
> +    return MASK_EXTR(data, MSI_DATA_VECTOR_MASK);
> +}
> +
> +static unsigned int msi_flags(uint16_t data, uint64_t addr)
> +{
> +    unsigned int rh, dm, dest_id, deliv_mode, trig_mode;
> +
> +    rh = MASK_EXTR(addr, MSI_ADDR_REDIRECTION_MASK);
> +    dm = MASK_EXTR(addr, MSI_ADDR_DESTMODE_MASK);
> +    dest_id = MASK_EXTR(addr, MSI_ADDR_DEST_ID_MASK);
> +    deliv_mode = MASK_EXTR(data, MSI_DATA_DELIVERY_MODE_MASK);
> +    trig_mode = MASK_EXTR(data, MSI_DATA_TRIGGER_MASK);
> +
> +    return (dest_id << GFLAGS_SHIFT_DEST_ID) | (rh << GFLAGS_SHIFT_RH)
> |
> +           (dm << GFLAGS_SHIFT_DM) | (deliv_mode <<
> GFLAGS_SHIFT_DELIV_MODE) |
> +           (trig_mode << GFLAGS_SHIFT_TRG_MODE);
> +}
> +
> +void vpci_msi_arch_mask(struct vpci_arch_msi *arch, struct pci_dev *pdev,
> +                        unsigned int entry, bool mask)
> +{
> +    struct domain *d = pdev->domain;
> +    const struct pirq *pinfo;
> +    struct irq_desc *desc;
> +    unsigned long flags;
> +    int irq;
> +
> +    ASSERT(arch->pirq >= 0);
> +    pinfo = pirq_info(d, arch->pirq + entry);
> +    ASSERT(pinfo);
> +
> +    irq = pinfo->arch.irq;
> +    ASSERT(irq < nr_irqs && irq >= 0);
> +
> +    desc = irq_to_desc(irq);
> +    ASSERT(desc);
> +
> +    spin_lock_irqsave(&desc->lock, flags);
> +    guest_mask_msi_irq(desc, mask);
> +    spin_unlock_irqrestore(&desc->lock, flags);
> +}
> +
> +int vpci_msi_arch_enable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
> +                         uint64_t address, uint32_t data, unsigned int vectors)
> +{
> +    struct msi_info msi_info = {
> +        .seg = pdev->seg,
> +        .bus = pdev->bus,
> +        .devfn = pdev->devfn,
> +        .entry_nr = vectors,
> +    };
> +    unsigned int i;
> +    int rc;
> +
> +    ASSERT(arch->pirq == -1);
> +
> +    /* Get a PIRQ. */
> +    rc = allocate_and_map_msi_pirq(pdev->domain, -1, &arch->pirq,
> +                                   MAP_PIRQ_TYPE_MULTI_MSI, &msi_info);
> +    if ( rc )
> +    {
> +        dprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: failed to map PIRQ:
> %d\n",
> +                pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                PCI_FUNC(pdev->devfn), rc);
> +        return rc;
> +    }
> +
> +    for ( i = 0; i < vectors; i++ )
> +    {
> +        xen_domctl_bind_pt_irq_t bind = {
> +            .machine_irq = arch->pirq + i,
> +            .irq_type = PT_IRQ_TYPE_MSI,
> +            .u.msi.gvec = msi_vector(data) + i,
> +            .u.msi.gflags = msi_flags(data, address),
> +        };
> +
> +        pcidevs_lock();
> +        rc = pt_irq_create_bind(pdev->domain, &bind);
> +        if ( rc )
> +        {
> +            dprintk(XENLOG_ERR,
> +                    "%04x:%02x:%02x.%u: failed to bind PIRQ %u: %d\n",
> +                    pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                    PCI_FUNC(pdev->devfn), arch->pirq + i, rc);
> +            spin_lock(&pdev->domain->event_lock);
> +            unmap_domain_pirq(pdev->domain, arch->pirq);
> +            spin_unlock(&pdev->domain->event_lock);
> +            pcidevs_unlock();
> +            arch->pirq = -1;
> +            return rc;
> +        }
> +        pcidevs_unlock();
> +    }
> +
> +    return 0;
> +}
> +
> +int vpci_msi_arch_disable(struct vpci_arch_msi *arch, struct pci_dev
> *pdev,
> +                          unsigned int vectors)
> +{
> +    unsigned int i;
> +
> +    ASSERT(arch->pirq != -1);
> +
> +    for ( i = 0; i < vectors; i++ )
> +    {
> +        xen_domctl_bind_pt_irq_t bind = {
> +            .machine_irq = arch->pirq + i,
> +            .irq_type = PT_IRQ_TYPE_MSI,
> +        };
> +
> +        pcidevs_lock();
> +        pt_irq_destroy_bind(pdev->domain, &bind);
> +        pcidevs_unlock();
> +    }
> +
> +    pcidevs_lock();
> +    spin_lock(&pdev->domain->event_lock);
> +    unmap_domain_pirq(pdev->domain, arch->pirq);
> +    spin_unlock(&pdev->domain->event_lock);
> +    pcidevs_unlock();
> +
> +    arch->pirq = -1;
> +
> +    return 0;
> +}
> +
> +int vpci_msi_arch_init(struct vpci_arch_msi *arch)
> +{
> +    arch->pirq = -1;
> +    return 0;
> +}
> +
> +void vpci_msi_arch_print(struct vpci_arch_msi *arch, uint16_t data,
> +                         uint64_t addr)
> +{
> +    printk("vec=%#02x%7s%6s%3sassert%5s%7s dest_id=%lu pirq: %d\n",
> +           MASK_EXTR(data, MSI_DATA_VECTOR_MASK),
> +           data & MSI_DATA_DELIVERY_LOWPRI ? "lowest" : "fixed",
> +           data & MSI_DATA_TRIGGER_LEVEL ? "level" : "edge",
> +           data & MSI_DATA_LEVEL_ASSERT ? "" : "de",
> +           addr & MSI_ADDR_DESTMODE_LOGIC ? "log" : "phys",
> +           addr & MSI_ADDR_REDIRECTION_LOWPRI ? "lowest" : "cpu",
> +           MASK_EXTR(addr, MSI_ADDR_DEST_ID_MASK),
> +           arch->pirq);
> +}
> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> index d98f400699..573378d6c3 100644
> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -30,6 +30,7 @@
>  #include <public/physdev.h>
>  #include <xen/iommu.h>
>  #include <xsm/xsm.h>
> +#include <xen/vpci.h>
> 
>  static s8 __read_mostly use_msi = -1;
>  boolean_param("msi", use_msi);
> @@ -1536,6 +1537,8 @@ static void dump_msi(unsigned char key)
>                 attr.guest_masked ? 'G' : ' ',
>                 mask);
>      }
> +
> +    vpci_dump_msi();
>  }
> 
>  static int __init msi_setup_keyhandler(void)
> diff --git a/xen/drivers/vpci/Makefile b/xen/drivers/vpci/Makefile
> index 241467212f..62cec9e82b 100644
> --- a/xen/drivers/vpci/Makefile
> +++ b/xen/drivers/vpci/Makefile
> @@ -1 +1 @@
> -obj-y += vpci.o header.o
> +obj-y += vpci.o header.o msi.o
> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> new file mode 100644
> index 0000000000..d8f3418616
> --- /dev/null
> +++ b/xen/drivers/vpci/msi.c
> @@ -0,0 +1,348 @@
> +/*
> + * Handlers for accesses to the MSI capability structure.
> + *
> + * Copyright (C) 2017 Citrix Systems R&D
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms and conditions of the GNU General Public
> + * License, version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; If not, see
> <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <xen/sched.h>
> +#include <xen/vpci.h>
> +#include <asm/msi.h>
> +#include <xen/keyhandler.h>
> +
> +/* Handlers for the MSI control field (PCI_MSI_FLAGS). */
> +static void vpci_msi_control_read(struct pci_dev *pdev, unsigned int reg,
> +                                  union vpci_val *val, void *data)
> +{
> +    const struct vpci_msi *msi = data;
> +
> +    /* Set multiple message capable. */
> +    val->u16 = MASK_INSR(fls(msi->max_vectors) - 1,
> PCI_MSI_FLAGS_QMASK);
> +
> +    if ( msi->enabled ) {
> +        val->u16 |= PCI_MSI_FLAGS_ENABLE;
> +        val->u16 |= MASK_INSR(fls(msi->vectors) - 1, PCI_MSI_FLAGS_QSIZE);
> +    }
> +    val->u16 |= msi->masking ? PCI_MSI_FLAGS_MASKBIT : 0;
> +    val->u16 |= msi->address64 ? PCI_MSI_FLAGS_64BIT : 0;
> +}
> +
> +static void vpci_msi_enable(struct pci_dev *pdev, struct vpci_msi *msi,
> +                            unsigned int vectors)
> +{
> +    int ret;
> +
> +    ASSERT(!msi->vectors);
> +
> +    ret = vpci_msi_arch_enable(&msi->arch, pdev, msi->address, msi->data,
> +                               vectors);
> +    if ( ret )
> +        return;
> +
> +    /* Apply the mask bits. */
> +    if ( msi->masking )
> +    {
> +        unsigned int i;
> +        uint32_t mask = msi->mask;
> +
> +        for ( i = ffs(mask) - 1; mask && i < vectors; i = ffs(mask) - 1 )
> +        {
> +            vpci_msi_arch_mask(&msi->arch, pdev, i, true);
> +            __clear_bit(i, &mask);
> +        }
> +    }
> +
> +    __msi_set_enable(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                     PCI_FUNC(pdev->devfn), msi->pos, 1);
> +
> +    msi->vectors = vectors;
> +    msi->enabled = true;
> +}
> +
> +static int vpci_msi_disable(struct pci_dev *pdev, struct vpci_msi *msi)
> +{
> +    int ret;
> +
> +    ASSERT(msi->vectors);
> +
> +    __msi_set_enable(pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> +                     PCI_FUNC(pdev->devfn), msi->pos, 0);
> +
> +    ret = vpci_msi_arch_disable(&msi->arch, pdev, msi->vectors);
> +    if ( ret )
> +        return ret;
> +
> +    msi->vectors = 0;
> +    msi->enabled = false;
> +
> +    return 0;
> +}
> +
> +static void vpci_msi_control_write(struct pci_dev *pdev, unsigned int reg,
> +                                   union vpci_val val, void *data)
> +{
> +    struct vpci_msi *msi = data;
> +    unsigned int vectors = 1 << MASK_EXTR(val.u16, PCI_MSI_FLAGS_QSIZE);
> +    int ret;
> +
> +    if ( vectors > msi->max_vectors )
> +        vectors = msi->max_vectors;
> +
> +    if ( !!(val.u16 & PCI_MSI_FLAGS_ENABLE) == msi->enabled &&
> +         (vectors == msi->vectors || !msi->enabled) )
> +        return;

Personally I find the above logic a little tricky to follow. Would it be clearer to fold it into the logic below? (I only understood the reason for the logic above after reading the logic below).

> +
> +    if ( val.u16 & PCI_MSI_FLAGS_ENABLE )
> +    {
> +        if ( msi->enabled )
> +        {
> +            /*
> +             * Change to the number of enabled vectors, disable and
> +             * enable MSI in order to apply it.
> +             */
> +            ret = vpci_msi_disable(pdev, msi);
> +            if ( ret )
> +                return;
> +        }
> +        vpci_msi_enable(pdev, msi, vectors);
> +    }
> +    else
> +        vpci_msi_disable(pdev, msi);
> +}
> +
> +/* Handlers for the address field (32bit or low part of a 64bit address). */
> +static void vpci_msi_address_read(struct pci_dev *pdev, unsigned int reg,
> +                                  union vpci_val *val, void *data)
> +{
> +    const struct vpci_msi *msi = data;
> +
> +    val->u32 = msi->address;
> +}
> +
> +static void vpci_msi_address_write(struct pci_dev *pdev, unsigned int reg,
> +                                   union vpci_val val, void *data)
> +{
> +    struct vpci_msi *msi = data;
> +
> +    /* Clear low part. */
> +    msi->address &= ~(uint64_t)0xffffffff;
> +    msi->address |= val.u32;
> +}
> +
> +/* Handlers for the high part of a 64bit address field. */
> +static void vpci_msi_address_upper_read(struct pci_dev *pdev, unsigned
> int reg,
> +                                        union vpci_val *val, void *data)
> +{
> +    const struct vpci_msi *msi = data;
> +
> +    val->u32 = msi->address >> 32;
> +}
> +
> +static void vpci_msi_address_upper_write(struct pci_dev *pdev, unsigned
> int reg,
> +                                         union vpci_val val, void *data)
> +{
> +    struct vpci_msi *msi = data;
> +
> +    /* Clear high part. */
> +    msi->address &= ~((uint64_t)0xffffffff << 32);
> +    msi->address |= (uint64_t)val.u32 << 32;
> +}
> +
> +/* Handlers for the data field. */
> +static void vpci_msi_data_read(struct pci_dev *pdev, unsigned int reg,
> +                               union vpci_val *val, void *data)
> +{
> +    const struct vpci_msi *msi = data;
> +
> +    val->u16 = msi->data;
> +}
> +
> +static void vpci_msi_data_write(struct pci_dev *pdev, unsigned int reg,
> +                                union vpci_val val, void *data)
> +{
> +    struct vpci_msi *msi = data;
> +
> +    msi->data = val.u16;
> +}
> +
> +static void vpci_msi_mask_read(struct pci_dev *pdev, unsigned int reg,
> +                               union vpci_val *val, void *data)
> +{
> +    const struct vpci_msi *msi = data;
> +
> +    val->u32 = msi->mask;
> +}
> +
> +static void vpci_msi_mask_write(struct pci_dev *pdev, unsigned int reg,
> +                                union vpci_val val, void *data)
> +{
> +    struct vpci_msi *msi = data;
> +    uint32_t dmask;
> +
> +    dmask = msi->mask ^ val.u32;
> +
> +    if ( !dmask )
> +        return;
> +
> +    if ( msi->enabled )
> +    {
> +        unsigned int i;
> +
> +        for ( i = ffs(dmask) - 1; dmask && i < msi->vectors;
> +              i = ffs(dmask) - 1 )
> +        {
> +            vpci_msi_arch_mask(&msi->arch, pdev, i, MASK_EXTR(val.u32, 1 <<
> i));
> +            __clear_bit(i, &dmask);
> +        }
> +    }
> +
> +    msi->mask = val.u32;
> +}
> +
> +static int vpci_init_msi(struct pci_dev *pdev)
> +{
> +    uint8_t seg = pdev->seg, bus = pdev->bus;
> +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> +    struct vpci_msi *msi;
> +    unsigned int msi_offset;
> +    uint16_t control;
> +    int ret;
> +
> +    msi_offset = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSI);
> +    if ( !msi_offset )
> +        return 0;
> +
> +    msi = xzalloc(struct vpci_msi);
> +    if ( !msi )
> +        return -ENOMEM;
> +
> +    msi->pos = msi_offset;
> +
> +    control = pci_conf_read16(seg, bus, slot, func,
> +                              msi_control_reg(msi_offset));
> +
> +    ret = vpci_add_register(pdev, vpci_msi_control_read,
> +                            vpci_msi_control_write,
> +                            msi_control_reg(msi_offset), 2, msi);
> +    if ( ret )
> +        goto error;
> +
> +    /* Get the maximum number of vectors the device supports. */
> +    msi->max_vectors = multi_msi_capable(control);
> +    ASSERT(msi->max_vectors <= 32);
> +
> +    /* No PIRQ bind yet. */
> +    vpci_msi_arch_init(&msi->arch);
> +
> +    if ( is_64bit_address(control) )
> +        msi->address64 = true;
> +    if ( is_mask_bit_support(control) )
> +        msi->masking = true;
> +
> +    ret = vpci_add_register(pdev, vpci_msi_address_read,
> +                            vpci_msi_address_write,
> +                            msi_lower_address_reg(msi_offset), 4, msi);
> +    if ( ret )
> +        goto error;
> +
> +    ret = vpci_add_register(pdev, vpci_msi_data_read, vpci_msi_data_write,
> +                            msi_data_reg(msi_offset, msi->address64), 2,
> +                            msi);
> +    if ( ret )
> +        goto error;
> +
> +    if ( msi->address64 )
> +    {
> +        ret = vpci_add_register(pdev, vpci_msi_address_upper_read,
> +                                vpci_msi_address_upper_write,
> +                                msi_upper_address_reg(msi_offset), 4, msi);
> +        if ( ret )
> +            goto error;
> +    }
> +
> +    if ( msi->masking )
> +    {
> +        ret = vpci_add_register(pdev, vpci_msi_mask_read,
> vpci_msi_mask_write,
> +                                msi_mask_bits_reg(msi_offset,
> +                                                  msi->address64), 4, msi);
> +        if ( ret )
> +            goto error;
> +    }
> +
> +    pdev->vpci->msi = msi;
> +
> +    return 0;
> +
> + error:

Do you not need to clean up any added register handlers here? They have been given a context value which you're about to xfree().

  Paul

> +    ASSERT(ret);
> +    xfree(msi);
> +    return ret;
> +}
> +
> +REGISTER_VPCI_INIT(vpci_init_msi);
> +
> +void vpci_dump_msi(void)
> +{
> +    struct domain *d;
> +
> +    for_each_domain ( d )
> +    {
> +        const struct pci_dev *pdev;
> +
> +        if ( !has_vpci(d) )
> +            continue;
> +
> +        printk("vPCI MSI information for guest %u\n", d->domain_id);
> +
> +        if ( !vpci_trylock(d) )
> +        {
> +            printk("Unable to get vPCI lock, skipping\n");
> +            continue;
> +        }
> +
> +        list_for_each_entry ( pdev, &d->arch.pdev_list, domain_list )
> +        {
> +            uint8_t seg = pdev->seg, bus = pdev->bus;
> +            uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev-
> >devfn);
> +            struct vpci_msi *msi = pdev->vpci->msi;
> +
> +            if ( !msi )
> +                continue;
> +
> +            printk("Device %04x:%02x:%02x.%u\n", seg, bus, slot, func);
> +
> +            printk("Enabled: %u Supports masking: %u 64-bit addresses: %u\n",
> +                   msi->enabled, msi->masking, msi->address64);
> +            printk("Max vectors: %u enabled vectors: %u\n",
> +                   msi->max_vectors, msi->vectors);
> +
> +            vpci_msi_arch_print(&msi->arch, msi->data, msi->address);
> +
> +            if ( msi->masking )
> +                printk("mask=%#032x\n", msi->mask);
> +        }
> +        vpci_unlock(d);
> +    }
> +}
> +
> +/*
> + * Local variables:
> + * mode: C
> + * c-file-style: "BSD"
> + * c-basic-offset: 4
> + * tab-width: 4
> + * indent-tabs-mode: nil
> + * End:
> + */
> +
> diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
> index 4fe996fe49..55ed094734 100644
> --- a/xen/include/asm-x86/hvm/io.h
> +++ b/xen/include/asm-x86/hvm/io.h
> @@ -20,6 +20,7 @@
>  #define __ASM_X86_HVM_IO_H__
> 
>  #include <xen/mm.h>
> +#include <xen/pci.h>
>  #include <asm/hvm/vpic.h>
>  #include <asm/hvm/vioapic.h>
>  #include <public/hvm/ioreq.h>
> @@ -126,6 +127,23 @@ void hvm_dpci_eoi(struct domain *d, unsigned int
> guest_irq,
>  void msix_write_completion(struct vcpu *);
>  void msixtbl_init(struct domain *d);
> 
> +/* Arch-specific MSI data for vPCI. */
> +struct vpci_arch_msi {
> +    int pirq;
> +};
> +
> +/* Arch-specific vPCI MSI helpers. */
> +void vpci_msi_arch_mask(struct vpci_arch_msi *arch, struct pci_dev *pdev,
> +                        unsigned int entry, bool mask);
> +int vpci_msi_arch_enable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
> +                         uint64_t address, uint32_t data,
> +                         unsigned int vectors);
> +int vpci_msi_arch_disable(struct vpci_arch_msi *arch, struct pci_dev
> *pdev,
> +                          unsigned int vectors);
> +int vpci_msi_arch_init(struct vpci_arch_msi *arch);
> +void vpci_msi_arch_print(struct vpci_arch_msi *arch, uint16_t data,
> +                         uint64_t addr);
> +
>  enum stdvga_cache_state {
>      STDVGA_CACHE_UNINITIALIZED,
>      STDVGA_CACHE_ENABLED,
> diff --git a/xen/include/asm-x86/msi.h b/xen/include/asm-x86/msi.h
> index 213ee53f72..9c36c34372 100644
> --- a/xen/include/asm-x86/msi.h
> +++ b/xen/include/asm-x86/msi.h
> @@ -48,6 +48,7 @@
>  #define MSI_ADDR_REDIRECTION_SHIFT  3
>  #define MSI_ADDR_REDIRECTION_CPU    (0 <<
> MSI_ADDR_REDIRECTION_SHIFT)
>  #define MSI_ADDR_REDIRECTION_LOWPRI (1 <<
> MSI_ADDR_REDIRECTION_SHIFT)
> +#define MSI_ADDR_REDIRECTION_MASK   0x8
> 
>  #define MSI_ADDR_DEST_ID_SHIFT		12
>  #define	 MSI_ADDR_DEST_ID_MASK		0x00ff000
> diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
> index 0d2c72c109..d07185a479 100644
> --- a/xen/include/xen/hvm/irq.h
> +++ b/xen/include/xen/hvm/irq.h
> @@ -57,7 +57,9 @@ struct dev_intx_gsi_link {
>  #define VMSI_DELIV_MASK   0x7000
>  #define VMSI_TRIG_MODE    0x8000
> 
> +#define GFLAGS_SHIFT_DEST_ID        0
>  #define GFLAGS_SHIFT_RH             8
> +#define GFLAGS_SHIFT_DM             9
>  #define GFLAGS_SHIFT_DELIV_MODE     12
>  #define GFLAGS_SHIFT_TRG_MODE       15
> 
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index 452ee482e8..2a7d7557b3 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -13,6 +13,7 @@
>   * of just returning whether the lock is hold by any CPU).
>   */
>  #define vpci_lock(d) spin_lock_recursive(&(d)-
> >arch.hvm_domain.vpci_lock)
> +#define vpci_trylock(d) spin_trylock_recursive(&(d)-
> >arch.hvm_domain.vpci_lock)
>  #define vpci_unlock(d) spin_unlock_recursive(&(d)-
> >arch.hvm_domain.vpci_lock)
>  #define vpci_locked(d) spin_is_locked(&(d)->arch.hvm_domain.vpci_lock)
> 
> @@ -85,9 +86,34 @@ struct vpci {
>          } bars[7]; /* At most 6 BARS + 1 expansion ROM BAR. */
>          /* FIXME: currently there's no support for SR-IOV. */
>      } header;
> +
> +    /* MSI data. */
> +    struct vpci_msi {
> +        /* Offset of the capability in the config space. */
> +        unsigned int pos;
> +        /* Maximum number of vectors supported by the device. */
> +        unsigned int max_vectors;
> +        /* Number of vectors configured. */
> +        unsigned int vectors;
> +        /* Address and data fields. */
> +        uint64_t address;
> +        uint16_t data;
> +        /* Mask bitfield. */
> +        uint32_t mask;
> +        /* Enabled? */
> +        bool enabled;
> +        /* Supports per-vector masking? */
> +        bool masking;
> +        /* 64-bit address capable? */
> +        bool address64;
> +        /* Arch-specific data. */
> +        struct vpci_arch_msi arch;
> +    } *msi;
>  #endif
>  };
> 
> +void vpci_dump_msi(void);
> +
>  #endif
> 
>  /*
> --
> 2.11.0 (Apple Git-81)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 3/9] x86/physdev: enable PHYSDEVOP_pci_mmcfg_reserved for PVH Dom0
  2017-07-14 10:32   ` Jan Beulich
@ 2017-07-20 10:23     ` Roger Pau Monne
  2017-07-28 12:31       ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monne @ 2017-07-20 10:23 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, julien.grall, boris.ostrovsky, xen-devel

On Fri, Jul 14, 2017 at 04:32:19AM -0600, Jan Beulich wrote:
> >>> On 30.06.17 at 17:01, <roger.pau@citrix.com> wrote:
> > So that hotplug (or MMCFG regions not present in the MCFG ACPI table)
> > can be added at run time by the hardware domain.
> 
> I think the emphasis should be the other way around. I'm rather certain
> hotplug of bridges doesn't really work right now anyway; at least
> IO-APIC hotplug code is completely missing.

IO-APICs can also be hot-plugged? Didn't even know about that...

> > When a new MMCFG area is added to a PVH Dom0, Xen will scan it and add
> > the devices to the hardware domain.
> 
> Adding the MMIO regions is certainly necessary, but what's the point of
> also scanning the bus and adding the devices?

It's not strictly necessary, the same can be accomplished by Dom0
calling PHYSDEVOP_manage_pci_add on each device.

Just thought it wouldn't hurt to do it here, but given your comment
below I'm not sure. I will wait for your reply before deciding what to
do.

> We expect Dom0 to tell us
> anyway, and not doing the scan in Xen avoids complications we presently
> have in the segment 0 case when Dom0 decides to re-number busses (e.g.
> in order to fit in SR-IOV VFs).

Is this renumbering performed by changing the
Primary/Secondary/Subordinate bus number registers in the bridge?

If so we could detect such accesses (by adding traps to type 01h
headers) and react accordingly.

What if Dom0 re-numbers the bus after having already registered the
devices with Xen?

> > --- a/xen/arch/x86/hvm/hypercall.c
> > +++ b/xen/arch/x86/hvm/hypercall.c
> > @@ -89,6 +89,10 @@ static long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
> >          if ( !has_pirq(curr->domain) )
> >              return -ENOSYS;
> >          break;
> > +    case PHYSDEVOP_pci_mmcfg_reserved:
> > +        if ( !is_hardware_domain(curr->domain) )
> > +            return -ENOSYS;
> > +        break;
> 
> This physdevop (like most ones) is restricted to Dom0 use anyway
> (properly expressed via XSM check), so I'd rather see you check
> has_vpci() here, in line with e.g. the check visible in context.

Ack.

> > --- a/xen/arch/x86/physdev.c
> > +++ b/xen/arch/x86/physdev.c
> > @@ -559,6 +559,25 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
> >  
> >          ret = pci_mmcfg_reserved(info.address, info.segment,
> >                                   info.start_bus, info.end_bus, info.flags);
> > +        if ( ret || !is_hvm_domain(currd) )
> > +            break;
> > +
> > +        /*
> > +         * For HVM (PVH) domains try to add the newly found MMCFG to the
> > +         * domain.
> > +         */
> > +        ret = register_vpci_mmcfg_handler(currd, info.address, info.start_bus,
> > +                                          info.end_bus, info.segment);
> > +        if ( ret == -EEXIST )
> > +        {
> > +            ret = 0;
> > +            break;
> 
> I don't really understand this part: Why would handlers be registered
> already? If you consider double registration, wouldn't that better
> either be detected by pci_mmcfg_reserved() (and the call here avoided
> altogether) or the fact indeed be reported back to the caller?

Yes, this can be done in pci_mmcfg_reserved, it's just that so far
pci_mmcfg_reserved doesn't return -EEXIST for duplicated bridges.

> > @@ -1110,6 +1110,37 @@ void __hwdom_init setup_hwdom_pci_devices(
> >      pcidevs_unlock();
> >  }
> >  
> > +static int add_device(uint8_t devfn, struct pci_dev *pdev)
> > +{
> > +    return iommu_add_device(pdev);
> > +}
> 
> You're discarding devfn here, just for iommu_add_device() to re-do the
> phantom function handling. At the very least this is wasteful. Perhaps
> you minimally want to call iommu_add_device() only when
> devfn == pdev->devfn (if all of this code stays in the first place)?

Doesn't the IOMMU also need to know about the phantom functions in
order to add translations for them too?

I assume phantom_stride already takes care of this, so yes, if this
has to stay here a pdev->dev == devfn check should be added.

> > +int pci_scan_and_setup_segment(uint16_t segment)
> > +{
> > +    struct pci_seg *pseg = get_pseg(segment);
> > +    struct setup_hwdom ctxt = {
> > +        .d = current->domain,
> > +        .handler = add_device,
> > +    };
> > +    int ret;
> > +
> > +    if ( !pseg )
> > +        return -EINVAL;
> > +
> > +    pcidevs_lock();
> > +    ret = _scan_pci_devices(pseg, NULL);
> > +    if ( ret )
> > +        goto out;
> > +
> > +    ret = _setup_hwdom_pci_devices(pseg, &ctxt);
> > +    if ( ret )
> > +        goto out;
> > +
> > + out:
> 
> Please let's avoid such unnecessary goto-s. Even the first one could be
> easily avoided without making the code anywhere near unreadable.

Right, that's not a problem.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/9] xen/pci: split code to size BARs from pci_add_device
  2017-07-14 10:33   ` Jan Beulich
@ 2017-07-20 14:00     ` Roger Pau Monne
  2017-07-20 14:05       ` Roger Pau Monne
  2017-07-29 16:32       ` Jan Beulich
  0 siblings, 2 replies; 44+ messages in thread
From: Roger Pau Monne @ 2017-07-20 14:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, julien.grall, boris.ostrovsky

On Fri, Jul 14, 2017 at 04:33:20AM -0600, Jan Beulich wrote:
> >>> On 30.06.17 at 17:01, <roger.pau@citrix.com> wrote:
> > So that it can be called from outside in order to get the size of regular PCI
> > BARs. This will be required in order to map the BARs from PCI devices into PVH
> > Dom0 p2m.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > 
> > --- a/xen/drivers/passthrough/pci.c
> > +++ b/xen/drivers/passthrough/pci.c
> > @@ -588,6 +588,54 @@ static void pci_enable_acs(struct pci_dev *pdev)
> >      pci_conf_write16(seg, bus, dev, func, pos + PCI_ACS_CTRL, ctrl);
> >  }
> >  
> > +int pci_size_mem_bar(unsigned int seg, unsigned int bus, unsigned int slot,
> > +                     unsigned int func, unsigned int pos, bool last,
> > +                     uint64_t *paddr, uint64_t *psize)
> > +{
> > +    uint32_t hi = 0, bar = pci_conf_read32(seg, bus, slot, func, pos);
> > +    uint64_t addr, size;
> > +
> > +    ASSERT((bar & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_MEMORY);
> > +    pci_conf_write32(seg, bus, slot, func, pos, ~0);
> > +    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> > +         PCI_BASE_ADDRESS_MEM_TYPE_64 )
> > +    {
> > +        if ( last )
> > +        {
> > +            printk(XENLOG_WARNING
> > +                    "device %04x:%02x:%02x.%u with 64-bit BAR in last slot\n",
> 
> This message needs to tell what kind of slot is being processed (just
> like the original did).

The original message is:

"SR-IOV device %04x:%02x:%02x.%u with 64-bit vf BAR in last slot"

I guess you would like to have the "vf" again, in which case I will
add a bool vf parameter to the function that's only going to be used
here. IMHO I'm not really sure it's worth it because I don't find it
that informative. I though that just knowing the device sbdf is
enough.

> > +                    seg, bus, slot, func);
> > +            return -EINVAL;
> > +        }
> > +        hi = pci_conf_read32(seg, bus, slot, func, pos + 4);
> > +        pci_conf_write32(seg, bus, slot, func, pos + 4, ~0);
> > +    }
> > +    size = pci_conf_read32(seg, bus, slot, func, pos) &
> > +           PCI_BASE_ADDRESS_MEM_MASK;
> > +    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> > +         PCI_BASE_ADDRESS_MEM_TYPE_64 )
> > +    {
> > +        size |= (u64)pci_conf_read32(seg, bus, slot, func, pos + 4) << 32;
> 
> uint64_t
> 
> > +        pci_conf_write32(seg, bus, slot, func, pos + 4, hi);
> > +    }
> > +    else if ( size )
> > +        size |= (u64)~0 << 32;
> 
> Again (and more below).

Yes, I think I've fixed all of them.

> > +    pci_conf_write32(seg, bus, slot, func, pos, bar);
> > +    size = -(size);
> 
> Stray parentheses.
> 
> > +    addr = (bar & PCI_BASE_ADDRESS_MEM_MASK) | ((u64)hi << 32);
> > +
> > +    if ( paddr )
> > +        *paddr = addr;
> > +    if ( psize )
> > +        *psize = size;
> 
> Is it reasonable to expect the caller to not care about the size?

Not at the moment, so I guess ASSERT(psize) would be better.

> > @@ -663,38 +710,12 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
> >                             seg, bus, slot, func, i);
> >                      continue;
> >                  }
> > -                pci_conf_write32(seg, bus, slot, func, idx, ~0);
> > -                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> > -                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
> > -                {
> > -                    if ( i >= PCI_SRIOV_NUM_BARS )
> > -                    {
> > -                        printk(XENLOG_WARNING
> > -                               "SR-IOV device %04x:%02x:%02x.%u with 64-bit"
> > -                               " vf BAR in last slot\n",
> > -                               seg, bus, slot, func);
> > -                        break;
> > -                    }
> > -                    hi = pci_conf_read32(seg, bus, slot, func, idx + 4);
> > -                    pci_conf_write32(seg, bus, slot, func, idx + 4, ~0);
> > -                }
> > -                pdev->vf_rlen[i] = pci_conf_read32(seg, bus, slot, func, idx) &
> > -                                   PCI_BASE_ADDRESS_MEM_MASK;
> > -                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> > -                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
> > -                {
> > -                    pdev->vf_rlen[i] |= (u64)pci_conf_read32(seg, bus,
> > -                                                             slot, func,
> > -                                                             idx + 4) << 32;
> > -                    pci_conf_write32(seg, bus, slot, func, idx + 4, hi);
> > -                }
> > -                else if ( pdev->vf_rlen[i] )
> > -                    pdev->vf_rlen[i] |= (u64)~0 << 32;
> > -                pci_conf_write32(seg, bus, slot, func, idx, bar);
> > -                pdev->vf_rlen[i] = -pdev->vf_rlen[i];
> > -                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> > -                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
> > -                    ++i;
> > +                ret = pci_size_mem_bar(seg, bus, slot, func, idx,
> > +                                       i == PCI_SRIOV_NUM_BARS - 1, NULL,
> > +                                       &pdev->vf_rlen[i]);
> > +                if ( ret < 0 )
> > +                    break;
> 
> ASSERT(ret) ?

Really? This is different from the previous behavior, that would just
break out of the loop in this situation. And on non-debug builds we
would end up decreasing i, which is not good.

Thanks for the review, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/9] xen/pci: split code to size BARs from pci_add_device
  2017-07-20 14:00     ` Roger Pau Monne
@ 2017-07-20 14:05       ` Roger Pau Monne
  2017-07-29 16:32       ` Jan Beulich
  1 sibling, 0 replies; 44+ messages in thread
From: Roger Pau Monne @ 2017-07-20 14:05 UTC (permalink / raw)
  To: Jan Beulich, xen-devel, julien.grall, boris.ostrovsky

On Thu, Jul 20, 2017 at 03:00:40PM +0100, Roger Pau Monne wrote:
> On Fri, Jul 14, 2017 at 04:33:20AM -0600, Jan Beulich wrote:
> > >>> On 30.06.17 at 17:01, <roger.pau@citrix.com> wrote:
> > > +                if ( ret < 0 )
> > > +                    break;
> > 
> > ASSERT(ret) ?
> 
> Really? This is different from the previous behavior, that would just
> break out of the loop in this situation. And on non-debug builds we
> would end up decreasing i, which is not good.

Figured that out, you wanted me to just add the ASSERT to make sure
ret != 0.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 6/9] xen/vpci: add handlers to map the BARs
  2017-07-14 15:11   ` Jan Beulich
@ 2017-07-24 14:58     ` Roger Pau Monne
  2017-07-29 16:44       ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monne @ 2017-07-24 14:58 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, AndrewCooper,
	Ian Jackson, TimDeegan, julien.grall, xen-devel, boris.ostrovsky

On Fri, Jul 14, 2017 at 09:11:29AM -0600, Jan Beulich wrote:
> >>> On 30.06.17 at 17:01, <roger.pau@citrix.com> wrote:
> > Introduce a set of handlers that trap accesses to the PCI BARs and the command
> > register, in order to emulate BAR sizing and BAR relocation.
> 
> I don't think "emulate" is the right term here - you really don't mean to
> change anything, you only want to snoop Dom0 writes.

Right, changed emulate to snoop.

> > --- /dev/null
> > +++ b/xen/drivers/vpci/header.c
> > @@ -0,0 +1,473 @@
> > +/*
> > + * Generic functionality for handling accesses to the PCI header from the
> > + * configuration space.
> > + *
> > + * Copyright (C) 2017 Citrix Systems R&D
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms and conditions of the GNU General Public
> > + * License, version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public
> > + * License along with this program; If not, see <http://www.gnu.org/licenses/>.
> > + */
> > +
> > +#include <xen/sched.h>
> > +#include <xen/vpci.h>
> > +#include <xen/p2m-common.h>
> > +
> > +#define MAPPABLE_BAR(x)                                                 \
> > +    (((x)->type == VPCI_BAR_MEM32 || (x)->type == VPCI_BAR_MEM64_LO ||  \
> > +     ((x)->type == VPCI_BAR_ROM && (x)->enabled)) &&                    \
> > +     (x)->addr != INVALID_PADDR)
> > +
> > +static struct rangeset *vpci_get_bar_memory(const struct domain *d,
> > +                                            const struct vpci_bar *map)
> > +{
> > +    const struct pci_dev *pdev;
> > +    struct rangeset *mem = rangeset_new(NULL, NULL, 0);
> > +    int rc;
> > +
> > +    if ( !mem )
> > +        return ERR_PTR(-ENOMEM);
> > +
> > +    /*
> > +     * Create a rangeset that represents the current BAR memory region
> > +     * and compare it against all the currently active BAR memory regions.
> > +     * If an overlap is found, subtract it from the region to be
> > +     * mapped/unmapped.
> > +     *
> > +     * NB: the rangeset uses frames, and if start and end addresses are
> > +     * equal it means only one frame is used, that's why PFN_DOWN is used
> > +     * to calculate the end of the rangeset.
> > +     */
> 
> That explanation doesn't seem to fit: Did you perhaps mean to
> point out that rangeset ranges are inclusive ones?

Yes, that's probably better.

> > +    rc = rangeset_add_range(mem, PFN_DOWN(map->addr),
> > +                            PFN_DOWN(map->addr + map->size));
> 
> Don't you need to subtract 1 here (and elsewhere below)?

Indeed.

> > +    if ( rc )
> > +    {
> > +        rangeset_destroy(mem);
> > +        return ERR_PTR(rc);
> > +    }
> > +
> > +    list_for_each_entry(pdev, &d->arch.pdev_list, domain_list)
> > +    {
> > +        uint16_t cmd = pci_conf_read16(pdev->seg, pdev->bus,
> > +                                       PCI_SLOT(pdev->devfn),
> > +                                       PCI_FUNC(pdev->devfn),
> > +                                       PCI_COMMAND);
> 
> This is quite a lot of overhead - a loop over all devices plus a config
> space read on each one. What state the memory decode bit is in
> could be recorded in the ->enabled flag, couldn't it? And devices on
> different sub-branches of the topology can't possibly have
> overlapping entries that we need to worry about, as the bridge
> windows would suppress actual accesses.

Oh, so Xen only needs to care about devices that share the same
bridge, because that is the only case where the same page can be
shared by multiple devices?

In any case, the Dom0 is free to wrongly position the BARs anywhere it
wants, thus possibly placing them outside of the bridge windows, in
with case I think we should better check all assigned devices.

> > +        unsigned int i;
> > +
> > +        /* Check if memory decoding is enabled. */
> > +        if ( !(cmd & PCI_COMMAND_MEMORY) )
> > +            continue;
> > +
> > +        for ( i = 0; i < ARRAY_SIZE(pdev->vpci->header.bars); i++ )
> > +        {
> > +            const struct vpci_bar *bar = &pdev->vpci->header.bars[i];
> > +
> > +            if ( bar == map || !MAPPABLE_BAR(bar) ||
> > +                 !rangeset_overlaps_range(mem, PFN_DOWN(bar->addr),
> > +                                          PFN_DOWN(bar->addr + bar->size)) )
> > +                continue;
> > +
> > +            rc = rangeset_remove_range(mem, PFN_DOWN(bar->addr),
> > +                                       PFN_DOWN(bar->addr + bar->size));
> 
> I'm struggling to convince myself of the correctness of this approach
> (including other code further down which is also involved). I think you
> should have taken the time to add a few words on the approach
> chosen to the description.

Will do.

> For example, it doesn't look like things will
> go right if the device being dealt with has two BARs both using part
> of the same page.

Right, because the BAR won't reflect it's actual state (due to the
memory decoding being global per-device). AFAICT this will be solved
by your suggestion above of using ->enabled and keeping it updated for
BARs also.

> > +static int vpci_modify_bar(struct domain *d, const struct vpci_bar *bar,
> > +                           const bool map)
> > +{
> > +    struct rangeset *mem;
> > +    struct map_data data = { .d = d, .map = map };
> > +    int rc;
> > +
> > +    ASSERT(MAPPABLE_BAR(bar));
> > +
> > +    mem = vpci_get_bar_memory(d, bar);
> > +    if ( IS_ERR(mem) )
> > +        return -PTR_ERR(mem);
> 
> The negation looks wrong to me.

OK, this is already returning -<ERROR>, so the negation is not needed.

> > +static void vpci_cmd_write(struct pci_dev *pdev, unsigned int reg,
> > +                           union vpci_val val, void *data)
> > +{
> > +    uint16_t cmd = val.u16, current_cmd;
> > +    uint8_t seg = pdev->seg, bus = pdev->bus;
> > +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> > +    int rc;
> > +
> > +    current_cmd = pci_conf_read16(seg, bus, slot, func, reg);
> > +
> > +    if ( !((cmd ^ current_cmd) & PCI_COMMAND_MEMORY) )
> > +    {
> > +        /*
> > +         * Let the guest play with all the bits directly except for the
> > +         * memory decoding one.
> > +         */
> > +        pci_conf_write16(seg, bus, slot, func, reg, cmd);
> > +        return;
> 
> Please invert the condition and have both cases use the same write
> at the end of the function.

Done.

> > +static void vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
> > +                           union vpci_val val, void *data)
> > +{
> > +    struct vpci_bar *bar = data;
> > +    uint8_t seg = pdev->seg, bus = pdev->bus;
> > +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> > +    uint32_t wdata = val.u32, size_mask;
> > +    bool hi = false;
> > +
> > +    switch ( bar->type )
> > +    {
> > +    case VPCI_BAR_MEM32:
> > +    case VPCI_BAR_MEM64_LO:
> > +        size_mask = (uint32_t)PCI_BASE_ADDRESS_MEM_MASK;
> > +        break;
> > +    case VPCI_BAR_MEM64_HI:
> > +        size_mask = ~0u;
> > +        break;
> > +    default:
> > +        ASSERT_UNREACHABLE();
> > +        return;
> > +    }
> > +
> > +    if ( (wdata & size_mask) == size_mask )
> > +    {
> > +        /* Next reads from this register are going to return the BAR size. */
> > +        bar->sizing = true;
> > +        return;
> 
> I think the comment needs extending to explain why the written
> sizing value can't possibly be an address. This is particularly
> relevant because I'm not sure that assumption would hold on e.g.
> ARM (which I don't think has guaranteed ROM right below 4Gb).

Hm, right. Maybe it would be best to detect sizing by checking that
the address when performing a read is ~0 on the high bits and ~0 &
PCI_BASE_ADDRESS_MEM_MASK on the lower ones, instead of doing this
kind of partial guessing as done here, it's certainly not very robust.

> > +    }
> > +
> > +    /* End previous sizing cycle if any. */
> > +    bar->sizing = false;
> > +
> > +    /*
> > +     * Ignore attempts to change the position of the BAR if memory decoding is
> > +     * active.
> > +     */
> > +    if ( pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &
> > +         PCI_COMMAND_MEMORY )
> > +        return;
> 
> Especially as long as this code supports only Dom0 I think we want
> a warning here.

Done, I've added:

%04x:%02x:%02x.%u: ignored BAR write with memory decoding enabled

> > +static void vpci_rom_write(struct pci_dev *pdev, unsigned int reg,
> > +                           union vpci_val val, void *data)
> > +{
> > +    struct vpci_bar *rom = data;
> > +    uint8_t seg = pdev->seg, bus = pdev->bus;
> > +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> > +    const uint32_t wdata = val.u32;
> > +
> > +    if ( (wdata & PCI_ROM_ADDRESS_MASK) == PCI_ROM_ADDRESS_MASK )
> > +    {
> > +        /* Next reads from this register are going to return the BAR size. */
> > +        rom->sizing = true;
> > +        return;
> > +    }
> > +
> > +    /* End previous sizing cycle if any. */
> > +    rom->sizing = false;
> > +
> > +    rom->addr = wdata & PCI_ROM_ADDRESS_MASK;
> > +
> > +    /* Check if memory decoding is enabled. */
> > +    if ( pci_conf_read16(seg, bus, slot, func, PCI_COMMAND) &
> > +         PCI_COMMAND_MEMORY &&
> > +         (rom->enabled ^ (wdata & PCI_ROM_ADDRESS_ENABLE)) )
> 
> Just like you parenthesize the operands of ^, please also do so for
> the ones of &. Also the ^-expression relies on the particular value
> of PCI_ROM_ADDRESS_ENABLE, which I'd prefer if you avoided.

Changed it to: rom->enabled != !!(wadata & PCI_ROM_ADDRESS_ENABLE)

> > +static int vpci_init_bars(struct pci_dev *pdev)
> > +{
> > +    uint8_t seg = pdev->seg, bus = pdev->bus;
> > +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> > +    uint8_t header_type;
> > +    uint16_t cmd;
> > +    uint32_t rom_val;
> > +    uint64_t addr, size;
> > +    unsigned int i, num_bars, rom_reg;
> > +    struct vpci_header *header = &pdev->vpci->header;
> > +    struct vpci_bar *bars = header->bars;
> > +    int rc;
> > +
> > +    header_type = pci_conf_read8(seg, bus, slot, func, PCI_HEADER_TYPE) & 0x7f;
> > +    switch ( header_type )
> 
> I'd prefer if you didn't introduce variables used just once.

OK, I find it cumbersome to place it as the switch expression, but it
fits in a single line so it's not that bad.

> > +    if ( cmd & PCI_COMMAND_MEMORY )
> > +        pci_conf_write16(seg, bus, slot, func, PCI_COMMAND,
> > +                         cmd & ~PCI_COMMAND_MEMORY);
> > +
> > +    for ( i = 0; i < num_bars; i++ )
> > +    {
> > +        uint8_t reg = PCI_BASE_ADDRESS_0 + i * 4;
> > +        uint32_t val = pci_conf_read32(seg, bus, slot, func, reg);
> > +
> > +        if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
> > +        {
> > +            bars[i].type = VPCI_BAR_MEM64_HI;
> > +            rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
> > +                                   &bars[i]);
> > +            if ( rc )
> > +                return rc;
> > +
> > +            continue;
> > +        }
> > +        if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
> > +        {
> > +            bars[i].type = VPCI_BAR_IO;
> > +            continue;
> > +        }
> > +        if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> > +             PCI_BASE_ADDRESS_MEM_TYPE_64 )
> > +            bars[i].type = VPCI_BAR_MEM64_LO;
> > +        else
> > +            bars[i].type = VPCI_BAR_MEM32;
> 
> Perhaps ignore the 64-bit indicator if it appears in the last BAR?

Hm, pci_size_mem_bar is going to complain anyway and Xen won't be able
to size the BAR.

> > +        /* Size the BAR and map it. */
> > +        rc = pci_size_mem_bar(seg, bus, slot, func, reg, i == num_bars - 1,
> > +                              &addr, &size);
> > +        if ( rc < 0 )
> > +            return rc;
> > +
> > +        if ( size == 0 )
> > +        {
> > +            bars[i].type = VPCI_BAR_EMPTY;
> > +            continue;
> > +        }
> > +
> > +        bars[i].addr = (cmd & PCI_COMMAND_MEMORY) ? addr : INVALID_PADDR;
> 
> This doesn't match up with logic further up: When the memory decode
> bit gets cleared, you don't zap the addresses, so I think you'd better
> store it here too. Use INVALID_PADDR only when the value read has
> all address bits set (same caveat as pointed out earlier).

OK, note that .addr can only possibly be INVALID_PADDR at
initialization time, once the user has written something to the BAR
.addr will be different than INVALID_PADDR.

> > +        bars[i].size = size;
> > +        bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
> > +
> > +        rc = vpci_add_register(pdev, vpci_bar_read, vpci_bar_write, reg, 4,
> > +                               &bars[i]);
> > +        if ( rc )
> > +            return rc;
> > +    }
> > +
> > +    /* Check expansion ROM. */
> > +    rom_val = pci_conf_read32(seg, bus, slot, func, rom_reg);
> > +    if ( rom_val & PCI_ROM_ADDRESS_ENABLE )
> > +        pci_conf_write32(seg, bus, slot, func, rom_reg,
> > +                         rom_val & ~PCI_ROM_ADDRESS_ENABLE);
> 
> Do you really need to do this when you've cleared the memory
> decode bit already?

Oh right, this is not needed. Both bits need to be enabled for the ROM
to be mapped.

> > +    rc = pci_size_mem_bar(seg, bus, slot, func, rom_reg, true, &addr, &size);
> 
> You can't use this function here without first making it capable of
> dealing with ROM BARs - it expects the low bits to be different
> than what we have here (see the early ASSERT() that's there).
> 
> > +    if ( rc < 0 )
> > +        return rc;
> 
> Perhaps I didn't pay attention elsewhere, but here it is quite obvious
> that in the error case you return with the device in a state other than
> on input.

Yes, there are several error paths here that will return with memory
decoding disabled. I can fix that by writing back the original command
value to the register.

> > +    if ( size )
> > +    {
> > +        struct vpci_bar *rom = &header->bars[num_bars];
> > +
> > +        rom->type = VPCI_BAR_ROM;
> > +        rom->size = size;
> > +        rom->enabled = rom_val & PCI_ROM_ADDRESS_ENABLE;
> > +        if ( rom->enabled )
> > +            rom->addr = addr;
> > +        else
> > +            rom->addr = INVALID_PADDR;
> 
> Same remark as further up.

Fixed.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 2/9] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas
  2017-07-14 16:33     ` Roger Pau Monné
@ 2017-07-28 12:22       ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2017-07-28 12:22 UTC (permalink / raw)
  To: roger.pau
  Cc: andrew.cooper3, julien.grall, paul.durrant, xen-devel,
	boris.ostrovsky

>>> Roger Pau Monné <roger.pau@citrix.com> 07/14/17 6:34 PM >>>
>On Thu, Jul 13, 2017 at 02:15:26PM -0600, Jan Beulich wrote:
>> >>> Roger Pau Monne <roger.pau@citrix.com> 06/30/17 5:02 PM >>>
>> > @@ -1041,6 +1043,24 @@ static int __init pvh_setup_acpi(struct domain *d, paddr_t start_info)
>> >      return 0;
>> >  }
>> >  
>> > +int __init pvh_setup_mmcfg(struct domain *d)
>> 
>> Didn't I point out that __init van't be correct here, and instead this
>> needs to be __hwdom_init? I can see that the only current caller is
>> __init, but that merely suggests there is a second call missing.
>
>Mostly likely, and I failed to update it.
>
>AFAIK it's not possible to build a late PVH hwdom (or I don't see
>how), so I guess that missing call should be added if we ever support
>that.

Why would a late hwdom not be able to be PVH? All depends on whether
the tool (stack) to build it (in domain 0) is capable of that.

>> > --- a/xen/arch/x86/hvm/io.c
>> > +++ b/xen/arch/x86/hvm/io.c
>> > @@ -261,11 +261,11 @@ void register_g2m_portio_handler(struct domain *d)
>> >  static int vpci_access_check(unsigned int reg, unsigned int len)
>> >  {
>> >      /* Check access size. */
>> > -    if ( len != 1 && len != 2 && len != 4 )
>> > +    if ( len != 1 && len != 2 && len != 4 && len != 8 )
>> >          return -EINVAL;
>> >  
>> > -    /* Check if access crosses a double-word boundary. */
>> > -    if ( (reg & 3) + len > 4 )
>> > +    /* Check if access crosses a double-word boundary or it's not aligned. */
>> > +    if ( (len <= 4 && (reg & 3) + len > 4) || (len == 8 && (reg & 3) != 0) )
>> >          return -EINVAL;
>> 
>> For one I suppose you mean "& 7" in the 8-byte case.
>
>I cannot find anything in the PCIe 3.1A specification that says that
>8B accesses should be aligned. AFAICT it only mentions that accesses
>should not cross double-word (4B) boundaries, because it's not
>mandatory for the root complex to support such accesses.

Hmm, ugly. I'd be particularly concerned about an 8-byte access
crossing the standard/extended config space boundary, or one crossing
the boundary between two devices (or worse between a device and a
hole). I'd suggest to be conservative for now and require full alignment.

>> And then I don't
>> understand why you permit mis-aligned 2-byte writes, but not mis-aligned
>> 4-byte ones as long as they fall withing a quad-word. Any such asymmetry
>> needs at least a comment.
>
>IIRC reading soemthing like that on the Mindshare PCI book, but I
>don't have it at hand. Will check on Monday. Anyway, I cannot seem to
>find any specific set of restrictions in the PCI/PCIe specifications,
>apart from the one that accesses should not cross a double-word
>boundary.
>
>I'm fine with only allowing accesses aligned to their respective
>sizes, but I think I should add a comment somewhere regarding where
>this has been picked from. Do you have any references from the
>AMD/Intel SDMs maybe?

No, I'm sorry.

>> > +static int vpci_mmcfg_accept(struct vcpu *v, unsigned long addr)
>> > +{
>> > +    struct domain *d = v->domain;
>> > +    bool found;
>> > +
>> > +    vpci_lock(d);
>> > +    found = vpci_mmcfg_find(d, addr);
>> > +    vpci_unlock(d);
>> 
>> The latest here I wonder whether the lock wouldn't better be an r/w one.
>
>TBH, my first implementation was using a rw lock, but then I though it
>was not worth it and switched to a spinlock. I Don't mind making it a
>rw lock, but then the argument passed to the read handlers should be
>constified for safety IMHO.

Which of the arguments?

>Also note that due to the usage of the pcidevs lock whether this is rw
>or a spinlock doesn't make much of a difference.

True.

>> > +static int vpci_mmcfg_read(struct vcpu *v, unsigned long addr,
>> > +                           unsigned int len, unsigned long *data)
>> 
>> uint64_t * (to be 32-bit compatible)
>
>Will this work properly on 32bit builds?

32-bit builds of what? For 32-bit ARM this is only a future (if ever)
consideration - r>hvm_mmio_{read/write}_t types expect a unsigned long, not a
>uint64_t. I'm confused about how this worked before with a 32bit
>hypervisor and a 64bit guest, how where movq handled?

I think all this abstraction postdates removal of x86-32 builds. As to
MOVQ - if you think about the MMX/SSE variants of it, 32-bit would
have split the access just like 64-bit splits e.g. MOVDQA.

>> > +    pcidevs_unlock();
>> > +    vpci_unlock(d);
>> 
>> Question on lock order (should have gone into the patch 1 reply already,
>> but I had thought of this only after sending): Is it really a good idea
>> to nest this way?
>
>I saw no other way to make sure the pdev is not removed while poking
>at it.

As long as all of this is Dom0-only, I don't think that's a major concern. As
said elsewhere, we don't consistently lock against device removal anyway,
and we should rather use refcounting to deal with this.

>> The pcidevs lock is covering quite large regions at
>> times, so the risk of a lock order violation seems non-negligible even
>> if there may be none right now. Futhermore the new uses of the pcidevs
>> lock you introduce would seem to make it quite desirable to make that
>> one an r/w one too. Otoh that's a recursive one, so it'll be non-trivial
>> to convert ...
>
>I can try, but as you say doesn't seem trivial at all.

So perhaps better to continue assuming a well behaved Dom0 here for
now?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 1/9] xen/vpci: introduce basic handlers to trap accesses to the PCI config space
  2017-07-14 16:41         ` Roger Pau Monné
@ 2017-07-28 12:25           ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2017-07-28 12:25 UTC (permalink / raw)
  To: roger.pau
  Cc: wei.liu2, andrew.cooper3, ian.jackson, julien.grall, paul.durrant,
	xen-devel, boris.ostrovsky

>>> Roger Pau Monné <roger.pau@citrix.com> 07/14/17 6:42 PM >>>
>On Fri, Jul 14, 2017 at 10:01:54AM -0600, Jan Beulich wrote:
>> >>> On 14.07.17 at 17:33, <roger.pau@citrix.com> wrote:
>> > On Thu, Jul 13, 2017 at 08:36:18AM -0600, Jan Beulich wrote:
>> >> >>> Roger Pau Monne <roger.pau@citrix.com> 06/30/17 5:01 PM >>>
>> >> > --- a/xen/arch/arm/xen.lds.S
>> >> > +++ b/xen/arch/arm/xen.lds.S
>> >> > @@ -41,6 +41,9 @@ SECTIONS
>> >> >  
>> >> >    . = ALIGN(PAGE_SIZE);
>> >> >    .rodata : {
>> >> > +       __start_vpci_array = .;
>> >> > +       *(.rodata.vpci)
>> >> > +       __end_vpci_array = .;
>> >> 
>> >> Do you really need this (unconditionally)?
>> > 
>> > Right, this should have a ifdef CONFIG_PCI.
>> 
>> CONFIG_HAS_PCI for one, and then ARM doesn't select this at
>> all. Hence the question.
>
>I think it would be better to just add it now? The code is not really
>x86 specific (although it's only used by x86 ATM). IMHO adding a
>CONFIG_HAS_PCI to both linker scripts is the best solution.

Yeah, adding it with the conditional would be fine to me. Eventually
we'll want to break out common pieces like this anyway.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 3/9] x86/physdev: enable PHYSDEVOP_pci_mmcfg_reserved for PVH Dom0
  2017-07-20 10:23     ` Roger Pau Monne
@ 2017-07-28 12:31       ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2017-07-28 12:31 UTC (permalink / raw)
  To: roger.pau; +Cc: andrew.cooper3, julien.grall, boris.ostrovsky, xen-devel

>>> Roger Pau Monne <roger.pau@citrix.com> 07/20/17 12:24 PM >>>
>On Fri, Jul 14, 2017 at 04:32:19AM -0600, Jan Beulich wrote:
>> >>> On 30.06.17 at 17:01, <roger.pau@citrix.com> wrote:
>> > So that hotplug (or MMCFG regions not present in the MCFG ACPI table)
>> > can be added at run time by the hardware domain.
>> 
>> I think the emphasis should be the other way around. I'm rather certain
>> hotplug of bridges doesn't really work right now anyway; at least
>> IO-APIC hotplug code is completely missing.
>
>IO-APICs can also be hot-plugged? Didn't even know about that...

Think of hot adding an entire node.

>> > When a new MMCFG area is added to a PVH Dom0, Xen will scan it and add
>> > the devices to the hardware domain.
>> 
>> Adding the MMIO regions is certainly necessary, but what's the point of
>> also scanning the bus and adding the devices?
>
>It's not strictly necessary, the same can be accomplished by Dom0
>calling PHYSDEVOP_manage_pci_add on each device.
>
>Just thought it wouldn't hurt to do it here, but given your comment
>below I'm not sure. I will wait for your reply before deciding what to
>do.
>
>> We expect Dom0 to tell us
>> anyway, and not doing the scan in Xen avoids complications we presently
>> have in the segment 0 case when Dom0 decides to re-number busses (e.g.
>> in order to fit in SR-IOV VFs).
>
>Is this renumbering performed by changing the
>Primary/Secondary/Subordinate bus number registers in the bridge?

Yes.

>If so we could detect such accesses (by adding traps to type 01h
>headers) and react accordingly.

Yes.

>What if Dom0 re-numbers the bus after having already registered the
>devices with Xen?

The expectation would be for Dom0 to first unregister all devices in the
sub-topology, to the re-numbering, and then re-add them. That doesn't
happen in Linux though, afair.

>> > @@ -1110,6 +1110,37 @@ void __hwdom_init setup_hwdom_pci_devices(
>> >      pcidevs_unlock();
>> >  }
>> >  
>> > +static int add_device(uint8_t devfn, struct pci_dev *pdev)
>> > +{
>> > +    return iommu_add_device(pdev);
>> > +}
>> 
>> You're discarding devfn here, just for iommu_add_device() to re-do the
>> phantom function handling. At the very least this is wasteful. Perhaps
>> you minimally want to call iommu_add_device() only when
>> devfn == pdev->devfn (if all of this code stays in the first place)?
>
>Doesn't the IOMMU also need to know about the phantom functions in
>order to add translations for them too?

Yes, that's why iommu_add_device() and others have a respective loop.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 5/9] xen/pci: split code to size BARs from pci_add_device
  2017-07-20 14:00     ` Roger Pau Monne
  2017-07-20 14:05       ` Roger Pau Monne
@ 2017-07-29 16:32       ` Jan Beulich
  1 sibling, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2017-07-29 16:32 UTC (permalink / raw)
  To: roger.pau; +Cc: xen-devel, julien.grall, boris.ostrovsky

>>> Roger Pau Monne <roger.pau@citrix.com> 07/20/17 4:00 PM >>>
>On Fri, Jul 14, 2017 at 04:33:20AM -0600, Jan Beulich wrote:
> >>> On 30.06.17 at 17:01, <roger.pau@citrix.com> wrote:
>> > --- a/xen/drivers/passthrough/pci.c
>> > +++ b/xen/drivers/passthrough/pci.c
>> > @@ -588,6 +588,54 @@ static void pci_enable_acs(struct pci_dev *pdev)
>> >      pci_conf_write16(seg, bus, dev, func, pos + PCI_ACS_CTRL, ctrl);
>> >  }
>> >  
>> > +int pci_size_mem_bar(unsigned int seg, unsigned int bus, unsigned int slot,
>> > +                     unsigned int func, unsigned int pos, bool last,
>> > +                     uint64_t *paddr, uint64_t *psize)
>> > +{
>> > +    uint32_t hi = 0, bar = pci_conf_read32(seg, bus, slot, func, pos);
>> > +    uint64_t addr, size;
>> > +
>> > +    ASSERT((bar & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_MEMORY);
>> > +    pci_conf_write32(seg, bus, slot, func, pos, ~0);
>> > +    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
>> > +         PCI_BASE_ADDRESS_MEM_TYPE_64 )
>> > +    {
>> > +        if ( last )
>> > +        {
>> > +            printk(XENLOG_WARNING
>> > +                    "device %04x:%02x:%02x.%u with 64-bit BAR in last slot\n",
>> 
>> This message needs to tell what kind of slot is being processed (just
>> like the original did).
>
>The original message is:
>
>"SR-IOV device %04x:%02x:%02x.%u with 64-bit vf BAR in last slot"
>
>I guess you would like to have the "vf" again, in which case I will
>add a bool vf parameter to the function that's only going to be used
>here.

Note also the "SR-IOV" at the beginning. But either part would be sufficient.

> IMHO I'm not really sure it's worth it because I don't find it
>that informative. I though that just knowing the device sbdf is
>enough.

It allows deducing the situation in which this function is being called.

>> > +    addr = (bar & PCI_BASE_ADDRESS_MEM_MASK) | ((u64)hi << 32);
>> > +
>> > +    if ( paddr )
>> > +        *paddr = addr;
>> > +    if ( psize )
>> > +        *psize = size;
>> 
>> Is it reasonable to expect the caller to not care about the size?
>
>Not at the moment, so I guess ASSERT(psize) would be better.

I don't even see a need for such an ASSERT().

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 6/9] xen/vpci: add handlers to map the BARs
  2017-07-24 14:58     ` Roger Pau Monne
@ 2017-07-29 16:44       ` Jan Beulich
  2017-08-08 12:35         ` Roger Pau Monné
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2017-07-29 16:44 UTC (permalink / raw)
  To: roger.pau
  Cc: sstabellini, wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	tim, julien.grall, xen-devel, boris.ostrovsky

>>> Roger Pau Monne <roger.pau@citrix.com> 07/24/17 4:58 PM >>>
>On Fri, Jul 14, 2017 at 09:11:29AM -0600, Jan Beulich wrote:
>> >>> On 30.06.17 at 17:01, <roger.pau@citrix.com> wrote:
>> > +    list_for_each_entry(pdev, &d->arch.pdev_list, domain_list)
>> > +    {
>> > +        uint16_t cmd = pci_conf_read16(pdev->seg, pdev->bus,
>> > +                                       PCI_SLOT(pdev->devfn),
>> > +                                       PCI_FUNC(pdev->devfn),
>> > +                                       PCI_COMMAND);
>> 
>> This is quite a lot of overhead - a loop over all devices plus a config
>> space read on each one. What state the memory decode bit is in
>> could be recorded in the ->enabled flag, couldn't it? And devices on
>> different sub-branches of the topology can't possibly have
>> overlapping entries that we need to worry about, as the bridge
>> windows would suppress actual accesses.
>
>Oh, so Xen only needs to care about devices that share the same
>bridge, because that is the only case where the same page can be
>shared by multiple devices?

Yes, that's my understanding (unless bridge windows overlap, which
I don't know what the resulting behavior would be).

>In any case, the Dom0 is free to wrongly position the BARs anywhere it
>wants, thus possibly placing them outside of the bridge windows, in
>with case I think we should better check all assigned devices.

As an initial solution this _may_ be good enough, but beware of systems
with very many devices.

>> > +static void vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
>> > +                           union vpci_val val, void *data)
>> > +{
>> > +    struct vpci_bar *bar = data;
>> > +    uint8_t seg = pdev->seg, bus = pdev->bus;
>> > +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
>> > +    uint32_t wdata = val.u32, size_mask;
>> > +    bool hi = false;
>> > +
>> > +    switch ( bar->type )
>> > +    {
>> > +    case VPCI_BAR_MEM32:
>> > +    case VPCI_BAR_MEM64_LO:
>> > +        size_mask = (uint32_t)PCI_BASE_ADDRESS_MEM_MASK;
>> > +        break;
>> > +    case VPCI_BAR_MEM64_HI:
>> > +        size_mask = ~0u;
>> > +        break;
>> > +    default:
>> > +        ASSERT_UNREACHABLE();
>> > +        return;
>> > +    }
>> > +
>> > +    if ( (wdata & size_mask) == size_mask )
>> > +    {
>> > +        /* Next reads from this register are going to return the BAR size. */
>> > +        bar->sizing = true;
>> > +        return;
>> 
>> I think the comment needs extending to explain why the written
>> sizing value can't possibly be an address. This is particularly
>> relevant because I'm not sure that assumption would hold on e.g.
>> ARM (which I don't think has guaranteed ROM right below 4Gb).
>
>Hm, right. Maybe it would be best to detect sizing by checking that
>the address when performing a read is ~0 on the high bits and ~0 &
>PCI_BASE_ADDRESS_MEM_MASK on the lower ones, instead of doing this
>kind of partial guessing as done here, it's certainly not very robust.

I don't understand, particularly because you say "when performing a read).
Or do you mean to do away with the "sizing" flag altogether?

>> > +        /* Size the BAR and map it. */
>> > +        rc = pci_size_mem_bar(seg, bus, slot, func, reg, i == num_bars - 1,
>> > +                              &addr, &size);
>> > +        if ( rc < 0 )
>> > +            return rc;
>> > +
>> > +        if ( size == 0 )
>> > +        {
>> > +            bars[i].type = VPCI_BAR_EMPTY;
>> > +            continue;
>> > +        }
>> > +
>> > +        bars[i].addr = (cmd & PCI_COMMAND_MEMORY) ? addr : INVALID_PADDR;
>> 
>> This doesn't match up with logic further up: When the memory decode
>> bit gets cleared, you don't zap the addresses, so I think you'd better
>> store it here too. Use INVALID_PADDR only when the value read has
>> all address bits set (same caveat as pointed out earlier).
>
>OK, note that .addr can only possibly be INVALID_PADDR at
>initialization time, once the user has written something to the BAR
>.addr will be different than INVALID_PADDR.

Which is part of what worries me - it would be better if the field wouldn't
ever hold a special init-time-only value.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 7/9] vpci/msi: add MSI handlers
  2017-06-30 15:01 ` [PATCH v4 7/9] vpci/msi: add MSI handlers Roger Pau Monne
  2017-07-18  8:56   ` Paul Durrant
@ 2017-08-02 13:34   ` Jan Beulich
  2017-08-08 15:44     ` Roger Pau Monné
  1 sibling, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2017-08-02 13:34 UTC (permalink / raw)
  To: roger.pau
  Cc: andrew.cooper3, julien.grall, paul.durrant, xen-devel,
	boris.ostrovsky

>>> Roger Pau Monne <roger.pau@citrix.com> 06/30/17 5:01 PM >>>
>Add handlers for the MSI control, address, data and mask fields in
>order to detect accesses to them and setup the interrupts as requested
>by the guest.
>
>Note that the pending register is not trapped, and the guest can
>freely read/write to it.
>
>Whether Xen is going to provide this functionality to Dom0 (MSI
>emulation) is controlled by the "msi" option in the dom0 field. When
>disabling this option Xen will hide the MSI capability structure from
>Dom0.

Isn't this last paragraph stale now?

>+void vpci_msi_arch_mask(struct vpci_arch_msi *arch, struct pci_dev *pdev,
>+                        unsigned int entry, bool mask)
>+{
>+    struct domain *d = pdev->domain;
>+    const struct pirq *pinfo;
>+    struct irq_desc *desc;
>+    unsigned long flags;
>+    int irq;
>+
>+    ASSERT(arch->pirq >= 0);
>+    pinfo = pirq_info(d, arch->pirq + entry);
>+    ASSERT(pinfo);
>+
>+    irq = pinfo->arch.irq;
>+    ASSERT(irq < nr_irqs && irq >= 0);
>+
>+    desc = irq_to_desc(irq);
>+    ASSERT(desc);

I know the goal is Dom0 support only at this point, but nevertheless I think
we shouldn't have ASSERT()s in place which could trigger if Dom0
misbehaves (and which would all need to be audited if we were to extend
support to DomU): I'm not convinced all of the ones above could really only
trigger depending on Xen (mis)behavior.

>+int vpci_msi_arch_enable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
>+                         uint64_t address, uint32_t data, unsigned int vectors)
>+{
>+    struct msi_info msi_info = {
>+        .seg = pdev->seg,
>+        .bus = pdev->bus,
>+        .devfn = pdev->devfn,
>+        .entry_nr = vectors,
>+    };
>+    unsigned int i;
>+    int rc;
>+
>+    ASSERT(arch->pirq == -1);

Please introduce a #define for the -1 here, to allow easily matching up
producer and consumer side(s).

>+    /* Get a PIRQ. */
>+    rc = allocate_and_map_msi_pirq(pdev->domain, -1, &arch->pirq,
>+                                   MAP_PIRQ_TYPE_MULTI_MSI, &msi_info);
>+    if ( rc )
>+    {
>+        dprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: failed to map PIRQ: %d\n",
>+                pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
>+                PCI_FUNC(pdev->devfn), rc);
>+        return rc;
>+    }
>+
>+    for ( i = 0; i < vectors; i++ )
>+    {
>+        xen_domctl_bind_pt_irq_t bind = {
>+            .machine_irq = arch->pirq + i,
>+            .irq_type = PT_IRQ_TYPE_MSI,
>+            .u.msi.gvec = msi_vector(data) + i,
>+            .u.msi.gflags = msi_flags(data, address),
>+        };
>+
>+        pcidevs_lock();
>+        rc = pt_irq_create_bind(pdev->domain, &bind);
>+        if ( rc )
>+        {
>+            dprintk(XENLOG_ERR,
>+                    "%04x:%02x:%02x.%u: failed to bind PIRQ %u: %d\n",
>+                    pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
>+                    PCI_FUNC(pdev->devfn), arch->pirq + i, rc);
>+            spin_lock(&pdev->domain->event_lock);
>+            unmap_domain_pirq(pdev->domain, arch->pirq);

Don't you also need to undo the pt_irq_create_bind() calls here for all prior
successful iterations?

>+int vpci_msi_arch_disable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
>+                          unsigned int vectors)
>+{
>+    unsigned int i;
>+
>+    ASSERT(arch->pirq != -1);
>+
>+    for ( i = 0; i < vectors; i++ )
>+    {
>+        xen_domctl_bind_pt_irq_t bind = {
>+            .machine_irq = arch->pirq + i,
>+            .irq_type = PT_IRQ_TYPE_MSI,
>+        };
>+
>+        pcidevs_lock();
>+        pt_irq_destroy_bind(pdev->domain, &bind);

While I agree that the loop should continue of this fails, I'm not convinced
you should entirely ignore the return value here.

>+        pcidevs_unlock();
>+    }
>+
>+    pcidevs_lock();

What good does it do to acquire the lock for most of the loop body as well
as for most of the epilogue, instead of just acquiring it once ahead of the
loop?

>+int vpci_msi_arch_init(struct vpci_arch_msi *arch)
>+{
>+    arch->pirq = -1;
>+    return 0;
>+}

At this point I think the function would better return void.

>+void vpci_msi_arch_print(struct vpci_arch_msi *arch, uint16_t data,

const

>+                         uint64_t addr)
>+{
>+    printk("vec=%#02x%7s%6s%3sassert%5s%7s dest_id=%lu pirq: %d\n",
>+           MASK_EXTR(data, MSI_DATA_VECTOR_MASK),
>+           data & MSI_DATA_DELIVERY_LOWPRI ? "lowest" : "fixed",
>+           data & MSI_DATA_TRIGGER_LEVEL ? "level" : "edge",
>+           data & MSI_DATA_LEVEL_ASSERT ? "" : "de",
>+           addr & MSI_ADDR_DESTMODE_LOGIC ? "log" : "phys",
>+           addr & MSI_ADDR_REDIRECTION_LOWPRI ? "lowest" : "cpu",

Why "cpu"? Elsewhere we call this mode "fixed".

>--- /dev/null
>+++ b/xen/drivers/vpci/msi.c
>@@ -0,0 +1,348 @@
>+/*
>+ * Handlers for accesses to the MSI capability structure.
>+ *
>+ * Copyright (C) 2017 Citrix Systems R&D
>+ *
>+ * This program is free software; you can redistribute it and/or
>+ * modify it under the terms and conditions of the GNU General Public
>+ * License, version 2, as published by the Free Software Foundation.
>+ *
>+ * This program is distributed in the hope that it will be useful,
>+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
>+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>+ * General Public License for more details.
>+ *
>+ * You should have received a copy of the GNU General Public
>+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
>+ */
>+
>+#include <xen/sched.h>
>+#include <xen/vpci.h>
>+#include <asm/msi.h>
>+#include <xen/keyhandler.h>

Please don't mix xen/ and asm/ includes, and where possible please also
alphabetically sort each group.

>+/* Handlers for the MSI control field (PCI_MSI_FLAGS). */
>+static void vpci_msi_control_read(struct pci_dev *pdev, unsigned int reg,
>+                                  union vpci_val *val, void *data)
>+{
>+    const struct vpci_msi *msi = data;
>+
>+    /* Set multiple message capable. */
>+    val->u16 = MASK_INSR(fls(msi->max_vectors) - 1, PCI_MSI_FLAGS_QMASK);

The comment is somewhat misleading - whether the device is multi-message
capable depends on msi->max_vectors.

>+    if ( msi->enabled ) {

Style.

>+        val->u16 |= PCI_MSI_FLAGS_ENABLE;
>+        val->u16 |= MASK_INSR(fls(msi->vectors) - 1, PCI_MSI_FLAGS_QSIZE);

Why is reading back the proper value here dependent upon MSI being
enabled?

>+static void vpci_msi_control_write(struct pci_dev *pdev, unsigned int reg,
>+                                   union vpci_val val, void *data)
>+{
>+    struct vpci_msi *msi = data;
>+    unsigned int vectors = 1 << MASK_EXTR(val.u16, PCI_MSI_FLAGS_QSIZE);
>+    int ret;
>+
>+    if ( vectors > msi->max_vectors )
>+        vectors = msi->max_vectors;
>+
>+    if ( !!(val.u16 & PCI_MSI_FLAGS_ENABLE) == msi->enabled &&
>+         (vectors == msi->vectors || !msi->enabled) )
>+        return;
>+
>+    if ( val.u16 & PCI_MSI_FLAGS_ENABLE )
>+    {
>+        if ( msi->enabled )
>+        {
>+            /*
>+             * Change to the number of enabled vectors, disable and
>+             * enable MSI in order to apply it.
>+             */

At least the first part of the comment would appear to belong outside the
inner if().

>+            ret = vpci_msi_disable(pdev, msi);
>+            if ( ret )
>+                return;

Returning here without doing anything is at least strange, and hence
would call for a comment to be attached to explain the intentions.

>+static void vpci_msi_address_write(struct pci_dev *pdev, unsigned int reg,
>+                                   union vpci_val val, void *data)
>+{
>+    struct vpci_msi *msi = data;
>+
>+    /* Clear low part. */
>+    msi->address &= ~(uint64_t)0xffffffff;

~0xffffffffull?

>+static void vpci_msi_address_upper_write(struct pci_dev *pdev, unsigned int reg,
>+                                         union vpci_val val, void *data)
>+{
>+    struct vpci_msi *msi = data;
>+
>+    /* Clear high part. */
>+    msi->address &= ~((uint64_t)0xffffffff << 32);

Simply 0xffffffff?

+static void vpci_msi_mask_read(struct pci_dev *pdev, unsigned int reg,

Earlier read/write pairs had comments ahead of them - for consistency
one would then belong here too.

>+static void vpci_msi_mask_write(struct pci_dev *pdev, unsigned int reg,
>+                                union vpci_val val, void *data)
>+{
>+    struct vpci_msi *msi = data;
>+    uint32_t dmask;
>+
>+    dmask = msi->mask ^ val.u32;
>+
>+    if ( !dmask )
>+        return;
>+
>+    if ( msi->enabled )
>+    {
>+        unsigned int i;
>+
>+        for ( i = ffs(dmask) - 1; dmask && i < msi->vectors;
>+              i = ffs(dmask) - 1 )
>+        {
>+            vpci_msi_arch_mask(&msi->arch, pdev, i, MASK_EXTR(val.u32, 1 << i));

I don't think using MASK_EXTR() here is really advisable? Could be as simple
as (val.u32 >> i) & 1.

>+static int vpci_init_msi(struct pci_dev *pdev)
>+{
>+    uint8_t seg = pdev->seg, bus = pdev->bus;
>+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
>+    struct vpci_msi *msi;
>+    unsigned int msi_offset;

Elsewhere we call such variables just "pos".

>+    uint16_t control;
>+    int ret;
>+
>+    msi_offset = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSI);
>+    if ( !msi_offset )
>+        return 0;
>+
>+    msi = xzalloc(struct vpci_msi);
>+    if ( !msi )
>+        return -ENOMEM;
>+
>+    msi->pos = msi_offset;
>+
>+    control = pci_conf_read16(seg, bus, slot, func,
>+                              msi_control_reg(msi_offset));

You don't use the value until ...

>+    ret = vpci_add_register(pdev, vpci_msi_control_read,
>+                            vpci_msi_control_write,
>+                            msi_control_reg(msi_offset), 2, msi);
>+    if ( ret )
>+        goto error;
>+
>+    /* Get the maximum number of vectors the device supports. */
>+    msi->max_vectors = multi_msi_capable(control);

... here. Please move the read down.

>+    ASSERT(msi->max_vectors <= 32);
>+
>+    /* No PIRQ bind yet. */
>+    vpci_msi_arch_init(&msi->arch);

s/bind/bound/ in the comment?

>...
>+ error:
>+    ASSERT(ret);
>+    xfree(msi);
>+    return ret;
>+}

Don't you also need to unregister address handlers you've registered?

>+void vpci_dump_msi(void)
>+{
>+    struct domain *d;
>+
>+    for_each_domain ( d )
>+    {
>+        const struct pci_dev *pdev;
>+
>+        if ( !has_vpci(d) )
>+            continue;
>+
>+        printk("vPCI MSI information for guest %u\n", d->domain_id);

"... for Dom%d" or "... for d%d" please.

>...
>+            if ( msi->masking )
>+                printk("mask=%#032x\n", msi->mask);

Why 30 hex digits? And generally # should be used only when not blank or
zero padding the value (as field width includes the 0x prefix).

>--- a/xen/include/asm-x86/msi.h
>+++ b/xen/include/asm-x86/msi.h
>@@ -48,6 +48,7 @@
>#define MSI_ADDR_REDIRECTION_SHIFT  3
>#define MSI_ADDR_REDIRECTION_CPU    (0 << MSI_ADDR_REDIRECTION_SHIFT)
>#define MSI_ADDR_REDIRECTION_LOWPRI (1 << MSI_ADDR_REDIRECTION_SHIFT)
>+#define MSI_ADDR_REDIRECTION_MASK   0x8

 (1 << MSI_ADDR_REDIRECTION_SHIFT) please.
 
>+    struct vpci_msi {
>+        /* Offset of the capability in the config space. */
>+        unsigned int pos;
>+        /* Maximum number of vectors supported by the device. */
>+        unsigned int max_vectors;
>+        /* Number of vectors configured. */
>+        unsigned int vectors;
>+        /* Address and data fields. */
>+        uint64_t address;
>+        uint16_t data;
>+        /* Mask bitfield. */
>+        uint32_t mask;
>+        /* Enabled? */
>+        bool enabled;
>+        /* Supports per-vector masking? */
>+        bool masking;
>+        /* 64-bit address capable? */
>+        bool address64;
>+        /* Arch-specific data. */
>+        struct vpci_arch_msi arch;
>+    } *msi;

Please try to reduce the number/size of holes.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 8/9] vpci: add a priority parameter to the vPCI register initializer
  2017-06-30 15:01 ` [PATCH v4 8/9] vpci: add a priority parameter to the vPCI register initializer Roger Pau Monne
@ 2017-08-02 14:13   ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2017-08-02 14:13 UTC (permalink / raw)
  To: roger.pau; +Cc: andrew.cooper3, julien.grall, boris.ostrovsky, xen-devel

>>> Roger Pau Monne <roger.pau@citrix.com> 06/30/17 5:02 PM >>>
>--- a/xen/drivers/vpci/header.c
>+++ b/xen/drivers/vpci/header.c
>@@ -459,7 +459,7 @@ static int vpci_init_bars(struct pci_dev *pdev)
>return 0;
>}
 >
>-REGISTER_VPCI_INIT(vpci_init_bars);
>+REGISTER_VPCI_INIT(vpci_init_bars, VPCI_PRIORITY_LOW);
 
Why "LOW"? I'd expect the BARs to possibly have further dependents, so
their init should be somewhere in the middle.

>--- a/xen/drivers/vpci/msi.c
>+++ b/xen/drivers/vpci/msi.c
>@@ -290,7 +290,7 @@ static int vpci_init_msi(struct pci_dev *pdev)
>return ret;
>}
 >
>-REGISTER_VPCI_INIT(vpci_init_msi);
>+REGISTER_VPCI_INIT(vpci_init_msi, VPCI_PRIORITY_LOW);
 
Whereas these indeed are unlikely to have further dependents.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 9/9] vpci/msix: add MSI-X handlers
  2017-06-30 15:01 ` [PATCH v4 9/9] vpci/msix: add MSI-X handlers Roger Pau Monne
@ 2017-08-02 15:07   ` Jan Beulich
  2017-08-10 17:04     ` Roger Pau Monné
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2017-08-02 15:07 UTC (permalink / raw)
  To: roger.pau; +Cc: andrew.cooper3, julien.grall, boris.ostrovsky, xen-devel

>>> Roger Pau Monne <roger.pau@citrix.com> 06/30/17 5:01 PM >>>
>Note that accesses to the Table Offset, Table BIR, PBA Offset and PBA
>BIR are not trapped by Xen at the moment.

They're mandated r/o by the spec anyway.

>@@ -113,6 +148,35 @@ static int vpci_modify_bar(struct domain *d, const struct vpci_bar *bar,
>if ( IS_ERR(mem) )
>return -PTR_ERR(mem);
 >
>+    /*
>+     * Make sure the MSI-X regions of the BAR are not mapped into the domain
>+     * p2m, or else the MSI-X handlers are useless. Only do this when mapping,
>+     * since that's when the memory decoding on the device is enabled.
>+     */
>+    for ( i = 0; map && i < ARRAY_SIZE(bar->msix); i++ )
>+    {
>+        struct vpci_msix_mem *msix = bar->msix[i];
>+
>+        if ( !msix || msix->addr == INVALID_PADDR )
>+            continue;
>+
>+        rc = vpci_unmap_msix(d, msix);

Why do you need this, instead of being able to simply rely on the rangeset
based (un)mapping?

>@@ -405,7 +475,20 @@ static int vpci_init_bars(struct pci_dev *pdev)
>continue;
>}
 >
>-        bars[i].addr = (cmd & PCI_COMMAND_MEMORY) ? addr : INVALID_PADDR;
>+        if ( cmd & PCI_COMMAND_MEMORY )
>+        {
>+            unsigned int j;
>+
>+            bars[i].addr = addr;
>+
>+            for ( j = 0; j < ARRAY_SIZE(bars[i].msix); j++ )
>+                if ( bars[i].msix[j] )
>+                    bars[i].msix[j]->addr = bars[i].addr +
>+                                            bars[i].msix[j]->offset;
>+        }
>+        else
>+            bars[i].addr = INVALID_PADDR;

As (I think) mentioned elsewhere already, this init-time special case looks
dangerous (and unnecessary) to me (or else I'd expect you to also zap
the field when the memory decode bit is being cleared).

>--- /dev/null
>+++ b/xen/drivers/vpci/msix.c
>@@ -0,0 +1,503 @@
>+/*
>+ * Handlers for accesses to the MSI-X capability structure and the memory
>+ * region.
>+ *
>+ * Copyright (C) 2017 Citrix Systems R&D
>+ *
>+ * This program is free software; you can redistribute it and/or
>+ * modify it under the terms and conditions of the GNU General Public
>+ * License, version 2, as published by the Free Software Foundation.
>+ *
>+ * This program is distributed in the hope that it will be useful,
>+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
>+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>+ * General Public License for more details.
>+ *
>+ * You should have received a copy of the GNU General Public
>+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
>+ */
>+
>+#include <xen/sched.h>
>+#include <xen/vpci.h>
>+#include <asm/msi.h>
>+#include <xen/p2m-common.h>
>+#include <xen/keyhandler.h>
>+
>+#define MSIX_SIZE(num) (offsetof(struct vpci_msix, entries[num]))

The outermost parens are pointless here.

>+static void vpci_msix_control_write(struct pci_dev *pdev, unsigned int reg,
>+                                    union vpci_val val, void *data)
>+{
>+    uint8_t seg = pdev->seg, bus = pdev->bus;
>+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
>+    struct vpci_msix *msix = data;
>+    paddr_t table_base = pdev->vpci->header.bars[msix->table.bir].addr;
>+    bool new_masked, new_enabled;
>+    unsigned int i;
>+    int rc;
>+
>+    new_masked = val.u16 & PCI_MSIX_FLAGS_MASKALL;
>+    new_enabled = val.u16 & PCI_MSIX_FLAGS_ENABLE;
>+
>+    if ( !msix->enabled && new_enabled )
>+    {
>+        /* MSI-X enabled. */

Insert "being"?

>+        for ( i = 0; i < msix->max_entries; i++ )
>+        {
>+            if ( msix->entries[i].masked )
>+                continue;

Why is the mask bit relevant here, but not the mask-all one?

>+            rc = vpci_msix_arch_enable(&msix->entries[i].arch, pdev,
>+                                       msix->entries[i].addr,
>+                                       msix->entries[i].data,
>+                                       msix->entries[i].nr, table_base);
>+            if ( rc )
>+            {
>+                gdprintk(XENLOG_ERR,
<+                         "%04x:%02x:%02x.%u: unable to update entry %u: %d\n",
>+                         seg, bus, slot, func, i, rc);
>+                return;
>+            }
>+
>+            vpci_msix_arch_mask(&msix->entries[i].arch, pdev, false);

Same question here.

>+        }
>+    }
>+    else if ( msix->enabled && !new_enabled )
>+    {
>+        /* MSI-X disabled. */
>+        for ( i = 0; i < msix->max_entries; i++ )
>+        {
>+            rc = vpci_msix_arch_disable(&msix->entries[i].arch, pdev);
>+            if ( rc )
>+            {
>+                gdprintk(XENLOG_ERR,
>+                         "%04x:%02x:%02x.%u: unable to disable entry %u: %d\n",
>+                         seg, bus, slot, func, i, rc);
>+                return;
>+            }
>+        }
>+    }
>+
>+    if ( (new_enabled != msix->enabled || new_masked != msix->masked) &&
>+         pci_msi_conf_write_intercept(pdev, reg, 2, &val.u32) >= 0 )
>+        pci_conf_write16(seg, bus, slot, func, reg, val.u32);

DYM val.u16 here?

>+static struct vpci_msix *vpci_msix_find(struct domain *d, unsigned long addr)
>+{
>+    struct vpci_msix *msix;
>+
>+    ASSERT(vpci_locked(d));
>+    list_for_each_entry ( msix,  &d->arch.hvm_domain.msix_tables, next )
>+    {
>+        uint8_t seg = msix->pdev->seg, bus = msix->pdev->bus;
>+        uint8_t slot = PCI_SLOT(msix->pdev->devfn);
>+        uint8_t func = PCI_FUNC(msix->pdev->devfn);
>+        uint16_t cmd = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);

Perhaps better to keep a cached copy of the command register value?

>+static int vpci_msix_read(struct vcpu *v, unsigned long addr,
>+                          unsigned int len, unsigned long *data)
>+{
>+    struct domain *d = v->domain;
>+    struct vpci_msix *msix;
>+    const struct vpci_msix_entry *entry;
>+    unsigned int offset;
>+
>+    vpci_lock(d);
>+
>+    msix = vpci_msix_find(d, addr);
>+    if ( !msix )
>+    {
>+        vpci_unlock(d);
>+        *data = ~0ul;
>+        return X86EMUL_OKAY;
>+    }
>+
>+    if ( vpci_msix_access_check(msix->pdev, addr, len) )
>+    {
>+        vpci_unlock(d);
>+        *data = ~0ul;
>+        return X86EMUL_OKAY;
>+    }
>+
>+    if ( MSIX_ADDR_IN_RANGE(addr, &msix->pba) )
>+    {
>+        /* Access to PBA. */
>+        switch ( len )
>+        {
>+        case 4:
>+            *data = readl(addr);
>+            break;
>+        case 8:
>+            *data = readq(addr);
>+            break;
>+        default:
>+            ASSERT_UNREACHABLE();

data = ~0ul;

>+static int vpci_msix_write(struct vcpu *v, unsigned long addr,
>+                                 unsigned int len, unsigned long data)
>+{
>+    struct domain *d = v->domain;
>+    struct vpci_msix *msix;
>+    struct vpci_msix_entry *entry;
>+    unsigned int offset;
>+
>+    vpci_lock(d);
>+    msix = vpci_msix_find(d, addr);
>+    if ( !msix )
>+    {
>+        vpci_unlock(d);
>+        return X86EMUL_OKAY;
>+    }
>+
>+    if ( MSIX_ADDR_IN_RANGE(addr, &msix->pba) )
>+    {
>+        /* Ignore writes to PBA, it's behavior is undefined. */
>+        vpci_unlock(d);
>+        return X86EMUL_OKAY;
>+    }
>+
>+    if ( vpci_msix_access_check(msix->pdev, addr, len) )
>+    {
>+        vpci_unlock(d);
>+        return X86EMUL_OKAY;
>+    }
>+
>+    /* Get the table entry and offset. */
>+    entry = vpci_msix_get_entry(msix, addr);
>+    offset = addr & (PCI_MSIX_ENTRY_SIZE - 1);
>+
>+    switch ( offset )
>+    {
>+    case PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET:
>+        if ( len == 8 )
>+        {
>+            entry->addr = data;
>+            break;
>+        }
>+        entry->addr &= ~0xffffffff;

With this, why not ...

>+        entry->addr |= data;
>+        break;
>+    case PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET:
>+        entry->addr &= ~((uint64_t)0xffffffff << 32);

... simply 0xffffffff here?

>+        entry->addr |= data << 32;

Iirc we've already talked about this being undefined for 32-bit arches (e.g.
ARM32), and the resulting need to make "data" uint64_t.

>+        break;
>+    case PCI_MSIX_ENTRY_DATA_OFFSET:
>+        /*
>+         * 8 byte writes to the msg data and vector control fields are
>+         * only allowed if the entry is masked.
>+         */
>+        if ( len == 8 && !entry->masked && !msix->masked && msix->enabled )
>+        {
>+            vpci_unlock(d);
>+            return X86EMUL_OKAY;
>+        }

I don't think this is correct - iirc such writes simply don't take effect immediately
(but I then seem to recall this to apply to the address field and 32-bit writes to
the data field as well). They'd instead take effect the next time the entry is being
unmasked (or some such). A while ago I did fix the qemu code to behave in this
way.

>+        entry->data = data;
>+
>+        if ( len == 4 )
>+            break;
>+
>+        data >>= 32;
>+        /* fallthrough */
>+    case PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET:
>+    {
>+        bool new_masked = data & PCI_MSIX_VECTOR_BITMASK;
>+        struct pci_dev *pdev = msix->pdev;
>+        paddr_t table_base = pdev->vpci->header.bars[msix->table.bir].addr;
>+        int rc;
>+
>+        if ( !msix->enabled )
>+        {
>+            entry->masked = new_masked;
>+            break;
>+        }
>+
>+        if ( new_masked != entry->masked && !new_masked )

if ( !new_masked && entry->masked )

(or the other way around)

>+static int vpci_init_msix(struct pci_dev *pdev)
>+{
>+    struct domain *d = pdev->domain;
>+    uint8_t seg = pdev->seg, bus = pdev->bus;
>+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
>+    struct vpci_msix *msix;
>+    unsigned int msix_offset, i, max_entries;
>+    struct vpci_bar *table_bar, *pba_bar;
>+    uint16_t control;
>+    int rc;
>+
>+    msix_offset = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSIX);
>+    if ( !msix_offset )
>+        return 0;
>+
>+    control = pci_conf_read16(seg, bus, slot, func,
>+                              msix_control_reg(msix_offset));
>+
>+    /* Get the maximum number of vectors the device supports. */
>+    max_entries = msix_table_size(control);
>+
>+    msix = xzalloc_bytes(MSIX_SIZE(max_entries));
>+    if ( !msix )
>+        return -ENOMEM;
>+
>+    msix->max_entries = max_entries;
>+    msix->pdev = pdev;
>+
>+    /* Find the MSI-X table address. */
>+    msix->table.offset = pci_conf_read32(seg, bus, slot, func,
>+                                         msix_table_offset_reg(msix_offset));
>+    msix->table.bir = msix->table.offset & PCI_MSIX_BIRMASK;
>+    msix->table.offset &= ~PCI_MSIX_BIRMASK;
>+    msix->table.size = msix->max_entries * PCI_MSIX_ENTRY_SIZE;
>+    msix->table.addr = INVALID_PADDR;
>+
>+    /* Find the MSI-X pba address. */
>+    msix->pba.offset = pci_conf_read32(seg, bus, slot, func,
>+                                       msix_pba_offset_reg(msix_offset));
>+    msix->pba.bir = msix->pba.offset & PCI_MSIX_BIRMASK;
>+    msix->pba.offset &= ~PCI_MSIX_BIRMASK;
>+    msix->pba.size = DIV_ROUND_UP(msix->max_entries, 8);

I think you want to round up to at least the next 32-bit boundary; the
spec talking about bits 63..00 even suggests a 64-bit boundary. The
table addresses being required to be qword aligned also supports this.

>+void vpci_dump_msix(void)
>+{
>+    struct domain *d;
>+    struct pci_dev *pdev;

const for all pointers in dump handlers, as far as possible.

>+    for_each_domain ( d )
>+    {
>+        if ( !has_vpci(d) )
>+            continue;
>+
>+        printk("vPCI MSI-X information for guest %u\n", d->domain_id);

Wouldn't it be better (more useful) to dump the MSI and MSI-X data for a
domain next to each other?

Apart from the comments here the ones give for the MSI patch apply
respectively.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 6/9] xen/vpci: add handlers to map the BARs
  2017-07-29 16:44       ` Jan Beulich
@ 2017-08-08 12:35         ` Roger Pau Monné
  2017-08-09  8:17           ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monné @ 2017-08-08 12:35 UTC (permalink / raw)
  To: Jan Beulich
  Cc: sstabellini, wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	tim, julien.grall, xen-devel, boris.ostrovsky

On Sat, Jul 29, 2017 at 10:44:02AM -0600, Jan Beulich wrote:
> >>> Roger Pau Monne <roger.pau@citrix.com> 07/24/17 4:58 PM >>>
> >On Fri, Jul 14, 2017 at 09:11:29AM -0600, Jan Beulich wrote:
> >> >>> On 30.06.17 at 17:01, <roger.pau@citrix.com> wrote:
> >> > +static void vpci_bar_write(struct pci_dev *pdev, unsigned int reg,
> >> > +                           union vpci_val val, void *data)
> >> > +{
> >> > +    struct vpci_bar *bar = data;
> >> > +    uint8_t seg = pdev->seg, bus = pdev->bus;
> >> > +    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> >> > +    uint32_t wdata = val.u32, size_mask;
> >> > +    bool hi = false;
> >> > +
> >> > +    switch ( bar->type )
> >> > +    {
> >> > +    case VPCI_BAR_MEM32:
> >> > +    case VPCI_BAR_MEM64_LO:
> >> > +        size_mask = (uint32_t)PCI_BASE_ADDRESS_MEM_MASK;
> >> > +        break;
> >> > +    case VPCI_BAR_MEM64_HI:
> >> > +        size_mask = ~0u;
> >> > +        break;
> >> > +    default:
> >> > +        ASSERT_UNREACHABLE();
> >> > +        return;
> >> > +    }
> >> > +
> >> > +    if ( (wdata & size_mask) == size_mask )
> >> > +    {
> >> > +        /* Next reads from this register are going to return the BAR size. */
> >> > +        bar->sizing = true;
> >> > +        return;
> >> 
> >> I think the comment needs extending to explain why the written
> >> sizing value can't possibly be an address. This is particularly
> >> relevant because I'm not sure that assumption would hold on e.g.
> >> ARM (which I don't think has guaranteed ROM right below 4Gb).
> >
> >Hm, right. Maybe it would be best to detect sizing by checking that
> >the address when performing a read is ~0 on the high bits and ~0 &
> >PCI_BASE_ADDRESS_MEM_MASK on the lower ones, instead of doing this
> >kind of partial guessing as done here, it's certainly not very robust.
> 
> I don't understand, particularly because you say "when performing a read).
> Or do you mean to do away with the "sizing" flag altogether?

Yes, I've got rid of the "sizing" flag, and now attempts by the guest
to size the BARs are detected during read of the BAR itself, by
checking whether the address matches ~0 in the high part, or
PCI_BASE_ADDRESS_MEM_MASK in the lower part.

> >> > +        /* Size the BAR and map it. */
> >> > +        rc = pci_size_mem_bar(seg, bus, slot, func, reg, i == num_bars - 1,
> >> > +                              &addr, &size);
> >> > +        if ( rc < 0 )
> >> > +            return rc;
> >> > +
> >> > +        if ( size == 0 )
> >> > +        {
> >> > +            bars[i].type = VPCI_BAR_EMPTY;
> >> > +            continue;
> >> > +        }
> >> > +
> >> > +        bars[i].addr = (cmd & PCI_COMMAND_MEMORY) ? addr : INVALID_PADDR;
> >> 
> >> This doesn't match up with logic further up: When the memory decode
> >> bit gets cleared, you don't zap the addresses, so I think you'd better
> >> store it here too. Use INVALID_PADDR only when the value read has
> >> all address bits set (same caveat as pointed out earlier).
> >
> >OK, note that .addr can only possibly be INVALID_PADDR at
> >initialization time, once the user has written something to the BAR
> >.addr will be different than INVALID_PADDR.
> 
> Which is part of what worries me - it would be better if the field wouldn't
> ever hold a special init-time-only value.

Right, but that matches the behavior of the hardware itself. On boot
the address of the BAR is not valid, but there's no way AFAIK to
restore the BAR to this state once an address has been written (except
by doing a reset of the device itself).

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 7/9] vpci/msi: add MSI handlers
  2017-08-02 13:34   ` Jan Beulich
@ 2017-08-08 15:44     ` Roger Pau Monné
  2017-08-09  8:21       ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monné @ 2017-08-08 15:44 UTC (permalink / raw)
  To: Jan Beulich
  Cc: andrew.cooper3, julien.grall, paul.durrant, xen-devel,
	boris.ostrovsky

On Wed, Aug 02, 2017 at 07:34:28AM -0600, Jan Beulich wrote:
> >>> Roger Pau Monne <roger.pau@citrix.com> 06/30/17 5:01 PM >>>
> >+int vpci_msi_arch_enable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
> >+                         uint64_t address, uint32_t data, unsigned int vectors)
> >+{
> >+    struct msi_info msi_info = {
> >+        .seg = pdev->seg,
> >+        .bus = pdev->bus,
> >+        .devfn = pdev->devfn,
> >+        .entry_nr = vectors,
> >+    };
> >+    unsigned int i;
> >+    int rc;
> >+
> >+    ASSERT(arch->pirq == -1);
> 
> Please introduce a #define for the -1 here, to allow easily matching up
> producer and consumer side(s).

I've added a define for INVALID_PIRQ to xen/irq.h.

> >+    /* Get a PIRQ. */
> >+    rc = allocate_and_map_msi_pirq(pdev->domain, -1, &arch->pirq,
> >+                                   MAP_PIRQ_TYPE_MULTI_MSI, &msi_info);
> >+    if ( rc )
> >+    {
> >+        dprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: failed to map PIRQ: %d\n",
> >+                pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> >+                PCI_FUNC(pdev->devfn), rc);
> >+        return rc;
> >+    }
> >+
> >+    for ( i = 0; i < vectors; i++ )
> >+    {
> >+        xen_domctl_bind_pt_irq_t bind = {
> >+            .machine_irq = arch->pirq + i,
> >+            .irq_type = PT_IRQ_TYPE_MSI,
> >+            .u.msi.gvec = msi_vector(data) + i,
> >+            .u.msi.gflags = msi_flags(data, address),
> >+        };
> >+
> >+        pcidevs_lock();
> >+        rc = pt_irq_create_bind(pdev->domain, &bind);
> >+        if ( rc )
> >+        {
> >+            dprintk(XENLOG_ERR,
> >+                    "%04x:%02x:%02x.%u: failed to bind PIRQ %u: %d\n",
> >+                    pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> >+                    PCI_FUNC(pdev->devfn), arch->pirq + i, rc);
> >+            spin_lock(&pdev->domain->event_lock);
> >+            unmap_domain_pirq(pdev->domain, arch->pirq);
> 
> Don't you also need to undo the pt_irq_create_bind() calls here for all prior
> successful iterations?

Yes, unmap_domain_pirq calls pirq_guest_force_unbind but better not
resort to that.

> >+int vpci_msi_arch_disable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
> >+                          unsigned int vectors)
> >+{
> >+    unsigned int i;
> >+
> >+    ASSERT(arch->pirq != -1);
> >+
> >+    for ( i = 0; i < vectors; i++ )
> >+    {
> >+        xen_domctl_bind_pt_irq_t bind = {
> >+            .machine_irq = arch->pirq + i,
> >+            .irq_type = PT_IRQ_TYPE_MSI,
> >+        };
> >+
> >+        pcidevs_lock();
> >+        pt_irq_destroy_bind(pdev->domain, &bind);
> 
> While I agree that the loop should continue of this fails, I'm not convinced
> you should entirely ignore the return value here.

I've added a printk in order to aid debug.

> >+/* Handlers for the MSI control field (PCI_MSI_FLAGS). */
> >+static void vpci_msi_control_read(struct pci_dev *pdev, unsigned int reg,
> >+                                  union vpci_val *val, void *data)
> >+{
> >+    const struct vpci_msi *msi = data;
> >+
> >+    /* Set multiple message capable. */
> >+    val->u16 = MASK_INSR(fls(msi->max_vectors) - 1, PCI_MSI_FLAGS_QMASK);
> 
> The comment is somewhat misleading - whether the device is multi-message
> capable depends on msi->max_vectors.

Better "Set the number of supported messages"?

> >+    if ( msi->enabled ) {
> 
> Style.
> 
> >+        val->u16 |= PCI_MSI_FLAGS_ENABLE;
> >+        val->u16 |= MASK_INSR(fls(msi->vectors) - 1, PCI_MSI_FLAGS_QSIZE);
> 
> Why is reading back the proper value here dependent upon MSI being
> enabled?

Right, I've now slightly changed this to always store the number of
enabled vectors, regardless of whether the MSI enable bit is set or
not.

> >...
> >+ error:
> >+    ASSERT(ret);
> >+    xfree(msi);
> >+    return ret;
> >+}
> 
> Don't you also need to unregister address handlers you've registered?

vpci_add_handlers already takes care of cleaning up the register
handlers on failure.

> >+void vpci_dump_msi(void)
> >+{
> >+    struct domain *d;
> >+
> >+    for_each_domain ( d )
> >+    {
> >+        const struct pci_dev *pdev;
> >+
> >+        if ( !has_vpci(d) )
> >+            continue;
> >+
> >+        printk("vPCI MSI information for guest %u\n", d->domain_id);
> 
> "... for Dom%d" or "... for d%d" please.
> 
> >...
> >+            if ( msi->masking )
> >+                printk("mask=%#032x\n", msi->mask);
> 
> Why 30 hex digits? And generally # should be used only when not blank or
> zero padding the value (as field width includes the 0x prefix).

Ouch, that should be 8, not 32.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 6/9] xen/vpci: add handlers to map the BARs
  2017-08-08 12:35         ` Roger Pau Monné
@ 2017-08-09  8:17           ` Jan Beulich
  2017-08-09  8:22             ` Roger Pau Monné
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2017-08-09  8:17 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: sstabellini, wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	tim, julien.grall, xen-devel, boris.ostrovsky

>>> On 08.08.17 at 14:35, <roger.pau@citrix.com> wrote:
> On Sat, Jul 29, 2017 at 10:44:02AM -0600, Jan Beulich wrote:
>> >>> Roger Pau Monne <roger.pau@citrix.com> 07/24/17 4:58 PM >>>
>> >On Fri, Jul 14, 2017 at 09:11:29AM -0600, Jan Beulich wrote:
>> >> >>> On 30.06.17 at 17:01, <roger.pau@citrix.com> wrote:
>> >> > +        /* Size the BAR and map it. */
>> >> > +        rc = pci_size_mem_bar(seg, bus, slot, func, reg, i == num_bars - 1,
>> >> > +                              &addr, &size);
>> >> > +        if ( rc < 0 )
>> >> > +            return rc;
>> >> > +
>> >> > +        if ( size == 0 )
>> >> > +        {
>> >> > +            bars[i].type = VPCI_BAR_EMPTY;
>> >> > +            continue;
>> >> > +        }
>> >> > +
>> >> > +        bars[i].addr = (cmd & PCI_COMMAND_MEMORY) ? addr : INVALID_PADDR;
>> >> 
>> >> This doesn't match up with logic further up: When the memory decode
>> >> bit gets cleared, you don't zap the addresses, so I think you'd better
>> >> store it here too. Use INVALID_PADDR only when the value read has
>> >> all address bits set (same caveat as pointed out earlier).
>> >
>> >OK, note that .addr can only possibly be INVALID_PADDR at
>> >initialization time, once the user has written something to the BAR
>> >.addr will be different than INVALID_PADDR.
>> 
>> Which is part of what worries me - it would be better if the field wouldn't
>> ever hold a special init-time-only value.
> 
> Right, but that matches the behavior of the hardware itself. On boot
> the address of the BAR is not valid, but there's no way AFAIK to
> restore the BAR to this state once an address has been written (except
> by doing a reset of the device itself).

True, but the BARs still hold _some_ value. And hence they can
equally well be made hold a value consistent with normal runtime.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 7/9] vpci/msi: add MSI handlers
  2017-08-08 15:44     ` Roger Pau Monné
@ 2017-08-09  8:21       ` Jan Beulich
  2017-08-09  8:39         ` Roger Pau Monné
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2017-08-09  8:21 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: andrew.cooper3, julien.grall, paul.durrant, xen-devel,
	boris.ostrovsky

>>> On 08.08.17 at 17:44, <roger.pau@citrix.com> wrote:
> On Wed, Aug 02, 2017 at 07:34:28AM -0600, Jan Beulich wrote:
>> >>> Roger Pau Monne <roger.pau@citrix.com> 06/30/17 5:01 PM >>>
>> >+    /* Get a PIRQ. */
>> >+    rc = allocate_and_map_msi_pirq(pdev->domain, -1, &arch->pirq,
>> >+                                   MAP_PIRQ_TYPE_MULTI_MSI, &msi_info);
>> >+    if ( rc )
>> >+    {
>> >+        dprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: failed to map PIRQ: %d\n",
>> >+                pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
>> >+                PCI_FUNC(pdev->devfn), rc);
>> >+        return rc;
>> >+    }
>> >+
>> >+    for ( i = 0; i < vectors; i++ )
>> >+    {
>> >+        xen_domctl_bind_pt_irq_t bind = {
>> >+            .machine_irq = arch->pirq + i,
>> >+            .irq_type = PT_IRQ_TYPE_MSI,
>> >+            .u.msi.gvec = msi_vector(data) + i,
>> >+            .u.msi.gflags = msi_flags(data, address),
>> >+        };
>> >+
>> >+        pcidevs_lock();
>> >+        rc = pt_irq_create_bind(pdev->domain, &bind);
>> >+        if ( rc )
>> >+        {
>> >+            dprintk(XENLOG_ERR,
>> >+                    "%04x:%02x:%02x.%u: failed to bind PIRQ %u: %d\n",
>> >+                    pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
>> >+                    PCI_FUNC(pdev->devfn), arch->pirq + i, rc);
>> >+            spin_lock(&pdev->domain->event_lock);
>> >+            unmap_domain_pirq(pdev->domain, arch->pirq);
>> 
>> Don't you also need to undo the pt_irq_create_bind() calls here for all prior
>> successful iterations?
> 
> Yes, unmap_domain_pirq calls pirq_guest_force_unbind but better not
> resort to that.

I don't understand.

>> >+int vpci_msi_arch_disable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
>> >+                          unsigned int vectors)
>> >+{
>> >+    unsigned int i;
>> >+
>> >+    ASSERT(arch->pirq != -1);
>> >+
>> >+    for ( i = 0; i < vectors; i++ )
>> >+    {
>> >+        xen_domctl_bind_pt_irq_t bind = {
>> >+            .machine_irq = arch->pirq + i,
>> >+            .irq_type = PT_IRQ_TYPE_MSI,
>> >+        };
>> >+
>> >+        pcidevs_lock();
>> >+        pt_irq_destroy_bind(pdev->domain, &bind);
>> 
>> While I agree that the loop should continue of this fails, I'm not convinced
>> you should entirely ignore the return value here.
> 
> I've added a printk in order to aid debug.

I've actually tried to hint at you wanting to run the loop to
completion while returning to the caller the first error you've
encountered.

>> >+/* Handlers for the MSI control field (PCI_MSI_FLAGS). */
>> >+static void vpci_msi_control_read(struct pci_dev *pdev, unsigned int reg,
>> >+                                  union vpci_val *val, void *data)
>> >+{
>> >+    const struct vpci_msi *msi = data;
>> >+
>> >+    /* Set multiple message capable. */
>> >+    val->u16 = MASK_INSR(fls(msi->max_vectors) - 1, PCI_MSI_FLAGS_QMASK);
>> 
>> The comment is somewhat misleading - whether the device is multi-message
>> capable depends on msi->max_vectors.
> 
> Better "Set the number of supported messages"?

Yes.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 6/9] xen/vpci: add handlers to map the BARs
  2017-08-09  8:17           ` Jan Beulich
@ 2017-08-09  8:22             ` Roger Pau Monné
  0 siblings, 0 replies; 44+ messages in thread
From: Roger Pau Monné @ 2017-08-09  8:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: sstabellini, wei.liu2, George.Dunlap, andrew.cooper3, ian.jackson,
	tim, julien.grall, xen-devel, boris.ostrovsky

On Wed, Aug 09, 2017 at 02:17:57AM -0600, Jan Beulich wrote:
> >>> On 08.08.17 at 14:35, <roger.pau@citrix.com> wrote:
> > On Sat, Jul 29, 2017 at 10:44:02AM -0600, Jan Beulich wrote:
> >> >>> Roger Pau Monne <roger.pau@citrix.com> 07/24/17 4:58 PM >>>
> >> >On Fri, Jul 14, 2017 at 09:11:29AM -0600, Jan Beulich wrote:
> >> >> >>> On 30.06.17 at 17:01, <roger.pau@citrix.com> wrote:
> >> >> > +        /* Size the BAR and map it. */
> >> >> > +        rc = pci_size_mem_bar(seg, bus, slot, func, reg, i == num_bars - 1,
> >> >> > +                              &addr, &size);
> >> >> > +        if ( rc < 0 )
> >> >> > +            return rc;
> >> >> > +
> >> >> > +        if ( size == 0 )
> >> >> > +        {
> >> >> > +            bars[i].type = VPCI_BAR_EMPTY;
> >> >> > +            continue;
> >> >> > +        }
> >> >> > +
> >> >> > +        bars[i].addr = (cmd & PCI_COMMAND_MEMORY) ? addr : INVALID_PADDR;
> >> >> 
> >> >> This doesn't match up with logic further up: When the memory decode
> >> >> bit gets cleared, you don't zap the addresses, so I think you'd better
> >> >> store it here too. Use INVALID_PADDR only when the value read has
> >> >> all address bits set (same caveat as pointed out earlier).
> >> >
> >> >OK, note that .addr can only possibly be INVALID_PADDR at
> >> >initialization time, once the user has written something to the BAR
> >> >.addr will be different than INVALID_PADDR.
> >> 
> >> Which is part of what worries me - it would be better if the field wouldn't
> >> ever hold a special init-time-only value.
> > 
> > Right, but that matches the behavior of the hardware itself. On boot
> > the address of the BAR is not valid, but there's no way AFAIK to
> > restore the BAR to this state once an address has been written (except
> > by doing a reset of the device itself).
> 
> True, but the BARs still hold _some_ value. And hence they can
> equally well be made hold a value consistent with normal runtime.

I've changed it to remove the usage of INVALID_PADDR and instead made
the BAR hold the value that Xen finds in the underlying hardware,
without Xen trying to figure out if it's initialized or not.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 7/9] vpci/msi: add MSI handlers
  2017-08-09  8:21       ` Jan Beulich
@ 2017-08-09  8:39         ` Roger Pau Monné
  0 siblings, 0 replies; 44+ messages in thread
From: Roger Pau Monné @ 2017-08-09  8:39 UTC (permalink / raw)
  To: Jan Beulich
  Cc: andrew.cooper3, julien.grall, paul.durrant, xen-devel,
	boris.ostrovsky

On Wed, Aug 09, 2017 at 02:21:33AM -0600, Jan Beulich wrote:
> >>> On 08.08.17 at 17:44, <roger.pau@citrix.com> wrote:
> > On Wed, Aug 02, 2017 at 07:34:28AM -0600, Jan Beulich wrote:
> >> >>> Roger Pau Monne <roger.pau@citrix.com> 06/30/17 5:01 PM >>>
> >> >+    /* Get a PIRQ. */
> >> >+    rc = allocate_and_map_msi_pirq(pdev->domain, -1, &arch->pirq,
> >> >+                                   MAP_PIRQ_TYPE_MULTI_MSI, &msi_info);
> >> >+    if ( rc )
> >> >+    {
> >> >+        dprintk(XENLOG_ERR, "%04x:%02x:%02x.%u: failed to map PIRQ: %d\n",
> >> >+                pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> >> >+                PCI_FUNC(pdev->devfn), rc);
> >> >+        return rc;
> >> >+    }
> >> >+
> >> >+    for ( i = 0; i < vectors; i++ )
> >> >+    {
> >> >+        xen_domctl_bind_pt_irq_t bind = {
> >> >+            .machine_irq = arch->pirq + i,
> >> >+            .irq_type = PT_IRQ_TYPE_MSI,
> >> >+            .u.msi.gvec = msi_vector(data) + i,
> >> >+            .u.msi.gflags = msi_flags(data, address),
> >> >+        };
> >> >+
> >> >+        pcidevs_lock();
> >> >+        rc = pt_irq_create_bind(pdev->domain, &bind);
> >> >+        if ( rc )
> >> >+        {
> >> >+            dprintk(XENLOG_ERR,
> >> >+                    "%04x:%02x:%02x.%u: failed to bind PIRQ %u: %d\n",
> >> >+                    pdev->seg, pdev->bus, PCI_SLOT(pdev->devfn),
> >> >+                    PCI_FUNC(pdev->devfn), arch->pirq + i, rc);
> >> >+            spin_lock(&pdev->domain->event_lock);
> >> >+            unmap_domain_pirq(pdev->domain, arch->pirq);
> >> 
> >> Don't you also need to undo the pt_irq_create_bind() calls here for all prior
> >> successful iterations?
> > 
> > Yes, unmap_domain_pirq calls pirq_guest_force_unbind but better not
> > resort to that.
> 
> I don't understand.

I've added a calls to pt_irq_destroy_bind before calling
unmap_domain_pirq.

> >> >+int vpci_msi_arch_disable(struct vpci_arch_msi *arch, struct pci_dev *pdev,
> >> >+                          unsigned int vectors)
> >> >+{
> >> >+    unsigned int i;
> >> >+
> >> >+    ASSERT(arch->pirq != -1);
> >> >+
> >> >+    for ( i = 0; i < vectors; i++ )
> >> >+    {
> >> >+        xen_domctl_bind_pt_irq_t bind = {
> >> >+            .machine_irq = arch->pirq + i,
> >> >+            .irq_type = PT_IRQ_TYPE_MSI,
> >> >+        };
> >> >+
> >> >+        pcidevs_lock();
> >> >+        pt_irq_destroy_bind(pdev->domain, &bind);
> >> 
> >> While I agree that the loop should continue of this fails, I'm not convinced
> >> you should entirely ignore the return value here.
> > 
> > I've added a printk in order to aid debug.
> 
> I've actually tried to hint at you wanting to run the loop to
> completion while returning to the caller the first error you've
> encountered.

Hm, I'm not sure of the best way to proceed here.

If vpci_msi_arch_disable returns once one of the pt_irq_destroy_bind
calls fail, further calls to vpci_msi_arch_disable are also likely to
fail if the previous call managed to destroy some of the bindings but
not all of them.

But then trying to call unmap_domain_pirq without having destroyed all
of the bindings seems likely to fail anyway...

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 9/9] vpci/msix: add MSI-X handlers
  2017-08-02 15:07   ` Jan Beulich
@ 2017-08-10 17:04     ` Roger Pau Monné
  2017-08-11 10:01       ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monné @ 2017-08-10 17:04 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, julien.grall, boris.ostrovsky, xen-devel

On Wed, Aug 02, 2017 at 09:07:54AM -0600, Jan Beulich wrote:
> >>> Roger Pau Monne <roger.pau@citrix.com> 06/30/17 5:01 PM >>>
> >Note that accesses to the Table Offset, Table BIR, PBA Offset and PBA
> >BIR are not trapped by Xen at the moment.
> 
> They're mandated r/o by the spec anyway.

> 
> >@@ -113,6 +148,35 @@ static int vpci_modify_bar(struct domain *d, const struct vpci_bar *bar,
> >if ( IS_ERR(mem) )
> >return -PTR_ERR(mem);
>  >
> >+    /*
> >+     * Make sure the MSI-X regions of the BAR are not mapped into the domain
> >+     * p2m, or else the MSI-X handlers are useless. Only do this when mapping,
> >+     * since that's when the memory decoding on the device is enabled.
> >+     */
> >+    for ( i = 0; map && i < ARRAY_SIZE(bar->msix); i++ )
> >+    {
> >+        struct vpci_msix_mem *msix = bar->msix[i];
> >+
> >+        if ( !msix || msix->addr == INVALID_PADDR )
> >+            continue;
> >+
> >+        rc = vpci_unmap_msix(d, msix);
> 
> Why do you need this, instead of being able to simply rely on the rangeset
> based (un)mapping?

This is because the series that I've sent called: "x86/pvh: implement
iommu_inclusive_mapping for PVH Dom0" will map the MSI-X memory areas
into the guest, and thus we need to make sure they are not mapped
here for the emulation path to work.

https://lists.xenproject.org/archives/html/xen-devel/2017-04/msg02849.html

> >@@ -405,7 +475,20 @@ static int vpci_init_bars(struct pci_dev *pdev)
> >continue;
> >}
>  >
> >-        bars[i].addr = (cmd & PCI_COMMAND_MEMORY) ? addr : INVALID_PADDR;
> >+        if ( cmd & PCI_COMMAND_MEMORY )
> >+        {
> >+            unsigned int j;
> >+
> >+            bars[i].addr = addr;
> >+
> >+            for ( j = 0; j < ARRAY_SIZE(bars[i].msix); j++ )
> >+                if ( bars[i].msix[j] )
> >+                    bars[i].msix[j]->addr = bars[i].addr +
> >+                                            bars[i].msix[j]->offset;
> >+        }
> >+        else
> >+            bars[i].addr = INVALID_PADDR;
> 
> As (I think) mentioned elsewhere already, this init-time special case looks
> dangerous (and unnecessary) to me (or else I'd expect you to also zap
> the field when the memory decode bit is being cleared).

OK, so I'm simply going to set this to addr + offset, regardless of
whether the BAR has memory decoding enabled of not. If the BAR is not
yet positioned Dom0 will have to position it anyway before enabling
memory decoding.

> >+        for ( i = 0; i < msix->max_entries; i++ )
> >+        {
> >+            if ( msix->entries[i].masked )
> >+                continue;
> 
> Why is the mask bit relevant here, but not the mask-all one?

Not taking the mask-all into account here is wrong, since setting
mask-all from 1 to 0 should force a recalculation of all the entries
address and data fields. I will fix this in the next version.

> >+            rc = vpci_msix_arch_enable(&msix->entries[i].arch, pdev,
> >+                                       msix->entries[i].addr,
> >+                                       msix->entries[i].data,
> >+                                       msix->entries[i].nr, table_base);
> >+            if ( rc )
> >+            {
> >+                gdprintk(XENLOG_ERR,
> <+                         "%04x:%02x:%02x.%u: unable to update entry %u: %d\n",
> >+                         seg, bus, slot, func, i, rc);
> >+                return;
> >+            }
> >+
> >+            vpci_msix_arch_mask(&msix->entries[i].arch, pdev, false);
> 
> Same question here.

This is needed because after a vpci_msix_arch_enable the pirq is still
masked, and hence needs to be unmasked to match the guest's view.

> >+        }
> >+    }
> >+    else if ( msix->enabled && !new_enabled )
> >+    {
> >+        /* MSI-X disabled. */
> >+        for ( i = 0; i < msix->max_entries; i++ )
> >+        {
> >+            rc = vpci_msix_arch_disable(&msix->entries[i].arch, pdev);
> >+            if ( rc )
> >+            {
> >+                gdprintk(XENLOG_ERR,
> >+                         "%04x:%02x:%02x.%u: unable to disable entry %u: %d\n",
> >+                         seg, bus, slot, func, i, rc);
> >+                return;
> >+            }
> >+        }
> >+    }
> >+
> >+    if ( (new_enabled != msix->enabled || new_masked != msix->masked) &&
> >+         pci_msi_conf_write_intercept(pdev, reg, 2, &val.u32) >= 0 )
> >+        pci_conf_write16(seg, bus, slot, func, reg, val.u32);
> 
> DYM val.u16 here?

Now this is simply val, since the union has been removed.

> >+static struct vpci_msix *vpci_msix_find(struct domain *d, unsigned long addr)
> >+{
> >+    struct vpci_msix *msix;
> >+
> >+    ASSERT(vpci_locked(d));
> >+    list_for_each_entry ( msix,  &d->arch.hvm_domain.msix_tables, next )
> >+    {
> >+        uint8_t seg = msix->pdev->seg, bus = msix->pdev->bus;
> >+        uint8_t slot = PCI_SLOT(msix->pdev->devfn);
> >+        uint8_t func = PCI_FUNC(msix->pdev->devfn);
> >+        uint16_t cmd = pci_conf_read16(seg, bus, slot, func, PCI_COMMAND);
> 
> Perhaps better to keep a cached copy of the command register value?

I'm now using the enabled field of the vpci_bar struct instead of
checking the command register.

> >+        break;
> >+    case PCI_MSIX_ENTRY_DATA_OFFSET:
> >+        /*
> >+         * 8 byte writes to the msg data and vector control fields are
> >+         * only allowed if the entry is masked.
> >+         */
> >+        if ( len == 8 && !entry->masked && !msix->masked && msix->enabled )
> >+        {
> >+            vpci_unlock(d);
> >+            return X86EMUL_OKAY;
> >+        }
> 
> I don't think this is correct - iirc such writes simply don't take effect immediately
> (but I then seem to recall this to apply to the address field and 32-bit writes to
> the data field as well). They'd instead take effect the next time the entry is being
> unmasked (or some such). A while ago I did fix the qemu code to behave in this
> way.

There's an Implementation Note called "Special Considerations for QWORD
Accesses" in the MSI-X section of the PCI 3.0 spec that states:

If a given entry is currently masked (via its Mask bit or the Function
Mask bit), software is permitted to fill in the Message Data and
Vector Control fields with a single QWORD write, taking advantage of
the fact the Message Data field is guaranteed to become visible to
hardware no later than the Vector Control field.

So I think the above chunk is correct. The specification also states
that:

Software must not modify the Address or Data fields of an entry while
it is unmasked. Refer to Section 6.8.3.5 for details.

AFAICT this is not enforced by QEMU, and you can write to the
address/data fields while the message is not masked. The update will
only take effect once the message is masked and unmasked.

> >+static int vpci_init_msix(struct pci_dev *pdev)
> >+{
> >+    struct domain *d = pdev->domain;
> >+    uint8_t seg = pdev->seg, bus = pdev->bus;
> >+    uint8_t slot = PCI_SLOT(pdev->devfn), func = PCI_FUNC(pdev->devfn);
> >+    struct vpci_msix *msix;
> >+    unsigned int msix_offset, i, max_entries;
> >+    struct vpci_bar *table_bar, *pba_bar;
> >+    uint16_t control;
> >+    int rc;
> >+
> >+    msix_offset = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSIX);
> >+    if ( !msix_offset )
> >+        return 0;
> >+
> >+    control = pci_conf_read16(seg, bus, slot, func,
> >+                              msix_control_reg(msix_offset));
> >+
> >+    /* Get the maximum number of vectors the device supports. */
> >+    max_entries = msix_table_size(control);
> >+
> >+    msix = xzalloc_bytes(MSIX_SIZE(max_entries));
> >+    if ( !msix )
> >+        return -ENOMEM;
> >+
> >+    msix->max_entries = max_entries;
> >+    msix->pdev = pdev;
> >+
> >+    /* Find the MSI-X table address. */
> >+    msix->table.offset = pci_conf_read32(seg, bus, slot, func,
> >+                                         msix_table_offset_reg(msix_offset));
> >+    msix->table.bir = msix->table.offset & PCI_MSIX_BIRMASK;
> >+    msix->table.offset &= ~PCI_MSIX_BIRMASK;
> >+    msix->table.size = msix->max_entries * PCI_MSIX_ENTRY_SIZE;
> >+    msix->table.addr = INVALID_PADDR;
> >+
> >+    /* Find the MSI-X pba address. */
> >+    msix->pba.offset = pci_conf_read32(seg, bus, slot, func,
> >+                                       msix_pba_offset_reg(msix_offset));
> >+    msix->pba.bir = msix->pba.offset & PCI_MSIX_BIRMASK;
> >+    msix->pba.offset &= ~PCI_MSIX_BIRMASK;
> >+    msix->pba.size = DIV_ROUND_UP(msix->max_entries, 8);
> 
> I think you want to round up to at least the next 32-bit boundary; the
> spec talking about bits 63..00 even suggests a 64-bit boundary. The
> table addresses being required to be qword aligned also supports this.

The spec mentions that the last QWORD of the PBA doesn't need to be
fully populated, so yes, I assume this needs to be rounded up to a
64-bit boundary.

> >+void vpci_dump_msix(void)
> >+{
> >+    struct domain *d;
> >+    struct pci_dev *pdev;
> 
> const for all pointers in dump handlers, as far as possible.
> 
> >+    for_each_domain ( d )
> >+    {
> >+        if ( !has_vpci(d) )
> >+            continue;
> >+
> >+        printk("vPCI MSI-X information for guest %u\n", d->domain_id);
> 
> Wouldn't it be better (more useful) to dump the MSI and MSI-X data for a
> domain next to each other?

Possibly yes, and printing the MSI and MSI-X data of each device
together would be even better IMHO.

> Apart from the comments here the ones give for the MSI patch apply
> respectively.

I've added the MSI-X dumping to vpci_dump_msi instead.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 9/9] vpci/msix: add MSI-X handlers
  2017-08-10 17:04     ` Roger Pau Monné
@ 2017-08-11 10:01       ` Jan Beulich
  2017-08-11 10:11         ` Roger Pau Monné
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2017-08-11 10:01 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: andrew.cooper3, julien.grall, boris.ostrovsky, xen-devel

>>> On 10.08.17 at 19:04, <roger.pau@citrix.com> wrote:
> On Wed, Aug 02, 2017 at 09:07:54AM -0600, Jan Beulich wrote:
>> >>> Roger Pau Monne <roger.pau@citrix.com> 06/30/17 5:01 PM >>>
>> >@@ -113,6 +148,35 @@ static int vpci_modify_bar(struct domain *d, const struct vpci_bar *bar,
>> >if ( IS_ERR(mem) )
>> >return -PTR_ERR(mem);
>>  >
>> >+    /*
>> >+     * Make sure the MSI-X regions of the BAR are not mapped into the domain
>> >+     * p2m, or else the MSI-X handlers are useless. Only do this when mapping,
>> >+     * since that's when the memory decoding on the device is enabled.
>> >+     */
>> >+    for ( i = 0; map && i < ARRAY_SIZE(bar->msix); i++ )
>> >+    {
>> >+        struct vpci_msix_mem *msix = bar->msix[i];
>> >+
>> >+        if ( !msix || msix->addr == INVALID_PADDR )
>> >+            continue;
>> >+
>> >+        rc = vpci_unmap_msix(d, msix);
>> 
>> Why do you need this, instead of being able to simply rely on the rangeset
>> based (un)mapping?
> 
> This is because the series that I've sent called: "x86/pvh: implement
> iommu_inclusive_mapping for PVH Dom0" will map the MSI-X memory areas
> into the guest, and thus we need to make sure they are not mapped
> here for the emulation path to work.
> 
> https://lists.xenproject.org/archives/html/xen-devel/2017-04/msg02849.html 

Oh, okay. The patch description doesn't mention any such
dependency though.

>> >+        break;
>> >+    case PCI_MSIX_ENTRY_DATA_OFFSET:
>> >+        /*
>> >+         * 8 byte writes to the msg data and vector control fields are
>> >+         * only allowed if the entry is masked.
>> >+         */
>> >+        if ( len == 8 && !entry->masked && !msix->masked && msix->enabled )
>> >+        {
>> >+            vpci_unlock(d);
>> >+            return X86EMUL_OKAY;
>> >+        }
>> 
>> I don't think this is correct - iirc such writes simply don't take effect immediately
>> (but I then seem to recall this to apply to the address field and 32-bit writes to
>> the data field as well). They'd instead take effect the next time the entry is being
>> unmasked (or some such). A while ago I did fix the qemu code to behave in this
>> way.
> 
> There's an Implementation Note called "Special Considerations for QWORD
> Accesses" in the MSI-X section of the PCI 3.0 spec that states:
> 
> If a given entry is currently masked (via its Mask bit or the Function
> Mask bit), software is permitted to fill in the Message Data and
> Vector Control fields with a single QWORD write, taking advantage of
> the fact the Message Data field is guaranteed to become visible to
> hardware no later than the Vector Control field.
> 
> So I think the above chunk is correct. The specification also states
> that:
> 
> Software must not modify the Address or Data fields of an entry while
> it is unmasked. Refer to Section 6.8.3.5 for details.
> 
> AFAICT this is not enforced by QEMU, and you can write to the
> address/data fields while the message is not masked. The update will
> only take effect once the message is masked and unmasked.

The spec also says:

"For MSI-X, a function is permitted to cache Address and Data values
 from unmasked MSIX Table entries. However, anytime software
 unmasks a currently masked MSI-X Table entry either by clearing its
 Mask bit or by clearing the Function Mask bit, the function must
 update any Address or Data values that it cached from that entry. If
 software changes the Address or Data value of an entry while the
 entry is unmasked, the result is undefined."

>> >+    for_each_domain ( d )
>> >+    {
>> >+        if ( !has_vpci(d) )
>> >+            continue;
>> >+
>> >+        printk("vPCI MSI-X information for guest %u\n", d->domain_id);
>> 
>> Wouldn't it be better (more useful) to dump the MSI and MSI-X data for a
>> domain next to each other?
> 
> Possibly yes, and printing the MSI and MSI-X data of each device
> together would be even better IMHO.

Not sure about that last aspect - devices aren't permitted to enable
both at the same time, so only one of them can really be relevant.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 9/9] vpci/msix: add MSI-X handlers
  2017-08-11 10:01       ` Jan Beulich
@ 2017-08-11 10:11         ` Roger Pau Monné
  2017-08-11 10:20           ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monné @ 2017-08-11 10:11 UTC (permalink / raw)
  To: Jan Beulich; +Cc: andrew.cooper3, julien.grall, boris.ostrovsky, xen-devel

On Fri, Aug 11, 2017 at 04:01:05AM -0600, Jan Beulich wrote:
> >>> On 10.08.17 at 19:04, <roger.pau@citrix.com> wrote:
> > On Wed, Aug 02, 2017 at 09:07:54AM -0600, Jan Beulich wrote:
> >> >>> Roger Pau Monne <roger.pau@citrix.com> 06/30/17 5:01 PM >>>
> >> >@@ -113,6 +148,35 @@ static int vpci_modify_bar(struct domain *d, const struct vpci_bar *bar,
> >> >if ( IS_ERR(mem) )
> >> >return -PTR_ERR(mem);
> >>  >
> >> >+    /*
> >> >+     * Make sure the MSI-X regions of the BAR are not mapped into the domain
> >> >+     * p2m, or else the MSI-X handlers are useless. Only do this when mapping,
> >> >+     * since that's when the memory decoding on the device is enabled.
> >> >+     */
> >> >+    for ( i = 0; map && i < ARRAY_SIZE(bar->msix); i++ )
> >> >+    {
> >> >+        struct vpci_msix_mem *msix = bar->msix[i];
> >> >+
> >> >+        if ( !msix || msix->addr == INVALID_PADDR )
> >> >+            continue;
> >> >+
> >> >+        rc = vpci_unmap_msix(d, msix);
> >> 
> >> Why do you need this, instead of being able to simply rely on the rangeset
> >> based (un)mapping?
> > 
> > This is because the series that I've sent called: "x86/pvh: implement
> > iommu_inclusive_mapping for PVH Dom0" will map the MSI-X memory areas
> > into the guest, and thus we need to make sure they are not mapped
> > here for the emulation path to work.
> > 
> > https://lists.xenproject.org/archives/html/xen-devel/2017-04/msg02849.html 
> 
> Oh, okay. The patch description doesn't mention any such
> dependency though.

Will make that clearer on the next version, in fact I'm going to send
this series rebased on top of the iommu_inclusive_mapping one. AFAICT
that one is closer to being committed, and in any case changing the
order is trivial, there are not conflicts.

> >> >+        break;
> >> >+    case PCI_MSIX_ENTRY_DATA_OFFSET:
> >> >+        /*
> >> >+         * 8 byte writes to the msg data and vector control fields are
> >> >+         * only allowed if the entry is masked.
> >> >+         */
> >> >+        if ( len == 8 && !entry->masked && !msix->masked && msix->enabled )
> >> >+        {
> >> >+            vpci_unlock(d);
> >> >+            return X86EMUL_OKAY;
> >> >+        }
> >> 
> >> I don't think this is correct - iirc such writes simply don't take effect immediately
> >> (but I then seem to recall this to apply to the address field and 32-bit writes to
> >> the data field as well). They'd instead take effect the next time the entry is being
> >> unmasked (or some such). A while ago I did fix the qemu code to behave in this
> >> way.
> > 
> > There's an Implementation Note called "Special Considerations for QWORD
> > Accesses" in the MSI-X section of the PCI 3.0 spec that states:
> > 
> > If a given entry is currently masked (via its Mask bit or the Function
> > Mask bit), software is permitted to fill in the Message Data and
> > Vector Control fields with a single QWORD write, taking advantage of
> > the fact the Message Data field is guaranteed to become visible to
> > hardware no later than the Vector Control field.
> > 
> > So I think the above chunk is correct. The specification also states
> > that:
> > 
> > Software must not modify the Address or Data fields of an entry while
> > it is unmasked. Refer to Section 6.8.3.5 for details.
> > 
> > AFAICT this is not enforced by QEMU, and you can write to the
> > address/data fields while the message is not masked. The update will
> > only take effect once the message is masked and unmasked.
> 
> The spec also says:
> 
> "For MSI-X, a function is permitted to cache Address and Data values
>  from unmasked MSIX Table entries. However, anytime software
>  unmasks a currently masked MSI-X Table entry either by clearing its
>  Mask bit or by clearing the Function Mask bit, the function must
>  update any Address or Data values that it cached from that entry. If
>  software changes the Address or Data value of an entry while the
>  entry is unmasked, the result is undefined."

I'm not sure it's clear to me what to do if the guest writes to an
entry while unmasked. For once it says that it must not modify it, but
OTHO it says result is undefined when doing so.

Would you be fine with ignoring writes to the address and data fields
if the entry is unmasked?

> >> >+    for_each_domain ( d )
> >> >+    {
> >> >+        if ( !has_vpci(d) )
> >> >+            continue;
> >> >+
> >> >+        printk("vPCI MSI-X information for guest %u\n", d->domain_id);
> >> 
> >> Wouldn't it be better (more useful) to dump the MSI and MSI-X data for a
> >> domain next to each other?
> > 
> > Possibly yes, and printing the MSI and MSI-X data of each device
> > together would be even better IMHO.
> 
> Not sure about that last aspect - devices aren't permitted to enable
> both at the same time, so only one of them can really be relevant.

I think (for debugging purposes) it's useful to print both together
in order to spot if the guest is doing something wrong.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v4 9/9] vpci/msix: add MSI-X handlers
  2017-08-11 10:11         ` Roger Pau Monné
@ 2017-08-11 10:20           ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2017-08-11 10:20 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: andrew.cooper3, julien.grall, boris.ostrovsky, xen-devel

>>> On 11.08.17 at 12:11, <roger.pau@citrix.com> wrote:
> On Fri, Aug 11, 2017 at 04:01:05AM -0600, Jan Beulich wrote:
>> >>> On 10.08.17 at 19:04, <roger.pau@citrix.com> wrote:
>> > On Wed, Aug 02, 2017 at 09:07:54AM -0600, Jan Beulich wrote:
>> >> >>> Roger Pau Monne <roger.pau@citrix.com> 06/30/17 5:01 PM >>>
>> >> >+    case PCI_MSIX_ENTRY_DATA_OFFSET:
>> >> >+        /*
>> >> >+         * 8 byte writes to the msg data and vector control fields are
>> >> >+         * only allowed if the entry is masked.
>> >> >+         */
>> >> >+        if ( len == 8 && !entry->masked && !msix->masked && msix->enabled )
>> >> >+        {
>> >> >+            vpci_unlock(d);
>> >> >+            return X86EMUL_OKAY;
>> >> >+        }
>> >> 
>> >> I don't think this is correct - iirc such writes simply don't take effect immediately
>> >> (but I then seem to recall this to apply to the address field and 32-bit writes to
>> >> the data field as well). They'd instead take effect the next time the entry is being
>> >> unmasked (or some such). A while ago I did fix the qemu code to behave in this
>> >> way.
>> > 
>> > There's an Implementation Note called "Special Considerations for QWORD
>> > Accesses" in the MSI-X section of the PCI 3.0 spec that states:
>> > 
>> > If a given entry is currently masked (via its Mask bit or the Function
>> > Mask bit), software is permitted to fill in the Message Data and
>> > Vector Control fields with a single QWORD write, taking advantage of
>> > the fact the Message Data field is guaranteed to become visible to
>> > hardware no later than the Vector Control field.
>> > 
>> > So I think the above chunk is correct. The specification also states
>> > that:
>> > 
>> > Software must not modify the Address or Data fields of an entry while
>> > it is unmasked. Refer to Section 6.8.3.5 for details.
>> > 
>> > AFAICT this is not enforced by QEMU, and you can write to the
>> > address/data fields while the message is not masked. The update will
>> > only take effect once the message is masked and unmasked.
>> 
>> The spec also says:
>> 
>> "For MSI-X, a function is permitted to cache Address and Data values
>>  from unmasked MSIX Table entries. However, anytime software
>>  unmasks a currently masked MSI-X Table entry either by clearing its
>>  Mask bit or by clearing the Function Mask bit, the function must
>>  update any Address or Data values that it cached from that entry. If
>>  software changes the Address or Data value of an entry while the
>>  entry is unmasked, the result is undefined."
> 
> I'm not sure it's clear to me what to do if the guest writes to an
> entry while unmasked. For once it says that it must not modify it, but
> OTHO it says result is undefined when doing so.
> 
> Would you be fine with ignoring writes to the address and data fields
> if the entry is unmasked?

No, not really. I've intentionally pointed you to the qemu code,
as there I've implemented the caching behavior described by the
quote above. I'd expect vPCI to behave similarly.

>> >> >+    for_each_domain ( d )
>> >> >+    {
>> >> >+        if ( !has_vpci(d) )
>> >> >+            continue;
>> >> >+
>> >> >+        printk("vPCI MSI-X information for guest %u\n", d->domain_id);
>> >> 
>> >> Wouldn't it be better (more useful) to dump the MSI and MSI-X data for a
>> >> domain next to each other?
>> > 
>> > Possibly yes, and printing the MSI and MSI-X data of each device
>> > together would be even better IMHO.
>> 
>> Not sure about that last aspect - devices aren't permitted to enable
>> both at the same time, so only one of them can really be relevant.
> 
> I think (for debugging purposes) it's useful to print both together
> in order to spot if the guest is doing something wrong.

For Dom0 maybe. For DomU we'd have to refuse guest attempts
to do anything possibly resulting in undefined behavior.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2017-08-11 10:20 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-06-30 15:01 [PATCH v4 0/9] vpci: PCI config space emulation Roger Pau Monne
2017-06-30 15:01 ` [PATCH v4 2/9] x86/mmcfg: add handlers for the PVH Dom0 MMCFG areas Roger Pau Monne
2017-07-10 13:34   ` Paul Durrant
2017-07-13 20:15   ` Jan Beulich
2017-07-14 16:33     ` Roger Pau Monné
2017-07-28 12:22       ` Jan Beulich
2017-06-30 15:01 ` [PATCH v4 3/9] x86/physdev: enable PHYSDEVOP_pci_mmcfg_reserved for PVH Dom0 Roger Pau Monne
2017-07-14 10:32   ` Jan Beulich
2017-07-20 10:23     ` Roger Pau Monne
2017-07-28 12:31       ` Jan Beulich
2017-06-30 15:01 ` [PATCH v4 4/9] xen/mm: move modify_identity_mmio to global file and drop __init Roger Pau Monne
2017-07-14 10:32   ` Jan Beulich
2017-06-30 15:01 ` [PATCH v4 5/9] xen/pci: split code to size BARs from pci_add_device Roger Pau Monne
2017-07-14 10:33   ` Jan Beulich
2017-07-20 14:00     ` Roger Pau Monne
2017-07-20 14:05       ` Roger Pau Monne
2017-07-29 16:32       ` Jan Beulich
2017-06-30 15:01 ` [PATCH v4 6/9] xen/vpci: add handlers to map the BARs Roger Pau Monne
2017-07-14 15:11   ` Jan Beulich
2017-07-24 14:58     ` Roger Pau Monne
2017-07-29 16:44       ` Jan Beulich
2017-08-08 12:35         ` Roger Pau Monné
2017-08-09  8:17           ` Jan Beulich
2017-08-09  8:22             ` Roger Pau Monné
2017-06-30 15:01 ` [PATCH v4 7/9] vpci/msi: add MSI handlers Roger Pau Monne
2017-07-18  8:56   ` Paul Durrant
2017-08-02 13:34   ` Jan Beulich
2017-08-08 15:44     ` Roger Pau Monné
2017-08-09  8:21       ` Jan Beulich
2017-08-09  8:39         ` Roger Pau Monné
2017-06-30 15:01 ` [PATCH v4 8/9] vpci: add a priority parameter to the vPCI register initializer Roger Pau Monne
2017-08-02 14:13   ` Jan Beulich
2017-06-30 15:01 ` [PATCH v4 9/9] vpci/msix: add MSI-X handlers Roger Pau Monne
2017-08-02 15:07   ` Jan Beulich
2017-08-10 17:04     ` Roger Pau Monné
2017-08-11 10:01       ` Jan Beulich
2017-08-11 10:11         ` Roger Pau Monné
2017-08-11 10:20           ` Jan Beulich
     [not found] ` <20170630150117.88489-2-roger.pau@citrix.com>
2017-07-10 13:27   ` [PATCH v4 1/9] xen/vpci: introduce basic handlers to trap accesses to the PCI config space Paul Durrant
2017-07-13 14:36   ` Jan Beulich
2017-07-14 15:33     ` Roger Pau Monné
2017-07-14 16:01       ` Jan Beulich
2017-07-14 16:41         ` Roger Pau Monné
2017-07-28 12:25           ` Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).