* [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR @ 2017-12-15 6:29 Alexey Kardashevskiy 2017-12-19 4:52 ` David Gibson 2018-01-05 9:48 ` Auger Eric 0 siblings, 2 replies; 19+ messages in thread From: Alexey Kardashevskiy @ 2017-12-15 6:29 UTC (permalink / raw) To: qemu-devel; +Cc: Alexey Kardashevskiy, qemu-ppc, David Gibson, Alex Williamson This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability which tells that a region with MSIX data can be mapped entirely, i.e. the VFIO PCI driver won't prevent MSIX vectors area from being mapped. With this change, all BARs are mapped in a single chunk and MSIX vectors are emulated on top unless the machine requests not to by defining and enabling a new "vfio-no-msix-emulation" property. At the moment only sPAPR machine does so - it prohibits MSIX emulation and does not allow enabling it as it does not define the "set" callback for the new property; the new property also does not appear in "-machine pseries,help". If the new capability is present, this puts MSIX IO memory region under mapped memory region. If the capability is not there, it falls back to the old behaviour with the sparse capability. In MSIX vectors section is not aligned to the page size, the KVM memory listener does not register it with the KVM as a memory slot and MSIX is emulated by QEMU as before. This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" - for the new capability: https://www.spinics.net/lists/kvm/msg160282.html Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> --- This is mtree and flatview BEFORE this patch: "info mtree": memory-region: pci@800000020000000.mmio 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] "info mtree -f": FlatView #0 AS "memory", root: system AS "cpu-memory", root: system Root memory region: system 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram 0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table 000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] This is AFTER this patch applied: "info mtree": memory-region: pci@800000020000000.mmio 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled] 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] "info mtree -f": FlatView #2 AS "memory", root: system AS "cpu-memory", root: system Root memory region: system 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] This is AFTER this patch applied AND spapr_get_msix_emulation() patched to enable emulation: "info mtree": memory-region: pci@800000020000000.mmio 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] "info mtree -f": FlatView #1 AS "memory", root: system AS "cpu-memory", root: system Root memory region: system 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram 0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table 000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] --- include/hw/vfio/vfio-common.h | 1 + linux-headers/linux/vfio.h | 5 +++++ hw/ppc/spapr.c | 8 ++++++++ hw/vfio/common.c | 15 +++++++++++++++ hw/vfio/pci.c | 23 +++++++++++++++++++++-- 5 files changed, 50 insertions(+), 2 deletions(-) diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index f3a2ac9..927d600 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index, struct vfio_region_info **info); int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, uint32_t subtype, struct vfio_region_info **info); +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region); #endif extern const MemoryListener vfio_prereg_listener; diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h index 4312e96..b45182e 100644 --- a/linux-headers/linux/vfio.h +++ b/linux-headers/linux/vfio.h @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type { #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2) #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG (3) +/* + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped. + */ +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 + /** * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, * struct vfio_irq_info) diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c index 9de63f0..693394a 100644 --- a/hw/ppc/spapr.c +++ b/hw/ppc/spapr.c @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value, spapr->use_hotplug_event_source = value; } +static bool spapr_get_msix_emulation(Object *obj, Error **errp) +{ + return true; +} + static char *spapr_get_resize_hpt(Object *obj, Error **errp) { sPAPRMachineState *spapr = SPAPR_MACHINE(obj); @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj) object_property_set_description(obj, "vsmt", "Virtual SMT: KVM behaves as if this were" " the host's SMT mode", &error_abort); + object_property_add_bool(obj, "vfio-no-msix-emulation", + spapr_get_msix_emulation, NULL, NULL); } static void spapr_machine_finalizefn(Object *obj) @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = { /* * pseries-2.11 */ + static void spapr_machine_2_11_instance_options(MachineState *machine) { } diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 1fb8a8e..da5f182 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, return -ENODEV; } +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region) +{ + struct vfio_region_info *info = NULL; + bool ret = false; + + if (!vfio_get_region_info(vbasedev, region, &info)) { + if (vfio_get_region_info_cap(info, cap_type)) { + ret = true; + } + g_free(info); + } + + return ret; +} + /* * Interfaces for IBM EEH (Enhanced Error Handling) */ diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index c977ee3..27a3706 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) off_t start, end; VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region; + if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, + vdev->msix->table_bar)) { + return; + } + /* * We expect to find a single mmap covering the whole BAR, anything else * means it's either unsupported or already setup. @@ -1432,6 +1437,15 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp) vfio_pci_fixup_msix_region(vdev); } +static MemoryRegion *vfio_msix_parent(VFIORegion *region) +{ + if (region->nr_mmaps == 1 && region->mmaps[0].size == region->size) { + return ®ion->mmaps[0].mem; + } + + return region->mem; +} + static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) { int ret; @@ -1440,9 +1454,9 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) vdev->msix->pending = g_malloc0(BITS_TO_LONGS(vdev->msix->entries) * sizeof(unsigned long)); ret = msix_init(&vdev->pdev, vdev->msix->entries, - vdev->bars[vdev->msix->table_bar].region.mem, + vfio_msix_parent(&vdev->bars[vdev->msix->table_bar].region), vdev->msix->table_bar, vdev->msix->table_offset, - vdev->bars[vdev->msix->pba_bar].region.mem, + vfio_msix_parent(&vdev->bars[vdev->msix->pba_bar].region), vdev->msix->pba_bar, vdev->msix->pba_offset, pos, &err); if (ret < 0) { @@ -1473,6 +1487,11 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) */ memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, false); + if (object_property_get_bool(OBJECT(qdev_get_machine()), + "vfio-no-msix-emulation", NULL)) { + memory_region_set_enabled(&vdev->pdev.msix_table_mmio, false); + } + return 0; } -- 2.11.0 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR 2017-12-15 6:29 [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR Alexey Kardashevskiy @ 2017-12-19 4:52 ` David Gibson 2017-12-19 17:26 ` Alex Williamson 2018-01-05 9:48 ` Auger Eric 1 sibling, 1 reply; 19+ messages in thread From: David Gibson @ 2017-12-19 4:52 UTC (permalink / raw) To: Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, Alex Williamson [-- Attachment #1: Type: text/plain, Size: 11347 bytes --] On Fri, Dec 15, 2017 at 05:29:14PM +1100, Alexey Kardashevskiy wrote: > This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability > which tells that a region with MSIX data can be mapped entirely, i.e. > the VFIO PCI driver won't prevent MSIX vectors area from being mapped. > > With this change, all BARs are mapped in a single chunk and MSIX vectors > are emulated on top unless the machine requests not to by defining and > enabling a new "vfio-no-msix-emulation" property. At the moment only > sPAPR machine does so - it prohibits MSIX emulation and does not allow > enabling it as it does not define the "set" callback for the new property; > the new property also does not appear in "-machine pseries,help". > > If the new capability is present, this puts MSIX IO memory region under > mapped memory region. If the capability is not there, it falls back to > the old behaviour with the sparse capability. > > In MSIX vectors section is not aligned to the page size, the KVM memory > listener does not register it with the KVM as a memory slot and MSIX is > emulated by QEMU as before. > > This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" - > for the new capability: https://www.spinics.net/lists/kvm/msg160282.html > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > --- > > This is mtree and flatview BEFORE this patch: > > "info mtree": > memory-region: pci@800000020000000.mmio > 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > "info mtree -f": > FlatView #0 > AS "memory", root: system > AS "cpu-memory", root: system > Root memory region: system > 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > 0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1 > 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > 000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600 > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > > > This is AFTER this patch applied: > > "info mtree": > memory-region: pci@800000020000000.mmio > 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled] > 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > > "info mtree -f": > FlatView #2 > AS "memory", root: system > AS "cpu-memory", root: system > Root memory region: system > 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > > > This is AFTER this patch applied AND spapr_get_msix_emulation() patched > to enable emulation: > > "info mtree": > memory-region: pci@800000020000000.mmio > 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > "info mtree -f": > FlatView #1 > AS "memory", root: system > AS "cpu-memory", root: system > Root memory region: system > 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > 0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > 000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600 > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > --- > include/hw/vfio/vfio-common.h | 1 + > linux-headers/linux/vfio.h | 5 +++++ > hw/ppc/spapr.c | 8 ++++++++ > hw/vfio/common.c | 15 +++++++++++++++ > hw/vfio/pci.c | 23 +++++++++++++++++++++-- > 5 files changed, 50 insertions(+), 2 deletions(-) > > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h > index f3a2ac9..927d600 100644 > --- a/include/hw/vfio/vfio-common.h > +++ b/include/hw/vfio/vfio-common.h > @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index, > struct vfio_region_info **info); > int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, > uint32_t subtype, struct vfio_region_info **info); > +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region); > #endif > extern const MemoryListener vfio_prereg_listener; > > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h > index 4312e96..b45182e 100644 > --- a/linux-headers/linux/vfio.h > +++ b/linux-headers/linux/vfio.h > @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type { > #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2) > #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG (3) > > +/* > + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped. > + */ > +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 > + > /** > * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, > * struct vfio_irq_info) > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > index 9de63f0..693394a 100644 > --- a/hw/ppc/spapr.c > +++ b/hw/ppc/spapr.c > @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value, > spapr->use_hotplug_event_source = value; > } > > +static bool spapr_get_msix_emulation(Object *obj, Error **errp) > +{ > + return true; > +} > + > static char *spapr_get_resize_hpt(Object *obj, Error **errp) > { > sPAPRMachineState *spapr = SPAPR_MACHINE(obj); > @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj) > object_property_set_description(obj, "vsmt", > "Virtual SMT: KVM behaves as if this were" > " the host's SMT mode", &error_abort); > + object_property_add_bool(obj, "vfio-no-msix-emulation", > + spapr_get_msix_emulation, NULL, NULL); I prefer the approach where the property is in the PCI device, set by the machine defaults mechanism, rather than the property being in the machine and the device having to look it up. If Alex prefers the latter though, I'm ok with that way around. > } > > static void spapr_machine_finalizefn(Object *obj) > @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = { > /* > * pseries-2.11 > */ > + > static void spapr_machine_2_11_instance_options(MachineState *machine) > { > } > diff --git a/hw/vfio/common.c b/hw/vfio/common.c > index 1fb8a8e..da5f182 100644 > --- a/hw/vfio/common.c > +++ b/hw/vfio/common.c > @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, > return -ENODEV; > } > > +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region) > +{ > + struct vfio_region_info *info = NULL; > + bool ret = false; > + > + if (!vfio_get_region_info(vbasedev, region, &info)) { > + if (vfio_get_region_info_cap(info, cap_type)) { > + ret = true; > + } > + g_free(info); > + } > + > + return ret; > +} > + > /* > * Interfaces for IBM EEH (Enhanced Error Handling) > */ > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > index c977ee3..27a3706 100644 > --- a/hw/vfio/pci.c > +++ b/hw/vfio/pci.c > @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) > off_t start, end; > VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region; > > + if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, > + vdev->msix->table_bar)) { > + return; > + } > + > /* > * We expect to find a single mmap covering the whole BAR, anything else > * means it's either unsupported or already setup. > @@ -1432,6 +1437,15 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp) > vfio_pci_fixup_msix_region(vdev); > } > > +static MemoryRegion *vfio_msix_parent(VFIORegion *region) > +{ > + if (region->nr_mmaps == 1 && region->mmaps[0].size == region->size) { > + return ®ion->mmaps[0].mem; > + } > + > + return region->mem; > +} > + > static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) > { > int ret; > @@ -1440,9 +1454,9 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) > vdev->msix->pending = g_malloc0(BITS_TO_LONGS(vdev->msix->entries) * > sizeof(unsigned long)); > ret = msix_init(&vdev->pdev, vdev->msix->entries, > - vdev->bars[vdev->msix->table_bar].region.mem, > + vfio_msix_parent(&vdev->bars[vdev->msix->table_bar].region), > vdev->msix->table_bar, vdev->msix->table_offset, > - vdev->bars[vdev->msix->pba_bar].region.mem, > + vfio_msix_parent(&vdev->bars[vdev->msix->pba_bar].region), > vdev->msix->pba_bar, vdev->msix->pba_offset, pos, > &err); > if (ret < 0) { > @@ -1473,6 +1487,11 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) > */ > memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, false); > > + if (object_property_get_bool(OBJECT(qdev_get_machine()), > + "vfio-no-msix-emulation", NULL)) { > + memory_region_set_enabled(&vdev->pdev.msix_table_mmio, false); > + } > + As Alex pointed out elsewhere, the need for the MSI-X emulation isn't really VFIO specific. It would make more sense to have the test directly in msix_init(), and have it skip adding the msix region to the BAR if the machine has said it's not necessary. > return 0; > } > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR 2017-12-19 4:52 ` David Gibson @ 2017-12-19 17:26 ` Alex Williamson 2017-12-20 1:13 ` Alexey Kardashevskiy 0 siblings, 1 reply; 19+ messages in thread From: Alex Williamson @ 2017-12-19 17:26 UTC (permalink / raw) To: David Gibson; +Cc: Alexey Kardashevskiy, qemu-devel, qemu-ppc On Tue, 19 Dec 2017 15:52:02 +1100 David Gibson <david@gibson.dropbear.id.au> wrote: > On Fri, Dec 15, 2017 at 05:29:14PM +1100, Alexey Kardashevskiy wrote: > > This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability > > which tells that a region with MSIX data can be mapped entirely, i.e. > > the VFIO PCI driver won't prevent MSIX vectors area from being mapped. > > > > With this change, all BARs are mapped in a single chunk and MSIX vectors > > are emulated on top unless the machine requests not to by defining and > > enabling a new "vfio-no-msix-emulation" property. At the moment only > > sPAPR machine does so - it prohibits MSIX emulation and does not allow > > enabling it as it does not define the "set" callback for the new property; > > the new property also does not appear in "-machine pseries,help". > > > > If the new capability is present, this puts MSIX IO memory region under > > mapped memory region. If the capability is not there, it falls back to > > the old behaviour with the sparse capability. > > > > In MSIX vectors section is not aligned to the page size, the KVM memory > > listener does not register it with the KVM as a memory slot and MSIX is > > emulated by QEMU as before. > > > > This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" - > > for the new capability: https://www.spinics.net/lists/kvm/msg160282.html > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > > --- > > > > This is mtree and flatview BEFORE this patch: > > > > "info mtree": > > memory-region: pci@800000020000000.mmio > > 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > > 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > > 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > > 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > > 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > > > "info mtree -f": > > FlatView #0 > > AS "memory", root: system > > AS "cpu-memory", root: system > > Root memory region: system > > 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > > 0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1 > > 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > > 000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600 > > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > > > > > > > This is AFTER this patch applied: > > > > "info mtree": > > memory-region: pci@800000020000000.mmio > > 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > > 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > > 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > > 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled] > > 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > > 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > > > > > "info mtree -f": > > FlatView #2 > > AS "memory", root: system > > AS "cpu-memory", root: system > > Root memory region: system > > 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > > 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > > > > > > > This is AFTER this patch applied AND spapr_get_msix_emulation() patched > > to enable emulation: > > > > "info mtree": > > memory-region: pci@800000020000000.mmio > > 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > > 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > > 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > > 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > > 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > > 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > > > "info mtree -f": > > FlatView #1 > > AS "memory", root: system > > AS "cpu-memory", root: system > > Root memory region: system > > 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > > 0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > > 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > > 000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600 > > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > --- > > include/hw/vfio/vfio-common.h | 1 + > > linux-headers/linux/vfio.h | 5 +++++ > > hw/ppc/spapr.c | 8 ++++++++ > > hw/vfio/common.c | 15 +++++++++++++++ > > hw/vfio/pci.c | 23 +++++++++++++++++++++-- > > 5 files changed, 50 insertions(+), 2 deletions(-) > > > > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h > > index f3a2ac9..927d600 100644 > > --- a/include/hw/vfio/vfio-common.h > > +++ b/include/hw/vfio/vfio-common.h > > @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index, > > struct vfio_region_info **info); > > int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, > > uint32_t subtype, struct vfio_region_info **info); > > +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region); > > #endif > > extern const MemoryListener vfio_prereg_listener; > > > > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h > > index 4312e96..b45182e 100644 > > --- a/linux-headers/linux/vfio.h > > +++ b/linux-headers/linux/vfio.h > > @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type { > > #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2) > > #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG (3) > > > > +/* > > + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped. > > + */ > > +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 > > + > > /** > > * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, > > * struct vfio_irq_info) > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > > index 9de63f0..693394a 100644 > > --- a/hw/ppc/spapr.c > > +++ b/hw/ppc/spapr.c > > @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value, > > spapr->use_hotplug_event_source = value; > > } > > > > +static bool spapr_get_msix_emulation(Object *obj, Error **errp) > > +{ > > + return true; > > +} > > + > > static char *spapr_get_resize_hpt(Object *obj, Error **errp) > > { > > sPAPRMachineState *spapr = SPAPR_MACHINE(obj); > > @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj) > > object_property_set_description(obj, "vsmt", > > "Virtual SMT: KVM behaves as if this were" > > " the host's SMT mode", &error_abort); > > + object_property_add_bool(obj, "vfio-no-msix-emulation", > > + spapr_get_msix_emulation, NULL, NULL); > > I prefer the approach where the property is in the PCI device, set by > the machine defaults mechanism, rather than the property being in the > machine and the device having to look it up. > > If Alex prefers the latter though, I'm ok with that way around. I prefer anything that doesn't result in a user visible option. > > } > > > > static void spapr_machine_finalizefn(Object *obj) > > @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = { > > /* > > * pseries-2.11 > > */ > > + > > static void spapr_machine_2_11_instance_options(MachineState *machine) > > { > > } > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c > > index 1fb8a8e..da5f182 100644 > > --- a/hw/vfio/common.c > > +++ b/hw/vfio/common.c > > @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, > > return -ENODEV; > > } > > > > +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region) > > +{ > > + struct vfio_region_info *info = NULL; > > + bool ret = false; > > + > > + if (!vfio_get_region_info(vbasedev, region, &info)) { > > + if (vfio_get_region_info_cap(info, cap_type)) { > > + ret = true; > > + } > > + g_free(info); > > + } > > + > > + return ret; > > +} > > + > > /* > > * Interfaces for IBM EEH (Enhanced Error Handling) > > */ > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > > index c977ee3..27a3706 100644 > > --- a/hw/vfio/pci.c > > +++ b/hw/vfio/pci.c > > @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) > > off_t start, end; > > VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region; > > > > + if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, > > + vdev->msix->table_bar)) { > > + return; > > + } > > + > > /* > > * We expect to find a single mmap covering the whole BAR, anything else > > * means it's either unsupported or already setup. > > @@ -1432,6 +1437,15 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp) > > vfio_pci_fixup_msix_region(vdev); > > } > > > > +static MemoryRegion *vfio_msix_parent(VFIORegion *region) > > +{ > > + if (region->nr_mmaps == 1 && region->mmaps[0].size == region->size) { > > + return ®ion->mmaps[0].mem; > > + } > > + > > + return region->mem; > > +} > > + > > static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) > > { > > int ret; > > @@ -1440,9 +1454,9 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) > > vdev->msix->pending = g_malloc0(BITS_TO_LONGS(vdev->msix->entries) * > > sizeof(unsigned long)); > > ret = msix_init(&vdev->pdev, vdev->msix->entries, > > - vdev->bars[vdev->msix->table_bar].region.mem, > > + vfio_msix_parent(&vdev->bars[vdev->msix->table_bar].region), > > vdev->msix->table_bar, vdev->msix->table_offset, > > - vdev->bars[vdev->msix->pba_bar].region.mem, > > + vfio_msix_parent(&vdev->bars[vdev->msix->pba_bar].region), > > vdev->msix->pba_bar, vdev->msix->pba_offset, pos, > > &err); > > if (ret < 0) { > > @@ -1473,6 +1487,11 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) > > */ > > memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, false); > > > > + if (object_property_get_bool(OBJECT(qdev_get_machine()), > > + "vfio-no-msix-emulation", NULL)) { > > + memory_region_set_enabled(&vdev->pdev.msix_table_mmio, false); > > + } > > + > > As Alex pointed out elsewhere, the need for the MSI-X emulation isn't > really VFIO specific. It would make more sense to have the test > directly in msix_init(), and have it skip adding the msix region to > the BAR if the machine has said it's not necessary. How much of the MSI-X capability do SPAPR guests make use of? Asking msix_init() to manage an MSI-X capability that references MMIO regions on a BAR that aren't implemented is a bit incongruous. Is there some way that the MSI-X capability can make it clear that MSI-X MMIO isn't implemented? Offsets past the end of the BAR? Reserved BIR values? You're basically asking msix_init() to create either an incomplete or invalid configuration. Thanks, Alex ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR 2017-12-19 17:26 ` Alex Williamson @ 2017-12-20 1:13 ` Alexey Kardashevskiy 2017-12-20 3:24 ` David Gibson 0 siblings, 1 reply; 19+ messages in thread From: Alexey Kardashevskiy @ 2017-12-20 1:13 UTC (permalink / raw) To: Alex Williamson, David Gibson; +Cc: qemu-devel, qemu-ppc On 20/12/17 04:26, Alex Williamson wrote: > On Tue, 19 Dec 2017 15:52:02 +1100 > David Gibson <david@gibson.dropbear.id.au> wrote: > >> On Fri, Dec 15, 2017 at 05:29:14PM +1100, Alexey Kardashevskiy wrote: >>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability >>> which tells that a region with MSIX data can be mapped entirely, i.e. >>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped. >>> >>> With this change, all BARs are mapped in a single chunk and MSIX vectors >>> are emulated on top unless the machine requests not to by defining and >>> enabling a new "vfio-no-msix-emulation" property. At the moment only >>> sPAPR machine does so - it prohibits MSIX emulation and does not allow >>> enabling it as it does not define the "set" callback for the new property; >>> the new property also does not appear in "-machine pseries,help". >>> >>> If the new capability is present, this puts MSIX IO memory region under >>> mapped memory region. If the capability is not there, it falls back to >>> the old behaviour with the sparse capability. >>> >>> In MSIX vectors section is not aligned to the page size, the KVM memory >>> listener does not register it with the KVM as a memory slot and MSIX is >>> emulated by QEMU as before. >>> >>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" - >>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html >>> >>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>> --- >>> >>> This is mtree and flatview BEFORE this patch: >>> >>> "info mtree": >>> memory-region: pci@800000020000000.mmio >>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>> >>> "info mtree -f": >>> FlatView #0 >>> AS "memory", root: system >>> AS "cpu-memory", root: system >>> Root memory region: system >>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>> 0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1 >>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>> 000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600 >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>> >>> >>> >>> This is AFTER this patch applied: >>> >>> "info mtree": >>> memory-region: pci@800000020000000.mmio >>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled] >>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>> >>> >>> "info mtree -f": >>> FlatView #2 >>> AS "memory", root: system >>> AS "cpu-memory", root: system >>> Root memory region: system >>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>> >>> >>> >>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched >>> to enable emulation: >>> >>> "info mtree": >>> memory-region: pci@800000020000000.mmio >>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>> >>> "info mtree -f": >>> FlatView #1 >>> AS "memory", root: system >>> AS "cpu-memory", root: system >>> Root memory region: system >>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>> 0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>> 000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600 >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>> --- >>> include/hw/vfio/vfio-common.h | 1 + >>> linux-headers/linux/vfio.h | 5 +++++ >>> hw/ppc/spapr.c | 8 ++++++++ >>> hw/vfio/common.c | 15 +++++++++++++++ >>> hw/vfio/pci.c | 23 +++++++++++++++++++++-- >>> 5 files changed, 50 insertions(+), 2 deletions(-) >>> >>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h >>> index f3a2ac9..927d600 100644 >>> --- a/include/hw/vfio/vfio-common.h >>> +++ b/include/hw/vfio/vfio-common.h >>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index, >>> struct vfio_region_info **info); >>> int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, >>> uint32_t subtype, struct vfio_region_info **info); >>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region); >>> #endif >>> extern const MemoryListener vfio_prereg_listener; >>> >>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h >>> index 4312e96..b45182e 100644 >>> --- a/linux-headers/linux/vfio.h >>> +++ b/linux-headers/linux/vfio.h >>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type { >>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2) >>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG (3) >>> >>> +/* >>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped. >>> + */ >>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 >>> + >>> /** >>> * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, >>> * struct vfio_irq_info) >>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c >>> index 9de63f0..693394a 100644 >>> --- a/hw/ppc/spapr.c >>> +++ b/hw/ppc/spapr.c >>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value, >>> spapr->use_hotplug_event_source = value; >>> } >>> >>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp) >>> +{ >>> + return true; >>> +} >>> + >>> static char *spapr_get_resize_hpt(Object *obj, Error **errp) >>> { >>> sPAPRMachineState *spapr = SPAPR_MACHINE(obj); >>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj) >>> object_property_set_description(obj, "vsmt", >>> "Virtual SMT: KVM behaves as if this were" >>> " the host's SMT mode", &error_abort); >>> + object_property_add_bool(obj, "vfio-no-msix-emulation", >>> + spapr_get_msix_emulation, NULL, NULL); >> >> I prefer the approach where the property is in the PCI device, set by >> the machine defaults mechanism, rather than the property being in the >> machine and the device having to look it up. >> >> If Alex prefers the latter though, I'm ok with that way around. > > I prefer anything that doesn't result in a user visible option. The only way to have this is an interface implemented by a machine. I can do that, call it "TYPE_MSIX_MMIO" or "TYPE_CONFIG_VFIO". >>> } >>> >>> static void spapr_machine_finalizefn(Object *obj) >>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = { >>> /* >>> * pseries-2.11 >>> */ >>> + >>> static void spapr_machine_2_11_instance_options(MachineState *machine) >>> { >>> } >>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >>> index 1fb8a8e..da5f182 100644 >>> --- a/hw/vfio/common.c >>> +++ b/hw/vfio/common.c >>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, >>> return -ENODEV; >>> } >>> >>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region) >>> +{ >>> + struct vfio_region_info *info = NULL; >>> + bool ret = false; >>> + >>> + if (!vfio_get_region_info(vbasedev, region, &info)) { >>> + if (vfio_get_region_info_cap(info, cap_type)) { >>> + ret = true; >>> + } >>> + g_free(info); >>> + } >>> + >>> + return ret; >>> +} >>> + >>> /* >>> * Interfaces for IBM EEH (Enhanced Error Handling) >>> */ >>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >>> index c977ee3..27a3706 100644 >>> --- a/hw/vfio/pci.c >>> +++ b/hw/vfio/pci.c >>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) >>> off_t start, end; >>> VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region; >>> >>> + if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, >>> + vdev->msix->table_bar)) { >>> + return; >>> + } >>> + >>> /* >>> * We expect to find a single mmap covering the whole BAR, anything else >>> * means it's either unsupported or already setup. >>> @@ -1432,6 +1437,15 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp) >>> vfio_pci_fixup_msix_region(vdev); >>> } >>> >>> +static MemoryRegion *vfio_msix_parent(VFIORegion *region) >>> +{ >>> + if (region->nr_mmaps == 1 && region->mmaps[0].size == region->size) { >>> + return ®ion->mmaps[0].mem; >>> + } >>> + >>> + return region->mem; >>> +} >>> + >>> static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) >>> { >>> int ret; >>> @@ -1440,9 +1454,9 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) >>> vdev->msix->pending = g_malloc0(BITS_TO_LONGS(vdev->msix->entries) * >>> sizeof(unsigned long)); >>> ret = msix_init(&vdev->pdev, vdev->msix->entries, >>> - vdev->bars[vdev->msix->table_bar].region.mem, >>> + vfio_msix_parent(&vdev->bars[vdev->msix->table_bar].region), >>> vdev->msix->table_bar, vdev->msix->table_offset, >>> - vdev->bars[vdev->msix->pba_bar].region.mem, >>> + vfio_msix_parent(&vdev->bars[vdev->msix->pba_bar].region), >>> vdev->msix->pba_bar, vdev->msix->pba_offset, pos, >>> &err); >>> if (ret < 0) { >>> @@ -1473,6 +1487,11 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) >>> */ >>> memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, false); >>> >>> + if (object_property_get_bool(OBJECT(qdev_get_machine()), >>> + "vfio-no-msix-emulation", NULL)) { >>> + memory_region_set_enabled(&vdev->pdev.msix_table_mmio, false); >>> + } >>> + >> >> As Alex pointed out elsewhere, the need for the MSI-X emulation isn't >> really VFIO specific. It would make more sense to have the test >> directly in msix_init(), and have it skip adding the msix region to >> the BAR if the machine has said it's not necessary. > > How much of the MSI-X capability do SPAPR guests make use of? Presence of the capability seems to be just enough. I disabled the sanity test in msix_init(), hardcoded: vdev->msix->table_offset = 0xff0000; vdev->msix->pba_offset = 0xfff000; in vfio_msix_setup() and got it up and running: 00:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02) Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K] Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K] Expansion ROM at 2000c0000000 [disabled] [size=1M] Capabilities: [c0] MSI-X: Enable+ Count=96 Masked- Vector table: BAR=1 offset=00ff0000 PBA: BAR=1 offset=00fff000 > Asking > msix_init() to manage an MSI-X capability that references MMIO regions > on a BAR that aren't implemented is a bit incongruous. Is there some > way that the MSI-X capability can make it clear that MSI-X MMIO isn't > implemented? Offsets past the end of the BAR? This works. > Reserved BIR values? This does not - the only reserved value is 7 and it crashes as the device does not implement BAR7 (ROM). > You're basically asking msix_init() to create either an incomplete or > invalid configuration. Thanks, I still like "vfio-no-msix-emulation" better. -- Alexey ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR 2017-12-20 1:13 ` Alexey Kardashevskiy @ 2017-12-20 3:24 ` David Gibson 0 siblings, 0 replies; 19+ messages in thread From: David Gibson @ 2017-12-20 3:24 UTC (permalink / raw) To: Alexey Kardashevskiy; +Cc: Alex Williamson, qemu-devel, qemu-ppc [-- Attachment #1: Type: text/plain, Size: 14538 bytes --] On Wed, Dec 20, 2017 at 12:13:04PM +1100, Alexey Kardashevskiy wrote: > On 20/12/17 04:26, Alex Williamson wrote: > > On Tue, 19 Dec 2017 15:52:02 +1100 > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > >> On Fri, Dec 15, 2017 at 05:29:14PM +1100, Alexey Kardashevskiy wrote: > >>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability > >>> which tells that a region with MSIX data can be mapped entirely, i.e. > >>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped. > >>> > >>> With this change, all BARs are mapped in a single chunk and MSIX vectors > >>> are emulated on top unless the machine requests not to by defining and > >>> enabling a new "vfio-no-msix-emulation" property. At the moment only > >>> sPAPR machine does so - it prohibits MSIX emulation and does not allow > >>> enabling it as it does not define the "set" callback for the new property; > >>> the new property also does not appear in "-machine pseries,help". > >>> > >>> If the new capability is present, this puts MSIX IO memory region under > >>> mapped memory region. If the capability is not there, it falls back to > >>> the old behaviour with the sparse capability. > >>> > >>> In MSIX vectors section is not aligned to the page size, the KVM memory > >>> listener does not register it with the KVM as a memory slot and MSIX is > >>> emulated by QEMU as before. > >>> > >>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" - > >>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html > >>> > >>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > >>> --- > >>> > >>> This is mtree and flatview BEFORE this patch: > >>> > >>> "info mtree": > >>> memory-region: pci@800000020000000.mmio > >>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > >>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > >>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>> > >>> "info mtree -f": > >>> FlatView #0 > >>> AS "memory", root: system > >>> AS "cpu-memory", root: system > >>> Root memory region: system > >>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >>> 0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>> 000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600 > >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>> > >>> > >>> > >>> This is AFTER this patch applied: > >>> > >>> "info mtree": > >>> memory-region: pci@800000020000000.mmio > >>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > >>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > >>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled] > >>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > >>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>> > >>> > >>> "info mtree -f": > >>> FlatView #2 > >>> AS "memory", root: system > >>> AS "cpu-memory", root: system > >>> Root memory region: system > >>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>> > >>> > >>> > >>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched > >>> to enable emulation: > >>> > >>> "info mtree": > >>> memory-region: pci@800000020000000.mmio > >>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > >>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > >>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > >>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>> > >>> "info mtree -f": > >>> FlatView #1 > >>> AS "memory", root: system > >>> AS "cpu-memory", root: system > >>> Root memory region: system > >>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >>> 0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > >>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>> 000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600 > >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>> --- > >>> include/hw/vfio/vfio-common.h | 1 + > >>> linux-headers/linux/vfio.h | 5 +++++ > >>> hw/ppc/spapr.c | 8 ++++++++ > >>> hw/vfio/common.c | 15 +++++++++++++++ > >>> hw/vfio/pci.c | 23 +++++++++++++++++++++-- > >>> 5 files changed, 50 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h > >>> index f3a2ac9..927d600 100644 > >>> --- a/include/hw/vfio/vfio-common.h > >>> +++ b/include/hw/vfio/vfio-common.h > >>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index, > >>> struct vfio_region_info **info); > >>> int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, > >>> uint32_t subtype, struct vfio_region_info **info); > >>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region); > >>> #endif > >>> extern const MemoryListener vfio_prereg_listener; > >>> > >>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h > >>> index 4312e96..b45182e 100644 > >>> --- a/linux-headers/linux/vfio.h > >>> +++ b/linux-headers/linux/vfio.h > >>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type { > >>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2) > >>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG (3) > >>> > >>> +/* > >>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped. > >>> + */ > >>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 > >>> + > >>> /** > >>> * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, > >>> * struct vfio_irq_info) > >>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > >>> index 9de63f0..693394a 100644 > >>> --- a/hw/ppc/spapr.c > >>> +++ b/hw/ppc/spapr.c > >>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value, > >>> spapr->use_hotplug_event_source = value; > >>> } > >>> > >>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp) > >>> +{ > >>> + return true; > >>> +} > >>> + > >>> static char *spapr_get_resize_hpt(Object *obj, Error **errp) > >>> { > >>> sPAPRMachineState *spapr = SPAPR_MACHINE(obj); > >>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj) > >>> object_property_set_description(obj, "vsmt", > >>> "Virtual SMT: KVM behaves as if this were" > >>> " the host's SMT mode", &error_abort); > >>> + object_property_add_bool(obj, "vfio-no-msix-emulation", > >>> + spapr_get_msix_emulation, NULL, NULL); > >> > >> I prefer the approach where the property is in the PCI device, set by > >> the machine defaults mechanism, rather than the property being in the > >> machine and the device having to look it up. > >> > >> If Alex prefers the latter though, I'm ok with that way around. > > > > I prefer anything that doesn't result in a user visible option. > > > The only way to have this is an interface implemented by a machine. I can > do that, call it "TYPE_MSIX_MMIO" or "TYPE_CONFIG_VFIO". I new interface seems overkill. You could just add a "needs_msix_emulation" method to MachineClass. It will return true by default, but pseries will override it. (or invert the sense, I don't care) > >>> } > >>> > >>> static void spapr_machine_finalizefn(Object *obj) > >>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = { > >>> /* > >>> * pseries-2.11 > >>> */ > >>> + > >>> static void spapr_machine_2_11_instance_options(MachineState *machine) > >>> { > >>> } > >>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c > >>> index 1fb8a8e..da5f182 100644 > >>> --- a/hw/vfio/common.c > >>> +++ b/hw/vfio/common.c > >>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, > >>> return -ENODEV; > >>> } > >>> > >>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region) > >>> +{ > >>> + struct vfio_region_info *info = NULL; > >>> + bool ret = false; > >>> + > >>> + if (!vfio_get_region_info(vbasedev, region, &info)) { > >>> + if (vfio_get_region_info_cap(info, cap_type)) { > >>> + ret = true; > >>> + } > >>> + g_free(info); > >>> + } > >>> + > >>> + return ret; > >>> +} > >>> + > >>> /* > >>> * Interfaces for IBM EEH (Enhanced Error Handling) > >>> */ > >>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > >>> index c977ee3..27a3706 100644 > >>> --- a/hw/vfio/pci.c > >>> +++ b/hw/vfio/pci.c > >>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) > >>> off_t start, end; > >>> VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region; > >>> > >>> + if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, > >>> + vdev->msix->table_bar)) { > >>> + return; > >>> + } > >>> + > >>> /* > >>> * We expect to find a single mmap covering the whole BAR, anything else > >>> * means it's either unsupported or already setup. > >>> @@ -1432,6 +1437,15 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp) > >>> vfio_pci_fixup_msix_region(vdev); > >>> } > >>> > >>> +static MemoryRegion *vfio_msix_parent(VFIORegion *region) > >>> +{ > >>> + if (region->nr_mmaps == 1 && region->mmaps[0].size == region->size) { > >>> + return ®ion->mmaps[0].mem; > >>> + } > >>> + > >>> + return region->mem; > >>> +} > >>> + > >>> static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) > >>> { > >>> int ret; > >>> @@ -1440,9 +1454,9 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) > >>> vdev->msix->pending = g_malloc0(BITS_TO_LONGS(vdev->msix->entries) * > >>> sizeof(unsigned long)); > >>> ret = msix_init(&vdev->pdev, vdev->msix->entries, > >>> - vdev->bars[vdev->msix->table_bar].region.mem, > >>> + vfio_msix_parent(&vdev->bars[vdev->msix->table_bar].region), > >>> vdev->msix->table_bar, vdev->msix->table_offset, > >>> - vdev->bars[vdev->msix->pba_bar].region.mem, > >>> + vfio_msix_parent(&vdev->bars[vdev->msix->pba_bar].region), > >>> vdev->msix->pba_bar, vdev->msix->pba_offset, pos, > >>> &err); > >>> if (ret < 0) { > >>> @@ -1473,6 +1487,11 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) > >>> */ > >>> memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, false); > >>> > >>> + if (object_property_get_bool(OBJECT(qdev_get_machine()), > >>> + "vfio-no-msix-emulation", NULL)) { > >>> + memory_region_set_enabled(&vdev->pdev.msix_table_mmio, false); > >>> + } > >>> + > >> > >> As Alex pointed out elsewhere, the need for the MSI-X emulation isn't > >> really VFIO specific. It would make more sense to have the test > >> directly in msix_init(), and have it skip adding the msix region to > >> the BAR if the machine has said it's not necessary. > > > > How much of the MSI-X capability do SPAPR guests make use of? > > Presence of the capability seems to be just enough. > > I disabled the sanity test in msix_init(), hardcoded: > vdev->msix->table_offset = 0xff0000; > vdev->msix->pba_offset = 0xfff000; > in vfio_msix_setup() and got it up and running: > > 00:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008 > PCI-Express Fusion-MPT SAS-3 (rev 02) > > Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K] > Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K] > Expansion ROM at 2000c0000000 [disabled] [size=1M] > > Capabilities: [c0] MSI-X: Enable+ Count=96 Masked- > Vector table: BAR=1 offset=00ff0000 > PBA: BAR=1 offset=00fff000 > > > > Asking > > msix_init() to manage an MSI-X capability that references MMIO regions > > on a BAR that aren't implemented is a bit incongruous. Is there some > > way that the MSI-X capability can make it clear that MSI-X MMIO isn't > > implemented? Offsets past the end of the BAR? > > This works. Sounds ok. > > Reserved BIR values? > > This does not - the only reserved value is 7 and it crashes as the device > does not implement BAR7 (ROM). > > > > You're basically asking msix_init() to create either an incomplete or > > invalid configuration. Thanks, I guess. I mean this is a platform specific performance hack, I feel as though hackiness is to be expected. > I still like "vfio-no-msix-emulation" better. > > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR 2017-12-15 6:29 [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR Alexey Kardashevskiy 2017-12-19 4:52 ` David Gibson @ 2018-01-05 9:48 ` Auger Eric 2018-01-05 15:29 ` Alex Williamson 1 sibling, 1 reply; 19+ messages in thread From: Auger Eric @ 2018-01-05 9:48 UTC (permalink / raw) To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Williamson, qemu-ppc, David Gibson Hi Alexey, On 15/12/17 07:29, Alexey Kardashevskiy wrote: > This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability > which tells that a region with MSIX data can be mapped entirely, i.e. > the VFIO PCI driver won't prevent MSIX vectors area from being mapped. > > With this change, all BARs are mapped in a single chunk and MSIX vectors > are emulated on top unless the machine requests not to by defining and > enabling a new "vfio-no-msix-emulation" property. At the moment only > sPAPR machine does so - it prohibits MSIX emulation and does not allow > enabling it as it does not define the "set" callback for the new property; > the new property also does not appear in "-machine pseries,help". > > If the new capability is present, this puts MSIX IO memory region under > mapped memory region. If the capability is not there, it falls back to > the old behaviour with the sparse capability. > > In MSIX vectors section is not aligned to the page size, the KVM memory > listener does not register it with the KVM as a memory slot and MSIX is > emulated by QEMU as before. > > This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" - > for the new capability: https://www.spinics.net/lists/kvm/msg160282.html > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > --- > > This is mtree and flatview BEFORE this patch: > > "info mtree": > memory-region: pci@800000020000000.mmio > 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > "info mtree -f": > FlatView #0 > AS "memory", root: system > AS "cpu-memory", root: system > Root memory region: system > 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > 0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1 > 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > 000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600 > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > > > This is AFTER this patch applied: > > "info mtree": > memory-region: pci@800000020000000.mmio > 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled] > 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > > "info mtree -f": > FlatView #2 > AS "memory", root: system > AS "cpu-memory", root: system > Root memory region: system > 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > > > This is AFTER this patch applied AND spapr_get_msix_emulation() patched > to enable emulation: > > "info mtree": > memory-region: pci@800000020000000.mmio > 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > "info mtree -f": > FlatView #1 > AS "memory", root: system > AS "cpu-memory", root: system > Root memory region: system > 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > 0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > 000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600 > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > --- > include/hw/vfio/vfio-common.h | 1 + > linux-headers/linux/vfio.h | 5 +++++ > hw/ppc/spapr.c | 8 ++++++++ > hw/vfio/common.c | 15 +++++++++++++++ > hw/vfio/pci.c | 23 +++++++++++++++++++++-- > 5 files changed, 50 insertions(+), 2 deletions(-) > > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h > index f3a2ac9..927d600 100644 > --- a/include/hw/vfio/vfio-common.h > +++ b/include/hw/vfio/vfio-common.h > @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index, > struct vfio_region_info **info); > int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, > uint32_t subtype, struct vfio_region_info **info); > +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region); > #endif > extern const MemoryListener vfio_prereg_listener; > > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h > index 4312e96..b45182e 100644 > --- a/linux-headers/linux/vfio.h > +++ b/linux-headers/linux/vfio.h > @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type { > #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2) > #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG (3) > > +/* > + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped. > + */ > +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 > + > /** > * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, > * struct vfio_irq_info) > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > index 9de63f0..693394a 100644 > --- a/hw/ppc/spapr.c > +++ b/hw/ppc/spapr.c > @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value, > spapr->use_hotplug_event_source = value; > } > > +static bool spapr_get_msix_emulation(Object *obj, Error **errp) > +{ > + return true; > +} > + > static char *spapr_get_resize_hpt(Object *obj, Error **errp) > { > sPAPRMachineState *spapr = SPAPR_MACHINE(obj); > @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj) > object_property_set_description(obj, "vsmt", > "Virtual SMT: KVM behaves as if this were" > " the host's SMT mode", &error_abort); > + object_property_add_bool(obj, "vfio-no-msix-emulation", > + spapr_get_msix_emulation, NULL, NULL); > } > > static void spapr_machine_finalizefn(Object *obj) > @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = { > /* > * pseries-2.11 > */ > + > static void spapr_machine_2_11_instance_options(MachineState *machine) > { > } > diff --git a/hw/vfio/common.c b/hw/vfio/common.c > index 1fb8a8e..da5f182 100644 > --- a/hw/vfio/common.c > +++ b/hw/vfio/common.c > @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, > return -ENODEV; > } > > +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region) > +{ > + struct vfio_region_info *info = NULL; > + bool ret = false; > + > + if (!vfio_get_region_info(vbasedev, region, &info)) { > + if (vfio_get_region_info_cap(info, cap_type)) { > + ret = true; > + } > + g_free(info); > + } > + > + return ret; > +} > + > /* > * Interfaces for IBM EEH (Enhanced Error Handling) > */ > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > index c977ee3..27a3706 100644 > --- a/hw/vfio/pci.c > +++ b/hw/vfio/pci.c > @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) > off_t start, end; > VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region; > > + if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, > + vdev->msix->table_bar)) { > + return; > + } I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 + msix table at 0x2000, using 64kB pages. 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0 0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0 mmaps[0] 0000000000002000-00000000000023ff (prio 0, i/o): msix-table 0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled] My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the vfio_listener_region_add_ram gets called on the first 8kB which cannot be mapped with 64kB page, leading to a qemu abort. 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram] 0x8000000000 - 0x8000001fff [0xffffa8820000] qemu-system-aarch64: VFIO_MAP_DMA: -22 qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000, 0xffffa8820000) = -22 (Invalid argument) qemu: hardware error: vfio: DMA mapping failed, unable to continue In my case don't we still need the fixup logic? Thanks Eric > + > /* > * We expect to find a single mmap covering the whole BAR, anything else > * means it's either unsupported or already setup. > @@ -1432,6 +1437,15 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp) > vfio_pci_fixup_msix_region(vdev); > } > > +static MemoryRegion *vfio_msix_parent(VFIORegion *region) > +{ > + if (region->nr_mmaps == 1 && region->mmaps[0].size == region->size) { > + return ®ion->mmaps[0].mem; > + } > + > + return region->mem; > +} > + > static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) > { > int ret; > @@ -1440,9 +1454,9 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) > vdev->msix->pending = g_malloc0(BITS_TO_LONGS(vdev->msix->entries) * > sizeof(unsigned long)); > ret = msix_init(&vdev->pdev, vdev->msix->entries, > - vdev->bars[vdev->msix->table_bar].region.mem, > + vfio_msix_parent(&vdev->bars[vdev->msix->table_bar].region), > vdev->msix->table_bar, vdev->msix->table_offset, > - vdev->bars[vdev->msix->pba_bar].region.mem, > + vfio_msix_parent(&vdev->bars[vdev->msix->pba_bar].region), > vdev->msix->pba_bar, vdev->msix->pba_offset, pos, > &err); > if (ret < 0) { > @@ -1473,6 +1487,11 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp) > */ > memory_region_set_enabled(&vdev->pdev.msix_pba_mmio, false); > > + if (object_property_get_bool(OBJECT(qdev_get_machine()), > + "vfio-no-msix-emulation", NULL)) { > + memory_region_set_enabled(&vdev->pdev.msix_table_mmio, false); > + } > + > return 0; > } > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR 2018-01-05 9:48 ` Auger Eric @ 2018-01-05 15:29 ` Alex Williamson 2018-01-16 5:17 ` Alexey Kardashevskiy 0 siblings, 1 reply; 19+ messages in thread From: Alex Williamson @ 2018-01-05 15:29 UTC (permalink / raw) To: Auger Eric; +Cc: Alexey Kardashevskiy, qemu-devel, qemu-ppc, David Gibson On Fri, 5 Jan 2018 10:48:07 +0100 Auger Eric <eric.auger@redhat.com> wrote: > Hi Alexey, > > On 15/12/17 07:29, Alexey Kardashevskiy wrote: > > This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability > > which tells that a region with MSIX data can be mapped entirely, i.e. > > the VFIO PCI driver won't prevent MSIX vectors area from being mapped. > > > > With this change, all BARs are mapped in a single chunk and MSIX vectors > > are emulated on top unless the machine requests not to by defining and > > enabling a new "vfio-no-msix-emulation" property. At the moment only > > sPAPR machine does so - it prohibits MSIX emulation and does not allow > > enabling it as it does not define the "set" callback for the new property; > > the new property also does not appear in "-machine pseries,help". > > > > If the new capability is present, this puts MSIX IO memory region under > > mapped memory region. If the capability is not there, it falls back to > > the old behaviour with the sparse capability. > > > > In MSIX vectors section is not aligned to the page size, the KVM memory > > listener does not register it with the KVM as a memory slot and MSIX is > > emulated by QEMU as before. > > > > This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" - > > for the new capability: https://www.spinics.net/lists/kvm/msg160282.html > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > > --- > > > > This is mtree and flatview BEFORE this patch: > > > > "info mtree": > > memory-region: pci@800000020000000.mmio > > 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > > 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > > 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > > 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > > 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > > > "info mtree -f": > > FlatView #0 > > AS "memory", root: system > > AS "cpu-memory", root: system > > Root memory region: system > > 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > > 0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1 > > 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > > 000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600 > > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > > > > > > > This is AFTER this patch applied: > > > > "info mtree": > > memory-region: pci@800000020000000.mmio > > 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > > 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > > 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > > 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled] > > 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > > 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > > > > > "info mtree -f": > > FlatView #2 > > AS "memory", root: system > > AS "cpu-memory", root: system > > Root memory region: system > > 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > > 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > > > > > > > This is AFTER this patch applied AND spapr_get_msix_emulation() patched > > to enable emulation: > > > > "info mtree": > > memory-region: pci@800000020000000.mmio > > 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > > 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > > 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > > 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > > 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > > 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > > > "info mtree -f": > > FlatView #1 > > AS "memory", root: system > > AS "cpu-memory", root: system > > Root memory region: system > > 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > > 0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > > 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > > 000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600 > > 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > > --- > > include/hw/vfio/vfio-common.h | 1 + > > linux-headers/linux/vfio.h | 5 +++++ > > hw/ppc/spapr.c | 8 ++++++++ > > hw/vfio/common.c | 15 +++++++++++++++ > > hw/vfio/pci.c | 23 +++++++++++++++++++++-- > > 5 files changed, 50 insertions(+), 2 deletions(-) > > > > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h > > index f3a2ac9..927d600 100644 > > --- a/include/hw/vfio/vfio-common.h > > +++ b/include/hw/vfio/vfio-common.h > > @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index, > > struct vfio_region_info **info); > > int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, > > uint32_t subtype, struct vfio_region_info **info); > > +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region); > > #endif > > extern const MemoryListener vfio_prereg_listener; > > > > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h > > index 4312e96..b45182e 100644 > > --- a/linux-headers/linux/vfio.h > > +++ b/linux-headers/linux/vfio.h > > @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type { > > #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2) > > #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG (3) > > > > +/* > > + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped. > > + */ > > +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 > > + > > /** > > * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, > > * struct vfio_irq_info) > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > > index 9de63f0..693394a 100644 > > --- a/hw/ppc/spapr.c > > +++ b/hw/ppc/spapr.c > > @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value, > > spapr->use_hotplug_event_source = value; > > } > > > > +static bool spapr_get_msix_emulation(Object *obj, Error **errp) > > +{ > > + return true; > > +} > > + > > static char *spapr_get_resize_hpt(Object *obj, Error **errp) > > { > > sPAPRMachineState *spapr = SPAPR_MACHINE(obj); > > @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj) > > object_property_set_description(obj, "vsmt", > > "Virtual SMT: KVM behaves as if this were" > > " the host's SMT mode", &error_abort); > > + object_property_add_bool(obj, "vfio-no-msix-emulation", > > + spapr_get_msix_emulation, NULL, NULL); > > } > > > > static void spapr_machine_finalizefn(Object *obj) > > @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = { > > /* > > * pseries-2.11 > > */ > > + > > static void spapr_machine_2_11_instance_options(MachineState *machine) > > { > > } > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c > > index 1fb8a8e..da5f182 100644 > > --- a/hw/vfio/common.c > > +++ b/hw/vfio/common.c > > @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, > > return -ENODEV; > > } > > > > +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region) > > +{ > > + struct vfio_region_info *info = NULL; > > + bool ret = false; > > + > > + if (!vfio_get_region_info(vbasedev, region, &info)) { > > + if (vfio_get_region_info_cap(info, cap_type)) { > > + ret = true; > > + } > > + g_free(info); > > + } > > + > > + return ret; > > +} > > + > > /* > > * Interfaces for IBM EEH (Enhanced Error Handling) > > */ > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > > index c977ee3..27a3706 100644 > > --- a/hw/vfio/pci.c > > +++ b/hw/vfio/pci.c > > @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) > > off_t start, end; > > VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region; > > > > + if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, > > + vdev->msix->table_bar)) { > > + return; > > + } > > I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 + > msix table at 0x2000, using 64kB pages. > > 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0 > 0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0 > mmaps[0] > 0000000000002000-00000000000023ff (prio 0, i/o): msix-table > 0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled] > > > My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass > vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the > vfio_listener_region_add_ram gets called on the first 8kB which cannot > be mapped with 64kB page, leading to a qemu abort. > > 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram] > 0x8000000000 - 0x8000001fff [0xffffa8820000] > qemu-system-aarch64: VFIO_MAP_DMA: -22 > qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000, > 0xffffa8820000) = -22 (Invalid argument) > qemu: hardware error: vfio: DMA mapping failed, unable to continue > > In my case don't we still need the fixup logic? Ah, I didn't realize you had Alexey's QEMU patch when you asked about this. I'd say no, the fixup shouldn't be needed, but apparently we can't remove it until we figure out this bug and who should be filtering out sub-page size listener events. Thanks, Alex ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR 2018-01-05 15:29 ` Alex Williamson @ 2018-01-16 5:17 ` Alexey Kardashevskiy 2018-01-17 8:17 ` Auger Eric 2018-01-18 21:59 ` Alex Williamson 0 siblings, 2 replies; 19+ messages in thread From: Alexey Kardashevskiy @ 2018-01-16 5:17 UTC (permalink / raw) To: Alex Williamson, Auger Eric; +Cc: qemu-devel, qemu-ppc, David Gibson On 06/01/18 02:29, Alex Williamson wrote: > On Fri, 5 Jan 2018 10:48:07 +0100 > Auger Eric <eric.auger@redhat.com> wrote: > >> Hi Alexey, >> >> On 15/12/17 07:29, Alexey Kardashevskiy wrote: >>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability >>> which tells that a region with MSIX data can be mapped entirely, i.e. >>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped. >>> >>> With this change, all BARs are mapped in a single chunk and MSIX vectors >>> are emulated on top unless the machine requests not to by defining and >>> enabling a new "vfio-no-msix-emulation" property. At the moment only >>> sPAPR machine does so - it prohibits MSIX emulation and does not allow >>> enabling it as it does not define the "set" callback for the new property; >>> the new property also does not appear in "-machine pseries,help". >>> >>> If the new capability is present, this puts MSIX IO memory region under >>> mapped memory region. If the capability is not there, it falls back to >>> the old behaviour with the sparse capability. >>> >>> In MSIX vectors section is not aligned to the page size, the KVM memory >>> listener does not register it with the KVM as a memory slot and MSIX is >>> emulated by QEMU as before. >>> >>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" - >>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html >>> >>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>> --- >>> >>> This is mtree and flatview BEFORE this patch: >>> >>> "info mtree": >>> memory-region: pci@800000020000000.mmio >>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>> >>> "info mtree -f": >>> FlatView #0 >>> AS "memory", root: system >>> AS "cpu-memory", root: system >>> Root memory region: system >>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>> 0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1 >>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>> 000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600 >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>> >>> >>> >>> This is AFTER this patch applied: >>> >>> "info mtree": >>> memory-region: pci@800000020000000.mmio >>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled] >>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>> >>> >>> "info mtree -f": >>> FlatView #2 >>> AS "memory", root: system >>> AS "cpu-memory", root: system >>> Root memory region: system >>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>> >>> >>> >>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched >>> to enable emulation: >>> >>> "info mtree": >>> memory-region: pci@800000020000000.mmio >>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>> >>> "info mtree -f": >>> FlatView #1 >>> AS "memory", root: system >>> AS "cpu-memory", root: system >>> Root memory region: system >>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>> 0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>> 000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600 >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>> --- >>> include/hw/vfio/vfio-common.h | 1 + >>> linux-headers/linux/vfio.h | 5 +++++ >>> hw/ppc/spapr.c | 8 ++++++++ >>> hw/vfio/common.c | 15 +++++++++++++++ >>> hw/vfio/pci.c | 23 +++++++++++++++++++++-- >>> 5 files changed, 50 insertions(+), 2 deletions(-) >>> >>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h >>> index f3a2ac9..927d600 100644 >>> --- a/include/hw/vfio/vfio-common.h >>> +++ b/include/hw/vfio/vfio-common.h >>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index, >>> struct vfio_region_info **info); >>> int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, >>> uint32_t subtype, struct vfio_region_info **info); >>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region); >>> #endif >>> extern const MemoryListener vfio_prereg_listener; >>> >>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h >>> index 4312e96..b45182e 100644 >>> --- a/linux-headers/linux/vfio.h >>> +++ b/linux-headers/linux/vfio.h >>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type { >>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2) >>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG (3) >>> >>> +/* >>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped. >>> + */ >>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 >>> + >>> /** >>> * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, >>> * struct vfio_irq_info) >>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c >>> index 9de63f0..693394a 100644 >>> --- a/hw/ppc/spapr.c >>> +++ b/hw/ppc/spapr.c >>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value, >>> spapr->use_hotplug_event_source = value; >>> } >>> >>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp) >>> +{ >>> + return true; >>> +} >>> + >>> static char *spapr_get_resize_hpt(Object *obj, Error **errp) >>> { >>> sPAPRMachineState *spapr = SPAPR_MACHINE(obj); >>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj) >>> object_property_set_description(obj, "vsmt", >>> "Virtual SMT: KVM behaves as if this were" >>> " the host's SMT mode", &error_abort); >>> + object_property_add_bool(obj, "vfio-no-msix-emulation", >>> + spapr_get_msix_emulation, NULL, NULL); >>> } >>> >>> static void spapr_machine_finalizefn(Object *obj) >>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = { >>> /* >>> * pseries-2.11 >>> */ >>> + >>> static void spapr_machine_2_11_instance_options(MachineState *machine) >>> { >>> } >>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >>> index 1fb8a8e..da5f182 100644 >>> --- a/hw/vfio/common.c >>> +++ b/hw/vfio/common.c >>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, >>> return -ENODEV; >>> } >>> >>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region) >>> +{ >>> + struct vfio_region_info *info = NULL; >>> + bool ret = false; >>> + >>> + if (!vfio_get_region_info(vbasedev, region, &info)) { >>> + if (vfio_get_region_info_cap(info, cap_type)) { >>> + ret = true; >>> + } >>> + g_free(info); >>> + } >>> + >>> + return ret; >>> +} >>> + >>> /* >>> * Interfaces for IBM EEH (Enhanced Error Handling) >>> */ >>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >>> index c977ee3..27a3706 100644 >>> --- a/hw/vfio/pci.c >>> +++ b/hw/vfio/pci.c >>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) >>> off_t start, end; >>> VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region; >>> >>> + if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, >>> + vdev->msix->table_bar)) { >>> + return; >>> + } >> >> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 + >> msix table at 0x2000, using 64kB pages. >> >> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0 >> 0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0 >> mmaps[0] >> 0000000000002000-00000000000023ff (prio 0, i/o): msix-table >> 0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled] >> >> >> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass >> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the >> vfio_listener_region_add_ram gets called on the first 8kB which cannot >> be mapped with 64kB page, leading to a qemu abort. >> >> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram] >> 0x8000000000 - 0x8000001fff [0xffffa8820000] >> qemu-system-aarch64: VFIO_MAP_DMA: -22 >> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000, >> 0xffffa8820000) = -22 (Invalid argument) >> qemu: hardware error: vfio: DMA mapping failed, unable to continue >> >> In my case don't we still need the fixup logic? > > Ah, I didn't realize you had Alexey's QEMU patch when you asked about > this. I'd say no, the fixup shouldn't be needed, but apparently we > can't remove it until we figure out this bug and who should be filtering > out sub-page size listener events. Thanks, This should do it: @@ -303,11 +303,19 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section) * Sizing an enabled 64-bit BAR can cause spurious mappings to * addresses in the upper part of the 64-bit address space. These * are never accessed by the CPU and beyond the address width of * some IOMMU hardware. TODO: VFIO should tell us the IOMMU width. */ - section->offset_within_address_space & (1ULL << 63); + section->offset_within_address_space & (1ULL << 63) || + /* + * Non-system-page aligned RAM regions have no chance to be + * DMA mapped so skip them too. + */ + (memory_region_is_ram(section->mr) && + ((section->offset_within_region & qemu_real_host_page_mask) || + (int128_getlo(section->size) & qemu_real_host_page_mask)) + ); } Do we VFIO DMA map on every RAM MR just because it is not necessary to have more complicated logic to detect whether it is MMIO? Or there is something bigger? As for now, it is possible to have non-aligned RAM MR, the KVM's listener just does not register it with the KVM if unaligned. -- Alexey ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR 2018-01-16 5:17 ` Alexey Kardashevskiy @ 2018-01-17 8:17 ` Auger Eric 2018-01-18 21:59 ` Alex Williamson 1 sibling, 0 replies; 19+ messages in thread From: Auger Eric @ 2018-01-17 8:17 UTC (permalink / raw) To: Alexey Kardashevskiy, Alex Williamson; +Cc: qemu-ppc, qemu-devel, David Gibson Hi Alexey, On 16/01/18 06:17, Alexey Kardashevskiy wrote: > On 06/01/18 02:29, Alex Williamson wrote: >> On Fri, 5 Jan 2018 10:48:07 +0100 >> Auger Eric <eric.auger@redhat.com> wrote: >> >>> Hi Alexey, >>> >>> On 15/12/17 07:29, Alexey Kardashevskiy wrote: >>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability >>>> which tells that a region with MSIX data can be mapped entirely, i.e. >>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped. >>>> >>>> With this change, all BARs are mapped in a single chunk and MSIX vectors >>>> are emulated on top unless the machine requests not to by defining and >>>> enabling a new "vfio-no-msix-emulation" property. At the moment only >>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow >>>> enabling it as it does not define the "set" callback for the new property; >>>> the new property also does not appear in "-machine pseries,help". >>>> >>>> If the new capability is present, this puts MSIX IO memory region under >>>> mapped memory region. If the capability is not there, it falls back to >>>> the old behaviour with the sparse capability. >>>> >>>> In MSIX vectors section is not aligned to the page size, the KVM memory >>>> listener does not register it with the KVM as a memory slot and MSIX is >>>> emulated by QEMU as before. >>>> >>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" - >>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html >>>> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>>> --- >>>> >>>> This is mtree and flatview BEFORE this patch: >>>> >>>> "info mtree": >>>> memory-region: pci@800000020000000.mmio >>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>> >>>> "info mtree -f": >>>> FlatView #0 >>>> AS "memory", root: system >>>> AS "cpu-memory", root: system >>>> Root memory region: system >>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>> 0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>> 000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600 >>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>> >>>> >>>> >>>> This is AFTER this patch applied: >>>> >>>> "info mtree": >>>> memory-region: pci@800000020000000.mmio >>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled] >>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>> >>>> >>>> "info mtree -f": >>>> FlatView #2 >>>> AS "memory", root: system >>>> AS "cpu-memory", root: system >>>> Root memory region: system >>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>> >>>> >>>> >>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched >>>> to enable emulation: >>>> >>>> "info mtree": >>>> memory-region: pci@800000020000000.mmio >>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>> >>>> "info mtree -f": >>>> FlatView #1 >>>> AS "memory", root: system >>>> AS "cpu-memory", root: system >>>> Root memory region: system >>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>> 0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>> 000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600 >>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>> --- >>>> include/hw/vfio/vfio-common.h | 1 + >>>> linux-headers/linux/vfio.h | 5 +++++ >>>> hw/ppc/spapr.c | 8 ++++++++ >>>> hw/vfio/common.c | 15 +++++++++++++++ >>>> hw/vfio/pci.c | 23 +++++++++++++++++++++-- >>>> 5 files changed, 50 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h >>>> index f3a2ac9..927d600 100644 >>>> --- a/include/hw/vfio/vfio-common.h >>>> +++ b/include/hw/vfio/vfio-common.h >>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index, >>>> struct vfio_region_info **info); >>>> int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, >>>> uint32_t subtype, struct vfio_region_info **info); >>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region); >>>> #endif >>>> extern const MemoryListener vfio_prereg_listener; >>>> >>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h >>>> index 4312e96..b45182e 100644 >>>> --- a/linux-headers/linux/vfio.h >>>> +++ b/linux-headers/linux/vfio.h >>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type { >>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2) >>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG (3) >>>> >>>> +/* >>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped. >>>> + */ >>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 >>>> + >>>> /** >>>> * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, >>>> * struct vfio_irq_info) >>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c >>>> index 9de63f0..693394a 100644 >>>> --- a/hw/ppc/spapr.c >>>> +++ b/hw/ppc/spapr.c >>>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value, >>>> spapr->use_hotplug_event_source = value; >>>> } >>>> >>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp) >>>> +{ >>>> + return true; >>>> +} >>>> + >>>> static char *spapr_get_resize_hpt(Object *obj, Error **errp) >>>> { >>>> sPAPRMachineState *spapr = SPAPR_MACHINE(obj); >>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj) >>>> object_property_set_description(obj, "vsmt", >>>> "Virtual SMT: KVM behaves as if this were" >>>> " the host's SMT mode", &error_abort); >>>> + object_property_add_bool(obj, "vfio-no-msix-emulation", >>>> + spapr_get_msix_emulation, NULL, NULL); >>>> } >>>> >>>> static void spapr_machine_finalizefn(Object *obj) >>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = { >>>> /* >>>> * pseries-2.11 >>>> */ >>>> + >>>> static void spapr_machine_2_11_instance_options(MachineState *machine) >>>> { >>>> } >>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >>>> index 1fb8a8e..da5f182 100644 >>>> --- a/hw/vfio/common.c >>>> +++ b/hw/vfio/common.c >>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, >>>> return -ENODEV; >>>> } >>>> >>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region) >>>> +{ >>>> + struct vfio_region_info *info = NULL; >>>> + bool ret = false; >>>> + >>>> + if (!vfio_get_region_info(vbasedev, region, &info)) { >>>> + if (vfio_get_region_info_cap(info, cap_type)) { >>>> + ret = true; >>>> + } >>>> + g_free(info); >>>> + } >>>> + >>>> + return ret; >>>> +} >>>> + >>>> /* >>>> * Interfaces for IBM EEH (Enhanced Error Handling) >>>> */ >>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >>>> index c977ee3..27a3706 100644 >>>> --- a/hw/vfio/pci.c >>>> +++ b/hw/vfio/pci.c >>>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) >>>> off_t start, end; >>>> VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region; >>>> >>>> + if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, >>>> + vdev->msix->table_bar)) { >>>> + return; >>>> + } >>> >>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 + >>> msix table at 0x2000, using 64kB pages. >>> >>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0 >>> 0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0 >>> mmaps[0] >>> 0000000000002000-00000000000023ff (prio 0, i/o): msix-table >>> 0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled] >>> >>> >>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass >>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the >>> vfio_listener_region_add_ram gets called on the first 8kB which cannot >>> be mapped with 64kB page, leading to a qemu abort. >>> >>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram] >>> 0x8000000000 - 0x8000001fff [0xffffa8820000] >>> qemu-system-aarch64: VFIO_MAP_DMA: -22 >>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000, >>> 0xffffa8820000) = -22 (Invalid argument) >>> qemu: hardware error: vfio: DMA mapping failed, unable to continue >>> >>> In my case don't we still need the fixup logic? >> >> Ah, I didn't realize you had Alexey's QEMU patch when you asked about >> this. I'd say no, the fixup shouldn't be needed, but apparently we >> can't remove it until we figure out this bug and who should be filtering >> out sub-page size listener events. Thanks, > > > This should do it: > > @@ -303,11 +303,19 @@ static bool > vfio_listener_skipped_section(MemoryRegionSection *section) > * Sizing an enabled 64-bit BAR can cause spurious mappings to > * addresses in the upper part of the 64-bit address space. These > * are never accessed by the CPU and beyond the address width of > * some IOMMU hardware. TODO: VFIO should tell us the IOMMU width. > */ > - section->offset_within_address_space & (1ULL << 63); > + section->offset_within_address_space & (1ULL << 63) || > + /* > + * Non-system-page aligned RAM regions have no chance to be > + * DMA mapped so skip them too. > + */ > + (memory_region_is_ram(section->mr) && > + ((section->offset_within_region & qemu_real_host_page_mask) || > + (int128_getlo(section->size) & qemu_real_host_page_mask)) > + ); > } This should indeed avoid the vfio_dma_map error. I can't test it at the moment since I don't have access to this HW anymore but as soon as I recover it, I will let you know. Thanks Eric > > > Do we VFIO DMA map on every RAM MR just because it is not necessary to have > more complicated logic to detect whether it is MMIO? Or there is something > bigger? > > As for now, it is possible to have non-aligned RAM MR, the KVM's listener > just does not register it with the KVM if unaligned. > > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR 2018-01-16 5:17 ` Alexey Kardashevskiy 2018-01-17 8:17 ` Auger Eric @ 2018-01-18 21:59 ` Alex Williamson 2018-01-19 2:41 ` Alexey Kardashevskiy 2018-01-19 8:55 ` Alexey Kardashevskiy 1 sibling, 2 replies; 19+ messages in thread From: Alex Williamson @ 2018-01-18 21:59 UTC (permalink / raw) To: Alexey Kardashevskiy; +Cc: Auger Eric, qemu-devel, qemu-ppc, David Gibson On Tue, 16 Jan 2018 16:17:58 +1100 Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > On 06/01/18 02:29, Alex Williamson wrote: > > On Fri, 5 Jan 2018 10:48:07 +0100 > > Auger Eric <eric.auger@redhat.com> wrote: > > > >> Hi Alexey, > >> > >> On 15/12/17 07:29, Alexey Kardashevskiy wrote: > >>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability > >>> which tells that a region with MSIX data can be mapped entirely, i.e. > >>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped. > >>> > >>> With this change, all BARs are mapped in a single chunk and MSIX vectors > >>> are emulated on top unless the machine requests not to by defining and > >>> enabling a new "vfio-no-msix-emulation" property. At the moment only > >>> sPAPR machine does so - it prohibits MSIX emulation and does not allow > >>> enabling it as it does not define the "set" callback for the new property; > >>> the new property also does not appear in "-machine pseries,help". > >>> > >>> If the new capability is present, this puts MSIX IO memory region under > >>> mapped memory region. If the capability is not there, it falls back to > >>> the old behaviour with the sparse capability. > >>> > >>> In MSIX vectors section is not aligned to the page size, the KVM memory > >>> listener does not register it with the KVM as a memory slot and MSIX is > >>> emulated by QEMU as before. > >>> > >>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" - > >>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html > >>> > >>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > >>> --- > >>> > >>> This is mtree and flatview BEFORE this patch: > >>> > >>> "info mtree": > >>> memory-region: pci@800000020000000.mmio > >>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > >>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > >>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>> > >>> "info mtree -f": > >>> FlatView #0 > >>> AS "memory", root: system > >>> AS "cpu-memory", root: system > >>> Root memory region: system > >>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >>> 0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>> 000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600 > >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>> > >>> > >>> > >>> This is AFTER this patch applied: > >>> > >>> "info mtree": > >>> memory-region: pci@800000020000000.mmio > >>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > >>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > >>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled] > >>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > >>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>> > >>> > >>> "info mtree -f": > >>> FlatView #2 > >>> AS "memory", root: system > >>> AS "cpu-memory", root: system > >>> Root memory region: system > >>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>> > >>> > >>> > >>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched > >>> to enable emulation: > >>> > >>> "info mtree": > >>> memory-region: pci@800000020000000.mmio > >>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > >>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > >>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > >>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>> > >>> "info mtree -f": > >>> FlatView #1 > >>> AS "memory", root: system > >>> AS "cpu-memory", root: system > >>> Root memory region: system > >>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >>> 0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > >>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>> 000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600 > >>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>> --- > >>> include/hw/vfio/vfio-common.h | 1 + > >>> linux-headers/linux/vfio.h | 5 +++++ > >>> hw/ppc/spapr.c | 8 ++++++++ > >>> hw/vfio/common.c | 15 +++++++++++++++ > >>> hw/vfio/pci.c | 23 +++++++++++++++++++++-- > >>> 5 files changed, 50 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h > >>> index f3a2ac9..927d600 100644 > >>> --- a/include/hw/vfio/vfio-common.h > >>> +++ b/include/hw/vfio/vfio-common.h > >>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index, > >>> struct vfio_region_info **info); > >>> int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, > >>> uint32_t subtype, struct vfio_region_info **info); > >>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region); > >>> #endif > >>> extern const MemoryListener vfio_prereg_listener; > >>> > >>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h > >>> index 4312e96..b45182e 100644 > >>> --- a/linux-headers/linux/vfio.h > >>> +++ b/linux-headers/linux/vfio.h > >>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type { > >>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2) > >>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG (3) > >>> > >>> +/* > >>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped. > >>> + */ > >>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 > >>> + > >>> /** > >>> * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, > >>> * struct vfio_irq_info) > >>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > >>> index 9de63f0..693394a 100644 > >>> --- a/hw/ppc/spapr.c > >>> +++ b/hw/ppc/spapr.c > >>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value, > >>> spapr->use_hotplug_event_source = value; > >>> } > >>> > >>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp) > >>> +{ > >>> + return true; > >>> +} > >>> + > >>> static char *spapr_get_resize_hpt(Object *obj, Error **errp) > >>> { > >>> sPAPRMachineState *spapr = SPAPR_MACHINE(obj); > >>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj) > >>> object_property_set_description(obj, "vsmt", > >>> "Virtual SMT: KVM behaves as if this were" > >>> " the host's SMT mode", &error_abort); > >>> + object_property_add_bool(obj, "vfio-no-msix-emulation", > >>> + spapr_get_msix_emulation, NULL, NULL); > >>> } > >>> > >>> static void spapr_machine_finalizefn(Object *obj) > >>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = { > >>> /* > >>> * pseries-2.11 > >>> */ > >>> + > >>> static void spapr_machine_2_11_instance_options(MachineState *machine) > >>> { > >>> } > >>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c > >>> index 1fb8a8e..da5f182 100644 > >>> --- a/hw/vfio/common.c > >>> +++ b/hw/vfio/common.c > >>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, > >>> return -ENODEV; > >>> } > >>> > >>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region) > >>> +{ > >>> + struct vfio_region_info *info = NULL; > >>> + bool ret = false; > >>> + > >>> + if (!vfio_get_region_info(vbasedev, region, &info)) { > >>> + if (vfio_get_region_info_cap(info, cap_type)) { > >>> + ret = true; > >>> + } > >>> + g_free(info); > >>> + } > >>> + > >>> + return ret; > >>> +} > >>> + > >>> /* > >>> * Interfaces for IBM EEH (Enhanced Error Handling) > >>> */ > >>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > >>> index c977ee3..27a3706 100644 > >>> --- a/hw/vfio/pci.c > >>> +++ b/hw/vfio/pci.c > >>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) > >>> off_t start, end; > >>> VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region; > >>> > >>> + if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, > >>> + vdev->msix->table_bar)) { > >>> + return; > >>> + } > >> > >> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 + > >> msix table at 0x2000, using 64kB pages. > >> > >> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0 > >> 0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0 > >> mmaps[0] > >> 0000000000002000-00000000000023ff (prio 0, i/o): msix-table > >> 0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled] > >> > >> > >> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass > >> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the > >> vfio_listener_region_add_ram gets called on the first 8kB which cannot > >> be mapped with 64kB page, leading to a qemu abort. > >> > >> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram] > >> 0x8000000000 - 0x8000001fff [0xffffa8820000] > >> qemu-system-aarch64: VFIO_MAP_DMA: -22 > >> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000, > >> 0xffffa8820000) = -22 (Invalid argument) > >> qemu: hardware error: vfio: DMA mapping failed, unable to continue > >> > >> In my case don't we still need the fixup logic? > > > > Ah, I didn't realize you had Alexey's QEMU patch when you asked about > > this. I'd say no, the fixup shouldn't be needed, but apparently we > > can't remove it until we figure out this bug and who should be filtering > > out sub-page size listener events. Thanks, > > > This should do it: > > @@ -303,11 +303,19 @@ static bool > vfio_listener_skipped_section(MemoryRegionSection *section) > * Sizing an enabled 64-bit BAR can cause spurious mappings to > * addresses in the upper part of the 64-bit address space. These > * are never accessed by the CPU and beyond the address width of > * some IOMMU hardware. TODO: VFIO should tell us the IOMMU width. > */ > - section->offset_within_address_space & (1ULL << 63); > + section->offset_within_address_space & (1ULL << 63) || > + /* > + * Non-system-page aligned RAM regions have no chance to be > + * DMA mapped so skip them too. > + */ > + (memory_region_is_ram(section->mr) && > + ((section->offset_within_region & qemu_real_host_page_mask) || > + (int128_getlo(section->size) & qemu_real_host_page_mask)) > + ); > } > > > Do we VFIO DMA map on every RAM MR just because it is not necessary to have > more complicated logic to detect whether it is MMIO? Or there is something > bigger? Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the silly address hack above to try to exclude PCI BARs as they're being resized, and b) there is some value to enabling peer-to-peer mappings between devices. Is there a better, more generic solution to a) now? If we know a MR is MMIO, then I think we can handle any associated mapping failure as non-fatal. It's nice to enable p2p, but it's barely used and not clear that it works reliably (sometimes even on bare metal). Nit, we don't yet have the IOVA range of the IOMMU exposed, thus the silly address hack, but we do have supported page sizes, so we should probably be referencing that rather than host page size. > As for now, it is possible to have non-aligned RAM MR, the KVM's listener > just does not register it with the KVM if unaligned. For a sanity check, we don't support running a guest with a smaller page size than the host, right? I assume that would be the only case where we would have a reasonable chance of the guest using a host sub-page as a DMA target, but that seems like just one of an infinite number of problems with such a setup. Thanks, Alex ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR 2018-01-18 21:59 ` Alex Williamson @ 2018-01-19 2:41 ` Alexey Kardashevskiy 2018-01-19 3:25 ` Alex Williamson 2018-01-19 8:55 ` Alexey Kardashevskiy 1 sibling, 1 reply; 19+ messages in thread From: Alexey Kardashevskiy @ 2018-01-19 2:41 UTC (permalink / raw) To: Alex Williamson; +Cc: Auger Eric, qemu-devel, qemu-ppc, David Gibson On 19/01/18 08:59, Alex Williamson wrote: > On Tue, 16 Jan 2018 16:17:58 +1100 > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > >> On 06/01/18 02:29, Alex Williamson wrote: >>> On Fri, 5 Jan 2018 10:48:07 +0100 >>> Auger Eric <eric.auger@redhat.com> wrote: >>> >>>> Hi Alexey, >>>> >>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote: >>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability >>>>> which tells that a region with MSIX data can be mapped entirely, i.e. >>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped. >>>>> >>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors >>>>> are emulated on top unless the machine requests not to by defining and >>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only >>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow >>>>> enabling it as it does not define the "set" callback for the new property; >>>>> the new property also does not appear in "-machine pseries,help". >>>>> >>>>> If the new capability is present, this puts MSIX IO memory region under >>>>> mapped memory region. If the capability is not there, it falls back to >>>>> the old behaviour with the sparse capability. >>>>> >>>>> In MSIX vectors section is not aligned to the page size, the KVM memory >>>>> listener does not register it with the KVM as a memory slot and MSIX is >>>>> emulated by QEMU as before. >>>>> >>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" - >>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html >>>>> >>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>>>> --- >>>>> >>>>> This is mtree and flatview BEFORE this patch: >>>>> >>>>> "info mtree": >>>>> memory-region: pci@800000020000000.mmio >>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>> >>>>> "info mtree -f": >>>>> FlatView #0 >>>>> AS "memory", root: system >>>>> AS "cpu-memory", root: system >>>>> Root memory region: system >>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>>> 0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>> 000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600 >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>> >>>>> >>>>> >>>>> This is AFTER this patch applied: >>>>> >>>>> "info mtree": >>>>> memory-region: pci@800000020000000.mmio >>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled] >>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>> >>>>> >>>>> "info mtree -f": >>>>> FlatView #2 >>>>> AS "memory", root: system >>>>> AS "cpu-memory", root: system >>>>> Root memory region: system >>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>> >>>>> >>>>> >>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched >>>>> to enable emulation: >>>>> >>>>> "info mtree": >>>>> memory-region: pci@800000020000000.mmio >>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>> >>>>> "info mtree -f": >>>>> FlatView #1 >>>>> AS "memory", root: system >>>>> AS "cpu-memory", root: system >>>>> Root memory region: system >>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>>> 0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>> 000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600 >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>> --- >>>>> include/hw/vfio/vfio-common.h | 1 + >>>>> linux-headers/linux/vfio.h | 5 +++++ >>>>> hw/ppc/spapr.c | 8 ++++++++ >>>>> hw/vfio/common.c | 15 +++++++++++++++ >>>>> hw/vfio/pci.c | 23 +++++++++++++++++++++-- >>>>> 5 files changed, 50 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h >>>>> index f3a2ac9..927d600 100644 >>>>> --- a/include/hw/vfio/vfio-common.h >>>>> +++ b/include/hw/vfio/vfio-common.h >>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index, >>>>> struct vfio_region_info **info); >>>>> int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, >>>>> uint32_t subtype, struct vfio_region_info **info); >>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region); >>>>> #endif >>>>> extern const MemoryListener vfio_prereg_listener; >>>>> >>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h >>>>> index 4312e96..b45182e 100644 >>>>> --- a/linux-headers/linux/vfio.h >>>>> +++ b/linux-headers/linux/vfio.h >>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type { >>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2) >>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG (3) >>>>> >>>>> +/* >>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped. >>>>> + */ >>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 >>>>> + >>>>> /** >>>>> * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, >>>>> * struct vfio_irq_info) >>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c >>>>> index 9de63f0..693394a 100644 >>>>> --- a/hw/ppc/spapr.c >>>>> +++ b/hw/ppc/spapr.c >>>>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value, >>>>> spapr->use_hotplug_event_source = value; >>>>> } >>>>> >>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp) >>>>> +{ >>>>> + return true; >>>>> +} >>>>> + >>>>> static char *spapr_get_resize_hpt(Object *obj, Error **errp) >>>>> { >>>>> sPAPRMachineState *spapr = SPAPR_MACHINE(obj); >>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj) >>>>> object_property_set_description(obj, "vsmt", >>>>> "Virtual SMT: KVM behaves as if this were" >>>>> " the host's SMT mode", &error_abort); >>>>> + object_property_add_bool(obj, "vfio-no-msix-emulation", >>>>> + spapr_get_msix_emulation, NULL, NULL); >>>>> } >>>>> >>>>> static void spapr_machine_finalizefn(Object *obj) >>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = { >>>>> /* >>>>> * pseries-2.11 >>>>> */ >>>>> + >>>>> static void spapr_machine_2_11_instance_options(MachineState *machine) >>>>> { >>>>> } >>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >>>>> index 1fb8a8e..da5f182 100644 >>>>> --- a/hw/vfio/common.c >>>>> +++ b/hw/vfio/common.c >>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, >>>>> return -ENODEV; >>>>> } >>>>> >>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region) >>>>> +{ >>>>> + struct vfio_region_info *info = NULL; >>>>> + bool ret = false; >>>>> + >>>>> + if (!vfio_get_region_info(vbasedev, region, &info)) { >>>>> + if (vfio_get_region_info_cap(info, cap_type)) { >>>>> + ret = true; >>>>> + } >>>>> + g_free(info); >>>>> + } >>>>> + >>>>> + return ret; >>>>> +} >>>>> + >>>>> /* >>>>> * Interfaces for IBM EEH (Enhanced Error Handling) >>>>> */ >>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >>>>> index c977ee3..27a3706 100644 >>>>> --- a/hw/vfio/pci.c >>>>> +++ b/hw/vfio/pci.c >>>>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) >>>>> off_t start, end; >>>>> VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region; >>>>> >>>>> + if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, >>>>> + vdev->msix->table_bar)) { >>>>> + return; >>>>> + } >>>> >>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 + >>>> msix table at 0x2000, using 64kB pages. >>>> >>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0 >>>> 0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0 >>>> mmaps[0] >>>> 0000000000002000-00000000000023ff (prio 0, i/o): msix-table >>>> 0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled] >>>> >>>> >>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass >>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the >>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot >>>> be mapped with 64kB page, leading to a qemu abort. >>>> >>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram] >>>> 0x8000000000 - 0x8000001fff [0xffffa8820000] >>>> qemu-system-aarch64: VFIO_MAP_DMA: -22 >>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000, >>>> 0xffffa8820000) = -22 (Invalid argument) >>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue >>>> >>>> In my case don't we still need the fixup logic? >>> >>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about >>> this. I'd say no, the fixup shouldn't be needed, but apparently we >>> can't remove it until we figure out this bug and who should be filtering >>> out sub-page size listener events. Thanks, >> >> >> This should do it: >> >> @@ -303,11 +303,19 @@ static bool >> vfio_listener_skipped_section(MemoryRegionSection *section) >> * Sizing an enabled 64-bit BAR can cause spurious mappings to >> * addresses in the upper part of the 64-bit address space. These >> * are never accessed by the CPU and beyond the address width of >> * some IOMMU hardware. TODO: VFIO should tell us the IOMMU width. >> */ >> - section->offset_within_address_space & (1ULL << 63); >> + section->offset_within_address_space & (1ULL << 63) || >> + /* >> + * Non-system-page aligned RAM regions have no chance to be >> + * DMA mapped so skip them too. >> + */ >> + (memory_region_is_ram(section->mr) && >> + ((section->offset_within_region & qemu_real_host_page_mask) || >> + (int128_getlo(section->size) & qemu_real_host_page_mask)) >> + ); >> } >> >> >> Do we VFIO DMA map on every RAM MR just because it is not necessary to have >> more complicated logic to detect whether it is MMIO? Or there is something >> bigger? > > Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the > silly address hack above to try to exclude PCI BARs as they're being > resized, and b) there is some value to enabling peer-to-peer mappings > between devices. Is there a better, more generic solution to a) now? "now?" what? :) > If we know a MR is MMIO, then I think we can handle any associated > mapping failure as non-fatal. mr->ram_device sound like an appropriate flag to check. > It's nice to enable p2p, but it's barely > used and not clear that it works reliably (sometimes even on bare > metal).Nit, we don't yet have the IOVA range of the IOMMU exposed, > thus the silly address hack, but we do have supported page sizes, so we > should probably be referencing that rather than host page size. Ah, correct, s/qemu_real_host_page_mask/memory_region_iommu_get_min_page_size()/ > >> As for now, it is possible to have non-aligned RAM MR, the KVM's listener >> just does not register it with the KVM if unaligned. > > For a sanity check, we don't support running a guest with a smaller > page size than the host, right? We do, at least on POWER. > I assume that would be the only case > where we would have a reasonable chance of the guest using a host > sub-page as a DMA target, but that seems like just one of an infinite > number of problems with such a setup. Thanks, Like what? I am missing the point here. The most popular config on POWER today is 64k system page size in both host and guest and 4K IOMMU page size for 32bit DMA, seems alright, I also tried 4K guests, no problem there too. -- Alexey ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR 2018-01-19 2:41 ` Alexey Kardashevskiy @ 2018-01-19 3:25 ` Alex Williamson 2018-01-19 4:31 ` Alexey Kardashevskiy 2018-01-19 13:15 ` Auger Eric 0 siblings, 2 replies; 19+ messages in thread From: Alex Williamson @ 2018-01-19 3:25 UTC (permalink / raw) To: Alexey Kardashevskiy; +Cc: Auger Eric, qemu-devel, qemu-ppc, David Gibson On Fri, 19 Jan 2018 13:41:41 +1100 Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > On 19/01/18 08:59, Alex Williamson wrote: > > On Tue, 16 Jan 2018 16:17:58 +1100 > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > > > >> On 06/01/18 02:29, Alex Williamson wrote: > >>> On Fri, 5 Jan 2018 10:48:07 +0100 > >>> Auger Eric <eric.auger@redhat.com> wrote: > >>> > >>>> Hi Alexey, > >>>> > >>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote: > >>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability > >>>>> which tells that a region with MSIX data can be mapped entirely, i.e. > >>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped. > >>>>> > >>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors > >>>>> are emulated on top unless the machine requests not to by defining and > >>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only > >>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow > >>>>> enabling it as it does not define the "set" callback for the new property; > >>>>> the new property also does not appear in "-machine pseries,help". > >>>>> > >>>>> If the new capability is present, this puts MSIX IO memory region under > >>>>> mapped memory region. If the capability is not there, it falls back to > >>>>> the old behaviour with the sparse capability. > >>>>> > >>>>> In MSIX vectors section is not aligned to the page size, the KVM memory > >>>>> listener does not register it with the KVM as a memory slot and MSIX is > >>>>> emulated by QEMU as before. > >>>>> > >>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" - > >>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html > >>>>> > >>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > >>>>> --- > >>>>> > >>>>> This is mtree and flatview BEFORE this patch: > >>>>> > >>>>> "info mtree": > >>>>> memory-region: pci@800000020000000.mmio > >>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > >>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > >>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>>>> > >>>>> "info mtree -f": > >>>>> FlatView #0 > >>>>> AS "memory", root: system > >>>>> AS "cpu-memory", root: system > >>>>> Root memory region: system > >>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >>>>> 0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>>>> 000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600 > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>>>> > >>>>> > >>>>> > >>>>> This is AFTER this patch applied: > >>>>> > >>>>> "info mtree": > >>>>> memory-region: pci@800000020000000.mmio > >>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > >>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled] > >>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > >>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>>>> > >>>>> > >>>>> "info mtree -f": > >>>>> FlatView #2 > >>>>> AS "memory", root: system > >>>>> AS "cpu-memory", root: system > >>>>> Root memory region: system > >>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>>>> > >>>>> > >>>>> > >>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched > >>>>> to enable emulation: > >>>>> > >>>>> "info mtree": > >>>>> memory-region: pci@800000020000000.mmio > >>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > >>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > >>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>>>> > >>>>> "info mtree -f": > >>>>> FlatView #1 > >>>>> AS "memory", root: system > >>>>> AS "cpu-memory", root: system > >>>>> Root memory region: system > >>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >>>>> 0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>>>> 000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600 > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>>>> --- > >>>>> include/hw/vfio/vfio-common.h | 1 + > >>>>> linux-headers/linux/vfio.h | 5 +++++ > >>>>> hw/ppc/spapr.c | 8 ++++++++ > >>>>> hw/vfio/common.c | 15 +++++++++++++++ > >>>>> hw/vfio/pci.c | 23 +++++++++++++++++++++-- > >>>>> 5 files changed, 50 insertions(+), 2 deletions(-) > >>>>> > >>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h > >>>>> index f3a2ac9..927d600 100644 > >>>>> --- a/include/hw/vfio/vfio-common.h > >>>>> +++ b/include/hw/vfio/vfio-common.h > >>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index, > >>>>> struct vfio_region_info **info); > >>>>> int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, > >>>>> uint32_t subtype, struct vfio_region_info **info); > >>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region); > >>>>> #endif > >>>>> extern const MemoryListener vfio_prereg_listener; > >>>>> > >>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h > >>>>> index 4312e96..b45182e 100644 > >>>>> --- a/linux-headers/linux/vfio.h > >>>>> +++ b/linux-headers/linux/vfio.h > >>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type { > >>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2) > >>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG (3) > >>>>> > >>>>> +/* > >>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped. > >>>>> + */ > >>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 > >>>>> + > >>>>> /** > >>>>> * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, > >>>>> * struct vfio_irq_info) > >>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > >>>>> index 9de63f0..693394a 100644 > >>>>> --- a/hw/ppc/spapr.c > >>>>> +++ b/hw/ppc/spapr.c > >>>>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value, > >>>>> spapr->use_hotplug_event_source = value; > >>>>> } > >>>>> > >>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp) > >>>>> +{ > >>>>> + return true; > >>>>> +} > >>>>> + > >>>>> static char *spapr_get_resize_hpt(Object *obj, Error **errp) > >>>>> { > >>>>> sPAPRMachineState *spapr = SPAPR_MACHINE(obj); > >>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj) > >>>>> object_property_set_description(obj, "vsmt", > >>>>> "Virtual SMT: KVM behaves as if this were" > >>>>> " the host's SMT mode", &error_abort); > >>>>> + object_property_add_bool(obj, "vfio-no-msix-emulation", > >>>>> + spapr_get_msix_emulation, NULL, NULL); > >>>>> } > >>>>> > >>>>> static void spapr_machine_finalizefn(Object *obj) > >>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = { > >>>>> /* > >>>>> * pseries-2.11 > >>>>> */ > >>>>> + > >>>>> static void spapr_machine_2_11_instance_options(MachineState *machine) > >>>>> { > >>>>> } > >>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c > >>>>> index 1fb8a8e..da5f182 100644 > >>>>> --- a/hw/vfio/common.c > >>>>> +++ b/hw/vfio/common.c > >>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, > >>>>> return -ENODEV; > >>>>> } > >>>>> > >>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region) > >>>>> +{ > >>>>> + struct vfio_region_info *info = NULL; > >>>>> + bool ret = false; > >>>>> + > >>>>> + if (!vfio_get_region_info(vbasedev, region, &info)) { > >>>>> + if (vfio_get_region_info_cap(info, cap_type)) { > >>>>> + ret = true; > >>>>> + } > >>>>> + g_free(info); > >>>>> + } > >>>>> + > >>>>> + return ret; > >>>>> +} > >>>>> + > >>>>> /* > >>>>> * Interfaces for IBM EEH (Enhanced Error Handling) > >>>>> */ > >>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > >>>>> index c977ee3..27a3706 100644 > >>>>> --- a/hw/vfio/pci.c > >>>>> +++ b/hw/vfio/pci.c > >>>>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) > >>>>> off_t start, end; > >>>>> VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region; > >>>>> > >>>>> + if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, > >>>>> + vdev->msix->table_bar)) { > >>>>> + return; > >>>>> + } > >>>> > >>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 + > >>>> msix table at 0x2000, using 64kB pages. > >>>> > >>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0 > >>>> 0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0 > >>>> mmaps[0] > >>>> 0000000000002000-00000000000023ff (prio 0, i/o): msix-table > >>>> 0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled] > >>>> > >>>> > >>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass > >>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the > >>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot > >>>> be mapped with 64kB page, leading to a qemu abort. > >>>> > >>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram] > >>>> 0x8000000000 - 0x8000001fff [0xffffa8820000] > >>>> qemu-system-aarch64: VFIO_MAP_DMA: -22 > >>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000, > >>>> 0xffffa8820000) = -22 (Invalid argument) > >>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue > >>>> > >>>> In my case don't we still need the fixup logic? > >>> > >>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about > >>> this. I'd say no, the fixup shouldn't be needed, but apparently we > >>> can't remove it until we figure out this bug and who should be filtering > >>> out sub-page size listener events. Thanks, > >> > >> > >> This should do it: > >> > >> @@ -303,11 +303,19 @@ static bool > >> vfio_listener_skipped_section(MemoryRegionSection *section) > >> * Sizing an enabled 64-bit BAR can cause spurious mappings to > >> * addresses in the upper part of the 64-bit address space. These > >> * are never accessed by the CPU and beyond the address width of > >> * some IOMMU hardware. TODO: VFIO should tell us the IOMMU width. > >> */ > >> - section->offset_within_address_space & (1ULL << 63); > >> + section->offset_within_address_space & (1ULL << 63) || > >> + /* > >> + * Non-system-page aligned RAM regions have no chance to be > >> + * DMA mapped so skip them too. > >> + */ > >> + (memory_region_is_ram(section->mr) && > >> + ((section->offset_within_region & qemu_real_host_page_mask) || > >> + (int128_getlo(section->size) & qemu_real_host_page_mask)) > >> + ); > >> } > >> > >> > >> Do we VFIO DMA map on every RAM MR just because it is not necessary to have > >> more complicated logic to detect whether it is MMIO? Or there is something > >> bigger? > > > > Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the > > silly address hack above to try to exclude PCI BARs as they're being > > resized, and b) there is some value to enabling peer-to-peer mappings > > between devices. Is there a better, more generic solution to a) now? > > "now?" what? :) Now - adverb, at the present time or moment. For example, vs in the past. > > If we know a MR is MMIO, then I think we can handle any associated > > mapping failure as non-fatal. > > > mr->ram_device sound like an appropriate flag to check. Yeah, I suppose we could, and that didn't exist when we added the bit 63 hack. > > It's nice to enable p2p, but it's barely > > used and not clear that it works reliably (sometimes even on bare > > metal).Nit, we don't yet have the IOVA range of the IOMMU exposed, > > thus the silly address hack, but we do have supported page sizes, so we > > should probably be referencing that rather than host page size. > > Ah, correct, > s/qemu_real_host_page_mask/memory_region_iommu_get_min_page_size()/ No, that would be the emulated iommu page size, which may or may not (hopefully does) match the physical iommu page size. The vfio iommu page sizes are exposed via VFIO_IOMMU_GET_INFO, at least for type1. > >> As for now, it is possible to have non-aligned RAM MR, the KVM's listener > >> just does not register it with the KVM if unaligned. > > > > For a sanity check, we don't support running a guest with a smaller > > page size than the host, right? > > We do, at least on POWER. > > > I assume that would be the only case > > where we would have a reasonable chance of the guest using a host > > sub-page as a DMA target, but that seems like just one of an infinite > > number of problems with such a setup. Thanks, > > Like what? I am missing the point here. The most popular config on POWER > today is 64k system page size in both host and guest and 4K IOMMU page size > for 32bit DMA, seems alright, I also tried 4K guests, no problem there too. The IOMMU supporting a smaller page size than the processor is the thing I think I was missing here, but it makes sense that they could, they're not necessarily linked. Eric, will SMMU support 4K even while the CPU runs at 64K? The type1 backend has a number of PAGE_SIZE assumptions in it and will not expose IOMMU page sizes less than PAGE_SIZE. Perhaps this is really the root of the problem, but it is advertising vfio_iommu_type1_info.iova_pgsizes, which userspace should honor. So I think running (target page size) < (host page size) is probably broken for type1, for instance the guest could have a 4K RAM page as a DMA target that the 64K host couldn't map due to alignment, therefore DMA with that page would fault. Thanks, Alex ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR 2018-01-19 3:25 ` Alex Williamson @ 2018-01-19 4:31 ` Alexey Kardashevskiy 2018-01-19 13:15 ` Auger Eric 1 sibling, 0 replies; 19+ messages in thread From: Alexey Kardashevskiy @ 2018-01-19 4:31 UTC (permalink / raw) To: Alex Williamson; +Cc: Auger Eric, qemu-devel, qemu-ppc, David Gibson On 19/01/18 14:25, Alex Williamson wrote: > On Fri, 19 Jan 2018 13:41:41 +1100 > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > >> On 19/01/18 08:59, Alex Williamson wrote: >>> On Tue, 16 Jan 2018 16:17:58 +1100 >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote: >>> >>>> On 06/01/18 02:29, Alex Williamson wrote: >>>>> On Fri, 5 Jan 2018 10:48:07 +0100 >>>>> Auger Eric <eric.auger@redhat.com> wrote: >>>>> >>>>>> Hi Alexey, >>>>>> >>>>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote: >>>>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability >>>>>>> which tells that a region with MSIX data can be mapped entirely, i.e. >>>>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped. >>>>>>> >>>>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors >>>>>>> are emulated on top unless the machine requests not to by defining and >>>>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only >>>>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow >>>>>>> enabling it as it does not define the "set" callback for the new property; >>>>>>> the new property also does not appear in "-machine pseries,help". >>>>>>> >>>>>>> If the new capability is present, this puts MSIX IO memory region under >>>>>>> mapped memory region. If the capability is not there, it falls back to >>>>>>> the old behaviour with the sparse capability. >>>>>>> >>>>>>> In MSIX vectors section is not aligned to the page size, the KVM memory >>>>>>> listener does not register it with the KVM as a memory slot and MSIX is >>>>>>> emulated by QEMU as before. >>>>>>> >>>>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" - >>>>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html >>>>>>> >>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>>>>>> --- >>>>>>> >>>>>>> This is mtree and flatview BEFORE this patch: >>>>>>> >>>>>>> "info mtree": >>>>>>> memory-region: pci@800000020000000.mmio >>>>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>>>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>>>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>> >>>>>>> "info mtree -f": >>>>>>> FlatView #0 >>>>>>> AS "memory", root: system >>>>>>> AS "cpu-memory", root: system >>>>>>> Root memory region: system >>>>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>>>>> 0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>>>> 000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600 >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>> >>>>>>> >>>>>>> >>>>>>> This is AFTER this patch applied: >>>>>>> >>>>>>> "info mtree": >>>>>>> memory-region: pci@800000020000000.mmio >>>>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>>>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled] >>>>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>>>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>> >>>>>>> >>>>>>> "info mtree -f": >>>>>>> FlatView #2 >>>>>>> AS "memory", root: system >>>>>>> AS "cpu-memory", root: system >>>>>>> Root memory region: system >>>>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>> >>>>>>> >>>>>>> >>>>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched >>>>>>> to enable emulation: >>>>>>> >>>>>>> "info mtree": >>>>>>> memory-region: pci@800000020000000.mmio >>>>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>>>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>>>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>> >>>>>>> "info mtree -f": >>>>>>> FlatView #1 >>>>>>> AS "memory", root: system >>>>>>> AS "cpu-memory", root: system >>>>>>> Root memory region: system >>>>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>>>>> 0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>>>> 000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600 >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>> --- >>>>>>> include/hw/vfio/vfio-common.h | 1 + >>>>>>> linux-headers/linux/vfio.h | 5 +++++ >>>>>>> hw/ppc/spapr.c | 8 ++++++++ >>>>>>> hw/vfio/common.c | 15 +++++++++++++++ >>>>>>> hw/vfio/pci.c | 23 +++++++++++++++++++++-- >>>>>>> 5 files changed, 50 insertions(+), 2 deletions(-) >>>>>>> >>>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h >>>>>>> index f3a2ac9..927d600 100644 >>>>>>> --- a/include/hw/vfio/vfio-common.h >>>>>>> +++ b/include/hw/vfio/vfio-common.h >>>>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index, >>>>>>> struct vfio_region_info **info); >>>>>>> int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, >>>>>>> uint32_t subtype, struct vfio_region_info **info); >>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region); >>>>>>> #endif >>>>>>> extern const MemoryListener vfio_prereg_listener; >>>>>>> >>>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h >>>>>>> index 4312e96..b45182e 100644 >>>>>>> --- a/linux-headers/linux/vfio.h >>>>>>> +++ b/linux-headers/linux/vfio.h >>>>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type { >>>>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2) >>>>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG (3) >>>>>>> >>>>>>> +/* >>>>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped. >>>>>>> + */ >>>>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 >>>>>>> + >>>>>>> /** >>>>>>> * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, >>>>>>> * struct vfio_irq_info) >>>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c >>>>>>> index 9de63f0..693394a 100644 >>>>>>> --- a/hw/ppc/spapr.c >>>>>>> +++ b/hw/ppc/spapr.c >>>>>>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value, >>>>>>> spapr->use_hotplug_event_source = value; >>>>>>> } >>>>>>> >>>>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp) >>>>>>> +{ >>>>>>> + return true; >>>>>>> +} >>>>>>> + >>>>>>> static char *spapr_get_resize_hpt(Object *obj, Error **errp) >>>>>>> { >>>>>>> sPAPRMachineState *spapr = SPAPR_MACHINE(obj); >>>>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj) >>>>>>> object_property_set_description(obj, "vsmt", >>>>>>> "Virtual SMT: KVM behaves as if this were" >>>>>>> " the host's SMT mode", &error_abort); >>>>>>> + object_property_add_bool(obj, "vfio-no-msix-emulation", >>>>>>> + spapr_get_msix_emulation, NULL, NULL); >>>>>>> } >>>>>>> >>>>>>> static void spapr_machine_finalizefn(Object *obj) >>>>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = { >>>>>>> /* >>>>>>> * pseries-2.11 >>>>>>> */ >>>>>>> + >>>>>>> static void spapr_machine_2_11_instance_options(MachineState *machine) >>>>>>> { >>>>>>> } >>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >>>>>>> index 1fb8a8e..da5f182 100644 >>>>>>> --- a/hw/vfio/common.c >>>>>>> +++ b/hw/vfio/common.c >>>>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, >>>>>>> return -ENODEV; >>>>>>> } >>>>>>> >>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region) >>>>>>> +{ >>>>>>> + struct vfio_region_info *info = NULL; >>>>>>> + bool ret = false; >>>>>>> + >>>>>>> + if (!vfio_get_region_info(vbasedev, region, &info)) { >>>>>>> + if (vfio_get_region_info_cap(info, cap_type)) { >>>>>>> + ret = true; >>>>>>> + } >>>>>>> + g_free(info); >>>>>>> + } >>>>>>> + >>>>>>> + return ret; >>>>>>> +} >>>>>>> + >>>>>>> /* >>>>>>> * Interfaces for IBM EEH (Enhanced Error Handling) >>>>>>> */ >>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >>>>>>> index c977ee3..27a3706 100644 >>>>>>> --- a/hw/vfio/pci.c >>>>>>> +++ b/hw/vfio/pci.c >>>>>>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) >>>>>>> off_t start, end; >>>>>>> VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region; >>>>>>> >>>>>>> + if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, >>>>>>> + vdev->msix->table_bar)) { >>>>>>> + return; >>>>>>> + } >>>>>> >>>>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 + >>>>>> msix table at 0x2000, using 64kB pages. >>>>>> >>>>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0 >>>>>> 0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0 >>>>>> mmaps[0] >>>>>> 0000000000002000-00000000000023ff (prio 0, i/o): msix-table >>>>>> 0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled] >>>>>> >>>>>> >>>>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass >>>>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the >>>>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot >>>>>> be mapped with 64kB page, leading to a qemu abort. >>>>>> >>>>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram] >>>>>> 0x8000000000 - 0x8000001fff [0xffffa8820000] >>>>>> qemu-system-aarch64: VFIO_MAP_DMA: -22 >>>>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000, >>>>>> 0xffffa8820000) = -22 (Invalid argument) >>>>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue >>>>>> >>>>>> In my case don't we still need the fixup logic? >>>>> >>>>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about >>>>> this. I'd say no, the fixup shouldn't be needed, but apparently we >>>>> can't remove it until we figure out this bug and who should be filtering >>>>> out sub-page size listener events. Thanks, >>>> >>>> >>>> This should do it: >>>> >>>> @@ -303,11 +303,19 @@ static bool >>>> vfio_listener_skipped_section(MemoryRegionSection *section) >>>> * Sizing an enabled 64-bit BAR can cause spurious mappings to >>>> * addresses in the upper part of the 64-bit address space. These >>>> * are never accessed by the CPU and beyond the address width of >>>> * some IOMMU hardware. TODO: VFIO should tell us the IOMMU width. >>>> */ >>>> - section->offset_within_address_space & (1ULL << 63); >>>> + section->offset_within_address_space & (1ULL << 63) || >>>> + /* >>>> + * Non-system-page aligned RAM regions have no chance to be >>>> + * DMA mapped so skip them too. >>>> + */ >>>> + (memory_region_is_ram(section->mr) && >>>> + ((section->offset_within_region & qemu_real_host_page_mask) || >>>> + (int128_getlo(section->size) & qemu_real_host_page_mask)) >>>> + ); >>>> } >>>> >>>> >>>> Do we VFIO DMA map on every RAM MR just because it is not necessary to have >>>> more complicated logic to detect whether it is MMIO? Or there is something >>>> bigger? >>> >>> Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the >>> silly address hack above to try to exclude PCI BARs as they're being >>> resized, and b) there is some value to enabling peer-to-peer mappings >>> between devices. Is there a better, more generic solution to a) now? >> >> "now?" what? :) > > Now - adverb, at the present time or moment. For example, vs in the > past. Ah, read incorrectly, I thought the sentence should have a) and b) and b) was missing by a mistake. A heatwave here today. > >>> If we know a MR is MMIO, then I think we can handle any associated >>> mapping failure as non-fatal. >> >> >> mr->ram_device sound like an appropriate flag to check. > > Yeah, I suppose we could, and that didn't exist when we added the > bit 63 hack. Ok, I'll make a patch for this then. >>> It's nice to enable p2p, but it's barely >>> used and not clear that it works reliably (sometimes even on bare >>> metal).Nit, we don't yet have the IOVA range of the IOMMU exposed, >>> thus the silly address hack, but we do have supported page sizes, so we >>> should probably be referencing that rather than host page size. >> >> Ah, correct, >> s/qemu_real_host_page_mask/memory_region_iommu_get_min_page_size()/ > > No, that would be the emulated iommu page size, which may or may not > (hopefully does) match the physical iommu page size. The vfio iommu > page sizes are exposed via VFIO_IOMMU_GET_INFO, at least for type1. Correct. -- Alexey ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR 2018-01-19 3:25 ` Alex Williamson 2018-01-19 4:31 ` Alexey Kardashevskiy @ 2018-01-19 13:15 ` Auger Eric 2018-01-23 19:56 ` Alex Williamson 1 sibling, 1 reply; 19+ messages in thread From: Auger Eric @ 2018-01-19 13:15 UTC (permalink / raw) To: Alex Williamson, Alexey Kardashevskiy; +Cc: qemu-devel, qemu-ppc, David Gibson Hi, On 19/01/18 04:25, Alex Williamson wrote: > On Fri, 19 Jan 2018 13:41:41 +1100 > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > >> On 19/01/18 08:59, Alex Williamson wrote: >>> On Tue, 16 Jan 2018 16:17:58 +1100 >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote: >>> >>>> On 06/01/18 02:29, Alex Williamson wrote: >>>>> On Fri, 5 Jan 2018 10:48:07 +0100 >>>>> Auger Eric <eric.auger@redhat.com> wrote: >>>>> >>>>>> Hi Alexey, >>>>>> >>>>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote: >>>>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability >>>>>>> which tells that a region with MSIX data can be mapped entirely, i.e. >>>>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped. >>>>>>> >>>>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors >>>>>>> are emulated on top unless the machine requests not to by defining and >>>>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only >>>>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow >>>>>>> enabling it as it does not define the "set" callback for the new property; >>>>>>> the new property also does not appear in "-machine pseries,help". >>>>>>> >>>>>>> If the new capability is present, this puts MSIX IO memory region under >>>>>>> mapped memory region. If the capability is not there, it falls back to >>>>>>> the old behaviour with the sparse capability. >>>>>>> >>>>>>> In MSIX vectors section is not aligned to the page size, the KVM memory >>>>>>> listener does not register it with the KVM as a memory slot and MSIX is >>>>>>> emulated by QEMU as before. >>>>>>> >>>>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" - >>>>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html >>>>>>> >>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>>>>>> --- >>>>>>> >>>>>>> This is mtree and flatview BEFORE this patch: >>>>>>> >>>>>>> "info mtree": >>>>>>> memory-region: pci@800000020000000.mmio >>>>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>>>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>>>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>> >>>>>>> "info mtree -f": >>>>>>> FlatView #0 >>>>>>> AS "memory", root: system >>>>>>> AS "cpu-memory", root: system >>>>>>> Root memory region: system >>>>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>>>>> 0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>>>> 000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600 >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>> >>>>>>> >>>>>>> >>>>>>> This is AFTER this patch applied: >>>>>>> >>>>>>> "info mtree": >>>>>>> memory-region: pci@800000020000000.mmio >>>>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>>>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled] >>>>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>>>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>> >>>>>>> >>>>>>> "info mtree -f": >>>>>>> FlatView #2 >>>>>>> AS "memory", root: system >>>>>>> AS "cpu-memory", root: system >>>>>>> Root memory region: system >>>>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>> >>>>>>> >>>>>>> >>>>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched >>>>>>> to enable emulation: >>>>>>> >>>>>>> "info mtree": >>>>>>> memory-region: pci@800000020000000.mmio >>>>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>>>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>>>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>> >>>>>>> "info mtree -f": >>>>>>> FlatView #1 >>>>>>> AS "memory", root: system >>>>>>> AS "cpu-memory", root: system >>>>>>> Root memory region: system >>>>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>>>>> 0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>>>> 000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600 >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>> --- >>>>>>> include/hw/vfio/vfio-common.h | 1 + >>>>>>> linux-headers/linux/vfio.h | 5 +++++ >>>>>>> hw/ppc/spapr.c | 8 ++++++++ >>>>>>> hw/vfio/common.c | 15 +++++++++++++++ >>>>>>> hw/vfio/pci.c | 23 +++++++++++++++++++++-- >>>>>>> 5 files changed, 50 insertions(+), 2 deletions(-) >>>>>>> >>>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h >>>>>>> index f3a2ac9..927d600 100644 >>>>>>> --- a/include/hw/vfio/vfio-common.h >>>>>>> +++ b/include/hw/vfio/vfio-common.h >>>>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index, >>>>>>> struct vfio_region_info **info); >>>>>>> int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, >>>>>>> uint32_t subtype, struct vfio_region_info **info); >>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region); >>>>>>> #endif >>>>>>> extern const MemoryListener vfio_prereg_listener; >>>>>>> >>>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h >>>>>>> index 4312e96..b45182e 100644 >>>>>>> --- a/linux-headers/linux/vfio.h >>>>>>> +++ b/linux-headers/linux/vfio.h >>>>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type { >>>>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2) >>>>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG (3) >>>>>>> >>>>>>> +/* >>>>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped. >>>>>>> + */ >>>>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 >>>>>>> + >>>>>>> /** >>>>>>> * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, >>>>>>> * struct vfio_irq_info) >>>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c >>>>>>> index 9de63f0..693394a 100644 >>>>>>> --- a/hw/ppc/spapr.c >>>>>>> +++ b/hw/ppc/spapr.c >>>>>>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value, >>>>>>> spapr->use_hotplug_event_source = value; >>>>>>> } >>>>>>> >>>>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp) >>>>>>> +{ >>>>>>> + return true; >>>>>>> +} >>>>>>> + >>>>>>> static char *spapr_get_resize_hpt(Object *obj, Error **errp) >>>>>>> { >>>>>>> sPAPRMachineState *spapr = SPAPR_MACHINE(obj); >>>>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj) >>>>>>> object_property_set_description(obj, "vsmt", >>>>>>> "Virtual SMT: KVM behaves as if this were" >>>>>>> " the host's SMT mode", &error_abort); >>>>>>> + object_property_add_bool(obj, "vfio-no-msix-emulation", >>>>>>> + spapr_get_msix_emulation, NULL, NULL); >>>>>>> } >>>>>>> >>>>>>> static void spapr_machine_finalizefn(Object *obj) >>>>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = { >>>>>>> /* >>>>>>> * pseries-2.11 >>>>>>> */ >>>>>>> + >>>>>>> static void spapr_machine_2_11_instance_options(MachineState *machine) >>>>>>> { >>>>>>> } >>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >>>>>>> index 1fb8a8e..da5f182 100644 >>>>>>> --- a/hw/vfio/common.c >>>>>>> +++ b/hw/vfio/common.c >>>>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, >>>>>>> return -ENODEV; >>>>>>> } >>>>>>> >>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region) >>>>>>> +{ >>>>>>> + struct vfio_region_info *info = NULL; >>>>>>> + bool ret = false; >>>>>>> + >>>>>>> + if (!vfio_get_region_info(vbasedev, region, &info)) { >>>>>>> + if (vfio_get_region_info_cap(info, cap_type)) { >>>>>>> + ret = true; >>>>>>> + } >>>>>>> + g_free(info); >>>>>>> + } >>>>>>> + >>>>>>> + return ret; >>>>>>> +} >>>>>>> + >>>>>>> /* >>>>>>> * Interfaces for IBM EEH (Enhanced Error Handling) >>>>>>> */ >>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >>>>>>> index c977ee3..27a3706 100644 >>>>>>> --- a/hw/vfio/pci.c >>>>>>> +++ b/hw/vfio/pci.c >>>>>>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) >>>>>>> off_t start, end; >>>>>>> VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region; >>>>>>> >>>>>>> + if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, >>>>>>> + vdev->msix->table_bar)) { >>>>>>> + return; >>>>>>> + } >>>>>> >>>>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 + >>>>>> msix table at 0x2000, using 64kB pages. >>>>>> >>>>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0 >>>>>> 0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0 >>>>>> mmaps[0] >>>>>> 0000000000002000-00000000000023ff (prio 0, i/o): msix-table >>>>>> 0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled] >>>>>> >>>>>> >>>>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass >>>>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the >>>>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot >>>>>> be mapped with 64kB page, leading to a qemu abort. >>>>>> >>>>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram] >>>>>> 0x8000000000 - 0x8000001fff [0xffffa8820000] >>>>>> qemu-system-aarch64: VFIO_MAP_DMA: -22 >>>>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000, >>>>>> 0xffffa8820000) = -22 (Invalid argument) >>>>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue >>>>>> >>>>>> In my case don't we still need the fixup logic? >>>>> >>>>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about >>>>> this. I'd say no, the fixup shouldn't be needed, but apparently we >>>>> can't remove it until we figure out this bug and who should be filtering >>>>> out sub-page size listener events. Thanks, >>>> >>>> >>>> This should do it: >>>> >>>> @@ -303,11 +303,19 @@ static bool >>>> vfio_listener_skipped_section(MemoryRegionSection *section) >>>> * Sizing an enabled 64-bit BAR can cause spurious mappings to >>>> * addresses in the upper part of the 64-bit address space. These >>>> * are never accessed by the CPU and beyond the address width of >>>> * some IOMMU hardware. TODO: VFIO should tell us the IOMMU width. >>>> */ >>>> - section->offset_within_address_space & (1ULL << 63); >>>> + section->offset_within_address_space & (1ULL << 63) || >>>> + /* >>>> + * Non-system-page aligned RAM regions have no chance to be >>>> + * DMA mapped so skip them too. >>>> + */ >>>> + (memory_region_is_ram(section->mr) && >>>> + ((section->offset_within_region & qemu_real_host_page_mask) || >>>> + (int128_getlo(section->size) & qemu_real_host_page_mask)) >>>> + ); >>>> } >>>> >>>> >>>> Do we VFIO DMA map on every RAM MR just because it is not necessary to have >>>> more complicated logic to detect whether it is MMIO? Or there is something >>>> bigger? >>> >>> Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the >>> silly address hack above to try to exclude PCI BARs as they're being >>> resized, and b) there is some value to enabling peer-to-peer mappings >>> between devices. Is there a better, more generic solution to a) now? >> >> "now?" what? :) > > Now - adverb, at the present time or moment. For example, vs in the > past. > >>> If we know a MR is MMIO, then I think we can handle any associated >>> mapping failure as non-fatal. >> >> >> mr->ram_device sound like an appropriate flag to check. > > Yeah, I suppose we could, and that didn't exist when we added the > bit 63 hack. > >>> It's nice to enable p2p, but it's barely >>> used and not clear that it works reliably (sometimes even on bare >>> metal).Nit, we don't yet have the IOVA range of the IOMMU exposed, >>> thus the silly address hack, but we do have supported page sizes, so we >>> should probably be referencing that rather than host page size. >> >> Ah, correct, >> s/qemu_real_host_page_mask/memory_region_iommu_get_min_page_size()/ > > No, that would be the emulated iommu page size, which may or may not > (hopefully does) match the physical iommu page size. The vfio iommu > page sizes are exposed via VFIO_IOMMU_GET_INFO, at least for type1. > >>>> As for now, it is possible to have non-aligned RAM MR, the KVM's listener >>>> just does not register it with the KVM if unaligned. >>> >>> For a sanity check, we don't support running a guest with a smaller >>> page size than the host, right? >> >> We do, at least on POWER. >> >>> I assume that would be the only case >>> where we would have a reasonable chance of the guest using a host >>> sub-page as a DMA target, but that seems like just one of an infinite >>> number of problems with such a setup. Thanks, >> >> Like what? I am missing the point here. The most popular config on POWER >> today is 64k system page size in both host and guest and 4K IOMMU page size >> for 32bit DMA, seems alright, I also tried 4K guests, no problem there too. > > The IOMMU supporting a smaller page size than the processor is the > thing I think I was missing here, but it makes sense that they could, > they're not necessarily linked. Eric, will SMMU support 4K even while > the CPU runs at 64K? Yes that's my understanding. Also we introduced: 4644321 vfio/type1: handle case where IOMMU does not support PAGE_SIZE size Thanks Eric The type1 backend has a number of PAGE_SIZE > assumptions in it and will not expose IOMMU page sizes less than > PAGE_SIZE. Perhaps this is really the root of the problem, but it is > advertising vfio_iommu_type1_info.iova_pgsizes, which userspace should > honor. So I think running (target page size) < (host page size) is > probably broken for type1, for instance the guest could have a 4K RAM > page as a DMA target that the 64K host couldn't map due to alignment, > therefore DMA with that page would fault. Thanks, > > Alex > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR 2018-01-19 13:15 ` Auger Eric @ 2018-01-23 19:56 ` Alex Williamson 2018-01-24 10:35 ` Auger Eric 0 siblings, 1 reply; 19+ messages in thread From: Alex Williamson @ 2018-01-23 19:56 UTC (permalink / raw) To: Auger Eric; +Cc: Alexey Kardashevskiy, qemu-devel, qemu-ppc, David Gibson On Fri, 19 Jan 2018 14:15:43 +0100 Auger Eric <eric.auger@redhat.com> wrote: > Hi, > > On 19/01/18 04:25, Alex Williamson wrote: > > On Fri, 19 Jan 2018 13:41:41 +1100 > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > > > >> On 19/01/18 08:59, Alex Williamson wrote: > >>> On Tue, 16 Jan 2018 16:17:58 +1100 > >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > >>> > >>>> On 06/01/18 02:29, Alex Williamson wrote: > >>>>> On Fri, 5 Jan 2018 10:48:07 +0100 > >>>>> Auger Eric <eric.auger@redhat.com> wrote: > >>>>> > >>>>>> Hi Alexey, > >>>>>> > >>>>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote: > >>>>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability > >>>>>>> which tells that a region with MSIX data can be mapped entirely, i.e. > >>>>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped. > >>>>>>> > >>>>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors > >>>>>>> are emulated on top unless the machine requests not to by defining and > >>>>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only > >>>>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow > >>>>>>> enabling it as it does not define the "set" callback for the new property; > >>>>>>> the new property also does not appear in "-machine pseries,help". > >>>>>>> > >>>>>>> If the new capability is present, this puts MSIX IO memory region under > >>>>>>> mapped memory region. If the capability is not there, it falls back to > >>>>>>> the old behaviour with the sparse capability. > >>>>>>> > >>>>>>> In MSIX vectors section is not aligned to the page size, the KVM memory > >>>>>>> listener does not register it with the KVM as a memory slot and MSIX is > >>>>>>> emulated by QEMU as before. > >>>>>>> > >>>>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" - > >>>>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html > >>>>>>> > >>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > >>>>>>> --- > >>>>>>> > >>>>>>> This is mtree and flatview BEFORE this patch: > >>>>>>> > >>>>>>> "info mtree": > >>>>>>> memory-region: pci@800000020000000.mmio > >>>>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > >>>>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>>>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > >>>>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>>>>>> > >>>>>>> "info mtree -f": > >>>>>>> FlatView #0 > >>>>>>> AS "memory", root: system > >>>>>>> AS "cpu-memory", root: system > >>>>>>> Root memory region: system > >>>>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >>>>>>> 0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>>>>>> 000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600 > >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> This is AFTER this patch applied: > >>>>>>> > >>>>>>> "info mtree": > >>>>>>> memory-region: pci@800000020000000.mmio > >>>>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > >>>>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>>>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled] > >>>>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > >>>>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>>>>>> > >>>>>>> > >>>>>>> "info mtree -f": > >>>>>>> FlatView #2 > >>>>>>> AS "memory", root: system > >>>>>>> AS "cpu-memory", root: system > >>>>>>> Root memory region: system > >>>>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >>>>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched > >>>>>>> to enable emulation: > >>>>>>> > >>>>>>> "info mtree": > >>>>>>> memory-region: pci@800000020000000.mmio > >>>>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > >>>>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>>>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>>>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > >>>>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>>>>>> > >>>>>>> "info mtree -f": > >>>>>>> FlatView #1 > >>>>>>> AS "memory", root: system > >>>>>>> AS "cpu-memory", root: system > >>>>>>> Root memory region: system > >>>>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >>>>>>> 0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>>>>>> 000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600 > >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>>>>>> --- > >>>>>>> include/hw/vfio/vfio-common.h | 1 + > >>>>>>> linux-headers/linux/vfio.h | 5 +++++ > >>>>>>> hw/ppc/spapr.c | 8 ++++++++ > >>>>>>> hw/vfio/common.c | 15 +++++++++++++++ > >>>>>>> hw/vfio/pci.c | 23 +++++++++++++++++++++-- > >>>>>>> 5 files changed, 50 insertions(+), 2 deletions(-) > >>>>>>> > >>>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h > >>>>>>> index f3a2ac9..927d600 100644 > >>>>>>> --- a/include/hw/vfio/vfio-common.h > >>>>>>> +++ b/include/hw/vfio/vfio-common.h > >>>>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index, > >>>>>>> struct vfio_region_info **info); > >>>>>>> int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, > >>>>>>> uint32_t subtype, struct vfio_region_info **info); > >>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region); > >>>>>>> #endif > >>>>>>> extern const MemoryListener vfio_prereg_listener; > >>>>>>> > >>>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h > >>>>>>> index 4312e96..b45182e 100644 > >>>>>>> --- a/linux-headers/linux/vfio.h > >>>>>>> +++ b/linux-headers/linux/vfio.h > >>>>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type { > >>>>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2) > >>>>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG (3) > >>>>>>> > >>>>>>> +/* > >>>>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped. > >>>>>>> + */ > >>>>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 > >>>>>>> + > >>>>>>> /** > >>>>>>> * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, > >>>>>>> * struct vfio_irq_info) > >>>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > >>>>>>> index 9de63f0..693394a 100644 > >>>>>>> --- a/hw/ppc/spapr.c > >>>>>>> +++ b/hw/ppc/spapr.c > >>>>>>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value, > >>>>>>> spapr->use_hotplug_event_source = value; > >>>>>>> } > >>>>>>> > >>>>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp) > >>>>>>> +{ > >>>>>>> + return true; > >>>>>>> +} > >>>>>>> + > >>>>>>> static char *spapr_get_resize_hpt(Object *obj, Error **errp) > >>>>>>> { > >>>>>>> sPAPRMachineState *spapr = SPAPR_MACHINE(obj); > >>>>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj) > >>>>>>> object_property_set_description(obj, "vsmt", > >>>>>>> "Virtual SMT: KVM behaves as if this were" > >>>>>>> " the host's SMT mode", &error_abort); > >>>>>>> + object_property_add_bool(obj, "vfio-no-msix-emulation", > >>>>>>> + spapr_get_msix_emulation, NULL, NULL); > >>>>>>> } > >>>>>>> > >>>>>>> static void spapr_machine_finalizefn(Object *obj) > >>>>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = { > >>>>>>> /* > >>>>>>> * pseries-2.11 > >>>>>>> */ > >>>>>>> + > >>>>>>> static void spapr_machine_2_11_instance_options(MachineState *machine) > >>>>>>> { > >>>>>>> } > >>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c > >>>>>>> index 1fb8a8e..da5f182 100644 > >>>>>>> --- a/hw/vfio/common.c > >>>>>>> +++ b/hw/vfio/common.c > >>>>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, > >>>>>>> return -ENODEV; > >>>>>>> } > >>>>>>> > >>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region) > >>>>>>> +{ > >>>>>>> + struct vfio_region_info *info = NULL; > >>>>>>> + bool ret = false; > >>>>>>> + > >>>>>>> + if (!vfio_get_region_info(vbasedev, region, &info)) { > >>>>>>> + if (vfio_get_region_info_cap(info, cap_type)) { > >>>>>>> + ret = true; > >>>>>>> + } > >>>>>>> + g_free(info); > >>>>>>> + } > >>>>>>> + > >>>>>>> + return ret; > >>>>>>> +} > >>>>>>> + > >>>>>>> /* > >>>>>>> * Interfaces for IBM EEH (Enhanced Error Handling) > >>>>>>> */ > >>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > >>>>>>> index c977ee3..27a3706 100644 > >>>>>>> --- a/hw/vfio/pci.c > >>>>>>> +++ b/hw/vfio/pci.c > >>>>>>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) > >>>>>>> off_t start, end; > >>>>>>> VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region; > >>>>>>> > >>>>>>> + if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, > >>>>>>> + vdev->msix->table_bar)) { > >>>>>>> + return; > >>>>>>> + } > >>>>>> > >>>>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 + > >>>>>> msix table at 0x2000, using 64kB pages. > >>>>>> > >>>>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0 > >>>>>> 0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0 > >>>>>> mmaps[0] > >>>>>> 0000000000002000-00000000000023ff (prio 0, i/o): msix-table > >>>>>> 0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled] > >>>>>> > >>>>>> > >>>>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass > >>>>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the > >>>>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot > >>>>>> be mapped with 64kB page, leading to a qemu abort. > >>>>>> > >>>>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram] > >>>>>> 0x8000000000 - 0x8000001fff [0xffffa8820000] > >>>>>> qemu-system-aarch64: VFIO_MAP_DMA: -22 > >>>>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000, > >>>>>> 0xffffa8820000) = -22 (Invalid argument) > >>>>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue > >>>>>> > >>>>>> In my case don't we still need the fixup logic? > >>>>> > >>>>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about > >>>>> this. I'd say no, the fixup shouldn't be needed, but apparently we > >>>>> can't remove it until we figure out this bug and who should be filtering > >>>>> out sub-page size listener events. Thanks, > >>>> > >>>> > >>>> This should do it: > >>>> > >>>> @@ -303,11 +303,19 @@ static bool > >>>> vfio_listener_skipped_section(MemoryRegionSection *section) > >>>> * Sizing an enabled 64-bit BAR can cause spurious mappings to > >>>> * addresses in the upper part of the 64-bit address space. These > >>>> * are never accessed by the CPU and beyond the address width of > >>>> * some IOMMU hardware. TODO: VFIO should tell us the IOMMU width. > >>>> */ > >>>> - section->offset_within_address_space & (1ULL << 63); > >>>> + section->offset_within_address_space & (1ULL << 63) || > >>>> + /* > >>>> + * Non-system-page aligned RAM regions have no chance to be > >>>> + * DMA mapped so skip them too. > >>>> + */ > >>>> + (memory_region_is_ram(section->mr) && > >>>> + ((section->offset_within_region & qemu_real_host_page_mask) || > >>>> + (int128_getlo(section->size) & qemu_real_host_page_mask)) > >>>> + ); > >>>> } > >>>> > >>>> > >>>> Do we VFIO DMA map on every RAM MR just because it is not necessary to have > >>>> more complicated logic to detect whether it is MMIO? Or there is something > >>>> bigger? > >>> > >>> Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the > >>> silly address hack above to try to exclude PCI BARs as they're being > >>> resized, and b) there is some value to enabling peer-to-peer mappings > >>> between devices. Is there a better, more generic solution to a) now? > >> > >> "now?" what? :) > > > > Now - adverb, at the present time or moment. For example, vs in the > > past. > > > >>> If we know a MR is MMIO, then I think we can handle any associated > >>> mapping failure as non-fatal. > >> > >> > >> mr->ram_device sound like an appropriate flag to check. > > > > Yeah, I suppose we could, and that didn't exist when we added the > > bit 63 hack. > > > >>> It's nice to enable p2p, but it's barely > >>> used and not clear that it works reliably (sometimes even on bare > >>> metal).Nit, we don't yet have the IOVA range of the IOMMU exposed, > >>> thus the silly address hack, but we do have supported page sizes, so we > >>> should probably be referencing that rather than host page size. > >> > >> Ah, correct, > >> s/qemu_real_host_page_mask/memory_region_iommu_get_min_page_size()/ > > > > No, that would be the emulated iommu page size, which may or may not > > (hopefully does) match the physical iommu page size. The vfio iommu > > page sizes are exposed via VFIO_IOMMU_GET_INFO, at least for type1. > > > >>>> As for now, it is possible to have non-aligned RAM MR, the KVM's listener > >>>> just does not register it with the KVM if unaligned. > >>> > >>> For a sanity check, we don't support running a guest with a smaller > >>> page size than the host, right? > >> > >> We do, at least on POWER. > >> > >>> I assume that would be the only case > >>> where we would have a reasonable chance of the guest using a host > >>> sub-page as a DMA target, but that seems like just one of an infinite > >>> number of problems with such a setup. Thanks, > >> > >> Like what? I am missing the point here. The most popular config on POWER > >> today is 64k system page size in both host and guest and 4K IOMMU page size > >> for 32bit DMA, seems alright, I also tried 4K guests, no problem there too. > > > > The IOMMU supporting a smaller page size than the processor is the > > thing I think I was missing here, but it makes sense that they could, > > they're not necessarily linked. Eric, will SMMU support 4K even while > > the CPU runs at 64K? > > Yes that's my understanding. Also we introduced: > > 4644321 vfio/type1: handle case where IOMMU does not support PAGE_SIZE size This is a bit of a different problem, for instance if PAGE_SIZE is 64K but the IOMMU only supports 4K, type1 would previously report an empty set in the pagesize bitmap. With this change, we recognize that we can compose a 64K mapping using 4K pages and therefore report 64K as a valid page size. It doesn't however report that we can create 4K mappings. VT-d sort or works the same way, but at the IOMMU API level intel-iommu reports PAGE_MASK even though it only has native support for a few of those discrete sizes (of course PAGE_SIZE is always 4K, so the similarity ends there). Thanks, Alex ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR 2018-01-23 19:56 ` Alex Williamson @ 2018-01-24 10:35 ` Auger Eric 0 siblings, 0 replies; 19+ messages in thread From: Auger Eric @ 2018-01-24 10:35 UTC (permalink / raw) To: Alex Williamson; +Cc: Alexey Kardashevskiy, qemu-devel, qemu-ppc, David Gibson Hi Alex, On 23/01/18 20:56, Alex Williamson wrote: > On Fri, 19 Jan 2018 14:15:43 +0100 > Auger Eric <eric.auger@redhat.com> wrote: > >> Hi, >> >> On 19/01/18 04:25, Alex Williamson wrote: >>> On Fri, 19 Jan 2018 13:41:41 +1100 >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote: >>> >>>> On 19/01/18 08:59, Alex Williamson wrote: >>>>> On Tue, 16 Jan 2018 16:17:58 +1100 >>>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote: >>>>> >>>>>> On 06/01/18 02:29, Alex Williamson wrote: >>>>>>> On Fri, 5 Jan 2018 10:48:07 +0100 >>>>>>> Auger Eric <eric.auger@redhat.com> wrote: >>>>>>> >>>>>>>> Hi Alexey, >>>>>>>> >>>>>>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote: >>>>>>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability >>>>>>>>> which tells that a region with MSIX data can be mapped entirely, i.e. >>>>>>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped. >>>>>>>>> >>>>>>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors >>>>>>>>> are emulated on top unless the machine requests not to by defining and >>>>>>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only >>>>>>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow >>>>>>>>> enabling it as it does not define the "set" callback for the new property; >>>>>>>>> the new property also does not appear in "-machine pseries,help". >>>>>>>>> >>>>>>>>> If the new capability is present, this puts MSIX IO memory region under >>>>>>>>> mapped memory region. If the capability is not there, it falls back to >>>>>>>>> the old behaviour with the sparse capability. >>>>>>>>> >>>>>>>>> In MSIX vectors section is not aligned to the page size, the KVM memory >>>>>>>>> listener does not register it with the KVM as a memory slot and MSIX is >>>>>>>>> emulated by QEMU as before. >>>>>>>>> >>>>>>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" - >>>>>>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html >>>>>>>>> >>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>>>>>>>> --- >>>>>>>>> >>>>>>>>> This is mtree and flatview BEFORE this patch: >>>>>>>>> >>>>>>>>> "info mtree": >>>>>>>>> memory-region: pci@800000020000000.mmio >>>>>>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>>>>>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>>>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>>>>>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>>>> >>>>>>>>> "info mtree -f": >>>>>>>>> FlatView #0 >>>>>>>>> AS "memory", root: system >>>>>>>>> AS "cpu-memory", root: system >>>>>>>>> Root memory region: system >>>>>>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>>>>>>> 0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>>>>>> 000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600 >>>>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> This is AFTER this patch applied: >>>>>>>>> >>>>>>>>> "info mtree": >>>>>>>>> memory-region: pci@800000020000000.mmio >>>>>>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>>>>>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>>>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled] >>>>>>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>>>>>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>>>> >>>>>>>>> >>>>>>>>> "info mtree -f": >>>>>>>>> FlatView #2 >>>>>>>>> AS "memory", root: system >>>>>>>>> AS "cpu-memory", root: system >>>>>>>>> Root memory region: system >>>>>>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>>>>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched >>>>>>>>> to enable emulation: >>>>>>>>> >>>>>>>>> "info mtree": >>>>>>>>> memory-region: pci@800000020000000.mmio >>>>>>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>>>>>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>>>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>>>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>>>>>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>>>> >>>>>>>>> "info mtree -f": >>>>>>>>> FlatView #1 >>>>>>>>> AS "memory", root: system >>>>>>>>> AS "cpu-memory", root: system >>>>>>>>> Root memory region: system >>>>>>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>>>>>>> 0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>>>>>> 000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600 >>>>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>>>> --- >>>>>>>>> include/hw/vfio/vfio-common.h | 1 + >>>>>>>>> linux-headers/linux/vfio.h | 5 +++++ >>>>>>>>> hw/ppc/spapr.c | 8 ++++++++ >>>>>>>>> hw/vfio/common.c | 15 +++++++++++++++ >>>>>>>>> hw/vfio/pci.c | 23 +++++++++++++++++++++-- >>>>>>>>> 5 files changed, 50 insertions(+), 2 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h >>>>>>>>> index f3a2ac9..927d600 100644 >>>>>>>>> --- a/include/hw/vfio/vfio-common.h >>>>>>>>> +++ b/include/hw/vfio/vfio-common.h >>>>>>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index, >>>>>>>>> struct vfio_region_info **info); >>>>>>>>> int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, >>>>>>>>> uint32_t subtype, struct vfio_region_info **info); >>>>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region); >>>>>>>>> #endif >>>>>>>>> extern const MemoryListener vfio_prereg_listener; >>>>>>>>> >>>>>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h >>>>>>>>> index 4312e96..b45182e 100644 >>>>>>>>> --- a/linux-headers/linux/vfio.h >>>>>>>>> +++ b/linux-headers/linux/vfio.h >>>>>>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type { >>>>>>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2) >>>>>>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG (3) >>>>>>>>> >>>>>>>>> +/* >>>>>>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped. >>>>>>>>> + */ >>>>>>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 >>>>>>>>> + >>>>>>>>> /** >>>>>>>>> * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, >>>>>>>>> * struct vfio_irq_info) >>>>>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c >>>>>>>>> index 9de63f0..693394a 100644 >>>>>>>>> --- a/hw/ppc/spapr.c >>>>>>>>> +++ b/hw/ppc/spapr.c >>>>>>>>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value, >>>>>>>>> spapr->use_hotplug_event_source = value; >>>>>>>>> } >>>>>>>>> >>>>>>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp) >>>>>>>>> +{ >>>>>>>>> + return true; >>>>>>>>> +} >>>>>>>>> + >>>>>>>>> static char *spapr_get_resize_hpt(Object *obj, Error **errp) >>>>>>>>> { >>>>>>>>> sPAPRMachineState *spapr = SPAPR_MACHINE(obj); >>>>>>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj) >>>>>>>>> object_property_set_description(obj, "vsmt", >>>>>>>>> "Virtual SMT: KVM behaves as if this were" >>>>>>>>> " the host's SMT mode", &error_abort); >>>>>>>>> + object_property_add_bool(obj, "vfio-no-msix-emulation", >>>>>>>>> + spapr_get_msix_emulation, NULL, NULL); >>>>>>>>> } >>>>>>>>> >>>>>>>>> static void spapr_machine_finalizefn(Object *obj) >>>>>>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = { >>>>>>>>> /* >>>>>>>>> * pseries-2.11 >>>>>>>>> */ >>>>>>>>> + >>>>>>>>> static void spapr_machine_2_11_instance_options(MachineState *machine) >>>>>>>>> { >>>>>>>>> } >>>>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >>>>>>>>> index 1fb8a8e..da5f182 100644 >>>>>>>>> --- a/hw/vfio/common.c >>>>>>>>> +++ b/hw/vfio/common.c >>>>>>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, >>>>>>>>> return -ENODEV; >>>>>>>>> } >>>>>>>>> >>>>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region) >>>>>>>>> +{ >>>>>>>>> + struct vfio_region_info *info = NULL; >>>>>>>>> + bool ret = false; >>>>>>>>> + >>>>>>>>> + if (!vfio_get_region_info(vbasedev, region, &info)) { >>>>>>>>> + if (vfio_get_region_info_cap(info, cap_type)) { >>>>>>>>> + ret = true; >>>>>>>>> + } >>>>>>>>> + g_free(info); >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + return ret; >>>>>>>>> +} >>>>>>>>> + >>>>>>>>> /* >>>>>>>>> * Interfaces for IBM EEH (Enhanced Error Handling) >>>>>>>>> */ >>>>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >>>>>>>>> index c977ee3..27a3706 100644 >>>>>>>>> --- a/hw/vfio/pci.c >>>>>>>>> +++ b/hw/vfio/pci.c >>>>>>>>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) >>>>>>>>> off_t start, end; >>>>>>>>> VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region; >>>>>>>>> >>>>>>>>> + if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, >>>>>>>>> + vdev->msix->table_bar)) { >>>>>>>>> + return; >>>>>>>>> + } >>>>>>>> >>>>>>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 + >>>>>>>> msix table at 0x2000, using 64kB pages. >>>>>>>> >>>>>>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0 >>>>>>>> 0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0 >>>>>>>> mmaps[0] >>>>>>>> 0000000000002000-00000000000023ff (prio 0, i/o): msix-table >>>>>>>> 0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled] >>>>>>>> >>>>>>>> >>>>>>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass >>>>>>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the >>>>>>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot >>>>>>>> be mapped with 64kB page, leading to a qemu abort. >>>>>>>> >>>>>>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram] >>>>>>>> 0x8000000000 - 0x8000001fff [0xffffa8820000] >>>>>>>> qemu-system-aarch64: VFIO_MAP_DMA: -22 >>>>>>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000, >>>>>>>> 0xffffa8820000) = -22 (Invalid argument) >>>>>>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue >>>>>>>> >>>>>>>> In my case don't we still need the fixup logic? >>>>>>> >>>>>>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about >>>>>>> this. I'd say no, the fixup shouldn't be needed, but apparently we >>>>>>> can't remove it until we figure out this bug and who should be filtering >>>>>>> out sub-page size listener events. Thanks, >>>>>> >>>>>> >>>>>> This should do it: >>>>>> >>>>>> @@ -303,11 +303,19 @@ static bool >>>>>> vfio_listener_skipped_section(MemoryRegionSection *section) >>>>>> * Sizing an enabled 64-bit BAR can cause spurious mappings to >>>>>> * addresses in the upper part of the 64-bit address space. These >>>>>> * are never accessed by the CPU and beyond the address width of >>>>>> * some IOMMU hardware. TODO: VFIO should tell us the IOMMU width. >>>>>> */ >>>>>> - section->offset_within_address_space & (1ULL << 63); >>>>>> + section->offset_within_address_space & (1ULL << 63) || >>>>>> + /* >>>>>> + * Non-system-page aligned RAM regions have no chance to be >>>>>> + * DMA mapped so skip them too. >>>>>> + */ >>>>>> + (memory_region_is_ram(section->mr) && >>>>>> + ((section->offset_within_region & qemu_real_host_page_mask) || >>>>>> + (int128_getlo(section->size) & qemu_real_host_page_mask)) >>>>>> + ); >>>>>> } >>>>>> >>>>>> >>>>>> Do we VFIO DMA map on every RAM MR just because it is not necessary to have >>>>>> more complicated logic to detect whether it is MMIO? Or there is something >>>>>> bigger? >>>>> >>>>> Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the >>>>> silly address hack above to try to exclude PCI BARs as they're being >>>>> resized, and b) there is some value to enabling peer-to-peer mappings >>>>> between devices. Is there a better, more generic solution to a) now? >>>> >>>> "now?" what? :) >>> >>> Now - adverb, at the present time or moment. For example, vs in the >>> past. >>> >>>>> If we know a MR is MMIO, then I think we can handle any associated >>>>> mapping failure as non-fatal. >>>> >>>> >>>> mr->ram_device sound like an appropriate flag to check. >>> >>> Yeah, I suppose we could, and that didn't exist when we added the >>> bit 63 hack. >>> >>>>> It's nice to enable p2p, but it's barely >>>>> used and not clear that it works reliably (sometimes even on bare >>>>> metal).Nit, we don't yet have the IOVA range of the IOMMU exposed, >>>>> thus the silly address hack, but we do have supported page sizes, so we >>>>> should probably be referencing that rather than host page size. >>>> >>>> Ah, correct, >>>> s/qemu_real_host_page_mask/memory_region_iommu_get_min_page_size()/ >>> >>> No, that would be the emulated iommu page size, which may or may not >>> (hopefully does) match the physical iommu page size. The vfio iommu >>> page sizes are exposed via VFIO_IOMMU_GET_INFO, at least for type1. >>> >>>>>> As for now, it is possible to have non-aligned RAM MR, the KVM's listener >>>>>> just does not register it with the KVM if unaligned. >>>>> >>>>> For a sanity check, we don't support running a guest with a smaller >>>>> page size than the host, right? >>>> >>>> We do, at least on POWER. >>>> >>>>> I assume that would be the only case >>>>> where we would have a reasonable chance of the guest using a host >>>>> sub-page as a DMA target, but that seems like just one of an infinite >>>>> number of problems with such a setup. Thanks, >>>> >>>> Like what? I am missing the point here. The most popular config on POWER >>>> today is 64k system page size in both host and guest and 4K IOMMU page size >>>> for 32bit DMA, seems alright, I also tried 4K guests, no problem there too. >>> >>> The IOMMU supporting a smaller page size than the processor is the >>> thing I think I was missing here, but it makes sense that they could, >>> they're not necessarily linked. Eric, will SMMU support 4K even while >>> the CPU runs at 64K? >> >> Yes that's my understanding. Hum actually I was wrong. in arm-smmu-v3 driver there is a first pgsize bitmap initialization at probe time, made by probing the HW capabilities (looking at ID5 reg). But on domain finalization, when the page allocator is allocated (alloc_io_pgtable_ops), this bitmap (cfg->pgsize_bitmap in the code below) is overriden and if PAGE_SIZE is supported and in our case granule=64K, only 64K and 512M page sizes are set. So domain pgsize bitmap is not exposing 4K. see arm_lpae_restrict_pgsizes in driver/iommu/io-pgtable-arm.c /* * We need to restrict the supported page sizes to match the * translation regime for a particular granule. Aim to match * the CPU page size if possible, otherwise prefer smaller sizes. * While we're at it, restrict the block sizes to match the * chosen granule. */ if (cfg->pgsize_bitmap & PAGE_SIZE) granule = PAGE_SIZE; else if (cfg->pgsize_bitmap & ~PAGE_MASK) granule = 1UL << __fls(cfg->pgsize_bitmap & ~PAGE_MASK); else if (cfg->pgsize_bitmap & PAGE_MASK) granule = 1UL << __ffs(cfg->pgsize_bitmap & PAGE_MASK); else granule = 0; switch (granule) { case SZ_4K: cfg->pgsize_bitmap &= (SZ_4K | SZ_2M | SZ_1G); break; case SZ_16K: cfg->pgsize_bitmap &= (SZ_16K | SZ_32M); break; case SZ_64K: cfg->pgsize_bitmap &= (SZ_64K | SZ_512M); break; default: cfg->pgsize_bitmap = 0; Also we introduced: >> >> 4644321 vfio/type1: handle case where IOMMU does not support PAGE_SIZE size > > This is a bit of a different problem, for instance if PAGE_SIZE is 64K > but the IOMMU only supports 4K, type1 would previously report an empty > set in the pagesize bitmap. With this change, we recognize that we can > compose a 64K mapping using 4K pages and therefore report 64K as a > valid page size. It doesn't however report that we can create 4K > mappings. I agree. With the above code, if PAGE_SIZE is not supported by the smmu, domain bitmap exposes 4K page size and then vfio iommu type1 exposes 64K. Thanks Eric VT-d sort or works the same way, but at the IOMMU API level > intel-iommu reports PAGE_MASK even though it only has native support > for a few of those discrete sizes (of course PAGE_SIZE is always 4K, so > the similarity ends there). Thanks, > > Alex > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR 2018-01-18 21:59 ` Alex Williamson 2018-01-19 2:41 ` Alexey Kardashevskiy @ 2018-01-19 8:55 ` Alexey Kardashevskiy 2018-01-19 15:57 ` Alex Williamson 1 sibling, 1 reply; 19+ messages in thread From: Alexey Kardashevskiy @ 2018-01-19 8:55 UTC (permalink / raw) To: Alex Williamson; +Cc: Auger Eric, qemu-devel, qemu-ppc, David Gibson On 19/01/18 08:59, Alex Williamson wrote: > On Tue, 16 Jan 2018 16:17:58 +1100 > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > >> On 06/01/18 02:29, Alex Williamson wrote: >>> On Fri, 5 Jan 2018 10:48:07 +0100 >>> Auger Eric <eric.auger@redhat.com> wrote: >>> >>>> Hi Alexey, >>>> >>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote: >>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability >>>>> which tells that a region with MSIX data can be mapped entirely, i.e. >>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped. >>>>> >>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors >>>>> are emulated on top unless the machine requests not to by defining and >>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only >>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow >>>>> enabling it as it does not define the "set" callback for the new property; >>>>> the new property also does not appear in "-machine pseries,help". >>>>> >>>>> If the new capability is present, this puts MSIX IO memory region under >>>>> mapped memory region. If the capability is not there, it falls back to >>>>> the old behaviour with the sparse capability. >>>>> >>>>> In MSIX vectors section is not aligned to the page size, the KVM memory >>>>> listener does not register it with the KVM as a memory slot and MSIX is >>>>> emulated by QEMU as before. >>>>> >>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" - >>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html >>>>> >>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>>>> --- >>>>> >>>>> This is mtree and flatview BEFORE this patch: >>>>> >>>>> "info mtree": >>>>> memory-region: pci@800000020000000.mmio >>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>> >>>>> "info mtree -f": >>>>> FlatView #0 >>>>> AS "memory", root: system >>>>> AS "cpu-memory", root: system >>>>> Root memory region: system >>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>>> 0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>> 000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600 >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>> >>>>> >>>>> >>>>> This is AFTER this patch applied: >>>>> >>>>> "info mtree": >>>>> memory-region: pci@800000020000000.mmio >>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled] >>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>> >>>>> >>>>> "info mtree -f": >>>>> FlatView #2 >>>>> AS "memory", root: system >>>>> AS "cpu-memory", root: system >>>>> Root memory region: system >>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>> >>>>> >>>>> >>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched >>>>> to enable emulation: >>>>> >>>>> "info mtree": >>>>> memory-region: pci@800000020000000.mmio >>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>> >>>>> "info mtree -f": >>>>> FlatView #1 >>>>> AS "memory", root: system >>>>> AS "cpu-memory", root: system >>>>> Root memory region: system >>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>>> 0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>> 000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600 >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>> --- >>>>> include/hw/vfio/vfio-common.h | 1 + >>>>> linux-headers/linux/vfio.h | 5 +++++ >>>>> hw/ppc/spapr.c | 8 ++++++++ >>>>> hw/vfio/common.c | 15 +++++++++++++++ >>>>> hw/vfio/pci.c | 23 +++++++++++++++++++++-- >>>>> 5 files changed, 50 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h >>>>> index f3a2ac9..927d600 100644 >>>>> --- a/include/hw/vfio/vfio-common.h >>>>> +++ b/include/hw/vfio/vfio-common.h >>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index, >>>>> struct vfio_region_info **info); >>>>> int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, >>>>> uint32_t subtype, struct vfio_region_info **info); >>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region); >>>>> #endif >>>>> extern const MemoryListener vfio_prereg_listener; >>>>> >>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h >>>>> index 4312e96..b45182e 100644 >>>>> --- a/linux-headers/linux/vfio.h >>>>> +++ b/linux-headers/linux/vfio.h >>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type { >>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2) >>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG (3) >>>>> >>>>> +/* >>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped. >>>>> + */ >>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 >>>>> + >>>>> /** >>>>> * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, >>>>> * struct vfio_irq_info) >>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c >>>>> index 9de63f0..693394a 100644 >>>>> --- a/hw/ppc/spapr.c >>>>> +++ b/hw/ppc/spapr.c >>>>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value, >>>>> spapr->use_hotplug_event_source = value; >>>>> } >>>>> >>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp) >>>>> +{ >>>>> + return true; >>>>> +} >>>>> + >>>>> static char *spapr_get_resize_hpt(Object *obj, Error **errp) >>>>> { >>>>> sPAPRMachineState *spapr = SPAPR_MACHINE(obj); >>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj) >>>>> object_property_set_description(obj, "vsmt", >>>>> "Virtual SMT: KVM behaves as if this were" >>>>> " the host's SMT mode", &error_abort); >>>>> + object_property_add_bool(obj, "vfio-no-msix-emulation", >>>>> + spapr_get_msix_emulation, NULL, NULL); >>>>> } >>>>> >>>>> static void spapr_machine_finalizefn(Object *obj) >>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = { >>>>> /* >>>>> * pseries-2.11 >>>>> */ >>>>> + >>>>> static void spapr_machine_2_11_instance_options(MachineState *machine) >>>>> { >>>>> } >>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >>>>> index 1fb8a8e..da5f182 100644 >>>>> --- a/hw/vfio/common.c >>>>> +++ b/hw/vfio/common.c >>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, >>>>> return -ENODEV; >>>>> } >>>>> >>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region) >>>>> +{ >>>>> + struct vfio_region_info *info = NULL; >>>>> + bool ret = false; >>>>> + >>>>> + if (!vfio_get_region_info(vbasedev, region, &info)) { >>>>> + if (vfio_get_region_info_cap(info, cap_type)) { >>>>> + ret = true; >>>>> + } >>>>> + g_free(info); >>>>> + } >>>>> + >>>>> + return ret; >>>>> +} >>>>> + >>>>> /* >>>>> * Interfaces for IBM EEH (Enhanced Error Handling) >>>>> */ >>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >>>>> index c977ee3..27a3706 100644 >>>>> --- a/hw/vfio/pci.c >>>>> +++ b/hw/vfio/pci.c >>>>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) >>>>> off_t start, end; >>>>> VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region; >>>>> >>>>> + if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, >>>>> + vdev->msix->table_bar)) { >>>>> + return; >>>>> + } >>>> >>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 + >>>> msix table at 0x2000, using 64kB pages. >>>> >>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0 >>>> 0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0 >>>> mmaps[0] >>>> 0000000000002000-00000000000023ff (prio 0, i/o): msix-table >>>> 0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled] >>>> >>>> >>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass >>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the >>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot >>>> be mapped with 64kB page, leading to a qemu abort. >>>> >>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram] >>>> 0x8000000000 - 0x8000001fff [0xffffa8820000] >>>> qemu-system-aarch64: VFIO_MAP_DMA: -22 >>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000, >>>> 0xffffa8820000) = -22 (Invalid argument) >>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue >>>> >>>> In my case don't we still need the fixup logic? >>> >>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about >>> this. I'd say no, the fixup shouldn't be needed, but apparently we >>> can't remove it until we figure out this bug and who should be filtering >>> out sub-page size listener events. Thanks, >> >> >> This should do it: >> >> @@ -303,11 +303,19 @@ static bool >> vfio_listener_skipped_section(MemoryRegionSection *section) >> * Sizing an enabled 64-bit BAR can cause spurious mappings to >> * addresses in the upper part of the 64-bit address space. These >> * are never accessed by the CPU and beyond the address width of >> * some IOMMU hardware. TODO: VFIO should tell us the IOMMU width. >> */ >> - section->offset_within_address_space & (1ULL << 63); >> + section->offset_within_address_space & (1ULL << 63) || >> + /* >> + * Non-system-page aligned RAM regions have no chance to be >> + * DMA mapped so skip them too. >> + */ >> + (memory_region_is_ram(section->mr) && >> + ((section->offset_within_region & qemu_real_host_page_mask) || >> + (int128_getlo(section->size) & qemu_real_host_page_mask)) >> + ); >> } >> >> >> Do we VFIO DMA map on every RAM MR just because it is not necessary to have >> more complicated logic to detect whether it is MMIO? Or there is something >> bigger? > > Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the > silly address hack above to try to exclude PCI BARs as they're being > resized, and b) there is some value to enabling peer-to-peer mappings > between devices. Is there a better, more generic solution to a) now? > If we know a MR is MMIO, then I think we can handle any associated > mapping failure as non-fatal. It's nice to enable p2p, but it's barely > used and not clear that it works reliably (sometimes even on bare > metal). When P2P works in a guest, what does it look like exactly? Do the devices talk via the IOMMU or directly? I'd assume the latter but then the guest device driver needs to know physical MMIO addresses and it does not or guest must program BARs exactly as the host did... Has anyone actually reported this working? I just tried exposing MMIO on a PCI AS ("[PATCH qemu] spapr_pci: Alias MMIO windows to PHB address space") and ended up having vfio_listener_region_add() called with RAM regions (which are mapped MMIO) and sPAPR cannot handle it otherwise than creating a window per a memory section and since we can have just two, this fails. Adding memory_region_is_ram_device() to vfio_listener_skipped_section() solves my particular problem but I wonder if I should rather enable this P2P on sPAPR? > Nit, we don't yet have the IOVA range of the IOMMU exposed, > thus the silly address hack, but we do have supported page sizes, so we > should probably be referencing that rather than host page size. > >> As for now, it is possible to have non-aligned RAM MR, the KVM's listener >> just does not register it with the KVM if unaligned. > > For a sanity check, we don't support running a guest with a smaller > page size than the host, right? I assume that would be the only case > where we would have a reasonable chance of the guest using a host > sub-page as a DMA target, but that seems like just one of an infinite > number of problems with such a setup. Thanks, > > Alex > -- Alexey ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR 2018-01-19 8:55 ` Alexey Kardashevskiy @ 2018-01-19 15:57 ` Alex Williamson 2018-01-20 0:04 ` Alexey Kardashevskiy 0 siblings, 1 reply; 19+ messages in thread From: Alex Williamson @ 2018-01-19 15:57 UTC (permalink / raw) To: Alexey Kardashevskiy; +Cc: Auger Eric, qemu-devel, qemu-ppc, David Gibson On Fri, 19 Jan 2018 19:55:49 +1100 Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > On 19/01/18 08:59, Alex Williamson wrote: > > On Tue, 16 Jan 2018 16:17:58 +1100 > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > > > >> On 06/01/18 02:29, Alex Williamson wrote: > >>> On Fri, 5 Jan 2018 10:48:07 +0100 > >>> Auger Eric <eric.auger@redhat.com> wrote: > >>> > >>>> Hi Alexey, > >>>> > >>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote: > >>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability > >>>>> which tells that a region with MSIX data can be mapped entirely, i.e. > >>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped. > >>>>> > >>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors > >>>>> are emulated on top unless the machine requests not to by defining and > >>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only > >>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow > >>>>> enabling it as it does not define the "set" callback for the new property; > >>>>> the new property also does not appear in "-machine pseries,help". > >>>>> > >>>>> If the new capability is present, this puts MSIX IO memory region under > >>>>> mapped memory region. If the capability is not there, it falls back to > >>>>> the old behaviour with the sparse capability. > >>>>> > >>>>> In MSIX vectors section is not aligned to the page size, the KVM memory > >>>>> listener does not register it with the KVM as a memory slot and MSIX is > >>>>> emulated by QEMU as before. > >>>>> > >>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" - > >>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html > >>>>> > >>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > >>>>> --- > >>>>> > >>>>> This is mtree and flatview BEFORE this patch: > >>>>> > >>>>> "info mtree": > >>>>> memory-region: pci@800000020000000.mmio > >>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > >>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > >>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>>>> > >>>>> "info mtree -f": > >>>>> FlatView #0 > >>>>> AS "memory", root: system > >>>>> AS "cpu-memory", root: system > >>>>> Root memory region: system > >>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >>>>> 0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>>>> 000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600 > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>>>> > >>>>> > >>>>> > >>>>> This is AFTER this patch applied: > >>>>> > >>>>> "info mtree": > >>>>> memory-region: pci@800000020000000.mmio > >>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > >>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled] > >>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > >>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>>>> > >>>>> > >>>>> "info mtree -f": > >>>>> FlatView #2 > >>>>> AS "memory", root: system > >>>>> AS "cpu-memory", root: system > >>>>> Root memory region: system > >>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>>>> > >>>>> > >>>>> > >>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched > >>>>> to enable emulation: > >>>>> > >>>>> "info mtree": > >>>>> memory-region: pci@800000020000000.mmio > >>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio > >>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 > >>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] > >>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>>>> > >>>>> "info mtree -f": > >>>>> FlatView #1 > >>>>> AS "memory", root: system > >>>>> AS "cpu-memory", root: system > >>>>> Root memory region: system > >>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram > >>>>> 0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] > >>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table > >>>>> 000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600 > >>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] > >>>>> --- > >>>>> include/hw/vfio/vfio-common.h | 1 + > >>>>> linux-headers/linux/vfio.h | 5 +++++ > >>>>> hw/ppc/spapr.c | 8 ++++++++ > >>>>> hw/vfio/common.c | 15 +++++++++++++++ > >>>>> hw/vfio/pci.c | 23 +++++++++++++++++++++-- > >>>>> 5 files changed, 50 insertions(+), 2 deletions(-) > >>>>> > >>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h > >>>>> index f3a2ac9..927d600 100644 > >>>>> --- a/include/hw/vfio/vfio-common.h > >>>>> +++ b/include/hw/vfio/vfio-common.h > >>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index, > >>>>> struct vfio_region_info **info); > >>>>> int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, > >>>>> uint32_t subtype, struct vfio_region_info **info); > >>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region); > >>>>> #endif > >>>>> extern const MemoryListener vfio_prereg_listener; > >>>>> > >>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h > >>>>> index 4312e96..b45182e 100644 > >>>>> --- a/linux-headers/linux/vfio.h > >>>>> +++ b/linux-headers/linux/vfio.h > >>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type { > >>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2) > >>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG (3) > >>>>> > >>>>> +/* > >>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped. > >>>>> + */ > >>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 > >>>>> + > >>>>> /** > >>>>> * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, > >>>>> * struct vfio_irq_info) > >>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > >>>>> index 9de63f0..693394a 100644 > >>>>> --- a/hw/ppc/spapr.c > >>>>> +++ b/hw/ppc/spapr.c > >>>>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value, > >>>>> spapr->use_hotplug_event_source = value; > >>>>> } > >>>>> > >>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp) > >>>>> +{ > >>>>> + return true; > >>>>> +} > >>>>> + > >>>>> static char *spapr_get_resize_hpt(Object *obj, Error **errp) > >>>>> { > >>>>> sPAPRMachineState *spapr = SPAPR_MACHINE(obj); > >>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj) > >>>>> object_property_set_description(obj, "vsmt", > >>>>> "Virtual SMT: KVM behaves as if this were" > >>>>> " the host's SMT mode", &error_abort); > >>>>> + object_property_add_bool(obj, "vfio-no-msix-emulation", > >>>>> + spapr_get_msix_emulation, NULL, NULL); > >>>>> } > >>>>> > >>>>> static void spapr_machine_finalizefn(Object *obj) > >>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = { > >>>>> /* > >>>>> * pseries-2.11 > >>>>> */ > >>>>> + > >>>>> static void spapr_machine_2_11_instance_options(MachineState *machine) > >>>>> { > >>>>> } > >>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c > >>>>> index 1fb8a8e..da5f182 100644 > >>>>> --- a/hw/vfio/common.c > >>>>> +++ b/hw/vfio/common.c > >>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, > >>>>> return -ENODEV; > >>>>> } > >>>>> > >>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region) > >>>>> +{ > >>>>> + struct vfio_region_info *info = NULL; > >>>>> + bool ret = false; > >>>>> + > >>>>> + if (!vfio_get_region_info(vbasedev, region, &info)) { > >>>>> + if (vfio_get_region_info_cap(info, cap_type)) { > >>>>> + ret = true; > >>>>> + } > >>>>> + g_free(info); > >>>>> + } > >>>>> + > >>>>> + return ret; > >>>>> +} > >>>>> + > >>>>> /* > >>>>> * Interfaces for IBM EEH (Enhanced Error Handling) > >>>>> */ > >>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > >>>>> index c977ee3..27a3706 100644 > >>>>> --- a/hw/vfio/pci.c > >>>>> +++ b/hw/vfio/pci.c > >>>>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) > >>>>> off_t start, end; > >>>>> VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region; > >>>>> > >>>>> + if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, > >>>>> + vdev->msix->table_bar)) { > >>>>> + return; > >>>>> + } > >>>> > >>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 + > >>>> msix table at 0x2000, using 64kB pages. > >>>> > >>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0 > >>>> 0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0 > >>>> mmaps[0] > >>>> 0000000000002000-00000000000023ff (prio 0, i/o): msix-table > >>>> 0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled] > >>>> > >>>> > >>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass > >>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the > >>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot > >>>> be mapped with 64kB page, leading to a qemu abort. > >>>> > >>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram] > >>>> 0x8000000000 - 0x8000001fff [0xffffa8820000] > >>>> qemu-system-aarch64: VFIO_MAP_DMA: -22 > >>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000, > >>>> 0xffffa8820000) = -22 (Invalid argument) > >>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue > >>>> > >>>> In my case don't we still need the fixup logic? > >>> > >>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about > >>> this. I'd say no, the fixup shouldn't be needed, but apparently we > >>> can't remove it until we figure out this bug and who should be filtering > >>> out sub-page size listener events. Thanks, > >> > >> > >> This should do it: > >> > >> @@ -303,11 +303,19 @@ static bool > >> vfio_listener_skipped_section(MemoryRegionSection *section) > >> * Sizing an enabled 64-bit BAR can cause spurious mappings to > >> * addresses in the upper part of the 64-bit address space. These > >> * are never accessed by the CPU and beyond the address width of > >> * some IOMMU hardware. TODO: VFIO should tell us the IOMMU width. > >> */ > >> - section->offset_within_address_space & (1ULL << 63); > >> + section->offset_within_address_space & (1ULL << 63) || > >> + /* > >> + * Non-system-page aligned RAM regions have no chance to be > >> + * DMA mapped so skip them too. > >> + */ > >> + (memory_region_is_ram(section->mr) && > >> + ((section->offset_within_region & qemu_real_host_page_mask) || > >> + (int128_getlo(section->size) & qemu_real_host_page_mask)) > >> + ); > >> } > >> > >> > >> Do we VFIO DMA map on every RAM MR just because it is not necessary to have > >> more complicated logic to detect whether it is MMIO? Or there is something > >> bigger? > > > > Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the > > silly address hack above to try to exclude PCI BARs as they're being > > resized, and b) there is some value to enabling peer-to-peer mappings > > between devices. Is there a better, more generic solution to a) now? > > If we know a MR is MMIO, then I think we can handle any associated > > mapping failure as non-fatal. It's nice to enable p2p, but it's barely > > used and not clear that it works reliably (sometimes even on bare > > metal). > > When P2P works in a guest, what does it look like exactly? Do the devices > talk via the IOMMU or directly? I'd assume the latter but then the guest > device driver needs to know physical MMIO addresses and it does not or > guest must program BARs exactly as the host did... Has anyone actually > reported this working? I've seen it work, NVIDIA has a peer to peer test in their cuda tests, see previous patches enabling GPUDirect. The route is generally through the IOMMU, guests don't know host physical addresses. I say generally though because ACS does have a Directed Translated P2P feature that would allow a more efficient path, but I've never seen anything that can take advantage of this. > I just tried exposing MMIO on a PCI AS ("[PATCH qemu] spapr_pci: Alias MMIO > windows to PHB address space") and ended up having > vfio_listener_region_add() called with RAM regions (which are mapped MMIO) > and sPAPR cannot handle it otherwise than creating a window per a memory > section and since we can have just two, this fails. Adding > memory_region_is_ram_device() to vfio_listener_skipped_section() solves my > particular problem but I wonder if I should rather enable this P2P on sPAPR? Yeah, as mentioned it'd be nice to at least make ram_device (ie. MMIO) mappings non-fatal. I think there's probably enough skepticism in drivers as to whether p2p works that this isn't as critical. Of course making it work is even better. Thanks, Alex ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR 2018-01-19 15:57 ` Alex Williamson @ 2018-01-20 0:04 ` Alexey Kardashevskiy 0 siblings, 0 replies; 19+ messages in thread From: Alexey Kardashevskiy @ 2018-01-20 0:04 UTC (permalink / raw) To: Alex Williamson; +Cc: Auger Eric, qemu-devel, qemu-ppc, David Gibson On 20/01/18 02:57, Alex Williamson wrote: > On Fri, 19 Jan 2018 19:55:49 +1100 > Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > >> On 19/01/18 08:59, Alex Williamson wrote: >>> On Tue, 16 Jan 2018 16:17:58 +1100 >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote: >>> >>>> On 06/01/18 02:29, Alex Williamson wrote: >>>>> On Fri, 5 Jan 2018 10:48:07 +0100 >>>>> Auger Eric <eric.auger@redhat.com> wrote: >>>>> >>>>>> Hi Alexey, >>>>>> >>>>>> On 15/12/17 07:29, Alexey Kardashevskiy wrote: >>>>>>> This makes use of a new VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability >>>>>>> which tells that a region with MSIX data can be mapped entirely, i.e. >>>>>>> the VFIO PCI driver won't prevent MSIX vectors area from being mapped. >>>>>>> >>>>>>> With this change, all BARs are mapped in a single chunk and MSIX vectors >>>>>>> are emulated on top unless the machine requests not to by defining and >>>>>>> enabling a new "vfio-no-msix-emulation" property. At the moment only >>>>>>> sPAPR machine does so - it prohibits MSIX emulation and does not allow >>>>>>> enabling it as it does not define the "set" callback for the new property; >>>>>>> the new property also does not appear in "-machine pseries,help". >>>>>>> >>>>>>> If the new capability is present, this puts MSIX IO memory region under >>>>>>> mapped memory region. If the capability is not there, it falls back to >>>>>>> the old behaviour with the sparse capability. >>>>>>> >>>>>>> In MSIX vectors section is not aligned to the page size, the KVM memory >>>>>>> listener does not register it with the KVM as a memory slot and MSIX is >>>>>>> emulated by QEMU as before. >>>>>>> >>>>>>> This requires the kernel change - "vfio-pci: Allow mapping MSIX BAR" - >>>>>>> for the new capability: https://www.spinics.net/lists/kvm/msg160282.html >>>>>>> >>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>>>>>> --- >>>>>>> >>>>>>> This is mtree and flatview BEFORE this patch: >>>>>>> >>>>>>> "info mtree": >>>>>>> memory-region: pci@800000020000000.mmio >>>>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>>>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>>>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>> >>>>>>> "info mtree -f": >>>>>>> FlatView #0 >>>>>>> AS "memory", root: system >>>>>>> AS "cpu-memory", root: system >>>>>>> Root memory region: system >>>>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>>>>> 0000210000000000-000021000000dfff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>>>> 000021000000e600-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 @000000000000e600 >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>> >>>>>>> >>>>>>> >>>>>>> This is AFTER this patch applied: >>>>>>> >>>>>>> "info mtree": >>>>>>> memory-region: pci@800000020000000.mmio >>>>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>>>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table [disabled] >>>>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>>>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>> >>>>>>> >>>>>>> "info mtree -f": >>>>>>> FlatView #2 >>>>>>> AS "memory", root: system >>>>>>> AS "cpu-memory", root: system >>>>>>> Root memory region: system >>>>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>> >>>>>>> >>>>>>> >>>>>>> This is AFTER this patch applied AND spapr_get_msix_emulation() patched >>>>>>> to enable emulation: >>>>>>> >>>>>>> "info mtree": >>>>>>> memory-region: pci@800000020000000.mmio >>>>>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio >>>>>>> 0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 BAR 1 >>>>>>> 0000210000000000-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>>>> 000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled] >>>>>>> 0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 BAR 3 >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>> >>>>>>> "info mtree -f": >>>>>>> FlatView #1 >>>>>>> AS "memory", root: system >>>>>>> AS "cpu-memory", root: system >>>>>>> Root memory region: system >>>>>>> 0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram >>>>>>> 0000210000000000-000021000000dfff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] >>>>>>> 000021000000e000-000021000000e5ff (prio 0, i/o): msix-table >>>>>>> 000021000000e600-000021000000ffff (prio 0, ramd): 0001:03:00.0 BAR 1 mmaps[0] @000000000000e600 >>>>>>> 0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0] >>>>>>> --- >>>>>>> include/hw/vfio/vfio-common.h | 1 + >>>>>>> linux-headers/linux/vfio.h | 5 +++++ >>>>>>> hw/ppc/spapr.c | 8 ++++++++ >>>>>>> hw/vfio/common.c | 15 +++++++++++++++ >>>>>>> hw/vfio/pci.c | 23 +++++++++++++++++++++-- >>>>>>> 5 files changed, 50 insertions(+), 2 deletions(-) >>>>>>> >>>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h >>>>>>> index f3a2ac9..927d600 100644 >>>>>>> --- a/include/hw/vfio/vfio-common.h >>>>>>> +++ b/include/hw/vfio/vfio-common.h >>>>>>> @@ -171,6 +171,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index, >>>>>>> struct vfio_region_info **info); >>>>>>> int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, >>>>>>> uint32_t subtype, struct vfio_region_info **info); >>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region); >>>>>>> #endif >>>>>>> extern const MemoryListener vfio_prereg_listener; >>>>>>> >>>>>>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h >>>>>>> index 4312e96..b45182e 100644 >>>>>>> --- a/linux-headers/linux/vfio.h >>>>>>> +++ b/linux-headers/linux/vfio.h >>>>>>> @@ -301,6 +301,11 @@ struct vfio_region_info_cap_type { >>>>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG (2) >>>>>>> #define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG (3) >>>>>>> >>>>>>> +/* >>>>>>> + * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped. >>>>>>> + */ >>>>>>> +#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 >>>>>>> + >>>>>>> /** >>>>>>> * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, >>>>>>> * struct vfio_irq_info) >>>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c >>>>>>> index 9de63f0..693394a 100644 >>>>>>> --- a/hw/ppc/spapr.c >>>>>>> +++ b/hw/ppc/spapr.c >>>>>>> @@ -2772,6 +2772,11 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value, >>>>>>> spapr->use_hotplug_event_source = value; >>>>>>> } >>>>>>> >>>>>>> +static bool spapr_get_msix_emulation(Object *obj, Error **errp) >>>>>>> +{ >>>>>>> + return true; >>>>>>> +} >>>>>>> + >>>>>>> static char *spapr_get_resize_hpt(Object *obj, Error **errp) >>>>>>> { >>>>>>> sPAPRMachineState *spapr = SPAPR_MACHINE(obj); >>>>>>> @@ -2853,6 +2858,8 @@ static void spapr_machine_initfn(Object *obj) >>>>>>> object_property_set_description(obj, "vsmt", >>>>>>> "Virtual SMT: KVM behaves as if this were" >>>>>>> " the host's SMT mode", &error_abort); >>>>>>> + object_property_add_bool(obj, "vfio-no-msix-emulation", >>>>>>> + spapr_get_msix_emulation, NULL, NULL); >>>>>>> } >>>>>>> >>>>>>> static void spapr_machine_finalizefn(Object *obj) >>>>>>> @@ -3742,6 +3749,7 @@ static const TypeInfo spapr_machine_info = { >>>>>>> /* >>>>>>> * pseries-2.11 >>>>>>> */ >>>>>>> + >>>>>>> static void spapr_machine_2_11_instance_options(MachineState *machine) >>>>>>> { >>>>>>> } >>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >>>>>>> index 1fb8a8e..da5f182 100644 >>>>>>> --- a/hw/vfio/common.c >>>>>>> +++ b/hw/vfio/common.c >>>>>>> @@ -1411,6 +1411,21 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type, >>>>>>> return -ENODEV; >>>>>>> } >>>>>>> >>>>>>> +bool vfio_is_cap_present(VFIODevice *vbasedev, uint16_t cap_type, int region) >>>>>>> +{ >>>>>>> + struct vfio_region_info *info = NULL; >>>>>>> + bool ret = false; >>>>>>> + >>>>>>> + if (!vfio_get_region_info(vbasedev, region, &info)) { >>>>>>> + if (vfio_get_region_info_cap(info, cap_type)) { >>>>>>> + ret = true; >>>>>>> + } >>>>>>> + g_free(info); >>>>>>> + } >>>>>>> + >>>>>>> + return ret; >>>>>>> +} >>>>>>> + >>>>>>> /* >>>>>>> * Interfaces for IBM EEH (Enhanced Error Handling) >>>>>>> */ >>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >>>>>>> index c977ee3..27a3706 100644 >>>>>>> --- a/hw/vfio/pci.c >>>>>>> +++ b/hw/vfio/pci.c >>>>>>> @@ -1289,6 +1289,11 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev) >>>>>>> off_t start, end; >>>>>>> VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region; >>>>>>> >>>>>>> + if (vfio_is_cap_present(&vdev->vbasedev, VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, >>>>>>> + vdev->msix->table_bar)) { >>>>>>> + return; >>>>>>> + } >>>>>> >>>>>> I am testing this patch on ARM with a Mellanox CX-4 card. Single bar0 + >>>>>> msix table at 0x2000, using 64kB pages. >>>>>> >>>>>> 0000000000000000-0000000001ffffff (prio 1, i/o): 0000:01:00.0 BAR 0 >>>>>> 0000000000000000-0000000001ffffff (prio 0, ramd): 0000:01:00.0 BAR 0 >>>>>> mmaps[0] >>>>>> 0000000000002000-00000000000023ff (prio 0, i/o): msix-table >>>>>> 0000000000003000-0000000000003007 (prio 0, i/o): msix-pba [disabled] >>>>>> >>>>>> >>>>>> My kernel includes your "vfio-pci: Allow mapping MSIX BAR" so we bypass >>>>>> vfio_pci_fixup_msix_region. But as the mmaps[0] is not fixed up the >>>>>> vfio_listener_region_add_ram gets called on the first 8kB which cannot >>>>>> be mapped with 64kB page, leading to a qemu abort. >>>>>> >>>>>> 29520@1515145231.232161:vfio_listener_region_add_ram region_add [ram] >>>>>> 0x8000000000 - 0x8000001fff [0xffffa8820000] >>>>>> qemu-system-aarch64: VFIO_MAP_DMA: -22 >>>>>> qemu-system-aarch64: vfio_dma_map(0x1e265f60, 0x8000000000, 0x2000, >>>>>> 0xffffa8820000) = -22 (Invalid argument) >>>>>> qemu: hardware error: vfio: DMA mapping failed, unable to continue >>>>>> >>>>>> In my case don't we still need the fixup logic? >>>>> >>>>> Ah, I didn't realize you had Alexey's QEMU patch when you asked about >>>>> this. I'd say no, the fixup shouldn't be needed, but apparently we >>>>> can't remove it until we figure out this bug and who should be filtering >>>>> out sub-page size listener events. Thanks, >>>> >>>> >>>> This should do it: >>>> >>>> @@ -303,11 +303,19 @@ static bool >>>> vfio_listener_skipped_section(MemoryRegionSection *section) >>>> * Sizing an enabled 64-bit BAR can cause spurious mappings to >>>> * addresses in the upper part of the 64-bit address space. These >>>> * are never accessed by the CPU and beyond the address width of >>>> * some IOMMU hardware. TODO: VFIO should tell us the IOMMU width. >>>> */ >>>> - section->offset_within_address_space & (1ULL << 63); >>>> + section->offset_within_address_space & (1ULL << 63) || >>>> + /* >>>> + * Non-system-page aligned RAM regions have no chance to be >>>> + * DMA mapped so skip them too. >>>> + */ >>>> + (memory_region_is_ram(section->mr) && >>>> + ((section->offset_within_region & qemu_real_host_page_mask) || >>>> + (int128_getlo(section->size) & qemu_real_host_page_mask)) >>>> + ); >>>> } >>>> >>>> >>>> Do we VFIO DMA map on every RAM MR just because it is not necessary to have >>>> more complicated logic to detect whether it is MMIO? Or there is something >>>> bigger? >>> >>> Both, a) it's not clear how we determine a MR is MMIO vs RAM, thus the >>> silly address hack above to try to exclude PCI BARs as they're being >>> resized, and b) there is some value to enabling peer-to-peer mappings >>> between devices. Is there a better, more generic solution to a) now? >>> If we know a MR is MMIO, then I think we can handle any associated >>> mapping failure as non-fatal. It's nice to enable p2p, but it's barely >>> used and not clear that it works reliably (sometimes even on bare >>> metal). >> >> When P2P works in a guest, what does it look like exactly? Do the devices >> talk via the IOMMU or directly? I'd assume the latter but then the guest >> device driver needs to know physical MMIO addresses and it does not or >> guest must program BARs exactly as the host did... Has anyone actually >> reported this working? > > I've seen it work, NVIDIA has a peer to peer test in their cuda tests, > see previous patches enabling GPUDirect. The route is generally > through the IOMMU, guests don't know host physical addresses. I say > generally though because ACS does have a Directed Translated P2P > feature that would allow a more efficient path, but I've never seen > anything that can take advantage of this. Ah, I see. I guess it still makes sense with IOMMU as it eliminates an extra memory copy. > >> I just tried exposing MMIO on a PCI AS ("[PATCH qemu] spapr_pci: Alias MMIO >> windows to PHB address space") and ended up having >> vfio_listener_region_add() called with RAM regions (which are mapped MMIO) >> and sPAPR cannot handle it otherwise than creating a window per a memory >> section and since we can have just two, this fails. Adding >> memory_region_is_ram_device() to vfio_listener_skipped_section() solves my >> particular problem but I wonder if I should rather enable this P2P on sPAPR? > > Yeah, as mentioned it'd be nice to at least make ram_device (ie. MMIO) > mappings non-fatal. I think there's probably enough skepticism in > drivers as to whether p2p works that this isn't as critical. Of course > making it work is even better. Thanks, Well, to make P2P-via-IOMMU work on sPAPR (assuming that our IOMMU can actually redirect back to the bus which I am not sure about), these MMIOs need to me mapped via the huge DMA window by the guest and not by QEMU-VFIO so sPAPR will need another machine option ("no-mmio-dma-mmap"? implemented like "vfio-no-msix-emulation" proposed by this patch) to tell vfio_listener_region_add() not to vfio_dma_map on MMIO MRs, vfio_listener_region_del() will need adjustment too. Would that be considered as an option or there is a better interface? -- Alexey ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2018-01-24 10:36 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-12-15 6:29 [Qemu-devel] [PATCH qemu v2] RFC: vfio-pci: Allow mmap of MSIX BAR Alexey Kardashevskiy 2017-12-19 4:52 ` David Gibson 2017-12-19 17:26 ` Alex Williamson 2017-12-20 1:13 ` Alexey Kardashevskiy 2017-12-20 3:24 ` David Gibson 2018-01-05 9:48 ` Auger Eric 2018-01-05 15:29 ` Alex Williamson 2018-01-16 5:17 ` Alexey Kardashevskiy 2018-01-17 8:17 ` Auger Eric 2018-01-18 21:59 ` Alex Williamson 2018-01-19 2:41 ` Alexey Kardashevskiy 2018-01-19 3:25 ` Alex Williamson 2018-01-19 4:31 ` Alexey Kardashevskiy 2018-01-19 13:15 ` Auger Eric 2018-01-23 19:56 ` Alex Williamson 2018-01-24 10:35 ` Auger Eric 2018-01-19 8:55 ` Alexey Kardashevskiy 2018-01-19 15:57 ` Alex Williamson 2018-01-20 0:04 ` Alexey Kardashevskiy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).