From: Alexey Kardashevskiy <aik@ozlabs.ru>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: Michael Roth <mdroth@linux.vnet.ibm.com>,
Alex Williamson <alex.williamson@redhat.com>,
qemu-ppc@nongnu.org, qemu-devel@nongnu.org,
Gavin Shan <gwshan@linux.vnet.ibm.com>
Subject: Re: [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
Date: Tue, 7 Jul 2015 19:46:51 +1000 [thread overview]
Message-ID: <559BA00B.5070904@ozlabs.ru> (raw)
In-Reply-To: <20150706110638.GF17857@voom.redhat.com>
On 07/06/2015 09:06 PM, David Gibson wrote:
> On Mon, Jul 06, 2015 at 12:11:10PM +1000, Alexey Kardashevskiy wrote:
>> This adds support for Dynamic DMA Windows (DDW) option defined by
>> the SPAPR specification which allows to have additional DMA window(s)
>>
>> This implements DDW for emulated and VFIO devices. As all TCE root regions
>> are mapped at 0 and 64bit long (and actual tables are child regions),
>> this replaces memory_region_add_subregion() with _overlap() to make
>> QEMU memory API happy.
>>
>> This reserves RTAS token numbers for DDW calls.
>>
>> This implements helpers to interact with VFIO kernel interface.
>>
>> This changes the TCE table migration descriptor to support dynamic
>> tables as from now on, PHB will create as many stub TCE table objects
>> as PHB can possibly support but not all of them might be initialized at
>> the time of migration because DDW might or might not be requested by
>> the guest.
>>
>> The "ddw" property is enabled by default on a PHB but for compatibility
>> the pseries-2.3 machine and older disable it.
>>
>> This implements DDW for VFIO. The host kernel support is required.
>> This adds a "levels" property to PHB to control the number of levels
>> in the actual TCE table allocated by the host kernel, 0 is the default
>> value to tell QEMU to calculate the correct value. Current hardware
>> supports up to 5 levels.
>>
>> The existing linux guests try creating one additional huge DMA window
>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
>> property which is a bus address for the 64bit window and by default
>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
>> uses and this allows having emulated and VFIO devices on the same bus.
>>
>> This adds 4 RTAS handlers:
>> * ibm,query-pe-dma-window
>> * ibm,create-pe-dma-window
>> * ibm,remove-pe-dma-window
>> * ibm,reset-pe-dma-window
>> These are registered from type_init() callback.
>>
>> These RTAS handlers are implemented in a separate file to avoid polluting
>> spapr_iommu.c with PCI.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v10:
>> * added dma64_win_addr property to PHB
>> * removed redundand check for "!migtable" in spapr_tce_table_post_load()
>>
>> v9:
>> * fixed default 64bit window start (from mdroth)
>> * fixed type cast in dma window update code (from mdroth)
>> * spapr_phb_dma_update() now can fail and cause hotplug failure if
>> hardware TCE table cannot be mapped to the same bus address as the emulated one
>>
>> v7:
>> * fixed uninitialized variables
>>
>> v6:
>> * rework as there is no more special device for VFIO PHB
>>
>> v5:
>> * total rework
>> * enabled for machines >2.3
>> * fixed migration
>> * merged rtas handlers here
>>
>> v4:
>> * reset handler is back in generalized form
>>
>> v3:
>> * removed reset
>> * windows_num is now 1 or bigger rather than 0-based value and it is only
>> changed in PHB code, not in RTAS
>> * added page mask check in create()
>> * added SPAPR_PCI_DDW_MAX_WINDOWS to track how many windows are already
>> created
>>
>> v2:
>> * tested on hacked emulated E1000
>> * implemented DDW reset on the PHB reset
>> * spapr_pci_ddw_remove/spapr_pci_ddw_reset are public for reuse by VFIO
>> ---
>> hw/ppc/Makefile.objs | 3 +
>> hw/ppc/spapr.c | 5 +
>> hw/ppc/spapr_iommu.c | 32 ++++-
>> hw/ppc/spapr_pci.c | 110 ++++++++++++++--
>> hw/ppc/spapr_pci_vfio.c | 88 +++++++++++++
>> hw/ppc/spapr_rtas_ddw.c | 300 ++++++++++++++++++++++++++++++++++++++++++++
>> hw/vfio/common.c | 2 +
>> include/hw/pci-host/spapr.h | 21 +++-
>> include/hw/ppc/spapr.h | 17 ++-
>> trace-events | 6 +
>> 10 files changed, 568 insertions(+), 16 deletions(-)
>> create mode 100644 hw/ppc/spapr_rtas_ddw.c
>>
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index c8ab06e..0b2ff6d 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -7,6 +7,9 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o
>> ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>> obj-y += spapr_pci_vfio.o
>> endif
>> +ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES), yy)
>> +obj-y += spapr_rtas_ddw.o
>> +endif
>> # PowerPC 4xx boards
>> obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>> obj-y += ppc4xx_pci.o
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index 5ca817c..d50d50b 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -1860,6 +1860,11 @@ static const TypeInfo spapr_machine_info = {
>> .driver = "spapr-pci-host-bridge",\
>> .property = "dynamic-reconfiguration",\
>> .value = "off",\
>> + },\
>> + {\
>> + .driver = TYPE_SPAPR_PCI_HOST_BRIDGE,\
>> + .property = "ddw",\
>> + .value = stringify(off),\
>> },
>>
>> #define SPAPR_COMPAT_2_2 \
>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>> index 2d99c3b..b54c3d8 100644
>> --- a/hw/ppc/spapr_iommu.c
>> +++ b/hw/ppc/spapr_iommu.c
>> @@ -136,6 +136,15 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
>> return ret;
>> }
>>
>> +static void spapr_tce_table_pre_save(void *opaque)
>> +{
>> + sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
>> +
>> + tcet->migtable = tcet->table;
>> +}
>> +
>> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel);
>> +
>> static int spapr_tce_table_post_load(void *opaque, int version_id)
>> {
>> sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
>> @@ -144,22 +153,39 @@ static int spapr_tce_table_post_load(void *opaque, int version_id)
>> spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
>> }
>>
>> + if (tcet->enabled) {
>> + if (!tcet->table) {
>> + tcet->enabled = false;
>> + /* VFIO does not migrate so pass vfio_accel == false */
>> + spapr_tce_table_do_enable(tcet, false);
>> + }
>> + memcpy(tcet->table, tcet->migtable,
>> + tcet->nb_table * sizeof(tcet->table[0]));
>> + free(tcet->migtable);
>> + tcet->migtable = NULL;
>> + }
>> +
>> return 0;
>> }
>>
>> static const VMStateDescription vmstate_spapr_tce_table = {
>> .name = "spapr_iommu",
>> - .version_id = 2,
>> + .version_id = 3,
>> .minimum_version_id = 2,
>> + .pre_save = spapr_tce_table_pre_save,
>> .post_load = spapr_tce_table_post_load,
>> .fields = (VMStateField []) {
>> /* Sanity check */
>> VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
>> - VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
>>
>> /* IOMMU state */
>> + VMSTATE_BOOL_V(enabled, sPAPRTCETable, 3),
>> + VMSTATE_UINT64_V(bus_offset, sPAPRTCETable, 3),
>> + VMSTATE_UINT32_V(page_shift, sPAPRTCETable, 3),
>> + VMSTATE_UINT32(nb_table, sPAPRTCETable),
>> VMSTATE_BOOL(bypass, sPAPRTCETable),
>> - VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t),
>> + VMSTATE_VARRAY_UINT32_ALLOC(migtable, sPAPRTCETable, nb_table, 0,
>> + vmstate_info_uint64, uint64_t),
>>
>> VMSTATE_END_OF_LIST()
>> },
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index d1fa157..b7113b5 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -778,6 +778,9 @@ static int spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb)
>>
>> sphb->dma32_window_start = 0;
>> sphb->dma32_window_size = SPAPR_PCI_DMA32_SIZE;
>> + sphb->windows_supported = SPAPR_PCI_DMA_MAX_WINDOWS;
>> + sphb->page_size_mask = (1ULL << 12) | (1ULL << 16) | (1ULL << 24);
>> + sphb->dma64_window_size = pow2ceil(ram_size);
>>
>> ret = spapr_phb_vfio_dma_capabilities_update(sphb);
>> sphb->has_vfio = (ret == 0);
>> @@ -785,12 +788,35 @@ static int spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb)
>> return 0;
>> }
>>
>> -static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
>> - uint32_t liobn, uint32_t page_shift,
>> - uint64_t window_size)
>> +int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
>> + uint32_t liobn, uint32_t page_shift,
>> + uint64_t window_size)
>> {
>> uint64_t bus_offset = sphb->dma32_window_start;
>> sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>> + int ret;
>> +
>> + if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
>> + return -1;
>> + }
>> +
>> + if (sphb->ddw_enabled) {
>> + if (sphb->has_vfio) {
>> + ret = spapr_phb_vfio_dma_init_window(sphb,
>> + page_shift, window_size,
>> + &bus_offset);
>> + if (ret) {
>> + return ret;
>> + }
>> + } else if (SPAPR_PCI_DMA_WINDOW_NUM(liobn)) {
>> + /*
>> + * There is no VFIO so we choose a huge window address.
>> + * If VFIO is added later, spapr_phb_dma_update() will fail
>> + * and cause hotplug failure.
>> + */
>> + bus_offset = sphb->dma64_window_start;
>> + }
>> + }
>>
>> spapr_tce_table_enable(tcet, bus_offset, page_shift,
>> window_size >> page_shift,
>> @@ -802,9 +828,14 @@ static int spapr_phb_dma_init_window(sPAPRPHBState *sphb,
>> int spapr_phb_dma_remove_window(sPAPRPHBState *sphb,
>> sPAPRTCETable *tcet)
>> {
>> + int ret = 0;
>> +
>> + if (sphb->has_vfio && sphb->ddw_enabled) {
>> + ret = spapr_phb_vfio_dma_remove_window(sphb, tcet);
>> + }
>> spapr_tce_table_disable(tcet);
>>
>> - return 0;
>> + return ret;
>> }
>>
>> int spapr_phb_dma_reset(sPAPRPHBState *sphb)
>> @@ -832,15 +863,46 @@ static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb)
>> int ret = 0, i;
>> bool had_vfio = sphb->has_vfio;
>> sPAPRTCETable *tcet;
>> + uint64_t bus_offset = 0;
>>
>> spapr_phb_dma_capabilities_update(sphb);
>>
>> + /*
>> + * PHB got first VFIO device or lost last VFIO device;
>> + * If it is the last VFIO device, we do not need windows anymore so
>> + * remove them.
>> + * If it is the first VFIO device, we have to remove them as
>> + * we cannot request a specific window from the host kernel so we
>> + * remove all windows and recreate them later if necessary.
>
> Am I right in thinking that there never should be (VFIO enabled)
> windows when the first VFIO device is added though?
Actually there should be a 32bit window already created in the container.
And PHB may have no 32bit at the moment of hotplug, it may have removed it
and created 64bit window instead (which does not happen now with the modern
guests and not supported by old guests anyway but still may be the case for
the other OS).
> If you're removing the windows when VFIO devices are removed, and any
> windows created while !has_vfio shouldn't result in the kernel being
"shouldn't result in the window" may be?
> requested from the kernel..?
Either way, putting (i.e. releasing) a container should do the right job.
>
>> + */
>> + if (had_vfio != sphb->has_vfio) {
>> + for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>> + tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
>> + if (!tcet) {
>> + continue;
>> + }
>> + spapr_phb_vfio_dma_remove_window(sphb, tcet);
>> + }
>> + }
>> +
>> if (!had_vfio && sphb->has_vfio) {
>> for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>> tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
>> if (!tcet || !tcet->enabled) {
>> continue;
>> }
>> + ret = spapr_phb_vfio_dma_init_window(sphb,
>> + tcet->page_shift,
>> + (uint64_t)tcet->nb_table <<
>> + tcet->page_shift,
>> + &bus_offset);
>> + if (ret) {
>> + break;
>> + }
>> + if (bus_offset != tcet->bus_offset) {
>> + ret = -EFAULT;
>> + break;
>> + }
>> if (tcet->fd >= 0) {
>> /*
>> * We got first vfio-pci device on accelerated table.
>> @@ -1143,7 +1205,10 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>> error_setg(errp, "Failed to create pci child device tree node");
>> goto out;
>> }
>> - spapr_phb_hotplug_dma_sync(phb);
>> + if (spapr_phb_hotplug_dma_sync(phb)) {
>> + error_setg(errp, "Failed to create DMA window(s)");
>> + goto out;
>> + }
>> }
>>
--
Alexey
next prev parent reply other threads:[~2015-07-07 9:47 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-07-06 2:10 [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
2015-07-06 2:10 ` [Qemu-devel] [PATCH qemu v10 01/14] linux-headers: Update to 4.2-rc1 Alexey Kardashevskiy
2015-07-06 11:18 ` Paolo Bonzini
2015-07-06 2:10 ` [Qemu-devel] [PATCH qemu v10 02/14] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
2015-07-06 14:21 ` Thomas Huth
2015-07-06 2:10 ` [Qemu-devel] [PATCH qemu v10 03/14] spapr_pci: Convert finish_realize() to dma_capabilities_update()+dma_init_window() Alexey Kardashevskiy
2015-07-06 16:41 ` Laurent Vivier
2015-07-07 0:28 ` Alexey Kardashevskiy
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 04/14] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
2015-07-06 15:14 ` Thomas Huth
2015-07-06 15:43 ` Alexey Kardashevskiy
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 05/14] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
2015-07-06 10:07 ` David Gibson
2015-07-06 17:04 ` Thomas Huth
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 06/14] spapr_iommu: Remove vfio_accel flag from sPAPRTCETable Alexey Kardashevskiy
2015-07-06 16:45 ` Laurent Vivier
2015-07-06 17:11 ` Thomas Huth
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 07/14] spapr_iommu: Add root memory region Alexey Kardashevskiy
2015-07-06 19:15 ` Thomas Huth
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 08/14] spapr_pci: Do complete reset of DMA config when resetting PHB Alexey Kardashevskiy
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 09/14] spapr_vfio_pci: Remove redundant spapr-pci-vfio-host-bridge Alexey Kardashevskiy
2015-07-06 21:13 ` Thomas Huth
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug Alexey Kardashevskiy
2015-07-06 10:27 ` David Gibson
2015-07-06 21:31 ` Thomas Huth
2015-07-07 9:28 ` Alexey Kardashevskiy
2015-07-10 21:33 ` Michael Roth
2015-07-12 4:59 ` Alexey Kardashevskiy
2015-07-12 14:41 ` Michael Roth
2015-07-13 1:10 ` David Gibson
2015-07-13 7:06 ` Alexey Kardashevskiy
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 11/14] spapr_pci_vfio: Enable multiple groups per container Alexey Kardashevskiy
2015-07-07 7:02 ` Thomas Huth
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 12/14] vfio: Unregister IOMMU notifiers when container is destroyed Alexey Kardashevskiy
2015-07-06 10:33 ` David Gibson
2015-07-06 12:49 ` Alex Williamson
2015-07-06 12:59 ` Alexey Kardashevskiy
2015-07-06 13:45 ` Alex Williamson
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
2015-07-06 13:42 ` Alex Williamson
2015-07-06 15:34 ` Alexey Kardashevskiy
2015-07-06 16:13 ` Alex Williamson
2015-07-07 0:29 ` David Gibson
2015-07-07 0:36 ` Alexey Kardashevskiy
2015-07-07 12:11 ` Alexey Kardashevskiy
2015-07-07 16:24 ` Alex Williamson
2015-07-08 6:26 ` Alexey Kardashevskiy
2015-07-08 14:51 ` Alex Williamson
2015-07-07 7:23 ` Thomas Huth
2015-07-07 10:05 ` Alexey Kardashevskiy
2015-07-07 10:21 ` Thomas Huth
2015-07-07 11:05 ` Alexey Kardashevskiy
2015-07-08 4:30 ` David Gibson
2015-07-08 6:24 ` Thomas Huth
2015-07-08 6:50 ` David Gibson
2015-07-08 7:07 ` Alexey Kardashevskiy
2015-07-08 14:47 ` Alex Williamson
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
2015-07-06 11:06 ` David Gibson
2015-07-06 11:27 ` Alexey Kardashevskiy
2015-07-07 9:46 ` Alexey Kardashevskiy [this message]
2015-07-07 4:58 ` David Gibson
2015-07-07 9:33 ` Thomas Huth
2015-07-07 10:43 ` Alexey Kardashevskiy
2015-07-07 11:35 ` Thomas Huth
2015-07-07 11:53 ` Alexey Kardashevskiy
2015-07-06 11:13 ` [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) David Gibson
2015-07-06 15:54 ` Thomas Huth
2015-07-06 16:07 ` Alexey Kardashevskiy
2015-07-06 16:13 ` Thomas Huth
2015-07-08 4:34 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=559BA00B.5070904@ozlabs.ru \
--to=aik@ozlabs.ru \
--cc=alex.williamson@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=gwshan@linux.vnet.ibm.com \
--cc=mdroth@linux.vnet.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=qemu-ppc@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).