From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:33568) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZC4Z1-0007Dl-U7 for qemu-devel@nongnu.org; Mon, 06 Jul 2015 07:27:57 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZC4Yx-0003n5-Qq for qemu-devel@nongnu.org; Mon, 06 Jul 2015 07:27:55 -0400 Received: from mail-pd0-f172.google.com ([209.85.192.172]:36081) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZC4Yx-0003mi-Ho for qemu-devel@nongnu.org; Mon, 06 Jul 2015 07:27:51 -0400 Received: by pddu5 with SMTP id u5so17157773pdd.3 for ; Mon, 06 Jul 2015 04:27:49 -0700 (PDT) References: <1436148670-6592-1-git-send-email-aik@ozlabs.ru> <1436148670-6592-15-git-send-email-aik@ozlabs.ru> <20150706110638.GF17857@voom.redhat.com> From: Alexey Kardashevskiy Message-ID: <559A662E.5000807@ozlabs.ru> Date: Mon, 6 Jul 2015 21:27:42 +1000 MIME-Version: 1.0 In-Reply-To: <20150706110638.GF17857@voom.redhat.com> Content-Type: text/plain; charset=koi8-r; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: David Gibson Cc: Michael Roth , Alex Williamson , qemu-ppc@nongnu.org, qemu-devel@nongnu.org, Gavin Shan On 07/06/2015 09:06 PM, David Gibson wrote: > On Mon, Jul 06, 2015 at 12:11:10PM +1000, Alexey Kardashevskiy wrote: >> This adds support for Dynamic DMA Windows (DDW) option defined by >> the SPAPR specification which allows to have additional DMA window(s) >> >> This implements DDW for emulated and VFIO devices. As all TCE root regions >> are mapped at 0 and 64bit long (and actual tables are child regions), >> this replaces memory_region_add_subregion() with _overlap() to make >> QEMU memory API happy. >> >> This reserves RTAS token numbers for DDW calls. >> >> This implements helpers to interact with VFIO kernel interface. >> >> This changes the TCE table migration descriptor to support dynamic >> tables as from now on, PHB will create as many stub TCE table objects >> as PHB can possibly support but not all of them might be initialized at >> the time of migration because DDW might or might not be requested by >> the guest. >> >> The "ddw" property is enabled by default on a PHB but for compatibility >> the pseries-2.3 machine and older disable it. >> >> This implements DDW for VFIO. The host kernel support is required. >> This adds a "levels" property to PHB to control the number of levels >> in the actual TCE table allocated by the host kernel, 0 is the default >> value to tell QEMU to calculate the correct value. Current hardware >> supports up to 5 levels. >> >> The existing linux guests try creating one additional huge DMA window >> with 64K or 16MB pages and map the entire guest RAM to. If succeeded, >> the guest switches to dma_direct_ops and never calls TCE hypercalls >> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM >> and not waste time on map/unmap later. This adds a "dma64_win_addr" >> property which is a bus address for the 64bit window and by default >> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware >> uses and this allows having emulated and VFIO devices on the same bus. >> >> This adds 4 RTAS handlers: >> * ibm,query-pe-dma-window >> * ibm,create-pe-dma-window >> * ibm,remove-pe-dma-window >> * ibm,reset-pe-dma-window >> These are registered from type_init() callback. >> >> These RTAS handlers are implemented in a separate file to avoid polluting >> spapr_iommu.c with PCI. >> >> Signed-off-by: Alexey Kardashevskiy >> --- >> Changes: >> v10: >> * added dma64_win_addr property to PHB >> * removed redundand check for "!migtable" in spapr_tce_table_post_load() >> >> v9: >> * fixed default 64bit window start (from mdroth) >> * fixed type cast in dma window update code (from mdroth) >> * spapr_phb_dma_update() now can fail and cause hotplug failure if >> hardware TCE table cannot be mapped to the same bus address as the emulated one >> >> v7: >> * fixed uninitialized variables >> >> v6: >> * rework as there is no more special device for VFIO PHB >> >> v5: >> * total rework >> * enabled for machines >2.3 >> * fixed migration >> * merged rtas handlers here >> >> v4: >> * reset handler is back in generalized form >> >> v3: >> * removed reset >> * windows_num is now 1 or bigger rather than 0-based value and it is only >> changed in PHB code, not in RTAS >> * added page mask check in create() >> * added SPAPR_PCI_DDW_MAX_WINDOWS to track how many windows are already >> created >> >> v2: >> * tested on hacked emulated E1000 >> * implemented DDW reset on the PHB reset >> * spapr_pci_ddw_remove/spapr_pci_ddw_reset are public for reuse by VFIO >> --- >> hw/ppc/Makefile.objs | 3 + >> hw/ppc/spapr.c | 5 + >> hw/ppc/spapr_iommu.c | 32 ++++- >> hw/ppc/spapr_pci.c | 110 ++++++++++++++-- >> hw/ppc/spapr_pci_vfio.c | 88 +++++++++++++ >> hw/ppc/spapr_rtas_ddw.c | 300 ++++++++++++++++++++++++++++++++++++++++++++ >> hw/vfio/common.c | 2 + >> include/hw/pci-host/spapr.h | 21 +++- >> include/hw/ppc/spapr.h | 17 ++- >> trace-events | 6 + >> 10 files changed, 568 insertions(+), 16 deletions(-) >> create mode 100644 hw/ppc/spapr_rtas_ddw.c >> >> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs >> index c8ab06e..0b2ff6d 100644 >> --- a/hw/ppc/Makefile.objs >> +++ b/hw/ppc/Makefile.objs >> @@ -7,6 +7,9 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o >> ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy) >> obj-y += spapr_pci_vfio.o >> endif >> +ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES), yy) >> +obj-y += spapr_rtas_ddw.o >> +endif >> # PowerPC 4xx boards >> obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o >> obj-y += ppc4xx_pci.o >> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c >> index 5ca817c..d50d50b 100644 >> --- a/hw/ppc/spapr.c >> +++ b/hw/ppc/spapr.c >> @@ -1860,6 +1860,11 @@ static const TypeInfo spapr_machine_info = { >> .driver = "spapr-pci-host-bridge",\ >> .property = "dynamic-reconfiguration",\ >> .value = "off",\ >> + },\ >> + {\ >> + .driver = TYPE_SPAPR_PCI_HOST_BRIDGE,\ >> + .property = "ddw",\ >> + .value = stringify(off),\ >> }, >> >> #define SPAPR_COMPAT_2_2 \ >> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c >> index 2d99c3b..b54c3d8 100644 >> --- a/hw/ppc/spapr_iommu.c >> +++ b/hw/ppc/spapr_iommu.c >> @@ -136,6 +136,15 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr, >> return ret; >> } >> >> +static void spapr_tce_table_pre_save(void *opaque) >> +{ >> + sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque); >> + >> + tcet->migtable = tcet->table; >> +} >> + >> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel); >> + >> static int spapr_tce_table_post_load(void *opaque, int version_id) >> { >> sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque); >> @@ -144,22 +153,39 @@ static int spapr_tce_table_post_load(void *opaque, int version_id) >> spapr_vio_set_bypass(tcet->vdev, tcet->bypass); >> } >> >> + if (tcet->enabled) { >> + if (!tcet->table) { >> + tcet->enabled = false; >> + /* VFIO does not migrate so pass vfio_accel == false */ >> + spapr_tce_table_do_enable(tcet, false); >> + } >> + memcpy(tcet->table, tcet->migtable, >> + tcet->nb_table * sizeof(tcet->table[0])); >> + free(tcet->migtable); >> + tcet->migtable = NULL; >> + } >> + >> return 0; >> } >> >> static const VMStateDescription vmstate_spapr_tce_table = { >> .name = "spapr_iommu", >> - .version_id = 2, >> + .version_id = 3, >> .minimum_version_id = 2, >> + .pre_save = spapr_tce_table_pre_save, >> .post_load = spapr_tce_table_post_load, >> .fields = (VMStateField []) { >> /* Sanity check */ >> VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable), >> - VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable), >> >> /* IOMMU state */ >> + VMSTATE_BOOL_V(enabled, sPAPRTCETable, 3), >> + VMSTATE_UINT64_V(bus_offset, sPAPRTCETable, 3), >> + VMSTATE_UINT32_V(page_shift, sPAPRTCETable, 3), >> + VMSTATE_UINT32(nb_table, sPAPRTCETable), >> VMSTATE_BOOL(bypass, sPAPRTCETable), >> - VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, vmstate_info_uint64, uint64_t), >> + VMSTATE_VARRAY_UINT32_ALLOC(migtable, sPAPRTCETable, nb_table, 0, >> + vmstate_info_uint64, uint64_t), >> >> VMSTATE_END_OF_LIST() >> }, >> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c >> index d1fa157..b7113b5 100644 >> --- a/hw/ppc/spapr_pci.c >> +++ b/hw/ppc/spapr_pci.c >> @@ -778,6 +778,9 @@ static int spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb) >> >> sphb->dma32_window_start = 0; >> sphb->dma32_window_size = SPAPR_PCI_DMA32_SIZE; >> + sphb->windows_supported = SPAPR_PCI_DMA_MAX_WINDOWS; >> + sphb->page_size_mask = (1ULL << 12) | (1ULL << 16) | (1ULL << 24); >> + sphb->dma64_window_size = pow2ceil(ram_size); >> >> ret = spapr_phb_vfio_dma_capabilities_update(sphb); >> sphb->has_vfio = (ret == 0); >> @@ -785,12 +788,35 @@ static int spapr_phb_dma_capabilities_update(sPAPRPHBState *sphb) >> return 0; >> } >> >> -static int spapr_phb_dma_init_window(sPAPRPHBState *sphb, >> - uint32_t liobn, uint32_t page_shift, >> - uint64_t window_size) >> +int spapr_phb_dma_init_window(sPAPRPHBState *sphb, >> + uint32_t liobn, uint32_t page_shift, >> + uint64_t window_size) >> { >> uint64_t bus_offset = sphb->dma32_window_start; >> sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn); >> + int ret; >> + >> + if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) { >> + return -1; >> + } >> + >> + if (sphb->ddw_enabled) { >> + if (sphb->has_vfio) { >> + ret = spapr_phb_vfio_dma_init_window(sphb, >> + page_shift, window_size, >> + &bus_offset); >> + if (ret) { >> + return ret; >> + } >> + } else if (SPAPR_PCI_DMA_WINDOW_NUM(liobn)) { >> + /* >> + * There is no VFIO so we choose a huge window address. >> + * If VFIO is added later, spapr_phb_dma_update() will fail >> + * and cause hotplug failure. >> + */ >> + bus_offset = sphb->dma64_window_start; >> + } >> + } >> >> spapr_tce_table_enable(tcet, bus_offset, page_shift, >> window_size >> page_shift, >> @@ -802,9 +828,14 @@ static int spapr_phb_dma_init_window(sPAPRPHBState *sphb, >> int spapr_phb_dma_remove_window(sPAPRPHBState *sphb, >> sPAPRTCETable *tcet) >> { >> + int ret = 0; >> + >> + if (sphb->has_vfio && sphb->ddw_enabled) { >> + ret = spapr_phb_vfio_dma_remove_window(sphb, tcet); >> + } >> spapr_tce_table_disable(tcet); >> >> - return 0; >> + return ret; >> } >> >> int spapr_phb_dma_reset(sPAPRPHBState *sphb) >> @@ -832,15 +863,46 @@ static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb) >> int ret = 0, i; >> bool had_vfio = sphb->has_vfio; >> sPAPRTCETable *tcet; >> + uint64_t bus_offset = 0; >> >> spapr_phb_dma_capabilities_update(sphb); >> >> + /* >> + * PHB got first VFIO device or lost last VFIO device; >> + * If it is the last VFIO device, we do not need windows anymore so >> + * remove them. >> + * If it is the first VFIO device, we have to remove them as >> + * we cannot request a specific window from the host kernel so we >> + * remove all windows and recreate them later if necessary. > > Am I right in thinking that there never should be (VFIO enabled) > windows when the first VFIO device is added though? > > If you're removing the windows when VFIO devices are removed, and any > windows created while !has_vfio shouldn't result in the kernel being > requested from the kernel..? Yes, I do not need this chunk. I realized that when I posted the patchset :( -- Alexey