From: Alexey Kardashevskiy <aik@ozlabs.ru>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>,
qemu-ppc@nongnu.org, qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [PATCH qemu v14 18/18] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
Date: Tue, 29 Mar 2016 17:23:11 +1100 [thread overview]
Message-ID: <56FA1F4F.5080903@ozlabs.ru> (raw)
In-Reply-To: <20160329052236.GF15156@voom.fritz.box>
On 03/29/2016 04:22 PM, David Gibson wrote:
> On Thu, Mar 24, 2016 at 01:32:48PM +1100, Alexey Kardashevskiy wrote:
>> On 03/23/2016 05:11 PM, David Gibson wrote:
>>> On Wed, Mar 23, 2016 at 02:28:01PM +1100, Alexey Kardashevskiy wrote:
>>>> On 03/23/2016 01:13 PM, David Gibson wrote:
>>>>> On Mon, Mar 21, 2016 at 06:47:06PM +1100, Alexey Kardashevskiy wrote:
>>>>>> This adds support for Dynamic DMA Windows (DDW) option defined by
>>>>>> the SPAPR specification which allows to have additional DMA window(s)
>>>>>>
>>>>>> This implements DDW for emulated and VFIO devices.
>>>>>> This reserves RTAS token numbers for DDW calls.
>>>>>>
>>>>>> This changes the TCE table migration descriptor to support dynamic
>>>>>> tables as from now on, PHB will create as many stub TCE table objects
>>>>>> as PHB can possibly support but not all of them might be initialized at
>>>>>> the time of migration because DDW might or might not be requested by
>>>>>> the guest.
>>>>>>
>>>>>> The "ddw" property is enabled by default on a PHB but for compatibility
>>>>>> the pseries-2.5 machine and older disable it.
>>>>>>
>>>>>> This implements DDW for VFIO. The host kernel support is required.
>>>>>> This adds a "levels" property to PHB to control the number of levels
>>>>>> in the actual TCE table allocated by the host kernel, 0 is the default
>>>>>> value to tell QEMU to calculate the correct value. Current hardware
>>>>>> supports up to 5 levels.
>>>>>>
>>>>>> The existing linux guests try creating one additional huge DMA window
>>>>>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>>>>>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>>>>>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>>>>>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
>>>>>> property which is a bus address for the 64bit window and by default
>>>>>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
>>>>>> uses and this allows having emulated and VFIO devices on the same bus.
>>>>>>
>>>>>> This adds 4 RTAS handlers:
>>>>>> * ibm,query-pe-dma-window
>>>>>> * ibm,create-pe-dma-window
>>>>>> * ibm,remove-pe-dma-window
>>>>>> * ibm,reset-pe-dma-window
>>>>>> These are registered from type_init() callback.
>>>>>>
>>>>>> These RTAS handlers are implemented in a separate file to avoid polluting
>>>>>> spapr_iommu.c with PCI.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> ---
>>>>>> hw/ppc/Makefile.objs | 1 +
>>>>>> hw/ppc/spapr.c | 7 +-
>>>>>> hw/ppc/spapr_pci.c | 73 ++++++++---
>>>>>> hw/ppc/spapr_rtas_ddw.c | 300 ++++++++++++++++++++++++++++++++++++++++++++
>>>>>> hw/vfio/common.c | 5 -
>>>>>> include/hw/pci-host/spapr.h | 13 ++
>>>>>> include/hw/ppc/spapr.h | 16 ++-
>>>>>> trace-events | 4 +
>>>>>> 8 files changed, 395 insertions(+), 24 deletions(-)
>>>>>> create mode 100644 hw/ppc/spapr_rtas_ddw.c
>>>>>>
>>>>>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>>>>>> index c1ffc77..986b36f 100644
>>>>>> --- a/hw/ppc/Makefile.objs
>>>>>> +++ b/hw/ppc/Makefile.objs
>>>>>> @@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
>>>>>> ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>>>>>> obj-y += spapr_pci_vfio.o
>>>>>> endif
>>>>>> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>>>>>> # PowerPC 4xx boards
>>>>>> obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>>>>>> obj-y += ppc4xx_pci.o
>>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>>>> index d0bb423..ef4c637 100644
>>>>>> --- a/hw/ppc/spapr.c
>>>>>> +++ b/hw/ppc/spapr.c
>>>>>> @@ -2362,7 +2362,12 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
>>>>>> * pseries-2.5
>>>>>> */
>>>>>> #define SPAPR_COMPAT_2_5 \
>>>>>> - HW_COMPAT_2_5
>>>>>> + HW_COMPAT_2_5 \
>>>>>> + {\
>>>>>> + .driver = TYPE_SPAPR_PCI_HOST_BRIDGE,\
>>>>>> + .property = "ddw",\
>>>>>> + .value = stringify(off),\
>>>>>> + },
>>>>>>
>>>>>> static void spapr_machine_2_5_instance_options(MachineState *machine)
>>>>>> {
>>>>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>>>>> index af99a36..3bb294a 100644
>>>>>> --- a/hw/ppc/spapr_pci.c
>>>>>> +++ b/hw/ppc/spapr_pci.c
>>>>>> @@ -803,12 +803,12 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
>>>>>> return buf;
>>>>>> }
>>>>>>
>>>>>> -static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>>>>>> - uint32_t liobn,
>>>>>> - uint32_t page_shift,
>>>>>> - uint64_t window_addr,
>>>>>> - uint64_t window_size,
>>>>>> - Error **errp)
>>>>>> +void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>>>>>> + uint32_t liobn,
>>>>>> + uint32_t page_shift,
>>>>>> + uint64_t window_addr,
>>>>>> + uint64_t window_size,
>>>>>> + Error **errp)
>>>>>> {
>>>>>> sPAPRTCETable *tcet;
>>>>>> uint32_t nb_table = window_size >> page_shift;
>>>>>> @@ -825,10 +825,16 @@ static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>>>>>> return;
>>>>>> }
>>>>>>
>>>>>> + if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
>>>>>> + error_setg(errp,
>>>>>> + "Attempt to use second window when DDW is disabled on PHB");
>>>>>> + return;
>>>>>> + }
>>>>>
>>>>> This should never happen unless something is wrong with the tests in
>>>>> the RTAS functions, yes? In which case it should probably be an
>>>>> assert().
>>>>
>>>> This should not. But this is called from the RTAS caller so I'd really like
>>>> to have a message rather than assert() if that condition happens, here or in
>>>> rtas_ibm_create_pe_dma_window().
>>>
>>> It should only be called from RTAS if ddw is enabled though, yes?
>>
>>
>> From RTAS and from the PHB reset handler. Well. I will get rid of
>> spapr_phb_dma_window_enable/spapr_phb_dma_window_disable, they are quite
>> useless when I look at them now.
>
> Ok.
>
>>>>>> spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table);
>>>>>> }
>>>>>>
>>>>>> -static int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
>>>>>> +int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
>>>>>> {
>>>>>> sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>>>>>>
>>>>>> @@ -1492,14 +1498,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>>>>> }
>>>>>>
>>>>>> /* DMA setup */
>>>>>> - tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
>>>>>> - if (!tcet) {
>>>>>> - error_report("No default TCE table for %s", sphb->dtbusname);
>>>>>> - return;
>>>>>> - }
>>>>>>
>>>>>> - memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
>>>>>> - spapr_tce_get_iommu(tcet), 0);
>>>>>> + for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>>>>>> + tcet = spapr_tce_new_table(DEVICE(sphb),
>>>>>> + SPAPR_PCI_LIOBN(sphb->index, i));
>>>>>> + if (!tcet) {
>>>>>> + error_setg(errp, "Creating window#%d failed for %s",
>>>>>> + i, sphb->dtbusname);
>>>>>> + return;
>>>>>> + }
>>>>>> + memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
>>>>>> + spapr_tce_get_iommu(tcet), 0);
>>>>>> + }
>>>>>>
>>>>>> sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>>>>>> }
>>>>>> @@ -1517,11 +1527,16 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>>>>>>
>>>>>> void spapr_phb_dma_reset(sPAPRPHBState *sphb)
>>>>>> {
>>>>>> - sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
>>>>>> Error *local_err = NULL;
>>>>>> + int i;
>>>>>>
>>>>>> - if (tcet && tcet->enabled) {
>>>>>> - spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
>>>>>> + for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>>>>>> + uint32_t liobn = SPAPR_PCI_LIOBN(sphb->index, i);
>>>>>> + sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>>>>>> +
>>>>>> + if (tcet && tcet->enabled) {
>>>>>> + spapr_phb_dma_window_disable(sphb, liobn);
>>>>>> + }
>>>>>> }
>>>>>>
>>>>>> /* Register default 32bit DMA window */
>>>>>> @@ -1562,6 +1577,13 @@ static Property spapr_phb_properties[] = {
>>>>>> /* Default DMA window is 0..1GB */
>>>>>> DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>>>>>> DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
>>>>>> + DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
>>>>>> + 0x800000000000000ULL),
>>>>>> + DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
>>>>>> + DEFINE_PROP_UINT32("windows", sPAPRPHBState, windows_supported,
>>>>>> + SPAPR_PCI_DMA_MAX_WINDOWS),
>>>>>
>>>>> What will happen if the user tries to set 'windows' larger than
>>>>> SPAPR_PCI_DMA_MAX_WINDOWS?
>>>>
>>>>
>>>> Oh. I need to replace SPAPR_PCI_DMA_MAX_WINDOWS with windows_supported
>>>> everywhere, missed that. Besides that, there will be support for more
>>>> windows, that's it. The host VFIO IOMMU driver will fail creating more
>>>> windows but this is expected. For emulated windows, there will be more
>>>> windows with no other consequences.
>>>
>>> Hmm.. is there actually a reason to have the windows property? Would
>>> you be better off just using the compile time constant for now.
>>
>>
>> I am afraid it is going to be 2 DMA windows forever as the other DMA tlb-ish
>> facility coming does not use windows at all :)
>
> That sounds like a reason not to have the property and leave it as a
> compile time constant.
>
>>>>>> + DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
>>>>>> + (1ULL << 12) | (1ULL << 16) | (1ULL << 24)),
>>>>>> DEFINE_PROP_END_OF_LIST(),
>>>>>> };
>>>>>>
>>>>>> @@ -1815,6 +1837,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>>>>> uint32_t interrupt_map_mask[] = {
>>>>>> cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>>>>>> uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
>>>>>> + uint32_t ddw_applicable[] = {
>>>>>> + cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
>>>>>> + cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
>>>>>> + cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
>>>>>> + };
>>>>>> + uint32_t ddw_extensions[] = {
>>>>>> + cpu_to_be32(1),
>>>>>> + cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
>>>>>> + };
>>>>>> sPAPRTCETable *tcet;
>>>>>> PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
>>>>>> sPAPRFDT s_fdt;
>>>>>> @@ -1839,6 +1870,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>>>>> _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>>>>>> _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>>>>>>
>>>>>> + /* Dynamic DMA window */
>>>>>> + if (phb->ddw_enabled) {
>>>>>> + _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
>>>>>> + sizeof(ddw_applicable)));
>>>>>> + _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
>>>>>> + &ddw_extensions, sizeof(ddw_extensions)));
>>>>>> + }
>>>>>> +
>>>>>> /* Build the interrupt-map, this must matches what is done
>>>>>> * in pci_spapr_map_irq
>>>>>> */
>>>>>> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
>>>>>> new file mode 100644
>>>>>> index 0000000..37f805f
>>>>>> --- /dev/null
>>>>>> +++ b/hw/ppc/spapr_rtas_ddw.c
>>>>>> @@ -0,0 +1,300 @@
>>>>>> +/*
>>>>>> + * QEMU sPAPR Dynamic DMA windows support
>>>>>> + *
>>>>>> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
>>>>>> + *
>>>>>> + * This program is free software; you can redistribute it and/or modify
>>>>>> + * it under the terms of the GNU General Public License as published by
>>>>>> + * the Free Software Foundation; either version 2 of the License,
>>>>>> + * or (at your option) any later version.
>>>>>> + *
>>>>>> + * This program is distributed in the hope that it will be useful,
>>>>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
>>>>>> + * GNU General Public License for more details.
>>>>>> + *
>>>>>> + * You should have received a copy of the GNU General Public License
>>>>>> + * along with this program; if not, see <http://www.gnu.org/licenses/>.
>>>>>> + */
>>>>>> +
>>>>>> +#include "qemu/osdep.h"
>>>>>> +#include "qemu/error-report.h"
>>>>>> +#include "hw/ppc/spapr.h"
>>>>>> +#include "hw/pci-host/spapr.h"
>>>>>> +#include "trace.h"
>>>>>> +
>>>>>> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
>>>>>> +{
>>>>>> + sPAPRTCETable *tcet;
>>>>>> +
>>>>>> + tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>>>>>> + if (tcet && tcet->enabled) {
>>>>>> + ++*(unsigned *)opaque;
>>>>>> + }
>>>>>> + return 0;
>>>>>> +}
>>>>>> +
>>>>>> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
>>>>>> +{
>>>>>> + unsigned ret = 0;
>>>>>> +
>>>>>> + object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
>>>>>> +
>>>>>> + return ret;
>>>>>> +}
>>>>>> +
>>>>>> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
>>>>>> +{
>>>>>> + sPAPRTCETable *tcet;
>>>>>> +
>>>>>> + tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>>>>>> + if (tcet && !tcet->enabled) {
>>>>>> + *(uint32_t *)opaque = tcet->liobn;
>>>>>> + return 1;
>>>>>> + }
>>>>>> + return 0;
>>>>>> +}
>>>>>> +
>>>>>> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
>>>>>> +{
>>>>>> + uint32_t liobn = 0;
>>>>>> +
>>>>>> + object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
>>>>>> +
>>>>>> + return liobn;
>>>>>> +}
>>>>>> +
>>>>>> +static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
>>>>>> + uint64_t page_mask)
>>>>>> +{
>>>>>> + int i, j;
>>>>>> + uint32_t mask = 0;
>>>>>> + const struct { int shift; uint32_t mask; } masks[] = {
>>>>>> + { 12, RTAS_DDW_PGSIZE_4K },
>>>>>> + { 16, RTAS_DDW_PGSIZE_64K },
>>>>>> + { 24, RTAS_DDW_PGSIZE_16M },
>>>>>> + { 25, RTAS_DDW_PGSIZE_32M },
>>>>>> + { 26, RTAS_DDW_PGSIZE_64M },
>>>>>> + { 27, RTAS_DDW_PGSIZE_128M },
>>>>>> + { 28, RTAS_DDW_PGSIZE_256M },
>>>>>> + { 34, RTAS_DDW_PGSIZE_16G },
>>>>>> + };
>>>>>> +
>>>>>> + for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
>>>>>> + for (j = 0; j < ARRAY_SIZE(masks); ++j) {
>>>>>> + if ((sps[i].page_shift == masks[j].shift) &&
>>>>>> + (page_mask & (1ULL << masks[j].shift))) {
>>>>>> + mask |= masks[j].mask;
>>>>>> + }
>>>>>> + }
>>>>>> + }
>>>>>> +
>>>>>> + return mask;
>>>>>> +}
>>>>>> +
>>>>>> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
>>>>>> + sPAPRMachineState *spapr,
>>>>>> + uint32_t token, uint32_t nargs,
>>>>>> + target_ulong args,
>>>>>> + uint32_t nret, target_ulong rets)
>>>>>> +{
>>>>>> + CPUPPCState *env = &cpu->env;
>>>>>> + sPAPRPHBState *sphb;
>>>>>> + uint64_t buid, max_window_size;
>>>>>> + uint32_t avail, addr, pgmask = 0;
>>>>>> +
>>>>>> + if ((nargs != 3) || (nret != 5)) {
>>>>>> + goto param_error_exit;
>>>>>> + }
>>>>>> +
>>>>>> + buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>>>>>> + addr = rtas_ld(args, 0);
>>>>>> + sphb = spapr_pci_find_phb(spapr, buid);
>>>>>> + if (!sphb || !sphb->ddw_enabled) {
>>>>>> + goto param_error_exit;
>>>>>> + }
>>>>>> +
>>>>>> + /* Work out supported page masks */
>>>>>> + pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask);
>>>>>
>>>>> There are a few potential problems here. First you're just
>>>>> arbitrarily picking the first entry in the sps array to filter
>>>>
>>>> Why first? spapr_query_mask() has a loop 0..PPC_PAGE_SIZES_MAX_SZ.
>>>
>>> env->sps is a nested array, 0..PPC_PAGE_SIZES_MAX_SZ-1 for base page
>>> sizes, then again for actual page sizes. You're only examing the
>>> first "row" of that table. It kinda works because the 4k base page
>>> size is first, which lists all the actual page size options.
>>
>> Ah. Right. So I need to walk through all of them, ok.
>
> Well.. that would be closer to right, but even then, I don't think
> that's really right. What you want to advertise here is all guest
> capabilities that are possible on the host.
>
> So, the first question is what's the actual size of the chunks that
> the host will need to map. Let's call that the "effective IOMMU page
> size". That's going to be whichever is smaller of the guest IOMMU
> page size, and the (host) RAM page size.
>
> e.g. For a guest using 4k IOMMU mappings on a host with 64k page size,
> the effective page size is 4kiB.
>
> For a guest using 16MiB IOMMU mappings on a host with 64kiB page size,
> the effective page size is 64kiB (because a 16MiB guest "page" could
> be broken up into different real pages on the host).
>
> The next question is what can you actually map in the host IOMMU. It
> should be able to handle any effective page size that's equal to, or
> smaller than the smallest page size supported on the host. That might
> involve the host inserting several host TCEs for a single effective
> page, but the VFIO DMA_MAP interface should already cover that because
> it takes a length parameter.
>
>>>>> against, which doesn't seem right (except by accident). It's a little
>>>>> bit odd filtering against guest page sizes at all, although I get what
>>>>> you're really trying to do is filter against allowed host page sizes.
>>>>>
>>>>> The other problem is that silently filtering capabilities based on the
>>>>> host can be a pain for migration - I've made the mistake and had it
>>>>> bite me in the past. I think it would be safer to just check the
>>>>> pagesizes requested in the property against what's possible and
>>>>> outright fail if they don't match. For convenience you could also set
>>>>> according to host capabilities if the user doesn't specify anything,
>>>>> but that would require explicit handling of the "default" case.
>>
>>
>> What are the host capabilities here?
>>
>> There is a page mask from the host IOMMU/PE which is 4K|64K|16M and many
>> other sizes, this is supported always by IODA2.
>> And there is PAGE_SIZE and huge pages (but only with -mempath) - so, 64K and
>> 16M (with -mempath).
>>
>> And there is a "ddw-query" RTAS call which tells the guest if it can use 16M
>> or not. How do you suggest I advertise 16M to the guest? If I always
>> advertise 16M and there is no -mempath, the guest won't try smaller page
>> size.
>
> So, here's what I think you need to do:
>
> 1. Start with the full list of IOMMU page sizes the virtual (guest)
> hardware supports (so, 4KiB, 64KiB & 16MiB, possibly modified by
> properties)
> 2. For each entry in the list work out what the effective page size
> will be, by clamping to the RAM page size.
> 3. Test if each effective page size is possible on the host (i.e. is
> <= at least one of the pagesizes advertised by the host IOMMU
> info ioctl).
>
> For a first cut it's probably reasonable to only check for == (not <=)
> the supported host pagesizes. Ideally the host (inside the kernel)
> would automatically create duplicate host TCEs if effective page size
> < host IOMMU page size. Obviously it would still need to use the
> supplied guest page size for the guest view of the table.
>
>
>> So - if the user wants 16M IOMMU pages, he has to use -mempath and in
>> addition to that explicitely say -global spapr-pci-host-bridge.pgsz=16M, and
>> by default enable only 4K and 64K (or just 4K?)? I am fine with this, it
>> just means more work for libvirt folks.
>
> I think that's the safest option to start with. Require that the user
> explicitly list what pagesizes the guest IOMMU will support, and just
> fail if the host can't do that.
>
> We could potentially auto-filter (as we already do for CPU/MMU
> pagesizes) once we've thought a bit more carefully about the possible
> migration implications.
For now I enable 4K and 64K for a guest and then rely on the kernel's
ability to create what the guest requested.
In v15, I'll add 3 patches to auto-add 16MB to the allowed mask list, may
be it won't look too ugly.
--
Alexey
next prev parent reply other threads:[~2016-03-29 6:23 UTC|newest]
Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-03-21 7:46 [Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
2016-03-21 7:46 ` [Qemu-devel] [PATCH qemu v14 01/18] memory: Fix IOMMU replay base address Alexey Kardashevskiy
2016-03-22 0:49 ` David Gibson
2016-03-22 3:12 ` Alexey Kardashevskiy
2016-03-22 3:26 ` David Gibson
2016-03-22 4:28 ` Alexey Kardashevskiy
2016-03-22 4:59 ` David Gibson
2016-03-22 7:19 ` Alexey Kardashevskiy
2016-03-22 23:07 ` David Gibson
2016-03-23 10:58 ` Paolo Bonzini
2016-03-21 7:46 ` [Qemu-devel] [PATCH qemu v14 02/18] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
2016-03-21 7:46 ` [Qemu-devel] [PATCH qemu v14 03/18] spapr_pci: Move DMA window enablement to a helper Alexey Kardashevskiy
2016-03-22 1:02 ` David Gibson
2016-03-22 3:17 ` Alexey Kardashevskiy
2016-03-22 3:28 ` David Gibson
2016-03-21 7:46 ` [Qemu-devel] [PATCH qemu v14 04/18] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
2016-03-21 7:46 ` [Qemu-devel] [PATCH qemu v14 05/18] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
2016-03-22 1:11 ` David Gibson
2016-03-21 7:46 ` [Qemu-devel] [PATCH qemu v14 06/18] spapr_iommu: Finish renaming vfio_accel to need_vfio Alexey Kardashevskiy
2016-03-22 1:18 ` David Gibson
2016-03-21 7:46 ` [Qemu-devel] [PATCH qemu v14 07/18] spapr_iommu: Realloc table during migration Alexey Kardashevskiy
2016-03-22 1:23 ` David Gibson
2016-03-21 7:46 ` [Qemu-devel] [PATCH qemu v14 08/18] spapr_iommu: Migrate full state Alexey Kardashevskiy
2016-03-22 1:31 ` David Gibson
2016-03-21 7:46 ` [Qemu-devel] [PATCH qemu v14 09/18] spapr_iommu: Add root memory region Alexey Kardashevskiy
2016-03-21 7:46 ` [Qemu-devel] [PATCH qemu v14 10/18] spapr_pci: Reset DMA config on PHB reset Alexey Kardashevskiy
2016-03-21 7:46 ` [Qemu-devel] [PATCH qemu v14 11/18] memory: Add reporting of supported page sizes Alexey Kardashevskiy
2016-03-22 3:02 ` David Gibson
2016-03-21 7:47 ` [Qemu-devel] [PATCH qemu v14 12/18] vfio: Check that IOMMU MR translates to system address space Alexey Kardashevskiy
2016-03-22 3:05 ` David Gibson
2016-03-22 15:47 ` Alex Williamson
2016-03-23 0:43 ` David Gibson
2016-03-23 0:44 ` Alexey Kardashevskiy
2016-03-21 7:47 ` [Qemu-devel] [PATCH qemu v14 13/18] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
2016-03-22 4:04 ` David Gibson
2016-03-21 7:47 ` [Qemu-devel] [PATCH qemu v14 14/18] spapr_pci: Add and export DMA resetting helper Alexey Kardashevskiy
2016-03-21 7:47 ` [Qemu-devel] [PATCH qemu v14 15/18] vfio: Add host side IOMMU capabilities Alexey Kardashevskiy
2016-03-22 4:20 ` David Gibson
2016-03-22 6:47 ` Alexey Kardashevskiy
2016-03-21 7:47 ` [Qemu-devel] [PATCH qemu v14 16/18] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO Alexey Kardashevskiy
2016-03-22 4:45 ` David Gibson
2016-03-22 6:24 ` Alexey Kardashevskiy
2016-03-22 10:22 ` David Gibson
2016-03-21 7:47 ` [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU Alexey Kardashevskiy
2016-03-22 5:14 ` David Gibson
2016-03-22 5:54 ` Alexey Kardashevskiy
2016-03-23 1:08 ` David Gibson
2016-03-23 2:12 ` Alexey Kardashevskiy
2016-03-23 2:53 ` David Gibson
2016-03-23 3:06 ` Alexey Kardashevskiy
2016-03-23 6:03 ` David Gibson
2016-03-24 0:03 ` Alexey Kardashevskiy
2016-03-24 9:10 ` Alexey Kardashevskiy
2016-03-29 5:30 ` David Gibson
2016-03-29 5:44 ` Alexey Kardashevskiy
2016-03-29 6:44 ` David Gibson
2016-03-21 7:47 ` [Qemu-devel] [PATCH qemu v14 18/18] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
2016-03-23 2:13 ` David Gibson
2016-03-23 3:28 ` Alexey Kardashevskiy
2016-03-23 6:11 ` David Gibson
2016-03-24 2:32 ` Alexey Kardashevskiy
2016-03-29 5:22 ` David Gibson
2016-03-29 6:23 ` Alexey Kardashevskiy [this message]
2016-03-31 3:19 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56FA1F4F.5080903@ozlabs.ru \
--to=aik@ozlabs.ru \
--cc=alex.williamson@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=qemu-devel@nongnu.org \
--cc=qemu-ppc@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).