qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Alexey Kardashevskiy <aik@ozlabs.ru>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>,
	qemu-ppc@nongnu.org, qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [PATCH qemu v14 18/18] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)
Date: Tue, 29 Mar 2016 17:23:11 +1100	[thread overview]
Message-ID: <56FA1F4F.5080903@ozlabs.ru> (raw)
In-Reply-To: <20160329052236.GF15156@voom.fritz.box>

On 03/29/2016 04:22 PM, David Gibson wrote:
> On Thu, Mar 24, 2016 at 01:32:48PM +1100, Alexey Kardashevskiy wrote:
>> On 03/23/2016 05:11 PM, David Gibson wrote:
>>> On Wed, Mar 23, 2016 at 02:28:01PM +1100, Alexey Kardashevskiy wrote:
>>>> On 03/23/2016 01:13 PM, David Gibson wrote:
>>>>> On Mon, Mar 21, 2016 at 06:47:06PM +1100, Alexey Kardashevskiy wrote:
>>>>>> This adds support for Dynamic DMA Windows (DDW) option defined by
>>>>>> the SPAPR specification which allows to have additional DMA window(s)
>>>>>>
>>>>>> This implements DDW for emulated and VFIO devices.
>>>>>> This reserves RTAS token numbers for DDW calls.
>>>>>>
>>>>>> This changes the TCE table migration descriptor to support dynamic
>>>>>> tables as from now on, PHB will create as many stub TCE table objects
>>>>>> as PHB can possibly support but not all of them might be initialized at
>>>>>> the time of migration because DDW might or might not be requested by
>>>>>> the guest.
>>>>>>
>>>>>> The "ddw" property is enabled by default on a PHB but for compatibility
>>>>>> the pseries-2.5 machine and older disable it.
>>>>>>
>>>>>> This implements DDW for VFIO. The host kernel support is required.
>>>>>> This adds a "levels" property to PHB to control the number of levels
>>>>>> in the actual TCE table allocated by the host kernel, 0 is the default
>>>>>> value to tell QEMU to calculate the correct value. Current hardware
>>>>>> supports up to 5 levels.
>>>>>>
>>>>>> The existing linux guests try creating one additional huge DMA window
>>>>>> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
>>>>>> the guest switches to dma_direct_ops and never calls TCE hypercalls
>>>>>> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
>>>>>> and not waste time on map/unmap later. This adds a "dma64_win_addr"
>>>>>> property which is a bus address for the 64bit window and by default
>>>>>> set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware
>>>>>> uses and this allows having emulated and VFIO devices on the same bus.
>>>>>>
>>>>>> This adds 4 RTAS handlers:
>>>>>> * ibm,query-pe-dma-window
>>>>>> * ibm,create-pe-dma-window
>>>>>> * ibm,remove-pe-dma-window
>>>>>> * ibm,reset-pe-dma-window
>>>>>> These are registered from type_init() callback.
>>>>>>
>>>>>> These RTAS handlers are implemented in a separate file to avoid polluting
>>>>>> spapr_iommu.c with PCI.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> ---
>>>>>>   hw/ppc/Makefile.objs        |   1 +
>>>>>>   hw/ppc/spapr.c              |   7 +-
>>>>>>   hw/ppc/spapr_pci.c          |  73 ++++++++---
>>>>>>   hw/ppc/spapr_rtas_ddw.c     | 300 ++++++++++++++++++++++++++++++++++++++++++++
>>>>>>   hw/vfio/common.c            |   5 -
>>>>>>   include/hw/pci-host/spapr.h |  13 ++
>>>>>>   include/hw/ppc/spapr.h      |  16 ++-
>>>>>>   trace-events                |   4 +
>>>>>>   8 files changed, 395 insertions(+), 24 deletions(-)
>>>>>>   create mode 100644 hw/ppc/spapr_rtas_ddw.c
>>>>>>
>>>>>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>>>>>> index c1ffc77..986b36f 100644
>>>>>> --- a/hw/ppc/Makefile.objs
>>>>>> +++ b/hw/ppc/Makefile.objs
>>>>>> @@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o spapr_rng.o
>>>>>>   ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>>>>>>   obj-y += spapr_pci_vfio.o
>>>>>>   endif
>>>>>> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>>>>>>   # PowerPC 4xx boards
>>>>>>   obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>>>>>>   obj-y += ppc4xx_pci.o
>>>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>>>> index d0bb423..ef4c637 100644
>>>>>> --- a/hw/ppc/spapr.c
>>>>>> +++ b/hw/ppc/spapr.c
>>>>>> @@ -2362,7 +2362,12 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
>>>>>>    * pseries-2.5
>>>>>>    */
>>>>>>   #define SPAPR_COMPAT_2_5 \
>>>>>> -        HW_COMPAT_2_5
>>>>>> +        HW_COMPAT_2_5 \
>>>>>> +        {\
>>>>>> +            .driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
>>>>>> +            .property = "ddw",\
>>>>>> +            .value    = stringify(off),\
>>>>>> +        },
>>>>>>
>>>>>>   static void spapr_machine_2_5_instance_options(MachineState *machine)
>>>>>>   {
>>>>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>>>>> index af99a36..3bb294a 100644
>>>>>> --- a/hw/ppc/spapr_pci.c
>>>>>> +++ b/hw/ppc/spapr_pci.c
>>>>>> @@ -803,12 +803,12 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, PCIDevice *pdev)
>>>>>>       return buf;
>>>>>>   }
>>>>>>
>>>>>> -static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>>>>>> -                                       uint32_t liobn,
>>>>>> -                                       uint32_t page_shift,
>>>>>> -                                       uint64_t window_addr,
>>>>>> -                                       uint64_t window_size,
>>>>>> -                                       Error **errp)
>>>>>> +void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>>>>>> +                                 uint32_t liobn,
>>>>>> +                                 uint32_t page_shift,
>>>>>> +                                 uint64_t window_addr,
>>>>>> +                                 uint64_t window_size,
>>>>>> +                                 Error **errp)
>>>>>>   {
>>>>>>       sPAPRTCETable *tcet;
>>>>>>       uint32_t nb_table = window_size >> page_shift;
>>>>>> @@ -825,10 +825,16 @@ static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
>>>>>>           return;
>>>>>>       }
>>>>>>
>>>>>> +    if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
>>>>>> +        error_setg(errp,
>>>>>> +                   "Attempt to use second window when DDW is disabled on PHB");
>>>>>> +        return;
>>>>>> +    }
>>>>>
>>>>> This should never happen unless something is wrong with the tests in
>>>>> the RTAS functions, yes?  In which case it should probably be an
>>>>> assert().
>>>>
>>>> This should not. But this is called from the RTAS caller so I'd really like
>>>> to have a message rather than assert() if that condition happens, here or in
>>>> rtas_ibm_create_pe_dma_window().
>>>
>>> It should only be called from RTAS if ddw is enabled though, yes?
>>
>>
>>  From RTAS and from the PHB reset handler. Well. I will get rid of
>> spapr_phb_dma_window_enable/spapr_phb_dma_window_disable, they are quite
>> useless when I look at them now.
>
> Ok.
>
>>>>>>       spapr_tce_table_enable(tcet, page_shift, window_addr, nb_table);
>>>>>>   }
>>>>>>
>>>>>> -static int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
>>>>>> +int spapr_phb_dma_window_disable(sPAPRPHBState *sphb, uint32_t liobn)
>>>>>>   {
>>>>>>       sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>>>>>>
>>>>>> @@ -1492,14 +1498,18 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>>>>>>       }
>>>>>>
>>>>>>       /* DMA setup */
>>>>>> -    tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn);
>>>>>> -    if (!tcet) {
>>>>>> -        error_report("No default TCE table for %s", sphb->dtbusname);
>>>>>> -        return;
>>>>>> -    }
>>>>>>
>>>>>> -    memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
>>>>>> -                                        spapr_tce_get_iommu(tcet), 0);
>>>>>> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>>>>>> +        tcet = spapr_tce_new_table(DEVICE(sphb),
>>>>>> +                                   SPAPR_PCI_LIOBN(sphb->index, i));
>>>>>> +        if (!tcet) {
>>>>>> +            error_setg(errp, "Creating window#%d failed for %s",
>>>>>> +                       i, sphb->dtbusname);
>>>>>> +            return;
>>>>>> +        }
>>>>>> +        memory_region_add_subregion_overlap(&sphb->iommu_root, 0,
>>>>>> +                                            spapr_tce_get_iommu(tcet), 0);
>>>>>> +    }
>>>>>>
>>>>>>       sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, g_free);
>>>>>>   }
>>>>>> @@ -1517,11 +1527,16 @@ static int spapr_phb_children_reset(Object *child, void *opaque)
>>>>>>
>>>>>>   void spapr_phb_dma_reset(sPAPRPHBState *sphb)
>>>>>>   {
>>>>>> -    sPAPRTCETable *tcet = spapr_tce_find_by_liobn(sphb->dma_liobn);
>>>>>>       Error *local_err = NULL;
>>>>>> +    int i;
>>>>>>
>>>>>> -    if (tcet && tcet->enabled) {
>>>>>> -        spapr_phb_dma_window_disable(sphb, sphb->dma_liobn);
>>>>>> +    for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>>>>>> +        uint32_t liobn = SPAPR_PCI_LIOBN(sphb->index, i);
>>>>>> +        sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
>>>>>> +
>>>>>> +        if (tcet && tcet->enabled) {
>>>>>> +            spapr_phb_dma_window_disable(sphb, liobn);
>>>>>> +        }
>>>>>>       }
>>>>>>
>>>>>>       /* Register default 32bit DMA window */
>>>>>> @@ -1562,6 +1577,13 @@ static Property spapr_phb_properties[] = {
>>>>>>       /* Default DMA window is 0..1GB */
>>>>>>       DEFINE_PROP_UINT64("dma_win_addr", sPAPRPHBState, dma_win_addr, 0),
>>>>>>       DEFINE_PROP_UINT64("dma_win_size", sPAPRPHBState, dma_win_size, 0x40000000),
>>>>>> +    DEFINE_PROP_UINT64("dma64_win_addr", sPAPRPHBState, dma64_window_addr,
>>>>>> +                       0x800000000000000ULL),
>>>>>> +    DEFINE_PROP_BOOL("ddw", sPAPRPHBState, ddw_enabled, true),
>>>>>> +    DEFINE_PROP_UINT32("windows", sPAPRPHBState, windows_supported,
>>>>>> +                       SPAPR_PCI_DMA_MAX_WINDOWS),
>>>>>
>>>>> What will happen if the user tries to set 'windows' larger than
>>>>> SPAPR_PCI_DMA_MAX_WINDOWS?
>>>>
>>>>
>>>> Oh. I need to replace SPAPR_PCI_DMA_MAX_WINDOWS with windows_supported
>>>> everywhere, missed that. Besides that, there will be support for more
>>>> windows, that's it. The host VFIO IOMMU driver will fail creating more
>>>> windows but this is expected. For emulated windows, there will be more
>>>> windows with no other consequences.
>>>
>>> Hmm.. is there actually a reason to have the windows property?  Would
>>> you be better off just using the compile time constant for now.
>>
>>
>> I am afraid it is going to be 2 DMA windows forever as the other DMA tlb-ish
>> facility coming does not use windows at all :)
>
> That sounds like a reason not to have the property and leave it as a
> compile time constant.
>
>>>>>> +    DEFINE_PROP_UINT64("pgsz", sPAPRPHBState, page_size_mask,
>>>>>> +                       (1ULL << 12) | (1ULL << 16) | (1ULL << 24)),
>>>>>>       DEFINE_PROP_END_OF_LIST(),
>>>>>>   };
>>>>>>
>>>>>> @@ -1815,6 +1837,15 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>>>>>       uint32_t interrupt_map_mask[] = {
>>>>>>           cpu_to_be32(b_ddddd(-1)|b_fff(0)), 0x0, 0x0, cpu_to_be32(-1)};
>>>>>>       uint32_t interrupt_map[PCI_SLOT_MAX * PCI_NUM_PINS][7];
>>>>>> +    uint32_t ddw_applicable[] = {
>>>>>> +        cpu_to_be32(RTAS_IBM_QUERY_PE_DMA_WINDOW),
>>>>>> +        cpu_to_be32(RTAS_IBM_CREATE_PE_DMA_WINDOW),
>>>>>> +        cpu_to_be32(RTAS_IBM_REMOVE_PE_DMA_WINDOW)
>>>>>> +    };
>>>>>> +    uint32_t ddw_extensions[] = {
>>>>>> +        cpu_to_be32(1),
>>>>>> +        cpu_to_be32(RTAS_IBM_RESET_PE_DMA_WINDOW)
>>>>>> +    };
>>>>>>       sPAPRTCETable *tcet;
>>>>>>       PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
>>>>>>       sPAPRFDT s_fdt;
>>>>>> @@ -1839,6 +1870,14 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
>>>>>>       _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pci-config-space-type", 0x1));
>>>>>>       _FDT(fdt_setprop_cell(fdt, bus_off, "ibm,pe-total-#msi", XICS_IRQS));
>>>>>>
>>>>>> +    /* Dynamic DMA window */
>>>>>> +    if (phb->ddw_enabled) {
>>>>>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-applicable", &ddw_applicable,
>>>>>> +                         sizeof(ddw_applicable)));
>>>>>> +        _FDT(fdt_setprop(fdt, bus_off, "ibm,ddw-extensions",
>>>>>> +                         &ddw_extensions, sizeof(ddw_extensions)));
>>>>>> +    }
>>>>>> +
>>>>>>       /* Build the interrupt-map, this must matches what is done
>>>>>>        * in pci_spapr_map_irq
>>>>>>        */
>>>>>> diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
>>>>>> new file mode 100644
>>>>>> index 0000000..37f805f
>>>>>> --- /dev/null
>>>>>> +++ b/hw/ppc/spapr_rtas_ddw.c
>>>>>> @@ -0,0 +1,300 @@
>>>>>> +/*
>>>>>> + * QEMU sPAPR Dynamic DMA windows support
>>>>>> + *
>>>>>> + * Copyright (c) 2015 Alexey Kardashevskiy, IBM Corporation.
>>>>>> + *
>>>>>> + *  This program is free software; you can redistribute it and/or modify
>>>>>> + *  it under the terms of the GNU General Public License as published by
>>>>>> + *  the Free Software Foundation; either version 2 of the License,
>>>>>> + *  or (at your option) any later version.
>>>>>> + *
>>>>>> + *  This program is distributed in the hope that it will be useful,
>>>>>> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>>>> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>>>>> + *  GNU General Public License for more details.
>>>>>> + *
>>>>>> + *  You should have received a copy of the GNU General Public License
>>>>>> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
>>>>>> + */
>>>>>> +
>>>>>> +#include "qemu/osdep.h"
>>>>>> +#include "qemu/error-report.h"
>>>>>> +#include "hw/ppc/spapr.h"
>>>>>> +#include "hw/pci-host/spapr.h"
>>>>>> +#include "trace.h"
>>>>>> +
>>>>>> +static int spapr_phb_get_active_win_num_cb(Object *child, void *opaque)
>>>>>> +{
>>>>>> +    sPAPRTCETable *tcet;
>>>>>> +
>>>>>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>>>>>> +    if (tcet && tcet->enabled) {
>>>>>> +        ++*(unsigned *)opaque;
>>>>>> +    }
>>>>>> +    return 0;
>>>>>> +}
>>>>>> +
>>>>>> +static unsigned spapr_phb_get_active_win_num(sPAPRPHBState *sphb)
>>>>>> +{
>>>>>> +    unsigned ret = 0;
>>>>>> +
>>>>>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_active_win_num_cb, &ret);
>>>>>> +
>>>>>> +    return ret;
>>>>>> +}
>>>>>> +
>>>>>> +static int spapr_phb_get_free_liobn_cb(Object *child, void *opaque)
>>>>>> +{
>>>>>> +    sPAPRTCETable *tcet;
>>>>>> +
>>>>>> +    tcet = (sPAPRTCETable *) object_dynamic_cast(child, TYPE_SPAPR_TCE_TABLE);
>>>>>> +    if (tcet && !tcet->enabled) {
>>>>>> +        *(uint32_t *)opaque = tcet->liobn;
>>>>>> +        return 1;
>>>>>> +    }
>>>>>> +    return 0;
>>>>>> +}
>>>>>> +
>>>>>> +static unsigned spapr_phb_get_free_liobn(sPAPRPHBState *sphb)
>>>>>> +{
>>>>>> +    uint32_t liobn = 0;
>>>>>> +
>>>>>> +    object_child_foreach(OBJECT(sphb), spapr_phb_get_free_liobn_cb, &liobn);
>>>>>> +
>>>>>> +    return liobn;
>>>>>> +}
>>>>>> +
>>>>>> +static uint32_t spapr_query_mask(struct ppc_one_seg_page_size *sps,
>>>>>> +                                 uint64_t page_mask)
>>>>>> +{
>>>>>> +    int i, j;
>>>>>> +    uint32_t mask = 0;
>>>>>> +    const struct { int shift; uint32_t mask; } masks[] = {
>>>>>> +        { 12, RTAS_DDW_PGSIZE_4K },
>>>>>> +        { 16, RTAS_DDW_PGSIZE_64K },
>>>>>> +        { 24, RTAS_DDW_PGSIZE_16M },
>>>>>> +        { 25, RTAS_DDW_PGSIZE_32M },
>>>>>> +        { 26, RTAS_DDW_PGSIZE_64M },
>>>>>> +        { 27, RTAS_DDW_PGSIZE_128M },
>>>>>> +        { 28, RTAS_DDW_PGSIZE_256M },
>>>>>> +        { 34, RTAS_DDW_PGSIZE_16G },
>>>>>> +    };
>>>>>> +
>>>>>> +    for (i = 0; i < PPC_PAGE_SIZES_MAX_SZ; i++) {
>>>>>> +        for (j = 0; j < ARRAY_SIZE(masks); ++j) {
>>>>>> +            if ((sps[i].page_shift == masks[j].shift) &&
>>>>>> +                    (page_mask & (1ULL << masks[j].shift))) {
>>>>>> +                mask |= masks[j].mask;
>>>>>> +            }
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    return mask;
>>>>>> +}
>>>>>> +
>>>>>> +static void rtas_ibm_query_pe_dma_window(PowerPCCPU *cpu,
>>>>>> +                                         sPAPRMachineState *spapr,
>>>>>> +                                         uint32_t token, uint32_t nargs,
>>>>>> +                                         target_ulong args,
>>>>>> +                                         uint32_t nret, target_ulong rets)
>>>>>> +{
>>>>>> +    CPUPPCState *env = &cpu->env;
>>>>>> +    sPAPRPHBState *sphb;
>>>>>> +    uint64_t buid, max_window_size;
>>>>>> +    uint32_t avail, addr, pgmask = 0;
>>>>>> +
>>>>>> +    if ((nargs != 3) || (nret != 5)) {
>>>>>> +        goto param_error_exit;
>>>>>> +    }
>>>>>> +
>>>>>> +    buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2);
>>>>>> +    addr = rtas_ld(args, 0);
>>>>>> +    sphb = spapr_pci_find_phb(spapr, buid);
>>>>>> +    if (!sphb || !sphb->ddw_enabled) {
>>>>>> +        goto param_error_exit;
>>>>>> +    }
>>>>>> +
>>>>>> +    /* Work out supported page masks */
>>>>>> +    pgmask = spapr_query_mask(env->sps.sps, sphb->page_size_mask);
>>>>>
>>>>> There are a few potential problems here.  First you're just
>>>>> arbitrarily picking the first entry in the sps array to filter
>>>>
>>>> Why first?  spapr_query_mask() has a loop 0..PPC_PAGE_SIZES_MAX_SZ.
>>>
>>> env->sps is a nested array, 0..PPC_PAGE_SIZES_MAX_SZ-1 for base page
>>> sizes, then again for actual page sizes.  You're only examing the
>>> first "row" of that table.  It kinda works because the 4k base page
>>> size is first, which lists all the actual page size options.
>>
>> Ah. Right. So I need to walk through all of them, ok.
>
> Well.. that would be closer to right, but even then, I don't think
> that's really right.  What you want to advertise here is all guest
> capabilities that are possible on the host.
>
> So, the first question is what's the actual size of the chunks that
> the host will need to map.  Let's call that the "effective IOMMU page
> size".  That's going to be whichever is smaller of the guest IOMMU
> page size, and the (host) RAM page size.
>
> e.g. For a guest using 4k IOMMU mappings on a host with 64k page size,
> the effective page size is 4kiB.
>
> For a guest using 16MiB IOMMU mappings on a host with 64kiB page size,
> the effective page size is 64kiB (because a 16MiB guest "page" could
> be broken up into different real pages on the host).
>
> The next question is what can you actually map in the host IOMMU.  It
> should be able to handle any effective page size that's equal to, or
> smaller than the smallest page size supported on the host.  That might
> involve the host inserting several host TCEs for a single effective
> page, but the VFIO DMA_MAP interface should already cover that because
> it takes a length parameter.
>
>>>>> against, which doesn't seem right (except by accident).  It's a little
>>>>> bit odd filtering against guest page sizes at all, although I get what
>>>>> you're really trying to do is filter against allowed host page sizes.
>>>>>
>>>>> The other problem is that silently filtering capabilities based on the
>>>>> host can be a pain for migration - I've made the mistake and had it
>>>>> bite me in the past.  I think it would be safer to just check the
>>>>> pagesizes requested in the property against what's possible and
>>>>> outright fail if they don't match.  For convenience you could also set
>>>>> according to host capabilities if the user doesn't specify anything,
>>>>> but that would require explicit handling of the "default" case.
>>
>>
>> What are the host capabilities here?
>>
>> There is a page mask from the host IOMMU/PE which is 4K|64K|16M and many
>> other sizes, this is supported always by IODA2.
>> And there is PAGE_SIZE and huge pages (but only with -mempath) - so, 64K and
>> 16M (with -mempath).
>>
>> And there is a "ddw-query" RTAS call which tells the guest if it can use 16M
>> or not. How do you suggest I advertise 16M to the guest? If I always
>> advertise 16M and there is no -mempath, the guest won't try smaller page
>> size.
>
> So, here's what I think you need to do:
>
>    1. Start with the full list of IOMMU page sizes the virtual (guest)
>       hardware supports (so, 4KiB, 64KiB & 16MiB, possibly modified by
>       properties)
>    2. For each entry in the list work out what the effective page size
>       will be, by clamping to the RAM page size.
>    3. Test if each effective page size is possible on the host (i.e. is
>       <= at least one of the pagesizes advertised by the host IOMMU
>       info ioctl).
>
> For a first cut it's probably reasonable to only check for == (not <=)
> the supported host pagesizes.  Ideally the host (inside the kernel)
> would automatically create duplicate host TCEs if effective page size
> < host IOMMU page size.  Obviously it would still need to use the
> supplied guest page size for the guest view of the table.
>
>
>> So - if the user wants 16M IOMMU pages, he has to use -mempath and in
>> addition to that explicitely say -global spapr-pci-host-bridge.pgsz=16M, and
>> by default enable only 4K and 64K (or just 4K?)? I am fine with this, it
>> just means more work for libvirt folks.
>
> I think that's the safest option to start with.  Require that the user
> explicitly list what pagesizes the guest IOMMU will support, and just
> fail if the host can't do that.
>
> We could potentially auto-filter (as we already do for CPU/MMU
> pagesizes) once we've thought a bit more carefully about the possible
> migration implications.


For now I enable 4K and 64K for a guest and then rely on the kernel's 
ability to create what the guest requested.

In v15, I'll add 3 patches to auto-add 16MB to the allowed mask list, may 
be it won't look too ugly.




-- 
Alexey

  reply	other threads:[~2016-03-29  6:23 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-21  7:46 [Qemu-devel] [PATCH qemu v14 00/18] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 01/18] memory: Fix IOMMU replay base address Alexey Kardashevskiy
2016-03-22  0:49   ` David Gibson
2016-03-22  3:12     ` Alexey Kardashevskiy
2016-03-22  3:26       ` David Gibson
2016-03-22  4:28         ` Alexey Kardashevskiy
2016-03-22  4:59           ` David Gibson
2016-03-22  7:19             ` Alexey Kardashevskiy
2016-03-22 23:07               ` David Gibson
2016-03-23 10:58         ` Paolo Bonzini
2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 02/18] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 03/18] spapr_pci: Move DMA window enablement to a helper Alexey Kardashevskiy
2016-03-22  1:02   ` David Gibson
2016-03-22  3:17     ` Alexey Kardashevskiy
2016-03-22  3:28       ` David Gibson
2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 04/18] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 05/18] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
2016-03-22  1:11   ` David Gibson
2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 06/18] spapr_iommu: Finish renaming vfio_accel to need_vfio Alexey Kardashevskiy
2016-03-22  1:18   ` David Gibson
2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 07/18] spapr_iommu: Realloc table during migration Alexey Kardashevskiy
2016-03-22  1:23   ` David Gibson
2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 08/18] spapr_iommu: Migrate full state Alexey Kardashevskiy
2016-03-22  1:31   ` David Gibson
2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 09/18] spapr_iommu: Add root memory region Alexey Kardashevskiy
2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 10/18] spapr_pci: Reset DMA config on PHB reset Alexey Kardashevskiy
2016-03-21  7:46 ` [Qemu-devel] [PATCH qemu v14 11/18] memory: Add reporting of supported page sizes Alexey Kardashevskiy
2016-03-22  3:02   ` David Gibson
2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 12/18] vfio: Check that IOMMU MR translates to system address space Alexey Kardashevskiy
2016-03-22  3:05   ` David Gibson
2016-03-22 15:47     ` Alex Williamson
2016-03-23  0:43       ` David Gibson
2016-03-23  0:44       ` Alexey Kardashevskiy
2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 13/18] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
2016-03-22  4:04   ` David Gibson
2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 14/18] spapr_pci: Add and export DMA resetting helper Alexey Kardashevskiy
2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 15/18] vfio: Add host side IOMMU capabilities Alexey Kardashevskiy
2016-03-22  4:20   ` David Gibson
2016-03-22  6:47     ` Alexey Kardashevskiy
2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 16/18] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO Alexey Kardashevskiy
2016-03-22  4:45   ` David Gibson
2016-03-22  6:24     ` Alexey Kardashevskiy
2016-03-22 10:22       ` David Gibson
2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU Alexey Kardashevskiy
2016-03-22  5:14   ` David Gibson
2016-03-22  5:54     ` Alexey Kardashevskiy
2016-03-23  1:08       ` David Gibson
2016-03-23  2:12         ` Alexey Kardashevskiy
2016-03-23  2:53           ` David Gibson
2016-03-23  3:06             ` Alexey Kardashevskiy
2016-03-23  6:03               ` David Gibson
2016-03-24  0:03                 ` Alexey Kardashevskiy
2016-03-24  9:10                   ` Alexey Kardashevskiy
2016-03-29  5:30                     ` David Gibson
2016-03-29  5:44                       ` Alexey Kardashevskiy
2016-03-29  6:44                         ` David Gibson
2016-03-21  7:47 ` [Qemu-devel] [PATCH qemu v14 18/18] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
2016-03-23  2:13   ` David Gibson
2016-03-23  3:28     ` Alexey Kardashevskiy
2016-03-23  6:11       ` David Gibson
2016-03-24  2:32         ` Alexey Kardashevskiy
2016-03-29  5:22           ` David Gibson
2016-03-29  6:23             ` Alexey Kardashevskiy [this message]
2016-03-31  3:19           ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56FA1F4F.5080903@ozlabs.ru \
    --to=aik@ozlabs.ru \
    --cc=alex.williamson@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).