From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:45341)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aik@ozlabs.ru>) id 1aiZ8B-0000YP-Fd
	for qemu-devel@nongnu.org; Tue, 22 Mar 2016 23:06:48 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <aik@ozlabs.ru>) id 1aiZ86-0005me-M4
	for qemu-devel@nongnu.org; Tue, 22 Mar 2016 23:06:47 -0400
Received: from mail-pf0-x243.google.com ([2607:f8b0:400e:c00::243]:34232)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aik@ozlabs.ru>) id 1aiZ86-0005mX-9W
	for qemu-devel@nongnu.org; Tue, 22 Mar 2016 23:06:42 -0400
Received: by mail-pf0-x243.google.com with SMTP id n5so529861pfn.1
	for <qemu-devel@nongnu.org>; Tue, 22 Mar 2016 20:06:42 -0700 (PDT)
References: <1458546426-26222-1-git-send-email-aik@ozlabs.ru>
	<1458546426-26222-18-git-send-email-aik@ozlabs.ru>
	<20160322051449.GG23586@voom.redhat.com> <56F0DDFF.7050308@ozlabs.ru>
	<20160323010844.GO23586@voom.redhat.com> <56F1FBAB.6090308@ozlabs.ru>
	<20160323025315.GS23586@voom.redhat.com>
From: Alexey Kardashevskiy <aik@ozlabs.ru>
Message-ID: <56F2083C.9050200@ozlabs.ru>
Date: Wed, 23 Mar 2016 14:06:36 +1100
MIME-Version: 1.0
In-Reply-To: <20160323025315.GS23586@voom.redhat.com>
Content-Type: text/plain; charset=koi8-r; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use
	VFIO_SPAPR_TCE_v2_IOMMU
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: Alex Williamson <alex.williamson@redhat.com>, qemu-ppc@nongnu.org, qemu-devel@nongnu.org

On 03/23/2016 01:53 PM, David Gibson wrote:
> On Wed, Mar 23, 2016 at 01:12:59PM +1100, Alexey Kardashevskiy wrote:
>> On 03/23/2016 12:08 PM, David Gibson wrote:
>>> On Tue, Mar 22, 2016 at 04:54:07PM +1100, Alexey Kardashevskiy wrote:
>>>> On 03/22/2016 04:14 PM, David Gibson wrote:
>>>>> On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote:
>>>>>> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
>>>>>> This adds ability to VFIO common code to dynamically allocate/remove
>>>>>> DMA windows in the host kernel when new VFIO container is added/removed.
>>>>>>
>>>>>> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
>>>>>> and adds just created IOMMU into the host IOMMU list; the opposite
>>>>>> action is taken in vfio_listener_region_del.
>>>>>>
>>>>>> When creating a new window, this uses euristic to decide on the TCE table
>>>>>> levels number.
>>>>>>
>>>>>> This should cause no guest visible change in behavior.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> ---
>>>>>> Changes:
>>>>>> v14:
>>>>>> * new to the series
>>>>>>
>>>>>> ---
>>>>>> TODO:
>>>>>> * export levels to PHB
>>>>>> ---
>>>>>>   hw/vfio/common.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
>>>>>>   trace-events     |   2 ++
>>>>>>   2 files changed, 105 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>>>> index 4e873b7..421d6eb 100644
>>>>>> --- a/hw/vfio/common.c
>>>>>> +++ b/hw/vfio/common.c
>>>>>> @@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer *container,
>>>>>>       return 0;
>>>>>>   }
>>>>>>
>>>>>> +static void vfio_host_iommu_del(VFIOContainer *container, hwaddr min_iova)
>>>>>> +{
>>>>>> +    VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container, min_iova, 0x1000);
>>>>>
>>>>> The hard-coded 0x1000 looks dubious..
>>>>
>>>> Well, that's the minimal page size...
>>>
>>> Really?  Some BookE CPUs support 1KiB page size..
>>
>> Hm. For IOMMU? Ok. s/0x1000/1/ should do then :)
>
> Uh.. actually I don't think those CPUs generally had an IOMMU.  But if
> it's been done for CPU MMU I wouldn't count on it not being done for
> IOMMU.
>
> 1 is a safer choice.
>
>>
>>
>>>
>>>>>> +    g_assert(hiommu);
>>>>>> +    QLIST_REMOVE(hiommu, hiommu_next);
>>>>>> +}
>>>>>> +
>>>>>>   static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>>>>>   {
>>>>>>       return (!memory_region_is_ram(section->mr) &&
>>>>>> @@ -392,6 +400,61 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>>>>>       }
>>>>>>       end = int128_get64(llend);
>>>>>>
>>>>>> +    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
>>>>>
>>>>> I think this would be clearer split out into a helper function,
>>>>> vfio_create_host_window() or something.
>>>>
>>>>
>>>> It is rather vfio_spapr_create_host_window() and we were avoiding
>>>> xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a
>>>> separate file but this usually triggers more discussion and never ends well.
>>>>
>>>>
>>>>
>>>>>> +        unsigned entries, pages;
>>>>>> +        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
>>>>>> +
>>>>>> +        g_assert(section->mr->iommu_ops);
>>>>>> +        g_assert(memory_region_is_iommu(section->mr));
>>>>>
>>>>> I don't think you need these asserts.  AFAICT the same logic should
>>>>> work if a RAM MR was added directly to PCI address space - this would
>>>>> create the new host window, then the existing code for adding a RAM MR
>>>>> would map that block of RAM statically into the new window.
>>>>
>>>> In what configuration/machine can we do that on SPAPR?
>>>
>>> spapr guests won't ever do that.  But you can run an x86 guest on a
>>> powernv host and this situation could come up.
>>
>>
>> I am pretty sure VFIO won't work in this case anyway.
>
> I'm not.  There's no fundamental reason VFIO shouldn't work with TCG.

This is not about TCG (pseries TCG guest works with VFIO on powernv host), 
this is about things like VFIO_IOMMU_GET_INFO vs. 
VFIO_IOMMU_SPAPR_TCE_GET_INFO ioctls but yes, fundamentally, it can work.

Should I add such support in this patchset?


>
>>> In any case there's no point asserting if the code is correct anyway.
>>
>> Assert here says (at least) "not tested" or "not expected to
>> happen".
>
> Hmmm..
>
>>
>>
>>>
>>>>>> +        trace_vfio_listener_region_add_iommu(iova, end - 1);
>>>>>> +        /*
>>>>>> +         * FIXME: For VFIO iommu types which have KVM acceleration to
>>>>>> +         * avoid bouncing all map/unmaps through qemu this way, this
>>>>>> +         * would be the right place to wire that up (tell the KVM
>>>>>> +         * device emulation the VFIO iommu handles to use).
>>>>>> +         */
>>>>>> +        create.window_size = memory_region_size(section->mr);
>>>>>> +        create.page_shift =
>>>>>> +                ctz64(section->mr->iommu_ops->get_page_sizes(section->mr));
>>>>>
>>>>> Ah.. except that I guess you'd need to fall back to host page size
>>>>> here to handle a RAM MR.
>>>>
>>>> Can you give an example of such RAM MR being added to PCI AS on
>>>> SPAPR?
>>>
>>> On spapr, no.  But you can run other machine types as guests (at least
>>> with TCG) on a host with the spapr IOMMU.
>>>
>>>>>> +        /*
>>>>>> +         * SPAPR host supports multilevel TCE tables, there is some
>>>>>> +         * euristic to decide how many levels we want for our table:
>>>>>> +         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
>>>>>> +         */
>>>>>> +        entries = create.window_size >> create.page_shift;
>>>>>> +        pages = (entries * sizeof(uint64_t)) / getpagesize();
>>>>>> +        create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
>>>>>> +
>>>>>> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>>>>>> +        if (ret) {
>>>>>> +            error_report("Failed to create a window, ret = %d (%m)", ret);
>>>>>> +            goto fail;
>>>>>> +        }
>>>>>> +
>>>>>> +        if (create.start_addr != section->offset_within_address_space ||
>>>>>> +            vfio_host_iommu_lookup(container, create.start_addr,
>>>>>> +                                   create.start_addr + create.window_size - 1)) {
>>>>>
>>>>> Under what circumstances can this trigger?  Is the kernel ioctl
>>>>> allowed to return a different window start address than the one
>>>>> requested?
>>>>
>>>> You already asked this some time ago :) The userspace cannot request
>>>> address, the host kernel returns one.
>>>
>>> Ok.  For generality it would be nice if you could succeed here as long
>>> as the new host window covers the requested guest window, even if it
>>> doesn't match exactly.  And for that matter to not request the new
>>> window if the host already has a window covering the guest region.
>>
>>
>> That would be dead code - when would it possibly work? I mean I could
>> instrument an artificial test but the actual user which might appear later
>> will likely be soooo different so this won't help anyway.
>
> Hmm, I suppose.  It actually shouldn't be that hard to trigger a case
> like this, if you just bumped the bridge's dma64 base address property
> up a little bit - above the host kernel's base address, but small
> enough that you can still easily fit the guest memory in.


I can test it today for sure but once committed, we will have to support 
it. Which I am trying to avoid until we get clear picture what we are 
supporting here.


-- 
Alexey