qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Alexey Kardashevskiy <aik@ozlabs.ru>
To: David Gibson <david@gibson.dropbear.id.au>,
	Alexander Graf <agraf@suse.de>
Cc: Alex Williamson <alex.williamson@redhat.com>,
	qemu-ppc@nongnu.org, qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: Enable DDW
Date: Fri, 15 Aug 2014 13:22:39 +1000	[thread overview]
Message-ID: <53ED7CFF.10005@ozlabs.ru> (raw)
In-Reply-To: <20140815000916.GL7628@voom.redhat.com>

On 08/15/2014 10:09 AM, David Gibson wrote:
> On Thu, Aug 14, 2014 at 03:38:45PM +0200, Alexander Graf wrote:
>>
>> On 13.08.14 02:18, Alexey Kardashevskiy wrote:
>>> On 08/13/2014 01:28 AM, Alexander Graf wrote:
>>>> On 12.08.14 17:10, Alexey Kardashevskiy wrote:
>>>>> On 08/12/2014 07:37 PM, Alexander Graf wrote:
>>>>>> On 12.08.14 02:03, Alexey Kardashevskiy wrote:
>>>>>>> On 08/12/2014 03:30 AM, Alexander Graf wrote:
>>>>>>>> On 11.08.14 17:01, Alexey Kardashevskiy wrote:
>>>>>>>>> On 08/11/2014 10:02 PM, Alexander Graf wrote:
>>>>>>>>>> On 31.07.14 11:34, Alexey Kardashevskiy wrote:
>>>>>>>>>>> This implements DDW for VFIO. Host kernel support is required for
>>>>>>>>>>> this.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>> ---
>>>>>>>>>>>      hw/ppc/spapr_pci_vfio.c | 75
>>>>>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>      1 file changed, 75 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
>>>>>>>>>>> index d3bddf2..dc443e2 100644
>>>>>>>>>>> --- a/hw/ppc/spapr_pci_vfio.c
>>>>>>>>>>> +++ b/hw/ppc/spapr_pci_vfio.c
>>>>>>>>>>> @@ -69,6 +69,77 @@ static void
>>>>>>>>>>> spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
>>>>>>>>>>>          /* Register default 32bit DMA window */
>>>>>>>>>>>          memory_region_add_subregion(&sphb->iommu_root,
>>>>>>>>>>> tcet->bus_offset,
>>>>>>>>>>>                                      spapr_tce_get_iommu(tcet));
>>>>>>>>>>> +
>>>>>>>>>>> +    sphb->ddw_supported = !!(info.flags &
>>>>>>>>>>> VFIO_IOMMU_SPAPR_TCE_FLAG_DDW);
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static int spapr_pci_vfio_ddw_query(sPAPRPHBState *sphb,
>>>>>>>>>>> +                                    uint32_t *windows_available,
>>>>>>>>>>> +                                    uint32_t *page_size_mask)
>>>>>>>>>>> +{
>>>>>>>>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>>>>>>>>> +    struct vfio_iommu_spapr_tce_query query = { .argsz =
>>>>>>>>>>> sizeof(query) };
>>>>>>>>>>> +    int ret;
>>>>>>>>>>> +
>>>>>>>>>>> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>>>>>>>>>>> +                               VFIO_IOMMU_SPAPR_TCE_QUERY, &query);
>>>>>>>>>>> +    if (ret) {
>>>>>>>>>>> +        return ret;
>>>>>>>>>>> +    }
>>>>>>>>>>> +
>>>>>>>>>>> +    *windows_available = query.windows_available;
>>>>>>>>>>> +    *page_size_mask = query.page_size_mask;
>>>>>>>>>>> +
>>>>>>>>>>> +    return ret;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static int spapr_pci_vfio_ddw_create(sPAPRPHBState *sphb, uint32_t
>>>>>>>>>>> page_shift,
>>>>>>>>>>> +                                     uint32_t window_shift, uint32_t
>>>>>>>>>>> liobn,
>>>>>>>>>>> +                                     sPAPRTCETable **ptcet)
>>>>>>>>>>> +{
>>>>>>>>>>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>>>>>>>>>>> +    struct vfio_iommu_spapr_tce_create create = {
>>>>>>>>>>> +        .argsz = sizeof(create),
>>>>>>>>>>> +        .page_shift = page_shift,
>>>>>>>>>>> +        .window_shift = window_shift,
>>>>>>>>>>> +        .start_addr = 0
>>>>>>>>>>> +    };
>>>>>>>>>>> +    int ret;
>>>>>>>>>>> +
>>>>>>>>>>> +    ret = vfio_container_ioctl(&sphb->iommu_as, svphb->iommugroupid,
>>>>>>>>>>> +                               VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>>>>>>>>>>> +    if (ret) {
>>>>>>>>>>> +        return ret;
>>>>>>>>>>> +    }
>>>>>>>>>>> +
>>>>>>>>>>> +    *ptcet = spapr_tce_new_table(DEVICE(sphb), liobn,
>>>>>>>>>>> create.start_addr,
>>>>>>>>>>> +                                 page_shift, 1 << (window_shift -
>>>>>>>>>>> page_shift),
>>>>>>>>>> I spot a 1 without ULL again - this time it might work out ok, but
>>>>>>>>>> please
>>>>>>>>>> just always use ULL when you pass around addresses.
>>>>>>>>> My bad. I keep forgetting this, I'll adjust my own checkpatch.py :)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Please walk me though the abstraction levels on what each page size
>>>>>>>>>> honoration means. If I use THP, what page size granularity can I use
>>>>>>>>>> for
>>>>>>>>>> TCE entries?
>>>>>>>>> [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls
>>>>>>>>> support
>>>>>>>>>
>>>>>>>>> +        const struct { int shift; uint32_t mask; } masks[] = {
>>>>>>>>> +            { 12, DDW_PGSIZE_4K },
>>>>>>>>> +            { 16, DDW_PGSIZE_64K },
>>>>>>>>> +            { 24, DDW_PGSIZE_16M },
>>>>>>>>> +            { 25, DDW_PGSIZE_32M },
>>>>>>>>> +            { 26, DDW_PGSIZE_64M },
>>>>>>>>> +            { 27, DDW_PGSIZE_128M },
>>>>>>>>> +            { 28, DDW_PGSIZE_256M },
>>>>>>>>> +            { 34, DDW_PGSIZE_16G },
>>>>>>>>> +        };
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Supported page sizes are returned by the host kernel via "query". For
>>>>>>>>> 16MB
>>>>>>>>> pages, page shift will return
>>>>>>>>> DDW_PGSIZE_4K|DDW_PGSIZE_64K|DDW_PGSIZE_16M.
>>>>>>>>> Or I did not understand the question...
>>>>>>>> Why do we care about the sizes? Anything bigger than what we support
>>>>>>>> should
>>>>>>>> always work, no? What happens if the guest creates a 16MB map but my
>>>>>>>> pages
>>>>>>>> are 4kb mapped? Wouldn't the same logic be able to deal with 16G pages?
>>>>>>> It is DMA memory, if I split "virtual" 16M page to a bunch of real 4K
>>>>>>> pages, I have to make sure these 16M are continuous - there will be one
>>>>>>> TCE
>>>>>>> entry for it and no more translations besides IOMMU. What do I miss now?
>>>>>> Who does the shadow translation where? Does it exist at all?
>>>>> IOMMU? I am not sure I am following you... This IOMMU will look as direct
>>>>> DMA for the guest but the real IOMMU table is sparse and it is populated
>>>>> via a bunch of H_PUT_TCE calls as the default small window.
>>>>>
>>>>> There is a direct mapping in the host called "bypass window" but it is not
>>>>> used here as sPAPR does not define that for paravirtualization.
>>>> Ok, imagine I have 16MB of guest physical memory that is in reality backed
>>>> by 256 64k pages on the host. The guest wants to create a 16M TCE entry for
>>>> this (from its point of view contiguous) chunk of memory.
>>>>
>>>> Do we allow this?
>>> No, we do not. We tell the guest what it can use.
>>>
>>>> Or do we force the guest to create 64k TCE entries?
>>> 16MB TCE pages are only allowed if qemu is running with hugepages.
>>
>> That's unfortunate ;) but as long as we have to pin TCEd memory anyway, I
>> guess it doesn't hurt as badly.
>>
>>>
>>>
>>>> If we allow it, why would we ever put any restriction at the upper end of
>>>> TCE entry sizes? If we already implement enough logic to map things lazily
>>>> around, we could as well have the guest create a 256M TCE entry and just
>>>> split it on the host view to 64k TCE entries.
>>> Oh, thiiiiiis is what you meant...
>>>
>>> Well, we could, just for now current linux guests support 4K/64K/16M only
>>> and they choose depending on what hypervisor supports - look at
>>> enable_ddw() in the guest. What you suggest seems to be an unnecessary code
>>> duplication for 16MB pages case. For bigger page sizes - for example, for
>>> 64GB guest, a TCE table with 16MB TCEs will be 32KB which is already
>>> awesome enough, no?
>>
>> In "normal" invironments guests won't be backed by 16M pages, but by 64k
>> pages with the occasional THP huge page merge that you can't rely on.
>>
>> That's why I figured it'd be smart to support 16MB TCEs even when the
>> underlying memory is only backed by 64k pages.
> 
> That could work for emulated PCI devices, but not for VFIO.  With VFIO
> the TCEs get passed through to the hardware, and so the pages mapped
> must be physically contiguous, which can only happen if the guest is
> backed by hugepages.
> 
> Well.. I guess you *could* fake it for VFIO, by making each guest
> H_PUT_TCE result in many real TCEs being created.  But I think it's a
> bad idea, because it would trigger the guest to map all RAM when not
> hugepage backed, and that would mean the translated (host) TCE table
> would be inordinately large.


inordinately? :) 64GB of ram, 64K pages = 1 million TCEs, 8 bytes each so
the whole table would be 8MB (which is 0.01% of 64GB). Well, allocating 8MB
in one continuous chunk might be a problem - but P8 allows splitting tables
to up to 4 level trees. Or it is just a single 16MB page.



-- 
Alexey

  reply	other threads:[~2014-08-15  3:22 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-31  9:34 [Qemu-devel] [RFC PATCH 00/10] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 01/10] qom: Make object_child_foreach safe for objects removal Alexey Kardashevskiy
2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 02/10] spapr_iommu: Disable in-kernel IOMMU tables for >4GB windows Alexey Kardashevskiy
2014-08-12  1:17   ` David Gibson
2014-08-12  7:32     ` Alexey Kardashevskiy
2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 03/10] spapr_pci: Make find_phb()/find_dev() public Alexey Kardashevskiy
2014-08-11 11:39   ` Alexander Graf
2014-08-11 14:56     ` Alexey Kardashevskiy
2014-08-11 17:16       ` Alexander Graf
2014-08-12  1:19   ` David Gibson
2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 04/10] spapr_iommu: Make spapr_tce_find_by_liobn() public Alexey Kardashevskiy
2014-08-12  1:19   ` David Gibson
2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 05/10] linux headers update for DDW Alexey Kardashevskiy
2014-08-12  1:20   ` David Gibson
2014-08-12  7:16     ` Alexey Kardashevskiy
2014-08-13  3:23       ` David Gibson
2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 06/10] spapr_rtas: Add Dynamic DMA windows (DDW) RTAS calls support Alexey Kardashevskiy
2014-08-11 11:51   ` Alexander Graf
2014-08-11 15:34     ` Alexey Kardashevskiy
2014-08-12  1:45   ` David Gibson
2014-08-12  7:25     ` Alexey Kardashevskiy
2014-08-13  3:27       ` David Gibson
2014-08-14  8:29         ` Alexey Kardashevskiy
2014-08-15  0:04           ` David Gibson
2014-08-15  3:09             ` Alexey Kardashevskiy
2014-08-15  4:20               ` David Gibson
2014-08-15  5:27                 ` Alexey Kardashevskiy
2014-08-15  5:30                   ` David Gibson
2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 07/10] spapr: Add "ddw" machine option Alexey Kardashevskiy
2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 08/10] spapr_pci: Enable DDW Alexey Kardashevskiy
2014-08-11 11:59   ` Alexander Graf
2014-08-11 15:26     ` Alexey Kardashevskiy
2014-08-11 17:29       ` Alexander Graf
2014-08-12  0:13         ` Alexey Kardashevskiy
2014-08-12  3:59           ` Alexey Kardashevskiy
2014-08-12  9:36             ` Alexander Graf
2014-08-12  2:10   ` David Gibson
2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 09/10] spapr_pci_vfio: " Alexey Kardashevskiy
2014-08-11 12:02   ` Alexander Graf
2014-08-11 15:01     ` Alexey Kardashevskiy
2014-08-11 17:30       ` Alexander Graf
2014-08-12  0:03         ` Alexey Kardashevskiy
2014-08-12  9:37           ` Alexander Graf
2014-08-12 15:10             ` Alexey Kardashevskiy
2014-08-12 15:28               ` Alexander Graf
2014-08-13  0:18                 ` Alexey Kardashevskiy
2014-08-14 13:38                   ` Alexander Graf
2014-08-15  0:09                     ` David Gibson
2014-08-15  3:22                       ` Alexey Kardashevskiy [this message]
2014-08-15  3:16                     ` Alexey Kardashevskiy
2014-08-15  7:37                       ` Alexander Graf
2014-08-12  2:14   ` David Gibson
2014-07-31  9:34 ` [Qemu-devel] [RFC PATCH 10/10] vfio: Enable DDW ioctls to VFIO IOMMU driver Alexey Kardashevskiy
2014-08-05  1:30 ` [Qemu-devel] [RFC PATCH 00/10] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
2014-08-10 23:50   ` Alexey Kardashevskiy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53ED7CFF.10005@ozlabs.ru \
    --to=aik@ozlabs.ru \
    --cc=agraf@suse.de \
    --cc=alex.williamson@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).