qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: "Greg Kurz" <groug@kaod.org>,
	qemu-devel@nongnu.org,
	"Jose Ricardo Ziviani" <joserz@linux.ibm.com>,
	"Daniel Henrique Barboza" <danielhb413@gmail.com>,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Sam Bobroff" <sbobroff@linux.ibm.com>,
	"Piotr Jaroszynski" <pjaroszynski@nvidia.com>,
	qemu-ppc@nongnu.org,
	"Leonardo Augusto Guimarães Garcia" <lagarcia@br.ibm.com>
Subject: Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v3 4/6] spapr_iommu: Do not replay mappings from just created DMA window
Date: Tue, 5 Mar 2019 14:28:46 +1100	[thread overview]
Message-ID: <20190305032846.GE7877@umbus.fritz.box> (raw)
In-Reply-To: <3644d6d1-b971-aa51-09c9-f90416147fc7@ozlabs.ru>

[-- Attachment #1: Type: text/plain, Size: 7607 bytes --]

On Thu, Feb 28, 2019 at 04:37:25PM +1100, Alexey Kardashevskiy wrote:
> On 28/02/2019 14:49, David Gibson wrote:
> > On Thu, Feb 28, 2019 at 10:59:56AM +1100, Alexey Kardashevskiy wrote:
> >>
> >>
> >> On 28/02/2019 01:33, Greg Kurz wrote:
> >>> On Wed, 27 Feb 2019 19:51:47 +1100
> >>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >>>
> >>>> On sPAPR vfio_listener_region_add() is called in 2 situations:
> >>>> 1. a new listener is registered from vfio_connect_container();
> >>>> 2. a new IOMMU Memory Region is added from rtas_ibm_create_pe_dma_window().
> >>>>
> >>>> In both cases vfio_listener_region_add() calls
> >>>> memory_region_iommu_replay() to notify newly registered IOMMU notifiers
> >>>> about existing mappings which is totally desirable for case 1.
> >>>>
> >>>> However for case 2 it is nothing but noop as the window has just been
> >>>> created and has no valid mappings so replaying those does not do anything.
> >>>> It is barely noticeable with usual guests but if the window happens to be
> >>>> really big, such no-op replay might take minutes and trigger RCU stall
> >>>> warnings in the guest.
> >>>>
> >>>> For example, a upcoming GPU RAM memory region mapped at 64TiB (right
> >>>> after SPAPR_PCI_LIMIT) causes a 64bit DMA window to be at least 128TiB
> >>>> which is (128<<40)/0x10000=2.147.483.648 TCEs to replay.
> >>>>
> >>>> This mitigates the problem by adding an "skipping_replay" flag to
> >>>> sPAPRTCETable and defining sPAPR own IOMMU MR replay() hook which does
> >>>> exactly the same thing as the generic one except it returns early if
> >>>> @skipping_replay==true.
> >>>>
> >>>> When "ibm,create-pe-dma-window" is complete, the guest will map only
> >>>> required regions of the huge DMA window.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>>  include/hw/ppc/spapr.h  |  1 +
> >>>>  hw/ppc/spapr_iommu.c    | 31 +++++++++++++++++++++++++++++++
> >>>>  hw/ppc/spapr_rtas_ddw.c |  7 +++++++
> >>>>  3 files changed, 39 insertions(+)
> >>>>
> >>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >>>> index 86b0488..358bb38 100644
> >>>> --- a/include/hw/ppc/spapr.h
> >>>> +++ b/include/hw/ppc/spapr.h
> >>>> @@ -727,6 +727,7 @@ struct sPAPRTCETable {
> >>>>      uint64_t *mig_table;
> >>>>      bool bypass;
> >>>>      bool need_vfio;
> >>>> +    bool skipping_replay;
> >>>>      int fd;
> >>>>      MemoryRegion root;
> >>>>      IOMMUMemoryRegion iommu;
> >>>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> >>>> index 37e98f9..8f23179 100644
> >>>> --- a/hw/ppc/spapr_iommu.c
> >>>> +++ b/hw/ppc/spapr_iommu.c
> >>>> @@ -141,6 +141,36 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(IOMMUMemoryRegion *iommu,
> >>>>      return ret;
> >>>>  }
> >>>>  
> >>>> +static void spapr_tce_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> >>>> +{
> >>>> +    MemoryRegion *mr = MEMORY_REGION(iommu_mr);
> >>>> +    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr);
> >>>> +    hwaddr addr, granularity;
> >>>> +    IOMMUTLBEntry iotlb;
> >>>> +    sPAPRTCETable *tcet = container_of(iommu_mr, sPAPRTCETable, iommu);
> >>>> +
> >>>> +    if (tcet->skipping_replay) {
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    granularity = memory_region_iommu_get_min_page_size(iommu_mr);
> >>>> +
> >>>> +    for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
> >>>> +        iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx);
> >>>> +        if (iotlb.perm != IOMMU_NONE) {
> >>>> +            n->notify(n, &iotlb);
> >>>> +        }
> >>>> +
> >>>> +        /*
> >>>> +         * if (2^64 - MR size) < granularity, it's possible to get an
> >>>> +         * infinite loop here.  This should catch such a wraparound.
> >>>> +         */
> >>>> +        if ((addr + granularity) < addr) {
> >>>> +            break;
> >>>> +        }
> >>>> +    }
> >>>> +}
> >>>
> >>> It is a bit unfortunate to duplicate all that code. What about making
> >>> a memory_region_iommu_replay_generic() helper out of it and call it
> >>> from spapr_tce_replay() and memory_region_iommu_replay() ?
> >>
> >>
> >> I really do not want to mess with generic code to solve our local sPAPR
> >> problem, especially when there is a way not to do so.
> > 
> > Well, the thing is, I think we're actually the only user of the
> > current generic replay - everything else has more efficient structure
> > aware replay hooks AFAIK.  Which makes this hack even hackier.
> 
> If that so, then we are better off removing that loop from
> memory_region_iommu_replay() at all rather than keeping it generic.

Well.. maybe.  In theory this logic should work, albeit slowly, for
any IOMMU implementation, it's just that everyone else has more
efficient implementations at the moment.

> >> And as a next step, I was thinking of removing (i.e. making this replay
> >> a no-op) from QEMU later and do replay in KVM instead when an IOMMU
> >> group is attaching to KVM as this is the only case when we need replay
> >> and KVM has a lot better idea what TCEs are actually valid and can skip
> >> most of them. This is a bit bigger thing as it requires a KVM capability
> >> "KVM replays mappings" but when we get it, spapr_tce_replay() will
> >> become no-op.
> > 
> > That's a good idea longer term.
> > 
> >>> Apart from that, LGTM.
> >>
> >> Well. It is a hack, I just do not have taste to tell how nasty it is
> >> :)
> > 
> > As an interim step until the kernel change, I think we can do a bit
> > better than this.  First, as Greg suggests we should have the
> > "generic" replay be a helper and have the spapr one call that with a
> > little in the way of extra checking.
> > 
> > Second, rather than having an explicit "skip_replay" flag, what we
> > really want here is to have the replay be a fast no-op if there are no
> > existing mappings rather than a slow no-op.  So instead I think we
> > should have a flag which records if any mappings have been made in the
> > region yet, initialized to false.
> > The new replay would do nothing if
> > it's still false.
> 
> If QEMU controlled the mappings - sure. But it does not - KVM does it
> instead via that fast path. So QEMU does not know if there are mappings
> until it reads all TCEs from mmap'ed KVM TCE table which will fault in
> all these pages.

Oops, I forgot that; good point.

> We could implement some tricks such are allow reading (or ioctl) from
> that KVM TCE fd and it could tell what is mapped and what is not in a
> very condensed format (for example a bit per every 256MB of the guest
> address space)  ooooor  implement different behavior for mapping with RW
> or readonly - the latter would fail if there is no backing page
> allocated yet - and then QEMU could skip these regions when replaying.

Urgh, yeah, this is trickier than I initially thought.  About the
cleanest approach I can see is to delay allocation of the IOMMU table
until the first H_PUT_TCE, and only then actually bind the table to
KVM and allow for H_PUT_TCE acceleration.

Seems pretty awkward, though.  Ok, your hack seems the best way
forward in the short to medium term, just make sure there are some
comments there explaining why the hack is valuable.


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2019-03-05  3:29 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-27  8:51 [Qemu-devel] [PATCH qemu v3 0/6] spapr_pci, vfio: NVIDIA V100 + POWER9 passthrough Alexey Kardashevskiy
2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 1/6] pci: Move NVIDIA vendor id to the rest of ids Alexey Kardashevskiy
2019-02-28  0:56   ` David Gibson
2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 2/6] vfio/spapr: Fix indirect levels calculation Alexey Kardashevskiy
2019-02-28  2:24   ` David Gibson
2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 3/6] vfio/spapr: Rename local systempagesize variable Alexey Kardashevskiy
2019-02-28  2:26   ` David Gibson
2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 4/6] spapr_iommu: Do not replay mappings from just created DMA window Alexey Kardashevskiy
2019-02-27 14:33   ` [Qemu-devel] [Qemu-ppc] " Greg Kurz
2019-02-27 23:59     ` Alexey Kardashevskiy
2019-02-28  3:49       ` David Gibson
2019-02-28  5:37         ` Alexey Kardashevskiy
2019-03-05  3:28           ` David Gibson [this message]
2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 5/6] vfio: Make vfio_get_region_info_cap public Alexey Kardashevskiy
2019-02-27  8:51 ` [Qemu-devel] [PATCH qemu v3 6/6] spapr: Support NVIDIA V100 GPU with NVLink2 Alexey Kardashevskiy
2019-02-28  3:31   ` David Gibson
2019-02-28  6:11     ` Alexey Kardashevskiy
2019-03-05  1:47       ` David Gibson
2019-03-07  2:40         ` Alexey Kardashevskiy
2019-03-07  3:57           ` David Gibson
2019-03-07  4:32             ` Alexey Kardashevskiy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190305032846.GE7877@umbus.fritz.box \
    --to=david@gibson.dropbear.id.au \
    --cc=aik@ozlabs.ru \
    --cc=alex.williamson@redhat.com \
    --cc=danielhb413@gmail.com \
    --cc=groug@kaod.org \
    --cc=joserz@linux.ibm.com \
    --cc=lagarcia@br.ibm.com \
    --cc=pjaroszynski@nvidia.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    --cc=sbobroff@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).