From: David Gibson <david@gibson.dropbear.id.au>
To: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: linuxppc-dev@lists.ozlabs.org, kvm-ppc@vger.kernel.org,
kvm@vger.kernel.org, Alex Williamson <alex.williamson@redhat.com>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>
Subject: Re: [PATCH kernel 6/6] powerpc/powernv/ioda: Allocate indirect TCE levels on demand
Date: Thu, 14 Jun 2018 22:35:25 +1000 [thread overview]
Message-ID: <20180614123525.GB19339@umbus.fritz.box> (raw)
In-Reply-To: <360896bd-f3a9-530c-8144-c10502dc965b@ozlabs.ru>
[-- Attachment #1: Type: text/plain, Size: 4449 bytes --]
On Thu, Jun 14, 2018 at 04:35:18PM +1000, Alexey Kardashevskiy wrote:
> On 12/6/18 2:17 pm, David Gibson wrote:
> > On Fri, Jun 08, 2018 at 03:46:33PM +1000, Alexey Kardashevskiy wrote:
> >> At the moment we allocate the entire TCE table, twice (hardware part and
> >> userspace translation cache). This normally works as we normally have
> >> contigous memory and the guest will map entire RAM for 64bit DMA.
> >>
> >> However if we have sparse RAM (one example is a memory device), then
> >> we will allocate TCEs which will never be used as the guest only maps
> >> actual memory for DMA. If it is a single level TCE table, there is nothing
> >> we can really do but if it a multilevel table, we can skip allocating
> >> TCEs we know we won't need.
> >>
> >> This adds ability to allocate only first level, saving memory.
> >>
> >> This changes iommu_table::free() to avoid allocating of an extra level;
> >> iommu_table::set() will do this when needed.
> >>
> >> This adds @alloc parameter to iommu_table::exchange() to tell the callback
> >> if it can allocate an extra level; the flag is set to "false" for
> >> the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns
> >> H_TOO_HARD.
> >>
> >> This still requires the entire table to be counted in mm::locked_vm.
> >>
> >> To be conservative, this only does on-demand allocation when
> >> the usespace cache table is requested which is the case of VFIO.
> >>
> >> The example math for a system replicating a powernv setup with NVLink2
> >> in a guest:
> >> 16GB RAM mapped at 0x0
> >> 128GB GPU RAM window (16GB of actual RAM) mapped at 0x244000000000
> >>
> >> the table to cover that all with 64K pages takes:
> >> (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
> >>
> >> If we allocate only necessary TCE levels, we will only need:
> >> (((0x400000000 + 0x400000000) >> 16)*8)>>20 = 4MB (plus some for indirect
> >> levels).
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >> arch/powerpc/include/asm/iommu.h | 7 ++-
> >> arch/powerpc/platforms/powernv/pci.h | 6 ++-
> >> arch/powerpc/kvm/book3s_64_vio_hv.c | 4 +-
> >> arch/powerpc/platforms/powernv/pci-ioda-tce.c | 69 ++++++++++++++++++++-------
> >> arch/powerpc/platforms/powernv/pci-ioda.c | 8 ++--
> >> drivers/vfio/vfio_iommu_spapr_tce.c | 2 +-
> >> 6 files changed, 69 insertions(+), 27 deletions(-)
> >>
> >> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> >> index 4bdcf22..daa3ee5 100644
> >> --- a/arch/powerpc/include/asm/iommu.h
> >> +++ b/arch/powerpc/include/asm/iommu.h
> >> @@ -70,7 +70,7 @@ struct iommu_table_ops {
> >> unsigned long *hpa,
> >> enum dma_data_direction *direction);
> >>
> >> - __be64 *(*useraddrptr)(struct iommu_table *tbl, long index);
> >> + __be64 *(*useraddrptr)(struct iommu_table *tbl, long index, bool alloc);
> >> #endif
> >> void (*clear)(struct iommu_table *tbl,
> >> long index, long npages);
> >> @@ -122,10 +122,13 @@ struct iommu_table {
> >> __be64 *it_userspace; /* userspace view of the table */
> >> struct iommu_table_ops *it_ops;
> >> struct kref it_kref;
> >> + int it_nid;
> >> };
> >>
> >> +#define IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry) \
> >> + ((tbl)->it_ops->useraddrptr((tbl), (entry), false))
> >
> > Is real mode really the only case where you want to inhibit new
> > allocations? I would have thought some paths would be read-only and
> > you wouldn't want to allocate, even in virtual mode.
>
> There are paths when I do not want allocation but I can figure out that
> from dma direction flag, for example, I am cleaning up the table and I do
> not want any extra allocation to happen there and they do happen because I
> made a mistake so I'll repost. Other than that, this @alloc flag is for
> real mode only.
Hm, ok.
Something just occurred to me: IIRC, going from the vmalloc() to the
explicit table structure was to avoid allocating pages for memory
regions that aren't there due to sparseness. But.. won't you get that
with vmalloc() anyway, if the pages for the gaps never get
instantiated?
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2018-06-14 12:35 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-06-08 5:46 [PATCH kernel 0/6] powerpc/powernv/iommu: Optimize memory use Alexey Kardashevskiy
2018-06-08 5:46 ` [PATCH kernel 1/6] powerpc/powernv: Remove useless wrapper Alexey Kardashevskiy
2018-06-09 2:04 ` David Gibson
2018-06-08 5:46 ` [PATCH kernel 2/6] powerpc/powernv: Move TCE manupulation code to its own file Alexey Kardashevskiy
2018-06-09 2:21 ` David Gibson
2018-06-08 5:46 ` [PATCH kernel 3/6] KVM: PPC: Make iommu_table::it_userspace big endian Alexey Kardashevskiy
2018-06-09 8:53 ` David Gibson
2018-06-08 5:46 ` [PATCH kernel 4/6] powerpc/powernv: Add indirect levels to it_userspace Alexey Kardashevskiy
2018-06-12 2:26 ` David Gibson
2018-06-08 5:46 ` [PATCH kernel 5/6] powerpc/powernv: Rework TCE level allocation Alexey Kardashevskiy
2018-06-12 2:40 ` David Gibson
2018-06-08 5:46 ` [PATCH kernel 6/6] powerpc/powernv/ioda: Allocate indirect TCE levels on demand Alexey Kardashevskiy
2018-06-12 4:17 ` David Gibson
2018-06-14 6:35 ` Alexey Kardashevskiy
2018-06-14 12:35 ` David Gibson [this message]
2018-06-15 7:15 ` Alexey Kardashevskiy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180614123525.GB19339@umbus.fritz.box \
--to=david@gibson.dropbear.id.au \
--cc=aik@ozlabs.ru \
--cc=alex.williamson@redhat.com \
--cc=benh@kernel.crashing.org \
--cc=kvm-ppc@vger.kernel.org \
--cc=kvm@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).