From: Alexey Kardashevskiy <aik@ozlabs.ru>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: Gavin Shan <gwshan@linux.vnet.ibm.com>,
linux-kernel@vger.kernel.org,
Alex Williamson <alex.williamson@redhat.com>,
Paul Mackerras <paulus@samba.org>,
linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table
Date: Mon, 11 May 2015 12:24:39 +1000 [thread overview]
Message-ID: <555012E7.4020106@ozlabs.ru> (raw)
In-Reply-To: <20150505115813.GL14090@voom.redhat.com>
On 05/05/2015 09:58 PM, David Gibson wrote:
> On Fri, May 01, 2015 at 04:53:08PM +1000, Alexey Kardashevskiy wrote:
>> On 05/01/2015 03:12 PM, David Gibson wrote:
>>> On Fri, May 01, 2015 at 02:10:58PM +1000, Alexey Kardashevskiy wrote:
>>>> On 04/29/2015 04:40 PM, David Gibson wrote:
>>>>> On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote:
>>>>>> This adds a way for the IOMMU user to know how much a new table will
>>>>>> use so it can be accounted in the locked_vm limit before allocation
>>>>>> happens.
>>>>>>
>>>>>> This stores the allocated table size in pnv_pci_create_table()
>>>>>> so the locked_vm counter can be updated correctly when a table is
>>>>>> being disposed.
>>>>>>
>>>>>> This defines an iommu_table_group_ops callback to let VFIO know
>>>>>> how much memory will be locked if a table is created.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> ---
>>>>>> Changes:
>>>>>> v9:
>>>>>> * reimplemented the whole patch
>>>>>> ---
>>>>>> arch/powerpc/include/asm/iommu.h | 5 +++++
>>>>>> arch/powerpc/platforms/powernv/pci-ioda.c | 14 ++++++++++++
>>>>>> arch/powerpc/platforms/powernv/pci.c | 36 +++++++++++++++++++++++++++++++
>>>>>> arch/powerpc/platforms/powernv/pci.h | 2 ++
>>>>>> 4 files changed, 57 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
>>>>>> index 1472de3..9844c106 100644
>>>>>> --- a/arch/powerpc/include/asm/iommu.h
>>>>>> +++ b/arch/powerpc/include/asm/iommu.h
>>>>>> @@ -99,6 +99,7 @@ struct iommu_table {
>>>>>> unsigned long it_size; /* Size of iommu table in entries */
>>>>>> unsigned long it_indirect_levels;
>>>>>> unsigned long it_level_size;
>>>>>> + unsigned long it_allocated_size;
>>>>>> unsigned long it_offset; /* Offset into global table */
>>>>>> unsigned long it_base; /* mapped address of tce table */
>>>>>> unsigned long it_index; /* which iommu table this is */
>>>>>> @@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
>>>>>> struct iommu_table_group;
>>>>>>
>>>>>> struct iommu_table_group_ops {
>>>>>> + unsigned long (*get_table_size)(
>>>>>> + __u32 page_shift,
>>>>>> + __u64 window_size,
>>>>>> + __u32 levels);
>>>>>> long (*create_table)(struct iommu_table_group *table_group,
>>>>>> int num,
>>>>>> __u32 page_shift,
>>>>>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>>> index e0be556..7f548b4 100644
>>>>>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>>>>> @@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
>>>>>> }
>>>>>>
>>>>>> #ifdef CONFIG_IOMMU_API
>>>>>> +static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
>>>>>> + __u64 window_size, __u32 levels)
>>>>>> +{
>>>>>> + unsigned long ret = pnv_get_table_size(page_shift, window_size, levels);
>>>>>> +
>>>>>> + if (!ret)
>>>>>> + return ret;
>>>>>> +
>>>>>> + /* Add size of it_userspace */
>>>>>> + return ret + (window_size >> page_shift) * sizeof(unsigned long);
>>>>>
>>>>> This doesn't make much sense. The userspace view can't possibly be a
>>>>> property of the specific low-level IOMMU model.
>>>>
>>>>
>>>> This it_userspace thing is all about memory preregistration.
>>>>
>>>> I need some way to track how many actual mappings the
>>>> mm_iommu_table_group_mem_t has in order to decide whether to allow
>>>> unregistering or not.
>>>>
>>>> When I clear TCE, I can read the old value which is host physical address
>>>> which I cannot use to find the preregistered region and adjust the mappings
>>>> counter; I can only use userspace addresses for this (not even guest
>>>> physical addresses as it is VFIO and probably no KVM).
>>>>
>>>> So I have to keep userspace addresses somewhere, one per IOMMU page, and the
>>>> iommu_table seems a natural place for this.
>>>
>>> Well.. sort of. But as noted elsewhere this pulls VFIO specific
>>> constraints into a platform code structure. And whether you get this
>>> table depends on the platform IOMMU type rather than on what VFIO
>>> wants to do with it, which doesn't make sense.
>>>
>>> What might make more sense is an opaque pointer io iommu_table for use
>>> by the table "owner" (in the take_ownership sense). The pointer would
>>> be stored in iommu_table, but VFIO is responsible for populating and
>>> managing its contents.
>>>
>>> Or you could just put the userspace mappings in the container.
>>> Although you might want a different data structure in that case.
>>
>> Nope. I need this table in in-kernel acceleration to update the mappings
>> counter per mm_iommu_table_group_mem_t. In KVM's real mode handlers, I only
>> have IOMMU tables, not containers or groups. QEMU creates a guest view of
>> the table (KVM_CREATE_SPAPR_TCE) specifying a LIOBN, and then attaches TCE
>> tables to it via set of ioctls (one per IOMMU group) to VFIO KVM device.
>>
>> So if I call it it_opaque (instead of it_userspace), I will still need a
>> common place (visible to VFIO and PowerKVM) for this to put:
>> #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry)
>
> I think it should be in a VFIO header. If I'm understanding right
> this part of the PowerKVM code is explicitly VFIO aware - that's kind
> of the point.
Well. 2 points against it are:
1. arch/powerpc/kvm/book3s_64_vio_hv.c and arch/powerpc/kvm/book3s_64_vio.c
are not including any of vfio headers now (all required liobn<->iommu
hooking bits are in virt/kvm/vfio.c), and I kind of like it.
2. for me it seems like a good idea to keep an accessor close to what it
provides the access to. Since I cannot move it_userspace somewhere else, I
should not move IOMMU_TABLE_USERSPACE_ENTRY() either.
>> So far this place was arch/powerpc/include/asm/iommu.h and the iommu_table
>> struct.
>>
>>
>>> The other thing to bear in mind is that registered regions are likely
>>> to be large contiguous blocks in user addresses, though obviously not
>>> contiguous in physical addr. So you might be able to compaticfy this
>>> information by storing it as a list of variable length blocks in
>>> userspace address space, rather than a per-page address..
>>
>> It is 8 bytes per system page - 8/65536 = 0.00012 (or 26MB for 200GB guest)
>> - very little overhead.
>>
>>
>>> But.. isn't there a bigger problem here. As Paulus was pointing out,
>>> there's nothing guaranteeing the page tables continue to contain the
>>> same page as was there at gup() time.
>>
>> This can happen if the userspace remaps memory which it registered/mapped
>> for DMA via VFIO, no? If so, then the userspace just should not do this, it
>> is DMA, it cannot be moved like this. What am I missing here?
>>
>>
>>> What's going to happen if you REGISTER a memory region, then mremap()
>>> over it?
>>
>> The registered pages will remain pinned and PUT_TCE will use that region for
>> translation (and this will fail as the userspace addresses changed).
>>
>> I do not see how it is different from the situation when the userspace
>> mapped a page and mremap()ed it while it is DMA-mapped.
>
> True, it's basically the same. Hrm, so what guarantees a dma_map,
> mremap() dma_unmap will unreference the correct pages.
The original page will remain pinned, wrong one will be unpinned, access to
it will produce EEH, the process exit will do cleanup. What is the problem
here? Inability for the userspace to dma_map() and then remap()? I do not
think x86 (or anything else) can and should cope with this well, it is DMA,
you just cannot do certain things when you work with DMA...
>>> Then attempt to PUT_TCE a page in the region? Or what if you
>>> mremap() it to someplace else then try to PUT_TCE a page there?
>>
>> This will fail - a new userspace address has to be preregistered.
>>
>>> Or REGISTER it again in its new location?
>>
>> It will be pinned twice + some memory overhead to store the same host
>> physical address(es) twice.
>>
>>
>>
>
--
Alexey
next prev parent reply other threads:[~2015-05-11 2:24 UTC|newest]
Thread overview: 110+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-04-25 12:14 [PATCH kernel v9 00/32] powerpc/iommu/vfio: Enable Dynamic DMA windows Alexey Kardashevskiy
2015-04-25 12:14 ` [PATCH kernel v9 01/32] powerpc/iommu: Split iommu_free_table into 2 helpers Alexey Kardashevskiy
2015-04-29 2:03 ` David Gibson
2015-04-25 12:14 ` [PATCH kernel v9 02/32] Revert "powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically" Alexey Kardashevskiy
2015-04-27 21:05 ` Alex Williamson
2015-04-29 2:05 ` David Gibson
2015-04-25 12:14 ` [PATCH kernel v9 03/32] vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU driver Alexey Kardashevskiy
2015-04-25 12:14 ` [PATCH kernel v9 04/32] vfio: powerpc/spapr: Check that IOMMU page is fully contained by system page Alexey Kardashevskiy
2015-04-25 12:14 ` [PATCH kernel v9 05/32] vfio: powerpc/spapr: Use it_page_size Alexey Kardashevskiy
2015-04-25 12:14 ` [PATCH kernel v9 06/32] vfio: powerpc/spapr: Move locked_vm accounting to helpers Alexey Kardashevskiy
2015-04-25 12:14 ` [PATCH kernel v9 07/32] vfio: powerpc/spapr: Disable DMA mappings on disabled container Alexey Kardashevskiy
2015-04-25 12:14 ` [PATCH kernel v9 08/32] vfio: powerpc/spapr: Moving pinning/unpinning to helpers Alexey Kardashevskiy
2015-04-29 2:14 ` David Gibson
2015-04-25 12:14 ` [PATCH kernel v9 09/32] vfio: powerpc/spapr: Rework groups attaching Alexey Kardashevskiy
2015-04-29 2:16 ` David Gibson
2015-04-30 2:29 ` Alexey Kardashevskiy
2015-04-30 4:05 ` David Gibson
2015-04-25 12:14 ` [PATCH kernel v9 10/32] powerpc/powernv: Do not set "read" flag if direction==DMA_NONE Alexey Kardashevskiy
2015-04-25 12:14 ` [PATCH kernel v9 11/32] powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table Alexey Kardashevskiy
2015-04-25 12:14 ` [PATCH kernel v9 12/32] powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group Alexey Kardashevskiy
2015-04-29 2:49 ` David Gibson
2015-04-30 2:30 ` Alexey Kardashevskiy
2015-04-25 12:14 ` [PATCH kernel v9 13/32] vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership control Alexey Kardashevskiy
2015-04-29 3:02 ` David Gibson
2015-04-29 9:19 ` Alexey Kardashevskiy
2015-04-30 4:08 ` David Gibson
2015-04-25 12:14 ` [PATCH kernel v9 14/32] powerpc/iommu: Fix IOMMU ownership control functions Alexey Kardashevskiy
2015-04-29 3:08 ` David Gibson
2015-04-25 12:14 ` [PATCH kernel v9 15/32] powerpc/powernv/ioda/ioda2: Rework TCE invalidation in tce_build()/tce_free() Alexey Kardashevskiy
2015-04-29 3:18 ` David Gibson
2015-04-30 2:58 ` Alexey Kardashevskiy
2015-04-30 4:16 ` David Gibson
2015-04-25 12:14 ` [PATCH kernel v9 16/32] powerpc/powernv/ioda: Move TCE kill register address to PE Alexey Kardashevskiy
2015-04-27 21:05 ` Alex Williamson
2015-04-29 3:25 ` David Gibson
2015-04-29 9:00 ` Alexey Kardashevskiy
2015-04-30 4:18 ` David Gibson
2015-04-25 12:14 ` [PATCH kernel v9 17/32] powerpc/powernv: Implement accessor to TCE entry Alexey Kardashevskiy
2015-04-29 4:04 ` David Gibson
2015-04-29 9:02 ` Alexey Kardashevskiy
2015-04-30 0:13 ` David Gibson
2015-04-25 12:14 ` [PATCH kernel v9 18/32] powerpc/iommu/powernv: Release replaced TCE Alexey Kardashevskiy
2015-04-29 4:18 ` David Gibson
2015-04-29 9:51 ` Alexey Kardashevskiy
2015-04-30 4:21 ` David Gibson
2015-04-25 12:14 ` [PATCH kernel v9 19/32] powerpc/powernv/ioda2: Rework iommu_table creation Alexey Kardashevskiy
2015-04-29 4:27 ` David Gibson
2015-04-25 12:14 ` [PATCH kernel v9 20/32] powerpc/powernv/ioda2: Introduce pnv_pci_create_table/pnv_pci_free_table Alexey Kardashevskiy
2015-04-29 4:39 ` David Gibson
2015-04-29 9:12 ` Alexey Kardashevskiy
2015-04-30 4:24 ` David Gibson
2015-05-01 10:13 ` Alexey Kardashevskiy
2015-04-25 12:14 ` [PATCH kernel v9 21/32] powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window Alexey Kardashevskiy
2015-04-29 4:45 ` David Gibson
2015-04-29 9:26 ` Alexey Kardashevskiy
2015-04-30 4:32 ` David Gibson
2015-04-25 12:14 ` [PATCH kernel v9 22/32] powerpc/powernv: Implement multilevel TCE tables Alexey Kardashevskiy
2015-04-29 5:04 ` David Gibson
2015-05-01 9:48 ` Alexey Kardashevskiy
2015-05-05 12:05 ` David Gibson
2015-04-25 12:14 ` [PATCH kernel v9 23/32] powerpc/powernv/ioda: Define and implement DMA table/window management callbacks Alexey Kardashevskiy
2015-04-29 5:30 ` David Gibson
2015-04-29 9:44 ` Alexey Kardashevskiy
2015-04-30 4:37 ` David Gibson
2015-04-30 9:56 ` Alexey Kardashevskiy
2015-05-01 3:36 ` David Gibson
2015-04-25 12:14 ` [PATCH kernel v9 24/32] powerpc/powernv/ioda2: Use new helpers to do proper cleanup on PE release Alexey Kardashevskiy
2015-04-25 12:14 ` [PATCH kernel v9 25/32] vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework ownership Alexey Kardashevskiy
2015-04-29 5:39 ` David Gibson
2015-04-25 12:14 ` [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table Alexey Kardashevskiy
2015-04-29 6:31 ` David Gibson
2015-05-01 4:01 ` Alexey Kardashevskiy
2015-05-01 4:23 ` David Gibson
2015-05-01 7:12 ` Alexey Kardashevskiy
2015-05-05 12:02 ` David Gibson
2015-05-11 2:11 ` Alexey Kardashevskiy
2015-05-11 4:52 ` Alexey Kardashevskiy
2015-04-25 12:14 ` [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table Alexey Kardashevskiy
2015-04-29 6:40 ` David Gibson
2015-05-01 4:10 ` Alexey Kardashevskiy
2015-05-01 5:12 ` David Gibson
2015-05-01 6:53 ` Alexey Kardashevskiy
2015-05-05 11:58 ` David Gibson
2015-05-11 2:24 ` Alexey Kardashevskiy [this message]
2015-04-25 12:14 ` [PATCH kernel v9 28/32] powerpc/mmu: Add userspace-to-physical addresses translation cache Alexey Kardashevskiy
2015-04-29 7:01 ` David Gibson
2015-05-01 11:26 ` Alexey Kardashevskiy
2015-05-05 12:12 ` David Gibson
2015-04-30 6:34 ` David Gibson
2015-04-30 8:25 ` Paul Mackerras
2015-05-01 3:39 ` David Gibson
2015-04-25 12:14 ` [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2 Alexey Kardashevskiy
2015-04-30 6:55 ` David Gibson
2015-05-01 4:35 ` Alexey Kardashevskiy
2015-05-01 5:23 ` David Gibson
2015-05-01 6:27 ` Alexey Kardashevskiy
2015-05-05 11:53 ` David Gibson
2015-04-25 12:14 ` [PATCH kernel v9 30/32] vfio: powerpc/spapr: Use 32bit DMA window properties from table_group Alexey Kardashevskiy
2015-04-27 22:18 ` Alex Williamson
2015-04-30 6:58 ` David Gibson
2015-04-25 12:14 ` [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible Alexey Kardashevskiy
2015-04-30 7:22 ` David Gibson
2015-04-30 9:33 ` Alexey Kardashevskiy
2015-05-01 0:46 ` Benjamin Herrenschmidt
2015-05-01 4:44 ` David Gibson
2015-05-01 4:33 ` David Gibson
2015-05-01 6:05 ` Alexey Kardashevskiy
2015-05-05 11:50 ` David Gibson
2015-05-11 2:26 ` Alexey Kardashevskiy
2015-04-25 12:14 ` [PATCH kernel v9 32/32] vfio: powerpc/spapr: Support Dynamic DMA windows Alexey Kardashevskiy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=555012E7.4020106@ozlabs.ru \
--to=aik@ozlabs.ru \
--cc=alex.williamson@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=gwshan@linux.vnet.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=paulus@samba.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).