LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH RESEND] net: fec_mpc52xx: Read MAC address from device-tree
From: David Miller @ 2013-02-11 18:50 UTC (permalink / raw)
  To: sr; +Cc: netdev, agust, linuxppc-dev
In-Reply-To: <1360403352-24237-1-git-send-email-sr@denx.de>

From: Stefan Roese <sr@denx.de>
Date: Sat,  9 Feb 2013 10:49:12 +0100

> Until now, the MPC5200 FEC ethernet driver relied upon the bootloader
> (U-Boot) to write the MAC address into the ethernet controller
> registers. The Linux driver should not rely on such a thing. So
> lets read the MAC address from the DT as it should be done here.
> 
> This fixes a problem with a MPC5200 board that uses the SPL U-Boot
> version without FEC initialization before Linux booting for
> boot speedup.
> 
> Additionally a status line will now be printed upon successful
> driver probing, also displaying this MAC address.
> 
> Signed-off-by: Stefan Roese <sr@denx.de>

I don't think this is a conservative enough change.

You have to keep the MAC register reading code around, as a backup
code path in case the OF device node lacks a MAC address, also:

> +	if (!is_zero_ether_addr(mpc52xx_fec_mac_addr)) {

I really wish I would have caught this terrible module parameter
when the driver was initially submitted.

I would just get rid of this, and have a priority list of cases:

1) First, try OF node MAC address, if not present or invalid, then:

2) Read from MAC address registers, if invalid, then:

3) Log a warning message, and choose a random MAC address.

That way no matter what happens, the user will at least have a
functioning networking device.

^ permalink raw reply

* Re[5]: PS3 platform is broken on Linux 3.7.0
From: Phileas Fogg @ 2013-02-11 16:57 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Geoff Levand, linuxppc-dev
In-Reply-To: <87zjzb4051.fsf@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 2859 bytes --]


>"Aneesh Kumar K.V" < aneesh.kumar@linux.vnet.ibm.com > writes:
>
>> Phileas Fogg < phileas-fogg@mail.ru > writes:
>>
>>>  And another note.
>>> I took a look at the MMU chapter in the Cell Architecture handbook and indeed the first 15 bits in VA are treated as 0 by the hardware.
>>>
>>> Quote:
>>>
>>> 1. High-order bits above 65 bits in the 80-bit virtual address (VA[0:14]) are not implemented. The hardware always
>>>    treats these bits as `0'. Software must not set these bits to any other value than `0' or the results are undefined in
>>>    the PPE.
>>>
>>>
>>
>> True, we missed the below part of ISA doc:
>>
>> ISA doc says
>>
>> "On implementations that support a virtual address size
>> of only n bits, n < 78, bits 0:77-n of the AVA field must be
>> zeros. "
>>
>> The Cell document I found at 
>>
>>  https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/7A77CCDF14FE70D5852575CA0074E8ED/$file/CellBE_Handbook_v1.12_3Apr09_pub.pdf
>>
>> gives 
>>
>> Virtual Address (VA) Size -> 65 bits
>>
>> So as per ISA, bits 0:12 should be zero, which should make 0:14 of PTE
>> fields zero for Cell.
>>
>> I will try to do a patch. 
>>
>
>Can you try this patch ?
>
>diff --git a/arch/powerpc/include/asm/mmu-hash64.h b/arch/powerpc/include/asm/mmu-hash64.h
>index 2fdb47a..f01fd9a 100644
>--- a/arch/powerpc/include/asm/mmu-hash64.h
>+++ b/arch/powerpc/include/asm/mmu-hash64.h
>@@ -381,21 +381,37 @@ extern void slb_set_size(u16 size);
>  * hash collisions.
>  */
> 
>+/* This should go in Kconfig */
>+/*
>+ * Be careful with this value. This determines the VSID_MODULUS_*  and that
>+ * need to be co-prime with VSID_MULTIPLIER*
>+ */
>+#if 1
>+#define MAX_VIRTUAL_ADDR_BITS	65
>+#else
>+#define MAX_VIRTUAL_ADDR_BITS	66
>+#endif
>+/*
>+ * One bit is taken by the kernel, only the rest of space is available for the
>+ * user space.
>+ */
>+#define CONTEXT_BITS		(MAX_VIRTUAL_ADDR_BITS - \
>+				 (USER_ESID_BITS + SID_SHIFT + 1))
>+#define USER_ESID_BITS		18
>+#define USER_ESID_BITS_1T	6
>+
> /*
>  * This should be computed such that protovosid * vsid_mulitplier
>  * doesn't overflow 64 bits. It should also be co-prime to vsid_modulus
>  */
> #define VSID_MULTIPLIER_256M	ASM_CONST(12538073)	/* 24-bit prime */
>-#define VSID_BITS_256M		38
>+#define VSID_BITS_256M		(CONTEXT_BITS + USER_ESID_BITS + 1)
> #define VSID_MODULUS_256M	((1UL<<VSID_BITS_256M)-1)
> 
> #define VSID_MULTIPLIER_1T	ASM_CONST(12538073)	/* 24-bit prime */
>-#define VSID_BITS_1T		26
>+#define VSID_BITS_1T		(CONTEXT_BITS + USER_ESID_BITS_1T + 1)
> #define VSID_MODULUS_1T		((1UL<<VSID_BITS_1T)-1)
> 
>-#define CONTEXT_BITS		19
>-#define USER_ESID_BITS		18
>-#define USER_ESID_BITS_1T	6
> 
> #define USER_VSID_RANGE	(1UL << (USER_ESID_BITS + SID_SHIFT))
> 
>

Testing it with Linux 3.8.0-rc7, it looks good so far under heavy hard disk usage.


[-- Attachment #2: Type: text/html, Size: 4105 bytes --]

^ permalink raw reply

* Re: [PATCH v5 01/45] percpu_rwlock: Introduce the global reader-writer lock backend
From: Srivatsa S. Bhat @ 2013-02-11 12:56 UTC (permalink / raw)
  To: David Howells
  Cc: linux-doc, peterz, fweisbec, linux-kernel, mingo, linux-arch,
	linux, xiaoguangrong, wangyun, paulmck, nikunj, linux-pm, rusty,
	rostedt, rjw, namhyung, tglx, linux-arm-kernel, netdev, oleg, sbw,
	tj, akpm, linuxppc-dev
In-Reply-To: <30708.1360586491@warthog.procyon.org.uk>

On 02/11/2013 06:11 PM, David Howells wrote:
> Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> wrote:
> 
>> We can use global rwlocks as shown below safely, without fear of deadlocks:
>>
>> Readers:
>>
>>          CPU 0                                CPU 1
>>          ------                               ------
>>
>> 1.    spin_lock(&random_lock);             read_lock(&my_rwlock);
>>
>>
>> 2.    read_lock(&my_rwlock);               spin_lock(&random_lock);
> 
> The lock order on CPU 0 is unsafe if CPU2 can do:
> 
> 	write_lock(&my_rwlock);
> 	spin_lock(&random_lock);
> 
> and on CPU 1 if CPU2 can do:
> 
> 	spin_lock(&random_lock);
> 	write_lock(&my_rwlock);
> 

Right..

> I presume you were specifically excluding these situations?
>

Yes.. Those cases are simple to find out and fix (by changing the
lock ordering). My main problem was with CPU 0 and CPU 1 as shown above..
... and using a global rwlock helps ease that part out.

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* Re: [PATCH v5 01/45] percpu_rwlock: Introduce the global reader-writer lock backend
From: David Howells @ 2013-02-11 12:41 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-doc, peterz, fweisbec, linux-kernel, dhowells, mingo,
	linux-arch, linux, xiaoguangrong, wangyun, paulmck, nikunj,
	linux-pm, rusty, rostedt, rjw, namhyung, tglx, linux-arm-kernel,
	netdev, oleg, sbw, tj, akpm, linuxppc-dev
In-Reply-To: <20130122073315.13822.27093.stgit@srivatsabhat.in.ibm.com>

Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> wrote:

> We can use global rwlocks as shown below safely, without fear of deadlocks:
> 
> Readers:
> 
>          CPU 0                                CPU 1
>          ------                               ------
> 
> 1.    spin_lock(&random_lock);             read_lock(&my_rwlock);
> 
> 
> 2.    read_lock(&my_rwlock);               spin_lock(&random_lock);

The lock order on CPU 0 is unsafe if CPU2 can do:

	write_lock(&my_rwlock);
	spin_lock(&random_lock);

and on CPU 1 if CPU2 can do:

	spin_lock(&random_lock);
	write_lock(&my_rwlock);

I presume you were specifically excluding these situations?

David

^ permalink raw reply

* Re: [PATCH v5 00/45] CPU hotplug: stop_machine()-free CPU hotplug
From: Srivatsa S. Bhat @ 2013-02-11 12:23 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-doc, peterz, fweisbec, linux-kernel, walken, mingo,
	linux-arch, Russell King - ARM Linux, xiaoguangrong, wangyun,
	paulmck, nikunj, linux-pm, Rusty Russell, rostedt, rjw, namhyung,
	tglx, linux-arm-kernel, netdev, oleg, sbw, tj, akpm, linuxppc-dev
In-Reply-To: <CAKfTPtCe+cD7LLkb+D6pqG1SwG2V08ws=4XO3ttHEHV2qgmqPg@mail.gmail.com>

On 02/11/2013 05:28 PM, Vincent Guittot wrote:
> On 8 February 2013 19:09, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>> On 02/08/2013 10:14 PM, Srivatsa S. Bhat wrote:
>>> On 02/08/2013 09:11 PM, Russell King - ARM Linux wrote:
>>>> On Thu, Feb 07, 2013 at 11:41:34AM +0530, Srivatsa S. Bhat wrote:
>>>>> On 02/07/2013 09:44 AM, Rusty Russell wrote:
>>>>>> "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> writes:
>>>>>>> On 01/22/2013 01:03 PM, Srivatsa S. Bhat wrote:
>>>>>>>                  Avg. latency of 1 CPU offline (ms) [stop-cpu/stop-m/c latency]
>>>>>>>
>>>>>>> # online CPUs    Mainline (with stop-m/c)       This patchset (no stop-m/c)
>>>>>>>
>>>>>>>       8                 17.04                          7.73
>>>>>>>
>>>>>>>      16                 18.05                          6.44
>>>>>>>
>>>>>>>      32                 17.31                          7.39
>>>>>>>
>>>>>>>      64                 32.40                          9.28
>>>>>>>
>>>>>>>     128                 98.23                          7.35
>>>>>>
>>>>>> Nice!
>>>>>
>>>>> Thank you :-)
>>>>>
>>>>>>  I wonder how the ARM guys feel with their quad-cpu systems...
>>>>>>
>>>>>
>>>>> That would be definitely interesting to know :-)
>>>>
>>>> That depends what exactly you'd like tested (and how) and whether you'd
>>>> like it to be a test-chip based quad core, or an OMAP dual-core SoC.
>>>>
>>>
>>> The effect of stop_machine() doesn't really depend on the CPU architecture
>>> used underneath or the platform. It depends only on the _number_ of
>>> _logical_ CPUs used.
>>>
>>> And stop_machine() has 2 noticeable drawbacks:
>>> 1. It makes the hotplug operation itself slow
>>> 2. and it causes disruptions to the workloads running on the other
>>> CPUs by hijacking the entire machine for significant amounts of time.
>>>
>>> In my experiments (mentioned above), I tried to measure how my patchset
>>> improves (reduces) the duration of hotplug (CPU offline) itself. Which is
>>> also slightly indicative of the impact it has on the rest of the system.
>>>
>>> But what would be nice to test, is a setup where the workloads running on
>>> the rest of the system are latency-sensitive, and measure the impact of
>>> CPU offline on them, with this patchset applied. That would tell us how
>>> far is this useful in making CPU hotplug less disruptive on the system.
>>>
>>> Of course, it would be nice to also see whether we observe any reduction
>>> in hotplug duration itself (point 1 above) on ARM platforms with lot
>>> of CPUs. [This could potentially speed up suspend/resume, which is used
>>> rather heavily on ARM platforms].
>>>
>>> The benefits from this patchset over mainline (both in terms of points
>>> 1 and 2 above) is expected to increase, with increasing number of CPUs in
>>> the system.
>>>
>>
>> Adding Vincent to CC, who had previously evaluated the performance and
>> latency implications of CPU hotplug on ARM platforms, IIRC.
>>
>
> Hi Srivatsa,
> 
> I can try to run some of our stress tests on your patches.

Great!

> Have you
> got a git tree that i can pull ?
> 

Unfortunately, no, none at the moment..  :-(

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* [PATCH 1/2] vfio powerpc: enabled on powernv platform
From: Alexey Kardashevskiy @ 2013-02-11 11:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Alexey Kardashevskiy, linux-kernel, Alex Williamson,
	Paul Mackerras, linuxppc-dev, David Gibson
In-Reply-To: <1360583672-21924-1-git-send-email-aik@ozlabs.ru>

This patch initializes IOMMU groups based on the IOMMU
configuration discovered during the PCI scan on POWERNV
(POWER non virtualized) platform. The IOMMU groups are
to be used later by VFIO driver (PCI pass through).

It also implements an API for mapping/unmapping pages for
guest PCI drivers and providing DMA window properties.
This API is going to be used later by QEMU-VFIO to handle
h_put_tce hypercalls from the KVM guest.

The iommu_put_tce_user_mode() does only a single page mapping
as an API for adding many mappings at once is going to be
added later.

Although this driver has been tested only on the POWERNV
platform, it should work on any platform which supports
TCE tables. As h_put_tce hypercall is received by the host
kernel and processed by the QEMU (what involves calling
the host kernel again), performance is not the best -
circa 220MB/s on 10Gb ethernet network.

To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config
option and configure VFIO as required.

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h            |   15 ++
 arch/powerpc/kernel/iommu.c                 |  343 +++++++++++++++++++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c   |    1 +
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |    5 +-
 arch/powerpc/platforms/powernv/pci.c        |    3 +
 drivers/iommu/Kconfig                       |    8 +
 6 files changed, 374 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index cbfe678..900294b 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -76,6 +76,9 @@ struct iommu_table {
 	struct iommu_pool large_pool;
 	struct iommu_pool pools[IOMMU_NR_POOLS];
 	unsigned long *it_map;       /* A simple allocation bitmap for now */
+#ifdef CONFIG_IOMMU_API
+	struct iommu_group *it_group;
+#endif
 };
 
 struct scatterlist;
@@ -98,6 +101,8 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
  */
 extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
 					    int nid);
+extern void iommu_register_group(struct iommu_table * tbl,
+				 int domain_number, unsigned long pe_num);
 
 extern int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
 			struct scatterlist *sglist, int nelems,
@@ -147,5 +152,15 @@ static inline void iommu_restore(void)
 }
 #endif
 
+/* The API to support IOMMU operations for VFIO */
+extern long iommu_clear_tce_user_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce_value,
+		unsigned long npages);
+extern long iommu_put_tce_user_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce);
+
+extern void iommu_flush_tce(struct iommu_table *tbl);
+extern long iommu_lock_table(struct iommu_table *tbl, bool lock);
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 7c309fe..b4fdabc 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -37,6 +37,7 @@
 #include <linux/fault-inject.h>
 #include <linux/pci.h>
 #include <linux/kvm_host.h>
+#include <linux/iommu.h>
 #include <asm/io.h>
 #include <asm/prom.h>
 #include <asm/iommu.h>
@@ -45,6 +46,7 @@
 #include <asm/kdump.h>
 #include <asm/fadump.h>
 #include <asm/vio.h>
+#include <asm/tce.h>
 
 #define DBG(...)
 
@@ -707,11 +709,39 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
 	return tbl;
 }
 
+static void group_release(void *iommu_data)
+{
+	struct iommu_table *tbl = iommu_data;
+	tbl->it_group = NULL;
+}
+
+void iommu_register_group(struct iommu_table * tbl,
+		int domain_number, unsigned long pe_num)
+{
+	struct iommu_group *grp;
+
+	grp = iommu_group_alloc();
+	if (IS_ERR(grp)) {
+		pr_info("powerpc iommu api: cannot create new group, err=%ld\n",
+				PTR_ERR(grp));
+		return;
+	}
+	tbl->it_group = grp;
+	iommu_group_set_iommudata(grp, tbl, group_release);
+	iommu_group_set_name(grp, kasprintf(GFP_KERNEL, "domain%d-pe%lx",
+			domain_number, pe_num));
+}
+
 void iommu_free_table(struct iommu_table *tbl, const char *node_name)
 {
 	unsigned long bitmap_sz;
 	unsigned int order;
 
+	if (tbl && tbl->it_group) {
+		iommu_group_put(tbl->it_group);
+		BUG_ON(tbl->it_group);
+	}
+
 	if (!tbl || !tbl->it_map) {
 		printk(KERN_ERR "%s: expected TCE map for %s\n", __func__,
 				node_name);
@@ -876,4 +906,317 @@ void kvm_iommu_unmap_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
 }
 
+static enum dma_data_direction tce_direction(unsigned long tce)
+{
+	if ((tce & TCE_PCI_READ) && (tce & TCE_PCI_WRITE))
+		return DMA_BIDIRECTIONAL;
+	else if (tce & TCE_PCI_READ)
+		return DMA_TO_DEVICE;
+	else if (tce & TCE_PCI_WRITE)
+		return DMA_FROM_DEVICE;
+	else
+		return DMA_NONE;
+}
+
+void iommu_flush_tce(struct iommu_table *tbl)
+{
+	/* Flush/invalidate TLB caches if necessary */
+	if (ppc_md.tce_flush)
+		ppc_md.tce_flush(tbl);
+
+	/* Make sure updates are seen by hardware */
+	mb();
+}
+EXPORT_SYMBOL_GPL(iommu_flush_tce);
+
+static long tce_clear_param_check(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce_value,
+		unsigned long npages)
+{
+	unsigned long size = npages << IOMMU_PAGE_SHIFT;
+
+	/* ppc_md.tce_free() does not support any value but 0 */
+	if (tce_value)
+		return -EINVAL;
+
+	if (ioba & ~IOMMU_PAGE_MASK)
+		return -EINVAL;
+
+	if ((ioba + size) > ((tbl->it_offset + tbl->it_size)
+			<< IOMMU_PAGE_SHIFT))
+		return -EINVAL;
+
+	if (ioba < (tbl->it_offset << IOMMU_PAGE_SHIFT))
+		return -EINVAL;
+
+	return 0;
+}
+
+static long tce_put_param_check(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce)
+{
+	if (!(tce & (TCE_PCI_WRITE | TCE_PCI_READ)))
+		return -EINVAL;
+
+	if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
+		return -EINVAL;
+
+	if (ioba & ~IOMMU_PAGE_MASK)
+		return -EINVAL;
+
+	if ((ioba + IOMMU_PAGE_SIZE) > ((tbl->it_offset + tbl->it_size)
+			<< IOMMU_PAGE_SHIFT))
+		return -EINVAL;
+
+	if (ioba < (tbl->it_offset << IOMMU_PAGE_SHIFT))
+		return -EINVAL;
+
+	return 0;
+}
+
+static long clear_tce(struct iommu_table *tbl,
+		unsigned long entry, unsigned long pages)
+{
+	unsigned long oldtce;
+	struct page *page;
+	struct iommu_pool *pool;
+
+	for ( ; pages; --pages, ++entry) {
+		pool = get_pool(tbl, entry);
+		spin_lock(&(pool->lock));
+
+		oldtce = ppc_md.tce_get(tbl, entry);
+		if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)) {
+			ppc_md.tce_free(tbl, entry, 1);
+
+			page = pfn_to_page(oldtce >> PAGE_SHIFT);
+			WARN_ON(!page);
+			if (page) {
+				if (oldtce & TCE_PCI_WRITE)
+					SetPageDirty(page);
+				put_page(page);
+			}
+		}
+		spin_unlock(&(pool->lock));
+	}
+
+	return 0;
+}
+
+long iommu_clear_tce_user_mode(struct iommu_table *tbl, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	long ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = tce_clear_param_check(tbl, ioba, tce_value, npages);
+	if (!ret)
+		ret = clear_tce(tbl, entry, npages);
+
+	if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%ld\n",
+				__func__, ioba, tce_value, ret);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_clear_tce_user_mode);
+
+/* hwaddr is a virtual address here, tce_build converts it to physical */
+static long do_tce_build(struct iommu_table *tbl, unsigned long entry,
+		unsigned long hwaddr, enum dma_data_direction direction)
+{
+	long ret = -EBUSY;
+	unsigned long oldtce;
+	struct iommu_pool *pool = get_pool(tbl, entry);
+
+	spin_lock(&(pool->lock));
+
+	oldtce = ppc_md.tce_get(tbl, entry);
+	/* Add new entry if it is not busy */
+	if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
+		ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, direction, NULL);
+
+	spin_unlock(&(pool->lock));
+
+	if (unlikely(ret))
+		pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%ld\n",
+				__func__, hwaddr, entry << IOMMU_PAGE_SHIFT,
+				hwaddr, ret);
+
+	return ret;
+}
+
+static int put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
+		unsigned long tce)
+{
+	int ret;
+	struct page *page = NULL;
+	unsigned long hwaddr, offset = tce & IOMMU_PAGE_MASK & ~PAGE_MASK;
+	enum dma_data_direction direction = tce_direction(tce);
+
+	ret = get_user_pages_fast(tce & PAGE_MASK, 1,
+			direction != DMA_TO_DEVICE, &page);
+	if (unlikely(ret != 1)) {
+		pr_err("iommu_tce: get_user_pages_fast failed tce=%lx ioba=%lx ret=%d\n",
+				tce, entry << IOMMU_PAGE_SHIFT, ret);
+		return -EFAULT;
+	}
+	hwaddr = (unsigned long) page_address(page) + offset;
+
+	ret = do_tce_build(tbl, entry, hwaddr, direction);
+	if (ret)
+		put_page(page);
+
+	return ret;
+}
+
+long iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long ioba,
+		unsigned long tce)
+{
+	long ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = tce_put_param_check(tbl, ioba, tce);
+	if (!ret)
+		ret = put_tce_user_mode(tbl, entry, tce);
+
+	if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
+				__func__, ioba, tce, ret);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
+
+/*
+ * Helpers to do locked pages accounting.
+ * Called from ioctl so down_write_trylock is not necessary.
+ */
+static void lock_acct(long npage)
+{
+	if (!current->mm)
+		return; /* process exited */
+
+	down_write(&current->mm->mmap_sem);
+	current->mm->locked_vm += npage;
+	up_write(&current->mm->mmap_sem);
+}
+
+/*
+ * iommu_lock_table - Start/stop using the table by VFIO
+ * @tbl: Pointer to the IOMMU table
+ * @lock: true when VFIO starts using the table
+ */
+long iommu_lock_table(struct iommu_table *tbl, bool lock)
+{
+	unsigned long sz = (tbl->it_size + 7) >> 3;
+	unsigned long locked, lock_limit;
+
+	if (lock) {
+		/*
+		 * Account for locked pages
+		 * The worst case is when every IOMMU page
+		 * is mapped to separate system page
+		 */
+		locked = current->mm->locked_vm + tbl->it_size;
+		lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+		if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
+			pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
+					rlimit(RLIMIT_MEMLOCK));
+			return -ENOMEM;
+		}
+
+		if (tbl->it_offset == 0)
+			clear_bit(0, tbl->it_map);
+
+		if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
+			pr_err("iommu_tce: it_map is not empty");
+			return -EBUSY;
+		}
+
+		lock_acct(tbl->it_size);
+		memset(tbl->it_map, 0xff, sz);
+	}
+
+	/* Clear TCE table */
+	clear_tce(tbl, tbl->it_offset, tbl->it_size);
+
+	if (!lock) {
+		lock_acct(-tbl->it_size);
+		memset(tbl->it_map, 0, sz);
+
+		/* Restore bit#0 set by iommu_init_table() */
+		if (tbl->it_offset == 0)
+			set_bit(0, tbl->it_map);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(iommu_lock_table);
+
+int iommu_add_device(struct device *dev)
+{
+	struct iommu_table *tbl;
+	int ret = 0;
+
+	if (WARN_ON(dev->iommu_group)) {
+		pr_warn("iommu_tce: device %s is already in iommu group %d, skipping\n",
+				dev_name(dev),
+				iommu_group_id(dev->iommu_group));
+		return -EBUSY;
+	}
+
+	tbl = get_iommu_table_base(dev);
+	if (!tbl) {
+		pr_debug("iommu_tce: skipping device %s with no tbl\n",
+				dev_name(dev));
+		return 0;
+	}
+
+	pr_debug("iommu_tce: adding %s to iommu group %d\n",
+			dev_name(dev), iommu_group_id(tbl->it_group));
+
+	ret = iommu_group_add_device(tbl->it_group, dev);
+	if (ret < 0)
+		pr_err("iommu_tce: %s has not been added, ret=%d\n",
+				dev_name(dev), ret);
+
+	return ret;
+}
+
+void iommu_del_device(struct device *dev)
+{
+	iommu_group_remove_device(dev);
+}
+
+static int iommu_bus_notifier(struct notifier_block *nb,
+			      unsigned long action, void *data)
+{
+	struct device *dev = data;
+
+	switch (action) {
+	case BUS_NOTIFY_ADD_DEVICE:
+		return iommu_add_device(dev);
+	case BUS_NOTIFY_DEL_DEVICE:
+		iommu_del_device(dev);
+		return 0;
+	default:
+		return 0;
+	}
+}
+
+static struct notifier_block tce_iommu_bus_nb = {
+	.notifier_call = iommu_bus_notifier,
+};
+
+static int __init tce_iommu_init(void)
+{
+	BUILD_BUG_ON(PAGE_SIZE < IOMMU_PAGE_SIZE);
+
+	bus_register_notifier(&pci_bus_type, &tce_iommu_bus_nb);
+	return 0;
+}
+
+arch_initcall(tce_iommu_init);
+
 #endif /* CONFIG_IOMMU_API */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 8e90e89..04dbc49 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -522,6 +522,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 			| TCE_PCI_SWINV_PAIR;
 	}
 	iommu_init_table(tbl, phb->hose->node);
+	iommu_register_group(tbl, pci_domain_nr(pe->pbus), pe->pe_number);
 
 	if (pe->pdev)
 		set_iommu_table_base(&pe->pdev->dev, tbl);
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index abe6780..7ce75b0 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -87,8 +87,11 @@ static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) { }
 static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
 					 struct pci_dev *pdev)
 {
-	if (phb->p5ioc2.iommu_table.it_map == NULL)
+	if (phb->p5ioc2.iommu_table.it_map == NULL) {
 		iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node);
+		iommu_register_group(&phb->p5ioc2.iommu_table,
+				pci_domain_nr(phb->hose->bus), phb->opal_id);
+	}
 
 	set_iommu_table_base(&pdev->dev, &phb->p5ioc2.iommu_table);
 }
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index f60a188..d112701 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -20,6 +20,7 @@
 #include <linux/irq.h>
 #include <linux/io.h>
 #include <linux/msi.h>
+#include <linux/iommu.h>
 
 #include <asm/sections.h>
 #include <asm/io.h>
@@ -503,6 +504,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
 	pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
 				  be32_to_cpup(sizep), 0);
 	iommu_init_table(tbl, hose->node);
+	iommu_register_group(tbl, pci_domain_nr(hose->bus), 0);
 
 	/* Deal with SW invalidated TCEs when needed (BML way) */
 	swinvp = of_get_property(hose->dn, "linux,tce-sw-invalidate-info",
@@ -631,3 +633,4 @@ void __init pnv_pci_init(void)
 	ppc_md.teardown_msi_irqs = pnv_teardown_msi_irqs;
 #endif
 }
+
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index e39f9db..ce6b186 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -187,4 +187,12 @@ config EXYNOS_IOMMU_DEBUG
 
 	  Say N unless you need kernel log message for IOMMU debugging
 
+config SPAPR_TCE_IOMMU
+	bool "sPAPR TCE IOMMU Support"
+	depends on PPC_POWERNV
+	select IOMMU_API
+	help
+	  Enables bits of IOMMU API required by VFIO. The iommu_ops is
+	  still not implemented.
+
 endif # IOMMU_SUPPORT
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 0/4] powerpc iommu: extending real mode support
From: aik @ 2013-02-11 12:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel,
	Paul Mackerras, linuxppc-dev, David Gibson

From: Alexey Kardashevskiy <aik@ozlabs.ru>

The first 2 patches in this set add a multi-tce support feature
(adding/deleting several TCE records at once) in real and virtual mode.

The last 2 patches enable real mode acceleration for VFIO and
extend the multi-tce feature to be available for VFIO devices.

The QEMU change is required in order to support this functionality
(additional ioctl to add a new LIOBN and hook it with a specific IOMMU)).


Alexey Kardashevskiy (4):
  powerpc: lookup_linux_pte has been made public
  powerpc kvm: added multiple TCEs requests support
  powerpc: preparing to support real mode optimization
  vfio powerpc: added real mode support

 arch/powerpc/include/asm/iommu.h         |   10 +
 arch/powerpc/include/asm/kvm_host.h      |    2 +
 arch/powerpc/include/asm/kvm_ppc.h       |   17 ++
 arch/powerpc/include/asm/pgtable-ppc64.h |    6 +
 arch/powerpc/include/uapi/asm/kvm.h      |    8 +
 arch/powerpc/kernel/iommu.c              |  253 ++++++++++++++++++-
 arch/powerpc/kvm/book3s_64_vio.c         |   55 ++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c      |  397 ++++++++++++++++++++++++++++--
 arch/powerpc/kvm/book3s_hv.c             |   23 ++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c      |    4 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |    6 +
 arch/powerpc/kvm/book3s_pr_papr.c        |   37 ++-
 arch/powerpc/kvm/powerpc.c               |   14 ++
 arch/powerpc/mm/init_64.c                |   56 ++++-
 include/uapi/linux/kvm.h                 |    2 +
 15 files changed, 852 insertions(+), 38 deletions(-)

-- 
1.7.10.4

^ permalink raw reply

* [PATCH 4/4] vfio powerpc: added real mode support
From: aik @ 2013-02-11 12:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel,
	Paul Mackerras, linuxppc-dev, David Gibson
In-Reply-To: <1360584763-21988-1-git-send-email-a>

From: Alexey Kardashevskiy <aik@ozlabs.ru>

The patch allows the host kernel to handle H_PUT_TCE request
without involving QEMU in it what should save time on switching
from the kernel to QEMU and back.

The patch adds an IOMMU ID parameter into the KVM_CAP_SPAPR_TCE ioctl,
QEMU needs to be fixed to support that.

At the moment H_PUT_TCE is processed in the virtual mode as the page
to be mapped may not be present in the RAM so paging may be involved as
it can be done from the virtual mode only.

Tests show that this patch increases tranmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/iommu.h    |   10 ++
 arch/powerpc/include/asm/kvm_host.h |    2 +
 arch/powerpc/include/asm/kvm_ppc.h  |    2 +
 arch/powerpc/include/uapi/asm/kvm.h |    8 ++
 arch/powerpc/kernel/iommu.c         |  253 +++++++++++++++++++++++++++++++++--
 arch/powerpc/kvm/book3s_64_vio.c    |   55 +++++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c |  186 +++++++++++++++++++++++--
 arch/powerpc/kvm/powerpc.c          |   11 ++
 include/uapi/linux/kvm.h            |    1 +
 9 files changed, 503 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 900294b..4a479e6 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -78,6 +78,7 @@ struct iommu_table {
 	unsigned long *it_map;       /* A simple allocation bitmap for now */
 #ifdef CONFIG_IOMMU_API
 	struct iommu_group *it_group;
+	struct list_head it_hugepages;
 #endif
 };
 
@@ -158,6 +159,15 @@ extern long iommu_clear_tce_user_mode(struct iommu_table *tbl,
 		unsigned long npages);
 extern long iommu_put_tce_user_mode(struct iommu_table *tbl,
 		unsigned long ioba, unsigned long tce);
+extern long iommu_put_tce_real_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce,
+		pte_t pte, unsigned long pg_size);
+extern long iommu_clear_tce_real_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce_value,
+		unsigned long npages);
+extern long iommu_put_tce_virt_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce,
+		pte_t pte, unsigned long pg_size);
 
 extern void iommu_flush_tce(struct iommu_table *tbl);
 extern long iommu_lock_table(struct iommu_table *tbl, bool lock);
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index ca9bf45..6fb22f8 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -178,6 +178,8 @@ struct kvmppc_spapr_tce_table {
 	struct kvm *kvm;
 	u64 liobn;
 	u32 window_size;
+	bool virtmode_only;
+	struct iommu_table *tbl;
 	struct page *pages[0];
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 76d133b..45c2a6c 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -134,6 +134,8 @@ extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
 extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
+extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+				struct kvm_create_spapr_tce_iommu *args);
 extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 			     unsigned long ioba, unsigned long tce);
 extern long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 2fba8a6..9578696 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -291,6 +291,14 @@ struct kvm_create_spapr_tce {
 	__u32 window_size;
 };
 
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+	__u64 liobn;
+	__u32 iommu_id;
+#define SPAPR_TCE_PUT_TCE_VIRTMODE_ONLY	1 /* for debug purposes */
+	__u32 flags;
+};
+
 /* for KVM_ALLOCATE_RMA */
 struct kvm_allocate_rma {
 	__u64 rma_size;
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index b4fdabc..acb9cdc 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -47,6 +47,8 @@
 #include <asm/fadump.h>
 #include <asm/vio.h>
 #include <asm/tce.h>
+#include <asm/kvm_book3s_64.h>
+#include <asm/page.h>
 
 #define DBG(...)
 
@@ -727,6 +729,7 @@ void iommu_register_group(struct iommu_table * tbl,
 		return;
 	}
 	tbl->it_group = grp;
+	INIT_LIST_HEAD(&tbl->it_hugepages);
 	iommu_group_set_iommudata(grp, tbl, group_release);
 	iommu_group_set_name(grp, kasprintf(GFP_KERNEL, "domain%d-pe%lx",
 			domain_number, pe_num));
@@ -906,6 +909,83 @@ void kvm_iommu_unmap_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
 }
 
+/*
+ * The KVM guest can be backed with 16MB pages (qemu switch
+ * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/).
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+struct iommu_kvmppc_hugepages {
+	struct list_head list;
+	pte_t pte;		/* Huge page PTE */
+	unsigned long pa;	/* Base phys address used as a real TCE */
+	struct page *page;	/* page struct of the very first subpage */
+	unsigned long size;	/* Huge page size (always 16MB at the moment) */
+	bool dirty;		/* Dirty bit */
+};
+
+static struct iommu_kvmppc_hugepages *find_hp_by_pte(struct iommu_table *tbl,
+		pte_t pte)
+{
+	struct iommu_kvmppc_hugepages *hp;
+
+	list_for_each_entry(hp, &tbl->it_hugepages, list) {
+		if (hp->pte == pte)
+			return hp;
+	}
+
+	return NULL;
+}
+
+static struct iommu_kvmppc_hugepages *find_hp_by_pa(struct iommu_table *tbl,
+		unsigned long pa)
+{
+	struct iommu_kvmppc_hugepages *hp;
+
+	list_for_each_entry(hp, &tbl->it_hugepages, list) {
+		if ((hp->pa <= pa) && (pa < hp->pa + hp->size))
+			return hp;
+	}
+
+	return NULL;
+}
+
+static struct iommu_kvmppc_hugepages *add_hp(struct iommu_table *tbl,
+		pte_t pte, unsigned long va, unsigned long pg_size)
+{
+	int ret;
+	struct iommu_kvmppc_hugepages *hp;
+
+	hp = kzalloc(sizeof(*hp), GFP_KERNEL);
+	if (!hp)
+		return NULL;
+
+	hp->pte = pte;
+	va = va & ~(pg_size - 1);
+	ret = get_user_pages_fast(va, 1, true/*write*/, &hp->page);
+	if ((ret != 1) || !hp->page) {
+		kfree(hp);
+		return NULL;
+	}
+#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
+#error TODO: fix to avoid page_address() here
+#endif
+	hp->pa = __pa((unsigned long) page_address(hp->page));
+
+	hp->size = pg_size;
+
+	list_add(&hp->list, &tbl->it_hugepages);
+
+	return hp;
+}
+
 static enum dma_data_direction tce_direction(unsigned long tce)
 {
 	if ((tce & TCE_PCI_READ) && (tce & TCE_PCI_WRITE))
@@ -974,14 +1054,16 @@ static long tce_put_param_check(struct iommu_table *tbl,
 	return 0;
 }
 
-static long clear_tce(struct iommu_table *tbl,
+static long clear_tce(struct iommu_table *tbl, bool realmode,
 		unsigned long entry, unsigned long pages)
 {
+	long ret = 0;
 	unsigned long oldtce;
 	struct page *page;
 	struct iommu_pool *pool;
+	struct iommu_kvmppc_hugepages *hp;
 
-	for ( ; pages; --pages, ++entry) {
+	for ( ; pages && !ret; --pages, ++entry) {
 		pool = get_pool(tbl, entry);
 		spin_lock(&(pool->lock));
 
@@ -989,12 +1071,32 @@ static long clear_tce(struct iommu_table *tbl,
 		if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)) {
 			ppc_md.tce_free(tbl, entry, 1);
 
-			page = pfn_to_page(oldtce >> PAGE_SHIFT);
-			WARN_ON(!page);
-			if (page) {
+			/* Release of huge pages is postponed till KVM's exit */
+			hp = find_hp_by_pa(tbl, oldtce);
+			if (hp) {
 				if (oldtce & TCE_PCI_WRITE)
-					SetPageDirty(page);
-				put_page(page);
+					hp->dirty = true;
+			} else if (realmode) {
+				/* Release a small page in real mode */
+				page = vmemmap_pfn_to_page(
+						oldtce >> PAGE_SHIFT);
+				if (page) {
+					if (oldtce & TCE_PCI_WRITE)
+						SetPageDirty(page);
+					ret = vmemmap_put_page(page);
+				} else {
+					/* Retry in virtual mode */
+					ret = -EAGAIN;
+				}
+			} else {
+				/* Release a small page in virtual mode */
+				page = pfn_to_page(oldtce >> PAGE_SHIFT);
+				WARN_ON(!page);
+				if (page) {
+					if (oldtce & TCE_PCI_WRITE)
+						SetPageDirty(page);
+					put_page(page);
+				}
 			}
 		}
 		spin_unlock(&(pool->lock));
@@ -1011,7 +1113,7 @@ long iommu_clear_tce_user_mode(struct iommu_table *tbl, unsigned long ioba,
 
 	ret = tce_clear_param_check(tbl, ioba, tce_value, npages);
 	if (!ret)
-		ret = clear_tce(tbl, entry, npages);
+		ret = clear_tce(tbl, false, entry, npages);
 
 	if (ret < 0)
 		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%ld\n",
@@ -1021,6 +1123,24 @@ long iommu_clear_tce_user_mode(struct iommu_table *tbl, unsigned long ioba,
 }
 EXPORT_SYMBOL_GPL(iommu_clear_tce_user_mode);
 
+long iommu_clear_tce_real_mode(struct iommu_table *tbl, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	long ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = tce_clear_param_check(tbl, ioba, tce_value, npages);
+	if (!ret)
+		ret = clear_tce(tbl, true, entry, npages);
+
+	if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%ld\n",
+				__func__, ioba, tce_value, ret);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_clear_tce_real_mode);
+
 /* hwaddr is a virtual address here, tce_build converts it to physical */
 static long do_tce_build(struct iommu_table *tbl, unsigned long entry,
 		unsigned long hwaddr, enum dma_data_direction direction)
@@ -1088,6 +1208,112 @@ long iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long ioba,
 }
 EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
 
+static long put_tce_virt_mode(struct iommu_table *tbl,
+		unsigned long entry, unsigned long tce,
+		pte_t pte, unsigned long pg_size)
+{
+	struct iommu_kvmppc_hugepages *hp;
+	enum dma_data_direction direction = tce_direction(tce);
+
+	/* Small page size case, easy to handle... */
+	if (pg_size <= PAGE_SIZE)
+		return put_tce_user_mode(tbl, entry, tce);
+
+	/*
+	 * Hugepages case - manage the hugepage list.
+	 * find_hp_by_pte() may find a huge page if called
+	 * from h_put_tce_indirect call.
+	 */
+	hp = find_hp_by_pte(tbl, pte);
+	if (!hp) {
+		/* This is the first time usage of this huge page */
+		hp = add_hp(tbl, pte, tce, pg_size);
+		if (!hp)
+			return -EFAULT;
+	}
+
+	tce = (unsigned long) __va(hp->pa) + (tce & (pg_size - 1));
+
+	return do_tce_build(tbl, entry, tce, direction);
+}
+
+long iommu_put_tce_virt_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce,
+		pte_t pte, unsigned long pg_size)
+{
+	long ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = tce_put_param_check(tbl, ioba, tce);
+	if (!ret)
+		ret = put_tce_virt_mode(tbl, entry, tce, pte, pg_size);
+
+	if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
+				__func__, ioba, tce, ret);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_put_tce_virt_mode);
+
+static long put_tce_real_mode(struct iommu_table *tbl,
+		unsigned long entry, unsigned long tce,
+		pte_t pte, unsigned long pg_size)
+{
+	long ret;
+	struct page *page = NULL;
+	struct iommu_kvmppc_hugepages *hp = NULL;
+	enum dma_data_direction direction = tce_direction(tce);
+
+	/* This is a huge page. we continue only if it is already in the list */
+	if (pg_size > PAGE_SIZE) {
+		hp = find_hp_by_pte(tbl, pte);
+
+		/* Go to virt mode to add a hugepage to the list if not found */
+		if (!hp)
+			return -EAGAIN;
+
+		/* tce_build accepts virtual addresses */
+		return do_tce_build(tbl, entry, (unsigned long) __va(tce),
+				direction);
+	}
+
+	/* Small page case, find page struct to increment a counter */
+	page = vmemmap_pfn_to_page(tce >> PAGE_SHIFT);
+	if (!page)
+		return -EAGAIN;
+
+	ret = vmemmap_get_page(page);
+	if (ret)
+		return ret;
+
+	/* tce_build accepts virtual addresses */
+	ret = do_tce_build(tbl, entry, (unsigned long) __va(tce), direction);
+	if (ret)
+		vmemmap_put_page(page);
+
+	return ret;
+}
+
+long iommu_put_tce_real_mode(struct iommu_table *tbl,
+		unsigned long ioba, unsigned long tce,
+		pte_t pte, unsigned long pg_size)
+{
+	long ret;
+	unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+	ret = tce_put_param_check(tbl, ioba, tce);
+	if (!ret)
+		ret = put_tce_real_mode(tbl, entry, tce, pte, pg_size);
+
+	if (ret < 0)
+		pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n",
+				__func__, ioba, tce, ret);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_put_tce_real_mode);
+
 /*
  * Helpers to do locked pages accounting.
  * Called from ioctl so down_write_trylock is not necessary.
@@ -1111,6 +1337,7 @@ long iommu_lock_table(struct iommu_table *tbl, bool lock)
 {
 	unsigned long sz = (tbl->it_size + 7) >> 3;
 	unsigned long locked, lock_limit;
+	struct iommu_kvmppc_hugepages *hp, *tmp;
 
 	if (lock) {
 		/*
@@ -1139,9 +1366,17 @@ long iommu_lock_table(struct iommu_table *tbl, bool lock)
 	}
 
 	/* Clear TCE table */
-	clear_tce(tbl, tbl->it_offset, tbl->it_size);
+	clear_tce(tbl, false, tbl->it_offset, tbl->it_size);
 
 	if (!lock) {
+		list_for_each_entry_safe(hp, tmp, &tbl->it_hugepages, list) {
+			list_del(&hp->list);
+			if (hp->dirty)
+				SetPageDirty(hp->page);
+			put_page(hp->page);
+			kfree(hp);
+		}
+
 		lock_acct(-tbl->it_size);
 		memset(tbl->it_map, 0, sz);
 
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 72ffc89..c3c29a0 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -26,6 +26,8 @@
 #include <linux/hugetlb.h>
 #include <linux/list.h>
 #include <linux/anon_inodes.h>
+#include <linux/pci.h>
+#include <linux/iommu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -36,6 +38,7 @@
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
 
@@ -52,8 +55,10 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
 
 	mutex_lock(&kvm->lock);
 	list_del(&stt->list);
-	for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
-		__free_page(stt->pages[i]);
+	if (!stt->tbl) {
+		for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
+			__free_page(stt->pages[i]);
+	}
 	kfree(stt);
 	mutex_unlock(&kvm->lock);
 
@@ -148,3 +153,49 @@ fail:
 	}
 	return ret;
 }
+
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+		struct kvm_create_spapr_tce_iommu *args)
+{
+	struct kvmppc_spapr_tce_table *stt = NULL;
+	struct pci_dev *pdev = NULL;
+
+	/* Check this LIOBN hasn't been previously allocated */
+	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
+		if (stt->liobn == args->liobn)
+			return -EBUSY;
+	}
+
+	stt = kzalloc(sizeof(*stt), GFP_KERNEL);
+	if (!stt)
+		return -ENOMEM;
+
+	stt->liobn = args->liobn;
+	stt->kvm = kvm;
+	stt->virtmode_only = !!(args->flags & SPAPR_TCE_PUT_TCE_VIRTMODE_ONLY);
+
+	/* Find an IOMMU table for the given ID */
+	for_each_pci_dev(pdev) {
+		struct iommu_table *tbl;
+
+		tbl = get_iommu_table_base(&pdev->dev);
+		if (!tbl)
+			continue;
+		if (iommu_group_id(tbl->it_group) != args->iommu_id)
+			continue;
+
+		stt->tbl = tbl;
+		pr_info("LIOBN=%llX hooked to IOMMU %d, virtmode_only=%u\n",
+				stt->liobn, args->iommu_id, stt->virtmode_only);
+		break;
+	}
+
+	kvm_get_kvm(kvm);
+
+	mutex_lock(&kvm->lock);
+	list_add(&stt->list, &kvm->arch.spapr_tce_tables);
+
+	mutex_unlock(&kvm->lock);
+
+	return 0;
+}
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index c38edcd..b2aa957 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -171,6 +171,7 @@ static long emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
+	long ret;
 	struct kvmppc_spapr_tce_table *stt;
 
 	stt = find_tce_table(vcpu, liobn);
@@ -178,8 +179,37 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (!stt)
 		return H_TOO_HARD;
 
+	if (stt->virtmode_only)
+		return H_TOO_HARD;
+
 	/* Emulated IO */
-	return emulated_h_put_tce(stt, ioba, tce);
+	if (!stt->tbl)
+		return emulated_h_put_tce(stt, ioba, tce);
+
+	/* VFIO IOMMU */
+	if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
+		unsigned long hpa, pg_size = 0;
+		pte_t pte = 0;
+
+		hpa = get_real_address(vcpu, tce, tce & TCE_PCI_WRITE,
+				&pte, &pg_size);
+		if (!hpa)
+			return H_TOO_HARD;
+
+		ret = iommu_put_tce_real_mode(stt->tbl, ioba, hpa,
+				pte, pg_size);
+	} else {
+		ret = iommu_clear_tce_real_mode(stt->tbl, ioba, 0, 1);
+	}
+	iommu_flush_tce(stt->tbl);
+
+	if (ret == -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
+
+	return H_SUCCESS;
 }
 
 long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
@@ -195,15 +225,43 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 	if (!stt)
 		return H_TOO_HARD;
 
+	if (stt->virtmode_only)
+		return H_TOO_HARD;
+
 	tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL);
 	if (!tces)
 		return H_TOO_HARD;
 
 	/* Emulated IO */
-	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
-		ret = emulated_h_put_tce(stt, ioba, tces[i]);
+	if (!stt->tbl) {
+		for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+			ret = emulated_h_put_tce(stt, ioba, tces[i]);
+
+		return ret;
+	}
+
+	/* VFIO IOMMU */
+	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE) {
+		unsigned long hpa, pg_size = 0;
+		pte_t pte = 0;
+
+		hpa = get_real_address(vcpu, tces[i], tces[i] & TCE_PCI_WRITE,
+				&pte, &pg_size);
+		if (!hpa)
+			return H_TOO_HARD;
+
+		ret = iommu_put_tce_real_mode(stt->tbl,
+				ioba, hpa, pte, pg_size);
+	}
+	iommu_flush_tce(stt->tbl);
+
+	if (ret == -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
 
-	return ret;
+	return H_SUCCESS;
 }
 
 long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
@@ -218,11 +276,28 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
 	if (!stt)
 		return H_TOO_HARD;
 
+	if (stt->virtmode_only)
+		return H_TOO_HARD;
+
 	/* Emulated IO */
-	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
-		ret = emulated_h_put_tce(stt, ioba, tce_value);
+	if (!stt->tbl) {
+		for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+			ret = emulated_h_put_tce(stt, ioba, tce_value);
+
+		return ret;
+	}
+
+	/* VFIO IOMMU */
+	ret = iommu_clear_tce_real_mode(stt->tbl, ioba, tce_value, npages);
+ 	iommu_flush_tce(stt->tbl);
+
+	if (ret == -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
 
-	return ret;
+	return H_SUCCESS;
 }
 
 /*
@@ -232,8 +307,42 @@ extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
 		unsigned long liobn, unsigned long ioba,
 		unsigned long tce)
 {
-	/* At the moment emulated IO is handled the same way */
-	return kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+	long ret;
+	struct kvmppc_spapr_tce_table *stt;
+
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
+
+	/* Emulated IO is not supported in virt mode */
+	if (!stt->tbl)
+		return emulated_h_put_tce(stt, ioba, tce);
+
+	/* VFIO IOMMU */
+	if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
+		unsigned long hpa, pg_size = 0;
+		pte_t pte;
+
+		hpa = get_virt_address(vcpu, tce, tce & TCE_PCI_WRITE,
+				&pte, &pg_size);
+		if (!tce)
+			return -EFAULT;
+
+		ret = iommu_put_tce_virt_mode(stt->tbl, ioba, hpa,
+				pte, pg_size);
+	} else {
+		ret = iommu_clear_tce_user_mode(stt->tbl, ioba, 0, 1);
+	}
+ 	iommu_flush_tce(stt->tbl);
+
+	if (ret == -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
+
+	return H_SUCCESS;
 }
 
 extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
@@ -254,16 +363,65 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 		return H_TOO_HARD;
 
 	/* Emulated IO */
-	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
-		ret = emulated_h_put_tce(stt, ioba, tces[i]);
+	if (!stt->tbl) {
+		for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+			ret = emulated_h_put_tce(stt, ioba, tces[i]);
+
+		return ret;
+	}
+
+	/* VFIO IOMMU */
+	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE) {
+		unsigned long hpa, pg_size = 0;
+		pte_t pte;
+
+		hpa = get_virt_address(vcpu, tces[i], tces[i] & TCE_PCI_WRITE,
+				&pte, &pg_size);
+		if (!hpa)
+			return H_TOO_HARD;
+
+		ret = iommu_put_tce_virt_mode(stt->tbl,
+				ioba, hpa, pte, pg_size);
+	}
+	iommu_flush_tce(stt->tbl);
+
+	if (ret == -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
 
-	return ret;
+	return H_SUCCESS;
 }
 
 extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
 		unsigned long liobn, unsigned long ioba,
 		unsigned long tce_value, unsigned long npages)
 {
-	/* At the moment emulated IO is handled the same way */
-	return kvmppc_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
+	struct kvmppc_spapr_tce_table *stt;
+	long ret = 0, i;
+
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	if (!stt->tbl) {
+		for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+			ret = emulated_h_put_tce(stt, ioba, tce_value);
+
+		return ret;
+	}
+
+	/* VFIO IOMMU */
+	ret = iommu_clear_tce_user_mode(stt->tbl, ioba, tce_value, npages);
+
+	if (ret == -EAGAIN)
+		return H_TOO_HARD;
+
+	if (ret < 0)
+		return H_PARAMETER;
+
+	return H_SUCCESS;
 }
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 95614c7..beceb90 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -934,6 +934,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
 		goto out;
 	}
+	case KVM_CREATE_SPAPR_TCE_IOMMU: {
+		struct kvm_create_spapr_tce_iommu create_tce_iommu;
+		struct kvm *kvm = filp->private_data;
+
+		r = -EFAULT;
+		if (copy_from_user(&create_tce_iommu, argp,
+				sizeof(create_tce_iommu)))
+			goto out;
+		r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu);
+		goto out;
+	}
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 #ifdef CONFIG_KVM_BOOK3S_64_HV
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 26e2b271..3727ea6 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -863,6 +863,7 @@ struct kvm_s390_ucas_mapping {
 #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
 /* Available with KVM_CAP_PPC_HTAB_FD */
 #define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, struct kvm_create_spapr_tce_iommu)
 
 /*
  * ioctls for vcpu fds
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 3/4] powerpc: preparing to support real mode optimization
From: aik @ 2013-02-11 12:12 UTC (permalink / raw)
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel,
	Paul Mackerras, linuxppc-dev, David Gibson
In-Reply-To: <1360584763-21988-1-git-send-email-a>

From: Alexey Kardashevskiy <aik@ozlabs.ru>

he current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow in really fast hardware so
it is better to be moved to the real mode.

The patch adds an API to increment/decrement page counter as
get_user_pages API used for user mode mapping does not work
in the real mode.

CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEN are supported.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/pgtable-ppc64.h |    3 ++
 arch/powerpc/mm/init_64.c                |   56 +++++++++++++++++++++++++++++-
 2 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index ddcc898..b7a1fb2 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -377,6 +377,9 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
 }
 #endif /* !CONFIG_HUGETLB_PAGE */
 
+struct page *vmemmap_pfn_to_page(unsigned long pfn);
+long vmemmap_get_page(struct page *page);
+long vmemmap_put_page(struct page *page);
 pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
 		int writing, unsigned long *pte_sizep);
 
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 95a4529..068e9e9 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -297,5 +297,59 @@ int __meminit vmemmap_populate(struct page *start_page,
 
 	return 0;
 }
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
+struct page *vmemmap_pfn_to_page(unsigned long pfn)
+{
+	struct vmemmap_backing *vmem_back;
+	struct page *page;
+	unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+	unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
+
+	for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+		if (pg_va < vmem_back->virt_addr)
+			continue;
+
+		/* Check that page struct is not split between real pages */
+		if ((pg_va + sizeof(struct page)) >
+				(vmem_back->virt_addr + page_size))
+			return NULL;
+
+		page = (struct page *) (vmem_back->phys + pg_va -
+				vmem_back->virt_addr);
+		return page;
+	}
+
+	return NULL;
+}
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *vmemmap_pfn_to_page(unsigned long pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+	return page;
+}
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
+
+#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
+long vmemmap_get_page(struct page *page)
+{
+	if (PageTail(page))
+		return -EAGAIN;
+
+	get_page(page);
+
+	return 0;
+}
+
+long vmemmap_put_page(struct page *page)
+{
+	if (PageCompound(page))
+		return -EAGAIN;
+
+	put_page(page);
+
+	return 0;
+}
+#endif
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 2/4] powerpc kvm: added multiple TCEs requests support
From: aik @ 2013-02-11 12:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel,
	Paul Mackerras, linuxppc-dev, David Gibson
In-Reply-To: <1360584763-21988-1-git-send-email-a>

From: Alexey Kardashevskiy <aik@ozlabs.ru>

The patch adds real mode handlers for H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
devices or emulated PCI. These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

The patch adds a guest physical to host real address converter
and calls the existing H_PUT_TCE handler. The converting function
is going to be fully utilized by upcoming VFIO supporting patches.

The patch also implements the KVM_CAP_PPC_MULTITCE capability,
so in order to support the functionality of this patch QEMU
needs to query for this capability and set the "hcall-multi-tce"
hypertas property if the capability is present.

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/include/asm/kvm_ppc.h      |   15 ++
 arch/powerpc/kvm/book3s_64_vio_hv.c     |  241 ++++++++++++++++++++++++++++---
 arch/powerpc/kvm/book3s_hv.c            |   23 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |    6 +
 arch/powerpc/kvm/book3s_pr_papr.c       |   37 ++++-
 arch/powerpc/kvm/powerpc.c              |    3 +
 include/uapi/linux/kvm.h                |    1 +
 7 files changed, 301 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 572aa75..76d133b 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -136,6 +136,21 @@ extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce *args);
 extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 			     unsigned long ioba, unsigned long tce);
+extern long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages);
+extern long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages);
+extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages);
+extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages);
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
 				struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 30c2f3b..c38edcd 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -14,6 +14,7 @@
  *
  * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
  * Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
  */
 
 #include <linux/types.h>
@@ -25,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
 #include <linux/list.h>
+#include <linux/kvm_host.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -35,42 +37,233 @@
 #include <asm/ppc-opcode.h>
 #include <asm/kvm_host.h>
 #include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
 
 #define TCES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
 
-/* WARNING: This will be called in real-mode on HV KVM and virtual
- *          mode on PR KVM
+static struct kvmppc_spapr_tce_table *find_tce_table(struct kvm_vcpu *vcpu,
+		unsigned long liobn)
+{
+	struct kvmppc_spapr_tce_table *stt;
+
+	list_for_each_entry(stt, &vcpu->kvm->arch.spapr_tce_tables, list) {
+		if (stt->liobn == liobn)
+			return stt;
+	}
+
+	return NULL;
+}
+
+/*
+ * Converts guest physical address into host virtual
+ * which is to be used later in get_user_pages_fast().
+ */
+static unsigned long get_virt_address(struct kvm_vcpu *vcpu,
+		unsigned long gpa, bool writing,
+		pte_t *ptep, unsigned long *pg_sizep)
+{
+	unsigned long hva, gfn = gpa >> PAGE_SHIFT;
+	struct kvm_memory_slot *memslot;
+
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return 0;
+
+	/*
+	 * Convert gfn to hva preserving flags and an offset
+	 * within a system page
+	 */
+	hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
+
+	/* Find out the page pte and size if requested */
+	if (ptep && pg_sizep) {
+		pte_t pte;
+		unsigned long pg_size = 0;
+
+		pte = lookup_linux_pte(vcpu->arch.pgdir, hva,
+				writing, &pg_size);
+		if (!pte_present(pte))
+			return 0;
+
+		*pg_sizep = pg_size;
+		*ptep = pte;
+	}
+
+	return hva;
+}
+
+/*
+ * Converts guest physical address into host real address.
+ * Also returns pte and page size if the page is present in page table.
+ */
+static unsigned long get_real_address(struct kvm_vcpu *vcpu,
+		unsigned long gpa, bool writing,
+		pte_t *ptep, unsigned long *pg_sizep)
+{
+	struct kvm_memory_slot *memslot;
+	pte_t pte;
+	unsigned long hva, pg_size = 0, hwaddr, offset;
+	unsigned long gfn = gpa >> PAGE_SHIFT;
+
+	/* Find a KVM memslot */
+	memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+	if (!memslot)
+		return 0;
+
+	/* Convert guest physical address to host virtual */
+	hva = __gfn_to_hva_memslot(memslot, gfn);
+
+	/* Find a PTE and determine the size */
+	pte = lookup_linux_pte(vcpu->arch.pgdir, hva,
+			writing, &pg_size);
+	if (!pte_present(pte))
+		return 0;
+
+	/* Calculate host phys address keeping flags and offset in the page */
+	offset = gpa & (pg_size - 1);
+
+	/* pte_pfn(pte) should return an address aligned to pg_size */
+	hwaddr = (pte_pfn(pte) << PAGE_SHIFT) + offset;
+
+	/* Copy outer values if required */
+	if (pg_sizep)
+		*pg_sizep = pg_size;
+	if (ptep)
+		*ptep = pte;
+
+	return hwaddr;
+}
+
+/*
+ * emulated_h_put_tce() handles TCE requests for devices emulated
+ * by QEMU. It puts guest TCE values into the table and expects
+ * the QEMU to convert them later in the QEMU device implementation.
+ * Works in both real and virtual modes.
+ */
+static long emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
+		unsigned long ioba, unsigned long tce)
+{
+	unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
+	struct page *page;
+	u64 *tbl;
+
+	/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
+	/* 	    liobn, stt, stt->window_size); */
+	if (ioba >= stt->window_size) {
+		pr_err("%s failed on ioba=%lx\n", __func__, ioba);
+		return H_PARAMETER;
+	}
+
+	page = stt->pages[idx / TCES_PER_PAGE];
+	tbl = (u64 *)page_address(page);
+
+	/* FIXME: Need to validate the TCE itself */
+	/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
+	tbl[idx % TCES_PER_PAGE] = tce;
+
+	return H_SUCCESS;
+}
+
+/*
+ * Real mode handlers
  */
 long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 		      unsigned long ioba, unsigned long tce)
 {
-	struct kvm *kvm = vcpu->kvm;
 	struct kvmppc_spapr_tce_table *stt;
 
-	/* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
-	/* 	    liobn, ioba, tce); */
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	return emulated_h_put_tce(stt, ioba, tce);
+}
+
+long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list,	unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *stt;
+	long i, ret = 0;
+	unsigned long *tces;
+
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
 
-	list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
-		if (stt->liobn == liobn) {
-			unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
-			struct page *page;
-			u64 *tbl;
+	tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL);
+	if (!tces)
+		return H_TOO_HARD;
 
-			/* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p  window_size=0x%x\n", */
-			/* 	    liobn, stt, stt->window_size); */
-			if (ioba >= stt->window_size)
-				return H_PARAMETER;
+	/* Emulated IO */
+	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+		ret = emulated_h_put_tce(stt, ioba, tces[i]);
 
-			page = stt->pages[idx / TCES_PER_PAGE];
-			tbl = (u64 *)page_address(page);
+	return ret;
+}
 
-			/* FIXME: Need to validate the TCE itself */
-			/* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
-			tbl[idx % TCES_PER_PAGE] = tce;
-			return H_SUCCESS;
-		}
-	}
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *stt;
+	long i, ret = 0;
+
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
 
-	/* Didn't find the liobn, punt it to userspace */
-	return H_TOO_HARD;
+	/* Emulated IO */
+	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+		ret = emulated_h_put_tce(stt, ioba, tce_value);
+
+	return ret;
+}
+
+/*
+ * Virtual mode handlers
+ */
+extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce)
+{
+	/* At the moment emulated IO is handled the same way */
+	return kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+}
+
+extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_list, unsigned long npages)
+{
+	struct kvmppc_spapr_tce_table *stt;
+	unsigned long *tces;
+	long ret = 0, i;
+
+	stt = find_tce_table(vcpu, liobn);
+	/* Didn't find the liobn, put it to userspace */
+	if (!stt)
+		return H_TOO_HARD;
+
+	tces = (void *) get_virt_address(vcpu, tce_list, false, NULL, NULL);
+	if (!tces)
+		return H_TOO_HARD;
+
+	/* Emulated IO */
+	for (i = 0; (i < npages) && !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+		ret = emulated_h_put_tce(stt, ioba, tces[i]);
+
+	return ret;
+}
+
+extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+		unsigned long liobn, unsigned long ioba,
+		unsigned long tce_value, unsigned long npages)
+{
+	/* At the moment emulated IO is handled the same way */
+	return kvmppc_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
 }
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 71d0c90..13c8436 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -515,6 +515,29 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 					kvmppc_get_gpr(vcpu, 5),
 					kvmppc_get_gpr(vcpu, 6));
 		break;
+	case H_PUT_TCE:
+		ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+					        kvmppc_get_gpr(vcpu, 5),
+					        kvmppc_get_gpr(vcpu, 6));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_PUT_TCE_INDIRECT:
+		ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
+					        kvmppc_get_gpr(vcpu, 5),
+					        kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
+	case H_STUFF_TCE:
+		ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+					        kvmppc_get_gpr(vcpu, 5),
+					        kvmppc_get_gpr(vcpu, 6),
+						kvmppc_get_gpr(vcpu, 7));
+		if (ret == H_TOO_HARD)
+			return RESUME_HOST;
+		break;
 	default:
 		return RESUME_HOST;
 	}
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 10b6c35..0826e8b 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1390,6 +1390,12 @@ hcall_real_table:
 	.long	0		/* 0x11c */
 	.long	0		/* 0x120 */
 	.long	.kvmppc_h_bulk_remove - hcall_real_table
+	.long	0		/* 0x128 */
+	.long	0		/* 0x12c */
+	.long	0		/* 0x130 */
+	.long	0		/* 0x134 */
+	.long	.kvmppc_h_stuff_tce - hcall_real_table
+	.long	.kvmppc_h_put_tce_indirect - hcall_real_table
 hcall_real_table_end:
 
 ignore_hdec:
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
index ee02b30..270e88e 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
 	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
 	long rc;
 
-	rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+	rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce, 1);
+	if (rc == H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
+			tce, npages);
+	if (rc == H_TOO_HARD)
+		return EMULATE_FAIL;
+	kvmppc_set_gpr(vcpu, 3, rc);
+	return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
+{
+	unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+	unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+	unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
+	unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+	long rc;
+
+	rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
 	if (rc == H_TOO_HARD)
 		return EMULATE_FAIL;
 	kvmppc_set_gpr(vcpu, 3, rc);
@@ -240,6 +271,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
 		return kvmppc_h_pr_bulk_remove(vcpu);
 	case H_PUT_TCE:
 		return kvmppc_h_pr_put_tce(vcpu);
+	case H_PUT_TCE_INDIRECT:
+		return kvmppc_h_pr_put_tce_indirect(vcpu);
+	case H_STUFF_TCE:
+		return kvmppc_h_pr_stuff_tce(vcpu);
 	case H_CEDE:
 		vcpu->arch.shared->msr |= MSR_EE;
 		kvm_vcpu_block(vcpu);
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 70739a0..95614c7 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -383,6 +383,9 @@ int kvm_dev_ioctl_check_extension(long ext)
 		r = 1;
 		break;
 #endif
+	case KVM_CAP_PPC_MULTITCE:
+		r = 1;
+		break;
 	default:
 		r = 0;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index e6e5d4b..26e2b271 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -635,6 +635,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_IRQFD_RESAMPLE 82
 #define KVM_CAP_PPC_BOOKE_WATCHDOG 83
 #define KVM_CAP_PPC_HTAB_FD 84
+#define KVM_CAP_PPC_MULTITCE 87
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 1/4] powerpc: lookup_linux_pte has been made public
From: aik @ 2013-02-11 12:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel,
	Paul Mackerras, linuxppc-dev, David Gibson
In-Reply-To: <1360584763-21988-1-git-send-email-a>

From: Alexey Kardashevskiy <aik@ozlabs.ru>

The lookup_linux_pte() function returns a linux PTE which
is required to convert KVM guest physical address into host real
address in real mode.

This convertion will be used by upcoming support of H_PUT_TCE_INDIRECT
as TCE list address comes from the guest directly so it is a guest
physical.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/pgtable-ppc64.h |    3 +++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c      |    4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 0182c20..ddcc898 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -377,6 +377,9 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
 }
 #endif /* !CONFIG_HUGETLB_PAGE */
 
+pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
+		int writing, unsigned long *pte_sizep);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 19c93ba..6a042d0 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -145,8 +145,8 @@ static void remove_revmap_chain(struct kvm *kvm, long pte_index,
 	unlock_rmap(rmap);
 }
 
-static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
-			      int writing, unsigned long *pte_sizep)
+pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
+		       int writing, unsigned long *pte_sizep)
 {
 	pte_t *ptep;
 	unsigned long ps = *pte_sizep;
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 2/2] vfio powerpc: implemented IOMMU driver for VFIO
From: Alexey Kardashevskiy @ 2013-02-11 11:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Alexey Kardashevskiy, linux-kernel, Alex Williamson,
	Paul Mackerras, linuxppc-dev, David Gibson
In-Reply-To: <1360583672-21924-1-git-send-email-aik@ozlabs.ru>

VFIO implements platform independent stuff such as
a PCI driver, BAR access (via read/write on a file descriptor
or direct mapping when possible) and IRQ signaling.

The platform dependent part includes IOMMU initialization
and handling. This patch implements an IOMMU driver for VFIO
which does mapping/unmapping pages for the guest IO and
provides information about DMA window (required by a POWERPC
guest).

The counterpart in QEMU is required to support this functionality.

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 drivers/vfio/Kconfig                |    6 +
 drivers/vfio/Makefile               |    1 +
 drivers/vfio/vfio_iommu_spapr_tce.c |  269 +++++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h           |   31 ++++
 4 files changed, 307 insertions(+)
 create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 7cd5dec..b464687 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -3,10 +3,16 @@ config VFIO_IOMMU_TYPE1
 	depends on VFIO
 	default n
 
+config VFIO_IOMMU_SPAPR_TCE
+	tristate
+	depends on VFIO && SPAPR_TCE_IOMMU
+	default n
+
 menuconfig VFIO
 	tristate "VFIO Non-Privileged userspace driver framework"
 	depends on IOMMU_API
 	select VFIO_IOMMU_TYPE1 if X86
+	select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV
 	help
 	  VFIO provides a framework for secure userspace device drivers.
 	  See Documentation/vfio.txt for more details.
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 2398d4a..72bfabc 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1,3 +1,4 @@
 obj-$(CONFIG_VFIO) += vfio.o
 obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
+obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_PCI) += pci/
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
new file mode 100644
index 0000000..9b3fa88
--- /dev/null
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -0,0 +1,269 @@
+/*
+ * VFIO: IOMMU DMA mapping support for TCE on POWER
+ *
+ * Copyright (C) 2012 IBM Corp.  All rights reserved.
+ *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio_iommu_type1.c:
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ */
+
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/err.h>
+#include <linux/vfio.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "aik@ozlabs.ru"
+#define DRIVER_DESC     "VFIO IOMMU SPAPR TCE"
+
+static void tce_iommu_detach_group(void *iommu_data,
+		struct iommu_group *iommu_group);
+
+/*
+ * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
+ *
+ * This code handles mapping and unmapping of user data buffers
+ * into DMA'ble space using the IOMMU
+ */
+
+/*
+ * The container descriptor supports only a single group per container.
+ * Required by the API as the container is not supplied with the IOMMU group
+ * at the moment of initialization.
+ */
+struct tce_container {
+	struct mutex lock;
+	struct iommu_table *tbl;
+};
+
+static void *tce_iommu_open(unsigned long arg)
+{
+	struct tce_container *container;
+
+	if (arg != VFIO_SPAPR_TCE_IOMMU) {
+		pr_err("tce_vfio: Wrong IOMMU type\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	container = kzalloc(sizeof(*container), GFP_KERNEL);
+	if (!container)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_init(&container->lock);
+
+	return container;
+}
+
+static void tce_iommu_release(void *iommu_data)
+{
+	struct tce_container *container = iommu_data;
+
+	WARN_ON(container->tbl && !container->tbl->it_group);
+	if (container->tbl && container->tbl->it_group)
+		tce_iommu_detach_group(iommu_data, container->tbl->it_group);
+
+	mutex_destroy(&container->lock);
+
+	kfree(container);
+}
+
+static long tce_iommu_ioctl(void *iommu_data,
+				 unsigned int cmd, unsigned long arg)
+{
+	struct tce_container *container = iommu_data;
+	unsigned long minsz;
+	long ret;
+
+	switch (cmd) {
+	case VFIO_CHECK_EXTENSION:
+		return (arg == VFIO_SPAPR_TCE_IOMMU) ? 1 : 0;
+
+	case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
+		struct vfio_iommu_spapr_tce_info info;
+		struct iommu_table *tbl = container->tbl;
+
+		if (WARN_ON(!tbl))
+			return -ENXIO;
+
+		minsz = offsetofend(struct vfio_iommu_spapr_tce_info,
+				dma32_window_size);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.dma32_window_start = tbl->it_offset << IOMMU_PAGE_SHIFT;
+		info.dma32_window_size = tbl->it_size << IOMMU_PAGE_SHIFT;
+		info.flags = 0;
+
+		if (copy_to_user((void __user *)arg, &info, minsz))
+			return -EFAULT;
+
+		return 0;
+	}
+	case VFIO_IOMMU_MAP_DMA: {
+		vfio_iommu_spapr_tce_dma_map param;
+		struct iommu_table *tbl = container->tbl;
+		unsigned long tce;
+
+		if (WARN_ON(!tbl))
+			return -ENXIO;
+
+		minsz = offsetofend(vfio_iommu_spapr_tce_dma_map, size);
+
+		if (copy_from_user(&param, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (param.argsz < minsz)
+			return -EINVAL;
+
+		if (param.flags & ~(VFIO_DMA_MAP_FLAG_READ |
+				VFIO_DMA_MAP_FLAG_WRITE))
+			return -EINVAL;
+
+		if ((param.size & ~IOMMU_PAGE_MASK) ||
+				(param.vaddr & ~IOMMU_PAGE_MASK))
+			return -EINVAL;
+
+		/* TODO: support multiple TCEs */
+		if (param.size != IOMMU_PAGE_SIZE) {
+			pr_err("VFIO map on POWER supports only %lu page size\n",
+					IOMMU_PAGE_SIZE);
+			return -EINVAL;
+		}
+
+		/* iova is checked by the IOMMU API */
+		tce = param.vaddr;
+		if (param.flags & VFIO_DMA_MAP_FLAG_READ)
+			tce |= TCE_PCI_READ;
+		if (param.flags & VFIO_DMA_MAP_FLAG_WRITE)
+			tce |= TCE_PCI_WRITE;
+
+		ret = iommu_put_tce_user_mode(tbl, param.iova, tce);
+		iommu_flush_tce(tbl);
+
+		return ret;
+	}
+	case VFIO_IOMMU_UNMAP_DMA: {
+		vfio_iommu_spapr_tce_dma_unmap param;
+		struct iommu_table *tbl = container->tbl;
+
+		if (WARN_ON(!tbl))
+			return -ENXIO;
+
+		minsz = offsetofend(vfio_iommu_spapr_tce_dma_unmap, size);
+
+		if (copy_from_user(&param, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (param.argsz < minsz)
+			return -EINVAL;
+
+		/* No flag is supported now */
+		if (param.flags)
+			return -EINVAL;
+
+		if (param.size & ~IOMMU_PAGE_MASK)
+			return -EINVAL;
+
+		/* iova is checked by the IOMMU API */
+		ret = iommu_clear_tce_user_mode(tbl, param.iova, 0,
+				param.size >> IOMMU_PAGE_SHIFT);
+		iommu_flush_tce(tbl);
+
+		return ret;
+	}
+	}
+
+	return -ENOTTY;
+}
+
+static int tce_iommu_attach_group(void *iommu_data,
+		struct iommu_group *iommu_group)
+{
+	int ret;
+	struct tce_container *container = iommu_data;
+	struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
+
+	BUG_ON(!tbl);
+	mutex_lock(&container->lock);
+	pr_debug("tce_vfio: Attaching group #%u to iommu %p\n",
+			iommu_group_id(iommu_group), iommu_group);
+	if (container->tbl) {
+		pr_warn("tce_vfio: Only one group per IOMMU container is allowed, existing id=%d, attaching id=%d\n",
+				iommu_group_id(container->tbl->it_group),
+				iommu_group_id(iommu_group));
+		mutex_unlock(&container->lock);
+		return -EBUSY;
+	}
+
+	container->tbl = tbl;
+	ret = iommu_lock_table(tbl, true);
+	mutex_unlock(&container->lock);
+
+	return ret;
+}
+
+static void tce_iommu_detach_group(void *iommu_data,
+		struct iommu_group *iommu_group)
+{
+	struct tce_container *container = iommu_data;
+	struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
+
+	BUG_ON(!tbl);
+	mutex_lock(&container->lock);
+	if (tbl != container->tbl) {
+		pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
+				iommu_group_id(iommu_group),
+				iommu_group_id(tbl->it_group));
+	} else {
+
+		pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
+				iommu_group_id(iommu_group), iommu_group);
+
+		container->tbl = NULL;
+		iommu_lock_table(tbl, false);
+	}
+	mutex_unlock(&container->lock);
+}
+
+const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
+	.name		= "iommu-vfio-powerpc",
+	.owner		= THIS_MODULE,
+	.open		= tce_iommu_open,
+	.release	= tce_iommu_release,
+	.ioctl		= tce_iommu_ioctl,
+	.attach_group	= tce_iommu_attach_group,
+	.detach_group	= tce_iommu_detach_group,
+};
+
+static int __init tce_iommu_init(void)
+{
+	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
+}
+
+static void __exit tce_iommu_cleanup(void)
+{
+	vfio_unregister_iommu_driver(&tce_iommu_driver_ops);
+}
+
+module_init(tce_iommu_init);
+module_exit(tce_iommu_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
+
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 4758d1b..ea9a9a7 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -22,6 +22,7 @@
 /* Extensions */
 
 #define VFIO_TYPE1_IOMMU		1
+#define VFIO_SPAPR_TCE_IOMMU		2
 
 /*
  * The IOCTL interface is designed for extensibility by embedding the
@@ -365,4 +366,34 @@ struct vfio_iommu_type1_dma_unmap {
 
 #define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
 
+/* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
+
+/*
+ * The SPAPR TCE info struct provides the information about the PCI bus
+ * address ranges available for DMA, these values are programmed into
+ * the hardware so the guest has to know that information.
+ *
+ * The DMA 32 bit window start is an absolute PCI bus address.
+ * The IOVA address passed via map/unmap ioctls are absolute PCI bus
+ * addresses too so the window works as a filter rather than an offset
+ * for IOVA addresses.
+ *
+ * A flag will need to be added if other page sizes are supported,
+ * so as defined here, it is always 4k.
+ */
+struct vfio_iommu_spapr_tce_info {
+	__u32 argsz;
+	__u32 flags;			/* reserved for future use */
+	__u32 dma32_window_start;	/* 32 bit window start (bytes) */
+	__u32 dma32_window_size;	/* 32 bit window size (bytes) */
+};
+
+#define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
+
+/* Reuse type1 map/unmap structs as they are the same at the moment */
+typedef struct vfio_iommu_type1_dma_map vfio_iommu_spapr_tce_dma_map;
+typedef struct vfio_iommu_type1_dma_unmap vfio_iommu_spapr_tce_dma_unmap;
+
+/* ***************************************************************** */
+
 #endif /* _UAPIVFIO_H */
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 0/2] vfio on power
From: Alexey Kardashevskiy @ 2013-02-11 11:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, Alexey Kardashevskiy, linux-kernel, Alex Williamson,
	Paul Mackerras, linuxppc-dev, David Gibson

The series introduces a VFIO support on POWER.
The QEMU support is required, the real mode acceleration patches are coming later.

Alexey Kardashevskiy (2):
  vfio powerpc: enabled on powernv platform
  vfio powerpc: implemented IOMMU driver for VFIO

 arch/powerpc/include/asm/iommu.h            |   15 ++
 arch/powerpc/kernel/iommu.c                 |  343 +++++++++++++++++++++++++++
 arch/powerpc/platforms/powernv/pci-ioda.c   |    1 +
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |    5 +-
 arch/powerpc/platforms/powernv/pci.c        |    3 +
 drivers/iommu/Kconfig                       |    8 +
 drivers/vfio/Kconfig                        |    6 +
 drivers/vfio/Makefile                       |    1 +
 drivers/vfio/vfio_iommu_spapr_tce.c         |  269 +++++++++++++++++++++
 include/uapi/linux/vfio.h                   |   31 +++
 10 files changed, 681 insertions(+), 1 deletion(-)
 create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c

-- 
1.7.10.4

^ permalink raw reply

* Re: [PATCH v5 00/45] CPU hotplug: stop_machine()-free CPU hotplug
From: Vincent Guittot @ 2013-02-11 11:58 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-doc, peterz, fweisbec, linux-kernel, walken, mingo,
	linux-arch, Russell King - ARM Linux, xiaoguangrong, wangyun,
	paulmck, nikunj, linux-pm, Rusty Russell, rostedt, rjw, namhyung,
	tglx, linux-arm-kernel, netdev, oleg, sbw, tj, akpm, linuxppc-dev
In-Reply-To: <51153F72.1060005@linux.vnet.ibm.com>

Hi Srivatsa,

I can try to run some of our stress tests on your patches. Have you
got a git tree that i can pull ?

Regards,
Vincent

On 8 February 2013 19:09, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
> On 02/08/2013 10:14 PM, Srivatsa S. Bhat wrote:
>> On 02/08/2013 09:11 PM, Russell King - ARM Linux wrote:
>>> On Thu, Feb 07, 2013 at 11:41:34AM +0530, Srivatsa S. Bhat wrote:
>>>> On 02/07/2013 09:44 AM, Rusty Russell wrote:
>>>>> "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> writes:
>>>>>> On 01/22/2013 01:03 PM, Srivatsa S. Bhat wrote:
>>>>>>                  Avg. latency of 1 CPU offline (ms) [stop-cpu/stop-m/c latency]
>>>>>>
>>>>>> # online CPUs    Mainline (with stop-m/c)       This patchset (no stop-m/c)
>>>>>>
>>>>>>       8                 17.04                          7.73
>>>>>>
>>>>>>      16                 18.05                          6.44
>>>>>>
>>>>>>      32                 17.31                          7.39
>>>>>>
>>>>>>      64                 32.40                          9.28
>>>>>>
>>>>>>     128                 98.23                          7.35
>>>>>
>>>>> Nice!
>>>>
>>>> Thank you :-)
>>>>
>>>>>  I wonder how the ARM guys feel with their quad-cpu systems...
>>>>>
>>>>
>>>> That would be definitely interesting to know :-)
>>>
>>> That depends what exactly you'd like tested (and how) and whether you'd
>>> like it to be a test-chip based quad core, or an OMAP dual-core SoC.
>>>
>>
>> The effect of stop_machine() doesn't really depend on the CPU architecture
>> used underneath or the platform. It depends only on the _number_ of
>> _logical_ CPUs used.
>>
>> And stop_machine() has 2 noticeable drawbacks:
>> 1. It makes the hotplug operation itself slow
>> 2. and it causes disruptions to the workloads running on the other
>> CPUs by hijacking the entire machine for significant amounts of time.
>>
>> In my experiments (mentioned above), I tried to measure how my patchset
>> improves (reduces) the duration of hotplug (CPU offline) itself. Which is
>> also slightly indicative of the impact it has on the rest of the system.
>>
>> But what would be nice to test, is a setup where the workloads running on
>> the rest of the system are latency-sensitive, and measure the impact of
>> CPU offline on them, with this patchset applied. That would tell us how
>> far is this useful in making CPU hotplug less disruptive on the system.
>>
>> Of course, it would be nice to also see whether we observe any reduction
>> in hotplug duration itself (point 1 above) on ARM platforms with lot
>> of CPUs. [This could potentially speed up suspend/resume, which is used
>> rather heavily on ARM platforms].
>>
>> The benefits from this patchset over mainline (both in terms of points
>> 1 and 2 above) is expected to increase, with increasing number of CPUs in
>> the system.
>>
>
> Adding Vincent to CC, who had previously evaluated the performance and
> latency implications of CPU hotplug on ARM platforms, IIRC.
>
> Regards,
> Srivatsa S. Bhat
>

^ permalink raw reply

* Re: [PATCH] Centralise CONFIG_ARCH_NO_VIRT_TO_BUS
From: James Hogan @ 2013-02-11 11:57 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: linux-arch, linux-sh, Vineet Gupta, linux-kernel, David S. Miller,
	sparclinux, Paul Mundt, Paul Mackerras, Bjorn Helgaas,
	Andrew Morton, linuxppc-dev, H Hartley Sweeten
In-Reply-To: <20121113082615.2f482eb8835daf46e1f27947@canb.auug.org.au>

[-- Attachment #1: Type: text/plain, Size: 364 bytes --]

Hi Stephen,

On 12/11/12 21:26, Stephen Rothwell wrote:
> Make if easier for more architectures to select it and thus disable
> drivers that use virt_to_bus().
> 
> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

I was just wondering what the status of this patch is? It was in -next
for a while but seems to have disappeared.

Cheers
James


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: Re[3]: PS3 platform is broken on Linux 3.7.0
From: Aneesh Kumar K.V @ 2013-02-11 10:26 UTC (permalink / raw)
  To: Phileas Fogg, Geoff Levand, linuxppc-dev
In-Reply-To: <87k3qg4092.fsf@linux.vnet.ibm.com>

"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> writes:

> Phileas Fogg <phileas-fogg@mail.ru> writes:
>
>>  And another note.
>> I took a look at the MMU chapter in the Cell Architecture handbook and i=
ndeed the first 15 bits in VA are treated as 0 by the hardware.
>>
>> Quote:
>>
>> 1. High-order bits above 65 bits in the 80-bit virtual address (VA[0:14]=
) are not implemented. The hardware always
>> =C2=A0=C2=A0 treats these bits as `0'. Software must not set these bits =
to any other value than `0' or the results are undefined in
>> =C2=A0=C2=A0 the PPE.
>>
>>
>
> True, we missed the below part of ISA doc:
>
> ISA doc says
>
> "On implementations that support a virtual address size
> of only n bits, n < 78, bits 0:77-n of the AVA field must be
> zeros. "
>
> The Cell document I found at=20
>
> https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/7A77CCDF14FE70D=
5852575CA0074E8ED/$file/CellBE_Handbook_v1.12_3Apr09_pub.pdf
>
> gives=20
>
> Virtual Address (VA) Size -> 65 bits
>
> So as per ISA, bits 0:12 should be zero, which should make 0:14 of PTE
> fields zero for Cell.
>
> I will try to do a patch.=20
>

Can you try this patch ?

diff --git a/arch/powerpc/include/asm/mmu-hash64.h b/arch/powerpc/include/a=
sm/mmu-hash64.h
index 2fdb47a..f01fd9a 100644
--- a/arch/powerpc/include/asm/mmu-hash64.h
+++ b/arch/powerpc/include/asm/mmu-hash64.h
@@ -381,21 +381,37 @@ extern void slb_set_size(u16 size);
  * hash collisions.
  */
=20
+/* This should go in Kconfig */
+/*
+ * Be careful with this value. This determines the VSID_MODULUS_*  and that
+ * need to be co-prime with VSID_MULTIPLIER*
+ */
+#if 1
+#define MAX_VIRTUAL_ADDR_BITS	65
+#else
+#define MAX_VIRTUAL_ADDR_BITS	66
+#endif
+/*
+ * One bit is taken by the kernel, only the rest of space is available for=
 the
+ * user space.
+ */
+#define CONTEXT_BITS		(MAX_VIRTUAL_ADDR_BITS - \
+				 (USER_ESID_BITS + SID_SHIFT + 1))
+#define USER_ESID_BITS		18
+#define USER_ESID_BITS_1T	6
+
 /*
  * This should be computed such that protovosid * vsid_mulitplier
  * doesn't overflow 64 bits. It should also be co-prime to vsid_modulus
  */
 #define VSID_MULTIPLIER_256M	ASM_CONST(12538073)	/* 24-bit prime */
-#define VSID_BITS_256M		38
+#define VSID_BITS_256M		(CONTEXT_BITS + USER_ESID_BITS + 1)
 #define VSID_MODULUS_256M	((1UL<<VSID_BITS_256M)-1)
=20
 #define VSID_MULTIPLIER_1T	ASM_CONST(12538073)	/* 24-bit prime */
-#define VSID_BITS_1T		26
+#define VSID_BITS_1T		(CONTEXT_BITS + USER_ESID_BITS_1T + 1)
 #define VSID_MODULUS_1T		((1UL<<VSID_BITS_1T)-1)
=20
-#define CONTEXT_BITS		19
-#define USER_ESID_BITS		18
-#define USER_ESID_BITS_1T	6
=20
 #define USER_VSID_RANGE	(1UL << (USER_ESID_BITS + SID_SHIFT))
=20

^ permalink raw reply related

* Re: [BUG] irq_dispose_mapping after irq request failure
From: Baruch Siach @ 2013-02-11  6:44 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20130211061949.GA5561@concordia>

Hi Michael,

On Mon, Feb 11, 2013 at 05:19:49PM +1100, Michael Ellerman wrote:
> On Mon, Feb 11, 2013 at 07:31:00AM +0200, Baruch Siach wrote:

[...]

> > mpc85xx_pci_err_probe: Unable to requiest irq 16 for MPC85xx PCI err
> While you're there, can you fix the typo :)

The patch fixing it is already queued at 
http://git.kernel.org/?p=linux/kernel/git/bp/bp.git;a=commitdiff;h=e7d2c215e56dc9fa0a01e26f2acfc3d73c889ba3.

Thanks for your details explanation. I'll now try to figure out what's wrong 
with my device tree.

baruch

-- 
     http://baruch.siach.name/blog/                  ~. .~   Tk Open Systems
=}------------------------------------------------ooO--U--Ooo------------{=
   - baruch@tkos.co.il - tel: +972.2.679.5364, http://www.tkos.co.il -

^ permalink raw reply

* Re: [BUG] irq_dispose_mapping after irq request failure
From: Michael Ellerman @ 2013-02-11  6:19 UTC (permalink / raw)
  To: Baruch Siach; +Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20130211053100.GB18462@sapphire.tkos.co.il>

On Mon, Feb 11, 2013 at 07:31:00AM +0200, Baruch Siach wrote:
> Hi lkml,

Hi Baruch,

> The drivers/edac/mpc85xx_edac.c driver contains the following (abbreviated)
> code snippet it its .probe:

You dropped an important detail which is the preceeding line:

	pdata->irq = irq_of_parse_and_map(op->dev.of_node, 0);

> 		res = devm_request_irq(&op->dev, pdata->irq,
> 				       mpc85xx_pci_isr, IRQF_DISABLED,
> 				       "[EDAC] PCI err", pci);
> 		if (res < 0) {
> 			irq_dispose_mapping(pdata->irq);
> 			goto err2;
> 		}
> 
> Now, since the requested irq is already in use, and IRQF_SHARED is not set,
> devm_request_irq errors() out, which is OK. Less OK is the
> irq_dispose_mapping() call, which gives me this:
> 
> EDAC PCI1: Giving out device to module 'MPC85xx_edac' controller 'mpc85xx_pci_err': DEV 'ffe0a000.pcie' (INTERRUPT)
> genirq: Flags mismatch irq 16. 00000020 ([EDAC] PCI err) vs. 00000020 ([EDAC] PCI err)

The hint here is to notice which other irq you're clashing with          ^^
ie. yourself. Which is odd, that is the root of the problem.

The badness you're getting from irq_dispose_mapping() is caused because you're
disposing of that mapping which is currently still in use, by the same interrupt.

That is caused by a "feature" in the irq mapping code, where if you ask to map an
already mapped hwirq, it will give you back the same virq. So in your case when
you called irq_of_parse_and_map() it noticed that someone had already mapped
that hwirq, and gave you back an existing (in use) virq.

> mpc85xx_pci_err_probe: Unable to requiest irq 16 for MPC85xx PCI err
                                       ^
While you're there, can you fix the typo :)

> So, is irq_dispose_mapping() the right thing to do when irq request fails?

It's the right thing to do to undo the effect of irq_create_mapping(), or in your case irq_of_parse_and_map().

It just falls down in this case, because you're inadvertently disposing of something that's still in use.

> A simple grep shows that irq_dispose_mapping() calls are mostly limited to
> powerpc code. Is there a reason for that?

That's because the irq domain code began life as powerpc specific code. It's now become generic and will start to appear in more places.

cheers

^ permalink raw reply

* [BUG] irq_dispose_mapping after irq request failure
From: Baruch Siach @ 2013-02-11  5:31 UTC (permalink / raw)
  To: linux-kernel; +Cc: linuxppc-dev

Hi lkml,

The drivers/edac/mpc85xx_edac.c driver contains the following (abbreviated)
code snippet it its .probe:

		res = devm_request_irq(&op->dev, pdata->irq,
				       mpc85xx_pci_isr, IRQF_DISABLED,
				       "[EDAC] PCI err", pci);
		if (res < 0) {
			irq_dispose_mapping(pdata->irq);
			goto err2;
		}

Now, since the requested irq is already in use, and IRQF_SHARED is not set,
devm_request_irq errors() out, which is OK. Less OK is the
irq_dispose_mapping() call, which gives me this:

EDAC PCI1: Giving out device to module 'MPC85xx_edac' controller 'mpc85xx_pci_err': DEV 'ffe0a000.pcie' (INTERRUPT)
genirq: Flags mismatch irq 16. 00000020 ([EDAC] PCI err) vs. 00000020 ([EDAC] PCI err)
mpc85xx_pci_err_probe: Unable to requiest irq 16 for MPC85xx PCI err
remove_proc_entry: removing non-empty directory 'irq/16', leaking at least '[EDAC] PCI err'
------------[ cut here ]------------
WARNING: at fs/proc/generic.c:842
NIP: c00cd00c LR: c00cd00c CTR: c000c5e4
REGS: cf039b80 TRAP: 0700   Not tainted  (3.8.0-rc7-00002-g37ddebf)
MSR: 00029000 <CE,EE,ME>  CR: 42042422  XER: 00000000
TASK = cf034000[1] 'swapper' THREAD: cf038000
GPR00: c00cd00c cf039c30 cf034000 0000005b 0000005c 0000005c c04b7dde 435d2050 
GPR08: 43492065 c04a9a44 00000000 cf039bf0 22042424 00000000 c00025d0 00000000 
GPR16: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 c042fe78 
GPR24: 00000000 00000000 c04c3f90 cf05294c 00100100 00200200 cf039c78 cf052900 
NIP [c00cd00c] remove_proc_entry+0x190/0x1bc
LR [c00cd00c] remove_proc_entry+0x190/0x1bc
Call Trace:
[cf039c30] [c00cd00c] remove_proc_entry+0x190/0x1bc (unreliable)
[cf039c70] [c0058c64] unregister_irq_proc+0x6c/0x74
[cf039c90] [c0054530] free_desc+0x34/0x68
[cf039cb0] [c00545f0] irq_free_descs+0x44/0x88
[cf039cd0] [c00585c8] irq_dispose_mapping+0x68/0x70
[cf039ce0] [c0222650] mpc85xx_pci_err_probe+0x2a8/0x308
[cf039d20] [c0014f8c] fsl_pci_probe+0x74/0x80
[cf039d30] [c01a9c48] platform_drv_probe+0x20/0x30
[cf039d40] [c01a88c4] driver_probe_device+0xcc/0x1f4
[cf039d60] [c01a7288] bus_for_each_drv+0x60/0x9c
[cf039d90] [c01a85ac] device_attach+0x78/0x90
[cf039db0] [c01a7430] bus_probe_device+0x34/0x9c
[cf039dd0] [c01a55c4] device_add+0x410/0x580
[cf039e10] [c022eef4] of_device_add+0x40/0x50
[cf039e20] [c022f550] of_platform_device_create_pdata+0x6c/0x8c
[cf039e40] [c022f658] of_platform_bus_create+0xe8/0x178
[cf039e90] [c022f7a0] of_platform_bus_probe+0xac/0xdc
[cf039eb0] [c0415488] mpc85xx_common_publish_devices+0x20/0x30
[cf039ec0] [c0415578] __machine_initcall_p1020_rdb_mpc85xx_common_publish_devices+0x2c/0x3c
[cf039ed0] [c040e83c] do_one_initcall+0xdc/0x1b4
[cf039f00] [c040ea24] kernel_init_freeable+0x110/0x1a8
[cf039f30] [c00025e8] kernel_init+0x18/0xf8
[cf039f40] [c000b868] ret_from_kernel_thread+0x64/0x6c
Instruction dump:
2f870000 41be0030 80bf002c 3c80c033 3c60c03e 38846be0 38840260 38a50055 
38df0055 38e70055 38631770 48260331 <0fe00000> 7fe3fb78 4bfffb41 48000018 
---[ end trace 9af370ce0e147530 ]---

So, is irq_dispose_mapping() the right thing to do when irq request fails?

A simple grep shows that irq_dispose_mapping() calls are mostly limited to
powerpc code. Is there a reason for that?

baruch

-- 
     http://baruch.siach.name/blog/                  ~. .~   Tk Open Systems
=}------------------------------------------------ooO--U--Ooo------------{=
   - baruch@tkos.co.il - tel: +972.2.679.5364, http://www.tkos.co.il -

^ permalink raw reply

* [PATCH] iommu: adding missing kvm_iommu_map_pages/kvm_iommu_unmap_pages
From: Alexey Kardashevskiy @ 2013-02-11  5:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, linux-kernel, kvm-ppc, Paul Mackerras,
	linuxppc-dev, David Gibson

The IOMMU API implements groups creating/deletion, device binding
and IOMMU map/unmap operations.

The POWERPC implementation uses most of the API except map/unmap
operations which are implemented on POWERPC using hypercalls.

However in order to link a kernel with the CONFIG_IOMMU_API enabled,
the empty kvm_iommu_map_pages/kvm_iommu_unmap_pages have to be
defined, so does the patch.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/kernel/iommu.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 31c4fdc..7c309fe 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -36,6 +36,7 @@
 #include <linux/hash.h>
 #include <linux/fault-inject.h>
 #include <linux/pci.h>
+#include <linux/kvm_host.h>
 #include <asm/io.h>
 #include <asm/prom.h>
 #include <asm/iommu.h>
@@ -860,3 +861,19 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
 		free_pages((unsigned long)vaddr, get_order(size));
 	}
 }
+
+#ifdef CONFIG_IOMMU_API
+/*
+ * SPAPR TCE API
+ */
+
+/* POWERPC does not use IOMMU API for mapping/unmapping */
+int kvm_iommu_map_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+	return 0;
+}
+void kvm_iommu_unmap_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+}
+
+#endif /* CONFIG_IOMMU_API */
-- 
1.7.10.4

^ permalink raw reply related

* Re: PS3 platform is broken on Linux 3.7.0
From: Michael Ellerman @ 2013-02-11  3:39 UTC (permalink / raw)
  To: Phileas Fogg; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1360518697.255951128@f337.mail.ru>

On Sun, Feb 10, 2013 at 09:51:37PM +0400, Phileas Fogg wrote:
> 
> >Phileas Fogg < phileas-fogg@mail.ru > writes:
> >
> 
> Patch:
> 
> --- arch/powerpc/kernel/setup_64.c.old    2013-02-10 19:34:53.787366191 +0100
> +++ arch/powerpc/kernel/setup_64.c    2013-02-10 19:35:38.834035478 +0100
> @@ -186,6 +186,9 @@
>      initialise_paca(&boot_paca, 0);
>      setup_paca(&boot_paca);
>  
> +    /* Allow percpu accesses to "work" until we setup percpu data */
> +    boot_paca.data_offset = 0;
> +

This is correct.

>      /* Initialize lockdep early or else spinlocks will blow */
>      lockdep_init();
>  
> @@ -208,8 +211,6 @@
>  
>      /* Fix up paca fields required for the boot cpu */
>      get_paca()->cpu_start = 1;
> -    /* Allow percpu accesses to "work" until we setup percpu data */
> -    get_paca()->data_offset = 0;

But this is not.

As you said, they are different pacas, so we need to make sure both
boot_paca, and "the paca of the boot cpu" are initialised with
data_offset = 0.

I'll send a patch to sort it.

cheers

^ permalink raw reply

* Re: [PATCH v5 04/45] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
From: Paul E. McKenney @ 2013-02-10 22:13 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-doc, peterz, fweisbec, linux-kernel, mingo, linux-arch,
	linux, xiaoguangrong, wangyun, nikunj, linux-pm, rusty, rostedt,
	rjw, namhyung, tglx, linux-arm-kernel, netdev, Oleg Nesterov, sbw,
	tj, akpm, linuxppc-dev
In-Reply-To: <5117FE74.4020000@linux.vnet.ibm.com>

On Mon, Feb 11, 2013 at 01:39:24AM +0530, Srivatsa S. Bhat wrote:
> On 02/11/2013 01:20 AM, Oleg Nesterov wrote:
> > On 02/11, Srivatsa S. Bhat wrote:
> >>
> >> On 02/10/2013 11:36 PM, Oleg Nesterov wrote:
> >>>>> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock)
> >>>>> +{
> >>>>> +   unsigned int cpu;
> >>>>> +
> >>>>> +   drop_writer_signal(pcpu_rwlock, smp_processor_id());
> >>>>
> >>>> Why do we drop ourselves twice?  More to the point, why is it important to
> >>>> drop ourselves first?
> >>>
> >>> And don't we need mb() _before_ we clear ->writer_signal ?
> >>>
> >>
> >> Oh, right! Or, how about moving announce_writer_inactive() to _after_
> >> write_unlock()?
> > 
> > Not sure this will help... but, either way it seems we have another
> > problem...
> > 
> > percpu_rwlock tries to be "generic". This means we should "ignore" its
> > usage in hotplug, and _write_lock should not race with _write_unlock.
> > 
> 
> Yes, good point!
> 
> > IOW. Suppose that _write_unlock clears ->writer_signal. We need to ensure
> > that this can't race with another write which wants to set this flag.
> > Perhaps it should be counter as well, and it should be protected by
> > the same ->global_rwlock, but _write_lock() should drop it before
> > sync_all_readers() and then take it again?
> 
> Hmm, or we could just add an extra mb() like you suggested, and keep it
> simple...
> 
> > 
> >>>>> +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock,
> >>>>> +			       unsigned int cpu)
> >>>>> +{
> >>>>> +	smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */
> >>>>
> >>>> As I understand it, the purpose of this memory barrier is to ensure
> >>>> that the stores in drop_writer_signal() happen before the reads from
> >>>> ->reader_refcnt in reader_uses_percpu_refcnt(), thus preventing the
> >>>> race between a new reader attempting to use the fastpath and this writer
> >>>> acquiring the lock.  Unless I am confused, this must be smp_mb() rather
> >>>> than smp_rmb().
> >>>
> >>> And note that before sync_reader() we call announce_writer_active() which
> >>> already adds mb() before sync_all_readers/sync_reader, so this rmb() looks
> >>> unneeded.
> >>>
> >>
> >> My intention was to help the writer see the ->reader_refcnt drop to zero
> >> ASAP; hence I used smp_wmb() at reader and smp_rmb() here at the writer.
> > 
> > Hmm, interesting... Not sure, but can't really comment. However I can
> > answer your next question:
> 
> Paul told me in another mail that I was expecting too much out of memory
> barriers, like increasing the speed of electrons and what not ;-)
> [ It would have been cool though, if it had such magical powers :P ]

"But because you have used the special mb_tachyonic instruction, the
speed of light is 600,000 km/s for the next five clock cycles"...  ;-)

							Thanx, Paul

> >> Please correct me if my understanding of memory barriers is wrong here..
> > 
> > Who? Me??? No we have paulmck for that ;)
> >
> 
> Haha ;-)
> 
> Regards,
> Srivatsa S. Bhat
> 

^ permalink raw reply

* Re: [PATCH v5 04/45] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
From: Srivatsa S. Bhat @ 2013-02-10 20:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-doc, peterz, fweisbec, mingo, linux-arch, linux,
	xiaoguangrong, wangyun, paulmck, nikunj, linux-pm, rusty, rostedt,
	rjw, namhyung, tglx, linux-arm-kernel, netdev, linux-kernel, sbw,
	tj, akpm, linuxppc-dev
In-Reply-To: <20130210201312.GB6236@redhat.com>

On 02/11/2013 01:43 AM, Oleg Nesterov wrote:
> On 02/11, Srivatsa S. Bhat wrote:
>>
>> On 02/09/2013 04:40 AM, Paul E. McKenney wrote:
>>>> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock)
>>>> +{
>>>> +	unsigned int cpu;
>>>> +
>>>> +	drop_writer_signal(pcpu_rwlock, smp_processor_id());
>>>
>>> Why do we drop ourselves twice?  More to the point, why is it important to
>>> drop ourselves first?
>>>
>>
>> I don't see where we are dropping ourselves twice. Note that we are no longer
>> in the cpu_online_mask, so the 'for' loop below won't include us. So we need
>> to manually drop ourselves. It doesn't matter whether we drop ourselves first
>> or later.
> 
> Yes, but this just reflects its usage in cpu-hotplug. cpu goes away under
> _write_lock.
> 

Ah, right. I guess the code still has remnants from the older version in which
this locking scheme wasn't generic and was tied to cpu-hotplug alone..

> Perhaps _write_lock/unlock shoud use for_each_possible_cpu() instead?
> 

Hmm, that wouldn't be too bad.

> Hmm... I think this makes sense anyway. Otherwise, in theory,
> percpu_write_lock(random_non_hotplug_lock) can race with cpu_up?
> 

Yeah, makes sense. Will change it to for_each_possible_cpu().
And I had previously fixed such races with lglocks with a complicated scheme (to
avoid the costly for_each_possible loop), which was finally rewritten to use
for_each_possible_cpu() for the sake of simplicity..
Regards,
Srivatsa S. Bhat

^ permalink raw reply

* Re: [PATCH v5 04/45] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
From: Oleg Nesterov @ 2013-02-10 20:13 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-doc, peterz, fweisbec, mingo, linux-arch, linux,
	xiaoguangrong, wangyun, paulmck, nikunj, linux-pm, rusty, rostedt,
	rjw, namhyung, tglx, linux-arm-kernel, netdev, linux-kernel, sbw,
	tj, akpm, linuxppc-dev
In-Reply-To: <5117F0C0.2030605@linux.vnet.ibm.com>

On 02/11, Srivatsa S. Bhat wrote:
>
> On 02/09/2013 04:40 AM, Paul E. McKenney wrote:
> >> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock)
> >> +{
> >> +	unsigned int cpu;
> >> +
> >> +	drop_writer_signal(pcpu_rwlock, smp_processor_id());
> >
> > Why do we drop ourselves twice?  More to the point, why is it important to
> > drop ourselves first?
> >
>
> I don't see where we are dropping ourselves twice. Note that we are no longer
> in the cpu_online_mask, so the 'for' loop below won't include us. So we need
> to manually drop ourselves. It doesn't matter whether we drop ourselves first
> or later.

Yes, but this just reflects its usage in cpu-hotplug. cpu goes away under
_write_lock.

Perhaps _write_lock/unlock shoud use for_each_possible_cpu() instead?

Hmm... I think this makes sense anyway. Otherwise, in theory,
percpu_write_lock(random_non_hotplug_lock) can race with cpu_up?

Oleg.

^ permalink raw reply

* Re: [PATCH v5 04/45] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
From: Srivatsa S. Bhat @ 2013-02-10 20:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-doc, peterz, fweisbec, mingo, linux-arch, linux,
	xiaoguangrong, wangyun, Paul E. McKenney, nikunj, linux-pm, rusty,
	rostedt, rjw, namhyung, tglx, linux-arm-kernel, netdev,
	linux-kernel, sbw, tj, akpm, linuxppc-dev
In-Reply-To: <20130210195042.GA6236@redhat.com>

On 02/11/2013 01:20 AM, Oleg Nesterov wrote:
> On 02/11, Srivatsa S. Bhat wrote:
>>
>> On 02/10/2013 11:36 PM, Oleg Nesterov wrote:
>>>>> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock)
>>>>> +{
>>>>> +   unsigned int cpu;
>>>>> +
>>>>> +   drop_writer_signal(pcpu_rwlock, smp_processor_id());
>>>>
>>>> Why do we drop ourselves twice?  More to the point, why is it important to
>>>> drop ourselves first?
>>>
>>> And don't we need mb() _before_ we clear ->writer_signal ?
>>>
>>
>> Oh, right! Or, how about moving announce_writer_inactive() to _after_
>> write_unlock()?
> 
> Not sure this will help... but, either way it seems we have another
> problem...
> 
> percpu_rwlock tries to be "generic". This means we should "ignore" its
> usage in hotplug, and _write_lock should not race with _write_unlock.
> 

Yes, good point!

> IOW. Suppose that _write_unlock clears ->writer_signal. We need to ensure
> that this can't race with another write which wants to set this flag.
> Perhaps it should be counter as well, and it should be protected by
> the same ->global_rwlock, but _write_lock() should drop it before
> sync_all_readers() and then take it again?

Hmm, or we could just add an extra mb() like you suggested, and keep it
simple...

> 
>>>>> +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock,
>>>>> +			       unsigned int cpu)
>>>>> +{
>>>>> +	smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */
>>>>
>>>> As I understand it, the purpose of this memory barrier is to ensure
>>>> that the stores in drop_writer_signal() happen before the reads from
>>>> ->reader_refcnt in reader_uses_percpu_refcnt(), thus preventing the
>>>> race between a new reader attempting to use the fastpath and this writer
>>>> acquiring the lock.  Unless I am confused, this must be smp_mb() rather
>>>> than smp_rmb().
>>>
>>> And note that before sync_reader() we call announce_writer_active() which
>>> already adds mb() before sync_all_readers/sync_reader, so this rmb() looks
>>> unneeded.
>>>
>>
>> My intention was to help the writer see the ->reader_refcnt drop to zero
>> ASAP; hence I used smp_wmb() at reader and smp_rmb() here at the writer.
> 
> Hmm, interesting... Not sure, but can't really comment. However I can
> answer your next question:
>

Paul told me in another mail that I was expecting too much out of memory
barriers, like increasing the speed of electrons and what not ;-)
[ It would have been cool though, if it had such magical powers :P ]
 
>> Please correct me if my understanding of memory barriers is wrong here..
> 
> Who? Me??? No we have paulmck for that ;)
>

Haha ;-)

Regards,
Srivatsa S. Bhat

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox