LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: kernel panic during kernel module load (powerpc specific part)
From: Steffen Rumler @ 2012-06-04  7:43 UTC (permalink / raw)
  To: ext Wrobel Heinz-R39252; +Cc: Michael Ellerman, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <192298D25D96A042975E372855100DB70FF0E2@039-SN2MPN1-011.039d.mgd.msft.net>


>>> I believe that the basic premise is that you should provide a directly
>>> reachable copy of the save/rstore functions, even if this means that
>> you need several copies of the functions.
>>
>> I just fixed a very similar problem with grub2 in fact. It was using r0
>> and trashing the saved LR that way.
>>
>> The real fix is indeed to statically link those gcc "helpers", we
>> shouldn't generate things like cross-module calls inside function prologs
>> and epilogues, when stackframes aren't even guaranteed to be reliable.
>>
>> However, in the grub2 case, it was easier to just use r12 :-)
> For not just the module loading case, I believe r12 is the only real solution now. I checked one debugger capable of doing ELF download. It also uses r12 for trampoline code. I am guessing for the reason that prompted this discussion.
>
> Without r12 we'd have to change standard libraries to automagically link in gcc helpers for any conceivable non-.text section, which I am not sure is feasible. How would you write section independent helper functions which link to any section needing them?!
> Asking users to create their own section specific copy of helper functions is definitely not portable if the module or other code is not architecture dependent.
> It is a normal gcc feature that you can assign specific code to non-.text sections and it is not documented that it may crash depending on the OS arch the ELF is built for, so asking for a Power Architecture specific change on tool libs to make Power Architecture Linux happy seems a bit much to ask.
>
> Using r12 in any Linux related trampoline code seems a reachable goal, and it would eliminate the conflict to the ABI.
>
> Regards,
>
> Heinz
Hi,

I've tried the fix using register r12 for the trampoline code, instead of register r11.
I'd to adapt entry_matches() too:

--- arch/powerpc/kernel/module_32.c    (revision 1898)
+++ arch/powerpc/kernel/module_32.c    (working copy)
@@ -187,8 +187,8 @@

  static inline int entry_matches(struct ppc_plt_entry *entry, Elf32_Addr val)
  {
-    if (entry->jump[0] == 0x3d600000 + ((val + 0x8000) >> 16)
- && entry->jump[1] == 0x396b0000 + (val & 0xffff))
+    if (entry->jump[0] == 0x3d800000 + ((val + 0x8000) >> 16)
+ && entry->jump[1] == 0x398c0000 + (val & 0xffff))
          return 1;
      return 0;
  }
@@ -215,10 +215,9 @@
          entry++;
      }

-    /* Stolen from Paul Mackerras as well... */
-    entry->jump[0] = 0x3d600000+((val+0x8000)>>16);    /* lis r11,sym@ha */
-    entry->jump[1] = 0x396b0000 + (val&0xffff);    /* addi r11,r11,sym@l*/
-    entry->jump[2] = 0x7d6903a6;            /* mtctr r11 */
+    entry->jump[0] = 0x3d800000+((val+0x8000)>>16); /* lis r12,sym@ha */
+    entry->jump[1] = 0x398c0000 + (val&0xffff);     /* addi r12,r12,sym@l*/
+    entry->jump[2] = 0x7d8903a6;                    /* mtctr r12 */
      entry->jump[3] = 0x4e800420;            /* bctr */

      DEBUGP("Initialized plt for 0x%x at %p\n", val, entry);


It looks working, also in the case the trampoline code is used.
Please see also the debug session below.

Let me summarize the situation:

    + According to Freescale, EABI assigns a dedicated function to register r11.
    + The ELF sections .text and .init.text may trigger the usage of the trampoline code.
    + The trampoline code must not user register r11, too.
    + We must not depend on addresses returned by vmalloc during kernel module loading.

I think the bug must be fixed !!!

Regards
Steffen

--


(gdb) bt
#0  0xd54990f0 in ?? ()
#1  0xd5499088 in ?? ()
#2  0xc0001d9c in do_one_initcall (fn=0xd76e26ac) at init/main.c:716
#3  0xc0059630 in sys_init_module (umod=0x4802f008, len=<value optimized out>, uargs=0x10012008 "") at kernel/module.c:2470
#4  0xc000dff8 in syscall_dotrace_cont () at arch/powerpc/kernel/entry_32.S:331
Backtrace stopped: frame did not save the PC

<-- we are returning from the driver init entry point

(gdb) info reg r11
r11            0xcbb57f00    3417669376
(gdb) x/2x 0xcbb57f00
0xcbb57f00:    0xcbb57f20    0xc0001d9c

<-- register r11 is fine here

0xd54990f0:     lis     r12,-10386
0xd54990f4:     addi    r12,r12,-20268
0xd54990f8:     mtctr   r12

<-- this is the trampoline code using now r12, instead of r11

(gdb) info reg r12
r12            0xd76db0d4    3614290132
(gdb) info reg ctr
ctr            0xd76db0d4    3614290132
0xd54990fc:     bctr

<-- we are jumping to the epilogue

0xd76db0d4:     lwz     r29,-12(r11)
0xd76db0d8:     lwz     r30,-8(r11)
0xd76db0dc:     lwz     r0,4(r11)
0xd76db0e0:     lwz     r31,-4(r11)
0xd76db0e4:     mtlr    r0
0xd76db0e8:     mr      r1,r11

<- the epilogue

(gdb) info reg lr
lr             0xc0001d9c    0xc0001d9c <do_one_initcall+100>
0xc0001d9c <do_one_initcall+100>:       lis     r9,-16330
0xc0001da0 <do_one_initcall+104>:       lwz     r0,12296(r9)
0xc0001da4 <do_one_initcall+108>:       lis     r9,-16330
0xd76db0ec:     blr

<-- the jump back to do_one_initcall()
<-- no crash

^ permalink raw reply

* Re: [PATCH] powerpc: pseries: Round up MSI-X requests
From: Benjamin Herrenschmidt @ 2012-06-04  6:54 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev, paulus, Anton Blanchard
In-Reply-To: <1338792215.15716.5.camel@concordia>

On Mon, 2012-06-04 at 16:43 +1000, Michael Ellerman wrote:
> There is some chance this will result in breakage because the driver
> asks for N - and assumes that is what was allocated - and the device is
> configured for > N.

We can fix that. We can whack the configuration back with N, just know
that we have "allocated"  > N.

> But that's a hypothetical, and we know the current approach sucks
> because it will result in many drivers falling back to a single
> interrupt.
> 
> I think this is the least-worst approach in light of the FW limitations,
> and we can always add quirks in here if we really have to.
> 
> Paul is the pseries maintainer so he gets to ACK or NAK it, but from an
> MSI point of view it gets my +1.
> 

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH] powerpc: pseries: Round up MSI-X requests
From: Michael Ellerman @ 2012-06-04  6:43 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: paulus, linuxppc-dev
In-Reply-To: <20120604091545.03ce738b@kryten>

On Mon, 2012-06-04 at 09:15 +1000, Anton Blanchard wrote:
> The pseries firmware currently refuses any non power of two MSI-X
> request. Unfortunately most network drivers end up asking for that
> because they want a power of two for RX queues and one or two extra
> for everything else.
> 
> This patch rounds up the firmware request to the next power of two
> if the quota allows it.

There is some chance this will result in breakage because the driver
asks for N - and assumes that is what was allocated - and the device is
configured for > N.

But that's a hypothetical, and we know the current approach sucks
because it will result in many drivers falling back to a single
interrupt.

I think this is the least-worst approach in light of the FW limitations,
and we can always add quirks in here if we really have to.

Paul is the pseries maintainer so he gets to ACK or NAK it, but from an
MSI point of view it gets my +1.

cheers

^ permalink raw reply

* [PATCH 5/5] powerpc: iommu: Implement IOMMU pools to improve multiqueue adapter performance
From: Anton Blanchard @ 2012-06-04  5:45 UTC (permalink / raw)
  To: benh, paulus, olof, michael; +Cc: linuxppc-dev
In-Reply-To: <20120604154213.173feecb@kryten>


At the moment all queues in a multiqueue adapter will serialise
against the IOMMU table lock. This is proving to be a big issue,
especially with 10Gbit ethernet.

This patch creates 4 pools and tries to spread the load across
them. If the table is under 1GB in size we revert back to the
original behaviour of 1 pool and 1 largealloc pool.

We create a hash to map CPUs to pools. Since we prefer interrupts to
be affinitised to primary CPUs, without some form of hashing we are
very likely to end up using the same pool. As an example, POWER7
has 4 way SMT and with 4 pools all primary threads will map to the
same pool

The largealloc pool is reduced from 1/2 to 1/4 of the space to
partially offset the overhead of breaking the table up into pools.

Some performance numbers were obtained with a Chelsio T3 adapter on
two POWER7 boxes, running a 100 session TCP round robin test.

Performance improved 69% with this patch applied.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

All patches combined improve performance by 178%

Index: linux-build/arch/powerpc/kernel/iommu.c
===================================================================
--- linux-build.orig/arch/powerpc/kernel/iommu.c	2012-06-04 15:36:46.786955282 +1000
+++ linux-build/arch/powerpc/kernel/iommu.c	2012-06-04 15:37:21.243503140 +1000
@@ -33,6 +33,7 @@
 #include <linux/bitmap.h>
 #include <linux/iommu-helper.h>
 #include <linux/crash_dump.h>
+#include <linux/hash.h>
 #include <asm/io.h>
 #include <asm/prom.h>
 #include <asm/iommu.h>
@@ -58,6 +59,26 @@ static int __init setup_iommu(char *str)
 
 __setup("iommu=", setup_iommu);
 
+static DEFINE_PER_CPU(unsigned int, iommu_pool_hash);
+
+/*
+ * We precalculate the hash to avoid doing it on every allocation.
+ *
+ * The hash is important to spread CPUs across all the pools. For example,
+ * on a POWER7 with 4 way SMT we want interrupts on the primary threads and
+ * with 4 pools all primary threads would map to the same pool.
+ */
+static int __init setup_iommu_pool_hash(void)
+{
+	unsigned int i;
+
+	for_each_possible_cpu(i)
+		per_cpu(iommu_pool_hash, i) = hash_32(i, IOMMU_POOL_HASHBITS);
+
+	return 0;
+}
+subsys_initcall(setup_iommu_pool_hash);
+
 static unsigned long iommu_range_alloc(struct device *dev,
 				       struct iommu_table *tbl,
                                        unsigned long npages,
@@ -72,6 +93,8 @@ static unsigned long iommu_range_alloc(s
 	unsigned long align_mask;
 	unsigned long boundary_size;
 	unsigned long flags;
+	unsigned int pool_nr;
+	struct iommu_pool *pool;
 
 	align_mask = 0xffffffffffffffffl >> (64 - align_order);
 
@@ -84,38 +107,46 @@ static unsigned long iommu_range_alloc(s
 		return DMA_ERROR_CODE;
 	}
 
-	spin_lock_irqsave(&(tbl->it_lock), flags);
+	/*
+	 * We don't need to disable preemption here because any CPU can
+	 * safely use any IOMMU pool.
+	 */
+	pool_nr = __raw_get_cpu_var(iommu_pool_hash) & (tbl->nr_pools - 1);
 
-	if (handle && *handle)
-		start = *handle;
+	if (largealloc)
+		pool = &(tbl->large_pool);
 	else
-		start = largealloc ? tbl->it_largehint : tbl->it_hint;
+		pool = &(tbl->pools[pool_nr]);
+
+	spin_lock_irqsave(&(pool->lock), flags);
 
-	/* Use only half of the table for small allocs (15 pages or less) */
-	limit = largealloc ? tbl->it_size : tbl->it_halfpoint;
+again:
+	if ((pass == 0) && handle && *handle)
+		start = *handle;
+	else
+		start = pool->hint;
 
-	if (largealloc && start < tbl->it_halfpoint)
-		start = tbl->it_halfpoint;
+	limit = pool->end;
 
 	/* The case below can happen if we have a small segment appended
 	 * to a large, or when the previous alloc was at the very end of
 	 * the available space. If so, go back to the initial start.
 	 */
 	if (start >= limit)
-		start = largealloc ? tbl->it_largehint : tbl->it_hint;
-
- again:
+		start = pool->start;
 
 	if (limit + tbl->it_offset > mask) {
 		limit = mask - tbl->it_offset + 1;
 		/* If we're constrained on address range, first try
 		 * at the masked hint to avoid O(n) search complexity,
-		 * but on second pass, start at 0.
+		 * but on second pass, start at 0 in pool 0.
 		 */
-		if ((start & mask) >= limit || pass > 0)
-			start = 0;
-		else
+		if ((start & mask) >= limit || pass > 0) {
+			pool = &(tbl->pools[0]);
+			start = pool->start;
+		} else {
 			start &= mask;
+		}
 	}
 
 	if (dev)
@@ -129,17 +160,25 @@ static unsigned long iommu_range_alloc(s
 			     tbl->it_offset, boundary_size >> IOMMU_PAGE_SHIFT,
 			     align_mask);
 	if (n == -1) {
-		if (likely(pass < 2)) {
-			/* First failure, just rescan the half of the table.
-			 * Second failure, rescan the other half of the table.
-			 */
-			start = (largealloc ^ pass) ? tbl->it_halfpoint : 0;
-			limit = pass ? tbl->it_size : limit;
+		if (likely(pass == 0)) {
+			/* First try the pool from the start */
+			pool->hint = pool->start;
 			pass++;
 			goto again;
+
+		} else if (pass <= tbl->nr_pools) {
+			/* Now try scanning all the other pools */
+			spin_unlock(&(pool->lock));
+			pool_nr = (pool_nr + 1) & (tbl->nr_pools - 1);
+			pool = &tbl->pools[pool_nr];
+			spin_lock(&(pool->lock));
+			pool->hint = pool->start;
+			pass++;
+			goto again;
+
 		} else {
-			/* Third failure, give up */
-			spin_unlock_irqrestore(&(tbl->it_lock), flags);
+			/* Give up */
+			spin_unlock_irqrestore(&(pool->lock), flags);
 			return DMA_ERROR_CODE;
 		}
 	}
@@ -149,10 +188,10 @@ static unsigned long iommu_range_alloc(s
 	/* Bump the hint to a new block for small allocs. */
 	if (largealloc) {
 		/* Don't bump to new block to avoid fragmentation */
-		tbl->it_largehint = end;
+		pool->hint = end;
 	} else {
 		/* Overflow will be taken care of at the next allocation */
-		tbl->it_hint = (end + tbl->it_blocksize - 1) &
+		pool->hint = (end + tbl->it_blocksize - 1) &
 		                ~(tbl->it_blocksize - 1);
 	}
 
@@ -160,7 +199,8 @@ static unsigned long iommu_range_alloc(s
 	if (handle)
 		*handle = end;
 
-	spin_unlock_irqrestore(&(tbl->it_lock), flags);
+	spin_unlock_irqrestore(&(pool->lock), flags);
+
 	return n;
 }
 
@@ -235,23 +275,45 @@ static bool iommu_free_check(struct iomm
 	return true;
 }
 
+static struct iommu_pool *get_pool(struct iommu_table *tbl,
+				   unsigned long entry)
+{
+	struct iommu_pool *p;
+	unsigned long largepool_start = tbl->large_pool.start;
+
+	/* The large pool is the last pool at the top of the table */
+	if (entry >= largepool_start) {
+		p = &tbl->large_pool;
+	} else {
+		unsigned int pool_nr = entry / tbl->poolsize;
+
+		BUG_ON(pool_nr > tbl->nr_pools);
+		p = &tbl->pools[pool_nr];
+	}
+
+	return p;
+}
+
 static void __iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
 			 unsigned int npages)
 {
 	unsigned long entry, free_entry;
 	unsigned long flags;
+	struct iommu_pool *pool;
 
 	entry = dma_addr >> IOMMU_PAGE_SHIFT;
 	free_entry = entry - tbl->it_offset;
 
+	pool = get_pool(tbl, free_entry);
+
 	if (!iommu_free_check(tbl, dma_addr, npages))
 		return;
 
 	ppc_md.tce_free(tbl, entry, npages);
 
-	spin_lock_irqsave(&(tbl->it_lock), flags);
+	spin_lock_irqsave(&(pool->lock), flags);
 	bitmap_clear(tbl->it_map, free_entry, npages);
-	spin_unlock_irqrestore(&(tbl->it_lock), flags);
+	spin_unlock_irqrestore(&(pool->lock), flags);
 }
 
 static void iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
@@ -493,9 +555,8 @@ struct iommu_table *iommu_init_table(str
 	unsigned long sz;
 	static int welcomed = 0;
 	struct page *page;
-
-	/* Set aside 1/4 of the table for large allocations. */
-	tbl->it_halfpoint = tbl->it_size * 3 / 4;
+	unsigned int i;
+	struct iommu_pool *p;
 
 	/* number of bytes needed for the bitmap */
 	sz = (tbl->it_size + 7) >> 3;
@@ -514,9 +575,28 @@ struct iommu_table *iommu_init_table(str
 	if (tbl->it_offset == 0)
 		set_bit(0, tbl->it_map);
 
-	tbl->it_hint = 0;
-	tbl->it_largehint = tbl->it_halfpoint;
-	spin_lock_init(&tbl->it_lock);
+	/* We only split the IOMMU table if we have 1GB or more of space */
+	if ((tbl->it_size << IOMMU_PAGE_SHIFT) >= (1UL * 1024 * 1024 * 1024))
+		tbl->nr_pools = IOMMU_NR_POOLS;
+	else
+		tbl->nr_pools = 1;
+
+	/* We reserve the top 1/4 of the table for large allocations */
+	tbl->poolsize = (tbl->it_size * 3 / 4) / IOMMU_NR_POOLS;
+
+	for (i = 0; i < IOMMU_NR_POOLS; i++) {
+		p = &tbl->pools[i];
+		spin_lock_init(&(p->lock));
+		p->start = tbl->poolsize * i;
+		p->hint = p->start;
+		p->end = p->start + tbl->poolsize;
+	}
+
+	p = &tbl->large_pool;
+	spin_lock_init(&(p->lock));
+	p->start = tbl->poolsize * i;
+	p->hint = p->start;
+	p->end = tbl->it_size;
 
 	iommu_table_clear(tbl);
 
Index: linux-build/arch/powerpc/include/asm/iommu.h
===================================================================
--- linux-build.orig/arch/powerpc/include/asm/iommu.h	2012-06-04 15:31:54.069855719 +1000
+++ linux-build/arch/powerpc/include/asm/iommu.h	2012-06-04 15:36:47.262963621 +1000
@@ -53,6 +53,16 @@ static __inline__ __attribute_const__ in
  */
 #define IOMAP_MAX_ORDER		13
 
+#define IOMMU_POOL_HASHBITS	2
+#define IOMMU_NR_POOLS		(1 << IOMMU_POOL_HASHBITS)
+
+struct iommu_pool {
+	unsigned long start;
+	unsigned long end;
+	unsigned long hint;
+	spinlock_t lock;
+} ____cacheline_aligned_in_smp;
+
 struct iommu_table {
 	unsigned long  it_busno;     /* Bus number this table belongs to */
 	unsigned long  it_size;      /* Size of iommu table in entries */
@@ -61,10 +71,10 @@ struct iommu_table {
 	unsigned long  it_index;     /* which iommu table this is */
 	unsigned long  it_type;      /* type: PCI or Virtual Bus */
 	unsigned long  it_blocksize; /* Entries in each block (cacheline) */
-	unsigned long  it_hint;      /* Hint for next alloc */
-	unsigned long  it_largehint; /* Hint for large allocs */
-	unsigned long  it_halfpoint; /* Breaking point for small/large allocs */
-	spinlock_t     it_lock;      /* Protects it_map */
+	unsigned long  poolsize;
+	unsigned long  nr_pools;
+	struct iommu_pool large_pool;
+	struct iommu_pool pools[IOMMU_NR_POOLS];
 	unsigned long *it_map;       /* A simple allocation bitmap for now */
 };
 

^ permalink raw reply

* [PATCH 4/5] powerpc: iommu: Push spinlock into iommu_range_alloc and __iommu_free
From: Anton Blanchard @ 2012-06-04  5:44 UTC (permalink / raw)
  To: benh, paulus, olof, michael; +Cc: linuxppc-dev
In-Reply-To: <20120604154213.173feecb@kryten>


In preparation for IOMMU pools, push the spinlock into
iommu_range_alloc and __iommu_free.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: linux-build/arch/powerpc/kernel/iommu.c
===================================================================
--- linux-build.orig/arch/powerpc/kernel/iommu.c	2012-06-04 10:38:43.813334034 +1000
+++ linux-build/arch/powerpc/kernel/iommu.c	2012-06-04 10:38:45.305365604 +1000
@@ -71,6 +71,7 @@ static unsigned long iommu_range_alloc(s
 	int pass = 0;
 	unsigned long align_mask;
 	unsigned long boundary_size;
+	unsigned long flags;
 
 	align_mask = 0xffffffffffffffffl >> (64 - align_order);
 
@@ -83,6 +84,8 @@ static unsigned long iommu_range_alloc(s
 		return DMA_ERROR_CODE;
 	}
 
+	spin_lock_irqsave(&(tbl->it_lock), flags);
+
 	if (handle && *handle)
 		start = *handle;
 	else
@@ -136,6 +139,7 @@ static unsigned long iommu_range_alloc(s
 			goto again;
 		} else {
 			/* Third failure, give up */
+			spin_unlock_irqrestore(&(tbl->it_lock), flags);
 			return DMA_ERROR_CODE;
 		}
 	}
@@ -156,6 +160,7 @@ static unsigned long iommu_range_alloc(s
 	if (handle)
 		*handle = end;
 
+	spin_unlock_irqrestore(&(tbl->it_lock), flags);
 	return n;
 }
 
@@ -165,13 +170,11 @@ static dma_addr_t iommu_alloc(struct dev
 			      unsigned long mask, unsigned int align_order,
 			      struct dma_attrs *attrs)
 {
-	unsigned long entry, flags;
+	unsigned long entry;
 	dma_addr_t ret = DMA_ERROR_CODE;
 	int build_fail;
 
-	spin_lock_irqsave(&(tbl->it_lock), flags);
 	entry = iommu_range_alloc(dev, tbl, npages, NULL, mask, align_order);
-	spin_unlock_irqrestore(&(tbl->it_lock), flags);
 
 	if (unlikely(entry == DMA_ERROR_CODE))
 		return DMA_ERROR_CODE;
@@ -232,23 +235,6 @@ static bool iommu_free_check(struct iomm
 	return true;
 }
 
-static void __iommu_free_locked(struct iommu_table *tbl, dma_addr_t dma_addr,
-			 unsigned int npages)
-{
-	unsigned long entry, free_entry;
-
-	BUG_ON(!spin_is_locked(&tbl->it_lock));
-
-	entry = dma_addr >> IOMMU_PAGE_SHIFT;
-	free_entry = entry - tbl->it_offset;
-
-	if (!iommu_free_check(tbl, dma_addr, npages))
-		return;
-
-	ppc_md.tce_free(tbl, entry, npages);
-	bitmap_clear(tbl->it_map, free_entry, npages);
-}
-
 static void __iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
 			 unsigned int npages)
 {
@@ -287,7 +273,6 @@ int iommu_map_sg(struct device *dev, str
 		 struct dma_attrs *attrs)
 {
 	dma_addr_t dma_next = 0, dma_addr;
-	unsigned long flags;
 	struct scatterlist *s, *outs, *segstart;
 	int outcount, incount, i, build_fail = 0;
 	unsigned int align;
@@ -309,8 +294,6 @@ int iommu_map_sg(struct device *dev, str
 
 	DBG("sg mapping %d elements:\n", nelems);
 
-	spin_lock_irqsave(&(tbl->it_lock), flags);
-
 	max_seg_size = dma_get_max_seg_size(dev);
 	for_each_sg(sglist, s, nelems, i) {
 		unsigned long vaddr, npages, entry, slen;
@@ -393,8 +376,6 @@ int iommu_map_sg(struct device *dev, str
 	if (ppc_md.tce_flush)
 		ppc_md.tce_flush(tbl);
 
-	spin_unlock_irqrestore(&(tbl->it_lock), flags);
-
 	DBG("mapped %d elements:\n", outcount);
 
 	/* For the sake of iommu_unmap_sg, we clear out the length in the
@@ -419,14 +400,13 @@ int iommu_map_sg(struct device *dev, str
 			vaddr = s->dma_address & IOMMU_PAGE_MASK;
 			npages = iommu_num_pages(s->dma_address, s->dma_length,
 						 IOMMU_PAGE_SIZE);
-			__iommu_free_locked(tbl, vaddr, npages);
+			__iommu_free(tbl, vaddr, npages);
 			s->dma_address = DMA_ERROR_CODE;
 			s->dma_length = 0;
 		}
 		if (s == outs)
 			break;
 	}
-	spin_unlock_irqrestore(&(tbl->it_lock), flags);
 	return 0;
 }
 
@@ -436,15 +416,12 @@ void iommu_unmap_sg(struct iommu_table *
 		struct dma_attrs *attrs)
 {
 	struct scatterlist *sg;
-	unsigned long flags;
 
 	BUG_ON(direction == DMA_NONE);
 
 	if (!tbl)
 		return;
 
-	spin_lock_irqsave(&(tbl->it_lock), flags);
-
 	sg = sglist;
 	while (nelems--) {
 		unsigned int npages;
@@ -454,7 +431,7 @@ void iommu_unmap_sg(struct iommu_table *
 			break;
 		npages = iommu_num_pages(dma_handle, sg->dma_length,
 					 IOMMU_PAGE_SIZE);
-		__iommu_free_locked(tbl, dma_handle, npages);
+		__iommu_free(tbl, dma_handle, npages);
 		sg = sg_next(sg);
 	}
 
@@ -464,8 +441,6 @@ void iommu_unmap_sg(struct iommu_table *
 	 */
 	if (ppc_md.tce_flush)
 		ppc_md.tce_flush(tbl);
-
-	spin_unlock_irqrestore(&(tbl->it_lock), flags);
 }
 
 static void iommu_table_clear(struct iommu_table *tbl)

^ permalink raw reply

* [PATCH 3/5] powerpc: iommu: Reduce spinlock coverage in iommu_free
From: Anton Blanchard @ 2012-06-04  5:43 UTC (permalink / raw)
  To: benh, paulus, olof, michael; +Cc: linuxppc-dev
In-Reply-To: <20120604154213.173feecb@kryten>


This patch moves tce_free outside of the lock in iommu_free.

Some performance numbers were obtained with a Chelsio T3 adapter on
two POWER7 boxes, running a 100 session TCP round robin test.

Performance improved 25% with this patch applied.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: linux-build/arch/powerpc/kernel/iommu.c
===================================================================
--- linux-build.orig/arch/powerpc/kernel/iommu.c	2012-06-04 10:38:41.461284266 +1000
+++ linux-build/arch/powerpc/kernel/iommu.c	2012-06-04 10:38:43.813334034 +1000
@@ -190,10 +190,7 @@ static dma_addr_t iommu_alloc(struct dev
 	 * not altered.
 	 */
 	if (unlikely(build_fail)) {
-		spin_lock_irqsave(&(tbl->it_lock), flags);
 		__iommu_free(tbl, ret, npages);
-		spin_unlock_irqrestore(&(tbl->it_lock), flags);
-
 		return DMA_ERROR_CODE;
 	}
 
@@ -207,8 +204,8 @@ static dma_addr_t iommu_alloc(struct dev
 	return ret;
 }
 
-static void __iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr, 
-			 unsigned int npages)
+static bool iommu_free_check(struct iommu_table *tbl, dma_addr_t dma_addr,
+			     unsigned int npages)
 {
 	unsigned long entry, free_entry;
 
@@ -228,21 +225,53 @@ static void __iommu_free(struct iommu_ta
 			printk(KERN_INFO "\tindex     = 0x%llx\n", (u64)tbl->it_index);
 			WARN_ON(1);
 		}
-		return;
+
+		return false;
 	}
 
+	return true;
+}
+
+static void __iommu_free_locked(struct iommu_table *tbl, dma_addr_t dma_addr,
+			 unsigned int npages)
+{
+	unsigned long entry, free_entry;
+
+	BUG_ON(!spin_is_locked(&tbl->it_lock));
+
+	entry = dma_addr >> IOMMU_PAGE_SHIFT;
+	free_entry = entry - tbl->it_offset;
+
+	if (!iommu_free_check(tbl, dma_addr, npages))
+		return;
+
 	ppc_md.tce_free(tbl, entry, npages);
 	bitmap_clear(tbl->it_map, free_entry, npages);
 }
 
-static void iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
-		unsigned int npages)
+static void __iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
+			 unsigned int npages)
 {
+	unsigned long entry, free_entry;
 	unsigned long flags;
 
+	entry = dma_addr >> IOMMU_PAGE_SHIFT;
+	free_entry = entry - tbl->it_offset;
+
+	if (!iommu_free_check(tbl, dma_addr, npages))
+		return;
+
+	ppc_md.tce_free(tbl, entry, npages);
+
 	spin_lock_irqsave(&(tbl->it_lock), flags);
-	__iommu_free(tbl, dma_addr, npages);
+	bitmap_clear(tbl->it_map, free_entry, npages);
 	spin_unlock_irqrestore(&(tbl->it_lock), flags);
+}
+
+static void iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
+		unsigned int npages)
+{
+	__iommu_free(tbl, dma_addr, npages);
 
 	/* Make sure TLB cache is flushed if the HW needs it. We do
 	 * not do an mb() here on purpose, it is not needed on any of
@@ -390,7 +419,7 @@ int iommu_map_sg(struct device *dev, str
 			vaddr = s->dma_address & IOMMU_PAGE_MASK;
 			npages = iommu_num_pages(s->dma_address, s->dma_length,
 						 IOMMU_PAGE_SIZE);
-			__iommu_free(tbl, vaddr, npages);
+			__iommu_free_locked(tbl, vaddr, npages);
 			s->dma_address = DMA_ERROR_CODE;
 			s->dma_length = 0;
 		}
@@ -425,7 +454,7 @@ void iommu_unmap_sg(struct iommu_table *
 			break;
 		npages = iommu_num_pages(dma_handle, sg->dma_length,
 					 IOMMU_PAGE_SIZE);
-		__iommu_free(tbl, dma_handle, npages);
+		__iommu_free_locked(tbl, dma_handle, npages);
 		sg = sg_next(sg);
 	}
 

^ permalink raw reply

* [PATCH 2/5] powerpc: iommu: Reduce spinlock coverage in iommu_alloc and iommu_free
From: Anton Blanchard @ 2012-06-04  5:43 UTC (permalink / raw)
  To: benh, paulus, olof, michael; +Cc: linuxppc-dev
In-Reply-To: <20120604154213.173feecb@kryten>


We currently hold the IOMMU spinlock around tce_build and tce_flush.
This causes our spinlock hold times to be much higher than required
and can impact multiqueue adapters.

This patch moves tce_build and tce_flush outside of the lock in
iommu_alloc, and tce_flush outside of the lock in iommu_free.

Some performance numbers were obtained with a Chelsio T3 adapter on
two POWER7 boxes, running a 100 session TCP round robin test.

Performance improved 32% with this patch applied.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: linux-build/arch/powerpc/kernel/iommu.c
===================================================================
--- linux-build.orig/arch/powerpc/kernel/iommu.c	2012-06-04 10:38:38.045211977 +1000
+++ linux-build/arch/powerpc/kernel/iommu.c	2012-06-04 10:38:41.461284266 +1000
@@ -170,13 +170,11 @@ static dma_addr_t iommu_alloc(struct dev
 	int build_fail;
 
 	spin_lock_irqsave(&(tbl->it_lock), flags);
-
 	entry = iommu_range_alloc(dev, tbl, npages, NULL, mask, align_order);
+	spin_unlock_irqrestore(&(tbl->it_lock), flags);
 
-	if (unlikely(entry == DMA_ERROR_CODE)) {
-		spin_unlock_irqrestore(&(tbl->it_lock), flags);
+	if (unlikely(entry == DMA_ERROR_CODE))
 		return DMA_ERROR_CODE;
-	}
 
 	entry += tbl->it_offset;	/* Offset into real TCE table */
 	ret = entry << IOMMU_PAGE_SHIFT;	/* Set the return dma address */
@@ -192,9 +190,10 @@ static dma_addr_t iommu_alloc(struct dev
 	 * not altered.
 	 */
 	if (unlikely(build_fail)) {
+		spin_lock_irqsave(&(tbl->it_lock), flags);
 		__iommu_free(tbl, ret, npages);
-
 		spin_unlock_irqrestore(&(tbl->it_lock), flags);
+
 		return DMA_ERROR_CODE;
 	}
 
@@ -202,8 +201,6 @@ static dma_addr_t iommu_alloc(struct dev
 	if (ppc_md.tce_flush)
 		ppc_md.tce_flush(tbl);
 
-	spin_unlock_irqrestore(&(tbl->it_lock), flags);
-
 	/* Make sure updates are seen by hardware */
 	mb();
 
@@ -244,8 +241,8 @@ static void iommu_free(struct iommu_tabl
 	unsigned long flags;
 
 	spin_lock_irqsave(&(tbl->it_lock), flags);
-
 	__iommu_free(tbl, dma_addr, npages);
+	spin_unlock_irqrestore(&(tbl->it_lock), flags);
 
 	/* Make sure TLB cache is flushed if the HW needs it. We do
 	 * not do an mb() here on purpose, it is not needed on any of
@@ -253,8 +250,6 @@ static void iommu_free(struct iommu_tabl
 	 */
 	if (ppc_md.tce_flush)
 		ppc_md.tce_flush(tbl);
-
-	spin_unlock_irqrestore(&(tbl->it_lock), flags);
 }
 
 int iommu_map_sg(struct device *dev, struct iommu_table *tbl,

^ permalink raw reply

* [PATCH 1/5] powerpc/pseries: Disable interrupts around IOMMU percpu data accesses
From: Anton Blanchard @ 2012-06-04  5:42 UTC (permalink / raw)
  To: benh, paulus, olof, michael; +Cc: linuxppc-dev


tce_buildmulti_pSeriesLP uses a per cpu page to communicate with the
hypervisor. We currently rely on the IOMMU table spinlock but
subsequent patches will be removing that so disable interrupts
around all accesses of tce_page.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: linux-build/arch/powerpc/platforms/pseries/iommu.c
===================================================================
--- linux-build.orig/arch/powerpc/platforms/pseries/iommu.c	2012-06-04 10:25:34.420492862 +1000
+++ linux-build/arch/powerpc/platforms/pseries/iommu.c	2012-06-04 10:25:39.300597880 +1000
@@ -192,12 +192,15 @@ static int tce_buildmulti_pSeriesLP(stru
 	long l, limit;
 	long tcenum_start = tcenum, npages_start = npages;
 	int ret = 0;
+	unsigned long flags;
 
 	if (npages == 1) {
 		return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
 		                           direction, attrs);
 	}
 
+	local_irq_save(flags);	/* to protect tcep and the page behind it */
+
 	tcep = __get_cpu_var(tce_page);
 
 	/* This is safe to do since interrupts are off when we're called
@@ -207,6 +210,7 @@ static int tce_buildmulti_pSeriesLP(stru
 		tcep = (u64 *)__get_free_page(GFP_ATOMIC);
 		/* If allocation fails, fall back to the loop implementation */
 		if (!tcep) {
+			local_irq_restore(flags);
 			return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
 					    direction, attrs);
 		}
@@ -240,6 +244,8 @@ static int tce_buildmulti_pSeriesLP(stru
 		tcenum += limit;
 	} while (npages > 0 && !rc);
 
+	local_irq_restore(flags);
+
 	if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
 		ret = (int)rc;
 		tce_freemulti_pSeriesLP(tbl, tcenum_start,

^ permalink raw reply

* Re: [PATCH] powerpc/time: Sanity check of decrementer expiration is necessary
From: Anton Blanchard @ 2012-06-04  2:31 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev
In-Reply-To: <20120601100102.GA11714@pale.ozlabs.ibm.com>

Hi Paul,

> This reverts 68568add2c ("powerpc/time: Remove unnecessary sanity
> check of decrementer expiration").  We do need to check whether we
> have reached the expiration time of the next event, because we
> sometimes get an early decrementer interrupt, most notably when we
> set the decrementer to 1 in arch_irq_work_raise().  The effect of not
> having the sanity check is that if timer_interrupt() gets called
> early, we leave the decrementer set to its maximum value, which means
> we then don't get any more decrementer interrupts for about 4 seconds
> (or longer, depending on timebase frequency).  I saw these pauses as
> a consequence of getting a stray hypervisor decrementer interrupt
> left over from exiting a KVM guest.

Urgh, sorry for that mess.

Acked-by: Anton Blanchard <anton@samba.org>

Anton

> This isn't quite a straight revert because of changes to the
> surrounding code, but it restores the same algorithm as was
> previously used.
> 
> Cc: stable@kernel.org
> Cc: Anton Blanchard <anton@samba.org>
> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
> If there are no objections, I'll send this to Linus shortly.  This
> regression is present in 3.3 and 3.4 as well as current upstream.
> 
>  arch/powerpc/kernel/time.c |   14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index 99a995c..be171ee 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -475,6 +475,7 @@ void timer_interrupt(struct pt_regs * regs)
>  	struct pt_regs *old_regs;
>  	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
>  	struct clock_event_device *evt =
> &__get_cpu_var(decrementers);
> +	u64 now;
>  
>  	/* Ensure a positive value is written to the decrementer, or
> else
>  	 * some CPUs will continue to take decrementer exceptions.
> @@ -509,9 +510,16 @@ void timer_interrupt(struct pt_regs * regs)
>  		irq_work_run();
>  	}
>  
> -	*next_tb = ~(u64)0;
> -	if (evt->event_handler)
> -		evt->event_handler(evt);
> +	now = get_tb_or_rtc();
> +	if (now >= *next_tb) {
> +		*next_tb = ~(u64)0;
> +		if (evt->event_handler)
> +			evt->event_handler(evt);
> +	} else {
> +		now = *next_tb - now;
> +		if (now <= DECREMENTER_MAX)
> +			set_dec((int)now);
> +	}
>  
>  #ifdef CONFIG_PPC64
>  	/* collect purr register values often, for accurate
> calculations */

^ permalink raw reply

* [PATCH] powerpc: pseries: Round up MSI-X requests
From: Anton Blanchard @ 2012-06-03 23:15 UTC (permalink / raw)
  To: benh, paulus, michael; +Cc: linuxppc-dev


The pseries firmware currently refuses any non power of two MSI-X
request. Unfortunately most network drivers end up asking for that
because they want a power of two for RX queues and one or two extra
for everything else.

This patch rounds up the firmware request to the next power of two
if the quota allows it.

Signed-off-by: Anton Blanchard <anton@samba.org>
---        

Index: linux-build/arch/powerpc/platforms/pseries/msi.c
===================================================================
--- linux-build.orig/arch/powerpc/platforms/pseries/msi.c	2012-06-03 20:49:29.082280031 +1000
+++ linux-build/arch/powerpc/platforms/pseries/msi.c	2012-06-04 09:06:55.909732276 +1000
@@ -402,6 +402,18 @@ static int rtas_setup_msi_irqs(struct pc
 		return -EINVAL;
 
 	/*
+	 * Firmware currently refuse any non power of two allocation
+	 * so we round up if the quota will allow it.
+	 */
+	if (type == PCI_CAP_ID_MSIX) {
+		int m = roundup_pow_of_two(nvec);
+		int quota = msi_quota_for_device(pdev, m);
+
+		if (quota >= m)
+			nvec = m;
+	}
+
+	/*
 	 * Try the new more explicit firmware interface, if that fails fall
 	 * back to the old interface. The old interface is known to never
 	 * return MSI-Xs.

^ permalink raw reply

* Re: [RFC] [PATCH] powerpc: Add MSR_DE to MSR_KERNEL
From: Joakim Tjernlund @ 2012-06-03  9:20 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Scott Wood, linuxppc-dev, Bob Cochran, support
In-Reply-To: <1338672076.7150.2.camel@pasglop>

Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote on 2012/06/02 23:21:16:
>
> On Sat, 2012-06-02 at 20:29 +0200, Joakim Tjernlund wrote:
> >
> > hmm, where does this go w.r.t the patch? Got the feeling that the
> > best thing is to just turn MSR:DE on and be done with it?
>
> Not unconditionally, we need to have a close look, that might be ok
> specifically for BookE 32-bit, it's certainly not ok for BookE 64-bit at
> this point.
>
> For now, I'm ok with a debug CONFIG_* option.

OK, I will wrap this with the existing CONFIG_BDI_SWITCH and only for booke

>
> Also do we know if MSR:DE has any performance impact on any CPU ? I know
> having DACs enabled has a major impact on some for example.

No idea, this is something for Freescale to dwell on.

^ permalink raw reply

* [PATCH 3/3 v2] powerpc/mpic: FSL MPIC error interrupt support.
From: Varun Sethi @ 2012-06-03  7:45 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Bogdan Hamciuc, Varun Sethi

All SOC device error interrupts are muxed and delivered to the core as a single
MPIC error interrupt. Currently all the device drivers requiring access to device
errors have to register for the MPIC error interrupt as a shared interrupt.

With this patch we add interrupt demuxing capability in the mpic driver, allowing
device drivers to register for their individual error interrupts. This is achieved
by handling error interrupts in a cascaded fashion.

MPIC error interrupt is handled by the "error_int_handler", which subsequently demuxes
it using the EISR and delivers it to the respective drivers. 

The error interrupt capability is dependent on the MPIC EIMR register, which was
introduced in FSL MPIC version 4.1 (P4080 rev2). So, error interrupt demuxing capability
is dependent on the MPIC version and can be used for versions >= 4.1.

Signed-off-by: Varun Sethi <Varun.Sethi@freescale.com>
Signed-off-by: Bogdan Hamciuc <bogdan.hamciuc@freescale.com>
[In the initial version of the patch we were using handle_simple_irq
 as the handler for cascaded error interrupts, this resulted
 in issues in case of threaded isrs (with RT kernel). This issue was
 debugged by Bogdan and decision was taken to use the handle_level_irq
 handler]
---
 arch/powerpc/include/asm/mpic.h    |   17 ++++
 arch/powerpc/sysdev/Makefile       |    2 +-
 arch/powerpc/sysdev/fsl_mpic_err.c |  157 ++++++++++++++++++++++++++++++++++++
 arch/powerpc/sysdev/mpic.c         |   35 ++++++++-
 arch/powerpc/sysdev/mpic.h         |   22 +++++
 5 files changed, 231 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/sysdev/fsl_mpic_err.c

diff --git a/arch/powerpc/include/asm/mpic.h b/arch/powerpc/include/asm/mpic.h
index e14d35d..71b42b9 100644
--- a/arch/powerpc/include/asm/mpic.h
+++ b/arch/powerpc/include/asm/mpic.h
@@ -114,10 +114,17 @@
 #define MPIC_FSL_BRR1			0x00000
 #define 	MPIC_FSL_BRR1_VER			0x0000ffff
 
+/*
+ * Error interrupt registers
+ */
+
+
 #define MPIC_MAX_IRQ_SOURCES	2048
 #define MPIC_MAX_CPUS		32
 #define MPIC_MAX_ISU		32
 
+#define MPIC_MAX_ERR      32
+
 /*
  * Tsi108 implementation of MPIC has many differences from the original one
  */
@@ -270,6 +277,7 @@ struct mpic
 	struct irq_chip		hc_ipi;
 #endif
 	struct irq_chip		hc_tm;
+	struct irq_chip		hc_err;
 	const char		*name;
 	/* Flags */
 	unsigned int		flags;
@@ -283,6 +291,8 @@ struct mpic
 	/* vector numbers used for internal sources (ipi/timers) */
 	unsigned int		ipi_vecs[4];
 	unsigned int		timer_vecs[8];
+	/* vector numbers used for FSL MPIC error interrupts */
+	unsigned int		err_int_vecs[MPIC_MAX_ERR];
 
 	/* Spurious vector to program into unused sources */
 	unsigned int		spurious_vec;
@@ -306,6 +316,11 @@ struct mpic
 	struct mpic_reg_bank	cpuregs[MPIC_MAX_CPUS];
 	struct mpic_reg_bank	isus[MPIC_MAX_ISU];
 
+	/* ioremap'ed base for error interrupt registers */
+	u32 __iomem	*err_regs;
+	/* error interrupt config */
+	u32			err_int_config_done;
+
 	/* Protected sources */
 	unsigned long		*protected;
 
@@ -370,6 +385,8 @@ struct mpic
 #define MPIC_NO_RESET			0x00004000
 /* Freescale MPIC (compatible includes "fsl,mpic") */
 #define MPIC_FSL			0x00008000
+/* Freescale MPIC supports EIMR (error interrupt mask register)*/
+#define MPIC_FSL_HAS_EIMR		0x00010000
 
 /* MPIC HW modification ID */
 #define MPIC_REGSET_MASK		0xf0000000
diff --git a/arch/powerpc/sysdev/Makefile b/arch/powerpc/sysdev/Makefile
index 1bd7ecb..a57600b 100644
--- a/arch/powerpc/sysdev/Makefile
+++ b/arch/powerpc/sysdev/Makefile
@@ -15,7 +15,7 @@ obj-$(CONFIG_PPC_DCR_NATIVE)	+= dcr-low.o
 obj-$(CONFIG_PPC_PMI)		+= pmi.o
 obj-$(CONFIG_U3_DART)		+= dart_iommu.o
 obj-$(CONFIG_MMIO_NVRAM)	+= mmio_nvram.o
-obj-$(CONFIG_FSL_SOC)		+= fsl_soc.o
+obj-$(CONFIG_FSL_SOC)		+= fsl_soc.o fsl_mpic_err.o
 obj-$(CONFIG_FSL_PCI)		+= fsl_pci.o $(fsl-msi-obj-y)
 obj-$(CONFIG_FSL_PMC)		+= fsl_pmc.o
 obj-$(CONFIG_FSL_LBC)		+= fsl_lbc.o
diff --git a/arch/powerpc/sysdev/fsl_mpic_err.c b/arch/powerpc/sysdev/fsl_mpic_err.c
new file mode 100644
index 0000000..f2d28f2
--- /dev/null
+++ b/arch/powerpc/sysdev/fsl_mpic_err.c
@@ -0,0 +1,157 @@
+/*
+ * Copyright (C) 2012 Freescale Semiconductor, Inc.
+ *
+ * Author: Varun Sethi <varun.sethi@freescale.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2 of the
+ * License.
+ *
+ */
+
+#include <linux/irq.h>
+#include <linux/smp.h>
+#include <linux/interrupt.h>
+
+#include <asm/io.h>
+#include <asm/irq.h>
+#include <asm/mpic.h>
+
+#include "mpic.h"
+
+#define MPIC_ERR_INT_BASE	0x3900
+#define MPIC_ERR_INT_EISR	0x0000
+#define MPIC_ERR_INT_EIMR	0x0010
+
+static inline u32 fsl_mpic_err_read(u32 __iomem *base, unsigned int err_reg)
+{
+	return in_be32(base + (err_reg >> 2));
+}
+
+static inline void fsl_mpic_err_write(u32 __iomem *base, unsigned int err_reg,
+				   u32 value)
+{
+	out_be32(base + (err_reg >> 2), value);
+}
+
+static void fsl_mpic_mask_err(struct irq_data *d)
+{
+	u32 eimr;
+	struct mpic *mpic = irq_data_get_irq_chip_data(d);
+	unsigned int src = virq_to_hw(d->irq) - mpic->err_int_vecs[0];
+	unsigned int err_reg_offset = MPIC_ERR_INT_EIMR;
+
+	eimr = fsl_mpic_err_read(mpic->err_regs, err_reg_offset);
+	eimr |= (1 << (31 - src));
+	fsl_mpic_err_write(mpic->err_regs, err_reg_offset, eimr);
+}
+
+static void fsl_mpic_unmask_err(struct irq_data *d)
+{
+	u32 eimr;
+	struct mpic *mpic = irq_data_get_irq_chip_data(d);
+	unsigned int src = virq_to_hw(d->irq) - mpic->err_int_vecs[0];
+	unsigned int err_reg_offset = MPIC_ERR_INT_EIMR;
+
+	eimr = fsl_mpic_err_read(mpic->err_regs, err_reg_offset);
+	eimr &= ~(1 << (31 - src));
+	fsl_mpic_err_write(mpic->err_regs, err_reg_offset, eimr);
+}
+
+static struct irq_chip fsl_mpic_err_chip = {
+	.irq_disable	= fsl_mpic_mask_err,
+	.irq_mask	= fsl_mpic_mask_err,
+	.irq_unmask	= fsl_mpic_unmask_err,
+};
+
+void mpic_setup_error_int(struct mpic *mpic, int intvec)
+{
+	int i;
+
+	mpic->err_regs = ioremap(mpic->paddr + MPIC_ERR_INT_BASE, 0x1000);
+	if (!mpic->err_regs) {
+		pr_err("could not map mpic error registers\n");
+		return;
+	}
+	mpic->hc_err = fsl_mpic_err_chip;
+	mpic->hc_err.name = mpic->name;
+	mpic->flags |= MPIC_FSL_HAS_EIMR;
+	/* allocate interrupt vectors for error interrupts */
+	for (i = MPIC_MAX_ERR - 1; i >= 0; i--)
+		mpic->err_int_vecs[i] = --intvec;
+
+}
+
+int mpic_map_error_int(struct mpic *mpic, unsigned int virq, irq_hw_number_t  hw)
+{
+	if ((mpic->flags & MPIC_FSL_HAS_EIMR) &&
+	    (hw >= mpic->err_int_vecs[0] &&
+	     hw <= mpic->err_int_vecs[MPIC_MAX_ERR - 1])) {
+		WARN_ON(mpic->flags & MPIC_SECONDARY);
+
+		pr_debug("mpic: mapping as Error Interrupt\n");
+		irq_set_chip_data(virq, mpic);
+		irq_set_chip_and_handler(virq, &mpic->hc_err,
+					 handle_level_irq);
+		return 1;
+	}
+
+	return 0;
+}
+
+static irqreturn_t fsl_error_int_handler(int irq, void *data)
+{
+	struct mpic *mpic = (struct mpic *) data;
+	unsigned int eisr_offset = MPIC_ERR_INT_EISR;
+	unsigned int eimr_offset = MPIC_ERR_INT_EIMR;
+	u32 eisr, eimr;
+	int errint;
+	unsigned int cascade_irq;
+
+	eisr = fsl_mpic_err_read(mpic->err_regs, eisr_offset);
+	eimr = fsl_mpic_err_read(mpic->err_regs, eimr_offset);
+
+	if (!(eisr & ~eimr))
+		return IRQ_NONE;
+
+	while (eisr) {
+		errint = __builtin_clz(eisr);
+		cascade_irq = irq_linear_revmap(mpic->irqhost,
+				 mpic->err_int_vecs[errint]);
+		WARN_ON(cascade_irq == NO_IRQ);
+		if (cascade_irq != NO_IRQ) {
+			generic_handle_irq(cascade_irq);
+		} else {
+			eimr |=  1 << (31 - errint);
+			fsl_mpic_err_write(mpic->err_regs, eimr_offset, eimr);
+		}
+		eisr &= ~(1 << (31 - errint));
+	}
+
+	return IRQ_HANDLED;
+}
+
+int mpic_err_int_init(struct mpic *mpic, irq_hw_number_t irqnum)
+{
+	unsigned int virq;
+	unsigned int offset = MPIC_ERR_INT_EIMR;
+	int ret;
+
+	virq = irq_create_mapping(mpic->irqhost, irqnum);
+	if (virq == NO_IRQ) {
+		pr_err("Error interrupt setup failed\n");
+		return -ENOSPC;
+	}
+
+	fsl_mpic_err_write(mpic->err_regs, offset, ~0);
+
+	ret = request_irq(virq, fsl_error_int_handler, IRQF_NO_THREAD,
+		    "mpic-error-int", mpic);
+	if (ret) {
+		pr_err("Failed to register error interrupt handler\n");
+		return ret;
+	}
+
+	return 0;
+}
diff --git a/arch/powerpc/sysdev/mpic.c b/arch/powerpc/sysdev/mpic.c
index 61c7225..7002ef3 100644
--- a/arch/powerpc/sysdev/mpic.c
+++ b/arch/powerpc/sysdev/mpic.c
@@ -1026,6 +1026,9 @@ static int mpic_host_map(struct irq_domain *h, unsigned int virq,
 		return 0;
 	}
 
+	if (mpic_map_error_int(mpic, virq, hw))
+		return 0;
+
 	if (hw >= mpic->num_sources)
 		return -EINVAL;
 
@@ -1085,7 +1088,24 @@ static int mpic_host_xlate(struct irq_domain *h, struct device_node *ct,
 		 */
 		switch (intspec[2]) {
 		case 0:
-		case 1: /* no EISR/EIMR support for now, treat as shared IRQ */
+			break;
+		case 1:
+			if (!(mpic->flags & MPIC_FSL_HAS_EIMR))
+				break;
+
+			if (intspec[3] >= ARRAY_SIZE(mpic->err_int_vecs))
+				return -EINVAL;
+
+			if (!mpic->err_int_config_done) {
+				int ret;
+				ret = mpic_err_int_init(mpic, intspec[0]);
+				if (ret)
+					return ret;
+				mpic->err_int_config_done = 1;
+			}
+
+			*out_hwirq = mpic->err_int_vecs[intspec[3]];
+
 			break;
 		case 2:
 			if (intspec[0] >= ARRAY_SIZE(mpic->ipi_vecs))
@@ -1302,6 +1322,8 @@ struct mpic * __init mpic_alloc(struct device_node *node,
 	mpic_map(mpic, mpic->paddr, &mpic->tmregs, MPIC_INFO(TIMER_BASE), 0x1000);
 
 	if (mpic->flags & MPIC_FSL) {
+		u32 brr1, version;
+
 		/*
 		 * Yes, Freescale really did put global registers in the
 		 * magic per-cpu area -- and they don't even show up in the
@@ -1309,6 +1331,17 @@ struct mpic * __init mpic_alloc(struct device_node *node,
 		 */
 		mpic_map(mpic, mpic->paddr, &mpic->thiscpuregs,
 			 MPIC_CPU_THISBASE, 0x1000);
+
+		brr1 = _mpic_read(mpic->reg_type, &mpic->thiscpuregs,
+				MPIC_FSL_BRR1);
+		version = brr1 & MPIC_FSL_BRR1_VER;
+
+		/* Error interrupt mask register (EIMR) is required for
+		 * handling individual device error interrupts. EIMR
+		 * was added in MPIC version 4.1.
+		 */
+		if (version >= 0x401)
+			mpic_setup_error_int(mpic, intvec_top - 12);
 	}
 
 	/* Reset */
diff --git a/arch/powerpc/sysdev/mpic.h b/arch/powerpc/sysdev/mpic.h
index 13f3e89..1a6995a 100644
--- a/arch/powerpc/sysdev/mpic.h
+++ b/arch/powerpc/sysdev/mpic.h
@@ -40,4 +40,26 @@ extern int mpic_set_affinity(struct irq_data *d,
 			     const struct cpumask *cpumask, bool force);
 extern void mpic_reset_core(int cpu);
 
+#ifdef CONFIG_FSL_SOC
+extern int mpic_map_error_int(struct mpic *mpic, unsigned int virq, irq_hw_number_t  hw);
+extern int mpic_err_int_init(struct mpic *mpic, irq_hw_number_t irqnum);
+extern void mpic_setup_error_int(struct mpic *mpic, int intvec);
+#else
+static inline int mpic_map_error_int(struct mpic *mpic, unsigned int virq, irq_hw_number_t  hw)
+{
+	return 0;
+}
+
+
+static inline int mpic_err_int_init(struct mpic *mpic, irq_hw_number_t irqnum)
+{
+	return -1;
+}
+
+static inline void mpic_setup_error_int(struct mpic *mpic, int intvec)
+{
+	return;
+}
+#endif
+
 #endif /* _POWERPC_SYSDEV_MPIC_H */
-- 
1.7.2.2

^ permalink raw reply related

* [PATCH 2/3 v2] powerpc/mpic: Use the MPIC_LARGE_VECTORS flag for FSL MPIC.
From: Varun Sethi @ 2012-06-03  7:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Varun Sethi

We should use the MPIC_LARG_VECTORS flag while intializing the MPIC. 
This prevents us from eating in to hardware vector number space (MSIs)
while setting up internal sources.

Signed-off-by: Varun Sethi <Varun.Sethi@freescale.com>
---
 arch/powerpc/sysdev/mpic.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/sysdev/mpic.c b/arch/powerpc/sysdev/mpic.c
index a98eb77..61c7225 100644
--- a/arch/powerpc/sysdev/mpic.c
+++ b/arch/powerpc/sysdev/mpic.c
@@ -1211,7 +1211,7 @@ struct mpic * __init mpic_alloc(struct device_node *node,
 	if (of_get_property(node, "single-cpu-affinity", NULL))
 		flags |= MPIC_SINGLE_DEST_CPU;
 	if (of_device_is_compatible(node, "fsl,mpic"))
-		flags |= MPIC_FSL;
+		flags |= MPIC_FSL | MPIC_LARGE_VECTORS;
 
 	mpic = kzalloc(sizeof(struct mpic), GFP_KERNEL);
 	if (mpic == NULL)
-- 
1.7.2.2

^ permalink raw reply related

* [PATCH 1/3] powerpc/mpic: finish supporting timer group B on Freescale chips
From: Varun Sethi @ 2012-06-03  7:41 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Scott Wood, Varun Sethi

Previously, these interrupts would be mapped, but the offset
calculation was broken, and only the first group was initialized.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
 arch/powerpc/include/asm/mpic.h |    5 +++
 arch/powerpc/sysdev/mpic.c      |   58 ++++++++++++++++++++++++++++-----------
 2 files changed, 47 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/mpic.h b/arch/powerpc/include/asm/mpic.h
index c9f698a..e14d35d 100644
--- a/arch/powerpc/include/asm/mpic.h
+++ b/arch/powerpc/include/asm/mpic.h
@@ -63,6 +63,7 @@
  */
 #define MPIC_TIMER_BASE			0x01100
 #define MPIC_TIMER_STRIDE		0x40
+#define MPIC_TIMER_GROUP_STRIDE		0x1000
 
 #define MPIC_TIMER_CURRENT_CNT		0x00000
 #define MPIC_TIMER_BASE_CNT		0x00010
@@ -110,6 +111,9 @@
 #define 	MPIC_VECPRI_SENSE_MASK			0x00400000
 #define MPIC_IRQ_DESTINATION		0x00010
 
+#define MPIC_FSL_BRR1			0x00000
+#define 	MPIC_FSL_BRR1_VER			0x0000ffff
+
 #define MPIC_MAX_IRQ_SOURCES	2048
 #define MPIC_MAX_CPUS		32
 #define MPIC_MAX_ISU		32
@@ -296,6 +300,7 @@ struct mpic
 	phys_addr_t paddr;
 
 	/* The various ioremap'ed bases */
+	struct mpic_reg_bank	thiscpuregs;
 	struct mpic_reg_bank	gregs;
 	struct mpic_reg_bank	tmregs;
 	struct mpic_reg_bank	cpuregs[MPIC_MAX_CPUS];
diff --git a/arch/powerpc/sysdev/mpic.c b/arch/powerpc/sysdev/mpic.c
index 395af13..a98eb77 100644
--- a/arch/powerpc/sysdev/mpic.c
+++ b/arch/powerpc/sysdev/mpic.c
@@ -6,7 +6,7 @@
  *  with various broken implementations of this HW.
  *
  *  Copyright (C) 2004 Benjamin Herrenschmidt, IBM Corp.
- *  Copyright 2010-2011 Freescale Semiconductor, Inc.
+ *  Copyright 2010-2012 Freescale Semiconductor, Inc.
  *
  *  This file is subject to the terms and conditions of the GNU General Public
  *  License.  See the file COPYING in the main directory of this archive
@@ -221,24 +221,24 @@ static inline void _mpic_ipi_write(struct mpic *mpic, unsigned int ipi, u32 valu
 	_mpic_write(mpic->reg_type, &mpic->gregs, offset, value);
 }
 
-static inline u32 _mpic_tm_read(struct mpic *mpic, unsigned int tm)
+static inline unsigned int mpic_tm_offset(struct mpic *mpic, unsigned int tm)
 {
-	unsigned int offset = MPIC_INFO(TIMER_VECTOR_PRI) +
-			      ((tm & 3) * MPIC_INFO(TIMER_STRIDE));
+	return (tm >> 2) * MPIC_TIMER_GROUP_STRIDE +
+	       (tm & 3) * MPIC_INFO(TIMER_STRIDE);
+}
 
-	if (tm >= 4)
-		offset += 0x1000 / 4;
+static inline u32 _mpic_tm_read(struct mpic *mpic, unsigned int tm)
+{
+	unsigned int offset = mpic_tm_offset(mpic, tm) +
+			      MPIC_INFO(TIMER_VECTOR_PRI);
 
 	return _mpic_read(mpic->reg_type, &mpic->tmregs, offset);
 }
 
 static inline void _mpic_tm_write(struct mpic *mpic, unsigned int tm, u32 value)
 {
-	unsigned int offset = MPIC_INFO(TIMER_VECTOR_PRI) +
-			      ((tm & 3) * MPIC_INFO(TIMER_STRIDE));
-
-	if (tm >= 4)
-		offset += 0x1000 / 4;
+	unsigned int offset = mpic_tm_offset(mpic, tm) +
+			      MPIC_INFO(TIMER_VECTOR_PRI);
 
 	_mpic_write(mpic->reg_type, &mpic->tmregs, offset, value);
 }
@@ -1301,6 +1301,16 @@ struct mpic * __init mpic_alloc(struct device_node *node,
 	mpic_map(mpic, mpic->paddr, &mpic->gregs, MPIC_INFO(GREG_BASE), 0x1000);
 	mpic_map(mpic, mpic->paddr, &mpic->tmregs, MPIC_INFO(TIMER_BASE), 0x1000);
 
+	if (mpic->flags & MPIC_FSL) {
+		/*
+		 * Yes, Freescale really did put global registers in the
+		 * magic per-cpu area -- and they don't even show up in the
+		 * non-magic per-cpu copies that this driver normally uses.
+		 */
+		mpic_map(mpic, mpic->paddr, &mpic->thiscpuregs,
+			 MPIC_CPU_THISBASE, 0x1000);
+	}
+
 	/* Reset */
 
 	/* When using a device-node, reset requests are only honored if the MPIC
@@ -1440,6 +1450,7 @@ void __init mpic_assign_isu(struct mpic *mpic, unsigned int isu_num,
 void __init mpic_init(struct mpic *mpic)
 {
 	int i, cpu;
+	int num_timers = 4;
 
 	BUG_ON(mpic->num_sources == 0);
 
@@ -1448,15 +1459,30 @@ void __init mpic_init(struct mpic *mpic)
 	/* Set current processor priority to max */
 	mpic_cpu_write(MPIC_INFO(CPU_CURRENT_TASK_PRI), 0xf);
 
+	if (mpic->flags & MPIC_FSL) {
+		u32 brr1 = _mpic_read(mpic->reg_type, &mpic->thiscpuregs,
+				      MPIC_FSL_BRR1);
+		u32 version = brr1 & MPIC_FSL_BRR1_VER;
+
+		/*
+		 * Timer group B is present at the latest in MPIC 3.1 (e.g.
+		 * mpc8536).  It is not present in MPIC 2.0 (e.g. mpc8544).
+		 * I don't know about the status of intermediate versions (or
+		 * whether they even exist).
+		 */
+		if (version >= 0x0301)
+			num_timers = 8;
+	}
+
 	/* Initialize timers to our reserved vectors and mask them for now */
-	for (i = 0; i < 4; i++) {
+	for (i = 0; i < num_timers; i++) {
+		unsigned int offset = mpic_tm_offset(mpic, i);
+
 		mpic_write(mpic->tmregs,
-			   i * MPIC_INFO(TIMER_STRIDE) +
-			   MPIC_INFO(TIMER_DESTINATION),
+			   offset + MPIC_INFO(TIMER_DESTINATION),
 			   1 << hard_smp_processor_id());
 		mpic_write(mpic->tmregs,
-			   i * MPIC_INFO(TIMER_STRIDE) +
-			   MPIC_INFO(TIMER_VECTOR_PRI),
+			   offset + MPIC_INFO(TIMER_VECTOR_PRI),
 			   MPIC_VECPRI_MASK |
 			   (9 << MPIC_VECPRI_PRIORITY_SHIFT) |
 			   (mpic->timer_vecs[0] + i));
-- 
1.7.2.2

^ permalink raw reply related

* [PATCH 0/3] powerpc/mpic: Enhancements for FSL MPIC.
From: Varun Sethi @ 2012-06-03  7:40 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Varun Sethi

This patchset adds/fixes the following functionality specific to the
FSL MPIC:
1. Fix support for timer group B interrupts. Previously these were
not getting initialized.

2. Use the MPIC_LARGE_VECTORS flag while intializing FSL MPIC.
This prevents us from eating in to hardware vector number
space (MSIs) while setting up internal sources.

3.Cascaded handling for the MPIC error interrupt. This is possible
with FSL MPIC version >= 4.1.

The patches are based on "next" branch of Benjamin Herrenschmidt's powerpc
linux tree.

Varun Sethi (3):
  Support time group b on freescale chips.
  Use MPIC_LARGE_VECTORS flag for Freescale MPIC.
  Add support for cascaded error interrupt handling.

 arch/powerpc/include/asm/mpic.h          |   22 ++++
 arch/powerpc/sysdev/Makefile             |    2 +-
 arch/powerpc/sysdev/fsl_mpic_err.c       |  157 ++++++++++++++++++++++++++++++
 arch/powerpc/sysdev/mpic.c               |   95 +++++++++++++++----
 arch/powerpc/sysdev/mpic.h               |   22 ++++
 6 files changed, 338 insertions(+), 19 deletions(-)
 create mode 100644 arch/powerpc/sysdev/fsl_mpic_err.c

-- 
1.7.2.2

^ permalink raw reply

* Re: [v3 PATCH 3/3] ppc32/kprobe: don't emulate store when kprobe stwu r1
From: tiejun.chen @ 2012-06-03  7:10 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev
In-Reply-To: <4FCB0B47.5080500@windriver.com>

On 06/03/2012 02:59 PM, tiejun.chen wrote:
> On 06/03/2012 01:07 PM, Tiejun Chen wrote:
>> We don't do the real store operation for kprobing 'stwu Rx,(y)R1'
>> since this may corrupt the exception frame, now we will do this
>> operation safely in exception return code after migrate current
>> exception frame below the kprobed function stack.
>>
>> So we only update gpr[1] here and trigger a thread flag to mask
>> this.
>>
>> Note we should make sure if we trigger kernel stack over flow.
>>
>> Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
>> ---
>>  arch/powerpc/lib/sstep.c |   37 +++++++++++++++++++++++++++++++++++--
>>  1 files changed, 35 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
>> index 9a52349..a4ce463 100644
>> --- a/arch/powerpc/lib/sstep.c
>> +++ b/arch/powerpc/lib/sstep.c
>> @@ -566,7 +566,7 @@ int __kprobes emulate_step(struct pt_regs *regs, unsigned int instr)
>>  	unsigned long int ea;
>>  	unsigned int cr, mb, me, sh;
>>  	int err;
>> -	unsigned long old_ra;
>> +	unsigned long old_ra, val3, r1;
>>  	long ival;
>>  
>>  	opcode = instr >> 26;
>> @@ -1486,11 +1486,44 @@ int __kprobes emulate_step(struct pt_regs *regs, unsigned int instr)
>>  		goto ldst_done;
>>  
>>  	case 36:	/* stw */
>> -	case 37:	/* stwu */
>>  		val = regs->gpr[rd];
>>  		err = write_mem(val, dform_ea(instr, regs), 4, regs);
>>  		goto ldst_done;
>>  
>> +	case 37:	/* stwu */
>> +		__asm__ __volatile__("mr %0,1" : "=r" (r1) :);
> 
> I'll remove this line, please see below.
> 
>> +
>> +		val = regs->gpr[rd];
>> +		val3 = dform_ea(instr, regs);
>> +		/*
>> +		 * For PPC32 we always use stwu to change stack point with r1. So
>> +		 * this emulated store may corrupt the exception frame, now we
>> +		 * have to provide the exception frame trampoline, which is pushed
>> +		 * below the kprobed function stack. So we only update gpr[1] but
>> +		 * don't emulate the real store operation. We will do real store
>> +		 * operation safely in exception return code by checking this flag.
>> +		 */
>> +		if ((ra == 1) && !(regs->msr & MSR_PR) && (val3 >= r1)) {

And I also should change

(val3 >= r1) to (val3 >= (regs->r1 - STACK_INT_FRAME_SIZE)) since its worth
doing this only we'll really overwrite this exception stack.

Tiejun

>> +			/*
>> +			 * Check if we will touch kernel sack overflow
>> +			 */
>> +			if (r1 - STACK_INT_FRAME_SIZE <= current->thread.ksp_limit) {
> 
> OOPS. This line should be:
> 			
> 		if (val3 - STACK_INT_FRAME_SIZE <= current->thread.ksp_limit) {
> 
> Tiejun
> 
>> +				printk(KERN_CRIT "Can't kprobe this since Kernel stack overflow.\n");
>> +				err = -EINVAL;
>> +				break;
>> +			}
>> +
>> +			/*
>> +			 * Check if we already set since that means we'll
>> +			 * lose the previous value.
>> +			 */
>> +			WARN_ON(test_thread_flag(TIF_EMULATE_STACK_STORE));
>> +			set_thread_flag(TIF_EMULATE_STACK_STORE);
>> +			err = 0;
>> +		} else
>> +			err = write_mem(val, val3, 4, regs);
>> +		goto ldst_done;
>> +
>>  	case 38:	/* stb */
>>  	case 39:	/* stbu */
>>  		val = regs->gpr[rd];
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 
> 

^ permalink raw reply

* Re: [v3 PATCH 3/3] ppc32/kprobe: don't emulate store when kprobe stwu r1
From: tiejun.chen @ 2012-06-03  6:59 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev
In-Reply-To: <1338700063-30670-3-git-send-email-tiejun.chen@windriver.com>

On 06/03/2012 01:07 PM, Tiejun Chen wrote:
> We don't do the real store operation for kprobing 'stwu Rx,(y)R1'
> since this may corrupt the exception frame, now we will do this
> operation safely in exception return code after migrate current
> exception frame below the kprobed function stack.
> 
> So we only update gpr[1] here and trigger a thread flag to mask
> this.
> 
> Note we should make sure if we trigger kernel stack over flow.
> 
> Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
> ---
>  arch/powerpc/lib/sstep.c |   37 +++++++++++++++++++++++++++++++++++--
>  1 files changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
> index 9a52349..a4ce463 100644
> --- a/arch/powerpc/lib/sstep.c
> +++ b/arch/powerpc/lib/sstep.c
> @@ -566,7 +566,7 @@ int __kprobes emulate_step(struct pt_regs *regs, unsigned int instr)
>  	unsigned long int ea;
>  	unsigned int cr, mb, me, sh;
>  	int err;
> -	unsigned long old_ra;
> +	unsigned long old_ra, val3, r1;
>  	long ival;
>  
>  	opcode = instr >> 26;
> @@ -1486,11 +1486,44 @@ int __kprobes emulate_step(struct pt_regs *regs, unsigned int instr)
>  		goto ldst_done;
>  
>  	case 36:	/* stw */
> -	case 37:	/* stwu */
>  		val = regs->gpr[rd];
>  		err = write_mem(val, dform_ea(instr, regs), 4, regs);
>  		goto ldst_done;
>  
> +	case 37:	/* stwu */
> +		__asm__ __volatile__("mr %0,1" : "=r" (r1) :);

I'll remove this line, please see below.

> +
> +		val = regs->gpr[rd];
> +		val3 = dform_ea(instr, regs);
> +		/*
> +		 * For PPC32 we always use stwu to change stack point with r1. So
> +		 * this emulated store may corrupt the exception frame, now we
> +		 * have to provide the exception frame trampoline, which is pushed
> +		 * below the kprobed function stack. So we only update gpr[1] but
> +		 * don't emulate the real store operation. We will do real store
> +		 * operation safely in exception return code by checking this flag.
> +		 */
> +		if ((ra == 1) && !(regs->msr & MSR_PR) && (val3 >= r1)) {
> +			/*
> +			 * Check if we will touch kernel sack overflow
> +			 */
> +			if (r1 - STACK_INT_FRAME_SIZE <= current->thread.ksp_limit) {

OOPS. This line should be:
			
		if (val3 - STACK_INT_FRAME_SIZE <= current->thread.ksp_limit) {

Tiejun

> +				printk(KERN_CRIT "Can't kprobe this since Kernel stack overflow.\n");
> +				err = -EINVAL;
> +				break;
> +			}
> +
> +			/*
> +			 * Check if we already set since that means we'll
> +			 * lose the previous value.
> +			 */
> +			WARN_ON(test_thread_flag(TIF_EMULATE_STACK_STORE));
> +			set_thread_flag(TIF_EMULATE_STACK_STORE);
> +			err = 0;
> +		} else
> +			err = write_mem(val, val3, 4, regs);
> +		goto ldst_done;
> +
>  	case 38:	/* stb */
>  	case 39:	/* stbu */
>  		val = regs->gpr[rd];

^ permalink raw reply

* Re: [PATCH] powerpc: Fix size of st_nlink on 64bit
From: Stephen Rothwell @ 2012-06-03  5:30 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: michael, linuxppc-dev, paulus, viro
In-Reply-To: <20120603134836.4ffbd73d@kryten>

[-- Attachment #1: Type: text/plain, Size: 780 bytes --]

Hi Anton,

On Sun, 3 Jun 2012 13:48:36 +1000 Anton Blanchard <anton@samba.org> wrote:
>
> > > commit e57f93cc53b7 (powerpc: get rid of nlink_t uses, switch to
> > > explicitly-sized type) changed the size of st_nlink on ppc64 from
> > > a long to a short, resulting in boot failures.
> > > 
> > > Signed-off-by: Anton Blanchard <anton@samba.org>
> > 
> > Would this affect my (early user mode) boot problems reported
> > yesterday;
> > 
> > /init: 71: mknod: Permission denied
> > /init: 88: mknod: Permission denied
> > /init: 88: mknod: Permission denied
> 
> Very similar to the errors I was seeing so I think the patch will fix
> it.

Great.  One less thing to bisect tomorrow :-)

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [v3 PATCH 0/3] ppc32/kprobe: Fix a bug for kprobe stwu r1
From: tiejun.chen @ 2012-06-03  5:14 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev
In-Reply-To: <1338700063-30670-4-git-send-email-tiejun.chen@windriver.com>

On 06/03/2012 01:07 PM, Tiejun Chen wrote:
> Changes from V2:
> 
> * populate those existed codes to reorganize codes
> * add check if we'll trigger kernel stack over flow

BTW, I always validate this on mpc8536ds(UP)/mpc8572ds(SMP) with/without
CONFIG_PREEMPT.

Tiejun

> 
> Changes from V1:
> 
> * use memcpy simply to withdraw copy_exc_stack
> * add !(regs->msr & MSR_PR)) and
> 	WARN_ON(test_thread_flag(TIF_EMULATE_STACK_STORE));
>   to make sure we're in goot path.
> * move this migration process inside 'restore'
> * clear TIF flag atomically 
> 
> Tiejun Chen (3):
>       powerpc/kprobe: introduce a new thread flag
>       ppc32/kprobe: complete kprobe and migrate exception frame
>       ppc32/kprobe: don't emulate store when kprobe stwu r1
> 
>  arch/powerpc/include/asm/thread_info.h |    3 ++
>  arch/powerpc/kernel/entry_32.S         |   43 ++++++++++++++++++++++++++-----
>  arch/powerpc/lib/sstep.c               |   37 ++++++++++++++++++++++++++-
>  3 files changed, 74 insertions(+), 9 deletions(-)
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 
> 

^ permalink raw reply

* [v3 PATCH 0/3] ppc32/kprobe: Fix a bug for kprobe stwu r1
From: Tiejun Chen @ 2012-06-03  5:07 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev
In-Reply-To: <1338700063-30670-1-git-send-email-tiejun.chen@windriver.com>

Changes from V2:

* populate those existed codes to reorganize codes
* add check if we'll trigger kernel stack over flow

Changes from V1:

* use memcpy simply to withdraw copy_exc_stack
* add !(regs->msr & MSR_PR)) and
	WARN_ON(test_thread_flag(TIF_EMULATE_STACK_STORE));
  to make sure we're in goot path.
* move this migration process inside 'restore'
* clear TIF flag atomically 

Tiejun Chen (3):
      powerpc/kprobe: introduce a new thread flag
      ppc32/kprobe: complete kprobe and migrate exception frame
      ppc32/kprobe: don't emulate store when kprobe stwu r1

 arch/powerpc/include/asm/thread_info.h |    3 ++
 arch/powerpc/kernel/entry_32.S         |   43 ++++++++++++++++++++++++++-----
 arch/powerpc/lib/sstep.c               |   37 ++++++++++++++++++++++++++-
 3 files changed, 74 insertions(+), 9 deletions(-)

^ permalink raw reply

* [v3 PATCH 3/3] ppc32/kprobe: don't emulate store when kprobe stwu r1
From: Tiejun Chen @ 2012-06-03  5:07 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev
In-Reply-To: <1338700063-30670-1-git-send-email-tiejun.chen@windriver.com>

We don't do the real store operation for kprobing 'stwu Rx,(y)R1'
since this may corrupt the exception frame, now we will do this
operation safely in exception return code after migrate current
exception frame below the kprobed function stack.

So we only update gpr[1] here and trigger a thread flag to mask
this.

Note we should make sure if we trigger kernel stack over flow.

Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
---
 arch/powerpc/lib/sstep.c |   37 +++++++++++++++++++++++++++++++++++--
 1 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index 9a52349..a4ce463 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -566,7 +566,7 @@ int __kprobes emulate_step(struct pt_regs *regs, unsigned int instr)
 	unsigned long int ea;
 	unsigned int cr, mb, me, sh;
 	int err;
-	unsigned long old_ra;
+	unsigned long old_ra, val3, r1;
 	long ival;
 
 	opcode = instr >> 26;
@@ -1486,11 +1486,44 @@ int __kprobes emulate_step(struct pt_regs *regs, unsigned int instr)
 		goto ldst_done;
 
 	case 36:	/* stw */
-	case 37:	/* stwu */
 		val = regs->gpr[rd];
 		err = write_mem(val, dform_ea(instr, regs), 4, regs);
 		goto ldst_done;
 
+	case 37:	/* stwu */
+		__asm__ __volatile__("mr %0,1" : "=r" (r1) :);
+
+		val = regs->gpr[rd];
+		val3 = dform_ea(instr, regs);
+		/*
+		 * For PPC32 we always use stwu to change stack point with r1. So
+		 * this emulated store may corrupt the exception frame, now we
+		 * have to provide the exception frame trampoline, which is pushed
+		 * below the kprobed function stack. So we only update gpr[1] but
+		 * don't emulate the real store operation. We will do real store
+		 * operation safely in exception return code by checking this flag.
+		 */
+		if ((ra == 1) && !(regs->msr & MSR_PR) && (val3 >= r1)) {
+			/*
+			 * Check if we will touch kernel sack overflow
+			 */
+			if (r1 - STACK_INT_FRAME_SIZE <= current->thread.ksp_limit) {
+				printk(KERN_CRIT "Can't kprobe this since Kernel stack overflow.\n");
+				err = -EINVAL;
+				break;
+			}
+
+			/*
+			 * Check if we already set since that means we'll
+			 * lose the previous value.
+			 */
+			WARN_ON(test_thread_flag(TIF_EMULATE_STACK_STORE));
+			set_thread_flag(TIF_EMULATE_STACK_STORE);
+			err = 0;
+		} else
+			err = write_mem(val, val3, 4, regs);
+		goto ldst_done;
+
 	case 38:	/* stb */
 	case 39:	/* stbu */
 		val = regs->gpr[rd];
-- 
1.5.6

^ permalink raw reply related

* [v3 PATCH 2/3] ppc32/kprobe: complete kprobe and migrate exception frame
From: Tiejun Chen @ 2012-06-03  5:07 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev
In-Reply-To: <1338700063-30670-1-git-send-email-tiejun.chen@windriver.com>

We can't emulate stwu since that may corrupt current exception stack.
So we will have to do real store operation in the exception return code.

Firstly we'll allocate a trampoline exception frame below the kprobed
function stack and copy the current exception frame to the trampoline.
Then we can do this real store operation to implement 'stwu', and reroute
the trampoline frame to r1 to complete this exception migration.

Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
---
 arch/powerpc/kernel/entry_32.S |   43 +++++++++++++++++++++++++++++++++------
 1 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index ba3aeb4..e7eefdf 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -829,19 +829,50 @@ restore_user:
 	bnel-	load_dbcr0
 #endif
 
-#ifdef CONFIG_PREEMPT
 	b	restore
 
 /* N.B. the only way to get here is from the beq following ret_from_except. */
 resume_kernel:
-	/* check current_thread_info->preempt_count */
+	/* check current_thread_info, _TIF_EMULATE_STACK_STORE */
 	rlwinm	r9,r1,0,0,(31-THREAD_SHIFT)
-	lwz	r0,TI_PREEMPT(r9)
-	cmpwi	0,r0,0		/* if non-zero, just restore regs and return */
-	bne	restore
 	lwz	r0,TI_FLAGS(r9)
+	andis.	r0,r0,_TIF_EMULATE_STACK_STORE@h
+	beq+	1f
+
+	addi	r8,r1,INT_FRAME_SIZE	/* Get the kprobed function entry */
+
+	lwz	r3,GPR1(r1)
+	subi	r3,r3,INT_FRAME_SIZE	/* dst: Allocate a trampoline exception frame */
+	mr	r4,r1			/* src:  current exception frame */
+	li	r5,INT_FRAME_SIZE	/* size: INT_FRAME_SIZE */
+	mr	r1,r3			/* Reroute the trampoline frame to r1 */
+	bl	memcpy			/* Copy from the original to the trampoline */
+
+	/* Do real store operation to complete stwu */
+	lwz	r5,GPR1(r1)
+	stw	r8,0(r5)
+
+	/* Clear _TIF_EMULATE_STACK_STORE flag */
+	rlwinm	r9,r1,0,0,(31-THREAD_SHIFT)
+	lis	r11,_TIF_EMULATE_STACK_STORE@h
+	addi	r5,r9,TI_FLAGS
+0:	lwarx	r8,0,r5
+	andc	r8,r8,r11
+#ifdef CONFIG_IBM405_ERR77
+	dcbt	0,r5
+#endif
+	stwcx.	r8,0,r5
+	bne-	0b
+1:
+
+#ifdef CONFIG_PREEMPT
+	/* check current_thread_info->preempt_count */
+	lwz	r8,TI_PREEMPT(r9)
+	cmpwi	0,r8,0		/* if non-zero, just restore regs and return */
+	bne	restore
 	andi.	r0,r0,_TIF_NEED_RESCHED
 	beq+	restore
+	lwz	r3,_MSR(r1)	
 	andi.	r0,r3,MSR_EE	/* interrupts off? */
 	beq	restore		/* don't schedule if so */
 #ifdef CONFIG_TRACE_IRQFLAGS
@@ -862,8 +893,6 @@ resume_kernel:
 	 */
 	bl	trace_hardirqs_on
 #endif
-#else
-resume_kernel:
 #endif /* CONFIG_PREEMPT */
 
 	/* interrupts are hard-disabled at this point */
-- 
1.5.6

^ permalink raw reply related

* [v3 PATCH 1/3] powerpc/kprobe: introduce a new thread flag
From: Tiejun Chen @ 2012-06-03  5:07 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev

We need to add a new thread flag, TIF_EMULATE_STACK_STORE,
for emulating stack store operation while exiting exception.

Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
---
 arch/powerpc/include/asm/thread_info.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
index bcebc75..45d098c 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -110,6 +110,8 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NOERROR		12	/* Force successful syscall return */
 #define TIF_NOTIFY_RESUME	13	/* callback before returning to user */
 #define TIF_SYSCALL_TRACEPOINT	15	/* syscall tracepoint instrumentation */
+#define	TIF_EMULATE_STACK_STORE	17	/* Is an instruction emulation
+						for stack store? */
 
 /* as above, but as bit values */
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
@@ -126,6 +128,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_NOERROR		(1<<TIF_NOERROR)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
 #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
+#define	_TIF_EMULATE_STACK_STORE	(1<<TIF_EMULATE_STACK_STORE)
 #define _TIF_SYSCALL_T_OR_A	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT)
 
-- 
1.5.6

^ permalink raw reply related

* Re: [PATCH v2 2/3] ppc32/kprobe: complete kprobe and migrate exception frame
From: tiejun.chen @ 2012-06-03  5:01 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev
In-Reply-To: <1336621805.3881.38.camel@pasglop>

On 05/10/2012 11:50 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2011-12-15 at 19:00 +0800, Tiejun Chen wrote:
>> We can't emulate stwu since that may corrupt current exception stack.
>> So we will have to do real store operation in the exception return code.
>>
>> Firstly we'll allocate a trampoline exception frame below the kprobed
>> function stack and copy the current exception frame to the trampoline.
>> Then we can do this real store operation to implement 'stwu', and reroute
>> the trampoline frame to r1 to complete this exception migration.
>>
>> Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
>> ---
>>  arch/powerpc/kernel/entry_32.S |   35 +++++++++++++++++++++++++++++++++++
>>  1 files changed, 35 insertions(+), 0 deletions(-)
>>
>> diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
>> index 56212bc..0cdd27d 100644
>> --- a/arch/powerpc/kernel/entry_32.S
>> +++ b/arch/powerpc/kernel/entry_32.S
>> @@ -850,6 +850,41 @@ resume_kernel:
>>  
>>  	/* interrupts are hard-disabled at this point */
>>  restore:
>> +	lwz	r3,_MSR(r1)	/* Returning to user mode? */
>> +	andi.	r0,r3,MSR_PR
>> +	bne	1f	
> 
>  .../...
> 

Sorry for this delay response since I can't take time to do this last week :(

> Wouldn't it be better to use resume_kernel here ?

Agreed :)

> 
> IE. We already have restore_user vs. resume_kernel labels, including
> the PR test. In the !PREEMPT case, resume_kernel is empty but it's
> there, so you can just "populate" it, ie, something like:
> 
> restore_user:
> 	... existing dbcr0 stuff ...
> 	b restore
> 
> resume_kernel: <-- removed the ifdef CONFIG_PREEMPT
> 
> 	... Do the stack store business here...
> 
> #ifdef CONFIG_PREEMPT
> 	... move the preempt code here...
> #endif
> 
> restore:
> 	... existing stuff ...
> 
> Also, the added advantage is that the code to calc
> the thread info pointer and load the TI_FLAG can be
> shared now between your stuff and preempt, ie:
> 
> resume_kernel:
> 	rlwinm	r9,r1,0,0,(31-THREAD_SHIFT)
> 	lwz	r0,TI_FLAGS(r9)
> 	andis.	r0,r0,_TIF_EMULATE_STACK_STORE@h
> 	bne-	emulate_stack_store
> #ifdef CONFIG_PREEMPT
> 	lwz	r8,TI_PREEMPT(r9) <-- note use of r8 instead of r0,
>                                       I -think- r8 can be clobbered
>                                       here but pls dbl check

Its should be safe.

> 	cmpwi	0,r8,0
> 	bne	restore
> 	andi.	r0,r0,_TIF_NEED_RESCHED
> 	etc...

Please check if next, v3, is fine.

>  	
>> +	/* check current_thread_info, _TIF_EMULATE_STACK_STORE */
>> +	rlwinm	r9,r1,0,0,(31-THREAD_SHIFT)
>> +	lwz	r0,TI_FLAGS(r9)
>> +	andis.	r0,r0,_TIF_EMULATE_STACK_STORE@h
>> +	beq+	1f	
>> +
>> +	addi	r9,r1,INT_FRAME_SIZE	/* Get the kprobed function entry */
>> +
>> +	lwz	r3,GPR1(r1)
>> +	subi	r3,r3,INT_FRAME_SIZE	/* dst: Allocate a trampoline exception frame */
>> +	mr	r4,r1			/* src:  current exception frame */
>> +	li	r5,INT_FRAME_SIZE	/* size: INT_FRAME_SIZE */
>> +	mr	r1,r3			/* Reroute the trampoline frame to r1 */
>> +	bl	memcpy			/* Copy from the original to the trampoline */
>> +
>> +	/* Do real store operation to complete stwu */
>> +	lwz	r5,GPR1(r1)
>> +	stw	r9,0(r5)
> 
> Ok, I think I -finally- understand your trick of using r1 +
> INT_FRAME_SIZE as the value to store :-) It makes sense and it's
> actually a nice hack :-)
> 
> I would recommend that in the C code part of the emulation, you
> add some sanity checking to make sure we don't overflow the
> kernel stack etc... it should come in generally handy especially
> if what's your trying to debug with kprobes is a kernel stack
> overflow :-)

Added.

> 
>> +	/* Clear _TIF_EMULATE_STACK_STORE flag */
>> +	rlwinm	r9,r1,0,0,(31-THREAD_SHIFT)
>> +	lis	r11,_TIF_EMULATE_STACK_STORE@h
>> +	addi	r9,r9,TI_FLAGS
>> +0:	lwarx	r8,0,r9
>> +	andc	r8,r8,r11
>> +#ifdef CONFIG_IBM405_ERR77
>> +	dcbt	0,r9
>> +#endif
>> +	stwcx.	r8,0,r9
>> +	bne-	0b
>> +1:
>> +
>>  #ifdef CONFIG_44x
>>  BEGIN_MMU_FTR_SECTION
>>  	b	1f
> 
> BTW. Are you going to do a ppc64 variant of that patch ?

I'd like to go ppc64 ASAP once we did on ppc32 is good enough :)

Tiejun

> 
> Cheers,
> Ben.
> 
> 
> 
> 

^ permalink raw reply

* Re: [PATCH] powerpc: Fix size of st_nlink on 64bit
From: Anton Blanchard @ 2012-06-03  3:48 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: michael, linuxppc-dev, paulus, viro
In-Reply-To: <20120603122850.3c5142767e429f34eb8a921e@canb.auug.org.au>


Hi Stephen,

> > commit e57f93cc53b7 (powerpc: get rid of nlink_t uses, switch to
> > explicitly-sized type) changed the size of st_nlink on ppc64 from
> > a long to a short, resulting in boot failures.
> > 
> > Signed-off-by: Anton Blanchard <anton@samba.org>
> 
> Would this affect my (early user mode) boot problems reported
> yesterday;
> 
> /init: 71: mknod: Permission denied
> /init: 88: mknod: Permission denied
> /init: 88: mknod: Permission denied

Very similar to the errors I was seeing so I think the patch will fix
it.

Anton

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox