LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [RESEND PATCH 3/3] cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines
From: Preeti U Murthy @ 2014-02-10  2:38 UTC (permalink / raw)
  To: benh, tglx, linux-kernel, srivatsa.bhat
  Cc: deepthi, arnd, geoff, paul.gortmaker, paulus, linuxppc-dev
In-Reply-To: <20140210023503.19345.30567.stgit@preeti>

From: Preeti U Murthy <preeti@linux.vnet.ibm.com>

Split timer_interrupt(), which is the local timer interrupt handler on ppc
into routines called during regular interrupt handling and __timer_interrupt(),
which takes care of running local timers and collecting time related stats.

This will enable callers interested only in running expired local timers to
directly call into __timer_interupt(). One of the use cases of this is the
tick broadcast IPI handling in which the sleeping CPUs need to handle the local
timers that have expired.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/kernel/time.c |   81 +++++++++++++++++++++++++-------------------
 1 file changed, 46 insertions(+), 35 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 3ff97db..df2989b 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -478,6 +478,47 @@ void arch_irq_work_raise(void)
 
 #endif /* CONFIG_IRQ_WORK */
 
+void __timer_interrupt(void)
+{
+	struct pt_regs *regs = get_irq_regs();
+	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
+	struct clock_event_device *evt = &__get_cpu_var(decrementers);
+	u64 now;
+
+	trace_timer_interrupt_entry(regs);
+
+	if (test_irq_work_pending()) {
+		clear_irq_work_pending();
+		irq_work_run();
+	}
+
+	now = get_tb_or_rtc();
+	if (now >= *next_tb) {
+		*next_tb = ~(u64)0;
+		if (evt->event_handler)
+			evt->event_handler(evt);
+		__get_cpu_var(irq_stat).timer_irqs_event++;
+	} else {
+		now = *next_tb - now;
+		if (now <= DECREMENTER_MAX)
+			set_dec((int)now);
+		/* We may have raced with new irq work */
+		if (test_irq_work_pending())
+			set_dec(1);
+		__get_cpu_var(irq_stat).timer_irqs_others++;
+	}
+
+#ifdef CONFIG_PPC64
+	/* collect purr register values often, for accurate calculations */
+	if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
+		struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array);
+		cu->current_tb = mfspr(SPRN_PURR);
+	}
+#endif
+
+	trace_timer_interrupt_exit(regs);
+}
+
 /*
  * timer_interrupt - gets called when the decrementer overflows,
  * with interrupts disabled.
@@ -486,8 +527,6 @@ void timer_interrupt(struct pt_regs * regs)
 {
 	struct pt_regs *old_regs;
 	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
-	struct clock_event_device *evt = &__get_cpu_var(decrementers);
-	u64 now;
 
 	/* Ensure a positive value is written to the decrementer, or else
 	 * some CPUs will continue to take decrementer exceptions.
@@ -519,39 +558,7 @@ void timer_interrupt(struct pt_regs * regs)
 	old_regs = set_irq_regs(regs);
 	irq_enter();
 
-	trace_timer_interrupt_entry(regs);
-
-	if (test_irq_work_pending()) {
-		clear_irq_work_pending();
-		irq_work_run();
-	}
-
-	now = get_tb_or_rtc();
-	if (now >= *next_tb) {
-		*next_tb = ~(u64)0;
-		if (evt->event_handler)
-			evt->event_handler(evt);
-		__get_cpu_var(irq_stat).timer_irqs_event++;
-	} else {
-		now = *next_tb - now;
-		if (now <= DECREMENTER_MAX)
-			set_dec((int)now);
-		/* We may have raced with new irq work */
-		if (test_irq_work_pending())
-			set_dec(1);
-		__get_cpu_var(irq_stat).timer_irqs_others++;
-	}
-
-#ifdef CONFIG_PPC64
-	/* collect purr register values often, for accurate calculations */
-	if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
-		struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array);
-		cu->current_tb = mfspr(SPRN_PURR);
-	}
-#endif
-
-	trace_timer_interrupt_exit(regs);
-
+	__timer_interrupt();
 	irq_exit();
 	set_irq_regs(old_regs);
 }
@@ -828,6 +835,10 @@ static void decrementer_set_mode(enum clock_event_mode mode,
 /* Interrupt handler for the timer broadcast IPI */
 void tick_broadcast_ipi_handler(void)
 {
+	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
+
+	*next_tb = get_tb_or_rtc();
+	__timer_interrupt();
 }
 
 static void register_decrementer_clockevent(int cpu)

^ permalink raw reply related

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Benjamin Herrenschmidt @ 2014-02-10  2:54 UTC (permalink / raw)
  To: Tom Musta
  Cc: Peter Zijlstra, linux-kernel, Paul Mackerras, Anton Blanchard,
	Torsten Duwe, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <52F3E255.5050906@gmail.com>

On Thu, 2014-02-06 at 13:28 -0600, Tom Musta wrote:
> My read is consistent with Torsten's ... this looks like a bad idea.
> 
> Look at the RTL for sthcx. on page 692 (Power ISA V2.06) and you will
> see this:
> 
> if RESERVE then
>   if RESERVE_LENGTH = 2 then
>      ...
>   else
>      undefined_case <- 1
> else
>   ...
> 
> A legal implementation might never perform the store.

This is an area where we definitely want to check with the implementors
and if the implementations happen to do what we want (they likely do),
get the architecture changed for future chips and use it anyway.

There's a a *significant* benefit in avoiding an atomic operation in the
unlock case .

The reservation mechanism being based on a granule that is generally a
cache line, I doubt implementations will ever check the actual access
size, but we need to double check.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Benjamin Herrenschmidt @ 2014-02-10  3:02 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, Peter Zijlstra, linux-kernel, Paul Mackerras,
	Anton Blanchard, Scott Wood, Paul E. McKenney, linuxppc-dev,
	Ingo Molnar
In-Reply-To: <20140207090248.GB26811@lst.de>

On Fri, 2014-02-07 at 10:02 +0100, Torsten Duwe wrote:
> > > > Can you pair lwarx with sthcx ? I couldn't immediately find the answer
> > > > in the PowerISA doc. If so I think you can do better by being able to
> > > > atomically load both tickets but only storing the head without affecting
> > > > the tail.
> 
> Can I simply write the half word, without a reservation, or will the HW caches
> mess up the other half? Will it ruin the cache coherency on some (sub)architectures?

Yes, you can, I *think*

> > Plus, sthcx doesn't exist on all PPC chips.
> 
> Which ones are lacking it? Do all have at least a simple 16-bit store?

half word atomics (and byte atomics) are new, they've been added in architecture
2.06 I believe so it's fairly recent, but it's still worthwhile to investigate a
way to avoid atomics on unlock on recent processors (we can use instruction patching
if necessary based on CPU features) because there's definitely a significant cost
in doing a larx/stcx. sequence on powerpc, way higher than our current unlock path
of barrier + store.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Benjamin Herrenschmidt @ 2014-02-10  3:05 UTC (permalink / raw)
  To: Kumar Gala
  Cc: Tom Musta, Peter Zijlstra, linux-kernel, Torsten Duwe,
	Anton Blanchard, Scott Wood, Paul Mackerras, Paul E. McKenney,
	linuxppc-dev, Ingo Molnar
In-Reply-To: <87C29DBB-41E7-4B6C-9089-3C7756FBAE07@kernel.crashing.org>

On Fri, 2014-02-07 at 09:51 -0600, Kumar Gala wrote:
> On Feb 7, 2014, at 3:02 AM, Torsten Duwe <duwe@lst.de> wrote:
> 
> > On Thu, Feb 06, 2014 at 02:19:52PM -0600, Scott Wood wrote:
> >> On Thu, 2014-02-06 at 18:37 +0100, Torsten Duwe wrote:
> >>> On Thu, Feb 06, 2014 at 05:38:37PM +0100, Peter Zijlstra wrote:
> >> 
> >>>> Can you pair lwarx with sthcx ? I couldn't immediately find the answer
> >>>> in the PowerISA doc. If so I think you can do better by being able to
> >>>> atomically load both tickets but only storing the head without affecting
> >>>> the tail.
> > 
> > Can I simply write the half word, without a reservation, or will the HW caches
> > mess up the other half? Will it ruin the cache coherency on some (sub)architectures?
> 
> The coherency should be fine, I just can’t remember if you’ll lose the reservation by doing this.

Yes you do.

> >> Plus, sthcx doesn't exist on all PPC chips.
> > 
> > Which ones are lacking it? Do all have at least a simple 16-bit store?
> 
> Everything implements a simple 16-bit store, just not everything implements the store conditional of 16-bit data.

Ben.

> - k--
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: [PATCH v2] powerpc ticket locks
From: Benjamin Herrenschmidt @ 2014-02-10  3:10 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, Peter Zijlstra, linux-kernel, Paul Mackerras,
	Anton Blanchard, Scott Wood, Paul E. McKenney, linuxppc-dev,
	Ingo Molnar
In-Reply-To: <20140207165801.GC2107@lst.de>

On Fri, 2014-02-07 at 17:58 +0100, Torsten Duwe wrote:
>  typedef struct {
> -       volatile unsigned int slock;
> -} arch_spinlock_t;
> +       union {
> +               __ticketpair_t head_tail;
> +               struct __raw_tickets {
> +#ifdef __BIG_ENDIAN__          /* The "tail" part should be in the MSBs */
> +                       __ticket_t tail, head;
> +#else
> +                       __ticket_t head, tail;
> +#endif
> +               } tickets;
> +       };
> +#if defined(CONFIG_PPC_SPLPAR)
> +       u32 holder;
> +#endif
> +} arch_spinlock_t __aligned(4);

That's still broken with lockref (which we just merged).

We must have the arch_spinlock_t and the ref in the same 64-bit word
otherwise it will break.

We can make it work in theory since the holder doesn't have to be
accessed atomically, but the practicals are a complete mess ...
lockref would essentially have to re-implement the holder handling
of the spinlocks and use lower level ticket stuff.

Unless you can find a sneaky trick ... :-(

Ben.

^ permalink raw reply

* Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
From: Preeti U Murthy @ 2014-02-10  3:45 UTC (permalink / raw)
  To: Peter Zijlstra, Nicolas Pitre
  Cc: Lists linaro-kernel, linux-pm@vger.kernel.org, Daniel Lezcano,
	Rafael J. Wysocki, LKML, Ingo Molnar, Thomas Gleixner,
	linuxppc-dev
In-Reply-To: <20140207124140.GB9987@twins.programming.kicks-ass.net>

Hi Peter,

On 02/07/2014 06:11 PM, Peter Zijlstra wrote:
> On Fri, Feb 07, 2014 at 05:11:26PM +0530, Preeti U Murthy wrote:
>> But observe the idle state "snooze" on powerpc. The power that this idle
>> state saves is through the lowering of the thread priority of the CPU.
>> After it lowers the thread priority, it is done. It cannot
>> "wait_for_interrupts". It will exit my_idle(). It is now upto the
>> generic idle loop to increase the thread priority if the need_resched
>> flag is set. Only an interrupt routine can increase the thread priority.
>> Else we will need to do it explicitly. And in such states which have a
>> polling nature, the cpu will not receive a reschedule IPI.
>>
>> That is why in the snooze_loop() we poll on need_resched. If it is set
>> we up the priority of the thread using HMT_MEDIUM() and then exit the
>> my_idle() loop. In case of interrupts, the priority gets automatically
>> increased.
> 
> You can poll without setting TS_POLLING/TIF_POLLING_NRFLAGS just fine
> and get the IPI if that is what you want.
> 
> Depending on how horribly unprovisioned the thread gets at the lowest
> priority, that might actually be faster than polling and raising the
> prio whenever it does get ran.

So I am assuming you mean something like the below:

my_idle()
{
   local_irq_enable();
   /* Remove the setting of the polling flag */
   HMT_low();
   return index;
}

And then exit into the generic idle loop. But the issue I see here is
that the TS_POLLING/TIF_POLLING_NRFLAGS gets set immediately. So, if on
testing need_resched() immediately after this returns that the
TIF_NEED_RESCHED flag is set, the thread will exit at low priority right?
 We could raise the priority of the thread in arch_cpu_idle_exit() soon
after setting the polling flag but that would mean for cases where the
TIF_NEED_RESCHED flag is not set we unnecessarily raise the priority of
the thread.

Thanks

Regards
Preeti U Murthy

> 

^ permalink raw reply

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: Gabriel Paubert @ 2014-02-10 11:03 UTC (permalink / raw)
  To: James Yang; +Cc: Chris Proctor, Stephen N Chivers, linuxppc-dev
In-Reply-To: <alpine.LRH.2.00.1402071348380.10318@ra8135-ec1.am.freescale.net>

On Fri, Feb 07, 2014 at 02:49:40PM -0600, James Yang wrote:
> On Fri, 7 Feb 2014, Gabriel Paubert wrote:
> 
> > 	Hi Stephen,
> > 
> > On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote:
> > > Gabriel Paubert <paubert@iram.es> wrote on 02/06/2014 07:26:37 PM:
> > > 
> > > > From: Gabriel Paubert <paubert@iram.es>
> > > > To: Stephen N Chivers <schivers@csc.com.au>
> > > > Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor <cproctor@csc.com.au>
> > > > Date: 02/06/2014 07:26 PM
> > > > Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
> > > > 
> > > > On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:
> 
> > > > > 
> > > > >                 mask = 0;
> > > > >                 if (FM & (1 << 0))
> > > > >                         mask |= 0x0000000f;
> > > > >                 if (FM & (1 << 1))
> > > > >                         mask |= 0x000000f0;
> > > > >                 if (FM & (1 << 2))
> > > > >                         mask |= 0x00000f00;
> > > > >                 if (FM & (1 << 3))
> > > > >                         mask |= 0x0000f000;
> > > > >                 if (FM & (1 << 4))
> > > > >                         mask |= 0x000f0000;
> > > > >                 if (FM & (1 << 5))
> > > > >                         mask |= 0x00f00000;
> > > > >                 if (FM & (1 << 6))
> > > > >                         mask |= 0x0f000000;
> > > > >                 if (FM & (1 << 7))
> > > > >                         mask |= 0x90000000;
> > > > > 
> > > > > With the above mask computation I get consistent results for 
> > > > > both the MPC8548 and MPC7410 boards.
> > > > > 
> > > > > Am I missing something subtle?
> > > > 
> > > > No I think you are correct. This said, this code may probably be 
> > > optimized 
> > > > to eliminate a lot of the conditional branches. I think that:
> 
> 
> If the compiler is enabled to generate isel instructions, it would not 
> use a conditional branch for this code. (ignore the andi's values, 
> this is an old compile)
> 
> c0037c2c <mtfsf>:
> c0037c2c:       2c 03 00 00     cmpwi   r3,0
> c0037c30:       41 82 01 1c     beq-    c0037d4c <mtfsf+0x120>
> c0037c34:       2f 83 00 ff     cmpwi   cr7,r3,255
> c0037c38:       41 9e 01 28     beq-    cr7,c0037d60 <mtfsf+0x134>
> c0037c3c:       70 66 00 01     andi.   r6,r3,1
> c0037c40:       3d 00 90 00     lis     r8,-28672
> c0037c44:       7d 20 40 9e     iseleq  r9,r0,r8
> c0037c48:       70 6a 00 02     andi.   r10,r3,2
> c0037c4c:       65 28 0f 00     oris    r8,r9,3840
> c0037c50:       7d 29 40 9e     iseleq  r9,r9,r8
> c0037c54:       70 66 00 04     andi.   r6,r3,4
> c0037c58:       65 28 00 f0     oris    r8,r9,240
> c0037c5c:       7d 29 40 9e     iseleq  r9,r9,r8
> c0037c60:       70 6a 00 08     andi.   r10,r3,8
> c0037c64:       65 28 00 0f     oris    r8,r9,15
> c0037c68:       7d 29 40 9e     iseleq  r9,r9,r8
> c0037c6c:       70 66 00 10     andi.   r6,r3,16
> c0037c70:       61 28 f0 00     ori     r8,r9,61440
> c0037c74:       7d 29 40 9e     iseleq  r9,r9,r8
> c0037c78:       70 6a 00 20     andi.   r10,r3,32
> c0037c7c:       61 28 0f 00     ori     r8,r9,3840
> c0037c80:       54 6a cf fe     rlwinm  r10,r3,25,31,31
> c0037c84:       7d 29 40 9e     iseleq  r9,r9,r8
> c0037c88:       2f 8a 00 00     cmpwi   cr7,r10,0
> c0037c8c:       70 66 00 40     andi.   r6,r3,64
> c0037c90:       61 28 00 f0     ori     r8,r9,240
> c0037c94:       7d 29 40 9e     iseleq  r9,r9,r8
> c0037c98:       41 9e 00 08     beq-    cr7,c0037ca0 <mtfsf+0x74>
> c0037c9c:       61 29 00 0f     ori     r9,r9,15
> 	...
>  
> However, your other solutions are better.
> 
> 
> > > > 
> > > > mask = (FM & 1);
> > > > mask |= (FM << 3) & 0x10;
> > > > mask |= (FM << 6) & 0x100;
> > > > mask |= (FM << 9) & 0x1000;
> > > > mask |= (FM << 12) & 0x10000;
> > > > mask |= (FM << 15) & 0x100000;
> > > > mask |= (FM << 18) & 0x1000000;
> > > > mask |= (FM << 21) & 0x10000000;
> > > > mask *= 15;
> > > > 
> > > > should do the job, in less code space and without a single branch.
> > > > 
> > > > Each one of the "mask |=" lines should be translated into an
> > > > rlwinm instruction followed by an "or". Actually it should be possible
> > > > to transform each of these lines into a single "rlwimi" instruction
> > > > but I don't know how to coerce gcc to reach this level of optimization.
> > > > 
> > > > Another way of optomizing this could be:
> > > > 
> > > > mask = (FM & 0x0f) | ((FM << 12) & 0x000f0000);
> > > > mask = (mask & 0x00030003) | ((mask << 6) & 0x03030303);
> > > > mask = (mask & 0x01010101) | ((mask << 3) & 0x10101010);
> > > > mask *= 15;
> > > > 
> 
> 
> > Ok, I finally edited my sources and test compiled the suggestions
> > I gave. I'd say that method2 is the best overall indeed. You can
> > actually save one more instruction by setting mask to all ones in 
> > the case FM=0xff, but that's about all in this area.
> 
> 
> My measurements show method1 to be smaller and faster than method2 due 
> to the number of instructions needed to generate the constant masks in 
> method2, but it may depend upon your compiler and hardware.  Both are 
> faster than the original with isel.

Ok, if you have measured that method1 is faster than method2, let us go for it.
I believe method2 would be faster if you had a large out-of-order execution
window, because more parallelism can be extracted from it, but this is probably
only true for high end cores, which do not need FPU emulation in the first place.

The other place where we can optimize is the generation of FEX. Here is 
my current patch:


diff --git a/arch/powerpc/math-emu/mtfsf.c b/arch/powerpc/math-emu/mtfsf.c
index dbce92e..b57b3fa8 100644
--- a/arch/powerpc/math-emu/mtfsf.c
+++ b/arch/powerpc/math-emu/mtfsf.c
@@ -11,48 +11,35 @@ mtfsf(unsigned int FM, u32 *frB)
 	u32 mask;
 	u32 fpscr;
 
-	if (FM == 0)
+	if (likely(FM == 0xff))
+		mask = 0xffffffff;
+	else if (unlikely(FM == 0))
 		return 0;
-
-	if (FM == 0xff)
-		mask = 0x9fffffff;
 	else {
-		mask = 0;
-		if (FM & (1 << 0))
-			mask |= 0x90000000;
-		if (FM & (1 << 1))
-			mask |= 0x0f000000;
-		if (FM & (1 << 2))
-			mask |= 0x00f00000;
-		if (FM & (1 << 3))
-			mask |= 0x000f0000;
-		if (FM & (1 << 4))
-			mask |= 0x0000f000;
-		if (FM & (1 << 5))
-			mask |= 0x00000f00;
-		if (FM & (1 << 6))
-			mask |= 0x000000f0;
-		if (FM & (1 << 7))
-			mask |= 0x0000000f;
+		mask = (FM & 1);
+		mask |= (FM << 3) & 0x10;
+		mask |= (FM << 6) & 0x100;
+		mask |= (FM << 9) & 0x1000;
+		mask |= (FM << 12) & 0x10000;
+		mask |= (FM << 15) & 0x100000;
+		mask |= (FM << 18) & 0x1000000;
+		mask |= (FM << 21) & 0x10000000;
+		mask *= 15;
 	}
 
-	__FPU_FPSCR &= ~(mask);
-	__FPU_FPSCR |= (frB[1] & mask);
+	fpscr = ((__FPU_FPSCR & ~mask) | (frB[1] & mask)) &
+		~(FPSCR_VX | FPSCR_FEX);
 
-	__FPU_FPSCR &= ~(FPSCR_VX);
-	if (__FPU_FPSCR & (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI |
+	if (fpscr & (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI |
 		     FPSCR_VXZDZ | FPSCR_VXIMZ | FPSCR_VXVC |
 		     FPSCR_VXSOFT | FPSCR_VXSQRT | FPSCR_VXCVI))
-		__FPU_FPSCR |= FPSCR_VX;
-
-	fpscr = __FPU_FPSCR;
-	fpscr &= ~(FPSCR_FEX);
-	if (((fpscr & FPSCR_VX) && (fpscr & FPSCR_VE)) ||
-	    ((fpscr & FPSCR_OX) && (fpscr & FPSCR_OE)) ||
-	    ((fpscr & FPSCR_UX) && (fpscr & FPSCR_UE)) ||
-	    ((fpscr & FPSCR_ZX) && (fpscr & FPSCR_ZE)) ||
-	    ((fpscr & FPSCR_XX) && (fpscr & FPSCR_XE)))
-		fpscr |= FPSCR_FEX;
+		fpscr |= FPSCR_VX;
+
+	/* The bit order of exception enables and exception status
+	 * is the same. Simply shift and mask to check for enabled
+	 * exceptions.
+	 */
+	if (fpscr & (fpscr >> 22) &  0xf8) fpscr |= FPSCR_FEX;
 	__FPU_FPSCR = fpscr;
 
 #ifdef DEBUG
 mtfsf.c |   57 ++++++++++++++++++++++-----------------------------------
 1 file changed, 22 insertions(+), 35 deletions(-)


Notes: 

1) I'm really unsure on whether 0xff is frequent or not. So the likely()
statement at the beginning may be wrong. Actually, if it is not very likely,
it might be better to remove the special casef for FM==0xff. A look at 
GCC sources shows that it never generates a mask of 0xff. From glibc
sources, there vast majority of cases uses 0x1, only isnan() uses 0xff.

2) it may be better to remove the check for FM==0, after all, the instruction
efectively becomes a nop, and generating the instruction in the first place
would be too stupid for words.

3) if might be better to #define the magic constants (22 and 0xf8) used 
in the last if statement.

	Gabriel
> 

^ permalink raw reply related

* RE: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: David Laight @ 2014-02-10 11:17 UTC (permalink / raw)
  To: 'Gabriel Paubert', James Yang
  Cc: Chris Proctor, Stephen N Chivers, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <20140210110342.GA15806@visitor2.iram.es>

> > However, your other solutions are better.
> >
> >
> > > > >
> > > > > mask =3D (FM & 1);
> > > > > mask |=3D (FM << 3) & 0x10;
> > > > > mask |=3D (FM << 6) & 0x100;
> > > > > mask |=3D (FM << 9) & 0x1000;
> > > > > mask |=3D (FM << 12) & 0x10000;
> > > > > mask |=3D (FM << 15) & 0x100000;
> > > > > mask |=3D (FM << 18) & 0x1000000;
> > > > > mask |=3D (FM << 21) & 0x10000000;
> > > > > mask *=3D 15;
> > > > >
> > > > > should do the job, in less code space and without a single branch=
.
...
> > > > > Another way of optomizing this could be:
> > > > >
> > > > > mask =3D (FM & 0x0f) | ((FM << 12) & 0x000f0000);
> > > > > mask =3D (mask & 0x00030003) | ((mask << 6) & 0x03030303);
> > > > > mask =3D (mask & 0x01010101) | ((mask << 3) & 0x10101010);
> > > > > mask *=3D 15;
...
> Ok, if you have measured that method1 is faster than method2, let us go f=
or it.
> I believe method2 would be faster if you had a large out-of-order executi=
on
> window, because more parallelism can be extracted from it, but this is pr=
obably
> only true for high end cores, which do not need FPU emulation in the firs=
t place.

FWIW the second has a long dependency chain on 'mask', whereas the first ca=
n execute
the shift/and in any order and then merge the results.
So on most superscalar cpu, or one with result delays for arithmetic, the f=
irst
is likely to be faster.

	David

^ permalink raw reply

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: Gabriel Paubert @ 2014-02-10 12:21 UTC (permalink / raw)
  To: David Laight
  Cc: James Yang, Chris Proctor, Stephen N Chivers,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D0F6BBCE9@AcuExch.aculab.com>

On Mon, Feb 10, 2014 at 11:17:38AM +0000, David Laight wrote:
> > > However, your other solutions are better.
> > >
> > >
> > > > > >
> > > > > > mask = (FM & 1);
> > > > > > mask |= (FM << 3) & 0x10;
> > > > > > mask |= (FM << 6) & 0x100;
> > > > > > mask |= (FM << 9) & 0x1000;
> > > > > > mask |= (FM << 12) & 0x10000;
> > > > > > mask |= (FM << 15) & 0x100000;
> > > > > > mask |= (FM << 18) & 0x1000000;
> > > > > > mask |= (FM << 21) & 0x10000000;
> > > > > > mask *= 15;
> > > > > >
> > > > > > should do the job, in less code space and without a single branch.
> ...
> > > > > > Another way of optomizing this could be:
> > > > > >
> > > > > > mask = (FM & 0x0f) | ((FM << 12) & 0x000f0000);
> > > > > > mask = (mask & 0x00030003) | ((mask << 6) & 0x03030303);
> > > > > > mask = (mask & 0x01010101) | ((mask << 3) & 0x10101010);
> > > > > > mask *= 15;
> ...
> > Ok, if you have measured that method1 is faster than method2, let us go for it.
> > I believe method2 would be faster if you had a large out-of-order execution
> > window, because more parallelism can be extracted from it, but this is probably
> > only true for high end cores, which do not need FPU emulation in the first place.
> 
> FWIW the second has a long dependency chain on 'mask', whereas the first can execute
> the shift/and in any order and then merge the results.
> So on most superscalar cpu, or one with result delays for arithmetic, the first
> is likely to be faster.

I disagree, perhaps mostly because the compiler is not clever enough, but right
now the code for solution 1 is (actually I have rewritten the code
and it reads:

	mask = (FM & 1)
			| ((FM << 3) & 0x10)
			| ((FM << 6) & 0x100)
			| ((FM << 9) & 0x1000)
			| ((FM << 12) & 0x10000)
			| ((FM << 15) & 0x100000)
			| ((FM << 18) & 0x1000000)
			| ((FM << 21) & 0x10000000);
to avoid sequence point in case it hampers the compiler)

and the output is:

        rlwinm 10,3,3,27,27      # D.11621, FM,,
        rlwinm 9,3,6,23,23       # D.11621, FM,,
        or 9,10,9        #, D.11621, D.11621, D.11621
        rlwinm 10,3,0,31,31      # D.11621, FM,
        or 9,9,10        #, D.11621, D.11621, D.11621
        rlwinm 10,3,9,19,19      # D.11621, FM,,
        or 9,9,10        #, D.11621, D.11621, D.11621
        rlwinm 10,3,12,15,15     # D.11621, FM,,
        or 9,9,10        #, D.11621, D.11621, D.11621
        rlwinm 10,3,15,11,11     # D.11621, FM,,
        or 9,9,10        #, D.11621, D.11621, D.11621
        rlwinm 10,3,18,7,7       # D.11621, FM,,
        or 9,9,10        #, D.11621, D.11621, D.11621
        rlwinm 3,3,21,3,3        # D.11621, FM,,
        or 9,9,3         #, mask, D.11621, D.11621
        mulli 9,9,15     # mask, mask,

see that r9 is used 7 times as both input and output operand, plus
once for rlwinm. This gives a dependency length of 8 at least.

In the other case (I've deleted the code) the dependency length
was significantly shorter. In any case that one is fewer instructions, 
which is good for occasional use. 

	Gabriel

^ permalink raw reply

* RE: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: David Laight @ 2014-02-10 12:32 UTC (permalink / raw)
  To: 'Gabriel Paubert'
  Cc: James Yang, Chris Proctor, Stephen N Chivers,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <20140210122138.GA30356@visitor2.iram.es>

> I disagree, perhaps mostly because the compiler is not clever enough, but=
 right
> now the code for solution 1 is (actually I have rewritten the code
> and it reads:
>=20
> 	mask =3D (FM & 1)
> 			| ((FM << 3) & 0x10)
> 			| ((FM << 6) & 0x100)
> 			| ((FM << 9) & 0x1000)
> 			| ((FM << 12) & 0x10000)
> 			| ((FM << 15) & 0x100000)
> 			| ((FM << 18) & 0x1000000)
> 			| ((FM << 21) & 0x10000000);
> to avoid sequence point in case it hampers the compiler)
>=20
> and the output is:
>=20
>         rlwinm 10,3,3,27,27      # D.11621, FM,,
>         rlwinm 9,3,6,23,23       # D.11621, FM,,
>         or 9,10,9        #, D.11621, D.11621, D.11621
>         rlwinm 10,3,0,31,31      # D.11621, FM,
>         or 9,9,10        #, D.11621, D.11621, D.11621
>         rlwinm 10,3,9,19,19      # D.11621, FM,,
>         or 9,9,10        #, D.11621, D.11621, D.11621
>         rlwinm 10,3,12,15,15     # D.11621, FM,,
>         or 9,9,10        #, D.11621, D.11621, D.11621
>         rlwinm 10,3,15,11,11     # D.11621, FM,,
>         or 9,9,10        #, D.11621, D.11621, D.11621
>         rlwinm 10,3,18,7,7       # D.11621, FM,,
>         or 9,9,10        #, D.11621, D.11621, D.11621
>         rlwinm 3,3,21,3,3        # D.11621, FM,,
>         or 9,9,3         #, mask, D.11621, D.11621
>         mulli 9,9,15     # mask, mask,
>=20
> see that r9 is used 7 times as both input and output operand, plus
> once for rlwinm. This gives a dependency length of 8 at least.
>=20
> In the other case (I've deleted the code) the dependency length
> was significantly shorter. In any case that one is fewer instructions,
> which is good for occasional use.

Hmmm... I hand-counted a dependency length of 8 for the other version.
Maybe there are some ppc instructions that reduce it.

Stupid compiler :-)
Trouble is, I bet that even if you code it as:
 	mask1 =3D (FM & 1) | ((FM << 3) & 0x10);
	mask2 =3D ((FM << 6) & 0x100) | ((FM << 9) & 0x1000);
	mask3 =3D ((FM << 12) & 0x10000) | ((FM << 15) & 0x100000);
	mask4 =3D ((FM << 18) & 0x1000000) | ((FM << 21) & 0x10000000);
	mask1 |=3D mask2;
	mask3 |=3D mask4;
	mask =3D mask1 | mask3;
the compiler will 'optimise' it to the above before code generation.
If it doesn't adding () to pair the | might be enough.
Then a new version of the compiler will change the behaviour again.

	David

^ permalink raw reply

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: Gabriel Paubert @ 2014-02-10 13:00 UTC (permalink / raw)
  To: David Laight
  Cc: James Yang, Chris Proctor, Stephen N Chivers,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D0F6BBDA6@AcuExch.aculab.com>

On Mon, Feb 10, 2014 at 12:32:18PM +0000, David Laight wrote:
> > I disagree, perhaps mostly because the compiler is not clever enough, but right
> > now the code for solution 1 is (actually I have rewritten the code
> > and it reads:
> > 
> > 	mask = (FM & 1)
> > 			| ((FM << 3) & 0x10)
> > 			| ((FM << 6) & 0x100)
> > 			| ((FM << 9) & 0x1000)
> > 			| ((FM << 12) & 0x10000)
> > 			| ((FM << 15) & 0x100000)
> > 			| ((FM << 18) & 0x1000000)
> > 			| ((FM << 21) & 0x10000000);
> > to avoid sequence point in case it hampers the compiler)
> > 
> > and the output is:
> > 
> >         rlwinm 10,3,3,27,27      # D.11621, FM,,
> >         rlwinm 9,3,6,23,23       # D.11621, FM,,
> >         or 9,10,9        #, D.11621, D.11621, D.11621
> >         rlwinm 10,3,0,31,31      # D.11621, FM,
> >         or 9,9,10        #, D.11621, D.11621, D.11621
> >         rlwinm 10,3,9,19,19      # D.11621, FM,,
> >         or 9,9,10        #, D.11621, D.11621, D.11621
> >         rlwinm 10,3,12,15,15     # D.11621, FM,,
> >         or 9,9,10        #, D.11621, D.11621, D.11621
> >         rlwinm 10,3,15,11,11     # D.11621, FM,,
> >         or 9,9,10        #, D.11621, D.11621, D.11621
> >         rlwinm 10,3,18,7,7       # D.11621, FM,,
> >         or 9,9,10        #, D.11621, D.11621, D.11621
> >         rlwinm 3,3,21,3,3        # D.11621, FM,,
> >         or 9,9,3         #, mask, D.11621, D.11621
> >         mulli 9,9,15     # mask, mask,
> > 
> > see that r9 is used 7 times as both input and output operand, plus
> > once for rlwinm. This gives a dependency length of 8 at least.
> > 
> > In the other case (I've deleted the code) the dependency length
> > was significantly shorter. In any case that one is fewer instructions,
> > which is good for occasional use.
> 
> Hmmm... I hand-counted a dependency length of 8 for the other version.
> Maybe there are some ppc instructions that reduce it.

Either I misread the generated code or I got somewhat less.

What helps for method1 is the rotate and mask instructions of PPC. Each of
left shift and mask becomes a single rlwinm. 
> 
> Stupid compiler :-)

Indeed. I've trying to coerce it into generating rlwimi instructions
(in which case the whole building of the mask reduces to 8 assembly
instructions) and failed. It seems that the compiler lacks some patterns
some patterns that would directly map to rlwimi.

> Trouble is, I bet that even if you code it as:
>  	mask1 = (FM & 1) | ((FM << 3) & 0x10);
> 	mask2 = ((FM << 6) & 0x100) | ((FM << 9) & 0x1000);
> 	mask3 = ((FM << 12) & 0x10000) | ((FM << 15) & 0x100000);
> 	mask4 = ((FM << 18) & 0x1000000) | ((FM << 21) & 0x10000000);
> 	mask1 |= mask2;
> 	mask3 |= mask4;
> 	mask = mask1 | mask3;
> the compiler will 'optimise' it to the above before code generation.

Indeed it's what it does :-(

I believe that the current suggestion is good enough.

	Gabriel

^ permalink raw reply

* Re: [PATCH v2] powerpc ticket locks
From: Torsten Duwe @ 2014-02-10 15:52 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Tom Musta, Peter Zijlstra, linux-kernel, Paul Mackerras,
	Anton Blanchard, Scott Wood, Paul E. McKenney, linuxppc-dev,
	Ingo Molnar
In-Reply-To: <1392001823.3996.21.camel@pasglop>

On Mon, Feb 10, 2014 at 02:10:23PM +1100, Benjamin Herrenschmidt wrote:
> On Fri, 2014-02-07 at 17:58 +0100, Torsten Duwe wrote:
> >  typedef struct {
> > -       volatile unsigned int slock;
> > -} arch_spinlock_t;
> > +       union {
> > +               __ticketpair_t head_tail;
> > +               struct __raw_tickets {
> > +#ifdef __BIG_ENDIAN__          /* The "tail" part should be in the MSBs */
> > +                       __ticket_t tail, head;
> > +#else
> > +                       __ticket_t head, tail;
> > +#endif
> > +               } tickets;
> > +       };
> > +#if defined(CONFIG_PPC_SPLPAR)
> > +       u32 holder;
> > +#endif
> > +} arch_spinlock_t __aligned(4);
> 
> That's still broken with lockref (which we just merged).
> 
> We must have the arch_spinlock_t and the ref in the same 64-bit word
> otherwise it will break.

Well, as far as I can see you'll just not be able to
USE_CMPXCHG_LOCKREF -- with the appropriate performance hit --
the code just falls back into lock&ref on pSeries.
What again was the intention of directed yield in the first place...?

> We can make it work in theory since the holder doesn't have to be
> accessed atomically, but the practicals are a complete mess ...
> lockref would essentially have to re-implement the holder handling
> of the spinlocks and use lower level ticket stuff.
> 
> Unless you can find a sneaky trick ... :-(

What if I squeeze the bits a little?
4k vCPUs, and 256 physical, as a limit to stay within 32 bits?
At the cost that unlock may become an ll/sc operation again.
I could think about a trick against that.
But alas, hw_cpu_id is 16 bit, which makes a lookup table neccessary :-/

Doing another round of yields for lockrefs now doesn't
sound so bad any more.

Opinions, anyone?

	Torsten

^ permalink raw reply

* [RFT][PATCH 04/12] drivers/macintosh/adb: change platform power management to use dev_pm_ops
From: Shuah Khan @ 2014-02-10 16:12 UTC (permalink / raw)
  To: rjw, benh; +Cc: shuahkhan, Shuah Khan, linuxppc-dev, linux-kernel, linux-pm
In-Reply-To: <cover.1392040064.git.shuah.kh@samsung.com>

Change adb platform driver to register pm ops using dev_pm_ops instead of
legacy pm_ops. .pm hooks call existing legacy suspend and resume interfaces
by passing in the right pm state.

Signed-off-by: Shuah Khan <shuah.kh@samsung.com>
---
 drivers/macintosh/adb.c |   41 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 36 insertions(+), 5 deletions(-)

diff --git a/drivers/macintosh/adb.c b/drivers/macintosh/adb.c
index 04a5049..df6b201 100644
--- a/drivers/macintosh/adb.c
+++ b/drivers/macintosh/adb.c
@@ -263,7 +263,7 @@ adb_reset_bus(void)
 /*
  * notify clients before sleep
  */
-static int adb_suspend(struct platform_device *dev, pm_message_t state)
+static int __adb_suspend(struct platform_device *dev, pm_message_t state)
 {
 	adb_got_sleep = 1;
 	/* We need to get a lock on the probe thread */
@@ -276,10 +276,25 @@ static int adb_suspend(struct platform_device *dev, pm_message_t state)
 	return 0;
 }
 
+static int adb_suspend(struct device *dev)
+{
+	return __adb_suspend(to_platform_device(dev), PMSG_SUSPEND);
+}
+
+static int adb_freeze(struct device *dev)
+{
+	return __adb_suspend(to_platform_device(dev), PMSG_FREEZE);
+}
+
+static int adb_poweroff(struct device *dev)
+{
+	return __adb_suspend(to_platform_device(dev), PMSG_HIBERNATE);
+}
+
 /*
  * reset bus after sleep
  */
-static int adb_resume(struct platform_device *dev)
+static int __adb_resume(struct platform_device *dev)
 {
 	adb_got_sleep = 0;
 	up(&adb_probe_mutex);
@@ -287,6 +302,11 @@ static int adb_resume(struct platform_device *dev)
 
 	return 0;
 }
+
+static int adb_resume(struct device *dev)
+{
+	return __adb_resume(to_platform_device(dev));
+}
 #endif /* CONFIG_PM */
 
 static int __init adb_init(void)
@@ -831,14 +851,25 @@ static const struct file_operations adb_fops = {
 	.release	= adb_release,
 };
 
+#ifdef CONFIG_PM
+static const struct dev_pm_ops adb_dev_pm_ops = {
+	.suspend = adb_suspend,
+	.resume = adb_resume,
+	/* Hibernate hooks */
+	.freeze = adb_freeze,
+	.thaw = adb_resume,
+	.poweroff = adb_poweroff,
+	.restore = adb_resume,
+};
+#endif
+
 static struct platform_driver adb_pfdrv = {
 	.driver = {
 		.name = "adb",
-	},
 #ifdef CONFIG_PM
-	.suspend = adb_suspend,
-	.resume = adb_resume,
+		.pm = &adb_dev_pm_ops,
 #endif
+	},
 };
 
 static struct platform_device adb_pfdev = {
-- 
1.7.10.4

^ permalink raw reply related

* [RFT][PATCH 00/12] change drivers power management to use dev_pm_ops
From: Shuah Khan @ 2014-02-10 16:12 UTC (permalink / raw)
  To: rjw
  Cc: Shuah Khan, heiko.carstens, chris, shuahkhan, dwalker,
	ingo.tuchscherer, sonic.zhang, linux-s390, linux, davidb,
	linux-pm, linux-arm-msm, linux-pcmcia, adi-buildroot-devel,
	mirq-linux, ian, manuel.lauss, linux-arm-kernel, gregkh,
	linux-mmc, linux-kernel, schwidefsky, linux390, linuxppc-dev

Change drivers to register pm ops using dev_pm_ops instead of legacy pm_ops.
.pm hooks call existing legacy suspend and resume interfaces by passing in
the right pm state. Bus drivers suspend and resume routines call .pm driver
hooks if found.

Shuah Khan (12):
  arm: change locomo platform and bus power management to use
    dev_pm_ops
  arm: change sa1111 platform and bus power management to use
    dev_pm_ops
  arm: change scoop platform power management to use dev_pm_ops
  drivers/macintosh/adb: change platform power managemnet to use
    dev_pm_ops
  mmc: change au1xmmc platform power management to use dev_pm_ops
  mmc: change bfin_sdh platform power management to use dev_pm_ops
  isa: change isa bus power managemnet to use dev_pm_ops
  mmc: change cb710-mmc platform power management to use dev_pm_ops
  mmc: change msm_sdcc platform power management to use dev_pm_ops
  mmc: change tmio_mmc platform power management to use dev_pm_ops
  drivers/pcmcia: change ds driver power management to use dev_pm_ops
  drivers/s390/crypto: change ap_bus driver power management to use
    dev_pm_ops

 arch/arm/common/locomo.c     |   93 +++++++++++++++++++++++++++++++++++-------
 arch/arm/common/sa1111.c     |   88 +++++++++++++++++++++++++++++++--------
 arch/arm/common/scoop.c      |   44 ++++++++++++++++----
 drivers/base/isa.c           |   30 ++++++++++++--
 drivers/macintosh/adb.c      |   41 ++++++++++++++++---
 drivers/mmc/host/au1xmmc.c   |   43 +++++++++++++++----
 drivers/mmc/host/bfin_sdh.c  |   40 +++++++++++++++---
 drivers/mmc/host/cb710-mmc.c |   37 +++++++++++++++--
 drivers/mmc/host/msm_sdcc.c  |   42 +++++++++++++++----
 drivers/mmc/host/tmio_mmc.c  |   42 +++++++++++++++----
 drivers/pcmcia/ds.c          |   36 +++++++++++++---
 drivers/s390/crypto/ap_bus.c |   30 ++++++++++++--
 12 files changed, 481 insertions(+), 85 deletions(-)

-- 
1.7.10.4

^ permalink raw reply

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: James Yang @ 2014-02-10 16:50 UTC (permalink / raw)
  To: Stephen N Chivers; +Cc: James Yang, Chris Proctor, linuxppc-dev
In-Reply-To: <OF80E982EC.BE15A034-ONCA257C7A.006B58B5-CA257C7A.006C3EA8@csc.com>

On Sun, 9 Feb 2014, Stephen N Chivers wrote:

> James Yang <James.Yang@freescale.com> wrote on 02/08/2014 07:49:40 AM:
> 
> > From: James Yang <James.Yang@freescale.com>
> > To: Gabriel Paubert <paubert@iram.es>
> > Cc: Stephen N Chivers <schivers@csc.com.au>, Chris Proctor 
> > <cproctor@csc.com.au>, <linuxppc-dev@lists.ozlabs.org>
> > Date: 02/08/2014 07:49 AM
> > Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
> > 
> > On Fri, 7 Feb 2014, Gabriel Paubert wrote:
> > 
> > >    Hi Stephen,
> > > 
> > > On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote:
> > > > Gabriel Paubert <paubert@iram.es> wrote on 02/06/2014 07:26:37 PM:
> > > > 
> > > > > From: Gabriel Paubert <paubert@iram.es>
> > > > > To: Stephen N Chivers <schivers@csc.com.au>
> > > > > Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor 
> <cproctor@csc.com.au>
> > > > > Date: 02/06/2014 07:26 PM
> > > > > Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
> > > > > 
> > > > > On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:
> > > > > > With the above mask computation I get consistent results for 
> > > > > > both the MPC8548 and MPC7410 boards.
> > > > > > 
> > > > > > Am I missing something subtle?
> > > > > 
> > > > > No I think you are correct. This said, this code may probably be 
> > > > optimized 
> > > > > to eliminate a lot of the conditional branches. I think that:
> > 
> > 
> > If the compiler is enabled to generate isel instructions, it would not 
> > use a conditional branch for this code. (ignore the andi's values, 
> > this is an old compile)
> > 
> From limited research, the 440GP is a processor
> that doesn't implement the isel instruction and it does
> not implement floating point.
> 
> The kernel emulates isel and so using that instruction
> for the 440GP would have a double trap penalty.

Are you writing about something outside the scope of this thread?  OP 
was using MPC8548 not a 440GP.  The compiler should not be using or 
targeting 8548 for a 440GP so having to emulate isel shouldn't be an 
issue because there wouldn't be any.  (The assembly listing I posted 
was generated by gcc targeting 8548.)  Anyway, I had measured the 
non-isel routines to be faster and that works without illop traps.

 
> Correct me if I am wrong, the isel instruction first appears
> in PowerPC ISA v2.04 around mid 2007. 


isel appeared in 2003 in the e500 (v1) core that is in the MPC8540.  
The instruction is Power ISA 2.03 (9/2006).

^ permalink raw reply

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: James Yang @ 2014-02-10 17:03 UTC (permalink / raw)
  To: Gabriel Paubert
  Cc: James Yang, Chris Proctor, Stephen N Chivers, linuxppc-dev
In-Reply-To: <20140210110342.GA15806@visitor2.iram.es>

On Mon, 10 Feb 2014, Gabriel Paubert wrote:

> On Fri, Feb 07, 2014 at 02:49:40PM -0600, James Yang wrote:
> > On Fri, 7 Feb 2014, Gabriel Paubert wrote:
> > 
> > > 	Hi Stephen,
> > > 
> > > On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote:
> > > > Gabriel Paubert <paubert@iram.es> wrote on 02/06/2014 07:26:37 PM:
> > > > 
> > > > > From: Gabriel Paubert <paubert@iram.es>
> > > > > To: Stephen N Chivers <schivers@csc.com.au>
> > > > > Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor <cproctor@csc.com.au>
> > > > > Date: 02/06/2014 07:26 PM
> > > > > Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
> > > > > 
> > > > > On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:


> > 
> > > > > > 
> > > > > >                 mask = 0;
> > > > > >                 if (FM & (1 << 0))
> > > > > >                         mask |= 0x0000000f;
> > > > > >                 if (FM & (1 << 1))
> > > > > >                         mask |= 0x000000f0;
> > > > > >                 if (FM & (1 << 2))
> > > > > >                         mask |= 0x00000f00;
> > > > > >                 if (FM & (1 << 3))
> > > > > >                         mask |= 0x0000f000;
> > > > > >                 if (FM & (1 << 4))
> > > > > >                         mask |= 0x000f0000;
> > > > > >                 if (FM & (1 << 5))
> > > > > >                         mask |= 0x00f00000;
> > > > > >                 if (FM & (1 << 6))
> > > > > >                         mask |= 0x0f000000;
> > > > > >                 if (FM & (1 << 7))
> > > > > >                         mask |= 0x90000000;
> > > > > > 
> > > > > > With the above mask computation I get consistent results for 
> > > > > > both the MPC8548 and MPC7410 boards.

> Ok, if you have measured that method1 is faster than method2, let us go for it.
> I believe method2 would be faster if you had a large out-of-order execution
> window, because more parallelism can be extracted from it, but this is probably
> only true for high end cores, which do not need FPU emulation in the first place.

Yeah, 8548 can issue 2 SFX instructions per cycle which is what the 
compiler generated.   

 
> The other place where we can optimize is the generation of FEX. Here is 
> my current patch:
> 
> 
> diff --git a/arch/powerpc/math-emu/mtfsf.c b/arch/powerpc/math-emu/mtfsf.c
> index dbce92e..b57b3fa8 100644
> --- a/arch/powerpc/math-emu/mtfsf.c
> +++ b/arch/powerpc/math-emu/mtfsf.c
> @@ -11,48 +11,35 @@ mtfsf(unsigned int FM, u32 *frB)
>  	u32 mask;
>  	u32 fpscr;
>  
> -	if (FM == 0)
> +	if (likely(FM == 0xff))
> +		mask = 0xffffffff;
> +	else if (unlikely(FM == 0))
>  		return 0;
> -
> -	if (FM == 0xff)
> -		mask = 0x9fffffff;
>  	else {
> -		mask = 0;
> -		if (FM & (1 << 0))
> -			mask |= 0x90000000;
> -		if (FM & (1 << 1))
> -			mask |= 0x0f000000;
> -		if (FM & (1 << 2))
> -			mask |= 0x00f00000;
> -		if (FM & (1 << 3))
> -			mask |= 0x000f0000;
> -		if (FM & (1 << 4))
> -			mask |= 0x0000f000;
> -		if (FM & (1 << 5))
> -			mask |= 0x00000f00;
> -		if (FM & (1 << 6))
> -			mask |= 0x000000f0;
> -		if (FM & (1 << 7))
> -			mask |= 0x0000000f;
> +		mask = (FM & 1);
> +		mask |= (FM << 3) & 0x10;
> +		mask |= (FM << 6) & 0x100;
> +		mask |= (FM << 9) & 0x1000;
> +		mask |= (FM << 12) & 0x10000;
> +		mask |= (FM << 15) & 0x100000;
> +		mask |= (FM << 18) & 0x1000000;
> +		mask |= (FM << 21) & 0x10000000;
> +		mask *= 15;


Needs to also mask out bits 1 and 2, they aren't to be set from frB.

		mask &= 0x9FFFFFFF;




>  	}
>  
> -	__FPU_FPSCR &= ~(mask);
> -	__FPU_FPSCR |= (frB[1] & mask);
> +	fpscr = ((__FPU_FPSCR & ~mask) | (frB[1] & mask)) &
> +		~(FPSCR_VX | FPSCR_FEX);
>  
> -	__FPU_FPSCR &= ~(FPSCR_VX);
> -	if (__FPU_FPSCR & (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI |
> +	if (fpscr & (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI |
>  		     FPSCR_VXZDZ | FPSCR_VXIMZ | FPSCR_VXVC |
>  		     FPSCR_VXSOFT | FPSCR_VXSQRT | FPSCR_VXCVI))
> -		__FPU_FPSCR |= FPSCR_VX;
> -
> -	fpscr = __FPU_FPSCR;
> -	fpscr &= ~(FPSCR_FEX);
> -	if (((fpscr & FPSCR_VX) && (fpscr & FPSCR_VE)) ||
> -	    ((fpscr & FPSCR_OX) && (fpscr & FPSCR_OE)) ||
> -	    ((fpscr & FPSCR_UX) && (fpscr & FPSCR_UE)) ||
> -	    ((fpscr & FPSCR_ZX) && (fpscr & FPSCR_ZE)) ||
> -	    ((fpscr & FPSCR_XX) && (fpscr & FPSCR_XE)))
> -		fpscr |= FPSCR_FEX;
> +		fpscr |= FPSCR_VX;
> +
> +	/* The bit order of exception enables and exception status
> +	 * is the same. Simply shift and mask to check for enabled
> +	 * exceptions.
> +	 */
> +	if (fpscr & (fpscr >> 22) &  0xf8) fpscr |= FPSCR_FEX;
>  	__FPU_FPSCR = fpscr;
>  
>  #ifdef DEBUG
>  mtfsf.c |   57 ++++++++++++++++++++++-----------------------------------
>  1 file changed, 22 insertions(+), 35 deletions(-)
> 
> 
> Notes: 
> 
> 1) I'm really unsure on whether 0xff is frequent or not. So the likely()
> statement at the beginning may be wrong. Actually, if it is not very likely,
> it might be better to remove the special casef for FM==0xff. A look at 
> GCC sources shows that it never generates a mask of 0xff. From glibc
> sources, there vast majority of cases uses 0x1, only isnan() uses 0xff.

Can't handle all cases here.  
 
> 2) it may be better to remove the check for FM==0, after all, the instruction
> efectively becomes a nop, and generating the instruction in the first place
> would be too stupid for words.

Hmm a heavy no-op.  I wonder if it is heavier than a sync.
 
> 3) if might be better to #define the magic constants (22 and 0xf8) used 
> in the last if statement.
> 
> 	Gabriel

^ permalink raw reply

* Re: [PATCH v2] powerpc ticket locks
From: Peter Zijlstra @ 2014-02-10 17:53 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, linux-kernel, Paul Mackerras, Anton Blanchard,
	Scott Wood, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140210155217.GF2107@lst.de>

On Mon, Feb 10, 2014 at 04:52:17PM +0100, Torsten Duwe wrote:
> Opinions, anyone?

Since the holder thing is a performance thing, not a correctness thing;
one thing you could do is something like:

static const int OWNER_HASH_SIZE = CONFIG_NR_CPUS * 4;
static const int OWNER_HASH_BITS = ilog2(OWNER_HASH_SIZE);

u16 lock_owner_array[OWNER_HASH_SIZE] = { 0, };

void set_owner(struct arch_spinlock_t *lock, int owner)
{
	int hash = hash_ptr(lock, OWNER_HASH_BITS);
	lock_owner_array[hash] = owner;
}

void yield_to_owner(struct arch_spinlock_t *lock)
{
	int hash = hash_ptr(lock, OWNER_HASH_BITS);
	int owner = lock_owner_array[hash];
	yield_to_cpu(owner);
}

And call set_owner() after the ticket lock is acquired, and don't bother
clearing it again; a new acquire will overwrite, a collision we have to
live with.

It should on average get you the right yield and does away with having
to track the owner field in place.

It does however get you an extra cacheline miss on acquire :/

One could consider patching it out when you know your kernel is not
running on an overloaded partition.

^ permalink raw reply

* Re: [PATCH] mtd: m25p80: Make the name of mtd_info fixed
From: Brian Norris @ 2014-02-10 18:47 UTC (permalink / raw)
  To: B48286@freescale.com
  Cc: Scott Wood, linuxppc-dev@ozlabs.org, Mingkai.Hu@freescale.com,
	linux-mtd@lists.infradead.org, Ezequiel Garcia
In-Reply-To: <66941b677b7e4b8d9ce4cc44304b0955@DM2PR03MB301.namprd03.prod.outlook.com>

Hi Hou,

On Thu, Jan 23, 2014 at 03:29:41AM +0000, B48286@freescale.com wrote:
> > -----Original Message-----
> > From: Brian Norris [mailto:computersforpeace@gmail.com]
> > Sent: Thursday, January 23, 2014 10:12 AM
> > To: Hou Zhiqiang-B48286
> > Cc: linux-mtd@lists.infradead.org; linuxppc-dev@ozlabs.org; Wood Scott-
> > B07421; Hu Mingkai-B21284; Ezequiel Garcia
> > Subject: Re: [PATCH] mtd: m25p80: Make the name of mtd_info fixed
> > 
> > On Mon, Jan 06, 2014 at 02:34:29PM +0800, Hou Zhiqiang wrote:
> > > --- a/drivers/mtd/devices/m25p80.c
> > > +++ b/drivers/mtd/devices/m25p80.c
> > > @@ -1012,7 +1012,8 @@ static int m25p_probe(struct spi_device *spi)
> > >  	if (data && data->name)
> > >  		flash->mtd.name = data->name;
> > >  	else
> > > -		flash->mtd.name = dev_name(&spi->dev);
> > > +		flash->mtd.name = kasprintf(GFP_KERNEL, "%s.%d",
> > > +				id->name, spi->chip_select);
> > 
> > Changing the mtd.name may have far-reaching consequences for users who
> > already have mtdparts= command lines. But your concern is probably valid
> > for dynamically-determined bus numbers. Perhaps you can edit this patch
> > to only change the name when the busnum is dynamically-allocated?
> >
> It's a good idea, but in the case of mtd_info's name dynamically-allocated
> using "mtdparts=..." in command lines is illegal obviously.

I agree that users should never have relied on the dynamically-allocated
name. But changing the name for non-dynamic schemes (e.g., where the SPI
busnum is a fixed value) is not pleasant for users.

> Would you tell
> me what side-effect will be brought by the change of mtd_info's name.

I can only think of the mtdparts= command line. Otherwise, I don't think
the name is very important.

Brian

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Nishanth Aravamudan @ 2014-02-10 19:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li
In-Reply-To: <alpine.DEB.2.10.1402071245040.20246@nuc>

Hi Christoph,

On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> Here is a draft of a patch to make this work with memoryless nodes.
> 
> The first thing is that we modify node_match to also match if we hit an
> empty node. In that case we simply take the current slab if its there.
> 
> If there is no current slab then a regular allocation occurs with the
> memoryless node. The page allocator will fallback to a possible node and
> that will become the current slab. Next alloc from a memoryless node
> will then use that slab.
> 
> For that we also add some tracking of allocations on nodes that were not
> satisfied using the empty_node[] array. A successful alloc on a node
> clears that flag.
> 
> I would rather avoid the empty_node[] array since its global and there may
> be thread specific allocation restrictions but it would be expensive to do
> an allocation attempt via the page allocator to make sure that there is
> really no page available from the page allocator.

With this patch on our test system (I pulled out the numa_mem_id()
change, since you Acked Joonsoo's already), on top of 3.13.0 + my
kthread locality change + CONFIG_HAVE_MEMORYLESS_NODES + Joonsoo's RFC
patch 1):

MemTotal:        8264704 kB
MemFree:         5924608 kB
...
Slab:            1402496 kB
SReclaimable:     102848 kB
SUnreclaim:      1299648 kB

And Anton's slabusage reports:

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                       207 MB   98.60%  100.00%
task_struct                         134 MB   97.82%  100.00%
kmalloc-8192                        117 MB  100.00%  100.00%
pgtable-2^12                        111 MB  100.00%  100.00%
pgtable-2^10                        104 MB  100.00%  100.00%

For comparison, Anton's patch applied at the same point in the series:

meminfo:

MemTotal:        8264704 kB
MemFree:         4150464 kB
...
Slab:            1590336 kB
SReclaimable:     208768 kB
SUnreclaim:      1381568 kB

slabusage:

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                       227 MB   98.63%  100.00%
kmalloc-8192                        130 MB  100.00%  100.00%
task_struct                         129 MB   97.73%  100.00%
pgtable-2^12                        112 MB  100.00%  100.00%
pgtable-2^10                        106 MB  100.00%  100.00%


Consider this patch:

Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

I was thinking about your concerns about empty_node[]. Would it make
sense to use a helper function, rather than direct access to
direct_node, such as:

	bool is_node_empty(int nid)

	void set_node_empty(int nid, bool empty)

which we stub out if !HAVE_MEMORYLESS_NODES to return false and noop
respectively?

That way only architectures that have memoryless nodes pay the penalty
of the array allocation?

Thanks,
Nish

> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c	2014-02-03 13:19:22.896853227 -0600
> +++ linux/mm/slub.c	2014-02-07 12:44:49.311494806 -0600
> @@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa
>  #endif
>  }
> 
> +static int empty_node[MAX_NUMNODES];
> +
>  /*
>   * Issues still to be resolved:
>   *
> @@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem
>  	void *last;
>  	void *p;
>  	int order;
> +	int alloc_node;
> 
>  	BUG_ON(flags & GFP_SLAB_BUG_MASK);
> 
>  	page = allocate_slab(s,
>  		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> -	if (!page)
> +	if (!page) {
> +		if (node != NUMA_NO_NODE)
> +			empty_node[node] = 1;
>  		goto out;
> +	}
> 
>  	order = compound_order(page);
> -	inc_slabs_node(s, page_to_nid(page), page->objects);
> +	alloc_node = page_to_nid(page);
> +	empty_node[alloc_node] = 0;
> +	inc_slabs_node(s, alloc_node, page->objects);
>  	memcg_bind_pages(s, order);
>  	page->slab_cache = s;
>  	__SetPageSlab(page);
> @@ -1712,7 +1720,7 @@ static void *get_partial(struct kmem_cac
>  		struct kmem_cache_cpu *c)
>  {
>  	void *object;
> -	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> +	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> 
>  	object = get_partial_node(s, get_node(s, searchnode), c, flags);
>  	if (object || node != NUMA_NO_NODE)
> @@ -2107,8 +2115,25 @@ static void flush_all(struct kmem_cache
>  static inline int node_match(struct page *page, int node)
>  {
>  #ifdef CONFIG_NUMA
> -	if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
> +	int page_node;
> +
> +	/* No data means no match */
> +	if (!page)
>  		return 0;
> +
> +	/* Node does not matter. Therefore anything is a match */
> +	if (node == NUMA_NO_NODE)
> +		return 1;
> +
> +	/* Did we hit the requested node ? */
> +	page_node = page_to_nid(page);
> +	if (page_node == node)
> +		return 1;
> +
> +	/* If the node has available data then we can use it. Mismatch */
> +	return !empty_node[page_node];
> +
> +	/* Target node empty so just take anything */
>  #endif
>  	return 1;
>  }
> 

^ permalink raw reply

* Re: [PATCH v2] mtd: m25p80: Make the name of mtd_info fixed
From: Brian Norris @ 2014-02-10 19:39 UTC (permalink / raw)
  To: Hou Zhiqiang; +Cc: scottwood, linuxppc-dev, mingkai.hu, linux-mtd, linux-spi
In-Reply-To: <1390717003-28699-1-git-send-email-b48286@freescale.com>

On Sun, Jan 26, 2014 at 02:16:43PM +0800, Hou Zhiqiang wrote:
> To give spi flash layout using "mtdparts=..." in cmdline, we must
> give mtd_info a fixed name,because the cmdlinepart's parser will
> match the name given in cmdline with the mtd_info.
> 
> Now, if use OF node, mtd_info's name will be spi->dev->name. It
> consists of spi_master->bus_num, and the spi_master->bus_num maybe
> dynamically fetched.
> So, give the mtd_info a new fiexd name "name.cs", "name" is name of
> spi_device_id and "cs" is chip-select in spi_dev.
> 
> Signed-off-by: Hou Zhiqiang <b48286@freescale.com>
> ---
> v2:
>  - add check for return value of function kasprintf.
>  - whether the spi_master->bus_num is dynamical is determined by spi
>    controller driver, and it can't be check in this driver. So, we can
>    not initial the mtd_info's name by distinguishing the spi_master
>    bus_num dynamically-allocated or not.

How about spi->master->bus_num < 0 ?

>  drivers/mtd/devices/m25p80.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/mtd/devices/m25p80.c b/drivers/mtd/devices/m25p80.c
> index eb558e8..1f494d2 100644
> --- a/drivers/mtd/devices/m25p80.c
> +++ b/drivers/mtd/devices/m25p80.c
> @@ -1011,8 +1011,12 @@ static int m25p_probe(struct spi_device *spi)
>  
>  	if (data && data->name)
>  		flash->mtd.name = data->name;
> -	else
> -		flash->mtd.name = dev_name(&spi->dev);
> +	else {
> +		flash->mtd.name = kasprintf(GFP_KERNEL, "%s.%d",
> +				id->name, spi->chip_select);

I don't think this name is specific enough. What if there are more than
one SPI controller? Then there could be one chip with the same
chip-select. You probably still need to incorporate the SPI master
somehow, even if it's not by using the bus number directly (because it's
dynamic).

> +		if (!flash->mtd.name)
> +			return -ENOMEM;
> +	}
>  
>  	flash->mtd.type = MTD_NORFLASH;
>  	flash->mtd.writesize = 1;

Brian

^ permalink raw reply

* Re: [PATCH] Handle vmalloc addresses
From: Benjamin Herrenschmidt @ 2014-02-11  0:19 UTC (permalink / raw)
  To: Nathan Fontenot; +Cc: Rong Song Shen, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <52E92DB0.1050902@linux.vnet.ibm.com>

Hi Nathan !

Please do a better submission :-)

Your subject is to be honest, crap. Something like

[PATCH] crypto/nx/nx-842: Fix handling of vmalloc addresses

Would have been much more informative.

Additionally, this is a patch for drivers/crypto, and while that driver
is powerpc-specific meaning I *could* take that patch, it should at
least be CCed to the crypto list/maintainer since that would
be the normal path for such a patch to be applied.

I'm taking it this time around but I know you can do better !

Cheers,
Ben.

On Wed, 2014-01-29 at 10:34 -0600, Nathan Fontenot wrote:
> The nx-842 compression driver does not currently handle getting
> a physical address for vmalloc addresses. The current driver
> uses __pa() for all addresses which does not properly handle
> vmalloc addresses and thus causes a failure since we do not pass
> a proper physical address to phyp.
> 
> This patch adds a routine to convert an address to a physical
> address by checking for vmalloc addresses and handling them properly.
> 
> Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
>  ---
>  drivers/crypto/nx/nx-842.c |   29 +++++++++++++++++++----------
>  1 file changed, 19 insertions(+), 10 deletions(-)
> 
> Index: linux/drivers/crypto/nx/nx-842.c
> ===================================================================
> --- linux.orig/drivers/crypto/nx/nx-842.c	2014-01-22 08:52:55.000000000 -0600
> +++ linux/drivers/crypto/nx/nx-842.c	2014-01-29 08:25:33.000000000 -0600
> @@ -158,6 +158,15 @@
>  	return sl->entry_nr * sizeof(struct nx842_slentry);
>  }
>  
> +static inline unsigned long nx842_get_pa(void *addr)
> +{
> +	if (is_vmalloc_addr(addr))
> +		return page_to_phys(vmalloc_to_page(addr))
> +		       + offset_in_page(addr);
> +	else
> +		return __pa(addr);
> +}
> +
>  static int nx842_build_scatterlist(unsigned long buf, int len,
>  			struct nx842_scatterlist *sl)
>  {
> @@ -168,7 +177,7 @@
>  
>  	entry = sl->entries;
>  	while (len) {
> -		entry->ptr = __pa(buf);
> +		entry->ptr = nx842_get_pa((void *)buf);
>  		nextpage = ALIGN(buf + 1, NX842_HW_PAGE_SIZE);
>  		if (nextpage < buf + len) {
>  			/* we aren't at the end yet */
> @@ -370,8 +379,8 @@
>  	op.flags = NX842_OP_COMPRESS;
>  	csbcpb = &workmem->csbcpb;
>  	memset(csbcpb, 0, sizeof(*csbcpb));
> -	op.csbcpb = __pa(csbcpb);
> -	op.out = __pa(slout.entries);
> +	op.csbcpb = nx842_get_pa(csbcpb);
> +	op.out = nx842_get_pa(slout.entries);
>  
>  	for (i = 0; i < hdr->blocks_nr; i++) {
>  		/*
> @@ -401,13 +410,13 @@
>  		 */
>  		if (likely(max_sync_size == NX842_HW_PAGE_SIZE)) {
>  			/* Create direct DDE */
> -			op.in = __pa(inbuf);
> +			op.in = nx842_get_pa((void *)inbuf);
>  			op.inlen = max_sync_size;
>  
>  		} else {
>  			/* Create indirect DDE (scatterlist) */
>  			nx842_build_scatterlist(inbuf, max_sync_size, &slin);
> -			op.in = __pa(slin.entries);
> +			op.in = nx842_get_pa(slin.entries);
>  			op.inlen = -nx842_get_scatterlist_size(&slin);
>  		}
>  
> @@ -565,7 +574,7 @@
>  	op.flags = NX842_OP_DECOMPRESS;
>  	csbcpb = &workmem->csbcpb;
>  	memset(csbcpb, 0, sizeof(*csbcpb));
> -	op.csbcpb = __pa(csbcpb);
> +	op.csbcpb = nx842_get_pa(csbcpb);
>  
>  	/*
>  	 * max_sync_size may have changed since compression,
> @@ -597,12 +606,12 @@
>  		if (likely((inbuf & NX842_HW_PAGE_MASK) ==
>  			((inbuf + hdr->sizes[i] - 1) & NX842_HW_PAGE_MASK))) {
>  			/* Create direct DDE */
> -			op.in = __pa(inbuf);
> +			op.in = nx842_get_pa((void *)inbuf);
>  			op.inlen = hdr->sizes[i];
>  		} else {
>  			/* Create indirect DDE (scatterlist) */
>  			nx842_build_scatterlist(inbuf, hdr->sizes[i] , &slin);
> -			op.in = __pa(slin.entries);
> +			op.in = nx842_get_pa(slin.entries);
>  			op.inlen = -nx842_get_scatterlist_size(&slin);
>  		}
>  
> @@ -613,12 +622,12 @@
>  		 */
>  		if (likely(max_sync_size == NX842_HW_PAGE_SIZE)) {
>  			/* Create direct DDE */
> -			op.out = __pa(outbuf);
> +			op.out = nx842_get_pa((void *)outbuf);
>  			op.outlen = max_sync_size;
>  		} else {
>  			/* Create indirect DDE (scatterlist) */
>  			nx842_build_scatterlist(outbuf, max_sync_size, &slout);
> -			op.out = __pa(slout.entries);
> +			op.out = nx842_get_pa(slout.entries);
>  			op.outlen = -nx842_get_scatterlist_size(&slout);
>  		}
>  
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

* [PATCH] powerpc/powernv: Add iommu DMA bypass support for IODA2
From: Benjamin Herrenschmidt @ 2014-02-11  0:38 UTC (permalink / raw)
  To: linuxppc-dev list

this patch adds the support for to create a direct iommu "bypass"
window on IODA2 bridges (such as Power8) allowing to bypass iommu
page translation completely for 64-bit DMA capable devices, thus
significantly improving DMA performances.

Additionally, this adds a hook to the struct iommu_table so that
the IOMMU API / VFIO can disable the bypass when external ownership
is requested, since in that case, the device will be used by an
environment such as userspace or a KVM guest which must not be
allowed to bypass translations.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

This is pretty much identical to the old version I posted a while ago,
except that it does have the iommu calls to enable/disable which I had
forgotten to git add before posting the previous one.

 arch/powerpc/include/asm/dma-mapping.h    |  1 +
 arch/powerpc/include/asm/iommu.h          |  1 +
 arch/powerpc/kernel/dma.c                 | 10 ++--
 arch/powerpc/kernel/iommu.c               | 12 +++++
 arch/powerpc/platforms/powernv/pci-ioda.c | 84 +++++++++++++++++++++++++++++++
 arch/powerpc/platforms/powernv/pci.c      | 10 ++++
 arch/powerpc/platforms/powernv/pci.h      |  6 ++-
 arch/powerpc/platforms/powernv/powernv.h  |  4 ++
 arch/powerpc/platforms/powernv/setup.c    |  9 ++++
 9 files changed, 133 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
index e27e9ad..150866b 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -134,6 +134,7 @@ static inline int dma_supported(struct device *dev, u64 mask)
 }
 
 extern int dma_set_mask(struct device *dev, u64 dma_mask);
+extern int __dma_set_mask(struct device *dev, u64 dma_mask);
 
 #define dma_alloc_coherent(d,s,h,f)	dma_alloc_attrs(d,s,h,f,NULL)
 
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index f7a8036..42632c7 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -77,6 +77,7 @@ struct iommu_table {
 #ifdef CONFIG_IOMMU_API
 	struct iommu_group *it_group;
 #endif
+	void (*set_bypass)(struct iommu_table *tbl, bool enable);
 };
 
 /* Pure 2^n version of get_order */
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index 8032b97..ee78f6e 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -191,12 +191,10 @@ EXPORT_SYMBOL(dma_direct_ops);
 
 #define PREALLOC_DMA_DEBUG_ENTRIES (1 << 16)
 
-int dma_set_mask(struct device *dev, u64 dma_mask)
+int __dma_set_mask(struct device *dev, u64 dma_mask)
 {
 	struct dma_map_ops *dma_ops = get_dma_ops(dev);
 
-	if (ppc_md.dma_set_mask)
-		return ppc_md.dma_set_mask(dev, dma_mask);
 	if ((dma_ops != NULL) && (dma_ops->set_dma_mask != NULL))
 		return dma_ops->set_dma_mask(dev, dma_mask);
 	if (!dev->dma_mask || !dma_supported(dev, dma_mask))
@@ -204,6 +202,12 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
 	*dev->dma_mask = dma_mask;
 	return 0;
 }
+int dma_set_mask(struct device *dev, u64 dma_mask)
+{
+	if (ppc_md.dma_set_mask)
+		return ppc_md.dma_set_mask(dev, dma_mask);
+	return __dma_set_mask(dev, dma_mask);
+}
 EXPORT_SYMBOL(dma_set_mask);
 
 u64 dma_get_required_mask(struct device *dev)
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index d773dd4..88e3ec6 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1088,6 +1088,14 @@ int iommu_take_ownership(struct iommu_table *tbl)
 	memset(tbl->it_map, 0xff, sz);
 	iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
 
+	/*
+	 * Disable iommu bypass, otherwise the user can DMA to all of
+	 * our physical memory via the bypass window instead of just
+	 * the pages that has been explicitly mapped into the iommu
+	 */
+	if (tbl->set_bypass)
+		tbl->set_bypass(tbl, false);
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(iommu_take_ownership);
@@ -1102,6 +1110,10 @@ void iommu_release_ownership(struct iommu_table *tbl)
 	/* Restore bit#0 set by iommu_init_table() */
 	if (tbl->it_offset == 0)
 		set_bit(0, tbl->it_map);
+
+	/* The kernel owns the device now, we can restore the iommu bypass */
+	if (tbl->set_bypass)
+		tbl->set_bypass(tbl, true);
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 7d6dcc6..3b2b4fb 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -21,6 +21,7 @@
 #include <linux/irq.h>
 #include <linux/io.h>
 #include <linux/msi.h>
+#include <linux/memblock.h>
 
 #include <asm/sections.h>
 #include <asm/io.h>
@@ -460,9 +461,39 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
 		return;
 
 	pe = &phb->ioda.pe_array[pdn->pe_number];
+	WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
 	set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
 }
 
+static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
+				     struct pci_dev *pdev, u64 dma_mask)
+{
+	struct pci_dn *pdn = pci_get_pdn(pdev);
+	struct pnv_ioda_pe *pe;
+	uint64_t top;
+	bool bypass = false;
+
+	if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
+		return -ENODEV;;
+
+	pe = &phb->ioda.pe_array[pdn->pe_number];
+	if (pe->tce_bypass_enabled) {
+		top = pe->tce_bypass_base + memblock_end_of_DRAM() - 1;
+		bypass = (dma_mask >= top);
+	}
+
+	if (bypass) {
+		dev_info(&pdev->dev, "Using 64-bit DMA iommu bypass\n");
+		set_dma_ops(&pdev->dev, &dma_direct_ops);
+		set_dma_offset(&pdev->dev, pe->tce_bypass_base);
+	} else {
+		dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n");
+		set_dma_ops(&pdev->dev, &dma_iommu_ops);
+		set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+	}
+	return 0;
+}
+
 static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
 {
 	struct pci_dev *dev;
@@ -657,6 +688,56 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 		__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
 }
 
+static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
+{
+	struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
+					      tce32_table);
+	uint16_t window_id = (pe->pe_number << 1 ) + 1;
+	int64_t rc;
+
+	pe_info(pe, "%sabling 64-bit DMA bypass\n", enable ? "En" : "Dis");
+	if (enable) {
+		phys_addr_t top = memblock_end_of_DRAM();
+
+		top = roundup_pow_of_two(top);
+		rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
+						     pe->pe_number,
+						     window_id,
+						     pe->tce_bypass_base,
+						     top);
+	} else {
+		rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
+						     pe->pe_number,
+						     window_id,
+						     pe->tce_bypass_base,
+						     0);
+
+		/*
+		 * We might want to reset the DMA ops of all devices on
+		 * this PE. However in theory, that shouldn't be necessary
+		 * as this is used for VFIO/KVM pass-through and the device
+		 * hasn't yet been returned to its kernel driver
+		 */
+	}
+	if (rc)
+		pe_err(pe, "OPAL error %lld configuring bypass window\n", rc);
+	else
+		pe->tce_bypass_enabled = enable;
+}
+
+static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb,
+					  struct pnv_ioda_pe *pe)
+{
+	/* TVE #1 is selected by PCI address bit 59 */
+	pe->tce_bypass_base = 1ull << 59;
+
+	/* Install set_bypass callback for VFIO */
+	pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
+
+	/* Enable bypass by default */
+	pnv_pci_ioda2_set_bypass(&pe->tce32_table, true);
+}
+
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 				       struct pnv_ioda_pe *pe)
 {
@@ -727,6 +808,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
 	else
 		pnv_ioda_setup_bus_dma(pe, pe->pbus);
 
+	/* Also create a bypass window */
+	pnv_pci_ioda2_setup_bypass_pe(phb, pe);
 	return;
 fail:
 	if (pe->tce32_seg >= 0)
@@ -1286,6 +1369,7 @@ void __init pnv_pci_init_ioda_phb(struct device_node *np,
 
 	/* Setup TCEs */
 	phb->dma_dev_setup = pnv_pci_ioda_dma_dev_setup;
+	phb->dma_set_mask = pnv_pci_ioda_dma_set_mask;
 
 	/* Setup shutdown function for kexec */
 	phb->shutdown = pnv_pci_ioda_shutdown;
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index b555ebc..95633d7 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -634,6 +634,16 @@ static void pnv_pci_dma_dev_setup(struct pci_dev *pdev)
 		pnv_pci_dma_fallback_setup(hose, pdev);
 }
 
+int pnv_pci_dma_set_mask(struct pci_dev *pdev, u64 dma_mask)
+{
+	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
+	struct pnv_phb *phb = hose->private_data;
+
+	if (phb && phb->dma_set_mask)
+		return phb->dma_set_mask(phb, pdev, dma_mask);
+	return __dma_set_mask(&pdev->dev, dma_mask);
+}
+
 void pnv_pci_shutdown(void)
 {
 	struct pci_controller *hose;
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 13f1942..cde1694 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -54,7 +54,9 @@ struct pnv_ioda_pe {
 	struct iommu_table	tce32_table;
 	phys_addr_t		tce_inval_reg_phys;
 
-	/* XXX TODO: Add support for additional 64-bit iommus */
+	/* 64-bit TCE bypass region */
+	bool			tce_bypass_enabled;
+	uint64_t		tce_bypass_base;
 
 	/* MSIs. MVE index is identical for for 32 and 64 bit MSI
 	 * and -1 if not supported. (It's actually identical to the
@@ -113,6 +115,8 @@ struct pnv_phb {
 			 unsigned int hwirq, unsigned int virq,
 			 unsigned int is_64, struct msi_msg *msg);
 	void (*dma_dev_setup)(struct pnv_phb *phb, struct pci_dev *pdev);
+	int (*dma_set_mask)(struct pnv_phb *phb, struct pci_dev *pdev,
+			    u64 dma_mask);
 	void (*fixup_phb)(struct pci_controller *hose);
 	u32 (*bdfn_to_pe)(struct pnv_phb *phb, struct pci_bus *bus, u32 devfn);
 	void (*shutdown)(struct pnv_phb *phb);
diff --git a/arch/powerpc/platforms/powernv/powernv.h b/arch/powerpc/platforms/powernv/powernv.h
index de6819b..213887a 100644
--- a/arch/powerpc/platforms/powernv/powernv.h
+++ b/arch/powerpc/platforms/powernv/powernv.h
@@ -7,12 +7,16 @@ extern void pnv_smp_init(void);
 static inline void pnv_smp_init(void) { }
 #endif
 
+struct pci_dev;
+
 #ifdef CONFIG_PCI
 extern void pnv_pci_init(void);
 extern void pnv_pci_shutdown(void);
+extern int pnv_pci_dma_set_mask(struct pci_dev *pdev, u64 dma_mask);
 #else
 static inline void pnv_pci_init(void) { }
 static inline void pnv_pci_shutdown(void) { }
+static inline int pnv_pci_dma_set_mask(struct pci_dev *pdev, u64 dma_mask) { }
 #endif
 
 extern void pnv_lpc_init(void);
diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
index 21166f6..110f4fb 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -27,6 +27,7 @@
 #include <linux/interrupt.h>
 #include <linux/bug.h>
 #include <linux/cpuidle.h>
+#include <linux/pci.h>
 
 #include <asm/machdep.h>
 #include <asm/firmware.h>
@@ -141,6 +142,13 @@ static void pnv_progress(char *s, unsigned short hex)
 {
 }
 
+static int pnv_dma_set_mask(struct device *dev, u64 dma_mask)
+{
+	if (dev_is_pci(dev))
+		return pnv_pci_dma_set_mask(to_pci_dev(dev), dma_mask);
+	return __dma_set_mask(dev, dma_mask);
+}
+
 static void pnv_shutdown(void)
 {
 	/* Let the PCI code clear up IODA tables */
@@ -238,6 +246,7 @@ define_machine(powernv) {
 	.machine_shutdown	= pnv_shutdown,
 	.power_save             = powernv_idle,
 	.calibrate_decr		= generic_calibrate_decr,
+	.dma_set_mask		= pnv_dma_set_mask,
 #ifdef CONFIG_KEXEC
 	.kexec_cpu_down		= pnv_kexec_cpu_down,
 #endif

^ permalink raw reply related

* Re: [PATCH v2] powerpc ticket locks
From: Benjamin Herrenschmidt @ 2014-02-11  2:44 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, Peter Zijlstra, Linus Torvalds, linux-kernel,
	Paul Mackerras, Anton Blanchard, Scott Wood, Paul E. McKenney,
	linuxppc-dev, Ingo Molnar, Al Viro
In-Reply-To: <20140210155217.GF2107@lst.de>

(Linus, Al, a question for you down there about lockref "ref" size)

On Mon, 2014-02-10 at 16:52 +0100, Torsten Duwe wrote:

> What if I squeeze the bits a little?
> 4k vCPUs, and 256 physical, as a limit to stay within 32 bits?
> At the cost that unlock may become an ll/sc operation again.
> I could think about a trick against that.
> But alas, hw_cpu_id is 16 bit, which makes a lookup table neccessary :-/
> 
> Doing another round of yields for lockrefs now doesn't
> sound so bad any more.

So, the ticketpair has to be 16:16 so we can avoid the atomic on unlock

That leaves us with 32 bits to put the ref and the owner. The question
is how big the ref really has to be and can we have a reasonable failure
mode if it overflows ?

If we limit ourselves to, for example, 16-bit for the ref in lockref,
then we can have the second 32-bit split between the owner and the ref.

If we limit ourselves to 4k CPUs, then we get 4 more bits of ref ...

So the question is, is it reasonable to have the ref smaller than
32-bit...

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH v2] powerpc ticket locks
From: Al Viro @ 2014-02-11  2:56 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Tom Musta, Peter Zijlstra, Linus Torvalds, linux-kernel,
	Paul Mackerras, Anton Blanchard, Scott Wood, Torsten Duwe,
	Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <1392086660.3996.50.camel@pasglop>

On Tue, Feb 11, 2014 at 01:44:20PM +1100, Benjamin Herrenschmidt wrote:
> That leaves us with 32 bits to put the ref and the owner. The question
> is how big the ref really has to be and can we have a reasonable failure
> mode if it overflows ?
> 
> If we limit ourselves to, for example, 16-bit for the ref in lockref,
> then we can have the second 32-bit split between the owner and the ref.
> 
> If we limit ourselves to 4k CPUs, then we get 4 more bits of ref ...
> 
> So the question is, is it reasonable to have the ref smaller than
> 32-bit...

Every time you open a file, you bump dentry refcount.  Something like
libc or ld.so will be opened on just about every execve(), so I'd say
that 16 bits is far too low.  If nothing else, 32 bits might be too
low on 64bit boxen...

^ permalink raw reply

* Re: [PATCH v2] powerpc ticket locks
From: Benjamin Herrenschmidt @ 2014-02-11  3:38 UTC (permalink / raw)
  To: Al Viro
  Cc: Tom Musta, Peter Zijlstra, Linus Torvalds, linux-kernel,
	Paul Mackerras, Anton Blanchard, Scott Wood, Torsten Duwe,
	Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140211025645.GJ18016@ZenIV.linux.org.uk>

On Tue, 2014-02-11 at 02:56 +0000, Al Viro wrote:
> > So the question is, is it reasonable to have the ref smaller than
> > 32-bit...
> 
> Every time you open a file, you bump dentry refcount.  Something like
> libc or ld.so will be opened on just about every execve(), so I'd say
> that 16 bits is far too low.  If nothing else, 32 bits might be too
> low on 64bit boxen...

So back to square 1 ... we can't implement together lockref, ticket
locks, and our lock confer mechanism within 64-bit.

I see two options at this stage. Both require a custom implementation
of lockref for powerpc, so some ifdef's such that we can replace the
generic implementation completely.

 - We can use a small ref, and when it's too big, overflow into a larger
one, falling back to the "old style" lock + ref (an overflow bit or a
compare with ffff)

 - We can have lockref "build" it's own lock out of the ticketpair and
ref, keeping the owner in a separate word. The owner doesn't strictly
need to be atomic.

Both are gross though :(

Anybody has a better idea ?

Ben.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox