LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 2/2] clocksource: Make clocksource register functions void
From: Yijing Wang @ 2014-02-10  1:13 UTC (permalink / raw)
  To: Thomas Gleixner, David Laight
  Cc: linux-mips@linux-mips.org, x86@kernel.org, Kevin Hilman,
	linux@lists.openrisc.net, Hanjun Guo, Sekhar Nori, Michal Simek,
	Paul Mackerras, Ralf Baechle, H. Peter Anvin, Daniel Walker,
	Hans-Christian Egtvedt, Jonas Bonn, Kukjin Kim, Russell King,
	Richard Weinberger, Daniel Lezcano, Tony Lindgren, Ingo Molnar,
	microblaze-uclinux@itee.uq.edu.au, David Brown,
	Haavard Skinnemoen, Mike Frysinger,
	user-mode-linux-devel@lists.sourceforge.net,
	linux-arm-msm@vger.kernel.org, Jeff Dike,
	davinci-linux-open-source@linux.davincidsp.com,
	linux-samsung-soc@vger.kernel.org, John Stultz,
	user-mode-linux-user@lists.sourceforge.net,
	linux-omap@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Barry Song, Jim Cromie, linux-kernel@vger.kernel.org,
	Nicolas Ferre, 'Tony Prisk', Bryan Huntsman,
	uclinux-dist-devel@blackfin.uclinux.org,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <alpine.DEB.2.02.1402052139560.24986@ionos.tec.linutronix.de>

On 2014/2/6 4:40, Thomas Gleixner wrote:
> Yijing,
> 
> On Thu, 23 Jan 2014, David Laight wrote:
> 
>> From: Linuxppc-dev Tony Prisk
>>> On 23/01/14 20:12, Yijing Wang wrote:
>>>> Currently, clocksource_register() and __clocksource_register_scale()
>>>> functions always return 0, it's pointless, make functions void.
>>>> And remove the dead code that check the clocksource_register_hz()
>>>> return value.
>>> ......
>>>> -static inline int clocksource_register_hz(struct clocksource *cs, u32 hz)
>>>> +static inline void clocksource_register_hz(struct clocksource *cs, u32 hz)
>>>>   {
>>>>   	return __clocksource_register_scale(cs, 1, hz);
>>>>   }
>>>
>>> This doesn't make sense - you are still returning a value on a function
>>> declared void, and the return is now from a function that doesn't return
>>> anything either ?!?!
>>> Doesn't this throw a compile-time warning??
>>
>> It depends on the compiler.
>> Recent gcc allow it.
>> I don't know if it is actually valid C though.
>>
>> There is no excuse for it on lines like the above though.
> 
> Can you please resend with that fixed against 3.14-rc1 ?

OK, I will resend later.

Thanks!
Yijing.


> 
> .
> 


-- 
Thanks!
Yijing

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Joonsoo Kim @ 2014-02-10  1:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Nishanth Aravamudan, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, linuxppc-dev, Wanpeng Li
In-Reply-To: <alpine.DEB.2.10.1402071245040.20246@nuc>

On Fri, Feb 07, 2014 at 12:51:07PM -0600, Christoph Lameter wrote:
> Here is a draft of a patch to make this work with memoryless nodes.
> 
> The first thing is that we modify node_match to also match if we hit an
> empty node. In that case we simply take the current slab if its there.

Why not inspecting whether we can get the page on the best node such as
numa_mem_id() node?

> 
> If there is no current slab then a regular allocation occurs with the
> memoryless node. The page allocator will fallback to a possible node and
> that will become the current slab. Next alloc from a memoryless node
> will then use that slab.
> 
> For that we also add some tracking of allocations on nodes that were not
> satisfied using the empty_node[] array. A successful alloc on a node
> clears that flag.
> 
> I would rather avoid the empty_node[] array since its global and there may
> be thread specific allocation restrictions but it would be expensive to do
> an allocation attempt via the page allocator to make sure that there is
> really no page available from the page allocator.
> 
> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c	2014-02-03 13:19:22.896853227 -0600
> +++ linux/mm/slub.c	2014-02-07 12:44:49.311494806 -0600
> @@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa
>  #endif
>  }
> 
> +static int empty_node[MAX_NUMNODES];
> +
>  /*
>   * Issues still to be resolved:
>   *
> @@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem
>  	void *last;
>  	void *p;
>  	int order;
> +	int alloc_node;
> 
>  	BUG_ON(flags & GFP_SLAB_BUG_MASK);
> 
>  	page = allocate_slab(s,
>  		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> -	if (!page)
> +	if (!page) {
> +		if (node != NUMA_NO_NODE)
> +			empty_node[node] = 1;
>  		goto out;
> +	}

empty_node cannot be set on memoryless node, since page allocation would
succeed on different node.

Thanks.

^ permalink raw reply

* [RESEND PATCH 0/3] powerpc: Free up an IPI message slot for tick broadcast IPIs
From: Preeti U Murthy @ 2014-02-10  2:37 UTC (permalink / raw)
  To: benh, tglx, linux-kernel, srivatsa.bhat
  Cc: deepthi, arnd, geoff, paul.gortmaker, paulus, linuxppc-dev

This patchset is a precursor for enabling deep idle states on powerpc,
when the local CPU timers stop. The tick broadcast framework in
the Linux Kernel today handles wakeup of such CPUs at their next timer event
by using an external clock device. At the expiry of this clock device, IPIs
are sent to the CPUs in deep idle states  so that they wakeup to handle their
respective timers. This patchset frees up one of the IPI slots on powerpc
so as to be used to handle the tick broadcast IPI.

On certain implementations of powerpc, such an external clock device is absent.
The support in the tick broadcast framework to handle wakeup of CPUs from
deep idle states on such implementations is currently in the tip tree.
https://lkml.org/lkml/2014/2/7/906
https://lkml.org/lkml/2014/2/7/876
https://lkml.org/lkml/2014/2/7/608

With the above support in place, this patchset is next in line to enable deep
idle states on powerpc.

The patchset has been appended by a RESEND tag since nothing has changed from
the previous post except for an added config condition around
tick_broadcast() which handles sending broadcast IPIs, and the update in the cover
letter.
---

Preeti U Murthy (1):
      cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines

Srivatsa S. Bhat (2):
      powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message
      powerpc: Implement tick broadcast IPI as a fixed IPI message


 arch/powerpc/include/asm/smp.h          |    2 -
 arch/powerpc/include/asm/time.h         |    1 
 arch/powerpc/kernel/smp.c               |   25 ++++++---
 arch/powerpc/kernel/time.c              |   86 ++++++++++++++++++-------------
 arch/powerpc/platforms/cell/interrupt.c |    2 -
 arch/powerpc/platforms/ps3/smp.c        |    2 -
 6 files changed, 73 insertions(+), 45 deletions(-)

-- 

^ permalink raw reply

* [RESEND PATCH 2/3] powerpc: Implement tick broadcast IPI as a fixed IPI message
From: Preeti U Murthy @ 2014-02-10  2:38 UTC (permalink / raw)
  To: benh, tglx, linux-kernel, srivatsa.bhat
  Cc: deepthi, arnd, geoff, paul.gortmaker, paulus, linuxppc-dev
In-Reply-To: <20140210023503.19345.30567.stgit@preeti>

From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

For scalability and performance reasons, we want the tick broadcast IPIs
to be handled as efficiently as possible. Fixed IPI messages
are one of the most efficient mechanisms available - they are faster than
the smp_call_function mechanism because the IPI handlers are fixed and hence
they don't involve costly operations such as adding IPI handlers to the target
CPU's function queue, acquiring locks for synchronization etc.

Luckily we have an unused IPI message slot, so use that to implement
tick broadcast IPIs efficiently.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
[Functions renamed to tick_broadcast* and Changelog modified by
 Preeti U. Murthy<preeti@linux.vnet.ibm.com>]
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Geoff Levand <geoff@infradead.org> [For the PS3 part]
---

 arch/powerpc/include/asm/smp.h          |    2 +-
 arch/powerpc/include/asm/time.h         |    1 +
 arch/powerpc/kernel/smp.c               |   21 +++++++++++++++++----
 arch/powerpc/kernel/time.c              |    5 +++++
 arch/powerpc/platforms/cell/interrupt.c |    2 +-
 arch/powerpc/platforms/ps3/smp.c        |    2 +-
 6 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 9f7356b..ff51046 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu);
  * in /proc/interrupts will be wrong!!! --Troy */
 #define PPC_MSG_CALL_FUNCTION   0
 #define PPC_MSG_RESCHEDULE      1
-#define PPC_MSG_UNUSED		2
+#define PPC_MSG_TICK_BROADCAST	2
 #define PPC_MSG_DEBUGGER_BREAK  3
 
 /* for irq controllers that have dedicated ipis per message (4) */
diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index c1f2676..1d428e6 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -28,6 +28,7 @@ extern struct clock_event_device decrementer_clockevent;
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
 extern void GregorianDay(struct rtc_time *tm);
+extern void tick_broadcast_ipi_handler(void);
 
 extern void generic_calibrate_decr(void);
 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ee7d76b..e2a4232 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -35,6 +35,7 @@
 #include <asm/ptrace.h>
 #include <linux/atomic.h>
 #include <asm/irq.h>
+#include <asm/hw_irq.h>
 #include <asm/page.h>
 #include <asm/pgtable.h>
 #include <asm/prom.h>
@@ -145,9 +146,9 @@ static irqreturn_t reschedule_action(int irq, void *data)
 	return IRQ_HANDLED;
 }
 
-static irqreturn_t unused_action(int irq, void *data)
+static irqreturn_t tick_broadcast_ipi_action(int irq, void *data)
 {
-	/* This slot is unused and hence available for use, if needed */
+	tick_broadcast_ipi_handler();
 	return IRQ_HANDLED;
 }
 
@@ -168,14 +169,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 static irq_handler_t smp_ipi_action[] = {
 	[PPC_MSG_CALL_FUNCTION] =  call_function_action,
 	[PPC_MSG_RESCHEDULE] = reschedule_action,
-	[PPC_MSG_UNUSED] = unused_action,
+	[PPC_MSG_TICK_BROADCAST] = tick_broadcast_ipi_action,
 	[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
 	[PPC_MSG_CALL_FUNCTION] =  "ipi call function",
 	[PPC_MSG_RESCHEDULE] = "ipi reschedule",
-	[PPC_MSG_UNUSED] = "ipi unused",
+	[PPC_MSG_TICK_BROADCAST] = "ipi tick-broadcast",
 	[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
 };
 
@@ -251,6 +252,8 @@ irqreturn_t smp_ipi_demux(void)
 			generic_smp_call_function_interrupt();
 		if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
 			scheduler_ipi();
+		if (all & IPI_MESSAGE(PPC_MSG_TICK_BROADCAST))
+			tick_broadcast_ipi_handler();
 		if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK))
 			debug_ipi_action(0, NULL);
 	} while (info->messages);
@@ -289,6 +292,16 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask)
 		do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
 }
 
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
+void tick_broadcast(const struct cpumask *mask)
+{
+	unsigned int cpu;
+
+	for_each_cpu(cpu, mask)
+		do_message_pass(cpu, PPC_MSG_TICK_BROADCAST);
+}
+#endif
+
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
 void smp_send_debugger_break(void)
 {
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index b3dab20..3ff97db 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -825,6 +825,11 @@ static void decrementer_set_mode(enum clock_event_mode mode,
 		decrementer_set_next_event(DECREMENTER_MAX, dev);
 }
 
+/* Interrupt handler for the timer broadcast IPI */
+void tick_broadcast_ipi_handler(void)
+{
+}
+
 static void register_decrementer_clockevent(int cpu)
 {
 	struct clock_event_device *dec = &per_cpu(decrementers, cpu);
diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c
index adf3726..8a106b4 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -215,7 +215,7 @@ void iic_request_IPIs(void)
 {
 	iic_request_ipi(PPC_MSG_CALL_FUNCTION);
 	iic_request_ipi(PPC_MSG_RESCHEDULE);
-	iic_request_ipi(PPC_MSG_UNUSED);
+	iic_request_ipi(PPC_MSG_TICK_BROADCAST);
 	iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
 }
 
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 00d1a7c..b358bec 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -76,7 +76,7 @@ static int __init ps3_smp_probe(void)
 
 		BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION    != 0);
 		BUILD_BUG_ON(PPC_MSG_RESCHEDULE       != 1);
-		BUILD_BUG_ON(PPC_MSG_UNUSED	      != 2);
+		BUILD_BUG_ON(PPC_MSG_TICK_BROADCAST   != 2);
 		BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK   != 3);
 
 		for (i = 0; i < MSG_COUNT; i++) {

^ permalink raw reply related

* [RESEND PATCH 1/3] powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message
From: Preeti U Murthy @ 2014-02-10  2:37 UTC (permalink / raw)
  To: benh, tglx, linux-kernel, srivatsa.bhat
  Cc: deepthi, arnd, geoff, paul.gortmaker, paulus, linuxppc-dev
In-Reply-To: <20140210023503.19345.30567.stgit@preeti>

From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE map
to a common implementation - generic_smp_call_function_single_interrupt(). So,
we can consolidate them and save one of the IPI message slots, (which are
precious on powerpc, since only 4 of those slots are available).

So, implement the functionality of PPC_MSG_CALL_FUNC_SINGLE using
PPC_MSG_CALL_FUNC itself and release its IPI message slot, so that it can be
used for something else in the future, if desired.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Geoff Levand <geoff@infradead.org> [For the PS3 part]
---

 arch/powerpc/include/asm/smp.h          |    2 +-
 arch/powerpc/kernel/smp.c               |   12 +++++-------
 arch/powerpc/platforms/cell/interrupt.c |    2 +-
 arch/powerpc/platforms/ps3/smp.c        |    2 +-
 4 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 084e080..9f7356b 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu);
  * in /proc/interrupts will be wrong!!! --Troy */
 #define PPC_MSG_CALL_FUNCTION   0
 #define PPC_MSG_RESCHEDULE      1
-#define PPC_MSG_CALL_FUNC_SINGLE	2
+#define PPC_MSG_UNUSED		2
 #define PPC_MSG_DEBUGGER_BREAK  3
 
 /* for irq controllers that have dedicated ipis per message (4) */
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ac2621a..ee7d76b 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -145,9 +145,9 @@ static irqreturn_t reschedule_action(int irq, void *data)
 	return IRQ_HANDLED;
 }
 
-static irqreturn_t call_function_single_action(int irq, void *data)
+static irqreturn_t unused_action(int irq, void *data)
 {
-	generic_smp_call_function_single_interrupt();
+	/* This slot is unused and hence available for use, if needed */
 	return IRQ_HANDLED;
 }
 
@@ -168,14 +168,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 static irq_handler_t smp_ipi_action[] = {
 	[PPC_MSG_CALL_FUNCTION] =  call_function_action,
 	[PPC_MSG_RESCHEDULE] = reschedule_action,
-	[PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action,
+	[PPC_MSG_UNUSED] = unused_action,
 	[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
 	[PPC_MSG_CALL_FUNCTION] =  "ipi call function",
 	[PPC_MSG_RESCHEDULE] = "ipi reschedule",
-	[PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single",
+	[PPC_MSG_UNUSED] = "ipi unused",
 	[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
 };
 
@@ -251,8 +251,6 @@ irqreturn_t smp_ipi_demux(void)
 			generic_smp_call_function_interrupt();
 		if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
 			scheduler_ipi();
-		if (all & IPI_MESSAGE(PPC_MSG_CALL_FUNC_SINGLE))
-			generic_smp_call_function_single_interrupt();
 		if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK))
 			debug_ipi_action(0, NULL);
 	} while (info->messages);
@@ -280,7 +278,7 @@ EXPORT_SYMBOL_GPL(smp_send_reschedule);
 
 void arch_send_call_function_single_ipi(int cpu)
 {
-	do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE);
+	do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
 }
 
 void arch_send_call_function_ipi_mask(const struct cpumask *mask)
diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c
index 2d42f3b..adf3726 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -215,7 +215,7 @@ void iic_request_IPIs(void)
 {
 	iic_request_ipi(PPC_MSG_CALL_FUNCTION);
 	iic_request_ipi(PPC_MSG_RESCHEDULE);
-	iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE);
+	iic_request_ipi(PPC_MSG_UNUSED);
 	iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
 }
 
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 4b35166..00d1a7c 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -76,7 +76,7 @@ static int __init ps3_smp_probe(void)
 
 		BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION    != 0);
 		BUILD_BUG_ON(PPC_MSG_RESCHEDULE       != 1);
-		BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2);
+		BUILD_BUG_ON(PPC_MSG_UNUSED	      != 2);
 		BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK   != 3);
 
 		for (i = 0; i < MSG_COUNT; i++) {

^ permalink raw reply related

* [RESEND PATCH 3/3] cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines
From: Preeti U Murthy @ 2014-02-10  2:38 UTC (permalink / raw)
  To: benh, tglx, linux-kernel, srivatsa.bhat
  Cc: deepthi, arnd, geoff, paul.gortmaker, paulus, linuxppc-dev
In-Reply-To: <20140210023503.19345.30567.stgit@preeti>

From: Preeti U Murthy <preeti@linux.vnet.ibm.com>

Split timer_interrupt(), which is the local timer interrupt handler on ppc
into routines called during regular interrupt handling and __timer_interrupt(),
which takes care of running local timers and collecting time related stats.

This will enable callers interested only in running expired local timers to
directly call into __timer_interupt(). One of the use cases of this is the
tick broadcast IPI handling in which the sleeping CPUs need to handle the local
timers that have expired.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/kernel/time.c |   81 +++++++++++++++++++++++++-------------------
 1 file changed, 46 insertions(+), 35 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 3ff97db..df2989b 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -478,6 +478,47 @@ void arch_irq_work_raise(void)
 
 #endif /* CONFIG_IRQ_WORK */
 
+void __timer_interrupt(void)
+{
+	struct pt_regs *regs = get_irq_regs();
+	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
+	struct clock_event_device *evt = &__get_cpu_var(decrementers);
+	u64 now;
+
+	trace_timer_interrupt_entry(regs);
+
+	if (test_irq_work_pending()) {
+		clear_irq_work_pending();
+		irq_work_run();
+	}
+
+	now = get_tb_or_rtc();
+	if (now >= *next_tb) {
+		*next_tb = ~(u64)0;
+		if (evt->event_handler)
+			evt->event_handler(evt);
+		__get_cpu_var(irq_stat).timer_irqs_event++;
+	} else {
+		now = *next_tb - now;
+		if (now <= DECREMENTER_MAX)
+			set_dec((int)now);
+		/* We may have raced with new irq work */
+		if (test_irq_work_pending())
+			set_dec(1);
+		__get_cpu_var(irq_stat).timer_irqs_others++;
+	}
+
+#ifdef CONFIG_PPC64
+	/* collect purr register values often, for accurate calculations */
+	if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
+		struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array);
+		cu->current_tb = mfspr(SPRN_PURR);
+	}
+#endif
+
+	trace_timer_interrupt_exit(regs);
+}
+
 /*
  * timer_interrupt - gets called when the decrementer overflows,
  * with interrupts disabled.
@@ -486,8 +527,6 @@ void timer_interrupt(struct pt_regs * regs)
 {
 	struct pt_regs *old_regs;
 	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
-	struct clock_event_device *evt = &__get_cpu_var(decrementers);
-	u64 now;
 
 	/* Ensure a positive value is written to the decrementer, or else
 	 * some CPUs will continue to take decrementer exceptions.
@@ -519,39 +558,7 @@ void timer_interrupt(struct pt_regs * regs)
 	old_regs = set_irq_regs(regs);
 	irq_enter();
 
-	trace_timer_interrupt_entry(regs);
-
-	if (test_irq_work_pending()) {
-		clear_irq_work_pending();
-		irq_work_run();
-	}
-
-	now = get_tb_or_rtc();
-	if (now >= *next_tb) {
-		*next_tb = ~(u64)0;
-		if (evt->event_handler)
-			evt->event_handler(evt);
-		__get_cpu_var(irq_stat).timer_irqs_event++;
-	} else {
-		now = *next_tb - now;
-		if (now <= DECREMENTER_MAX)
-			set_dec((int)now);
-		/* We may have raced with new irq work */
-		if (test_irq_work_pending())
-			set_dec(1);
-		__get_cpu_var(irq_stat).timer_irqs_others++;
-	}
-
-#ifdef CONFIG_PPC64
-	/* collect purr register values often, for accurate calculations */
-	if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
-		struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array);
-		cu->current_tb = mfspr(SPRN_PURR);
-	}
-#endif
-
-	trace_timer_interrupt_exit(regs);
-
+	__timer_interrupt();
 	irq_exit();
 	set_irq_regs(old_regs);
 }
@@ -828,6 +835,10 @@ static void decrementer_set_mode(enum clock_event_mode mode,
 /* Interrupt handler for the timer broadcast IPI */
 void tick_broadcast_ipi_handler(void)
 {
+	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
+
+	*next_tb = get_tb_or_rtc();
+	__timer_interrupt();
 }
 
 static void register_decrementer_clockevent(int cpu)

^ permalink raw reply related

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Benjamin Herrenschmidt @ 2014-02-10  2:54 UTC (permalink / raw)
  To: Tom Musta
  Cc: Peter Zijlstra, linux-kernel, Paul Mackerras, Anton Blanchard,
	Torsten Duwe, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <52F3E255.5050906@gmail.com>

On Thu, 2014-02-06 at 13:28 -0600, Tom Musta wrote:
> My read is consistent with Torsten's ... this looks like a bad idea.
> 
> Look at the RTL for sthcx. on page 692 (Power ISA V2.06) and you will
> see this:
> 
> if RESERVE then
>   if RESERVE_LENGTH = 2 then
>      ...
>   else
>      undefined_case <- 1
> else
>   ...
> 
> A legal implementation might never perform the store.

This is an area where we definitely want to check with the implementors
and if the implementations happen to do what we want (they likely do),
get the architecture changed for future chips and use it anyway.

There's a a *significant* benefit in avoiding an atomic operation in the
unlock case .

The reservation mechanism being based on a granule that is generally a
cache line, I doubt implementations will ever check the actual access
size, but we need to double check.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Benjamin Herrenschmidt @ 2014-02-10  3:02 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, Peter Zijlstra, linux-kernel, Paul Mackerras,
	Anton Blanchard, Scott Wood, Paul E. McKenney, linuxppc-dev,
	Ingo Molnar
In-Reply-To: <20140207090248.GB26811@lst.de>

On Fri, 2014-02-07 at 10:02 +0100, Torsten Duwe wrote:
> > > > Can you pair lwarx with sthcx ? I couldn't immediately find the answer
> > > > in the PowerISA doc. If so I think you can do better by being able to
> > > > atomically load both tickets but only storing the head without affecting
> > > > the tail.
> 
> Can I simply write the half word, without a reservation, or will the HW caches
> mess up the other half? Will it ruin the cache coherency on some (sub)architectures?

Yes, you can, I *think*

> > Plus, sthcx doesn't exist on all PPC chips.
> 
> Which ones are lacking it? Do all have at least a simple 16-bit store?

half word atomics (and byte atomics) are new, they've been added in architecture
2.06 I believe so it's fairly recent, but it's still worthwhile to investigate a
way to avoid atomics on unlock on recent processors (we can use instruction patching
if necessary based on CPU features) because there's definitely a significant cost
in doing a larx/stcx. sequence on powerpc, way higher than our current unlock path
of barrier + store.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Benjamin Herrenschmidt @ 2014-02-10  3:05 UTC (permalink / raw)
  To: Kumar Gala
  Cc: Tom Musta, Peter Zijlstra, linux-kernel, Torsten Duwe,
	Anton Blanchard, Scott Wood, Paul Mackerras, Paul E. McKenney,
	linuxppc-dev, Ingo Molnar
In-Reply-To: <87C29DBB-41E7-4B6C-9089-3C7756FBAE07@kernel.crashing.org>

On Fri, 2014-02-07 at 09:51 -0600, Kumar Gala wrote:
> On Feb 7, 2014, at 3:02 AM, Torsten Duwe <duwe@lst.de> wrote:
> 
> > On Thu, Feb 06, 2014 at 02:19:52PM -0600, Scott Wood wrote:
> >> On Thu, 2014-02-06 at 18:37 +0100, Torsten Duwe wrote:
> >>> On Thu, Feb 06, 2014 at 05:38:37PM +0100, Peter Zijlstra wrote:
> >> 
> >>>> Can you pair lwarx with sthcx ? I couldn't immediately find the answer
> >>>> in the PowerISA doc. If so I think you can do better by being able to
> >>>> atomically load both tickets but only storing the head without affecting
> >>>> the tail.
> > 
> > Can I simply write the half word, without a reservation, or will the HW caches
> > mess up the other half? Will it ruin the cache coherency on some (sub)architectures?
> 
> The coherency should be fine, I just can’t remember if you’ll lose the reservation by doing this.

Yes you do.

> >> Plus, sthcx doesn't exist on all PPC chips.
> > 
> > Which ones are lacking it? Do all have at least a simple 16-bit store?
> 
> Everything implements a simple 16-bit store, just not everything implements the store conditional of 16-bit data.

Ben.

> - k--
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: [PATCH v2] powerpc ticket locks
From: Benjamin Herrenschmidt @ 2014-02-10  3:10 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, Peter Zijlstra, linux-kernel, Paul Mackerras,
	Anton Blanchard, Scott Wood, Paul E. McKenney, linuxppc-dev,
	Ingo Molnar
In-Reply-To: <20140207165801.GC2107@lst.de>

On Fri, 2014-02-07 at 17:58 +0100, Torsten Duwe wrote:
>  typedef struct {
> -       volatile unsigned int slock;
> -} arch_spinlock_t;
> +       union {
> +               __ticketpair_t head_tail;
> +               struct __raw_tickets {
> +#ifdef __BIG_ENDIAN__          /* The "tail" part should be in the MSBs */
> +                       __ticket_t tail, head;
> +#else
> +                       __ticket_t head, tail;
> +#endif
> +               } tickets;
> +       };
> +#if defined(CONFIG_PPC_SPLPAR)
> +       u32 holder;
> +#endif
> +} arch_spinlock_t __aligned(4);

That's still broken with lockref (which we just merged).

We must have the arch_spinlock_t and the ref in the same 64-bit word
otherwise it will break.

We can make it work in theory since the holder doesn't have to be
accessed atomically, but the practicals are a complete mess ...
lockref would essentially have to re-implement the holder handling
of the spinlocks and use lower level ticket stuff.

Unless you can find a sneaky trick ... :-(

Ben.

^ permalink raw reply

* Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
From: Preeti U Murthy @ 2014-02-10  3:45 UTC (permalink / raw)
  To: Peter Zijlstra, Nicolas Pitre
  Cc: Lists linaro-kernel, linux-pm@vger.kernel.org, Daniel Lezcano,
	Rafael J. Wysocki, LKML, Ingo Molnar, Thomas Gleixner,
	linuxppc-dev
In-Reply-To: <20140207124140.GB9987@twins.programming.kicks-ass.net>

Hi Peter,

On 02/07/2014 06:11 PM, Peter Zijlstra wrote:
> On Fri, Feb 07, 2014 at 05:11:26PM +0530, Preeti U Murthy wrote:
>> But observe the idle state "snooze" on powerpc. The power that this idle
>> state saves is through the lowering of the thread priority of the CPU.
>> After it lowers the thread priority, it is done. It cannot
>> "wait_for_interrupts". It will exit my_idle(). It is now upto the
>> generic idle loop to increase the thread priority if the need_resched
>> flag is set. Only an interrupt routine can increase the thread priority.
>> Else we will need to do it explicitly. And in such states which have a
>> polling nature, the cpu will not receive a reschedule IPI.
>>
>> That is why in the snooze_loop() we poll on need_resched. If it is set
>> we up the priority of the thread using HMT_MEDIUM() and then exit the
>> my_idle() loop. In case of interrupts, the priority gets automatically
>> increased.
> 
> You can poll without setting TS_POLLING/TIF_POLLING_NRFLAGS just fine
> and get the IPI if that is what you want.
> 
> Depending on how horribly unprovisioned the thread gets at the lowest
> priority, that might actually be faster than polling and raising the
> prio whenever it does get ran.

So I am assuming you mean something like the below:

my_idle()
{
   local_irq_enable();
   /* Remove the setting of the polling flag */
   HMT_low();
   return index;
}

And then exit into the generic idle loop. But the issue I see here is
that the TS_POLLING/TIF_POLLING_NRFLAGS gets set immediately. So, if on
testing need_resched() immediately after this returns that the
TIF_NEED_RESCHED flag is set, the thread will exit at low priority right?
 We could raise the priority of the thread in arch_cpu_idle_exit() soon
after setting the polling flag but that would mean for cases where the
TIF_NEED_RESCHED flag is not set we unnecessarily raise the priority of
the thread.

Thanks

Regards
Preeti U Murthy

> 

^ permalink raw reply

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: Gabriel Paubert @ 2014-02-10 11:03 UTC (permalink / raw)
  To: James Yang; +Cc: Chris Proctor, Stephen N Chivers, linuxppc-dev
In-Reply-To: <alpine.LRH.2.00.1402071348380.10318@ra8135-ec1.am.freescale.net>

On Fri, Feb 07, 2014 at 02:49:40PM -0600, James Yang wrote:
> On Fri, 7 Feb 2014, Gabriel Paubert wrote:
> 
> > 	Hi Stephen,
> > 
> > On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote:
> > > Gabriel Paubert <paubert@iram.es> wrote on 02/06/2014 07:26:37 PM:
> > > 
> > > > From: Gabriel Paubert <paubert@iram.es>
> > > > To: Stephen N Chivers <schivers@csc.com.au>
> > > > Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor <cproctor@csc.com.au>
> > > > Date: 02/06/2014 07:26 PM
> > > > Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
> > > > 
> > > > On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:
> 
> > > > > 
> > > > >                 mask = 0;
> > > > >                 if (FM & (1 << 0))
> > > > >                         mask |= 0x0000000f;
> > > > >                 if (FM & (1 << 1))
> > > > >                         mask |= 0x000000f0;
> > > > >                 if (FM & (1 << 2))
> > > > >                         mask |= 0x00000f00;
> > > > >                 if (FM & (1 << 3))
> > > > >                         mask |= 0x0000f000;
> > > > >                 if (FM & (1 << 4))
> > > > >                         mask |= 0x000f0000;
> > > > >                 if (FM & (1 << 5))
> > > > >                         mask |= 0x00f00000;
> > > > >                 if (FM & (1 << 6))
> > > > >                         mask |= 0x0f000000;
> > > > >                 if (FM & (1 << 7))
> > > > >                         mask |= 0x90000000;
> > > > > 
> > > > > With the above mask computation I get consistent results for 
> > > > > both the MPC8548 and MPC7410 boards.
> > > > > 
> > > > > Am I missing something subtle?
> > > > 
> > > > No I think you are correct. This said, this code may probably be 
> > > optimized 
> > > > to eliminate a lot of the conditional branches. I think that:
> 
> 
> If the compiler is enabled to generate isel instructions, it would not 
> use a conditional branch for this code. (ignore the andi's values, 
> this is an old compile)
> 
> c0037c2c <mtfsf>:
> c0037c2c:       2c 03 00 00     cmpwi   r3,0
> c0037c30:       41 82 01 1c     beq-    c0037d4c <mtfsf+0x120>
> c0037c34:       2f 83 00 ff     cmpwi   cr7,r3,255
> c0037c38:       41 9e 01 28     beq-    cr7,c0037d60 <mtfsf+0x134>
> c0037c3c:       70 66 00 01     andi.   r6,r3,1
> c0037c40:       3d 00 90 00     lis     r8,-28672
> c0037c44:       7d 20 40 9e     iseleq  r9,r0,r8
> c0037c48:       70 6a 00 02     andi.   r10,r3,2
> c0037c4c:       65 28 0f 00     oris    r8,r9,3840
> c0037c50:       7d 29 40 9e     iseleq  r9,r9,r8
> c0037c54:       70 66 00 04     andi.   r6,r3,4
> c0037c58:       65 28 00 f0     oris    r8,r9,240
> c0037c5c:       7d 29 40 9e     iseleq  r9,r9,r8
> c0037c60:       70 6a 00 08     andi.   r10,r3,8
> c0037c64:       65 28 00 0f     oris    r8,r9,15
> c0037c68:       7d 29 40 9e     iseleq  r9,r9,r8
> c0037c6c:       70 66 00 10     andi.   r6,r3,16
> c0037c70:       61 28 f0 00     ori     r8,r9,61440
> c0037c74:       7d 29 40 9e     iseleq  r9,r9,r8
> c0037c78:       70 6a 00 20     andi.   r10,r3,32
> c0037c7c:       61 28 0f 00     ori     r8,r9,3840
> c0037c80:       54 6a cf fe     rlwinm  r10,r3,25,31,31
> c0037c84:       7d 29 40 9e     iseleq  r9,r9,r8
> c0037c88:       2f 8a 00 00     cmpwi   cr7,r10,0
> c0037c8c:       70 66 00 40     andi.   r6,r3,64
> c0037c90:       61 28 00 f0     ori     r8,r9,240
> c0037c94:       7d 29 40 9e     iseleq  r9,r9,r8
> c0037c98:       41 9e 00 08     beq-    cr7,c0037ca0 <mtfsf+0x74>
> c0037c9c:       61 29 00 0f     ori     r9,r9,15
> 	...
>  
> However, your other solutions are better.
> 
> 
> > > > 
> > > > mask = (FM & 1);
> > > > mask |= (FM << 3) & 0x10;
> > > > mask |= (FM << 6) & 0x100;
> > > > mask |= (FM << 9) & 0x1000;
> > > > mask |= (FM << 12) & 0x10000;
> > > > mask |= (FM << 15) & 0x100000;
> > > > mask |= (FM << 18) & 0x1000000;
> > > > mask |= (FM << 21) & 0x10000000;
> > > > mask *= 15;
> > > > 
> > > > should do the job, in less code space and without a single branch.
> > > > 
> > > > Each one of the "mask |=" lines should be translated into an
> > > > rlwinm instruction followed by an "or". Actually it should be possible
> > > > to transform each of these lines into a single "rlwimi" instruction
> > > > but I don't know how to coerce gcc to reach this level of optimization.
> > > > 
> > > > Another way of optomizing this could be:
> > > > 
> > > > mask = (FM & 0x0f) | ((FM << 12) & 0x000f0000);
> > > > mask = (mask & 0x00030003) | ((mask << 6) & 0x03030303);
> > > > mask = (mask & 0x01010101) | ((mask << 3) & 0x10101010);
> > > > mask *= 15;
> > > > 
> 
> 
> > Ok, I finally edited my sources and test compiled the suggestions
> > I gave. I'd say that method2 is the best overall indeed. You can
> > actually save one more instruction by setting mask to all ones in 
> > the case FM=0xff, but that's about all in this area.
> 
> 
> My measurements show method1 to be smaller and faster than method2 due 
> to the number of instructions needed to generate the constant masks in 
> method2, but it may depend upon your compiler and hardware.  Both are 
> faster than the original with isel.

Ok, if you have measured that method1 is faster than method2, let us go for it.
I believe method2 would be faster if you had a large out-of-order execution
window, because more parallelism can be extracted from it, but this is probably
only true for high end cores, which do not need FPU emulation in the first place.

The other place where we can optimize is the generation of FEX. Here is 
my current patch:


diff --git a/arch/powerpc/math-emu/mtfsf.c b/arch/powerpc/math-emu/mtfsf.c
index dbce92e..b57b3fa8 100644
--- a/arch/powerpc/math-emu/mtfsf.c
+++ b/arch/powerpc/math-emu/mtfsf.c
@@ -11,48 +11,35 @@ mtfsf(unsigned int FM, u32 *frB)
 	u32 mask;
 	u32 fpscr;
 
-	if (FM == 0)
+	if (likely(FM == 0xff))
+		mask = 0xffffffff;
+	else if (unlikely(FM == 0))
 		return 0;
-
-	if (FM == 0xff)
-		mask = 0x9fffffff;
 	else {
-		mask = 0;
-		if (FM & (1 << 0))
-			mask |= 0x90000000;
-		if (FM & (1 << 1))
-			mask |= 0x0f000000;
-		if (FM & (1 << 2))
-			mask |= 0x00f00000;
-		if (FM & (1 << 3))
-			mask |= 0x000f0000;
-		if (FM & (1 << 4))
-			mask |= 0x0000f000;
-		if (FM & (1 << 5))
-			mask |= 0x00000f00;
-		if (FM & (1 << 6))
-			mask |= 0x000000f0;
-		if (FM & (1 << 7))
-			mask |= 0x0000000f;
+		mask = (FM & 1);
+		mask |= (FM << 3) & 0x10;
+		mask |= (FM << 6) & 0x100;
+		mask |= (FM << 9) & 0x1000;
+		mask |= (FM << 12) & 0x10000;
+		mask |= (FM << 15) & 0x100000;
+		mask |= (FM << 18) & 0x1000000;
+		mask |= (FM << 21) & 0x10000000;
+		mask *= 15;
 	}
 
-	__FPU_FPSCR &= ~(mask);
-	__FPU_FPSCR |= (frB[1] & mask);
+	fpscr = ((__FPU_FPSCR & ~mask) | (frB[1] & mask)) &
+		~(FPSCR_VX | FPSCR_FEX);
 
-	__FPU_FPSCR &= ~(FPSCR_VX);
-	if (__FPU_FPSCR & (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI |
+	if (fpscr & (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI |
 		     FPSCR_VXZDZ | FPSCR_VXIMZ | FPSCR_VXVC |
 		     FPSCR_VXSOFT | FPSCR_VXSQRT | FPSCR_VXCVI))
-		__FPU_FPSCR |= FPSCR_VX;
-
-	fpscr = __FPU_FPSCR;
-	fpscr &= ~(FPSCR_FEX);
-	if (((fpscr & FPSCR_VX) && (fpscr & FPSCR_VE)) ||
-	    ((fpscr & FPSCR_OX) && (fpscr & FPSCR_OE)) ||
-	    ((fpscr & FPSCR_UX) && (fpscr & FPSCR_UE)) ||
-	    ((fpscr & FPSCR_ZX) && (fpscr & FPSCR_ZE)) ||
-	    ((fpscr & FPSCR_XX) && (fpscr & FPSCR_XE)))
-		fpscr |= FPSCR_FEX;
+		fpscr |= FPSCR_VX;
+
+	/* The bit order of exception enables and exception status
+	 * is the same. Simply shift and mask to check for enabled
+	 * exceptions.
+	 */
+	if (fpscr & (fpscr >> 22) &  0xf8) fpscr |= FPSCR_FEX;
 	__FPU_FPSCR = fpscr;
 
 #ifdef DEBUG
 mtfsf.c |   57 ++++++++++++++++++++++-----------------------------------
 1 file changed, 22 insertions(+), 35 deletions(-)


Notes: 

1) I'm really unsure on whether 0xff is frequent or not. So the likely()
statement at the beginning may be wrong. Actually, if it is not very likely,
it might be better to remove the special casef for FM==0xff. A look at 
GCC sources shows that it never generates a mask of 0xff. From glibc
sources, there vast majority of cases uses 0x1, only isnan() uses 0xff.

2) it may be better to remove the check for FM==0, after all, the instruction
efectively becomes a nop, and generating the instruction in the first place
would be too stupid for words.

3) if might be better to #define the magic constants (22 and 0xf8) used 
in the last if statement.

	Gabriel
> 

^ permalink raw reply related

* RE: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: David Laight @ 2014-02-10 11:17 UTC (permalink / raw)
  To: 'Gabriel Paubert', James Yang
  Cc: Chris Proctor, Stephen N Chivers, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <20140210110342.GA15806@visitor2.iram.es>

> > However, your other solutions are better.
> >
> >
> > > > >
> > > > > mask =3D (FM & 1);
> > > > > mask |=3D (FM << 3) & 0x10;
> > > > > mask |=3D (FM << 6) & 0x100;
> > > > > mask |=3D (FM << 9) & 0x1000;
> > > > > mask |=3D (FM << 12) & 0x10000;
> > > > > mask |=3D (FM << 15) & 0x100000;
> > > > > mask |=3D (FM << 18) & 0x1000000;
> > > > > mask |=3D (FM << 21) & 0x10000000;
> > > > > mask *=3D 15;
> > > > >
> > > > > should do the job, in less code space and without a single branch=
.
...
> > > > > Another way of optomizing this could be:
> > > > >
> > > > > mask =3D (FM & 0x0f) | ((FM << 12) & 0x000f0000);
> > > > > mask =3D (mask & 0x00030003) | ((mask << 6) & 0x03030303);
> > > > > mask =3D (mask & 0x01010101) | ((mask << 3) & 0x10101010);
> > > > > mask *=3D 15;
...
> Ok, if you have measured that method1 is faster than method2, let us go f=
or it.
> I believe method2 would be faster if you had a large out-of-order executi=
on
> window, because more parallelism can be extracted from it, but this is pr=
obably
> only true for high end cores, which do not need FPU emulation in the firs=
t place.

FWIW the second has a long dependency chain on 'mask', whereas the first ca=
n execute
the shift/and in any order and then merge the results.
So on most superscalar cpu, or one with result delays for arithmetic, the f=
irst
is likely to be faster.

	David

^ permalink raw reply

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: Gabriel Paubert @ 2014-02-10 12:21 UTC (permalink / raw)
  To: David Laight
  Cc: James Yang, Chris Proctor, Stephen N Chivers,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D0F6BBCE9@AcuExch.aculab.com>

On Mon, Feb 10, 2014 at 11:17:38AM +0000, David Laight wrote:
> > > However, your other solutions are better.
> > >
> > >
> > > > > >
> > > > > > mask = (FM & 1);
> > > > > > mask |= (FM << 3) & 0x10;
> > > > > > mask |= (FM << 6) & 0x100;
> > > > > > mask |= (FM << 9) & 0x1000;
> > > > > > mask |= (FM << 12) & 0x10000;
> > > > > > mask |= (FM << 15) & 0x100000;
> > > > > > mask |= (FM << 18) & 0x1000000;
> > > > > > mask |= (FM << 21) & 0x10000000;
> > > > > > mask *= 15;
> > > > > >
> > > > > > should do the job, in less code space and without a single branch.
> ...
> > > > > > Another way of optomizing this could be:
> > > > > >
> > > > > > mask = (FM & 0x0f) | ((FM << 12) & 0x000f0000);
> > > > > > mask = (mask & 0x00030003) | ((mask << 6) & 0x03030303);
> > > > > > mask = (mask & 0x01010101) | ((mask << 3) & 0x10101010);
> > > > > > mask *= 15;
> ...
> > Ok, if you have measured that method1 is faster than method2, let us go for it.
> > I believe method2 would be faster if you had a large out-of-order execution
> > window, because more parallelism can be extracted from it, but this is probably
> > only true for high end cores, which do not need FPU emulation in the first place.
> 
> FWIW the second has a long dependency chain on 'mask', whereas the first can execute
> the shift/and in any order and then merge the results.
> So on most superscalar cpu, or one with result delays for arithmetic, the first
> is likely to be faster.

I disagree, perhaps mostly because the compiler is not clever enough, but right
now the code for solution 1 is (actually I have rewritten the code
and it reads:

	mask = (FM & 1)
			| ((FM << 3) & 0x10)
			| ((FM << 6) & 0x100)
			| ((FM << 9) & 0x1000)
			| ((FM << 12) & 0x10000)
			| ((FM << 15) & 0x100000)
			| ((FM << 18) & 0x1000000)
			| ((FM << 21) & 0x10000000);
to avoid sequence point in case it hampers the compiler)

and the output is:

        rlwinm 10,3,3,27,27      # D.11621, FM,,
        rlwinm 9,3,6,23,23       # D.11621, FM,,
        or 9,10,9        #, D.11621, D.11621, D.11621
        rlwinm 10,3,0,31,31      # D.11621, FM,
        or 9,9,10        #, D.11621, D.11621, D.11621
        rlwinm 10,3,9,19,19      # D.11621, FM,,
        or 9,9,10        #, D.11621, D.11621, D.11621
        rlwinm 10,3,12,15,15     # D.11621, FM,,
        or 9,9,10        #, D.11621, D.11621, D.11621
        rlwinm 10,3,15,11,11     # D.11621, FM,,
        or 9,9,10        #, D.11621, D.11621, D.11621
        rlwinm 10,3,18,7,7       # D.11621, FM,,
        or 9,9,10        #, D.11621, D.11621, D.11621
        rlwinm 3,3,21,3,3        # D.11621, FM,,
        or 9,9,3         #, mask, D.11621, D.11621
        mulli 9,9,15     # mask, mask,

see that r9 is used 7 times as both input and output operand, plus
once for rlwinm. This gives a dependency length of 8 at least.

In the other case (I've deleted the code) the dependency length
was significantly shorter. In any case that one is fewer instructions, 
which is good for occasional use. 

	Gabriel

^ permalink raw reply

* RE: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: David Laight @ 2014-02-10 12:32 UTC (permalink / raw)
  To: 'Gabriel Paubert'
  Cc: James Yang, Chris Proctor, Stephen N Chivers,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <20140210122138.GA30356@visitor2.iram.es>

> I disagree, perhaps mostly because the compiler is not clever enough, but=
 right
> now the code for solution 1 is (actually I have rewritten the code
> and it reads:
>=20
> 	mask =3D (FM & 1)
> 			| ((FM << 3) & 0x10)
> 			| ((FM << 6) & 0x100)
> 			| ((FM << 9) & 0x1000)
> 			| ((FM << 12) & 0x10000)
> 			| ((FM << 15) & 0x100000)
> 			| ((FM << 18) & 0x1000000)
> 			| ((FM << 21) & 0x10000000);
> to avoid sequence point in case it hampers the compiler)
>=20
> and the output is:
>=20
>         rlwinm 10,3,3,27,27      # D.11621, FM,,
>         rlwinm 9,3,6,23,23       # D.11621, FM,,
>         or 9,10,9        #, D.11621, D.11621, D.11621
>         rlwinm 10,3,0,31,31      # D.11621, FM,
>         or 9,9,10        #, D.11621, D.11621, D.11621
>         rlwinm 10,3,9,19,19      # D.11621, FM,,
>         or 9,9,10        #, D.11621, D.11621, D.11621
>         rlwinm 10,3,12,15,15     # D.11621, FM,,
>         or 9,9,10        #, D.11621, D.11621, D.11621
>         rlwinm 10,3,15,11,11     # D.11621, FM,,
>         or 9,9,10        #, D.11621, D.11621, D.11621
>         rlwinm 10,3,18,7,7       # D.11621, FM,,
>         or 9,9,10        #, D.11621, D.11621, D.11621
>         rlwinm 3,3,21,3,3        # D.11621, FM,,
>         or 9,9,3         #, mask, D.11621, D.11621
>         mulli 9,9,15     # mask, mask,
>=20
> see that r9 is used 7 times as both input and output operand, plus
> once for rlwinm. This gives a dependency length of 8 at least.
>=20
> In the other case (I've deleted the code) the dependency length
> was significantly shorter. In any case that one is fewer instructions,
> which is good for occasional use.

Hmmm... I hand-counted a dependency length of 8 for the other version.
Maybe there are some ppc instructions that reduce it.

Stupid compiler :-)
Trouble is, I bet that even if you code it as:
 	mask1 =3D (FM & 1) | ((FM << 3) & 0x10);
	mask2 =3D ((FM << 6) & 0x100) | ((FM << 9) & 0x1000);
	mask3 =3D ((FM << 12) & 0x10000) | ((FM << 15) & 0x100000);
	mask4 =3D ((FM << 18) & 0x1000000) | ((FM << 21) & 0x10000000);
	mask1 |=3D mask2;
	mask3 |=3D mask4;
	mask =3D mask1 | mask3;
the compiler will 'optimise' it to the above before code generation.
If it doesn't adding () to pair the | might be enough.
Then a new version of the compiler will change the behaviour again.

	David

^ permalink raw reply

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: Gabriel Paubert @ 2014-02-10 13:00 UTC (permalink / raw)
  To: David Laight
  Cc: James Yang, Chris Proctor, Stephen N Chivers,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D0F6BBDA6@AcuExch.aculab.com>

On Mon, Feb 10, 2014 at 12:32:18PM +0000, David Laight wrote:
> > I disagree, perhaps mostly because the compiler is not clever enough, but right
> > now the code for solution 1 is (actually I have rewritten the code
> > and it reads:
> > 
> > 	mask = (FM & 1)
> > 			| ((FM << 3) & 0x10)
> > 			| ((FM << 6) & 0x100)
> > 			| ((FM << 9) & 0x1000)
> > 			| ((FM << 12) & 0x10000)
> > 			| ((FM << 15) & 0x100000)
> > 			| ((FM << 18) & 0x1000000)
> > 			| ((FM << 21) & 0x10000000);
> > to avoid sequence point in case it hampers the compiler)
> > 
> > and the output is:
> > 
> >         rlwinm 10,3,3,27,27      # D.11621, FM,,
> >         rlwinm 9,3,6,23,23       # D.11621, FM,,
> >         or 9,10,9        #, D.11621, D.11621, D.11621
> >         rlwinm 10,3,0,31,31      # D.11621, FM,
> >         or 9,9,10        #, D.11621, D.11621, D.11621
> >         rlwinm 10,3,9,19,19      # D.11621, FM,,
> >         or 9,9,10        #, D.11621, D.11621, D.11621
> >         rlwinm 10,3,12,15,15     # D.11621, FM,,
> >         or 9,9,10        #, D.11621, D.11621, D.11621
> >         rlwinm 10,3,15,11,11     # D.11621, FM,,
> >         or 9,9,10        #, D.11621, D.11621, D.11621
> >         rlwinm 10,3,18,7,7       # D.11621, FM,,
> >         or 9,9,10        #, D.11621, D.11621, D.11621
> >         rlwinm 3,3,21,3,3        # D.11621, FM,,
> >         or 9,9,3         #, mask, D.11621, D.11621
> >         mulli 9,9,15     # mask, mask,
> > 
> > see that r9 is used 7 times as both input and output operand, plus
> > once for rlwinm. This gives a dependency length of 8 at least.
> > 
> > In the other case (I've deleted the code) the dependency length
> > was significantly shorter. In any case that one is fewer instructions,
> > which is good for occasional use.
> 
> Hmmm... I hand-counted a dependency length of 8 for the other version.
> Maybe there are some ppc instructions that reduce it.

Either I misread the generated code or I got somewhat less.

What helps for method1 is the rotate and mask instructions of PPC. Each of
left shift and mask becomes a single rlwinm. 
> 
> Stupid compiler :-)

Indeed. I've trying to coerce it into generating rlwimi instructions
(in which case the whole building of the mask reduces to 8 assembly
instructions) and failed. It seems that the compiler lacks some patterns
some patterns that would directly map to rlwimi.

> Trouble is, I bet that even if you code it as:
>  	mask1 = (FM & 1) | ((FM << 3) & 0x10);
> 	mask2 = ((FM << 6) & 0x100) | ((FM << 9) & 0x1000);
> 	mask3 = ((FM << 12) & 0x10000) | ((FM << 15) & 0x100000);
> 	mask4 = ((FM << 18) & 0x1000000) | ((FM << 21) & 0x10000000);
> 	mask1 |= mask2;
> 	mask3 |= mask4;
> 	mask = mask1 | mask3;
> the compiler will 'optimise' it to the above before code generation.

Indeed it's what it does :-(

I believe that the current suggestion is good enough.

	Gabriel

^ permalink raw reply

* Re: [PATCH v2] powerpc ticket locks
From: Torsten Duwe @ 2014-02-10 15:52 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Tom Musta, Peter Zijlstra, linux-kernel, Paul Mackerras,
	Anton Blanchard, Scott Wood, Paul E. McKenney, linuxppc-dev,
	Ingo Molnar
In-Reply-To: <1392001823.3996.21.camel@pasglop>

On Mon, Feb 10, 2014 at 02:10:23PM +1100, Benjamin Herrenschmidt wrote:
> On Fri, 2014-02-07 at 17:58 +0100, Torsten Duwe wrote:
> >  typedef struct {
> > -       volatile unsigned int slock;
> > -} arch_spinlock_t;
> > +       union {
> > +               __ticketpair_t head_tail;
> > +               struct __raw_tickets {
> > +#ifdef __BIG_ENDIAN__          /* The "tail" part should be in the MSBs */
> > +                       __ticket_t tail, head;
> > +#else
> > +                       __ticket_t head, tail;
> > +#endif
> > +               } tickets;
> > +       };
> > +#if defined(CONFIG_PPC_SPLPAR)
> > +       u32 holder;
> > +#endif
> > +} arch_spinlock_t __aligned(4);
> 
> That's still broken with lockref (which we just merged).
> 
> We must have the arch_spinlock_t and the ref in the same 64-bit word
> otherwise it will break.

Well, as far as I can see you'll just not be able to
USE_CMPXCHG_LOCKREF -- with the appropriate performance hit --
the code just falls back into lock&ref on pSeries.
What again was the intention of directed yield in the first place...?

> We can make it work in theory since the holder doesn't have to be
> accessed atomically, but the practicals are a complete mess ...
> lockref would essentially have to re-implement the holder handling
> of the spinlocks and use lower level ticket stuff.
> 
> Unless you can find a sneaky trick ... :-(

What if I squeeze the bits a little?
4k vCPUs, and 256 physical, as a limit to stay within 32 bits?
At the cost that unlock may become an ll/sc operation again.
I could think about a trick against that.
But alas, hw_cpu_id is 16 bit, which makes a lookup table neccessary :-/

Doing another round of yields for lockrefs now doesn't
sound so bad any more.

Opinions, anyone?

	Torsten

^ permalink raw reply

* [RFT][PATCH 04/12] drivers/macintosh/adb: change platform power management to use dev_pm_ops
From: Shuah Khan @ 2014-02-10 16:12 UTC (permalink / raw)
  To: rjw, benh; +Cc: shuahkhan, Shuah Khan, linuxppc-dev, linux-kernel, linux-pm
In-Reply-To: <cover.1392040064.git.shuah.kh@samsung.com>

Change adb platform driver to register pm ops using dev_pm_ops instead of
legacy pm_ops. .pm hooks call existing legacy suspend and resume interfaces
by passing in the right pm state.

Signed-off-by: Shuah Khan <shuah.kh@samsung.com>
---
 drivers/macintosh/adb.c |   41 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 36 insertions(+), 5 deletions(-)

diff --git a/drivers/macintosh/adb.c b/drivers/macintosh/adb.c
index 04a5049..df6b201 100644
--- a/drivers/macintosh/adb.c
+++ b/drivers/macintosh/adb.c
@@ -263,7 +263,7 @@ adb_reset_bus(void)
 /*
  * notify clients before sleep
  */
-static int adb_suspend(struct platform_device *dev, pm_message_t state)
+static int __adb_suspend(struct platform_device *dev, pm_message_t state)
 {
 	adb_got_sleep = 1;
 	/* We need to get a lock on the probe thread */
@@ -276,10 +276,25 @@ static int adb_suspend(struct platform_device *dev, pm_message_t state)
 	return 0;
 }
 
+static int adb_suspend(struct device *dev)
+{
+	return __adb_suspend(to_platform_device(dev), PMSG_SUSPEND);
+}
+
+static int adb_freeze(struct device *dev)
+{
+	return __adb_suspend(to_platform_device(dev), PMSG_FREEZE);
+}
+
+static int adb_poweroff(struct device *dev)
+{
+	return __adb_suspend(to_platform_device(dev), PMSG_HIBERNATE);
+}
+
 /*
  * reset bus after sleep
  */
-static int adb_resume(struct platform_device *dev)
+static int __adb_resume(struct platform_device *dev)
 {
 	adb_got_sleep = 0;
 	up(&adb_probe_mutex);
@@ -287,6 +302,11 @@ static int adb_resume(struct platform_device *dev)
 
 	return 0;
 }
+
+static int adb_resume(struct device *dev)
+{
+	return __adb_resume(to_platform_device(dev));
+}
 #endif /* CONFIG_PM */
 
 static int __init adb_init(void)
@@ -831,14 +851,25 @@ static const struct file_operations adb_fops = {
 	.release	= adb_release,
 };
 
+#ifdef CONFIG_PM
+static const struct dev_pm_ops adb_dev_pm_ops = {
+	.suspend = adb_suspend,
+	.resume = adb_resume,
+	/* Hibernate hooks */
+	.freeze = adb_freeze,
+	.thaw = adb_resume,
+	.poweroff = adb_poweroff,
+	.restore = adb_resume,
+};
+#endif
+
 static struct platform_driver adb_pfdrv = {
 	.driver = {
 		.name = "adb",
-	},
 #ifdef CONFIG_PM
-	.suspend = adb_suspend,
-	.resume = adb_resume,
+		.pm = &adb_dev_pm_ops,
 #endif
+	},
 };
 
 static struct platform_device adb_pfdev = {
-- 
1.7.10.4

^ permalink raw reply related

* [RFT][PATCH 00/12] change drivers power management to use dev_pm_ops
From: Shuah Khan @ 2014-02-10 16:12 UTC (permalink / raw)
  To: rjw
  Cc: Shuah Khan, heiko.carstens, chris, shuahkhan, dwalker,
	ingo.tuchscherer, sonic.zhang, linux-s390, linux, davidb,
	linux-pm, linux-arm-msm, linux-pcmcia, adi-buildroot-devel,
	mirq-linux, ian, manuel.lauss, linux-arm-kernel, gregkh,
	linux-mmc, linux-kernel, schwidefsky, linux390, linuxppc-dev

Change drivers to register pm ops using dev_pm_ops instead of legacy pm_ops.
.pm hooks call existing legacy suspend and resume interfaces by passing in
the right pm state. Bus drivers suspend and resume routines call .pm driver
hooks if found.

Shuah Khan (12):
  arm: change locomo platform and bus power management to use
    dev_pm_ops
  arm: change sa1111 platform and bus power management to use
    dev_pm_ops
  arm: change scoop platform power management to use dev_pm_ops
  drivers/macintosh/adb: change platform power managemnet to use
    dev_pm_ops
  mmc: change au1xmmc platform power management to use dev_pm_ops
  mmc: change bfin_sdh platform power management to use dev_pm_ops
  isa: change isa bus power managemnet to use dev_pm_ops
  mmc: change cb710-mmc platform power management to use dev_pm_ops
  mmc: change msm_sdcc platform power management to use dev_pm_ops
  mmc: change tmio_mmc platform power management to use dev_pm_ops
  drivers/pcmcia: change ds driver power management to use dev_pm_ops
  drivers/s390/crypto: change ap_bus driver power management to use
    dev_pm_ops

 arch/arm/common/locomo.c     |   93 +++++++++++++++++++++++++++++++++++-------
 arch/arm/common/sa1111.c     |   88 +++++++++++++++++++++++++++++++--------
 arch/arm/common/scoop.c      |   44 ++++++++++++++++----
 drivers/base/isa.c           |   30 ++++++++++++--
 drivers/macintosh/adb.c      |   41 ++++++++++++++++---
 drivers/mmc/host/au1xmmc.c   |   43 +++++++++++++++----
 drivers/mmc/host/bfin_sdh.c  |   40 +++++++++++++++---
 drivers/mmc/host/cb710-mmc.c |   37 +++++++++++++++--
 drivers/mmc/host/msm_sdcc.c  |   42 +++++++++++++++----
 drivers/mmc/host/tmio_mmc.c  |   42 +++++++++++++++----
 drivers/pcmcia/ds.c          |   36 +++++++++++++---
 drivers/s390/crypto/ap_bus.c |   30 ++++++++++++--
 12 files changed, 481 insertions(+), 85 deletions(-)

-- 
1.7.10.4

^ permalink raw reply

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: James Yang @ 2014-02-10 16:50 UTC (permalink / raw)
  To: Stephen N Chivers; +Cc: James Yang, Chris Proctor, linuxppc-dev
In-Reply-To: <OF80E982EC.BE15A034-ONCA257C7A.006B58B5-CA257C7A.006C3EA8@csc.com>

On Sun, 9 Feb 2014, Stephen N Chivers wrote:

> James Yang <James.Yang@freescale.com> wrote on 02/08/2014 07:49:40 AM:
> 
> > From: James Yang <James.Yang@freescale.com>
> > To: Gabriel Paubert <paubert@iram.es>
> > Cc: Stephen N Chivers <schivers@csc.com.au>, Chris Proctor 
> > <cproctor@csc.com.au>, <linuxppc-dev@lists.ozlabs.org>
> > Date: 02/08/2014 07:49 AM
> > Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
> > 
> > On Fri, 7 Feb 2014, Gabriel Paubert wrote:
> > 
> > >    Hi Stephen,
> > > 
> > > On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote:
> > > > Gabriel Paubert <paubert@iram.es> wrote on 02/06/2014 07:26:37 PM:
> > > > 
> > > > > From: Gabriel Paubert <paubert@iram.es>
> > > > > To: Stephen N Chivers <schivers@csc.com.au>
> > > > > Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor 
> <cproctor@csc.com.au>
> > > > > Date: 02/06/2014 07:26 PM
> > > > > Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
> > > > > 
> > > > > On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:
> > > > > > With the above mask computation I get consistent results for 
> > > > > > both the MPC8548 and MPC7410 boards.
> > > > > > 
> > > > > > Am I missing something subtle?
> > > > > 
> > > > > No I think you are correct. This said, this code may probably be 
> > > > optimized 
> > > > > to eliminate a lot of the conditional branches. I think that:
> > 
> > 
> > If the compiler is enabled to generate isel instructions, it would not 
> > use a conditional branch for this code. (ignore the andi's values, 
> > this is an old compile)
> > 
> From limited research, the 440GP is a processor
> that doesn't implement the isel instruction and it does
> not implement floating point.
> 
> The kernel emulates isel and so using that instruction
> for the 440GP would have a double trap penalty.

Are you writing about something outside the scope of this thread?  OP 
was using MPC8548 not a 440GP.  The compiler should not be using or 
targeting 8548 for a 440GP so having to emulate isel shouldn't be an 
issue because there wouldn't be any.  (The assembly listing I posted 
was generated by gcc targeting 8548.)  Anyway, I had measured the 
non-isel routines to be faster and that works without illop traps.

 
> Correct me if I am wrong, the isel instruction first appears
> in PowerPC ISA v2.04 around mid 2007. 


isel appeared in 2003 in the e500 (v1) core that is in the MPC8540.  
The instruction is Power ISA 2.03 (9/2006).

^ permalink raw reply

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: James Yang @ 2014-02-10 17:03 UTC (permalink / raw)
  To: Gabriel Paubert
  Cc: James Yang, Chris Proctor, Stephen N Chivers, linuxppc-dev
In-Reply-To: <20140210110342.GA15806@visitor2.iram.es>

On Mon, 10 Feb 2014, Gabriel Paubert wrote:

> On Fri, Feb 07, 2014 at 02:49:40PM -0600, James Yang wrote:
> > On Fri, 7 Feb 2014, Gabriel Paubert wrote:
> > 
> > > 	Hi Stephen,
> > > 
> > > On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote:
> > > > Gabriel Paubert <paubert@iram.es> wrote on 02/06/2014 07:26:37 PM:
> > > > 
> > > > > From: Gabriel Paubert <paubert@iram.es>
> > > > > To: Stephen N Chivers <schivers@csc.com.au>
> > > > > Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor <cproctor@csc.com.au>
> > > > > Date: 02/06/2014 07:26 PM
> > > > > Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
> > > > > 
> > > > > On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:


> > 
> > > > > > 
> > > > > >                 mask = 0;
> > > > > >                 if (FM & (1 << 0))
> > > > > >                         mask |= 0x0000000f;
> > > > > >                 if (FM & (1 << 1))
> > > > > >                         mask |= 0x000000f0;
> > > > > >                 if (FM & (1 << 2))
> > > > > >                         mask |= 0x00000f00;
> > > > > >                 if (FM & (1 << 3))
> > > > > >                         mask |= 0x0000f000;
> > > > > >                 if (FM & (1 << 4))
> > > > > >                         mask |= 0x000f0000;
> > > > > >                 if (FM & (1 << 5))
> > > > > >                         mask |= 0x00f00000;
> > > > > >                 if (FM & (1 << 6))
> > > > > >                         mask |= 0x0f000000;
> > > > > >                 if (FM & (1 << 7))
> > > > > >                         mask |= 0x90000000;
> > > > > > 
> > > > > > With the above mask computation I get consistent results for 
> > > > > > both the MPC8548 and MPC7410 boards.

> Ok, if you have measured that method1 is faster than method2, let us go for it.
> I believe method2 would be faster if you had a large out-of-order execution
> window, because more parallelism can be extracted from it, but this is probably
> only true for high end cores, which do not need FPU emulation in the first place.

Yeah, 8548 can issue 2 SFX instructions per cycle which is what the 
compiler generated.   

 
> The other place where we can optimize is the generation of FEX. Here is 
> my current patch:
> 
> 
> diff --git a/arch/powerpc/math-emu/mtfsf.c b/arch/powerpc/math-emu/mtfsf.c
> index dbce92e..b57b3fa8 100644
> --- a/arch/powerpc/math-emu/mtfsf.c
> +++ b/arch/powerpc/math-emu/mtfsf.c
> @@ -11,48 +11,35 @@ mtfsf(unsigned int FM, u32 *frB)
>  	u32 mask;
>  	u32 fpscr;
>  
> -	if (FM == 0)
> +	if (likely(FM == 0xff))
> +		mask = 0xffffffff;
> +	else if (unlikely(FM == 0))
>  		return 0;
> -
> -	if (FM == 0xff)
> -		mask = 0x9fffffff;
>  	else {
> -		mask = 0;
> -		if (FM & (1 << 0))
> -			mask |= 0x90000000;
> -		if (FM & (1 << 1))
> -			mask |= 0x0f000000;
> -		if (FM & (1 << 2))
> -			mask |= 0x00f00000;
> -		if (FM & (1 << 3))
> -			mask |= 0x000f0000;
> -		if (FM & (1 << 4))
> -			mask |= 0x0000f000;
> -		if (FM & (1 << 5))
> -			mask |= 0x00000f00;
> -		if (FM & (1 << 6))
> -			mask |= 0x000000f0;
> -		if (FM & (1 << 7))
> -			mask |= 0x0000000f;
> +		mask = (FM & 1);
> +		mask |= (FM << 3) & 0x10;
> +		mask |= (FM << 6) & 0x100;
> +		mask |= (FM << 9) & 0x1000;
> +		mask |= (FM << 12) & 0x10000;
> +		mask |= (FM << 15) & 0x100000;
> +		mask |= (FM << 18) & 0x1000000;
> +		mask |= (FM << 21) & 0x10000000;
> +		mask *= 15;


Needs to also mask out bits 1 and 2, they aren't to be set from frB.

		mask &= 0x9FFFFFFF;




>  	}
>  
> -	__FPU_FPSCR &= ~(mask);
> -	__FPU_FPSCR |= (frB[1] & mask);
> +	fpscr = ((__FPU_FPSCR & ~mask) | (frB[1] & mask)) &
> +		~(FPSCR_VX | FPSCR_FEX);
>  
> -	__FPU_FPSCR &= ~(FPSCR_VX);
> -	if (__FPU_FPSCR & (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI |
> +	if (fpscr & (FPSCR_VXSNAN | FPSCR_VXISI | FPSCR_VXIDI |
>  		     FPSCR_VXZDZ | FPSCR_VXIMZ | FPSCR_VXVC |
>  		     FPSCR_VXSOFT | FPSCR_VXSQRT | FPSCR_VXCVI))
> -		__FPU_FPSCR |= FPSCR_VX;
> -
> -	fpscr = __FPU_FPSCR;
> -	fpscr &= ~(FPSCR_FEX);
> -	if (((fpscr & FPSCR_VX) && (fpscr & FPSCR_VE)) ||
> -	    ((fpscr & FPSCR_OX) && (fpscr & FPSCR_OE)) ||
> -	    ((fpscr & FPSCR_UX) && (fpscr & FPSCR_UE)) ||
> -	    ((fpscr & FPSCR_ZX) && (fpscr & FPSCR_ZE)) ||
> -	    ((fpscr & FPSCR_XX) && (fpscr & FPSCR_XE)))
> -		fpscr |= FPSCR_FEX;
> +		fpscr |= FPSCR_VX;
> +
> +	/* The bit order of exception enables and exception status
> +	 * is the same. Simply shift and mask to check for enabled
> +	 * exceptions.
> +	 */
> +	if (fpscr & (fpscr >> 22) &  0xf8) fpscr |= FPSCR_FEX;
>  	__FPU_FPSCR = fpscr;
>  
>  #ifdef DEBUG
>  mtfsf.c |   57 ++++++++++++++++++++++-----------------------------------
>  1 file changed, 22 insertions(+), 35 deletions(-)
> 
> 
> Notes: 
> 
> 1) I'm really unsure on whether 0xff is frequent or not. So the likely()
> statement at the beginning may be wrong. Actually, if it is not very likely,
> it might be better to remove the special casef for FM==0xff. A look at 
> GCC sources shows that it never generates a mask of 0xff. From glibc
> sources, there vast majority of cases uses 0x1, only isnan() uses 0xff.

Can't handle all cases here.  
 
> 2) it may be better to remove the check for FM==0, after all, the instruction
> efectively becomes a nop, and generating the instruction in the first place
> would be too stupid for words.

Hmm a heavy no-op.  I wonder if it is heavier than a sync.
 
> 3) if might be better to #define the magic constants (22 and 0xf8) used 
> in the last if statement.
> 
> 	Gabriel

^ permalink raw reply

* Re: [PATCH v2] powerpc ticket locks
From: Peter Zijlstra @ 2014-02-10 17:53 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, linux-kernel, Paul Mackerras, Anton Blanchard,
	Scott Wood, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140210155217.GF2107@lst.de>

On Mon, Feb 10, 2014 at 04:52:17PM +0100, Torsten Duwe wrote:
> Opinions, anyone?

Since the holder thing is a performance thing, not a correctness thing;
one thing you could do is something like:

static const int OWNER_HASH_SIZE = CONFIG_NR_CPUS * 4;
static const int OWNER_HASH_BITS = ilog2(OWNER_HASH_SIZE);

u16 lock_owner_array[OWNER_HASH_SIZE] = { 0, };

void set_owner(struct arch_spinlock_t *lock, int owner)
{
	int hash = hash_ptr(lock, OWNER_HASH_BITS);
	lock_owner_array[hash] = owner;
}

void yield_to_owner(struct arch_spinlock_t *lock)
{
	int hash = hash_ptr(lock, OWNER_HASH_BITS);
	int owner = lock_owner_array[hash];
	yield_to_cpu(owner);
}

And call set_owner() after the ticket lock is acquired, and don't bother
clearing it again; a new acquire will overwrite, a collision we have to
live with.

It should on average get you the right yield and does away with having
to track the owner field in place.

It does however get you an extra cacheline miss on acquire :/

One could consider patching it out when you know your kernel is not
running on an overloaded partition.

^ permalink raw reply

* Re: [PATCH] mtd: m25p80: Make the name of mtd_info fixed
From: Brian Norris @ 2014-02-10 18:47 UTC (permalink / raw)
  To: B48286@freescale.com
  Cc: Scott Wood, linuxppc-dev@ozlabs.org, Mingkai.Hu@freescale.com,
	linux-mtd@lists.infradead.org, Ezequiel Garcia
In-Reply-To: <66941b677b7e4b8d9ce4cc44304b0955@DM2PR03MB301.namprd03.prod.outlook.com>

Hi Hou,

On Thu, Jan 23, 2014 at 03:29:41AM +0000, B48286@freescale.com wrote:
> > -----Original Message-----
> > From: Brian Norris [mailto:computersforpeace@gmail.com]
> > Sent: Thursday, January 23, 2014 10:12 AM
> > To: Hou Zhiqiang-B48286
> > Cc: linux-mtd@lists.infradead.org; linuxppc-dev@ozlabs.org; Wood Scott-
> > B07421; Hu Mingkai-B21284; Ezequiel Garcia
> > Subject: Re: [PATCH] mtd: m25p80: Make the name of mtd_info fixed
> > 
> > On Mon, Jan 06, 2014 at 02:34:29PM +0800, Hou Zhiqiang wrote:
> > > --- a/drivers/mtd/devices/m25p80.c
> > > +++ b/drivers/mtd/devices/m25p80.c
> > > @@ -1012,7 +1012,8 @@ static int m25p_probe(struct spi_device *spi)
> > >  	if (data && data->name)
> > >  		flash->mtd.name = data->name;
> > >  	else
> > > -		flash->mtd.name = dev_name(&spi->dev);
> > > +		flash->mtd.name = kasprintf(GFP_KERNEL, "%s.%d",
> > > +				id->name, spi->chip_select);
> > 
> > Changing the mtd.name may have far-reaching consequences for users who
> > already have mtdparts= command lines. But your concern is probably valid
> > for dynamically-determined bus numbers. Perhaps you can edit this patch
> > to only change the name when the busnum is dynamically-allocated?
> >
> It's a good idea, but in the case of mtd_info's name dynamically-allocated
> using "mtdparts=..." in command lines is illegal obviously.

I agree that users should never have relied on the dynamically-allocated
name. But changing the name for non-dynamic schemes (e.g., where the SPI
busnum is a fixed value) is not pleasant for users.

> Would you tell
> me what side-effect will be brought by the change of mtd_info's name.

I can only think of the mtdparts= command line. Otherwise, I don't think
the name is very important.

Brian

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Nishanth Aravamudan @ 2014-02-10 19:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li
In-Reply-To: <alpine.DEB.2.10.1402071245040.20246@nuc>

Hi Christoph,

On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> Here is a draft of a patch to make this work with memoryless nodes.
> 
> The first thing is that we modify node_match to also match if we hit an
> empty node. In that case we simply take the current slab if its there.
> 
> If there is no current slab then a regular allocation occurs with the
> memoryless node. The page allocator will fallback to a possible node and
> that will become the current slab. Next alloc from a memoryless node
> will then use that slab.
> 
> For that we also add some tracking of allocations on nodes that were not
> satisfied using the empty_node[] array. A successful alloc on a node
> clears that flag.
> 
> I would rather avoid the empty_node[] array since its global and there may
> be thread specific allocation restrictions but it would be expensive to do
> an allocation attempt via the page allocator to make sure that there is
> really no page available from the page allocator.

With this patch on our test system (I pulled out the numa_mem_id()
change, since you Acked Joonsoo's already), on top of 3.13.0 + my
kthread locality change + CONFIG_HAVE_MEMORYLESS_NODES + Joonsoo's RFC
patch 1):

MemTotal:        8264704 kB
MemFree:         5924608 kB
...
Slab:            1402496 kB
SReclaimable:     102848 kB
SUnreclaim:      1299648 kB

And Anton's slabusage reports:

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                       207 MB   98.60%  100.00%
task_struct                         134 MB   97.82%  100.00%
kmalloc-8192                        117 MB  100.00%  100.00%
pgtable-2^12                        111 MB  100.00%  100.00%
pgtable-2^10                        104 MB  100.00%  100.00%

For comparison, Anton's patch applied at the same point in the series:

meminfo:

MemTotal:        8264704 kB
MemFree:         4150464 kB
...
Slab:            1590336 kB
SReclaimable:     208768 kB
SUnreclaim:      1381568 kB

slabusage:

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                       227 MB   98.63%  100.00%
kmalloc-8192                        130 MB  100.00%  100.00%
task_struct                         129 MB   97.73%  100.00%
pgtable-2^12                        112 MB  100.00%  100.00%
pgtable-2^10                        106 MB  100.00%  100.00%


Consider this patch:

Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

I was thinking about your concerns about empty_node[]. Would it make
sense to use a helper function, rather than direct access to
direct_node, such as:

	bool is_node_empty(int nid)

	void set_node_empty(int nid, bool empty)

which we stub out if !HAVE_MEMORYLESS_NODES to return false and noop
respectively?

That way only architectures that have memoryless nodes pay the penalty
of the array allocation?

Thanks,
Nish

> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c	2014-02-03 13:19:22.896853227 -0600
> +++ linux/mm/slub.c	2014-02-07 12:44:49.311494806 -0600
> @@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa
>  #endif
>  }
> 
> +static int empty_node[MAX_NUMNODES];
> +
>  /*
>   * Issues still to be resolved:
>   *
> @@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem
>  	void *last;
>  	void *p;
>  	int order;
> +	int alloc_node;
> 
>  	BUG_ON(flags & GFP_SLAB_BUG_MASK);
> 
>  	page = allocate_slab(s,
>  		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> -	if (!page)
> +	if (!page) {
> +		if (node != NUMA_NO_NODE)
> +			empty_node[node] = 1;
>  		goto out;
> +	}
> 
>  	order = compound_order(page);
> -	inc_slabs_node(s, page_to_nid(page), page->objects);
> +	alloc_node = page_to_nid(page);
> +	empty_node[alloc_node] = 0;
> +	inc_slabs_node(s, alloc_node, page->objects);
>  	memcg_bind_pages(s, order);
>  	page->slab_cache = s;
>  	__SetPageSlab(page);
> @@ -1712,7 +1720,7 @@ static void *get_partial(struct kmem_cac
>  		struct kmem_cache_cpu *c)
>  {
>  	void *object;
> -	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> +	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> 
>  	object = get_partial_node(s, get_node(s, searchnode), c, flags);
>  	if (object || node != NUMA_NO_NODE)
> @@ -2107,8 +2115,25 @@ static void flush_all(struct kmem_cache
>  static inline int node_match(struct page *page, int node)
>  {
>  #ifdef CONFIG_NUMA
> -	if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
> +	int page_node;
> +
> +	/* No data means no match */
> +	if (!page)
>  		return 0;
> +
> +	/* Node does not matter. Therefore anything is a match */
> +	if (node == NUMA_NO_NODE)
> +		return 1;
> +
> +	/* Did we hit the requested node ? */
> +	page_node = page_to_nid(page);
> +	if (page_node == node)
> +		return 1;
> +
> +	/* If the node has available data then we can use it. Mismatch */
> +	return !empty_node[page_node];
> +
> +	/* Target node empty so just take anything */
>  #endif
>  	return 1;
>  }
> 

^ permalink raw reply

* Re: [PATCH v2] mtd: m25p80: Make the name of mtd_info fixed
From: Brian Norris @ 2014-02-10 19:39 UTC (permalink / raw)
  To: Hou Zhiqiang; +Cc: scottwood, linuxppc-dev, mingkai.hu, linux-mtd, linux-spi
In-Reply-To: <1390717003-28699-1-git-send-email-b48286@freescale.com>

On Sun, Jan 26, 2014 at 02:16:43PM +0800, Hou Zhiqiang wrote:
> To give spi flash layout using "mtdparts=..." in cmdline, we must
> give mtd_info a fixed name,because the cmdlinepart's parser will
> match the name given in cmdline with the mtd_info.
> 
> Now, if use OF node, mtd_info's name will be spi->dev->name. It
> consists of spi_master->bus_num, and the spi_master->bus_num maybe
> dynamically fetched.
> So, give the mtd_info a new fiexd name "name.cs", "name" is name of
> spi_device_id and "cs" is chip-select in spi_dev.
> 
> Signed-off-by: Hou Zhiqiang <b48286@freescale.com>
> ---
> v2:
>  - add check for return value of function kasprintf.
>  - whether the spi_master->bus_num is dynamical is determined by spi
>    controller driver, and it can't be check in this driver. So, we can
>    not initial the mtd_info's name by distinguishing the spi_master
>    bus_num dynamically-allocated or not.

How about spi->master->bus_num < 0 ?

>  drivers/mtd/devices/m25p80.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/mtd/devices/m25p80.c b/drivers/mtd/devices/m25p80.c
> index eb558e8..1f494d2 100644
> --- a/drivers/mtd/devices/m25p80.c
> +++ b/drivers/mtd/devices/m25p80.c
> @@ -1011,8 +1011,12 @@ static int m25p_probe(struct spi_device *spi)
>  
>  	if (data && data->name)
>  		flash->mtd.name = data->name;
> -	else
> -		flash->mtd.name = dev_name(&spi->dev);
> +	else {
> +		flash->mtd.name = kasprintf(GFP_KERNEL, "%s.%d",
> +				id->name, spi->chip_select);

I don't think this name is specific enough. What if there are more than
one SPI controller? Then there could be one chip with the same
chip-select. You probably still need to incorporate the SPI master
somehow, even if it's not by using the bus number directly (because it's
dynamic).

> +		if (!flash->mtd.name)
> +			return -ENOMEM;
> +	}
>  
>  	flash->mtd.type = MTD_NORFLASH;
>  	flash->mtd.writesize = 1;

Brian

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox