LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 03/18] powerpc/boot: use prom_arg_t in oflib
From: Cédric Le Goater @ 2014-02-07 15:59 UTC (permalink / raw)
  To: benh; +Cc: Cédric Le Goater, linuxppc-dev
In-Reply-To: <1391788771-16405-1-git-send-email-clg@fr.ibm.com>

This patch updates the wrapper code to converge with the kernel code in
prom_init.

Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
---
 arch/powerpc/boot/oflib.c |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/boot/oflib.c b/arch/powerpc/boot/oflib.c
index c3288a3446b3..3b0c9458504f 100644
--- a/arch/powerpc/boot/oflib.c
+++ b/arch/powerpc/boot/oflib.c
@@ -16,6 +16,8 @@
 
 #include "of.h"
 
+typedef u32 prom_arg_t;
+
 /* The following structure is used to communicate with open firmware.
  * All arguments in and out are in big endian format. */
 struct prom_args {
@@ -46,7 +48,7 @@ int of_call_prom(const char *service, int nargs, int nret, ...)
 
 	va_start(list, nret);
 	for (i = 0; i < nargs; i++)
-		args.args[i] = va_arg(list, unsigned int);
+		args.args[i] = va_arg(list, prom_arg_t);
 	va_end(list);
 
 	for (i = 0; i < nret; i++)
@@ -59,7 +61,7 @@ int of_call_prom(const char *service, int nargs, int nret, ...)
 }
 
 static int of_call_prom_ret(const char *service, int nargs, int nret,
-			    unsigned int *rets, ...)
+			    prom_arg_t *rets, ...)
 {
 	int i;
 	struct prom_args args;
@@ -71,7 +73,7 @@ static int of_call_prom_ret(const char *service, int nargs, int nret,
 
 	va_start(list, rets);
 	for (i = 0; i < nargs; i++)
-		args.args[i] = va_arg(list, unsigned int);
+		args.args[i] = va_arg(list, prom_arg_t);
 	va_end(list);
 
 	for (i = 0; i < nret; i++)
@@ -148,7 +150,7 @@ static int check_of_version(void)
 void *of_claim(unsigned long virt, unsigned long size, unsigned long align)
 {
 	int ret;
-	unsigned int result;
+	prom_arg_t result;
 
 	if (need_map < 0)
 		need_map = check_of_version();
-- 
1.7.10.4

^ permalink raw reply related

* [RFC PATCH 01/18] powerpc/boot: fix do_div for 64bit wrapper
From: Cédric Le Goater @ 2014-02-07 15:59 UTC (permalink / raw)
  To: benh; +Cc: Cédric Le Goater, linuxppc-dev
In-Reply-To: <1391788771-16405-1-git-send-email-clg@fr.ibm.com>

When the boot wrapper is compiled in 64bit, there is no need to
use __div64_32.

Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
---
 arch/powerpc/boot/stdio.c |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/arch/powerpc/boot/stdio.c b/arch/powerpc/boot/stdio.c
index 5b57800bbc67..a701261b1781 100644
--- a/arch/powerpc/boot/stdio.c
+++ b/arch/powerpc/boot/stdio.c
@@ -21,6 +21,18 @@ size_t strnlen(const char * s, size_t count)
 	return sc - s;
 }
 
+#ifdef __powerpc64__
+
+# define do_div(n, base) ({						\
+	unsigned int __base = (base);					\
+	unsigned int __rem;						\
+	__rem = ((unsigned long long)(n)) % __base;			\
+	(n) = ((unsigned long long)(n)) / __base;			\
+	__rem;								\
+})
+
+#else
+
 extern unsigned int __div64_32(unsigned long long *dividend,
 			       unsigned int divisor);
 
@@ -39,6 +51,8 @@ extern unsigned int __div64_32(unsigned long long *dividend,
 	__rem;								\
  })
 
+#endif /* __powerpc64__ */
+
 static int skip_atoi(const char **s)
 {
 	int i, c;
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Kumar Gala @ 2014-02-07 15:51 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, Peter Zijlstra, linux-kernel, Paul Mackerras,
	Anton Blanchard, Scott Wood, Paul E. McKenney, linuxppc-dev,
	Ingo Molnar
In-Reply-To: <20140207090248.GB26811@lst.de>


On Feb 7, 2014, at 3:02 AM, Torsten Duwe <duwe@lst.de> wrote:

> On Thu, Feb 06, 2014 at 02:19:52PM -0600, Scott Wood wrote:
>> On Thu, 2014-02-06 at 18:37 +0100, Torsten Duwe wrote:
>>> On Thu, Feb 06, 2014 at 05:38:37PM +0100, Peter Zijlstra wrote:
>>=20
>>>> Can you pair lwarx with sthcx ? I couldn't immediately find the =
answer
>>>> in the PowerISA doc. If so I think you can do better by being able =
to
>>>> atomically load both tickets but only storing the head without =
affecting
>>>> the tail.
>=20
> Can I simply write the half word, without a reservation, or will the =
HW caches
> mess up the other half? Will it ruin the cache coherency on some =
(sub)architectures?

The coherency should be fine, I just can=92t remember if you=92ll lose =
the reservation by doing this.

>> Plus, sthcx doesn't exist on all PPC chips.
>=20
> Which ones are lacking it? Do all have at least a simple 16-bit store?

Everything implements a simple 16-bit store, just not everything =
implements the store conditional of 16-bit data.

- k=

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Peter Zijlstra @ 2014-02-07 15:43 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, linux-kernel, Paul Mackerras, Anton Blanchard,
	Scott Wood, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140207151847.GB3104@twins.programming.kicks-ass.net>

On Fri, Feb 07, 2014 at 04:18:47PM +0100, Peter Zijlstra wrote:
> void ticket_lock(tickets_t *lock)
> {
> 	tickets_t t;
> 
> 	/*
> 	 * Because @head is MSB, the direct increment wrap doesn't disturb
> 	 * @tail.
> 	 */
> 	t.pair = xadd(&lock->pair, 1<<16);
> 
> 	if (likely(t.head == t.tail)) {
> 		__lwsync(); /* acquire */
> 		return;
> 	}
> 
> 	while (smp_load_acquire(&lock->tail) != t.tail)
> 		cpu_relax();

That should be "!= t.head", for that contains our ticket.

I'm hopelessly scatter brained today it seems :/


> }
> 
> void ticket_unlock(tickets_t *lock)
> {
> 	ticket_t tail = lock->tail + 1;
> 
> 	/*
> 	 * The store is save against the xadd for it will make the ll/sc fail
> 	 * and try again. Aside from that PowerISA guarantees single-copy
> 	 * atomicy for half-word writes.
> 	 *
> 	 * And since only the lock owner will ever write the tail, we're good.
> 	 */
> 	smp_store_release(&lock->tail, tail);
> }

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Peter Zijlstra @ 2014-02-07 15:18 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, linux-kernel, Paul Mackerras, Anton Blanchard,
	Scott Wood, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140207122837.GA3104@twins.programming.kicks-ass.net>

On Fri, Feb 07, 2014 at 01:28:37PM +0100, Peter Zijlstra wrote:
> Anyway, you can do a version with lwarx/stwcx if you're looking get rid
> of lharx.

the below seems to compile into relatively ok asm. It can be done better
if you write the entire thing by hand though.

---

typedef unsigned short ticket_t;

typedef struct {
	union {
		unsigned int pair;
		struct {
			/* ensure @head is the MSB */
#ifdef __BIG_ENDIAN__
			ticket_t head,tail;
#else
			ticket_t tail,head;
#endif
		};
	};
} tickets_t;

#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

#define barrier()	__asm__ __volatile__ ("" : : :"memory")
#define __lwsync()	__asm__ __volatile__ ("lwsync" : : :"memory")

#define smp_store_release(p, v)                                         \
do {                                                                    \
        __lwsync();                                                     \
        ACCESS_ONCE(*p) = (v);                                          \
} while (0)

#define smp_load_acquire(p)                                             \
({                                                                      \
        typeof(*p) ___p1 = ACCESS_ONCE(*p);                             \
        __lwsync();                                                     \
        ___p1;                                                          \
})

#define likely(x)	__builtin_expect(!!(x), 1)

#define cpu_relax()	barrier();

static inline unsigned int xadd(unsigned int *v, unsigned int i)
{
	int t, ret;
	
	__asm__ __volatile__ (
"1:	lwarx	%0, 0, %4\n"
"	mr	%1, %0\n"
"	add	%0, %3, %0\n"
"	stwcx.	%0, %0, %4\n"
"	bne-	1b\n"
	: "=&r" (t), "=&r" (ret), "+m" (*v)
	: "r" (i), "r" (v)
	: "cc");

	return ret;
}

void ticket_lock(tickets_t *lock)
{
	tickets_t t;

	/*
	 * Because @head is MSB, the direct increment wrap doesn't disturb
	 * @tail.
	 */
	t.pair = xadd(&lock->pair, 1<<16);

	if (likely(t.head == t.tail)) {
		__lwsync(); /* acquire */
		return;
	}

	while (smp_load_acquire(&lock->tail) != t.tail)
		cpu_relax();
}

void ticket_unlock(tickets_t *lock)
{
	ticket_t tail = lock->tail + 1;

	/*
	 * The store is save against the xadd for it will make the ll/sc fail
	 * and try again. Aside from that PowerISA guarantees single-copy
	 * atomicy for half-word writes.
	 *
	 * And since only the lock owner will ever write the tail, we're good.
	 */
	smp_store_release(&lock->tail, tail);
}

^ permalink raw reply

* [PATCH V2] powerpc: thp: Fix crash on mremap
From: Aneesh Kumar K.V @ 2014-02-07 13:51 UTC (permalink / raw)
  To: benh, paulus, stable; +Cc: linuxppc-dev, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

This patch fix the below crash

NIP [c00000000004cee4] .__hash_page_thp+0x2a4/0x440
LR [c0000000000439ac] .hash_page+0x18c/0x5e0
...
Call Trace:
[c000000736103c40] [00001ffffb000000] 0x1ffffb000000(unreliable)
[437908.479693] [c000000736103d50] [c0000000000439ac] .hash_page+0x18c/0x5e0
[437908.479699] [c000000736103e30] [c00000000000924c] .do_hash_page+0x4c/0x58

On ppc64 we use the pgtable for storing the hpte slot information and
store address to the pgtable at a constant offset (PTRS_PER_PMD) from
pmd. On mremap, when we switch the pmd, we need to withdraw and deposit
the pgtable again, so that we find the pgtable at PTRS_PER_PMD offset
from new pmd.

We also want to move the withdraw and deposit before the set_pmd so
that, when page fault find the pmd as trans huge we can be sure that
pgtable can be located at the offset.

variant of upstream SHA1: b3084f4db3aeb991c507ca774337c7e7893ed04f
for 3.12 stable series

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/Kconfig                           |  3 +++
 arch/powerpc/platforms/Kconfig.cputype |  1 +
 mm/huge_memory.c                       | 12 ++++++++++++
 3 files changed, 16 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index af2cc6eabcc7..bca9e7a18bd2 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -365,6 +365,9 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE
 config HAVE_ARCH_SOFT_DIRTY
 	bool
 
+config ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW
+	bool
+
 config HAVE_MOD_ARCH_SPECIFIC
 	bool
 	help
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index 6704e2e20e6b..0225011231ea 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -71,6 +71,7 @@ config PPC_BOOK3S_64
 	select PPC_FPU
 	select PPC_HAVE_PMU_SUPPORT
 	select SYS_SUPPORTS_HUGETLBFS
+	select ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE if PPC_64K_PAGES
 
 config PPC_BOOK3E_64
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 292a266e0d42..89b7a647f1cb 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1474,8 +1474,20 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 
 	ret = __pmd_trans_huge_lock(old_pmd, vma);
 	if (ret == 1) {
+#ifdef CONFIG_ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW
+		pgtable_t pgtable;
+#endif
 		pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
 		VM_BUG_ON(!pmd_none(*new_pmd));
+#ifdef CONFIG_ARCH_THP_MOVE_PMD_ALWAYS_WITHDRAW
+		/*
+		 * Archs like ppc64 use pgtable to store per pmd
+		 * specific information. So when we switch the pmd,
+		 * we should also withdraw and deposit the pgtable
+		 */
+		pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
+		pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
+#endif
 		set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
 		spin_unlock(&mm->page_table_lock);
 	}
-- 
1.8.3.2

^ permalink raw reply related

* Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
From: Peter Zijlstra @ 2014-02-07 12:41 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Nicolas Pitre, Lists linaro-kernel, linux-pm@vger.kernel.org,
	Daniel Lezcano, Rafael J. Wysocki, LKML, Ingo Molnar,
	Thomas Gleixner, linuxppc-dev
In-Reply-To: <52F4C666.4050308@linux.vnet.ibm.com>

On Fri, Feb 07, 2014 at 05:11:26PM +0530, Preeti U Murthy wrote:
> But observe the idle state "snooze" on powerpc. The power that this idle
> state saves is through the lowering of the thread priority of the CPU.
> After it lowers the thread priority, it is done. It cannot
> "wait_for_interrupts". It will exit my_idle(). It is now upto the
> generic idle loop to increase the thread priority if the need_resched
> flag is set. Only an interrupt routine can increase the thread priority.
> Else we will need to do it explicitly. And in such states which have a
> polling nature, the cpu will not receive a reschedule IPI.
> 
> That is why in the snooze_loop() we poll on need_resched. If it is set
> we up the priority of the thread using HMT_MEDIUM() and then exit the
> my_idle() loop. In case of interrupts, the priority gets automatically
> increased.

You can poll without setting TS_POLLING/TIF_POLLING_NRFLAGS just fine
and get the IPI if that is what you want.

Depending on how horribly unprovisioned the thread gets at the lowest
priority, that might actually be faster than polling and raising the
prio whenever it does get ran.

^ permalink raw reply

* Re: [PATCH v2 6/6] cpu/idle.c: move to sched/idle.c
From: Peter Zijlstra @ 2014-02-07 12:32 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: linaro-kernel, Russell King, linux-sh, linux-pm, Daniel Lezcano,
	Rafael J. Wysocki, linux-kernel, Paul Mundt, Preeti U Murthy,
	Thomas Gleixner, linuxppc-dev, Ingo Molnar, linux-arm-kernel
In-Reply-To: <alpine.LFD.2.11.1402071051140.1906@knanqh.ubzr>

On Fri, Feb 07, 2014 at 11:09:23AM +0000, Nicolas Pitre wrote:
> On Thu, 6 Feb 2014, Peter Zijlstra wrote:
> > tree, tree, what's in a word.
> 
> Something you may plant on a patch of grass?  "Merging" becomes a 
> strange concept in that context though.  :-)

I do know some farmers who splice trees thought :-)

> > Its in my patch stack yes. 
> 
> Quilt I suppose?? (yet another word.)

Yes, I'm one of the refugee Quilt users. Comes in handy when cold too
:-)

> > I should get some of that into tip I suppose, been side-tracked a bit 
> > this week. Sorry for the delay.
> 
> If you prefer we pile those patches (and future ones after revew) 
> ourselves just let me know.  Future patches are likely to be more 
> intimate with the scheduler so I just need to know who to upstream them 
> through afterwards.

Normally I get Ingo to pick up the queue 1-2 times a week so latency
shouldn't be too bad. But we just had the merge window and then I got
side-tracked rewriting all atomic implementations.

So usually submit patches against tip/master unless you know there's
other pending bits that conflict, in which case you can grab my queue on
top of tip/master -- but I try to make sure that's usually not needed.

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Peter Zijlstra @ 2014-02-07 12:28 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, linux-kernel, Paul Mackerras, Anton Blanchard,
	Scott Wood, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140207114949.GA2107@lst.de>

On Fri, Feb 07, 2014 at 12:49:49PM +0100, Torsten Duwe wrote:
> On Fri, Feb 07, 2014 at 11:45:30AM +0100, Peter Zijlstra wrote:
> > 
> > That might need to be lhz too, I'm confused on all the load variants.
> 
> ;-)
> 
> > > unlock:
> > > 	lhz	%0, 0, &tail
> > > 	addic	%0, %0, 1
> 
> No carry with this one, I'd say.

Right you are, add immediate it is.

> Besides, unlock increments the head.

No, unlock increments the tail, lock increments the head and waits until
the tail matches the pre-inc value.

That said, why do the atomic_inc() primitives do an carry add? (that's
where I borrowed it from).

> > > 	lwsync
> > > 	sth	%0, 0, &tail
> > > 
> 
> Given the beauty and simplicity of this, may I ask Ingo:
> you signed off 314cdbefd1fd0a7acf3780e9628465b77ea6a836;
> can you explain why head and tail must live on the same cache
> line? Or is it just a space saver? I just ported it to ppc,
> I didn't think about alternatives.

spinlock_t should, ideally, be 32bits.

> What about
> 
> atomic_t tail;
> volatile int head; ?
> 
> Admittedly, that's usually 8 bytes instead of 4...

That still won't straddle a cacheline unless you do weird alignement
things which will bloat all the various data structures more still.

Anyway, you can do a version with lwarx/stwcx if you're looking get rid
of lharx.

^ permalink raw reply

* Re: [PATCH V4 3/3] time/cpuidle:Handle failed call to BROADCAST_ENTER on archs with CPUIDLE_FLAG_TIMER_STOP set
From: Rafael J. Wysocki @ 2014-02-07 12:07 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: deepthi, linux-pm, peterz, rafael.j.wysocki, linux-kernel, paulus,
	srivatsa.bhat, fweisbec, tglx, paulmck, linuxppc-dev, mingo
In-Reply-To: <20140207080652.17187.66344.stgit@preeti.in.ibm.com>

On Friday, February 07, 2014 01:36:52 PM Preeti U Murthy wrote:
> Some archs set the CPUIDLE_FLAG_TIMER_STOP flag for idle states in which the
> local timers stop. The cpuidle_idle_call() currently handles such idle states
> by calling into the broadcast framework so as to wakeup CPUs at their next
> wakeup event. With the hrtimer mode of broadcast, the BROADCAST_ENTER call
> into the broadcast frameowork can fail for archs that do not have an external
> clock device to handle wakeups and the CPU in question has to thus be made
> the stand by CPU. This patch handles such cases by failing the call into
> cpuidle so that the arch can take some default action. The arch will certainly
> not enter a similar idle state because a failed cpuidle call will also implicitly
> indicate that the broadcast framework has not registered this CPU to be woken up.
> Hence we are safe if we fail the cpuidle call.
> 
> In the process move the functions that trace idle statistics just before and
> after the entry and exit into idle states respectively. In other
> scenarios where the call to cpuidle fails, we end up not tracing idle
> entry and exit since a decision on an idle state could not be taken. Similarly
> when the call to broadcast framework fails, we skip tracing idle statistics
> because we are in no further position to take a decision on an alternative
> idle state to enter into.
> 
> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
> 
>  drivers/cpuidle/cpuidle.c |   14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index a55e68f..8beb0f02 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -140,12 +140,14 @@ int cpuidle_idle_call(void)
>  		return 0;
>  	}
>  
> -	trace_cpu_idle_rcuidle(next_state, dev->cpu);
> -
>  	broadcast = !!(drv->states[next_state].flags & CPUIDLE_FLAG_TIMER_STOP);
>  
> -	if (broadcast)
> -		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu);
> +	if (broadcast &&
> +		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu))
> +		return -EBUSY;
> +
> +
> +	trace_cpu_idle_rcuidle(next_state, dev->cpu);
>  
>  	if (cpuidle_state_is_coupled(dev, drv, next_state))
>  		entered_state = cpuidle_enter_state_coupled(dev, drv,
> @@ -153,11 +155,11 @@ int cpuidle_idle_call(void)
>  	else
>  		entered_state = cpuidle_enter_state(dev, drv, next_state);
>  
> +	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
> +
>  	if (broadcast)
>  		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu);
>  
> -	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
> -
>  	/* give the governor an opportunity to reflect on the outcome */
>  	if (cpuidle_curr_governor->reflect)
>  		cpuidle_curr_governor->reflect(dev, entered_state);
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Torsten Duwe @ 2014-02-07 11:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tom Musta, linux-kernel, Paul Mackerras, Anton Blanchard,
	Scott Wood, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140207104530.GG5126@laptop.programming.kicks-ass.net>

On Fri, Feb 07, 2014 at 11:45:30AM +0100, Peter Zijlstra wrote:
> 
> That might need to be lhz too, I'm confused on all the load variants.

;-)

> > unlock:
> > 	lhz	%0, 0, &tail
> > 	addic	%0, %0, 1

No carry with this one, I'd say.
Besides, unlock increments the head.

> > 	lwsync
> > 	sth	%0, 0, &tail
> > 

Given the beauty and simplicity of this, may I ask Ingo:
you signed off 314cdbefd1fd0a7acf3780e9628465b77ea6a836;
can you explain why head and tail must live on the same cache
line? Or is it just a space saver? I just ported it to ppc,
I didn't think about alternatives.

What about

atomic_t tail;
volatile int head; ?

Admittedly, that's usually 8 bytes instead of 4...

	Torsten

^ permalink raw reply

* Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
From: Preeti U Murthy @ 2014-02-07 11:41 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Lists linaro-kernel, linux-pm@vger.kernel.org, Peter Zijlstra,
	Daniel Lezcano, Rafael J. Wysocki, LKML, Ingo Molnar,
	Thomas Gleixner, linuxppc-dev
In-Reply-To: <alpine.LFD.2.11.1402071022490.1906@knanqh.ubzr>

Hi Nicolas,

On 02/07/2014 04:18 PM, Nicolas Pitre wrote:
> On Fri, 7 Feb 2014, Preeti U Murthy wrote:
> 
>> Hi Nicolas,
>>
>> On 02/07/2014 06:47 AM, Nicolas Pitre wrote:
>>>
>>> What about creating arch_cpu_idle_enter() and arch_cpu_idle_exit() in 
>>> arch/powerpc/kernel/idle.c and calling ppc64_runlatch_off() and 
>>> ppc64_runlatch_on() respectively from there instead?  Would that work?  
>>> That would make the idle consolidation much easier afterwards.
>>
>> I would not suggest doing this. The ppc64_runlatch_*() routines need to
>> be called when we are sure that the cpu is about to enter or has exit an
>> idle state. Moving the ppc64_runlatch_on() routine to
>> arch_cpu_idle_enter() for instance is not a good idea because there are
>> places where the cpu can decide not to enter any idle state before the
>> call to cpuidle_idle_call() itself. In that case communicating
>> prematurely that we are in an idle state would not be a good idea.
>>
>> So its best to add the ppc64_runlatch_* calls in the powernv cpuidle
>> driver IMO. We could however create idle_loop_prologue/epilogue()
>> variants inside it so that in addition to the runlatch routines we could
>> potentially add more such similar routines that are powernv specific.
>>   If there are cases where there is work to be done prior to and post an
>> entry into an idle state common to both pseries and powernv, we will
>> probably put them in arch_cpu_idle_enter/exit(). But the runlatch
>> routines are not suitable to be moved there as far as I can see.
> 
> OK.
> 
> However, one thing we need to do as much as possible is to remove those 
> loops based on need_resched() from idle backend drivers.  A somewhat 
> common pattern is:
> 
> my_idle()
> {
> 	/* interrupts disabled on entry */
> 	while (!need_resched()) {
> 		lowpower_wait_for_interrupts();
> 		local_irq_enable();
> 		/* IRQ serviced from here */
> 		local_irq_disable();
> 	}
> 	local_irq_enable();
> 	/* interrupts enabled on exit */
> }
> 
> To be able to keep statistics on the actual idleness of the CPU we'd 
> need for all idle backends to always return to generic code on every 
> interrupt similar to this:
> 
> my_idle()
> {
> 	/* interrupts disabled on entry */
> 	lowpower_wait_for_interrupts();

You can do this for the idle states which do not have the polling
nature. IOW, these idle states are capable of doing what you describe as
"wait_for_interrupts". They do some kind of spinning at the hardware
level with interrupts enabled. A reschedule IPI or any other interrupt
will wake them up to enter the generic idle loop where they check for
the cause of the interrupt.

But observe the idle state "snooze" on powerpc. The power that this idle
state saves is through the lowering of the thread priority of the CPU.
After it lowers the thread priority, it is done. It cannot
"wait_for_interrupts". It will exit my_idle(). It is now upto the
generic idle loop to increase the thread priority if the need_resched
flag is set. Only an interrupt routine can increase the thread priority.
Else we will need to do it explicitly. And in such states which have a
polling nature, the cpu will not receive a reschedule IPI.

That is why in the snooze_loop() we poll on need_resched. If it is set
we up the priority of the thread using HMT_MEDIUM() and then exit the
my_idle() loop. In case of interrupts, the priority gets automatically
increased.

This might not be required to be done for similar idle routines on other
archs but this is the consequence of applying this idea of simplified
cpuidle backend driver on powerpc.

I would say you could let the backend cpuidle drivers be in this regard,
it could complicate the generic idle loop IMO depending on how the
polling states are implemented in each architecture.

> The generic code would be responsible for dealing with need_resched() 
> and call back into the backend right away if necessary after updating 
> some stats.
> 
> Do you see a problem with the runlatch calls happening around each 
> interrrupt from such a simplified idle backend?

The runlatch calls could be moved outside the loop.They do not need to
be called each time.

Thanks

Regards
Preeti U Murthy
> 
> 
> Nicolas
> 

^ permalink raw reply

* Re: ppc9a board support
From: Martyn Welch @ 2014-02-07 11:23 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <CAOB9jCJmXTo7NVR4jdV+3AkRsZviR447PKi2_TFyDdjoO0to8g@mail.gmail.com>

Hi Chris,

Support for the PPC9A is already in the Linux kernel, along with VME 
drivers.

Martyn

On 04/02/14 14:47, Chris Enrique wrote:
> hello,
>
> sorry if this message doesn't clearly hit the topic of this list but i
> would like to ask if anybody has information about bringing a linux-3.10
> or later kernel to ge fanuc ppc9a powerpc board (existing bsp? vme
> drivers? etc.?) to prevent starting from scratch for this task.
>
> any ideas would be much appreciated.
>
>
> kind regards,
> chris
>
>
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>

-- 
--
Martyn Welch (Lead Software Engineer)  | Registered in England and Wales
GE Intelligent Platforms               | (3828642) at 100 Barbirolli Square
T +44(0)1327322748                     | Manchester, M2 3AB
E martyn.welch@ge.com                  | VAT:GB 927559189

^ permalink raw reply

* Re: [PATCH v2 6/6] cpu/idle.c: move to sched/idle.c
From: Nicolas Pitre @ 2014-02-07 11:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linaro-kernel, Russell King, linux-sh, linux-pm, Daniel Lezcano,
	Rafael J. Wysocki, linux-kernel, Paul Mundt, Preeti U Murthy,
	Thomas Gleixner, linuxppc-dev, Ingo Molnar, linux-arm-kernel
In-Reply-To: <20140206164336.GU2936@laptop.programming.kicks-ass.net>

On Thu, 6 Feb 2014, Peter Zijlstra wrote:

> On Thu, Feb 06, 2014 at 02:09:59PM +0000, Nicolas Pitre wrote:
> > Hi Peter,
> > 
> > Did you merge those patches in your tree? 
> 
> tree, tree, what's in a word.

Something you may plant on a patch of grass?  "Merging" becomes a 
strange concept in that context though.  :-)

> Its in my patch stack yes. 

Quilt I suppose?? (yet another word.)

> I should get some of that into tip I suppose, been side-tracked a bit 
> this week. Sorry for the delay.

If you prefer we pile those patches (and future ones after revew) 
ourselves just let me know.  Future patches are likely to be more 
intimate with the scheduler so I just need to know who to upstream them 
through afterwards.

Nicolas

^ permalink raw reply

* Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
From: Preeti U Murthy @ 2014-02-07 11:00 UTC (permalink / raw)
  To: Deepthi Dharwar
  Cc: Nicolas Pitre, Lists linaro-kernel, linux-pm@vger.kernel.org,
	Peter Zijlstra, Daniel Lezcano, Rafael J. Wysocki, LKML,
	Ingo Molnar, Thomas Gleixner, linuxppc-dev, Linux ARM Kernel ML
In-Reply-To: <52F4AB54.801@linux.vnet.ibm.com>

Hi Deepthi,

On 02/07/2014 03:15 PM, Deepthi Dharwar wrote:
> Hi Preeti,
> 
> Thanks for the patch.
> 
> On 02/07/2014 12:31 PM, Preeti U Murthy wrote:
>> Hi Nicolas,
>>
>> Find below the patch that will need to be squashed with this one.
>> This patch is based on the mainline.Adding Deepthi, the author of
>> the patch which introduced the powernv cpuidle driver. Deepthi,
>> do you think the below patch looks right? We do not need to do an
>> explicit local_irq_enable() since we are in the call path of
>> cpuidle driver and that explicitly enables irqs on exit from
>> idle states.
> 
> Yes, We enable irqs explicitly while entering snooze loop and we always
> have interrupts enabled in the snooze state.
> For NAP state, we exit out of this state with interrupts enabled so we
> do not need an explicit enable of irqs.
> 
>> On 02/07/2014 06:47 AM, Nicolas Pitre wrote:
>>> On Thu, 6 Feb 2014, Preeti U Murthy wrote:
>>>
>>>> Hi Daniel,
>>>>
>>>> On 02/06/2014 09:55 PM, Daniel Lezcano wrote:
>>>>> Hi Nico,
>>>>>
>>>>>
>>>>> On 6 February 2014 14:16, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>>>>>
>>>>>> The core idle loop now takes care of it.
>>>>>>
>>>>>> Signed-off-by: Nicolas Pitre <nico@linaro.org>
>>>>>> ---
>>>>>>  arch/powerpc/platforms/powernv/setup.c | 13 +------------
>>>>>>  1 file changed, 1 insertion(+), 12 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/powerpc/platforms/powernv/setup.c
>>>>>> b/arch/powerpc/platforms/powernv/setup.c
>>>>>> index 21166f65c9..a932feb290 100644
>>>>>> --- a/arch/powerpc/platforms/powernv/setup.c
>>>>>> +++ b/arch/powerpc/platforms/powernv/setup.c
>>>>>> @@ -26,7 +26,6 @@
>>>>>>  #include <linux/of_fdt.h>
>>>>>>  #include <linux/interrupt.h>
>>>>>>  #include <linux/bug.h>
>>>>>> -#include <linux/cpuidle.h>
>>>>>>
>>>>>>  #include <asm/machdep.h>
>>>>>>  #include <asm/firmware.h>
>>>>>> @@ -217,16 +216,6 @@ static int __init pnv_probe(void)
>>>>>>         return 1;
>>>>>>  }
>>>>>>
>>>>>> -void powernv_idle(void)
>>>>>> -{
>>>>>> -       /* Hook to cpuidle framework if available, else
>>>>>> -        * call on default platform idle code
>>>>>> -        */
>>>>>> -       if (cpuidle_idle_call()) {
>>>>>> -               power7_idle();
>>>>>> -       }
>>>>>>
>>
>>  drivers/cpuidle/cpuidle-powernv.c |    4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
>> index 78fd174..130f081 100644
>> --- a/drivers/cpuidle/cpuidle-powernv.c
>> +++ b/drivers/cpuidle/cpuidle-powernv.c
>> @@ -31,11 +31,13 @@ static int snooze_loop(struct cpuidle_device *dev,
>>  	set_thread_flag(TIF_POLLING_NRFLAG);
>>
>>  	while (!need_resched()) {
>> +		ppc64_runlatch_off();
>                 ^^^^^^^^^^^^^^^
> We could move this before the while() loop.
> It would ideal to turn off latch when we enter snooze and
> turn it on when we are about to exit, rather than doing
> it over and over in the while loop.

You are right, this can be moved out of the loop.

Thanks

Regards
Preeti U Murthy

^ permalink raw reply

* Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
From: Nicolas Pitre @ 2014-02-07 10:48 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Lists linaro-kernel, linux-pm@vger.kernel.org, Peter Zijlstra,
	Daniel Lezcano, Rafael J. Wysocki, LKML, Ingo Molnar,
	Thomas Gleixner, linuxppc-dev
In-Reply-To: <52F46EB3.5080403@linux.vnet.ibm.com>

On Fri, 7 Feb 2014, Preeti U Murthy wrote:

> Hi Nicolas,
> 
> On 02/07/2014 06:47 AM, Nicolas Pitre wrote:
> > 
> > What about creating arch_cpu_idle_enter() and arch_cpu_idle_exit() in 
> > arch/powerpc/kernel/idle.c and calling ppc64_runlatch_off() and 
> > ppc64_runlatch_on() respectively from there instead?  Would that work?  
> > That would make the idle consolidation much easier afterwards.
> 
> I would not suggest doing this. The ppc64_runlatch_*() routines need to
> be called when we are sure that the cpu is about to enter or has exit an
> idle state. Moving the ppc64_runlatch_on() routine to
> arch_cpu_idle_enter() for instance is not a good idea because there are
> places where the cpu can decide not to enter any idle state before the
> call to cpuidle_idle_call() itself. In that case communicating
> prematurely that we are in an idle state would not be a good idea.
> 
> So its best to add the ppc64_runlatch_* calls in the powernv cpuidle
> driver IMO. We could however create idle_loop_prologue/epilogue()
> variants inside it so that in addition to the runlatch routines we could
> potentially add more such similar routines that are powernv specific.
>   If there are cases where there is work to be done prior to and post an
> entry into an idle state common to both pseries and powernv, we will
> probably put them in arch_cpu_idle_enter/exit(). But the runlatch
> routines are not suitable to be moved there as far as I can see.

OK.

However, one thing we need to do as much as possible is to remove those 
loops based on need_resched() from idle backend drivers.  A somewhat 
common pattern is:

my_idle()
{
	/* interrupts disabled on entry */
	while (!need_resched()) {
		lowpower_wait_for_interrupts();
		local_irq_enable();
		/* IRQ serviced from here */
		local_irq_disable();
	}
	local_irq_enable();
	/* interrupts enabled on exit */
}

To be able to keep statistics on the actual idleness of the CPU we'd 
need for all idle backends to always return to generic code on every 
interrupt similar to this:

my_idle()
{
	/* interrupts disabled on entry */
	lowpower_wait_for_interrupts();
	local_irq_enable();
	/* interrupts enabled on exit */
}

The generic code would be responsible for dealing with need_resched() 
and call back into the backend right away if necessary after updating 
some stats.

Do you see a problem with the runlatch calls happening around each 
interrrupt from such a simplified idle backend?


Nicolas

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Peter Zijlstra @ 2014-02-07 10:45 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, linux-kernel, Paul Mackerras, Anton Blanchard,
	Scott Wood, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140207103139.GP5002@laptop.programming.kicks-ass.net>

On Fri, Feb 07, 2014 at 11:31:39AM +0100, Peter Zijlstra wrote:
> Anyway, what might work is something like (please forgive my ppc asm, I
> can barely read the thing, I've never before attempted writing it):
> 
> lock:
> 1:	lharx	%0, 0, &head
> 	mov	%1, %0
> 	addic	%0, %0, 1
> 	stwcd   %0, 0, &head
> 	bne-	1b
> 2:	lhax	%0, 0, &tail

That might need to be lhz too, I'm confused on all the load variants.

> 	lwsync
> 	cmp	0, %0, %0

	cmp	0, %0, %1

So we compare the &tail load to the xadd return %1 above.

> 	bne-	2b
> 
> 
> unlock:
> 	lhz	%0, 0, &tail
> 	addic	%0, %0, 1
> 	lwsync
> 	sth	%0, 0, &tail
> 

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Peter Zijlstra @ 2014-02-07 10:36 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, linux-kernel, Paul Mackerras, Anton Blanchard,
	Scott Wood, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140207103139.GP5002@laptop.programming.kicks-ass.net>

> So if you have ll/sc on the whole word concurrent with the half-word
> store, you can loose the half-word store like:
> 
>   lwarx &tickets
>   ...			sth &tail
>   stwcd &tickets
> 
> 
> The stwcd will over-write the tail store.

Oh wait, that's stupid, it will invalidate the lock and fail the store
and make it try again, so you could try and combine the load, but you'd
need an extra shift instruction instead of an extra load.

Not sure that's a valid trade-off..

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Peter Zijlstra @ 2014-02-07 10:31 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, linux-kernel, Paul Mackerras, Anton Blanchard,
	Scott Wood, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140207090248.GB26811@lst.de>

On Fri, Feb 07, 2014 at 10:02:48AM +0100, Torsten Duwe wrote:
> On Thu, Feb 06, 2014 at 02:19:52PM -0600, Scott Wood wrote:
> > On Thu, 2014-02-06 at 18:37 +0100, Torsten Duwe wrote:
> > > On Thu, Feb 06, 2014 at 05:38:37PM +0100, Peter Zijlstra wrote:
> > 
> > > > Can you pair lwarx with sthcx ? I couldn't immediately find the answer
> > > > in the PowerISA doc. If so I think you can do better by being able to
> > > > atomically load both tickets but only storing the head without affecting
> > > > the tail.
> 
> Can I simply write the half word, without a reservation, or will the HW caches
> mess up the other half? Will it ruin the cache coherency on some (sub)architectures?

So if you have ll/sc on the whole word concurrent with the half-word
store, you can loose the half-word store like:

  lwarx &tickets
  ...			sth &tail
  stwcd &tickets

The stwcd will over-write the tail store.

Anyway, what might work is something like (please forgive my ppc asm, I
can barely read the thing, I've never before attempted writing it):

lock:
1:	lharx	%0, 0, &head
	mov	%1, %0
	addic	%0, %0, 1
	stwcd   %0, 0, &head
	bne-	1b

2:	lhax	%0, 0, &tail
	lwsync
	cmp	0, %0, %0
	bne-	2b

unlock:
	lhz	%0, 0, &tail
	addic	%0, %0, 1
	lwsync
	sth	%0, 0, &tail

Which would somewhat translate into C as:

static inline void ticket_spin_lock(tickets_t *lock)
{
	ticket_t mine = xadd(&lock->head);

	while (smp_load_acquire(&lock->tail) != mine)
		cpu_relax();
}

static inline void ticket_spin_unlock(tickets_t *lock)
{
	ticket_t tail = lock->tail + 1;

	smp_store_release(&lock->tail, tail);
}

Where xadd() returns the value before addition and we assume half word
single-copy atomicy, such that the head and tail updates will not
interfere.

The x86 implementation uses the 32bit xadd and places the head at the
MSB end to get the atomic add + tail load in a single instruction, but
for PPC its much better to have an extra load (to an already hot
cacheline) and avoid a second ll/sc pair, as the ll/sc things are stupid
slow for your arch afaik.

^ permalink raw reply

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: Gabriel Paubert @ 2014-02-07 10:10 UTC (permalink / raw)
  To: Stephen N Chivers; +Cc: Chris Proctor, linuxppc-dev
In-Reply-To: <OFA4C1AF9F.B23E83EC-ONCA257C78.00044235-CA257C78.00080D50@csc.com>

	Hi Stephen,

On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote:
> Gabriel Paubert <paubert@iram.es> wrote on 02/06/2014 07:26:37 PM:
> 
> > From: Gabriel Paubert <paubert@iram.es>
> > To: Stephen N Chivers <schivers@csc.com.au>
> > Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor <cproctor@csc.com.au>
> > Date: 02/06/2014 07:26 PM
> > Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
> > 
> > On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:
> > > I have a MPC8548e based board and an application that makes
> > > extensive use of floating point including numerous calls to cos.
> > > In the same program there is the use of an sqlite database.
> > > 
> > > The kernel is derived from 2.6.31 and is compiled with math emulation.
> > > 
> > > At some point after the reading of the SQLITE database, the
> > > return result from cos goes from an in range value to an out
> > > of range value.
> > > 
> > > This is as a result of the FP rounding mode mutating from "round to 
> > > nearest"
> > > to "round toward zero".
> > > 
> > > The cos in the glibc version being used is known to be sensitive to 
> > > rounding
> > > direction and Joseph Myers has previously fixed glibc.
> > > 
> > > The failure does not occur on a machine that has a hardware floating
> > > point unit (a MPC7410 processor).
> > > 
> > > I have traced the mutation to the following series of instructions:
> > > 
> > >         mffs            f0
> > >         mtfsb1          4*cr7+so
> > >         mtfsb0          4*cr7+eq
> > >         fadd            f13,f1,f2
> > >         mtfsf           1, f0
> > > 
> > > The instructions are part of the stream emitted by gcc for the 
> conversion
> > > of a 128 bit floating point value into an integer in the sqlite 
> database 
> > > read.
> > > 
> > > Immediately before the execution of the mffs instruction the "rounding
> > > mode" is "round to nearest".
> > > 
> > > On the MPC8548 board, the execution of the mtfsf instruction does not
> > > restore the rounding mode to "round to nearest".
> > > 
> > > I believe that the mask computation in mtfsf.c is incorrect and is 
> > > reversed.
> > > 
> > > In the latest version of the file (linux-3.14-rc1), the mask is 
> computed 
> > > by:
> > > 
> > >                  mask = 0;
> > >                  if (FM & (1 << 0))
> > >                         mask |= 0x90000000;
> > >                  if (FM & (1 << 1))
> > >                         mask |= 0x0f000000;
> > >                  if (FM & (1 << 2))
> > >                         mask |= 0x00f00000;
> > >                  if (FM & (1 << 3))
> > >                         mask |= 0x000f0000;
> > >                  if (FM & (1 << 4))
> > >                         mask |= 0x0000f000;
> > >                  if (FM & (1 << 5))
> > >                         mask |= 0x00000f00;
> > >                  if (FM & (1 << 6))
> > >                         mask |= 0x000000f0;
> > >                  if (FM & (1 << 7))
> > >                         mask |= 0x0000000f;
> > > 
> > > I think it should be:
> > > 
> > >                 mask = 0;
> > >                 if (FM & (1 << 0))
> > >                         mask |= 0x0000000f;
> > >                 if (FM & (1 << 1))
> > >                         mask |= 0x000000f0;
> > >                 if (FM & (1 << 2))
> > >                         mask |= 0x00000f00;
> > >                 if (FM & (1 << 3))
> > >                         mask |= 0x0000f000;
> > >                 if (FM & (1 << 4))
> > >                         mask |= 0x000f0000;
> > >                 if (FM & (1 << 5))
> > >                         mask |= 0x00f00000;
> > >                 if (FM & (1 << 6))
> > >                         mask |= 0x0f000000;
> > >                 if (FM & (1 << 7))
> > >                         mask |= 0x90000000;
> > > 
> > > With the above mask computation I get consistent results for both the 
> > > MPC8548
> > > and MPC7410 boards.
> > > 
> > > Am I missing something subtle?
> > 
> > No I think you are correct. This said, this code may probably be 
> optimized 
> > to eliminate a lot of the conditional branches. I think that:
> > 
> > mask = (FM & 1);
> > mask |= (FM << 3) & 0x10;
> > mask |= (FM << 6) & 0x100;
> > mask |= (FM << 9) & 0x1000;
> > mask |= (FM << 12) & 0x10000;
> > mask |= (FM << 15) & 0x100000;
> > mask |= (FM << 18) & 0x1000000;
> > mask |= (FM << 21) & 0x10000000;
> > mask *= 15;
> > 
> > should do the job, in less code space and without a single branch.
> > 
> > Each one of the "mask |=" lines should be translated into an
> > rlwinm instruction followed by an "or". Actually it should be possible
> > to transform each of these lines into a single "rlwimi" instruction
> > but I don't know how to coerce gcc to reach this level of optimization.
> > 
> > Another way of optomizing this could be:
> > 
> > mask = (FM & 0x0f) | ((FM << 12) & 0x000f0000);
> > mask = (mask & 0x00030003) | ((mask << 6) & 0x03030303);
> > mask = (mask & 0x01010101) | ((mask << 3) & 0x10101010);
> > mask *= 15;
> > 
> > It's not easy to see which of the solutions is faster, the second one
> > needs to create quite a few constants, but its dependency length is 
> > lower. It is very likely that the first solution is faster in cache-cold
> > case and the second in cache-hot. 
> > 
> > Regardless, the original code is rather naïve, larger and slower in all 
> cases,
> > with timing variation depending on branch mispredictions.
> 
> Thanks for the response, it is appreciated.
> 
> I have tried simple test versions of the two suggestions above, both
> produce the same results as the original formulation.
> 
> My toolchain is gcc-4.1.2 with binutils 2.17.
> 
> When compiled without optimization I get:
> 
>         original        58 instructions 20 memory accesses 9 branches
>         method1 53 instructions 27 memory accesses 0 branches
>         method2 37 instructions 13 memory accesses 0 branches
> 
> with optimization:
> 
>         original        25 instructions 0 memory accesses 8 branches
>         method1 18 instructions 0 memory accesses 0 branches
>         method2 21 instructions 0 memory accesses 0 branches
> 
> The memory accesses do not include "setup" such as moving FM to
> a register.
> 
> The instruction counts for method1 and method2 include an extra
> and operation to preserve the original behaviour wrt the sticky
> FX bit (I think) although maybe that is also something else
> that it is wrong with the original implementation.

These bits are not sticky, they are actually a logical combination 
of other bits in the register. The code in mtfsf.c recomputes 
the FPSCR_VX and FPSCR_VEX bits from the contents of the new FPSCR, 
so, unless I miss something, the and operation is unnecessary.

> In my naivety I would go with method2 as it generates fewer
> instructions when  not optimized and isn't far from method1 when
> optimized.

I would never count the size of non-optimized gcc code as a valid 
benchmark. At least -O1 is necessary to implement basic, trivial
optimizations, as you see in the number of memory accesses.

Ok, I finally edited my sources and test compiled the suggestions
I gave. I'd say that method2 is the best overall indeed. You can
actually save one more instruction by setting mask to all ones in 
the case FM=0xff, but that's about all in this area.

This said, there are better ways to write the end of the routine
to be more compact. The __FPU_FPSCR is a global variable which is
accessed very often, and it would be better to use more the local
fpscr variable. I see really stupid things in the code, for example
the following:

        and. 8,10,0      #, tmp184, D.11439
==>     stw 0,740(2)     # current.0_21->thread.fp_state.fpscr, D.11439
        beq- 0,.L4       #
        oris 0,0,0x2000  #, tmp188, D.11439,
==>     stw 0,740(2)     # current.0_21->thread.fp_state.fpscr, tmp188
.L4:
==>     lwz 0,740(9)     # current.0_21->thread.fp_state.fpscr, fpscr
        lis 11,0x2000    # tmp227,

(and yes gcc also copies around r2, the current thread pointer, to
other registers, for no good reason).

Anyway, the first step would be to get the current patch in, then
work on more optimizations.

	Regards,
	Gabriel

^ permalink raw reply

* Re: [PATCH 2/2] ARM64: powernv: remove redundant cpuidle_idle_call()
From: Catalin Marinas @ 2014-02-07  9:50 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: linaro-kernel, linux-pm@vger.kernel.org, Peter Zijlstra,
	Daniel Lezcano, Rafael J. Wysocki, Linux Kernel Mailing List,
	Ingo Molnar, Preeti U Murthy, Thomas Gleixner, linuxppc-dev,
	linux-arm-kernel@lists.infradead.org
In-Reply-To: <1391696188-14540-2-git-send-email-nicolas.pitre@linaro.org>

On 6 February 2014 14:16, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> The core idle loop now takes care of it.
>
> Signed-off-by: Nicolas Pitre <nico@linaro.org>

Acked-by: Catalin Marinas <catalin.marinas@arm.com>

^ permalink raw reply

* Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
From: Deepthi Dharwar @ 2014-02-07  9:45 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Nicolas Pitre, Lists linaro-kernel, linux-pm@vger.kernel.org,
	Peter Zijlstra, Daniel Lezcano, Rafael J. Wysocki, LKML,
	Ingo Molnar, Thomas Gleixner, linuxppc-dev, Linux ARM Kernel ML
In-Reply-To: <52F484D9.6020604@linux.vnet.ibm.com>

Hi Preeti,

Thanks for the patch.

On 02/07/2014 12:31 PM, Preeti U Murthy wrote:
> Hi Nicolas,
> 
> Find below the patch that will need to be squashed with this one.
> This patch is based on the mainline.Adding Deepthi, the author of
> the patch which introduced the powernv cpuidle driver. Deepthi,
> do you think the below patch looks right? We do not need to do an
> explicit local_irq_enable() since we are in the call path of
> cpuidle driver and that explicitly enables irqs on exit from
> idle states.

Yes, We enable irqs explicitly while entering snooze loop and we always
have interrupts enabled in the snooze state.
For NAP state, we exit out of this state with interrupts enabled so we
do not need an explicit enable of irqs.

> On 02/07/2014 06:47 AM, Nicolas Pitre wrote:
>> On Thu, 6 Feb 2014, Preeti U Murthy wrote:
>>
>>> Hi Daniel,
>>>
>>> On 02/06/2014 09:55 PM, Daniel Lezcano wrote:
>>>> Hi Nico,
>>>>
>>>>
>>>> On 6 February 2014 14:16, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>>>>
>>>>> The core idle loop now takes care of it.
>>>>>
>>>>> Signed-off-by: Nicolas Pitre <nico@linaro.org>
>>>>> ---
>>>>>  arch/powerpc/platforms/powernv/setup.c | 13 +------------
>>>>>  1 file changed, 1 insertion(+), 12 deletions(-)
>>>>>
>>>>> diff --git a/arch/powerpc/platforms/powernv/setup.c
>>>>> b/arch/powerpc/platforms/powernv/setup.c
>>>>> index 21166f65c9..a932feb290 100644
>>>>> --- a/arch/powerpc/platforms/powernv/setup.c
>>>>> +++ b/arch/powerpc/platforms/powernv/setup.c
>>>>> @@ -26,7 +26,6 @@
>>>>>  #include <linux/of_fdt.h>
>>>>>  #include <linux/interrupt.h>
>>>>>  #include <linux/bug.h>
>>>>> -#include <linux/cpuidle.h>
>>>>>
>>>>>  #include <asm/machdep.h>
>>>>>  #include <asm/firmware.h>
>>>>> @@ -217,16 +216,6 @@ static int __init pnv_probe(void)
>>>>>         return 1;
>>>>>  }
>>>>>
>>>>> -void powernv_idle(void)
>>>>> -{
>>>>> -       /* Hook to cpuidle framework if available, else
>>>>> -        * call on default platform idle code
>>>>> -        */
>>>>> -       if (cpuidle_idle_call()) {
>>>>> -               power7_idle();
>>>>> -       }
>>>>>
> 
>  drivers/cpuidle/cpuidle-powernv.c |    4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
> index 78fd174..130f081 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -31,11 +31,13 @@ static int snooze_loop(struct cpuidle_device *dev,
>  	set_thread_flag(TIF_POLLING_NRFLAG);
> 
>  	while (!need_resched()) {
> +		ppc64_runlatch_off();
                ^^^^^^^^^^^^^^^
We could move this before the while() loop.
It would ideal to turn off latch when we enter snooze and
turn it on when we are about to exit, rather than doing
it over and over in the while loop.

>  		HMT_low();
>  		HMT_very_low();
>  	}
> 
>  	HMT_medium();
> +	ppc64_runlatch_on();
>  	clear_thread_flag(TIF_POLLING_NRFLAG);
>  	smp_mb();
>  	return index;
> @@ -45,7 +47,9 @@ static int nap_loop(struct cpuidle_device *dev,
>  			struct cpuidle_driver *drv,
>  			int index)
>  {
> +	ppc64_runlatch_off();
>  	power7_idle();
> +	ppc64_runlatch_on();
>  	return index;
>  }
> 
> Thanks
> 
> Regards
> Preeti U Murthy
> 

Regards,
Deepthi

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Torsten Duwe @ 2014-02-07  9:02 UTC (permalink / raw)
  To: Scott Wood
  Cc: Tom Musta, Peter Zijlstra, linux-kernel, Paul Mackerras,
	Anton Blanchard, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <1391717992.6733.232.camel@snotra.buserror.net>

On Thu, Feb 06, 2014 at 02:19:52PM -0600, Scott Wood wrote:
> On Thu, 2014-02-06 at 18:37 +0100, Torsten Duwe wrote:
> > On Thu, Feb 06, 2014 at 05:38:37PM +0100, Peter Zijlstra wrote:
> 
> > > Can you pair lwarx with sthcx ? I couldn't immediately find the answer
> > > in the PowerISA doc. If so I think you can do better by being able to
> > > atomically load both tickets but only storing the head without affecting
> > > the tail.

Can I simply write the half word, without a reservation, or will the HW caches
mess up the other half? Will it ruin the cache coherency on some (sub)architectures?

> Plus, sthcx doesn't exist on all PPC chips.

Which ones are lacking it? Do all have at least a simple 16-bit store?

	Torsten

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Torsten Duwe @ 2014-02-07  8:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Paul Mackerras, Anton Blanchard, Paul E. McKenney,
	linuxppc-dev, Ingo Molnar
In-Reply-To: <20140206180826.GI5002@laptop.programming.kicks-ass.net>

On Thu, Feb 06, 2014 at 07:08:26PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 06, 2014 at 06:37:27PM +0100, Torsten Duwe wrote:
> > I must admit that I haven't tested the patch on non-pseries ppc64 nor on
> > ppc32. Only ppc64 has the ldarx and I tried to atomically replace the 
> > holder along with the locks. That might prove unneccessary.
> 
> But what is the holder for? Can't we do away with that field?

Scott, Peter: good questions.
The conditional is wrong because I confused pSeries with ppc64 CPUs with
64-bit kernels. I got deluded by the LOCK_TOKEN definition above. Is that
correctly ifdef'd, with PPC64? The holder field should be ifdef'd
CONFIG_PPC_SPLPAR, independent of ppc64.

It is an advisory performance hint, and doesn't need to be updated atomically
with the lock; this and the above are 2 reasons to drop the asm string
operand size voodoo as well.

Thanks,
	Torsten

^ permalink raw reply

* [PATCH V4 3/3] time/cpuidle:Handle failed call to BROADCAST_ENTER on archs with CPUIDLE_FLAG_TIMER_STOP set
From: Preeti U Murthy @ 2014-02-07  8:06 UTC (permalink / raw)
  To: linux-pm, peterz, benh, rafael.j.wysocki, linux-kernel, tglx,
	linuxppc-dev, mingo
  Cc: deepthi, fweisbec, paulus, srivatsa.bhat, paulmck
In-Reply-To: <20140207075618.17187.20149.stgit@preeti.in.ibm.com>

Some archs set the CPUIDLE_FLAG_TIMER_STOP flag for idle states in which the
local timers stop. The cpuidle_idle_call() currently handles such idle states
by calling into the broadcast framework so as to wakeup CPUs at their next
wakeup event. With the hrtimer mode of broadcast, the BROADCAST_ENTER call
into the broadcast frameowork can fail for archs that do not have an external
clock device to handle wakeups and the CPU in question has to thus be made
the stand by CPU. This patch handles such cases by failing the call into
cpuidle so that the arch can take some default action. The arch will certainly
not enter a similar idle state because a failed cpuidle call will also implicitly
indicate that the broadcast framework has not registered this CPU to be woken up.
Hence we are safe if we fail the cpuidle call.

In the process move the functions that trace idle statistics just before and
after the entry and exit into idle states respectively. In other
scenarios where the call to cpuidle fails, we end up not tracing idle
entry and exit since a decision on an idle state could not be taken. Similarly
when the call to broadcast framework fails, we skip tracing idle statistics
because we are in no further position to take a decision on an alternative
idle state to enter into.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 drivers/cpuidle/cpuidle.c |   14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a55e68f..8beb0f02 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -140,12 +140,14 @@ int cpuidle_idle_call(void)
 		return 0;
 	}

-	trace_cpu_idle_rcuidle(next_state, dev->cpu);
-
 	broadcast = !!(drv->states[next_state].flags & CPUIDLE_FLAG_TIMER_STOP);

-	if (broadcast)
-		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu);
+	if (broadcast &&
+		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu))
+		return -EBUSY;
+
+
+	trace_cpu_idle_rcuidle(next_state, dev->cpu);

 	if (cpuidle_state_is_coupled(dev, drv, next_state))
 		entered_state = cpuidle_enter_state_coupled(dev, drv,
@@ -153,11 +155,11 @@ int cpuidle_idle_call(void)
 	else
 		entered_state = cpuidle_enter_state(dev, drv, next_state);

+	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
+
 	if (broadcast)
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu);

-	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
-
 	/* give the governor an opportunity to reflect on the outcome */
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev, entered_state);

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox