LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH V4 3/3] time/cpuidle:Handle failed call to BROADCAST_ENTER on archs with CPUIDLE_FLAG_TIMER_STOP set
From: Rafael J. Wysocki @ 2014-02-07 12:07 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: deepthi, linux-pm, peterz, rafael.j.wysocki, linux-kernel, paulus,
	srivatsa.bhat, fweisbec, tglx, paulmck, linuxppc-dev, mingo
In-Reply-To: <20140207080652.17187.66344.stgit@preeti.in.ibm.com>

On Friday, February 07, 2014 01:36:52 PM Preeti U Murthy wrote:
> Some archs set the CPUIDLE_FLAG_TIMER_STOP flag for idle states in which the
> local timers stop. The cpuidle_idle_call() currently handles such idle states
> by calling into the broadcast framework so as to wakeup CPUs at their next
> wakeup event. With the hrtimer mode of broadcast, the BROADCAST_ENTER call
> into the broadcast frameowork can fail for archs that do not have an external
> clock device to handle wakeups and the CPU in question has to thus be made
> the stand by CPU. This patch handles such cases by failing the call into
> cpuidle so that the arch can take some default action. The arch will certainly
> not enter a similar idle state because a failed cpuidle call will also implicitly
> indicate that the broadcast framework has not registered this CPU to be woken up.
> Hence we are safe if we fail the cpuidle call.
> 
> In the process move the functions that trace idle statistics just before and
> after the entry and exit into idle states respectively. In other
> scenarios where the call to cpuidle fails, we end up not tracing idle
> entry and exit since a decision on an idle state could not be taken. Similarly
> when the call to broadcast framework fails, we skip tracing idle statistics
> because we are in no further position to take a decision on an alternative
> idle state to enter into.
> 
> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
> 
>  drivers/cpuidle/cpuidle.c |   14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index a55e68f..8beb0f02 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -140,12 +140,14 @@ int cpuidle_idle_call(void)
>  		return 0;
>  	}
>  
> -	trace_cpu_idle_rcuidle(next_state, dev->cpu);
> -
>  	broadcast = !!(drv->states[next_state].flags & CPUIDLE_FLAG_TIMER_STOP);
>  
> -	if (broadcast)
> -		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu);
> +	if (broadcast &&
> +		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu))
> +		return -EBUSY;
> +
> +
> +	trace_cpu_idle_rcuidle(next_state, dev->cpu);
>  
>  	if (cpuidle_state_is_coupled(dev, drv, next_state))
>  		entered_state = cpuidle_enter_state_coupled(dev, drv,
> @@ -153,11 +155,11 @@ int cpuidle_idle_call(void)
>  	else
>  		entered_state = cpuidle_enter_state(dev, drv, next_state);
>  
> +	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
> +
>  	if (broadcast)
>  		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu);
>  
> -	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
> -
>  	/* give the governor an opportunity to reflect on the outcome */
>  	if (cpuidle_curr_governor->reflect)
>  		cpuidle_curr_governor->reflect(dev, entered_state);
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Torsten Duwe @ 2014-02-07 11:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tom Musta, linux-kernel, Paul Mackerras, Anton Blanchard,
	Scott Wood, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140207104530.GG5126@laptop.programming.kicks-ass.net>

On Fri, Feb 07, 2014 at 11:45:30AM +0100, Peter Zijlstra wrote:
> 
> That might need to be lhz too, I'm confused on all the load variants.

;-)

> > unlock:
> > 	lhz	%0, 0, &tail
> > 	addic	%0, %0, 1

No carry with this one, I'd say.
Besides, unlock increments the head.

> > 	lwsync
> > 	sth	%0, 0, &tail
> > 

Given the beauty and simplicity of this, may I ask Ingo:
you signed off 314cdbefd1fd0a7acf3780e9628465b77ea6a836;
can you explain why head and tail must live on the same cache
line? Or is it just a space saver? I just ported it to ppc,
I didn't think about alternatives.

What about

atomic_t tail;
volatile int head; ?

Admittedly, that's usually 8 bytes instead of 4...

	Torsten

^ permalink raw reply

* Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
From: Preeti U Murthy @ 2014-02-07 11:41 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Lists linaro-kernel, linux-pm@vger.kernel.org, Peter Zijlstra,
	Daniel Lezcano, Rafael J. Wysocki, LKML, Ingo Molnar,
	Thomas Gleixner, linuxppc-dev
In-Reply-To: <alpine.LFD.2.11.1402071022490.1906@knanqh.ubzr>

Hi Nicolas,

On 02/07/2014 04:18 PM, Nicolas Pitre wrote:
> On Fri, 7 Feb 2014, Preeti U Murthy wrote:
> 
>> Hi Nicolas,
>>
>> On 02/07/2014 06:47 AM, Nicolas Pitre wrote:
>>>
>>> What about creating arch_cpu_idle_enter() and arch_cpu_idle_exit() in 
>>> arch/powerpc/kernel/idle.c and calling ppc64_runlatch_off() and 
>>> ppc64_runlatch_on() respectively from there instead?  Would that work?  
>>> That would make the idle consolidation much easier afterwards.
>>
>> I would not suggest doing this. The ppc64_runlatch_*() routines need to
>> be called when we are sure that the cpu is about to enter or has exit an
>> idle state. Moving the ppc64_runlatch_on() routine to
>> arch_cpu_idle_enter() for instance is not a good idea because there are
>> places where the cpu can decide not to enter any idle state before the
>> call to cpuidle_idle_call() itself. In that case communicating
>> prematurely that we are in an idle state would not be a good idea.
>>
>> So its best to add the ppc64_runlatch_* calls in the powernv cpuidle
>> driver IMO. We could however create idle_loop_prologue/epilogue()
>> variants inside it so that in addition to the runlatch routines we could
>> potentially add more such similar routines that are powernv specific.
>>   If there are cases where there is work to be done prior to and post an
>> entry into an idle state common to both pseries and powernv, we will
>> probably put them in arch_cpu_idle_enter/exit(). But the runlatch
>> routines are not suitable to be moved there as far as I can see.
> 
> OK.
> 
> However, one thing we need to do as much as possible is to remove those 
> loops based on need_resched() from idle backend drivers.  A somewhat 
> common pattern is:
> 
> my_idle()
> {
> 	/* interrupts disabled on entry */
> 	while (!need_resched()) {
> 		lowpower_wait_for_interrupts();
> 		local_irq_enable();
> 		/* IRQ serviced from here */
> 		local_irq_disable();
> 	}
> 	local_irq_enable();
> 	/* interrupts enabled on exit */
> }
> 
> To be able to keep statistics on the actual idleness of the CPU we'd 
> need for all idle backends to always return to generic code on every 
> interrupt similar to this:
> 
> my_idle()
> {
> 	/* interrupts disabled on entry */
> 	lowpower_wait_for_interrupts();

You can do this for the idle states which do not have the polling
nature. IOW, these idle states are capable of doing what you describe as
"wait_for_interrupts". They do some kind of spinning at the hardware
level with interrupts enabled. A reschedule IPI or any other interrupt
will wake them up to enter the generic idle loop where they check for
the cause of the interrupt.

But observe the idle state "snooze" on powerpc. The power that this idle
state saves is through the lowering of the thread priority of the CPU.
After it lowers the thread priority, it is done. It cannot
"wait_for_interrupts". It will exit my_idle(). It is now upto the
generic idle loop to increase the thread priority if the need_resched
flag is set. Only an interrupt routine can increase the thread priority.
Else we will need to do it explicitly. And in such states which have a
polling nature, the cpu will not receive a reschedule IPI.

That is why in the snooze_loop() we poll on need_resched. If it is set
we up the priority of the thread using HMT_MEDIUM() and then exit the
my_idle() loop. In case of interrupts, the priority gets automatically
increased.

This might not be required to be done for similar idle routines on other
archs but this is the consequence of applying this idea of simplified
cpuidle backend driver on powerpc.

I would say you could let the backend cpuidle drivers be in this regard,
it could complicate the generic idle loop IMO depending on how the
polling states are implemented in each architecture.

> The generic code would be responsible for dealing with need_resched() 
> and call back into the backend right away if necessary after updating 
> some stats.
> 
> Do you see a problem with the runlatch calls happening around each 
> interrrupt from such a simplified idle backend?

The runlatch calls could be moved outside the loop.They do not need to
be called each time.

Thanks

Regards
Preeti U Murthy
> 
> 
> Nicolas
> 

^ permalink raw reply

* Re: ppc9a board support
From: Martyn Welch @ 2014-02-07 11:23 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <CAOB9jCJmXTo7NVR4jdV+3AkRsZviR447PKi2_TFyDdjoO0to8g@mail.gmail.com>

Hi Chris,

Support for the PPC9A is already in the Linux kernel, along with VME 
drivers.

Martyn

On 04/02/14 14:47, Chris Enrique wrote:
> hello,
>
> sorry if this message doesn't clearly hit the topic of this list but i
> would like to ask if anybody has information about bringing a linux-3.10
> or later kernel to ge fanuc ppc9a powerpc board (existing bsp? vme
> drivers? etc.?) to prevent starting from scratch for this task.
>
> any ideas would be much appreciated.
>
>
> kind regards,
> chris
>
>
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>

-- 
--
Martyn Welch (Lead Software Engineer)  | Registered in England and Wales
GE Intelligent Platforms               | (3828642) at 100 Barbirolli Square
T +44(0)1327322748                     | Manchester, M2 3AB
E martyn.welch@ge.com                  | VAT:GB 927559189

^ permalink raw reply

* Re: [PATCH v2 6/6] cpu/idle.c: move to sched/idle.c
From: Nicolas Pitre @ 2014-02-07 11:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linaro-kernel, Russell King, linux-sh, linux-pm, Daniel Lezcano,
	Rafael J. Wysocki, linux-kernel, Paul Mundt, Preeti U Murthy,
	Thomas Gleixner, linuxppc-dev, Ingo Molnar, linux-arm-kernel
In-Reply-To: <20140206164336.GU2936@laptop.programming.kicks-ass.net>

On Thu, 6 Feb 2014, Peter Zijlstra wrote:

> On Thu, Feb 06, 2014 at 02:09:59PM +0000, Nicolas Pitre wrote:
> > Hi Peter,
> > 
> > Did you merge those patches in your tree? 
> 
> tree, tree, what's in a word.

Something you may plant on a patch of grass?  "Merging" becomes a 
strange concept in that context though.  :-)

> Its in my patch stack yes. 

Quilt I suppose?? (yet another word.)

> I should get some of that into tip I suppose, been side-tracked a bit 
> this week. Sorry for the delay.

If you prefer we pile those patches (and future ones after revew) 
ourselves just let me know.  Future patches are likely to be more 
intimate with the scheduler so I just need to know who to upstream them 
through afterwards.

Nicolas

^ permalink raw reply

* Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
From: Preeti U Murthy @ 2014-02-07 11:00 UTC (permalink / raw)
  To: Deepthi Dharwar
  Cc: Nicolas Pitre, Lists linaro-kernel, linux-pm@vger.kernel.org,
	Peter Zijlstra, Daniel Lezcano, Rafael J. Wysocki, LKML,
	Ingo Molnar, Thomas Gleixner, linuxppc-dev, Linux ARM Kernel ML
In-Reply-To: <52F4AB54.801@linux.vnet.ibm.com>

Hi Deepthi,

On 02/07/2014 03:15 PM, Deepthi Dharwar wrote:
> Hi Preeti,
> 
> Thanks for the patch.
> 
> On 02/07/2014 12:31 PM, Preeti U Murthy wrote:
>> Hi Nicolas,
>>
>> Find below the patch that will need to be squashed with this one.
>> This patch is based on the mainline.Adding Deepthi, the author of
>> the patch which introduced the powernv cpuidle driver. Deepthi,
>> do you think the below patch looks right? We do not need to do an
>> explicit local_irq_enable() since we are in the call path of
>> cpuidle driver and that explicitly enables irqs on exit from
>> idle states.
> 
> Yes, We enable irqs explicitly while entering snooze loop and we always
> have interrupts enabled in the snooze state.
> For NAP state, we exit out of this state with interrupts enabled so we
> do not need an explicit enable of irqs.
> 
>> On 02/07/2014 06:47 AM, Nicolas Pitre wrote:
>>> On Thu, 6 Feb 2014, Preeti U Murthy wrote:
>>>
>>>> Hi Daniel,
>>>>
>>>> On 02/06/2014 09:55 PM, Daniel Lezcano wrote:
>>>>> Hi Nico,
>>>>>
>>>>>
>>>>> On 6 February 2014 14:16, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>>>>>
>>>>>> The core idle loop now takes care of it.
>>>>>>
>>>>>> Signed-off-by: Nicolas Pitre <nico@linaro.org>
>>>>>> ---
>>>>>>  arch/powerpc/platforms/powernv/setup.c | 13 +------------
>>>>>>  1 file changed, 1 insertion(+), 12 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/powerpc/platforms/powernv/setup.c
>>>>>> b/arch/powerpc/platforms/powernv/setup.c
>>>>>> index 21166f65c9..a932feb290 100644
>>>>>> --- a/arch/powerpc/platforms/powernv/setup.c
>>>>>> +++ b/arch/powerpc/platforms/powernv/setup.c
>>>>>> @@ -26,7 +26,6 @@
>>>>>>  #include <linux/of_fdt.h>
>>>>>>  #include <linux/interrupt.h>
>>>>>>  #include <linux/bug.h>
>>>>>> -#include <linux/cpuidle.h>
>>>>>>
>>>>>>  #include <asm/machdep.h>
>>>>>>  #include <asm/firmware.h>
>>>>>> @@ -217,16 +216,6 @@ static int __init pnv_probe(void)
>>>>>>         return 1;
>>>>>>  }
>>>>>>
>>>>>> -void powernv_idle(void)
>>>>>> -{
>>>>>> -       /* Hook to cpuidle framework if available, else
>>>>>> -        * call on default platform idle code
>>>>>> -        */
>>>>>> -       if (cpuidle_idle_call()) {
>>>>>> -               power7_idle();
>>>>>> -       }
>>>>>>
>>
>>  drivers/cpuidle/cpuidle-powernv.c |    4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
>> index 78fd174..130f081 100644
>> --- a/drivers/cpuidle/cpuidle-powernv.c
>> +++ b/drivers/cpuidle/cpuidle-powernv.c
>> @@ -31,11 +31,13 @@ static int snooze_loop(struct cpuidle_device *dev,
>>  	set_thread_flag(TIF_POLLING_NRFLAG);
>>
>>  	while (!need_resched()) {
>> +		ppc64_runlatch_off();
>                 ^^^^^^^^^^^^^^^
> We could move this before the while() loop.
> It would ideal to turn off latch when we enter snooze and
> turn it on when we are about to exit, rather than doing
> it over and over in the while loop.

You are right, this can be moved out of the loop.

Thanks

Regards
Preeti U Murthy

^ permalink raw reply

* Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
From: Nicolas Pitre @ 2014-02-07 10:48 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Lists linaro-kernel, linux-pm@vger.kernel.org, Peter Zijlstra,
	Daniel Lezcano, Rafael J. Wysocki, LKML, Ingo Molnar,
	Thomas Gleixner, linuxppc-dev
In-Reply-To: <52F46EB3.5080403@linux.vnet.ibm.com>

On Fri, 7 Feb 2014, Preeti U Murthy wrote:

> Hi Nicolas,
> 
> On 02/07/2014 06:47 AM, Nicolas Pitre wrote:
> > 
> > What about creating arch_cpu_idle_enter() and arch_cpu_idle_exit() in 
> > arch/powerpc/kernel/idle.c and calling ppc64_runlatch_off() and 
> > ppc64_runlatch_on() respectively from there instead?  Would that work?  
> > That would make the idle consolidation much easier afterwards.
> 
> I would not suggest doing this. The ppc64_runlatch_*() routines need to
> be called when we are sure that the cpu is about to enter or has exit an
> idle state. Moving the ppc64_runlatch_on() routine to
> arch_cpu_idle_enter() for instance is not a good idea because there are
> places where the cpu can decide not to enter any idle state before the
> call to cpuidle_idle_call() itself. In that case communicating
> prematurely that we are in an idle state would not be a good idea.
> 
> So its best to add the ppc64_runlatch_* calls in the powernv cpuidle
> driver IMO. We could however create idle_loop_prologue/epilogue()
> variants inside it so that in addition to the runlatch routines we could
> potentially add more such similar routines that are powernv specific.
>   If there are cases where there is work to be done prior to and post an
> entry into an idle state common to both pseries and powernv, we will
> probably put them in arch_cpu_idle_enter/exit(). But the runlatch
> routines are not suitable to be moved there as far as I can see.

OK.

However, one thing we need to do as much as possible is to remove those 
loops based on need_resched() from idle backend drivers.  A somewhat 
common pattern is:

my_idle()
{
	/* interrupts disabled on entry */
	while (!need_resched()) {
		lowpower_wait_for_interrupts();
		local_irq_enable();
		/* IRQ serviced from here */
		local_irq_disable();
	}
	local_irq_enable();
	/* interrupts enabled on exit */
}

To be able to keep statistics on the actual idleness of the CPU we'd 
need for all idle backends to always return to generic code on every 
interrupt similar to this:

my_idle()
{
	/* interrupts disabled on entry */
	lowpower_wait_for_interrupts();
	local_irq_enable();
	/* interrupts enabled on exit */
}

The generic code would be responsible for dealing with need_resched() 
and call back into the backend right away if necessary after updating 
some stats.

Do you see a problem with the runlatch calls happening around each 
interrrupt from such a simplified idle backend?


Nicolas

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Peter Zijlstra @ 2014-02-07 10:45 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, linux-kernel, Paul Mackerras, Anton Blanchard,
	Scott Wood, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140207103139.GP5002@laptop.programming.kicks-ass.net>

On Fri, Feb 07, 2014 at 11:31:39AM +0100, Peter Zijlstra wrote:
> Anyway, what might work is something like (please forgive my ppc asm, I
> can barely read the thing, I've never before attempted writing it):
> 
> lock:
> 1:	lharx	%0, 0, &head
> 	mov	%1, %0
> 	addic	%0, %0, 1
> 	stwcd   %0, 0, &head
> 	bne-	1b
> 2:	lhax	%0, 0, &tail

That might need to be lhz too, I'm confused on all the load variants.

> 	lwsync
> 	cmp	0, %0, %0

	cmp	0, %0, %1

So we compare the &tail load to the xadd return %1 above.

> 	bne-	2b
> 
> 
> unlock:
> 	lhz	%0, 0, &tail
> 	addic	%0, %0, 1
> 	lwsync
> 	sth	%0, 0, &tail
> 

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Peter Zijlstra @ 2014-02-07 10:36 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, linux-kernel, Paul Mackerras, Anton Blanchard,
	Scott Wood, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140207103139.GP5002@laptop.programming.kicks-ass.net>

> So if you have ll/sc on the whole word concurrent with the half-word
> store, you can loose the half-word store like:
> 
>   lwarx &tickets
>   ...			sth &tail
>   stwcd &tickets
> 
> 
> The stwcd will over-write the tail store.

Oh wait, that's stupid, it will invalidate the lock and fail the store
and make it try again, so you could try and combine the load, but you'd
need an extra shift instruction instead of an extra load.

Not sure that's a valid trade-off..

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Peter Zijlstra @ 2014-02-07 10:31 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, linux-kernel, Paul Mackerras, Anton Blanchard,
	Scott Wood, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140207090248.GB26811@lst.de>

On Fri, Feb 07, 2014 at 10:02:48AM +0100, Torsten Duwe wrote:
> On Thu, Feb 06, 2014 at 02:19:52PM -0600, Scott Wood wrote:
> > On Thu, 2014-02-06 at 18:37 +0100, Torsten Duwe wrote:
> > > On Thu, Feb 06, 2014 at 05:38:37PM +0100, Peter Zijlstra wrote:
> > 
> > > > Can you pair lwarx with sthcx ? I couldn't immediately find the answer
> > > > in the PowerISA doc. If so I think you can do better by being able to
> > > > atomically load both tickets but only storing the head without affecting
> > > > the tail.
> 
> Can I simply write the half word, without a reservation, or will the HW caches
> mess up the other half? Will it ruin the cache coherency on some (sub)architectures?

So if you have ll/sc on the whole word concurrent with the half-word
store, you can loose the half-word store like:

  lwarx &tickets
  ...			sth &tail
  stwcd &tickets

The stwcd will over-write the tail store.

Anyway, what might work is something like (please forgive my ppc asm, I
can barely read the thing, I've never before attempted writing it):

lock:
1:	lharx	%0, 0, &head
	mov	%1, %0
	addic	%0, %0, 1
	stwcd   %0, 0, &head
	bne-	1b

2:	lhax	%0, 0, &tail
	lwsync
	cmp	0, %0, %0
	bne-	2b

unlock:
	lhz	%0, 0, &tail
	addic	%0, %0, 1
	lwsync
	sth	%0, 0, &tail

Which would somewhat translate into C as:

static inline void ticket_spin_lock(tickets_t *lock)
{
	ticket_t mine = xadd(&lock->head);

	while (smp_load_acquire(&lock->tail) != mine)
		cpu_relax();
}

static inline void ticket_spin_unlock(tickets_t *lock)
{
	ticket_t tail = lock->tail + 1;

	smp_store_release(&lock->tail, tail);
}

Where xadd() returns the value before addition and we assume half word
single-copy atomicy, such that the head and tail updates will not
interfere.

The x86 implementation uses the 32bit xadd and places the head at the
MSB end to get the atomic add + tail load in a single instruction, but
for PPC its much better to have an extra load (to an already hot
cacheline) and avoid a second ll/sc pair, as the ll/sc things are stupid
slow for your arch afaik.

^ permalink raw reply

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: Gabriel Paubert @ 2014-02-07 10:10 UTC (permalink / raw)
  To: Stephen N Chivers; +Cc: Chris Proctor, linuxppc-dev
In-Reply-To: <OFA4C1AF9F.B23E83EC-ONCA257C78.00044235-CA257C78.00080D50@csc.com>

	Hi Stephen,

On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote:
> Gabriel Paubert <paubert@iram.es> wrote on 02/06/2014 07:26:37 PM:
> 
> > From: Gabriel Paubert <paubert@iram.es>
> > To: Stephen N Chivers <schivers@csc.com.au>
> > Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor <cproctor@csc.com.au>
> > Date: 02/06/2014 07:26 PM
> > Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
> > 
> > On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:
> > > I have a MPC8548e based board and an application that makes
> > > extensive use of floating point including numerous calls to cos.
> > > In the same program there is the use of an sqlite database.
> > > 
> > > The kernel is derived from 2.6.31 and is compiled with math emulation.
> > > 
> > > At some point after the reading of the SQLITE database, the
> > > return result from cos goes from an in range value to an out
> > > of range value.
> > > 
> > > This is as a result of the FP rounding mode mutating from "round to 
> > > nearest"
> > > to "round toward zero".
> > > 
> > > The cos in the glibc version being used is known to be sensitive to 
> > > rounding
> > > direction and Joseph Myers has previously fixed glibc.
> > > 
> > > The failure does not occur on a machine that has a hardware floating
> > > point unit (a MPC7410 processor).
> > > 
> > > I have traced the mutation to the following series of instructions:
> > > 
> > >         mffs            f0
> > >         mtfsb1          4*cr7+so
> > >         mtfsb0          4*cr7+eq
> > >         fadd            f13,f1,f2
> > >         mtfsf           1, f0
> > > 
> > > The instructions are part of the stream emitted by gcc for the 
> conversion
> > > of a 128 bit floating point value into an integer in the sqlite 
> database 
> > > read.
> > > 
> > > Immediately before the execution of the mffs instruction the "rounding
> > > mode" is "round to nearest".
> > > 
> > > On the MPC8548 board, the execution of the mtfsf instruction does not
> > > restore the rounding mode to "round to nearest".
> > > 
> > > I believe that the mask computation in mtfsf.c is incorrect and is 
> > > reversed.
> > > 
> > > In the latest version of the file (linux-3.14-rc1), the mask is 
> computed 
> > > by:
> > > 
> > >                  mask = 0;
> > >                  if (FM & (1 << 0))
> > >                         mask |= 0x90000000;
> > >                  if (FM & (1 << 1))
> > >                         mask |= 0x0f000000;
> > >                  if (FM & (1 << 2))
> > >                         mask |= 0x00f00000;
> > >                  if (FM & (1 << 3))
> > >                         mask |= 0x000f0000;
> > >                  if (FM & (1 << 4))
> > >                         mask |= 0x0000f000;
> > >                  if (FM & (1 << 5))
> > >                         mask |= 0x00000f00;
> > >                  if (FM & (1 << 6))
> > >                         mask |= 0x000000f0;
> > >                  if (FM & (1 << 7))
> > >                         mask |= 0x0000000f;
> > > 
> > > I think it should be:
> > > 
> > >                 mask = 0;
> > >                 if (FM & (1 << 0))
> > >                         mask |= 0x0000000f;
> > >                 if (FM & (1 << 1))
> > >                         mask |= 0x000000f0;
> > >                 if (FM & (1 << 2))
> > >                         mask |= 0x00000f00;
> > >                 if (FM & (1 << 3))
> > >                         mask |= 0x0000f000;
> > >                 if (FM & (1 << 4))
> > >                         mask |= 0x000f0000;
> > >                 if (FM & (1 << 5))
> > >                         mask |= 0x00f00000;
> > >                 if (FM & (1 << 6))
> > >                         mask |= 0x0f000000;
> > >                 if (FM & (1 << 7))
> > >                         mask |= 0x90000000;
> > > 
> > > With the above mask computation I get consistent results for both the 
> > > MPC8548
> > > and MPC7410 boards.
> > > 
> > > Am I missing something subtle?
> > 
> > No I think you are correct. This said, this code may probably be 
> optimized 
> > to eliminate a lot of the conditional branches. I think that:
> > 
> > mask = (FM & 1);
> > mask |= (FM << 3) & 0x10;
> > mask |= (FM << 6) & 0x100;
> > mask |= (FM << 9) & 0x1000;
> > mask |= (FM << 12) & 0x10000;
> > mask |= (FM << 15) & 0x100000;
> > mask |= (FM << 18) & 0x1000000;
> > mask |= (FM << 21) & 0x10000000;
> > mask *= 15;
> > 
> > should do the job, in less code space and without a single branch.
> > 
> > Each one of the "mask |=" lines should be translated into an
> > rlwinm instruction followed by an "or". Actually it should be possible
> > to transform each of these lines into a single "rlwimi" instruction
> > but I don't know how to coerce gcc to reach this level of optimization.
> > 
> > Another way of optomizing this could be:
> > 
> > mask = (FM & 0x0f) | ((FM << 12) & 0x000f0000);
> > mask = (mask & 0x00030003) | ((mask << 6) & 0x03030303);
> > mask = (mask & 0x01010101) | ((mask << 3) & 0x10101010);
> > mask *= 15;
> > 
> > It's not easy to see which of the solutions is faster, the second one
> > needs to create quite a few constants, but its dependency length is 
> > lower. It is very likely that the first solution is faster in cache-cold
> > case and the second in cache-hot. 
> > 
> > Regardless, the original code is rather naïve, larger and slower in all 
> cases,
> > with timing variation depending on branch mispredictions.
> 
> Thanks for the response, it is appreciated.
> 
> I have tried simple test versions of the two suggestions above, both
> produce the same results as the original formulation.
> 
> My toolchain is gcc-4.1.2 with binutils 2.17.
> 
> When compiled without optimization I get:
> 
>         original        58 instructions 20 memory accesses 9 branches
>         method1 53 instructions 27 memory accesses 0 branches
>         method2 37 instructions 13 memory accesses 0 branches
> 
> with optimization:
> 
>         original        25 instructions 0 memory accesses 8 branches
>         method1 18 instructions 0 memory accesses 0 branches
>         method2 21 instructions 0 memory accesses 0 branches
> 
> The memory accesses do not include "setup" such as moving FM to
> a register.
> 
> The instruction counts for method1 and method2 include an extra
> and operation to preserve the original behaviour wrt the sticky
> FX bit (I think) although maybe that is also something else
> that it is wrong with the original implementation.

These bits are not sticky, they are actually a logical combination 
of other bits in the register. The code in mtfsf.c recomputes 
the FPSCR_VX and FPSCR_VEX bits from the contents of the new FPSCR, 
so, unless I miss something, the and operation is unnecessary.

> In my naivety I would go with method2 as it generates fewer
> instructions when  not optimized and isn't far from method1 when
> optimized.

I would never count the size of non-optimized gcc code as a valid 
benchmark. At least -O1 is necessary to implement basic, trivial
optimizations, as you see in the number of memory accesses.

Ok, I finally edited my sources and test compiled the suggestions
I gave. I'd say that method2 is the best overall indeed. You can
actually save one more instruction by setting mask to all ones in 
the case FM=0xff, but that's about all in this area.

This said, there are better ways to write the end of the routine
to be more compact. The __FPU_FPSCR is a global variable which is
accessed very often, and it would be better to use more the local
fpscr variable. I see really stupid things in the code, for example
the following:

        and. 8,10,0      #, tmp184, D.11439
==>     stw 0,740(2)     # current.0_21->thread.fp_state.fpscr, D.11439
        beq- 0,.L4       #
        oris 0,0,0x2000  #, tmp188, D.11439,
==>     stw 0,740(2)     # current.0_21->thread.fp_state.fpscr, tmp188
.L4:
==>     lwz 0,740(9)     # current.0_21->thread.fp_state.fpscr, fpscr
        lis 11,0x2000    # tmp227,

(and yes gcc also copies around r2, the current thread pointer, to
other registers, for no good reason).

Anyway, the first step would be to get the current patch in, then
work on more optimizations.

	Regards,
	Gabriel

^ permalink raw reply

* Re: [PATCH 2/2] ARM64: powernv: remove redundant cpuidle_idle_call()
From: Catalin Marinas @ 2014-02-07  9:50 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: linaro-kernel, linux-pm@vger.kernel.org, Peter Zijlstra,
	Daniel Lezcano, Rafael J. Wysocki, Linux Kernel Mailing List,
	Ingo Molnar, Preeti U Murthy, Thomas Gleixner, linuxppc-dev,
	linux-arm-kernel@lists.infradead.org
In-Reply-To: <1391696188-14540-2-git-send-email-nicolas.pitre@linaro.org>

On 6 February 2014 14:16, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> The core idle loop now takes care of it.
>
> Signed-off-by: Nicolas Pitre <nico@linaro.org>

Acked-by: Catalin Marinas <catalin.marinas@arm.com>

^ permalink raw reply

* Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
From: Deepthi Dharwar @ 2014-02-07  9:45 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Nicolas Pitre, Lists linaro-kernel, linux-pm@vger.kernel.org,
	Peter Zijlstra, Daniel Lezcano, Rafael J. Wysocki, LKML,
	Ingo Molnar, Thomas Gleixner, linuxppc-dev, Linux ARM Kernel ML
In-Reply-To: <52F484D9.6020604@linux.vnet.ibm.com>

Hi Preeti,

Thanks for the patch.

On 02/07/2014 12:31 PM, Preeti U Murthy wrote:
> Hi Nicolas,
> 
> Find below the patch that will need to be squashed with this one.
> This patch is based on the mainline.Adding Deepthi, the author of
> the patch which introduced the powernv cpuidle driver. Deepthi,
> do you think the below patch looks right? We do not need to do an
> explicit local_irq_enable() since we are in the call path of
> cpuidle driver and that explicitly enables irqs on exit from
> idle states.

Yes, We enable irqs explicitly while entering snooze loop and we always
have interrupts enabled in the snooze state.
For NAP state, we exit out of this state with interrupts enabled so we
do not need an explicit enable of irqs.

> On 02/07/2014 06:47 AM, Nicolas Pitre wrote:
>> On Thu, 6 Feb 2014, Preeti U Murthy wrote:
>>
>>> Hi Daniel,
>>>
>>> On 02/06/2014 09:55 PM, Daniel Lezcano wrote:
>>>> Hi Nico,
>>>>
>>>>
>>>> On 6 February 2014 14:16, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>>>>
>>>>> The core idle loop now takes care of it.
>>>>>
>>>>> Signed-off-by: Nicolas Pitre <nico@linaro.org>
>>>>> ---
>>>>>  arch/powerpc/platforms/powernv/setup.c | 13 +------------
>>>>>  1 file changed, 1 insertion(+), 12 deletions(-)
>>>>>
>>>>> diff --git a/arch/powerpc/platforms/powernv/setup.c
>>>>> b/arch/powerpc/platforms/powernv/setup.c
>>>>> index 21166f65c9..a932feb290 100644
>>>>> --- a/arch/powerpc/platforms/powernv/setup.c
>>>>> +++ b/arch/powerpc/platforms/powernv/setup.c
>>>>> @@ -26,7 +26,6 @@
>>>>>  #include <linux/of_fdt.h>
>>>>>  #include <linux/interrupt.h>
>>>>>  #include <linux/bug.h>
>>>>> -#include <linux/cpuidle.h>
>>>>>
>>>>>  #include <asm/machdep.h>
>>>>>  #include <asm/firmware.h>
>>>>> @@ -217,16 +216,6 @@ static int __init pnv_probe(void)
>>>>>         return 1;
>>>>>  }
>>>>>
>>>>> -void powernv_idle(void)
>>>>> -{
>>>>> -       /* Hook to cpuidle framework if available, else
>>>>> -        * call on default platform idle code
>>>>> -        */
>>>>> -       if (cpuidle_idle_call()) {
>>>>> -               power7_idle();
>>>>> -       }
>>>>>
> 
>  drivers/cpuidle/cpuidle-powernv.c |    4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
> index 78fd174..130f081 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -31,11 +31,13 @@ static int snooze_loop(struct cpuidle_device *dev,
>  	set_thread_flag(TIF_POLLING_NRFLAG);
> 
>  	while (!need_resched()) {
> +		ppc64_runlatch_off();
                ^^^^^^^^^^^^^^^
We could move this before the while() loop.
It would ideal to turn off latch when we enter snooze and
turn it on when we are about to exit, rather than doing
it over and over in the while loop.

>  		HMT_low();
>  		HMT_very_low();
>  	}
> 
>  	HMT_medium();
> +	ppc64_runlatch_on();
>  	clear_thread_flag(TIF_POLLING_NRFLAG);
>  	smp_mb();
>  	return index;
> @@ -45,7 +47,9 @@ static int nap_loop(struct cpuidle_device *dev,
>  			struct cpuidle_driver *drv,
>  			int index)
>  {
> +	ppc64_runlatch_off();
>  	power7_idle();
> +	ppc64_runlatch_on();
>  	return index;
>  }
> 
> Thanks
> 
> Regards
> Preeti U Murthy
> 

Regards,
Deepthi

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Torsten Duwe @ 2014-02-07  9:02 UTC (permalink / raw)
  To: Scott Wood
  Cc: Tom Musta, Peter Zijlstra, linux-kernel, Paul Mackerras,
	Anton Blanchard, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <1391717992.6733.232.camel@snotra.buserror.net>

On Thu, Feb 06, 2014 at 02:19:52PM -0600, Scott Wood wrote:
> On Thu, 2014-02-06 at 18:37 +0100, Torsten Duwe wrote:
> > On Thu, Feb 06, 2014 at 05:38:37PM +0100, Peter Zijlstra wrote:
> 
> > > Can you pair lwarx with sthcx ? I couldn't immediately find the answer
> > > in the PowerISA doc. If so I think you can do better by being able to
> > > atomically load both tickets but only storing the head without affecting
> > > the tail.

Can I simply write the half word, without a reservation, or will the HW caches
mess up the other half? Will it ruin the cache coherency on some (sub)architectures?

> Plus, sthcx doesn't exist on all PPC chips.

Which ones are lacking it? Do all have at least a simple 16-bit store?

	Torsten

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Torsten Duwe @ 2014-02-07  8:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Paul Mackerras, Anton Blanchard, Paul E. McKenney,
	linuxppc-dev, Ingo Molnar
In-Reply-To: <20140206180826.GI5002@laptop.programming.kicks-ass.net>

On Thu, Feb 06, 2014 at 07:08:26PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 06, 2014 at 06:37:27PM +0100, Torsten Duwe wrote:
> > I must admit that I haven't tested the patch on non-pseries ppc64 nor on
> > ppc32. Only ppc64 has the ldarx and I tried to atomically replace the 
> > holder along with the locks. That might prove unneccessary.
> 
> But what is the holder for? Can't we do away with that field?

Scott, Peter: good questions.
The conditional is wrong because I confused pSeries with ppc64 CPUs with
64-bit kernels. I got deluded by the LOCK_TOKEN definition above. Is that
correctly ifdef'd, with PPC64? The holder field should be ifdef'd
CONFIG_PPC_SPLPAR, independent of ppc64.

It is an advisory performance hint, and doesn't need to be updated atomically
with the lock; this and the above are 2 reasons to drop the asm string
operand size voodoo as well.

Thanks,
	Torsten

^ permalink raw reply

* [PATCH V4 3/3] time/cpuidle:Handle failed call to BROADCAST_ENTER on archs with CPUIDLE_FLAG_TIMER_STOP set
From: Preeti U Murthy @ 2014-02-07  8:06 UTC (permalink / raw)
  To: linux-pm, peterz, benh, rafael.j.wysocki, linux-kernel, tglx,
	linuxppc-dev, mingo
  Cc: deepthi, fweisbec, paulus, srivatsa.bhat, paulmck
In-Reply-To: <20140207075618.17187.20149.stgit@preeti.in.ibm.com>

Some archs set the CPUIDLE_FLAG_TIMER_STOP flag for idle states in which the
local timers stop. The cpuidle_idle_call() currently handles such idle states
by calling into the broadcast framework so as to wakeup CPUs at their next
wakeup event. With the hrtimer mode of broadcast, the BROADCAST_ENTER call
into the broadcast frameowork can fail for archs that do not have an external
clock device to handle wakeups and the CPU in question has to thus be made
the stand by CPU. This patch handles such cases by failing the call into
cpuidle so that the arch can take some default action. The arch will certainly
not enter a similar idle state because a failed cpuidle call will also implicitly
indicate that the broadcast framework has not registered this CPU to be woken up.
Hence we are safe if we fail the cpuidle call.

In the process move the functions that trace idle statistics just before and
after the entry and exit into idle states respectively. In other
scenarios where the call to cpuidle fails, we end up not tracing idle
entry and exit since a decision on an idle state could not be taken. Similarly
when the call to broadcast framework fails, we skip tracing idle statistics
because we are in no further position to take a decision on an alternative
idle state to enter into.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 drivers/cpuidle/cpuidle.c |   14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a55e68f..8beb0f02 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -140,12 +140,14 @@ int cpuidle_idle_call(void)
 		return 0;
 	}

-	trace_cpu_idle_rcuidle(next_state, dev->cpu);
-
 	broadcast = !!(drv->states[next_state].flags & CPUIDLE_FLAG_TIMER_STOP);

-	if (broadcast)
-		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu);
+	if (broadcast &&
+		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu))
+		return -EBUSY;
+
+
+	trace_cpu_idle_rcuidle(next_state, dev->cpu);

 	if (cpuidle_state_is_coupled(dev, drv, next_state))
 		entered_state = cpuidle_enter_state_coupled(dev, drv,
@@ -153,11 +155,11 @@ int cpuidle_idle_call(void)
 	else
 		entered_state = cpuidle_enter_state(dev, drv, next_state);

+	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
+
 	if (broadcast)
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu);

-	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
-
 	/* give the governor an opportunity to reflect on the outcome */
 	if (cpuidle_curr_governor->reflect)
 		cpuidle_curr_governor->reflect(dev, entered_state);

^ permalink raw reply related

* [PATCH V4 2/3] tick/cpuidle: Initialize hrtimer mode of broadcast
From: Preeti U Murthy @ 2014-02-07  8:06 UTC (permalink / raw)
  To: linux-pm, peterz, benh, rafael.j.wysocki, linux-kernel, tglx,
	linuxppc-dev, mingo
  Cc: deepthi, fweisbec, paulus, srivatsa.bhat, paulmck
In-Reply-To: <20140207075618.17187.20149.stgit@preeti.in.ibm.com>

From: Thomas Gleixner <tglx@linutronix.de>

On some architectures, in certain CPU deep idle states the local timers stop.
An external clock device is used to wakeup these CPUs. The kernel support for the
wakeup of these CPUs is provided by the tick broadcast framework by using the
external clock device as the wakeup source.

However not all implementations of architectures provide such an external
clock device. This patch includes support in the broadcast framework to handle
the wakeup of the CPUs in deep idle states on such systems by queuing a hrtimer
on one of the CPUs, which is meant to handle the wakeup of CPUs in deep idle states.

This patchset introduces a pseudo clock device which can be registered by the
archs as tick_broadcast_device in the absence of a real external clock
device. Once registered, the broadcast framework will work as is for these
architectures as long as the archs take care of the BROADCAST_ENTER
notification failing for one of the CPUs. This CPU is made the stand by CPU to
handle wakeup of the CPUs in deep idle and it *must not enter deep idle states*.

The CPU with the earliest wakeup is chosen to be this CPU. Hence this way the
stand by CPU dynamically moves around and so does the hrtimer which is queued
to trigger at the next earliest wakeup time. This is consistent with the case where
an external clock device is present. The smp affinity of this clock device is
set to the CPU with the earliest wakeup. This patchset handles the hotplug of
the stand by CPU as well by moving the hrtimer on to the CPU handling the CPU_DEAD
notification.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
[Added Changelog and code to handle reprogramming of hrtimer]
---

 include/linux/clockchips.h           |    9 +++
 kernel/time/Makefile                 |    2 -
 kernel/time/tick-broadcast-hrtimer.c |  105 ++++++++++++++++++++++++++++++++++
 kernel/time/tick-broadcast.c         |   54 +++++++++++++++++
 4 files changed, 166 insertions(+), 4 deletions(-)
 create mode 100644 kernel/time/tick-broadcast-hrtimer.c

diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index e0c5a6c..dbe9e14 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -62,6 +62,11 @@ enum clock_event_mode {
 #define CLOCK_EVT_FEAT_DYNIRQ		0x000020
 #define CLOCK_EVT_FEAT_PERCPU		0x000040
 
+/*
+ * Clockevent device is based on a hrtimer for broadcast
+ */
+#define CLOCK_EVT_FEAT_HRTIMER		0x000080
+
 /**
  * struct clock_event_device - clock event device descriptor
  * @event_handler:	Assigned by the framework to be called by the low
@@ -83,6 +88,7 @@ enum clock_event_mode {
  * @name:		ptr to clock event name
  * @rating:		variable to rate clock event devices
  * @irq:		IRQ number (only for non CPU local devices)
+ * @bound_on:		Bound on CPU
  * @cpumask:		cpumask to indicate for which CPUs this device works
  * @list:		list head for the management code
  * @owner:		module reference
@@ -113,6 +119,7 @@ struct clock_event_device {
 	const char		*name;
 	int			rating;
 	int			irq;
+	int			bound_on;
 	const struct cpumask	*cpumask;
 	struct list_head	list;
 	struct module		*owner;
@@ -180,9 +187,11 @@ extern int tick_receive_broadcast(void);
 #endif
 
 #if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && defined(CONFIG_TICK_ONESHOT)
+extern void tick_setup_hrtimer_broadcast(void);
 extern int tick_check_broadcast_expired(void);
 #else
 static inline int tick_check_broadcast_expired(void) { return 0; }
+static void tick_setup_hrtimer_broadcast(void) {};
 #endif
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
diff --git a/kernel/time/Makefile b/kernel/time/Makefile
index 9250130..06151ef 100644
--- a/kernel/time/Makefile
+++ b/kernel/time/Makefile
@@ -3,7 +3,7 @@ obj-y += timeconv.o posix-clock.o alarmtimer.o
 
 obj-$(CONFIG_GENERIC_CLOCKEVENTS_BUILD)		+= clockevents.o
 obj-$(CONFIG_GENERIC_CLOCKEVENTS)		+= tick-common.o
-obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)	+= tick-broadcast.o
+obj-$(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST)	+= tick-broadcast.o tick-broadcast-hrtimer.o
 obj-$(CONFIG_GENERIC_SCHED_CLOCK)		+= sched_clock.o
 obj-$(CONFIG_TICK_ONESHOT)			+= tick-oneshot.o
 obj-$(CONFIG_TICK_ONESHOT)			+= tick-sched.o
diff --git a/kernel/time/tick-broadcast-hrtimer.c b/kernel/time/tick-broadcast-hrtimer.c
new file mode 100644
index 0000000..af1e119
--- /dev/null
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -0,0 +1,105 @@
+/*
+ * linux/kernel/time/tick-broadcast-hrtimer.c
+ * This file emulates a local clock event device
+ * via a pseudo clock device.
+ */
+#include <linux/cpu.h>
+#include <linux/err.h>
+#include <linux/hrtimer.h>
+#include <linux/interrupt.h>
+#include <linux/percpu.h>
+#include <linux/profile.h>
+#include <linux/clockchips.h>
+#include <linux/sched.h>
+#include <linux/smp.h>
+#include <linux/module.h>
+
+#include "tick-internal.h"
+
+static struct hrtimer bctimer;
+
+static void bc_set_mode(enum clock_event_mode mode,
+			struct clock_event_device *bc)
+{
+	switch (mode) {
+	case CLOCK_EVT_MODE_SHUTDOWN:
+		/*
+		 * Note, we cannot cancel the timer here as we might
+		 * run into the following live lock scenario:
+		 *
+		 * cpu 0		cpu1
+		 * lock(broadcast_lock);
+		 *			hrtimer_interrupt()
+		 *			bc_handler()
+		 *			   tick_handle_oneshot_broadcast();
+		 *			    lock(broadcast_lock);
+		 * hrtimer_cancel()
+		 *  wait_for_callback()
+		 */
+		hrtimer_try_to_cancel(&bctimer);
+		break;
+	default:
+		break;
+	}
+}
+
+/*
+ * This is called from the guts of the broadcast code when the cpu
+ * which is about to enter idle has the earliest broadcast timer event.
+ */
+static int bc_set_next(ktime_t expires, struct clock_event_device *bc)
+{
+	/*
+	 * We try to cancel the timer first. If the callback is on
+	 * flight on some other cpu then we let it handle it. If we
+	 * were able to cancel the timer nothing can rearm it as we
+	 * own broadcast_lock.
+	 *
+	 * However we can also be called from the event handler of
+	 * ce_broadcast_hrtimer itself when it expires. We cannot therefore
+	 * restart the timer since it is on flight on the same CPU. But
+	 * due to the same reason we can reset it.
+	 */
+	if (hrtimer_try_to_cancel(&bctimer) >= 0) {
+		hrtimer_start(&bctimer, expires, HRTIMER_MODE_ABS_PINNED);
+		/* Bind the "device" to the cpu */
+		bc->bound_on = smp_processor_id();
+	} else if (bc->bound_on == smp_processor_id()) {
+		hrtimer_set_expires(&bctimer, expires);
+	}
+	return 0;
+}
+
+static struct clock_event_device ce_broadcast_hrtimer = {
+	.set_mode		= bc_set_mode,
+	.set_next_ktime		= bc_set_next,
+	.features		= CLOCK_EVT_FEAT_ONESHOT |
+				  CLOCK_EVT_FEAT_KTIME |
+				  CLOCK_EVT_FEAT_HRTIMER,
+	.rating			= 0,
+	.bound_on		= -1,
+	.min_delta_ns		= 1,
+	.max_delta_ns		= KTIME_MAX,
+	.min_delta_ticks	= 1,
+	.max_delta_ticks	= KTIME_MAX,
+	.mult			= 1,
+	.shift			= 0,
+	.cpumask		= cpu_all_mask,
+};
+
+static enum hrtimer_restart bc_handler(struct hrtimer *t)
+{
+	ce_broadcast_hrtimer.event_handler(&ce_broadcast_hrtimer);
+
+	if (ce_broadcast_hrtimer.next_event.tv64 == KTIME_MAX)
+		return HRTIMER_NORESTART;
+
+	return HRTIMER_RESTART;
+}
+
+void tick_setup_hrtimer_broadcast(void)
+{
+	hrtimer_init(&bctimer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+	bctimer.function = bc_handler;
+	clockevents_register_device(&ce_broadcast_hrtimer);
+}
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index ddf2ac2..2f013c3 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -630,6 +630,42 @@ again:
 	raw_spin_unlock(&tick_broadcast_lock);
 }
 
+static int broadcast_needs_cpu(struct clock_event_device *bc, int cpu)
+{
+	if (!(bc->features & CLOCK_EVT_FEAT_HRTIMER))
+		return 0;
+	if (bc->next_event.tv64 == KTIME_MAX)
+		return 0;
+	return bc->bound_on == cpu ? -EBUSY : 0;
+}
+
+static void broadcast_shutdown_local(struct clock_event_device *bc,
+				     struct clock_event_device *dev)
+{
+	/*
+	 * For hrtimer based broadcasting we cannot shutdown the cpu
+	 * local device if our own event is the first one to expire or
+	 * if we own the broadcast timer.
+	 */
+	if (bc->features & CLOCK_EVT_FEAT_HRTIMER) {
+		if (broadcast_needs_cpu(bc, smp_processor_id()))
+			return;
+		if (dev->next_event.tv64 < bc->next_event.tv64)
+			return;
+	}
+	clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN);
+}
+
+static void broadcast_move_bc(int deadcpu)
+{
+	struct clock_event_device *bc = tick_broadcast_device.evtdev;
+
+	if (!bc || !broadcast_needs_cpu(bc, deadcpu))
+		return;
+	/* This moves the broadcast assignment to this cpu */
+	clockevents_program_event(bc, bc->next_event, 1);
+}
+
 /*
  * Powerstate information: The system enters/leaves a state, where
  * affected devices might stop
@@ -648,7 +684,7 @@ int tick_broadcast_oneshot_control(unsigned long reason)
 	 * states
 	 */
 	if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
-		return;
+		return 0;
 
 	/*
 	 * We are called with preemtion disabled from the depth of the
@@ -659,7 +695,7 @@ int tick_broadcast_oneshot_control(unsigned long reason)
 	dev = td->evtdev;
 
 	if (!(dev->features & CLOCK_EVT_FEAT_C3STOP))
-		return;
+		return 0;
 
 	bc = tick_broadcast_device.evtdev;
 
@@ -667,7 +703,7 @@ int tick_broadcast_oneshot_control(unsigned long reason)
 	if (reason == CLOCK_EVT_NOTIFY_BROADCAST_ENTER) {
 		if (!cpumask_test_and_set_cpu(cpu, tick_broadcast_oneshot_mask)) {
 			WARN_ON_ONCE(cpumask_test_cpu(cpu, tick_broadcast_pending_mask));
-			clockevents_set_mode(dev, CLOCK_EVT_MODE_SHUTDOWN);
+			broadcast_shutdown_local(bc, dev);
 			/*
 			 * We only reprogram the broadcast timer if we
 			 * did not mark ourself in the force mask and
@@ -680,6 +716,16 @@ int tick_broadcast_oneshot_control(unsigned long reason)
 			    dev->next_event.tv64 < bc->next_event.tv64)
 				tick_broadcast_set_event(bc, cpu, dev->next_event, 1);
 		}
+		/*
+		 * If the current CPU owns the hrtimer broadcast
+		 * mechanism, it cannot go deep idle and we remove the
+		 * CPU from the broadcast mask. We don't have to go
+		 * through the EXIT path as the local timer is not
+		 * shutdown.
+		 */
+		ret = broadcast_needs_cpu(bc, cpu);
+		if (ret)
+			cpumask_clear_cpu(cpu, tick_broadcast_oneshot_mask);
 	} else {
 		if (cpumask_test_and_clear_cpu(cpu, tick_broadcast_oneshot_mask)) {
 			clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT);
@@ -853,6 +899,8 @@ void tick_shutdown_broadcast_oneshot(unsigned int *cpup)
 	cpumask_clear_cpu(cpu, tick_broadcast_pending_mask);
 	cpumask_clear_cpu(cpu, tick_broadcast_force_mask);
 
+	broadcast_move_bc(cpu);
+
 	raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
 }
 

^ permalink raw reply related

* [PATCH V4 1/3] time: Change the return type of clockevents_notify() to integer
From: Preeti U Murthy @ 2014-02-07  8:06 UTC (permalink / raw)
  To: linux-pm, peterz, benh, rafael.j.wysocki, linux-kernel, tglx,
	linuxppc-dev, mingo
  Cc: deepthi, fweisbec, paulus, srivatsa.bhat, paulmck
In-Reply-To: <20140207075618.17187.20149.stgit@preeti.in.ibm.com>

The broadcast framework can potentially be made use of by archs which do not have an
external clock device as well. Then, it is required that one of the CPUs need
to handle the broadcasting of wakeup IPIs to the CPUs in deep idle. As a
result its local timers should remain functional all the time. For such
a CPU, the BROADCAST_ENTER notification has to fail indicating that its clock
device cannot be shutdown. To make way for this support, change the return
type of tick_broadcast_oneshot_control() and hence clockevents_notify() to
indicate such scenarios.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 include/linux/clockchips.h   |    6 +++---
 kernel/time/clockevents.c    |    8 +++++---
 kernel/time/tick-broadcast.c |    6 ++++--
 kernel/time/tick-internal.h  |    6 +++---
 4 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index 493aa02..e0c5a6c 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -186,9 +186,9 @@ static inline int tick_check_broadcast_expired(void) { return 0; }
 #endif
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
-extern void clockevents_notify(unsigned long reason, void *arg);
+extern int clockevents_notify(unsigned long reason, void *arg);
 #else
-static inline void clockevents_notify(unsigned long reason, void *arg) {}
+static inline int clockevents_notify(unsigned long reason, void *arg) { return 0; }
 #endif
 
 #else /* CONFIG_GENERIC_CLOCKEVENTS_BUILD */
@@ -196,7 +196,7 @@ static inline void clockevents_notify(unsigned long reason, void *arg) {}
 static inline void clockevents_suspend(void) {}
 static inline void clockevents_resume(void) {}
 
-static inline void clockevents_notify(unsigned long reason, void *arg) {}
+static inline int clockevents_notify(unsigned long reason, void *arg) { return 0; }
 static inline int tick_check_broadcast_expired(void) { return 0; }
 
 #endif
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 086ad60..79b8685 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -524,12 +524,13 @@ void clockevents_resume(void)
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 /**
  * clockevents_notify - notification about relevant events
+ * Returns 0 on success, any other value on error
  */
-void clockevents_notify(unsigned long reason, void *arg)
+int clockevents_notify(unsigned long reason, void *arg)
 {
 	struct clock_event_device *dev, *tmp;
 	unsigned long flags;
-	int cpu;
+	int cpu, ret = 0;
 
 	raw_spin_lock_irqsave(&clockevents_lock, flags);
 
@@ -542,7 +543,7 @@ void clockevents_notify(unsigned long reason, void *arg)
 
 	case CLOCK_EVT_NOTIFY_BROADCAST_ENTER:
 	case CLOCK_EVT_NOTIFY_BROADCAST_EXIT:
-		tick_broadcast_oneshot_control(reason);
+		ret = tick_broadcast_oneshot_control(reason);
 		break;
 
 	case CLOCK_EVT_NOTIFY_CPU_DYING:
@@ -585,6 +586,7 @@ void clockevents_notify(unsigned long reason, void *arg)
 		break;
 	}
 	raw_spin_unlock_irqrestore(&clockevents_lock, flags);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(clockevents_notify);
 
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 43780ab..ddf2ac2 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -633,14 +633,15 @@ again:
 /*
  * Powerstate information: The system enters/leaves a state, where
  * affected devices might stop
+ * Returns 0 on success, -EBUSY if the cpu is used to broadcast wakeups.
  */
-void tick_broadcast_oneshot_control(unsigned long reason)
+int tick_broadcast_oneshot_control(unsigned long reason)
 {
 	struct clock_event_device *bc, *dev;
 	struct tick_device *td;
 	unsigned long flags;
 	ktime_t now;
-	int cpu;
+	int cpu, ret = 0;
 
 	/*
 	 * Periodic mode does not care about the enter/exit of power
@@ -746,6 +747,7 @@ void tick_broadcast_oneshot_control(unsigned long reason)
 	}
 out:
 	raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
+	return ret;
 }
 
 /*
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index 8329669..f0dc03c 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -46,7 +46,7 @@ extern int tick_switch_to_oneshot(void (*handler)(struct clock_event_device *));
 extern void tick_resume_oneshot(void);
 # ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 extern void tick_broadcast_setup_oneshot(struct clock_event_device *bc);
-extern void tick_broadcast_oneshot_control(unsigned long reason);
+extern int tick_broadcast_oneshot_control(unsigned long reason);
 extern void tick_broadcast_switch_to_oneshot(void);
 extern void tick_shutdown_broadcast_oneshot(unsigned int *cpup);
 extern int tick_resume_broadcast_oneshot(struct clock_event_device *bc);
@@ -58,7 +58,7 @@ static inline void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
 {
 	BUG();
 }
-static inline void tick_broadcast_oneshot_control(unsigned long reason) { }
+static inline int tick_broadcast_oneshot_control(unsigned long reason) { return 0; }
 static inline void tick_broadcast_switch_to_oneshot(void) { }
 static inline void tick_shutdown_broadcast_oneshot(unsigned int *cpup) { }
 static inline int tick_broadcast_oneshot_active(void) { return 0; }
@@ -87,7 +87,7 @@ static inline void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
 {
 	BUG();
 }
-static inline void tick_broadcast_oneshot_control(unsigned long reason) { }
+static inline int tick_broadcast_oneshot_control(unsigned long reason) { return 0; }
 static inline void tick_shutdown_broadcast_oneshot(unsigned int *cpup) { }
 static inline int tick_resume_broadcast_oneshot(struct clock_event_device *bc)
 {

^ permalink raw reply related

* [PATCH V4 0/3] time/cpuidle: Support in tick broadcast framework in absence of external clock device
From: Preeti U Murthy @ 2014-02-07  8:05 UTC (permalink / raw)
  To: linux-pm, peterz, benh, rafael.j.wysocki, linux-kernel, tglx,
	linuxppc-dev, mingo
  Cc: deepthi, fweisbec, paulus, srivatsa.bhat, paulmck

This patchset provides support in the tick broadcast framework for such
architectures so as to enable the CPUs to get into deep idle.

Presently we are in need of this support on certain implementations of
PowerPC. This patchset has thus been tested on the same.

This patchset has been based on the idea discussed here:
http://www.kernelhub.org/?p=2&msg=399516

Changes in V4:
1. Cleared the stand by CPU from the oneshot mask. As a result PATCH 3/3
was simplified.
2. Fixed compile time warnings.
---

Preeti U Murthy (2):
      time: Change the return type of clockevents_notify() to integer
      time/cpuidle:Handle failed call to BROADCAST_ENTER on archs with CPUIDLE_FLAG_TIMER_STOP set

Thomas Gleixner (1):
      tick/cpuidle: Initialize hrtimer mode of broadcast


 drivers/cpuidle/cpuidle.c            |   14 +++--
 include/linux/clockchips.h           |   15 ++++-
 kernel/time/Makefile                 |    2 -
 kernel/time/clockevents.c            |    8 ++-
 kernel/time/tick-broadcast-hrtimer.c |  105 ++++++++++++++++++++++++++++++++++
 kernel/time/tick-broadcast.c         |   60 ++++++++++++++++++-
 kernel/time/tick-internal.h          |    6 +-
 7 files changed, 189 insertions(+), 21 deletions(-)
 create mode 100644 kernel/time/tick-broadcast-hrtimer.c

-- 

^ permalink raw reply

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
From: Joonsoo Kim @ 2014-02-07  8:03 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140206192812.GC7845@linux.vnet.ibm.com>

On Thu, Feb 06, 2014 at 11:28:12AM -0800, Nishanth Aravamudan wrote:
> On 06.02.2014 [10:59:55 -0800], Nishanth Aravamudan wrote:
> > On 06.02.2014 [17:04:18 +0900], Joonsoo Kim wrote:
> > > On Wed, Feb 05, 2014 at 06:07:57PM -0800, Nishanth Aravamudan wrote:
> > > > On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> > > > > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> > > > > 
> > > > > > Thank you for clarifying and providing  a test patch. I ran with this on
> > > > > > the system showing the original problem, configured to have 15GB of
> > > > > > memory.
> > > > > > 
> > > > > > With your patch after boot:
> > > > > > 
> > > > > > MemTotal:       15604736 kB
> > > > > > MemFree:         8768192 kB
> > > > > > Slab:            3882560 kB
> > > > > > SReclaimable:     105408 kB
> > > > > > SUnreclaim:      3777152 kB
> > > > > > 
> > > > > > With Anton's patch after boot:
> > > > > > 
> > > > > > MemTotal:       15604736 kB
> > > > > > MemFree:        11195008 kB
> > > > > > Slab:            1427968 kB
> > > > > > SReclaimable:     109184 kB
> > > > > > SUnreclaim:      1318784 kB
> > > > > > 
> > > > > > 
> > > > > > I know that's fairly unscientific, but the numbers are reproducible. 
> > > > > > 
> > > > > 
> > > > > I don't think the goal of the discussion is to reduce the amount of slab 
> > > > > allocated, but rather get the most local slab memory possible by use of 
> > > > > kmalloc_node().  When a memoryless node is being passed to kmalloc_node(), 
> > > > > which is probably cpu_to_node() for a cpu bound to a node without memory, 
> > > > > my patch is allocating it on the most local node; Anton's patch is 
> > > > > allocating it on whatever happened to be the cpu slab.
> > > > > 
> > > > > > > diff --git a/mm/slub.c b/mm/slub.c
> > > > > > > --- a/mm/slub.c
> > > > > > > +++ b/mm/slub.c
> > > > > > > @@ -2278,10 +2278,14 @@ redo:
> > > > > > > 
> > > > > > >  	if (unlikely(!node_match(page, node))) {
> > > > > > >  		stat(s, ALLOC_NODE_MISMATCH);
> > > > > > > -		deactivate_slab(s, page, c->freelist);
> > > > > > > -		c->page = NULL;
> > > > > > > -		c->freelist = NULL;
> > > > > > > -		goto new_slab;
> > > > > > > +		if (unlikely(!node_present_pages(node)))
> > > > > > > +			node = numa_mem_id();
> > > > > > > +		if (!node_match(page, node)) {
> > > > > > > +			deactivate_slab(s, page, c->freelist);
> > > > > > > +			c->page = NULL;
> > > > > > > +			c->freelist = NULL;
> > > > > > > +			goto new_slab;
> > > > > > > +		}
> > > > > > 
> > > > > > Semantically, and please correct me if I'm wrong, this patch is saying
> > > > > > if we have a memoryless node, we expect the page's locality to be that
> > > > > > of numa_mem_id(), and we still deactivate the slab if that isn't true.
> > > > > > Just wanting to make sure I understand the intent.
> > > > > > 
> > > > > 
> > > > > Yeah, the default policy should be to fallback to local memory if the node 
> > > > > passed is memoryless.
> > > > > 
> > > > > > What I find odd is that there are only 2 nodes on this system, node 0
> > > > > > (empty) and node 1. So won't numa_mem_id() always be 1? And every page
> > > > > > should be coming from node 1 (thus node_match() should always be true?)
> > > > > > 
> > > > > 
> > > > > The nice thing about slub is its debugging ability, what is 
> > > > > /sys/kernel/slab/cache/objects showing in comparison between the two 
> > > > > patches?
> > > > 
> > > > Ok, I finally got around to writing a script that compares the objects
> > > > output from both kernels.
> > > > 
> > > > log1 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
> > > > and Joonsoo's patch.
> > > > 
> > > > log2 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
> > > > and Anton's patch.
> > > > 
> > > > slab                           objects    objects   percent
> > > >                                log1       log2      change
> > > > -----------------------------------------------------------
> > > > :t-0000104                     71190      85680      20.353982 %
> > > > UDP                            4352       3392       22.058824 %
> > > > inode_cache                    54302      41923      22.796582 %
> > > > fscache_cookie_jar             3276       2457       25.000000 %
> > > > :t-0000896                     438        292        33.333333 %
> > > > :t-0000080                     310401     195323     37.073978 %
> > > > ext4_inode_cache               335        201        40.000000 %
> > > > :t-0000192                     89408      128898     44.168307 %
> > > > :t-0000184                     151300     81880      45.882353 %
> > > > :t-0000512                     49698      73648      48.191074 %
> > > > :at-0000192                    242867     120948     50.199904 %
> > > > xfs_inode                      34350      15221      55.688501 %
> > > > :t-0016384                     11005      17257      56.810541 %
> > > > proc_inode_cache               103868     34717      66.575846 %
> > > > tw_sock_TCP                    768        256        66.666667 %
> > > > :t-0004096                     15240      25672      68.451444 %
> > > > nfs_inode_cache                1008       315        68.750000 %
> > > > :t-0001024                     14528      24720      70.154185 %
> > > > :t-0032768                     655        1312       100.305344%
> > > > :t-0002048                     14242      30720      115.700042%
> > > > :t-0000640                     1020       2550       150.000000%
> > > > :t-0008192                     10005      27905      178.910545%
> > > > 
> > > > FWIW, the configuration of this LPAR has slightly changed. It is now configured
> > > > for maximally 400 CPUs, of which 200 are present. The result is that even with
> > > > Joonsoo's patch (log1 above), we OOM pretty easily and Anton's slab usage
> > > > script reports:
> > > > 
> > > > slab                                   mem     objs    slabs
> > > >                                       used   active   active
> > > > ------------------------------------------------------------
> > > > kmalloc-512                        1182 MB    2.03%  100.00%
> > > > kmalloc-192                        1182 MB    1.38%  100.00%
> > > > kmalloc-16384                       966 MB   17.66%  100.00%
> > > > kmalloc-4096                        353 MB   15.92%  100.00%
> > > > kmalloc-8192                        259 MB   27.28%  100.00%
> > > > kmalloc-32768                       207 MB    9.86%  100.00%
> > > > 
> > > > In comparison (log2 above):
> > > > 
> > > > slab                                   mem     objs    slabs
> > > >                                       used   active   active
> > > > ------------------------------------------------------------
> > > > kmalloc-16384                       273 MB   98.76%  100.00%
> > > > kmalloc-8192                        225 MB   98.67%  100.00%
> > > > pgtable-2^11                        114 MB  100.00%  100.00%
> > > > pgtable-2^12                        109 MB  100.00%  100.00%
> > > > kmalloc-4096                        104 MB   98.59%  100.00%
> > > > 
> > > > I appreciate all the help so far, if anyone has any ideas how best to
> > > > proceed further, or what they'd like debugged more, I'm happy to get
> > > > this fixed. We're hitting this on a couple of different systems and I'd
> > > > like to find a good resolution to the problem.
> > > 
> > > Hello,
> > > 
> > > I have no memoryless system, so, to debug it, I need your help. :)
> > > First, please let me know node information on your system.
> > 
> > [    0.000000] Node 0 Memory:
> > [    0.000000] Node 1 Memory: 0x0-0x200000000
> > 
> > [    0.000000] On node 0 totalpages: 0
> > [    0.000000] On node 1 totalpages: 131072
> > [    0.000000]   DMA zone: 112 pages used for memmap
> > [    0.000000]   DMA zone: 0 pages reserved
> > [    0.000000]   DMA zone: 131072 pages, LIFO batch:1
> > 
> > [    0.638391] Node 0 CPUs: 0-199
> > [    0.638394] Node 1 CPUs:
> > 
> > Do you need anything else?
> > 
> > > I'm preparing 3 another patches which are nearly same with previous patch,
> > > but slightly different approach. Could you test them on your system?
> > > I will send them soon.
> > 
> > Test results are in the attached tarball [1].
> > 
> > > And I think that same problem exists if CONFIG_SLAB is enabled. Could you
> > > confirm that?
> > 
> > I will test and let you know.
> 
> Ok, with your patches applied and CONFIG_SLAB enabled:
> 
> MemTotal:        8264640 kB
> MemFree:         7119680 kB
> Slab:             207232 kB
> SReclaimable:      32896 kB
> SUnreclaim:       174336 kB
> 
> For reference, same kernel with CONFIG_SLUB:
> 
> MemTotal:        8264640 kB
> MemFree:         4264000 kB
> Slab:            3065408 kB
> SReclaimable:     104704 kB
> SUnreclaim:      2960704 kB
> 


Hello,

First of all, thanks for testing!

My patch only affects CONFIG_SLUB. Request to test on CONFIG_SLAB is just
for reference. It seems that my patches doesn't have any effect to your case.
Could you check that numa_mem_id() and get_numa_mem() returns correctly?
I think that numa_mem_id() for all cpus and get_numa_mem() for all nodes
should return 1 on your system.

I will investigate further on my side.

Thanks!

^ permalink raw reply

* Re: [PATCH 2/2] power8, perf: Export raw event codes through sysfs interface
From: Michael Ellerman @ 2014-02-07  7:25 UTC (permalink / raw)
  To: Anshuman Khandual; +Cc: linuxppc-dev, mikey, sukadev
In-Reply-To: <1381902780-2719-3-git-send-email-khandual@linux.vnet.ibm.com>

On Wed, 2013-10-16 at 11:23 +0530, Anshuman Khandual wrote:
> This patch exports a set of POWER8 PMU raw event codes through
> sysfs interface. Right now the raw event set matches the entire
> set of POWER8 events found in libpfm4 library.

As for the POWER7 events, these all need to be in the ABI file too.

Please update & resend.

cheers

^ permalink raw reply

* Re: [PATCH V2] power7, perf: Make some new raw event codes available in sysfs
From: Michael Ellerman @ 2014-02-07  7:22 UTC (permalink / raw)
  To: Anshuman Khandual; +Cc: Linux PPC dev
In-Reply-To: <52F1FB75.9090007@linux.vnet.ibm.com>

On Wed, 2014-02-05 at 14:21 +0530, Anshuman Khandual wrote:
> On 01/02/2014 10:32 AM, Anshuman Khandual wrote:
> > This patchset adds some missing event list for POWER7 PMU raw
> > events which are exported through sysfs interface. Also updates
> > the ABI documentation to add all the sysfs exported raw events.
> > 
> > Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
> > ---
> >  .../testing/sysfs-bus-event_source-devices-events  | 517 +++++++++++++++++++++
> >  arch/powerpc/perf/power7-events-list.h             |  10 +
> >  2 files changed, 527 insertions(+)
> 
> Hey Michael,
> 
> Any updates on this patch ? This does not seem to be merged yet.

No updates. I thought I asked Ben to merge it but I guess I didn't, or he
didn't. Will queue it up.

cheers

^ permalink raw reply

* Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
From: Preeti U Murthy @ 2014-02-07  7:01 UTC (permalink / raw)
  To: Nicolas Pitre, Deepthi Dharwar
  Cc: Lists linaro-kernel, linux-pm@vger.kernel.org, Peter Zijlstra,
	Daniel Lezcano, Rafael J. Wysocki, LKML, Ingo Molnar,
	Thomas Gleixner, linuxppc-dev, Linux ARM Kernel ML
In-Reply-To: <alpine.LFD.2.11.1402070105170.1906@knanqh.ubzr>

Hi Nicolas,

Find below the patch that will need to be squashed with this one.
This patch is based on the mainline.Adding Deepthi, the author of
the patch which introduced the powernv cpuidle driver. Deepthi,
do you think the below patch looks right? We do not need to do an
explicit local_irq_enable() since we are in the call path of
cpuidle driver and that explicitly enables irqs on exit from
idle states.

On 02/07/2014 06:47 AM, Nicolas Pitre wrote:
> On Thu, 6 Feb 2014, Preeti U Murthy wrote:
> 
>> Hi Daniel,
>>
>> On 02/06/2014 09:55 PM, Daniel Lezcano wrote:
>>> Hi Nico,
>>>
>>>
>>> On 6 February 2014 14:16, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>>>
>>>> The core idle loop now takes care of it.
>>>>
>>>> Signed-off-by: Nicolas Pitre <nico@linaro.org>
>>>> ---
>>>>  arch/powerpc/platforms/powernv/setup.c | 13 +------------
>>>>  1 file changed, 1 insertion(+), 12 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/platforms/powernv/setup.c
>>>> b/arch/powerpc/platforms/powernv/setup.c
>>>> index 21166f65c9..a932feb290 100644
>>>> --- a/arch/powerpc/platforms/powernv/setup.c
>>>> +++ b/arch/powerpc/platforms/powernv/setup.c
>>>> @@ -26,7 +26,6 @@
>>>>  #include <linux/of_fdt.h>
>>>>  #include <linux/interrupt.h>
>>>>  #include <linux/bug.h>
>>>> -#include <linux/cpuidle.h>
>>>>
>>>>  #include <asm/machdep.h>
>>>>  #include <asm/firmware.h>
>>>> @@ -217,16 +216,6 @@ static int __init pnv_probe(void)
>>>>         return 1;
>>>>  }
>>>>
>>>> -void powernv_idle(void)
>>>> -{
>>>> -       /* Hook to cpuidle framework if available, else
>>>> -        * call on default platform idle code
>>>> -        */
>>>> -       if (cpuidle_idle_call()) {
>>>> -               power7_idle();
>>>> -       }
>>>>

 drivers/cpuidle/cpuidle-powernv.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
index 78fd174..130f081 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -31,11 +31,13 @@ static int snooze_loop(struct cpuidle_device *dev,
 	set_thread_flag(TIF_POLLING_NRFLAG);
 
 	while (!need_resched()) {
+		ppc64_runlatch_off();
 		HMT_low();
 		HMT_very_low();
 	}
 
 	HMT_medium();
+	ppc64_runlatch_on();
 	clear_thread_flag(TIF_POLLING_NRFLAG);
 	smp_mb();
 	return index;
@@ -45,7 +47,9 @@ static int nap_loop(struct cpuidle_device *dev,
 			struct cpuidle_driver *drv,
 			int index)
 {
+	ppc64_runlatch_off();
 	power7_idle();
+	ppc64_runlatch_on();
 	return index;
 }
 
Thanks

Regards
Preeti U Murthy

^ permalink raw reply related

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Joonsoo Kim @ 2014-02-07  5:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: Han Pingtian, Nishanth Aravamudan, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Christoph Lameter, linuxppc-dev, Wanpeng Li
In-Reply-To: <alpine.DEB.2.02.1402061248450.9567@chino.kir.corp.google.com>

On Thu, Feb 06, 2014 at 12:52:11PM -0800, David Rientjes wrote:
> On Thu, 6 Feb 2014, Joonsoo Kim wrote:
> 
> > From bf691e7eb07f966e3aed251eaeb18f229ee32d1f Mon Sep 17 00:00:00 2001
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Date: Thu, 6 Feb 2014 17:07:05 +0900
> > Subject: [RFC PATCH 2/3 v2] topology: support node_numa_mem() for
> > determining the
> >  fallback node
> > 
> > We need to determine the fallback node in slub allocator if the allocation
> > target node is memoryless node. Without it, the SLUB wrongly select
> > the node which has no memory and can't use a partial slab, because of node
> > mismatch. Introduced function, node_numa_mem(X), will return
> > a node Y with memory that has the nearest distance. If X is memoryless
> > node, it will return nearest distance node, but, if
> > X is normal node, it will return itself.
> > 
> > We will use this function in following patch to determine the fallback
> > node.
> > 
> 
> I like the approach and it may fix the problem today, but it may not be 
> sufficient in the future: nodes may not only be memoryless but they may 
> also be cpuless.  It's possible that a node can only have I/O, networking, 
> or storage devices and we can define affinity for them that is remote from 
> every cpu and/or memory by the ACPI specification.
> 
> It seems like a better approach would be to do this when a node is brought 
> online and determine the fallback node based not on the zonelists as you 
> do here but rather on locality (such as through a SLIT if provided, see 
> node_distance()).

Hmm...
I guess that zonelist is base on locality. Zonelist is generated using
node_distance(), so I think that it reflects locality. But, I'm not expert
on NUMA, so please let me know what I am missing here :)

> Also, the names aren't very descriptive: {get,set}_numa_mem() doesn't make 
> a lot of sense in generic code.  I'd suggest something like 
> node_to_mem_node().

It's much better!
If this patch eventually will be needed, I will update it.

Thanks.

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Joonsoo Kim @ 2014-02-07  5:42 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140206191131.GB7845@linux.vnet.ibm.com>

On Thu, Feb 06, 2014 at 11:11:31AM -0800, Nishanth Aravamudan wrote:
> > diff --git a/include/linux/topology.h b/include/linux/topology.h
> > index 12ae6ce..66b19b8 100644
> > --- a/include/linux/topology.h
> > +++ b/include/linux/topology.h
> > @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
> >   * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
> >   */
> >  DECLARE_PER_CPU(int, _numa_mem_);
> > +int _node_numa_mem_[MAX_NUMNODES];
> 
> Should be static, I think?

Yes, will update it.

Thanks.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox