LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 1/2] PPC: powernv: remove redundant cpuidle_idle_call()
From: Preeti U Murthy @ 2014-02-10  3:45 UTC (permalink / raw)
  To: Peter Zijlstra, Nicolas Pitre
  Cc: Lists linaro-kernel, linux-pm@vger.kernel.org, Daniel Lezcano,
	Rafael J. Wysocki, LKML, Ingo Molnar, Thomas Gleixner,
	linuxppc-dev
In-Reply-To: <20140207124140.GB9987@twins.programming.kicks-ass.net>

Hi Peter,

On 02/07/2014 06:11 PM, Peter Zijlstra wrote:
> On Fri, Feb 07, 2014 at 05:11:26PM +0530, Preeti U Murthy wrote:
>> But observe the idle state "snooze" on powerpc. The power that this idle
>> state saves is through the lowering of the thread priority of the CPU.
>> After it lowers the thread priority, it is done. It cannot
>> "wait_for_interrupts". It will exit my_idle(). It is now upto the
>> generic idle loop to increase the thread priority if the need_resched
>> flag is set. Only an interrupt routine can increase the thread priority.
>> Else we will need to do it explicitly. And in such states which have a
>> polling nature, the cpu will not receive a reschedule IPI.
>>
>> That is why in the snooze_loop() we poll on need_resched. If it is set
>> we up the priority of the thread using HMT_MEDIUM() and then exit the
>> my_idle() loop. In case of interrupts, the priority gets automatically
>> increased.
> 
> You can poll without setting TS_POLLING/TIF_POLLING_NRFLAGS just fine
> and get the IPI if that is what you want.
> 
> Depending on how horribly unprovisioned the thread gets at the lowest
> priority, that might actually be faster than polling and raising the
> prio whenever it does get ran.

So I am assuming you mean something like the below:

my_idle()
{
   local_irq_enable();
   /* Remove the setting of the polling flag */
   HMT_low();
   return index;
}

And then exit into the generic idle loop. But the issue I see here is
that the TS_POLLING/TIF_POLLING_NRFLAGS gets set immediately. So, if on
testing need_resched() immediately after this returns that the
TIF_NEED_RESCHED flag is set, the thread will exit at low priority right?
 We could raise the priority of the thread in arch_cpu_idle_exit() soon
after setting the polling flag but that would mean for cases where the
TIF_NEED_RESCHED flag is not set we unnecessarily raise the priority of
the thread.

Thanks

Regards
Preeti U Murthy

> 

^ permalink raw reply

* Re: [PATCH v2] powerpc ticket locks
From: Benjamin Herrenschmidt @ 2014-02-10  3:10 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, Peter Zijlstra, linux-kernel, Paul Mackerras,
	Anton Blanchard, Scott Wood, Paul E. McKenney, linuxppc-dev,
	Ingo Molnar
In-Reply-To: <20140207165801.GC2107@lst.de>

On Fri, 2014-02-07 at 17:58 +0100, Torsten Duwe wrote:
>  typedef struct {
> -       volatile unsigned int slock;
> -} arch_spinlock_t;
> +       union {
> +               __ticketpair_t head_tail;
> +               struct __raw_tickets {
> +#ifdef __BIG_ENDIAN__          /* The "tail" part should be in the MSBs */
> +                       __ticket_t tail, head;
> +#else
> +                       __ticket_t head, tail;
> +#endif
> +               } tickets;
> +       };
> +#if defined(CONFIG_PPC_SPLPAR)
> +       u32 holder;
> +#endif
> +} arch_spinlock_t __aligned(4);

That's still broken with lockref (which we just merged).

We must have the arch_spinlock_t and the ref in the same 64-bit word
otherwise it will break.

We can make it work in theory since the holder doesn't have to be
accessed atomically, but the practicals are a complete mess ...
lockref would essentially have to re-implement the holder handling
of the spinlocks and use lower level ticket stuff.

Unless you can find a sneaky trick ... :-(

Ben.

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Benjamin Herrenschmidt @ 2014-02-10  3:05 UTC (permalink / raw)
  To: Kumar Gala
  Cc: Tom Musta, Peter Zijlstra, linux-kernel, Torsten Duwe,
	Anton Blanchard, Scott Wood, Paul Mackerras, Paul E. McKenney,
	linuxppc-dev, Ingo Molnar
In-Reply-To: <87C29DBB-41E7-4B6C-9089-3C7756FBAE07@kernel.crashing.org>

On Fri, 2014-02-07 at 09:51 -0600, Kumar Gala wrote:
> On Feb 7, 2014, at 3:02 AM, Torsten Duwe <duwe@lst.de> wrote:
> 
> > On Thu, Feb 06, 2014 at 02:19:52PM -0600, Scott Wood wrote:
> >> On Thu, 2014-02-06 at 18:37 +0100, Torsten Duwe wrote:
> >>> On Thu, Feb 06, 2014 at 05:38:37PM +0100, Peter Zijlstra wrote:
> >> 
> >>>> Can you pair lwarx with sthcx ? I couldn't immediately find the answer
> >>>> in the PowerISA doc. If so I think you can do better by being able to
> >>>> atomically load both tickets but only storing the head without affecting
> >>>> the tail.
> > 
> > Can I simply write the half word, without a reservation, or will the HW caches
> > mess up the other half? Will it ruin the cache coherency on some (sub)architectures?
> 
> The coherency should be fine, I just can’t remember if you’ll lose the reservation by doing this.

Yes you do.

> >> Plus, sthcx doesn't exist on all PPC chips.
> > 
> > Which ones are lacking it? Do all have at least a simple 16-bit store?
> 
> Everything implements a simple 16-bit store, just not everything implements the store conditional of 16-bit data.

Ben.

> - k--
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Benjamin Herrenschmidt @ 2014-02-10  3:02 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, Peter Zijlstra, linux-kernel, Paul Mackerras,
	Anton Blanchard, Scott Wood, Paul E. McKenney, linuxppc-dev,
	Ingo Molnar
In-Reply-To: <20140207090248.GB26811@lst.de>

On Fri, 2014-02-07 at 10:02 +0100, Torsten Duwe wrote:
> > > > Can you pair lwarx with sthcx ? I couldn't immediately find the answer
> > > > in the PowerISA doc. If so I think you can do better by being able to
> > > > atomically load both tickets but only storing the head without affecting
> > > > the tail.
> 
> Can I simply write the half word, without a reservation, or will the HW caches
> mess up the other half? Will it ruin the cache coherency on some (sub)architectures?

Yes, you can, I *think*

> > Plus, sthcx doesn't exist on all PPC chips.
> 
> Which ones are lacking it? Do all have at least a simple 16-bit store?

half word atomics (and byte atomics) are new, they've been added in architecture
2.06 I believe so it's fairly recent, but it's still worthwhile to investigate a
way to avoid atomics on unlock on recent processors (we can use instruction patching
if necessary based on CPU features) because there's definitely a significant cost
in doing a larx/stcx. sequence on powerpc, way higher than our current unlock path
of barrier + store.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Benjamin Herrenschmidt @ 2014-02-10  2:54 UTC (permalink / raw)
  To: Tom Musta
  Cc: Peter Zijlstra, linux-kernel, Paul Mackerras, Anton Blanchard,
	Torsten Duwe, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <52F3E255.5050906@gmail.com>

On Thu, 2014-02-06 at 13:28 -0600, Tom Musta wrote:
> My read is consistent with Torsten's ... this looks like a bad idea.
> 
> Look at the RTL for sthcx. on page 692 (Power ISA V2.06) and you will
> see this:
> 
> if RESERVE then
>   if RESERVE_LENGTH = 2 then
>      ...
>   else
>      undefined_case <- 1
> else
>   ...
> 
> A legal implementation might never perform the store.

This is an area where we definitely want to check with the implementors
and if the implementations happen to do what we want (they likely do),
get the architecture changed for future chips and use it anyway.

There's a a *significant* benefit in avoiding an atomic operation in the
unlock case .

The reservation mechanism being based on a granule that is generally a
cache line, I doubt implementations will ever check the actual access
size, but we need to double check.

Cheers,
Ben.

^ permalink raw reply

* [RESEND PATCH 3/3] cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines
From: Preeti U Murthy @ 2014-02-10  2:38 UTC (permalink / raw)
  To: benh, tglx, linux-kernel, srivatsa.bhat
  Cc: deepthi, arnd, geoff, paul.gortmaker, paulus, linuxppc-dev
In-Reply-To: <20140210023503.19345.30567.stgit@preeti>

From: Preeti U Murthy <preeti@linux.vnet.ibm.com>

Split timer_interrupt(), which is the local timer interrupt handler on ppc
into routines called during regular interrupt handling and __timer_interrupt(),
which takes care of running local timers and collecting time related stats.

This will enable callers interested only in running expired local timers to
directly call into __timer_interupt(). One of the use cases of this is the
tick broadcast IPI handling in which the sleeping CPUs need to handle the local
timers that have expired.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/kernel/time.c |   81 +++++++++++++++++++++++++-------------------
 1 file changed, 46 insertions(+), 35 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 3ff97db..df2989b 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -478,6 +478,47 @@ void arch_irq_work_raise(void)
 
 #endif /* CONFIG_IRQ_WORK */
 
+void __timer_interrupt(void)
+{
+	struct pt_regs *regs = get_irq_regs();
+	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
+	struct clock_event_device *evt = &__get_cpu_var(decrementers);
+	u64 now;
+
+	trace_timer_interrupt_entry(regs);
+
+	if (test_irq_work_pending()) {
+		clear_irq_work_pending();
+		irq_work_run();
+	}
+
+	now = get_tb_or_rtc();
+	if (now >= *next_tb) {
+		*next_tb = ~(u64)0;
+		if (evt->event_handler)
+			evt->event_handler(evt);
+		__get_cpu_var(irq_stat).timer_irqs_event++;
+	} else {
+		now = *next_tb - now;
+		if (now <= DECREMENTER_MAX)
+			set_dec((int)now);
+		/* We may have raced with new irq work */
+		if (test_irq_work_pending())
+			set_dec(1);
+		__get_cpu_var(irq_stat).timer_irqs_others++;
+	}
+
+#ifdef CONFIG_PPC64
+	/* collect purr register values often, for accurate calculations */
+	if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
+		struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array);
+		cu->current_tb = mfspr(SPRN_PURR);
+	}
+#endif
+
+	trace_timer_interrupt_exit(regs);
+}
+
 /*
  * timer_interrupt - gets called when the decrementer overflows,
  * with interrupts disabled.
@@ -486,8 +527,6 @@ void timer_interrupt(struct pt_regs * regs)
 {
 	struct pt_regs *old_regs;
 	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
-	struct clock_event_device *evt = &__get_cpu_var(decrementers);
-	u64 now;
 
 	/* Ensure a positive value is written to the decrementer, or else
 	 * some CPUs will continue to take decrementer exceptions.
@@ -519,39 +558,7 @@ void timer_interrupt(struct pt_regs * regs)
 	old_regs = set_irq_regs(regs);
 	irq_enter();
 
-	trace_timer_interrupt_entry(regs);
-
-	if (test_irq_work_pending()) {
-		clear_irq_work_pending();
-		irq_work_run();
-	}
-
-	now = get_tb_or_rtc();
-	if (now >= *next_tb) {
-		*next_tb = ~(u64)0;
-		if (evt->event_handler)
-			evt->event_handler(evt);
-		__get_cpu_var(irq_stat).timer_irqs_event++;
-	} else {
-		now = *next_tb - now;
-		if (now <= DECREMENTER_MAX)
-			set_dec((int)now);
-		/* We may have raced with new irq work */
-		if (test_irq_work_pending())
-			set_dec(1);
-		__get_cpu_var(irq_stat).timer_irqs_others++;
-	}
-
-#ifdef CONFIG_PPC64
-	/* collect purr register values often, for accurate calculations */
-	if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
-		struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array);
-		cu->current_tb = mfspr(SPRN_PURR);
-	}
-#endif
-
-	trace_timer_interrupt_exit(regs);
-
+	__timer_interrupt();
 	irq_exit();
 	set_irq_regs(old_regs);
 }
@@ -828,6 +835,10 @@ static void decrementer_set_mode(enum clock_event_mode mode,
 /* Interrupt handler for the timer broadcast IPI */
 void tick_broadcast_ipi_handler(void)
 {
+	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
+
+	*next_tb = get_tb_or_rtc();
+	__timer_interrupt();
 }
 
 static void register_decrementer_clockevent(int cpu)

^ permalink raw reply related

* [RESEND PATCH 1/3] powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message
From: Preeti U Murthy @ 2014-02-10  2:37 UTC (permalink / raw)
  To: benh, tglx, linux-kernel, srivatsa.bhat
  Cc: deepthi, arnd, geoff, paul.gortmaker, paulus, linuxppc-dev
In-Reply-To: <20140210023503.19345.30567.stgit@preeti>

From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE map
to a common implementation - generic_smp_call_function_single_interrupt(). So,
we can consolidate them and save one of the IPI message slots, (which are
precious on powerpc, since only 4 of those slots are available).

So, implement the functionality of PPC_MSG_CALL_FUNC_SINGLE using
PPC_MSG_CALL_FUNC itself and release its IPI message slot, so that it can be
used for something else in the future, if desired.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Geoff Levand <geoff@infradead.org> [For the PS3 part]
---

 arch/powerpc/include/asm/smp.h          |    2 +-
 arch/powerpc/kernel/smp.c               |   12 +++++-------
 arch/powerpc/platforms/cell/interrupt.c |    2 +-
 arch/powerpc/platforms/ps3/smp.c        |    2 +-
 4 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 084e080..9f7356b 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu);
  * in /proc/interrupts will be wrong!!! --Troy */
 #define PPC_MSG_CALL_FUNCTION   0
 #define PPC_MSG_RESCHEDULE      1
-#define PPC_MSG_CALL_FUNC_SINGLE	2
+#define PPC_MSG_UNUSED		2
 #define PPC_MSG_DEBUGGER_BREAK  3
 
 /* for irq controllers that have dedicated ipis per message (4) */
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ac2621a..ee7d76b 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -145,9 +145,9 @@ static irqreturn_t reschedule_action(int irq, void *data)
 	return IRQ_HANDLED;
 }
 
-static irqreturn_t call_function_single_action(int irq, void *data)
+static irqreturn_t unused_action(int irq, void *data)
 {
-	generic_smp_call_function_single_interrupt();
+	/* This slot is unused and hence available for use, if needed */
 	return IRQ_HANDLED;
 }
 
@@ -168,14 +168,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 static irq_handler_t smp_ipi_action[] = {
 	[PPC_MSG_CALL_FUNCTION] =  call_function_action,
 	[PPC_MSG_RESCHEDULE] = reschedule_action,
-	[PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action,
+	[PPC_MSG_UNUSED] = unused_action,
 	[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
 	[PPC_MSG_CALL_FUNCTION] =  "ipi call function",
 	[PPC_MSG_RESCHEDULE] = "ipi reschedule",
-	[PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single",
+	[PPC_MSG_UNUSED] = "ipi unused",
 	[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
 };
 
@@ -251,8 +251,6 @@ irqreturn_t smp_ipi_demux(void)
 			generic_smp_call_function_interrupt();
 		if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
 			scheduler_ipi();
-		if (all & IPI_MESSAGE(PPC_MSG_CALL_FUNC_SINGLE))
-			generic_smp_call_function_single_interrupt();
 		if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK))
 			debug_ipi_action(0, NULL);
 	} while (info->messages);
@@ -280,7 +278,7 @@ EXPORT_SYMBOL_GPL(smp_send_reschedule);
 
 void arch_send_call_function_single_ipi(int cpu)
 {
-	do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE);
+	do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
 }
 
 void arch_send_call_function_ipi_mask(const struct cpumask *mask)
diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c
index 2d42f3b..adf3726 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -215,7 +215,7 @@ void iic_request_IPIs(void)
 {
 	iic_request_ipi(PPC_MSG_CALL_FUNCTION);
 	iic_request_ipi(PPC_MSG_RESCHEDULE);
-	iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE);
+	iic_request_ipi(PPC_MSG_UNUSED);
 	iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
 }
 
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 4b35166..00d1a7c 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -76,7 +76,7 @@ static int __init ps3_smp_probe(void)
 
 		BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION    != 0);
 		BUILD_BUG_ON(PPC_MSG_RESCHEDULE       != 1);
-		BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2);
+		BUILD_BUG_ON(PPC_MSG_UNUSED	      != 2);
 		BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK   != 3);
 
 		for (i = 0; i < MSG_COUNT; i++) {

^ permalink raw reply related

* [RESEND PATCH 2/3] powerpc: Implement tick broadcast IPI as a fixed IPI message
From: Preeti U Murthy @ 2014-02-10  2:38 UTC (permalink / raw)
  To: benh, tglx, linux-kernel, srivatsa.bhat
  Cc: deepthi, arnd, geoff, paul.gortmaker, paulus, linuxppc-dev
In-Reply-To: <20140210023503.19345.30567.stgit@preeti>

From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

For scalability and performance reasons, we want the tick broadcast IPIs
to be handled as efficiently as possible. Fixed IPI messages
are one of the most efficient mechanisms available - they are faster than
the smp_call_function mechanism because the IPI handlers are fixed and hence
they don't involve costly operations such as adding IPI handlers to the target
CPU's function queue, acquiring locks for synchronization etc.

Luckily we have an unused IPI message slot, so use that to implement
tick broadcast IPIs efficiently.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
[Functions renamed to tick_broadcast* and Changelog modified by
 Preeti U. Murthy<preeti@linux.vnet.ibm.com>]
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Geoff Levand <geoff@infradead.org> [For the PS3 part]
---

 arch/powerpc/include/asm/smp.h          |    2 +-
 arch/powerpc/include/asm/time.h         |    1 +
 arch/powerpc/kernel/smp.c               |   21 +++++++++++++++++----
 arch/powerpc/kernel/time.c              |    5 +++++
 arch/powerpc/platforms/cell/interrupt.c |    2 +-
 arch/powerpc/platforms/ps3/smp.c        |    2 +-
 6 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 9f7356b..ff51046 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu);
  * in /proc/interrupts will be wrong!!! --Troy */
 #define PPC_MSG_CALL_FUNCTION   0
 #define PPC_MSG_RESCHEDULE      1
-#define PPC_MSG_UNUSED		2
+#define PPC_MSG_TICK_BROADCAST	2
 #define PPC_MSG_DEBUGGER_BREAK  3
 
 /* for irq controllers that have dedicated ipis per message (4) */
diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index c1f2676..1d428e6 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -28,6 +28,7 @@ extern struct clock_event_device decrementer_clockevent;
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
 extern void GregorianDay(struct rtc_time *tm);
+extern void tick_broadcast_ipi_handler(void);
 
 extern void generic_calibrate_decr(void);
 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ee7d76b..e2a4232 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -35,6 +35,7 @@
 #include <asm/ptrace.h>
 #include <linux/atomic.h>
 #include <asm/irq.h>
+#include <asm/hw_irq.h>
 #include <asm/page.h>
 #include <asm/pgtable.h>
 #include <asm/prom.h>
@@ -145,9 +146,9 @@ static irqreturn_t reschedule_action(int irq, void *data)
 	return IRQ_HANDLED;
 }
 
-static irqreturn_t unused_action(int irq, void *data)
+static irqreturn_t tick_broadcast_ipi_action(int irq, void *data)
 {
-	/* This slot is unused and hence available for use, if needed */
+	tick_broadcast_ipi_handler();
 	return IRQ_HANDLED;
 }
 
@@ -168,14 +169,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 static irq_handler_t smp_ipi_action[] = {
 	[PPC_MSG_CALL_FUNCTION] =  call_function_action,
 	[PPC_MSG_RESCHEDULE] = reschedule_action,
-	[PPC_MSG_UNUSED] = unused_action,
+	[PPC_MSG_TICK_BROADCAST] = tick_broadcast_ipi_action,
 	[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
 	[PPC_MSG_CALL_FUNCTION] =  "ipi call function",
 	[PPC_MSG_RESCHEDULE] = "ipi reschedule",
-	[PPC_MSG_UNUSED] = "ipi unused",
+	[PPC_MSG_TICK_BROADCAST] = "ipi tick-broadcast",
 	[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
 };
 
@@ -251,6 +252,8 @@ irqreturn_t smp_ipi_demux(void)
 			generic_smp_call_function_interrupt();
 		if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
 			scheduler_ipi();
+		if (all & IPI_MESSAGE(PPC_MSG_TICK_BROADCAST))
+			tick_broadcast_ipi_handler();
 		if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK))
 			debug_ipi_action(0, NULL);
 	} while (info->messages);
@@ -289,6 +292,16 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask)
 		do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
 }
 
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
+void tick_broadcast(const struct cpumask *mask)
+{
+	unsigned int cpu;
+
+	for_each_cpu(cpu, mask)
+		do_message_pass(cpu, PPC_MSG_TICK_BROADCAST);
+}
+#endif
+
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
 void smp_send_debugger_break(void)
 {
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index b3dab20..3ff97db 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -825,6 +825,11 @@ static void decrementer_set_mode(enum clock_event_mode mode,
 		decrementer_set_next_event(DECREMENTER_MAX, dev);
 }
 
+/* Interrupt handler for the timer broadcast IPI */
+void tick_broadcast_ipi_handler(void)
+{
+}
+
 static void register_decrementer_clockevent(int cpu)
 {
 	struct clock_event_device *dec = &per_cpu(decrementers, cpu);
diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c
index adf3726..8a106b4 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -215,7 +215,7 @@ void iic_request_IPIs(void)
 {
 	iic_request_ipi(PPC_MSG_CALL_FUNCTION);
 	iic_request_ipi(PPC_MSG_RESCHEDULE);
-	iic_request_ipi(PPC_MSG_UNUSED);
+	iic_request_ipi(PPC_MSG_TICK_BROADCAST);
 	iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
 }
 
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 00d1a7c..b358bec 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -76,7 +76,7 @@ static int __init ps3_smp_probe(void)
 
 		BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION    != 0);
 		BUILD_BUG_ON(PPC_MSG_RESCHEDULE       != 1);
-		BUILD_BUG_ON(PPC_MSG_UNUSED	      != 2);
+		BUILD_BUG_ON(PPC_MSG_TICK_BROADCAST   != 2);
 		BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK   != 3);
 
 		for (i = 0; i < MSG_COUNT; i++) {

^ permalink raw reply related

* [RESEND PATCH 0/3] powerpc: Free up an IPI message slot for tick broadcast IPIs
From: Preeti U Murthy @ 2014-02-10  2:37 UTC (permalink / raw)
  To: benh, tglx, linux-kernel, srivatsa.bhat
  Cc: deepthi, arnd, geoff, paul.gortmaker, paulus, linuxppc-dev

This patchset is a precursor for enabling deep idle states on powerpc,
when the local CPU timers stop. The tick broadcast framework in
the Linux Kernel today handles wakeup of such CPUs at their next timer event
by using an external clock device. At the expiry of this clock device, IPIs
are sent to the CPUs in deep idle states  so that they wakeup to handle their
respective timers. This patchset frees up one of the IPI slots on powerpc
so as to be used to handle the tick broadcast IPI.

On certain implementations of powerpc, such an external clock device is absent.
The support in the tick broadcast framework to handle wakeup of CPUs from
deep idle states on such implementations is currently in the tip tree.
https://lkml.org/lkml/2014/2/7/906
https://lkml.org/lkml/2014/2/7/876
https://lkml.org/lkml/2014/2/7/608

With the above support in place, this patchset is next in line to enable deep
idle states on powerpc.

The patchset has been appended by a RESEND tag since nothing has changed from
the previous post except for an added config condition around
tick_broadcast() which handles sending broadcast IPIs, and the update in the cover
letter.
---

Preeti U Murthy (1):
      cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines

Srivatsa S. Bhat (2):
      powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message
      powerpc: Implement tick broadcast IPI as a fixed IPI message

 arch/powerpc/include/asm/smp.h          |    2 -
 arch/powerpc/include/asm/time.h         |    1 
 arch/powerpc/kernel/smp.c               |   25 ++++++---
 arch/powerpc/kernel/time.c              |   86 ++++++++++++++++++-------------
 arch/powerpc/platforms/cell/interrupt.c |    2 -
 arch/powerpc/platforms/ps3/smp.c        |    2 -
 6 files changed, 73 insertions(+), 45 deletions(-)

-- 

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Joonsoo Kim @ 2014-02-10  1:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Nishanth Aravamudan, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, linuxppc-dev, Wanpeng Li
In-Reply-To: <alpine.DEB.2.10.1402071245040.20246@nuc>

On Fri, Feb 07, 2014 at 12:51:07PM -0600, Christoph Lameter wrote:
> Here is a draft of a patch to make this work with memoryless nodes.
> 
> The first thing is that we modify node_match to also match if we hit an
> empty node. In that case we simply take the current slab if its there.

Why not inspecting whether we can get the page on the best node such as
numa_mem_id() node?

> 
> If there is no current slab then a regular allocation occurs with the
> memoryless node. The page allocator will fallback to a possible node and
> that will become the current slab. Next alloc from a memoryless node
> will then use that slab.
> 
> For that we also add some tracking of allocations on nodes that were not
> satisfied using the empty_node[] array. A successful alloc on a node
> clears that flag.
> 
> I would rather avoid the empty_node[] array since its global and there may
> be thread specific allocation restrictions but it would be expensive to do
> an allocation attempt via the page allocator to make sure that there is
> really no page available from the page allocator.
> 
> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c	2014-02-03 13:19:22.896853227 -0600
> +++ linux/mm/slub.c	2014-02-07 12:44:49.311494806 -0600
> @@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa
>  #endif
>  }
> 
> +static int empty_node[MAX_NUMNODES];
> +
>  /*
>   * Issues still to be resolved:
>   *
> @@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem
>  	void *last;
>  	void *p;
>  	int order;
> +	int alloc_node;
> 
>  	BUG_ON(flags & GFP_SLAB_BUG_MASK);
> 
>  	page = allocate_slab(s,
>  		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> -	if (!page)
> +	if (!page) {
> +		if (node != NUMA_NO_NODE)
> +			empty_node[node] = 1;
>  		goto out;
> +	}

empty_node cannot be set on memoryless node, since page allocation would
succeed on different node.

Thanks.

^ permalink raw reply

* Re: [PATCH 2/2] clocksource: Make clocksource register functions void
From: Yijing Wang @ 2014-02-10  1:13 UTC (permalink / raw)
  To: Thomas Gleixner, David Laight
  Cc: linux-mips@linux-mips.org, x86@kernel.org, Kevin Hilman,
	linux@lists.openrisc.net, Hanjun Guo, Sekhar Nori, Michal Simek,
	Paul Mackerras, Ralf Baechle, H. Peter Anvin, Daniel Walker,
	Hans-Christian Egtvedt, Jonas Bonn, Kukjin Kim, Russell King,
	Richard Weinberger, Daniel Lezcano, Tony Lindgren, Ingo Molnar,
	microblaze-uclinux@itee.uq.edu.au, David Brown,
	Haavard Skinnemoen, Mike Frysinger,
	user-mode-linux-devel@lists.sourceforge.net,
	linux-arm-msm@vger.kernel.org, Jeff Dike,
	davinci-linux-open-source@linux.davincidsp.com,
	linux-samsung-soc@vger.kernel.org, John Stultz,
	user-mode-linux-user@lists.sourceforge.net,
	linux-omap@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Barry Song, Jim Cromie, linux-kernel@vger.kernel.org,
	Nicolas Ferre, 'Tony Prisk', Bryan Huntsman,
	uclinux-dist-devel@blackfin.uclinux.org,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <alpine.DEB.2.02.1402052139560.24986@ionos.tec.linutronix.de>

On 2014/2/6 4:40, Thomas Gleixner wrote:
> Yijing,
> 
> On Thu, 23 Jan 2014, David Laight wrote:
> 
>> From: Linuxppc-dev Tony Prisk
>>> On 23/01/14 20:12, Yijing Wang wrote:
>>>> Currently, clocksource_register() and __clocksource_register_scale()
>>>> functions always return 0, it's pointless, make functions void.
>>>> And remove the dead code that check the clocksource_register_hz()
>>>> return value.
>>> ......
>>>> -static inline int clocksource_register_hz(struct clocksource *cs, u32 hz)
>>>> +static inline void clocksource_register_hz(struct clocksource *cs, u32 hz)
>>>>   {
>>>>   	return __clocksource_register_scale(cs, 1, hz);
>>>>   }
>>>
>>> This doesn't make sense - you are still returning a value on a function
>>> declared void, and the return is now from a function that doesn't return
>>> anything either ?!?!
>>> Doesn't this throw a compile-time warning??
>>
>> It depends on the compiler.
>> Recent gcc allow it.
>> I don't know if it is actually valid C though.
>>
>> There is no excuse for it on lines like the above though.
> 
> Can you please resend with that fixed against 3.14-rc1 ?

OK, I will resend later.

Thanks!
Yijing.


> 
> .
> 


-- 
Thanks!
Yijing

^ permalink raw reply

* Re: [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node
From: Joonsoo Kim @ 2014-02-10  1:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Nishanth Aravamudan, mpm, penberg, linux-mm, paulus,
	Anton Blanchard, David Rientjes, linuxppc-dev, Wanpeng Li
In-Reply-To: <alpine.DEB.2.10.1402071147390.15168@nuc>

On Fri, Feb 07, 2014 at 11:49:57AM -0600, Christoph Lameter wrote:
> On Fri, 7 Feb 2014, Joonsoo Kim wrote:
> 
> > > This check wouild need to be something that checks for other contigencies
> > > in the page allocator as well. A simple solution would be to actually run
> > > a GFP_THIS_NODE alloc to see if you can grab a page from the proper node.
> > > If that fails then fallback. See how fallback_alloc() does it in slab.
> > >
> >
> > Hello, Christoph.
> >
> > This !node_present_pages() ensure that allocation on this node cannot succeed.
> > So we can directly use numa_mem_id() here.
> 
> Yes of course we can use numa_mem_id().
> 
> But the check is only for not having any memory at all on a node. There
> are other reason for allocations to fail on a certain node. The node could
> have memory that cannot be reclaimed, all dirty, beyond certain
> thresholds, not in the current set of allowed nodes etc etc.

Yes. There are many other cases, but I prefer that we think them separately.
Maybe they needs another approach. For now, to solve memoryless node problem,
my solution is enough and safe.

Thanks.

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Joonsoo Kim @ 2014-02-10  1:15 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140207213855.GA24989@linux.vnet.ibm.com>

On Fri, Feb 07, 2014 at 01:38:55PM -0800, Nishanth Aravamudan wrote:
> On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> > Here is a draft of a patch to make this work with memoryless nodes.
> 
> Hi Christoph, this should be tested instead of Joonsoo's patch 2 (and 3)?

Hello,

I guess that your system has another problem that makes my patches inactive.
Maybe it will also affect to the Christoph's one. Could you confirm page_to_nid(),
numa_mem_id() and node_present_pages although I doubt mostly about page_to_nid()?

Thanks.

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Joonsoo Kim @ 2014-02-10  1:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: Han Pingtian, Nishanth Aravamudan, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Christoph Lameter, linuxppc-dev, Wanpeng Li
In-Reply-To: <alpine.DEB.2.02.1402080154140.9668@chino.kir.corp.google.com>

On Sat, Feb 08, 2014 at 01:57:39AM -0800, David Rientjes wrote:
> On Fri, 7 Feb 2014, Joonsoo Kim wrote:
> 
> > > It seems like a better approach would be to do this when a node is brought 
> > > online and determine the fallback node based not on the zonelists as you 
> > > do here but rather on locality (such as through a SLIT if provided, see 
> > > node_distance()).
> > 
> > Hmm...
> > I guess that zonelist is base on locality. Zonelist is generated using
> > node_distance(), so I think that it reflects locality. But, I'm not expert
> > on NUMA, so please let me know what I am missing here :)
> > 
> 
> The zonelist is, yes, but I'm talking about memoryless and cpuless nodes.  
> If your solution is going to become the generic kernel API that determines 
> what node has local memory for a particular node, then it will have to 
> support all definitions of node.  That includes nodes that consist solely 
> of I/O, chipsets, networking, or storage devices.  These nodes may not 
> have memory or cpus, so doing it as part of onlining cpus isn't going to 
> be generic enough.  You want a node_to_mem_node() API for all possible 
> node types (the possible node types listed above are straight from the 
> ACPI spec).  For 99% of people, node_to_mem_node(X) is always going to be 
> X and we can optimize for that, but any solution that relies on cpu online 
> is probably shortsighted right now.
> 
> I think it would be much better to do this as a part of setting a node to 
> be online.

Okay. I got your point.
I will change it to rely on node online if this patch is really needed.

Thanks!

^ permalink raw reply

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: Stephen N Chivers @ 2014-02-09 19:42 UTC (permalink / raw)
  To: James Yang; +Cc: Chris Proctor, linuxppc-dev, Stephen N Chivers
In-Reply-To: <alpine.LRH.2.00.1402071348380.10318@ra8135-ec1.am.freescale.net>

James Yang <James.Yang@freescale.com> wrote on 02/08/2014 07:49:40 AM:

> From: James Yang <James.Yang@freescale.com>
> To: Gabriel Paubert <paubert@iram.es>
> Cc: Stephen N Chivers <schivers@csc.com.au>, Chris Proctor 
> <cproctor@csc.com.au>, <linuxppc-dev@lists.ozlabs.org>
> Date: 02/08/2014 07:49 AM
> Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
> 
> On Fri, 7 Feb 2014, Gabriel Paubert wrote:
> 
> >    Hi Stephen,
> > 
> > On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote:
> > > Gabriel Paubert <paubert@iram.es> wrote on 02/06/2014 07:26:37 PM:
> > > 
> > > > From: Gabriel Paubert <paubert@iram.es>
> > > > To: Stephen N Chivers <schivers@csc.com.au>
> > > > Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor 
<cproctor@csc.com.au>
> > > > Date: 02/06/2014 07:26 PM
> > > > Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
> > > > 
> > > > On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:
> 
> > > > > 
> > > > >                 mask = 0;
> > > > >                 if (FM & (1 << 0))
> > > > >                         mask |= 0x0000000f;
> > > > >                 if (FM & (1 << 1))
> > > > >                         mask |= 0x000000f0;
> > > > >                 if (FM & (1 << 2))
> > > > >                         mask |= 0x00000f00;
> > > > >                 if (FM & (1 << 3))
> > > > >                         mask |= 0x0000f000;
> > > > >                 if (FM & (1 << 4))
> > > > >                         mask |= 0x000f0000;
> > > > >                 if (FM & (1 << 5))
> > > > >                         mask |= 0x00f00000;
> > > > >                 if (FM & (1 << 6))
> > > > >                         mask |= 0x0f000000;
> > > > >                 if (FM & (1 << 7))
> > > > >                         mask |= 0x90000000;
> > > > > 
> > > > > With the above mask computation I get consistent results for 
> > > > > both the MPC8548 and MPC7410 boards.
> > > > > 
> > > > > Am I missing something subtle?
> > > > 
> > > > No I think you are correct. This said, this code may probably be 
> > > optimized 
> > > > to eliminate a lot of the conditional branches. I think that:
> 
> 
> If the compiler is enabled to generate isel instructions, it would not 
> use a conditional branch for this code. (ignore the andi's values, 
> this is an old compile)
> 
>From limited research, the 440GP is a processor
that doesn't implement the isel instruction and it does
not implement floating point.

The kernel emulates isel and so using that instruction
for the 440GP would have a double trap penalty.

Correct me if I am wrong, the isel instruction first appears
in PowerPC ISA v2.04 around mid 2007. 

Stephen Chivers,
CSC Australia Pty. Ltd. 

^ permalink raw reply

* [PATCH RFC/RFT v2 5/8] powerpc: move cacheinfo sysfs to generic cacheinfo infrastructure
From: Sudeep Holla @ 2014-02-07 16:49 UTC (permalink / raw)
  To: linux-kernel; +Cc: Paul Mackerras, linuxppc-dev, sudeep.holla
In-Reply-To: <1391791763-28518-1-git-send-email-sudeep.holla@arm.com>

From: Sudeep Holla <sudeep.holla@arm.com>

This patch removes the redundant sysfs cacheinfo code by making use of
the newly introduced generic cacheinfo infrastructure.

Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/cacheinfo.c | 828 ++++++----------------------------------
 arch/powerpc/kernel/cacheinfo.h |   8 -
 arch/powerpc/kernel/sysfs.c     |   4 -
 3 files changed, 109 insertions(+), 731 deletions(-)
 delete mode 100644 arch/powerpc/kernel/cacheinfo.h

diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c
index abfa011..05b7580 100644
--- a/arch/powerpc/kernel/cacheinfo.c
+++ b/arch/powerpc/kernel/cacheinfo.c
@@ -10,38 +10,10 @@
  * 2 as published by the Free Software Foundation.
  */
 
+#include <linux/cacheinfo.h>
 #include <linux/cpu.h>
-#include <linux/cpumask.h>
 #include <linux/kernel.h>
-#include <linux/kobject.h>
-#include <linux/list.h>
-#include <linux/notifier.h>
 #include <linux/of.h>
-#include <linux/percpu.h>
-#include <linux/slab.h>
-#include <asm/prom.h>
-
-#include "cacheinfo.h"
-
-/* per-cpu object for tracking:
- * - a "cache" kobject for the top-level directory
- * - a list of "index" objects representing the cpu's local cache hierarchy
- */
-struct cache_dir {
-	struct kobject *kobj; /* bare (not embedded) kobject for cache
-			       * directory */
-	struct cache_index_dir *index; /* list of index objects */
-};
-
-/* "index" object: each cpu's cache directory has an index
- * subdirectory corresponding to a cache object associated with the
- * cpu.  This object's lifetime is managed via the embedded kobject.
- */
-struct cache_index_dir {
-	struct kobject kobj;
-	struct cache_index_dir *next; /* next index in parent directory */
-	struct cache *cache;
-};
 
 /* Template for determining which OF properties to query for a given
  * cache type */
@@ -60,11 +32,6 @@ struct cache_type_info {
 	const char *nr_sets_prop;
 };
 
-/* These are used to index the cache_type_info array. */
-#define CACHE_TYPE_UNIFIED     0
-#define CACHE_TYPE_INSTRUCTION 1
-#define CACHE_TYPE_DATA        2
-
 static const struct cache_type_info cache_type_info[] = {
 	{
 		/* PowerPC Processor binding says the [di]-cache-*
@@ -77,246 +44,115 @@ static const struct cache_type_info cache_type_info[] = {
 		.nr_sets_prop    = "d-cache-sets",
 	},
 	{
-		.name            = "Instruction",
-		.size_prop       = "i-cache-size",
-		.line_size_props = { "i-cache-line-size",
-				     "i-cache-block-size", },
-		.nr_sets_prop    = "i-cache-sets",
-	},
-	{
 		.name            = "Data",
 		.size_prop       = "d-cache-size",
 		.line_size_props = { "d-cache-line-size",
 				     "d-cache-block-size", },
 		.nr_sets_prop    = "d-cache-sets",
 	},
+	{
+		.name            = "Instruction",
+		.size_prop       = "i-cache-size",
+		.line_size_props = { "i-cache-line-size",
+				     "i-cache-block-size", },
+		.nr_sets_prop    = "i-cache-sets",
+	},
 };
 
-/* Cache object: each instance of this corresponds to a distinct cache
- * in the system.  There are separate objects for Harvard caches: one
- * each for instruction and data, and each refers to the same OF node.
- * The refcount of the OF node is elevated for the lifetime of the
- * cache object.  A cache object is released when its shared_cpu_map
- * is cleared (see cache_cpu_clear).
- *
- * A cache object is on two lists: an unsorted global list
- * (cache_list) of cache objects; and a singly-linked list
- * representing the local cache hierarchy, which is ordered by level
- * (e.g. L1d -> L1i -> L2 -> L3).
- */
-struct cache {
-	struct device_node *ofnode;    /* OF node for this cache, may be cpu */
-	struct cpumask shared_cpu_map; /* online CPUs using this cache */
-	int type;                      /* split cache disambiguation */
-	int level;                     /* level not explicit in device tree */
-	struct list_head list;         /* global list of cache objects */
-	struct cache *next_local;      /* next cache of >= level */
-};
-
-static DEFINE_PER_CPU(struct cache_dir *, cache_dir_pcpu);
-
-/* traversal/modification of this list occurs only at cpu hotplug time;
- * access is serialized by cpu hotplug locking
- */
-static LIST_HEAD(cache_list);
-
-static struct cache_index_dir *kobj_to_cache_index_dir(struct kobject *k)
-{
-	return container_of(k, struct cache_index_dir, kobj);
-}
-
-static const char *cache_type_string(const struct cache *cache)
+static inline int get_cacheinfo_idx(enum cache_type type)
 {
-	return cache_type_info[cache->type].name;
-}
-
-static void cache_init(struct cache *cache, int type, int level,
-		       struct device_node *ofnode)
-{
-	cache->type = type;
-	cache->level = level;
-	cache->ofnode = of_node_get(ofnode);
-	INIT_LIST_HEAD(&cache->list);
-	list_add(&cache->list, &cache_list);
-}
-
-static struct cache *new_cache(int type, int level, struct device_node *ofnode)
-{
-	struct cache *cache;
-
-	cache = kzalloc(sizeof(*cache), GFP_KERNEL);
-	if (cache)
-		cache_init(cache, type, level, ofnode);
-
-	return cache;
-}
-
-static void release_cache_debugcheck(struct cache *cache)
-{
-	struct cache *iter;
-
-	list_for_each_entry(iter, &cache_list, list)
-		WARN_ONCE(iter->next_local == cache,
-			  "cache for %s(%s) refers to cache for %s(%s)\n",
-			  iter->ofnode->full_name,
-			  cache_type_string(iter),
-			  cache->ofnode->full_name,
-			  cache_type_string(cache));
-}
-
-static void release_cache(struct cache *cache)
-{
-	if (!cache)
-		return;
-
-	pr_debug("freeing L%d %s cache for %s\n", cache->level,
-		 cache_type_string(cache), cache->ofnode->full_name);
-
-	release_cache_debugcheck(cache);
-	list_del(&cache->list);
-	of_node_put(cache->ofnode);
-	kfree(cache);
-}
-
-static void cache_cpu_set(struct cache *cache, int cpu)
-{
-	struct cache *next = cache;
-
-	while (next) {
-		WARN_ONCE(cpumask_test_cpu(cpu, &next->shared_cpu_map),
-			  "CPU %i already accounted in %s(%s)\n",
-			  cpu, next->ofnode->full_name,
-			  cache_type_string(next));
-		cpumask_set_cpu(cpu, &next->shared_cpu_map);
-		next = next->next_local;
-	}
+	if (type == CACHE_TYPE_UNIFIED)
+		return 0;
+	else
+		return type;
 }
 
-static int cache_size(const struct cache *cache, unsigned int *ret)
+static int cache_size(struct cache_info *this_leaf)
 {
 	const char *propname;
 	const __be32 *cache_size;
+	int ct_idx;
 
-	propname = cache_type_info[cache->type].size_prop;
-
-	cache_size = of_get_property(cache->ofnode, propname, NULL);
-	if (!cache_size)
-		return -ENODEV;
-
-	*ret = of_read_number(cache_size, 1);
-	return 0;
-}
-
-static int cache_size_kb(const struct cache *cache, unsigned int *ret)
-{
-	unsigned int size;
+	ct_idx = get_cacheinfo_idx(this_leaf->type);
+	propname = cache_type_info[ct_idx].size_prop;
 
-	if (cache_size(cache, &size))
+	cache_size = of_get_property(this_leaf->of_node, propname, NULL);
+	if (!cache_size) {
+		this_leaf->size = 0;
 		return -ENODEV;
-
-	*ret = size / 1024;
-	return 0;
+	} else {
+		this_leaf->size = of_read_number(cache_size, 1);
+		return 0;
+	}
 }
 
 /* not cache_line_size() because that's a macro in include/linux/cache.h */
-static int cache_get_line_size(const struct cache *cache, unsigned int *ret)
+static int cache_get_line_size(struct cache_info *this_leaf)
 {
 	const __be32 *line_size;
-	int i, lim;
+	int i, lim, ct_idx;
 
-	lim = ARRAY_SIZE(cache_type_info[cache->type].line_size_props);
+	ct_idx = get_cacheinfo_idx(this_leaf->type);
+	lim = ARRAY_SIZE(cache_type_info[ct_idx].line_size_props);
 
 	for (i = 0; i < lim; i++) {
 		const char *propname;
 
-		propname = cache_type_info[cache->type].line_size_props[i];
-		line_size = of_get_property(cache->ofnode, propname, NULL);
+		propname = cache_type_info[ct_idx].line_size_props[i];
+		line_size = of_get_property(this_leaf->of_node, propname, NULL);
 		if (line_size)
 			break;
 	}
 
-	if (!line_size)
+	if (!line_size) {
+		this_leaf->coherency_line_size = 0;
 		return -ENODEV;
-
-	*ret = of_read_number(line_size, 1);
-	return 0;
+	} else {
+		this_leaf->coherency_line_size = of_read_number(line_size, 1);
+		return 0;
+	}
 }
 
-static int cache_nr_sets(const struct cache *cache, unsigned int *ret)
+static int cache_nr_sets(struct cache_info *this_leaf)
 {
 	const char *propname;
 	const __be32 *nr_sets;
+	int ct_idx;
 
-	propname = cache_type_info[cache->type].nr_sets_prop;
+	ct_idx = get_cacheinfo_idx(this_leaf->type);
+	propname = cache_type_info[ct_idx].nr_sets_prop;
 
-	nr_sets = of_get_property(cache->ofnode, propname, NULL);
-	if (!nr_sets)
+	nr_sets = of_get_property(this_leaf->of_node, propname, NULL);
+	if (!nr_sets) {
+		this_leaf->number_of_sets = 0;
 		return -ENODEV;
-
-	*ret = of_read_number(nr_sets, 1);
-	return 0;
+	} else {
+		this_leaf->number_of_sets = of_read_number(nr_sets, 1);
+		return 0;
+	}
 }
 
-static int cache_associativity(const struct cache *cache, unsigned int *ret)
+static int cache_associativity(struct cache_info *this_leaf)
 {
-	unsigned int line_size;
-	unsigned int nr_sets;
-	unsigned int size;
-
-	if (cache_nr_sets(cache, &nr_sets))
-		goto err;
+	unsigned int line_size = this_leaf->coherency_line_size;
+	unsigned int nr_sets = this_leaf->number_of_sets;
+	unsigned int size = this_leaf->size;
 
 	/* If the cache is fully associative, there is no need to
 	 * check the other properties.
 	 */
 	if (nr_sets == 1) {
-		*ret = 0;
+		this_leaf->ways_of_associativity = 0;
 		return 0;
 	}
 
-	if (cache_get_line_size(cache, &line_size))
-		goto err;
-	if (cache_size(cache, &size))
-		goto err;
-
-	if (!(nr_sets > 0 && size > 0 && line_size > 0))
-		goto err;
-
-	*ret = (size / nr_sets) / line_size;
-	return 0;
-err:
-	return -ENODEV;
-}
-
-/* helper for dealing with split caches */
-static struct cache *cache_find_first_sibling(struct cache *cache)
-{
-	struct cache *iter;
-
-	if (cache->type == CACHE_TYPE_UNIFIED)
-		return cache;
-
-	list_for_each_entry(iter, &cache_list, list)
-		if (iter->ofnode == cache->ofnode && iter->next_local == cache)
-			return iter;
-
-	return cache;
-}
-
-/* return the first cache on a local list matching node */
-static struct cache *cache_lookup_by_node(const struct device_node *node)
-{
-	struct cache *cache = NULL;
-	struct cache *iter;
-
-	list_for_each_entry(iter, &cache_list, list) {
-		if (iter->ofnode != node)
-			continue;
-		cache = cache_find_first_sibling(iter);
-		break;
+	if (!(nr_sets > 0 && size > 0 && line_size > 0)) {
+		this_leaf->ways_of_associativity = 0;
+		return -ENODEV;
+	} else {
+		this_leaf->ways_of_associativity = (size / nr_sets) / line_size;
+		return 0;
 	}
-
-	return cache;
 }
 
 static bool cache_node_is_unified(const struct device_node *np)
@@ -324,520 +160,74 @@ static bool cache_node_is_unified(const struct device_node *np)
 	return of_get_property(np, "cache-unified", NULL);
 }
 
-static struct cache *cache_do_one_devnode_unified(struct device_node *node,
-						  int level)
-{
-	struct cache *cache;
-
-	pr_debug("creating L%d ucache for %s\n", level, node->full_name);
-
-	cache = new_cache(CACHE_TYPE_UNIFIED, level, node);
-
-	return cache;
-}
-
-static struct cache *cache_do_one_devnode_split(struct device_node *node,
-						int level)
+static void ci_leaf_init(struct cache_info *this_leaf,
+				enum cache_type type, unsigned int level)
 {
-	struct cache *dcache, *icache;
-
-	pr_debug("creating L%d dcache and icache for %s\n", level,
-		 node->full_name);
-
-	dcache = new_cache(CACHE_TYPE_DATA, level, node);
-	icache = new_cache(CACHE_TYPE_INSTRUCTION, level, node);
-
-	if (!dcache || !icache)
-		goto err;
-
-	dcache->next_local = icache;
-
-	return dcache;
-err:
-	release_cache(dcache);
-	release_cache(icache);
-	return NULL;
+	this_leaf->level = level;
+	this_leaf->type = type;
+	cache_size(this_leaf);
+	cache_get_line_size(this_leaf);
+	cache_nr_sets(this_leaf);
+	cache_associativity(this_leaf);
 }
 
-static struct cache *cache_do_one_devnode(struct device_node *node, int level)
+int init_cache_level(unsigned int cpu)
 {
-	struct cache *cache;
-
-	if (cache_node_is_unified(node))
-		cache = cache_do_one_devnode_unified(node, level);
-	else
-		cache = cache_do_one_devnode_split(node, level);
-
-	return cache;
-}
+	struct device_node *np;
+	struct device *cpu_dev = get_cpu_device(cpu);
+	struct cpu_cacheinfo *this_cpu_ci = get_cpu_cacheinfo(cpu);
+	unsigned int level = 0, leaves = 0;
 
-static struct cache *cache_lookup_or_instantiate(struct device_node *node,
-						 int level)
-{
-	struct cache *cache;
-
-	cache = cache_lookup_by_node(node);
-
-	WARN_ONCE(cache && cache->level != level,
-		  "cache level mismatch on lookup (got %d, expected %d)\n",
-		  cache->level, level);
-
-	if (!cache)
-		cache = cache_do_one_devnode(node, level);
-
-	return cache;
-}
-
-static void link_cache_lists(struct cache *smaller, struct cache *bigger)
-{
-	while (smaller->next_local) {
-		if (smaller->next_local == bigger)
-			return; /* already linked */
-		smaller = smaller->next_local;
-	}
-
-	smaller->next_local = bigger;
-}
-
-static void do_subsidiary_caches_debugcheck(struct cache *cache)
-{
-	WARN_ON_ONCE(cache->level != 1);
-	WARN_ON_ONCE(strcmp(cache->ofnode->type, "cpu"));
-}
-
-static void do_subsidiary_caches(struct cache *cache)
-{
-	struct device_node *subcache_node;
-	int level = cache->level;
-
-	do_subsidiary_caches_debugcheck(cache);
-
-	while ((subcache_node = of_find_next_cache_node(cache->ofnode))) {
-		struct cache *subcache;
-
-		level++;
-		subcache = cache_lookup_or_instantiate(subcache_node, level);
-		of_node_put(subcache_node);
-		if (!subcache)
-			break;
-
-		link_cache_lists(cache, subcache);
-		cache = subcache;
-	}
-}
-
-static struct cache *cache_chain_instantiate(unsigned int cpu_id)
-{
-	struct device_node *cpu_node;
-	struct cache *cpu_cache = NULL;
-
-	pr_debug("creating cache object(s) for CPU %i\n", cpu_id);
-
-	cpu_node = of_get_cpu_node(cpu_id, NULL);
-	WARN_ONCE(!cpu_node, "no OF node found for CPU %i\n", cpu_id);
-	if (!cpu_node)
-		goto out;
-
-	cpu_cache = cache_lookup_or_instantiate(cpu_node, 1);
-	if (!cpu_cache)
-		goto out;
-
-	do_subsidiary_caches(cpu_cache);
-
-	cache_cpu_set(cpu_cache, cpu_id);
-out:
-	of_node_put(cpu_node);
-
-	return cpu_cache;
-}
-
-static struct cache_dir *cacheinfo_create_cache_dir(unsigned int cpu_id)
-{
-	struct cache_dir *cache_dir;
-	struct device *dev;
-	struct kobject *kobj = NULL;
-
-	dev = get_cpu_device(cpu_id);
-	WARN_ONCE(!dev, "no dev for CPU %i\n", cpu_id);
-	if (!dev)
-		goto err;
-
-	kobj = kobject_create_and_add("cache", &dev->kobj);
-	if (!kobj)
-		goto err;
-
-	cache_dir = kzalloc(sizeof(*cache_dir), GFP_KERNEL);
-	if (!cache_dir)
-		goto err;
-
-	cache_dir->kobj = kobj;
-
-	WARN_ON_ONCE(per_cpu(cache_dir_pcpu, cpu_id) != NULL);
-
-	per_cpu(cache_dir_pcpu, cpu_id) = cache_dir;
-
-	return cache_dir;
-err:
-	kobject_put(kobj);
-	return NULL;
-}
-
-static void cache_index_release(struct kobject *kobj)
-{
-	struct cache_index_dir *index;
-
-	index = kobj_to_cache_index_dir(kobj);
-
-	pr_debug("freeing index directory for L%d %s cache\n",
-		 index->cache->level, cache_type_string(index->cache));
-
-	kfree(index);
-}
-
-static ssize_t cache_index_show(struct kobject *k, struct attribute *attr, char *buf)
-{
-	struct kobj_attribute *kobj_attr;
-
-	kobj_attr = container_of(attr, struct kobj_attribute, attr);
-
-	return kobj_attr->show(k, kobj_attr, buf);
-}
-
-static struct cache *index_kobj_to_cache(struct kobject *k)
-{
-	struct cache_index_dir *index;
-
-	index = kobj_to_cache_index_dir(k);
-
-	return index->cache;
-}
-
-static ssize_t size_show(struct kobject *k, struct kobj_attribute *attr, char *buf)
-{
-	unsigned int size_kb;
-	struct cache *cache;
-
-	cache = index_kobj_to_cache(k);
-
-	if (cache_size_kb(cache, &size_kb))
+	if (!cpu_dev) {
+		pr_err("No cpu device for CPU %d\n", cpu);
 		return -ENODEV;
-
-	return sprintf(buf, "%uK\n", size_kb);
-}
-
-static struct kobj_attribute cache_size_attr =
-	__ATTR(size, 0444, size_show, NULL);
-
-
-static ssize_t line_size_show(struct kobject *k, struct kobj_attribute *attr, char *buf)
-{
-	unsigned int line_size;
-	struct cache *cache;
-
-	cache = index_kobj_to_cache(k);
-
-	if (cache_get_line_size(cache, &line_size))
-		return -ENODEV;
-
-	return sprintf(buf, "%u\n", line_size);
-}
-
-static struct kobj_attribute cache_line_size_attr =
-	__ATTR(coherency_line_size, 0444, line_size_show, NULL);
-
-static ssize_t nr_sets_show(struct kobject *k, struct kobj_attribute *attr, char *buf)
-{
-	unsigned int nr_sets;
-	struct cache *cache;
-
-	cache = index_kobj_to_cache(k);
-
-	if (cache_nr_sets(cache, &nr_sets))
-		return -ENODEV;
-
-	return sprintf(buf, "%u\n", nr_sets);
-}
-
-static struct kobj_attribute cache_nr_sets_attr =
-	__ATTR(number_of_sets, 0444, nr_sets_show, NULL);
-
-static ssize_t associativity_show(struct kobject *k, struct kobj_attribute *attr, char *buf)
-{
-	unsigned int associativity;
-	struct cache *cache;
-
-	cache = index_kobj_to_cache(k);
-
-	if (cache_associativity(cache, &associativity))
-		return -ENODEV;
-
-	return sprintf(buf, "%u\n", associativity);
-}
-
-static struct kobj_attribute cache_assoc_attr =
-	__ATTR(ways_of_associativity, 0444, associativity_show, NULL);
-
-static ssize_t type_show(struct kobject *k, struct kobj_attribute *attr, char *buf)
-{
-	struct cache *cache;
-
-	cache = index_kobj_to_cache(k);
-
-	return sprintf(buf, "%s\n", cache_type_string(cache));
-}
-
-static struct kobj_attribute cache_type_attr =
-	__ATTR(type, 0444, type_show, NULL);
-
-static ssize_t level_show(struct kobject *k, struct kobj_attribute *attr, char *buf)
-{
-	struct cache_index_dir *index;
-	struct cache *cache;
-
-	index = kobj_to_cache_index_dir(k);
-	cache = index->cache;
-
-	return sprintf(buf, "%d\n", cache->level);
-}
-
-static struct kobj_attribute cache_level_attr =
-	__ATTR(level, 0444, level_show, NULL);
-
-static ssize_t shared_cpu_map_show(struct kobject *k, struct kobj_attribute *attr, char *buf)
-{
-	struct cache_index_dir *index;
-	struct cache *cache;
-	int len;
-	int n = 0;
-
-	index = kobj_to_cache_index_dir(k);
-	cache = index->cache;
-	len = PAGE_SIZE - 2;
-
-	if (len > 1) {
-		n = cpumask_scnprintf(buf, len, &cache->shared_cpu_map);
-		buf[n++] = '\n';
-		buf[n] = '\0';
 	}
-	return n;
-}
-
-static struct kobj_attribute cache_shared_cpu_map_attr =
-	__ATTR(shared_cpu_map, 0444, shared_cpu_map_show, NULL);
-
-/* Attributes which should always be created -- the kobject/sysfs core
- * does this automatically via kobj_type->default_attrs.  This is the
- * minimum data required to uniquely identify a cache.
- */
-static struct attribute *cache_index_default_attrs[] = {
-	&cache_type_attr.attr,
-	&cache_level_attr.attr,
-	&cache_shared_cpu_map_attr.attr,
-	NULL,
-};
-
-/* Attributes which should be created if the cache device node has the
- * right properties -- see cacheinfo_create_index_opt_attrs
- */
-static struct kobj_attribute *cache_index_opt_attrs[] = {
-	&cache_size_attr,
-	&cache_line_size_attr,
-	&cache_nr_sets_attr,
-	&cache_assoc_attr,
-};
-
-static const struct sysfs_ops cache_index_ops = {
-	.show = cache_index_show,
-};
-
-static struct kobj_type cache_index_type = {
-	.release = cache_index_release,
-	.sysfs_ops = &cache_index_ops,
-	.default_attrs = cache_index_default_attrs,
-};
-
-static void cacheinfo_create_index_opt_attrs(struct cache_index_dir *dir)
-{
-	const char *cache_name;
-	const char *cache_type;
-	struct cache *cache;
-	char *buf;
-	int i;
-
-	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
-	if (!buf)
-		return;
-
-	cache = dir->cache;
-	cache_name = cache->ofnode->full_name;
-	cache_type = cache_type_string(cache);
-
-	/* We don't want to create an attribute that can't provide a
-	 * meaningful value.  Check the return value of each optional
-	 * attribute's ->show method before registering the
-	 * attribute.
-	 */
-	for (i = 0; i < ARRAY_SIZE(cache_index_opt_attrs); i++) {
-		struct kobj_attribute *attr;
-		ssize_t rc;
-
-		attr = cache_index_opt_attrs[i];
-
-		rc = attr->show(&dir->kobj, attr, buf);
-		if (rc <= 0) {
-			pr_debug("not creating %s attribute for "
-				 "%s(%s) (rc = %zd)\n",
-				 attr->attr.name, cache_name,
-				 cache_type, rc);
-			continue;
-		}
-		if (sysfs_create_file(&dir->kobj, &attr->attr))
-			pr_debug("could not create %s attribute for %s(%s)\n",
-				 attr->attr.name, cache_name, cache_type);
+	np = cpu_dev->of_node;
+	if (!np) {
+		pr_err("Failed to find cpu%d device node\n", cpu);
+		return -ENOENT;
 	}
 
-	kfree(buf);
-}
-
-static void cacheinfo_create_index_dir(struct cache *cache, int index,
-				       struct cache_dir *cache_dir)
-{
-	struct cache_index_dir *index_dir;
-	int rc;
-
-	index_dir = kzalloc(sizeof(*index_dir), GFP_KERNEL);
-	if (!index_dir)
-		goto err;
-
-	index_dir->cache = cache;
-
-	rc = kobject_init_and_add(&index_dir->kobj, &cache_index_type,
-				  cache_dir->kobj, "index%d", index);
-	if (rc)
-		goto err;
-
-	index_dir->next = cache_dir->index;
-	cache_dir->index = index_dir;
-
-	cacheinfo_create_index_opt_attrs(index_dir);
-
-	return;
-err:
-	kfree(index_dir);
-}
-
-static void cacheinfo_sysfs_populate(unsigned int cpu_id,
-				     struct cache *cache_list)
-{
-	struct cache_dir *cache_dir;
-	struct cache *cache;
-	int index = 0;
-
-	cache_dir = cacheinfo_create_cache_dir(cpu_id);
-	if (!cache_dir)
-		return;
-
-	cache = cache_list;
-	while (cache) {
-		cacheinfo_create_index_dir(cache, index, cache_dir);
-		index++;
-		cache = cache->next_local;
+	while (np) {
+		leaves += cache_node_is_unified(np) ? 1 : 2;
+		level++;
+		of_node_put(np);
+		np = of_find_next_cache_node(np);
 	}
-}
-
-void cacheinfo_cpu_online(unsigned int cpu_id)
-{
-	struct cache *cache;
-
-	cache = cache_chain_instantiate(cpu_id);
-	if (!cache)
-		return;
-
-	cacheinfo_sysfs_populate(cpu_id, cache);
-}
-
-#ifdef CONFIG_HOTPLUG_CPU /* functions needed for cpu offline */
+	this_cpu_ci->num_levels = level;
+	this_cpu_ci->num_leaves = leaves;
 
-static struct cache *cache_lookup_by_cpu(unsigned int cpu_id)
-{
-	struct device_node *cpu_node;
-	struct cache *cache;
-
-	cpu_node = of_get_cpu_node(cpu_id, NULL);
-	WARN_ONCE(!cpu_node, "no OF node found for CPU %i\n", cpu_id);
-	if (!cpu_node)
-		return NULL;
-
-	cache = cache_lookup_by_node(cpu_node);
-	of_node_put(cpu_node);
-
-	return cache;
+	return 0;
 }
 
-static void remove_index_dirs(struct cache_dir *cache_dir)
+int populate_cache_leaves(unsigned int cpu)
 {
-	struct cache_index_dir *index;
+	struct cpu_cacheinfo *this_cpu_ci = get_cpu_cacheinfo(cpu);
+	struct cache_info *this_leaf = this_cpu_ci->info_list;
+	struct device *cpu_dev = get_cpu_device(cpu);
+	struct device_node *np;
+	unsigned int level, idx;
 
-	index = cache_dir->index;
-
-	while (index) {
-		struct cache_index_dir *next;
-
-		next = index->next;
-		kobject_put(&index->kobj);
-		index = next;
+	np = of_node_get(cpu_dev->of_node);
+	if (!np) {
+		pr_err("Failed to find cpu%d device node\n", cpu);
+		return -ENOENT;
 	}
-}
 
-static void remove_cache_dir(struct cache_dir *cache_dir)
-{
-	remove_index_dirs(cache_dir);
-
-	kobject_put(cache_dir->kobj);
-
-	kfree(cache_dir);
-}
-
-static void cache_cpu_clear(struct cache *cache, int cpu)
-{
-	while (cache) {
-		struct cache *next = cache->next_local;
-
-		WARN_ONCE(!cpumask_test_cpu(cpu, &cache->shared_cpu_map),
-			  "CPU %i not accounted in %s(%s)\n",
-			  cpu, cache->ofnode->full_name,
-			  cache_type_string(cache));
-
-		cpumask_clear_cpu(cpu, &cache->shared_cpu_map);
-
-		/* Release the cache object if all the cpus using it
-		 * are offline */
-		if (cpumask_empty(&cache->shared_cpu_map))
-			release_cache(cache);
-
-		cache = next;
+	for (idx = 0, level = 1; level <= this_cpu_ci->num_levels &&
+			idx < this_cpu_ci->num_leaves; idx++, level++) {
+		if (!this_leaf)
+			return -EINVAL;
+
+		this_leaf->of_node = np;
+		if (cache_node_is_unified(np)) {
+			ci_leaf_init(this_leaf++, CACHE_TYPE_UNIFIED, level);
+		} else {
+			ci_leaf_init(this_leaf++, CACHE_TYPE_DATA, level);
+			ci_leaf_init(this_leaf++, CACHE_TYPE_INST, level);
+		}
+		np = of_find_next_cache_node(np);
 	}
+	return 0;
 }
 
-void cacheinfo_cpu_offline(unsigned int cpu_id)
-{
-	struct cache_dir *cache_dir;
-	struct cache *cache;
-
-	/* Prevent userspace from seeing inconsistent state - remove
-	 * the sysfs hierarchy first */
-	cache_dir = per_cpu(cache_dir_pcpu, cpu_id);
-
-	/* careful, sysfs population may have failed */
-	if (cache_dir)
-		remove_cache_dir(cache_dir);
-
-	per_cpu(cache_dir_pcpu, cpu_id) = NULL;
-
-	/* clear the CPU's bit in its cache chain, possibly freeing
-	 * cache objects */
-	cache = cache_lookup_by_cpu(cpu_id);
-	if (cache)
-		cache_cpu_clear(cache, cpu_id);
-}
-#endif /* CONFIG_HOTPLUG_CPU */
diff --git a/arch/powerpc/kernel/cacheinfo.h b/arch/powerpc/kernel/cacheinfo.h
deleted file mode 100644
index a7b74d3..0000000
--- a/arch/powerpc/kernel/cacheinfo.h
+++ /dev/null
@@ -1,8 +0,0 @@
-#ifndef _PPC_CACHEINFO_H
-#define _PPC_CACHEINFO_H
-
-/* These are just hooks for sysfs.c to use. */
-extern void cacheinfo_cpu_online(unsigned int cpu_id);
-extern void cacheinfo_cpu_offline(unsigned int cpu_id);
-
-#endif /* _PPC_CACHEINFO_H */
diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index d4a43e6..935929b 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -19,8 +19,6 @@
 #include <asm/pmc.h>
 #include <asm/firmware.h>
 
-#include "cacheinfo.h"
-
 #ifdef CONFIG_PPC64
 #include <asm/paca.h>
 #include <asm/lppaca.h>
@@ -732,7 +730,6 @@ static void register_cpu_online(unsigned int cpu)
 		device_create_file(s, &dev_attr_altivec_idle_wait_time);
 	}
 #endif
-	cacheinfo_cpu_online(cpu);
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -813,7 +810,6 @@ static void unregister_cpu_online(unsigned int cpu)
 		device_remove_file(s, &dev_attr_altivec_idle_wait_time);
 	}
 #endif
-	cacheinfo_cpu_offline(cpu);
 }
 
 #ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
-- 
1.8.3.2

^ permalink raw reply related

* Re: [PATCH v2] powerpc/powernv: Platform dump interface
From: Anton Blanchard @ 2014-02-08 21:20 UTC (permalink / raw)
  To: Vasant Hegde; +Cc: linuxppc-dev
In-Reply-To: <20140116121411.624.55662.stgit@hegdevasant.in.ibm.com>


Hi Vasant,

> +static void free_dump_sg_list(struct opal_sg_list *list)
> +{
> +	struct opal_sg_list *sg1;
> +	while (list) {
> +		sg1 = list->next;
> +		kfree(list);
> +		list = sg1;
> +	}
> +	list = NULL;
> +}
> +
> +/*
> + * Build dump buffer scatter gather list
> + */
> +static struct opal_sg_list *dump_data_to_sglist(void)
> +{
> +	struct opal_sg_list *sg1, *list = NULL;
> +	void *addr;
> +	int64_t size;
> +
> +	addr = dump_record.buffer;
> +	size = dump_record.size;
> +
> +	sg1 = kzalloc(PAGE_SIZE, GFP_KERNEL);
> +	if (!sg1)
> +		goto nomem;
> +
> +	list = sg1;
> +	sg1->num_entries = 0;
> +	while (size > 0) {
> +		/* Translate virtual address to physical address */
> +		sg1->entry[sg1->num_entries].data =
> +			(void *)(vmalloc_to_pfn(addr) << PAGE_SHIFT);
> +
> +		if (size > PAGE_SIZE)
> +			sg1->entry[sg1->num_entries].length =
> PAGE_SIZE;
> +		else
> +			sg1->entry[sg1->num_entries].length = size;
> +
> +		sg1->num_entries++;
> +		if (sg1->num_entries >= SG_ENTRIES_PER_NODE) {
> +			sg1->next = kzalloc(PAGE_SIZE, GFP_KERNEL);
> +			if (!sg1->next)
> +				goto nomem;
> +
> +			sg1 = sg1->next;
> +			sg1->num_entries = 0;
> +		}
> +		addr += PAGE_SIZE;
> +		size -= PAGE_SIZE;
> +	}
> +	return list;
> +
> +nomem:
> +	pr_err("%s : Failed to allocate memory\n", __func__);
> +	free_dump_sg_list(list);
> +	return NULL;
> +}
> +
> +/*
> + * Translate sg list address to absolute
> + */
> +static void sglist_to_phy_addr(struct opal_sg_list *list)
> +{
> +	struct opal_sg_list *sg, *next;
> +
> +	for (sg = list; sg; sg = next) {
> +		next = sg->next;
> +		/* Don't translate NULL pointer for last entry */
> +		if (sg->next)
> +			sg->next = (struct opal_sg_list
> *)__pa(sg->next);
> +		else
> +			sg->next = NULL;
> +
> +		/* Convert num_entries to length */
> +		sg->num_entries =
> +			sg->num_entries * sizeof(struct
> opal_sg_entry) + 16;
> +	}
> +}
> +
> +static void free_dump_data_buf(void)
> +{
> +	vfree(dump_record.buffer);
> +	dump_record.size = 0;
> +}

This looks identical to the code in opal-flash.c. Considering how
complicated it is, can we put it somewhere common?

Anton

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: David Rientjes @ 2014-02-08  9:57 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Christoph Lameter, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140207054819.GC28952@lge.com>

On Fri, 7 Feb 2014, Joonsoo Kim wrote:

> > It seems like a better approach would be to do this when a node is brought 
> > online and determine the fallback node based not on the zonelists as you 
> > do here but rather on locality (such as through a SLIT if provided, see 
> > node_distance()).
> 
> Hmm...
> I guess that zonelist is base on locality. Zonelist is generated using
> node_distance(), so I think that it reflects locality. But, I'm not expert
> on NUMA, so please let me know what I am missing here :)
> 

The zonelist is, yes, but I'm talking about memoryless and cpuless nodes.  
If your solution is going to become the generic kernel API that determines 
what node has local memory for a particular node, then it will have to 
support all definitions of node.  That includes nodes that consist solely 
of I/O, chipsets, networking, or storage devices.  These nodes may not 
have memory or cpus, so doing it as part of onlining cpus isn't going to 
be generic enough.  You want a node_to_mem_node() API for all possible 
node types (the possible node types listed above are straight from the 
ACPI spec).  For 99% of people, node_to_mem_node(X) is always going to be 
X and we can optimize for that, but any solution that relies on cpu online 
is probably shortsighted right now.

I think it would be much better to do this as a part of setting a node to 
be online.

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Nishanth Aravamudan @ 2014-02-07 21:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li
In-Reply-To: <alpine.DEB.2.10.1402071245040.20246@nuc>

On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> Here is a draft of a patch to make this work with memoryless nodes.

Hi Christoph, this should be tested instead of Joonsoo's patch 2 (and 3)?

Thanks,
Nish

^ permalink raw reply

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: James Yang @ 2014-02-07 20:49 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: Chris Proctor, Stephen N Chivers, linuxppc-dev
In-Reply-To: <20140207101036.GA823@visitor2.iram.es>

On Fri, 7 Feb 2014, Gabriel Paubert wrote:

> 	Hi Stephen,
> 
> On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote:
> > Gabriel Paubert <paubert@iram.es> wrote on 02/06/2014 07:26:37 PM:
> > 
> > > From: Gabriel Paubert <paubert@iram.es>
> > > To: Stephen N Chivers <schivers@csc.com.au>
> > > Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor <cproctor@csc.com.au>
> > > Date: 02/06/2014 07:26 PM
> > > Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
> > > 
> > > On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:

> > > > 
> > > >                 mask = 0;
> > > >                 if (FM & (1 << 0))
> > > >                         mask |= 0x0000000f;
> > > >                 if (FM & (1 << 1))
> > > >                         mask |= 0x000000f0;
> > > >                 if (FM & (1 << 2))
> > > >                         mask |= 0x00000f00;
> > > >                 if (FM & (1 << 3))
> > > >                         mask |= 0x0000f000;
> > > >                 if (FM & (1 << 4))
> > > >                         mask |= 0x000f0000;
> > > >                 if (FM & (1 << 5))
> > > >                         mask |= 0x00f00000;
> > > >                 if (FM & (1 << 6))
> > > >                         mask |= 0x0f000000;
> > > >                 if (FM & (1 << 7))
> > > >                         mask |= 0x90000000;
> > > > 
> > > > With the above mask computation I get consistent results for 
> > > > both the MPC8548 and MPC7410 boards.
> > > > 
> > > > Am I missing something subtle?
> > > 
> > > No I think you are correct. This said, this code may probably be 
> > optimized 
> > > to eliminate a lot of the conditional branches. I think that:


If the compiler is enabled to generate isel instructions, it would not 
use a conditional branch for this code. (ignore the andi's values, 
this is an old compile)

c0037c2c <mtfsf>:
c0037c2c:       2c 03 00 00     cmpwi   r3,0
c0037c30:       41 82 01 1c     beq-    c0037d4c <mtfsf+0x120>
c0037c34:       2f 83 00 ff     cmpwi   cr7,r3,255
c0037c38:       41 9e 01 28     beq-    cr7,c0037d60 <mtfsf+0x134>
c0037c3c:       70 66 00 01     andi.   r6,r3,1
c0037c40:       3d 00 90 00     lis     r8,-28672
c0037c44:       7d 20 40 9e     iseleq  r9,r0,r8
c0037c48:       70 6a 00 02     andi.   r10,r3,2
c0037c4c:       65 28 0f 00     oris    r8,r9,3840
c0037c50:       7d 29 40 9e     iseleq  r9,r9,r8
c0037c54:       70 66 00 04     andi.   r6,r3,4
c0037c58:       65 28 00 f0     oris    r8,r9,240
c0037c5c:       7d 29 40 9e     iseleq  r9,r9,r8
c0037c60:       70 6a 00 08     andi.   r10,r3,8
c0037c64:       65 28 00 0f     oris    r8,r9,15
c0037c68:       7d 29 40 9e     iseleq  r9,r9,r8
c0037c6c:       70 66 00 10     andi.   r6,r3,16
c0037c70:       61 28 f0 00     ori     r8,r9,61440
c0037c74:       7d 29 40 9e     iseleq  r9,r9,r8
c0037c78:       70 6a 00 20     andi.   r10,r3,32
c0037c7c:       61 28 0f 00     ori     r8,r9,3840
c0037c80:       54 6a cf fe     rlwinm  r10,r3,25,31,31
c0037c84:       7d 29 40 9e     iseleq  r9,r9,r8
c0037c88:       2f 8a 00 00     cmpwi   cr7,r10,0
c0037c8c:       70 66 00 40     andi.   r6,r3,64
c0037c90:       61 28 00 f0     ori     r8,r9,240
c0037c94:       7d 29 40 9e     iseleq  r9,r9,r8
c0037c98:       41 9e 00 08     beq-    cr7,c0037ca0 <mtfsf+0x74>
c0037c9c:       61 29 00 0f     ori     r9,r9,15
	...
 
However, your other solutions are better.


> > > 
> > > mask = (FM & 1);
> > > mask |= (FM << 3) & 0x10;
> > > mask |= (FM << 6) & 0x100;
> > > mask |= (FM << 9) & 0x1000;
> > > mask |= (FM << 12) & 0x10000;
> > > mask |= (FM << 15) & 0x100000;
> > > mask |= (FM << 18) & 0x1000000;
> > > mask |= (FM << 21) & 0x10000000;
> > > mask *= 15;
> > > 
> > > should do the job, in less code space and without a single branch.
> > > 
> > > Each one of the "mask |=" lines should be translated into an
> > > rlwinm instruction followed by an "or". Actually it should be possible
> > > to transform each of these lines into a single "rlwimi" instruction
> > > but I don't know how to coerce gcc to reach this level of optimization.
> > > 
> > > Another way of optomizing this could be:
> > > 
> > > mask = (FM & 0x0f) | ((FM << 12) & 0x000f0000);
> > > mask = (mask & 0x00030003) | ((mask << 6) & 0x03030303);
> > > mask = (mask & 0x01010101) | ((mask << 3) & 0x10101010);
> > > mask *= 15;
> > > 


> Ok, I finally edited my sources and test compiled the suggestions
> I gave. I'd say that method2 is the best overall indeed. You can
> actually save one more instruction by setting mask to all ones in 
> the case FM=0xff, but that's about all in this area.


My measurements show method1 to be smaller and faster than method2 due 
to the number of instructions needed to generate the constant masks in 
method2, but it may depend upon your compiler and hardware.  Both are 
faster than the original with isel.

^ permalink raw reply

* [PATCH] powerpc: fix build failure in sysdev/mpic.c for MPIC_WEIRD=y
From: Paul Gortmaker @ 2014-02-07 19:50 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras; +Cc: Paul Gortmaker, linuxppc-dev

Commit 446f6d06fab0b49c61887ecbe8286d6aaa796637 ("powerpc/mpic: Properly
set default triggers") breaks the mpc7447_hpc_defconfig as follows:

  CC      arch/powerpc/sysdev/mpic.o
arch/powerpc/sysdev/mpic.c: In function 'mpic_set_irq_type':
arch/powerpc/sysdev/mpic.c:886:9: error: case label does not reduce to an integer constant
arch/powerpc/sysdev/mpic.c:890:9: error: case label does not reduce to an integer constant
arch/powerpc/sysdev/mpic.c:894:9: error: case label does not reduce to an integer constant
arch/powerpc/sysdev/mpic.c:898:9: error: case label does not reduce to an integer constant

Looking at the cpp output (gcc 4.7.3), I see:

   case mpic->hw_set[MPIC_IDX_VECPRI_SENSE_EDGE] |
        mpic->hw_set[MPIC_IDX_VECPRI_POLARITY_POSITIVE]:

The pointer into an array appears because CONFIG_MPIC_WEIRD=y is set
for this platform, thus enabling the following:

  -------------------
  #ifdef CONFIG_MPIC_WEIRD
  static u32 mpic_infos[][MPIC_IDX_END] = {
        [0] = { /* Original OpenPIC compatible MPIC */

  [...]

  #define MPIC_INFO(name) mpic->hw_set[MPIC_IDX_##name]

  #else /* CONFIG_MPIC_WEIRD */

  #define MPIC_INFO(name) MPIC_##name

  #endif /* CONFIG_MPIC_WEIRD */
  -------------------

Here we convert the case section to if/else if, and also add
the equivalent of a default case to warn about unknown types.
Boot tested on sbc8548, build tested on all defconfigs.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 arch/powerpc/sysdev/mpic.c | 38 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/sysdev/mpic.c b/arch/powerpc/sysdev/mpic.c
index 0e166ed4cd16..8209744b2829 100644
--- a/arch/powerpc/sysdev/mpic.c
+++ b/arch/powerpc/sysdev/mpic.c
@@ -886,25 +886,25 @@ int mpic_set_irq_type(struct irq_data *d, unsigned int flow_type)
 
 	/* Default: read HW settings */
 	if (flow_type == IRQ_TYPE_DEFAULT) {
-		switch(vold & (MPIC_INFO(VECPRI_POLARITY_MASK) |
-			       MPIC_INFO(VECPRI_SENSE_MASK))) {
-			case MPIC_INFO(VECPRI_SENSE_EDGE) |
-			     MPIC_INFO(VECPRI_POLARITY_POSITIVE):
-				flow_type = IRQ_TYPE_EDGE_RISING;
-				break;
-			case MPIC_INFO(VECPRI_SENSE_EDGE) |
-			     MPIC_INFO(VECPRI_POLARITY_NEGATIVE):
-				flow_type = IRQ_TYPE_EDGE_FALLING;
-				break;
-			case MPIC_INFO(VECPRI_SENSE_LEVEL) |
-			     MPIC_INFO(VECPRI_POLARITY_POSITIVE):
-				flow_type = IRQ_TYPE_LEVEL_HIGH;
-				break;
-			case MPIC_INFO(VECPRI_SENSE_LEVEL) |
-			     MPIC_INFO(VECPRI_POLARITY_NEGATIVE):
-				flow_type = IRQ_TYPE_LEVEL_LOW;
-				break;
-		}
+		int vold_ps;
+
+		vold_ps = vold & (MPIC_INFO(VECPRI_POLARITY_MASK) |
+				  MPIC_INFO(VECPRI_SENSE_MASK));
+
+		if (vold_ps == (MPIC_INFO(VECPRI_SENSE_EDGE) |
+				MPIC_INFO(VECPRI_POLARITY_POSITIVE)))
+			flow_type = IRQ_TYPE_EDGE_RISING;
+		else if	(vold_ps == (MPIC_INFO(VECPRI_SENSE_EDGE) |
+				     MPIC_INFO(VECPRI_POLARITY_NEGATIVE)))
+			flow_type = IRQ_TYPE_EDGE_FALLING;
+		else if (vold_ps == (MPIC_INFO(VECPRI_SENSE_LEVEL) |
+				     MPIC_INFO(VECPRI_POLARITY_POSITIVE)))
+			flow_type = IRQ_TYPE_LEVEL_HIGH;
+		else if (vold_ps == (MPIC_INFO(VECPRI_SENSE_LEVEL) |
+				     MPIC_INFO(VECPRI_POLARITY_NEGATIVE)))
+			flow_type = IRQ_TYPE_LEVEL_LOW;
+		else
+			WARN_ONCE(1, "mpic: unknown IRQ type %d\n", vold);
 	}
 
 	/* Apply to irq desc */
-- 
1.8.5.2

^ permalink raw reply related

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Christoph Lameter @ 2014-02-07 18:51 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, linuxppc-dev, Wanpeng Li
In-Reply-To: <alpine.DEB.2.10.1402071150090.15168@nuc>

Here is a draft of a patch to make this work with memoryless nodes.

The first thing is that we modify node_match to also match if we hit an
empty node. In that case we simply take the current slab if its there.

If there is no current slab then a regular allocation occurs with the
memoryless node. The page allocator will fallback to a possible node and
that will become the current slab. Next alloc from a memoryless node
will then use that slab.

For that we also add some tracking of allocations on nodes that were not
satisfied using the empty_node[] array. A successful alloc on a node
clears that flag.

I would rather avoid the empty_node[] array since its global and there may
be thread specific allocation restrictions but it would be expensive to do
an allocation attempt via the page allocator to make sure that there is
really no page available from the page allocator.

Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c	2014-02-03 13:19:22.896853227 -0600
+++ linux/mm/slub.c	2014-02-07 12:44:49.311494806 -0600
@@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa
 #endif
 }

+static int empty_node[MAX_NUMNODES];
+
 /*
  * Issues still to be resolved:
  *
@@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem
 	void *last;
 	void *p;
 	int order;
+	int alloc_node;

 	BUG_ON(flags & GFP_SLAB_BUG_MASK);

 	page = allocate_slab(s,
 		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
-	if (!page)
+	if (!page) {
+		if (node != NUMA_NO_NODE)
+			empty_node[node] = 1;
 		goto out;
+	}

 	order = compound_order(page);
-	inc_slabs_node(s, page_to_nid(page), page->objects);
+	alloc_node = page_to_nid(page);
+	empty_node[alloc_node] = 0;
+	inc_slabs_node(s, alloc_node, page->objects);
 	memcg_bind_pages(s, order);
 	page->slab_cache = s;
 	__SetPageSlab(page);
@@ -1712,7 +1720,7 @@ static void *get_partial(struct kmem_cac
 		struct kmem_cache_cpu *c)
 {
 	void *object;
-	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;

 	object = get_partial_node(s, get_node(s, searchnode), c, flags);
 	if (object || node != NUMA_NO_NODE)
@@ -2107,8 +2115,25 @@ static void flush_all(struct kmem_cache
 static inline int node_match(struct page *page, int node)
 {
 #ifdef CONFIG_NUMA
-	if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
+	int page_node;
+
+	/* No data means no match */
+	if (!page)
 		return 0;
+
+	/* Node does not matter. Therefore anything is a match */
+	if (node == NUMA_NO_NODE)
+		return 1;
+
+	/* Did we hit the requested node ? */
+	page_node = page_to_nid(page);
+	if (page_node == node)
+		return 1;
+
+	/* If the node has available data then we can use it. Mismatch */
+	return !empty_node[page_node];
+
+	/* Target node empty so just take anything */
 #endif
 	return 1;
 }

^ permalink raw reply

* Re: [PATCH v2] powerpc ticket locks
From: Torsten Duwe @ 2014-02-07 17:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tom Musta, linux-kernel, Paul Mackerras, Anton Blanchard,
	Scott Wood, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140207171224.GR5002@laptop.programming.kicks-ass.net>

On Fri, Feb 07, 2014 at 06:12:24PM +0100, Peter Zijlstra wrote:
> On Fri, Feb 07, 2014 at 05:58:01PM +0100, Torsten Duwe wrote:
> > +static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
> >  {
> > +	register struct __raw_tickets old, tmp,
> > +		inc = { .tail = TICKET_LOCK_INC };
> > +
> >  	CLEAR_IO_SYNC;
> > +	__asm__ __volatile__(
> > +"1:	lwarx	%0,0,%4		# arch_spin_lock\n"
> > +"	add	%1,%3,%0\n"
> > +	PPC405_ERR77(0, "%4")
> > +"	stwcx.	%1,0,%4\n"
> > +"	bne-	1b"
> > +	: "=&r" (old), "=&r" (tmp), "+m" (lock->tickets)
> > +	: "r" (inc), "r" (&lock->tickets)
> > +	: "cc");
> > +
> > +	if (likely(old.head == old.tail))
> > +		goto out;
> 
> I would have expected an lwsync someplace hereabouts.

Let me reconsider this. The v1 code worked on an 8 core,
maybe I didn't beat it enough.

> >  static inline void arch_spin_unlock(arch_spinlock_t *lock)
> >  {
> > +	arch_spinlock_t old, new;
> > +
> > +#if defined(CONFIG_PPC_SPLPAR)
> > +	lock->holder = 0;
> > +#endif
> > +	do {
> > +		old.tickets = ACCESS_ONCE(lock->tickets);
> > +		new.tickets.head = old.tickets.head + TICKET_LOCK_INC;
> > +		new.tickets.tail = old.tickets.tail;
> > +	} while (unlikely(__arch_spin_cmpxchg_eq(lock,
> > +						 old.head_tail,
> > +						 new.head_tail)));
> >  	SYNC_IO;
> >  	__asm__ __volatile__("# arch_spin_unlock\n\t"
> >  				PPC_RELEASE_BARRIER: : :"memory");
> 
> Doens't your cmpxchg_eq not already imply a lwsync?

Right.

> > -	lock->slock = 0;
> >  }
> 
> I'm still failing to see why you need an ll/sc pair for unlock.

Like so:
static inline void arch_spin_unlock(arch_spinlock_t *lock)
{
	arch_spinlock_t tmp;

#if defined(CONFIG_PPC_SPLPAR)
	lock->holder = 0;
#endif
	tmp.tickets = ACCESS_ONCE(lock->tickets);
	tmp.tickets.head += TICKET_LOCK_INC;
	lock->tickets.head = tmp.tickets.head;
	SYNC_IO;
	__asm__ __volatile__("# arch_spin_unlock\n\t"
				PPC_RELEASE_BARRIER: : :"memory");
}
?

I'll wrap it all up next week. I only wanted to post an updated v2
with the agreed-upon changes for BenH.

Thanks so far!

	Torsten

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Christoph Lameter @ 2014-02-07 17:53 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140207054819.GC28952@lge.com>

On Fri, 7 Feb 2014, Joonsoo Kim wrote:

> >
> > It seems like a better approach would be to do this when a node is brought
> > online and determine the fallback node based not on the zonelists as you
> > do here but rather on locality (such as through a SLIT if provided, see
> > node_distance()).
>
> Hmm...
> I guess that zonelist is base on locality. Zonelist is generated using
> node_distance(), so I think that it reflects locality. But, I'm not expert
> on NUMA, so please let me know what I am missing here :)

The next node can be found by going through the zonelist of a node and
checking for available memory. See fallback_alloc().

There is a function node_distance() that determines the relative
performance of a memory access from one to the other node.
The building of the fallback list for every node in build_zonelists()
relies on that.

^ permalink raw reply

* Re: [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node
From: Christoph Lameter @ 2014-02-07 17:49 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, mpm, penberg, linux-mm, paulus,
	Anton Blanchard, David Rientjes, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140207054119.GA28952@lge.com>

On Fri, 7 Feb 2014, Joonsoo Kim wrote:

> > This check wouild need to be something that checks for other contigencies
> > in the page allocator as well. A simple solution would be to actually run
> > a GFP_THIS_NODE alloc to see if you can grab a page from the proper node.
> > If that fails then fallback. See how fallback_alloc() does it in slab.
> >
>
> Hello, Christoph.
>
> This !node_present_pages() ensure that allocation on this node cannot succeed.
> So we can directly use numa_mem_id() here.

Yes of course we can use numa_mem_id().

But the check is only for not having any memory at all on a node. There
are other reason for allocations to fail on a certain node. The node could
have memory that cannot be reclaimed, all dirty, beyond certain
thresholds, not in the current set of allowed nodes etc etc.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox