linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V4 0/9] cpuidle/ppc: Enable deep idle states on PowerNV
@ 2013-11-29 10:41 Preeti U Murthy
  2013-11-29 10:41 ` [PATCH V4 1/9] powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message Preeti U Murthy
                   ` (8 more replies)
  0 siblings, 9 replies; 15+ messages in thread
From: Preeti U Murthy @ 2013-11-29 10:41 UTC (permalink / raw)
  To: fweisbec, paul.gortmaker, paulus, shangw, rjw, galak, benh,
	paulmck, arnd, linux-pm, rostedt, michael, john.stultz, tglx,
	chenhui.zhao, deepthi, r58472, geoff, linux-kernel, srivatsa.bhat,
	schwidefsky, svaidy, linuxppc-dev

On PowerPC, when CPUs enter certain deep idle states, the local timers stop and the
time base could go out of sync with the rest of the cores in the system.

This patchset adds support to wake up CPUs in such idle states by broadcasting
IPIs to them at their next timer events. We refer to these IPIs as the tick
broadcast IPIs in this patchset to refer to this context.

The patchset also includes resyncing of time base with the rest of the cores
in the system as soon as the CPUs wake up from deep idle states.

"Fast-Sleep" is a deep idle state on Power8 in which the above mentioned challenges
exist. With the required support for deep idle states thus in place, the patchset
adds Fast-Sleep into cpuidle. Fast-Sleep can yield us significantly more power
savings than the idle states that we have in cpuidle so far.

This patchset is based on mainline-3.13-rc1 and the cpuidle driver for power
posted by Deepthi Dharwar: https://lkml.org/lkml/2013/11/11/29

Changes in V4:

1. Add Fast Sleep CPU idle state on PowerNV.

2. Add the required context management for Fast Sleep and the call to OPAL
to synchronize time base after wakeup from fast sleep.

4. Add parsing of CPU idle states from the device tree to populate the cpuidle
state table.

5. Rename ambiguous functions in the code around waking up of CPUs from fast
sleep.

6. Fixed a bug in re-programming of the hrtimer that is queued to wakeup the
CPUs in fast sleep and modified Changelogs.

7. Added the ARCH_HAS_TICK_BROADCAST option. This signifies that we have a
arch specific function to perform broadcast.

Changes in V3:
http://thread.gmane.org/gmane.linux.power-management.general/38113

1. Fix the way in which a broadcast ipi is handled on the idling cpus. Timer
handling on a broadcast ipi is being done now without missing out any timer
stats generation.

2. Fix a bug in the programming of the hrtimer meant to do broadcast. Program
it to trigger at the earlier of a "broadcast period", and the next wakeup
event. By introducing the "broadcast period" as the maximum period after
which the broadcast hrtimer can fire, we ensure that we do not miss
wakeups in corner cases.

3. On hotplug of a broadcast cpu, trigger the hrtimer meant to do broadcast
to fire immediately on the new broadcast cpu. This will ensure we do not miss
doing a broadcast pending in the nearest future.

4. Change the type of allocation from GFP_KERNEL to GFP_NOWAIT while
initializing bc_hrtimer since we are in an atomic context and cannot sleep.

5. Use the broadcast ipi to wakeup the newly nominated broadcast cpu on
hotplug of the old instead of smp_call_function_single(). This is because we
are interrupt disabled at this point and should not be using
smp_call_function_single or its children in this context to send an ipi.

6. Move GENERIC_CLOCKEVENTS_BROADCAST to arch/powerpc/Kconfig.

7. Fix coding style issues.

Changes in V2: https://lkml.org/lkml/2013/8/14/239

1. Dynamically pick a broadcast CPU, instead of having a dedicated one.
2. Remove the constraint of having to disable tickless idle on the broadcast
CPU by queueing a hrtimer dedicated to do broadcast.

V1 posting: https://lkml.org/lkml/2013/7/25/740.

1. Added the infrastructure to wakeup CPUs in deep idle states in which the
local timers stop.

---

Preeti U Murthy (5):
      cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines
      cpuidle/ppc: Add basic infrastructure to enable the broadcast framework on ppc
      cpuidle/powernv: Add "Fast-Sleep" CPU idle state
      cpuidle/ppc: Nominate new broadcast cpu on hotplug of the old
      cpuidle/powernv: Parse device tree to setup idle states

Srivatsa S. Bhat (2):
      powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message
      powerpc: Implement tick broadcast IPI as a fixed IPI message

Vaidyanathan Srinivasan (2):
      powernv/cpuidle: Add context management for Fast Sleep
      powermgt: Add OPAL call to resync timebase on wakeup


 arch/powerpc/Kconfig                           |    2 
 arch/powerpc/include/asm/opal.h                |    2 
 arch/powerpc/include/asm/processor.h           |    1 
 arch/powerpc/include/asm/smp.h                 |    2 
 arch/powerpc/include/asm/time.h                |    4 
 arch/powerpc/kernel/exceptions-64s.S           |   10 +
 arch/powerpc/kernel/idle_power7.S              |   90 +++++++--
 arch/powerpc/kernel/smp.c                      |   23 ++
 arch/powerpc/kernel/time.c                     |  137 ++++++++++----
 arch/powerpc/platforms/cell/interrupt.c        |    2 
 arch/powerpc/platforms/powernv/opal-wrappers.S |    1 
 arch/powerpc/platforms/ps3/smp.c               |    2 
 drivers/cpuidle/cpuidle-powerpc-book3s.c       |  241 +++++++++++++++++++++++-
 13 files changed, 443 insertions(+), 74 deletions(-)

-- 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH V4 1/9] powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message
  2013-11-29 10:41 [PATCH V4 0/9] cpuidle/ppc: Enable deep idle states on PowerNV Preeti U Murthy
@ 2013-11-29 10:41 ` Preeti U Murthy
  2013-11-29 10:41 ` [PATCH V4 2/9] powerpc: Implement tick broadcast IPI as a fixed " Preeti U Murthy
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Preeti U Murthy @ 2013-11-29 10:41 UTC (permalink / raw)
  To: fweisbec, paul.gortmaker, paulus, shangw, rjw, galak, benh,
	paulmck, arnd, linux-pm, rostedt, michael, john.stultz, tglx,
	chenhui.zhao, deepthi, r58472, geoff, linux-kernel, srivatsa.bhat,
	schwidefsky, svaidy, linuxppc-dev

From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE map
to a common implementation - generic_smp_call_function_single_interrupt(). So,
we can consolidate them and save one of the IPI message slots, (which are
precious on powerpc, since only 4 of those slots are available).

So, implement the functionality of PPC_MSG_CALL_FUNC_SINGLE using
PPC_MSG_CALL_FUNC itself and release its IPI message slot, so that it can be
used for something else in the future, if desired.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Geoff Levand <geoff@infradead.org> [For the PS3 part]
---

 arch/powerpc/include/asm/smp.h          |    2 +-
 arch/powerpc/kernel/smp.c               |   12 +++++-------
 arch/powerpc/platforms/cell/interrupt.c |    2 +-
 arch/powerpc/platforms/ps3/smp.c        |    2 +-
 4 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 084e080..9f7356b 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu);
  * in /proc/interrupts will be wrong!!! --Troy */
 #define PPC_MSG_CALL_FUNCTION   0
 #define PPC_MSG_RESCHEDULE      1
-#define PPC_MSG_CALL_FUNC_SINGLE	2
+#define PPC_MSG_UNUSED		2
 #define PPC_MSG_DEBUGGER_BREAK  3
 
 /* for irq controllers that have dedicated ipis per message (4) */
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index a3b64f3..c2bd8d6 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -145,9 +145,9 @@ static irqreturn_t reschedule_action(int irq, void *data)
 	return IRQ_HANDLED;
 }
 
-static irqreturn_t call_function_single_action(int irq, void *data)
+static irqreturn_t unused_action(int irq, void *data)
 {
-	generic_smp_call_function_single_interrupt();
+	/* This slot is unused and hence available for use, if needed */
 	return IRQ_HANDLED;
 }
 
@@ -168,14 +168,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 static irq_handler_t smp_ipi_action[] = {
 	[PPC_MSG_CALL_FUNCTION] =  call_function_action,
 	[PPC_MSG_RESCHEDULE] = reschedule_action,
-	[PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action,
+	[PPC_MSG_UNUSED] = unused_action,
 	[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
 	[PPC_MSG_CALL_FUNCTION] =  "ipi call function",
 	[PPC_MSG_RESCHEDULE] = "ipi reschedule",
-	[PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single",
+	[PPC_MSG_UNUSED] = "ipi unused",
 	[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
 };
 
@@ -251,8 +251,6 @@ irqreturn_t smp_ipi_demux(void)
 			generic_smp_call_function_interrupt();
 		if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
 			scheduler_ipi();
-		if (all & IPI_MESSAGE(PPC_MSG_CALL_FUNC_SINGLE))
-			generic_smp_call_function_single_interrupt();
 		if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK))
 			debug_ipi_action(0, NULL);
 	} while (info->messages);
@@ -280,7 +278,7 @@ EXPORT_SYMBOL_GPL(smp_send_reschedule);
 
 void arch_send_call_function_single_ipi(int cpu)
 {
-	do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE);
+	do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
 }
 
 void arch_send_call_function_ipi_mask(const struct cpumask *mask)
diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c
index 2d42f3b..adf3726 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -215,7 +215,7 @@ void iic_request_IPIs(void)
 {
 	iic_request_ipi(PPC_MSG_CALL_FUNCTION);
 	iic_request_ipi(PPC_MSG_RESCHEDULE);
-	iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE);
+	iic_request_ipi(PPC_MSG_UNUSED);
 	iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
 }
 
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 4b35166..00d1a7c 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -76,7 +76,7 @@ static int __init ps3_smp_probe(void)
 
 		BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION    != 0);
 		BUILD_BUG_ON(PPC_MSG_RESCHEDULE       != 1);
-		BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2);
+		BUILD_BUG_ON(PPC_MSG_UNUSED	      != 2);
 		BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK   != 3);
 
 		for (i = 0; i < MSG_COUNT; i++) {

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V4 2/9] powerpc: Implement tick broadcast IPI as a fixed IPI message
  2013-11-29 10:41 [PATCH V4 0/9] cpuidle/ppc: Enable deep idle states on PowerNV Preeti U Murthy
  2013-11-29 10:41 ` [PATCH V4 1/9] powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message Preeti U Murthy
@ 2013-11-29 10:41 ` Preeti U Murthy
  2013-11-29 10:42 ` [PATCH V4 3/9] cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines Preeti U Murthy
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Preeti U Murthy @ 2013-11-29 10:41 UTC (permalink / raw)
  To: fweisbec, paul.gortmaker, paulus, shangw, rjw, galak, benh,
	paulmck, arnd, linux-pm, rostedt, michael, john.stultz, tglx,
	chenhui.zhao, deepthi, r58472, geoff, linux-kernel, srivatsa.bhat,
	schwidefsky, svaidy, linuxppc-dev

From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

For scalability and performance reasons, we want the tick broadcast IPIs
to be handled as efficiently as possible. Fixed IPI messages
are one of the most efficient mechanisms available - they are faster than
the smp_call_function mechanism because the IPI handlers are fixed and hence
they don't involve costly operations such as adding IPI handlers to the target
CPU's function queue, acquiring locks for synchronization etc.

Luckily we have an unused IPI message slot, so use that to implement
tick broadcast IPIs efficiently.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
[Functions renamed to tick_broadcast* and Changelog modified by
 Preeti U. Murthy<preeti@linux.vnet.ibm.com>]
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Geoff Levand <geoff@infradead.org> [For the PS3 part]
---

 arch/powerpc/include/asm/smp.h          |    2 +-
 arch/powerpc/include/asm/time.h         |    1 +
 arch/powerpc/kernel/smp.c               |   19 +++++++++++++++----
 arch/powerpc/kernel/time.c              |    5 +++++
 arch/powerpc/platforms/cell/interrupt.c |    2 +-
 arch/powerpc/platforms/ps3/smp.c        |    2 +-
 6 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 9f7356b..ff51046 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu);
  * in /proc/interrupts will be wrong!!! --Troy */
 #define PPC_MSG_CALL_FUNCTION   0
 #define PPC_MSG_RESCHEDULE      1
-#define PPC_MSG_UNUSED		2
+#define PPC_MSG_TICK_BROADCAST	2
 #define PPC_MSG_DEBUGGER_BREAK  3
 
 /* for irq controllers that have dedicated ipis per message (4) */
diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index c1f2676..1d428e6 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -28,6 +28,7 @@ extern struct clock_event_device decrementer_clockevent;
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
 extern void GregorianDay(struct rtc_time *tm);
+extern void tick_broadcast_ipi_handler(void);
 
 extern void generic_calibrate_decr(void);
 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index c2bd8d6..c77c6d7 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -35,6 +35,7 @@
 #include <asm/ptrace.h>
 #include <linux/atomic.h>
 #include <asm/irq.h>
+#include <asm/hw_irq.h>
 #include <asm/page.h>
 #include <asm/pgtable.h>
 #include <asm/prom.h>
@@ -145,9 +146,9 @@ static irqreturn_t reschedule_action(int irq, void *data)
 	return IRQ_HANDLED;
 }
 
-static irqreturn_t unused_action(int irq, void *data)
+static irqreturn_t tick_broadcast_ipi_action(int irq, void *data)
 {
-	/* This slot is unused and hence available for use, if needed */
+	tick_broadcast_ipi_handler();
 	return IRQ_HANDLED;
 }
 
@@ -168,14 +169,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 static irq_handler_t smp_ipi_action[] = {
 	[PPC_MSG_CALL_FUNCTION] =  call_function_action,
 	[PPC_MSG_RESCHEDULE] = reschedule_action,
-	[PPC_MSG_UNUSED] = unused_action,
+	[PPC_MSG_TICK_BROADCAST] = tick_broadcast_ipi_action,
 	[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
 	[PPC_MSG_CALL_FUNCTION] =  "ipi call function",
 	[PPC_MSG_RESCHEDULE] = "ipi reschedule",
-	[PPC_MSG_UNUSED] = "ipi unused",
+	[PPC_MSG_TICK_BROADCAST] = "ipi tick-broadcast",
 	[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
 };
 
@@ -251,6 +252,8 @@ irqreturn_t smp_ipi_demux(void)
 			generic_smp_call_function_interrupt();
 		if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
 			scheduler_ipi();
+		if (all & IPI_MESSAGE(PPC_MSG_TICK_BROADCAST))
+			tick_broadcast_ipi_handler();
 		if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK))
 			debug_ipi_action(0, NULL);
 	} while (info->messages);
@@ -289,6 +292,14 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask)
 		do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
 }
 
+void tick_broadcast(const struct cpumask *mask)
+{
+	unsigned int cpu;
+
+	for_each_cpu(cpu, mask)
+		do_message_pass(cpu, PPC_MSG_TICK_BROADCAST);
+}
+
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
 void smp_send_debugger_break(void)
 {
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index b3b1441..42269c7 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -813,6 +813,11 @@ static void decrementer_set_mode(enum clock_event_mode mode,
 		decrementer_set_next_event(DECREMENTER_MAX, dev);
 }
 
+/* Interrupt handler for the timer broadcast IPI */
+void tick_broadcast_ipi_handler(void)
+{
+}
+
 static void register_decrementer_clockevent(int cpu)
 {
 	struct clock_event_device *dec = &per_cpu(decrementers, cpu);
diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c
index adf3726..8a106b4 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -215,7 +215,7 @@ void iic_request_IPIs(void)
 {
 	iic_request_ipi(PPC_MSG_CALL_FUNCTION);
 	iic_request_ipi(PPC_MSG_RESCHEDULE);
-	iic_request_ipi(PPC_MSG_UNUSED);
+	iic_request_ipi(PPC_MSG_TICK_BROADCAST);
 	iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
 }
 
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 00d1a7c..b358bec 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -76,7 +76,7 @@ static int __init ps3_smp_probe(void)
 
 		BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION    != 0);
 		BUILD_BUG_ON(PPC_MSG_RESCHEDULE       != 1);
-		BUILD_BUG_ON(PPC_MSG_UNUSED	      != 2);
+		BUILD_BUG_ON(PPC_MSG_TICK_BROADCAST   != 2);
 		BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK   != 3);
 
 		for (i = 0; i < MSG_COUNT; i++) {

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V4 3/9] cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines
  2013-11-29 10:41 [PATCH V4 0/9] cpuidle/ppc: Enable deep idle states on PowerNV Preeti U Murthy
  2013-11-29 10:41 ` [PATCH V4 1/9] powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message Preeti U Murthy
  2013-11-29 10:41 ` [PATCH V4 2/9] powerpc: Implement tick broadcast IPI as a fixed " Preeti U Murthy
@ 2013-11-29 10:42 ` Preeti U Murthy
  2013-11-29 10:42 ` [PATCH V4 4/9] powernv/cpuidle: Add context management for Fast Sleep Preeti U Murthy
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Preeti U Murthy @ 2013-11-29 10:42 UTC (permalink / raw)
  To: fweisbec, paul.gortmaker, paulus, shangw, rjw, galak, benh,
	paulmck, arnd, linux-pm, rostedt, michael, john.stultz, tglx,
	chenhui.zhao, deepthi, r58472, geoff, linux-kernel, srivatsa.bhat,
	schwidefsky, svaidy, linuxppc-dev

Split timer_interrupt(), which is the local timer interrupt handler on ppc
into routines called during regular interrupt handling and __timer_interrupt(),
which takes care of running local timers and collecting time related stats.

This will enable callers interested only in running expired local timers to
directly call into __timer_interupt(). One of the use cases of this is the
tick broadcast IPI handling in which the sleeping CPUs need to handle the local
timers that have expired.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/kernel/time.c |   73 +++++++++++++++++++++++++-------------------
 1 file changed, 41 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 42269c7..42cb603 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -478,6 +478,42 @@ void arch_irq_work_raise(void)
 
 #endif /* CONFIG_IRQ_WORK */
 
+static void __timer_interrupt(void)
+{
+	struct pt_regs *regs = get_irq_regs();
+	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
+	struct clock_event_device *evt = &__get_cpu_var(decrementers);
+	u64 now;
+
+	__get_cpu_var(irq_stat).timer_irqs++;
+	trace_timer_interrupt_entry(regs);
+
+	if (test_irq_work_pending()) {
+		clear_irq_work_pending();
+		irq_work_run();
+	}
+
+	now = get_tb_or_rtc();
+	if (now >= *next_tb) {
+		*next_tb = ~(u64)0;
+		if (evt->event_handler)
+			evt->event_handler(evt);
+	} else {
+		now = *next_tb - now;
+		if (now <= DECREMENTER_MAX)
+			set_dec((int)now);
+	}
+
+#ifdef CONFIG_PPC64
+	/* collect purr register values often, for accurate calculations */
+	if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
+		struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array);
+		cu->current_tb = mfspr(SPRN_PURR);
+	}
+#endif
+	trace_timer_interrupt_exit(regs);
+}
+
 /*
  * timer_interrupt - gets called when the decrementer overflows,
  * with interrupts disabled.
@@ -486,8 +522,6 @@ void timer_interrupt(struct pt_regs * regs)
 {
 	struct pt_regs *old_regs;
 	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
-	struct clock_event_device *evt = &__get_cpu_var(decrementers);
-	u64 now;
 
 	/* Ensure a positive value is written to the decrementer, or else
 	 * some CPUs will continue to take decrementer exceptions.
@@ -510,8 +544,6 @@ void timer_interrupt(struct pt_regs * regs)
 	 */
 	may_hard_irq_enable();
 
-	__get_cpu_var(irq_stat).timer_irqs++;
-
 #if defined(CONFIG_PPC32) && defined(CONFIG_PMAC)
 	if (atomic_read(&ppc_n_lost_interrupts) != 0)
 		do_IRQ(regs);
@@ -520,34 +552,7 @@ void timer_interrupt(struct pt_regs * regs)
 	old_regs = set_irq_regs(regs);
 	irq_enter();
 
-	trace_timer_interrupt_entry(regs);
-
-	if (test_irq_work_pending()) {
-		clear_irq_work_pending();
-		irq_work_run();
-	}
-
-	now = get_tb_or_rtc();
-	if (now >= *next_tb) {
-		*next_tb = ~(u64)0;
-		if (evt->event_handler)
-			evt->event_handler(evt);
-	} else {
-		now = *next_tb - now;
-		if (now <= DECREMENTER_MAX)
-			set_dec((int)now);
-	}
-
-#ifdef CONFIG_PPC64
-	/* collect purr register values often, for accurate calculations */
-	if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
-		struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array);
-		cu->current_tb = mfspr(SPRN_PURR);
-	}
-#endif
-
-	trace_timer_interrupt_exit(regs);
-
+	__timer_interrupt();
 	irq_exit();
 	set_irq_regs(old_regs);
 }
@@ -816,6 +821,10 @@ static void decrementer_set_mode(enum clock_event_mode mode,
 /* Interrupt handler for the timer broadcast IPI */
 void tick_broadcast_ipi_handler(void)
 {
+	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
+
+	*next_tb = get_tb_or_rtc();
+	__timer_interrupt();
 }
 
 static void register_decrementer_clockevent(int cpu)

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V4 4/9] powernv/cpuidle: Add context management for Fast Sleep
  2013-11-29 10:41 [PATCH V4 0/9] cpuidle/ppc: Enable deep idle states on PowerNV Preeti U Murthy
                   ` (2 preceding siblings ...)
  2013-11-29 10:42 ` [PATCH V4 3/9] cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines Preeti U Murthy
@ 2013-11-29 10:42 ` Preeti U Murthy
  2013-11-29 10:42 ` [PATCH V4 5/9] powermgt: Add OPAL call to resync timebase on wakeup Preeti U Murthy
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Preeti U Murthy @ 2013-11-29 10:42 UTC (permalink / raw)
  To: fweisbec, paul.gortmaker, paulus, shangw, rjw, galak, benh,
	paulmck, arnd, linux-pm, rostedt, michael, john.stultz, tglx,
	chenhui.zhao, deepthi, r58472, geoff, linux-kernel, srivatsa.bhat,
	schwidefsky, svaidy, linuxppc-dev

From: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>

Before adding Fast-Sleep into the cpuidle framework, some low level
support needs to be added to enable it. This includes saving and
restoring of certain registers at entry and exit time of this state
respectively just like we do in the NAP idle state.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
[Changelog modified by Preeti U. Murthy <preeti@linux.vnet.ibm.com>]
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/include/asm/processor.h |    1 +
 arch/powerpc/kernel/exceptions-64s.S |   10 ++++-
 arch/powerpc/kernel/idle_power7.S    |   63 ++++++++++++++++++++++++----------
 3 files changed, 53 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index 4f7b047..d7633d0 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -444,6 +444,7 @@ enum idle_boot_override {IDLE_NO_OVERRIDE = 0, IDLE_POWERSAVE_OFF};
 
 extern int powersave_nap;	/* set if nap mode can be used in idle loop */
 extern void power7_nap(void);
+extern void power7_sleep(void);
 
 #ifdef CONFIG_CPU_IDLE_POWERPC_BOOK3S
 extern void update_smt_snooze_delay(int cpu, int residency);
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 9f905e4..b8139fb 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -121,9 +121,10 @@ BEGIN_FTR_SECTION
 	cmpwi	cr1,r13,2
 	/* Total loss of HV state is fatal, we could try to use the
 	 * PIR to locate a PACA, then use an emergency stack etc...
-	 * but for now, let's just stay stuck here
+	 * OPAL v3 based powernv platforms have new idle states
+	 * which fall in this catagory.
 	 */
-	bgt	cr1,.
+	bgt	cr1,8f
 	GET_PACA(r13)
 
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
@@ -141,6 +142,11 @@ BEGIN_FTR_SECTION
 	beq	cr1,2f
 	b	.power7_wakeup_noloss
 2:	b	.power7_wakeup_loss
+
+	/* Fast Sleep wakeup on PowerNV */
+8:	GET_PACA(r13)
+	b 	.power7_wakeup_loss
+
 9:
 END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
 #endif /* CONFIG_PPC_P7_NAP */
diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S
index 847e40e..e4bbca2 100644
--- a/arch/powerpc/kernel/idle_power7.S
+++ b/arch/powerpc/kernel/idle_power7.S
@@ -20,17 +20,27 @@
 
 #undef DEBUG
 
-	.text
+/* Idle state entry routines */
 
-_GLOBAL(power7_idle)
-	/* Now check if user or arch enabled NAP mode */
-	LOAD_REG_ADDRBASE(r3,powersave_nap)
-	lwz	r4,ADDROFF(powersave_nap)(r3)
-	cmpwi	0,r4,0
-	beqlr
-	/* fall through */
+#define	IDLE_STATE_ENTER_SEQ(IDLE_INST)				\
+	/* Magic NAP/SLEEP/WINKLE mode enter sequence */	\
+	std	r0,0(r1);					\
+	ptesync;						\
+	ld	r0,0(r1);					\
+1:	cmp	cr0,r0,r0;					\
+	bne	1b;						\
+	IDLE_INST;						\
+	b	.
 
-_GLOBAL(power7_nap)
+	.text
+
+/*
+ * Pass requested state in r3:
+ * 	0 - nap
+ * 	1 - sleep
+ */
+_GLOBAL(power7_powersave_common)
+	/* Use r3 to pass state nap/sleep/winkle */
 	/* NAP is a state loss, we create a regs frame on the
 	 * stack, fill it up with the state we care about and
 	 * stick a pointer to it in PACAR1. We really only
@@ -79,8 +89,8 @@ _GLOBAL(power7_nap)
 	/* Continue saving state */
 	SAVE_GPR(2, r1)
 	SAVE_NVGPRS(r1)
-	mfcr	r3
-	std	r3,_CCR(r1)
+	mfcr	r4
+	std	r4,_CCR(r1)
 	std	r9,_MSR(r1)
 	std	r1,PACAR1(r13)
 
@@ -89,15 +99,30 @@ _GLOBAL(power7_nap)
 	li	r4,KVM_HWTHREAD_IN_NAP
 	stb	r4,HSTATE_HWTHREAD_STATE(r13)
 #endif
+	cmpwi	cr0,r3,1
+	beq	2f
+	IDLE_STATE_ENTER_SEQ(PPC_NAP)
+	/* No return */
+2:	IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
+	/* No return */
 
-	/* Magic NAP mode enter sequence */
-	std	r0,0(r1)
-	ptesync
-	ld	r0,0(r1)
-1:	cmp	cr0,r0,r0
-	bne	1b
-	PPC_NAP
-	b	.
+_GLOBAL(power7_idle)
+	/* Now check if user or arch enabled NAP mode */
+	LOAD_REG_ADDRBASE(r3,powersave_nap)
+	lwz	r4,ADDROFF(powersave_nap)(r3)
+	cmpwi	0,r4,0
+	beqlr
+	/* fall through */
+
+_GLOBAL(power7_nap)
+	li	r3,0
+	b	power7_powersave_common
+	/* No return */
+
+_GLOBAL(power7_sleep)
+	li	r3,1
+	b	power7_powersave_common
+	/* No return */
 
 _GLOBAL(power7_wakeup_loss)
 	ld	r1,PACAR1(r13)

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V4 5/9] powermgt: Add OPAL call to resync timebase on wakeup
  2013-11-29 10:41 [PATCH V4 0/9] cpuidle/ppc: Enable deep idle states on PowerNV Preeti U Murthy
                   ` (3 preceding siblings ...)
  2013-11-29 10:42 ` [PATCH V4 4/9] powernv/cpuidle: Add context management for Fast Sleep Preeti U Murthy
@ 2013-11-29 10:42 ` Preeti U Murthy
  2013-11-29 10:43 ` [PATCH V4 6/9] cpuidle/ppc: Add basic infrastructure to enable the broadcast framework on ppc Preeti U Murthy
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Preeti U Murthy @ 2013-11-29 10:42 UTC (permalink / raw)
  To: fweisbec, paul.gortmaker, paulus, shangw, rjw, galak, benh,
	paulmck, arnd, linux-pm, rostedt, michael, john.stultz, tglx,
	chenhui.zhao, deepthi, r58472, geoff, linux-kernel, srivatsa.bhat,
	schwidefsky, svaidy, linuxppc-dev

From: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>

During "Fast-sleep" and deeper power savings state, decrementer and
timebase could be stopped making it out of sync with rest
of the cores in the system.

Add a firmware call to request platform to resync timebase
using low level platform methods.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/include/asm/opal.h                |    2 ++
 arch/powerpc/kernel/exceptions-64s.S           |    2 +-
 arch/powerpc/kernel/idle_power7.S              |   27 ++++++++++++++++++++++++
 arch/powerpc/platforms/powernv/opal-wrappers.S |    1 +
 4 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 033c06b..a662d06 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -132,6 +132,7 @@ extern int opal_enter_rtas(struct rtas_args *args,
 #define OPAL_FLASH_VALIDATE			76
 #define OPAL_FLASH_MANAGE			77
 #define OPAL_FLASH_UPDATE			78
+#define OPAL_RESYNC_TIMEBASE			79
 
 #ifndef __ASSEMBLY__
 
@@ -763,6 +764,7 @@ extern void opal_flash_init(void);
 extern int opal_machine_check(struct pt_regs *regs);
 
 extern void opal_shutdown(void);
+extern int opal_resync_timebase(void);
 
 extern void opal_lpc_init(void);
 
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index b8139fb..91e6417 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -145,7 +145,7 @@ BEGIN_FTR_SECTION
 
 	/* Fast Sleep wakeup on PowerNV */
 8:	GET_PACA(r13)
-	b 	.power7_wakeup_loss
+	b 	.power7_wakeup_tb_loss
 
 9:
 END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S
index e4bbca2..34c71e8 100644
--- a/arch/powerpc/kernel/idle_power7.S
+++ b/arch/powerpc/kernel/idle_power7.S
@@ -17,6 +17,7 @@
 #include <asm/ppc-opcode.h>
 #include <asm/hw_irq.h>
 #include <asm/kvm_book3s_asm.h>
+#include <asm/opal.h>
 
 #undef DEBUG
 
@@ -124,6 +125,32 @@ _GLOBAL(power7_sleep)
 	b	power7_powersave_common
 	/* No return */
 
+_GLOBAL(power7_wakeup_tb_loss)
+	ld	r2,PACATOC(r13);
+	ld	r1,PACAR1(r13)
+
+	/* Time base re-sync */
+	li	r0,OPAL_RESYNC_TIMEBASE
+	LOAD_REG_ADDR(r11,opal);
+	ld	r12,8(r11);
+	ld	r2,0(r11);
+	mtctr	r12
+	bctrl
+
+	/* TODO: Check r3 for failure */
+
+	REST_NVGPRS(r1)
+	REST_GPR(2, r1)
+	ld	r3,_CCR(r1)
+	ld	r4,_MSR(r1)
+	ld	r5,_NIP(r1)
+	addi	r1,r1,INT_FRAME_SIZE
+	mtcr	r3
+	mfspr	r3,SPRN_SRR1		/* Return SRR1 */
+	mtspr	SPRN_SRR1,r4
+	mtspr	SPRN_SRR0,r5
+	rfid
+
 _GLOBAL(power7_wakeup_loss)
 	ld	r1,PACAR1(r13)
 	REST_NVGPRS(r1)
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index e780650..ddfe95a 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -126,3 +126,4 @@ OPAL_CALL(opal_return_cpu,			OPAL_RETURN_CPU);
 OPAL_CALL(opal_validate_flash,			OPAL_FLASH_VALIDATE);
 OPAL_CALL(opal_manage_flash,			OPAL_FLASH_MANAGE);
 OPAL_CALL(opal_update_flash,			OPAL_FLASH_UPDATE);
+OPAL_CALL(opal_resync_timebase,			OPAL_RESYNC_TIMEBASE);

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V4 6/9] cpuidle/ppc: Add basic infrastructure to enable the broadcast framework on ppc
  2013-11-29 10:41 [PATCH V4 0/9] cpuidle/ppc: Enable deep idle states on PowerNV Preeti U Murthy
                   ` (4 preceding siblings ...)
  2013-11-29 10:42 ` [PATCH V4 5/9] powermgt: Add OPAL call to resync timebase on wakeup Preeti U Murthy
@ 2013-11-29 10:43 ` Preeti U Murthy
  2013-11-29 11:58   ` Thomas Gleixner
  2013-11-29 10:43 ` [PATCH V4 7/9] cpuidle/powernv: Add "Fast-Sleep" CPU idle state Preeti U Murthy
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 15+ messages in thread
From: Preeti U Murthy @ 2013-11-29 10:43 UTC (permalink / raw)
  To: fweisbec, paul.gortmaker, paulus, shangw, rjw, galak, benh,
	paulmck, arnd, linux-pm, rostedt, michael, john.stultz, tglx,
	chenhui.zhao, deepthi, r58472, geoff, linux-kernel, srivatsa.bhat,
	schwidefsky, svaidy, linuxppc-dev

On ppc there are certain deep CPU idle states in which the local timers stop. One such
idle state on Power8 is "Fast-Sleep". However we do not have an external timer
to wake up these CPUs. Hence we prevent one of the CPUs from entering
Fast-Sleep so that it can wakeup the remaining CPUs in this state.

However we would still rely on the broadcast framework[1] in the kernel to keep
track of the CPUs in deep idle and the time at which to wake them up. To enable
this framework, we need to register a clock device that does not stop in deep idle
states. Without such a device, the broadcast framework does not take any
action when CPUs enter and exit deep idle states since it believes that there
is no clock device to wakeup the CPUs in deep idle states.

A local timer does not satisfy this condition and hence we introduce a
pseudo clock device, called the broadcast_clockevent and get this registered
in the broadcast framework. This is done to trick the broadcast framework
into believing that we have an external timer to wakeup the CPUs. But this
device is not programmable; it just enables us to make use of the broadcast framework.

[1]http://lwn.net/Articles/574591/

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/Kconfig            |    2 +
 arch/powerpc/include/asm/time.h |    1 +
 arch/powerpc/kernel/time.c      |   58 ++++++++++++++++++++++++++++++++++++++-
 3 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index b44b52c..cafa788 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -129,6 +129,8 @@ config PPC
 	select GENERIC_CMOS_UPDATE
 	select GENERIC_TIME_VSYSCALL_OLD
 	select GENERIC_CLOCKEVENTS
+	select GENERIC_CLOCKEVENTS_BROADCAST
+	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
 	select GENERIC_STRNCPY_FROM_USER
 	select GENERIC_STRNLEN_USER
 	select HAVE_MOD_ARCH_SPECIFIC
diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index 1d428e6..4057425 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -24,6 +24,7 @@ extern unsigned long tb_ticks_per_jiffy;
 extern unsigned long tb_ticks_per_usec;
 extern unsigned long tb_ticks_per_sec;
 extern struct clock_event_device decrementer_clockevent;
+extern struct clock_event_device broadcast_clockevent;
 
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 42cb603..d2e582b 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -42,6 +42,7 @@
 #include <linux/timex.h>
 #include <linux/kernel_stat.h>
 #include <linux/time.h>
+#include <linux/timer.h>
 #include <linux/init.h>
 #include <linux/profile.h>
 #include <linux/cpu.h>
@@ -97,6 +98,10 @@ static struct clocksource clocksource_timebase = {
 
 static int decrementer_set_next_event(unsigned long evt,
 				      struct clock_event_device *dev);
+static int broadcast_set_next_event(unsigned long evt,
+				      struct clock_event_device *dev);
+static void broadcast_set_mode(enum clock_event_mode mode,
+				 struct clock_event_device *dev);
 static void decrementer_set_mode(enum clock_event_mode mode,
 				 struct clock_event_device *dev);
 
@@ -106,12 +111,23 @@ struct clock_event_device decrementer_clockevent = {
 	.irq            = 0,
 	.set_next_event = decrementer_set_next_event,
 	.set_mode       = decrementer_set_mode,
-	.features       = CLOCK_EVT_FEAT_ONESHOT,
+	.features       = CLOCK_EVT_FEAT_C3STOP | CLOCK_EVT_FEAT_ONESHOT,
 };
 EXPORT_SYMBOL(decrementer_clockevent);
 
+struct clock_event_device broadcast_clockevent = {
+	.name           = "broadcast",
+	.rating         = 200,
+	.irq            = 0,
+	.set_next_event = broadcast_set_next_event,
+	.set_mode       = broadcast_set_mode,
+	.features       = CLOCK_EVT_FEAT_ONESHOT,
+};
+EXPORT_SYMBOL(broadcast_clockevent);
+
 DEFINE_PER_CPU(u64, decrementers_next_tb);
 static DEFINE_PER_CPU(struct clock_event_device, decrementers);
+static struct clock_event_device bc_timer;
 
 #define XSEC_PER_SEC (1024*1024)
 
@@ -811,6 +827,19 @@ static int decrementer_set_next_event(unsigned long evt,
 	return 0;
 }
 
+static int broadcast_set_next_event(unsigned long evt,
+					struct clock_event_device *dev)
+{
+	return 0;
+}
+
+static void broadcast_set_mode(enum clock_event_mode mode,
+				 struct clock_event_device *dev)
+{
+	if (mode != CLOCK_EVT_MODE_ONESHOT)
+		broadcast_set_next_event(DECREMENTER_MAX, dev);
+}
+
 static void decrementer_set_mode(enum clock_event_mode mode,
 				 struct clock_event_device *dev)
 {
@@ -840,6 +869,19 @@ static void register_decrementer_clockevent(int cpu)
 	clockevents_register_device(dec);
 }
 
+static void register_broadcast_clockevent(int cpu)
+{
+	struct clock_event_device *bc_evt = &bc_timer;
+
+	*bc_evt = broadcast_clockevent;
+	bc_evt->cpumask = cpu_possible_mask;
+
+	printk_once(KERN_DEBUG "clockevent: %s mult[%x] shift[%d] cpu[%d]\n",
+		    bc_evt->name, bc_evt->mult, bc_evt->shift, cpu);
+
+	clockevents_register_device(bc_evt);
+}
+
 static void __init init_decrementer_clockevent(void)
 {
 	int cpu = smp_processor_id();
@@ -854,6 +896,19 @@ static void __init init_decrementer_clockevent(void)
 	register_decrementer_clockevent(cpu);
 }
 
+static void __init init_broadcast_clockevent(void)
+{
+	int cpu = smp_processor_id();
+
+	clockevents_calc_mult_shift(&broadcast_clockevent, ppc_tb_freq, 4);
+
+	broadcast_clockevent.max_delta_ns =
+		clockevent_delta2ns(DECREMENTER_MAX, &broadcast_clockevent);
+	broadcast_clockevent.min_delta_ns =
+		clockevent_delta2ns(2, &broadcast_clockevent);
+	register_broadcast_clockevent(cpu);
+}
+
 void secondary_cpu_time_init(void)
 {
 	/* Start the decrementer on CPUs that have manual control
@@ -930,6 +985,7 @@ void __init time_init(void)
 	clocksource_init();
 
 	init_decrementer_clockevent();
+	init_broadcast_clockevent();
 }
 
 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V4 7/9] cpuidle/powernv: Add "Fast-Sleep" CPU idle state
  2013-11-29 10:41 [PATCH V4 0/9] cpuidle/ppc: Enable deep idle states on PowerNV Preeti U Murthy
                   ` (5 preceding siblings ...)
  2013-11-29 10:43 ` [PATCH V4 6/9] cpuidle/ppc: Add basic infrastructure to enable the broadcast framework on ppc Preeti U Murthy
@ 2013-11-29 10:43 ` Preeti U Murthy
  2013-11-29 14:39   ` Thomas Gleixner
  2013-11-29 10:43 ` [PATCH V4 8/9] cpuidle/ppc: Nominate new broadcast cpu on hotplug of the old Preeti U Murthy
  2013-11-29 10:43 ` [PATCH V4 9/9] cpuidle/powernv: Parse device tree to setup idle states Preeti U Murthy
  8 siblings, 1 reply; 15+ messages in thread
From: Preeti U Murthy @ 2013-11-29 10:43 UTC (permalink / raw)
  To: fweisbec, paul.gortmaker, paulus, shangw, rjw, galak, benh,
	paulmck, arnd, linux-pm, rostedt, michael, john.stultz, tglx,
	chenhui.zhao, deepthi, r58472, geoff, linux-kernel, srivatsa.bhat,
	schwidefsky, svaidy, linuxppc-dev

Fast sleep is one of the deep idle states on Power8 in which local timers of
CPUs stop. Now that the basic support for fast sleep has been added,
enable it in the cpuidle framework on PowerNV.

On ppc, since we do not have an external device that can wakeup cpus in deep
idle, the local timer of one of the CPUs needs to be nominated to do this job.
This cpu is called the broadcast cpu/bc_cpu. Only if the bc_cpu is nominated
will the remaining cpus be allowed to enter deep idle state after notifying
the broadcast framework. The bc_cpu is not allowed to enter deep idle state.

The bc_cpu queues a hrtimer onto itself to handle the wakeup of CPUs in deep
idle state. The hrtimer handler calls into the broadcast framework which takes
care of sending IPIs to all those CPUs in deep idle whose wakeup times has expired.
	On each expiry of the hrtimer, it is programmed to the earlier of the
next wakeup time of  cpus in deep idle and and a safety period so as to not miss
any wakeups. This safety period is currently maintained at a jiffy.

But having a dedicated bc_cpu would mean overloading just one cpu with the
broadcast work which could hinder its performance apart from leading to thermal
imbalance on the chip. Therefore the first CPU that enters deep idle state is
the bc_cpu. It gets unassigned when there are no more CPUs in deep idle to be
woken up. This state remains until such a time that a CPU enters the
deep idle state again to be nominated as the bc_cpu and the cycle repeats.

Protect the region of nomination,de-nomination and check for existence of broadcast
CPU with a lock to ensure synchronization between them.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/include/asm/time.h          |    1 
 arch/powerpc/kernel/time.c               |    2 
 drivers/cpuidle/cpuidle-powerpc-book3s.c |  152 ++++++++++++++++++++++++++++++
 3 files changed, 154 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index 4057425..a6604b7 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -25,6 +25,7 @@ extern unsigned long tb_ticks_per_usec;
 extern unsigned long tb_ticks_per_sec;
 extern struct clock_event_device decrementer_clockevent;
 extern struct clock_event_device broadcast_clockevent;
+extern struct clock_event_device bc_timer;
 
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index d2e582b..f0603a0 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -127,7 +127,7 @@ EXPORT_SYMBOL(broadcast_clockevent);
 
 DEFINE_PER_CPU(u64, decrementers_next_tb);
 static DEFINE_PER_CPU(struct clock_event_device, decrementers);
-static struct clock_event_device bc_timer;
+struct clock_event_device bc_timer;
 
 #define XSEC_PER_SEC (1024*1024)
 
diff --git a/drivers/cpuidle/cpuidle-powerpc-book3s.c b/drivers/cpuidle/cpuidle-powerpc-book3s.c
index 25e8a99..649c330 100644
--- a/drivers/cpuidle/cpuidle-powerpc-book3s.c
+++ b/drivers/cpuidle/cpuidle-powerpc-book3s.c
@@ -12,12 +12,19 @@
 #include <linux/cpuidle.h>
 #include <linux/cpu.h>
 #include <linux/notifier.h>
+#include <linux/clockchips.h>
+#include <linux/tick.h>
+#include <linux/hrtimer.h>
+#include <linux/ktime.h>
+#include <linux/spinlock.h>
+#include <linux/slab.h>
 
 #include <asm/paca.h>
 #include <asm/reg.h>
 #include <asm/machdep.h>
 #include <asm/firmware.h>
 #include <asm/runlatch.h>
+#include <asm/time.h>
 #include <asm/plpar_wrappers.h>
 
 struct cpuidle_driver powerpc_book3s_idle_driver = {
@@ -28,6 +35,26 @@ struct cpuidle_driver powerpc_book3s_idle_driver = {
 static int max_idle_state;
 static struct cpuidle_state *cpuidle_state_table;
 
+static int bc_cpu = -1;
+static struct hrtimer *bc_hrtimer;
+static int bc_hrtimer_initialized = 0;
+
+/*
+ * Bits to indicate if a cpu can enter deep idle where local timer gets
+ * switched off.
+ * BROADCAST_CPU_PRESENT : Enter deep idle since bc_cpu is assigned
+ * BROADCAST_CPU_SELF	 : Do not enter deep idle since you are bc_cpu
+ * BROADCAST_CPU_ABSENT	 : Do not enter deep idle since there is no bc_cpu,
+ * 			   hence nominate yourself as bc_cpu
+ * BROADCAST_CPU_ERROR	:  Do not enter deep idle since there is no bc_cpu
+ *			   and the broadcast hrtimer could not be initialized.
+ */
+enum broadcast_cpu_status {
+	BROADCAST_CPU_PRESENT,
+	BROADCAST_CPU_SELF,
+	BROADCAST_CPU_ERROR,
+};
+
 static inline void idle_loop_prolog(unsigned long *in_purr)
 {
 	*in_purr = mfspr(SPRN_PURR);
@@ -48,6 +75,8 @@ static inline void idle_loop_epilog(unsigned long in_purr)
 	get_lppaca()->idle = 0;
 }
 
+static DEFINE_SPINLOCK(fastsleep_idle_lock);
+
 static int snooze_loop(struct cpuidle_device *dev,
 			struct cpuidle_driver *drv,
 			int index)
@@ -143,6 +172,122 @@ static int nap_loop(struct cpuidle_device *dev,
 	return index;
 }
 
+/* Functions supporting broadcasting in fastsleep */
+static ktime_t get_next_bc_tick(void)
+{
+	u64 next_bc_ns;
+
+	next_bc_ns = (tb_ticks_per_jiffy / tb_ticks_per_usec) * 1000;
+	return ns_to_ktime(next_bc_ns);
+}
+
+static int restart_broadcast(struct clock_event_device *bc_evt)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&fastsleep_idle_lock, flags);
+	bc_evt->event_handler(bc_evt);
+
+	if (bc_evt->next_event.tv64 == KTIME_MAX)
+		bc_cpu = -1;
+
+	spin_unlock_irqrestore(&fastsleep_idle_lock, flags);
+	return (bc_cpu != -1);
+}
+
+static enum hrtimer_restart handle_broadcast(struct hrtimer *hrtimer)
+{
+	struct clock_event_device *bc_evt = &bc_timer;
+	ktime_t interval, next_bc_tick, now;
+
+	now = ktime_get();
+
+	if (!restart_broadcast(bc_evt))
+		return HRTIMER_NORESTART;
+
+	interval = ktime_sub(bc_evt->next_event, now);
+	next_bc_tick = get_next_bc_tick();
+
+	if (interval.tv64 < next_bc_tick.tv64)
+		hrtimer_forward_now(hrtimer, interval);
+	else
+		hrtimer_forward_now(hrtimer, next_bc_tick);
+
+	return HRTIMER_RESTART;
+}
+
+static enum broadcast_cpu_status can_enter_deep_idle(int cpu)
+{
+	if (bc_cpu != -1 && cpu != bc_cpu) {
+		return BROADCAST_CPU_PRESENT;
+	} else if (bc_cpu != -1 && cpu == bc_cpu) {
+		return BROADCAST_CPU_SELF;
+	} else {
+		if (!bc_hrtimer_initialized) {
+			bc_hrtimer = kmalloc(sizeof(*bc_hrtimer), GFP_NOWAIT);
+			if (!bc_hrtimer)
+				return BROADCAST_CPU_ERROR;
+			hrtimer_init(bc_hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
+			bc_hrtimer->function = handle_broadcast;
+			hrtimer_start(bc_hrtimer, get_next_bc_tick(),
+				HRTIMER_MODE_REL_PINNED);
+			bc_hrtimer_initialized = 1;
+		} else {
+			hrtimer_start(bc_hrtimer, get_next_bc_tick(), HRTIMER_MODE_REL_PINNED);
+		}
+
+		bc_cpu = cpu;
+		return BROADCAST_CPU_SELF;
+	}
+}
+
+/* Emulate sleep, with long nap.
+ * During sleep, the core does not receive decrementer interrupts.
+ * Emulate sleep using long nap with decrementers interrupts disabled.
+ * This is an initial prototype to test the broadcast framework for ppc.
+ */
+static int fastsleep_loop(struct cpuidle_device *dev,
+				struct cpuidle_driver *drv,
+				int index)
+{
+	int cpu = dev->cpu;
+	unsigned long old_lpcr = mfspr(SPRN_LPCR);
+	unsigned long new_lpcr;
+	unsigned long flags;
+	int bc_cpu_status;
+
+	new_lpcr = old_lpcr;
+	new_lpcr &= ~(LPCR_MER | LPCR_PECE); /* lpcr[mer] must be 0 */
+
+	/* exit powersave upon external interrupt, but not decrementer
+	 * interrupt, Emulate sleep.
+	 */
+	new_lpcr |= LPCR_PECE0;
+
+	spin_lock_irqsave(&fastsleep_idle_lock, flags);
+	bc_cpu_status = can_enter_deep_idle(cpu);
+
+	if (bc_cpu_status == BROADCAST_CPU_PRESENT) {
+		mtspr(SPRN_LPCR, new_lpcr);
+		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
+		spin_unlock_irqrestore(&fastsleep_idle_lock, flags);
+		power7_sleep();
+		spin_lock_irqsave(&fastsleep_idle_lock, flags);
+		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
+		spin_unlock_irqrestore(&fastsleep_idle_lock, flags);
+	} else if (bc_cpu_status == BROADCAST_CPU_SELF) {
+		new_lpcr |= LPCR_PECE1;
+		mtspr(SPRN_LPCR, new_lpcr);
+		spin_unlock_irqrestore(&fastsleep_idle_lock, flags);
+		power7_nap();
+	} else {
+		spin_unlock_irqrestore(&fastsleep_idle_lock, flags);
+	}
+
+	mtspr(SPRN_LPCR, old_lpcr);
+	return index;
+}
+
 /*
  * States for dedicated partition case.
  */
@@ -191,6 +336,13 @@ static struct cpuidle_state powernv_states[] = {
 		.exit_latency = 10,
 		.target_residency = 100,
 		.enter = &nap_loop },
+	 { /* Fastsleep */
+		.name = "fastsleep",
+		.desc = "fastsleep",
+		.flags = CPUIDLE_FLAG_TIME_VALID,
+		.exit_latency = 10,
+		.target_residency = 100,
+		.enter = &fastsleep_loop },
 };
 
 void update_smt_snooze_delay(int cpu, int residency)

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V4 8/9] cpuidle/ppc: Nominate new broadcast cpu on hotplug of the old
  2013-11-29 10:41 [PATCH V4 0/9] cpuidle/ppc: Enable deep idle states on PowerNV Preeti U Murthy
                   ` (6 preceding siblings ...)
  2013-11-29 10:43 ` [PATCH V4 7/9] cpuidle/powernv: Add "Fast-Sleep" CPU idle state Preeti U Murthy
@ 2013-11-29 10:43 ` Preeti U Murthy
  2013-11-29 10:43 ` [PATCH V4 9/9] cpuidle/powernv: Parse device tree to setup idle states Preeti U Murthy
  8 siblings, 0 replies; 15+ messages in thread
From: Preeti U Murthy @ 2013-11-29 10:43 UTC (permalink / raw)
  To: fweisbec, paul.gortmaker, paulus, shangw, rjw, galak, benh,
	paulmck, arnd, linux-pm, rostedt, michael, john.stultz, tglx,
	chenhui.zhao, deepthi, r58472, geoff, linux-kernel, srivatsa.bhat,
	schwidefsky, svaidy, linuxppc-dev

On hotplug of the broadcast cpu, cancel the hrtimer queued to do
broadcast and nominate a new broadcast cpu.

We choose the new broadcast cpu as one of the cpus in deep idle and thus
send an ipi to wake it up to continue the duty of broadcast. The new
broadcast cpu needs to find out if it woke up to resume broadcast.
If so it needs to restart the broadcast hrtimer on itself.

Its possible that the old broadcast cpu was hotplugged out when the broadcast
hrtimer was about to fire on it. Therefore the newly nominated broadcast cpu
should set the broadcast hrtimer on itself to expire immediately so as to not
miss wakeups under such scenarios.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/include/asm/time.h          |    1 +
 arch/powerpc/kernel/time.c               |    1 +
 drivers/cpuidle/cpuidle-powerpc-book3s.c |   22 ++++++++++++++++++++++
 3 files changed, 24 insertions(+)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index a6604b7..e24ebb4 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -31,6 +31,7 @@ struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
 extern void GregorianDay(struct rtc_time *tm);
 extern void tick_broadcast_ipi_handler(void);
+extern void broadcast_irq_entry(void);
 
 extern void generic_calibrate_decr(void);
 
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index f0603a0..021a5c5 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -852,6 +852,7 @@ void tick_broadcast_ipi_handler(void)
 {
 	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
 
+	broadcast_irq_entry();
 	*next_tb = get_tb_or_rtc();
 	__timer_interrupt();
 }
diff --git a/drivers/cpuidle/cpuidle-powerpc-book3s.c b/drivers/cpuidle/cpuidle-powerpc-book3s.c
index 649c330..59cd529 100644
--- a/drivers/cpuidle/cpuidle-powerpc-book3s.c
+++ b/drivers/cpuidle/cpuidle-powerpc-book3s.c
@@ -288,6 +288,12 @@ static int fastsleep_loop(struct cpuidle_device *dev,
 	return index;
 }
 
+void broadcast_irq_entry(void)
+{
+	if (smp_processor_id() == bc_cpu)
+		hrtimer_start(bc_hrtimer, ns_to_ktime(0), HRTIMER_MODE_REL_PINNED);
+}
+
 /*
  * States for dedicated partition case.
  */
@@ -366,6 +372,7 @@ static int powerpc_book3s_cpuidle_add_cpu_notifier(struct notifier_block *n,
 			unsigned long action, void *hcpu)
 {
 	int hotcpu = (unsigned long)hcpu;
+	unsigned long flags;
 	struct cpuidle_device *dev =
 			per_cpu(cpuidle_devices, hotcpu);
 
@@ -378,6 +385,21 @@ static int powerpc_book3s_cpuidle_add_cpu_notifier(struct notifier_block *n,
 			cpuidle_resume_and_unlock();
 			break;
 
+		case CPU_DYING:
+		case CPU_DYING_FROZEN:
+			spin_lock_irqsave(&fastsleep_idle_lock, flags);
+			if (hotcpu == bc_cpu) {
+				bc_cpu = -1;
+				hrtimer_cancel(bc_hrtimer);
+				if (!cpumask_empty(tick_get_broadcast_oneshot_mask())) {
+					bc_cpu = cpumask_first(
+							tick_get_broadcast_oneshot_mask());
+					tick_broadcast(cpumask_of(bc_cpu));
+				}
+			}
+			spin_unlock_irqrestore(&fastsleep_idle_lock, flags);
+			break;
+
 		case CPU_DEAD:
 		case CPU_DEAD_FROZEN:
 			cpuidle_pause_and_lock();

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH V4 9/9] cpuidle/powernv: Parse device tree to setup idle states
  2013-11-29 10:41 [PATCH V4 0/9] cpuidle/ppc: Enable deep idle states on PowerNV Preeti U Murthy
                   ` (7 preceding siblings ...)
  2013-11-29 10:43 ` [PATCH V4 8/9] cpuidle/ppc: Nominate new broadcast cpu on hotplug of the old Preeti U Murthy
@ 2013-11-29 10:43 ` Preeti U Murthy
  8 siblings, 0 replies; 15+ messages in thread
From: Preeti U Murthy @ 2013-11-29 10:43 UTC (permalink / raw)
  To: fweisbec, paul.gortmaker, paulus, shangw, rjw, galak, benh,
	paulmck, arnd, linux-pm, rostedt, michael, john.stultz, tglx,
	chenhui.zhao, deepthi, r58472, geoff, linux-kernel, srivatsa.bhat,
	schwidefsky, svaidy, linuxppc-dev

Add deep idle states such as nap and fast sleep to the cpuidle state table
only if they are discovered from the device tree during cpuidle initialization.

Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
---

 drivers/cpuidle/cpuidle-powerpc-book3s.c |   81 ++++++++++++++++++++++++------
 1 file changed, 64 insertions(+), 17 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-powerpc-book3s.c b/drivers/cpuidle/cpuidle-powerpc-book3s.c
index 59cd529..b80ee9b 100644
--- a/drivers/cpuidle/cpuidle-powerpc-book3s.c
+++ b/drivers/cpuidle/cpuidle-powerpc-book3s.c
@@ -18,6 +18,7 @@
 #include <linux/ktime.h>
 #include <linux/spinlock.h>
 #include <linux/slab.h>
+#include <linux/of.h>
 
 #include <asm/paca.h>
 #include <asm/reg.h>
@@ -27,6 +28,12 @@
 #include <asm/time.h>
 #include <asm/plpar_wrappers.h>
 
+/* Flags and constants used in PowerNV platform */
+
+#define MAX_POWERNV_IDLE_STATES	8
+#define IDLE_USE_INST_NAP	0x00010000 /* Use nap instruction */
+#define IDLE_USE_INST_SLEEP	0x00020000 /* Use sleep instruction */
+
 struct cpuidle_driver powerpc_book3s_idle_driver = {
 	.name             = "powerpc_book3s_idle",
 	.owner            = THIS_MODULE,
@@ -327,7 +334,7 @@ static struct cpuidle_state shared_states[] = {
 		.enter = &shared_cede_loop },
 };
 
-static struct cpuidle_state powernv_states[] = {
+static struct cpuidle_state powernv_states[MAX_POWERNV_IDLE_STATES] = {
 	{ /* Snooze */
 		.name = "snooze",
 		.desc = "snooze",
@@ -335,20 +342,6 @@ static struct cpuidle_state powernv_states[] = {
 		.exit_latency = 0,
 		.target_residency = 0,
 		.enter = &snooze_loop },
-	{ /* NAP */
-		.name = "NAP",
-		.desc = "NAP",
-		.flags = CPUIDLE_FLAG_TIME_VALID,
-		.exit_latency = 10,
-		.target_residency = 100,
-		.enter = &nap_loop },
-	 { /* Fastsleep */
-		.name = "fastsleep",
-		.desc = "fastsleep",
-		.flags = CPUIDLE_FLAG_TIME_VALID,
-		.exit_latency = 10,
-		.target_residency = 100,
-		.enter = &fastsleep_loop },
 };
 
 void update_smt_snooze_delay(int cpu, int residency)
@@ -418,6 +411,60 @@ static struct notifier_block setup_hotplug_notifier = {
 	.notifier_call = powerpc_book3s_cpuidle_add_cpu_notifier,
 };
 
+static int powernv_add_idle_states(void)
+{
+	struct device_node *power_mgt;
+	struct property *prop;
+	int nr_idle_states = 1; /* Snooze */
+	int dt_idle_states;
+	u32 *flags;
+	int i;
+
+	/* Currently we have snooze statically defined */
+
+	power_mgt = of_find_node_by_path("/ibm,opal/power-mgt");
+	if (!power_mgt) {
+		pr_warn("opal: PowerMgmt Node not found\n");
+		return nr_idle_states;
+	}
+
+	prop = of_find_property(power_mgt, "ibm,cpu-idle-state-flags", NULL);
+	if (!prop) {
+		pr_warn("DT-PowerMgmt: missing ibm,cpu-idle-state-flags\n");
+		return nr_idle_states;
+	}
+
+	dt_idle_states = prop->length / sizeof(u32);
+	flags = (u32 *) prop->value;
+
+	for (i = 0; i < dt_idle_states; i++) {
+
+		if (flags[i] & IDLE_USE_INST_NAP) {
+			/* Add NAP state */
+			strcpy(powernv_states[nr_idle_states].name, "Nap");
+			strcpy(powernv_states[nr_idle_states].desc, "Nap");
+			powernv_states[nr_idle_states].flags = CPUIDLE_FLAG_TIME_VALID;
+			powernv_states[nr_idle_states].exit_latency = 10;
+			powernv_states[nr_idle_states].target_residency = 100;
+			powernv_states[nr_idle_states].enter = &nap_loop;
+			nr_idle_states++;
+		}
+
+		if (flags[i] & IDLE_USE_INST_SLEEP) {
+			/* Add FASTSLEEP state */
+			strcpy(powernv_states[nr_idle_states].name, "FastSleep");
+			strcpy(powernv_states[nr_idle_states].desc, "FastSleep");
+			powernv_states[nr_idle_states].flags = CPUIDLE_FLAG_TIME_VALID;
+			powernv_states[nr_idle_states].exit_latency = 300;
+			powernv_states[nr_idle_states].target_residency = 1000000;
+			powernv_states[nr_idle_states].enter = &fastsleep_loop;
+			nr_idle_states++;
+		}
+	}
+
+	return nr_idle_states;
+}
+
 /*
  * powerpc_book3s_cpuidle_driver_init()
  */
@@ -448,7 +495,6 @@ static int powerpc_book3s_cpuidle_driver_init(void)
  */
 static int powerpc_book3s_idle_probe(void)
 {
-
 	if (cpuidle_disable != IDLE_NO_OVERRIDE)
 		return -ENODEV;
 
@@ -463,7 +509,8 @@ static int powerpc_book3s_idle_probe(void)
 
 	} else if (firmware_has_feature(FW_FEATURE_OPALv3)) {
 		cpuidle_state_table = powernv_states;
-		max_idle_state = ARRAY_SIZE(powernv_states);
+		/* Device tree can indicate more idle states */
+		max_idle_state = powernv_add_idle_states();
 
 	} else
 		return -ENODEV;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH V4 6/9] cpuidle/ppc: Add basic infrastructure to enable the broadcast framework on ppc
  2013-11-29 10:43 ` [PATCH V4 6/9] cpuidle/ppc: Add basic infrastructure to enable the broadcast framework on ppc Preeti U Murthy
@ 2013-11-29 11:58   ` Thomas Gleixner
  2013-12-02 15:27     ` Preeti U Murthy
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2013-11-29 11:58 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: fweisbec, paul.gortmaker, paulus, shangw, rjw, paulmck, arnd,
	linux-pm, rostedt, michael, john.stultz, chenhui.zhao, deepthi,
	r58472, geoff, linux-kernel, srivatsa.bhat, schwidefsky,
	linuxppc-dev

On Fri, 29 Nov 2013, Preeti U Murthy wrote:
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index b44b52c..cafa788 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -129,6 +129,8 @@ config PPC
>  	select GENERIC_CMOS_UPDATE
>  	select GENERIC_TIME_VSYSCALL_OLD
>  	select GENERIC_CLOCKEVENTS
> +	select GENERIC_CLOCKEVENTS_BROADCAST
> +	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST

What's the point of this config switch? It's nowhere used.

> +static int broadcast_set_next_event(unsigned long evt,
> +					struct clock_event_device *dev)
> +{
> +	return 0;
> +}
> +
> +static void broadcast_set_mode(enum clock_event_mode mode,
> +				 struct clock_event_device *dev)
> +{
> +	if (mode != CLOCK_EVT_MODE_ONESHOT)
> +		broadcast_set_next_event(DECREMENTER_MAX, dev);

What's the point of calling an empty function?  

> +}
> +
>  static void decrementer_set_mode(enum clock_event_mode mode,
>  				 struct clock_event_device *dev)
>  {
> @@ -840,6 +869,19 @@ static void register_decrementer_clockevent(int cpu)
>  	clockevents_register_device(dec);
>  }
>  
> +static void register_broadcast_clockevent(int cpu)
> +{
> +	struct clock_event_device *bc_evt = &bc_timer;
> +
> +	*bc_evt = broadcast_clockevent;
> +	bc_evt->cpumask = cpu_possible_mask;
> +
> +	printk_once(KERN_DEBUG "clockevent: %s mult[%x] shift[%d] cpu[%d]\n",
> +		    bc_evt->name, bc_evt->mult, bc_evt->shift, cpu);
> +
> +	clockevents_register_device(bc_evt);
> +}
> +
>  static void __init init_decrementer_clockevent(void)
>  {
>  	int cpu = smp_processor_id();
> @@ -854,6 +896,19 @@ static void __init init_decrementer_clockevent(void)
>  	register_decrementer_clockevent(cpu);
>  }
>  
> +static void __init init_broadcast_clockevent(void)
> +{
> +	int cpu = smp_processor_id();
> +
> +	clockevents_calc_mult_shift(&broadcast_clockevent, ppc_tb_freq, 4);
> +
> +	broadcast_clockevent.max_delta_ns =
> +		clockevent_delta2ns(DECREMENTER_MAX, &broadcast_clockevent);
> +	broadcast_clockevent.min_delta_ns =
> +		clockevent_delta2ns(2, &broadcast_clockevent);

clockevents_config()

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V4 7/9] cpuidle/powernv: Add "Fast-Sleep" CPU idle state
  2013-11-29 10:43 ` [PATCH V4 7/9] cpuidle/powernv: Add "Fast-Sleep" CPU idle state Preeti U Murthy
@ 2013-11-29 14:39   ` Thomas Gleixner
  2013-12-02 15:07     ` Preeti U Murthy
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2013-11-29 14:39 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: fweisbec, paul.gortmaker, paulus, shangw, rjw, paulmck, arnd,
	linux-pm, rostedt, michael, john.stultz, chenhui.zhao, deepthi,
	r58472, geoff, linux-kernel, srivatsa.bhat, schwidefsky,
	linuxppc-dev

On Fri, 29 Nov 2013, Preeti U Murthy wrote:
> +static enum hrtimer_restart handle_broadcast(struct hrtimer *hrtimer)
> +{
> +	struct clock_event_device *bc_evt = &bc_timer;
> +	ktime_t interval, next_bc_tick, now;
> +
> +	now = ktime_get();
> +
> +	if (!restart_broadcast(bc_evt))
> +		return HRTIMER_NORESTART;
> +
> +	interval = ktime_sub(bc_evt->next_event, now);
> +	next_bc_tick = get_next_bc_tick();

So you're seriously using a hrtimer to poll in HZ frequency for
updates of bc->next_event?

To be honest, this design sucks.

First of all, why is this a PPC specific feature? There are probably
other architectures which could make use of this. So this should be
implemented in the core code to begin with.

And a lot of the things you need for this are already available in the
core in one form or the other.

For a start you can stick the broadcast hrtimer to the cpu which does
the timekeeping. The handover in the hotplug case is handled there as
well as is the handover for the NOHZ case.

This needs to be extended for this hrtimer broadcast thingy to work,
but it shouldn't be that hard to do so.

Now for the polling. That's a complete trainwreck.

This can be solved via the broadcast IPI as well. When a CPU which
goes down into deep idle sets the broadcast to expire earlier than the
active value it can denote that and send the timer broadcast IPI over
to the CPU which has the honour of dealing with this.

This supports HIGHRES and NO_HZ if done right, without polling at
all. So you can even let the last CPU which handles the broadcast
hrtimer go for a long sleep, just not in the deepest idle state.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V4 7/9] cpuidle/powernv: Add "Fast-Sleep" CPU idle state
  2013-11-29 14:39   ` Thomas Gleixner
@ 2013-12-02 15:07     ` Preeti U Murthy
  2013-12-02 18:00       ` Thomas Gleixner
  0 siblings, 1 reply; 15+ messages in thread
From: Preeti U Murthy @ 2013-12-02 15:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: michael, r58472, shangw, arnd, linux-pm, geoff, fweisbec,
	linux-kernel, rostedt, deepthi, rjw, paul.gortmaker, paulus,
	srivatsa.bhat, schwidefsky, john.stultz, paulmck, linuxppc-dev,
	chenhui.zhao

Hi Thomas,

On 11/29/2013 08:09 PM, Thomas Gleixner wrote:
> On Fri, 29 Nov 2013, Preeti U Murthy wrote:
>> +static enum hrtimer_restart handle_broadcast(struct hrtimer *hrtimer)
>> +{
>> +	struct clock_event_device *bc_evt = &bc_timer;
>> +	ktime_t interval, next_bc_tick, now;
>> +
>> +	now = ktime_get();
>> +
>> +	if (!restart_broadcast(bc_evt))
>> +		return HRTIMER_NORESTART;
>> +
>> +	interval = ktime_sub(bc_evt->next_event, now);
>> +	next_bc_tick = get_next_bc_tick();
> 
> So you're seriously using a hrtimer to poll in HZ frequency for
> updates of bc->next_event?
> 
> To be honest, this design sucks.
> 
> First of all, why is this a PPC specific feature? There are probably
> other architectures which could make use of this. So this should be
> implemented in the core code to begin with.
> 
> And a lot of the things you need for this are already available in the
> core in one form or the other.
> 
> For a start you can stick the broadcast hrtimer to the cpu which does
> the timekeeping. The handover in the hotplug case is handled there as
> well as is the handover for the NOHZ case.
> 
> This needs to be extended for this hrtimer broadcast thingy to work,
> but it shouldn't be that hard to do so.
> 
> Now for the polling. That's a complete trainwreck.
> 
> This can be solved via the broadcast IPI as well. When a CPU which
> goes down into deep idle sets the broadcast to expire earlier than the
> active value it can denote that and send the timer broadcast IPI over
> to the CPU which has the honour of dealing with this.
> 
> This supports HIGHRES and NO_HZ if done right, without polling at
> all. So you can even let the last CPU which handles the broadcast
> hrtimer go for a long sleep, just not in the deepest idle state.

Thank you for the review. The above points are all valid. I will rework
the design to:

1. Eliminate the concept of a broadcast CPU and integrate its
functionality in the timekeeping CPU.

2. Avoid polling by using IPIs to communicate the next wakeup of the
CPUs in deep idle state so as to reprogram the broadcast hrtimer.

3. Make this feature generic and not arch-specific.

Regards
Preeti U Murthy
> 
> Thanks,
> 
> 	tglx
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V4 6/9] cpuidle/ppc: Add basic infrastructure to enable the broadcast framework on ppc
  2013-11-29 11:58   ` Thomas Gleixner
@ 2013-12-02 15:27     ` Preeti U Murthy
  0 siblings, 0 replies; 15+ messages in thread
From: Preeti U Murthy @ 2013-12-02 15:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: fweisbec, paul.gortmaker, paulus, shangw, rjw, paulmck, arnd,
	linux-pm, rostedt, michael, john.stultz, chenhui.zhao, deepthi,
	r58472, geoff, linux-kernel, srivatsa.bhat, schwidefsky,
	linuxppc-dev

Hi Thomas,

On 11/29/2013 05:28 PM, Thomas Gleixner wrote:
> On Fri, 29 Nov 2013, Preeti U Murthy wrote:
>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> index b44b52c..cafa788 100644
>> --- a/arch/powerpc/Kconfig
>> +++ b/arch/powerpc/Kconfig
>> @@ -129,6 +129,8 @@ config PPC
>>  	select GENERIC_CMOS_UPDATE
>>  	select GENERIC_TIME_VSYSCALL_OLD
>>  	select GENERIC_CLOCKEVENTS
>> +	select GENERIC_CLOCKEVENTS_BROADCAST
>> +	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
> 
> What's the point of this config switch? It's nowhere used.

When broadcast IPIs are to be sent, either the "broadcast" method
associated with the local timers is used or an arch-specific method
tick_broadcast() is invoked. For the latter be invoked,
ARCH_HAS_TICK_BROADCAST config needs to be set. On PowerPC, the
broadcast method is not associated with the local timer. Hence we invoke
tick_broadcast(). This function has been added in [PATCH 2/9].
> 
>> +static int broadcast_set_next_event(unsigned long evt,
>> +					struct clock_event_device *dev)
>> +{
>> +	return 0;
>> +}
>> +
>> +static void broadcast_set_mode(enum clock_event_mode mode,
>> +				 struct clock_event_device *dev)
>> +{
>> +	if (mode != CLOCK_EVT_MODE_ONESHOT)
>> +		broadcast_set_next_event(DECREMENTER_MAX, dev);
> 
> What's the point of calling an empty function?  

You are right, this should have remained a dummy function like
broadcast_set_next_event() as per the design of this patchset.
> 
>> +}
>> +
>>  static void decrementer_set_mode(enum clock_event_mode mode,
>>  				 struct clock_event_device *dev)
>>  {
>> @@ -840,6 +869,19 @@ static void register_decrementer_clockevent(int cpu)
>>  	clockevents_register_device(dec);
>>  }
>>  
>> +static void register_broadcast_clockevent(int cpu)
>> +{
>> +	struct clock_event_device *bc_evt = &bc_timer;
>> +
>> +	*bc_evt = broadcast_clockevent;
>> +	bc_evt->cpumask = cpu_possible_mask;
>> +
>> +	printk_once(KERN_DEBUG "clockevent: %s mult[%x] shift[%d] cpu[%d]\n",
>> +		    bc_evt->name, bc_evt->mult, bc_evt->shift, cpu);
>> +
>> +	clockevents_register_device(bc_evt);
>> +}
>> +
>>  static void __init init_decrementer_clockevent(void)
>>  {
>>  	int cpu = smp_processor_id();
>> @@ -854,6 +896,19 @@ static void __init init_decrementer_clockevent(void)
>>  	register_decrementer_clockevent(cpu);
>>  }
>>  
>> +static void __init init_broadcast_clockevent(void)
>> +{
>> +	int cpu = smp_processor_id();
>> +
>> +	clockevents_calc_mult_shift(&broadcast_clockevent, ppc_tb_freq, 4);
>> +
>> +	broadcast_clockevent.max_delta_ns =
>> +		clockevent_delta2ns(DECREMENTER_MAX, &broadcast_clockevent);
>> +	broadcast_clockevent.min_delta_ns =
>> +		clockevent_delta2ns(2, &broadcast_clockevent);
> 
> clockevents_config()

Right, I will change this to call clockevents_config(). I see that this
needs to be done during the initialization of the decrementer as well.
Will do the same.

Thank you

Regards
Preeti U Murthy
> 
> Thanks,
> 
> 	tglx
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH V4 7/9] cpuidle/powernv: Add "Fast-Sleep" CPU idle state
  2013-12-02 15:07     ` Preeti U Murthy
@ 2013-12-02 18:00       ` Thomas Gleixner
  0 siblings, 0 replies; 15+ messages in thread
From: Thomas Gleixner @ 2013-12-02 18:00 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: michael, r58472, shangw, arnd, linux-pm, geoff, fweisbec,
	linux-kernel, rostedt, deepthi, rjw, paul.gortmaker, paulus,
	srivatsa.bhat, schwidefsky, john.stultz, paulmck, linuxppc-dev,
	chenhui.zhao

On Mon, 2 Dec 2013, Preeti U Murthy wrote:
> On 11/29/2013 08:09 PM, Thomas Gleixner wrote:
> > This supports HIGHRES and NO_HZ if done right, without polling at
> > all. So you can even let the last CPU which handles the broadcast
> > hrtimer go for a long sleep, just not in the deepest idle state.
> 
> Thank you for the review. The above points are all valid. I will rework
> the design to:
> 
> 1. Eliminate the concept of a broadcast CPU and integrate its
> functionality in the timekeeping CPU.
> 
> 2. Avoid polling by using IPIs to communicate the next wakeup of the
> CPUs in deep idle state so as to reprogram the broadcast hrtimer.
> 
> 3. Make this feature generic and not arch-specific.

Great. If you need help with the generic bits, please let me know.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-12-02 18:00 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-29 10:41 [PATCH V4 0/9] cpuidle/ppc: Enable deep idle states on PowerNV Preeti U Murthy
2013-11-29 10:41 ` [PATCH V4 1/9] powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message Preeti U Murthy
2013-11-29 10:41 ` [PATCH V4 2/9] powerpc: Implement tick broadcast IPI as a fixed " Preeti U Murthy
2013-11-29 10:42 ` [PATCH V4 3/9] cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines Preeti U Murthy
2013-11-29 10:42 ` [PATCH V4 4/9] powernv/cpuidle: Add context management for Fast Sleep Preeti U Murthy
2013-11-29 10:42 ` [PATCH V4 5/9] powermgt: Add OPAL call to resync timebase on wakeup Preeti U Murthy
2013-11-29 10:43 ` [PATCH V4 6/9] cpuidle/ppc: Add basic infrastructure to enable the broadcast framework on ppc Preeti U Murthy
2013-11-29 11:58   ` Thomas Gleixner
2013-12-02 15:27     ` Preeti U Murthy
2013-11-29 10:43 ` [PATCH V4 7/9] cpuidle/powernv: Add "Fast-Sleep" CPU idle state Preeti U Murthy
2013-11-29 14:39   ` Thomas Gleixner
2013-12-02 15:07     ` Preeti U Murthy
2013-12-02 18:00       ` Thomas Gleixner
2013-11-29 10:43 ` [PATCH V4 8/9] cpuidle/ppc: Nominate new broadcast cpu on hotplug of the old Preeti U Murthy
2013-11-29 10:43 ` [PATCH V4 9/9] cpuidle/powernv: Parse device tree to setup idle states Preeti U Murthy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).