LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: Disable sleep states on P7+
From: Deepthi Dharwar @ 2014-01-15 11:02 UTC (permalink / raw)
  To: Steven Pratt; +Cc: linuxppc-dev
In-Reply-To: <52D54B60.4090807@austin.ibm.com>

Hi Steven,

On 01/14/2014 08:06 PM, Steven Pratt wrote:
> I am looking for info on when and how we are able to disable power saving features of current (P7, P7+) chips in order to reduce latency. This is often done in latency sensitive applications when

 power consumption is not an issue. On Intel boxes we can disable
P-state frequency changes as well as disabling C-State or sleep state
changes. In fact we can control how deep a sleep the processor can go
into.

I know we have control Dynamic Processor Scaling and Idle Power Savings,
but what states do these really affect?  Can I really disable Nap mode
of a processor? If so how?  Can I disable even the lightest winkle mode?

Looking for current information (read RHEL 6 and SLES11), future changes
are interesting.
> 

On POWERVM platforms idle states currently supported are:
Snooze - reducing thread priority.
Nap
Sleep.

Snooze and Nap can be controlled through in-band kernel mechanisms and
Sleep state through AEM.

Currently you can turn off Idle Power Savings mode and run in dynamic
processor scaling mode. By doing so you will disable entry into Sleep
state on all CPUS.

If you further want to disable NAP, then you could just boot the kernel
with powersave=off. This will disable the entry into any of the idle
state like nap and just reduce the priority of the thread when there is
no work to be done. This is part of cpuidle framework which is available
on SLES 11 SP3 and RHEL7.

In the newer kernels cpuidle framework is adopted for POWERVM platform
but in case if you are using RHEL 6 or SLES 11 SP1/2, then you could use
the ppc64_cpu util and set a high smt-snooze-delay value say 1000.first

#> ppc64_cpu --smt-snooze-delay=1000.

smt-snooze-delay variable potentially delays entry to NAP state.
So if the idle time predicted on a cpu = 1000us and smt-snooze-delay is
set to 100 (which is default value), then on RHEL 6 and SLES 11 SP1/2
kernels cpus would reduce the thread priority and spin for  first 100us
and if the cpu continues to be idle further then automatically go to NAP
state for remaining (1000-100us) time.

By setting a very high value, one would always be looping  This could
potentially delay ure entry to NAP state and effectively disable entry
into NAP state most of the time.

Please let me know if you have any queries around it.

Regards,
Deepthi






> Steve
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

^ permalink raw reply

* Re: [git pull] Please pull powerpc.git merge branch
From: Benjamin Herrenschmidt @ 2014-01-15 10:28 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linuxppc-dev list, Andrew Morton, Linux Kernel list
In-Reply-To: <CA+55aFyfY5kEezwx8-jhgdgOX+EG__v3QRZmGnGstDpUP-UVWQ@mail.gmail.com>

On Wed, 2014-01-15 at 15:05 +0700, Linus Torvalds wrote:
> On Wed, Jan 15, 2014 at 12:01 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> >
> > My original intend was to put it in powerpc-next and then shoot it to
> > stable, but it got a tad annoying (due to churn it needs to be applied
> > at least on rc4 or later while my next is at rc1 and clean that way), so
> > I put it in the merge branch.
> 
> Quite frankly, I'll prefer to not merge it now, and then 3.13 will get
> it from stable, when it does things like this.
> 
> Partly because it fixes a power-only bug, but potentially changes
> non-power behavior. If it was all in arch/powerpc, I wouldn't mind.

Right, I wasn't too comfortable either. I'll resend the pull request
after the merge window is open.

Cheers,
Ben.

>                Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* [PATCH V5 8/8] cpuidle/powernv: Parse device tree to setup idle states
From: Preeti U Murthy @ 2014-01-15  8:10 UTC (permalink / raw)
  To: daniel.lezcano, peterz, fweisbec, agraf, paul.gortmaker, paulus,
	mingo, mikey, shangw, rafael.j.wysocki, galak, benh, paulmck,
	arnd, linux-pm, rostedt, michael, john.stultz, anton, tglx,
	chenhui.zhao, deepthi, r58472, geoff, linux-kernel, srivatsa.bhat,
	schwidefsky, svaidy, linuxppc-dev
In-Reply-To: <20140115080555.20446.27238.stgit@preeti.in.ibm.com>

Add deep idle states such as nap and fast sleep to the cpuidle state table
only if they are discovered from the device tree during cpuidle initialization.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 drivers/cpuidle/cpuidle-powernv.c |   81 +++++++++++++++++++++++++++++--------
 1 file changed, 64 insertions(+), 17 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
index e3aa62f..b01987d 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -12,10 +12,17 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/clockchips.h>
+#include <linux/of.h>
 
 #include <asm/machdep.h>
 #include <asm/firmware.h>
 
+/* Flags and constants used in PowerNV platform */
+
+#define MAX_POWERNV_IDLE_STATES	8
+#define IDLE_USE_INST_NAP	0x00010000 /* Use nap instruction */
+#define IDLE_USE_INST_SLEEP	0x00020000 /* Use sleep instruction */
+
 struct cpuidle_driver powernv_idle_driver = {
 	.name             = "powernv_idle",
 	.owner            = THIS_MODULE,
@@ -84,7 +91,7 @@ static int fastsleep_loop(struct cpuidle_device *dev,
 /*
  * States for dedicated partition case.
  */
-static struct cpuidle_state powernv_states[] = {
+static struct cpuidle_state powernv_states[MAX_POWERNV_IDLE_STATES] = {
 	{ /* Snooze */
 		.name = "snooze",
 		.desc = "snooze",
@@ -92,20 +99,6 @@ static struct cpuidle_state powernv_states[] = {
 		.exit_latency = 0,
 		.target_residency = 0,
 		.enter = &snooze_loop },
-	{ /* NAP */
-		.name = "NAP",
-		.desc = "NAP",
-		.flags = CPUIDLE_FLAG_TIME_VALID,
-		.exit_latency = 10,
-		.target_residency = 100,
-		.enter = &nap_loop },
-	 { /* Fastsleep */
-		.name = "fastsleep",
-		.desc = "fastsleep",
-		.flags = CPUIDLE_FLAG_TIME_VALID,
-		.exit_latency = 10,
-		.target_residency = 100,
-		.enter = &fastsleep_loop },
 };
 
 static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n,
@@ -166,19 +159,73 @@ static int powernv_cpuidle_driver_init(void)
 	return 0;
 }
 
+static int powernv_add_idle_states(void)
+{
+	struct device_node *power_mgt;
+	struct property *prop;
+	int nr_idle_states = 1; /* Snooze */
+	int dt_idle_states;
+	u32 *flags;
+	int i;
+
+	/* Currently we have snooze statically defined */
+
+	power_mgt = of_find_node_by_path("/ibm,opal/power-mgt");
+	if (!power_mgt) {
+		pr_warn("opal: PowerMgmt Node not found\n");
+		return nr_idle_states;
+	}
+
+	prop = of_find_property(power_mgt, "ibm,cpu-idle-state-flags", NULL);
+	if (!prop) {
+		pr_warn("DT-PowerMgmt: missing ibm,cpu-idle-state-flags\n");
+		return nr_idle_states;
+	}
+
+	dt_idle_states = prop->length / sizeof(u32);
+	flags = (u32 *) prop->value;
+
+	for (i = 0; i < dt_idle_states; i++) {
+
+		if (flags[i] & IDLE_USE_INST_NAP) {
+			/* Add NAP state */
+			strcpy(powernv_states[nr_idle_states].name, "Nap");
+			strcpy(powernv_states[nr_idle_states].desc, "Nap");
+			powernv_states[nr_idle_states].flags = CPUIDLE_FLAG_TIME_VALID;
+			powernv_states[nr_idle_states].exit_latency = 10;
+			powernv_states[nr_idle_states].target_residency = 100;
+			powernv_states[nr_idle_states].enter = &nap_loop;
+			nr_idle_states++;
+		}
+
+		if (flags[i] & IDLE_USE_INST_SLEEP) {
+			/* Add FASTSLEEP state */
+			strcpy(powernv_states[nr_idle_states].name, "FastSleep");
+			strcpy(powernv_states[nr_idle_states].desc, "FastSleep");
+			powernv_states[nr_idle_states].flags = CPUIDLE_FLAG_TIME_VALID;
+			powernv_states[nr_idle_states].exit_latency = 300;
+			powernv_states[nr_idle_states].target_residency = 1000000;
+			powernv_states[nr_idle_states].enter = &fastsleep_loop;
+			nr_idle_states++;
+		}
+	}
+
+	return nr_idle_states;
+}
+
 /*
  * powernv_idle_probe()
  * Choose state table for shared versus dedicated partition
  */
 static int powernv_idle_probe(void)
 {
-
 	if (cpuidle_disable != IDLE_NO_OVERRIDE)
 		return -ENODEV;
 
 	if (firmware_has_feature(FW_FEATURE_OPALv3)) {
 		cpuidle_state_table = powernv_states;
-		max_idle_state = ARRAY_SIZE(powernv_states);
+		/* Device tree can indicate more idle states */
+		max_idle_state = powernv_add_idle_states();
  	} else
  		return -ENODEV;
 

^ permalink raw reply related

* [PATCH V5 7/8] cpuidle/powernv: Add "Fast-Sleep" CPU idle state
From: Preeti U Murthy @ 2014-01-15  8:10 UTC (permalink / raw)
  To: daniel.lezcano, peterz, fweisbec, agraf, paul.gortmaker, paulus,
	mingo, mikey, shangw, rafael.j.wysocki, galak, benh, paulmck,
	arnd, linux-pm, rostedt, michael, john.stultz, anton, tglx,
	chenhui.zhao, deepthi, r58472, geoff, linux-kernel, srivatsa.bhat,
	schwidefsky, svaidy, linuxppc-dev
In-Reply-To: <20140115080555.20446.27238.stgit@preeti.in.ibm.com>

Fast sleep is one of the deep idle states on Power8 in which local timers of
CPUs stop. On PowerPC we do not have an external clock device which can
handle wakeup of such CPUs. Now that we have the support in the tick broadcast
framework for archs that do not sport such a device and the low level support
for fast sleep, enable it in the cpuidle framework on PowerNV.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/Kconfig              |    2 ++
 arch/powerpc/kernel/time.c        |    2 +-
 drivers/cpuidle/cpuidle-powernv.c |   39 +++++++++++++++++++++++++++++++++++++
 3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index b44b52c..cafa788 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -129,6 +129,8 @@ config PPC
 	select GENERIC_CMOS_UPDATE
 	select GENERIC_TIME_VSYSCALL_OLD
 	select GENERIC_CLOCKEVENTS
+	select GENERIC_CLOCKEVENTS_BROADCAST
+	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
 	select GENERIC_STRNCPY_FROM_USER
 	select GENERIC_STRNLEN_USER
 	select HAVE_MOD_ARCH_SPECIFIC
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 42cb603..d9efd93 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -106,7 +106,7 @@ struct clock_event_device decrementer_clockevent = {
 	.irq            = 0,
 	.set_next_event = decrementer_set_next_event,
 	.set_mode       = decrementer_set_mode,
-	.features       = CLOCK_EVT_FEAT_ONESHOT,
+	.features       = CLOCK_EVT_FEAT_ONESHOT | CLOCK_EVT_FEAT_C3STOP,
 };
 EXPORT_SYMBOL(decrementer_clockevent);
 
diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
index 78fd174..e3aa62f 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -11,6 +11,7 @@
 #include <linux/cpuidle.h>
 #include <linux/cpu.h>
 #include <linux/notifier.h>
+#include <linux/clockchips.h>
 
 #include <asm/machdep.h>
 #include <asm/firmware.h>
@@ -49,6 +50,37 @@ static int nap_loop(struct cpuidle_device *dev,
 	return index;
 }
 
+static int fastsleep_loop(struct cpuidle_device *dev,
+				struct cpuidle_driver *drv,
+				int index)
+{
+	int cpu = dev->cpu;
+	unsigned long old_lpcr = mfspr(SPRN_LPCR);
+	unsigned long new_lpcr;
+
+	new_lpcr = old_lpcr;
+	new_lpcr &= ~(LPCR_MER | LPCR_PECE); /* lpcr[mer] must be 0 */
+
+	/* exit powersave upon external interrupt, but not decrementer
+	 * interrupt, Emulate sleep.
+	 */
+	new_lpcr |= LPCR_PECE0;
+
+	if (clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu)) {
+		new_lpcr |= LPCR_PECE1;
+		mtspr(SPRN_LPCR, new_lpcr);
+		power7_nap();
+	} else {
+		mtspr(SPRN_LPCR, new_lpcr);
+		power7_sleep();
+		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
+	}
+
+	mtspr(SPRN_LPCR, old_lpcr);
+
+	return index;
+}
+
 /*
  * States for dedicated partition case.
  */
@@ -67,6 +99,13 @@ static struct cpuidle_state powernv_states[] = {
 		.exit_latency = 10,
 		.target_residency = 100,
 		.enter = &nap_loop },
+	 { /* Fastsleep */
+		.name = "fastsleep",
+		.desc = "fastsleep",
+		.flags = CPUIDLE_FLAG_TIME_VALID,
+		.exit_latency = 10,
+		.target_residency = 100,
+		.enter = &fastsleep_loop },
 };
 
 static int powernv_cpuidle_add_cpu_notifier(struct notifier_block *n,

^ permalink raw reply related

* [PATCH V5 6/8] time/cpuidle: Support in tick broadcast framework in the absence of external clock device
From: Preeti U Murthy @ 2014-01-15  8:09 UTC (permalink / raw)
  To: daniel.lezcano, peterz, fweisbec, agraf, paul.gortmaker, paulus,
	mingo, mikey, shangw, rafael.j.wysocki, galak, benh, paulmck,
	arnd, linux-pm, rostedt, michael, john.stultz, anton, tglx,
	chenhui.zhao, deepthi, r58472, geoff, linux-kernel, srivatsa.bhat,
	schwidefsky, svaidy, linuxppc-dev
In-Reply-To: <20140115080555.20446.27238.stgit@preeti.in.ibm.com>

On some architectures, in certain CPU deep idle states the local timers stop.
An external clock device is used to wakeup these CPUs. The kernel support for the
wakeup of these CPUs is provided by the tick broadcast framework by using the
external clock device as the wakeup source.

However not all implementations of architectures provide such an external
clock device such as some PowerPC ones. This patch includes support in the
broadcast framework to handle the wakeup of the CPUs in deep idle states on such
systems by queuing a hrtimer on one of the CPUs, meant to handle the wakeup of
CPUs in deep idle states. This CPU is identified as the bc_cpu.

Each time the hrtimer expires, it is reprogrammed for the next wakeup of the
CPUs in deep idle state after handling broadcast. However when a CPU is about
to enter  deep idle state with its wakeup time earlier than the time at which
the hrtimer is currently programmed, it *becomes the new bc_cpu* and restarts
the hrtimer on itself. This way the job of doing broadcast is handed around to
the CPUs that ask for the earliest wakeup just before entering deep idle
state. This is consistent with what happens in cases where an external clock
device is present. The smp affinity of this clock device is set to the CPU
with the earliest wakeup.

The important point here is that the bc_cpu cannot enter deep idle state
since it has a hrtimer queued to wakeup the other CPUs in deep idle. Hence it
cannot have its local timer stopped. Therefore for such a CPU, the
BROADCAST_ENTER notification has to fail implying that it cannot enter deep
idle state. On architectures where an external clock device is present, all
CPUs can enter deep idle.

During hotplug of the bc_cpu, the job of doing a broadcast is assigned to the
first cpu in the broadcast mask. This newly nominated bc_cpu is woken up by
an IPI so as to queue the above mentioned hrtimer on it.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 include/linux/clockchips.h   |    4 -
 kernel/time/clockevents.c    |    9 +-
 kernel/time/tick-broadcast.c |  192 ++++++++++++++++++++++++++++++++++++++----
 kernel/time/tick-internal.h  |    8 +-
 4 files changed, 186 insertions(+), 27 deletions(-)

diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index 493aa02..bbda37b 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -186,9 +186,9 @@ static inline int tick_check_broadcast_expired(void) { return 0; }
 #endif
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
-extern void clockevents_notify(unsigned long reason, void *arg);
+extern int clockevents_notify(unsigned long reason, void *arg);
 #else
-static inline void clockevents_notify(unsigned long reason, void *arg) {}
+static inline int clockevents_notify(unsigned long reason, void *arg) {}
 #endif
 
 #else /* CONFIG_GENERIC_CLOCKEVENTS_BUILD */
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 086ad60..d61404e 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -524,12 +524,13 @@ void clockevents_resume(void)
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 /**
  * clockevents_notify - notification about relevant events
+ * Returns non zero on error.
  */
-void clockevents_notify(unsigned long reason, void *arg)
+int clockevents_notify(unsigned long reason, void *arg)
 {
 	struct clock_event_device *dev, *tmp;
 	unsigned long flags;
-	int cpu;
+	int cpu, ret = 0;
 
 	raw_spin_lock_irqsave(&clockevents_lock, flags);
 
@@ -542,11 +543,12 @@ void clockevents_notify(unsigned long reason, void *arg)
 
 	case CLOCK_EVT_NOTIFY_BROADCAST_ENTER:
 	case CLOCK_EVT_NOTIFY_BROADCAST_EXIT:
-		tick_broadcast_oneshot_control(reason);
+		ret = tick_broadcast_oneshot_control(reason);
 		break;
 
 	case CLOCK_EVT_NOTIFY_CPU_DYING:
 		tick_handover_do_timer(arg);
+		tick_handover_broadcast_cpu(arg);
 		break;
 
 	case CLOCK_EVT_NOTIFY_SUSPEND:
@@ -585,6 +587,7 @@ void clockevents_notify(unsigned long reason, void *arg)
 		break;
 	}
 	raw_spin_unlock_irqrestore(&clockevents_lock, flags);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(clockevents_notify);
 
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 9532690..1c23912 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -20,6 +20,7 @@
 #include <linux/sched.h>
 #include <linux/smp.h>
 #include <linux/module.h>
+#include <linux/slab.h>
 
 #include "tick-internal.h"
 
@@ -35,6 +36,15 @@ static cpumask_var_t tmpmask;
 static DEFINE_RAW_SPINLOCK(tick_broadcast_lock);
 static int tick_broadcast_force;
 
+/*
+ * Helper variables for handling broadcast in the absence of a
+ * tick_broadcast_device.
+ * */
+static struct hrtimer *bc_hrtimer;
+static int bc_cpu = -1;
+static ktime_t bc_next_wakeup;
+static int hrtimer_initialized = 0;
+
 #ifdef CONFIG_TICK_ONESHOT
 static void tick_broadcast_clear_oneshot(int cpu);
 #else
@@ -528,6 +538,20 @@ static int tick_broadcast_set_event(struct clock_event_device *bc, int cpu,
 	return ret;
 }
 
+static void tick_broadcast_set_next_wakeup(int cpu, ktime_t expires, int force)
+{
+	struct clock_event_device *bc;
+
+	bc = tick_broadcast_device.evtdev;
+
+	if (bc) {
+		tick_broadcast_set_event(bc, cpu, expires, force);
+	} else {
+		hrtimer_start(bc_hrtimer, expires, HRTIMER_MODE_ABS_PINNED);
+		bc_cpu = cpu;
+	}
+}
+
 int tick_resume_broadcast_oneshot(struct clock_event_device *bc)
 {
 	clockevents_set_mode(bc, CLOCK_EVT_MODE_ONESHOT);
@@ -558,15 +582,13 @@ void tick_check_oneshot_broadcast(int cpu)
 /*
  * Handle oneshot mode broadcasting
  */
-static void tick_handle_oneshot_broadcast(struct clock_event_device *dev)
+static int tick_oneshot_broadcast(void)
 {
 	struct tick_device *td;
 	ktime_t now, next_event;
 	int cpu, next_cpu = 0;
 
-	raw_spin_lock(&tick_broadcast_lock);
-again:
-	dev->next_event.tv64 = KTIME_MAX;
+	bc_next_wakeup.tv64 = KTIME_MAX;
 	next_event.tv64 = KTIME_MAX;
 	cpumask_clear(tmpmask);
 	now = ktime_get();
@@ -620,34 +642,95 @@ again:
 	 * in the event mask
 	 */
 	if (next_event.tv64 != KTIME_MAX) {
-		/*
-		 * Rearm the broadcast device. If event expired,
-		 * repeat the above
-		 */
-		if (tick_broadcast_set_event(dev, next_cpu, next_event, 0))
+		bc_next_wakeup = next_event;
+	}
+
+	return next_cpu;
+}
+
+/*
+ * Handler in oneshot mode for the external clock device
+ */
+static void tick_handle_oneshot_broadcast(struct clock_event_device *dev)
+{
+	int next_cpu;
+
+	raw_spin_lock(&tick_broadcast_lock);
+
+again:	next_cpu = tick_oneshot_broadcast();
+	/*
+	 * Rearm the broadcast device. If event expired,
+	 * repeat the above
+	 */
+	if (bc_next_wakeup.tv64 != KTIME_MAX)
+		if (tick_broadcast_set_event(dev, next_cpu, bc_next_wakeup, 0))
 			goto again;
+
+	raw_spin_unlock(&tick_broadcast_lock);
+}
+
+/*
+ * Handler in oneshot mode for the hrtimer queued when there is no external
+ * clock device.
+ */
+static enum hrtimer_restart handle_broadcast(struct hrtimer *hrtmr)
+{
+	ktime_t now, interval;
+
+	raw_spin_lock(&tick_broadcast_lock);
+	tick_oneshot_broadcast();
+
+	now = ktime_get();
+
+	if (bc_next_wakeup.tv64 != KTIME_MAX) {
+		interval = ktime_sub(bc_next_wakeup, now);
+		hrtimer_forward_now(bc_hrtimer, interval);
+		raw_spin_unlock(&tick_broadcast_lock);
+		return HRTIMER_RESTART;
 	}
 	raw_spin_unlock(&tick_broadcast_lock);
+	return HRTIMER_NORESTART;
+}
+
+/* The CPU could be asked to take over from the previous bc_cpu,
+ * if it is being hotplugged out.
+ */
+static void tick_broadcast_exit_check(int cpu)
+{
+	if (cpu == bc_cpu)
+		hrtimer_start(bc_hrtimer, bc_next_wakeup,
+				HRTIMER_MODE_ABS_PINNED);
+}
+
+static int can_enter_broadcast(int cpu)
+{
+	return cpu != bc_cpu;
 }
 
 /*
  * Powerstate information: The system enters/leaves a state, where
  * affected devices might stop
+ *
+ * Returns non zero value if the entry into broadcast framework failed
+ * This scenario can arise on certain implementations of archs which do
+ * not have an external clock device to do the broadcast. Then one of the
+ * CPUs get nominated to handle broadcasting.
+ * Such a CPU cannot enter a state where its tick device can stop.
  */
-void tick_broadcast_oneshot_control(unsigned long reason)
+int tick_broadcast_oneshot_control(unsigned long reason)
 {
-	struct clock_event_device *bc, *dev;
+	struct clock_event_device *dev;
 	struct tick_device *td;
 	unsigned long flags;
 	ktime_t now;
-	int cpu;
+	int cpu, ret = 0;
 
 	/*
 	 * Periodic mode does not care about the enter/exit of power
 	 * states
 	 */
 	if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
-		return;
+		return ret;
 
 	/*
 	 * We are called with preemtion disabled from the depth of the
@@ -658,9 +741,8 @@ void tick_broadcast_oneshot_control(unsigned long reason)
 	dev = td->evtdev;
 
 	if (!(dev->features & CLOCK_EVT_FEAT_C3STOP))
-		return;
+		return ret;
 
-	bc = tick_broadcast_device.evtdev;
 
 	raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
 	if (reason == CLOCK_EVT_NOTIFY_BROADCAST_ENTER) {
@@ -676,12 +758,22 @@ void tick_broadcast_oneshot_control(unsigned long reason)
 			 * woken by the IPI right away.
 			 */
 			if (!cpumask_test_cpu(cpu, tick_broadcast_force_mask) &&
-			    dev->next_event.tv64 < bc->next_event.tv64)
-				tick_broadcast_set_event(bc, cpu, dev->next_event, 1);
+			    dev->next_event.tv64 < bc_next_wakeup.tv64) {
+				bc_next_wakeup = dev->next_event;
+				tick_broadcast_set_next_wakeup(cpu, dev->next_event, 1);
+			}
+
+			if (!can_enter_broadcast(cpu)) {
+				cpumask_clear_cpu(cpu, tick_broadcast_oneshot_mask);
+				clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT);
+				ret = 1;
+			}
 		}
 	} else {
 		if (cpumask_test_and_clear_cpu(cpu, tick_broadcast_oneshot_mask)) {
 			clockevents_set_mode(dev, CLOCK_EVT_MODE_ONESHOT);
+
+			tick_broadcast_exit_check(cpu);
 			/*
 			 * The cpu which was handling the broadcast
 			 * timer marked this cpu in the broadcast
@@ -746,6 +838,7 @@ void tick_broadcast_oneshot_control(unsigned long reason)
 	}
 out:
 	raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
+	return ret;
 }
 
 /*
@@ -821,17 +914,57 @@ void tick_broadcast_switch_to_oneshot(void)
 {
 	struct clock_event_device *bc;
 	unsigned long flags;
+	int cpu = smp_processor_id();
 
 	raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
 
+	bc_next_wakeup.tv64 = KTIME_MAX;
+
 	tick_broadcast_device.mode = TICKDEV_MODE_ONESHOT;
 	bc = tick_broadcast_device.evtdev;
-	if (bc)
+	if (bc) {
 		tick_broadcast_setup_oneshot(bc);
+		bc_next_wakeup = bc->next_event;
+	} else if (hrtimer_initialized) {
+
+		/*
+		 * There may be CPUs waiting for periodic broadcast. We need
+		 * to set the oneshot bits for those and program the hrtimer
+		 * to fire at the next tick period.
+ 		 */
+		cpumask_copy(tmpmask, tick_broadcast_mask);
+		cpumask_clear_cpu(cpu, tmpmask);
+		cpumask_or(tick_broadcast_oneshot_mask,
+			   tick_broadcast_oneshot_mask, tmpmask);
+
+		if (!cpumask_empty(tmpmask)) {
+			tick_broadcast_init_next_event(tmpmask,
+						       tick_next_period);
+			hrtimer_start(bc_hrtimer, tick_next_period, HRTIMER_MODE_ABS_PINNED);
+			bc_next_wakeup = tick_next_period;
+		}
+	}
 
 	raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
 }
 
+/*
+ * Use the broadcast function itself to wake up the new broadcast cpu
+ */
+void tick_handover_broadcast_cpu(int *cpup)
+{
+	struct tick_device *td;
+
+	if (*cpup == bc_cpu) {
+		int cpu = cpumask_first(tick_broadcast_oneshot_mask);
+
+		bc_cpu = (cpu < nr_cpu_ids) ? cpu : -1;
+		if (bc_cpu != -1) {
+			td = &per_cpu(tick_cpu_device, bc_cpu);
+			td->evtdev->broadcast(cpumask_of(bc_cpu));
+		}
+	}
+}
 
 /*
  * Remove a dead CPU from broadcasting
@@ -868,8 +1001,29 @@ int tick_broadcast_oneshot_active(void)
 bool tick_broadcast_oneshot_available(void)
 {
 	struct clock_event_device *bc = tick_broadcast_device.evtdev;
+	bool ret = true;
+	unsigned long flags;
 
-	return bc ? bc->features & CLOCK_EVT_FEAT_ONESHOT : false;
+	raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
+
+	if (bc) {
+		ret = bc->features & CLOCK_EVT_FEAT_ONESHOT;
+	} else if (!hrtimer_initialized) {
+		/* An alternative to tick_broadcast_device on archs which do not have
+		 * an external device
+		 */
+		bc_hrtimer = kmalloc(sizeof(*bc_hrtimer), GFP_NOWAIT);
+		if (!bc_hrtimer) {
+			ret = false;
+			goto out;
+		}
+		hrtimer_init(bc_hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+		bc_hrtimer->function = handle_broadcast;
+		hrtimer_initialized = 1;
+	}
+
+out:	raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
+	return ret;
 }
 
 #endif
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index 18e71f7..9e42177 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -46,23 +46,25 @@ extern int tick_switch_to_oneshot(void (*handler)(struct clock_event_device *));
 extern void tick_resume_oneshot(void);
 # ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 extern void tick_broadcast_setup_oneshot(struct clock_event_device *bc);
-extern void tick_broadcast_oneshot_control(unsigned long reason);
+extern int tick_broadcast_oneshot_control(unsigned long reason);
 extern void tick_broadcast_switch_to_oneshot(void);
 extern void tick_shutdown_broadcast_oneshot(unsigned int *cpup);
 extern int tick_resume_broadcast_oneshot(struct clock_event_device *bc);
 extern int tick_broadcast_oneshot_active(void);
 extern void tick_check_oneshot_broadcast(int cpu);
+extern void tick_handover_broadcast_cpu(int *cpup);
 bool tick_broadcast_oneshot_available(void);
 # else /* BROADCAST */
 static inline void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
 {
 	BUG();
 }
-static inline void tick_broadcast_oneshot_control(unsigned long reason) { }
+static inline int tick_broadcast_oneshot_control(unsigned long reason) { }
 static inline void tick_broadcast_switch_to_oneshot(void) { }
 static inline void tick_shutdown_broadcast_oneshot(unsigned int *cpup) { }
 static inline int tick_broadcast_oneshot_active(void) { return 0; }
 static inline void tick_check_oneshot_broadcast(int cpu) { }
+static inline void tick_handover_broadcast_cpu(int *cpup) {}
 static inline bool tick_broadcast_oneshot_available(void) { return true; }
 # endif /* !BROADCAST */
 
@@ -87,7 +89,7 @@ static inline void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
 {
 	BUG();
 }
-static inline void tick_broadcast_oneshot_control(unsigned long reason) { }
+static inline int tick_broadcast_oneshot_control(unsigned long reason) { }
 static inline void tick_shutdown_broadcast_oneshot(unsigned int *cpup) { }
 static inline int tick_resume_broadcast_oneshot(struct clock_event_device *bc)
 {

^ permalink raw reply related

* [PATCH V5 5/8] powermgt: Add OPAL call to resync timebase on wakeup
From: Preeti U Murthy @ 2014-01-15  8:09 UTC (permalink / raw)
  To: daniel.lezcano, peterz, fweisbec, agraf, paul.gortmaker, paulus,
	mingo, mikey, shangw, rafael.j.wysocki, galak, benh, paulmck,
	arnd, linux-pm, rostedt, michael, john.stultz, anton, tglx,
	chenhui.zhao, deepthi, r58472, geoff, linux-kernel, srivatsa.bhat,
	schwidefsky, svaidy, linuxppc-dev
In-Reply-To: <20140115080555.20446.27238.stgit@preeti.in.ibm.com>

From: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>

During "Fast-sleep" and deeper power savings state, decrementer and
timebase could be stopped making it out of sync with rest
of the cores in the system.

Add a firmware call to request platform to resync timebase
using low level platform methods.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/include/asm/opal.h                |    2 ++
 arch/powerpc/kernel/exceptions-64s.S           |    2 +-
 arch/powerpc/kernel/idle_power7.S              |   27 ++++++++++++++++++++++++
 arch/powerpc/platforms/powernv/opal-wrappers.S |    1 +
 4 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 033c06b..a662d06 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -132,6 +132,7 @@ extern int opal_enter_rtas(struct rtas_args *args,
 #define OPAL_FLASH_VALIDATE			76
 #define OPAL_FLASH_MANAGE			77
 #define OPAL_FLASH_UPDATE			78
+#define OPAL_RESYNC_TIMEBASE			79
 
 #ifndef __ASSEMBLY__
 
@@ -763,6 +764,7 @@ extern void opal_flash_init(void);
 extern int opal_machine_check(struct pt_regs *regs);
 
 extern void opal_shutdown(void);
+extern int opal_resync_timebase(void);
 
 extern void opal_lpc_init(void);
 
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index b8139fb..91e6417 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -145,7 +145,7 @@ BEGIN_FTR_SECTION
 
 	/* Fast Sleep wakeup on PowerNV */
 8:	GET_PACA(r13)
-	b 	.power7_wakeup_loss
+	b 	.power7_wakeup_tb_loss
 
 9:
 END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S
index e4bbca2..34c71e8 100644
--- a/arch/powerpc/kernel/idle_power7.S
+++ b/arch/powerpc/kernel/idle_power7.S
@@ -17,6 +17,7 @@
 #include <asm/ppc-opcode.h>
 #include <asm/hw_irq.h>
 #include <asm/kvm_book3s_asm.h>
+#include <asm/opal.h>
 
 #undef DEBUG
 
@@ -124,6 +125,32 @@ _GLOBAL(power7_sleep)
 	b	power7_powersave_common
 	/* No return */
 
+_GLOBAL(power7_wakeup_tb_loss)
+	ld	r2,PACATOC(r13);
+	ld	r1,PACAR1(r13)
+
+	/* Time base re-sync */
+	li	r0,OPAL_RESYNC_TIMEBASE
+	LOAD_REG_ADDR(r11,opal);
+	ld	r12,8(r11);
+	ld	r2,0(r11);
+	mtctr	r12
+	bctrl
+
+	/* TODO: Check r3 for failure */
+
+	REST_NVGPRS(r1)
+	REST_GPR(2, r1)
+	ld	r3,_CCR(r1)
+	ld	r4,_MSR(r1)
+	ld	r5,_NIP(r1)
+	addi	r1,r1,INT_FRAME_SIZE
+	mtcr	r3
+	mfspr	r3,SPRN_SRR1		/* Return SRR1 */
+	mtspr	SPRN_SRR1,r4
+	mtspr	SPRN_SRR0,r5
+	rfid
+
 _GLOBAL(power7_wakeup_loss)
 	ld	r1,PACAR1(r13)
 	REST_NVGPRS(r1)
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index e780650..ddfe95a 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -126,3 +126,4 @@ OPAL_CALL(opal_return_cpu,			OPAL_RETURN_CPU);
 OPAL_CALL(opal_validate_flash,			OPAL_FLASH_VALIDATE);
 OPAL_CALL(opal_manage_flash,			OPAL_FLASH_MANAGE);
 OPAL_CALL(opal_update_flash,			OPAL_FLASH_UPDATE);
+OPAL_CALL(opal_resync_timebase,			OPAL_RESYNC_TIMEBASE);

^ permalink raw reply related

* [PATCH V5 4/8] powernv/cpuidle: Add context management for Fast Sleep
From: Preeti U Murthy @ 2014-01-15  8:09 UTC (permalink / raw)
  To: daniel.lezcano, peterz, fweisbec, agraf, paul.gortmaker, paulus,
	mingo, mikey, shangw, rafael.j.wysocki, galak, benh, paulmck,
	arnd, linux-pm, rostedt, michael, john.stultz, anton, tglx,
	chenhui.zhao, deepthi, r58472, geoff, linux-kernel, srivatsa.bhat,
	schwidefsky, svaidy, linuxppc-dev
In-Reply-To: <20140115080555.20446.27238.stgit@preeti.in.ibm.com>

From: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>

Before adding Fast-Sleep into the cpuidle framework, some low level
support needs to be added to enable it. This includes saving and
restoring of certain registers at entry and exit time of this state
respectively just like we do in the NAP idle state.

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
[Changelog modified by Preeti U. Murthy <preeti@linux.vnet.ibm.com>]
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/include/asm/processor.h |    1 +
 arch/powerpc/kernel/exceptions-64s.S |   10 ++++-
 arch/powerpc/kernel/idle_power7.S    |   63 ++++++++++++++++++++++++----------
 3 files changed, 53 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index 027fefd..22e547a 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -444,6 +444,7 @@ enum idle_boot_override {IDLE_NO_OVERRIDE = 0, IDLE_POWERSAVE_OFF};
 
 extern int powersave_nap;	/* set if nap mode can be used in idle loop */
 extern void power7_nap(void);
+extern void power7_sleep(void);
 extern void flush_instruction_cache(void);
 extern void hard_reset_now(void);
 extern void poweroff_now(void);
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 9f905e4..b8139fb 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -121,9 +121,10 @@ BEGIN_FTR_SECTION
 	cmpwi	cr1,r13,2
 	/* Total loss of HV state is fatal, we could try to use the
 	 * PIR to locate a PACA, then use an emergency stack etc...
-	 * but for now, let's just stay stuck here
+	 * OPAL v3 based powernv platforms have new idle states
+	 * which fall in this catagory.
 	 */
-	bgt	cr1,.
+	bgt	cr1,8f
 	GET_PACA(r13)
 
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
@@ -141,6 +142,11 @@ BEGIN_FTR_SECTION
 	beq	cr1,2f
 	b	.power7_wakeup_noloss
 2:	b	.power7_wakeup_loss
+
+	/* Fast Sleep wakeup on PowerNV */
+8:	GET_PACA(r13)
+	b 	.power7_wakeup_loss
+
 9:
 END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
 #endif /* CONFIG_PPC_P7_NAP */
diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S
index 847e40e..e4bbca2 100644
--- a/arch/powerpc/kernel/idle_power7.S
+++ b/arch/powerpc/kernel/idle_power7.S
@@ -20,17 +20,27 @@
 
 #undef DEBUG
 
-	.text
+/* Idle state entry routines */
 
-_GLOBAL(power7_idle)
-	/* Now check if user or arch enabled NAP mode */
-	LOAD_REG_ADDRBASE(r3,powersave_nap)
-	lwz	r4,ADDROFF(powersave_nap)(r3)
-	cmpwi	0,r4,0
-	beqlr
-	/* fall through */
+#define	IDLE_STATE_ENTER_SEQ(IDLE_INST)				\
+	/* Magic NAP/SLEEP/WINKLE mode enter sequence */	\
+	std	r0,0(r1);					\
+	ptesync;						\
+	ld	r0,0(r1);					\
+1:	cmp	cr0,r0,r0;					\
+	bne	1b;						\
+	IDLE_INST;						\
+	b	.
 
-_GLOBAL(power7_nap)
+	.text
+
+/*
+ * Pass requested state in r3:
+ * 	0 - nap
+ * 	1 - sleep
+ */
+_GLOBAL(power7_powersave_common)
+	/* Use r3 to pass state nap/sleep/winkle */
 	/* NAP is a state loss, we create a regs frame on the
 	 * stack, fill it up with the state we care about and
 	 * stick a pointer to it in PACAR1. We really only
@@ -79,8 +89,8 @@ _GLOBAL(power7_nap)
 	/* Continue saving state */
 	SAVE_GPR(2, r1)
 	SAVE_NVGPRS(r1)
-	mfcr	r3
-	std	r3,_CCR(r1)
+	mfcr	r4
+	std	r4,_CCR(r1)
 	std	r9,_MSR(r1)
 	std	r1,PACAR1(r13)
 
@@ -89,15 +99,30 @@ _GLOBAL(power7_nap)
 	li	r4,KVM_HWTHREAD_IN_NAP
 	stb	r4,HSTATE_HWTHREAD_STATE(r13)
 #endif
+	cmpwi	cr0,r3,1
+	beq	2f
+	IDLE_STATE_ENTER_SEQ(PPC_NAP)
+	/* No return */
+2:	IDLE_STATE_ENTER_SEQ(PPC_SLEEP)
+	/* No return */
 
-	/* Magic NAP mode enter sequence */
-	std	r0,0(r1)
-	ptesync
-	ld	r0,0(r1)
-1:	cmp	cr0,r0,r0
-	bne	1b
-	PPC_NAP
-	b	.
+_GLOBAL(power7_idle)
+	/* Now check if user or arch enabled NAP mode */
+	LOAD_REG_ADDRBASE(r3,powersave_nap)
+	lwz	r4,ADDROFF(powersave_nap)(r3)
+	cmpwi	0,r4,0
+	beqlr
+	/* fall through */
+
+_GLOBAL(power7_nap)
+	li	r3,0
+	b	power7_powersave_common
+	/* No return */
+
+_GLOBAL(power7_sleep)
+	li	r3,1
+	b	power7_powersave_common
+	/* No return */
 
 _GLOBAL(power7_wakeup_loss)
 	ld	r1,PACAR1(r13)

^ permalink raw reply related

* [PATCH V5 3/8] cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines
From: Preeti U Murthy @ 2014-01-15  8:08 UTC (permalink / raw)
  To: daniel.lezcano, peterz, fweisbec, agraf, paul.gortmaker, paulus,
	mingo, mikey, shangw, rafael.j.wysocki, galak, benh, paulmck,
	arnd, linux-pm, rostedt, michael, john.stultz, anton, tglx,
	chenhui.zhao, deepthi, r58472, geoff, linux-kernel, srivatsa.bhat,
	schwidefsky, svaidy, linuxppc-dev
In-Reply-To: <20140115080555.20446.27238.stgit@preeti.in.ibm.com>

Split timer_interrupt(), which is the local timer interrupt handler on ppc
into routines called during regular interrupt handling and __timer_interrupt(),
which takes care of running local timers and collecting time related stats.

This will enable callers interested only in running expired local timers to
directly call into __timer_interupt(). One of the use cases of this is the
tick broadcast IPI handling in which the sleeping CPUs need to handle the local
timers that have expired.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---

 arch/powerpc/kernel/time.c |   73 +++++++++++++++++++++++++-------------------
 1 file changed, 41 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 42269c7..42cb603 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -478,6 +478,42 @@ void arch_irq_work_raise(void)
 
 #endif /* CONFIG_IRQ_WORK */
 
+static void __timer_interrupt(void)
+{
+	struct pt_regs *regs = get_irq_regs();
+	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
+	struct clock_event_device *evt = &__get_cpu_var(decrementers);
+	u64 now;
+
+	__get_cpu_var(irq_stat).timer_irqs++;
+	trace_timer_interrupt_entry(regs);
+
+	if (test_irq_work_pending()) {
+		clear_irq_work_pending();
+		irq_work_run();
+	}
+
+	now = get_tb_or_rtc();
+	if (now >= *next_tb) {
+		*next_tb = ~(u64)0;
+		if (evt->event_handler)
+			evt->event_handler(evt);
+	} else {
+		now = *next_tb - now;
+		if (now <= DECREMENTER_MAX)
+			set_dec((int)now);
+	}
+
+#ifdef CONFIG_PPC64
+	/* collect purr register values often, for accurate calculations */
+	if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
+		struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array);
+		cu->current_tb = mfspr(SPRN_PURR);
+	}
+#endif
+	trace_timer_interrupt_exit(regs);
+}
+
 /*
  * timer_interrupt - gets called when the decrementer overflows,
  * with interrupts disabled.
@@ -486,8 +522,6 @@ void timer_interrupt(struct pt_regs * regs)
 {
 	struct pt_regs *old_regs;
 	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
-	struct clock_event_device *evt = &__get_cpu_var(decrementers);
-	u64 now;
 
 	/* Ensure a positive value is written to the decrementer, or else
 	 * some CPUs will continue to take decrementer exceptions.
@@ -510,8 +544,6 @@ void timer_interrupt(struct pt_regs * regs)
 	 */
 	may_hard_irq_enable();
 
-	__get_cpu_var(irq_stat).timer_irqs++;
-
 #if defined(CONFIG_PPC32) && defined(CONFIG_PMAC)
 	if (atomic_read(&ppc_n_lost_interrupts) != 0)
 		do_IRQ(regs);
@@ -520,34 +552,7 @@ void timer_interrupt(struct pt_regs * regs)
 	old_regs = set_irq_regs(regs);
 	irq_enter();
 
-	trace_timer_interrupt_entry(regs);
-
-	if (test_irq_work_pending()) {
-		clear_irq_work_pending();
-		irq_work_run();
-	}
-
-	now = get_tb_or_rtc();
-	if (now >= *next_tb) {
-		*next_tb = ~(u64)0;
-		if (evt->event_handler)
-			evt->event_handler(evt);
-	} else {
-		now = *next_tb - now;
-		if (now <= DECREMENTER_MAX)
-			set_dec((int)now);
-	}
-
-#ifdef CONFIG_PPC64
-	/* collect purr register values often, for accurate calculations */
-	if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
-		struct cpu_usage *cu = &__get_cpu_var(cpu_usage_array);
-		cu->current_tb = mfspr(SPRN_PURR);
-	}
-#endif
-
-	trace_timer_interrupt_exit(regs);
-
+	__timer_interrupt();
 	irq_exit();
 	set_irq_regs(old_regs);
 }
@@ -816,6 +821,10 @@ static void decrementer_set_mode(enum clock_event_mode mode,
 /* Interrupt handler for the timer broadcast IPI */
 void tick_broadcast_ipi_handler(void)
 {
+	u64 *next_tb = &__get_cpu_var(decrementers_next_tb);
+
+	*next_tb = get_tb_or_rtc();
+	__timer_interrupt();
 }
 
 static void register_decrementer_clockevent(int cpu)

^ permalink raw reply related

* [PATCH V5 2/8] powerpc: Implement tick broadcast IPI as a fixed IPI message
From: Preeti U Murthy @ 2014-01-15  8:08 UTC (permalink / raw)
  To: daniel.lezcano, peterz, fweisbec, agraf, paul.gortmaker, paulus,
	mingo, mikey, shangw, rafael.j.wysocki, galak, benh, paulmck,
	arnd, linux-pm, rostedt, michael, john.stultz, anton, tglx,
	chenhui.zhao, deepthi, r58472, geoff, linux-kernel, srivatsa.bhat,
	schwidefsky, svaidy, linuxppc-dev
In-Reply-To: <20140115080555.20446.27238.stgit@preeti.in.ibm.com>

From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

For scalability and performance reasons, we want the tick broadcast IPIs
to be handled as efficiently as possible. Fixed IPI messages
are one of the most efficient mechanisms available - they are faster than
the smp_call_function mechanism because the IPI handlers are fixed and hence
they don't involve costly operations such as adding IPI handlers to the target
CPU's function queue, acquiring locks for synchronization etc.

Luckily we have an unused IPI message slot, so use that to implement
tick broadcast IPIs efficiently.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
[Functions renamed to tick_broadcast* and Changelog modified by
 Preeti U. Murthy<preeti@linux.vnet.ibm.com>]
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Geoff Levand <geoff@infradead.org> [For the PS3 part]
---

 arch/powerpc/include/asm/smp.h          |    2 +-
 arch/powerpc/include/asm/time.h         |    1 +
 arch/powerpc/kernel/smp.c               |   19 +++++++++++++++----
 arch/powerpc/kernel/time.c              |    5 +++++
 arch/powerpc/platforms/cell/interrupt.c |    2 +-
 arch/powerpc/platforms/ps3/smp.c        |    2 +-
 6 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 9f7356b..ff51046 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu);
  * in /proc/interrupts will be wrong!!! --Troy */
 #define PPC_MSG_CALL_FUNCTION   0
 #define PPC_MSG_RESCHEDULE      1
-#define PPC_MSG_UNUSED		2
+#define PPC_MSG_TICK_BROADCAST	2
 #define PPC_MSG_DEBUGGER_BREAK  3
 
 /* for irq controllers that have dedicated ipis per message (4) */
diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index c1f2676..1d428e6 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -28,6 +28,7 @@ extern struct clock_event_device decrementer_clockevent;
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
 extern void GregorianDay(struct rtc_time *tm);
+extern void tick_broadcast_ipi_handler(void);
 
 extern void generic_calibrate_decr(void);
 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index c2bd8d6..c77c6d7 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -35,6 +35,7 @@
 #include <asm/ptrace.h>
 #include <linux/atomic.h>
 #include <asm/irq.h>
+#include <asm/hw_irq.h>
 #include <asm/page.h>
 #include <asm/pgtable.h>
 #include <asm/prom.h>
@@ -145,9 +146,9 @@ static irqreturn_t reschedule_action(int irq, void *data)
 	return IRQ_HANDLED;
 }
 
-static irqreturn_t unused_action(int irq, void *data)
+static irqreturn_t tick_broadcast_ipi_action(int irq, void *data)
 {
-	/* This slot is unused and hence available for use, if needed */
+	tick_broadcast_ipi_handler();
 	return IRQ_HANDLED;
 }
 
@@ -168,14 +169,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 static irq_handler_t smp_ipi_action[] = {
 	[PPC_MSG_CALL_FUNCTION] =  call_function_action,
 	[PPC_MSG_RESCHEDULE] = reschedule_action,
-	[PPC_MSG_UNUSED] = unused_action,
+	[PPC_MSG_TICK_BROADCAST] = tick_broadcast_ipi_action,
 	[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
 	[PPC_MSG_CALL_FUNCTION] =  "ipi call function",
 	[PPC_MSG_RESCHEDULE] = "ipi reschedule",
-	[PPC_MSG_UNUSED] = "ipi unused",
+	[PPC_MSG_TICK_BROADCAST] = "ipi tick-broadcast",
 	[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
 };
 
@@ -251,6 +252,8 @@ irqreturn_t smp_ipi_demux(void)
 			generic_smp_call_function_interrupt();
 		if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
 			scheduler_ipi();
+		if (all & IPI_MESSAGE(PPC_MSG_TICK_BROADCAST))
+			tick_broadcast_ipi_handler();
 		if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK))
 			debug_ipi_action(0, NULL);
 	} while (info->messages);
@@ -289,6 +292,14 @@ void arch_send_call_function_ipi_mask(const struct cpumask *mask)
 		do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
 }
 
+void tick_broadcast(const struct cpumask *mask)
+{
+	unsigned int cpu;
+
+	for_each_cpu(cpu, mask)
+		do_message_pass(cpu, PPC_MSG_TICK_BROADCAST);
+}
+
 #if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
 void smp_send_debugger_break(void)
 {
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index b3b1441..42269c7 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -813,6 +813,11 @@ static void decrementer_set_mode(enum clock_event_mode mode,
 		decrementer_set_next_event(DECREMENTER_MAX, dev);
 }
 
+/* Interrupt handler for the timer broadcast IPI */
+void tick_broadcast_ipi_handler(void)
+{
+}
+
 static void register_decrementer_clockevent(int cpu)
 {
 	struct clock_event_device *dec = &per_cpu(decrementers, cpu);
diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c
index adf3726..8a106b4 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -215,7 +215,7 @@ void iic_request_IPIs(void)
 {
 	iic_request_ipi(PPC_MSG_CALL_FUNCTION);
 	iic_request_ipi(PPC_MSG_RESCHEDULE);
-	iic_request_ipi(PPC_MSG_UNUSED);
+	iic_request_ipi(PPC_MSG_TICK_BROADCAST);
 	iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
 }
 
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 00d1a7c..b358bec 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -76,7 +76,7 @@ static int __init ps3_smp_probe(void)
 
 		BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION    != 0);
 		BUILD_BUG_ON(PPC_MSG_RESCHEDULE       != 1);
-		BUILD_BUG_ON(PPC_MSG_UNUSED	      != 2);
+		BUILD_BUG_ON(PPC_MSG_TICK_BROADCAST   != 2);
 		BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK   != 3);
 
 		for (i = 0; i < MSG_COUNT; i++) {

^ permalink raw reply related

* [PATCH V5 1/8] powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message
From: Preeti U Murthy @ 2014-01-15  8:08 UTC (permalink / raw)
  To: daniel.lezcano, peterz, fweisbec, agraf, paul.gortmaker, paulus,
	mingo, mikey, shangw, rafael.j.wysocki, galak, benh, paulmck,
	arnd, linux-pm, rostedt, michael, john.stultz, anton, tglx,
	chenhui.zhao, deepthi, r58472, geoff, linux-kernel, srivatsa.bhat,
	schwidefsky, svaidy, linuxppc-dev
In-Reply-To: <20140115080555.20446.27238.stgit@preeti.in.ibm.com>

From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

The IPI handlers for both PPC_MSG_CALL_FUNC and PPC_MSG_CALL_FUNC_SINGLE map
to a common implementation - generic_smp_call_function_single_interrupt(). So,
we can consolidate them and save one of the IPI message slots, (which are
precious on powerpc, since only 4 of those slots are available).

So, implement the functionality of PPC_MSG_CALL_FUNC_SINGLE using
PPC_MSG_CALL_FUNC itself and release its IPI message slot, so that it can be
used for something else in the future, if desired.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Geoff Levand <geoff@infradead.org> [For the PS3 part]
---

 arch/powerpc/include/asm/smp.h          |    2 +-
 arch/powerpc/kernel/smp.c               |   12 +++++-------
 arch/powerpc/platforms/cell/interrupt.c |    2 +-
 arch/powerpc/platforms/ps3/smp.c        |    2 +-
 4 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 084e080..9f7356b 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -120,7 +120,7 @@ extern int cpu_to_core_id(int cpu);
  * in /proc/interrupts will be wrong!!! --Troy */
 #define PPC_MSG_CALL_FUNCTION   0
 #define PPC_MSG_RESCHEDULE      1
-#define PPC_MSG_CALL_FUNC_SINGLE	2
+#define PPC_MSG_UNUSED		2
 #define PPC_MSG_DEBUGGER_BREAK  3
 
 /* for irq controllers that have dedicated ipis per message (4) */
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index a3b64f3..c2bd8d6 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -145,9 +145,9 @@ static irqreturn_t reschedule_action(int irq, void *data)
 	return IRQ_HANDLED;
 }
 
-static irqreturn_t call_function_single_action(int irq, void *data)
+static irqreturn_t unused_action(int irq, void *data)
 {
-	generic_smp_call_function_single_interrupt();
+	/* This slot is unused and hence available for use, if needed */
 	return IRQ_HANDLED;
 }
 
@@ -168,14 +168,14 @@ static irqreturn_t debug_ipi_action(int irq, void *data)
 static irq_handler_t smp_ipi_action[] = {
 	[PPC_MSG_CALL_FUNCTION] =  call_function_action,
 	[PPC_MSG_RESCHEDULE] = reschedule_action,
-	[PPC_MSG_CALL_FUNC_SINGLE] = call_function_single_action,
+	[PPC_MSG_UNUSED] = unused_action,
 	[PPC_MSG_DEBUGGER_BREAK] = debug_ipi_action,
 };
 
 const char *smp_ipi_name[] = {
 	[PPC_MSG_CALL_FUNCTION] =  "ipi call function",
 	[PPC_MSG_RESCHEDULE] = "ipi reschedule",
-	[PPC_MSG_CALL_FUNC_SINGLE] = "ipi call function single",
+	[PPC_MSG_UNUSED] = "ipi unused",
 	[PPC_MSG_DEBUGGER_BREAK] = "ipi debugger",
 };
 
@@ -251,8 +251,6 @@ irqreturn_t smp_ipi_demux(void)
 			generic_smp_call_function_interrupt();
 		if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
 			scheduler_ipi();
-		if (all & IPI_MESSAGE(PPC_MSG_CALL_FUNC_SINGLE))
-			generic_smp_call_function_single_interrupt();
 		if (all & IPI_MESSAGE(PPC_MSG_DEBUGGER_BREAK))
 			debug_ipi_action(0, NULL);
 	} while (info->messages);
@@ -280,7 +278,7 @@ EXPORT_SYMBOL_GPL(smp_send_reschedule);
 
 void arch_send_call_function_single_ipi(int cpu)
 {
-	do_message_pass(cpu, PPC_MSG_CALL_FUNC_SINGLE);
+	do_message_pass(cpu, PPC_MSG_CALL_FUNCTION);
 }
 
 void arch_send_call_function_ipi_mask(const struct cpumask *mask)
diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c
index 2d42f3b..adf3726 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -215,7 +215,7 @@ void iic_request_IPIs(void)
 {
 	iic_request_ipi(PPC_MSG_CALL_FUNCTION);
 	iic_request_ipi(PPC_MSG_RESCHEDULE);
-	iic_request_ipi(PPC_MSG_CALL_FUNC_SINGLE);
+	iic_request_ipi(PPC_MSG_UNUSED);
 	iic_request_ipi(PPC_MSG_DEBUGGER_BREAK);
 }
 
diff --git a/arch/powerpc/platforms/ps3/smp.c b/arch/powerpc/platforms/ps3/smp.c
index 4b35166..00d1a7c 100644
--- a/arch/powerpc/platforms/ps3/smp.c
+++ b/arch/powerpc/platforms/ps3/smp.c
@@ -76,7 +76,7 @@ static int __init ps3_smp_probe(void)
 
 		BUILD_BUG_ON(PPC_MSG_CALL_FUNCTION    != 0);
 		BUILD_BUG_ON(PPC_MSG_RESCHEDULE       != 1);
-		BUILD_BUG_ON(PPC_MSG_CALL_FUNC_SINGLE != 2);
+		BUILD_BUG_ON(PPC_MSG_UNUSED	      != 2);
 		BUILD_BUG_ON(PPC_MSG_DEBUGGER_BREAK   != 3);
 
 		for (i = 0; i < MSG_COUNT; i++) {

^ permalink raw reply related

* [PATCH V5 0/8] cpuidle/ppc: Enable deep idle states on PowerNV
From: Preeti U Murthy @ 2014-01-15  8:07 UTC (permalink / raw)
  To: daniel.lezcano, peterz, fweisbec, agraf, paul.gortmaker, paulus,
	mingo, mikey, shangw, rafael.j.wysocki, galak, benh, paulmck,
	arnd, linux-pm, rostedt, michael, john.stultz, anton, tglx,
	chenhui.zhao, deepthi, r58472, geoff, linux-kernel, srivatsa.bhat,
	schwidefsky, svaidy, linuxppc-dev

On PowerPC, when CPUs enter certain deep idle states, the local timers stop
and the time base could go out of sync with the rest of the cores in the system.

This patchset adds support to wake up CPUs in such idle states by
broadcasting IPIs to them at their next timer events using the tick broadcast
framework in the Linux kernel. We refer to these IPIs as the tick
broadcast IPIs in this patchset.

However the tick broadcast framework as it exists today makes use of an external
clock device to wakeup CPUs in such idle states. But not all implementations of
PowerPC provides such an external clock device.

Hence Patch[6/8]:
[time/cpuidle: Support in tick broadcast framework for archs without external
clock device] adds support in the tick broadcast framework for such
use cases by queuing a hrtimer on one of the CPUs which is meant to handle the wakeup
of CPUs in deep idle states.
This patch was posted separately at: https://lkml.org/lkml/2013/12/12/687.

Patches 1-3 adds support in powerpc to hook onto the tick broadcast framework.

The patchset also includes support for resyncing of time base with the rest of the
cores in the system and context management for fast sleep. PATCH[4/8] and
PATCH[5/8] address these issues.

With the required support for deep idle states thus in place, the
patchset adds "Fast-Sleep" idle state into cpuidle (Patches 7 and 8). "Fast-Sleep"
is a deep idle state on Power8 in which the above mentioned challenges
exist. Fast-Sleep can yield us significantly more power
savings than the idle states that we have in cpuidle so far.

This patchset is based on mainline commit-id:8ae516aa8b8161254d3,  and the
cpuidle driver for powernv posted by Deepthi Dharwar:
https://lkml.org/lkml/2014/1/14/172


Changes in V5:
-------------
The primary change in this version is in Patch[6/8].
As per the discussions in V4 posting of this patchset, it was decided to
refine handling the wakeup of CPUs in fast-sleep by doing the following:

1. In V4, a polling mechanism was used by the CPU handling broadcast to
find out the time of next wakeup of the CPUs in deep idle states. V5 avoids
polling by a way described under PATCH[6/8] in this patchset.

2. The mechanism of broadcast handling of CPUs in deep idle in the absence of an
external wakeup device should be generic and not arch specific code. Hence in this
version this functionality has been integrated into the tick broadcast framework in
the kernel unlike before where it was handled in powerpc specific code.

3. It was suggested that the "broadcast cpu" can be the time keeping cpu
itself. However this has challenges of its own:

 a. The time keeping cpu need not exist when all cpus are idle. Hence there
are phases in time when time keeping cpu is absent. But for the use case that
this patchset is trying to address we rely on the presence of a broadcast cpu
all the time.

 b. The nomination and un-assignment of the time keeping cpu is not protected
by a lock today and need not be as well since such is its use case in the
kernel. However we would need locks if we double up the time keeping cpu as the
broadcast cpu.

Hence the broadcast cpu is independent of the time-keeping cpu. However PATCH[6/8]
proposes a simpler solution to pick a broadcast cpu in this version.



Changes in V4:
-------------
https://lkml.org/lkml/2013/11/29/97

1. Add Fast Sleep CPU idle state on PowerNV.

2. Add the required context management for Fast Sleep and the call to OPAL
to synchronize time base after wakeup from fast sleep.

4. Add parsing of CPU idle states from the device tree to populate the
cpuidle
state table.

5. Rename ambiguous functions in the code around waking up of CPUs from fast
sleep.

6. Fixed a bug in re-programming of the hrtimer that is queued to wakeup the
CPUs in fast sleep and modified Changelogs.

7. Added the ARCH_HAS_TICK_BROADCAST option. This signifies that we have a
arch specific function to perform broadcast.


Changes in V3:
-------------
http://thread.gmane.org/gmane.linux.power-management.general/38113

1. Fix the way in which a broadcast ipi is handled on the idling cpus. Timer
handling on a broadcast ipi is being done now without missing out any timer
stats generation.

2. Fix a bug in the programming of the hrtimer meant to do broadcast. Program
it to trigger at the earlier of a "broadcast period", and the next wakeup
event. By introducing the "broadcast period" as the maximum period after
which the broadcast hrtimer can fire, we ensure that we do not miss
wakeups in corner cases.

3. On hotplug of a broadcast cpu, trigger the hrtimer meant to do broadcast
to fire immediately on the new broadcast cpu. This will ensure we do not miss
doing a broadcast pending in the nearest future.

4. Change the type of allocation from GFP_KERNEL to GFP_NOWAIT while
initializing bc_hrtimer since we are in an atomic context and cannot sleep.

5. Use the broadcast ipi to wakeup the newly nominated broadcast cpu on
hotplug of the old instead of smp_call_function_single(). This is because we
are interrupt disabled at this point and should not be using
smp_call_function_single or its children in this context to send an ipi.

6. Move GENERIC_CLOCKEVENTS_BROADCAST to arch/powerpc/Kconfig.

7. Fix coding style issues.


Changes in V2:
-------------
https://lkml.org/lkml/2013/8/14/239

1. Dynamically pick a broadcast CPU, instead of having a dedicated one.
2. Remove the constraint of having to disable tickless idle on the broadcast
CPU by queueing a hrtimer dedicated to do broadcast.



V1 posting: https://lkml.org/lkml/2013/7/25/740.

1. Added the infrastructure to wakeup CPUs in deep idle states in which the
local timers stop.
---

Preeti U Murthy (4):
      cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines
      time/cpuidle: Support in tick broadcast framework in the absence of external clock device
      cpuidle/powernv: Add "Fast-Sleep" CPU idle state
      cpuidle/powernv: Parse device tree to setup idle states

Srivatsa S. Bhat (2):
      powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message
      powerpc: Implement tick broadcast IPI as a fixed IPI message

Vaidyanathan Srinivasan (2):
      powernv/cpuidle: Add context management for Fast Sleep
      powermgt: Add OPAL call to resync timebase on wakeup


 arch/powerpc/Kconfig                           |    2 
 arch/powerpc/include/asm/opal.h                |    2 
 arch/powerpc/include/asm/processor.h           |    1 
 arch/powerpc/include/asm/smp.h                 |    2 
 arch/powerpc/include/asm/time.h                |    1 
 arch/powerpc/kernel/exceptions-64s.S           |   10 +
 arch/powerpc/kernel/idle_power7.S              |   90 +++++++++--
 arch/powerpc/kernel/smp.c                      |   23 ++-
 arch/powerpc/kernel/time.c                     |   80 ++++++----
 arch/powerpc/platforms/cell/interrupt.c        |    2 
 arch/powerpc/platforms/powernv/opal-wrappers.S |    1 
 arch/powerpc/platforms/ps3/smp.c               |    2 
 drivers/cpuidle/cpuidle-powernv.c              |  106 ++++++++++++-
 include/linux/clockchips.h                     |    4 -
 kernel/time/clockevents.c                      |    9 +
 kernel/time/tick-broadcast.c                   |  192 ++++++++++++++++++++++--
 kernel/time/tick-internal.h                    |    8 +
 17 files changed, 434 insertions(+), 101 deletions(-)

-- 

^ permalink raw reply

* Re: [git pull] Please pull powerpc.git merge branch
From: Linus Torvalds @ 2014-01-15  8:05 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linuxppc-dev list, Andrew Morton, Linux Kernel list
In-Reply-To: <1389762075.6933.76.camel@pasglop>

On Wed, Jan 15, 2014 at 12:01 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>
> My original intend was to put it in powerpc-next and then shoot it to
> stable, but it got a tad annoying (due to churn it needs to be applied
> at least on rc4 or later while my next is at rc1 and clean that way), so
> I put it in the merge branch.

Quite frankly, I'll prefer to not merge it now, and then 3.13 will get
it from stable, when it does things like this.

Partly because it fixes a power-only bug, but potentially changes
non-power behavior. If it was all in arch/powerpc, I wouldn't mind.

               Linus

^ permalink raw reply

* [PATCH 2/2] powerpc: Implement arch_spin_is_locked() using arch_spin_value_unlocked()
From: Michael Ellerman @ 2014-01-15  7:14 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Michael Neuling, Linus Torvalds, linux-kernel
In-Reply-To: <1389770069-28733-1-git-send-email-mpe@ellerman.id.au>

At a glance these are just the inverse of each other. The one subtlety
is that arch_spin_value_unlocked() takes the lock by value, rather than
as a pointer, which is important for the lockref code.

On the other hand arch_spin_is_locked() doesn't really care, so
implement it in terms of arch_spin_value_unlocked().

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/include/asm/spinlock.h | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index 5162f8c..a30ef69 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -28,8 +28,6 @@
 #include <asm/synch.h>
 #include <asm/ppc-opcode.h>
 
-#define arch_spin_is_locked(x)		((x)->slock != 0)
-
 #ifdef CONFIG_PPC64
 /* use 0x800000yy when locked, where yy == CPU number */
 #ifdef __BIG_ENDIAN__
@@ -59,6 +57,11 @@ static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
 	return lock.slock == 0;
 }
 
+static inline int arch_spin_is_locked(arch_spinlock_t *lock)
+{
+	return !arch_spin_value_unlocked(*lock);
+}
+
 /*
  * This returns the old value in the lock, so we succeeded
  * in getting the lock if the return value is 0.
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH 1/2] powerpc: Add support for the optimised lockref implementation
From: Michael Ellerman @ 2014-01-15  7:14 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Michael Neuling, Linus Torvalds, linux-kernel

This commit adds the architecture support required to enable the
optimised implementation of lockrefs.

That's as simple as defining arch_spin_value_unlocked() and selecting
the Kconfig option.

We also define cmpxchg64_relaxed(), because the lockref code does not
need the cmpxchg to have barrier semantics.

Using Linus' test case[1] on one system I see a 4x improvement for the
basic enablement, and a further 1.3x for cmpxchg64_relaxed(), for a
total of 5.3x vs the baseline.

On another system I see more like 2x improvement.

[1]: http://marc.info/?l=linux-fsdevel&m=137782380714721&w=4

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/Kconfig                | 1 +
 arch/powerpc/include/asm/cmpxchg.h  | 1 +
 arch/powerpc/include/asm/spinlock.h | 5 +++++
 3 files changed, 7 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index b44b52c..b34b53d 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -139,6 +139,7 @@ config PPC
 	select OLD_SIGACTION if PPC32
 	select HAVE_DEBUG_STACKOVERFLOW
 	select HAVE_IRQ_EXIT_ON_IRQ_STACK
+	select ARCH_USE_CMPXCHG_LOCKREF if PPC64
 
 config GENERIC_CSUM
 	def_bool CPU_LITTLE_ENDIAN
diff --git a/arch/powerpc/include/asm/cmpxchg.h b/arch/powerpc/include/asm/cmpxchg.h
index e245aab..d463c68 100644
--- a/arch/powerpc/include/asm/cmpxchg.h
+++ b/arch/powerpc/include/asm/cmpxchg.h
@@ -300,6 +300,7 @@ __cmpxchg_local(volatile void *ptr, unsigned long old, unsigned long new,
 	BUILD_BUG_ON(sizeof(*(ptr)) != 8);				\
 	cmpxchg_local((ptr), (o), (n));					\
   })
+#define cmpxchg64_relaxed	cmpxchg64_local
 #else
 #include <asm-generic/cmpxchg-local.h>
 #define cmpxchg64_local(ptr, o, n) __cmpxchg64_local_generic((ptr), (o), (n))
diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index 5f54a74..5162f8c 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -54,6 +54,11 @@
 #define SYNC_IO
 #endif
 
+static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
+{
+	return lock.slock == 0;
+}
+
 /*
  * This returns the old value in the lock, so we succeeded
  * in getting the lock if the return value is 0.
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH] powerpc, perf: Define perf_event_print_debug() to print PMU register values
From: Anshuman Khandual @ 2014-01-15  6:50 UTC (permalink / raw)
  To: Linux PPC dev; +Cc: Michael Ellerman

As of now "echo p > /proc/sysrq-trigger" command does not print anything on
the console as we have a blank perf_event_print_debug function. This patch
defines perf_event_print_debug function to print various PMU registers.

With this patch, "echo p > /proc/sysrq-trigger" command on a POWER8 system
generates this sample output on the console.

echo p > /proc/sysrq-trigger
SysRq : Show Regs
CPU#5 PMC#1:  00000000 PMC#2:  00000000
CPU#5 PMC#3:  00000000 PMC#4:  00000000
CPU#5 PMC#5:  d03737ba PMC#6:  843aaf8c
CPU#5 MMCR0:  0000000000000000 MMCR1:  0000000000000000
CPU#5 MMCRA:  0000000000000000 SIAR:   0000000000000000
CPU#5 SDAR:   0000000000000000
CPU#5 SIER:   0000000000000000
CPU#5 MMCR2:  0000000000000000 EBBHR:  0000000000000000
CPU#5 EBBRR:  0000000000000000 BESCR:  0000000000000000

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 arch/powerpc/perf/core-book3s.c | 54 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 29b89e8..ac35aae 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -562,9 +562,63 @@ out:
 #endif /* CONFIG_PPC64 */
 
 static void perf_event_interrupt(struct pt_regs *regs);
+static unsigned long read_pmc(int idx);
 
+/* Called from generic sysrq dump register handler */
 void perf_event_print_debug(void)
 {
+	unsigned long flags;
+	int cpu, idx;
+
+	if (!ppmu->n_counter)
+		return;
+
+	local_irq_save(flags);
+
+	cpu = smp_processor_id();
+
+	/* General PMU counters */
+	for (idx = 1; idx <= ppmu->n_counter; idx = idx + 2)
+		pr_info("CPU#%d PMC#%d:  %08lx PMC#%d:  %08lx\n",
+			cpu, idx, read_pmc(idx), idx + 1, read_pmc(idx + 1));
+
+	/* General PMU config registers */
+	pr_info("CPU#%d MMCR0:  %016lx MMCR1:  %016lx\n", cpu,
+					mfspr(SPRN_MMCR0), mfspr(SPRN_MMCR1));
+	pr_info("CPU#%d MMCRA:  %016lx SIAR:   %016lx\n", cpu,
+					mfspr(SPRN_MMCRA), mfspr(SPRN_SIAR));
+
+#ifdef CONFIG_PPC64
+	pr_info("CPU#%d SDAR:   %016lx\n", cpu, mfspr(SPRN_SDAR));
+#endif /* CONFIG_PPC64 */
+
+	/* PMU specific config registers */
+	if (ppmu->flags & PPMU_HAS_SIER)
+		pr_info("CPU#%d SIER:   %016lx\n", cpu, mfspr(SPRN_SIER));
+
+
+	if (ppmu->flags & PPMU_EBB) {
+		pr_info("CPU#%d MMCR2:  %016lx EBBHR:  %016lx\n", cpu,
+					mfspr(SPRN_MMCR2), mfspr(SPRN_EBBHR));
+		pr_info("CPU#%d EBBRR:  %016lx BESCR:  %016lx\n", cpu,
+					mfspr(SPRN_EBBRR), mfspr(SPRN_BESCR));
+	}
+
+	if (ppmu->flags & PPMU_BHRB) {
+		u64 val;
+
+		for (idx = 0; idx < ppmu->bhrb_nr; idx++) {
+			val = read_bhrb(idx);
+
+			/* BHRB terminal marker */
+			if (!val)
+				break;
+
+			pr_info("CPU#%d BHRBE[%d]:  %016llx\n", cpu, idx, val);
+		}
+	}
+
+	local_irq_restore(flags);
 }
 
 /*
-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH 0/4] powernv: kvm: numa fault improvement
From: Liu ping fan @ 2014-01-15  6:36 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Paul Mackerras, linuxppc-dev, Aneesh Kumar K.V, kvm-ppc
In-Reply-To: <DB55CAD2-72AF-4740-9904-193071C2740B@suse.de>

On Thu, Jan 9, 2014 at 8:08 PM, Alexander Graf <agraf@suse.de> wrote:
>
> On 11.12.2013, at 09:47, Liu Ping Fan <kernelfans@gmail.com> wrote:
>
>> This series is based on Aneesh's series  "[PATCH -V2 0/5] powerpc: mm: Numa faults support for ppc64"
>>
>> For this series, I apply the same idea from the previous thread "[PATCH 0/3] optimize for powerpc _PAGE_NUMA"
>> (for which, I still try to get a machine to show nums)
>>
>> But for this series, I think that I have a good justification -- the fact of heavy cost when switching context between guest and host,
>> which is  well known.
>
> This cover letter isn't really telling me anything. Please put a proper description of what you're trying to achieve, why you're trying to achieve what you're trying and convince your readers that it's a good idea to do it the way you do it.
>
Sorry for the unclear message. After introducing the _PAGE_NUMA,
kvmppc_do_h_enter() can not fill up the hpte for guest. Instead, it
should rely on host's kvmppc_book3s_hv_page_fault() to call
do_numa_page() to do the numa fault check. This incurs the overhead
when exiting from rmode to vmode.  My idea is that in
kvmppc_do_h_enter(), we do a quick check, if the page is right placed,
there is no need to exit to vmode (i.e saving htab, slab switching)

>> If my suppose is correct, will CCing kvm@vger.kernel.org from next version.
>
> This translates to me as "This is an RFC"?
>
Yes, I am not quite sure about it. I have no bare-metal to verify it.
So I hope at least, from the theory, it is correct.

Thanks and regards,
Ping Fan
>
> Alex
>

^ permalink raw reply

* powerpc/powernv: Call OPAL sync before kexec'ing
From: Benjamin Herrenschmidt @ 2014-01-15  6:02 UTC (permalink / raw)
  To: linuxppc-dev list

From: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
    
Its possible that OPAL may be writing to host memory during
kexec (like dump retrieve scenario). In this situation we might
end up corrupting host memory.
    
This patch makes OPAL sync call to make sure OPAL stops
writing to host memory before kexec'ing.
    
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 0ac6f04..1920f70 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -162,6 +162,7 @@ extern int opal_enter_rtas(struct rtas_args *args,
 #define OPAL_DUMP_ACK				84
 #define OPAL_GET_MSG				85
 #define OPAL_CHECK_ASYNC_COMPLETION		86
+#define OPAL_SYNC_HOST_REBOOT			87
 
 #ifndef __ASSEMBLY__
 
@@ -841,6 +842,7 @@ int64_t opal_dump_read(uint32_t dump_id, uint64_t buffer);
 int64_t opal_dump_ack(uint32_t dump_id);
 int64_t opal_get_msg(uint64_t buffer, size_t size);
 int64_t opal_check_completion(uint64_t buffer, size_t size, uint64_t token);
+int64_t opal_sync_host_reboot(void);
 
 /* Internal functions */
 extern int early_init_dt_scan_opal(unsigned long node, const char *uname, int depth, void *data);
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index d58fcae..9e7ca21 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -127,3 +127,4 @@ OPAL_CALL(opal_dump_read,			OPAL_DUMP_READ);
 OPAL_CALL(opal_dump_ack,			OPAL_DUMP_ACK);
 OPAL_CALL(opal_get_msg,				OPAL_GET_MSG);
 OPAL_CALL(opal_check_completion,		OPAL_CHECK_ASYNC_COMPLETION);
+OPAL_CALL(opal_sync_host_reboot,		OPAL_SYNC_HOST_REBOOT);
diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
index 0ba1ccb..619b94a 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -21,6 +21,7 @@
 #include <linux/kobject.h>
 #include <linux/debugfs.h>
 #include <linux/uaccess.h>
+#include <linux/delay.h>
 #include <asm/opal.h>
 #include <asm/firmware.h>
 #include <asm/mce.h>
@@ -581,10 +582,25 @@ subsys_initcall(opal_init);
 void opal_shutdown(void)
 {
 	unsigned int i;
+	long rc = OPAL_BUSY;
 
+	/* First free interrupts, which will also mask them */
 	for (i = 0; i < opal_irq_count; i++) {
 		if (opal_irqs[i])
 			free_irq(opal_irqs[i], 0);
 		opal_irqs[i] = 0;
 	}
+
+	/*
+	 * Then sync with OPAL which ensure anything that can
+	 * potentially write to our memory has completed such
+	 * as an ongoing dump retrieval
+	 */
+	while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
+		rc = opal_sync_host_reboot();
+		if (rc == OPAL_BUSY)
+			opal_poll_events(NULL);
+		else
+			mdelay(10);
+	}
 }
diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
index ddc9690..86733d6 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -153,8 +153,10 @@ static void pnv_shutdown(void)
 	/* Let the PCI code clear up IODA tables */
 	pnv_pci_shutdown();
 
-	/* And unregister all OPAL interrupts so they don't fire
-	 * up while we kexec
+	/*
+	 * Stop OPAL activity: Unregister all OPAL interrupts so they
+	 * don't fire up while we kexec and make sure all potentially
+	 * DMA'ing ops are complete (such as dump retrieval).
 	 */
 	opal_shutdown();
 }

^ permalink raw reply related

* [PATCH v2 2/3] powerpc/eeh: Hotplug improvement
From: Gavin Shan @ 2014-01-15  5:16 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Gavin Shan
In-Reply-To: <1389762974-14176-1-git-send-email-shangw@linux.vnet.ibm.com>

When EEH error comes to one specific PCI device before its driver
is loaded, we will apply hotplug to recover the error. During the
plug time, the PCI device will be probed and its driver is loaded.
Then we wrongly calls to the error handlers if the driver supports
EEH explicitly.

The patch intends to fix by introducing flag EEH_DEV_NO_HANDLER and
set it before we remove the PCI device. In turn, we can avoid wrongly
calls the error handlers of the PCI device after its driver loaded.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h   |    3 ++-
 arch/powerpc/kernel/eeh.c        |   15 +++++++++++++++
 arch/powerpc/kernel/eeh_driver.c |   10 +++++++---
 3 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index e37db7f..8e31dad 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -90,7 +90,8 @@ struct eeh_pe {
 #define EEH_DEV_IRQ_DISABLED	(1 << 3)	/* Interrupt disabled	*/
 #define EEH_DEV_DISCONNECTED	(1 << 4)	/* Removing from PE	*/
 
-#define EEH_DEV_SYSFS		(1 << 8)	/* Sysfs created        */
+#define EEH_DEV_NO_HANDLER	(1 << 8)	/* No error handler	*/
+#define EEH_DEV_SYSFS		(1 << 9)	/* Sysfs created	*/
 
 struct eeh_dev {
 	int mode;			/* EEH mode			*/
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 4bd687d..6a118db 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -921,6 +921,13 @@ void eeh_add_device_late(struct pci_dev *dev)
 		eeh_sysfs_remove_device(edev->pdev);
 		edev->mode &= ~EEH_DEV_SYSFS;
 
+		/*
+		 * We definitely should have the PCI device removed
+		 * though it wasn't correctly. So we needn't call
+		 * into error handler afterwards.
+		 */
+		edev->mode |= EEH_DEV_NO_HANDLER;
+
 		edev->pdev = NULL;
 		dev->dev.archdata.edev = NULL;
 	}
@@ -1023,6 +1030,14 @@ void eeh_remove_device(struct pci_dev *dev)
 	else
 		edev->mode |= EEH_DEV_DISCONNECTED;
 
+	/*
+	 * We're removing from the PCI subsystem, that means
+	 * the PCI device driver can't support EEH or not
+	 * well. So we rely on hotplug completely to do recovery
+	 * for the specific PCI device.
+	 */
+	edev->mode |= EEH_DEV_NO_HANDLER;
+
 	eeh_addr_cache_rmv_dev(dev);
 	eeh_sysfs_remove_device(dev);
 	edev->mode &= ~EEH_DEV_SYSFS;
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index d3a132c..ce3a698 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -217,7 +217,8 @@ static void *eeh_report_mmio_enabled(void *data, void *userdata)
 	if (!driver) return NULL;
 
 	if (!driver->err_handler ||
-	    !driver->err_handler->mmio_enabled) {
+	    !driver->err_handler->mmio_enabled ||
+	    (edev->mode & EEH_DEV_NO_HANDLER)) {
 		eeh_pcid_put(dev);
 		return NULL;
 	}
@@ -258,7 +259,8 @@ static void *eeh_report_reset(void *data, void *userdata)
 	eeh_enable_irq(dev);
 
 	if (!driver->err_handler ||
-	    !driver->err_handler->slot_reset) {
+	    !driver->err_handler->slot_reset ||
+	    (edev->mode & EEH_DEV_NO_HANDLER)) {
 		eeh_pcid_put(dev);
 		return NULL;
 	}
@@ -297,7 +299,9 @@ static void *eeh_report_resume(void *data, void *userdata)
 	eeh_enable_irq(dev);
 
 	if (!driver->err_handler ||
-	    !driver->err_handler->resume) {
+	    !driver->err_handler->resume ||
+	    (edev->mode & EEH_DEV_NO_HANDLER)) {
+		edev->mode &= ~EEH_DEV_NO_HANDLER;
 		eeh_pcid_put(dev);
 		return NULL;
 	}
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH v2 1/3] powerpc/eeh: Handle multiple EEH errors
From: Gavin Shan @ 2014-01-15  5:16 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Gavin Shan

For one PCI error relevant OPAL event, we possibly have multiple
EEH errors for that. For example, multiple frozen PEs detected on
different PHBs. Unfortunately, we didn't cover the case. The patch
enumarates the return value from eeh_ops::next_error() and change
eeh_handle_special_event() and eeh_ops::next_error() to handle all
existing EEH errors.

As Ben pointed out, we needn't list_for_each_entry_safe() since we
are not deleting any PHB from the hose_list and the EEH serialized
lock should be held while purging EEH events. The patch covers those
suggestions as well.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h            |   10 ++
 arch/powerpc/kernel/eeh_driver.c          |  150 +++++++++++++++--------------
 arch/powerpc/platforms/powernv/eeh-ioda.c |   39 +++++---
 3 files changed, 112 insertions(+), 87 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index d3e5e9b..e37db7f 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -117,6 +117,16 @@ static inline struct pci_dev *eeh_dev_to_pci_dev(struct eeh_dev *edev)
 	return edev ? edev->pdev : NULL;
 }
 
+/* Return values from eeh_ops::next_error */
+enum {
+	EEH_NEXT_ERR_NONE = 0,
+	EEH_NEXT_ERR_INF,
+	EEH_NEXT_ERR_FROZEN_PE,
+	EEH_NEXT_ERR_FENCED_PHB,
+	EEH_NEXT_ERR_DEAD_PHB,
+	EEH_NEXT_ERR_DEAD_IOC
+};
+
 /*
  * The struct is used to trace the registered EEH operation
  * callback functions. Actually, those operation callback
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 36bed5a..d3a132c 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -626,84 +626,90 @@ static void eeh_handle_special_event(void)
 {
 	struct eeh_pe *pe, *phb_pe;
 	struct pci_bus *bus;
-	struct pci_controller *hose, *tmp;
+	struct pci_controller *hose;
 	unsigned long flags;
-	int rc = 0;
+	int rc;
 
-	/*
-	 * The return value from next_error() has been classified as follows.
-	 * It might be good to enumerate them. However, next_error() is only
-	 * supported by PowerNV platform for now. So it would be fine to use
-	 * integer directly:
-	 *
-	 * 4 - Dead IOC           3 - Dead PHB
-	 * 2 - Fenced PHB         1 - Frozen PE
-	 * 0 - No error found
-	 *
-	 */
-	rc = eeh_ops->next_error(&pe);
-	if (rc <= 0)
-		return;
 
-	switch (rc) {
-	case 4:
-		/* Mark all PHBs in dead state */
-		eeh_serialize_lock(&flags);
-		list_for_each_entry_safe(hose, tmp,
-				&hose_list, list_node) {
-			phb_pe = eeh_phb_pe_get(hose);
-			if (!phb_pe) continue;
-
-			eeh_pe_state_mark(phb_pe,
-				EEH_PE_ISOLATED | EEH_PE_PHB_DEAD);
+	do {
+		rc = eeh_ops->next_error(&pe);
+
+		switch (rc) {
+		case EEH_NEXT_ERR_DEAD_IOC:
+			/* Mark all PHBs in dead state */
+			eeh_serialize_lock(&flags);
+
+			/* Purge all events */
+			eeh_remove_event(NULL);
+
+			list_for_each_entry(hose, &hose_list, list_node) {
+				phb_pe = eeh_phb_pe_get(hose);
+				if (!phb_pe) continue;
+
+				eeh_pe_state_mark(phb_pe,
+					EEH_PE_ISOLATED | EEH_PE_PHB_DEAD);
+			}
+
+			eeh_serialize_unlock(flags);
+
+			break;
+		case EEH_NEXT_ERR_FROZEN_PE:
+		case EEH_NEXT_ERR_FENCED_PHB:
+		case EEH_NEXT_ERR_DEAD_PHB:
+			/* Mark the PE in fenced state */
+			eeh_serialize_lock(&flags);
+
+			/* Purge all events of the PHB */
+			eeh_remove_event(pe);
+
+			if (rc == EEH_NEXT_ERR_DEAD_PHB)
+				eeh_pe_state_mark(pe,
+					EEH_PE_ISOLATED | EEH_PE_PHB_DEAD);
+			else
+				eeh_pe_state_mark(pe,
+					EEH_PE_ISOLATED | EEH_PE_RECOVERING);
+
+			eeh_serialize_unlock(flags);
+
+			break;
+		case EEH_NEXT_ERR_NONE:
+			return;
+		default:
+			pr_warn("%s: Invalid value %d from next_error()\n",
+				__func__, rc);
+			return;
 		}
-		eeh_serialize_unlock(flags);
-
-		/* Purge all events */
-		eeh_remove_event(NULL);
-		break;
-	case 3:
-	case 2:
-	case 1:
-		/* Mark the PE in fenced state */
-		eeh_serialize_lock(&flags);
-		if (rc == 3)
-			eeh_pe_state_mark(pe,
-				EEH_PE_ISOLATED | EEH_PE_PHB_DEAD);
-		else
-			eeh_pe_state_mark(pe,
-				EEH_PE_ISOLATED | EEH_PE_RECOVERING);
-		eeh_serialize_unlock(flags);
-
-		/* Purge all events of the PHB */
-		eeh_remove_event(pe);
-		break;
-	default:
-		pr_err("%s: Invalid value %d from next_error()\n",
-		       __func__, rc);
-		return;
-	}
 
-	/*
-	 * For fenced PHB and frozen PE, it's handled as normal
-	 * event. We have to remove the affected PHBs for dead
-	 * PHB and IOC
-	 */
-	if (rc == 2 || rc == 1)
-		eeh_handle_normal_event(pe);
-	else {
-		list_for_each_entry_safe(hose, tmp,
-			&hose_list, list_node) {
-			phb_pe = eeh_phb_pe_get(hose);
-			if (!phb_pe || !(phb_pe->state & EEH_PE_PHB_DEAD))
-				continue;
-
-			bus = eeh_pe_bus_get(phb_pe);
-			/* Notify all devices that they're about to go down. */
-			eeh_pe_dev_traverse(pe, eeh_report_failure, NULL);
-			pcibios_remove_pci_devices(bus);
+		/*
+		 * For fenced PHB and frozen PE, it's handled as normal
+		 * event. We have to remove the affected PHBs for dead
+		 * PHB and IOC
+		 */
+		if (rc == EEH_NEXT_ERR_FROZEN_PE ||
+		    rc == EEH_NEXT_ERR_FENCED_PHB) {
+			eeh_handle_normal_event(pe);
+		} else {
+			list_for_each_entry(hose, &hose_list, list_node) {
+				phb_pe = eeh_phb_pe_get(hose);
+				if (!phb_pe ||
+				    !(phb_pe->state & EEH_PE_PHB_DEAD))
+					continue;
+
+				/* Notify all devices to be down */
+				bus = eeh_pe_bus_get(phb_pe);
+				eeh_pe_dev_traverse(pe,
+					eeh_report_failure, NULL);
+				pcibios_remove_pci_devices(bus);
+			}
 		}
-	}
+
+		/*
+		 * If we have detected dead IOC, we needn't proceed
+		 * any more since all PHBs would have been removed
+		 */
+		if (rc == EEH_NEXT_ERR_DEAD_IOC)
+			break;
+	} while (rc != EEH_NEXT_ERR_NONE);
 }
 
 /**
diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c b/arch/powerpc/platforms/powernv/eeh-ioda.c
index d7ddcee..e0b12d0 100644
--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
@@ -884,12 +884,12 @@ static int ioda_eeh_get_pe(struct pci_controller *hose,
  */
 static int ioda_eeh_next_error(struct eeh_pe **pe)
 {
-	struct pci_controller *hose, *tmp;
+	struct pci_controller *hose;
 	struct pnv_phb *phb;
 	u64 frozen_pe_no;
 	u16 err_type, severity;
 	long rc;
-	int ret = 1;
+	int ret = EEH_NEXT_ERR_NONE;
 
 	/*
 	 * While running here, it's safe to purge the event queue.
@@ -899,7 +899,7 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
 	eeh_remove_event(NULL);
 	opal_notifier_update_evt(OPAL_EVENT_PCI_ERROR, 0x0ul);
 
-	list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
+	list_for_each_entry(hose, &hose_list, list_node) {
 		/*
 		 * If the subordinate PCI buses of the PHB has been
 		 * removed, we needn't take care of it any more.
@@ -938,19 +938,19 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
 		switch (err_type) {
 		case OPAL_EEH_IOC_ERROR:
 			if (severity == OPAL_EEH_SEV_IOC_DEAD) {
-				list_for_each_entry_safe(hose, tmp,
-						&hose_list, list_node) {
+				list_for_each_entry(hose, &hose_list,
+						    list_node) {
 					phb = hose->private_data;
 					phb->eeh_state |= PNV_EEH_STATE_REMOVED;
 				}
 
 				pr_err("EEH: dead IOC detected\n");
-				ret = 4;
-				goto out;
+				ret = EEH_NEXT_ERR_DEAD_IOC;
 			} else if (severity == OPAL_EEH_SEV_INF) {
 				pr_info("EEH: IOC informative error "
 					"detected\n");
 				ioda_eeh_hub_diag(hose);
+				ret = EEH_NEXT_ERR_NONE;
 			}
 
 			break;
@@ -962,21 +962,20 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
 				pr_err("EEH: dead PHB#%x detected\n",
 					hose->global_number);
 				phb->eeh_state |= PNV_EEH_STATE_REMOVED;
-				ret = 3;
-				goto out;
+				ret = EEH_NEXT_ERR_DEAD_PHB;
 			} else if (severity == OPAL_EEH_SEV_PHB_FENCED) {
 				if (ioda_eeh_get_phb_pe(hose, pe))
 					break;
 
 				pr_err("EEH: fenced PHB#%x detected\n",
 					hose->global_number);
-				ret = 2;
-				goto out;
+				ret = EEH_NEXT_ERR_FENCED_PHB;
 			} else if (severity == OPAL_EEH_SEV_INF) {
 				pr_info("EEH: PHB#%x informative error "
 					"detected\n",
 					hose->global_number);
 				ioda_eeh_phb_diag(hose);
+				ret = EEH_NEXT_ERR_NONE;
 			}
 
 			break;
@@ -986,13 +985,23 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
 
 			pr_err("EEH: Frozen PE#%x on PHB#%x detected\n",
 				(*pe)->addr, (*pe)->phb->global_number);
-			ret = 1;
-			goto out;
+			ret = EEH_NEXT_ERR_FROZEN_PE;
+			break;
+		default:
+			pr_warn("%s: Unexpected error type %d\n",
+				__func__, err_type);
 		}
+
+		/*
+		 * If we have no errors on the specific PHB or only
+		 * informative error there, we continue poking it.
+		 * Otherwise, we need actions to be taken by upper
+		 * layer.
+		 */
+		if (ret > EEH_NEXT_ERR_INF)
+			break;
 	}
 
-	ret = 0;
-out:
 	return ret;
 }
 
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH v2 3/3] powerpc/eeh: Escalate error on non-existing PE
From: Gavin Shan @ 2014-01-15  5:16 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Gavin Shan
In-Reply-To: <1389762974-14176-1-git-send-email-shangw@linux.vnet.ibm.com>

Sometimes, especially in sinario of loading another kernel with kdump,
we got EEH error on non-existing PE. That means the PEEV / PEST in
the corresponding PHB would be messy and we can't handle that case.
The patch escalates the error to fenced PHB so that the PHB could be
rested in order to revoer the errors on non-existing PEs.

Reported-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Tested-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/eeh-ioda.c |   31 +++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-ioda.c b/arch/powerpc/platforms/powernv/eeh-ioda.c
index e0b12d0..92aa1f9 100644
--- a/arch/powerpc/platforms/powernv/eeh-ioda.c
+++ b/arch/powerpc/platforms/powernv/eeh-ioda.c
@@ -862,11 +862,7 @@ static int ioda_eeh_get_pe(struct pci_controller *hose,
 	dev.phb = hose;
 	dev.pe_config_addr = pe_no;
 	dev_pe = eeh_pe_get(&dev);
-	if (!dev_pe) {
-		pr_warning("%s: Can't find PE for PHB#%x - PE#%x\n",
-			   __func__, hose->global_number, pe_no);
-		return -EEXIST;
-	}
+	if (!dev_pe) return -EEXIST;
 
 	*pe = dev_pe;
 	return 0;
@@ -980,12 +976,27 @@ static int ioda_eeh_next_error(struct eeh_pe **pe)
 
 			break;
 		case OPAL_EEH_PE_ERROR:
-			if (ioda_eeh_get_pe(hose, frozen_pe_no, pe))
-				break;
+			/*
+			 * If we can't find the corresponding PE, the
+			 * PEEV / PEST would be messy. So we force an
+			 * fenced PHB so that it can be recovered.
+			 */
+			if (ioda_eeh_get_pe(hose, frozen_pe_no, pe)) {
+				if (!ioda_eeh_get_phb_pe(hose, pe)) {
+					pr_err("EEH: Escalated fenced PHB#%x "
+					       "detected for PE#%llx\n",
+						hose->global_number,
+						frozen_pe_no);
+					ret = EEH_NEXT_ERR_FENCED_PHB;
+				} else {
+					ret = EEH_NEXT_ERR_NONE;
+				}
+			} else {
+				pr_err("EEH: Frozen PE#%x on PHB#%x detected\n",
+					(*pe)->addr, (*pe)->phb->global_number);
+				ret = EEH_NEXT_ERR_FROZEN_PE;
+			}
 
-			pr_err("EEH: Frozen PE#%x on PHB#%x detected\n",
-				(*pe)->addr, (*pe)->phb->global_number);
-			ret = EEH_NEXT_ERR_FROZEN_PE;
 			break;
 		default:
 			pr_warn("%s: Unexpected error type %d\n",
-- 
1.7.10.4

^ permalink raw reply related

* [git pull] Please pull powerpc.git merge branch
From: Benjamin Herrenschmidt @ 2014-01-15  5:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linuxppc-dev list, Andrew Morton, Linux Kernel list

Hi Linus !

So you make the call onto whether taking that one now or waiting for the
merge window. It's a bug fix for a crash in mremap that occurs on
powerpc with THP enabled.

The fix however requires a small change in the generic code. It moves a
condition into a helper we can override from the arch which is harmless,
but it *also* slightly changes the order of the set_pmd and the withdraw
& deposit, which should be fine according to Kirill (who wrote that
code) but I agree -rc8 is a bit late...

It was acked by Kirill and Andrew told me to just merge it via powerpc.

My original intend was to put it in powerpc-next and then shoot it to
stable, but it got a tad annoying (due to churn it needs to be applied
at least on rc4 or later while my next is at rc1 and clean that way), so
I put it in the merge branch.

>From there, you tell me if you want to take it now, if not, I'll send
you that branch along with my normal next one after you open the merge
window.

Cheers,
Ben.

The following changes since commit a6da83f98267bc8ee4e34aa899169991eb0ceb93:

  Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc (2014-01-13 10:59:05 +0700)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc.git merge

for you to fetch changes up to b3084f4db3aeb991c507ca774337c7e7893ed04f:

  powerpc/thp: Fix crash on mremap (2014-01-15 15:46:38 +1100)

----------------------------------------------------------------
Aneesh Kumar K.V (1):
      powerpc/thp: Fix crash on mremap

 arch/powerpc/include/asm/pgtable-ppc64.h | 14 ++++++++++++++
 include/asm-generic/pgtable.h            | 12 ++++++++++++
 mm/huge_memory.c                         | 14 +++++---------
 3 files changed, 31 insertions(+), 9 deletions(-)

^ permalink raw reply

* Re: Pull request: scottwood/linux.git
From: Benjamin Herrenschmidt @ 2014-01-15  3:54 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev
In-Reply-To: <20140111004451.GA7353@home.buserror.net>

On Fri, 2014-01-10 at 18:44 -0600, Scott Wood wrote:
> Highlights include 32-bit booke relocatable support, e6500 hardware
> tablewalk support, various e500 SPE fixes, some new/revived boards, and
> e6500 deeper idle and altivec powerdown modes.

This breaks WSP (A2) build with 64K pages:

/home/benh/linux-powerpc-test/arch/powerpc/mm/tlb_low_64e.S: Assembler messages:
/home/benh/linux-powerpc-test/arch/powerpc/mm/tlb_low_64e.S:334: Error: can't resolve `L0^A' {*ABS* section} - `PUD_SHIFT' {*UND* section}
/home/benh/linux-powerpc-test/arch/powerpc/mm/tlb_low_64e.S:334: Error: expression too complex
/home/benh/linux-powerpc-test/arch/powerpc/mm/tlb_low_64e.S:334: Error: operand out of range (67 is not between 0 and 63)
make[2]: *** [arch/powerpc/mm/tlb_low_64e.o] Error 1

I'm merging anyway because nobody uses WSP anymore (I'm keen to remove it by 3.15 or so)
but in the meantime you may want to fix it (probably just ifdef the PUD level walk on
64k pages, look at what I do elsewhere).

Cheers,
Ben.

> The following changes since commit dece8ada993e1764a115bdff0f1effffaa5fc8dc:
> 
>   Merge branch 'merge' into next (2013-12-30 15:19:31 +1100)
> 
> are available in the git repository at:
> 
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux.git next
> 
> for you to fetch changes up to d064f30e5063ec54ab50af08c64fb5055e759bfd:
> 
>   powerpc/fsl_pci: add versionless pci compatible (2014-01-10 17:38:56 -0600)
> 
> ----------------------------------------------------------------
> Anton Blanchard (1):
>       drivers/tty: ehv_bytechan fails to build as a module
> 
> Christian Engelmayer (1):
>       powerpc/sysdev: Fix a pci section mismatch for Book E
> 
> Diana Craciun (1):
>       powerpc: Replaced tlbilx with tlbwe in the initialization code
> 
> Joseph Myers (6):
>       powerpc: fix exception clearing in e500 SPE float emulation
>       powerpc: fix e500 SPE float rounding inexactness detection
>       math-emu: fix floating-point to integer unsigned saturation
>       math-emu: fix floating-point to integer overflow detection
>       powerpc: fix e500 SPE float to integer and fixed-point conversions
>       powerpc: fix e500 SPE float SIGFPE generation
> 
> Kevin Hao (11):
>       powerpc/85xx: don't init the mpic ipi for the SoC which has doorbell support
>       powerpc/fsl_booke: protect the access to MAS7
>       powerpc/fsl_booke: introduce get_phys_addr function
>       powerpc: introduce macro LOAD_REG_ADDR_PIC
>       powerpc: enable the relocatable support for the fsl booke 32bit kernel
>       powerpc/fsl_booke: set the tlb entry for the kernel address in AS1
>       powerpc: introduce early_get_first_memblock_info
>       powerpc/fsl_booke: introduce map_mem_in_cams_addr
>       powerpc/fsl_booke: make sure PAGE_OFFSET map to memstart_addr for relocatable kernel
>       powerpc/fsl_booke: smp support for booting a relocatable kernel above 64M
>       powerpc/fsl_booke: enable the relocatable for the kdump kernel
> 
> LEROY Christophe (1):
>       powerpc 8xx: defconfig: slice by 4 is more efficient than the default slice by 8 on Powerpc 8xx.
> 
> Lijun Pan (1):
>       powerpc/85xx: Merge 85xx/p1023_defconfig into mpc85xx_smp and mpc85xx
> 
> Mihai Caraman (1):
>       powerpc/booke64: Add LRAT error exception handler
> 
> Paul Gortmaker (1):
>       powerpc: fix 8xx and 6xx final link failures
> 
> Scott Wood (5):
>       powerpc/fsl-booke: Use SPRN_SPRGn rather than mfsprg/mtsprg
>       powerpc: add barrier after writing kernel PTE
>       powerpc/e6500: TLB miss handler with hardware tablewalk support
>       powerpc/fsl-book3e-64: Use paca for hugetlb TLB1 entry selection
>       powerpc/booke-64: fix tlbsrx. path in bolted tlb handler
> 
> Shaohui Xie (1):
>       powerpc/85xx: handle the eLBC error interrupt if it exists in dts
> 
> Shengzhou Liu (2):
>       powerpc/85xx/dts: add third elo3 dma component
>       powerpc/fsl_pci: add versionless pci compatible
> 
> Stephen Chivers (1):
>       powerpc/embedded6xx: Add support for Motorola/Emerson MVME5100
> 
> Wang Dongsheng (9):
>       powerpc/fsl: add E6500 PVR and SPRN_PWRMGTCR0 define
>       powerpc/85xx: add hardware automatically enter altivec idle state
>       powerpc/85xx: add hardware automatically enter pw20 state
>       powerpc/85xx: add sysfs for pw20 state and altivec idle
>       powerpc/p1022ds: fix rtc compatible string
>       powerpc/p1022ds: add a interrupt for rtc node
>       powerpc/mpic_timer: fix the time is not accurate caused by GTCRR toggle bit
>       powerpc/mpic_timer: fix convert ticks to time subtraction overflow
>       powerpc/dts: fix lbc lack of error interrupt
> 
> Xie Xiaobo (2):
>       powerpc/85xx: Add QE common init function
>       powerpc/85xx: Add TWR-P1025 board support
> 
> Zhao Qiang (3):
>       powerpc/p1010rdb:update dts to adapt to both old and new p1010rdb
>       powerpc/p1010rdb:update mtd of nand to adapt to both old and new p1010rdb
>       powerpc/p1010rdb-pa: modify phy interrupt.
> 
>  .../devicetree/bindings/video/ssd1289fb.txt        |  13 +
>  arch/powerpc/Kconfig                               |   5 +-
>  arch/powerpc/boot/Makefile                         |   7 +-
>  arch/powerpc/boot/dts/fsl/elo3-dma-2.dtsi          |  82 ++++++
>  arch/powerpc/boot/dts/fsl/p1020si-post.dtsi        |   3 +-
>  arch/powerpc/boot/dts/fsl/p1021si-post.dtsi        |   3 +-
>  arch/powerpc/boot/dts/fsl/p1022si-post.dtsi        |   3 +-
>  arch/powerpc/boot/dts/fsl/p1023si-post.dtsi        |   3 +-
>  arch/powerpc/boot/dts/mvme5100.dts                 | 185 ++++++++++++
>  arch/powerpc/boot/dts/p1010rdb-pa.dts              |  23 ++
>  arch/powerpc/boot/dts/p1010rdb-pa.dtsi             |  85 ++++++
>  .../dts/{p1010rdb_36b.dts => p1010rdb-pa_36b.dts}  |  47 +--
>  arch/powerpc/boot/dts/p1010rdb-pb.dts              |  35 +++
>  arch/powerpc/boot/dts/p1010rdb-pb_36b.dts          |  58 ++++
>  arch/powerpc/boot/dts/p1010rdb.dts                 |  66 -----
>  arch/powerpc/boot/dts/p1010rdb.dtsi                |  43 +--
>  arch/powerpc/boot/dts/p1010rdb_32b.dtsi            |  79 ++++++
>  arch/powerpc/boot/dts/p1010rdb_36b.dtsi            |  79 ++++++
>  arch/powerpc/boot/dts/p1022ds.dtsi                 |   3 +-
>  arch/powerpc/boot/dts/p1025twr.dts                 |  95 +++++++
>  arch/powerpc/boot/dts/p1025twr.dtsi                | 280 ++++++++++++++++++
>  arch/powerpc/boot/mvme5100.c                       |  27 ++
>  arch/powerpc/boot/wrapper                          |   4 +
>  arch/powerpc/configs/85xx/p1023_defconfig          | 188 ------------
>  arch/powerpc/configs/adder875_defconfig            |   1 +
>  arch/powerpc/configs/ep88xc_defconfig              |   1 +
>  arch/powerpc/configs/mpc85xx_defconfig             |   3 +
>  arch/powerpc/configs/mpc85xx_smp_defconfig         |   3 +
>  arch/powerpc/configs/mpc866_ads_defconfig          |   1 +
>  arch/powerpc/configs/mpc885_ads_defconfig          |   1 +
>  arch/powerpc/configs/mvme5100_defconfig            | 144 ++++++++++
>  arch/powerpc/configs/tqm8xx_defconfig              |   1 +
>  arch/powerpc/include/asm/fsl_lbc.h                 |   2 +-
>  arch/powerpc/include/asm/kvm_asm.h                 |   1 +
>  arch/powerpc/include/asm/mmu-book3e.h              |  13 +
>  arch/powerpc/include/asm/mmu.h                     |  21 +-
>  arch/powerpc/include/asm/paca.h                    |   6 +
>  arch/powerpc/include/asm/ppc_asm.h                 |  13 +
>  arch/powerpc/include/asm/processor.h               |   6 +-
>  arch/powerpc/include/asm/reg.h                     |   2 +
>  arch/powerpc/include/asm/reg_booke.h               |  10 +
>  arch/powerpc/kernel/asm-offsets.c                  |   9 +
>  arch/powerpc/kernel/cpu_setup_fsl_booke.S          |  54 ++++
>  arch/powerpc/kernel/exceptions-64e.S               |  27 +-
>  arch/powerpc/kernel/fsl_booke_entry_mapping.S      |   2 +
>  arch/powerpc/kernel/head_fsl_booke.S               | 266 +++++++++++++++--
>  arch/powerpc/kernel/paca.c                         |   5 +
>  arch/powerpc/kernel/process.c                      |  30 +-
>  arch/powerpc/kernel/prom.c                         |  41 ++-
>  arch/powerpc/kernel/setup_64.c                     |  31 ++
>  arch/powerpc/kernel/swsusp_booke.S                 |  32 +--
>  arch/powerpc/kernel/sysfs.c                        | 316 +++++++++++++++++++++
>  arch/powerpc/kvm/bookehv_interrupts.S              |   2 +
>  arch/powerpc/math-emu/math_efp.c                   | 316 ++++++++++++++++-----
>  arch/powerpc/mm/fsl_booke_mmu.c                    |  80 +++++-
>  arch/powerpc/mm/hugetlbpage-book3e.c               |  54 +++-
>  arch/powerpc/mm/mem.c                              |   6 +
>  arch/powerpc/mm/mmu_decl.h                         |   2 +
>  arch/powerpc/mm/pgtable_32.c                       |   1 +
>  arch/powerpc/mm/pgtable_64.c                       |  12 +
>  arch/powerpc/mm/tlb_low_64e.S                      | 174 +++++++++++-
>  arch/powerpc/mm/tlb_nohash.c                       |  93 ++++--
>  arch/powerpc/mm/tlb_nohash_low.S                   |   4 +-
>  arch/powerpc/platforms/85xx/Kconfig                |   6 +
>  arch/powerpc/platforms/85xx/Makefile               |   1 +
>  arch/powerpc/platforms/85xx/common.c               |  38 +++
>  arch/powerpc/platforms/85xx/mpc85xx.h              |   6 +
>  arch/powerpc/platforms/85xx/mpc85xx_mds.c          |  29 +-
>  arch/powerpc/platforms/85xx/mpc85xx_rdb.c          |  25 +-
>  arch/powerpc/platforms/85xx/smp.c                  |  17 +-
>  arch/powerpc/platforms/85xx/twr_p102x.c            | 147 ++++++++++
>  arch/powerpc/platforms/embedded6xx/Kconfig         |  13 +-
>  arch/powerpc/platforms/embedded6xx/Makefile        |   1 +
>  arch/powerpc/platforms/embedded6xx/mvme5100.c      | 221 ++++++++++++++
>  arch/powerpc/sysdev/fsl_lbc.c                      |  31 +-
>  arch/powerpc/sysdev/fsl_pci.c                      |   3 +-
>  arch/powerpc/sysdev/indirect_pci.c                 |   6 +-
>  arch/powerpc/sysdev/mpic_timer.c                   |  10 +-
>  drivers/tty/Kconfig                                |   2 +-
>  include/linux/of_fdt.h                             |   1 +
>  include/math-emu/op-common.h                       |   9 +-
>  81 files changed, 3154 insertions(+), 614 deletions(-)
>  create mode 100644 Documentation/devicetree/bindings/video/ssd1289fb.txt
>  create mode 100644 arch/powerpc/boot/dts/fsl/elo3-dma-2.dtsi
>  create mode 100644 arch/powerpc/boot/dts/mvme5100.dts
>  create mode 100644 arch/powerpc/boot/dts/p1010rdb-pa.dts
>  create mode 100644 arch/powerpc/boot/dts/p1010rdb-pa.dtsi
>  rename arch/powerpc/boot/dts/{p1010rdb_36b.dts => p1010rdb-pa_36b.dts} (64%)
>  create mode 100644 arch/powerpc/boot/dts/p1010rdb-pb.dts
>  create mode 100644 arch/powerpc/boot/dts/p1010rdb-pb_36b.dts
>  delete mode 100644 arch/powerpc/boot/dts/p1010rdb.dts
>  create mode 100644 arch/powerpc/boot/dts/p1010rdb_32b.dtsi
>  create mode 100644 arch/powerpc/boot/dts/p1010rdb_36b.dtsi
>  create mode 100644 arch/powerpc/boot/dts/p1025twr.dts
>  create mode 100644 arch/powerpc/boot/dts/p1025twr.dtsi
>  create mode 100644 arch/powerpc/boot/mvme5100.c
>  delete mode 100644 arch/powerpc/configs/85xx/p1023_defconfig
>  create mode 100644 arch/powerpc/configs/mvme5100_defconfig
>  create mode 100644 arch/powerpc/platforms/85xx/twr_p102x.c
>  create mode 100644 arch/powerpc/platforms/embedded6xx/mvme5100.c

^ permalink raw reply

* Re: [PATCH] powerpc: dma-mapping: Return dma_direct_ops variable when dev == NULL
From: Benjamin Herrenschmidt @ 2014-01-15  3:47 UTC (permalink / raw)
  To: Chunhe Lan; +Cc: linux-pci, linuxppc-dev, Chunhe Lan
In-Reply-To: <52D60238.2040204@freescale.com>

On Wed, 2014-01-15 at 11:36 +0800, Chunhe Lan wrote:

> >
> >> Signed-off-by: Chunhe Lan <Chunhe.Lan@freescale.com>
> >> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> >> Tested-by: Chunhe Lan <Chunhe.Lan@freescale.com>
> >> ---
> >>   arch/powerpc/include/asm/dma-mapping.h |   13 +++++++++----
> >>   1 files changed, 9 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
> >> index e27e9ad..b8c10de 100644
> >> --- a/arch/powerpc/include/asm/dma-mapping.h
> >> +++ b/arch/powerpc/include/asm/dma-mapping.h
> >> @@ -84,10 +84,15 @@ static inline struct dma_map_ops *get_dma_ops(struct device *dev)
>          I see the get_dma_ops function in 
> arch/*x86*/include/asm/dma-mapping.h as the following:
> 
>   32 static inline struct dma_map_ops *get_dma_ops(struct device *dev)
>   33 {
>   34 #ifndef CONFIG_X86_DEV_DMA_OPS
>   35         return dma_ops;
>   36 #else
>   37         if (unlikely(!dev) || !dev->archdata.dma_ops)
>   38                 return dma_ops;
>   39         else
>   40                 return dev->archdata.dma_ops;
>   41 #endif
>   42 }
> 
>          And also  see the get_dma_ops function in  
> arch/*arm*/include/asm/dma-mapping.h as the following:
> 
>   18 static inline struct dma_map_ops *get_dma_ops(struct device *dev)
>   19 {
>   20         if (dev && dev->archdata.dma_ops)
>   21                 return dev->archdata.dma_ops;
>   22         return &arm_dma_ops;
>   23 }
> 
>        Why not powerpc use this method to process dev == NULL ?

Because we don't :-) We used to and removed this. Due to how our HW
works it might not be correct. When an iommu is enabled for example
you simply cannot use the direct ops.

So the right fix is to properly establish the iommu for the VFs like
we do for the PFs.

> Thanks,
> -Chunhe
> 
> >>   	 * only ISA DMA device we support is the floppy and we have a hack
> >>   	 * in the floppy driver directly to get a device for us.
> >>   	 */
> >> -	if (unlikely(dev == NULL))
> >> -		return NULL;
> >> -
> >> -	return dev->archdata.dma_ops;
> >> +	if (dev && dev->archdata.dma_ops)
> >> +		return dev->archdata.dma_ops;
> >> +	/*
> >> +	 * In some cases (for example, use the Intel(R) 10 Gigabit PCI
> >> +	 * expression Virtual Function Network Driver -- ixgbevf.ko),
> >> +	 * their value of dev is the NULL. If return NULL, the driver is
> >> +	 * aborting. So return dma_direct_ops variable when dev == NULL.
> >> +	 */
> >> +	return &dma_direct_ops;
> >>   }
> >>   
> >>   static inline void set_dma_ops(struct device *dev, struct dma_map_ops *ops)
> >
> >
> >
> 
> 

^ permalink raw reply

* Re: [PATCH] powerpc: dma-mapping: Return dma_direct_ops variable when dev == NULL
From: Chunhe Lan @ 2014-01-15  3:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linux-pci, linuxppc-dev, Chunhe Lan
In-Reply-To: <1389694446.6933.14.camel@pasglop>

On 01/14/2014 06:14 PM, Benjamin Herrenschmidt wrote:
> On Tue, 2014-01-14 at 17:44 +0800, Chunhe Lan wrote:
>> Without this patch, kind of below error will be dumped if
>> 'insmod ixgbevf.ko' is executed:
>>
>>      ixgbevf: Intel(R) 10 Gigabit PCI Express Virtual Function
>>               Network Driver - version 2.7.12-k
>>      ixgbevf: Copyright (c) 2009 - 2012 Intel Corporation.
>>      ixgbevf 0000:01:10.0: enabling device (0000 -> 0002)
>>      ixgbevf 0000:01:10.0: No usable DMA configuration, aborting
>>      ixgbevf: probe of 0000:01:10.0 failed with error -5
>>      ......
>>      ......
> That's not right. The DMA ops must be set properly for the VF somewhere
> in the arch code instead. When creating VFs, is there a hook allowing
> the arch to fix things up ?
>
> (Also adding linux-pci on CC)
>
> Ben.
>
>> Signed-off-by: Chunhe Lan <Chunhe.Lan@freescale.com>
>> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>> Tested-by: Chunhe Lan <Chunhe.Lan@freescale.com>
>> ---
>>   arch/powerpc/include/asm/dma-mapping.h |   13 +++++++++----
>>   1 files changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
>> index e27e9ad..b8c10de 100644
>> --- a/arch/powerpc/include/asm/dma-mapping.h
>> +++ b/arch/powerpc/include/asm/dma-mapping.h
>> @@ -84,10 +84,15 @@ static inline struct dma_map_ops *get_dma_ops(struct device *dev)
         I see the get_dma_ops function in 
arch/*x86*/include/asm/dma-mapping.h as the following:

  32 static inline struct dma_map_ops *get_dma_ops(struct device *dev)
  33 {
  34 #ifndef CONFIG_X86_DEV_DMA_OPS
  35         return dma_ops;
  36 #else
  37         if (unlikely(!dev) || !dev->archdata.dma_ops)
  38                 return dma_ops;
  39         else
  40                 return dev->archdata.dma_ops;
  41 #endif
  42 }

         And also  see the get_dma_ops function in 
arch/*arm*/include/asm/dma-mapping.h as the following:

  18 static inline struct dma_map_ops *get_dma_ops(struct device *dev)
  19 {
  20         if (dev && dev->archdata.dma_ops)
  21                 return dev->archdata.dma_ops;
  22         return &arm_dma_ops;
  23 }

       Why not powerpc use this method to process dev == NULL ?

Thanks,
-Chunhe

>>   	 * only ISA DMA device we support is the floppy and we have a hack
>>   	 * in the floppy driver directly to get a device for us.
>>   	 */
>> -	if (unlikely(dev == NULL))
>> -		return NULL;
>> -
>> -	return dev->archdata.dma_ops;
>> +	if (dev && dev->archdata.dma_ops)
>> +		return dev->archdata.dma_ops;
>> +	/*
>> +	 * In some cases (for example, use the Intel(R) 10 Gigabit PCI
>> +	 * expression Virtual Function Network Driver -- ixgbevf.ko),
>> +	 * their value of dev is the NULL. If return NULL, the driver is
>> +	 * aborting. So return dma_direct_ops variable when dev == NULL.
>> +	 */
>> +	return &dma_direct_ops;
>>   }
>>   
>>   static inline void set_dma_ops(struct device *dev, struct dma_map_ops *ops)
>
>
>

^ permalink raw reply

* RE: [PATCH 2/3] powerpc/85xx: Provide two functions to save/restore the core registers
From: Dongsheng.Wang @ 2014-01-15  3:30 UTC (permalink / raw)
  To: Scott Wood
  Cc: anton@enomsg.org, linuxppc-dev@lists.ozlabs.org,
	chenhui.zhao@freescale.com
In-Reply-To: <1389743456.24905.143.camel@snotra.buserror.net>

DQoNCj4gLS0tLS1PcmlnaW5hbCBNZXNzYWdlLS0tLS0NCj4gRnJvbTogV29vZCBTY290dC1CMDc0
MjENCj4gU2VudDogV2VkbmVzZGF5LCBKYW51YXJ5IDE1LCAyMDE0IDc6NTEgQU0NCj4gVG86IFdh
bmcgRG9uZ3NoZW5nLUI0MDUzNA0KPiBDYzogYmVuaEBrZXJuZWwuY3Jhc2hpbmcub3JnOyBaaGFv
IENoZW5odWktQjM1MzM2OyBhbnRvbkBlbm9tc2cub3JnOyBsaW51eHBwYy0NCj4gZGV2QGxpc3Rz
Lm96bGFicy5vcmcNCj4gU3ViamVjdDogUmU6IFtQQVRDSCAyLzNdIHBvd2VycGMvODV4eDogUHJv
dmlkZSB0d28gZnVuY3Rpb25zIHRvIHNhdmUvcmVzdG9yZSB0aGUNCj4gY29yZSByZWdpc3RlcnMN
Cj4gDQo+IE9uIFR1ZSwgMjAxNC0wMS0xNCBhdCAxNTo1OSArMDgwMCwgRG9uZ3NoZW5nIFdhbmcg
d3JvdGU6DQo+ID4gRnJvbTogV2FuZyBEb25nc2hlbmcgPGRvbmdzaGVuZy53YW5nQGZyZWVzY2Fs
ZS5jb20+DQo+ID4NCj4gPiBBZGQgZnNsX2NwdV9zdGF0ZV9zYXZlL2ZzbF9jcHVfc3RhdGVfcmVz
dG9yZSBmdW5jdGlvbnMsIHVzZWQgZm9yIGRlZXANCj4gPiBzbGVlcCBhbmQgaGliZXJuYXRpb24g
dG8gc2F2ZS9yZXN0b3JlIGNvcmUgcmVnaXN0ZXJzLiBXZSBhYnN0cmFjdCBvdXQNCj4gPiBzYXZl
L3Jlc3RvcmUgY29kZSBmb3IgdXNlIGluIHZhcmlvdXMgbW9kdWxlcywgdG8gbWFrZSB0aGVtIGRv
bid0IG5lZWQNCj4gPiB0byBtYWludGFpbi4NCj4gPg0KPiA+IEN1cnJlbnRseSBzdXBwb3J0ZWQg
cHJvY2Vzc29ycyB0eXBlIGFyZSBFNjUwMCwgRTU1MDAsIEU1MDBNQywgRTUwMHYyDQo+ID4gYW5k
IEU1MDB2MS4NCj4gPg0KPiA+IFNpZ25lZC1vZmYtYnk6IFdhbmcgRG9uZ3NoZW5nIDxkb25nc2hl
bmcud2FuZ0BmcmVlc2NhbGUuY29tPg0KPiANCj4gV2hhdCBpcyB0aGVyZSB0aGF0IGlzIHNwZWNm
aWMgdG8gYSBwYXJ0aWN1bGFyIGNvcmUgdHlwZSB0aGF0IGNhbid0IGJlIGhhbmRsZWQNCj4gZnJv
bSBDIGNvZGU/DQo+IA0KDQpJbiB0aGUgY29udGV4dCBvZiB0aGUgY2FsbGluZywgbWF5YmUgbm90
IGluIEMgZW52aXJvbm1lbnQuKERlZXAgc2xlZXAgd2l0aG91dA0KQyBlbnZpcm9ubWVudCB3aGVu
IGNhbGxpbmcgdGhvc2UgaW50ZXJmYWNlcykNCg0KPiA+ICsJLyoNCj4gPiArCSAqIE5lZWQgdG8g
c2F2ZSBmbG9hdC1wb2ludCByZWdpc3RlcnMgaWYgTVNSW0ZQXSA9IDEuDQo+ID4gKwkgKi8NCj4g
PiArCW1mbXNyCXIxMg0KPiA+ICsJYW5kaS4JcjEyLCByMTIsIE1TUl9GUA0KPiA+ICsJYmVxCTFm
DQo+ID4gKwlkb19zcl9mcHJfcmVncyhzYXZlKQ0KPiANCj4gQyBjb2RlIHNob3VsZCBoYXZlIGFs
cmVhZHkgZW5zdXJlZCB0aGF0IE1TUltGUF0gaXMgbm90IDEgKGFuZCB0aHVzIHRoZSBGUA0KPiBj
b250ZXh0IGhhcyBiZWVuIHNhdmVkKS4NCj4gDQoNClllcywgcmlnaHQuIEJ1dCBJIG1lYW4gaWYg
dGhlIEZQIHN0aWxsIHVzZSBpbiBjb3JlIHNhdmUgZmxvdywgd2UgbmVlZCB0byBzYXZlIGl0Lg0K
SW4gdGhpcyBwcm9jZXNzLCBpIGRvbid0IGNhcmUgd2hhdCBvdGhlciBjb2RlIGRvLCB3ZSBuZWVk
IHRvIGZvY3VzIG9uIG5vdCBsb3NpbmcNCnZhbHVhYmxlIGRhdGEuDQoNCj4gPiArLyoNCj4gPiAr
ICogcjMgPSB0aGUgdmlydHVhbCBhZGRyZXNzIG9mIGJ1ZmZlcg0KPiA+ICsgKiByNCA9IHN1c3Bl
bmQgdHlwZSwgMC1CQVNFX1NBVkUsIDEtQUxMX1NBVkUNCj4gDQo+ICNkZWZpbmUgdGhlc2UgbWFn
aWMgbnVtYmVycywgYW5kIGRlZmluZSB3aGF0IGlzIG1lYW50IGJ5ICJiYXNlIHNhdmUiDQo+IHZl
cnN1cyAiYWxsIHNhdmUiLg0KDQpPaywgdGhhbmtzLg0KDQotRG9uZ3NoZW5nDQoNCg==

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox