LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: MPC8536 PCI rescan to discover FPGA
From: Benjamin Herrenschmidt @ 2009-09-22  7:23 UTC (permalink / raw)
  To: David Hawkins; +Cc: linuxppc-dev@ozlabs.org list, Felix Radensky, Ira Snyder
In-Reply-To: <4AB7A411.3030406@ovro.caltech.edu>

On Mon, 2009-09-21 at 09:04 -0700, David Hawkins wrote:
> This can be made to work using the kernel hot-swap
> interface. PCI devices have an ENUM# interrupt that
> they assert when inserted or extracted, and the host
> hot-swap driver can be hooked up to it. PCI-E may
> have a similar mechanism, if it does, then when your
> FPGA configures as a PCI-E device, it can assert that
> interrupt line (or send the appropriate PCI-E
> message to simulate that interrupt).
> 
> However, even if PCI-E does not have the concept of
> an ENUM# interrupt there is a way to generate a fake
> hot-swap event and generate a re-scan of the PCI bus.
> 
> I haven't tested the kernel hot-swap interface, but I
> know that Ira did, so I'll cc him on this mail, and he
> can let you know what he tested.

Right. However, in case it's a bit too much work to get
hotswap implemented on the machine, you may still be able
to do something simpler from your platform code, after you've
finished loading the FPGA. I assume the FPGA doesn't contain a
P2P bridge that would require probing further below the FPGA
itself.

The basic idea is to call pci_scan_slot() on the devfn where
the FPGA is supposed to respond.

Then you need to also do some fixup. First you need to call
pcibios_setup_bus_devices(). This will wire up the device
to an OF node if you have one, setup some default DMA ops,
etc...

Note that this function will walk over all devices on that bus
which is interesting since some of those may have already been
fully setup initially. Hopefully that isn't a problem. If it
was to become one, we would have to figure out a way to skip
devices that have already been "setup".

And finally you call pcibios_finish_adding_to_bus() which will
do the resource allocation pass on all new devices on the bus
and register them with the core device layer.

Cheers,
Ben.

^ permalink raw reply

* [PATCH] powerpc: Fix ibm,client-architecture-support printout
From: Anton Blanchard @ 2009-09-22  6:47 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev


On machines without the ibm,client-architecture-support call we were missing a
newline. We may as well print the full name in all its glory too - its
ibm,client-architecture-support, not ibm,client-architecture as I mistakenly
wrote (a name only an IBM architect could love).

For my penance I will write out ibm,client-architecture-support 100 times.

Before:

Calling ibm,client-architecture...command line: root=/dev/sda6 console=hvc0  quiet

After:

Calling ibm,client-architecture-support... not implemented
command line: root=/dev/sda6 console=hvc0  

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: linux.trees.git/arch/powerpc/kernel/prom_init.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/kernel/prom_init.c	2009-09-22 16:18:12.000000000 +1000
+++ linux.trees.git/arch/powerpc/kernel/prom_init.c	2009-09-22 16:37:59.000000000 +1000
@@ -800,7 +800,7 @@ static void __init prom_send_capabilitie
 	root = call_prom("open", 1, 1, ADDR("/"));
 	if (root != 0) {
 		/* try calling the ibm,client-architecture-support method */
-		prom_printf("Calling ibm,client-architecture...");
+		prom_printf("Calling ibm,client-architecture-support...");
 		if (call_prom_ret("call-method", 3, 2, &ret,
 				  ADDR("ibm,client-architecture-support"),
 				  root,
@@ -814,6 +814,7 @@ static void __init prom_send_capabilitie
 			return;
 		}
 		call_prom("close", 1, 0, root);
+		prom_printf(" not implemented\n");
 	}
 
 	/* no ibm,client-architecture-support call, try the old way */

^ permalink raw reply

* [PATCH} powerpc: Increase NODES_SHIFT on 64bit from 4 to 8
From: Anton Blanchard @ 2009-09-22  5:56 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev


Some System p configurations can already have more than 16 nodes so we
need to increase NODES_SHIFT. I chose 256 to give us some room to grow in the
future, although we can look at something smaller if the memory bloat is
considered too much.

Unless we clamp MAX_ACTIVE_REGIONS we end up with 300kB of extra bloat in
early_node_map in mm/page_alloc.c:

< 6144   early_node_map
> 307200 early_node_map

due to:

    #if MAX_NUMNODES >= 32
      /* If there can be many nodes, allow up to 50 holes per node */
      #define MAX_ACTIVE_REGIONS (MAX_NUMNODES*50)
    #else
      /* By default, allow up to 256 distinct regions */
    #define MAX_ACTIVE_REGIONS 256

Since our memory is mostly contiguous it seems reasonable to keep this
at 256 for now. I also set 32bit to 32 to save space (is there any chance
a 32bit system will have more than 32 discontiguous memory ranges?).

Even with that fixed we have a few data structures that grow:

< 896   bootmem_node_data
> 14336 bootmem_node_data

< 1280  node_devices
> 20480 node_devices

< 25088 kmalloc_caches
> 59648 kmalloc_caches

< 1632  hstates
> 21792 hstates

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: linux.trees.git/arch/powerpc/Kconfig
===================================================================
--- linux.trees.git.orig/arch/powerpc/Kconfig	2009-09-22 15:25:27.000000000 +1000
+++ linux.trees.git/arch/powerpc/Kconfig	2009-09-22 15:52:16.000000000 +1000
@@ -385,9 +385,15 @@ config NUMA
 
 config NODES_SHIFT
 	int
+	default "8" if PPC64
 	default "4"
 	depends on NEED_MULTIPLE_NODES
 
+config MAX_ACTIVE_REGIONS
+	int
+	default "256" if PPC64
+	default "32"
+
 config ARCH_SELECT_MEMORY_MODEL
 	def_bool y
 	depends on PPC64

^ permalink raw reply

* [v5 RFC PATCH 7/7]: POWER/pSeries: implement pSeries processor idle module.
From: Arun R Bharadwaj @ 2009-09-22  5:41 UTC (permalink / raw)
  To: Joel Schopp, Benjamin Herrenschmidt, Paul Mackerras,
	Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Dipankar Sarma, Balbir Singh, Gautham R Shenoy, Arun Bharadwaj
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20090922053314.GA6417@linux.vnet.ibm.com>

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-09-22 11:03:14]:

This patch creates arch/powerpc/platforms/pseries/processor_idle.c,
which implements the cpuidle infrastructure for pseries.
It implements a pseries_cpuidle_loop() which would be the main idle loop
called from cpu_idle(). It makes decision of entering either cede1 or cede2
for dedicated lpar and shared_cede for shared lpar processor based on the
decision taken by the cpuidle governor.

Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/pseries/Makefile         |    1 
 arch/powerpc/platforms/pseries/processor_idle.c |  191 ++++++++++++++++++++++++
 arch/powerpc/platforms/pseries/pseries.h        |    9 +
 arch/powerpc/platforms/pseries/setup.c          |    8 -
 4 files changed, 207 insertions(+), 2 deletions(-)

Index: linux.trees.git/arch/powerpc/platforms/pseries/Makefile
===================================================================
--- linux.trees.git.orig/arch/powerpc/platforms/pseries/Makefile
+++ linux.trees.git/arch/powerpc/platforms/pseries/Makefile
@@ -26,3 +26,4 @@ obj-$(CONFIG_HCALL_STATS)	+= hvCall_inst
 obj-$(CONFIG_PHYP_DUMP)	+= phyp_dump.o
 obj-$(CONFIG_CMM)		+= cmm.o
 obj-$(CONFIG_DTL)		+= dtl.o
+obj-$(CONFIG_PSERIES_PROCESSOR_IDLE)	+= processor_idle.o
Index: linux.trees.git/arch/powerpc/platforms/pseries/pseries.h
===================================================================
--- linux.trees.git.orig/arch/powerpc/platforms/pseries/pseries.h
+++ linux.trees.git/arch/powerpc/platforms/pseries/pseries.h
@@ -10,6 +10,8 @@
 #ifndef _PSERIES_PSERIES_H
 #define _PSERIES_PSERIES_H
 
+#include <linux/cpuidle.h>
+
 extern void __init fw_feature_init(const char *hypertas, unsigned long len);
 
 struct pt_regs;
@@ -40,4 +42,11 @@ extern unsigned long rtas_poweron_auto;
 
 extern void find_udbg_vterm(void);
 
+DECLARE_PER_CPU(unsigned long, smt_snooze_delay);
+
+#ifdef CONFIG_PSERIES_PROCESSOR_IDLE
+int pseries_processor_idle_init(void);
+extern struct cpuidle_driver pseries_idle_driver;
+#endif
+
 #endif /* _PSERIES_PSERIES_H */
Index: linux.trees.git/arch/powerpc/platforms/pseries/processor_idle.c
===================================================================
--- /dev/null
+++ linux.trees.git/arch/powerpc/platforms/pseries/processor_idle.c
@@ -0,0 +1,191 @@
+/*
+ *  processor_idle - idle state cpuidle driver.
+ *  Adapted from drivers/acpi/processor_idle.c
+ *
+ *  Arun R Bharadwaj <arun@linux.vnet.ibm.com>
+ *
+ *  Copyright (C) 2009 IBM Corporation.
+ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or (at
+ *  your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful, but
+ *  WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ *  General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License along
+ *  with this program; if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/moduleparam.h>
+#include <linux/cpuidle.h>
+
+#include <asm/paca.h>
+#include <asm/reg.h>
+#include <asm/system.h>
+#include <asm/machdep.h>
+#include <asm/firmware.h>
+
+#include "plpar_wrappers.h"
+#include "pseries.h"
+
+MODULE_AUTHOR("Arun R Bharadwaj");
+MODULE_DESCRIPTION("pSeries Idle State Driver");
+MODULE_LICENSE("GPL");
+
+struct cpuidle_driver pseries_idle_driver = {
+	.name =		"pseries_idle",
+	.owner =	THIS_MODULE,
+};
+
+DEFINE_PER_CPU(struct cpuidle_device, pseries_dev);
+
+#define IDLE_STATE_COUNT	2
+
+static int pseries_idle_init(struct cpuidle_device *dev)
+{
+	return cpuidle_register_device(dev);
+}
+
+static void shared_cede(void)
+{
+	get_lppaca()->idle = 1;
+	cede_processor();
+	get_lppaca()->idle = 0;
+}
+
+static void cede1(void)
+{
+	local_irq_enable();
+	set_thread_flag(TIF_POLLING_NRFLAG);
+	while (!need_resched()) {
+		ppc64_runlatch_off();
+		HMT_low();
+		HMT_very_low();
+	}
+	HMT_medium();
+	clear_thread_flag(TIF_POLLING_NRFLAG);
+	smp_mb();
+	local_irq_disable();
+}
+
+static void cede2(void)
+{
+	ppc64_runlatch_off();
+	HMT_medium();
+	cede_processor();
+}
+
+static int pseries_cpuidle_loop(struct cpuidle_device *dev,
+				struct cpuidle_state *st)
+{
+	ktime_t t1, t2;
+	s64 diff;
+	int ret;
+	unsigned long in_purr, out_purr;
+
+	get_lppaca()->idle = 1;
+	get_lppaca()->donate_dedicated_cpu = 1;
+	in_purr = mfspr(SPRN_PURR);
+
+	t1 = ktime_get();
+
+	if (strcmp(st->desc, "shared_cede") == 0)
+		shared_cede();
+	else if (strcmp(st->desc, "cede1") == 0)
+		cede1();
+	else
+		cede2();
+
+	t2 = ktime_get();
+	diff = ktime_to_us(ktime_sub(t2, t1));
+	if (diff > INT_MAX)
+		diff = INT_MAX;
+
+	ret = (int) diff;
+
+	out_purr = mfspr(SPRN_PURR);
+	get_lppaca()->wait_state_cycles += out_purr - in_purr;
+	get_lppaca()->donate_dedicated_cpu = 0;
+	get_lppaca()->idle = 0;
+
+	return ret;
+}
+
+static int pseries_setup_cpuidle(struct cpuidle_device *dev, int cpu)
+{
+	int i;
+	struct cpuidle_state *state;
+
+	dev->cpu = cpu;
+
+	if (get_lppaca()->shared_proc) {
+		state = &dev->states[0];
+		snprintf(state->name, CPUIDLE_NAME_LEN, "IDLE");
+		state->enter = pseries_cpuidle_loop;
+		strncpy(state->desc, "shared_cede", CPUIDLE_DESC_LEN);
+		state->exit_latency = 0;
+		state->target_residency = 0;
+		return 0;
+	}
+
+	for (i = 0; i < IDLE_STATE_COUNT; i++) {
+		state = &dev->states[i];
+
+		snprintf(state->name, CPUIDLE_NAME_LEN, "IDLE%d", i);
+		state->enter = pseries_cpuidle_loop;
+
+		switch (i) {
+		case 0:
+			strncpy(state->desc, "cede1", CPUIDLE_DESC_LEN);
+			state->exit_latency = 0;
+			state->target_residency = 0;
+			break;
+
+		case 1:
+			strncpy(state->desc, "cede2", CPUIDLE_DESC_LEN);
+			state->exit_latency = 1;
+			state->target_residency =
+					__get_cpu_var(smt_snooze_delay);
+			break;
+		}
+	}
+	dev->state_count = IDLE_STATE_COUNT;
+
+	return 0;
+}
+
+int __init pseries_processor_idle_init(void)
+{
+	int cpu;
+	int result = cpuidle_register_driver(&pseries_idle_driver);
+
+	if (result < 0)
+		return result;
+
+	printk(KERN_DEBUG "pSeries idle driver registered\n");
+
+	if (!firmware_has_feature(FW_FEATURE_SPLPAR)) {
+		printk(KERN_DEBUG "Using default idle\n");
+		return 0;
+	}
+
+	for_each_online_cpu(cpu) {
+		pseries_setup_cpuidle(&per_cpu(pseries_dev, cpu), cpu);
+		pseries_idle_init(&per_cpu(pseries_dev, cpu));
+	}
+
+	printk(KERN_DEBUG "Using cpuidle idle loop\n");
+
+	return 0;
+}
Index: linux.trees.git/arch/powerpc/platforms/pseries/setup.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/platforms/pseries/setup.c
+++ linux.trees.git/arch/powerpc/platforms/pseries/setup.c
@@ -297,9 +297,13 @@ static void __init pSeries_setup_arch(vo
 
 	pSeries_nvram_init();
 
-	/* Choose an idle loop */
-	if (firmware_has_feature(FW_FEATURE_SPLPAR))
+	/* Register an idle loop with cpuidle */
+	if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
 		vpa_init(boot_cpuid);
+#ifdef CONFIG_PSERIES_PROCESSOR_IDLE
+		pseries_processor_idle_init();
+#endif
+	}
 
 	if (firmware_has_feature(FW_FEATURE_LPAR))
 		ppc_md.enable_pmcs = pseries_lpar_enable_pmcs;

^ permalink raw reply

* [v5 RFC PATCH 6/7]: POWER: add a default_idle idle loop for POWER.
From: Arun R Bharadwaj @ 2009-09-22  5:40 UTC (permalink / raw)
  To: Joel Schopp, Benjamin Herrenschmidt, Paul Mackerras,
	Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Dipankar Sarma, Balbir Singh, Gautham R Shenoy, Arun Bharadwaj
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20090922053314.GA6417@linux.vnet.ibm.com>

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-09-22 11:03:14]:

In arch/powerpc/kernel/idle.c create a default_idle() routine by moving
the failover condition of the cpu_idle() idle loop. This is needed by
cpuidle infrastructure to call default_idle when other idle routines
are not yet registered. Functionality remains the same, but the code is
slightly moved around.


Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 arch/powerpc/Kconfig              |    3 +++
 arch/powerpc/include/asm/system.h |    1 +
 arch/powerpc/kernel/idle.c        |    6 ++++++
 3 files changed, 10 insertions(+)

Index: linux.trees.git/arch/powerpc/Kconfig
===================================================================
--- linux.trees.git.orig/arch/powerpc/Kconfig
+++ linux.trees.git/arch/powerpc/Kconfig
@@ -88,6 +88,9 @@ config ARCH_HAS_ILOG2_U64
 	bool
 	default y if 64BIT
 
+config ARCH_HAS_DEFAULT_IDLE
+	def_bool y
+
 config GENERIC_HWEIGHT
 	bool
 	default y
Index: linux.trees.git/arch/powerpc/include/asm/system.h
===================================================================
--- linux.trees.git.orig/arch/powerpc/include/asm/system.h
+++ linux.trees.git/arch/powerpc/include/asm/system.h
@@ -218,6 +218,7 @@ extern unsigned long klimit;
 extern void *alloc_maybe_bootmem(size_t size, gfp_t mask);
 extern void *zalloc_maybe_bootmem(size_t size, gfp_t mask);
 
+extern void default_idle(void);
 extern int powersave_nap;	/* set if nap mode can be used in idle loop */
 
 /*
Index: linux.trees.git/arch/powerpc/kernel/idle.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/kernel/idle.c
+++ linux.trees.git/arch/powerpc/kernel/idle.c
@@ -94,6 +94,12 @@ void cpu_idle(void)
 	}
 }
 
+void default_idle(void)
+{
+	HMT_low();
+	HMT_very_low();
+}
+
 int powersave_nap;
 
 #ifdef CONFIG_SYSCTL

^ permalink raw reply

* [v5 RFC PATCH 5/7]: POWER/pSeries: remove dedicate/shared idle loops, which will be moved to arch/powerpc/platforms/pseries/processor_idle.c
From: Arun R Bharadwaj @ 2009-09-22  5:39 UTC (permalink / raw)
  To: Joel Schopp, Benjamin Herrenschmidt, Paul Mackerras,
	Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Dipankar Sarma, Balbir Singh, Gautham R Shenoy, Arun Bharadwaj
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20090922053314.GA6417@linux.vnet.ibm.com>

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-09-22 11:03:14]:

This patch removes the routines, pseries_shared_idle_sleep and
pseries_dedicated_idle_sleep, since this is implemented as a part
of arch/powerpc/platform/pseries/processor_idle.c

Also, similar to x86, call cpuidle_idle_call from cpu_idle() idle
loop instead of ppc_md.power_save.


Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/idle.c             |   50 +++++++-----------
 arch/powerpc/platforms/pseries/setup.c |   89 ---------------------------------
 2 files changed, 22 insertions(+), 117 deletions(-)

Index: linux.trees.git/arch/powerpc/platforms/pseries/setup.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/platforms/pseries/setup.c
+++ linux.trees.git/arch/powerpc/platforms/pseries/setup.c
@@ -75,9 +75,6 @@ EXPORT_SYMBOL(CMO_PageSize);
 
 int fwnmi_active;  /* TRUE if an FWNMI handler is present */
 
-static void pseries_shared_idle_sleep(void);
-static void pseries_dedicated_idle_sleep(void);
-
 static struct device_node *pSeries_mpic_node;
 
 static void pSeries_show_cpuinfo(struct seq_file *m)
@@ -301,18 +298,8 @@ static void __init pSeries_setup_arch(vo
 	pSeries_nvram_init();
 
 	/* Choose an idle loop */
-	if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
+	if (firmware_has_feature(FW_FEATURE_SPLPAR))
 		vpa_init(boot_cpuid);
-		if (get_lppaca()->shared_proc) {
-			printk(KERN_DEBUG "Using shared processor idle loop\n");
-			ppc_md.power_save = pseries_shared_idle_sleep;
-		} else {
-			printk(KERN_DEBUG "Using dedicated idle loop\n");
-			ppc_md.power_save = pseries_dedicated_idle_sleep;
-		}
-	} else {
-		printk(KERN_DEBUG "Using default idle loop\n");
-	}
 
 	if (firmware_has_feature(FW_FEATURE_LPAR))
 		ppc_md.enable_pmcs = pseries_lpar_enable_pmcs;
@@ -500,80 +487,6 @@ static int __init pSeries_probe(void)
 	return 1;
 }
 
-
-DECLARE_PER_CPU(unsigned long, smt_snooze_delay);
-
-static void pseries_dedicated_idle_sleep(void)
-{ 
-	unsigned int cpu = smp_processor_id();
-	unsigned long start_snooze;
-	unsigned long in_purr, out_purr;
-
-	/*
-	 * Indicate to the HV that we are idle. Now would be
-	 * a good time to find other work to dispatch.
-	 */
-	get_lppaca()->idle = 1;
-	get_lppaca()->donate_dedicated_cpu = 1;
-	in_purr = mfspr(SPRN_PURR);
-
-	/*
-	 * We come in with interrupts disabled, and need_resched()
-	 * has been checked recently.  If we should poll for a little
-	 * while, do so.
-	 */
-	if (__get_cpu_var(smt_snooze_delay)) {
-		start_snooze = get_tb() +
-			__get_cpu_var(smt_snooze_delay) * tb_ticks_per_usec;
-		local_irq_enable();
-		set_thread_flag(TIF_POLLING_NRFLAG);
-
-		while (get_tb() < start_snooze) {
-			if (need_resched() || cpu_is_offline(cpu))
-				goto out;
-			ppc64_runlatch_off();
-			HMT_low();
-			HMT_very_low();
-		}
-
-		HMT_medium();
-		clear_thread_flag(TIF_POLLING_NRFLAG);
-		smp_mb();
-		local_irq_disable();
-		if (need_resched() || cpu_is_offline(cpu))
-			goto out;
-	}
-
-	cede_processor();
-
-out:
-	HMT_medium();
-	out_purr = mfspr(SPRN_PURR);
-	get_lppaca()->wait_state_cycles += out_purr - in_purr;
-	get_lppaca()->donate_dedicated_cpu = 0;
-	get_lppaca()->idle = 0;
-}
-
-static void pseries_shared_idle_sleep(void)
-{
-	/*
-	 * Indicate to the HV that we are idle. Now would be
-	 * a good time to find other work to dispatch.
-	 */
-	get_lppaca()->idle = 1;
-
-	/*
-	 * Yield the processor to the hypervisor.  We return if
-	 * an external interrupt occurs (which are driven prior
-	 * to returning here) or if a prod occurs from another
-	 * processor. When returning here, external interrupts
-	 * are enabled.
-	 */
-	cede_processor();
-
-	get_lppaca()->idle = 0;
-}
-
 static int pSeries_pci_probe_mode(struct pci_bus *bus)
 {
 	if (firmware_has_feature(FW_FEATURE_LPAR))
Index: linux.trees.git/arch/powerpc/kernel/idle.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/kernel/idle.c
+++ linux.trees.git/arch/powerpc/kernel/idle.c
@@ -25,6 +25,7 @@
 #include <linux/cpu.h>
 #include <linux/sysctl.h>
 #include <linux/tick.h>
+#include <linux/cpuidle.h>
 
 #include <asm/system.h>
 #include <asm/processor.h>
@@ -60,35 +61,26 @@ void cpu_idle(void)
 		while (!need_resched() && !cpu_should_die()) {
 			ppc64_runlatch_off();
 
-			if (ppc_md.power_save) {
-				clear_thread_flag(TIF_POLLING_NRFLAG);
-				/*
-				 * smp_mb is so clearing of TIF_POLLING_NRFLAG
-				 * is ordered w.r.t. need_resched() test.
-				 */
-				smp_mb();
-				local_irq_disable();
-
-				/* Don't trace irqs off for idle */
-				stop_critical_timings();
-
-				/* check again after disabling irqs */
-				if (!need_resched() && !cpu_should_die())
-					ppc_md.power_save();
-
-				start_critical_timings();
-
-				local_irq_enable();
-				set_thread_flag(TIF_POLLING_NRFLAG);
-
-			} else {
-				/*
-				 * Go into low thread priority and possibly
-				 * low power mode.
-				 */
-				HMT_low();
-				HMT_very_low();
-			}
+			clear_thread_flag(TIF_POLLING_NRFLAG);
+			/*
+			 * smp_mb is so clearing of TIF_POLLING_NRFLAG
+			 * is ordered w.r.t. need_resched() test.
+			 */
+			smp_mb();
+			local_irq_disable();
+
+			/* Don't trace irqs off for idle */
+			stop_critical_timings();
+
+			/* check again after disabling irqs */
+			if (!need_resched() && !cpu_should_die())
+				cpuidle_idle_call();
+
+			start_critical_timings();
+
+			local_irq_enable();
+			set_thread_flag(TIF_POLLING_NRFLAG);
+
 		}
 
 		HMT_medium();

^ permalink raw reply

* [v5 RFC PATCH 4/7]: POWER: enable cpuidle for POWER.
From: Arun R Bharadwaj @ 2009-09-22  5:38 UTC (permalink / raw)
  To: Joel Schopp, Benjamin Herrenschmidt, Paul Mackerras,
	Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Dipankar Sarma, Balbir Singh, Gautham R Shenoy, Arun Bharadwaj
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20090922053314.GA6417@linux.vnet.ibm.com>

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-09-22 11:03:14]:

This patch enables the cpuidle option in Kconfig for pSeries.

Currently cpuidle infrastructure is enabled only for x86 and ARM.


Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 arch/powerpc/Kconfig |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

Index: linux.trees.git/arch/powerpc/Kconfig
===================================================================
--- linux.trees.git.orig/arch/powerpc/Kconfig
+++ linux.trees.git/arch/powerpc/Kconfig
@@ -243,6 +243,20 @@ source "kernel/Kconfig.freezer"
 source "arch/powerpc/sysdev/Kconfig"
 source "arch/powerpc/platforms/Kconfig"
 
+menu "Power management options"
+
+source "drivers/cpuidle/Kconfig"
+
+config PSERIES_PROCESSOR_IDLE
+	bool "Idle Power Management Support for pSeries"
+	depends on PPC_PSERIES && CPU_IDLE
+	default y
+	help
+	  Idle Power Management Support for pSeries. This hooks onto cpuidle
+	  infrastructure to help in idle cpu power management.
+
+endmenu
+
 menu "Kernel options"
 
 config HIGHMEM

^ permalink raw reply

* [v5 RFC PATCH 3/7]: x86: refactor x86 idle power management code and remove all instances of pm_idle.
From: Arun R Bharadwaj @ 2009-09-22  5:37 UTC (permalink / raw)
  To: Joel Schopp, Benjamin Herrenschmidt, Paul Mackerras,
	Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Dipankar Sarma, Balbir Singh, Gautham R Shenoy, Arun Bharadwaj
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20090922053314.GA6417@linux.vnet.ibm.com>

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-09-22 11:03:14]:

This patch cleans up x86 of all instances of pm_idle.

pm_idle which was earlier called from cpu_idle() idle loop
is replaced by cpuidle_idle_call.

x86 also registers to cpuidle when the idle routine is selected,
by populating the cpuidle_device data structure for each cpu.

This is replicated for apm module and for xen, which also used pm_idle.


Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 arch/x86/kernel/apm_32.c     |   37 +++++++++++++++++++++--
 arch/x86/kernel/process.c    |   69 ++++++++++++++++++++++++++++++++++---------
 arch/x86/kernel/process_32.c |    3 +
 arch/x86/kernel/process_64.c |    3 +
 arch/x86/xen/setup.c         |   22 +++++++++++++
 5 files changed, 114 insertions(+), 20 deletions(-)

Index: linux.trees.git/arch/x86/kernel/process.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/process.c
+++ linux.trees.git/arch/x86/kernel/process.c
@@ -9,6 +9,8 @@
 #include <linux/pm.h>
 #include <linux/clockchips.h>
 #include <linux/random.h>
+#include <linux/cpuidle.h>
+
 #include <trace/power.h>
 #include <asm/system.h>
 #include <asm/apic.h>
@@ -247,12 +249,6 @@ int sys_vfork(struct pt_regs *regs)
 unsigned long boot_option_idle_override = 0;
 EXPORT_SYMBOL(boot_option_idle_override);
 
-/*
- * Powermanagement idle function, if any..
- */
-void (*pm_idle)(void);
-EXPORT_SYMBOL(pm_idle);
-
 #ifdef CONFIG_X86_32
 /*
  * This halt magic was a workaround for ancient floppy DMA
@@ -531,15 +527,58 @@ static void c1e_idle(void)
 		default_idle();
 }
 
+static void (*local_idle)(void);
+DEFINE_PER_CPU(struct cpuidle_device, idle_devices);
+
+struct cpuidle_driver cpuidle_default_driver = {
+	.name =         "cpuidle_default",
+};
+
+static int local_idle_loop(struct cpuidle_device *dev, struct cpuidle_state *st)
+{
+	ktime_t t1, t2;
+	s64 diff;
+	int ret;
+
+	t1 = ktime_get();
+	local_idle();
+	t2 = ktime_get();
+
+	diff = ktime_to_us(ktime_sub(t2, t1));
+	if (diff > INT_MAX)
+		diff = INT_MAX;
+	ret = (int) diff;
+
+	return ret;
+}
+static int __cpuinit setup_cpuidle_simple(void)
+{
+	struct cpuidle_device *dev;
+	int cpu;
+
+	if (!cpuidle_curr_driver)
+		cpuidle_register_driver(&cpuidle_default_driver);
+
+	for_each_online_cpu(cpu) {
+		dev = &per_cpu(idle_devices, cpu);
+		dev->cpu = cpu;
+		dev->states[0].enter = local_idle_loop;
+		dev->state_count = 1;
+		cpuidle_register_device(dev);
+	}
+	return 0;
+}
+late_initcall(setup_cpuidle_simple);
+
 void __cpuinit select_idle_routine(const struct cpuinfo_x86 *c)
 {
 #ifdef CONFIG_SMP
-	if (pm_idle == poll_idle && smp_num_siblings > 1) {
+	if (local_idle == poll_idle && smp_num_siblings > 1) {
 		printk(KERN_WARNING "WARNING: polling idle and HT enabled,"
 			" performance may degrade.\n");
 	}
 #endif
-	if (pm_idle)
+	if (local_idle)
 		return;
 
 	if (cpu_has(c, X86_FEATURE_MWAIT) && mwait_usable(c)) {
@@ -547,18 +586,20 @@ void __cpuinit select_idle_routine(const
 		 * One CPU supports mwait => All CPUs supports mwait
 		 */
 		printk(KERN_INFO "using mwait in idle threads.\n");
-		pm_idle = mwait_idle;
+		local_idle = mwait_idle;
 	} else if (check_c1e_idle(c)) {
 		printk(KERN_INFO "using C1E aware idle routine\n");
-		pm_idle = c1e_idle;
+		local_idle = c1e_idle;
 	} else
-		pm_idle = default_idle;
+		local_idle = default_idle;
+
+	return;
 }
 
 void __init init_c1e_mask(void)
 {
 	/* If we're using c1e_idle, we need to allocate c1e_mask. */
-	if (pm_idle == c1e_idle) {
+	if (local_idle == c1e_idle) {
 		alloc_cpumask_var(&c1e_mask, GFP_KERNEL);
 		cpumask_clear(c1e_mask);
 	}
@@ -571,7 +612,7 @@ static int __init idle_setup(char *str)
 
 	if (!strcmp(str, "poll")) {
 		printk("using polling idle threads.\n");
-		pm_idle = poll_idle;
+		local_idle = poll_idle;
 	} else if (!strcmp(str, "mwait"))
 		force_mwait = 1;
 	else if (!strcmp(str, "halt")) {
@@ -582,7 +623,7 @@ static int __init idle_setup(char *str)
 		 * To continue to load the CPU idle driver, don't touch
 		 * the boot_option_idle_override.
 		 */
-		pm_idle = default_idle;
+		local_idle = default_idle;
 		idle_halt = 1;
 		return 0;
 	} else if (!strcmp(str, "nomwait")) {
Index: linux.trees.git/arch/x86/kernel/process_32.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/process_32.c
+++ linux.trees.git/arch/x86/kernel/process_32.c
@@ -40,6 +40,7 @@
 #include <linux/uaccess.h>
 #include <linux/io.h>
 #include <linux/kdebug.h>
+#include <linux/cpuidle.h>
 
 #include <asm/pgtable.h>
 #include <asm/system.h>
@@ -113,7 +114,7 @@ void cpu_idle(void)
 			local_irq_disable();
 			/* Don't trace irqs off for idle */
 			stop_critical_timings();
-			pm_idle();
+			cpuidle_idle_call();
 			start_critical_timings();
 		}
 		tick_nohz_restart_sched_tick();
Index: linux.trees.git/arch/x86/kernel/process_64.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/process_64.c
+++ linux.trees.git/arch/x86/kernel/process_64.c
@@ -39,6 +39,7 @@
 #include <linux/io.h>
 #include <linux/ftrace.h>
 #include <linux/dmi.h>
+#include <linux/cpuidle.h>
 
 #include <asm/pgtable.h>
 #include <asm/system.h>
@@ -142,7 +143,7 @@ void cpu_idle(void)
 			enter_idle();
 			/* Don't trace irqs off for idle */
 			stop_critical_timings();
-			pm_idle();
+			cpuidle_idle_call();
 			start_critical_timings();
 			/* In many cases the interrupt that ended idle
 			   has already called exit_idle. But some idle
Index: linux.trees.git/arch/x86/kernel/apm_32.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/apm_32.c
+++ linux.trees.git/arch/x86/kernel/apm_32.c
@@ -2257,6 +2257,38 @@ static struct dmi_system_id __initdata a
 	{ }
 };
 
+DEFINE_PER_CPU(struct cpuidle_device, apm_idle_devices);
+
+struct cpuidle_driver cpuidle_apm_driver = {
+	.name =         "cpuidle_apm",
+};
+
+void __cpuinit setup_cpuidle_apm(void)
+{
+	struct cpuidle_device *dev;
+
+	if (!cpuidle_curr_driver)
+		cpuidle_register_driver(&cpuidle_apm_driver);
+
+	dev = &per_cpu(apm_idle_devices, smp_processor_id());
+	dev->cpu = smp_processor_id();
+	dev->states[0].enter = apm_cpu_idle;
+	dev->state_count = 1;
+	cpuidle_register_device(dev);
+}
+
+void exit_cpuidle_apm(void)
+{
+	struct cpuidle_device *dev;
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		dev = &per_cpu(apm_idle_devices, cpu);
+		cpuidle_unregister_device(dev);
+	}
+}
+
+
 /*
  * Just start the APM thread. We do NOT want to do APM BIOS
  * calls from anything but the APM thread, if for no other reason
@@ -2394,8 +2426,7 @@ static int __init apm_init(void)
 	if (HZ != 100)
 		idle_period = (idle_period * HZ) / 100;
 	if (idle_threshold < 100) {
-		original_pm_idle = pm_idle;
-		pm_idle  = apm_cpu_idle;
+		setup_cpuidle_apm();
 		set_pm_idle = 1;
 	}
 
@@ -2407,7 +2438,7 @@ static void __exit apm_exit(void)
 	int error;
 
 	if (set_pm_idle) {
-		pm_idle = original_pm_idle;
+		exit_cpuidle_apm();
 		/*
 		 * We are about to unload the current idle thread pm callback
 		 * (pm_idle), Wait for all processors to update cached/local
Index: linux.trees.git/arch/x86/xen/setup.c
===================================================================
--- linux.trees.git.orig/arch/x86/xen/setup.c
+++ linux.trees.git/arch/x86/xen/setup.c
@@ -8,6 +8,7 @@
 #include <linux/sched.h>
 #include <linux/mm.h>
 #include <linux/pm.h>
+#include <linux/cpuidle.h>
 
 #include <asm/elf.h>
 #include <asm/vdso.h>
@@ -151,6 +152,25 @@ void __cpuinit xen_enable_syscall(void)
 #endif /* CONFIG_X86_64 */
 }
 
+DEFINE_PER_CPU(struct cpuidle_device, idle_devices);
+struct cpuidle_driver cpuidle_xen_driver = {
+	.name =         "cpuidle_xen",
+};
+
+void __cpuinit setup_cpuidle_xen(void)
+{
+	struct cpuidle_device *dev;
+
+	if (!cpuidle_curr_driver)
+		cpuidle_register_driver(&cpuidle_xen_driver);
+
+	dev = &per_cpu(idle_devices, smp_processor_id());
+	dev->cpu = smp_processor_id();
+	dev->states[0].enter = xen_idle;
+	dev->state_count = 1;
+	cpuidle_register_device(dev);
+}
+
 void __init xen_arch_setup(void)
 {
 	struct physdev_set_iopl set_iopl;
@@ -186,7 +206,7 @@ void __init xen_arch_setup(void)
 	       MAX_GUEST_CMDLINE > COMMAND_LINE_SIZE ?
 	       COMMAND_LINE_SIZE : MAX_GUEST_CMDLINE);
 
-	pm_idle = xen_idle;
+	setup_cpuidle_xen();
 
 	paravirt_disable_iospace();
 

^ permalink raw reply

* [v5 RFC PATCH 2/7]: cpuidle: implement a list based approach to register a set of idle routines.
From: Arun R Bharadwaj @ 2009-09-22  5:36 UTC (permalink / raw)
  To: Joel Schopp, Benjamin Herrenschmidt, Paul Mackerras,
	Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Dipankar Sarma, Balbir Singh, Gautham R Shenoy, Arun Bharadwaj
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20090922053314.GA6417@linux.vnet.ibm.com>

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-09-22 11:03:14]:

Implement a list based registering mechanism for architectures which
have multiple sets of idle routines which are to be registered.

Currently, in x86 it is done by merely setting pm_idle = idle_routine
and managing this pm_idle pointer is messy.

To give an example of how this mechanism works:
In x86, initially, idle routine is selected from the set of poll/mwait/
c1e/default idle loops. So the selected idle loop is registered in cpuidle
as one idle state cpuidle devices. Once ACPI comes up, it registers
another set of idle states on top of this state. Again, suppose a module
registers another set of idle loops, it is added to this list.

This provides a clean way of registering and unregistering idle state
routines.

In the current implementation, pm_idle is set as the current idle routine
being used and the old idle routine has to be maintained and when a module
registers/unregisters an idle routine, confusion arises.


Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 drivers/cpuidle/cpuidle.c |   50 ++++++++++++++++++++++++++++++++++++++++++----
 include/linux/cpuidle.h   |    1 
 2 files changed, 47 insertions(+), 4 deletions(-)

Index: linux.trees.git/drivers/cpuidle/cpuidle.c
===================================================================
--- linux.trees.git.orig/drivers/cpuidle/cpuidle.c
+++ linux.trees.git/drivers/cpuidle/cpuidle.c
@@ -21,6 +21,7 @@
 #include "cpuidle.h"
 
 DEFINE_PER_CPU(struct cpuidle_device *, cpuidle_devices);
+DEFINE_PER_CPU(struct list_head, cpuidle_devices_list);
 
 DEFINE_MUTEX(cpuidle_lock);
 LIST_HEAD(cpuidle_detected_devices);
@@ -100,6 +101,44 @@ void cpuidle_resume_and_unlock(void)
 
 EXPORT_SYMBOL_GPL(cpuidle_resume_and_unlock);
 
+int cpuidle_add_to_list(struct cpuidle_device *dev)
+{
+	int ret, cpu = dev->cpu;
+	struct cpuidle_device *old_dev;
+
+	if (!list_empty(&per_cpu(cpuidle_devices_list, cpu))) {
+		old_dev = list_first_entry(&per_cpu(cpuidle_devices_list, cpu),
+				struct cpuidle_device, percpu_list[cpu]);
+		cpuidle_remove_state_sysfs(old_dev);
+	}
+
+	list_add(&dev->percpu_list[cpu], &per_cpu(cpuidle_devices_list, cpu));
+	ret = cpuidle_add_state_sysfs(dev);
+	return ret;
+}
+
+void cpuidle_remove_from_list(struct cpuidle_device *dev)
+{
+	struct cpuidle_device *temp_dev;
+	struct list_head *pos;
+	int ret, cpu = dev->cpu;
+
+	list_for_each(pos, &per_cpu(cpuidle_devices_list, cpu)) {
+		temp_dev = container_of(pos, struct cpuidle_device,
+					percpu_list[cpu]);
+		if (dev == temp_dev) {
+			list_del(&temp_dev->percpu_list[cpu]);
+			cpuidle_remove_state_sysfs(temp_dev);
+		}
+	}
+
+	if (!list_empty(&per_cpu(cpuidle_devices_list, cpu))) {
+		temp_dev = list_first_entry(&per_cpu(cpuidle_devices_list, cpu),
+				struct cpuidle_device, percpu_list[cpu]);
+				ret = cpuidle_add_state_sysfs(temp_dev);
+	}
+}
+
 /**
  * cpuidle_enable_device - enables idle PM for a CPU
  * @dev: the CPU
@@ -124,7 +163,7 @@ int cpuidle_enable_device(struct cpuidle
 			return ret;
 	}
 
-	if ((ret = cpuidle_add_state_sysfs(dev)))
+	if ((cpuidle_add_to_list(dev)))
 		return ret;
 
 	if (cpuidle_curr_governor->enable &&
@@ -145,7 +184,7 @@ int cpuidle_enable_device(struct cpuidle
 	return 0;
 
 fail_sysfs:
-	cpuidle_remove_state_sysfs(dev);
+	cpuidle_remove_from_list(dev);
 
 	return ret;
 }
@@ -171,7 +210,7 @@ void cpuidle_disable_device(struct cpuid
 	if (cpuidle_curr_governor->disable)
 		cpuidle_curr_governor->disable(dev);
 
-	cpuidle_remove_state_sysfs(dev);
+	cpuidle_remove_from_list(dev);
 }
 
 EXPORT_SYMBOL_GPL(cpuidle_disable_device);
@@ -339,12 +378,15 @@ static inline void latency_notifier_init
  */
 static int __init cpuidle_init(void)
 {
-	int ret;
+	int ret, cpu;
 
 	ret = cpuidle_add_class_sysfs(&cpu_sysdev_class);
 	if (ret)
 		return ret;
 
+	for_each_possible_cpu(cpu)
+		INIT_LIST_HEAD(&per_cpu(cpuidle_devices_list, cpu));
+
 	latency_notifier_init(&cpuidle_latency_notifier);
 
 	return 0;
Index: linux.trees.git/include/linux/cpuidle.h
===================================================================
--- linux.trees.git.orig/include/linux/cpuidle.h
+++ linux.trees.git/include/linux/cpuidle.h
@@ -93,6 +93,7 @@ struct cpuidle_device {
 	struct cpuidle_state	*last_state;
 
 	struct list_head 	device_list;
+	struct list_head	percpu_list[NR_CPUS];
 	struct kobject		kobj;
 	struct completion	kobj_unregister;
 	void			*governor_data;

^ permalink raw reply

* [v5 RFC PATCH 1/7]: cpuidle: cleanup drivers/cpuidle/cpuidle.c
From: Arun R Bharadwaj @ 2009-09-22  5:35 UTC (permalink / raw)
  To: Joel Schopp, Benjamin Herrenschmidt, Paul Mackerras,
	Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Dipankar Sarma, Balbir Singh, Gautham R Shenoy, Arun Bharadwaj
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20090922053314.GA6417@linux.vnet.ibm.com>

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-09-22 11:03:14]:

This patch cleans up drivers/cpuidle/cpuidle.c
Earlier cpuidle assumed pm_idle as the default idle loop. Break that
assumption and make it more generic. cpuidle_idle_call() which is the
main idle loop of cpuidle is to be called by architectures which have
registered to cpuidle.

Remove routines cpuidle_install/uninstall_idle_handler() and
cpuidle_kick_cpus() which are not needed anymore.

Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
---
 drivers/cpuidle/cpuidle.c  |   59 ++++++---------------------------------------
 drivers/cpuidle/cpuidle.h  |    1 
 drivers/cpuidle/governor.c |    3 --
 include/linux/cpuidle.h    |    3 ++
 4 files changed, 11 insertions(+), 55 deletions(-)

Index: linux.trees.git/drivers/cpuidle/cpuidle.c
===================================================================
--- linux.trees.git.orig/drivers/cpuidle/cpuidle.c
+++ linux.trees.git/drivers/cpuidle/cpuidle.c
@@ -24,20 +24,6 @@ DEFINE_PER_CPU(struct cpuidle_device *, 
 
 DEFINE_MUTEX(cpuidle_lock);
 LIST_HEAD(cpuidle_detected_devices);
-static void (*pm_idle_old)(void);
-
-static int enabled_devices;
-
-#if defined(CONFIG_ARCH_HAS_CPU_IDLE_WAIT)
-static void cpuidle_kick_cpus(void)
-{
-	cpu_idle_wait();
-}
-#elif defined(CONFIG_SMP)
-# error "Arch needs cpu_idle_wait() equivalent here"
-#else /* !CONFIG_ARCH_HAS_CPU_IDLE_WAIT && !CONFIG_SMP */
-static void cpuidle_kick_cpus(void) {}
-#endif
 
 static int __cpuidle_register_device(struct cpuidle_device *dev);
 
@@ -46,7 +32,7 @@ static int __cpuidle_register_device(str
  *
  * NOTE: no locks or semaphores should be used here
  */
-static void cpuidle_idle_call(void)
+void cpuidle_idle_call(void)
 {
 	struct cpuidle_device *dev = __get_cpu_var(cpuidle_devices);
 	struct cpuidle_state *target_state;
@@ -54,13 +40,10 @@ static void cpuidle_idle_call(void)
 
 	/* check if the device is ready */
 	if (!dev || !dev->enabled) {
-		if (pm_idle_old)
-			pm_idle_old();
-		else
 #if defined(CONFIG_ARCH_HAS_DEFAULT_IDLE)
-			default_idle();
+		default_idle();
 #else
-			local_irq_enable();
+		local_irq_enable();
 #endif
 		return;
 	}
@@ -74,7 +57,11 @@ static void cpuidle_idle_call(void)
 	hrtimer_peek_ahead_timers();
 #endif
 	/* ask the governor for the next state */
-	next_state = cpuidle_curr_governor->select(dev);
+	if (dev->state_count > 1)
+		next_state = cpuidle_curr_governor->select(dev);
+	else
+		next_state = 0;
+
 	if (need_resched())
 		return;
 	target_state = &dev->states[next_state];
@@ -94,35 +81,11 @@ static void cpuidle_idle_call(void)
 }
 
 /**
- * cpuidle_install_idle_handler - installs the cpuidle idle loop handler
- */
-void cpuidle_install_idle_handler(void)
-{
-	if (enabled_devices && (pm_idle != cpuidle_idle_call)) {
-		/* Make sure all changes finished before we switch to new idle */
-		smp_wmb();
-		pm_idle = cpuidle_idle_call;
-	}
-}
-
-/**
- * cpuidle_uninstall_idle_handler - uninstalls the cpuidle idle loop handler
- */
-void cpuidle_uninstall_idle_handler(void)
-{
-	if (enabled_devices && pm_idle_old && (pm_idle != pm_idle_old)) {
-		pm_idle = pm_idle_old;
-		cpuidle_kick_cpus();
-	}
-}
-
-/**
  * cpuidle_pause_and_lock - temporarily disables CPUIDLE
  */
 void cpuidle_pause_and_lock(void)
 {
 	mutex_lock(&cpuidle_lock);
-	cpuidle_uninstall_idle_handler();
 }
 
 EXPORT_SYMBOL_GPL(cpuidle_pause_and_lock);
@@ -132,7 +95,6 @@ EXPORT_SYMBOL_GPL(cpuidle_pause_and_lock
  */
 void cpuidle_resume_and_unlock(void)
 {
-	cpuidle_install_idle_handler();
 	mutex_unlock(&cpuidle_lock);
 }
 
@@ -180,7 +142,6 @@ int cpuidle_enable_device(struct cpuidle
 
 	dev->enabled = 1;
 
-	enabled_devices++;
 	return 0;
 
 fail_sysfs:
@@ -211,7 +172,6 @@ void cpuidle_disable_device(struct cpuid
 		cpuidle_curr_governor->disable(dev);
 
 	cpuidle_remove_state_sysfs(dev);
-	enabled_devices--;
 }
 
 EXPORT_SYMBOL_GPL(cpuidle_disable_device);
@@ -303,7 +263,6 @@ int cpuidle_register_device(struct cpuid
 	}
 
 	cpuidle_enable_device(dev);
-	cpuidle_install_idle_handler();
 
 	mutex_unlock(&cpuidle_lock);
 
@@ -382,8 +341,6 @@ static int __init cpuidle_init(void)
 {
 	int ret;
 
-	pm_idle_old = pm_idle;
-
 	ret = cpuidle_add_class_sysfs(&cpu_sysdev_class);
 	if (ret)
 		return ret;
Index: linux.trees.git/drivers/cpuidle/governor.c
===================================================================
--- linux.trees.git.orig/drivers/cpuidle/governor.c
+++ linux.trees.git/drivers/cpuidle/governor.c
@@ -48,8 +48,6 @@ int cpuidle_switch_governor(struct cpuid
 	if (gov == cpuidle_curr_governor)
 		return 0;
 
-	cpuidle_uninstall_idle_handler();
-
 	if (cpuidle_curr_governor) {
 		list_for_each_entry(dev, &cpuidle_detected_devices, device_list)
 			cpuidle_disable_device(dev);
@@ -63,7 +61,6 @@ int cpuidle_switch_governor(struct cpuid
 			return -EINVAL;
 		list_for_each_entry(dev, &cpuidle_detected_devices, device_list)
 			cpuidle_enable_device(dev);
-		cpuidle_install_idle_handler();
 		printk(KERN_INFO "cpuidle: using governor %s\n", gov->name);
 	}
 
Index: linux.trees.git/include/linux/cpuidle.h
===================================================================
--- linux.trees.git.orig/include/linux/cpuidle.h
+++ linux.trees.git/include/linux/cpuidle.h
@@ -112,6 +112,9 @@ static inline int cpuidle_get_last_resid
 	return dev->last_residency;
 }
 
+extern void cpuidle_idle_call(void);
+extern struct cpuidle_driver *cpuidle_curr_driver;
+
 
 /****************************
  * CPUIDLE DRIVER INTERFACE *
Index: linux.trees.git/drivers/cpuidle/cpuidle.h
===================================================================
--- linux.trees.git.orig/drivers/cpuidle/cpuidle.h
+++ linux.trees.git/drivers/cpuidle/cpuidle.h
@@ -9,7 +9,6 @@
 
 /* For internal use only */
 extern struct cpuidle_governor *cpuidle_curr_governor;
-extern struct cpuidle_driver *cpuidle_curr_driver;
 extern struct list_head cpuidle_governors;
 extern struct list_head cpuidle_detected_devices;
 extern struct mutex cpuidle_lock;

^ permalink raw reply

* [v5 RFC PATCH 0/7]: cpuidle/x86/POWER (REDESIGN): Cleanup idle power management code in x86, cleanup drivers/cpuidle/cpuidle.c and introduce cpuidle to POWER.
From: Arun R Bharadwaj @ 2009-09-22  5:33 UTC (permalink / raw)
  To: Joel Schopp, Benjamin Herrenschmidt, Paul Mackerras,
	Peter Zijlstra, Ingo Molnar, Vaidyanathan Srinivasan,
	Dipankar Sarma, Balbir Singh, Gautham R Shenoy, Arun Bharadwaj
  Cc: linuxppc-dev, linux-kernel

Hi,

******** This is an RFC, not for inclusion **********

This patchset introduces cpuidle infrastructure to POWER, prototyping
for pSeries, and also does a major refactoring of current x86 idle
power management and a cleanup of cpuidle infrastructure.

My earlier iterations can be found at:

v4 --> http://lkml.org/lkml/2009/9/1/133
v3 --> http://lkml.org/lkml/2009/8/27/124
v2 --> http://lkml.org/lkml/2009/8/26/233
v1 --> http://lkml.org/lkml/2009/8/19/150

Major Changes in this iteration:
------------------------------------------

* Refactoring x86 idle power management code
        Remove all instances of pm_idle and make cpuidle_idle_call as
        _the_ idle routine in x86. So, cpuidle_idle_call will be
        called from the main idle loop, cpu_idle instead of the
        function pointer pm_idle. Also, pm_idle was used by apm module
        and xen. Change those instances such that they register to
        cpuidle instead.

* Cleanup drivers/cpuidle/cpuidle.c
        Currently, the cpuidle implementation has weakness in the
        framework where an exported pm_idle function pointer is
        manipulated by various subsystem. The proposed framework has
        a registration mechanism to cleanly add and remove new idle
        routines from different subsystems.

* Implement cpuidle for pSeries
        Implement the processor_idle module for pseries, which
        registers idle loops to cpuidle and also cleanup
        arch/powerpc/platforms/pseries/setup.c and remove the
        redundant pseries_dedicated/shared_idle_sleep which is
        implemented in processor_idle.c
        Also, remove all instances of ppc_md.power_save, for the same
        reason as that given for pm_idle.

TODO:
---------------------------------------------

* Currently, the list based approach that I'm using here is not
  working in a clean way on a few x86 platforms which have multiple
  sleep states, leading to kernel panics. So working on resolving
  that.

* ppc_md.power_save has been replaced by cpuidle_idle_call only for
  pseries. So this needs to be done for all POWER platforms so that
  ppc_md.power_save is completely removed.

Patches included in this series:
---------------------------------------------

1/7 - cleanup drivers/cpuidle/cpuidle.c
2/7 - implement a list based approach to register a set of idle
      routines.
3/7 - refactor x86 idle power management code and remove all instances
      of pm_idle.
4/7 - enable cpuidle for POWER.
5/7 - remove dedicate/shared idle loops, which will be moved to
      arch/powerpc/platforms/pseries/processor_idle.c
6/7 - add a default_idle idle loop for POWER.
7/7 - implement pSeries processor idle module.

Any comments on the design is welcome.

--arun

^ permalink raw reply

* Re: [PATCH] [V2] USB: Add support for Xilinx USB host controller
From: Grant Likely @ 2009-09-22  5:11 UTC (permalink / raw)
  To: Julie Zhu; +Cc: greg, linux-usb, linuxppc-dev, john.linn
In-Reply-To: <20090921220815.D63D854004B@mail96-sin.bigfish.com>

On Mon, Sep 21, 2009 at 3:08 PM, Julie Zhu <julie.zhu@xilinx.com> wrote:
> Add bus glue driver for Xilinx USB host controller. The controller can be
> configured as HS only or HS/FS hybrid. The driver uses the device tree file
> to configure the driver according to the setting in the hardware system.
>
> This driver has been tested with usbtest using the NET2280 PCI card.
>
> Signed-off-by: Julie Zhu <julie.zhu@xilinx.com>
> Signed-off-by: John Linn <john.linn@xilinx.com>

Acked-by: Grant Likely <grant.likely@secretlab.ca>

-- 
Grant Likely, B.Sc., P.Eng.
Secret Lab Technologies Ltd.

^ permalink raw reply

* Re: High load average  but low cpu (xenomai can be the explanation?)
From: Brad Boyer @ 2009-09-22  4:10 UTC (permalink / raw)
  To: dibacco@libero.it; +Cc: linuxppc-dev
In-Reply-To: <17851044.2911671253545985517.JavaMail.root@wmail27>

On Mon, Sep 21, 2009 at 05:13:05PM +0200, dibacco@libero.it wrote:
> I have an MPC880 @133MHz. If I look into the load (with uptime) I get 
> values around 3.0 but my CPU is always under 5 percent (top). How could I 
> explain this? I'm using linux 2.6.19 with xenomai but no xenomai application is 
> running at all. I have a cramfs on a nor flash. What could be the problem? If I 
> kill the process I developed the average load goes down.

Does your system have any kernel modules that are used as part of the
application you are creating? I've seen this happen with custom written
kernel modules that sleep in an uninterruptible wait during a call
from a program or inside a kernel thread. This shows up as 'D' state
in your ps output. Any thread in state 'D' counts against the load
average but doesn't show in CPU usage if it is actually sleeping.

I don't think this is the most likely explanation since you said you
can kill your process, but I thought I should mention it.

	Brad Boyer
	flar@allandria.com

^ permalink raw reply

* [PATCH] powerpc/perf_counter: Enable SDAR in continous sample mode
From: Anton Blanchard @ 2009-09-22  3:01 UTC (permalink / raw)
  To: paulus, benh, a.p.zijlstra, mingo; +Cc: linuxppc-dev


In continuous sampling mode we want the SDAR to update. While we can
select between dcache misses and erat misses, a decent default is to
enable both.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: linux.trees.git/arch/powerpc/kernel/power7-pmu.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/kernel/power7-pmu.c	2009-07-22 08:41:50.000000000 +1000
+++ linux.trees.git/arch/powerpc/kernel/power7-pmu.c	2009-07-22 09:22:54.000000000 +1000
@@ -54,6 +54,10 @@
  * Bits in MMCRA
  */
 
+/* These bits control when the SDAR updates when in continous sampling mode */
+#define MMCRA_SDAR_DCACHE_MISS	30
+#define MMCRA_SDAR_ERAT_MISS	29
+
 /*
  * Layout of constraint bits:
  * 6666555555555544444444443333333333222222222211111111110000000000
@@ -230,7 +234,8 @@ static int power7_compute_mmcr(u64 event
 			       unsigned int hwc[], unsigned long mmcr[])
 {
 	unsigned long mmcr1 = 0;
-	unsigned long mmcra = 0;
+	unsigned long mmcra = (1 << MMCRA_SDAR_DCACHE_MISS) |
+				(1 << MMCRA_SDAR_ERAT_MISS);
 	unsigned int pmc, unit, combine, l2sel, psel;
 	unsigned int pmc_inuse = 0;
 	int i;
Index: linux.trees.git/arch/powerpc/kernel/power5-pmu.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/kernel/power5-pmu.c	2009-07-22 09:07:17.000000000 +1000
+++ linux.trees.git/arch/powerpc/kernel/power5-pmu.c	2009-07-22 09:22:52.000000000 +1000
@@ -76,6 +76,10 @@
  * Bits in MMCRA
  */
 
+/* These bits control when the SDAR updates when in continous sampling mode */
+#define MMCRA_SDAR_DCACHE_MISS	30
+#define MMCRA_SDAR_ERAT_MISS	29
+
 /*
  * Layout of constraint bits:
  * 6666555555555544444444443333333333222222222211111111110000000000
@@ -390,7 +394,8 @@ static int power5_compute_mmcr(u64 event
 			       unsigned int hwc[], unsigned long mmcr[])
 {
 	unsigned long mmcr1 = 0;
-	unsigned long mmcra = 0;
+	unsigned long mmcra = (1 << MMCRA_SDAR_DCACHE_MISS) |
+				(1 << MMCRA_SDAR_ERAT_MISS);
 	unsigned int pmc, unit, byte, psel;
 	unsigned int ttm, grp;
 	int i, isbus, bit, grsel;
Index: linux.trees.git/arch/powerpc/kernel/power6-pmu.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/kernel/power6-pmu.c	2009-07-22 09:07:14.000000000 +1000
+++ linux.trees.git/arch/powerpc/kernel/power6-pmu.c	2009-07-22 09:22:45.000000000 +1000
@@ -51,6 +51,14 @@
 #define MMCR1_PMCSEL_MSK	0xff
 
 /*
+ * Bits in MMCRA
+ */
+
+/* These bits control when the SDAR updates when in continous sampling mode */
+#define MMCRA_SDAR_DCACHE_MISS	30
+#define MMCRA_SDAR_ERAT_MISS	29
+
+/*
  * Map of which direct events on which PMCs are marked instruction events.
  * Indexed by PMCSEL value >> 1.
  * Bottom 4 bits are a map of which PMCs are interesting,
@@ -178,7 +186,8 @@ static int p6_compute_mmcr(u64 event[], 
 			   unsigned int hwc[], unsigned long mmcr[])
 {
 	unsigned long mmcr1 = 0;
-	unsigned long mmcra = 0;
+	unsigned long mmcra = (1 << MMCRA_SDAR_DCACHE_MISS) |
+				(1 << MMCRA_SDAR_ERAT_MISS);
 	int i;
 	unsigned int pmc, ev, b, u, s, psel;
 	unsigned int ttmset = 0;

^ permalink raw reply

* [PATCH] powerpc/perf_counter: Fix vdso detection
From: Anton Blanchard @ 2009-09-22  2:57 UTC (permalink / raw)
  To: paulus, benh, a.p.zijlstra, mingo; +Cc: linuxppc-dev


perf_counter uses arch_vma_name() to detect a vdso region which in turn uses
current->mm->context.vdso_base. We need to initialise this before doing
the mmap or else we fail to detect the vdso.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: linux.trees.git/arch/powerpc/kernel/vdso.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/kernel/vdso.c	2009-07-15 10:00:41.000000000 +1000
+++ linux.trees.git/arch/powerpc/kernel/vdso.c	2009-07-15 10:01:40.000000000 +1000
@@ -240,6 +240,13 @@
 	}
 
 	/*
+	 * Put vDSO base into mm struct. We need to do this before calling
+	 * install_special_mapping or the perf counter mmap tracking code
+	 * will fail to recognise it as a vDSO (since arch_vma_name fails).
+	 */
+	current->mm->context.vdso_base = vdso_base;
+
+	/*
 	 * our vma flags don't have VM_WRITE so by default, the process isn't
 	 * allowed to write those pages.
 	 * gdb can break that with ptrace interface, and thus trigger COW on
@@ -259,11 +266,10 @@
 				     VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC|
 				     VM_ALWAYSDUMP,
 				     vdso_pagelist);
-	if (rc)
+	if (rc) {
+		current->mm->context.vdso_base = 0;
 		goto fail_mmapsem;
-
-	/* Put vDSO base into mm struct */
-	current->mm->context.vdso_base = vdso_base;
+	}
 
 	up_write(&mm->mmap_sem);
 	return 0;

^ permalink raw reply

* [PATCH] powerpc/perf_counter: Log invalid data addresses as all 1s
From: Anton Blanchard @ 2009-09-22  2:56 UTC (permalink / raw)
  To: paulus, benh, a.p.zijlstra, mingo; +Cc: linuxppc-dev


When we take an exception and the SDAR isn't synchronised we currently
log 0 as the address. Unfortunately this is a pretty common value, so
use ~OUL instead.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: linux.trees.git/arch/powerpc/kernel/perf_event.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/kernel/perf_event.c	2009-09-22 11:54:32.000000000 +1000
+++ linux.trees.git/arch/powerpc/kernel/perf_event.c	2009-09-22 11:59:33.000000000 +1000
@@ -1162,7 +1162,7 @@ static void record_and_restart(struct pe
 	 */
 	if (record) {
 		struct perf_sample_data data = {
-			.addr	= 0,
+			.addr	= ~0ULL,
 			.period	= event->hw.last_period,
 		};
 

^ permalink raw reply

* powerpc: Move 64bit heap above 1TB on machines with 1TB segments
From: Anton Blanchard @ 2009-09-22  2:52 UTC (permalink / raw)
  To: benh, MELGOR; +Cc: linuxppc-dev


If we are using 1TB segments and we are allowed to randomise the heap, we can
put it above 1TB so it is backed by a 1TB segment. Otherwise the heap will be
in the bottom 1TB which always uses 256MB segments and this may result in a
performance penalty.

This functionality is disabled when heap randomisation is turned off:

echo 1 > /proc/sys/kernel/randomize_va_space

which may be useful when trying to allocate the maximum amount of 16M or 16G
pages.

On a microbenchmark that repeatedly touches 32GB of memory with a stride of
256MB + 4kB (designed to stress 256MB segments while still mapping nicely into
the L1 cache), we see the improvement:

Force malloc to use heap all the time:
# export MALLOC_MMAP_MAX_=0 MALLOC_TRIM_THRESHOLD_=-1

Disable heap randomization:
# echo 1 > /proc/sys/kernel/randomize_va_space
# time ./test 
12.51s

Enable heap randomization:
# echo 2 > /proc/sys/kernel/randomize_va_space
# time ./test 
1.70s

Signed-off-by: Anton Blanchard <anton@samba.org>
---

I've cc-ed Mel on this one. As you can see it definitely helps the base
page size performance, but I'm a bit worried of the impact of taking away
another of our 1TB slices.

Index: linux.trees.git/arch/powerpc/kernel/process.c
===================================================================
--- linux.trees.git.orig/arch/powerpc/kernel/process.c	2009-09-17 15:47:46.000000000 +1000
+++ linux.trees.git/arch/powerpc/kernel/process.c	2009-09-17 15:49:11.000000000 +1000
@@ -1165,7 +1165,22 @@ static inline unsigned long brk_rnd(void
 
 unsigned long arch_randomize_brk(struct mm_struct *mm)
 {
-	unsigned long ret = PAGE_ALIGN(mm->brk + brk_rnd());
+	unsigned long base = mm->brk;
+	unsigned long ret;
+
+#ifdef CONFIG_PPC64
+	/*
+	 * If we are using 1TB segments and we are allowed to randomise
+	 * the heap, we can put it above 1TB so it is backed by a 1TB
+	 * segment. Otherwise the heap will be in the bottom 1TB
+	 * which always uses 256MB segments and this may result in a
+	 * performance penalty.
+	 */
+	if (!is_32bit_task() && (mmu_highuser_ssize == MMU_SEGSIZE_1T))
+		base = max_t(unsigned long, mm->brk, 1UL << SID_SHIFT_1T);
+#endif
+
+	ret = PAGE_ALIGN(base + brk_rnd());
 
 	if (ret < mm->brk)
 		return mm->brk;

^ permalink raw reply

* Re: [PATCH] perf_event, powerpc: Fix compilation after big perf_counter rename
From: Benjamin Herrenschmidt @ 2009-09-22  1:56 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Peter Zijlstra, linux-kernel, linuxppc-dev, Ingo Molnar,
	Linus Torvalds, David S. Miller
In-Reply-To: <19128.4280.813369.589704@cargo.ozlabs.ibm.com>

On Tue, 2009-09-22 at 09:48 +1000, Paul Mackerras wrote:
> This fixes two places in the powerpc perf_event (perf_counter) code
> where 'list_entry' needs to be changed to 'group_entry', but were
> missed in commit 65abc865 ("perf_counter: Rename list_entry ->
> group_entry, counter_list -> group_list").

Ingo: This is becoming a recurring one now... powerpc build upstream  is
broken approx everyday by some new perfctr build breakage.

You really aren't build testing other architectures than x86 right ?

Ben.

> This also changes 'event' back to 'counter' in a couple of contexts:
> 
> * Field and function names that deal with the limited-function
>   counters: it's really the hardware counters whose function is
>   limited, not the events that they count.  Hence:
> 
>   MAX_LIMITED_HWEVENTS -> MAX_LIMITED_HWCOUNTERS
>   limited_event -> limited_counter
>   freeze/thaw_limited_events -> freeze/thaw_limited_counters
> 
> * The machine-specific PMU description struct (struct power_pmu): this
>   renames 'n_event' back to 'n_counter' since it really describes how
>   many hardware counters the machine has.  (Renaming this back avoids
>   a compile error in each of the machine-specific PMU back-ends where
>   they initialize their power_pmu struct.)
> 
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
>  arch/powerpc/include/asm/perf_event.h |    4 +--
>  arch/powerpc/kernel/perf_event.c      |   38 +++++++++++++++++-----------------
>  2 files changed, 21 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/perf_event.h b/arch/powerpc/include/asm/perf_event.h
> index 2499aaa..3288ce3 100644
> --- a/arch/powerpc/include/asm/perf_event.h
> +++ b/arch/powerpc/include/asm/perf_event.h
> @@ -14,7 +14,7 @@
>  
>  #define MAX_HWEVENTS		8
>  #define MAX_EVENT_ALTERNATIVES	8
> -#define MAX_LIMITED_HWEVENTS	2
> +#define MAX_LIMITED_HWCOUNTERS	2
>  
>  /*
>   * This struct provides the constants and functions needed to
> @@ -22,7 +22,7 @@
>   */
>  struct power_pmu {
>  	const char	*name;
> -	int		n_event;
> +	int		n_counter;
>  	int		max_alternatives;
>  	unsigned long	add_fields;
>  	unsigned long	test_adder;
> diff --git a/arch/powerpc/kernel/perf_event.c b/arch/powerpc/kernel/perf_event.c
> index 197b7d9..bbcbae1 100644
> --- a/arch/powerpc/kernel/perf_event.c
> +++ b/arch/powerpc/kernel/perf_event.c
> @@ -30,8 +30,8 @@ struct cpu_hw_events {
>  	u64 events[MAX_HWEVENTS];
>  	unsigned int flags[MAX_HWEVENTS];
>  	unsigned long mmcr[3];
> -	struct perf_event *limited_event[MAX_LIMITED_HWEVENTS];
> -	u8  limited_hwidx[MAX_LIMITED_HWEVENTS];
> +	struct perf_event *limited_counter[MAX_LIMITED_HWCOUNTERS];
> +	u8  limited_hwidx[MAX_LIMITED_HWCOUNTERS];
>  	u64 alternatives[MAX_HWEVENTS][MAX_EVENT_ALTERNATIVES];
>  	unsigned long amasks[MAX_HWEVENTS][MAX_EVENT_ALTERNATIVES];
>  	unsigned long avalues[MAX_HWEVENTS][MAX_EVENT_ALTERNATIVES];
> @@ -253,7 +253,7 @@ static int power_check_constraints(struct cpu_hw_events *cpuhw,
>  	unsigned long addf = ppmu->add_fields;
>  	unsigned long tadd = ppmu->test_adder;
>  
> -	if (n_ev > ppmu->n_event)
> +	if (n_ev > ppmu->n_counter)
>  		return -1;
>  
>  	/* First see if the events will go on as-is */
> @@ -426,7 +426,7 @@ static int is_limited_pmc(int pmcnum)
>  		&& (pmcnum == 5 || pmcnum == 6);
>  }
>  
> -static void freeze_limited_events(struct cpu_hw_events *cpuhw,
> +static void freeze_limited_counters(struct cpu_hw_events *cpuhw,
>  				    unsigned long pmc5, unsigned long pmc6)
>  {
>  	struct perf_event *event;
> @@ -434,7 +434,7 @@ static void freeze_limited_events(struct cpu_hw_events *cpuhw,
>  	int i;
>  
>  	for (i = 0; i < cpuhw->n_limited; ++i) {
> -		event = cpuhw->limited_event[i];
> +		event = cpuhw->limited_counter[i];
>  		if (!event->hw.idx)
>  			continue;
>  		val = (event->hw.idx == 5) ? pmc5 : pmc6;
> @@ -445,7 +445,7 @@ static void freeze_limited_events(struct cpu_hw_events *cpuhw,
>  	}
>  }
>  
> -static void thaw_limited_events(struct cpu_hw_events *cpuhw,
> +static void thaw_limited_counters(struct cpu_hw_events *cpuhw,
>  				  unsigned long pmc5, unsigned long pmc6)
>  {
>  	struct perf_event *event;
> @@ -453,7 +453,7 @@ static void thaw_limited_events(struct cpu_hw_events *cpuhw,
>  	int i;
>  
>  	for (i = 0; i < cpuhw->n_limited; ++i) {
> -		event = cpuhw->limited_event[i];
> +		event = cpuhw->limited_counter[i];
>  		event->hw.idx = cpuhw->limited_hwidx[i];
>  		val = (event->hw.idx == 5) ? pmc5 : pmc6;
>  		atomic64_set(&event->hw.prev_count, val);
> @@ -495,9 +495,9 @@ static void write_mmcr0(struct cpu_hw_events *cpuhw, unsigned long mmcr0)
>  		       "i" (SPRN_PMC5), "i" (SPRN_PMC6));
>  
>  	if (mmcr0 & MMCR0_FC)
> -		freeze_limited_events(cpuhw, pmc5, pmc6);
> +		freeze_limited_counters(cpuhw, pmc5, pmc6);
>  	else
> -		thaw_limited_events(cpuhw, pmc5, pmc6);
> +		thaw_limited_counters(cpuhw, pmc5, pmc6);
>  
>  	/*
>  	 * Write the full MMCR0 including the event overflow interrupt
> @@ -653,7 +653,7 @@ void hw_perf_enable(void)
>  			continue;
>  		idx = hwc_index[i] + 1;
>  		if (is_limited_pmc(idx)) {
> -			cpuhw->limited_event[n_lim] = event;
> +			cpuhw->limited_counter[n_lim] = event;
>  			cpuhw->limited_hwidx[n_lim] = idx;
>  			++n_lim;
>  			continue;
> @@ -702,7 +702,7 @@ static int collect_events(struct perf_event *group, int max_count,
>  		flags[n] = group->hw.event_base;
>  		events[n++] = group->hw.config;
>  	}
> -	list_for_each_entry(event, &group->sibling_list, list_entry) {
> +	list_for_each_entry(event, &group->sibling_list, group_entry) {
>  		if (!is_software_event(event) &&
>  		    event->state != PERF_EVENT_STATE_OFF) {
>  			if (n >= max_count)
> @@ -742,7 +742,7 @@ int hw_perf_group_sched_in(struct perf_event *group_leader,
>  		return 0;
>  	cpuhw = &__get_cpu_var(cpu_hw_events);
>  	n0 = cpuhw->n_events;
> -	n = collect_events(group_leader, ppmu->n_event - n0,
> +	n = collect_events(group_leader, ppmu->n_counter - n0,
>  			   &cpuhw->event[n0], &cpuhw->events[n0],
>  			   &cpuhw->flags[n0]);
>  	if (n < 0)
> @@ -764,7 +764,7 @@ int hw_perf_group_sched_in(struct perf_event *group_leader,
>  	cpuctx->active_oncpu += n;
>  	n = 1;
>  	event_sched_in(group_leader, cpu);
> -	list_for_each_entry(sub, &group_leader->sibling_list, list_entry) {
> +	list_for_each_entry(sub, &group_leader->sibling_list, group_entry) {
>  		if (sub->state != PERF_EVENT_STATE_OFF) {
>  			event_sched_in(sub, cpu);
>  			++n;
> @@ -797,7 +797,7 @@ static int power_pmu_enable(struct perf_event *event)
>  	 */
>  	cpuhw = &__get_cpu_var(cpu_hw_events);
>  	n0 = cpuhw->n_events;
> -	if (n0 >= ppmu->n_event)
> +	if (n0 >= ppmu->n_counter)
>  		goto out;
>  	cpuhw->event[n0] = event;
>  	cpuhw->events[n0] = event->hw.config;
> @@ -848,11 +848,11 @@ static void power_pmu_disable(struct perf_event *event)
>  		}
>  	}
>  	for (i = 0; i < cpuhw->n_limited; ++i)
> -		if (event == cpuhw->limited_event[i])
> +		if (event == cpuhw->limited_counter[i])
>  			break;
>  	if (i < cpuhw->n_limited) {
>  		while (++i < cpuhw->n_limited) {
> -			cpuhw->limited_event[i-1] = cpuhw->limited_event[i];
> +			cpuhw->limited_counter[i-1] = cpuhw->limited_counter[i];
>  			cpuhw->limited_hwidx[i-1] = cpuhw->limited_hwidx[i];
>  		}
>  		--cpuhw->n_limited;
> @@ -1078,7 +1078,7 @@ const struct pmu *hw_perf_event_init(struct perf_event *event)
>  	 */
>  	n = 0;
>  	if (event->group_leader != event) {
> -		n = collect_events(event->group_leader, ppmu->n_event - 1,
> +		n = collect_events(event->group_leader, ppmu->n_counter - 1,
>  				   ctrs, events, cflags);
>  		if (n < 0)
>  			return ERR_PTR(-EINVAL);
> @@ -1230,7 +1230,7 @@ static void perf_event_interrupt(struct pt_regs *regs)
>  	int nmi;
>  
>  	if (cpuhw->n_limited)
> -		freeze_limited_events(cpuhw, mfspr(SPRN_PMC5),
> +		freeze_limited_counters(cpuhw, mfspr(SPRN_PMC5),
>  					mfspr(SPRN_PMC6));
>  
>  	perf_read_regs(regs);
> @@ -1260,7 +1260,7 @@ static void perf_event_interrupt(struct pt_regs *regs)
>  	 * Any that we processed in the previous loop will not be negative.
>  	 */
>  	if (!found) {
> -		for (i = 0; i < ppmu->n_event; ++i) {
> +		for (i = 0; i < ppmu->n_counter; ++i) {
>  			if (is_limited_pmc(i + 1))
>  				continue;
>  			val = read_pmc(i + 1);
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

* RE: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: Benjamin Herrenschmidt @ 2009-09-22  1:09 UTC (permalink / raw)
  To: Prodyut Hazarika
  Cc: Victor Gallardo, Feng Kan, netdev, lada.podivin, Loc Ho,
	linuxppc-dev, bhutchings, prodyut hazarika, davem
In-Reply-To: <0CA0A16855646F4FA96D25A158E299D606FFE81A@SDCEXCHANGE01.ad.amcc.com>

On Mon, 2009-09-21 at 17:53 -0700, Prodyut Hazarika wrote:
> 
> In the newer revs of 460EX/GT and 405EX, we have Interrupt coalescing
> both on Tx and Rx per channel (physical not virtual), which can be
> enabled/disabled per channel via UIC. The Tx/Rx Coalesce mappings are
> defined in the dts file. But in the older revs, there is only a global
> EOP_Int_Enable in the MAL configuration register. There can be a
> possible way even for older SoCs if we use the MAL descriptor I bit
> and
> not the global EOP_Int_Enable. But to turn on/off the channel, we will
> have to go and set/clear the I bit in whole of MAL descriptor ring for
> that channel. That might be really inefficient.
> 
> What would you suggest?

I wouldn't bother with the old SoCs, we should keep the current
workaround we have today for them. For the new ones, I'll have a look
and see how we can get the driver upgraded to avoid the workaround.

Don't bother with this for now. I'll dig at some stage.

Cheers,
Ben.

^ permalink raw reply

* RE: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: Prodyut Hazarika @ 2009-09-22  0:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, prodyut hazarika
  Cc: Victor Gallardo, Feng Kan, netdev, lada.podivin, Loc Ho,
	bhutchings, linuxppc-dev, davem
In-Reply-To: <1253579943.7103.194.camel@pasglop>

Hi Ben,

> Well... the above is a HW limitation :-) IE. I was suggesting you fix
> the HW, but in the case where you already did and the current MAL in
> your SoC can indeed mask the interrupt per-channel, then that's great
> and we should definitely look into having the driver go back to a more
> standard NAPI model on MALs that have that capability.

In the newer revs of 460EX/GT and 405EX, we have Interrupt coalescing
both on Tx and Rx per channel (physical not virtual), which can be
enabled/disabled per channel via UIC. The Tx/Rx Coalesce mappings are
defined in the dts file. But in the older revs, there is only a global
EOP_Int_Enable in the MAL configuration register. There can be a
possible way even for older SoCs if we use the MAL descriptor I bit and
not the global EOP_Int_Enable. But to turn on/off the channel, we will
have to go and set/clear the I bit in whole of MAL descriptor ring for
that channel. That might be really inefficient.

What would you suggest?

Thanks
Prodyut

^ permalink raw reply

* Re: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: Benjamin Herrenschmidt @ 2009-09-22  0:39 UTC (permalink / raw)
  To: prodyut hazarika
  Cc: Victor Gallardo, Feng Kan, netdev, lada.podivin, Loc Ho,
	bhutchings, Prodyut Hazarika, linuxppc-dev, davem
In-Reply-To: <49c0ff980909211728s2d39e356p6900d047c6918826@mail.gmail.com>

On Mon, 2009-09-21 at 17:28 -0700, prodyut hazarika wrote:
> > BTW. If you guys are ever going to do another change to MAL, please
> > please plase, add the -one- major missing feature that's causing all
> the
> > pain and complication in the current design: Add a per-channel
> interrupt
> > masking option.
> >
> > The lack of ability to mask the interrupt per MAL channel is what
> forces
> > us to create that fake netdev structure in order to share the napi
> > device instance between all the EMACs in the system. This is very
> > inefficient too. We would be able to make things run a lot smoother
> if
> > we could just have a napi instance per EMAC, but for that, we need
> > per-channel interrupt masking.
> >
> 
> I will add a patch for the above as soon as I am done incorporating
> your comments on the MAL coalescing support.
> 
Well... the above is a HW limitation :-) IE. I was suggesting you fix
the HW, but in the case where you already did and the current MAL in
your SoC can indeed mask the interrupt per-channel, then that's great
and we should definitely look into having the driver go back to a more
standard NAPI model on MALs that have that capability.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: prodyut hazarika @ 2009-09-22  0:28 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Victor Gallardo, Feng Kan, netdev, lada.podivin, Loc Ho,
	bhutchings, Prodyut Hazarika, linuxppc-dev, davem
In-Reply-To: <1253578361.7103.180.camel@pasglop>

Hi Ben,

>
> BTW. If you guys are ever going to do another change to MAL, please
> please plase, add the -one- major missing feature that's causing all the
> pain and complication in the current design: Add a per-channel interrupt
> masking option.
>
> The lack of ability to mask the interrupt per MAL channel is what forces
> us to create that fake netdev structure in order to share the napi
> device instance between all the EMACs in the system. This is very
> inefficient too. We would be able to make things run a lot smoother if
> we could just have a napi instance per EMAC, but for that, we need
> per-channel interrupt masking.
>

I will add a patch for the above as soon as I am done incorporating
your comments on the MAL coalescing support.

Thanks
Prodyut

^ permalink raw reply

* Re: [LTP] mmapstress03 weirdness? (fwd)
From: Benjamin Herrenschmidt @ 2009-09-22  0:19 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Linux/PPC Development, Linux Kernel Development,
	Linux Test Project
In-Reply-To: <alpine.LRH.2.00.0909211539520.16077@vixen.sonytel.be>

On Mon, 2009-09-21 at 15:40 +0200, Geert Uytterhoeven wrote:

> 
> With 32-bit userland, this boils down to:
> 
> | mmap addr 0x7fff0000 size 0x7fff0000
> | mmap returned 0x7fff0000
> 
> i.e. mmap() succeeds, but (1) the test expects it to fail, so the test returns
> TFAIL, but (2) ltp-pan still reports that the tests passed?

What is the output of /proc/<pid>/maps after that mmap ?

With a 64-bit kernel, 32-bit userspace has access to the entire 4G
address space, so mapping 2G-64k at the 2G-64k point can work, provided
you aren't overlapping an existing mapping such as the stack.

> In addition, sometimes mmapstress03 fails due to SEGV. I created a small test
> program that just does the above mmap(), and depending on the distro and what
> else I print later it crashes with a SEGV, too. Probably this happens because
> the mmap() did succeed, and corrupted some existing mappings, cfr. the notes
> for MAP_FIXED:

That's possible.

>        MAP_FIXED
>               Don’t  interpret  addr  as  a hint: place the mapping at exactly
>               that address.  addr must be a multiple of the page size.  If the
>               memory  region  specified  by addr and len overlaps pages of any
>               existing mapping(s), then the overlapped part  of  the  existing
>               mapping(s)  will  be discarded.  If the specified address cannot
>               be used, mmap() will fail.  Because requiring  a  fixed  address
>               for  a  mapping is less portable, the use of this option is dis‐
>               couraged.

Yeah, I suppose the test might be wiping out its own stack for example

IE. I think that test is just bogus :-)

> JFYI, with 64-bit userland, this boils down to:
> 
> | mmap addr 0x7fffffffffff0000 size 0x7fffffffffff0000
> | mmap returned 0xffffffffffffffff
> 
> i.e. mmap() fails as expected, and the test succeeds.

Right because on 64-bit userspace, you only are allowed something like
16T of address space.

> Does all of this sound OK?
> Thanks for your comments!

Yes, I think so far, it's just bogus tests :-)

Cheers,
Ben.

> With kind regards,
> 
> Geert Uytterhoeven
> Software Architect
> Techsoft Centre
> 
> Technology and Software Centre Europe
> The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium
> 
> Phone:    +32 (0)2 700 8453
> Fax:      +32 (0)2 700 8622
> E-mail:   Geert.Uytterhoeven@sonycom.com
> Internet: http://www.sony-europe.com/
> 
> A division of Sony Europe (Belgium) N.V.
> VAT BE 0413.825.160 · RPR Brussels
> Fortis · BIC GEBABEBB · IBAN BE41293037680010
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

* RE: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: Benjamin Herrenschmidt @ 2009-09-22  0:12 UTC (permalink / raw)
  To: Prodyut Hazarika
  Cc: Victor Gallardo, Feng Kan, netdev, lada.podivin, Loc Ho,
	bhutchings, linuxppc-dev, davem
In-Reply-To: <0CA0A16855646F4FA96D25A158E299D606FFE802@SDCEXCHANGE01.ad.amcc.com>

On Mon, 2009-09-21 at 17:05 -0700, Prodyut Hazarika wrote:
> Hi Ben,
> Thanks again for your comments.
> 
> > Same goes with the SDR register definitions. Prefix them with the SOC
> > name but don't make them conditionally compiled.
> 
> I will add the base address in the Device tree, and make all register
> definitions based on offset from the base in the next version of this
> patch.

That's a good idea. In fact, you can also use the dcr_read/write
variants of the accessors rather than the low level mfdcri/mtdcri. This
wouldn't make much of a difference unless you ever release a SoC with
those same registers behind an MMIO mapping but it's cleaner.

> Thanks for this comment. I will hookup ethtool with the EMAC driver, but
> the MAL driver will come up with default coalesce options (as defined in
> the appropriate defconfig file). The user will be able to change these
> parameters as needed using ethtool.

That's ok. I don't have an objection in using Kconfig to set the
defaults.

> I will get all the changes in place in the next version of this patch.

Thanks !

BTW. If you guys are ever going to do another change to MAL, please
please plase, add the -one- major missing feature that's causing all the
pain and complication in the current design: Add a per-channel interrupt
masking option.

The lack of ability to mask the interrupt per MAL channel is what forces
us to create that fake netdev structure in order to share the napi
device instance between all the EMACs in the system. This is very
inefficient too. We would be able to make things run a lot smoother if
we could just have a napi instance per EMAC, but for that, we need
per-channel interrupt masking.

Cheers,
Ben.

^ permalink raw reply

* RE: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: Benjamin Herrenschmidt @ 2009-09-22  0:07 UTC (permalink / raw)
  To: Prodyut Hazarika
  Cc: Victor Gallardo, Feng Kan, netdev, lada.podivin, Loc Ho,
	bhutchings, linuxppc-dev, davem
In-Reply-To: <0CA0A16855646F4FA96D25A158E299D606FFE7FF@SDCEXCHANGE01.ad.amcc.com>

On Mon, 2009-09-21 at 16:49 -0700, Prodyut Hazarika wrote:
> Hi Ben,
> Thanks for your comments.
> 
> 
> > What happens if we build a kernel that is supposed to boot with two
> > different variants of 405 or 440 ?
> 
> We cannot build a kernel with H/W Interrupt coalescing other than in
> 405EX/460EX/GT.
> This is controlled via KConfig (config IBM_NEW_EMAC_INTR_COALESCE
> depends on IBM_NEW_EMAC && (460EX || 460GT || 405EX))
> Is this approach acceptable (via Kconfig)?

No. That's my point. All of this must be runtime options. The kernel
must be buildablt for 460EX -and- 460GT - and an old 440EP if I want to
in a single image, and this -with- the coalescing option enabled. It
would obviously only be available when running on the cores that support
it, but it should -not- be a compile time decision.

IE. All your ifdef's should be turned into runtime checks. If you have
conflicting #define for register names and bits, then prefix them with
the SoC name.

The only acceptable compile-time option is to have the ability to not
compile the coalescing support at all, thus avoiding bloat when building
configs that are only targeted toward processors that don't have it or
setups that don't want it. 

> > There are existing mechanisms via ethtool to configure coalescing. You
> > should hookup onto these.
> 
> I will start looking at the ethtool options

Thanks.

Cheers,
Ben.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox