Linux Power Management development
 help / color / mirror / Atom feed
* [PATCH v10 08/14] x86-32, hotplug: Add start_cpu0() entry point to head_32.S
From: Fenghua Yu @ 2012-11-13 19:32 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, H Peter Anvin, Linus Torvalds,
	Andrew Morton, Asit K Mallick, Tony Luck, Arjan Dan De Ven,
	Suresh B Siddha, Len Brown, Srivatssa S. Bhat, Randy Dunlap,
	Rafael J. Wysocki, Chen Gong, linux-kernel, linux-pm, x86
  Cc: Fenghua Yu
In-Reply-To: <1352835171-3958-1-git-send-email-fenghua.yu@intel.com>

From: Fenghua Yu <fenghua.yu@intel.com>

start_cpu0() is defined in head_32.S for 32-bit. The function sets up stack and
jumps to start_secondary() for CPU0 wake up.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/kernel/head_32.S |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index 957a47a..a013e73 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -266,6 +266,19 @@ num_subarch_entries = (. - subarch_entries) / 4
 	jmp default_entry
 #endif /* CONFIG_PARAVIRT */
 
+#ifdef CONFIG_HOTPLUG_CPU
+/*
+ * Boot CPU0 entry point. It's called from play_dead(). Everything has been set
+ * up already except stack. We just set up stack here. Then call
+ * start_secondary().
+ */
+ENTRY(start_cpu0)
+	movl stack_start, %ecx
+	movl %ecx, %esp
+	jmp  *(initial_code)
+ENDPROC(start_cpu0)
+#endif
+
 /*
  * Non-boot CPU entry point; entered from trampoline.S
  * We can't lgdt here, because lgdt itself uses a data segment, but
-- 
1.7.2

^ permalink raw reply related

* [PATCH v10 10/14] x86, hotplug: During CPU0 online, enable x2apic, set_numa_node.
From: Fenghua Yu @ 2012-11-13 19:32 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, H Peter Anvin, Linus Torvalds,
	Andrew Morton, Asit K Mallick, Tony Luck, Arjan Dan De Ven,
	Suresh B Siddha, Len Brown, Srivatssa S. Bhat, Randy Dunlap,
	Rafael J. Wysocki, Chen Gong, linux-kernel, linux-pm, x86
  Cc: Fenghua Yu
In-Reply-To: <1352835171-3958-1-git-send-email-fenghua.yu@intel.com>

From: Fenghua Yu <fenghua.yu@intel.com>

Previously these functions were not run on the BSP (CPU 0, the boot processor)
since the boot processor init would only be executed before this functionality
was initialized.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/kernel/cpu/common.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 7505f7b..ca165ac 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1237,7 +1237,7 @@ void __cpuinit cpu_init(void)
 	oist = &per_cpu(orig_ist, cpu);
 
 #ifdef CONFIG_NUMA
-	if (cpu != 0 && this_cpu_read(numa_node) == 0 &&
+	if (this_cpu_read(numa_node) == 0 &&
 	    early_cpu_to_node(cpu) != NUMA_NO_NODE)
 		set_numa_node(early_cpu_to_node(cpu));
 #endif
@@ -1269,8 +1269,7 @@ void __cpuinit cpu_init(void)
 	barrier();
 
 	x86_configure_nx();
-	if (cpu != 0)
-		enable_x2apic();
+	enable_x2apic();
 
 	/*
 	 * set up and load the per-CPU TSS
-- 
1.7.2

^ permalink raw reply related

* [PATCH v10 11/14] x86, hotplug: The first online processor saves the MTRR state
From: Fenghua Yu @ 2012-11-13 19:32 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, H Peter Anvin, Linus Torvalds,
	Andrew Morton, Asit K Mallick, Tony Luck, Arjan Dan De Ven,
	Suresh B Siddha, Len Brown, Srivatssa S. Bhat, Randy Dunlap,
	Rafael J. Wysocki, Chen Gong, linux-kernel, linux-pm, x86
  Cc: Fenghua Yu
In-Reply-To: <1352835171-3958-1-git-send-email-fenghua.yu@intel.com>

From: Fenghua Yu <fenghua.yu@intel.com>

Ask the first online CPU to save mtrr instead of asking BSP. BSP could be
offline when mtrr_save_state() is called.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/kernel/cpu/mtrr/main.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mtrr/main.c b/arch/x86/kernel/cpu/mtrr/main.c
index 6b96110..e4c1a41 100644
--- a/arch/x86/kernel/cpu/mtrr/main.c
+++ b/arch/x86/kernel/cpu/mtrr/main.c
@@ -695,11 +695,16 @@ void mtrr_ap_init(void)
 }
 
 /**
- * Save current fixed-range MTRR state of the BSP
+ * Save current fixed-range MTRR state of the first cpu in cpu_online_mask.
  */
 void mtrr_save_state(void)
 {
-	smp_call_function_single(0, mtrr_save_fixed_ranges, NULL, 1);
+	int first_cpu;
+
+	get_online_cpus();
+	first_cpu = cpumask_first(cpu_online_mask);
+	smp_call_function_single(first_cpu, mtrr_save_fixed_ranges, NULL, 1);
+	put_online_cpus();
 }
 
 void set_mtrr_aps_delayed_init(void)
-- 
1.7.2

^ permalink raw reply related

* [PATCH v10 12/14] x86, hotplug: Handle retrigger irq by the first available CPU
From: Fenghua Yu @ 2012-11-13 19:32 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, H Peter Anvin, Linus Torvalds,
	Andrew Morton, Asit K Mallick, Tony Luck, Arjan Dan De Ven,
	Suresh B Siddha, Len Brown, Srivatssa S. Bhat, Randy Dunlap,
	Rafael J. Wysocki, Chen Gong, linux-kernel, linux-pm, x86
  Cc: Fenghua Yu
In-Reply-To: <1352835171-3958-1-git-send-email-fenghua.yu@intel.com>

From: Fenghua Yu <fenghua.yu@intel.com>

The first cpu in irq cfg->domain is likely to be CPU 0 and may not be available
when CPU 0 is offline. Instead of using CPU 0 to handle retriggered irq, we use
first available CPU which is online and in this irq's domain.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/kernel/apic/io_apic.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index b134f0b..c94bac8 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2199,9 +2199,11 @@ static int ioapic_retrigger_irq(struct irq_data *data)
 {
 	struct irq_cfg *cfg = data->chip_data;
 	unsigned long flags;
+	int cpu;
 
 	raw_spin_lock_irqsave(&vector_lock, flags);
-	apic->send_IPI_mask(cpumask_of(cpumask_first(cfg->domain)), cfg->vector);
+	cpu = cpumask_first_and(cfg->domain, cpu_online_mask);
+	apic->send_IPI_mask(cpumask_of(cpu), cfg->vector);
 	raw_spin_unlock_irqrestore(&vector_lock, flags);
 
 	return 1;
-- 
1.7.2

^ permalink raw reply related

* [PATCH v10 13/14] x86/i387.c: Initialize thread xstate only on CPU0 only once
From: Fenghua Yu @ 2012-11-13 19:32 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, H Peter Anvin, Linus Torvalds,
	Andrew Morton, Asit K Mallick, Tony Luck, Arjan Dan De Ven,
	Suresh B Siddha, Len Brown, Srivatssa S. Bhat, Randy Dunlap,
	Rafael J. Wysocki, Chen Gong, linux-kernel, linux-pm, x86
  Cc: Fenghua Yu
In-Reply-To: <1352835171-3958-1-git-send-email-fenghua.yu@intel.com>

From: Fenghua Yu <fenghua.yu@intel.com>

init_thread_xstate() is only called once to avoid overriding xstate_size during
boot time or during CPU hotplug.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/kernel/i387.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 675a050..245a71d 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -175,7 +175,11 @@ void __cpuinit fpu_init(void)
 		cr0 |= X86_CR0_EM;
 	write_cr0(cr0);
 
-	if (!smp_processor_id())
+	/*
+	 * init_thread_xstate is only called once to avoid overriding
+	 * xstate_size during boot time or during CPU hotplug.
+	 */
+	if (xstate_size == 0)
 		init_thread_xstate();
 
 	mxcsr_feature_mask_init();
-- 
1.7.2

^ permalink raw reply related

* [PATCH v10 14/14] x86, topology: Debug CPU00 hotplug
From: Fenghua Yu @ 2012-11-13 19:32 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, H Peter Anvin, Linus Torvalds,
	Andrew Morton, Asit K Mallick, Tony Luck, Arjan Dan De Ven,
	Suresh B Siddha, Len Brown, Srivatssa S. Bhat, Randy Dunlap,
	Rafael J. Wysocki, Chen Gong, linux-kernel, linux-pm, x86
  Cc: Fenghua Yu
In-Reply-To: <1352835171-3958-1-git-send-email-fenghua.yu@intel.com>

From: Fenghua Yu <fenghua.yu@intel.com>

CONFIG_DEBUG_HOTPLUG_CPU0 is for debugging the CPU0 hotplug feature. The switch
offlines CPU0 as soon as possible and boots userspace up with CPU0 offlined.
User can online CPU0 back after boot time. The default value of the switch is
off.

To debug CPU0 hotplug, you need to enable CPU0 offline/online feature by either
turning on CONFIG_BOOTPARAM_HOTPLUG_CPU0 during compilation or giving
cpu0_hotplug kernel parameter at boot.

It's safe and early place to take down CPU0 after all hotplug notifiers
are installed and SMP is booted.

Please note that some applications or drivers, e.g. some versions of udevd,
during boot time may put CPU0 online again in this CPU0 hotplug debug mode.

In this debug mode, setup_local_APIC() may report a warning on max_loops<=0
when CPU0 is onlined back after boot time. This is because pending interrupt in
IRR can not move to ISR. The warning is not CPU0 specfic and it can happen on
other CPUs as well. It is harmless except the first CPU0 online takes a bit
longer time. And so this debug mode is useful to expose this issue. I'll send
a seperate patch to fix this generic warning issue.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
 arch/x86/Kconfig           |   15 +++++++++++++
 arch/x86/include/asm/cpu.h |    3 ++
 arch/x86/kernel/topology.c |   51 ++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/power/cpu.c       |   38 ++++++++++++++++++++++++++++++++
 4 files changed, 107 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 933ff87..ea5852f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1759,6 +1759,21 @@ config BOOTPARAM_HOTPLUG_CPU0
 	  You still can enable the CPU0 hotplug feature at boot by kernel
 	  parameter cpu0_hotplug.
 
+config DEBUG_HOTPLUG_CPU0
+	def_bool n
+	prompt "Debug CPU0 hotplug"
+	depends on HOTPLUG_CPU && EXPERIMENTAL
+	---help---
+	  Enabling this option offlines CPU0 (if CPU0 can be offlined) as
+	  soon as possible and boots up userspace with CPU0 offlined. User
+	  can online CPU0 back after boot time.
+
+	  To debug CPU0 hotplug, you need to enable CPU0 offline/online
+	  feature by either turning on CONFIG_BOOTPARAM_HOTPLUG_CPU0 during
+	  compilation or giving cpu0_hotplug kernel parameter at boot.
+
+	  If unsure, say N.
+
 config COMPAT_VDSO
 	def_bool y
 	prompt "Compat VDSO support"
diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h
index a119572..5f9a124 100644
--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -29,6 +29,9 @@ struct x86_cpu {
 extern int arch_register_cpu(int num);
 extern void arch_unregister_cpu(int);
 extern void __cpuinit start_cpu0(void);
+#ifdef CONFIG_DEBUG_HOTPLUG_CPU0
+extern int _debug_hotplug_cpu(int cpu, int action);
+#endif
 #endif
 
 DECLARE_PER_CPU(int, cpu_state);
diff --git a/arch/x86/kernel/topology.c b/arch/x86/kernel/topology.c
index 0e7b4a7..6e60b5f 100644
--- a/arch/x86/kernel/topology.c
+++ b/arch/x86/kernel/topology.c
@@ -50,6 +50,57 @@ static int __init enable_cpu0_hotplug(char *str)
 __setup("cpu0_hotplug", enable_cpu0_hotplug);
 #endif
 
+#ifdef CONFIG_DEBUG_HOTPLUG_CPU0
+/*
+ * This function offlines a CPU as early as possible and allows userspace to
+ * boot up without the CPU. The CPU can be onlined back by user after boot.
+ *
+ * This is only called for debugging CPU offline/online feature.
+ */
+int __ref _debug_hotplug_cpu(int cpu, int action)
+{
+	struct device *dev = get_cpu_device(cpu);
+	int ret;
+
+	if (!cpu_is_hotpluggable(cpu))
+		return -EINVAL;
+
+	cpu_hotplug_driver_lock();
+
+	switch (action) {
+	case 0:
+		ret = cpu_down(cpu);
+		if (!ret) {
+			pr_info("CPU %u is now offline\n", cpu);
+			kobject_uevent(&dev->kobj, KOBJ_OFFLINE);
+		} else
+			pr_debug("Can't offline CPU%d.\n", cpu);
+		break;
+	case 1:
+		ret = cpu_up(cpu);
+		if (!ret)
+			kobject_uevent(&dev->kobj, KOBJ_ONLINE);
+		else
+			pr_debug("Can't online CPU%d.\n", cpu);
+		break;
+	default:
+		ret = -EINVAL;
+	}
+
+	cpu_hotplug_driver_unlock();
+
+	return ret;
+}
+
+static int __init debug_hotplug_cpu(void)
+{
+	_debug_hotplug_cpu(0, 0);
+	return 0;
+}
+
+late_initcall_sync(debug_hotplug_cpu);
+#endif /* CONFIG_DEBUG_HOTPLUG_CPU0 */
+
 int __ref arch_register_cpu(int num)
 {
 	struct cpuinfo_x86 *c = &cpu_data(num);
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index adde775..120cee1 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -21,6 +21,7 @@
 #include <asm/suspend.h>
 #include <asm/debugreg.h>
 #include <asm/fpu-internal.h> /* pcntxt_mask */
+#include <asm/cpu.h>
 
 #ifdef CONFIG_X86_32
 static struct saved_context saved_context;
@@ -263,6 +264,43 @@ static int bsp_pm_callback(struct notifier_block *nb, unsigned long action,
 	case PM_HIBERNATION_PREPARE:
 		ret = bsp_check();
 		break;
+#ifdef CONFIG_DEBUG_HOTPLUG_CPU0
+	case PM_RESTORE_PREPARE:
+		/*
+		 * When system resumes from hibernation, online CPU0 because
+		 * 1. it's required for resume and
+		 * 2. the CPU was online before hibernation
+		 */
+		if (!cpu_online(0))
+			_debug_hotplug_cpu(0, 1);
+		break;
+	case PM_POST_RESTORE:
+		/*
+		 * When a resume really happens, this code won't be called.
+		 *
+		 * This code is called only when user space hibernation software
+		 * prepares for snapshot device during boot time. So we just
+		 * call _debug_hotplug_cpu() to restore to CPU0's state prior to
+		 * preparing the snapshot device.
+		 *
+		 * This works for normal boot case in our CPU0 hotplug debug
+		 * mode, i.e. CPU0 is offline and user mode hibernation
+		 * software initializes during boot time.
+		 *
+		 * If CPU0 is online and user application accesses snapshot
+		 * device after boot time, this will offline CPU0 and user may
+		 * see different CPU0 state before and after accessing
+		 * the snapshot device. But hopefully this is not a case when
+		 * user debugging CPU0 hotplug. Even if users hit this case,
+		 * they can easily online CPU0 back.
+		 *
+		 * To simplify this debug code, we only consider normal boot
+		 * case. Otherwise we need to remember CPU0's state and restore
+		 * to that state and resolve racy conditions etc.
+		 */
+		_debug_hotplug_cpu(0, 0);
+		break;
+#endif
 	default:
 		break;
 	}
-- 
1.7.2

^ permalink raw reply related

* Re: [PATCH 3/3] PM: Introduce Intel PowerClamp Driver
From: Paul E. McKenney @ 2012-11-13 21:16 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Linux PM, LKML, Rafael Wysocki, Len Brown, Thomas Gleixner,
	H. Peter Anvin, Ingo Molnar, Zhang Rui, Rob Landley,
	Arjan van de Ven
In-Reply-To: <1352757831-5202-4-git-send-email-jacob.jun.pan@linux.intel.com>

On Mon, Nov 12, 2012 at 02:03:51PM -0800, Jacob Pan wrote:
> Intel PowerClamp driver performs synchronized idle injection across
> all online CPUs. The goal is to maintain a given package level C-state
> ratio.
> 
> Compared to other throttling methods already exist in the kernel,
> such as ACPI PAD (taking CPUs offline) and clock modulation, this is often
> more efficient in terms of performance per watt.
> 
> Please refer to Documentation/thermal/intel_powerclamp.txt for more details.

If I read this correctly, this forces a group of CPUs into idle for
about 600 milliseconds at a time.  This would indeed delay grace periods,
which could easily result in user complaints.  Also, given the default
RCU_BOOST_DELAY of 500 milliseconds in kernels enabling RCU_BOOST,
you would see needless RCU priority boosting.

Of course, if the idle period extended for longer, you would see RCU
CPU stall warnings.  And if the idle period extended indefinitely, you
could hang the system -- the RCU callbacks on the idled CPU could not
be invoked, and if one of those RCU callbacks was waking someone up,
that someone would not be woken up.

It looks like you could end up with part of the system powerclamped
in some situations, and with all of it powerclamped in other situations.
Is that the case, or am I confused?

							Thanx, Paul

> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  Documentation/thermal/intel_powerclamp.txt |  307 +++++++++++
>  drivers/thermal/Kconfig                    |   10 +
>  drivers/thermal/Makefile                   |    1 +
>  drivers/thermal/intel_powerclamp.c         |  766 ++++++++++++++++++++++++++++
>  4 files changed, 1084 insertions(+)
>  create mode 100644 Documentation/thermal/intel_powerclamp.txt
>  create mode 100644 drivers/thermal/intel_powerclamp.c
> 
> diff --git a/Documentation/thermal/intel_powerclamp.txt b/Documentation/thermal/intel_powerclamp.txt
> new file mode 100644
> index 0000000..332de4a
> --- /dev/null
> +++ b/Documentation/thermal/intel_powerclamp.txt
> @@ -0,0 +1,307 @@
> +			 =======================
> +			 INTEL POWERCLAMP DRIVER
> +			 =======================
> +By: Arjan van de Ven <arjan@linux.intel.com>
> +    Jacob Pan <jacob.jun.pan@linux.intel.com>
> +
> +Contents:
> +	(*) Introduction
> +	    - Goals and Objectives
> +
> +	(*) Theory of Operation
> +	    - Idle Injection
> +	    - Calibration
> +
> +	(*) Performance Analysis
> +	    - Effectiveness and Limitations
> +	    - Power vs Performance
> +	    - Scalability
> +	    - Calibration
> +	    - Comparison with Alternative Techniques
> +
> +	(*) Usage and Interfaces
> +	    - Generic Thermal Layer (sysfs)
> +	    - Kernel APIs (TBD)
> +
> +============
> +INTRODUCTION
> +============
> +
> +Consider the situation where a system’s power consumption must be
> +reduced at runtime, due to power budget, thermal constraint, or noise
> +level, and where active cooling is not preferred. Software managed
> +passive power reduction must be performed to prevent the hardware
> +actions that are designed for catastrophic scenarios.
> +
> +Currently, P-states, T-states (clock modulation), and CPU offlining
> +are used for CPU throttling.
> +
> +On Intel CPUs, C-states provide effective power reduction, but so far
> +they’re only used opportunistically, based on workload. With the
> +development of intel_powerclamp driver, the method of synchronizing
> +idle injection across all online CPU threads was introduced. The goal
> +is to achieve forced and controllable C-state residency.
> +
> +Test/Analysis has been made in the areas of power, performance,
> +scalability, and user experience. In many cases, clear advantage is
> +shown over taking the CPU offline or modulating the CPU clock.
> +
> +
> +===================
> +THEORY OF OPERATION
> +===================
> +
> +Idle Injection
> +--------------
> +
> +On modern Intel processors (Nehalem or later), package level C-state
> +residency is available in MSRs, thus also available to the kernel.
> +
> +These MSRs are:
> +      #define MSR_PKG_C2_RESIDENCY	0x60D
> +      #define MSR_PKG_C3_RESIDENCY	0x3F8
> +      #define MSR_PKG_C6_RESIDENCY	0x3F9
> +      #define MSR_PKG_C7_RESIDENCY	0x3FA
> +
> +If the kernel can also inject idle time to the system, then a
> +closed-loop control system can be established that manages package
> +level C-state. The intel_powerclamp driver is conceived as such a
> +control system, where the target set point is a user-selected idle
> +ratio (based on power reduction), and the error is the difference
> +between the actual package level C-state residency ratio and the target idle
> +ratio.
> +
> +Injection is controlled by high priority kernel threads, spawned for
> +each online CPU.
> +
> +These kernel threads, with SCHED_FIFO class, are created to perform
> +clamping actions of controlled duty ratio and duration. Each per-CPU
> +thread synchronizes its idle time and duration, based on the rounding
> +of jiffies, so accumulated errors can be prevented to avoid a jittery
> +effect. Threads are also bound to the CPU such that they cannot be
> +migrated, unless the CPU is taken offline. In this case, threads
> +belong to the offlined CPUs will be terminated immediately.
> +
> +Running as SCHED_FIFO and relatively high priority, also allows such
> +scheme to work for both preemptable and non-preemptable kernels.
> +Alignment of idle time around jiffies ensures scalability for HZ
> +values. This effect can be better visualized using a Perf timechart.
> +The following diagram shows the behavior of kernel thread
> +kidle_inject/cpu. During idle injection, it runs monitor/mwait idle
> +for a given "duration", then relinquishes the CPU to other tasks,
> +until the next time interval.
> +
> +The NOHZ schedule tick is disabled during idle time, but interrupts
> +are not masked. Tests show that the extra wakeups from scheduler tick
> +have a dramatic impact on the effectiveness of the powerclamp driver
> +on large scale systems (Westmere system with 80 processors).
> +
> +CPU0
> +		  ____________          ____________
> +kidle_inject/0   |   sleep    |  mwait |  sleep     |
> +	_________|            |________|            |_______
> +			       duration
> +CPU1
> +		  ____________          ____________
> +kidle_inject/1   |   sleep    |  mwait |  sleep     |
> +	_________|            |________|            |_______
> +			      ^
> +			      |
> +			      |
> +			      roundup(jiffies, interval)
> +
> +Only one CPU is allowed to collect statistics and update global
> +control parameters. This CPU is referred to as the controlling CPU in
> +this document. The controlling CPU is elected at runtime, with a
> +policy that favors BSP, taking into account the possibility of a CPU
> +hot-plug.
> +
> +In terms of dynamics of the idle control system, package level idle
> +time is considered largely as a non-causal system where its behavior
> +cannot be based on the past or current input. Therefore, the
> +intel_powerclamp driver attempts to enforce the desired idle time
> +instantly as given input (target idle ratio). After injection,
> +powerclamp moniors the actual idle for a given time window and adjust
> +the next injection accordingly to avoid over/under correction.
> +
> +When used in a causal control system, such as a temperature control,
> +it is up to the user of this driver to implement algorithms where
> +past samples and outputs are included in the feedback. For example, a
> +PID-based thermal controller can use the powerclamp driver to
> +maintain a desired target temperature, based on integral and
> +derivative gains of the past samples.
> +
> +
> +
> +Calibration
> +-----------
> +During scalability testing, it is observed that synchronized actions
> +among CPUs become challenging as the number of cores grows. This is
> +also true for the ability of a system to enter package level C-states.
> +
> +To make sure the intel_powerclamp driver scales well, online
> +calibration is implemented. The goals for doing such a calibration
> +are:
> +
> +a) determine the effective range of idle injection ratio
> +b) determine the amount of compensation needed at each target ratio
> +
> +Compensation to each target ratio consists of two parts:
> +
> +        a) steady state error compensation
> +	This is to offset the error occurring when the system can
> +	enter idle without extra wakeups (such as external interrupts).
> +
> +	b) dynamic error compensation
> +	When an excessive amount of wakeups occurs during idle, an
> +	additional idle ratio can be added to quiet interrupts, by
> +	slowing down CPU activities.
> +
> +A debugfs file is provided for the user to examine compensation
> +progress and results, such as on a Westmere system.
> +[jacob@nex01 ~]$ cat
> +/sys/kernel/debug/intel_powerclamp/powerclamp_calib
> +controlling cpu: 0
> +pct confidence steady dynamic (compensation)
> +0	0	0	0
> +1	1	0	0
> +2	1	1	0
> +3	3	1	0
> +4	3	1	0
> +5	3	1	0
> +6	3	1	0
> +7	3	1	0
> +8	3	1	0
> +...
> +30	3	2	0
> +31	3	2	0
> +32	3	1	0
> +33	3	2	0
> +34	3	1	0
> +35	3	2	0
> +36	3	1	0
> +37	3	2	0
> +38	3	1	0
> +39	3	2	0
> +40	3	3	0
> +41	3	1	0
> +42	3	2	0
> +43	3	1	0
> +44	3	1	0
> +45	3	2	0
> +46	3	3	0
> +47	3	0	0
> +48	3	2	0
> +49	3	3	0
> +
> +Calibration occurs during runtime. No offline method is available.
> +Steady state compensation is used only when confidence levels of all
> +adjacent ratios have reached satisfactory level. A confidence level
> +is accumulated based on clean data collected at runtime. Data
> +collected during a period without extra interrupts is considered
> +clean.
> +
> +To compensate for excessive amounts of wakeup during idle, additional
> +idle time is injected when such a condition is detected. Currently,
> +we have a simple algorithm to double the injection ratio. A possible
> +enhancement might be to throttle the offending IRQ, such as delaying
> +EOI for level triggered interrupts. But it is a challenge to be
> +non-intrusive to the scheduler or the IRQ core code.
> +
> +
> +CPU Online/Offline
> +------------------
> +Per-CPU kernel threads are started/stopped upon receiving
> +notifications of CPU hotplug activities. The intel_powerclamp driver
> +keeps track of clamping kernel threads, even after they are migrated
> +to other CPUs, after a CPU offline event.
> +
> +
> +=====================
> +Performance Analysis
> +=====================
> +This section describes the general performance data collected on
> +multiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P).
> +
> +Effectiveness and Limitations
> +-----------------------------
> +The maximum range that idle injection is allowed is capped at 50
> +percent. As mentioned earlier, since interrupts are allowed during
> +forced idle time, excessive interrupts could result in less
> +effectiveness. The extreme case would be doing a ping -f to generated
> +flooded network interrupts without much CPU acknowledgement. In this
> +case, little can be done from the idle injection threads. In most
> +normal cases, such as scp a large file, applications can be throttled
> +by the powerclamp driver, since slowing down the CPU also slows down
> +network protocol processing, which in turn reduces interrupts.
> +
> +When control parameters change at runtime by the controlling CPU, it
> +may take an additional period for the rest of the CPUs to catch up
> +with the changes. During this time, idle injection is out of sync,
> +thus not able to enter package C- states at the expected ratio. But
> +this effect is minor, in that in most cases change to the target
> +ratio is updated much less frequently than the idle injection
> +frequency.
> +
> +Scalability
> +-----------
> +Tests also show a minor, but measurable, difference between the 4P/8P
> +Ivy Bridge system and the 80P Westmere server under 50% idle ratio.
> +More compensation is needed on Westmere for the same amount of
> +target idle ratio. The compensation also increases as the idle ratio
> +gets larger. The above reason constitutes the need for the
> +calibration code.
> +
> +On the IVB 8P system, compared to an offline CPU, powerclamp can
> +achieve up to 40% better performance per watt. (measured by a spin
> +counter summed over per CPU counting threads spawned for all running
> +CPUs).
> +
> +====================
> +Usage and Interfaces
> +====================
> +The powerclamp driver is registered to the generic thermal layer as a
> +cooling device. Currently, it’s not bound to any thermal zones.
> +
> +jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . *
> +cur_state:0
> +max_state:50
> +type:intel_powerclamp
> +
> +Example usage:
> +- To inject 25% idle time
> +$ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state
> +"
> +
> +If the system is not busy and has more than 25% idle time already,
> +then the powerclamp driver will not start idle injection. Using Top
> +will not show idle injection kernel threads.
> +
> +If the system is busy (spin test below) and has less than 25% natural
> +idle time, powerclamp kernel threads will do idle injection, which
> +appear running to the scheduler. But the overall system idle is still
> +reflected. In this example, 24.1% idle is shown. This helps the
> +system admin or user determine the cause of slowdown, when a
> +powerclamp driver is in action.
> +
> +
> +Tasks: 197 total,   1 running, 196 sleeping,   0 stopped,   0 zombie
> +Cpu(s): 71.2%us,  4.7%sy,  0.0%ni, 24.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> +Mem:   3943228k total,  1689632k used,  2253596k free,    74960k buffers
> +Swap:  4087804k total,        0k used,  4087804k free,   945336k cached
> +
> +  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> + 3352 jacob     20   0  262m  644  428 S  286  0.0   0:17.16 spin
> + 3341 root     -51   0     0    0    0 D   25  0.0   0:01.62 kidle_inject/0
> + 3344 root     -51   0     0    0    0 D   25  0.0   0:01.60 kidle_inject/3
> + 3342 root     -51   0     0    0    0 D   25  0.0   0:01.61 kidle_inject/1
> + 3343 root     -51   0     0    0    0 D   25  0.0   0:01.60 kidle_inject/2
> + 2935 jacob     20   0  696m 125m  35m S    5  3.3   0:31.11 firefox
> + 1546 root      20   0  158m  20m 6640 S    3  0.5   0:26.97 Xorg
> + 2100 jacob     20   0 1223m  88m  30m S    3  2.3   0:23.68 compiz
> +
> +Tests have shown that by using the powerclamp driver as a cooling
> +device, a PID based userspace thermal controller can manage to
> +control CPU temperature effectively, when no other thermal influence
> +is added. For example, a UltraBook user can compile the kernel under
> +certain temperature (below most active trip points).
> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> index e1cb6bd..4d99c4b 100644
> --- a/drivers/thermal/Kconfig
> +++ b/drivers/thermal/Kconfig
> @@ -55,3 +55,13 @@ config EXYNOS_THERMAL
>  	help
>  	  If you say yes here you get support for TMU (Thermal Managment
>  	  Unit) on SAMSUNG EXYNOS series of SoC.
> +
> +config INTEL_POWERCLAMP
> +	tristate "Intel PowerClamp idle injection driver"
> +	depends on THERMAL
> +	depends on X86
> +	depends on CPU_SUP_INTEL
> +	help
> +	  Enable this to enable Intel PowerClamp idle injection driver. This
> +	  enforce idle time which results in more package C-state residency. The
> +	  user interface is exposed via generic thermal framework.
> diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
> index 885550d..03e4479 100644
> --- a/drivers/thermal/Makefile
> +++ b/drivers/thermal/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_CPU_THERMAL)		+= cpu_cooling.o
>  obj-$(CONFIG_SPEAR_THERMAL)		+= spear_thermal.o
>  obj-$(CONFIG_RCAR_THERMAL)	+= rcar_thermal.o
>  obj-$(CONFIG_EXYNOS_THERMAL)		+= exynos_thermal.o
> +obj-$(CONFIG_INTEL_POWERCLAMP)	+= intel_powerclamp.o
> diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
> new file mode 100644
> index 0000000..cc9fa17
> --- /dev/null
> +++ b/drivers/thermal/intel_powerclamp.c
> @@ -0,0 +1,766 @@
> +/*
> + * intel_powerclamp.c - package c-state idle injection
> + *
> + * Copyright (c) 2012, Intel Corporation.
> + *
> + * Authors:
> + *     Arjan van de Ven <arjan@linux.intel.com>
> + *     Jacob Pan <jacob.jun.pan@linux.intel.com>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along with
> + * this program; if not, write to the Free Software Foundation, Inc.,
> + * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
> + *
> + *
> + *	TODO:
> + *           1. better handle wakeup from external interrupts, currently a fixed
> + *              compensation is added to clamping duration when excessive amount
> + *              of wakeups are observed during idle time. the reason is that in
> + *              case of external interrupts without need for ack, clamping down
> + *              cpu in non-irq context does not reduce irq. for majority of the
> + *              cases, clamping down cpu does help reduce irq as well, we should
> + *              be able to differenciate the two cases and give a quantitative
> + *              solution for the irqs that we can control. perhaps based on
> + *              get_cpu_iowait_time_us()
> + *
> + *	     2. synchronization with other hw blocks
> + *
> + *
> + */
> +
> +/* #define DEBUG */
> +
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/delay.h>
> +#include <linux/kthread.h>
> +#include <linux/freezer.h>
> +#include <linux/cpu.h>
> +#include <linux/thermal.h>
> +#include <linux/slab.h>
> +#include <linux/tick.h>
> +#include <linux/debugfs.h>
> +#include <linux/seq_file.h>
> +#include <linux/nmi.h>
> +
> +#include <asm/msr.h>
> +#include <asm/mwait.h>
> +#include <asm/cpu_device_id.h>
> +#include <asm/idle.h>
> +#include <asm/hardirq.h>
> +
> +#define MSR_PKG_C2_RESIDENCY		0x60D
> +#define MSR_PKG_C3_RESIDENCY		0x3F8
> +#define MSR_PKG_C6_RESIDENCY		0x3F9
> +#define MSR_PKG_C7_RESIDENCY		0x3FA
> +
> +#define MAX_TARGET_RATIO (50)
> +/* For each undisturbed clamping period (no extra wake ups during idle time),
> + * we increment the confidence counter for the given target ratio.
> + * CONFIDENCE_OK defines the level where runtime calibration results are
> + * valid.
> + */
> +#define CONFIDENCE_OK (3)
> +/* Default idle injection duration, driver adjust sleep time to meet target
> + * idle ratio. Similar to frequency modulation.
> + */
> +#define DEFAULT_DURATION_JIFFIES (6)
> +
> +static unsigned int target_mwait;
> +static struct dentry *debug_dir;
> +
> +/* user selected target */
> +static unsigned int set_target_ratio;
> +static unsigned int current_ratio;
> +static bool should_skip;
> +static bool reduce_irq;
> +static atomic_t idle_wakeup_counter;
> +static unsigned int control_cpu; /* The cpu assigned to collect stat and update
> +				  * control parameters. default to BSP but BSP
> +				  * can be offlined.
> +				  */
> +static int clamping;
> +
> +
> +static struct task_struct __percpu **powerclamp_thread;
> +static struct thermal_cooling_device *cooling_dev;
> +static unsigned long *cpu_clamping_mask;  /* bit map for tracking per cpu
> +					   * clamping thread
> +					   */
> +static int duration;
> +module_param(duration, int, 0600);
> +MODULE_PARM_DESC(duration, "forced idle time for each attempt in msec.");
> +
> +static unsigned int pkg_cstate_ratio_cur;
> +static unsigned int window_size;
> +
> +struct powerclamp_calibration_data {
> +	unsigned long confidence;  /* used for calibration, basically a counter
> +				    * gets incremented each time a clamping
> +				    * period is completed without extra wakeups
> +				    * once that counter is reached given level,
> +				    * compensation is deemed usable.
> +				    */
> +	unsigned long steady_comp; /* steady state compensation used when
> +				    * no extra wakeups occurred.
> +				    */
> +	unsigned long dynamic_comp; /* compensate excessive wakeup from idle
> +				     * mostly from external interrupts.
> +				     */
> +};
> +
> +static struct powerclamp_calibration_data cal_data[MAX_TARGET_RATIO];
> +
> +static int window_size_set(const char *arg, const struct kernel_param *kp)
> +{
> +	int ret = 0;
> +	unsigned long new_window_size;
> +
> +	ret = kstrtoul(arg, 10, &new_window_size);
> +	if (ret)
> +		goto exit_win;
> +	if (new_window_size >= 10 || new_window_size < 2) {
> +		pr_err("PowerClamp: invalid window size %lu, between 2-10\n",
> +			new_window_size);
> +		ret = -EINVAL;
> +	}
> +
> +	window_size = new_window_size;
> +	smp_mb();
> +
> +exit_win:
> +
> +	return ret;
> +}
> +static struct kernel_param_ops window_size_ops = {
> +	.set = window_size_set,
> +	.get = param_get_int,
> +};
> +
> +module_param_cb(window_size, &window_size_ops, &window_size, 0644);
> +MODULE_PARM_DESC(window_size, "sliding window in number of clamping cycles\n"
> +	"\tpowerclamp controls idle ratio within this window. larger\n"
> +	"\twindow size results in slower response time but more smooth\n"
> +	"\tclamping results. default to 2.");
> +
> +static void find_target_mwait(void)
> +{
> +	unsigned int eax, ebx, ecx, edx;
> +	unsigned int highest_cstate = 0;
> +	unsigned int highest_subcstate = 0;
> +	int i;
> +
> +	if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
> +		return;
> +
> +	cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &edx);
> +
> +	if (!(ecx & CPUID5_ECX_EXTENSIONS_SUPPORTED) ||
> +	    !(ecx & CPUID5_ECX_INTERRUPT_BREAK))
> +		return;
> +
> +	edx >>= MWAIT_SUBSTATE_SIZE;
> +	for (i = 0; i < 7 && edx; i++, edx >>= MWAIT_SUBSTATE_SIZE) {
> +		if (edx & MWAIT_SUBSTATE_MASK) {
> +			highest_cstate = i;
> +			highest_subcstate = edx & MWAIT_SUBSTATE_MASK;
> +		}
> +	}
> +	target_mwait = (highest_cstate << MWAIT_SUBSTATE_SIZE) |
> +		(highest_subcstate - 1);
> +
> +}
> +
> +static u64 pkg_state_counter(void)
> +{
> +	u64 val;
> +	u64 count = 0;
> +
> +	static int skip_c2;
> +	static int skip_c3;
> +	static int skip_c6;
> +	static int skip_c7;
> +
> +	if (!skip_c2) {
> +		if (!rdmsrl_safe(MSR_PKG_C2_RESIDENCY, &val))
> +			count += val;
> +		else
> +			skip_c2 = 1;
> +	}
> +
> +	if (!skip_c3) {
> +		if (!rdmsrl_safe(MSR_PKG_C3_RESIDENCY, &val))
> +			count += val;
> +		else
> +			skip_c3 = 1;
> +	}
> +
> +	if (!skip_c6) {
> +		if (!rdmsrl_safe(MSR_PKG_C6_RESIDENCY, &val))
> +			count += val;
> +		else
> +			skip_c6 = 1;
> +	}
> +
> +	if (!skip_c7) {
> +		if (!rdmsrl_safe(MSR_PKG_C7_RESIDENCY, &val))
> +			count += val;
> +		else
> +			skip_c7 = 1;
> +	}
> +
> +	return count;
> +}
> +
> +static void noop_timer(unsigned long foo)
> +{
> +	/* empty... just the fact that we get the interrupt wakes us up */
> +}
> +
> +static unsigned int get_compensation(int ratio)
> +{
> +	unsigned int comp = 0;
> +
> +	/* we only use compensation if all adjacent ones are good */
> +	if (ratio == 1 &&
> +		cal_data[ratio].confidence >= CONFIDENCE_OK &&
> +		cal_data[ratio + 1].confidence >= CONFIDENCE_OK &&
> +		cal_data[ratio + 2].confidence >= CONFIDENCE_OK) {
> +		comp = (cal_data[ratio].steady_comp +
> +			cal_data[ratio + 1].steady_comp +
> +			cal_data[ratio + 2].steady_comp) / 3;
> +	} else if (ratio == MAX_TARGET_RATIO - 1 &&
> +		cal_data[ratio].confidence >= CONFIDENCE_OK &&
> +		cal_data[ratio - 1].confidence >= CONFIDENCE_OK &&
> +		cal_data[ratio - 2].confidence >= CONFIDENCE_OK) {
> +		comp = (cal_data[ratio].steady_comp +
> +			cal_data[ratio - 1].steady_comp +
> +			cal_data[ratio - 2].steady_comp) / 3;
> +	} else if (cal_data[ratio].confidence >= CONFIDENCE_OK &&
> +		cal_data[ratio - 1].confidence >= CONFIDENCE_OK &&
> +		cal_data[ratio + 1].confidence >= CONFIDENCE_OK) {
> +		comp = (cal_data[ratio].steady_comp +
> +			cal_data[ratio - 1].steady_comp +
> +			cal_data[ratio + 1].steady_comp) / 3;
> +	}
> +
> +	/* REVISIT: simple penalty of double idle injection */
> +	if (reduce_irq)
> +		comp = ratio;
> +	/* do not exceed limit */
> +	if (comp + ratio >= MAX_TARGET_RATIO)
> +		comp = MAX_TARGET_RATIO - ratio - 1;
> +
> +	return comp;
> +}
> +
> +static void adjust_compensation(int target_ratio, unsigned int win)
> +{
> +	int delta;
> +
> +	/*
> +	 * adjust compensations if confidence level has not been reached or
> +	 * there are too many wakeups during the last idle injection period, we
> +	 * cannot trust the data for compensation.
> +	 */
> +	if (cal_data[target_ratio].confidence >= CONFIDENCE_OK ||
> +		atomic_read(&idle_wakeup_counter) >
> +		win * num_online_cpus())
> +		return;
> +
> +	delta = set_target_ratio - current_ratio;
> +	/* filter out bad data */
> +	if (delta >= 0 && delta <= (1+target_ratio/10)) {
> +		if (cal_data[target_ratio].steady_comp)
> +			cal_data[target_ratio].steady_comp =
> +				roundup(delta+
> +					cal_data[target_ratio].steady_comp,
> +					2)/2;
> +		else
> +			cal_data[target_ratio].steady_comp = delta;
> +		cal_data[target_ratio].confidence++;
> +	}
> +}
> +
> +static bool powerclamp_adjust_controls(unsigned int target_ratio,
> +				unsigned int guard, unsigned int win)
> +{
> +	static u64 msr_last, tsc_last;
> +	u64 msr_now, tsc_now;
> +
> +	/* check result for the last window */
> +	msr_now = pkg_state_counter();
> +	rdtscll(tsc_now);
> +
> +	/* calculate pkg cstate vs tsc ratio */
> +	if (!msr_last || !tsc_last)
> +		current_ratio = 1;
> +	else if (tsc_now-tsc_last)
> +		current_ratio = 100*(msr_now-msr_last)/
> +			(tsc_now-tsc_last);
> +
> +	/* update record */
> +	msr_last = msr_now;
> +	tsc_last = tsc_now;
> +
> +	adjust_compensation(target_ratio, win);
> +	/*
> +	 * too many external interrupts, set flag such
> +	 * that we can take measure later.
> +	 */
> +	reduce_irq = atomic_read(&idle_wakeup_counter) >=
> +		2 * win * num_online_cpus();
> +
> +	atomic_set(&idle_wakeup_counter, 0);
> +	/* if we are above target+guard, skip */
> +	return set_target_ratio + guard <= current_ratio;
> +}
> +
> +static int clamp_thread(void *arg)
> +{
> +	int cpunr = (unsigned long)arg;
> +	DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0);
> +	static const struct sched_param param = {
> +		.sched_priority = MAX_USER_RT_PRIO/2,
> +	};
> +	unsigned int count = 0;
> +	unsigned int target_ratio;
> +
> +	set_bit(cpunr, cpu_clamping_mask);
> +	set_freezable();
> +	init_timer_on_stack(&wakeup_timer);
> +	sched_setscheduler(current, SCHED_FIFO, &param);
> +
> +	while (clamping && !kthread_should_stop() && cpu_online(cpunr)) {
> +		int sleeptime;
> +		unsigned long target_jiffies;
> +		unsigned int guard;
> +		unsigned int compensation = 0;
> +		int interval; /* jiffies to sleep for each attempt */
> +		unsigned int duration_jiffies = msecs_to_jiffies(duration);
> +		unsigned int window_size_now;
> +
> +		try_to_freeze();
> +		/*
> +		 * make sure user selected ratio does not take effect until
> +		 * the next round. adjust target_ratio if user has changed
> +		 * target such that we can converge quickly.
> +		 */
> +		target_ratio = set_target_ratio;
> +		guard = 1 + target_ratio/20;
> +		window_size_now = window_size;
> +		count++;
> +
> +		/*
> +		 * systems may have different ability to enter package level
> +		 * c-states, thus we need to compensate the injected idle ratio
> +		 * to achieve the actual target reported by the HW.
> +		 */
> +		compensation = get_compensation(target_ratio);
> +		interval = duration_jiffies*100/(target_ratio+compensation);
> +
> +		/* align idle time */
> +		target_jiffies = roundup(jiffies, interval);
> +		sleeptime = target_jiffies - jiffies;
> +		if (sleeptime <= 0)
> +			sleeptime = 1;
> +		schedule_timeout_interruptible(sleeptime);
> +		/*
> +		 * only elected controlling cpu can collect stats and update
> +		 * control parameters.
> +		 */
> +		if (cpunr == control_cpu && !(count%window_size_now)) {
> +			should_skip =
> +				powerclamp_adjust_controls(target_ratio,
> +							guard, window_size_now);
> +			smp_mb();
> +		}
> +
> +		if (should_skip)
> +			continue;
> +
> +		target_jiffies = jiffies + duration_jiffies;
> +		mod_timer(&wakeup_timer, target_jiffies);
> +		if (unlikely(local_softirq_pending()))
> +			continue;
> +		/*
> +		 * stop tick sched during idle time, interrupts are still
> +		 * allowed. thus jiffies are updated properly.
> +		 */
> +		preempt_disable();
> +		tick_nohz_idle_enter();
> +		/* mwait until target jiffies is reached */
> +		while (time_before(jiffies, target_jiffies)) {
> +			unsigned long ecx = 1;
> +			unsigned long eax = target_mwait;
> +
> +			/*
> +			 * REVISIT: may call enter_idle() to notify drivers who
> +			 * can save power during cpu idle. same for exit_idle()
> +			 */
> +			local_touch_nmi();
> +			stop_critical_timings();
> +			__monitor((void *)&current_thread_info()->flags, 0, 0);
> +			cpu_relax(); /* allow HT sibling to run */
> +			__mwait(eax, ecx);
> +			start_critical_timings();
> +			atomic_inc(&idle_wakeup_counter);
> +		}
> +		tick_nohz_idle_exit();
> +		preempt_enable_no_resched();
> +	}
> +	del_timer_sync(&wakeup_timer);
> +	clear_bit(cpunr, cpu_clamping_mask);
> +
> +	return 0;
> +}
> +
> +/*
> + * 1 HZ polling while clamping is active, useful for userspace
> + * to monitor actual idle ratio.
> + */
> +static void poll_pkg_cstate(struct work_struct *dummy);
> +static DECLARE_DELAYED_WORK(poll_pkg_cstate_work, poll_pkg_cstate);
> +static void poll_pkg_cstate(struct work_struct *dummy)
> +{
> +	static u64 msr_last;
> +	static u64 tsc_last;
> +	static unsigned long jiffies_last;
> +
> +	u64 msr_now;
> +	unsigned long jiffies_now;
> +	u64 tsc_now;
> +
> +	msr_now = pkg_state_counter();
> +	rdtscll(tsc_now);
> +	jiffies_now = jiffies;
> +
> +	/* calculate pkg cstate vs tsc ratio */
> +	if (!msr_last || !tsc_last)
> +		pkg_cstate_ratio_cur = 1;
> +	else {
> +		if (tsc_now - tsc_last)
> +			pkg_cstate_ratio_cur = 100 * (msr_now - msr_last)/
> +				(tsc_now - tsc_last);
> +	}
> +
> +	/* update record */
> +	msr_last = msr_now;
> +	jiffies_last = jiffies_now;
> +	tsc_last = tsc_now;
> +
> +	if (clamping)
> +		schedule_delayed_work(&poll_pkg_cstate_work, HZ);
> +}
> +
> +static int start_power_clamp(void)
> +{
> +	unsigned long cpu;
> +	struct task_struct *thread;
> +
> +	/* check if pkg cstate counter is completely 0, abort in this case */
> +	if (!pkg_state_counter()) {
> +		pr_err("pkg cstate counter not functional, abort\n");
> +		return -EINVAL;
> +	}
> +
> +	if (set_target_ratio > MAX_TARGET_RATIO)
> +		set_target_ratio = MAX_TARGET_RATIO;
> +
> +	/* prevent cpu hotplug */
> +	get_online_cpus();
> +
> +	/* prefer BSP */
> +	control_cpu = 0;
> +	if (!cpu_online(control_cpu))
> +		control_cpu = smp_processor_id();
> +
> +	clamping = 1;
> +	schedule_delayed_work(&poll_pkg_cstate_work, 0);
> +
> +	/* start one thread per online cpu */
> +	for_each_online_cpu(cpu) {
> +		struct task_struct **p =
> +			per_cpu_ptr(powerclamp_thread, cpu);
> +
> +		thread = kthread_create_on_node(clamp_thread,
> +						(void *) cpu,
> +						cpu_to_node(cpu),
> +						"kidle_inject/%ld", cpu);
> +		/* bind to cpu here */
> +		if (likely(!IS_ERR(thread))) {
> +			kthread_bind(thread, cpu);
> +			wake_up_process(thread);
> +			*p = thread;
> +		}
> +
> +	}
> +	put_online_cpus();
> +
> +	return 0;
> +}
> +
> +static void end_power_clamp(void)
> +{
> +	int i;
> +	struct task_struct *thread;
> +
> +	clamping = 0;
> +	/*
> +	 * make clamping visible to other cpus and give per cpu clamping threads
> +	 * sometime to exit, or gets killed later.
> +	 */
> +	smp_mb();
> +	msleep(20);
> +	if (bitmap_weight(cpu_clamping_mask, num_possible_cpus())) {
> +		for_each_set_bit(i, cpu_clamping_mask, num_possible_cpus()) {
> +			pr_debug("clamping thread for cpu %d alive, kill\n", i);
> +			thread = *per_cpu_ptr(powerclamp_thread, i);
> +			kthread_stop(thread);
> +		}
> +	}
> +}
> +
> +static int powerclamp_cpu_callback(struct notifier_block *nfb,
> +				unsigned long action, void *hcpu)
> +{
> +	unsigned long cpu = (unsigned long)hcpu;
> +	struct task_struct *thread;
> +	struct task_struct **percpu_thread =
> +		per_cpu_ptr(powerclamp_thread, cpu);
> +
> +	if (!clamping)
> +		goto exit_ok;
> +
> +	switch (action) {
> +	case CPU_ONLINE:
> +		thread = kthread_create_on_node(clamp_thread,
> +						(void *) cpu,
> +						cpu_to_node(cpu),
> +						"kidle_inject/%lu", cpu);
> +		if (likely(!IS_ERR(thread))) {
> +			kthread_bind(thread, cpu);
> +			wake_up_process(thread);
> +			*percpu_thread = thread;
> +		}
> +		/* prefer BSP as controlling CPU */
> +		if (cpu == 0) {
> +			control_cpu = 0;
> +			smp_mb();
> +		}
> +		break;
> +	case CPU_DEAD:
> +		if (test_bit(cpu, cpu_clamping_mask)) {
> +			pr_err("cpu %lu dead but powerclamping thread is not\n",
> +				cpu);
> +			kthread_stop(*percpu_thread);
> +		}
> +		if (cpu == control_cpu) {
> +			control_cpu = smp_processor_id();
> +			smp_mb();
> +		}
> +	}
> +
> +exit_ok:
> +	return NOTIFY_OK;
> +}
> +
> +static struct notifier_block powerclamp_cpu_notifier = {
> +	.notifier_call = powerclamp_cpu_callback,
> +};
> +
> +static int powerclamp_get_max_state(struct thermal_cooling_device *cdev,
> +				 unsigned long *state)
> +{
> +	*state = MAX_TARGET_RATIO;
> +
> +	return 0;
> +}
> +
> +static int powerclamp_get_cur_state(struct thermal_cooling_device *cdev,
> +				 unsigned long *state)
> +{
> +	if (clamping)
> +		*state = pkg_cstate_ratio_cur;
> +	else
> +		/* to save power, do not poll idle ratio while not clamping */
> +		*state = -1; /* indicates invalid state */
> +
> +	return 0;
> +}
> +
> +static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev,
> +				 unsigned long new_target_ratio)
> +{
> +	int ret = 0;
> +
> +	if (new_target_ratio >= MAX_TARGET_RATIO)
> +		new_target_ratio = MAX_TARGET_RATIO - 1;
> +
> +	if (set_target_ratio == 0 && new_target_ratio > 0) {
> +		pr_info("Start idle injection to reduce power\n");
> +		set_target_ratio = new_target_ratio;
> +		ret = start_power_clamp();
> +		goto exit_set;
> +	} else	if (set_target_ratio > 0 && new_target_ratio == 0) {
> +		pr_info("Stop forced idle injection\n");
> +		set_target_ratio = 0;
> +		end_power_clamp();
> +	} else	/* adjust currently running */ {
> +		set_target_ratio = new_target_ratio;
> +		/* make new set_target_ratio visible to other cpus */
> +		smp_mb();
> +	}
> +
> +exit_set:
> +	return ret;
> +}
> +
> +/* bind to generic thermal layer as cooling device*/
> +static struct thermal_cooling_device_ops powerclamp_cooling_ops = {
> +	.get_max_state = powerclamp_get_max_state,
> +	.get_cur_state = powerclamp_get_cur_state,
> +	.set_cur_state = powerclamp_set_cur_state,
> +};
> +
> +/* runs on Nehalem and later */
> +static const struct x86_cpu_id intel_powerclamp_ids[] = {
> +	{ X86_VENDOR_INTEL, 6, 0x1a},
> +	{ X86_VENDOR_INTEL, 6, 0x1c},
> +	{ X86_VENDOR_INTEL, 6, 0x1e},
> +	{ X86_VENDOR_INTEL, 6, 0x1f},
> +	{ X86_VENDOR_INTEL, 6, 0x25},
> +	{ X86_VENDOR_INTEL, 6, 0x26},
> +	{ X86_VENDOR_INTEL, 6, 0x2a},
> +	{ X86_VENDOR_INTEL, 6, 0x2c},
> +	{ X86_VENDOR_INTEL, 6, 0x2d},
> +	{ X86_VENDOR_INTEL, 6, 0x2e},
> +	{ X86_VENDOR_INTEL, 6, 0x2f},
> +	{ X86_VENDOR_INTEL, 6, 0x3a},
> +	{}
> +};
> +MODULE_DEVICE_TABLE(x86cpu, intel_powerclamp_ids);
> +
> +static int powerclamp_probe(void)
> +{
> +	if (!x86_match_cpu(intel_powerclamp_ids)) {
> +		pr_err("Intel powerclamp does not run on family %d model %d\n",
> +				boot_cpu_data.x86, boot_cpu_data.x86_model);
> +		return -ENODEV;
> +	}
> +	if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ||
> +		!boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ||
> +		!boot_cpu_has(X86_FEATURE_MWAIT) ||
> +		!boot_cpu_has(X86_FEATURE_ARAT))
> +		return -ENODEV;
> +
> +	/* find the deepest mwait value */
> +	find_target_mwait();
> +
> +	return 0;
> +}
> +
> +static int powerclamp_debug_show(struct seq_file *m, void *unused)
> +{
> +	int i = 0;
> +
> +	seq_printf(m, "controlling cpu: %d\n", control_cpu);
> +	seq_printf(m, "pct confidence steady dynamic (compensation)\n");
> +	for (i = 0; i < MAX_TARGET_RATIO; i++) {
> +		seq_printf(m, "%d\t%lu\t%lu\t%lu\n",
> +			i,
> +			cal_data[i].confidence,
> +			cal_data[i].steady_comp,
> +			cal_data[i].dynamic_comp);
> +	}
> +
> +	return 0;
> +}
> +
> +static int powerclamp_debug_open(struct inode *inode,
> +			struct file *file)
> +{
> +	return single_open(file, powerclamp_debug_show, inode->i_private);
> +}
> +
> +static const struct file_operations powerclamp_debug_fops = {
> +	.open		= powerclamp_debug_open,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= single_release,
> +	.owner		= THIS_MODULE,
> +};
> +
> +static inline void powerclamp_create_debug_files(void)
> +{
> +	debug_dir = debugfs_create_dir("intel_powerclamp", NULL);
> +	if (!debug_dir)
> +		return;
> +
> +	if (!debugfs_create_file("powerclamp_calib", S_IRUGO, debug_dir,
> +					cal_data, &powerclamp_debug_fops))
> +		goto file_error;
> +
> +	return;
> +
> +file_error:
> +	debugfs_remove_recursive(debug_dir);
> +}
> +
> +static int powerclamp_init(void)
> +{
> +	int retval;
> +	int bitmap_size;
> +
> +	bitmap_size = BITS_TO_LONGS(num_possible_cpus()) * sizeof(long);
> +	cpu_clamping_mask = kzalloc(bitmap_size, GFP_KERNEL);
> +	if (!cpu_clamping_mask)
> +		return -ENOMEM;
> +
> +	/* probe cpu features and ids here */
> +	retval = powerclamp_probe();
> +	if (retval)
> +		return retval;
> +	/* set default limit, maybe adjusted during runtime based on feedback */
> +	window_size = 2;
> +	register_hotcpu_notifier(&powerclamp_cpu_notifier);
> +	powerclamp_thread = alloc_percpu(struct task_struct *);
> +	cooling_dev = thermal_cooling_device_register("intel_powerclamp", NULL,
> +						&powerclamp_cooling_ops);
> +	if (IS_ERR(cooling_dev))
> +		return -ENODEV;
> +
> +	if (!duration)
> +		duration = jiffies_to_msecs(DEFAULT_DURATION_JIFFIES);
> +	powerclamp_create_debug_files();
> +
> +	return 0;
> +}
> +module_init(powerclamp_init);
> +
> +static void powerclamp_exit(void)
> +{
> +	unregister_hotcpu_notifier(&powerclamp_cpu_notifier);
> +	end_power_clamp();
> +	free_percpu(powerclamp_thread);
> +	thermal_cooling_device_unregister(cooling_dev);
> +	kfree(cpu_clamping_mask);
> +
> +	cancel_delayed_work_sync(&poll_pkg_cstate_work);
> +	debugfs_remove_recursive(debug_dir);
> +}
> +module_exit(powerclamp_exit);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Arjan van de Ven <arjan@linux.intel.com>");
> +MODULE_AUTHOR("Jacob Pan <jacob.jun.pan@linux.intel.com>");
> +MODULE_DESCRIPTION("Package Level C-state Idle Injection for Intel CPUs");
> -- 
> 1.7.9.5
> 
> 

^ permalink raw reply

* Re: [PATCH 3/3] PM: Introduce Intel PowerClamp Driver
From: Jacob Pan @ 2012-11-13 21:39 UTC (permalink / raw)
  To: paulmck
  Cc: Linux PM, LKML, Rafael Wysocki, Len Brown, Thomas Gleixner,
	H. Peter Anvin, Ingo Molnar, Zhang Rui, Rob Landley,
	Arjan van de Ven
In-Reply-To: <20121113211602.GA30150@linux.vnet.ibm.com>

On Tue, 13 Nov 2012 13:16:02 -0800
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> > Please refer to Documentation/thermal/intel_powerclamp.txt for more
> > details.  
> 
> If I read this correctly, this forces a group of CPUs into idle for
> about 600 milliseconds at a time.  This would indeed delay grace
> periods, which could easily result in user complaints.  Also, given
> the default RCU_BOOST_DELAY of 500 milliseconds in kernels enabling
> RCU_BOOST, you would see needless RCU priority boosting.
> 
the default idle injection duration is 6ms. we adjust the sleep
interval to ensure idle ratio. So the idle duration stays the same once
set. So would it be safe to delay grace period for this small amount in
exchange for less over head in each injection period?
> Of course, if the idle period extended for longer, you would see RCU
> CPU stall warnings.  And if the idle period extended indefinitely, you
> could hang the system -- the RCU callbacks on the idled CPU could not
> be invoked, and if one of those RCU callbacks was waking someone up,
> that someone would not be woken up.
> 
for the same algorithm, idle duration is not extended. the injected
idle loop also yield to pending softirqs, i guess that is what rcu
callbacks are using?
> It looks like you could end up with part of the system powerclamped
> in some situations, and with all of it powerclamped in other
> situations. Is that the case, or am I confused?
> 
could you explain the part that is partially powerclamped?

> 							Thanx, Paul
[Jacob Pan]

-- 
Thanks,

Jacob

^ permalink raw reply

* [PATCH] cpuidle: Measure idle state durations with monotonic clock
From: Julius Werner @ 2012-11-13 21:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: Len Brown, Rafael J. Wysocki, Kevin Hilman, Andrew Morton,
	Srivatsa S. Bhat, linux-acpi, linux-pm, linuxppc-dev,
	Deepthi Dharwar, Trinabh Gupta, Daniel Lezcano, Sameer Nanda,
	Julius Werner

Many cpuidle drivers measure their time spent in an idle state by
reading the wallclock time before and after idling and calculating the
difference. This leads to erroneous results when the wallclock time gets
updated by another processor in the meantime, adding that clock
adjustment to the idle state's time counter.

If the clock adjustment was negative, the result is even worse due to an
erroneous cast from int to unsigned long long of the last_residency
variable. The negative 32 bit integer will zero-extend and result in a
forward time jump of roughly four billion milliseconds or 1.3 hours on
the idle state residency counter.

This patch changes all affected cpuidle drivers to use the monotonic
clock for their measurements instead. It also removes the erroneous
cast, making sure that negative residency values are applied correctly
even though they should not appear anymore.

Signed-off-by: Julius Werner <jwerner@chromium.org>
---
 arch/powerpc/platforms/pseries/processor_idle.c |    4 ++--
 drivers/acpi/processor_idle.c                   |   12 ++++++------
 drivers/cpuidle/cpuidle.c                       |    3 +--
 drivers/idle/intel_idle.c                       |   13 ++++---------
 4 files changed, 13 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/processor_idle.c b/arch/powerpc/platforms/pseries/processor_idle.c
index 45d00e5..4d806b4 100644
--- a/arch/powerpc/platforms/pseries/processor_idle.c
+++ b/arch/powerpc/platforms/pseries/processor_idle.c
@@ -36,7 +36,7 @@ static struct cpuidle_state *cpuidle_state_table;
 static inline void idle_loop_prolog(unsigned long *in_purr, ktime_t *kt_before)
 {
 
-	*kt_before = ktime_get_real();
+	*kt_before = ktime_get();
 	*in_purr = mfspr(SPRN_PURR);
 	/*
 	 * Indicate to the HV that we are idle. Now would be
@@ -50,7 +50,7 @@ static inline  s64 idle_loop_epilog(unsigned long in_purr, ktime_t kt_before)
 	get_lppaca()->wait_state_cycles += mfspr(SPRN_PURR) - in_purr;
 	get_lppaca()->idle = 0;
 
-	return ktime_to_us(ktime_sub(ktime_get_real(), kt_before));
+	return ktime_to_us(ktime_sub(ktime_get(), kt_before));
 }
 
 static int snooze_loop(struct cpuidle_device *dev,
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index e8086c7..8c98d73 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -751,9 +751,9 @@ static int acpi_idle_enter_c1(struct cpuidle_device *dev,
 
 
 	lapic_timer_state_broadcast(pr, cx, 1);
-	kt1 = ktime_get_real();
+	kt1 = ktime_get();
 	acpi_idle_do_entry(cx);
-	kt2 = ktime_get_real();
+	kt2 = ktime_get();
 	idle_time =  ktime_to_us(ktime_sub(kt2, kt1));
 
 	/* Update device last_residency*/
@@ -843,11 +843,11 @@ static int acpi_idle_enter_simple(struct cpuidle_device *dev,
 	if (cx->type == ACPI_STATE_C3)
 		ACPI_FLUSH_CPU_CACHE();
 
-	kt1 = ktime_get_real();
+	kt1 = ktime_get();
 	/* Tell the scheduler that we are going deep-idle: */
 	sched_clock_idle_sleep_event();
 	acpi_idle_do_entry(cx);
-	kt2 = ktime_get_real();
+	kt2 = ktime_get();
 	idle_time_ns = ktime_to_ns(ktime_sub(kt2, kt1));
 	idle_time = idle_time_ns;
 	do_div(idle_time, NSEC_PER_USEC);
@@ -934,7 +934,7 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
 	 */
 	lapic_timer_state_broadcast(pr, cx, 1);
 
-	kt1 = ktime_get_real();
+	kt1 = ktime_get();
 	/*
 	 * disable bus master
 	 * bm_check implies we need ARB_DIS
@@ -965,7 +965,7 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
 		c3_cpu_count--;
 		raw_spin_unlock(&c3_lock);
 	}
-	kt2 = ktime_get_real();
+	kt2 = ktime_get();
 	idle_time_ns = ktime_to_ns(ktime_sub(kt2, kt1));
 	idle_time = idle_time_ns;
 	do_div(idle_time, NSEC_PER_USEC);
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 7f15b85..1536edd 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -109,8 +109,7 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
 		/* This can be moved to within driver enter routine
 		 * but that results in multiple copies of same code.
 		 */
-		dev->states_usage[entered_state].time +=
-				(unsigned long long)dev->last_residency;
+		dev->states_usage[entered_state].time += dev->last_residency;
 		dev->states_usage[entered_state].usage++;
 	} else {
 		dev->last_residency = 0;
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index b0f6b4c..6329a97 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -56,7 +56,7 @@
 #include <linux/kernel.h>
 #include <linux/cpuidle.h>
 #include <linux/clockchips.h>
-#include <linux/hrtimer.h>	/* ktime_get_real() */
+#include <linux/hrtimer.h>	/* ktime_get() */
 #include <trace/events/power.h>
 #include <linux/sched.h>
 #include <linux/notifier.h>
@@ -281,8 +281,7 @@ static int intel_idle(struct cpuidle_device *dev,
 	struct cpuidle_state_usage *state_usage = &dev->states_usage[index];
 	unsigned long eax = (unsigned long)cpuidle_get_statedata(state_usage);
 	unsigned int cstate;
-	ktime_t kt_before, kt_after;
-	s64 usec_delta;
+	ktime_t kt_before;
 	int cpu = smp_processor_id();
 
 	cstate = (((eax) >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1;
@@ -297,7 +296,7 @@ static int intel_idle(struct cpuidle_device *dev,
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 
-	kt_before = ktime_get_real();
+	kt_before = ktime_get();
 
 	stop_critical_timings();
 	if (!need_resched()) {
@@ -310,17 +309,13 @@ static int intel_idle(struct cpuidle_device *dev,
 
 	start_critical_timings();
 
-	kt_after = ktime_get_real();
-	usec_delta = ktime_to_us(ktime_sub(kt_after, kt_before));
+	dev->last_residency = ktime_to_us(ktime_sub(ktime_get(), kt_before));
 
 	local_irq_enable();
 
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
 
-	/* Update cpuidle counters */
-	dev->last_residency = (int)usec_delta;
-
 	return index;
 }
 
-- 
1.7.8.6

^ permalink raw reply related

* Re: [PATCH 3/3] PM: Introduce Intel PowerClamp Driver
From: Arjan van de Ven @ 2012-11-13 21:56 UTC (permalink / raw)
  To: paulmck
  Cc: Jacob Pan, Linux PM, LKML, Rafael Wysocki, Len Brown,
	Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Zhang Rui,
	Rob Landley
In-Reply-To: <20121113211602.GA30150@linux.vnet.ibm.com>

On 11/13/2012 1:16 PM, Paul E. McKenney wrote:
> On Mon, Nov 12, 2012 at 02:03:51PM -0800, Jacob Pan wrote:
>> Intel PowerClamp driver performs synchronized idle injection across
>> all online CPUs. The goal is to maintain a given package level C-state
>> ratio.
>>
>> Compared to other throttling methods already exist in the kernel,
>> such as ACPI PAD (taking CPUs offline) and clock modulation, this is often
>> more efficient in terms of performance per watt.
>>
>> Please refer to Documentation/thermal/intel_powerclamp.txt for more details.
> 
> If I read this correctly, this forces a group of CPUs into idle for
> about 600 milliseconds at a time.  This would indeed delay grace periods,

6ms not 600



^ permalink raw reply

* Re: [PATCH 3/3] PM: Introduce Intel PowerClamp Driver
From: Paul E. McKenney @ 2012-11-13 22:23 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Linux PM, LKML, Rafael Wysocki, Len Brown, Thomas Gleixner,
	H. Peter Anvin, Ingo Molnar, Zhang Rui, Rob Landley,
	Arjan van de Ven
In-Reply-To: <20121113133922.47144a50@chromoly>

On Tue, Nov 13, 2012 at 01:39:22PM -0800, Jacob Pan wrote:
> On Tue, 13 Nov 2012 13:16:02 -0800
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > > Please refer to Documentation/thermal/intel_powerclamp.txt for more
> > > details.  
> > 
> > If I read this correctly, this forces a group of CPUs into idle for
> > about 600 milliseconds at a time.  This would indeed delay grace
> > periods, which could easily result in user complaints.  Also, given
> > the default RCU_BOOST_DELAY of 500 milliseconds in kernels enabling
> > RCU_BOOST, you would see needless RCU priority boosting.
> > 
> the default idle injection duration is 6ms. we adjust the sleep
> interval to ensure idle ratio. So the idle duration stays the same once
> set. So would it be safe to delay grace period for this small amount in
> exchange for less over head in each injection period?

Ah, 6ms of delay is much better than 600ms.  Should be OK (famous last
words!).

> > Of course, if the idle period extended for longer, you would see RCU
> > CPU stall warnings.  And if the idle period extended indefinitely, you
> > could hang the system -- the RCU callbacks on the idled CPU could not
> > be invoked, and if one of those RCU callbacks was waking someone up,
> > that someone would not be woken up.
> > 
> for the same algorithm, idle duration is not extended. the injected
> idle loop also yield to pending softirqs, i guess that is what rcu
> callbacks are using?

For most kernel configuration options, it does use softirq.  And yes,
the kthread you are using would yield to softirqs -- but only as long
as softirq processing hasn't moved over to ksoftirqd.  Longer term,
RCU will be moving from softirq to kthreads, though, and these might be
prempted by your powerclamp kthread, depending on priorities.  It looks
like you use RT prio 50, which would usually preempt the RCU kthreads
(unless someone changed the priorities).

> > It looks like you could end up with part of the system powerclamped
> > in some situations, and with all of it powerclamped in other
> > situations. Is that the case, or am I confused?
> > 
> could you explain the part that is partially powerclamped?

Suppose that a given system has two sockets.  Are the two sockets
powerclamped independently, or at the same time?  My guess was the
former, but looking at the code again, it looks like the latter.
So it is a good thing I asked, I guess.  ;-)

 							Thanx, Paul

> [Jacob Pan]
> 
> -- 
> Thanks,
> 
> Jacob
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply

* Re: [PATCH 3/3] PM: Introduce Intel PowerClamp Driver
From: Arjan van de Ven @ 2012-11-13 22:45 UTC (permalink / raw)
  To: paulmck
  Cc: Jacob Pan, Linux PM, LKML, Rafael Wysocki, Len Brown,
	Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Zhang Rui,
	Rob Landley
In-Reply-To: <20121113222350.GH2489@linux.vnet.ibm.com>

On 11/13/2012 2:23 PM, Paul E. McKenney wrote:
> On Tue, Nov 13, 2012 at 01:39:22PM -0800, Jacob Pan wrote:
>> On Tue, 13 Nov 2012 13:16:02 -0800
>> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
>>
>>>> Please refer to Documentation/thermal/intel_powerclamp.txt for more
>>>> details.  
>>>
>>> If I read this correctly, this forces a group of CPUs into idle for
>>> about 600 milliseconds at a time.  This would indeed delay grace
>>> periods, which could easily result in user complaints.  Also, given
>>> the default RCU_BOOST_DELAY of 500 milliseconds in kernels enabling
>>> RCU_BOOST, you would see needless RCU priority boosting.
>>>
>> the default idle injection duration is 6ms. we adjust the sleep
>> interval to ensure idle ratio. So the idle duration stays the same once
>> set. So would it be safe to delay grace period for this small amount in
>> exchange for less over head in each injection period?
> 
> Ah, 6ms of delay is much better than 600ms.  Should be OK (famous last
> words!).

well... power clamping is not "free".
You're going to lose performance as a trade off for dropping instantaneous power consumption....
in the measurements we've done comparing various methods.. this one is doing remarkably well.

> 
> For most kernel configuration options, it does use softirq.  And yes,
> the kthread you are using would yield to softirqs -- but only as long
> as softirq processing hasn't moved over to ksoftirqd.  Longer term,
> RCU will be moving from softirq to kthreads, though, and these might be
> prempted by your powerclamp kthread, depending on priorities.  It looks
> like you use RT prio 50, which would usually preempt the RCU kthreads
> (unless someone changed the priorities).

we tried to pick a "middle of the road" value, so that usages that really really
want to run, still get to run, but things that are more loose about it, get put on hold.


> 
>>> It looks like you could end up with part of the system powerclamped
>>> in some situations, and with all of it powerclamped in other
>>> situations. Is that the case, or am I confused?
>>>
>> could you explain the part that is partially powerclamped?
> 
> Suppose that a given system has two sockets.  Are the two sockets
> powerclamped independently, or at the same time?  My guess was the
> former, but looking at the code again, it looks like the latter.
> So it is a good thing I asked, I guess.  ;-)

they are clamped together, and they have to.
you don't get (on the systems where this driver works) any "package" C state unless
all packages are idle completely.
And it's these package C states where the real deep power savings happen, that's
why they are such a juicy target for power clamping ;-)


^ permalink raw reply

* Re: [PATCH 3/3] PM: Introduce Intel PowerClamp Driver
From: Rafael J. Wysocki @ 2012-11-13 23:02 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: paulmck, Jacob Pan, Linux PM, LKML, Rafael Wysocki, Len Brown,
	Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Zhang Rui,
	Rob Landley
In-Reply-To: <50A2CD77.7000403@linux.intel.com>

On Tuesday, November 13, 2012 02:45:11 PM Arjan van de Ven wrote:
> On 11/13/2012 2:23 PM, Paul E. McKenney wrote:
> > On Tue, Nov 13, 2012 at 01:39:22PM -0800, Jacob Pan wrote:
> >> On Tue, 13 Nov 2012 13:16:02 -0800
> >> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> >>
> >>>> Please refer to Documentation/thermal/intel_powerclamp.txt for more
> >>>> details.  
> >>>
> >>> If I read this correctly, this forces a group of CPUs into idle for
> >>> about 600 milliseconds at a time.  This would indeed delay grace
> >>> periods, which could easily result in user complaints.  Also, given
> >>> the default RCU_BOOST_DELAY of 500 milliseconds in kernels enabling
> >>> RCU_BOOST, you would see needless RCU priority boosting.
> >>>
> >> the default idle injection duration is 6ms. we adjust the sleep
> >> interval to ensure idle ratio. So the idle duration stays the same once
> >> set. So would it be safe to delay grace period for this small amount in
> >> exchange for less over head in each injection period?
> > 
> > Ah, 6ms of delay is much better than 600ms.  Should be OK (famous last
> > words!).
> 
> well... power clamping is not "free".
> You're going to lose performance as a trade off for dropping instantaneous power consumption....

Yes.  It is good to realize that when the clamping triggers, we already
have some more to worry about than losing some performance. :-)

The problem here is to find a way to lose as little performance as we possibly
can and prevent the system from overheating at the same time.

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Rafael J. Wysocki @ 2012-11-13 23:43 UTC (permalink / raw)
  To: Alan Stern; +Cc: Huang Ying, linux-kernel, linux-pm
In-Reply-To: <Pine.LNX.4.44L0.1211121110520.26926-100000@netrider.rowland.org>

On Monday, November 12, 2012 11:32:26 AM Alan Stern wrote:
> On Mon, 12 Nov 2012, Huang Ying wrote:
> 
> > > > Is it absolute necessary to call pm_runtime_set_suspended?  If the
> > > > device is disabled, the transition to SUSPENDED state will not be
> > > > triggered even if the device is ACTIVE.
> > > 
> > > It's not absolutely necessary to do this, but we ought to because it 
> > > will allow the device's parent to be suspended.  If we leave the device 
> > > in the ACTIVE state then the parent can't be suspended, even when the 
> > > device is disabled.
> > 
> > I think this is the hard part of the issue.  Now "disabled" and
> > SUSPENDED state is managed by hand.  IMHO, if we changed
> > pm_runtime_allow() as you said, we need to change the rule too.  Maybe
> > something as follow:
> > 
> > - remove pm_runtime_set_suspended/pm_runtime_set_active
> 
> We can't remove them entirely.  They are needed for situations where 
> the device's physical state is different from what the PM core thinks; 
> they tell the PM core what the actual state is.
> 
> > - in pm_runtime_disable/pm_runtime_allow, put device into SUSPENDED
> > state if runtime PM is not forbidden.
> 
> That doesn't make sense.  Runtime PM is never forbidden after 
> pm_runtime_allow is called; that's its purpose.
> 
> > - in pm_runtime_forbid/pm_runtime_enable, put device into ACTIVE state.
> 
> Why should pm_runtime_enable put the device into the ACTIVE state?
> 
> No, I think a better approach is simply to say that the device will
> never be allowed to be in the SUSPENDED state if runtime PM is
> forbidden.  We already enforce this when the device is enabled for 
> runtime PM, so we would have to start enforcing it when the device is 
> disabled.

Sorry for the delay, I needed to take care of some ACPI changes urgently.

(Without reading the rest of the thread yet) I think that would be
a reasonable approach.

> This means:
> 
> 	pm_runtime_set_suspended should fail if dev->power.runtime_auto
> 	is clear.
> 
> 	pm_runtime_forbid should call pm_runtime_set_active if
> 	dev->power.disable_depth > 0.  (This would run into a problem
> 	if the parent is suspended and disabled.  Maybe 
> 	pm_runtime_forbid should fail when this happens.)
> 
> Finally, we probably should make a third change even though it isn't
> strictly necessary:
> 
> 	pm_runtime_allow should call pm_runtime_set_suspended if
> 	dev->power.disable_depth > 0.
> 
> Rafael, what do you think?

As I said, sounds reasonable.

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [RFC PATCH v2 1/6] usb: Register usb port's acpi power resources
From: Rafael J. Wysocki @ 2012-11-13 23:56 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: Sergei Shtylyov, Lan Tianyu, gregkh, sarah.a.sharp, stern,
	oneukum, linux-usb, Linux PM list
In-Reply-To: <50A23EBF.1070304@gmail.com>

On Tuesday, November 13, 2012 08:36:15 PM Lan Tianyu wrote:
> 于 2012/11/13 19:07, Sergei Shtylyov 写道:
> > Hello.
> >
> > On 13-11-2012 12:00, Lan Tianyu wrote:
> >
> >> This patch is to register usb port's acpi power resources. Create
> >> link between usb port device and its acpi power resource.
> >
> >> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> > [...]
> >
> >> diff --git a/drivers/usb/core/usb-acpi.c b/drivers/usb/core/usb-acpi.c
> >> index cef4252..c58ebc0 100644
> >> --- a/drivers/usb/core/usb-acpi.c
> >> +++ b/drivers/usb/core/usb-acpi.c
> >> @@ -216,6 +216,28 @@ static struct acpi_bus_type usb_acpi_bus = {
> >>       .find_device = usb_acpi_find_device,
> >>   };
> >>
> >> +int usb_acpi_register_power_resources(struct device *dev)
> >> +{
> >> +    acpi_handle port_handle = DEVICE_ACPI_HANDLE(dev);
> >> +
> >> +    if (!port_handle)
> >> +        return -ENODEV;
> >> +
> >> +    if (acpi_power_resource_register_device(dev, port_handle) < 0)
> >> +        return -ENODEV;
> >> +    return 0;
> >> +}
> >> +
> >> +void usb_acpi_unregister_power_resources(struct device *dev)
> >> +{
> >> +    acpi_handle port_handle = DEVICE_ACPI_HANDLE(dev);
> >> +
> >> +    if (!port_handle)
> >> +        return;
> >> +
> >> +    acpi_power_resource_register_device(dev, port_handle);
> >
> >     I thinbk you have been askied already, but shouldn't it be
> > acpi_power_resource_unregister_device()?
> >
> Oh. Sorry. Too focus on the other modification. Thanks for your reminder.

Besides, it would be a bit more natural to do

if (port_handle)
	acpi_power_resource_unregister_device(dev, port_handle);

instead of doing that return when port_handle is NULL.

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [RFC PATCH v2 2/6] usb: Add driver/usb/core/port.c file to fill usb port related code.
From: Rafael J. Wysocki @ 2012-11-14  0:04 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: gregkh, sarah.a.sharp, stern, oneukum, linux-usb, Linux PM list
In-Reply-To: <1352793605-4168-3-git-send-email-tianyu.lan@intel.com>

On Tuesday, November 13, 2012 04:00:01 PM Lan Tianyu wrote:
> This patch is to create driver/usb/core/port.c and move usb port related
> code into it.

It does seem to make functional changes in addition to that, however.

If I'm not mistaken and that really is the case, can you (briefly)
describe those changes here too?

Rafael


> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> ---
>  drivers/usb/core/Makefile |    1 +
>  drivers/usb/core/hub.c    |  113 +++++++--------------------------------------
>  drivers/usb/core/port.c   |   82 ++++++++++++++++++++++++++++++++
>  drivers/usb/core/usb.h    |    3 ++
>  include/linux/usb.h       |   16 +++++++
>  5 files changed, 118 insertions(+), 97 deletions(-)
>  create mode 100644 drivers/usb/core/port.c
> 
> diff --git a/drivers/usb/core/Makefile b/drivers/usb/core/Makefile
> index 26059b9..5e847ad 100644
> --- a/drivers/usb/core/Makefile
> +++ b/drivers/usb/core/Makefile
> @@ -7,6 +7,7 @@ ccflags-$(CONFIG_USB_DEBUG) := -DDEBUG
>  usbcore-y := usb.o hub.o hcd.o urb.o message.o driver.o
>  usbcore-y += config.o file.o buffer.o sysfs.o endpoint.o
>  usbcore-y += devio.o notify.o generic.o quirks.o devices.o
> +usbcore-y += port.o
>  
>  usbcore-$(CONFIG_PCI)		+= hcd-pci.o
>  usbcore-$(CONFIG_ACPI)		+= usb-acpi.o
> diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
> index 3c85fe1c..60746aa 100644
> --- a/drivers/usb/core/hub.c
> +++ b/drivers/usb/core/hub.c
> @@ -42,13 +42,6 @@
>  #define USB_VENDOR_GENESYS_LOGIC		0x05e3
>  #define HUB_QUIRK_CHECK_PORT_AUTOSUSPEND	0x01
>  
> -struct usb_port {
> -	struct usb_device *child;
> -	struct device dev;
> -	struct dev_state *port_owner;
> -	enum usb_port_connect_type connect_type;
> -};
> -
>  struct usb_hub {
>  	struct device		*intfdev;	/* the "interface" device */
>  	struct usb_device	*hdev;
> @@ -170,9 +163,6 @@ EXPORT_SYMBOL_GPL(ehci_cf_port_reset_rwsem);
>  #define HUB_DEBOUNCE_STEP	  25
>  #define HUB_DEBOUNCE_STABLE	 100
>  
> -#define to_usb_port(_dev) \
> -	container_of(_dev, struct usb_port, dev)
> -
>  static int usb_reset_and_verify_device(struct usb_device *udev);
>  
>  static inline char *portspeed(struct usb_hub *hub, int portstatus)
> @@ -1237,57 +1227,12 @@ static int hub_post_reset(struct usb_interface *intf)
>  	return 0;
>  }
>  
> -static void usb_port_device_release(struct device *dev)
> -{
> -	struct usb_port *port_dev = to_usb_port(dev);
> -
> -	usb_acpi_unregister_power_resources(dev);
> -	kfree(port_dev);
> -}
> -
>  static void usb_hub_remove_port_device(struct usb_hub *hub,
>  				       int port1)
>  {
>  	device_unregister(&hub->ports[port1 - 1]->dev);
>  }
>  
> -struct device_type usb_port_device_type = {
> -	.name =		"usb_port",
> -	.release =	usb_port_device_release,
> -};
> -
> -static int usb_hub_create_port_device(struct usb_hub *hub,
> -				      int port1)
> -{
> -	struct usb_port *port_dev = NULL;
> -	int retval;
> -
> -	port_dev = kzalloc(sizeof(*port_dev), GFP_KERNEL);
> -	if (!port_dev) {
> -		retval = -ENOMEM;
> -		goto exit;
> -	}
> -
> -	hub->ports[port1 - 1] = port_dev;
> -	port_dev->dev.parent = hub->intfdev;
> -	port_dev->dev.groups = port_dev_group;
> -	port_dev->dev.type = &usb_port_device_type;
> -	dev_set_name(&port_dev->dev, "port%d", port1);
> -
> -	retval = device_register(&port_dev->dev);
> -	if (retval)
> -		goto error_register;
> -
> -	usb_acpi_register_power_resources(&port_dev->dev);
> -
> -	return 0;
> -
> -error_register:
> -	put_device(&port_dev->dev);
> -exit:
> -	return retval;
> -}
> -
>  static int hub_configure(struct usb_hub *hub,
>  	struct usb_endpoint_descriptor *endpoint)
>  {
> @@ -1548,10 +1493,24 @@ static int hub_configure(struct usb_hub *hub,
>  	if (hub->has_indicators && blinkenlights)
>  		hub->indicator [0] = INDICATOR_CYCLE;
>  
> -	for (i = 0; i < hdev->maxchild; i++)
> -		if (usb_hub_create_port_device(hub, i + 1) < 0)
> +	for (i = 0; i < hdev->maxchild; i++) {
> +		struct usb_port *port_dev = NULL;
> +
> +		port_dev = kzalloc(sizeof(*port_dev), GFP_KERNEL);
> +		if (!port_dev) {
> +			dev_err(hub->intfdev,
> +				"couldn't create port%d device due to lack mem.\n", i + 1);
> +			continue;
> +		}
> +
> +		hub->ports[i] = port_dev;
> +
> +		if (usb_hub_create_port_device(hub->intfdev, i + 1, port_dev) < 0) {
>  			dev_err(hub->intfdev,
>  				"couldn't create port%d device.\n", i + 1);
> +			hub->ports[i] = NULL;
> +		}
> +	}
>  
>  	if (!hub_is_superspeed(hdev)) {
>  		for (i = 1; i <= hdev->maxchild; i++)
> @@ -4765,46 +4724,6 @@ static int hub_thread(void *__unused)
>  	return 0;
>  }
>  
> -static ssize_t show_port_connect_type(struct device *dev,
> -	struct device_attribute *attr, char *buf)
> -{
> -	struct usb_port *port_dev = to_usb_port(dev);
> -	char *result;
> -
> -	switch (port_dev->connect_type) {
> -	case USB_PORT_CONNECT_TYPE_HOT_PLUG:
> -		result = "hotplug";
> -		break;
> -	case USB_PORT_CONNECT_TYPE_HARD_WIRED:
> -		result = "hardwired";
> -		break;
> -	case USB_PORT_NOT_USED:
> -		result = "not used";
> -		break;
> -	default:
> -		result = "unknown";
> -		break;
> -	}
> -
> -	return sprintf(buf, "%s\n", result);
> -}
> -static DEVICE_ATTR(connect_type, S_IRUGO, show_port_connect_type,
> -		NULL);
> -
> -static struct attribute *port_dev_attrs[] = {
> -	&dev_attr_connect_type.attr,
> -	NULL,
> -};
> -
> -static struct attribute_group port_dev_attr_grp = {
> -	.attrs = port_dev_attrs,
> -};
> -
> -static const struct attribute_group *port_dev_group[] = {
> -	&port_dev_attr_grp,
> -	NULL,
> -};
> -
>  static const struct usb_device_id hub_id_table[] = {
>      { .match_flags = USB_DEVICE_ID_MATCH_VENDOR
>  	           | USB_DEVICE_ID_MATCH_INT_CLASS,
> diff --git a/drivers/usb/core/port.c b/drivers/usb/core/port.c
> new file mode 100644
> index 0000000..6523a03
> --- /dev/null
> +++ b/drivers/usb/core/port.c
> @@ -0,0 +1,82 @@
> +#include <linux/kernel.h>
> +#include <linux/errno.h>
> +#include <linux/module.h>
> +#include <linux/usb.h>
> +
> +#include "usb.h"
> +
> +static ssize_t show_port_connect_type(struct device *dev,
> +	struct device_attribute *attr, char *buf)
> +{
> +	struct usb_port *port_dev = to_usb_port(dev);
> +	char *result;
> +
> +	switch (port_dev->connect_type) {
> +	case USB_PORT_CONNECT_TYPE_HOT_PLUG:
> +		result = "hotplug";
> +		break;
> +	case USB_PORT_CONNECT_TYPE_HARD_WIRED:
> +		result = "hardwired";
> +		break;
> +	case USB_PORT_NOT_USED:
> +		result = "not used";
> +		break;
> +	default:
> +		result = "unknown";
> +		break;
> +	}
> +
> +	return sprintf(buf, "%s\n", result);
> +}
> +static DEVICE_ATTR(connect_type, S_IRUGO, show_port_connect_type,
> +		NULL);
> +
> +static struct attribute *port_dev_attrs[] = {
> +	&dev_attr_connect_type.attr,
> +	NULL,
> +};
> +
> +static struct attribute_group port_dev_attr_grp = {
> +	.attrs = port_dev_attrs,
> +};
> +
> +static const struct attribute_group *port_dev_group[] = {
> +	&port_dev_attr_grp,
> +	NULL,
> +};
> +
> +static void usb_port_device_release(struct device *dev)
> +{
> +	struct usb_port *port_dev = to_usb_port(dev);
> +
> +	usb_acpi_unregister_power_resources(dev);
> +	kfree(port_dev);
> +}
> +
> +struct device_type usb_port_device_type = {
> +	.name =		"usb_port",
> +	.release =	usb_port_device_release,
> +};
> +
> +int usb_hub_create_port_device(struct device *intfdev,
> +		int port1, struct usb_port *port_dev)
> +{
> +	int retval;
> +
> +	port_dev->dev.parent = intfdev;
> +	port_dev->dev.groups = port_dev_group;
> +	port_dev->dev.type = &usb_port_device_type;
> +	dev_set_name(&port_dev->dev, "port%d", port1);
> +
> +	retval = device_register(&port_dev->dev);
> +	if (retval)
> +		goto error_register;
> +
> +	usb_acpi_register_power_resources(&port_dev->dev);
> +
> +	return 0;
> +
> +error_register:
> +	put_device(&port_dev->dev);
> +	return retval;
> +}
> diff --git a/drivers/usb/core/usb.h b/drivers/usb/core/usb.h
> index 38958fb..de7434b 100644
> --- a/drivers/usb/core/usb.h
> +++ b/drivers/usb/core/usb.h
> @@ -60,6 +60,9 @@ extern void usb_hub_cleanup(void);
>  extern int usb_major_init(void);
>  extern void usb_major_cleanup(void);
>  
> +extern int usb_hub_create_port_device(struct device *intfdev,
> +		int port1, struct usb_port *port_dev);
> +
>  #ifdef	CONFIG_PM
>  
>  extern int usb_suspend(struct device *dev, pm_message_t msg);
> diff --git a/include/linux/usb.h b/include/linux/usb.h
> index e996bb6..c1f1346 100644
> --- a/include/linux/usb.h
> +++ b/include/linux/usb.h
> @@ -572,6 +572,22 @@ struct usb_device {
>  };
>  #define	to_usb_device(d) container_of(d, struct usb_device, dev)
>  
> +/**
> + * struct usb port - kernel's representation of a usb port
> + * @child: usb device attatched to the port
> + * @dev: generic device interface
> + * @port_owner: port's owner
> + * @connect_type: port's connect type
> + */
> +struct usb_port {
> +	struct usb_device *child;
> +	struct device dev;
> +	struct dev_state *port_owner;
> +	enum usb_port_connect_type connect_type;
> +};
> +#define to_usb_port(_dev) \
> +	container_of(_dev, struct usb_port, dev)
> +
>  static inline struct usb_device *interface_to_usbdev(struct usb_interface *intf)
>  {
>  	return to_usb_device(intf->dev.parent);
> 
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [PATCH 3/3] PM: Introduce Intel PowerClamp Driver
From: Paul E. McKenney @ 2012-11-14  0:03 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Jacob Pan, Linux PM, LKML, Rafael Wysocki, Len Brown,
	Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Zhang Rui,
	Rob Landley
In-Reply-To: <50A2CD77.7000403@linux.intel.com>

On Tue, Nov 13, 2012 at 02:45:11PM -0800, Arjan van de Ven wrote:
> On 11/13/2012 2:23 PM, Paul E. McKenney wrote:
> > On Tue, Nov 13, 2012 at 01:39:22PM -0800, Jacob Pan wrote:
> >> On Tue, 13 Nov 2012 13:16:02 -0800
> >> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> >>
> >>>> Please refer to Documentation/thermal/intel_powerclamp.txt for more
> >>>> details.  
> >>>
> >>> If I read this correctly, this forces a group of CPUs into idle for
> >>> about 600 milliseconds at a time.  This would indeed delay grace
> >>> periods, which could easily result in user complaints.  Also, given
> >>> the default RCU_BOOST_DELAY of 500 milliseconds in kernels enabling
> >>> RCU_BOOST, you would see needless RCU priority boosting.
> >>>
> >> the default idle injection duration is 6ms. we adjust the sleep
> >> interval to ensure idle ratio. So the idle duration stays the same once
> >> set. So would it be safe to delay grace period for this small amount in
> >> exchange for less over head in each injection period?
> > 
> > Ah, 6ms of delay is much better than 600ms.  Should be OK (famous last
> > words!).
> 
> well... power clamping is not "free".
> You're going to lose performance as a trade off for dropping instantaneous power consumption....
> in the measurements we've done comparing various methods.. this one is doing remarkably well.

No argument here.  My concern is not performance in this situation, but
rather in-kernel confusion, particularly any such confusion involving RCU.

And understood, you can get similar effects from virtualization.
For all I know, the virtualization guys might leverage your experience
with power clamping to push for gang scheduling once more.  ;-)

> > For most kernel configuration options, it does use softirq.  And yes,
> > the kthread you are using would yield to softirqs -- but only as long
> > as softirq processing hasn't moved over to ksoftirqd.  Longer term,
> > RCU will be moving from softirq to kthreads, though, and these might be
> > prempted by your powerclamp kthread, depending on priorities.  It looks
> > like you use RT prio 50, which would usually preempt the RCU kthreads
> > (unless someone changed the priorities).
> 
> we tried to pick a "middle of the road" value, so that usages that really really
> want to run, still get to run, but things that are more loose about it, get put on hold.

Makes sense.

> >>> It looks like you could end up with part of the system powerclamped
> >>> in some situations, and with all of it powerclamped in other
> >>> situations. Is that the case, or am I confused?
> >>>
> >> could you explain the part that is partially powerclamped?
> > 
> > Suppose that a given system has two sockets.  Are the two sockets
> > powerclamped independently, or at the same time?  My guess was the
> > former, but looking at the code again, it looks like the latter.
> > So it is a good thing I asked, I guess.  ;-)
> 
> they are clamped together, and they have to.
> you don't get (on the systems where this driver works) any "package" C state unless
> all packages are idle completely.
> And it's these package C states where the real deep power savings happen, that's
> why they are such a juicy target for power clamping ;-)

OK, so the point of clamping all sockets simultaneously is to be able
to power down the electronics surrounding the sockets as well as the
sockets themselves?  If all you cared about was the individual sockets,
I don't see why you couldn't power the sockets down individually rather
than in sync with each other.

Just to make sure I am really understanding what is happening, let's
suppose we have a HZ=1000 system that has a few tasks that occasionally
run at prio 99.  These tasks would run during the clamp interval,
but would (for example) see the jiffies counter remaining at the
value at the beginning of the clamp interval until the end of that
interval, when the jiffies counter would suddenly jump by roughly
six counts, right?

If so, this could cause some (minor) RCU issues, such as RCU
deciding to force quiescent states right at the end of a clamping
interval, even though none of the RCU readers would have had a
chance to do anything in the meantime.  Shouldn't result in a
bug though, just wasted motion.

I think I know, but I feel the need to ask anyway.  Why not tell
RCU about the clamping?

							Thanx, Paul

^ permalink raw reply

* Re: [RFC PATCH v2 3/6] usb: add runtime pm support for usb port device
From: Rafael J. Wysocki @ 2012-11-14  0:08 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: gregkh, sarah.a.sharp, stern, oneukum, linux-usb, Linux PM list
In-Reply-To: <1352793605-4168-4-git-send-email-tianyu.lan@intel.com>

On Tuesday, November 13, 2012 04:00:02 PM Lan Tianyu wrote:
> This patch is to add runtime pm callback for usb port device.
> Set/clear PORT_POWER feature in the resume/suspend callbak.
> Add portnum for struct usb_port to record port number.
> 
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>

This one looks reasonable to me.  From the PM side

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  drivers/usb/core/hub.c  |   14 ++++++++++++++
>  drivers/usb/core/port.c |   39 +++++++++++++++++++++++++++++++++++++++
>  drivers/usb/core/usb.h  |    2 ++
>  include/linux/usb.h     |    2 ++
>  4 files changed, 57 insertions(+)
> 
> diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
> index 60746aa..2cb414e 100644
> --- a/drivers/usb/core/hub.c
> +++ b/drivers/usb/core/hub.c
> @@ -765,6 +765,20 @@ static void hub_tt_work(struct work_struct *work)
>  	spin_unlock_irqrestore (&hub->tt.lock, flags);
>  }
>  
> +int usb_hub_set_port_power(struct usb_device *hdev, int port1,
> +		bool set)
> +{
> +	int ret;
> +
> +	if (set)
> +		ret = set_port_feature(hdev, port1,
> +				USB_PORT_FEAT_POWER);
> +	else
> +		ret = clear_port_feature(hdev, port1,
> +				USB_PORT_FEAT_POWER);
> +	return ret;
> +}
> +
>  /**
>   * usb_hub_clear_tt_buffer - clear control/bulk TT state in high speed hub
>   * @urb: an URB associated with the failed or incomplete split transaction
> diff --git a/drivers/usb/core/port.c b/drivers/usb/core/port.c
> index 6523a03..1cda766 100644
> --- a/drivers/usb/core/port.c
> +++ b/drivers/usb/core/port.c
> @@ -53,9 +53,45 @@ static void usb_port_device_release(struct device *dev)
>  	kfree(port_dev);
>  }
>  
> +static int usb_port_runtime_resume(struct device *dev)
> +{
> +	struct usb_port *port_dev = to_usb_port(dev);
> +	struct usb_device *hdev =
> +		to_usb_device(dev->parent->parent);
> +
> +	return usb_hub_set_port_power(hdev, port_dev->portnum,
> +			true);
> +}
> +
> +static int usb_port_runtime_suspend(struct device *dev)
> +{
> +	struct usb_port *port_dev = to_usb_port(dev);
> +	struct usb_device *hdev =
> +		to_usb_device(dev->parent->parent);
> +
> +	return usb_hub_set_port_power(hdev, port_dev->portnum,
> +			false);
> +}
> +
> +static int usb_port_runtime_idle(struct device *dev)
> +{
> +	struct usb_port *port_dev = to_usb_port(dev);
> +
> +	return pm_runtime_suspend(&port_dev->dev);
> +}
> +
> +static const struct dev_pm_ops usb_port_pm_ops = {
> +#ifdef CONFIG_USB_SUSPEND
> +.runtime_suspend =	usb_port_runtime_suspend,
> +.runtime_resume =	usb_port_runtime_resume,
> +.runtime_idle =		usb_port_runtime_idle,
> +#endif
> +};
> +
>  struct device_type usb_port_device_type = {
>  	.name =		"usb_port",
>  	.release =	usb_port_device_release,
> +	.pm = &usb_port_pm_ops,
>  };
>  
>  int usb_hub_create_port_device(struct device *intfdev,
> @@ -63,6 +99,7 @@ int usb_hub_create_port_device(struct device *intfdev,
>  {
>  	int retval;
>  
> +	port_dev->portnum = port1;
>  	port_dev->dev.parent = intfdev;
>  	port_dev->dev.groups = port_dev_group;
>  	port_dev->dev.type = &usb_port_device_type;
> @@ -72,6 +109,8 @@ int usb_hub_create_port_device(struct device *intfdev,
>  	if (retval)
>  		goto error_register;
>  
> +	pm_runtime_set_active(&port_dev->dev);
> +	pm_runtime_enable(&port_dev->dev);
>  	usb_acpi_register_power_resources(&port_dev->dev);
>  
>  	return 0;
> diff --git a/drivers/usb/core/usb.h b/drivers/usb/core/usb.h
> index de7434b..47d77d4 100644
> --- a/drivers/usb/core/usb.h
> +++ b/drivers/usb/core/usb.h
> @@ -62,6 +62,8 @@ extern void usb_major_cleanup(void);
>  
>  extern int usb_hub_create_port_device(struct device *intfdev,
>  		int port1, struct usb_port *port_dev);
> +extern int usb_hub_set_port_power(struct usb_device *hdev,
> +		int port1, bool set);
>  
>  #ifdef	CONFIG_PM
>  
> diff --git a/include/linux/usb.h b/include/linux/usb.h
> index c1f1346..71510bf 100644
> --- a/include/linux/usb.h
> +++ b/include/linux/usb.h
> @@ -578,12 +578,14 @@ struct usb_device {
>   * @dev: generic device interface
>   * @port_owner: port's owner
>   * @connect_type: port's connect type
> + * @portnum: port index num based one
>   */
>  struct usb_port {
>  	struct usb_device *child;
>  	struct device dev;
>  	struct dev_state *port_owner;
>  	enum usb_port_connect_type connect_type;
> +	u8 portnum;
>  };
>  #define to_usb_port(_dev) \
>  	container_of(_dev, struct usb_port, dev)
> 
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [PATCH 3/3] PM: Introduce Intel PowerClamp Driver
From: Paul E. McKenney @ 2012-11-14  0:03 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Arjan van de Ven, Jacob Pan, Linux PM, LKML, Rafael Wysocki,
	Len Brown, Thomas Gleixner, H. Peter Anvin, Ingo Molnar,
	Zhang Rui, Rob Landley
In-Reply-To: <21157900.tI2QCFTxxq@vostro.rjw.lan>

On Wed, Nov 14, 2012 at 12:02:00AM +0100, Rafael J. Wysocki wrote:
> On Tuesday, November 13, 2012 02:45:11 PM Arjan van de Ven wrote:
> > On 11/13/2012 2:23 PM, Paul E. McKenney wrote:
> > > On Tue, Nov 13, 2012 at 01:39:22PM -0800, Jacob Pan wrote:
> > >> On Tue, 13 Nov 2012 13:16:02 -0800
> > >> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > >>
> > >>>> Please refer to Documentation/thermal/intel_powerclamp.txt for more
> > >>>> details.  
> > >>>
> > >>> If I read this correctly, this forces a group of CPUs into idle for
> > >>> about 600 milliseconds at a time.  This would indeed delay grace
> > >>> periods, which could easily result in user complaints.  Also, given
> > >>> the default RCU_BOOST_DELAY of 500 milliseconds in kernels enabling
> > >>> RCU_BOOST, you would see needless RCU priority boosting.
> > >>>
> > >> the default idle injection duration is 6ms. we adjust the sleep
> > >> interval to ensure idle ratio. So the idle duration stays the same once
> > >> set. So would it be safe to delay grace period for this small amount in
> > >> exchange for less over head in each injection period?
> > > 
> > > Ah, 6ms of delay is much better than 600ms.  Should be OK (famous last
> > > words!).
> > 
> > well... power clamping is not "free".
> > You're going to lose performance as a trade off for dropping instantaneous power consumption....
> 
> Yes.  It is good to realize that when the clamping triggers, we already
> have some more to worry about than losing some performance. :-)
> 
> The problem here is to find a way to lose as little performance as we possibly
> can and prevent the system from overheating at the same time.

Understood.  My concern is in-kernel confusion rather than performance.

							Thanx, Paul


^ permalink raw reply

* Re: [PATCH 3/3] PM: Introduce Intel PowerClamp Driver
From: Arjan van de Ven @ 2012-11-14  0:08 UTC (permalink / raw)
  To: paulmck
  Cc: Jacob Pan, Linux PM, LKML, Rafael Wysocki, Len Brown,
	Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Zhang Rui,
	Rob Landley
In-Reply-To: <20121114000259.GK2489@linux.vnet.ibm.com>


> 
> OK, so the point of clamping all sockets simultaneously is to be able
> to power down the electronics surrounding the sockets as well as the
> sockets themselves?  

yup; memory can go to self refresh etc etc

>If all you cared about was the individual sockets,
> I don't see why you couldn't power the sockets down individually rather
> than in sync with each other.

the hardware that this driver supports does not support powering down sockets individually.

(since the memory controllers are part of the "socket"... it would increase latency
etc etc, and likely wreak havoc with the cache coherency protocols)

> I think I know, but I feel the need to ask anyway.  Why not tell
> RCU about the clamping?

I don't mind telling RCU, but what cannot happen is a bunch of CPU time suddenly getting used
(since that is the opposite of what is needed at the specific point in time of going idle)


^ permalink raw reply

* Re: [RFC PATCH v2 4/6] usb: add usb port auto power off mechanism
From: Rafael J. Wysocki @ 2012-11-14  0:16 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: gregkh, sarah.a.sharp, stern, oneukum, linux-usb, Linux PM list
In-Reply-To: <1352793605-4168-5-git-send-email-tianyu.lan@intel.com>

On Tuesday, November 13, 2012 04:00:03 PM Lan Tianyu wrote:
> This patch is to add usb port auto power off mechanism.
> When usb device is suspending, usb core will suspend usb port and
> usb port runtime pm callback will clear PORT_POWER feature to
> power off port if all conditions were met.These conditions are
> remote wakeup disable, pm qos NO_POWER_OFF flag clear and persist
> enable. When device is suspended and power off, usb core
> will ignore port's connect and enable change event to keep the device
> not being disconnected. When it resumes, power on port again.
> 
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>

Again, from the PM side:

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

although as far as the USB internals are concerned, I'm far from being an expert. :-)

Thanks,
Rafael


> ---
>  drivers/usb/core/hub.c  |   76 +++++++++++++++++++++++++++++++++++++++++++++--
>  drivers/usb/core/port.c |    1 +
>  include/linux/usb.h     |    2 ++
>  3 files changed, 76 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
> index 2cb414e..267e9d7 100644
> --- a/drivers/usb/core/hub.c
> +++ b/drivers/usb/core/hub.c
> @@ -26,6 +26,7 @@
>  #include <linux/mutex.h>
>  #include <linux/freezer.h>
>  #include <linux/random.h>
> +#include <linux/pm_qos.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/byteorder.h>
> @@ -159,11 +160,14 @@ MODULE_PARM_DESC(use_both_schemes,
>  DECLARE_RWSEM(ehci_cf_port_reset_rwsem);
>  EXPORT_SYMBOL_GPL(ehci_cf_port_reset_rwsem);
>  
> +#define HUB_PORT_RECONNECT_TRIES	20
> +
>  #define HUB_DEBOUNCE_TIMEOUT	1500
>  #define HUB_DEBOUNCE_STEP	  25
>  #define HUB_DEBOUNCE_STABLE	 100
>  
>  static int usb_reset_and_verify_device(struct usb_device *udev);
> +static int hub_port_debounce(struct usb_hub *hub, int port1);
>  
>  static inline char *portspeed(struct usb_hub *hub, int portstatus)
>  {
> @@ -855,7 +859,11 @@ static unsigned hub_power_on(struct usb_hub *hub, bool do_delay)
>  		dev_dbg(hub->intfdev, "trying to enable port power on "
>  				"non-switchable hub\n");
>  	for (port1 = 1; port1 <= hub->descriptor->bNbrPorts; port1++)
> -		set_port_feature(hub->hdev, port1, USB_PORT_FEAT_POWER);
> +		if (hub->ports[port1 - 1]->power_on)
> +			set_port_feature(hub->hdev, port1, USB_PORT_FEAT_POWER);
> +		else
> +			clear_port_feature(hub->hdev, port1,
> +						USB_PORT_FEAT_POWER);
>  
>  	/* Wait at least 100 msec for power to become stable */
>  	delay = max(pgood_delay, (unsigned) 100);
> @@ -2852,6 +2860,7 @@ EXPORT_SYMBOL_GPL(usb_enable_ltm);
>  int usb_port_suspend(struct usb_device *udev, pm_message_t msg)
>  {
>  	struct usb_hub	*hub = hdev_to_hub(udev->parent);
> +	struct usb_port *port_dev = hub->ports[udev->portnum - 1];
>  	int		port1 = udev->portnum;
>  	int		status;
>  
> @@ -2946,6 +2955,19 @@ int usb_port_suspend(struct usb_device *udev, pm_message_t msg)
>  		udev->port_is_suspended = 1;
>  		msleep(10);
>  	}
> +
> +	/*
> +	 * Check whether current status is meeting requirement of
> +	 * usb port power off mechanism
> +	 */
> +	if (!udev->do_remote_wakeup
> +			&& (dev_pm_qos_flags(&port_dev->dev,
> +			PM_QOS_FLAG_NO_POWER_OFF) != PM_QOS_FLAGS_ALL)
> +			&& udev->persist_enabled
> +			&& !status)
> +		if (!pm_runtime_put_sync(&port_dev->dev))
> +			port_dev->power_on = false;
> +
>  	usb_mark_last_busy(hub->hdev);
>  	return status;
>  }
> @@ -3029,6 +3051,25 @@ static int finish_port_resume(struct usb_device *udev)
>  	return status;
>  }
>  
> +static int usb_port_wait_for_connected(struct usb_hub *hub, int port1)
> +{
> +	int status, i;
> +
> +	for (i = 0; i < HUB_PORT_RECONNECT_TRIES; i++) {
> +		status = hub_port_debounce(hub, port1);
> +		if (status & USB_PORT_STAT_CONNECTION) {
> +			/*
> +			 * just clear enable-change feature since debounce
> +			 *  has cleared connect-change feature.
> +			 */
> +			clear_port_feature(hub->hdev, port1,
> +					USB_PORT_FEAT_C_ENABLE);
> +			return 0;
> +		}
> +	}
> +	return -ETIMEDOUT;
> +}
> +
>  /*
>   * usb_port_resume - re-activate a suspended usb device's upstream port
>   * @udev: device to re-activate, not a root hub
> @@ -3066,10 +3107,36 @@ static int finish_port_resume(struct usb_device *udev)
>  int usb_port_resume(struct usb_device *udev, pm_message_t msg)
>  {
>  	struct usb_hub	*hub = hdev_to_hub(udev->parent);
> +	struct usb_port *port_dev = hub->ports[udev->portnum  - 1];
>  	int		port1 = udev->portnum;
>  	int		status;
>  	u16		portchange, portstatus;
>  
> +
> +	set_bit(port1, hub->busy_bits);
> +
> +	if (!port_dev->power_on) {
> +		status = pm_runtime_get_sync(&port_dev->dev);
> +		if (status) {
> +			dev_dbg(&udev->dev, "can't resume usb port, status %d\n",
> +					status);
> +			return status;
> +		}
> +
> +		port_dev->power_on = true;
> +
> +		/*
> +		 * Wait for usb hub port to be reconnected in order to make
> +		 * the resume procedure successful.
> +		 */
> +		status = usb_port_wait_for_connected(hub, port1);
> +		if (status < 0) {
> +			dev_dbg(&udev->dev, "can't get reconnection after setting port  power on, status %d\n",
> +					status);
> +			return status;
> +		}
> +	}
> +
>  	/* Skip the initial Clear-Suspend step for a remote wakeup */
>  	status = hub_port_status(hub, port1, &portstatus, &portchange);
>  	if (status == 0 && !port_is_suspended(hub, portstatus))
> @@ -3077,8 +3144,6 @@ int usb_port_resume(struct usb_device *udev, pm_message_t msg)
>  
>  	// dev_dbg(hub->intfdev, "resume port %d\n", port1);
>  
> -	set_bit(port1, hub->busy_bits);
> -
>  	/* see 7.1.7.7; affects power usage, but not budgeting */
>  	if (hub_is_superspeed(hub->hdev))
>  		status = set_port_feature(hub->hdev,
> @@ -4256,6 +4321,11 @@ static void hub_port_connect_change(struct usb_hub *hub, int port1,
>  		}
>  	}
>  
> +	if (!hub->ports[port1 - 1]->power_on) {
> +		clear_bit(port1, hub->change_bits);
> +		return;
> +	}
> +
>  	/* Disconnect any existing devices under this port */
>  	if (udev)
>  		usb_disconnect(&hub->ports[port1 - 1]->child);
> diff --git a/drivers/usb/core/port.c b/drivers/usb/core/port.c
> index 1cda766..b7388fd 100644
> --- a/drivers/usb/core/port.c
> +++ b/drivers/usb/core/port.c
> @@ -99,6 +99,7 @@ int usb_hub_create_port_device(struct device *intfdev,
>  {
>  	int retval;
>  
> +	port_dev->power_on = true;
>  	port_dev->portnum = port1;
>  	port_dev->dev.parent = intfdev;
>  	port_dev->dev.groups = port_dev_group;
> diff --git a/include/linux/usb.h b/include/linux/usb.h
> index 71510bf..8002640 100644
> --- a/include/linux/usb.h
> +++ b/include/linux/usb.h
> @@ -579,6 +579,7 @@ struct usb_device {
>   * @port_owner: port's owner
>   * @connect_type: port's connect type
>   * @portnum: port index num based one
> + * @power_on: port's power state
>   */
>  struct usb_port {
>  	struct usb_device *child;
> @@ -586,6 +587,7 @@ struct usb_port {
>  	struct dev_state *port_owner;
>  	enum usb_port_connect_type connect_type;
>  	u8 portnum;
> +	bool power_on;
>  };
>  #define to_usb_port(_dev) \
>  	container_of(_dev, struct usb_port, dev)
> 
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [RFC PATCH v2 5/6] usb: expose usb port's pm qos flags to user space
From: Rafael J. Wysocki @ 2012-11-14  0:21 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: gregkh, sarah.a.sharp, stern, oneukum, linux-usb, Linux PM list
In-Reply-To: <1352793605-4168-6-git-send-email-tianyu.lan@intel.com>

On Tuesday, November 13, 2012 04:00:04 PM Lan Tianyu wrote:
> This patch is to expose usb port's pm qos flags(pm_qos_no_power_off,
> pm_qos_remote_wakeup) to user space. User can set pm_qos_no_power_off
> flag to prohibit the port from being power off.
> 
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
> ---
>  drivers/usb/core/port.c |   10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/usb/core/port.c b/drivers/usb/core/port.c
> index b7388fd..5a7a833 100644
> --- a/drivers/usb/core/port.c
> +++ b/drivers/usb/core/port.c
> @@ -2,6 +2,7 @@
>  #include <linux/errno.h>
>  #include <linux/module.h>
>  #include <linux/usb.h>
> +#include <linux/pm_qos.h>
>  
>  #include "usb.h"
>  
> @@ -48,7 +49,7 @@ static const struct attribute_group *port_dev_group[] = {
>  static void usb_port_device_release(struct device *dev)
>  {
>  	struct usb_port *port_dev = to_usb_port(dev);
> -
> +	dev_pm_qos_hide_flags(dev);
>  	usb_acpi_unregister_power_resources(dev);
>  	kfree(port_dev);
>  }
> @@ -110,12 +111,19 @@ int usb_hub_create_port_device(struct device *intfdev,
>  	if (retval)
>  		goto error_register;
>  
> +	retval = dev_pm_qos_expose_flags(&port_dev->dev,
> +			PM_QOS_FLAG_NO_POWER_OFF);
> +	if (retval)
> +		goto error_expose_pm_qos;
> +
>  	pm_runtime_set_active(&port_dev->dev);
>  	pm_runtime_enable(&port_dev->dev);
>  	usb_acpi_register_power_resources(&port_dev->dev);
>  
>  	return 0;
>  
> +error_expose_pm_qos:
> +	device_del(&port_dev->dev);

That seems to be a bit drastic. :-)

Why don't you simply avoid enabling runtime PM in that case?

Rafael


>  error_register:
>  	put_device(&port_dev->dev);
>  	return retval;
> 
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [RFC PATCH v2 6/6] usb: add usb port's pm qos flags request to change NO_POWER_OFF flag
From: Rafael J. Wysocki @ 2012-11-14  0:23 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: gregkh, sarah.a.sharp, stern, oneukum, linux-usb, Linux PM list
In-Reply-To: <1352793605-4168-7-git-send-email-tianyu.lan@intel.com>

On Tuesday, November 13, 2012 04:00:05 PM Lan Tianyu wrote:
> Some usb devices can't be resumed correctly after power off. This
> patch is to add pm qos flags request to change NO_POWER_OFF and
> provide usb_device_allow_power_off() for device drivers to allow or
> prohibit usb core to power off the device.
> 
> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>

This looks reasonable to me.  From the PM perspective:

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  drivers/usb/core/hub.c  |   36 ++++++++++++++++++++++++++++++++++++
>  drivers/usb/core/port.c |   12 ++++++++++++
>  include/linux/usb.h     |    3 +++
>  3 files changed, 51 insertions(+)
> 
> diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
> index 267e9d7..7aaac720 100644
> --- a/drivers/usb/core/hub.c
> +++ b/drivers/usb/core/hub.c
> @@ -189,6 +189,42 @@ static struct usb_hub *hdev_to_hub(struct usb_device *hdev)
>  	return usb_get_intfdata(hdev->actconfig->interface[0]);
>  }
>  
> +/**
> + * usb_device_allow_power_off - Allow or prohibit power off device.
> + * @udev: target usb device
> + * @set: choice of allow or prohibit
> + *
> + * Clearing or setting usb port's pm qos flag PM_QOS_FLAG_NO_POWER_OFF
> + * to allow or prohibit target usb device to be power off.
> + */
> +int usb_device_allow_power_off(struct usb_device *udev, bool set)
> +{
> +	struct usb_port *port_dev;
> +	struct dev_pm_qos_request *pm_qos;
> +	s32 value;
> +	int ret;
> +
> +	if (!udev->parent)
> +		return -EINVAL;
> +
> +	port_dev =
> +		hdev_to_hub(udev->parent)->ports[udev->portnum - 1];
> +	pm_qos = &port_dev->pm_qos;
> +	value = pm_qos->data.flr.flags;
> +
> +	if (set)
> +		value &= ~PM_QOS_FLAG_NO_POWER_OFF;
> +	else
> +		value |= PM_QOS_FLAG_NO_POWER_OFF;
> +
> +	pm_runtime_get_sync(&port_dev->dev);
> +	ret = dev_pm_qos_update_request(pm_qos, value);
> +	pm_runtime_put(&port_dev->dev);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(usb_device_allow_power_off);
> +
>  static int usb_device_supports_lpm(struct usb_device *udev)
>  {
>  	/* USB 2.1 (and greater) devices indicate LPM support through
> diff --git a/drivers/usb/core/port.c b/drivers/usb/core/port.c
> index 5a7a833..077a494 100644
> --- a/drivers/usb/core/port.c
> +++ b/drivers/usb/core/port.c
> @@ -49,6 +49,11 @@ static const struct attribute_group *port_dev_group[] = {
>  static void usb_port_device_release(struct device *dev)
>  {
>  	struct usb_port *port_dev = to_usb_port(dev);
> +
> +	pm_runtime_get_sync(dev);
> +	dev_pm_qos_remove_request(&port_dev->pm_qos);
> +	pm_runtime_put(dev);
> +
>  	dev_pm_qos_hide_flags(dev);
>  	usb_acpi_unregister_power_resources(dev);
>  	kfree(port_dev);
> @@ -116,12 +121,19 @@ int usb_hub_create_port_device(struct device *intfdev,
>  	if (retval)
>  		goto error_expose_pm_qos;
>  
> +	retval = dev_pm_qos_add_request(&port_dev->dev, &port_dev->pm_qos,
> +			DEV_PM_QOS_FLAGS, 0);
> +	if (retval)
> +		goto error_add_qos_request;
> +
>  	pm_runtime_set_active(&port_dev->dev);
>  	pm_runtime_enable(&port_dev->dev);
>  	usb_acpi_register_power_resources(&port_dev->dev);
>  
>  	return 0;
>  
> +error_add_qos_request:
> +	pm_runtime_put(&port_dev->dev);
>  error_expose_pm_qos:
>  	device_del(&port_dev->dev);
>  error_register:
> diff --git a/include/linux/usb.h b/include/linux/usb.h
> index 8002640..aa201bd 100644
> --- a/include/linux/usb.h
> +++ b/include/linux/usb.h
> @@ -21,6 +21,7 @@
>  #include <linux/sched.h>	/* for current && schedule_timeout */
>  #include <linux/mutex.h>	/* for struct mutex */
>  #include <linux/pm_runtime.h>	/* for runtime PM */
> +#include <linux/pm_qos.h>	/* for PM Qos */
>  
>  struct usb_device;
>  struct usb_driver;
> @@ -577,6 +578,7 @@ struct usb_device {
>   * @child: usb device attatched to the port
>   * @dev: generic device interface
>   * @port_owner: port's owner
> + * @pm_qos: port's pm qos flags request
>   * @connect_type: port's connect type
>   * @portnum: port index num based one
>   * @power_on: port's power state
> @@ -585,6 +587,7 @@ struct usb_port {
>  	struct usb_device *child;
>  	struct device dev;
>  	struct dev_state *port_owner;
> +	struct dev_pm_qos_request pm_qos;
>  	enum usb_port_connect_type connect_type;
>  	u8 portnum;
>  	bool power_on;
> 
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Huang Ying @ 2012-11-14  1:08 UTC (permalink / raw)
  To: Alan Stern; +Cc: Rafael J. Wysocki, linux-kernel, linux-pm
In-Reply-To: <Pine.LNX.4.44L0.1211131102330.1292-100000@iolanthe.rowland.org>

On Tue, 2012-11-13 at 11:10 -0500, Alan Stern wrote:
> On Tue, 13 Nov 2012, Huang Ying wrote:
> 
> > > This is not quite right.  Consider a device that is in runtime suspend 
> > > when a system sleep starts.  When the system sleep ends, the device 
> > > will be resumed but the PM core will still think its state is 
> > > SUSPENDED.  The subsystem has to tell the PM core that the device is 
> > > now ACTIVE.  Currently, subsystems do this by calling 
> > > pm_runtime_disable, pm_runtime_set_active, pm_runtime_enable.  Under 
> > > your scheme this wouldn't work; the pm_runtime_set_active call would 
> > > fail because the device was !forbidden.
> > 
> > Thanks for your information.  For this specific situation, is it
> > possible to call pm_runtime_resume() or pm_request_resume() for the
> > device?
> 
> No, because the device already is at full power.  The subsystem just
> needs to tell the PM core that it is.
> 
> > > > PM.  Device can always work with full power.
> > > 
> > > It can't if the parent is in SUSPEND.  If necessary, the user can write 
> > > "on" to the parent's power/control attribute first.
> > 
> > Is it possible to call pm_runtime_set_active() for the parent if the
> > parent is disabled and SUSPENDED.
> 
> Doing that is possible, but it might not work.  The parent might
> actually be at low power; calling pm_runtime_set_active wouldn't change
> the physical power level.  Basically, it's not safe to assume anything
> about devices that are disabled for runtime PM.
> 
> > It appears that there is race condition between this and the
> > pm_runtime_disable, pm_runtime_set_active, pm_runtime_enable sequence
> > you mentioned ealier.
> > 
> > thread 1			thread 2
> > pm_runtime_disable
> > pm_runtime_set_active
> > 				pm_runtime_allow
> > 				  pm_runtime_set_suspended
> > pm_runtime_enable
> 
> This can't happen in the situation I described earlier because during
> system sleep transitions, no other user threads are allowed to run.  
> All of them except the one actually carrying out the transition are
> frozen.

Thanks for your kind explanation.

After talking with you, my feeling is that the disabled state is obscure
and error-prone.  So I suggest not to use it if possible.  Maybe we can

 - make changes suggested by Alan to make disabled state better.
 - use Rafael's solution to solve this specific issue, and avoid the
usage of disabled state here.

Best Regards,
Huang Ying



^ permalink raw reply

* Re: [PATCH 3/3] PM: Introduce Intel PowerClamp Driver
From: Jacob Pan @ 2012-11-14  1:14 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: paulmck, Linux PM, LKML, Rafael Wysocki, Len Brown,
	Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Zhang Rui,
	Rob Landley
In-Reply-To: <50A2E116.8000400@linux.intel.com>

On Tue, 13 Nov 2012 16:08:54 -0800
Arjan van de Ven <arjan@linux.intel.com> wrote:

> > I think I know, but I feel the need to ask anyway.  Why not tell
> > RCU about the clamping?  
> 
> I don't mind telling RCU, but what cannot happen is a bunch of CPU
> time suddenly getting used (since that is the opposite of what is
> needed at the specific point in time of going idle)
Another reason is my observation that there are some assumptions/checks
to make sure only idle thread can tell rcu it is idle. Is it ok to
extend that to other kthreads?

-- 
Thanks,

Jacob

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox