Linux Power Management development

Linux Power Management development
 help / color / mirror / Atom feed

* Re: [RFC PATCH v2 02/10] CPU hotplug: Provide APIs for "full" atomic readers to prevent CPU offline
From: Tejun Heo @ 2012-12-05 20:57 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, oleg, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <50BFAF27.9060203@linux.vnet.ibm.com>

Hello,

On Thu, Dec 06, 2012 at 02:01:35AM +0530, Srivatsa S. Bhat wrote:
> Yes, that _sounds_ sufficient, but IMHO it won't be, in practice. The
> *number* of call-sites that you need to convert from preempt_disable/enable
> to get/put_online_cpus_atomic() won't be too many, however the *frequency*
> of usage of those call-sites can potentially be very high.

I don't think that will be the case and, even if it is, doing it this
way would make it difficult to tell.  The right thing to do is
replacing stop_machine with finer grained percpu locking first.
Refining it further should happen iff that isn't enough and there
isn't an simpler solution.  So, let's please do the simple conversion
first.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [RFC PATCH v2 02/10] CPU hotplug: Provide APIs for "full" atomic readers to prevent CPU offline
From: Srivatsa S. Bhat @ 2012-12-05 20:31 UTC (permalink / raw)
  To: tj
  Cc: Srivatsa S. Bhat, tglx, peterz, paulmck, rusty, mingo, akpm,
	namhyung, vincent.guittot, oleg, sbw, amit.kucheria, rostedt, rjw,
	wangyun, xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <50BF99FA.8060109@linux.vnet.ibm.com>

> Replaying what Tejun wrote:
> 
> On 12/06/2012 12:13 AM, Srivatsa S. Bhat wrote:
>> Some of the atomic hotplug readers cannot tolerate CPUs going offline while
>> they are in their critical section. That is, they can't get away with just
>> synchronizing with the updates to the cpu_online_mask; they really need to
>> synchronize with the entire CPU tear-down sequence, because they are very
>> much involved in the hotplug related code paths.
>>
>> Such "full" atomic hotplug readers need a way to *actually* and *truly*
>> prevent CPUs from going offline while they are active.
>>
> 
> I don't think this is a good idea.  You really should just need
> get/put_online_cpus() and get/put_online_cpus_atomic().  The former
> the same as they are.  The latter replacing what
> preempt_disable/enable() was protecting.  Let's please not go
> overboard unless we know they're necessary.  I strongly suspect that
> breaking up reader side from preempt_disable and making writer side a
> bit lighter should be enough.  Conceptually, it really should be a
> simple conversion - convert preempt_disable/enable() pairs protecting
> CPU on/offlining w/ get/put_cpu_online_atomic() and wrap the
> stop_machine() section with the matching write lock.
> 

Yes, that _sounds_ sufficient, but IMHO it won't be, in practice. The
*number* of call-sites that you need to convert from preempt_disable/enable
to get/put_online_cpus_atomic() won't be too many, however the *frequency*
of usage of those call-sites can potentially be very high.

For example, the IPI path (smp_call_function_*) needs to use the new APIs
instead of preempt_disable(); and this is quite a hot path. So if we replace
preempt_disable/enable() with a synchronization mechanism that spins
the reader *throughout* the CPU offline operation, and provide no light-weight
alternative API, then even such very hot readers will have to bear the wrath.

And IPIs and interrupts are the work-generators in a system. Since they
can be hotplug readers, if we spin them like this, we effectively end up
recreating the stop_machine() "effect", without even using stop_machine().

This is what I meant in my yesterday's reply too:
https://lkml.org/lkml/2012/12/4/349

That's why we need a light-weight variant IMHO, so that we can use them
atleast where feasible, like IPI path (smp_call_function_*) for example.
That'll help us avoid the "stop_machine effect", hoping that most readers
are of the light-type. As I mentioned in the cover-letter, most readers
_are_ of the light-type (eg: 5 patches in this series deal with light
readers, only 1 patch deals with a heavy/full reader). I don't see why
we should unnecessarily slow down every reader just because a minority of
readers actually need full synchronization with CPU offline.

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* Re: [RFC PATCH v2 01/10] CPU hotplug: Provide APIs for "light" atomic readers to prevent CPU offline
From: Srivatsa S. Bhat @ 2012-12-05 20:14 UTC (permalink / raw)
  To: tj, oleg
  Cc: Srivatsa S. Bhat, tglx, peterz, paulmck, rusty, mingo, akpm,
	namhyung, vincent.guittot, sbw, amit.kucheria, rostedt, rjw,
	wangyun, xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <50BF999C.6030707@linux.vnet.ibm.com>

> Replaying what Tejun wrote:
> 
> Hello, Oleg.
> 
>> Replaying what Oleg wrote:
>>> Replacing get_online_cpus() w/ percpu_rwsem is great but this thread
>>> is about replacing preempt_disable with something finer grained and
>>> less heavy on the writer side
>>
>> If only I understood why preempt_disable() is bad ;-)
>>
>> OK, I guess "less heavy on the writer side" is the hint, and in the
>> previous email you mentioned that "stop_machine() itself is extremely
>> heavy".
>>
>> Looks like, you are going to remove stop_machine() from cpu_down ???
>>
> 
> Yeah, that's what Srivatsa is trying to do.  The problem seems to be
> that cpu up/down is very frequent on certain mobile platforms for
> power management and as currently implemented cpu hotplug is too heavy
> and latency-inducing.
>   
>>> The problem seems that we don't have percpu_rwlock yet.  It shouldn't
>>> be too difficult to implement, right?
>>>
>>
>> Oh, I am not sure... unless you simply copy-and-paste the lglock code
>> and replace spinlock_t with rwlock_t.
>>
> 
> Ah... right, so that's where brlock ended up.  So, lglock is the new
> thing and brlock is a wrapper around it.
> 
>> We probably want something more efficient, but I bet we can't avoid
>> the barriers on the read side.
>>
>> And somehow we should avoid the livelocks. Say, we can't simply add
>> the per_cpu_reader_counter, _read_lock should spin if the writer is
>> active. But at the same time _read_lock should be recursive.
>>
> 
> I think we should just go with lglock.  It does involve local atomic
> ops but atomic ops themselves aren't that expensive and it's not like
> we can avoid memory barriers.  Also, that's the non-sleeping
> counterpart of percpu_rwsem.  If it's not good enough for some reason,
> we should improve it rather than introducing something else.
> 

While working on the v2 yesterday, I had actually used rwlocks for
the light readers and atomic ops for the full-readers. (Later I changed
both to rwlocks while posting this v2). Anyway, the atomic ops version
looked something like shown below.

I'll take a look at lglocks and see if that helps in our case.

Regards,
Srivatsa S. Bhat


---

 include/linux/cpu.h |    4 ++
 kernel/cpu.c        |   98 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 102 insertions(+)


diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index c64b6ed..5011c7d 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -177,6 +177,8 @@ extern void get_online_cpus(void);
 extern void put_online_cpus(void);
 extern void get_online_cpus_stable_atomic(void);
 extern void put_online_cpus_stable_atomic(void);
+extern void get_online_cpus_atomic(void);
+extern void put_online_cpus_atomic(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
 #define register_hotcpu_notifier(nb)	register_cpu_notifier(nb)
 #define unregister_hotcpu_notifier(nb)	unregister_cpu_notifier(nb)
@@ -202,6 +204,8 @@ static inline void cpu_hotplug_driver_unlock(void)
 #define put_online_cpus()	do { } while (0)
 #define get_online_cpus_stable_atomic()	do { } while (0)
 #define put_online_cpus_stable_atomic()	do { } while (0)
+#define get_online_cpus_atomic()	do { } while (0)
+#define put_online_cpus_atomic()	do { } while (0)
 #define hotcpu_notifier(fn, pri)	do { (void)(fn); } while (0)
 /* These aren't inline functions due to a GCC bug. */
 #define register_hotcpu_notifier(nb)	({ (void)(nb); 0; })
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 8c9eecc..76b07f7 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -19,6 +19,7 @@
 #include <linux/mutex.h>
 #include <linux/gfp.h>
 #include <linux/suspend.h>
+#include <linux/atomic.h>
 
 #include "smpboot.h"
 
@@ -104,6 +105,58 @@ void put_online_cpus_stable_atomic(void)
 }
 EXPORT_SYMBOL_GPL(put_online_cpus_stable_atomic);
 
+static DEFINE_PER_CPU(atomic_t, atomic_reader_refcount);
+
+#define writer_active(v)	((v) < 0)
+#define reader_active(v)	((v) > 0)
+
+/*
+ * Invoked by hotplug reader, to prevent CPUs from going offline.
+ * Increments its per-cpu 'atomic_reader_refcount' to mark itself as being
+ * active.
+ *
+ * If 'atomic_reader_refcount' is negative, it means that a CPU offline
+ * operation is in progress (hotplug writer). Wait for it to complete
+ * and then mark your presence (increment the count) and return.
+ *
+ * You can call this recursively, because it doesn't hold any locks.
+ *
+ * Returns with preemption disabled.
+ */
+void get_online_cpus_atomic(void)
+{
+	int c, old;
+
+	preempt_disable();
+	read_lock(&hotplug_rwlock);
+
+	for (;;) {
+		c = atomic_read(&__get_cpu_var(atomic_reader_refcount));
+		if (unlikely(writer_active(c))) {
+			cpu_relax();
+			continue;
+		}
+
+		old = atomic_cmpxchg(&__get_cpu_var(atomic_reader_refcount),
+				     c, c + 1);
+
+		if (likely(old == c))
+			break;
+
+		c = old;
+	}
+}
+EXPORT_SYMBOL_GPL(get_online_cpus_atomic);
+
+void put_online_cpus_atomic(void)
+{
+	atomic_dec(&__get_cpu_var(atomic_reader_refcount));
+	smp_mb__after_atomic_dec();
+	read_unlock(&hotplug_rwlock);
+	preempt_enable();
+}
+EXPORT_SYMBOL_GPL(put_online_cpus_atomic);
+
 static struct {
 	struct task_struct *active_writer;
 	struct mutex lock; /* Synchronizes accesses to refcount, */
@@ -292,6 +345,42 @@ static inline void check_for_tasks(int cpu)
 	write_unlock_irq(&tasklist_lock);
 }
 
+/*
+ * Invoked by hotplug writer, in preparation to take a CPU offline.
+ * Decrements the per-cpu 'atomic_reader_refcount' to mark itself as being
+ * active.
+ *
+ * If 'atomic_reader_refcount' is positive, it means that there are active
+ * hotplug readers (those that prevent hot-unplug). Wait for them to complete
+ * and then mark your presence (decrement the count) and return.
+ */
+static void disable_atomic_reader(unsigned int cpu)
+{
+	int c, old;
+
+	for (;;) {
+		c = atomic_read(&per_cpu(atomic_reader_refcount, cpu));
+		if (likely(reader_active(c))) {
+			cpu_relax();
+			continue;
+		}
+
+		old = atomic_cmpxchg(&per_cpu(atomic_reader_refcount, cpu),
+				     c, c - 1);
+
+		if (likely(old == c))
+			break;
+
+		c = old;
+	}
+}
+
+static void enable_atomic_reader(unsigned int cpu)
+{
+	atomic_inc(&per_cpu(atomic_reader_refcount, cpu));
+	smp_mb__after_atomic_inc();
+}
+
 struct take_cpu_down_param {
 	unsigned long mod;
 	void *hcpu;
@@ -302,6 +391,7 @@ static int __ref take_cpu_down(void *_param)
 {
 	struct take_cpu_down_param *param = _param;
 	unsigned long flags;
+	unsigned int cpu;
 	int err;
 
 	/*
@@ -317,6 +407,10 @@ static int __ref take_cpu_down(void *_param)
 		return err;
 	}
 
+	/* Disable the atomic hotplug readers who need full synchronization */
+	for_each_online_cpu(cpu)
+		disable_atomic_reader(cpu);
+
 	/*
 	 * We have successfully removed the CPU from the cpu_online_mask.
 	 * So release the lock, so that the light-weight atomic readers (who care
@@ -330,6 +424,10 @@ static int __ref take_cpu_down(void *_param)
 
 	cpu_notify(CPU_DYING | param->mod, param->hcpu);
 
+	/* Enable the atomic hotplug readers who need full synchronization */
+	for_each_online_cpu(cpu)
+		enable_atomic_reader(cpu);
+
 	local_irq_restore(flags);
 	return 0;
 }


^ permalink raw reply related

* Re: [RFC PATCH v2 01/10] CPU hotplug: Provide APIs for "light" atomic readers to prevent CPU offline
From: Srivatsa S. Bhat @ 2012-12-05 19:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121205190703.GA13795@redhat.com>

On 12/06/2012 12:37 AM, Oleg Nesterov wrote:
> I'll try to read this series later,
> 
> one minor and almost offtopic nit.
> 
> On 12/06, Srivatsa S. Bhat wrote:
>>
>>  static int __ref take_cpu_down(void *_param)
>>  {
>>  	struct take_cpu_down_param *param = _param;
>> +	unsigned long flags;
>>  	int err;
>>
>> +	/*
>> +	 *  __cpu_disable() is the step where the CPU is removed from the
>> +	 *  cpu_online_mask. Protect it with the light-lock held for write.
>> +	 */
>> +	write_lock_irqsave(&light_hotplug_rwlock, flags);
>> +
>>  	/* Ensure this CPU doesn't handle any more interrupts. */
>>  	err = __cpu_disable();
>> -	if (err < 0)
>> +	if (err < 0) {
>> +		write_unlock_irqrestore(&light_hotplug_rwlock, flags);
>>  		return err;
>> +	}
>> +
>> +	/*
>> +	 * We have successfully removed the CPU from the cpu_online_mask.
>> +	 * So release the light-lock, so that the light-weight atomic readers
>> +	 * (who care only about the cpu_online_mask updates, and not really
>> +	 * about the actual cpu-take-down operation) can continue.
>> +	 *
>> +	 * But don't enable interrupts yet, because we still have work left to
>> +	 * do, to actually bring the CPU down.
>> +	 */
>> +	write_unlock(&light_hotplug_rwlock);
>>
>>  	cpu_notify(CPU_DYING | param->mod, param->hcpu);
>> +
>> +	local_irq_restore(flags);
>>  	return 0;
> 
> This is subjective, but imho _irqsave and the fat comment look confusing.
> 
> Currently take_cpu_down() is always called with irqs disabled, so you
> do not need to play with interrupts.
> 
> 10/10 does s/__stop_machine/stop_cpus/ and that patch could simply add
> local_irq_disable/enable into take_cpu_down().
> 

Hmm, we could certainly do that, but somehow I felt it would be easier to
read if we tinker and fix up the take_cpu_down() logic at one place, as a
whole, instead of breaking up into pieces in different patches. And that
also makes the last patch look really cute: it just replaces stop_machine()
with stop_cpus(), as the changelog intended.

I'll see if doing like what you suggested improves the readability, and
if yes, I'll change it. Thank you!

Regards,
Srivatsa S. Bhat


^ permalink raw reply

* Re: [RFC PATCH v2 10/10] cpu: No more __stop_machine() in _cpu_down()
From: Srivatsa S. Bhat @ 2012-12-05 19:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121205190836.GB13795@redhat.com>

On 12/06/2012 12:38 AM, Oleg Nesterov wrote:
> On 12/06, Srivatsa S. Bhat wrote:
>>
>> @@ -418,7 +418,7 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
>>  	}
>>  	smpboot_park_threads(cpu);
>>
>> -	err = __stop_machine(take_cpu_down, &tcd_param, cpumask_of(cpu));
>> +	err = stop_cpus(cpumask_of(cpu), take_cpu_down, &tcd_param);
> 
> stop_one_cpu(cpu) ?
> 

Even I was thinking of using that. Paul, any particular reason you chose
stop_cpus() over stop_one_cpu() in [1]?

[1]. https://lkml.org/lkml/2012/10/30/359

Regards,
Srivatsa S. Bhat


^ permalink raw reply

* Re: [RFC PATCH v2 10/10] cpu: No more __stop_machine() in _cpu_down()
From: Oleg Nesterov @ 2012-12-05 19:08 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121205184456.3750.35005.stgit@srivatsabhat.in.ibm.com>

On 12/06, Srivatsa S. Bhat wrote:
>
> @@ -418,7 +418,7 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
>  	}
>  	smpboot_park_threads(cpu);
>
> -	err = __stop_machine(take_cpu_down, &tcd_param, cpumask_of(cpu));
> +	err = stop_cpus(cpumask_of(cpu), take_cpu_down, &tcd_param);

stop_one_cpu(cpu) ?

Oleg.


^ permalink raw reply

* Re: [RFC PATCH v2 01/10] CPU hotplug: Provide APIs for "light" atomic readers to prevent CPU offline
From: Oleg Nesterov @ 2012-12-05 19:07 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, sbw, amit.kucheria, rostedt, rjw, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121205184258.3750.31879.stgit@srivatsabhat.in.ibm.com>

I'll try to read this series later,

one minor and almost offtopic nit.

On 12/06, Srivatsa S. Bhat wrote:
>
>  static int __ref take_cpu_down(void *_param)
>  {
>  	struct take_cpu_down_param *param = _param;
> +	unsigned long flags;
>  	int err;
>
> +	/*
> +	 *  __cpu_disable() is the step where the CPU is removed from the
> +	 *  cpu_online_mask. Protect it with the light-lock held for write.
> +	 */
> +	write_lock_irqsave(&light_hotplug_rwlock, flags);
> +
>  	/* Ensure this CPU doesn't handle any more interrupts. */
>  	err = __cpu_disable();
> -	if (err < 0)
> +	if (err < 0) {
> +		write_unlock_irqrestore(&light_hotplug_rwlock, flags);
>  		return err;
> +	}
> +
> +	/*
> +	 * We have successfully removed the CPU from the cpu_online_mask.
> +	 * So release the light-lock, so that the light-weight atomic readers
> +	 * (who care only about the cpu_online_mask updates, and not really
> +	 * about the actual cpu-take-down operation) can continue.
> +	 *
> +	 * But don't enable interrupts yet, because we still have work left to
> +	 * do, to actually bring the CPU down.
> +	 */
> +	write_unlock(&light_hotplug_rwlock);
>
>  	cpu_notify(CPU_DYING | param->mod, param->hcpu);
> +
> +	local_irq_restore(flags);
>  	return 0;

This is subjective, but imho _irqsave and the fat comment look confusing.

Currently take_cpu_down() is always called with irqs disabled, so you
do not need to play with interrupts.

10/10 does s/__stop_machine/stop_cpus/ and that patch could simply add
local_irq_disable/enable into take_cpu_down().

But again this is minor and subjective, I won't insist.

Oleg.


^ permalink raw reply

* [PATCH RFC 1/1] cpufreq/x86: Add P-state driver for sandy bridge.
From: dirk.brandewie @ 2012-12-05 19:01 UTC (permalink / raw)
  To: linux-kernel, cpufreq, linux-pm; +Cc: rjw, deneen.t.dock, arjan, Dirk Brandewie
In-Reply-To: <1354734091-29870-1-git-send-email-dirk.brandewie@gmail.com>

From: Dirk Brandewie <dirk.brandewie@gmail.com>

Add a P-state driver for the Sandy bridge processor.

This driver provides better power efficiency than the current
governors of the Intel architecture.  The driver masquerades as a
frequency governor to the cpufreq subsystem but does not use cpufreq
to change frequency.

Issues:
  does not report current frequency via cpufreq subsystem so this
  confuses some tools.


Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
---
 drivers/cpufreq/Kconfig.x86   |    8 +
 drivers/cpufreq/Makefile      |    1 +
 drivers/cpufreq/cpufreq_snb.c |  727 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 736 insertions(+), 0 deletions(-)
 create mode 100644 drivers/cpufreq/cpufreq_snb.c

diff --git a/drivers/cpufreq/Kconfig.x86 b/drivers/cpufreq/Kconfig.x86
index 934854a..8c8acd3 100644
--- a/drivers/cpufreq/Kconfig.x86
+++ b/drivers/cpufreq/Kconfig.x86
@@ -2,6 +2,14 @@
 # x86 CPU Frequency scaling drivers
 #
 
+config X86_SNB_CPUFREQ
+       tristate "SandyBridge frequency Governor"
+       help
+         This driver will override the CPU_FREQ subsystem when
+	 the system has a SandyBridge processor
+
+	 If in doubt, say N.
+
 config X86_PCC_CPUFREQ
 	tristate "Processor Clocking Control interface driver"
 	depends on ACPI && ACPI_PROCESSOR
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index 1bc90e1..71ad49e 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -38,6 +38,7 @@ obj-$(CONFIG_X86_SPEEDSTEP_SMI)		+= speedstep-smi.o
 obj-$(CONFIG_X86_SPEEDSTEP_CENTRINO)	+= speedstep-centrino.o
 obj-$(CONFIG_X86_P4_CLOCKMOD)		+= p4-clockmod.o
 obj-$(CONFIG_X86_CPUFREQ_NFORCE2)	+= cpufreq-nforce2.o
+obj-$(CONFIG_X86_SNB_CPUFREQ)		+= cpufreq_snb.o
 
 ##################################################################################
 # ARM SoC drivers
diff --git a/drivers/cpufreq/cpufreq_snb.c b/drivers/cpufreq/cpufreq_snb.c
new file mode 100644
index 0000000..0d46862
--- /dev/null
+++ b/drivers/cpufreq/cpufreq_snb.c
@@ -0,0 +1,727 @@
+/*
+ * cpufreq_snb.c: Native P state management for Intel processors
+ *
+ * (C) Copyright 2012 Intel Corporation
+ * Author: Dirk Brandewie <dirk.j.brandewie@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+
+
+#include <linux/kernel.h>
+#include <linux/kernel_stat.h>
+#include <linux/module.h>
+#include <linux/ktime.h>
+#include <linux/hrtimer.h>
+#include <linux/tick.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/cpufreq.h>
+#include <linux/sysfs.h>
+#include <linux/types.h>
+
+#include <trace/events/power.h>
+
+#include <asm/div64.h>
+#include <asm/msr.h>
+#include <asm/cpu_device_id.h>
+
+#define SAMPLE_COUNT		3
+
+struct sampling_state {
+	int idle_mode;
+	int first_sample;
+};
+
+struct sample {
+	ktime_t start_time;
+	ktime_t end_time;
+	int core_pct_busy;
+	int freq_pct_busy;
+	u64 duration_us;
+	u64 idletime_us;
+	u64 aperf;
+	u64 mperf;
+};
+
+struct freqdata {
+	int	current_freq;
+	int	min_freq;
+	int	max_freq;
+	int	turbo_freq;
+};
+
+struct _pid {
+	int setpoint;
+	int32_t integral;
+	int32_t p_gain;
+	int32_t i_gain;
+	int32_t d_gain;
+	int deadband;
+	int last_err;
+};
+
+struct cpudata {
+	int cpu;
+
+	char name[64];
+
+	struct timer_list timer;
+
+	struct freq_adjust_policy *freq_policy;
+	struct freqdata clock;
+	struct sampling_state sampling_state;
+	struct _pid pid;
+	struct _pid idle_pid;
+
+	int min_freq_count;
+
+	ktime_t prev_sample;
+	u64	prev_idle_time_us;
+	u64	prev_aperf;
+	u64	prev_mperf;
+	int	sample_ptr;
+	struct sample samples[SAMPLE_COUNT];
+};
+
+static unsigned int snb_usage;
+static DEFINE_MUTEX(snb_mutex);
+
+struct cpudata **all_cpu_data;
+struct freq_adjust_policy {
+	int sample_rate_ms;    /* sample rate */
+	int deadband; /*adjust freq on last sample or average */
+	int setpoint; /* starting freq when we have no info */
+	int p_gain_pct;
+	int d_gain_pct;
+	int i_gain_pct;
+};
+
+struct freq_adjust_policy default_policy = {
+	.sample_rate_ms = 10,
+	.deadband = 0,
+	.setpoint = 109,
+	.p_gain_pct = 17,
+	.d_gain_pct = 0,
+	.i_gain_pct = 4,
+};
+
+#define FRAC_BITS 8
+#define int_tofp(X) ((int64_t)(X) << FRAC_BITS)
+#define fp_toint(X) ((X) >> FRAC_BITS)
+
+static inline int32_t mul_fp(int32_t x, int32_t y)
+{
+	return ((int64_t)x * (int64_t)y) >> FRAC_BITS;
+}
+
+static inline int32_t div_fp(int32_t x, int32_t y)
+{
+	return div_s64((int64_t)x << FRAC_BITS, (int64_t)y);
+}
+
+
+static inline void pid_reset(struct _pid *pid, int setpoint, int busy,
+			int deadband, int integral) {
+	pid->setpoint = setpoint;
+	pid->deadband  = deadband;
+	pid->integral  = int_tofp(integral);
+	pid->last_err  = setpoint - busy;
+}
+
+static inline void pid_p_gain_set(struct _pid *pid, int percent)
+{
+	pid->p_gain = div_fp(int_tofp(percent), int_tofp(100));
+}
+
+static inline void pid_i_gain_set(struct _pid *pid, int percent)
+{
+	pid->i_gain = div_fp(int_tofp(percent), int_tofp(100));
+}
+
+static inline void pid_d_gain_set(struct _pid *pid, int percent)
+{
+
+	pid->d_gain = div_fp(int_tofp(percent), int_tofp(100));
+}
+
+static inline int pid_calc(struct _pid *pid, int busy)
+{
+	int err, result;
+	int32_t pterm, dterm, fp_error;
+	int32_t integral_limit;
+
+	integral_limit = int_tofp(30);
+	err = pid->setpoint - busy;
+
+	if (abs(err) <= pid->deadband)
+		return 0;
+
+	fp_error = int_tofp(err);
+	pterm = mul_fp(pid->p_gain, fp_error);
+	pid->integral += mul_fp(pid->i_gain, fp_error);
+
+	/* limit the integral term */
+	if (pid->integral > integral_limit)
+		pid->integral = integral_limit;
+	if (pid->integral < -integral_limit)
+		pid->integral = -integral_limit;
+
+	dterm = mul_fp(pid->d_gain, (err - pid->last_err));
+	result = pterm + pid->integral + dterm;
+
+	pid->last_err = err;
+	return fp_toint(result);
+}
+
+
+static inline void snb_busy_pid_reset(struct cpudata *cpu)
+{
+	pid_reset(&cpu->pid,
+		cpu->freq_policy->setpoint,
+		100,
+		cpu->freq_policy->deadband,
+		0);
+
+	pid_p_gain_set(&cpu->pid, cpu->freq_policy->p_gain_pct);
+	pid_d_gain_set(&cpu->pid, cpu->freq_policy->d_gain_pct);
+	pid_i_gain_set(&cpu->pid, cpu->freq_policy->i_gain_pct);
+}
+
+static inline void snb_idle_pid_reset(struct cpudata *cpu)
+{
+	pid_reset(&cpu->idle_pid,
+		75,
+		50,
+		cpu->freq_policy->deadband,
+		0);
+
+	pid_p_gain_set(&cpu->idle_pid, cpu->freq_policy->p_gain_pct);
+	pid_d_gain_set(&cpu->idle_pid, cpu->freq_policy->d_gain_pct);
+	pid_i_gain_set(&cpu->idle_pid, cpu->freq_policy->i_gain_pct);
+}
+
+static inline void snb_reset_all_pid(void)
+{
+	unsigned int cpu;
+	for_each_online_cpu(cpu) {
+		if (all_cpu_data[cpu])
+			snb_busy_pid_reset(all_cpu_data[cpu]);
+	}
+}
+
+/************************** sysfs begin ************************/
+#define show_one(file_name, object)					\
+	static ssize_t show_##file_name					\
+	(struct kobject *kobj, struct attribute *attr, char *buf)	\
+	{								\
+		return sprintf(buf, "%u\n", default_policy.object);	\
+	}
+
+static ssize_t store_sample_rate_ms(struct kobject *a, struct attribute *b,
+				const char *buf, size_t count)
+{
+	unsigned int input;
+	int ret;
+	ret = sscanf(buf, "%u", &input);
+	if (ret != 1)
+		return -EINVAL;
+	default_policy.sample_rate_ms = input;
+	snb_reset_all_pid();
+	return count;
+}
+
+static ssize_t store_d_gain_pct(struct kobject *a, struct attribute *b,
+				const char *buf, size_t count)
+{
+	unsigned int input;
+	int ret;
+	ret = sscanf(buf, "%u", &input);
+	if (ret != 1)
+		return -EINVAL;
+	default_policy.d_gain_pct = input;
+	snb_reset_all_pid();
+
+	return count;
+}
+
+static ssize_t store_i_gain_pct(struct kobject *a, struct attribute *b,
+				const char *buf, size_t count)
+{
+	unsigned int input;
+	int ret;
+	ret = sscanf(buf, "%u", &input);
+	if (ret != 1)
+		return -EINVAL;
+	default_policy.i_gain_pct = input;
+	snb_reset_all_pid();
+
+	return count;
+}
+
+static ssize_t store_deadband(struct kobject *a, struct attribute *b,
+			const char *buf, size_t count)
+{
+	unsigned int input;
+	int ret;
+	ret = sscanf(buf, "%u", &input);
+	if (ret != 1)
+		return -EINVAL;
+	default_policy.deadband = input;
+	snb_reset_all_pid();
+
+	return count;
+}
+
+static ssize_t store_setpoint(struct kobject *a, struct attribute *b,
+			const char *buf, size_t count)
+{
+	unsigned int input;
+	int ret;
+	ret = sscanf(buf, "%u", &input);
+	if (ret != 1)
+		return -EINVAL;
+	default_policy.setpoint = input;
+	snb_reset_all_pid();
+
+	return count;
+}
+
+static ssize_t store_p_gain_pct(struct kobject *a, struct attribute *b,
+				const char *buf, size_t count)
+{
+	unsigned int input;
+	int ret;
+	ret = sscanf(buf, "%u", &input);
+	if (ret != 1)
+		return -EINVAL;
+	default_policy.p_gain_pct = input;
+	snb_reset_all_pid();
+
+	return count;
+}
+
+show_one(sample_rate_ms, sample_rate_ms);
+show_one(d_gain_pct, d_gain_pct);
+show_one(i_gain_pct, i_gain_pct);
+show_one(deadband, deadband);
+show_one(setpoint, setpoint);
+show_one(p_gain_pct, p_gain_pct);
+
+
+define_one_global_rw(sample_rate_ms);
+define_one_global_rw(d_gain_pct);
+define_one_global_rw(i_gain_pct);
+define_one_global_rw(deadband);
+define_one_global_rw(setpoint);
+define_one_global_rw(p_gain_pct);
+
+
+static struct attribute *snb_attributes[] = {
+	&sample_rate_ms.attr,
+	&d_gain_pct.attr,
+	&i_gain_pct.attr,
+	&deadband.attr,
+	&setpoint.attr,
+	&p_gain_pct.attr,
+	NULL
+};
+
+static struct attribute_group snb_attr_group = {
+	.attrs = snb_attributes,
+	.name = "snb",
+};
+
+/************************** sysfs end ************************/
+
+static int snb_get_min_freq(void)
+{
+	u64 value;
+	rdmsrl(0xCE, value);
+	return (value >> 40) & 0xFF;
+}
+
+static int snb_get_max_freq(void)
+{
+	u64 value;
+	rdmsrl(0xCE, value);
+	return (value >> 8) & 0xFF;
+}
+
+static int snb_get_turbo_freq(void)
+{
+	u64 value;
+	int nont, ret;
+	rdmsrl(0x1AD, value);
+	nont = snb_get_max_freq();
+	ret = ((value) & 255);
+	if (ret <= nont)
+		ret = nont;
+	return ret;
+}
+
+
+static void snb_set_freq(struct cpudata *cpu, int clock)
+{
+	clock = clamp_t(int, clock, cpu->clock.min_freq, cpu->clock.turbo_freq);
+
+	if (clock == cpu->clock.current_freq)
+		return;
+
+#ifndef MODULE
+	trace_cpu_frequency(clock * 100000, cpu->cpu);
+	trace_power_frequency(POWER_PSTATE, clock * 100000, cpu->cpu);
+#endif
+
+	cpu->clock.current_freq = clock;
+	wrmsrl(MSR_IA32_PERF_CTL, clock << 8);
+}
+
+static inline void snb_freq_increase(struct cpudata *cpu, int steps)
+{
+	int target;
+	target = cpu->clock.current_freq + steps;
+
+	snb_set_freq(cpu, target);
+}
+
+static inline void snb_freq_decrease(struct cpudata *cpu, int steps)
+{
+	int target;
+	target = cpu->clock.current_freq - steps;
+	snb_set_freq(cpu, target);
+}
+
+static void snb_get_cpu_freqs(struct cpudata *cpu)
+{
+	sprintf(cpu->name, "Intel 2nd generation core");
+
+	cpu->clock.min_freq = snb_get_min_freq();
+	cpu->clock.max_freq = snb_get_max_freq();
+	cpu->clock.turbo_freq = snb_get_turbo_freq();
+
+	/* goto max clock so we don't slow up boot if we are built-in
+	   if we are a module we will take care of it during normal
+	   operation
+	*/
+	snb_set_freq(cpu, cpu->clock.max_freq);
+}
+
+
+static inline void snb_calc_busy(struct cpudata *cpu, struct sample *sample)
+{
+	u64 core_pct;
+
+	sample->freq_pct_busy = 100 - div64_u64(
+					sample->idletime_us * 100,
+					sample->duration_us);
+	core_pct = div64_u64(sample->aperf * 100, sample->mperf);
+	sample->core_pct_busy = sample->freq_pct_busy * core_pct / 100;
+}
+
+static inline int snb_sample(struct cpudata *cpu)
+{
+	ktime_t now;
+	u64 idle_time_us;
+	u64 aperf, mperf;
+
+	now = ktime_get();
+	idle_time_us = get_cpu_idle_time_us(cpu->cpu, NULL);
+
+	rdmsrl(MSR_IA32_APERF, aperf);
+	rdmsrl(MSR_IA32_MPERF, mperf);
+	/* for the first sample, don't actually record a sample, just
+	 * set the baseline */
+	if (cpu->prev_idle_time_us > 0) {
+		cpu->sample_ptr = (cpu->sample_ptr + 1) % SAMPLE_COUNT;
+		cpu->samples[cpu->sample_ptr].start_time = cpu->prev_sample;
+		cpu->samples[cpu->sample_ptr].end_time = now;
+		cpu->samples[cpu->sample_ptr].duration_us =
+			ktime_us_delta(now, cpu->prev_sample);
+		cpu->samples[cpu->sample_ptr].idletime_us =
+			idle_time_us - cpu->prev_idle_time_us;
+
+		cpu->samples[cpu->sample_ptr].aperf = aperf;
+		cpu->samples[cpu->sample_ptr].mperf = mperf;
+		cpu->samples[cpu->sample_ptr].aperf -= cpu->prev_aperf;
+		cpu->samples[cpu->sample_ptr].mperf -= cpu->prev_mperf;
+
+		snb_calc_busy(cpu, &cpu->samples[cpu->sample_ptr]);
+	}
+
+	cpu->prev_sample = now;
+	cpu->prev_idle_time_us = idle_time_us;
+	cpu->prev_aperf = aperf;
+	cpu->prev_mperf = mperf;
+	return cpu->sample_ptr;
+}
+
+static inline void snb_set_sample_time(struct cpudata *cpu)
+{
+	int sample_time;
+	int delay;
+
+	sample_time = cpu->freq_policy->sample_rate_ms;
+	delay = msecs_to_jiffies(sample_time);
+	delay -= jiffies % delay;
+	mod_timer(&cpu->timer, jiffies + delay);
+}
+
+static inline void snb_idle_mode(struct cpudata *cpu)
+{
+	cpu->sampling_state.idle_mode = 1;
+}
+
+static inline void snb_normal_mode(struct cpudata *cpu)
+{
+	cpu->sampling_state.idle_mode = 0;
+}
+
+static inline int snb_get_scaled_busy(struct cpudata *cpu)
+{
+	int32_t busy_scaled;
+	int32_t core_busy, turbo_freq, current_freq;
+
+	core_busy    = int_tofp(cpu->samples[cpu->sample_ptr].core_pct_busy);
+	turbo_freq   = int_tofp(cpu->clock.turbo_freq);
+	current_freq = int_tofp(cpu->clock.current_freq);
+	busy_scaled = mul_fp(core_busy, div_fp(turbo_freq, current_freq));
+
+	return fp_toint(busy_scaled);
+}
+
+static inline void snb_adjust_busy_freq(struct cpudata *cpu)
+{
+	int busy_scaled;
+	struct _pid *pid;
+	int ctl = 0;
+	int steps;
+
+	pid = &cpu->pid;
+
+	busy_scaled =  snb_get_scaled_busy(cpu);
+
+	ctl = pid_calc(pid, busy_scaled);
+
+	steps = abs(ctl);
+	if (ctl < 0)
+		snb_freq_increase(cpu, steps);
+	else
+		snb_freq_decrease(cpu, steps);
+}
+
+static inline void snb_adjust_idle_freq(struct cpudata *cpu)
+{
+	int busy_scaled;
+	struct _pid *pid;
+	int ctl = 0;
+	int steps;
+
+	pid = &cpu->idle_pid;
+
+	busy_scaled =  snb_get_scaled_busy(cpu);
+
+	ctl = pid_calc(pid, 100 - busy_scaled);
+
+	steps = abs(ctl);
+	if (ctl < 0)
+		snb_freq_decrease(cpu, steps);
+	else
+		snb_freq_increase(cpu, steps);
+
+	if (cpu->clock.current_freq == cpu->clock.min_freq)
+		snb_normal_mode(cpu);
+}
+static inline int snb_valid_sample(struct cpudata *cpu, int idx)
+{
+	struct sample *sample = &cpu->samples[idx];
+
+	return sample->duration_us <
+		(cpu->freq_policy->sample_rate_ms * USEC_PER_MSEC  * 2);
+}
+
+static void snb_timer_func(unsigned long __data)
+{
+	struct cpudata *cpu = (struct cpudata *) __data;
+	struct freq_adjust_policy *policy;
+	int idx;
+
+	policy = cpu->freq_policy;
+
+	idx = snb_sample(cpu);
+
+	if (snb_valid_sample(cpu, idx)) {
+		if (!cpu->sampling_state.idle_mode)
+			snb_adjust_busy_freq(cpu);
+		else
+			snb_adjust_idle_freq(cpu);
+	}
+
+#if defined(XPERF_FIX)
+	if (cpu->clock.current_freq == cpu->clock.min_freq) {
+		cpu->min_freq_count++;
+		if (!(cpu->min_freq_count % 5)) {
+			snb_set_freq(cpu, cpu->clock.max_freq);
+			snb_idle_mode(cpu);
+		}
+	} else
+		cpu->min_freq_count = 0;
+#endif
+	snb_set_sample_time(cpu);
+}
+
+static void snb_exit(unsigned int cpu)
+{
+	if (!all_cpu_data)
+		return;
+	pr_info("snb: disabling %d\n", cpu);
+	if (all_cpu_data[cpu]) {
+		del_timer_sync(&all_cpu_data[cpu]->timer);
+		kfree(all_cpu_data[cpu]);
+	}
+}
+
+#define ICPU(model, policy) \
+	{ X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY, (unsigned long)&policy }
+
+static const struct x86_cpu_id intel_cpufreq_ids[] = {
+	ICPU(0x2a, default_policy),
+	ICPU(0x2d, default_policy),
+	{}
+};
+MODULE_DEVICE_TABLE(x86cpu, intel_cpufreq_ids);
+
+static int snb_init(unsigned int cpu)
+{
+	int rc;
+	const struct x86_cpu_id *id;
+
+	id = x86_match_cpu(intel_cpufreq_ids);
+	if (!id)
+		return -ENODEV;
+
+	all_cpu_data[cpu] = kzalloc(sizeof(struct cpudata), GFP_KERNEL);
+	if (!all_cpu_data[cpu]) {
+		rc = -ENOMEM;
+		goto unwind;
+	}
+
+	snb_get_cpu_freqs(all_cpu_data[cpu]);
+
+	all_cpu_data[cpu]->cpu = cpu;
+	all_cpu_data[cpu]->freq_policy =
+		(struct freq_adjust_policy *)id->driver_data;
+	init_timer_deferrable(&all_cpu_data[cpu]->timer);
+	all_cpu_data[cpu]->timer.function = snb_timer_func;
+	all_cpu_data[cpu]->timer.data =
+		(unsigned long)all_cpu_data[cpu];
+	all_cpu_data[cpu]->timer.expires = jiffies + HZ/100;
+	snb_busy_pid_reset(all_cpu_data[cpu]);
+	snb_idle_pid_reset(all_cpu_data[cpu]);
+	pr_info("snb: enabling %d\n", cpu);
+	add_timer_on(&all_cpu_data[cpu]->timer, cpu);
+	return 0;
+
+unwind:
+	snb_exit(cpu);
+	return -ENODEV;
+}
+
+/**
+ * cpufreq_set - set the CPU frequency
+ * @policy: pointer to policy struct where freq is being set
+ * @freq: target frequency in kHz
+ *
+ * Sets the CPU frequency to freq.
+ */
+static int cpufreq_snb_set(struct cpufreq_policy *policy, unsigned int freq)
+{
+	int ret = 0;
+	return ret;
+}
+
+
+static ssize_t cpufreq_snb_show_speed(struct cpufreq_policy *policy, char *buf)
+{
+	return 0;
+}
+
+static int cpufreq_snb(struct cpufreq_policy *policy,
+				   unsigned int event)
+{
+	unsigned int cpu = policy->cpu;
+	int rc = 0;
+
+	switch (event) {
+	case CPUFREQ_GOV_START:
+		if (!cpu_online(cpu))
+			return -EINVAL;
+		mutex_lock(&snb_mutex);
+		snb_usage++;
+		rc = snb_init(cpu);
+
+		if (snb_usage == 1)
+			rc = sysfs_create_group(cpufreq_global_kobject,
+						&snb_attr_group);
+
+		mutex_unlock(&snb_mutex);
+		break;
+	case CPUFREQ_GOV_STOP:
+		mutex_lock(&snb_mutex);
+		snb_usage--;
+		snb_exit(cpu);
+		if (!snb_usage)
+			sysfs_remove_group(cpufreq_global_kobject,
+					&snb_attr_group);
+
+		mutex_unlock(&snb_mutex);
+		break;
+	case CPUFREQ_GOV_LIMITS:
+		mutex_lock(&snb_mutex);
+		mutex_unlock(&snb_mutex);
+		break;
+	}
+	return rc;
+}
+
+static struct cpufreq_governor cpufreq_gov_snb = {
+	.name		= "snb",
+	.governor	= cpufreq_snb,
+	.store_setspeed	= cpufreq_snb_set,
+	.show_setspeed	= cpufreq_snb_show_speed,
+	.owner		= THIS_MODULE,
+};
+
+static int __init cpufreq_gov_snb_init(void)
+{
+	pr_info("Sandybridge frequency driver initializing.\n");
+
+	all_cpu_data = vmalloc(sizeof(void *) * num_possible_cpus());
+	if (!all_cpu_data)
+		return -ENOMEM;
+	memset(all_cpu_data, 0, sizeof(void *) * num_possible_cpus());
+
+
+	return cpufreq_register_governor(&cpufreq_gov_snb);
+}
+
+static void __exit cpufreq_gov_snb_exit(void)
+{
+	vfree(all_cpu_data);
+	all_cpu_data = NULL;
+	cpufreq_unregister_governor(&cpufreq_gov_snb);
+}
+
+
+MODULE_AUTHOR("Dirk Brandewie <dirk.j.brandewie@intel.com>");
+MODULE_DESCRIPTION("'cpufreq_snb' - cpufreq governor for Sandy Bridge");
+MODULE_LICENSE("GPL");
+
+
+fs_initcall(cpufreq_gov_snb_init);
+module_exit(cpufreq_gov_snb_exit);
-- 
1.7.7.6


^ permalink raw reply related

* [PATCH RFC 0/1] cpufreq/x86: Add P-state driver for sandy bridge. 
From: dirk.brandewie @ 2012-12-05 19:01 UTC (permalink / raw)
  To: linux-kernel, cpufreq, linux-pm; +Cc: rjw, deneen.t.dock, arjan, Dirk Brandewie

From: Dirk Brandewie <dirk.brandewie@gmail.com>

This driver provides a P state driver for Sandybridge and Ivybridge
processors.

Motivation:
The goal of this driver is to improve the power efficiency of
Sandybridge/Ivybridge based systems.  As the investigation into how to
achieve this goal progressed it became apparent (to me) that some of the
design assumptions of the cpufreq subsystem are no longer valid and
that a micro-architecure specific P state driver would be less complex
and potentially more effiecent.  As Intel continues to innovate in the
area of freqency/power control this will become more true IMHO.

General info:
The driver uses a PID controller to adjust the core frequency based on
the presented load. The driver exposes the tuning parameters for the
controller in the /sys/devices/system/cpu/cpufreq/snb directory.  The
controller code is being used in PI mode with the default tuning
parmeters.

Tuning parmeters:
   setpoint - load in percent on the core will attempt to maintain.	
   sample_rate_ms - rate at which the driver will sample the load on the core. 
   deadband  - percent ± around the setpoint the controller will
               consider zero error.
   p_gain_pct - Proportional gain in percent. 
   i_gain_pct - Integral gain in percent. 
   d_gain_pct - Derivative gain in percent

To use the driver as root run the following shell script:
   #!/bin/sh
   for file in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
   do 
      echo snb > $file
   done

Limitations:

ATM this driver will only run on SandyBridge systems testing on
Ivybridge systems is not complete.

Open Questions:

What is the correct way to integrate this driver into the system?  The
current implementation registers as a cpufreq frequency governor, this
was done to streamline testing using cpufreq to load/unload governors.

What tuning parameters should be exposed via sysfs (if any)?  ATM all
the PID parameters are exposed to enable tuning of the driver.

Performance information:

 --- Kernel build ---
The following is data collected for a bzImage kernel build.  The
commands used were:
	 make -j8 clean
	 sysctl -w vm.drop_caches=3
	 /usr/bin/time -f "%E %c" make -j8 bzImage

Time and context switches measured with /usr/bin/time -f "%E %c"

Energy measured with package energy status MSR described in section
14.7 in the Intel® 64 and IA-32 Architectures Software Developer’s
Manual Volume 3.
       http://download.intel.com/products/processor/manual/325384.pdf

Average watts calculated with energy/time in seconds

           time		ctx sw	energy	avg watts
perf      02:24.49	116721	6660	46.09
snb       02:27.03	114940	6591	44.83
ondemand  02:26.83	190948	6683	45.51

A graph of the power usage during the kernel build for each governor
is available here:
    http://git.fenrus.org/dirk/kernel.png

 --- Power benchmark ---
I used industry standard power bench suite to compare the performance and
ondemand governors against the Sandybridge governor.

  Governor    | ssj_ops/watt
 -----------------------------
  performance |   1855
  ondemand    |   1839
  snb         |   2016

A graph of the power usage for each governor is avavailable here:
    http://git.fenrus.org/dirk/power_benchmark.png

A graph showing the results of cpufreq-bench tool shipped with the
kernel Collected with
   cpufreq-bench -l 6000 -s 6000 -x 2000 -y 2000 -c 0 \
               -g {ondemand | snb} -n 40 -r 40 
is available here:
    http://git.fenrus.org/dirk/cpufreq-bench.png

Dirk Brandewie (1):
  cpufreq/x86: Add P-state driver for sandy bridge.

 drivers/cpufreq/Kconfig.x86   |    8 +
 drivers/cpufreq/Makefile      |    1 +
 drivers/cpufreq/cpufreq_snb.c |  727 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 736 insertions(+), 0 deletions(-)
 create mode 100644 drivers/cpufreq/cpufreq_snb.c

-- 
1.7.7.6

^ permalink raw reply

* Re: [RFC PATCH v2 02/10] CPU hotplug: Provide APIs for "full" atomic readers to prevent CPU offline
From: Srivatsa S. Bhat @ 2012-12-05 19:01 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg, sbw, amit.kucheria, rostedt, rjw,
	wangyun, xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121205184313.3750.17752.stgit@srivatsabhat.in.ibm.com>

Replaying what Tejun wrote:

On 12/06/2012 12:13 AM, Srivatsa S. Bhat wrote:
> Some of the atomic hotplug readers cannot tolerate CPUs going offline while
> they are in their critical section. That is, they can't get away with just
> synchronizing with the updates to the cpu_online_mask; they really need to
> synchronize with the entire CPU tear-down sequence, because they are very
> much involved in the hotplug related code paths.
> 
> Such "full" atomic hotplug readers need a way to *actually* and *truly*
> prevent CPUs from going offline while they are active.
> 

I don't think this is a good idea.  You really should just need
get/put_online_cpus() and get/put_online_cpus_atomic().  The former
the same as they are.  The latter replacing what
preempt_disable/enable() was protecting.  Let's please not go
overboard unless we know they're necessary.  I strongly suspect that
breaking up reader side from preempt_disable and making writer side a
bit lighter should be enough.  Conceptually, it really should be a
simple conversion - convert preempt_disable/enable() pairs protecting
CPU on/offlining w/ get/put_cpu_online_atomic() and wrap the
stop_machine() section with the matching write lock.

Thanks.

-- tejun 

^ permalink raw reply

* Re: [RFC PATCH v2 01/10] CPU hotplug: Provide APIs for "light" atomic readers to prevent CPU offline
From: Srivatsa S. Bhat @ 2012-12-05 18:59 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg, sbw, amit.kucheria, rostedt, rjw,
	wangyun, xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <50BF98F7.3030600@linux.vnet.ibm.com>

Replaying what Tejun wrote:

Hello, Oleg.

> Replaying what Oleg wrote:
>> Replacing get_online_cpus() w/ percpu_rwsem is great but this thread
>> is about replacing preempt_disable with something finer grained and
>> less heavy on the writer side
> 
> If only I understood why preempt_disable() is bad ;-)
> 
> OK, I guess "less heavy on the writer side" is the hint, and in the
> previous email you mentioned that "stop_machine() itself is extremely
> heavy".
> 
> Looks like, you are going to remove stop_machine() from cpu_down ???
>

Yeah, that's what Srivatsa is trying to do.  The problem seems to be
that cpu up/down is very frequent on certain mobile platforms for
power management and as currently implemented cpu hotplug is too heavy
and latency-inducing.

>> The problem seems that we don't have percpu_rwlock yet.  It shouldn't
>> be too difficult to implement, right?
>>
> 
> Oh, I am not sure... unless you simply copy-and-paste the lglock code
> and replace spinlock_t with rwlock_t.
> 

Ah... right, so that's where brlock ended up.  So, lglock is the new
thing and brlock is a wrapper around it.

> We probably want something more efficient, but I bet we can't avoid
> the barriers on the read side.
> 
> And somehow we should avoid the livelocks. Say, we can't simply add
> the per_cpu_reader_counter, _read_lock should spin if the writer is
> active. But at the same time _read_lock should be recursive.
> 

I think we should just go with lglock.  It does involve local atomic
ops but atomic ops themselves aren't that expensive and it's not like
we can avoid memory barriers.  Also, that's the non-sleeping
counterpart of percpu_rwsem.  If it's not good enough for some reason,
we should improve it rather than introducing something else.

Thanks.

-- tejun

^ permalink raw reply

* Re: [RFC PATCH v2 01/10] CPU hotplug: Provide APIs for "light" atomic readers to prevent CPU offline
From: Srivatsa S. Bhat @ 2012-12-05 18:56 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg, sbw, amit.kucheria, rostedt, rjw,
	wangyun, xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <50BF982D.7090704@linux.vnet.ibm.com>

Replaying what Oleg wrote:

(add lkml)

> Replaying what Tejun wrote:

> Replacing get_online_cpus() w/ percpu_rwsem is great but this thread
> is about replacing preempt_disable with something finer grained and
> less heavy on the writer side

If only I understood why preempt_disable() is bad ;-)

OK, I guess "less heavy on the writer side" is the hint, and in the
previous email you mentioned that "stop_machine() itself is extremely
heavy".

Looks like, you are going to remove stop_machine() from cpu_down ???

> The problem seems that we don't have percpu_rwlock yet.  It shouldn't
> be too difficult to implement, right?
> 

Oh, I am not sure... unless you simply copy-and-paste the lglock code
and replace spinlock_t with rwlock_t.

We probably want something more efficient, but I bet we can't avoid
the barriers on the read side.

And somehow we should avoid the livelocks. Say, we can't simply add
the per_cpu_reader_counter, _read_lock should spin if the writer is
active. But at the same time _read_lock should be recursive.

Tejun, could you please send me mbox with this thread offlist?

[That should now be unnecessary, since the discussion can continue
on-list on this thread].

Oleg.

^ permalink raw reply

* Re: [RFC PATCH v2 01/10] CPU hotplug: Provide APIs for "light" atomic readers to prevent CPU offline
From: Srivatsa S. Bhat @ 2012-12-05 18:53 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg, sbw, amit.kucheria, rostedt, rjw,
	wangyun, xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <50BF979A.50304@linux.vnet.ibm.com>

Replaying what Tejun wrote:

Hello, Oleg.

> Replaying what Oleg wrote:
> 
> Hi,
> 
> Sorry I don't understand the context and I can't find this thread
> anywhere, so I am not sure I understand...
> 

Weird, lkml cc is missing.  Srivatsa?

[Now fixed. This thread has lkml CC]

>> Replaying what Tejun wrote:
>> So, we basically need percpu_rwlock.  We already have percpu_rwsem.
> 
> Yes, and with -mm patches it becomes reader-friendly. In particular
> see http://marc.info/?l=linux-mm-commits&m=135240650828875
> 
>> Oleg, it seems
>> CPU hotplug needs big-reader rwlock, ideas on how to proceed?
>>
> 
> I am going to convert get_online_cpus() to use percpu_down_read(),
> this looks simple.
> 
> We already discussed this with Paul, see
> 
> 	http://marc.info/?l=linux-kernel&m=135248463226031
> 
> and the whole thread.
> 
> In short, all we need is percpu_down_write_recursive_readers() and
> afaics the only complication is lockdep, we need down_read_no_lockdep()
> which (like __up_read) doesn't do rwsem_acquire_read().
> 

So, it's a different thing.  There are two mechanism protecting
against cpu hotplug - get_online_cpus() and preempt_disable().  The
former can be used by ones which can sleep and need to protect against
the whole up/down process (DOWN_PREPARE and so on).  The latter
protects the last step and can be used when the caller can't sleep.
Replacing get_online_cpus() w/ percpu_rwsem is great but this thread
is about replacing preempt_disable with something finer grained and
less heavy on the writer side - IOW, percpu_rwlock as opposed to
percpu_rwsem, so, I think the end result would be that CPU hotplug
will be protected by percpu_rwsem for the whole part and by
percpu_rwlock for the last commit stage.

The problem seems that we don't have percpu_rwlock yet.  It shouldn't
be too difficult to implement, right?

Thanks.

-- tejun 

^ permalink raw reply

* Re: [RFC PATCH v2 01/10] CPU hotplug: Provide APIs for "light" atomic readers to prevent CPU offline
From: Srivatsa S. Bhat @ 2012-12-05 18:51 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg, sbw, amit.kucheria, rostedt, rjw,
	wangyun, xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <50BF96DF.3000500@linux.vnet.ibm.com>

Replaying what Oleg wrote:

Hi,

Sorry I don't understand the context and I can't find this thread
anywhere, so I am not sure I understand...

> Replaying what Tejun wrote:
> So, we basically need percpu_rwlock.  We already have percpu_rwsem.

Yes, and with -mm patches it becomes reader-friendly. In particular
see http://marc.info/?l=linux-mm-commits&m=135240650828875

> Oleg, it seems
> CPU hotplug needs big-reader rwlock, ideas on how to proceed?
> 

I am going to convert get_online_cpus() to use percpu_down_read(),
this looks simple.

We already discussed this with Paul, see

	http://marc.info/?l=linux-kernel&m=135248463226031

and the whole thread.

In short, all we need is percpu_down_write_recursive_readers() and
afaics the only complication is lockdep, we need down_read_no_lockdep()
which (like __up_read) doesn't do rwsem_acquire_read().

Oleg.

^ permalink raw reply

* Re: [PATCH 6/6 v8] cpufreq, highbank: add support for highbank cpufreq
From: Mike Turquette @ 2012-12-05 18:49 UTC (permalink / raw)
  To: Mark Langsdorf
  Cc: linux-kernel, cpufreq, linux-pm@vger.kernel.org, linux-arm-kernel,
	Shawn Guo
In-Reply-To: <1354726121-17190-7-git-send-email-mark.langsdorf@calxeda.com>

On Wed, Dec 5, 2012 at 8:48 AM, Mark Langsdorf
<mark.langsdorf@calxeda.com> wrote:
> diff --git a/drivers/cpufreq/highbank-cpufreq.c b/drivers/cpufreq/highbank-cpufreq.c
> new file mode 100644
> index 0000000..1f28fa6
> --- /dev/null
> +++ b/drivers/cpufreq/highbank-cpufreq.c
> @@ -0,0 +1,102 @@

Looks pretty good to me.  Some tedious nitpicks and discussion below.
<snip>

> +static int hb_voltage_change(unsigned int freq)
> +{
> +       int i;
> +       u32 msg[7];
> +
> +       msg[0] = HB_CPUFREQ_CHANGE_NOTE;
> +       msg[1] = freq / 1000000;
> +       for (i = 2; i < 7; i++)
> +               msg[i] = 0;
> +
> +       return pl320_ipc_transmit(msg);
> +}
> +
> +static int hb_cpufreq_clk_notify(struct notifier_block *nb,
> +                               unsigned long action, void *hclk)
> +{
> +       struct clk_notifier_data *clk_data = hclk;
> +       int i = 0;
> +
> +       if (action == PRE_RATE_CHANGE) {
> +               if (clk_data->new_rate > clk_data->old_rate)
> +                       while (hb_voltage_change(clk_data->new_rate))
> +                               if (i++ > 15)

There are a few magic numbers here.  How about something like:

#define HB_VOLT_CHANGE_MAX_TRIES 15

Maybe do the same for the i2c message length?

> +                                       return NOTIFY_STOP;

How about NOTIFY_BAD?  It more clearly signals that an error has occurred.

You could also return notifier_from_errno(-ETIMEDOUT) here if you
prefer but that would only be for the sake of readability.
clk_set_rate doesn't actually return the notifier error code in the
event of a notifier abort.

> +       } else if (action == POST_RATE_CHANGE) {
> +               if (clk_data->new_rate < clk_data->old_rate)
> +                       while (hb_voltage_change(clk_data->new_rate))
> +                               if (i++ > 15)
> +                                       break;

Same as above.  It is true that the clock framework does nothing with
post-rate change notifier aborts but that might change in the future.

> +       }
> +
> +       return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block hb_cpufreq_clk_nb = {
> +       .notifier_call = hb_cpufreq_clk_notify,
> +};
> +

Do you have any plans to convert your voltage change routine over to
the regulator framework?  Likewise do you plan to use the OPP library
in the future?  I can understand if you do not do that since your
regulator/dvfs programming model makes things very simple for you.

The reason I bring this up is that I did float a patch a while back
for a generalized dvfs notifier handler.  The prereqs for using it are
1) ccf, 2) regulator fwk, 3) opp definitions.  Here is the patch:
https://github.com/mturquette/linux/commit/05a280bbc0819a6858d73088a632666f0c7f68a4

And an example usage in the OMAP CPUfreq driver:
https://github.com/mturquette/linux/commit/958f10bb98a293aa912e7eb9cd6edbdc51c1c04a

I understand if this approach incurs too much software overhead for
you but I wanted to throw it out there.  It might working nicely in
the cpufreq-cpu0 driver or some other "generic" CPUfreq driver for
implementing DVFS.

Regards,
Mike

^ permalink raw reply

* Re: [RFC PATCH v2 01/10] CPU hotplug: Provide APIs for "light" atomic readers to prevent CPU offline
From: Srivatsa S. Bhat @ 2012-12-05 18:47 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg, sbw, amit.kucheria, rostedt, rjw,
	wangyun, xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121205184258.3750.31879.stgit@srivatsabhat.in.ibm.com>

Replaying what Tejun wrote:

(cc'ing Oleg)

Hello, Srivatsa.

On 12/06/2012 12:13 AM, Srivatsa S. Bhat wrote:
> Also, since we don't use per-cpu locks (because rwlocks themselves are quite
> scalable for readers), we don't end up in any lock ordering problems that can
> occur if we try to use per-cpu locks.
> 

Read-lock really isn't that scalable when you compare it to
preempt_disable/enable().  When used on hot paths, it's gonna generate
a lot of cacheline pingpongs.  This patch is essentially creating a
new big lock which has potential for being very hot.

preempt_disable/enable() + stop_machine() essentially works as percpu
rwlock with very heavy penalty on the writer side.  Because the reader
side doesn't even implement spinning while writer is in progress, the
writer side has to preempt the readers before entering critical
section and that's what the "stopping machine" is about.

Note that the resolution on the reader side is very low.  Any section
w/ preemption disabled is protected against stop_machine().  Also, the
stop_machine() itself is extremely heavy involving essentially locking
up the machine until all CPUs can reach the same condition via
scheduling the stop_machine tasks.  So, I *think* all you need to do
here is making cpu online locking finer grained (separated from
preemption) and lighten the writer side a bit.  I'm quite doubtful
that you would need to go hunting donw all get_online_cpus().  They
aren't used that often anyway.

Anyways, so, separating out cpu hotplug locking from preemption is the
right thing to do but I think rwlock is likely to be too heavy on the
reader side.  I think percpu reader accounting + reader spinning while
writer in progress should be a good combination.  It's a bit heavier
than preempt_disable() - it'll have an extra conditional jump on the
hot path, but there won't be any cacheline bouncing.  The writer side
would need to synchronize against all CPUs but only against the ones
actually read locking cpu hotplug.  As long as reader side critical
sections don't go crazy, it should be okay.

So, we basically need percpu_rwlock.  We already have percpu_rwsem.
We used to have some different variants of writer-heavy locks.  Dunno
what happened to them.  Maybe we still have it somewhere.  Oleg has
been working on the area lately and should know more.  Oleg, it seems
CPU hotplug needs big-reader rwlock, ideas on how to proceed?

Thanks.

-- tejun 

^ permalink raw reply

* [RFC PATCH v2 08/10] yield_to(), cpu-hotplug: Prevent offlining of other CPUs properly
From: Srivatsa S. Bhat @ 2012-12-05 18:44 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg
  Cc: sbw, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121205184041.3750.64945.stgit@srivatsabhat.in.ibm.com>

Once stop_machine() is gone from the CPU offline path, we won't be able to
depend on local_irq_save() to prevent CPUs from going offline from under us.

Use the get/put_online_cpus_atomic_light() APIs to prevent changes to the
cpu_online_mask, while invoking from atomic context.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 kernel/sched/core.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 84a8579..1ef595a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4312,6 +4312,7 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
 	unsigned long flags;
 	bool yielded = 0;
 
+	get_online_cpus_atomic_light();
 	local_irq_save(flags);
 	rq = this_rq();
 
@@ -4339,13 +4340,14 @@ again:
 		 * Make p's CPU reschedule; pick_next_entity takes care of
 		 * fairness.
 		 */
-		if (preempt && rq != p_rq)
+		if (preempt && rq != p_rq && cpu_online(task_cpu(p)))
 			resched_task(p_rq->curr);
 	}
 
 out:
 	double_rq_unlock(rq, p_rq);
 	local_irq_restore(flags);
+	put_online_cpus_atomic_light();
 
 	if (yielded)
 		schedule();


^ permalink raw reply related

* [RFC PATCH v2 10/10] cpu: No more __stop_machine() in _cpu_down()
From: Srivatsa S. Bhat @ 2012-12-05 18:45 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg
  Cc: sbw, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121205184041.3750.64945.stgit@srivatsabhat.in.ibm.com>

From: Paul E. McKenney <paul.mckenney@linaro.org>

The _cpu_down() function invoked as part of the CPU-hotplug offlining
process currently invokes __stop_machine(), which is slow and inflicts
substantial real-time latencies on the entire system.  This patch
substitutes stop_cpus() for __stop_machine() in order to improve
both performance and real-time latency.

This is currently unsafe, because there are a number of uses of
preempt_disable() that are intended to block CPU-hotplug offlining.
These will be fixed by using get/put_online_cpus_atomic_light(), or
get/put_online_cpus_atomic_full(), but in the meantime, this commit is one
way to help locate them.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ Srivatsa: Refer to the new sync primitives for readers, in the changelog ]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 kernel/cpu.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index c71c723..00a1edc 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -418,7 +418,7 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
 	}
 	smpboot_park_threads(cpu);

-	err = __stop_machine(take_cpu_down, &tcd_param, cpumask_of(cpu));
+	err = stop_cpus(cpumask_of(cpu), take_cpu_down, &tcd_param);
 	if (err) {
 		/* CPU didn't die: tell everyone.  Can't complain. */
 		smpboot_unpark_threads(cpu);

^ permalink raw reply related

* [RFC PATCH v2 03/10] CPU hotplug: Convert preprocessor macros to static inline functions
From: Srivatsa S. Bhat @ 2012-12-05 18:43 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg
  Cc: sbw, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121205184041.3750.64945.stgit@srivatsabhat.in.ibm.com>

On 12/05/2012 06:10 AM, Andrew Morton wrote:
"static inline C functions would be preferred if possible.  Feel free to
fix up the wrong crufty surrounding code as well ;-)"

Convert the macros in the CPU hotplug code to static inline C functions.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/cpu.h |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index e2a9c49..599b376 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -200,12 +200,12 @@ static inline void cpu_hotplug_driver_unlock(void)
 
 #else		/* CONFIG_HOTPLUG_CPU */
 
-#define get_online_cpus()	do { } while (0)
-#define put_online_cpus()	do { } while (0)
-#define get_online_cpus_atomic_light()	do { } while (0)
-#define put_online_cpus_atomic_light()	do { } while (0)
-#define get_online_cpus_atomic_full()	do { } while (0)
-#define put_online_cpus_atomic_full()	do { } while (0)
+static inline void get_online_cpus(void) {}
+static inline void put_online_cpus(void) {}
+static inline void get_online_cpus_atomic_light(void) {}
+static inline void put_online_cpus_atomic_light(void) {}
+static inline void get_online_cpus_atomic_full(void) {}
+static inline void put_online_cpus_atomic_full(void) {}
 #define hotcpu_notifier(fn, pri)	do { (void)(fn); } while (0)
 /* These aren't inline functions due to a GCC bug. */
 #define register_hotcpu_notifier(nb)	({ (void)(nb); 0; })


^ permalink raw reply related

* [RFC PATCH v2 02/10] CPU hotplug: Provide APIs for "full" atomic readers to prevent CPU offline
From: Srivatsa S. Bhat @ 2012-12-05 18:43 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg
  Cc: sbw, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121205184041.3750.64945.stgit@srivatsabhat.in.ibm.com>

Some of the atomic hotplug readers cannot tolerate CPUs going offline while
they are in their critical section. That is, they can't get away with just
synchronizing with the updates to the cpu_online_mask; they really need to
synchronize with the entire CPU tear-down sequence, because they are very
much involved in the hotplug related code paths.

Such "full" atomic hotplug readers need a way to *actually* and *truly*
prevent CPUs from going offline while they are active.

The intent of this patch is to provide synchronization APIs for such "full"
atomic hotplug readers. [ get/put_online_cpus_atomic_full()]

Some important design requirements and considerations:
-----------------------------------------------------

1. Scalable synchronization

Any synchronization at the atomic hotplug readers side must be highly
scalable - avoid global single-holder locks/counters etc. Because, these
paths currently use the extremely fast preempt_disable(); our replacement
to preempt_disable() should not become ridiculously costly and also should
not serialize the readers among themselves needlessly.

2. Should not have ABBA deadlock possibilities between the 2 types of atomic
readers ("light" vs "full")

Atomic readers who can get away with a stable online mask ("light" readers)
and atomic readers who need full synchronization with CPU offline ("full"
readers) must not end up in situations leading to ABBA deadlocks because of
the APIs they use respectively.

Also, we should not impose any ordering restrictions on how the 2 types of
readers can nest. They should be allowed to nest freely in any way they want,
and we should provide the guarantee that they won't deadlock.
(preempt_disable() posed no ordering restrictions before. Neither should we).

3. preempt_disable() was recursive. The replacement should also be recursive.

Implementation of the design:
----------------------------

Basically, we use another reader-writer lock for synchronizing the "full"
hotplug readers with the writer. But since we want to avoid ABBA deadlock
possibilities, we need to be careful as well as clever while designing this
"full" reader APIs.

Simplification: All "full" readers are also "light" readers
-----------------------------------------------------------

This simplification helps us get rid of ABBA deadlock possibilites, because
the lock ordering remains consistent to both types of readers, and looks
something like this:

Light reader:
------------
    Take light-lock for read

        /* Critical section */

    Release the light-lock

Full reader:
-----------
    Take light-lock for read

        Take full-lock for read

            /* Critical section */

        Release the full-lock

    Release the light-lock

But then, the writer path should be cleverly designed in such a way that
after the update to cpu_online_mask, only the light-readers can continue, but
the full-readers continue to spin until entire CPU offline operation is
complete.

So the lock ordering in the writer should look like this:

Writer:
------

    Take light-lock for write

        Take the full-lock for write

            Update cpu_online_mask (flip the bit)

    /*
     * Now allow only the light-readers to continue, while keeping the
     * full-readers spinning (ie., release the light-lock alone).
     */
    Release the light-lock

    /* Continue CPU tear-down, calling CPU_DYING notifiers */

    /* Finally, allow the full-readers to continue */
    Release the full-lock

It can be verified that, with this scheme, there is no possibility of any
ABBA deadlocks, and that the 2 types of atomic readers can nest in any
way they want, without fear.

We expect that the atomic hotplug readers who need full synchronization
with CPU offline (and cannot just get away with a stable online mask), be
rare. Otherwise, we could end up creating a similar effect as
stop_machine() without even using stop_machine()! [That is, if too many
readers are of this kind, everybody will wait for the entire CPU offline to
finish, which is almost like having stop_machine() itself.]

So we hope that most atomic hotplug readers are of the "light" type.
That would keeps things fast and scalable and make CPU offline operations
painless.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/cpu.h |    4 ++++
 kernel/cpu.c        |   47 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index dd0a3ee..e2a9c49 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -177,6 +177,8 @@ extern void get_online_cpus(void);
 extern void put_online_cpus(void);
 extern void get_online_cpus_atomic_light(void);
 extern void put_online_cpus_atomic_light(void);
+extern void get_online_cpus_atomic_full(void);
+extern void put_online_cpus_atomic_full(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
 #define register_hotcpu_notifier(nb)	register_cpu_notifier(nb)
 #define unregister_hotcpu_notifier(nb)	unregister_cpu_notifier(nb)
@@ -202,6 +204,8 @@ static inline void cpu_hotplug_driver_unlock(void)
 #define put_online_cpus()	do { } while (0)
 #define get_online_cpus_atomic_light()	do { } while (0)
 #define put_online_cpus_atomic_light()	do { } while (0)
+#define get_online_cpus_atomic_full()	do { } while (0)
+#define put_online_cpus_atomic_full()	do { } while (0)
 #define hotcpu_notifier(fn, pri)	do { (void)(fn); } while (0)
 /* These aren't inline functions due to a GCC bug. */
 #define register_hotcpu_notifier(nb)	({ (void)(nb); 0; })
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 381593c..c71c723 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -112,6 +112,46 @@ void put_online_cpus_atomic_light(void)
 }
 EXPORT_SYMBOL_GPL(put_online_cpus_atomic_light);

+/*
+ * Reader-writer lock to synchronize between "full/heavy" atomic hotplug
+ * readers and the hotplug writer while doing CPU offline operation.
+ * "Full/heavy" atomic hotplug readers are those who need to synchronize
+ * with the full CPU take-down sequence, and not just the bit flip in the
+ * cpu_online_mask.
+ */
+static DEFINE_RWLOCK(full_hotplug_rwlock);
+
+/*
+ * Some atomic hotplug readers need to synchronize with the entire CPU
+ * tear-down sequence, and not just with the update of the cpu_online_mask.
+ * Such readers are called "full" atomic hotplug readers.
+ *
+ * The following APIs help them synchronize fully with the CPU offline
+ * operation.
+ *
+ * You can call this function recursively.
+ *
+ * Also, you can mix and match (nest) "full" and "light" atomic hotplug
+ * readers in any way you want (without worrying about their ordering).
+ * The respective APIs have been designed in such a way as to provide
+ * the guarantee that you won't end up in a deadlock.
+ */
+void get_online_cpus_atomic_full(void)
+{
+	preempt_disable();
+	read_lock(&light_hotplug_rwlock);
+	read_lock(&full_hotplug_rwlock);
+}
+EXPORT_SYMBOL_GPL(get_online_cpus_atomic_full);
+
+void put_online_cpus_atomic_full(void)
+{
+	read_unlock(&full_hotplug_rwlock);
+	read_unlock(&light_hotplug_rwlock);
+	preempt_enable();
+}
+EXPORT_SYMBOL_GPL(put_online_cpus_atomic_full);
+
 static struct {
 	struct task_struct *active_writer;
 	struct mutex lock; /* Synchronizes accesses to refcount, */
@@ -318,9 +358,13 @@ static int __ref take_cpu_down(void *_param)
 	 */
 	write_lock_irqsave(&light_hotplug_rwlock, flags);

+	/* Disable the atomic hotplug readers who need full synchronization */
+	write_lock(&full_hotplug_rwlock);
+
 	/* Ensure this CPU doesn't handle any more interrupts. */
 	err = __cpu_disable();
 	if (err < 0) {
+		write_unlock(&full_hotplug_rwlock);
 		write_unlock_irqrestore(&light_hotplug_rwlock, flags);
 		return err;
 	}
@@ -338,6 +382,9 @@ static int __ref take_cpu_down(void *_param)

 	cpu_notify(CPU_DYING | param->mod, param->hcpu);

+	/* Enable the atomic hotplug readers who need full synchronization */
+	write_unlock(&full_hotplug_rwlock);
+
 	local_irq_restore(flags);
 	return 0;
 }

^ permalink raw reply related

* [RFC PATCH v2 09/10] kvm, vmx: Add full atomic synchronization with CPU Hotplug
From: Srivatsa S. Bhat @ 2012-12-05 18:44 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg
  Cc: sbw, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121205184041.3750.64945.stgit@srivatsabhat.in.ibm.com>

preempt_disable() will no longer help prevent CPUs from going offline, once
stop_machine() gets removed from the CPU offline path. So use
get/put_online_cpus_atomic_full() in vmx_vcpu_load() to prevent CPUs from
going offline while clearing vmcs. Here we truly need full-synchronization
with CPU hotplug (and not just an unchanging cpu_online_mask), because we
want to prevent race with the CPU_DYING callback from kvm.

Reported-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Debugged-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 arch/x86/kvm/vmx.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f858159..23c1063 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1519,10 +1519,14 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
 
-	if (!vmm_exclusive)
+	if (!vmm_exclusive) {
 		kvm_cpu_vmxon(phys_addr);
-	else if (vmx->loaded_vmcs->cpu != cpu)
+	} else if (vmx->loaded_vmcs->cpu != cpu) {
+		/* Prevent any CPU from going offline */
+		get_online_cpus_atomic_full();
 		loaded_vmcs_clear(vmx->loaded_vmcs);
+		put_online_cpus_atomic_full();
+	}
 
 	if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
 		per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;

^ permalink raw reply related

* [RFC PATCH v2 01/10] CPU hotplug: Provide APIs for "light" atomic readers to prevent CPU offline
From: Srivatsa S. Bhat @ 2012-12-05 18:43 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg
  Cc: sbw, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121205184041.3750.64945.stgit@srivatsabhat.in.ibm.com>

There are places where preempt_disable() is used to prevent any CPU from
going offline during the critical section. Let us call them as "atomic
hotplug readers" (atomic because they run in atomic contexts).

Often, these atomic hotplug readers have a simple need : they want the cpu
online mask that they work with (inside their critical section), to be
stable, i.e., it should be guaranteed that CPUs in that mask won't go
offline during the critical section. The important point here is that
they don't really need to synchronize with the actual CPU tear-down
sequence. All they need is synchronization with the updates to the
cpu_online_mask. (Hence the term "light", for light-weight).

The intent of this patch is to provide synchronization APIs for such
"light" atomic hotplug readers. [ get/put_online_cpus_atomic_light() ]

Fundamental idea behind the design:
-----------------------------------

Simply put, in the hotplug writer path, have appropriate locking around the
update to the cpu_online_mask in the CPU tear-down sequence. And once the
update is done, release the lock and allow the "light" atomic hotplug readers
to go ahead. Meanwhile, the hotplug writer can safely continue the actual
CPU tear-down sequence (running CPU_DYING notifiers etc) since the "light"
atomic readers don't really care about those operations (and hence don't
need to synchronize with them).

Also, once the hotplug writer completes taking the CPU offline, it should
not start any new cpu_down() operations until all existing "light" atomic
hotplug readers have completed.

Some important design requirements and considerations:
-----------------------------------------------------

1. The "light" atomic hotplug readers should ideally *never* have to wait
   for the hotplug writer (cpu_down()) for too long (like entire duration
   of CPU offline, for example). Because, these atomic hotplug readers can be
   in very hot-paths like interrupt handling/IPI and hence, if they have to
   wait for an ongoing cpu_down() to complete, it would pretty much
   introduce the same performance/latency problems as stop_machine().

2. Any synchronization at the atomic hotplug readers side must be highly
   scalable - avoid global single-holder locks/counters etc. Because, these
   paths currently use the extremely fast preempt_disable(); our replacement
   to preempt_disable() should not become ridiculously costly and also should
   not serialize the readers among themselves needlessly.

3. preempt_disable() was recursive. The replacement should also be recursive.

Implementation of the design:
----------------------------

At the core, we use a reader-writer lock to synchronize the update to the
cpu_online_mask. That way, multiple "light" atomic hotplug readers can co-exist
and the writer can acquire the lock only when all the readers have completed.

Once acquired, the writer holds the "light" lock only during the duration of
the update to the cpu_online_mask. That way, the readers don't have to spin
for too long (ie., the write-hold-time for the "light" lock is tiny), which
keeps the readers in good shape.

Reader-writer lock are recursive, so they can be used in a nested fashion in
the reader-path. Together, these satisfy all the 3 requirements mentioned above.

Also, since we don't use per-cpu locks (because rwlocks themselves are quite
scalable for readers), we don't end up in any lock ordering problems that can
occur if we try to use per-cpu locks.

I'm indebted to Michael Wang and Xiao Guangrong for their numerous thoughtful
suggestions and ideas, which inspired and influenced many of the decisions in
this as well as previous designs. Thanks a lot Michael and Xiao!

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/cpu.h |    4 ++
 kernel/cpu.c        |   87 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 90 insertions(+), 1 deletion(-)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index ce7a074..dd0a3ee 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -175,6 +175,8 @@ extern struct bus_type cpu_subsys;

 extern void get_online_cpus(void);
 extern void put_online_cpus(void);
+extern void get_online_cpus_atomic_light(void);
+extern void put_online_cpus_atomic_light(void);
 #define hotcpu_notifier(fn, pri)	cpu_notifier(fn, pri)
 #define register_hotcpu_notifier(nb)	register_cpu_notifier(nb)
 #define unregister_hotcpu_notifier(nb)	unregister_cpu_notifier(nb)
@@ -198,6 +200,8 @@ static inline void cpu_hotplug_driver_unlock(void)

 #define get_online_cpus()	do { } while (0)
 #define put_online_cpus()	do { } while (0)
+#define get_online_cpus_atomic_light()	do { } while (0)
+#define put_online_cpus_atomic_light()	do { } while (0)
 #define hotcpu_notifier(fn, pri)	do { (void)(fn); } while (0)
 /* These aren't inline functions due to a GCC bug. */
 #define register_hotcpu_notifier(nb)	({ (void)(nb); 0; })
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 42bd331..381593c 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,6 +49,69 @@ static int cpu_hotplug_disabled;

 #ifdef CONFIG_HOTPLUG_CPU

+/*
+ * Reader-writer lock to synchronize between "light" atomic hotplug readers
+ * and the hotplug writer while updating cpu_online_mask.
+ * "Light" atomic hotplug readers are those who don't really need to
+ * synchronize with the actual CPU bring-up/take-down sequence, but only
+ * need to synchronize with the updates to the cpu_online_mask.
+ */
+static DEFINE_RWLOCK(light_hotplug_rwlock);
+
+/*
+ * Hotplug readers (those that want to prevent CPUs from coming online or
+ * going offline ) sometimes run from atomic contexts, and hence can't use
+ * get/put_online_cpus() because they can sleep. And often-times, all
+ * they really want is that the cpu_online_mask remain unchanged while
+ * they are executing in their critical section. They also don't really
+ * need to synchronize with the actual CPU tear-down sequence. Such atomic
+ * hotplug readers are called "light" readers (light for light-weight).
+ *
+ * These "light" atomic hotplug readers can use the APIs
+ * get/put_online_atomic_light() around their critical sections to
+ * ensure that the cpu_online_mask remains unaltered throughout that
+ * critical section.
+ *
+ * Caution!: While the readers are in their critical section, a CPU offline
+ * operation can actually happen under the covers; its just that the bit flip
+ * in the cpu_online_mask will be synchronized properly if you use these APIs.
+ * If you really want full synchronization with the entire CPU tear-down
+ * sequence, then you are not a "light" hotplug reader. So don't use these
+ * APIs!
+ *
+ * Eg:
+ *
+ * "Light" atomic hotplug read-side critical section:
+ * --------------------------------------------------
+ *
+ * get_online_cpus_atomic_light();
+ *
+ * for_each_online_cpu(cpu) {
+ *         ... Do something...
+ * }
+ *    ...
+ *
+ * if (cpu_online(other_cpu))
+ *         do_something();
+ *
+ * put_online_cpus_atomic_light();
+ *
+ * You can call this function recursively.
+ */
+void get_online_cpus_atomic_light(void)
+{
+	preempt_disable();
+	read_lock(&light_hotplug_rwlock);
+}
+EXPORT_SYMBOL_GPL(get_online_cpus_atomic_light);
+
+void put_online_cpus_atomic_light(void)
+{
+	read_unlock(&light_hotplug_rwlock);
+	preempt_enable();
+}
+EXPORT_SYMBOL_GPL(put_online_cpus_atomic_light);
+
 static struct {
 	struct task_struct *active_writer;
 	struct mutex lock; /* Synchronizes accesses to refcount, */
@@ -246,14 +309,36 @@ struct take_cpu_down_param {
 static int __ref take_cpu_down(void *_param)
 {
 	struct take_cpu_down_param *param = _param;
+	unsigned long flags;
 	int err;

+	/*
+	 *  __cpu_disable() is the step where the CPU is removed from the
+	 *  cpu_online_mask. Protect it with the light-lock held for write.
+	 */
+	write_lock_irqsave(&light_hotplug_rwlock, flags);
+
 	/* Ensure this CPU doesn't handle any more interrupts. */
 	err = __cpu_disable();
-	if (err < 0)
+	if (err < 0) {
+		write_unlock_irqrestore(&light_hotplug_rwlock, flags);
 		return err;
+	}
+
+	/*
+	 * We have successfully removed the CPU from the cpu_online_mask.
+	 * So release the light-lock, so that the light-weight atomic readers
+	 * (who care only about the cpu_online_mask updates, and not really
+	 * about the actual cpu-take-down operation) can continue.
+	 *
+	 * But don't enable interrupts yet, because we still have work left to
+	 * do, to actually bring the CPU down.
+	 */
+	write_unlock(&light_hotplug_rwlock);

 	cpu_notify(CPU_DYING | param->mod, param->hcpu);
+
+	local_irq_restore(flags);
 	return 0;
 }

^ permalink raw reply related

* [RFC PATCH v2 00/10][RESEND] CPU hotplug: stop_machine()-free CPU hotplug
From: Srivatsa S. Bhat @ 2012-12-05 18:42 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg
  Cc: sbw, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel

[Resending because I forgot to CC lkml last time. Please send your replies to
this thread instead of the previous (non-public) one. Sorry for the trouble!]

Hi,

This patchset removes CPU hotplug's dependence on stop_machine() from the CPU
offline path and provides an alternative (set of APIs) to preempt_disable() to
prevent CPUs from going offline, which can be invoked from atomic context.

This is an RFC patchset with only a few call-sites of preempt_disable()
converted to the new APIs for now, and the main goal is to get feedback on the
design of the new atomic APIs and see if it serves as a viable replacement for
stop_machine()-free CPU hotplug.

Overview of the patches:
-----------------------

Patch 1 introduces the new APIs that can be used from atomic context, to
prevent CPUs from going offline. This set of APIs is for "light" atomic
hotplug readers (the term "light" is explained in the changelog/comments).

Patch 2 introduces the new APIs for "full" atomic hotplug readers.

Patch 3 is a cleanup; it converts preprocessor macros to static inline
functions.

Patches 4 to 9 convert various call-sites to use the new APIs.
(Patches 4 to 8 deal with "light" readers, patch 9 deals with a "full"
reader. Together, they demonstrate how to use these APIs effectively).

Patch 10 is the one which actually removes stop_machine() from the CPU
offline path.

Changes in v2:
-------------
* Completely redesigned the synchronization scheme to avoid using any extra
  cpumasks.

* Provided APIs for 2 types of atomic hotplug readers: "light" (for
  light-weight) and "full". We wish to have more "light" readers than
  the "full" ones, to avoid indirectly inducing the "stop_machine effect"
  without even actually using stop_machine().

  And the patches show that it _is_ generally true: 5 patches deal with
  "light" readers, whereas only 1 patch deals with a "full" reader.

  Also, the "light" readers happen to be in very hot paths. So it makes a
  lot of sense to have such a distinction and a corresponding light-weight
  API.

v1: https://lkml.org/lkml/2012/12/4/88

Comments and suggestions welcome!

--
 Paul E. McKenney (1):
      cpu: No more __stop_machine() in _cpu_down()

Srivatsa S. Bhat (9):
      CPU hotplug: Provide APIs for "light" atomic readers to prevent CPU offline
      CPU hotplug: Provide APIs for "full" atomic readers to prevent CPU offline
      CPU hotplug: Convert preprocessor macros to static inline functions
      smp, cpu hotplug: Fix smp_call_function_*() to prevent CPU offline properly
      smp, cpu hotplug: Fix on_each_cpu_*() to prevent CPU offline properly
      sched, cpu hotplug: Use stable online cpus in try_to_wake_up() & select_task_rq()
      kick_process(), cpu-hotplug: Prevent offlining of target CPU properly
      yield_to(), cpu-hotplug: Prevent offlining of other CPUs properly
      kvm, vmx: Add full atomic synchronization with CPU Hotplug

  arch/x86/kvm/vmx.c  |    8 ++-
 include/linux/cpu.h |   12 ++++-
 kernel/cpu.c        |  136 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/core.c |   22 +++++++-
 kernel/smp.c        |   64 +++++++++++++++---------
 5 files changed, 210 insertions(+), 32 deletions(-)

Thanks,
Srivatsa S. Bhat
IBM Linux Technology Center

^ permalink raw reply

* [RFC PATCH v2 07/10] kick_process(), cpu-hotplug: Prevent offlining of target CPU properly
From: Srivatsa S. Bhat @ 2012-12-05 18:44 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg
  Cc: sbw, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121205184041.3750.64945.stgit@srivatsabhat.in.ibm.com>

Once stop_machine() is gone from the CPU offline path, we won't be able to
depend on preempt_disable() to prevent CPUs from going offline from under us.

Use the get/put_online_cpus_atomic_light() APIs to prevent changes to the
cpu_online_mask, while invoking from atomic context.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 kernel/sched/core.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c9a331e..84a8579 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1091,11 +1091,11 @@ void kick_process(struct task_struct *p)
 {
 	int cpu;
 
-	preempt_disable();
+	get_online_cpus_atomic_light();
 	cpu = task_cpu(p);
 	if ((cpu != smp_processor_id()) && task_curr(p))
 		smp_send_reschedule(cpu);
-	preempt_enable();
+	put_online_cpus_atomic_light();
 }
 EXPORT_SYMBOL_GPL(kick_process);
 #endif /* CONFIG_SMP */

^ permalink raw reply related

* [RFC PATCH v2 06/10] sched, cpu hotplug: Use stable online cpus in try_to_wake_up() & select_task_rq()
From: Srivatsa S. Bhat @ 2012-12-05 18:44 UTC (permalink / raw)
  To: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
	vincent.guittot, tj, oleg
  Cc: sbw, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
	xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121205184041.3750.64945.stgit@srivatsabhat.in.ibm.com>

Once stop_machine() is gone from the CPU offline path, we won't be able to
depend on preempt_disable() to prevent CPUs from going offline from under us.

Use the get/put_online_cpus_atomic_light() APIs to prevent changes to the
cpu_online_mask, while invoking from atomic context.

Scheduler functions such as try_to_wake_up() and select_task_rq() (and even
select_fallback_rq()) deal with picking new CPUs to run tasks. They care only
about the cpu_online_mask and not really about the actual CPU tear-down. So
they qualify as "light" atomic readers. So use the _light() variant of the APIs.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 kernel/sched/core.c |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..c9a331e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1103,6 +1103,10 @@ EXPORT_SYMBOL_GPL(kick_process);
 #ifdef CONFIG_SMP
 /*
  * ->cpus_allowed is protected by both rq->lock and p->pi_lock
+ *
+ *  Must be called under get/put_online_cpus_atomic_light() or
+ *  equivalent, to avoid CPUs from going offline from underneath
+ *  us.
  */
 static int select_fallback_rq(int cpu, struct task_struct *p)
 {
@@ -1166,6 +1170,9 @@ out:
 
 /*
  * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
+ *
+ * Must be called under get/put_online_cpus_atomic_light(), to prevent
+ * CPUs from going offline from underneath us.
  */
 static inline
 int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
@@ -1406,6 +1413,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	int cpu, success = 0;
 
 	smp_wmb();
+	get_online_cpus_atomic_light();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	if (!(p->state & state))
 		goto out;
@@ -1446,6 +1454,7 @@ stat:
 	ttwu_stat(p, cpu, wake_flags);
 out:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+	put_online_cpus_atomic_light();
 
 	return success;
 }
@@ -1624,6 +1633,7 @@ void wake_up_new_task(struct task_struct *p)
 	unsigned long flags;
 	struct rq *rq;
 
+	get_online_cpus_atomic_light();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 #ifdef CONFIG_SMP
 	/*
@@ -1644,6 +1654,7 @@ void wake_up_new_task(struct task_struct *p)
 		p->sched_class->task_woken(rq, p);
 #endif
 	task_rq_unlock(rq, p, &flags);
+	put_online_cpus_atomic_light();
 }
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
@@ -2541,6 +2552,7 @@ void sched_exec(void)
 	unsigned long flags;
 	int dest_cpu;
 
+	get_online_cpus_atomic_light();
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
 	dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
 	if (dest_cpu == smp_processor_id())
@@ -2550,11 +2562,13 @@ void sched_exec(void)
 		struct migration_arg arg = { p, dest_cpu };
 
 		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+		put_online_cpus_atomic_light();
 		stop_one_cpu(task_cpu(p), migration_cpu_stop, &arg);
 		return;
 	}
 unlock:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+	put_online_cpus_atomic_light();
 }
 
 #endif

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox