[PATCH RT 0/9][RFC] rt: Fix hotplugging and other nasties

public inbox for linux-rt-users@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH RT 0/9][RFC] rt: Fix hotplugging and other nasties
@ 2012-03-01 18:55 Steven Rostedt
  2012-03-01 18:55 ` [PATCH RT 1/9][RFC] [PATCH 1/9] timer: Fix hotplug for -rt Steven Rostedt
                   ` (8 more replies)
  0 siblings, 9 replies; 20+ messages in thread
From: Steven Rostedt @ 2012-03-01 18:55 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, John Kacur, Peter Zijlstra,
	Clark Williams

On both 3.0-rt and 3.2-rt I ran a stress test on CPU hotplug and they
both failed miserably. These bugs also show themselves on suspend.

After spending 5 long days debugging this, and finding lots of bugs
I have a patch series that makes the CPU hot plugging rather stable.

The first few patches fix other broken RT features that were discovered
while debugging hotplug.

Patch 7 is a rework of how RT handles taking down a CPU. There were
several corner cases that the original approach failed on, and this
rework seems to cover them now.

Patch 8 and 9 revert the rework that was done on workqueue for RT.
Yes, letting workqueues play games with THREAD_BOUND is bad, but
the fixes introduced several more races that were very hard to fix.
Every fix seemed to produce a new race, and it was a bad game of
cat and mouse running around a goose hunt whacking moles!

Since the rework of the CPU down code no longer had issues with
workqueue messing with THREAD_BOUND, reverting the patches that did
the rework of workqueues fixed the bugs that I was hitting.

Lets fix CPU hotplug today, and leave the redesign of workqueues
for another day.

Comments,

-- Steve

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH RT 1/9][RFC] [PATCH 1/9] timer: Fix hotplug for -rt
  2012-03-01 18:55 [PATCH RT 0/9][RFC] rt: Fix hotplugging and other nasties Steven Rostedt
@ 2012-03-01 18:55 ` Steven Rostedt
  2012-03-01 18:55 ` [PATCH RT 2/9][RFC] [PATCH 2/9] futex/rt: Fix possible lockup when taking pi_lock in proxy handler Steven Rostedt
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Steven Rostedt @ 2012-03-01 18:55 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, John Kacur, Peter Zijlstra,
	Clark Williams

[-- Attachment #1: 0001-timer-Fix-hotplug-for-rt.patch --]
[-- Type: text/plain, Size: 1690 bytes --]

Revert the RT patch:
    Author: Ingo Molnar <mingo@elte.hu>
    Date:   Fri Jul 3 08:30:32 2009 -0500
    timers: fix timer hotplug on -rt

    Here we are in the CPU_DEAD notifier, and we must not sleep nor
    enable interrupts.

There's no problem with sleeping in this notifier.

But the get_cpu_var() had to be converted to a get_local_var().

Replace the previous fix with the get_local_var() convert.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/timer.c |   16 +++++-----------
 1 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/kernel/timer.c b/kernel/timer.c
index a791a43..c8c88d1 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1743,21 +1743,17 @@ static void __cpuinit migrate_timers(int cpu)
 {
 	struct tvec_base *old_base;
 	struct tvec_base *new_base;
-	unsigned long flags;
 	int i;
 
 	BUG_ON(cpu_online(cpu));
 	old_base = per_cpu(tvec_bases, cpu);
-	new_base = get_cpu_var(tvec_bases);
+	new_base = get_local_var(tvec_bases);
 	/*
 	 * The caller is globally serialized and nobody else
 	 * takes two locks at once, deadlock is not possible.
 	 */
-	local_irq_save(flags);
-	while (!spin_trylock(&new_base->lock))
-		cpu_relax();
-	while (!spin_trylock(&old_base->lock))
-		cpu_relax();
+	spin_lock_irq(&new_base->lock);
+	spin_lock_nested(&old_base->lock, SINGLE_DEPTH_NESTING);
 
 	BUG_ON(old_base->running_timer);
 
@@ -1771,10 +1767,8 @@ static void __cpuinit migrate_timers(int cpu)
 	}
 
 	spin_unlock(&old_base->lock);
-	spin_unlock(&new_base->lock);
-	local_irq_restore(flags);
-
-	put_cpu_var(tvec_bases);
+	spin_unlock_irq(&new_base->lock);
+	put_local_var(tvec_bases);
 }
 #endif /* CONFIG_HOTPLUG_CPU */
 
-- 
1.7.3.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RT 2/9][RFC] [PATCH 2/9] futex/rt: Fix possible lockup when taking pi_lock in proxy handler
  2012-03-01 18:55 [PATCH RT 0/9][RFC] rt: Fix hotplugging and other nasties Steven Rostedt
  2012-03-01 18:55 ` [PATCH RT 1/9][RFC] [PATCH 1/9] timer: Fix hotplug for -rt Steven Rostedt
@ 2012-03-01 18:55 ` Steven Rostedt
  2012-03-01 18:55 ` [PATCH RT 3/9][RFC] [PATCH 3/9] lglock/rt: Use non-rt for_each_cpu() in -rt code Steven Rostedt
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Steven Rostedt @ 2012-03-01 18:55 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, John Kacur, Peter Zijlstra,
	Clark Williams

[-- Attachment #1: 0002-futex-rt-Fix-possible-lockup-when-taking-pi_lock-in-.patch --]
[-- Type: text/plain, Size: 1090 bytes --]

When taking the pi_lock, we must disable interrupts because the
pi_lock can also be taken in an interrupt handler.

Use raw_spin_lock_irq() instead of raw_spin_lock().

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/rtmutex.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 9850dc0..b525158 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -1373,14 +1373,14 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 	 * PI_REQUEUE_INPROGRESS, so that if the task is waking up
 	 * it will know that we are in the process of requeuing it.
 	 */
-	raw_spin_lock(&task->pi_lock);
+	raw_spin_lock_irq(&task->pi_lock);
 	if (task->pi_blocked_on) {
-		raw_spin_unlock(&task->pi_lock);
+		raw_spin_unlock_irq(&task->pi_lock);
 		raw_spin_unlock(&lock->wait_lock);
 		return -EAGAIN;
 	}
 	task->pi_blocked_on = PI_REQUEUE_INPROGRESS;
-	raw_spin_unlock(&task->pi_lock);
+	raw_spin_unlock_irq(&task->pi_lock);
 #endif
 
 	ret = task_blocks_on_rt_mutex(lock, waiter, task, detect_deadlock);
-- 
1.7.3.4



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RT 3/9][RFC] [PATCH 3/9] lglock/rt: Use non-rt for_each_cpu() in -rt code
  2012-03-01 18:55 [PATCH RT 0/9][RFC] rt: Fix hotplugging and other nasties Steven Rostedt
  2012-03-01 18:55 ` [PATCH RT 1/9][RFC] [PATCH 1/9] timer: Fix hotplug for -rt Steven Rostedt
  2012-03-01 18:55 ` [PATCH RT 2/9][RFC] [PATCH 2/9] futex/rt: Fix possible lockup when taking pi_lock in proxy handler Steven Rostedt
@ 2012-03-01 18:55 ` Steven Rostedt
  2012-03-02  7:25   ` Srivatsa S. Bhat
  2012-03-01 18:55 ` [PATCH RT 4/9][RFC] [PATCH 4/9] rtmutex: Add new mutex_lock_savestate() API Steven Rostedt
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: Steven Rostedt @ 2012-03-01 18:55 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, John Kacur, Peter Zijlstra,
	Clark Williams

[-- Attachment #1: 0003-lglock-rt-Use-non-rt-for_each_cpu-in-rt-code.patch --]
[-- Type: text/plain, Size: 3570 bytes --]

Currently the RT version of the lglocks() does a for_each_online_cpu()
in the name##_global_lock_online() functions. Non-rt uses its own
mask for this, and for good reason.

A task may grab a *_global_lock_online(), and in the mean time, one
of the CPUs goes offline. Now when that task does a *_global_unlock_online()
it releases all the locks *except* the one that went offline.

Now if that CPU were to come back on line, its lock is now owned by a
task that never released it when it should have.

This causes all sorts of fun errors. Like owners of a lock no longer
existing, or sleeping on IO, waiting to be woken up by a task that
happens to be blocked on the lock it never released.

Convert the RT versions to use the lglock specific cpumasks. As once
a CPU comes on line, the mask is set, and never cleared even when the
CPU goes offline. The locks for that CPU will still be taken and released.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/lglock.h |   35 ++++++++++++++++++++++++++++++++---
 1 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/include/linux/lglock.h b/include/linux/lglock.h
index 52b289f..cdfcef3 100644
--- a/include/linux/lglock.h
+++ b/include/linux/lglock.h
@@ -203,9 +203,31 @@
 #else /* !PREEMPT_RT_FULL */
 #define DEFINE_LGLOCK(name)						\
 									\
- DEFINE_PER_CPU(struct rt_mutex, name##_lock);					\
+ DEFINE_PER_CPU(struct rt_mutex, name##_lock);				\
+ DEFINE_SPINLOCK(name##_cpu_lock);					\
+ cpumask_t name##_cpus __read_mostly;					\
  DEFINE_LGLOCK_LOCKDEP(name);						\
 									\
+ static int								\
+ name##_lg_cpu_callback(struct notifier_block *nb,			\
+				unsigned long action, void *hcpu)	\
+ {									\
+	switch (action & ~CPU_TASKS_FROZEN) {				\
+	case CPU_UP_PREPARE:						\
+		spin_lock(&name##_cpu_lock);				\
+		cpu_set((unsigned long)hcpu, name##_cpus);		\
+		spin_unlock(&name##_cpu_lock);				\
+		break;							\
+	case CPU_UP_CANCELED: case CPU_DEAD:				\
+		spin_lock(&name##_cpu_lock);				\
+		cpu_clear((unsigned long)hcpu, name##_cpus);		\
+		spin_unlock(&name##_cpu_lock);				\
+	}								\
+	return NOTIFY_OK;						\
+ }									\
+ static struct notifier_block name##_lg_cpu_notifier = {		\
+	.notifier_call = name##_lg_cpu_callback,			\
+ };									\
  void name##_lock_init(void) {						\
 	int i;								\
 	LOCKDEP_INIT_MAP(&name##_lock_dep_map, #name, &name##_lock_key, 0); \
@@ -214,6 +236,11 @@
 		lock = &per_cpu(name##_lock, i);			\
 		rt_mutex_init(lock);					\
 	}								\
+	register_hotcpu_notifier(&name##_lg_cpu_notifier);		\
+	get_online_cpus();						\
+	for_each_online_cpu(i)						\
+		cpu_set(i, name##_cpus);				\
+	put_online_cpus();						\
  }									\
  EXPORT_SYMBOL(name##_lock_init);					\
 									\
@@ -254,7 +281,8 @@
  void name##_global_lock_online(void) {					\
 	int i;								\
 	rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_);		\
-	for_each_online_cpu(i) {					\
+	spin_lock(&name##_cpu_lock);					\
+	for_each_cpu(i, &name##_cpus) {					\
 		struct rt_mutex *lock;					\
 		lock = &per_cpu(name##_lock, i);			\
 		__rt_spin_lock(lock);					\
@@ -265,11 +293,12 @@
  void name##_global_unlock_online(void) {				\
 	int i;								\
 	rwlock_release(&name##_lock_dep_map, 1, _RET_IP_);		\
-	for_each_online_cpu(i) {					\
+	for_each_cpu(i, &name##_cpus) {					\
 		struct rt_mutex *lock;					\
 		lock = &per_cpu(name##_lock, i);			\
 		__rt_spin_unlock(lock);					\
 	}								\
+	spin_unlock(&name##_cpu_lock);					\
  }									\
  EXPORT_SYMBOL(name##_global_unlock_online);				\
 									\
-- 
1.7.3.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RT 4/9][RFC] [PATCH 4/9] rtmutex: Add new mutex_lock_savestate() API
  2012-03-01 18:55 [PATCH RT 0/9][RFC] rt: Fix hotplugging and other nasties Steven Rostedt
                   ` (2 preceding siblings ...)
  2012-03-01 18:55 ` [PATCH RT 3/9][RFC] [PATCH 3/9] lglock/rt: Use non-rt for_each_cpu() in -rt code Steven Rostedt
@ 2012-03-01 18:55 ` Steven Rostedt
  2012-03-01 18:55 ` [PATCH RT 5/9][RFC] [PATCH 5/9] ring-buffer/rt: Check for irqs disabled before grabbing reader lock Steven Rostedt
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Steven Rostedt @ 2012-03-01 18:55 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, John Kacur, Peter Zijlstra,
	Clark Williams

[-- Attachment #1: 0004-rtmutex-Add-new-mutex_lock_savestate-API.patch --]
[-- Type: text/plain, Size: 4387 bytes --]

To handle the case between CPU hotplugging and migrate_disable() the
cpu_hotplug.lock mutex is taken to postpone tasks from being pinned
to a CPU. This mutex is taken by all locations that do migrate_disable()
which includes spin_locks() that are converted to mutexes for PREEMPT_RT.

The problem is that grabbing the cpu_hotplug.lock mutex does not save the
task's state (TASK_INTERRUPTIBLE, etc), and may cause a task to not sleep
when it is expected. This causes lots of unexpected, hard to debug errors.

Create a new mutex_lock_savestate() API that lets the migrate_disable()
code for RT take the cpu_hotplug.lock mutex and still maintain the task's
state when the lock is released.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/mutex.h    |    2 ++
 include/linux/mutex_rt.h |    2 ++
 include/linux/rtmutex.h  |    1 +
 kernel/rt.c              |    6 ++++++
 kernel/rtmutex.c         |   12 ++++++++++++
 5 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index bdf1da2..910696b 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -163,6 +163,8 @@ extern int __must_check mutex_lock_killable(struct mutex *lock);
 # define mutex_lock_nest_lock(lock, nest_lock) mutex_lock(lock)
 #endif
 
+#define mutex_lock_savestate(lock) mutex_lock(lock)
+
 /*
  * NOTE: mutex_trylock() follows the spin_trylock() convention,
  *       not the down_trylock() convention!
diff --git a/include/linux/mutex_rt.h b/include/linux/mutex_rt.h
index c38a44b..fbdef29 100644
--- a/include/linux/mutex_rt.h
+++ b/include/linux/mutex_rt.h
@@ -30,6 +30,7 @@ extern void __mutex_do_init(struct mutex *lock, const char *name, struct lock_cl
 extern void __lockfunc _mutex_lock(struct mutex *lock);
 extern int __lockfunc _mutex_lock_interruptible(struct mutex *lock);
 extern int __lockfunc _mutex_lock_killable(struct mutex *lock);
+extern void __lockfunc _mutex_lock_savestate(struct mutex *lock);
 extern void __lockfunc _mutex_lock_nested(struct mutex *lock, int subclass);
 extern void __lockfunc _mutex_lock_nest_lock(struct mutex *lock, struct lockdep_map *nest_lock);
 extern int __lockfunc _mutex_lock_interruptible_nested(struct mutex *lock, int subclass);
@@ -41,6 +42,7 @@ extern void __lockfunc _mutex_unlock(struct mutex *lock);
 #define mutex_lock(l)			_mutex_lock(l)
 #define mutex_lock_interruptible(l)	_mutex_lock_interruptible(l)
 #define mutex_lock_killable(l)		_mutex_lock_killable(l)
+#define mutex_lock_savestate(l)		_mutex_lock_savestate(l)
 #define mutex_trylock(l)		_mutex_trylock(l)
 #define mutex_unlock(l)			_mutex_unlock(l)
 #define mutex_destroy(l)		rt_mutex_destroy(&(l)->lock)
diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 5ebd0bb..27a4b62 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -111,6 +111,7 @@ extern void rt_mutex_destroy(struct rt_mutex *lock);
 extern void rt_mutex_lock(struct rt_mutex *lock);
 extern int rt_mutex_lock_interruptible(struct rt_mutex *lock,
 						int detect_deadlock);
+extern void rt_mutex_lock_savestate(struct rt_mutex *lock);
 extern int rt_mutex_lock_killable(struct rt_mutex *lock, int detect_deadlock);
 extern int rt_mutex_timed_lock(struct rt_mutex *lock,
 					struct hrtimer_sleeper *timeout,
diff --git a/kernel/rt.c b/kernel/rt.c
index 092d6b3..39641d0 100644
--- a/kernel/rt.c
+++ b/kernel/rt.c
@@ -117,6 +117,12 @@ int __lockfunc _mutex_lock_killable(struct mutex *lock)
 }
 EXPORT_SYMBOL(_mutex_lock_killable);
 
+void __lockfunc _mutex_lock_savestate(struct mutex *lock)
+{
+	mutex_acquire(&lock->dep_map, 0, 0, _RET_IP_);
+	rt_mutex_lock_savestate(&lock->lock);
+}
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 void __lockfunc _mutex_lock_nested(struct mutex *lock, int subclass)
 {
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index b525158..a22c299 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -927,6 +927,18 @@ __rt_spin_lock_init(spinlock_t *lock, char *name, struct lock_class_key *key)
 }
 EXPORT_SYMBOL(__rt_spin_lock_init);
 
+/*
+ * In some special cases in PREEMPT_RT, we need to grab a mutex
+ * but also keep the interruptible state like the rt_spinlocks
+ * do.
+ */
+void __lockfunc rt_mutex_lock_savestate(struct rt_mutex *lock)
+{
+	might_sleep();
+
+	rt_spin_lock_fastlock(lock, rt_spin_lock_slowlock);
+}
+
 #endif /* PREEMPT_RT_FULL */
 
 /**
-- 
1.7.3.4



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RT 5/9][RFC] [PATCH 5/9] ring-buffer/rt: Check for irqs disabled before grabbing reader lock
  2012-03-01 18:55 [PATCH RT 0/9][RFC] rt: Fix hotplugging and other nasties Steven Rostedt
                   ` (3 preceding siblings ...)
  2012-03-01 18:55 ` [PATCH RT 4/9][RFC] [PATCH 4/9] rtmutex: Add new mutex_lock_savestate() API Steven Rostedt
@ 2012-03-01 18:55 ` Steven Rostedt
  2012-03-01 18:55 ` [PATCH RT 6/9][RFC] [PATCH 6/9] sched/rt: Fix wait_task_interactive() to test rt_spin_lock state Steven Rostedt
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Steven Rostedt @ 2012-03-01 18:55 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, John Kacur, Peter Zijlstra,
	Clark Williams

[-- Attachment #1: 0005-ring-buffer-rt-Check-for-irqs-disabled-before-grabbi.patch --]
[-- Type: text/plain, Size: 714 bytes --]

In RT the reader lock is a mutex and we can not grab it when preemption is
disabled. The in_atomic() check that is there does not check if irqs are
disabled. Add that check as well.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 354017f..c060f04 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1054,7 +1054,7 @@ static inline int ok_to_lock(void)
 	if (in_nmi())
 		return 0;
 #ifdef CONFIG_PREEMPT_RT_FULL
-	if (in_atomic())
+	if (in_atomic() || irqs_disabled())
 		return 0;
 #endif
 	return 1;
-- 
1.7.3.4



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RT 6/9][RFC] [PATCH 6/9] sched/rt: Fix wait_task_interactive() to test rt_spin_lock state
  2012-03-01 18:55 [PATCH RT 0/9][RFC] rt: Fix hotplugging and other nasties Steven Rostedt
                   ` (4 preceding siblings ...)
  2012-03-01 18:55 ` [PATCH RT 5/9][RFC] [PATCH 5/9] ring-buffer/rt: Check for irqs disabled before grabbing reader lock Steven Rostedt
@ 2012-03-01 18:55 ` Steven Rostedt
  2012-03-01 18:55 ` [PATCH RT 7/9][RFC] [PATCH 7/9] cpu/rt: Rework cpu down for PREEMPT_RT Steven Rostedt
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Steven Rostedt @ 2012-03-01 18:55 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, John Kacur, Peter Zijlstra,
	Clark Williams

[-- Attachment #1: 0006-sched-rt-Fix-wait_task_interactive-to-test-rt_spin_l.patch --]
[-- Type: text/plain, Size: 1419 bytes --]

The wait_task_interactive() will have a task sleep waiting for another
task to have a certain state. But it ignores the rt_spin_locks state
and can return with an incorrect result if the task it is waiting
for is blocked on a rt_spin_lock() and is waking up.

The rt_spin_locks save the tasks state in the saved_state field
and the wait_task_interactive() must also test that state.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/sched.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 81b340d..1cc706d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2450,7 +2450,8 @@ unsigned long wait_task_inactive(struct task_struct *p, long match_state)
 		 * is actually now running somewhere else!
 		 */
 		while (task_running(rq, p)) {
-			if (match_state && unlikely(p->state != match_state))
+			if (match_state && unlikely(p->state != match_state)
+			    && unlikely(p->saved_state != match_state))
 				return 0;
 			cpu_relax();
 		}
@@ -2465,7 +2466,8 @@ unsigned long wait_task_inactive(struct task_struct *p, long match_state)
 		running = task_running(rq, p);
 		on_rq = p->on_rq;
 		ncsw = 0;
-		if (!match_state || p->state == match_state)
+		if (!match_state || p->state == match_state
+		    || p->saved_state == match_state)
 			ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
 		task_rq_unlock(rq, p, &flags);
 
-- 
1.7.3.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RT 7/9][RFC] [PATCH 7/9] cpu/rt: Rework cpu down for PREEMPT_RT
  2012-03-01 18:55 [PATCH RT 0/9][RFC] rt: Fix hotplugging and other nasties Steven Rostedt
                   ` (5 preceding siblings ...)
  2012-03-01 18:55 ` [PATCH RT 6/9][RFC] [PATCH 6/9] sched/rt: Fix wait_task_interactive() to test rt_spin_lock state Steven Rostedt
@ 2012-03-01 18:55 ` Steven Rostedt
  2012-03-02 14:52   ` Thomas Gleixner
  2012-03-01 18:55 ` [PATCH RT 8/9][RFC] [PATCH 8/9] workqueue: Revert workqueue: Fix PF_THREAD_BOUND abuse Steven Rostedt
  2012-03-01 18:55 ` [PATCH RT 9/9][RFC] [PATCH 9/9] workqueue: Revert workqueue: Fix cpuhotplug trainwreck Steven Rostedt
  8 siblings, 1 reply; 20+ messages in thread
From: Steven Rostedt @ 2012-03-01 18:55 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, John Kacur, Peter Zijlstra,
	Clark Williams

[-- Attachment #1: 0007-cpu-rt-Rework-cpu-down-for-PREEMPT_RT.patch --]
[-- Type: text/plain, Size: 13375 bytes --]

Bringing a CPU down is a pain with the PREEMPT_RT kernel because
tasks can be preempted in many more places than in non-RT. In
order to handle per_cpu variables, tasks may be pinned to a CPU
for a while, and even sleep. But these tasks need to be off the CPU
if that CPU is going down.

Several synchronization methods have been tried, but when stressed
they failed. This is a new approach.

A sync_tsk thread is still created and tasks may still block on a
lock when the CPU is going down, but how that works is a bit different.
When cpu_down() starts, it will create the sync_tsk and wait on it
to inform that current tasks that are pinned on the CPU are no longer
pinned. But new tasks that are about to be pinned will still be allowed
to do so at this time.

Then the notifiers are called. Several notifiers will bring down tasks
that will enter these locations. Some of these tasks will take locks
of other tasks that are on the CPU. If we don't let those other tasks
continue, but make them block until CPU down is done, the tasks that
the notifiers are waiting on will never complete as they are waiting
for the locks held by the tasks that are blocked.

Thus we still let the task pin the CPU until the notifiers are done.
After the notifiers run, we then make new tasks entering the pinned
CPU sections grab a mutex and wait. This mutex is now a per CPU mutex
in the hotplug_pcp descriptor.

To help things along, a new function in the scheduler code is created
called migrate_me(). This function will try to migrate the current task
off the CPU this is going down if possible. When the sync_tsk is created,
all tasks will then try to migrate off the CPU going down. There are
several cases that this wont work, but it helps in most cases.

After the notifiers are called and if a task can't migrate off but enters
the pin CPU sections, it will be forced to wait on the hotplug_pcp mutex
until the CPU down is complete. Then the scheduler will force the migration
anyway.

Also, I found that THREAD_BOUND need to also be accounted for in the
pinned CPU, and the migrate_disable no longer treats them special.
This helps fix issues with ksoftirqd and workqueue that unbind on CPU down.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/sched.h |    7 ++
 kernel/cpu.c          |  193 ++++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched.c        |   82 ++++++++++++++++++++-
 3 files changed, 262 insertions(+), 20 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1f6b11a..3453856 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1933,6 +1933,10 @@ extern void do_set_cpus_allowed(struct task_struct *p,
 
 extern int set_cpus_allowed_ptr(struct task_struct *p,
 				const struct cpumask *new_mask);
+int migrate_me(void);
+void tell_sched_cpu_down_begin(int cpu);
+void tell_sched_cpu_down_done(int cpu);
+
 #else
 static inline void do_set_cpus_allowed(struct task_struct *p,
 				      const struct cpumask *new_mask)
@@ -1945,6 +1949,9 @@ static inline int set_cpus_allowed_ptr(struct task_struct *p,
 		return -EINVAL;
 	return 0;
 }
+static inline int migrate_me(void) { return 0; }
+static inline void tell_sched_cpu_down_begin(int cpu) { }
+static inline void tell_sched_cpu_down_done(int cpu) { }
 #endif
 
 #ifndef CONFIG_CPUMASK_OFFSTACK
diff --git a/kernel/cpu.c b/kernel/cpu.c
index fa40834..1a74ef5 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -58,10 +58,28 @@ static struct {
 	.refcount = 0,
 };
 
+/**
+ * hotplug_pcp - per cpu hotplug descriptor
+ * @unplug:	set when pin_current_cpu() needs to sync tasks
+ * @sync_tsk:	the task that waits for tasks to finish pinned sections
+ * @refcount:	counter of tasks in pinned sections
+ * @grab_lock:	set when the tasks entering pinned sections should wait
+ * @synced:	notifier for @sync_tsk to tell cpu_down it's finished
+ * @mutex:	the mutex to make tasks wait (used when @grab_lock is true)
+ * @mutex_init:	zero if the mutex hasn't been initialized yet.
+ *
+ * Although @unplug and @sync_tsk may point to the same task, the @unplug
+ * is used as a flag and still exists after @sync_tsk has exited and
+ * @sync_tsk set to NULL.
+ */
 struct hotplug_pcp {
 	struct task_struct *unplug;
+	struct task_struct *sync_tsk;
 	int refcount;
+	int grab_lock;
 	struct completion synced;
+	struct mutex mutex;
+	int mutex_init;
 };
 
 static DEFINE_PER_CPU(struct hotplug_pcp, hotplug_pcp);
@@ -77,18 +95,40 @@ static DEFINE_PER_CPU(struct hotplug_pcp, hotplug_pcp);
 void pin_current_cpu(void)
 {
 	struct hotplug_pcp *hp;
+	int force = 0;
 
 retry:
 	hp = &__get_cpu_var(hotplug_pcp);
 
-	if (!hp->unplug || hp->refcount || preempt_count() > 1 ||
+	if (!hp->unplug || hp->refcount || force || preempt_count() > 1 ||
 	    hp->unplug == current || (current->flags & PF_STOMPER)) {
 		hp->refcount++;
 		return;
 	}
-	preempt_enable();
-	mutex_lock(&cpu_hotplug.lock);
-	mutex_unlock(&cpu_hotplug.lock);
+
+	if (hp->grab_lock) {
+		preempt_enable();
+		mutex_lock_savestate(&hp->mutex);
+		mutex_unlock(&hp->mutex);
+	} else {
+		preempt_enable();
+		/*
+		 * Try to push this task off of this CPU.
+		 */
+		if (!migrate_me()) {
+			preempt_disable();
+			hp = &__get_cpu_var(hotplug_pcp);
+			if (!hp->grab_lock) {
+				/*
+				 * Just let it continue it's already pinned
+				 * or about to sleep.
+				 */
+				force = 1;
+				goto retry;
+			}
+			preempt_enable();
+		}
+	}
 	preempt_disable();
 	goto retry;
 }
@@ -110,26 +150,84 @@ void unpin_current_cpu(void)
 		wake_up_process(hp->unplug);
 }
 
-/*
- * FIXME: Is this really correct under all circumstances ?
- */
+static void wait_for_pinned_cpus(struct hotplug_pcp *hp)
+{
+	set_current_state(TASK_UNINTERRUPTIBLE);
+	while (hp->refcount) {
+		schedule_preempt_disabled();
+		set_current_state(TASK_UNINTERRUPTIBLE);
+	}
+}
+
 static int sync_unplug_thread(void *data)
 {
 	struct hotplug_pcp *hp = data;
 
 	preempt_disable();
 	hp->unplug = current;
+	wait_for_pinned_cpus(hp);
+
+	/*
+	 * This thread will synchronize the cpu_down() with threads
+	 * that have pinned the CPU. When the pinned CPU count reaches
+	 * zero, we inform the cpu_down code to continue to the next step.
+	 */
 	set_current_state(TASK_UNINTERRUPTIBLE);
-	while (hp->refcount) {
-		schedule_preempt_disabled();
+	preempt_enable();
+	complete(&hp->synced);
+
+	/*
+	 * If all succeeds, the next step will need tasks to wait till
+	 * the CPU is offline before continuing. To do this, the grab_lock
+	 * is set and tasks going into pin_current_cpu() will block on the
+	 * mutex. But we still need to wait for those that are already in
+	 * pinned CPU sections. If the cpu_down() failed, the kthread_should_stop()
+	 * will kick this thread out.
+	 */
+	while (!hp->grab_lock && !kthread_should_stop()) {
+		schedule();
+		set_current_state(TASK_UNINTERRUPTIBLE);
+	}
+
+	/* Make sure grab_lock is seen before we see a stale completion */
+	smp_mb();
+
+	/*
+	 * Now just before cpu_down() enters stop machine, we need to make
+	 * sure all tasks that are in pinned CPU sections are out, and new
+	 * tasks will now grab the lock, keeping them from entering pinned
+	 * CPU sections.
+	 */
+	if (!kthread_should_stop()) {
+		preempt_disable();
+		wait_for_pinned_cpus(hp);
+		preempt_enable();
+		complete(&hp->synced);
+	}
+
+	set_current_state(TASK_UNINTERRUPTIBLE);
+	while (!kthread_should_stop()) {
+		schedule();
 		set_current_state(TASK_UNINTERRUPTIBLE);
 	}
 	set_current_state(TASK_RUNNING);
-	preempt_enable();
-	complete(&hp->synced);
+
+	/*
+	 * Force this thread off this CPU as it's going down and
+	 * we don't want any more work on this CPU.
+	 */
+	current->flags &= ~PF_THREAD_BOUND;
+	do_set_cpus_allowed(current, cpu_present_mask);
+	migrate_me();
 	return 0;
 }
 
+static void __cpu_unplug_sync(struct hotplug_pcp *hp)
+{
+	wake_up_process(hp->sync_tsk);
+	wait_for_completion(&hp->synced);
+}
+
 /*
  * Start the sync_unplug_thread on the target cpu and wait for it to
  * complete.
@@ -137,23 +235,79 @@ static int sync_unplug_thread(void *data)
 static int cpu_unplug_begin(unsigned int cpu)
 {
 	struct hotplug_pcp *hp = &per_cpu(hotplug_pcp, cpu);
-	struct task_struct *tsk;
+	int err;
+
+	/* Protected by cpu_hotplug.lock */
+	if (!hp->mutex_init) {
+		mutex_init(&hp->mutex);
+		hp->mutex_init = 1;
+	}
+
+	/* Inform the scheduler to migrate tasks off this CPU */
+	tell_sched_cpu_down_begin(cpu);
 
 	init_completion(&hp->synced);
-	tsk = kthread_create(sync_unplug_thread, hp, "sync_unplug/%d", cpu);
-	if (IS_ERR(tsk))
-		return (PTR_ERR(tsk));
-	kthread_bind(tsk, cpu);
-	wake_up_process(tsk);
-	wait_for_completion(&hp->synced);
+
+	hp->sync_tsk = kthread_create(sync_unplug_thread, hp, "sync_unplug/%d", cpu);
+	if (IS_ERR(hp->sync_tsk)) {
+		err = PTR_ERR(hp->sync_tsk);
+		hp->sync_tsk = NULL;
+		return err;
+	}
+	kthread_bind(hp->sync_tsk, cpu);
+
+	/*
+	 * Wait for tasks to get out of the pinned sections,
+	 * it's still OK if new tasks enter. Some CPU notifiers will
+	 * wait for tasks that are going to enter these sections and
+	 * we must not have them block.
+	 */
+	__cpu_unplug_sync(hp);
+
 	return 0;
 }
 
+static void cpu_unplug_sync(unsigned int cpu)
+{
+	struct hotplug_pcp *hp = &per_cpu(hotplug_pcp, cpu);
+
+	init_completion(&hp->synced);
+	/* The completion needs to be initialzied before setting grab_lock */
+	smp_wmb();
+
+	/* Grab the mutex before setting grab_lock */
+	mutex_lock(&hp->mutex);
+	hp->grab_lock = 1;
+
+	/*
+	 * The CPU notifiers have been completed.
+	 * Wait for tasks to get out of pinned CPU sections and have new
+	 * tasks block until the CPU is completely down.
+	 */
+	__cpu_unplug_sync(hp);
+
+	/* All done with the sync thread */
+	kthread_stop(hp->sync_tsk);
+	hp->sync_tsk = NULL;
+}
+
 static void cpu_unplug_done(unsigned int cpu)
 {
 	struct hotplug_pcp *hp = &per_cpu(hotplug_pcp, cpu);
 
 	hp->unplug = NULL;
+	/* Let all tasks know cpu unplug is finished before cleaning up */
+	smp_wmb();
+
+	if (hp->sync_tsk)
+		kthread_stop(hp->sync_tsk);
+
+	if (hp->grab_lock) {
+		mutex_unlock(&hp->mutex);
+		/* protected by cpu_hotplug.lock */
+		hp->grab_lock = 0;
+	}
+	tell_sched_cpu_down_done(cpu);
 }
 
 void get_online_cpus(void)
@@ -354,6 +508,9 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
 		goto out_release;
 	}
 
+	/* Notifiers are done. Don't let any more tasks pin this CPU. */
+	cpu_unplug_sync(cpu);
+
 	err = __stop_machine(take_cpu_down, &tcd_param, cpumask_of(cpu));
 	if (err) {
 		/* CPU didn't die: tell everyone.  Can't complain. */
diff --git a/kernel/sched.c b/kernel/sched.c
index 1cc706d..2e31cad 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4404,7 +4404,7 @@ void migrate_disable(void)
 {
 	struct task_struct *p = current;
 
-	if (in_atomic() || p->flags & PF_THREAD_BOUND) {
+	if (in_atomic()) {
 #ifdef CONFIG_SCHED_DEBUG
 		p->migrate_disable_atomic++;
 #endif
@@ -4435,7 +4435,7 @@ void migrate_enable(void)
 	unsigned long flags;
 	struct rq *rq;
 
-	if (in_atomic() || p->flags & PF_THREAD_BOUND) {
+	if (in_atomic()) {
 #ifdef CONFIG_SCHED_DEBUG
 		p->migrate_disable_atomic--;
 #endif
@@ -6361,6 +6361,84 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 	cpumask_copy(&p->cpus_allowed, new_mask);
 }
 
+static DEFINE_PER_CPU(struct cpumask, sched_cpumasks);
+static DEFINE_MUTEX(sched_down_mutex);
+static cpumask_t sched_down_cpumask;
+
+void tell_sched_cpu_down_begin(int cpu)
+{
+	mutex_lock(&sched_down_mutex);
+	cpumask_set_cpu(cpu, &sched_down_cpumask);
+	mutex_unlock(&sched_down_mutex);
+}
+
+void tell_sched_cpu_down_done(int cpu)
+{
+	mutex_lock(&sched_down_mutex);
+	cpumask_clear_cpu(cpu, &sched_down_cpumask);
+	mutex_unlock(&sched_down_mutex);
+}
+
+/**
+ * migrate_me - try to move the current task off this cpu
+ *
+ * Used by the pin_current_cpu() code to try to get tasks
+ * to move off the current CPU as it is going down.
+ * It will only move the task if the task isn't pinned to
+ * the CPU (with migrate_disable, affinity or THREAD_BOUND)
+ * and the task has to be in a RUNNING state. Otherwise the
+ * movement of the task will wake it up (change its state
+ * to running) when the task did not expect it.
+ *
+ * Returns 1 if it succeeded in moving the current task
+ *         0 otherwise.
+ */
+int migrate_me(void)
+{
+	struct task_struct *p = current;
+	struct migration_arg arg;
+	struct cpumask *cpumask;
+	struct cpumask *mask;
+	unsigned long flags;
+	unsigned int dest_cpu;
+	struct rq *rq;
+
+	/*
+	 * We can not migrate tasks bounded to a CPU or tasks not
+	 * running. The movement of the task will wake it up.
+	 */
+	if (p->flags & PF_THREAD_BOUND || p->state)
+		return 0;
+
+	mutex_lock(&sched_down_mutex);
+	rq = task_rq_lock(p, &flags);
+
+	cpumask = &__get_cpu_var(sched_cpumasks);
+	mask = &p->cpus_allowed;
+
+	cpumask_andnot(cpumask, mask, &sched_down_cpumask);
+
+	if (!cpumask_weight(cpumask)) {
+		/* It's only on this CPU? */
+		task_rq_unlock(rq, p, &flags);
+		mutex_unlock(&sched_down_mutex);
+		return 0;
+	}
+
+	dest_cpu = cpumask_any_and(cpu_active_mask, cpumask);
+
+	arg.task = p;
+	arg.dest_cpu = dest_cpu;
+
+	task_rq_unlock(rq, p, &flags);
+
+	stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
+	tlb_migrate_finish(p->mm);
+	mutex_unlock(&sched_down_mutex);
+
+	return 1;
+}
+
 /*
  * This is how migration works:
  *
-- 
1.7.3.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RT 8/9][RFC] [PATCH 8/9] workqueue: Revert workqueue: Fix PF_THREAD_BOUND abuse
  2012-03-01 18:55 [PATCH RT 0/9][RFC] rt: Fix hotplugging and other nasties Steven Rostedt
                   ` (6 preceding siblings ...)
  2012-03-01 18:55 ` [PATCH RT 7/9][RFC] [PATCH 7/9] cpu/rt: Rework cpu down for PREEMPT_RT Steven Rostedt
@ 2012-03-01 18:55 ` Steven Rostedt
  2012-03-01 18:55 ` [PATCH RT 9/9][RFC] [PATCH 9/9] workqueue: Revert workqueue: Fix cpuhotplug trainwreck Steven Rostedt
  8 siblings, 0 replies; 20+ messages in thread
From: Steven Rostedt @ 2012-03-01 18:55 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, John Kacur, Peter Zijlstra,
	Clark Williams

[-- Attachment #1: 0008-workqueue-Revert-workqueue-Fix-PF_THREAD_BOUND-abuse.patch --]
[-- Type: text/plain, Size: 2315 bytes --]

Revert commit

    Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Date:   Mon Oct 3 12:43:25 2011 +0200
    workqueue: Fix PF_THREAD_BOUND abuse

As TREAD_BOUND no longer affects cpu down, and this code introduced
a lot of races with taking down a CPU.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/workqueue.c |   29 +++++++++--------------------
 1 files changed, 9 insertions(+), 20 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e7f3e79..3b53135 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1286,14 +1286,8 @@ __acquires(&gcwq->lock)
 			return false;
 		if (task_cpu(task) == gcwq->cpu &&
 		    cpumask_equal(&current->cpus_allowed,
-				  get_cpu_mask(gcwq->cpu))) {
-			/*
-			 * Since we're binding to a particular cpu and need to
-			 * stay there for correctness, mark us PF_THREAD_BOUND.
-			 */
-			task->flags |= PF_THREAD_BOUND;
+				  get_cpu_mask(gcwq->cpu)))
 			return true;
-		}
 		spin_unlock_irq(&gcwq->lock);
 
 		/*
@@ -1307,18 +1301,6 @@ __acquires(&gcwq->lock)
 	}
 }
 
-static void worker_unbind_and_unlock(struct worker *worker)
-{
-	struct global_cwq *gcwq = worker->gcwq;
-	struct task_struct *task = worker->task;
-
-	/*
-	 * Its no longer required we're PF_THREAD_BOUND, the work is done.
-	 */
-	task->flags &= ~PF_THREAD_BOUND;
-	spin_unlock_irq(&gcwq->lock);
-}
-
 static struct worker *alloc_worker(void)
 {
 	struct worker *worker;
@@ -1381,9 +1363,15 @@ static struct worker *create_worker(struct global_cwq *gcwq, bool bind)
 	if (IS_ERR(worker->task))
 		goto fail;
 
+	/*
+	 * A rogue worker will become a regular one if CPU comes
+	 * online later on.  Make sure every worker has
+	 * PF_THREAD_BOUND set.
+	 */
 	if (bind && !on_unbound_cpu)
 		kthread_bind(worker->task, gcwq->cpu);
 	else {
+		worker->task->flags |= PF_THREAD_BOUND;
 		if (on_unbound_cpu)
 			worker->flags |= WORKER_UNBOUND;
 	}
@@ -2060,7 +2048,7 @@ repeat:
 		if (keep_working(gcwq))
 			wake_up_worker(gcwq);
 
-		worker_unbind_and_unlock(rescuer);
+		spin_unlock_irq(&gcwq->lock);
 	}
 
 	schedule();
@@ -3009,6 +2997,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char *name,
 		if (IS_ERR(rescuer->task))
 			goto err;
 
+		rescuer->task->flags |= PF_THREAD_BOUND;
 		wake_up_process(rescuer->task);
 	}
 
-- 
1.7.3.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH RT 9/9][RFC] [PATCH 9/9] workqueue: Revert workqueue: Fix cpuhotplug trainwreck
  2012-03-01 18:55 [PATCH RT 0/9][RFC] rt: Fix hotplugging and other nasties Steven Rostedt
                   ` (7 preceding siblings ...)
  2012-03-01 18:55 ` [PATCH RT 8/9][RFC] [PATCH 8/9] workqueue: Revert workqueue: Fix PF_THREAD_BOUND abuse Steven Rostedt
@ 2012-03-01 18:55 ` Steven Rostedt
  8 siblings, 0 replies; 20+ messages in thread
From: Steven Rostedt @ 2012-03-01 18:55 UTC (permalink / raw)
  To: linux-kernel, linux-rt-users
  Cc: Thomas Gleixner, Carsten Emde, John Kacur, Peter Zijlstra,
	Clark Williams

[-- Attachment #1: 0009-workqueue-Revert-workqueue-Fix-cpuhotplug-trainwreck.patch --]
[-- Type: text/plain, Size: 24904 bytes --]

Revert

    Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Date:   Fri Sep 30 11:57:58 2011 +0200
    workqueue: Fix cpuhotplug trainwreck

As TREAD_BOUND no longer affects cpu down, and this code introduced
a lot of races with taking down a CPU.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/cpu.h       |    6 +-
 include/linux/workqueue.h |    5 +-
 kernel/workqueue.c        |  561 +++++++++++++++++++++++++++++++++------------
 3 files changed, 415 insertions(+), 157 deletions(-)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 72e90bb..c46ec3e 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -66,10 +66,8 @@ enum {
 	/* migration should happen before other stuff but after perf */
 	CPU_PRI_PERF		= 20,
 	CPU_PRI_MIGRATION	= 10,
-
-	CPU_PRI_WORKQUEUE_ACTIVE	= 5,  /* prepare workqueues for others */
-	CPU_PRI_NORMAL			= 0,
-	CPU_PRI_WORKQUEUE_INACTIVE	= -5, /* flush workqueues after others */
+	/* prepare workqueues for other notifiers */
+	CPU_PRI_WORKQUEUE	= 5,
 };
 
 #define CPU_ONLINE		0x0002 /* CPU (unsigned)v is up */
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index d68ead8..0d556de 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -254,10 +254,9 @@ enum {
 	WQ_MEM_RECLAIM		= 1 << 3, /* may be used for memory reclaim */
 	WQ_HIGHPRI		= 1 << 4, /* high priority */
 	WQ_CPU_INTENSIVE	= 1 << 5, /* cpu instensive workqueue */
-	WQ_NON_AFFINE		= 1 << 6, /* free to move works around cpus */
 
-	WQ_DRAINING		= 1 << 7, /* internal: workqueue is draining */
-	WQ_RESCUER		= 1 << 8, /* internal: workqueue has rescuer */
+	WQ_DRAINING		= 1 << 6, /* internal: workqueue is draining */
+	WQ_RESCUER		= 1 << 7, /* internal: workqueue has rescuer */
 
 	WQ_MAX_ACTIVE		= 512,	  /* I like 512, better ideas? */
 	WQ_MAX_UNBOUND_PER_CPU	= 4,	  /* 4 * #cpus for unbound wq */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 3b53135..e8f4cce 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -41,7 +41,6 @@
 #include <linux/debug_locks.h>
 #include <linux/lockdep.h>
 #include <linux/idr.h>
-#include <linux/delay.h>
 
 #include "workqueue_sched.h"
 
@@ -58,10 +57,20 @@ enum {
 	WORKER_DIE		= 1 << 1,	/* die die die */
 	WORKER_IDLE		= 1 << 2,	/* is idle */
 	WORKER_PREP		= 1 << 3,	/* preparing to run works */
-	WORKER_CPU_INTENSIVE	= 1 << 4,	/* cpu intensive */
-	WORKER_UNBOUND		= 1 << 5,	/* worker is unbound */
+	WORKER_ROGUE		= 1 << 4,	/* not bound to any cpu */
+	WORKER_REBIND		= 1 << 5,	/* mom is home, come back */
+	WORKER_CPU_INTENSIVE	= 1 << 6,	/* cpu intensive */
+	WORKER_UNBOUND		= 1 << 7,	/* worker is unbound */
 
-	WORKER_NOT_RUNNING	= WORKER_PREP | WORKER_CPU_INTENSIVE | WORKER_UNBOUND,
+	WORKER_NOT_RUNNING	= WORKER_PREP | WORKER_ROGUE | WORKER_REBIND |
+				  WORKER_CPU_INTENSIVE | WORKER_UNBOUND,
+
+	/* gcwq->trustee_state */
+	TRUSTEE_START		= 0,		/* start */
+	TRUSTEE_IN_CHARGE	= 1,		/* trustee in charge of gcwq */
+	TRUSTEE_BUTCHER		= 2,		/* butcher workers */
+	TRUSTEE_RELEASE		= 3,		/* release workers */
+	TRUSTEE_DONE		= 4,		/* trustee is done */
 
 	BUSY_WORKER_HASH_ORDER	= 6,		/* 64 pointers */
 	BUSY_WORKER_HASH_SIZE	= 1 << BUSY_WORKER_HASH_ORDER,
@@ -75,6 +84,7 @@ enum {
 						   (min two ticks) */
 	MAYDAY_INTERVAL		= HZ / 10,	/* and then every 100ms */
 	CREATE_COOLDOWN		= HZ,		/* time to breath after fail */
+	TRUSTEE_COOLDOWN	= HZ / 10,	/* for trustee draining */
 
 	/*
 	 * Rescue workers are used only on emergencies and shared by
@@ -126,6 +136,7 @@ struct worker {
 	unsigned long		last_active;	/* L: last active timestamp */
 	unsigned int		flags;		/* X: flags */
 	int			id;		/* I: worker id */
+	struct work_struct	rebind_work;	/* L: rebind worker to cpu */
 	int			sleeping;	/* None */
 };
 
@@ -153,8 +164,10 @@ struct global_cwq {
 
 	struct ida		worker_ida;	/* L: for worker IDs */
 
+	struct task_struct	*trustee;	/* L: for gcwq shutdown */
+	unsigned int		trustee_state;	/* L: trustee state */
+	wait_queue_head_t	trustee_wait;	/* trustee wait */
 	struct worker		*first_idle;	/* L: first idle worker */
-	wait_queue_head_t	idle_wait;
 } ____cacheline_aligned_in_smp;
 
 /*
@@ -959,38 +972,13 @@ static bool is_chained_work(struct workqueue_struct *wq)
 	return false;
 }
 
-static void ___queue_work(struct workqueue_struct *wq, struct global_cwq *gcwq,
-			  struct work_struct *work)
-{
-	struct cpu_workqueue_struct *cwq;
-	struct list_head *worklist;
-	unsigned int work_flags;
-
-	/* gcwq determined, get cwq and queue */
-	cwq = get_cwq(gcwq->cpu, wq);
-	trace_workqueue_queue_work(gcwq->cpu, cwq, work);
-
-	BUG_ON(!list_empty(&work->entry));
-
-	cwq->nr_in_flight[cwq->work_color]++;
-	work_flags = work_color_to_flags(cwq->work_color);
-
-	if (likely(cwq->nr_active < cwq->max_active)) {
-		trace_workqueue_activate_work(work);
-		cwq->nr_active++;
-		worklist = gcwq_determine_ins_pos(gcwq, cwq);
-	} else {
-		work_flags |= WORK_STRUCT_DELAYED;
-		worklist = &cwq->delayed_works;
-	}
-
-	insert_work(cwq, work, worklist, work_flags);
-}
-
 static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 			 struct work_struct *work)
 {
 	struct global_cwq *gcwq;
+	struct cpu_workqueue_struct *cwq;
+	struct list_head *worklist;
+	unsigned int work_flags;
 	unsigned long flags;
 
 	debug_work_activate(work);
@@ -1036,32 +1024,27 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq,
 		spin_lock_irqsave(&gcwq->lock, flags);
 	}
 
-	___queue_work(wq, gcwq, work);
+	/* gcwq determined, get cwq and queue */
+	cwq = get_cwq(gcwq->cpu, wq);
+	trace_workqueue_queue_work(cpu, cwq, work);
 
-	spin_unlock_irqrestore(&gcwq->lock, flags);
-}
+	BUG_ON(!list_empty(&work->entry));
 
-/**
- * queue_work_on - queue work on specific cpu
- * @cpu: CPU number to execute work on
- * @wq: workqueue to use
- * @work: work to queue
- *
- * Returns 0 if @work was already on a queue, non-zero otherwise.
- *
- * We queue the work to a specific CPU, the caller must ensure it
- * can't go away.
- */
-static int
-__queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work)
-{
-	int ret = 0;
+	cwq->nr_in_flight[cwq->work_color]++;
+	work_flags = work_color_to_flags(cwq->work_color);
 
-	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
-		__queue_work(cpu, wq, work);
-		ret = 1;
+	if (likely(cwq->nr_active < cwq->max_active)) {
+		trace_workqueue_activate_work(work);
+		cwq->nr_active++;
+		worklist = gcwq_determine_ins_pos(gcwq, cwq);
+	} else {
+		work_flags |= WORK_STRUCT_DELAYED;
+		worklist = &cwq->delayed_works;
 	}
-	return ret;
+
+	insert_work(cwq, work, worklist, work_flags);
+
+	spin_unlock_irqrestore(&gcwq->lock, flags);
 }
 
 /**
@@ -1078,19 +1061,34 @@ int queue_work(struct workqueue_struct *wq, struct work_struct *work)
 {
 	int ret;
 
-	ret = __queue_work_on(get_cpu_light(), wq, work);
+	ret = queue_work_on(get_cpu_light(), wq, work);
 	put_cpu_light();
 
 	return ret;
 }
 EXPORT_SYMBOL_GPL(queue_work);
 
+/**
+ * queue_work_on - queue work on specific cpu
+ * @cpu: CPU number to execute work on
+ * @wq: workqueue to use
+ * @work: work to queue
+ *
+ * Returns 0 if @work was already on a queue, non-zero otherwise.
+ *
+ * We queue the work to a specific CPU, the caller must ensure it
+ * can't go away.
+ */
 int
 queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work)
 {
-	WARN_ON(wq->flags & WQ_NON_AFFINE);
+	int ret = 0;
 
-	return __queue_work_on(cpu, wq, work);
+	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
+		__queue_work(cpu, wq, work);
+		ret = 1;
+	}
+	return ret;
 }
 EXPORT_SYMBOL_GPL(queue_work_on);
 
@@ -1136,8 +1134,6 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 	struct timer_list *timer = &dwork->timer;
 	struct work_struct *work = &dwork->work;
 
-	WARN_ON((wq->flags & WQ_NON_AFFINE) && cpu != -1);
-
 	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
 		unsigned int lcpu;
 
@@ -1203,13 +1199,12 @@ static void worker_enter_idle(struct worker *worker)
 	/* idle_list is LIFO */
 	list_add(&worker->entry, &gcwq->idle_list);
 
-	if (gcwq->nr_idle == gcwq->nr_workers)
-		wake_up_all(&gcwq->idle_wait);
-
-	if (too_many_workers(gcwq) && !timer_pending(&gcwq->idle_timer)) {
-		mod_timer(&gcwq->idle_timer,
-				jiffies + IDLE_WORKER_TIMEOUT);
-	}
+	if (likely(!(worker->flags & WORKER_ROGUE))) {
+		if (too_many_workers(gcwq) && !timer_pending(&gcwq->idle_timer))
+			mod_timer(&gcwq->idle_timer,
+				  jiffies + IDLE_WORKER_TIMEOUT);
+	} else
+		wake_up_all(&gcwq->trustee_wait);
 
 	/* sanity check nr_running */
 	WARN_ON_ONCE(gcwq->nr_workers == gcwq->nr_idle &&
@@ -1301,6 +1296,23 @@ __acquires(&gcwq->lock)
 	}
 }
 
+/*
+ * Function for worker->rebind_work used to rebind rogue busy workers
+ * to the associated cpu which is coming back online.  This is
+ * scheduled by cpu up but can race with other cpu hotplug operations
+ * and may be executed twice without intervening cpu down.
+ */
+static void worker_rebind_fn(struct work_struct *work)
+{
+	struct worker *worker = container_of(work, struct worker, rebind_work);
+	struct global_cwq *gcwq = worker->gcwq;
+
+	if (worker_maybe_bind_and_lock(worker))
+		worker_clr_flags(worker, WORKER_REBIND);
+
+	spin_unlock_irq(&gcwq->lock);
+}
+
 static struct worker *alloc_worker(void)
 {
 	struct worker *worker;
@@ -1309,6 +1321,7 @@ static struct worker *alloc_worker(void)
 	if (worker) {
 		INIT_LIST_HEAD(&worker->entry);
 		INIT_LIST_HEAD(&worker->scheduled);
+		INIT_WORK(&worker->rebind_work, worker_rebind_fn);
 		/* on creation a worker is in !idle && prep state */
 		worker->flags = WORKER_PREP;
 	}
@@ -1648,6 +1661,13 @@ static bool manage_workers(struct worker *worker)
 
 	gcwq->flags &= ~GCWQ_MANAGING_WORKERS;
 
+	/*
+	 * The trustee might be waiting to take over the manager
+	 * position, tell it we're done.
+	 */
+	if (unlikely(gcwq->trustee))
+		wake_up_all(&gcwq->trustee_wait);
+
 	return ret;
 }
 
@@ -3187,76 +3207,171 @@ EXPORT_SYMBOL_GPL(work_busy);
  * gcwqs serve mix of short, long and very long running works making
  * blocked draining impractical.
  *
+ * This is solved by allowing a gcwq to be detached from CPU, running
+ * it with unbound (rogue) workers and allowing it to be reattached
+ * later if the cpu comes back online.  A separate thread is created
+ * to govern a gcwq in such state and is called the trustee of the
+ * gcwq.
+ *
+ * Trustee states and their descriptions.
+ *
+ * START	Command state used on startup.  On CPU_DOWN_PREPARE, a
+ *		new trustee is started with this state.
+ *
+ * IN_CHARGE	Once started, trustee will enter this state after
+ *		assuming the manager role and making all existing
+ *		workers rogue.  DOWN_PREPARE waits for trustee to
+ *		enter this state.  After reaching IN_CHARGE, trustee
+ *		tries to execute the pending worklist until it's empty
+ *		and the state is set to BUTCHER, or the state is set
+ *		to RELEASE.
+ *
+ * BUTCHER	Command state which is set by the cpu callback after
+ *		the cpu has went down.  Once this state is set trustee
+ *		knows that there will be no new works on the worklist
+ *		and once the worklist is empty it can proceed to
+ *		killing idle workers.
+ *
+ * RELEASE	Command state which is set by the cpu callback if the
+ *		cpu down has been canceled or it has come online
+ *		again.  After recognizing this state, trustee stops
+ *		trying to drain or butcher and clears ROGUE, rebinds
+ *		all remaining workers back to the cpu and releases
+ *		manager role.
+ *
+ * DONE		Trustee will enter this state after BUTCHER or RELEASE
+ *		is complete.
+ *
+ *          trustee                 CPU                draining
+ *         took over                down               complete
+ * START -----------> IN_CHARGE -----------> BUTCHER -----------> DONE
+ *                        |                     |                  ^
+ *                        | CPU is back online  v   return workers |
+ *                         ----------------> RELEASE --------------
  */
 
-static int __devinit workqueue_cpu_up_callback(struct notifier_block *nfb,
-						unsigned long action,
-						void *hcpu)
-{
-	unsigned int cpu = (unsigned long)hcpu;
-	struct global_cwq *gcwq = get_gcwq(cpu);
-	struct worker *uninitialized_var(new_worker);
-	unsigned long flags;
+/**
+ * trustee_wait_event_timeout - timed event wait for trustee
+ * @cond: condition to wait for
+ * @timeout: timeout in jiffies
+ *
+ * wait_event_timeout() for trustee to use.  Handles locking and
+ * checks for RELEASE request.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times.  To be used by trustee.
+ *
+ * RETURNS:
+ * Positive indicating left time if @cond is satisfied, 0 if timed
+ * out, -1 if canceled.
+ */
+#define trustee_wait_event_timeout(cond, timeout) ({			\
+	long __ret = (timeout);						\
+	while (!((cond) || (gcwq->trustee_state == TRUSTEE_RELEASE)) &&	\
+	       __ret) {							\
+		spin_unlock_irq(&gcwq->lock);				\
+		__wait_event_timeout(gcwq->trustee_wait, (cond) ||	\
+			(gcwq->trustee_state == TRUSTEE_RELEASE),	\
+			__ret);						\
+		spin_lock_irq(&gcwq->lock);				\
+	}								\
+	gcwq->trustee_state == TRUSTEE_RELEASE ? -1 : (__ret);		\
+})
 
-	action &= ~CPU_TASKS_FROZEN;
+/**
+ * trustee_wait_event - event wait for trustee
+ * @cond: condition to wait for
+ *
+ * wait_event() for trustee to use.  Automatically handles locking and
+ * checks for CANCEL request.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times.  To be used by trustee.
+ *
+ * RETURNS:
+ * 0 if @cond is satisfied, -1 if canceled.
+ */
+#define trustee_wait_event(cond) ({					\
+	long __ret1;							\
+	__ret1 = trustee_wait_event_timeout(cond, MAX_SCHEDULE_TIMEOUT);\
+	__ret1 < 0 ? -1 : 0;						\
+})
 
-	switch (action) {
-	case CPU_UP_PREPARE:
-		BUG_ON(gcwq->first_idle);
-		new_worker = create_worker(gcwq, false);
-		if (!new_worker)
-			return NOTIFY_BAD;
-	case CPU_UP_CANCELED:
-	case CPU_ONLINE:
-		break;
-	default:
-		return notifier_from_errno(0);
-	}
+static int __cpuinit trustee_thread(void *__gcwq)
+{
+	struct global_cwq *gcwq = __gcwq;
+	struct worker *worker;
+	struct work_struct *work;
+	struct hlist_node *pos;
+	long rc;
+	int i;
 
-	/* some are called w/ irq disabled, don't disturb irq status */
-	spin_lock_irqsave(&gcwq->lock, flags);
+	BUG_ON(gcwq->cpu != smp_processor_id());
 
-	switch (action) {
-	case CPU_UP_PREPARE:
-		BUG_ON(gcwq->first_idle);
-		gcwq->first_idle = new_worker;
-		break;
+	spin_lock_irq(&gcwq->lock);
+	/*
+	 * Claim the manager position and make all workers rogue.
+	 * Trustee must be bound to the target cpu and can't be
+	 * cancelled.
+	 */
+	BUG_ON(gcwq->cpu != smp_processor_id());
+	rc = trustee_wait_event(!(gcwq->flags & GCWQ_MANAGING_WORKERS));
+	BUG_ON(rc < 0);
 
-	case CPU_UP_CANCELED:
-		destroy_worker(gcwq->first_idle);
-		gcwq->first_idle = NULL;
-		break;
+	gcwq->flags |= GCWQ_MANAGING_WORKERS;
 
-	case CPU_ONLINE:
-		spin_unlock_irq(&gcwq->lock);
-		kthread_bind(gcwq->first_idle->task, cpu);
-		spin_lock_irq(&gcwq->lock);
-		gcwq->flags |= GCWQ_MANAGE_WORKERS;
-		start_worker(gcwq->first_idle);
-		gcwq->first_idle = NULL;
-		break;
-	}
+	list_for_each_entry(worker, &gcwq->idle_list, entry)
+		worker->flags |= WORKER_ROGUE;
 
-	spin_unlock_irqrestore(&gcwq->lock, flags);
+	for_each_busy_worker(worker, i, pos, gcwq)
+		worker->flags |= WORKER_ROGUE;
 
-	return notifier_from_errno(0);
-}
+	/*
+	 * Call schedule() so that we cross rq->lock and thus can
+	 * guarantee sched callbacks see the rogue flag.  This is
+	 * necessary as scheduler callbacks may be invoked from other
+	 * cpus.
+	 */
+	spin_unlock_irq(&gcwq->lock);
+	schedule();
+	spin_lock_irq(&gcwq->lock);
 
-static void flush_gcwq(struct global_cwq *gcwq)
-{
-	struct work_struct *work, *nw;
-	struct worker *worker, *n;
-	LIST_HEAD(non_affine_works);
+	/*
+	 * Sched callbacks are disabled now.  Zap nr_running.  After
+	 * this, nr_running stays zero and need_more_worker() and
+	 * keep_working() are always true as long as the worklist is
+	 * not empty.
+	 */
+	atomic_set(get_gcwq_nr_running(gcwq->cpu), 0);
 
+	spin_unlock_irq(&gcwq->lock);
+	del_timer_sync(&gcwq->idle_timer);
 	spin_lock_irq(&gcwq->lock);
-	list_for_each_entry_safe(work, nw, &gcwq->worklist, entry) {
-		struct workqueue_struct *wq = get_work_cwq(work)->wq;
 
-		if (wq->flags & WQ_NON_AFFINE)
-			list_move(&work->entry, &non_affine_works);
-	}
+	/*
+	 * We're now in charge.  Notify and proceed to drain.  We need
+	 * to keep the gcwq running during the whole CPU down
+	 * procedure as other cpu hotunplug callbacks may need to
+	 * flush currently running tasks.
+	 */
+	gcwq->trustee_state = TRUSTEE_IN_CHARGE;
+	wake_up_all(&gcwq->trustee_wait);
 
-	while (!list_empty(&gcwq->worklist)) {
+	/*
+	 * The original cpu is in the process of dying and may go away
+	 * anytime now.  When that happens, we and all workers would
+	 * be migrated to other cpus.  Try draining any left work.  We
+	 * want to get it over with ASAP - spam rescuers, wake up as
+	 * many idlers as necessary and create new ones till the
+	 * worklist is empty.  Note that if the gcwq is frozen, there
+	 * may be frozen works in freezable cwqs.  Don't declare
+	 * completion while frozen.
+	 */
+	while (gcwq->nr_workers != gcwq->nr_idle ||
+	       gcwq->flags & GCWQ_FREEZING ||
+	       gcwq->trustee_state == TRUSTEE_IN_CHARGE) {
 		int nr_works = 0;
 
 		list_for_each_entry(work, &gcwq->worklist, entry) {
@@ -3270,55 +3385,200 @@ static void flush_gcwq(struct global_cwq *gcwq)
 			wake_up_process(worker->task);
 		}
 
-		spin_unlock_irq(&gcwq->lock);
-
 		if (need_to_create_worker(gcwq)) {
-			worker = create_worker(gcwq, true);
-			if (worker)
+			spin_unlock_irq(&gcwq->lock);
+			worker = create_worker(gcwq, false);
+			spin_lock_irq(&gcwq->lock);
+			if (worker) {
+				worker->flags |= WORKER_ROGUE;
 				start_worker(worker);
+			}
 		}
 
-		wait_event_timeout(gcwq->idle_wait,
-				gcwq->nr_idle == gcwq->nr_workers, HZ/10);
-
-		spin_lock_irq(&gcwq->lock);
+		/* give a breather */
+		if (trustee_wait_event_timeout(false, TRUSTEE_COOLDOWN) < 0)
+			break;
 	}
 
-	WARN_ON(gcwq->nr_workers != gcwq->nr_idle);
+	/*
+	 * Either all works have been scheduled and cpu is down, or
+	 * cpu down has already been canceled.  Wait for and butcher
+	 * all workers till we're canceled.
+	 */
+	do {
+		rc = trustee_wait_event(!list_empty(&gcwq->idle_list));
+		while (!list_empty(&gcwq->idle_list))
+			destroy_worker(list_first_entry(&gcwq->idle_list,
+							struct worker, entry));
+	} while (gcwq->nr_workers && rc >= 0);
 
-	list_for_each_entry_safe(worker, n, &gcwq->idle_list, entry)
-		destroy_worker(worker);
+	/*
+	 * At this point, either draining has completed and no worker
+	 * is left, or cpu down has been canceled or the cpu is being
+	 * brought back up.  There shouldn't be any idle one left.
+	 * Tell the remaining busy ones to rebind once it finishes the
+	 * currently scheduled works by scheduling the rebind_work.
+	 */
+	WARN_ON(!list_empty(&gcwq->idle_list));
 
-	WARN_ON(gcwq->nr_workers || gcwq->nr_idle);
+	for_each_busy_worker(worker, i, pos, gcwq) {
+		struct work_struct *rebind_work = &worker->rebind_work;
 
-	spin_unlock_irq(&gcwq->lock);
+		/*
+		 * Rebind_work may race with future cpu hotplug
+		 * operations.  Use a separate flag to mark that
+		 * rebinding is scheduled.
+		 */
+		worker->flags |= WORKER_REBIND;
+		worker->flags &= ~WORKER_ROGUE;
 
-	gcwq = get_gcwq(get_cpu_light());
-	spin_lock_irq(&gcwq->lock);
-	list_for_each_entry_safe(work, nw, &non_affine_works, entry) {
-		list_del_init(&work->entry);
-		___queue_work(get_work_cwq(work)->wq, gcwq, work);
+		/* queue rebind_work, wq doesn't matter, use the default one */
+		if (test_and_set_bit(WORK_STRUCT_PENDING_BIT,
+				     work_data_bits(rebind_work)))
+			continue;
+
+		debug_work_activate(rebind_work);
+		insert_work(get_cwq(gcwq->cpu, system_wq), rebind_work,
+			    worker->scheduled.next,
+			    work_color_to_flags(WORK_NO_COLOR));
 	}
+
+	/* relinquish manager role */
+	gcwq->flags &= ~GCWQ_MANAGING_WORKERS;
+
+	/* notify completion */
+	gcwq->trustee = NULL;
+	gcwq->trustee_state = TRUSTEE_DONE;
+	wake_up_all(&gcwq->trustee_wait);
 	spin_unlock_irq(&gcwq->lock);
-	put_cpu_light();
+	return 0;
 }
 
-static int __devinit workqueue_cpu_down_callback(struct notifier_block *nfb,
+/**
+ * wait_trustee_state - wait for trustee to enter the specified state
+ * @gcwq: gcwq the trustee of interest belongs to
+ * @state: target state to wait for
+ *
+ * Wait for the trustee to reach @state.  DONE is already matched.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock) which may be released and regrabbed
+ * multiple times.  To be used by cpu_callback.
+ */
+static void __cpuinit wait_trustee_state(struct global_cwq *gcwq, int state)
+__releases(&gcwq->lock)
+__acquires(&gcwq->lock)
+{
+	if (!(gcwq->trustee_state == state ||
+	      gcwq->trustee_state == TRUSTEE_DONE)) {
+		spin_unlock_irq(&gcwq->lock);
+		__wait_event(gcwq->trustee_wait,
+			     gcwq->trustee_state == state ||
+			     gcwq->trustee_state == TRUSTEE_DONE);
+		spin_lock_irq(&gcwq->lock);
+	}
+}
+
+static int __devinit workqueue_cpu_callback(struct notifier_block *nfb,
 						unsigned long action,
 						void *hcpu)
 {
 	unsigned int cpu = (unsigned long)hcpu;
 	struct global_cwq *gcwq = get_gcwq(cpu);
+	struct task_struct *new_trustee = NULL;
+	struct worker *uninitialized_var(new_worker);
+	unsigned long flags;
 
 	action &= ~CPU_TASKS_FROZEN;
 
-        switch (action) {
-        case CPU_DOWN_PREPARE:
-                flush_gcwq(gcwq);
-                break;
-        }
+	switch (action) {
+	case CPU_DOWN_PREPARE:
+		new_trustee = kthread_create(trustee_thread, gcwq,
+					     "workqueue_trustee/%d\n", cpu);
+		if (IS_ERR(new_trustee))
+			return notifier_from_errno(PTR_ERR(new_trustee));
+		kthread_bind(new_trustee, cpu);
+		/* fall through */
+	case CPU_UP_PREPARE:
+		BUG_ON(gcwq->first_idle);
+		new_worker = create_worker(gcwq, false);
+		if (!new_worker) {
+			if (new_trustee)
+				kthread_stop(new_trustee);
+			return NOTIFY_BAD;
+		}
+		break;
+	case CPU_POST_DEAD:
+	case CPU_UP_CANCELED:
+	case CPU_DOWN_FAILED:
+	case CPU_ONLINE:
+		break;
+	case CPU_DYING:
+		/*
+		 * We access this lockless. We are on the dying CPU
+		 * and called from stomp machine.
+		 *
+		 * Before this, the trustee and all workers except for
+		 * the ones which are still executing works from
+		 * before the last CPU down must be on the cpu.  After
+		 * this, they'll all be diasporas.
+		 */
+		gcwq->flags |= GCWQ_DISASSOCIATED;
+	default:
+		goto out;
+	}
+
+	/* some are called w/ irq disabled, don't disturb irq status */
+	spin_lock_irqsave(&gcwq->lock, flags);
+
+	switch (action) {
+	case CPU_DOWN_PREPARE:
+		/* initialize trustee and tell it to acquire the gcwq */
+		BUG_ON(gcwq->trustee || gcwq->trustee_state != TRUSTEE_DONE);
+		gcwq->trustee = new_trustee;
+		gcwq->trustee_state = TRUSTEE_START;
+		wake_up_process(gcwq->trustee);
+		wait_trustee_state(gcwq, TRUSTEE_IN_CHARGE);
+		/* fall through */
+	case CPU_UP_PREPARE:
+		BUG_ON(gcwq->first_idle);
+		gcwq->first_idle = new_worker;
+		break;
+
+	case CPU_POST_DEAD:
+		gcwq->trustee_state = TRUSTEE_BUTCHER;
+		/* fall through */
+	case CPU_UP_CANCELED:
+		destroy_worker(gcwq->first_idle);
+		gcwq->first_idle = NULL;
+		break;
 
+	case CPU_DOWN_FAILED:
+	case CPU_ONLINE:
+		gcwq->flags &= ~GCWQ_DISASSOCIATED;
+		if (gcwq->trustee_state != TRUSTEE_DONE) {
+			gcwq->trustee_state = TRUSTEE_RELEASE;
+			wake_up_process(gcwq->trustee);
+			wait_trustee_state(gcwq, TRUSTEE_DONE);
+		}
+
+		/*
+		 * Trustee is done and there might be no worker left.
+		 * Put the first_idle in and request a real manager to
+		 * take a look.
+		 */
+		spin_unlock_irq(&gcwq->lock);
+		kthread_bind(gcwq->first_idle->task, cpu);
+		spin_lock_irq(&gcwq->lock);
+		gcwq->flags |= GCWQ_MANAGE_WORKERS;
+		start_worker(gcwq->first_idle);
+		gcwq->first_idle = NULL;
+		break;
+	}
+
+	spin_unlock_irqrestore(&gcwq->lock, flags);
 
+out:
 	return notifier_from_errno(0);
 }
 
@@ -3515,8 +3775,7 @@ static int __init init_workqueues(void)
 	unsigned int cpu;
 	int i;
 
-	cpu_notifier(workqueue_cpu_up_callback, CPU_PRI_WORKQUEUE_ACTIVE);
- 	hotcpu_notifier(workqueue_cpu_down_callback, CPU_PRI_WORKQUEUE_INACTIVE);
+	cpu_notifier(workqueue_cpu_callback, CPU_PRI_WORKQUEUE);
 
 	/* initialize gcwqs */
 	for_each_gcwq_cpu(cpu) {
@@ -3539,7 +3798,9 @@ static int __init init_workqueues(void)
 			    (unsigned long)gcwq);
 
 		ida_init(&gcwq->worker_ida);
-		init_waitqueue_head(&gcwq->idle_wait);
+
+		gcwq->trustee_state = TRUSTEE_DONE;
+		init_waitqueue_head(&gcwq->trustee_wait);
 	}
 
 	/* create the initial worker */
-- 
1.7.3.4



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH RT 3/9][RFC] [PATCH 3/9] lglock/rt: Use non-rt for_each_cpu() in -rt code
  2012-03-01 18:55 ` [PATCH RT 3/9][RFC] [PATCH 3/9] lglock/rt: Use non-rt for_each_cpu() in -rt code Steven Rostedt
@ 2012-03-02  7:25   ` Srivatsa S. Bhat
  2012-03-02 14:20     ` Steven Rostedt
  0 siblings, 1 reply; 20+ messages in thread
From: Srivatsa S. Bhat @ 2012-03-02  7:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-rt-users, Thomas Gleixner, Carsten Emde,
	John Kacur, Peter Zijlstra, Clark Williams, mingo@elte.hu,
	Andi Kleen, akpm@linux-foundation.org, rusty@rustcorp.com.au

On 03/02/2012 12:25 AM, Steven Rostedt wrote:

> Currently the RT version of the lglocks() does a for_each_online_cpu()
> in the name##_global_lock_online() functions. Non-rt uses its own
> mask for this, and for good reason.
> 
> A task may grab a *_global_lock_online(), and in the mean time, one
> of the CPUs goes offline. Now when that task does a *_global_unlock_online()
> it releases all the locks *except* the one that went offline.
> 
> Now if that CPU were to come back on line, its lock is now owned by a
> task that never released it when it should have.
> 
> This causes all sorts of fun errors. Like owners of a lock no longer
> existing, or sleeping on IO, waiting to be woken up by a task that
> happens to be blocked on the lock it never released.
> 
> Convert the RT versions to use the lglock specific cpumasks. As once
> a CPU comes on line, the mask is set, and never cleared even when the
> CPU goes offline. The locks for that CPU will still be taken and released.
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>  include/linux/lglock.h |   35 ++++++++++++++++++++++++++++++++---
>  1 files changed, 32 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/lglock.h b/include/linux/lglock.h
> index 52b289f..cdfcef3 100644
> --- a/include/linux/lglock.h
> +++ b/include/linux/lglock.h
> @@ -203,9 +203,31 @@
>  #else /* !PREEMPT_RT_FULL */
>  #define DEFINE_LGLOCK(name)						\
>  									\
> - DEFINE_PER_CPU(struct rt_mutex, name##_lock);					\
> + DEFINE_PER_CPU(struct rt_mutex, name##_lock);				\
> + DEFINE_SPINLOCK(name##_cpu_lock);					\
> + cpumask_t name##_cpus __read_mostly;					\
>   DEFINE_LGLOCK_LOCKDEP(name);						\
>  									\
> + static int								\
> + name##_lg_cpu_callback(struct notifier_block *nb,			\
> +				unsigned long action, void *hcpu)	\
> + {									\
> +	switch (action & ~CPU_TASKS_FROZEN) {				\
> +	case CPU_UP_PREPARE:						\
> +		spin_lock(&name##_cpu_lock);				\
> +		cpu_set((unsigned long)hcpu, name##_cpus);		\
> +		spin_unlock(&name##_cpu_lock);				\
> +		break;							\
> +	case CPU_UP_CANCELED: case CPU_DEAD:				\
> +		spin_lock(&name##_cpu_lock);				\
> +		cpu_clear((unsigned long)hcpu, name##_cpus);		\
> +		spin_unlock(&name##_cpu_lock);				\
> +	}								\
> +	return NOTIFY_OK;						\
> + }									\
> + static struct notifier_block name##_lg_cpu_notifier = {		\
> +	.notifier_call = name##_lg_cpu_callback,			\
> + };									\
>   void name##_lock_init(void) {						\
>  	int i;								\
>  	LOCKDEP_INIT_MAP(&name##_lock_dep_map, #name, &name##_lock_key, 0); \
> @@ -214,6 +236,11 @@
>  		lock = &per_cpu(name##_lock, i);			\
>  		rt_mutex_init(lock);					\
>  	}								\
> +	register_hotcpu_notifier(&name##_lg_cpu_notifier);		\
> +	get_online_cpus();						\
> +	for_each_online_cpu(i)						\
> +		cpu_set(i, name##_cpus);				\


This can be further improved. We don't really need this loop. We can replace
it with:

cpumask_copy(&name##_cpus, cpu_online_mask);

(as pointed out by Ingo. See: https://lkml.org/lkml/2012/2/29/93 and
https://lkml.org/lkml/2012/2/29/153).

I will try sending a patch for this to non-RT after the numerous patches
currently flying around this code (in non-RT) settle down..


> +	put_online_cpus();						\
>   }									\
>   EXPORT_SYMBOL(name##_lock_init);					\
>  									\
> @@ -254,7 +281,8 @@
>   void name##_global_lock_online(void) {					\
>  	int i;								\
>  	rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_);		\
> -	for_each_online_cpu(i) {					\
> +	spin_lock(&name##_cpu_lock);					\
> +	for_each_cpu(i, &name##_cpus) {					\
>  		struct rt_mutex *lock;					\
>  		lock = &per_cpu(name##_lock, i);			\
>  		__rt_spin_lock(lock);					\
> @@ -265,11 +293,12 @@
>   void name##_global_unlock_online(void) {				\
>  	int i;								\
>  	rwlock_release(&name##_lock_dep_map, 1, _RET_IP_);		\
> -	for_each_online_cpu(i) {					\
> +	for_each_cpu(i, &name##_cpus) {					\
>  		struct rt_mutex *lock;					\
>  		lock = &per_cpu(name##_lock, i);			\
>  		__rt_spin_unlock(lock);					\
>  	}								\
> +	spin_unlock(&name##_cpu_lock);					\
>   }									\
>   EXPORT_SYMBOL(name##_global_unlock_online);				\
>  									\


Regards,
Srivatsa S. Bhat
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RT 3/9][RFC] [PATCH 3/9] lglock/rt: Use non-rt for_each_cpu() in -rt code
  2012-03-02  7:25   ` Srivatsa S. Bhat
@ 2012-03-02 14:20     ` Steven Rostedt
  0 siblings, 0 replies; 20+ messages in thread
From: Steven Rostedt @ 2012-03-02 14:20 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-kernel, linux-rt-users, Thomas Gleixner, Carsten Emde,
	John Kacur, Peter Zijlstra, Clark Williams, mingo@elte.hu,
	Andi Kleen, akpm@linux-foundation.org, rusty@rustcorp.com.au

On Fri, 2012-03-02 at 12:55 +0530, Srivatsa S. Bhat wrote:
> On 03/02/2012 12:25 AM, Steven Rostedt wrote:
>  	}								\
> > +	register_hotcpu_notifier(&name##_lg_cpu_notifier);		\
> > +	get_online_cpus();						\
> > +	for_each_online_cpu(i)						\
> > +		cpu_set(i, name##_cpus);				\
> 
> 
> This can be further improved. We don't really need this loop. We can replace
> it with:
> 
> cpumask_copy(&name##_cpus, cpu_online_mask);
> 
> (as pointed out by Ingo. See: https://lkml.org/lkml/2012/2/29/93 and
> https://lkml.org/lkml/2012/2/29/153).
> 
> I will try sending a patch for this to non-RT after the numerous patches
> currently flying around this code (in non-RT) settle down..
> 
> 

Yeah, I thought that was funny too, but I wanted RT to be close to
mainline. My thoughts were to fix the broken hotplug code at the time
and I wasn't thinking too much on improving mainline ;-)

-- Steve

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RT 7/9][RFC] [PATCH 7/9] cpu/rt: Rework cpu down for PREEMPT_RT
  2012-03-01 18:55 ` [PATCH RT 7/9][RFC] [PATCH 7/9] cpu/rt: Rework cpu down for PREEMPT_RT Steven Rostedt
@ 2012-03-02 14:52   ` Thomas Gleixner
  2012-03-02 15:11     ` Steven Rostedt
  2012-03-02 15:15     ` Steven Rostedt
  0 siblings, 2 replies; 20+ messages in thread
From: Thomas Gleixner @ 2012-03-02 14:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-rt-users, Carsten Emde, John Kacur,
	Peter Zijlstra, Clark Williams

On Thu, 1 Mar 2012, Steven Rostedt wrote:
> Bringing a CPU down is a pain with the PREEMPT_RT kernel because
> tasks can be preempted in many more places than in non-RT. In
> order to handle per_cpu variables, tasks may be pinned to a CPU
> for a while, and even sleep. But these tasks need to be off the CPU
> if that CPU is going down.
> 
> Several synchronization methods have been tried, but when stressed
> they failed. This is a new approach.

OMG! That hotplug stuff has been ugly as hell already, but you managed
to make it exponentially worse. That's really an achievement.

Instead of adding more mess, we should simply fix hotplug. 

We can migrate away all tasks _before_ we run stomp-machine and
prevent that any new tasks go on the cpu which is about to be taken
down. Once all migratable tasks are gone, we only have to deal with
the cpu bound ones, which is not causing such headaches.

So no, I rather keep the current problem than applying that insanity.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RT 7/9][RFC] [PATCH 7/9] cpu/rt: Rework cpu down for PREEMPT_RT
  2012-03-02 14:52   ` Thomas Gleixner
@ 2012-03-02 15:11     ` Steven Rostedt
  2012-03-02 15:15     ` Steven Rostedt
  1 sibling, 0 replies; 20+ messages in thread
From: Steven Rostedt @ 2012-03-02 15:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-rt-users, Carsten Emde, John Kacur,
	Peter Zijlstra, Clark Williams

On Fri, 2012-03-02 at 15:52 +0100, Thomas Gleixner wrote:
> On Thu, 1 Mar 2012, Steven Rostedt wrote:
> > Bringing a CPU down is a pain with the PREEMPT_RT kernel because
> > tasks can be preempted in many more places than in non-RT. In
> > order to handle per_cpu variables, tasks may be pinned to a CPU
> > for a while, and even sleep. But these tasks need to be off the CPU
> > if that CPU is going down.
> > 
> > Several synchronization methods have been tried, but when stressed
> > they failed. This is a new approach.
> 
> OMG! That hotplug stuff has been ugly as hell already, but you managed
> to make it exponentially worse. That's really an achievement.

*Blush*  Oh no, I don't deserve the credit for that. Adding the
migrate_disable() and friends in -rt did most the work for me.

>  
> Instead of adding more mess, we should simply fix hotplug. 
> 
> We can migrate away all tasks _before_ we run stomp-machine and

Well, that's what this patch pretty much tried to do. But I agree, not
so nicely.

Should we call into the scheduler and tell it to do the migrate all
tasks off and add no new ones? But note, the migrate is done with
stomp_machine too. (per cpu, and on the CPU that's going down). Then we
need to let the scheduler allow stomp_machine to enter the CPU, which
shouldn't be that hard.

> prevent that any new tasks go on the cpu which is about to be taken
> down. Once all migratable tasks are gone, we only have to deal with
> the cpu bound ones, which is not causing such headaches.

Also, the headaches came mostly from the tasks that were bound to the
CPU. Moving everything off after the notifiers is what's best. I may be
able to have that happen. We need a way to call into the scheduler and
tell it to migrate everything off then.

> 
> So no, I rather keep the current problem than applying that insanity.

Note, the sync daemon was already added in -rt, I just modified it to be
a little different. What is currently there introduced lots of races
(especially that grabbing of the mutex). And is completely broken.

I'm willing to take another shot at it, but I'm burnt out from this
hotplug crap and I'm way behind in my other duties because of it. As for
now we have a fix (yes it's crappy), for those that need hotplug
working, they can use this mess.

I'm going to go off and catch up on my other work and when I get time
again I'll take another shot at it. Unless someone else beats me to it
(I can only hope so).

-- Steve

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RT 7/9][RFC] [PATCH 7/9] cpu/rt: Rework cpu down for PREEMPT_RT
  2012-03-02 14:52   ` Thomas Gleixner
  2012-03-02 15:11     ` Steven Rostedt
@ 2012-03-02 15:15     ` Steven Rostedt
  2012-03-02 15:20       ` Thomas Gleixner
  1 sibling, 1 reply; 20+ messages in thread
From: Steven Rostedt @ 2012-03-02 15:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-rt-users, Carsten Emde, John Kacur,
	Peter Zijlstra, Clark Williams

On Fri, 2012-03-02 at 15:52 +0100, Thomas Gleixner wrote:

> So no, I rather keep the current problem than applying that insanity.

Note, patches 1,2,3,5 and 6, were bugs I found while debugging hotplug.
Please add them, they are not insanity fixes but real bug fixes.

And can you mark them as stable-rt fixes too.

Thanks,

-- Steve



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RT 7/9][RFC] [PATCH 7/9] cpu/rt: Rework cpu down for PREEMPT_RT
  2012-03-02 15:15     ` Steven Rostedt
@ 2012-03-02 15:20       ` Thomas Gleixner
  2012-03-02 15:36         ` Steven Rostedt
  0 siblings, 1 reply; 20+ messages in thread
From: Thomas Gleixner @ 2012-03-02 15:20 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-rt-users, Carsten Emde, John Kacur,
	Peter Zijlstra, Clark Williams

On Fri, 2 Mar 2012, Steven Rostedt wrote:

> On Fri, 2012-03-02 at 15:52 +0100, Thomas Gleixner wrote:
> 
> > So no, I rather keep the current problem than applying that insanity.
> 
> Note, patches 1,2,3,5 and 6, were bugs I found while debugging hotplug.
> Please add them, they are not insanity fixes but real bug fixes.

Already queued. I'd say #4 is a bug fix as well, though I
fundamentally hate the ass backwards semantics of that new function.
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RT 7/9][RFC] [PATCH 7/9] cpu/rt: Rework cpu down for PREEMPT_RT
  2012-03-02 15:20       ` Thomas Gleixner
@ 2012-03-02 15:36         ` Steven Rostedt
  2012-03-02 15:39           ` Steven Rostedt
                             ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Steven Rostedt @ 2012-03-02 15:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-rt-users, Carsten Emde, John Kacur,
	Peter Zijlstra, Clark Williams

On Fri, 2012-03-02 at 16:20 +0100, Thomas Gleixner wrote:

> Already queued. I'd say #4 is a bug fix as well, though I
> fundamentally hate the ass backwards semantics of that new function.

It is a bug fix, but I don't like it either. Note, that the patch didn't
update the pin_current_cpu() code, which would have to be done too (to
be a full fix).

But I was also thinking that as a work around, as we plan on changing
this code anyway. Instead of adding a new API which is ass backwards,
just encompass the cpu_hotplug.lock instead. Something like this:

Not tested nor complied.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Yes this is ugly, but at least we don't introduce an ass backwards
semantic.

diff --git a/kernel/cpu.c b/kernel/cpu.c
index fa40834..c25b5ff 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -46,7 +46,12 @@ static int cpu_hotplug_disabled;
 
 static struct {
 	struct task_struct *active_writer;
+#ifdef CONFIG_PREEMPT_RT_FULL
+	/* Makes the lock keep the task's state */
+	spinlock_t lock;
+#else
 	struct mutex lock; /* Synchronizes accesses to refcount, */
+#endif
 	/*
 	 * Also blocks the new readers during
 	 * an ongoing cpu hotplug operation.
@@ -58,6 +63,14 @@ static struct {
 	.refcount = 0,
 };
 
+#ifdef CONFIG_PREEMPT_RT_FULL
+# define hotplug_lock() spin_lock(&cpu_hotplug.lock)
+# define hotplug_unlock() spin_unlock(&cpu_hotplug.lock)
+#else
+# define hotplug_lock() mutex_lock(&cpu_hotplug.lock)
+# define hotplug_lock() mutex_unlock(&cpu_hotplug.lock)
+#endif
+
 struct hotplug_pcp {
 	struct task_struct *unplug;
 	int refcount;
@@ -87,8 +100,8 @@ retry:
 		return;
 	}
 	preempt_enable();
-	mutex_lock(&cpu_hotplug.lock);
-	mutex_unlock(&cpu_hotplug.lock);
+	hotplug_lock();
+	hotplug_unlock();
 	preempt_disable();
 	goto retry;
 }
@@ -161,9 +174,9 @@ void get_online_cpus(void)
 	might_sleep();
 	if (cpu_hotplug.active_writer == current)
 		return;
-	mutex_lock(&cpu_hotplug.lock);
+	hotplug_lock();
 	cpu_hotplug.refcount++;
-	mutex_unlock(&cpu_hotplug.lock);
+	hotplug_unlock();
 
 }
 EXPORT_SYMBOL_GPL(get_online_cpus);
@@ -172,10 +185,10 @@ void put_online_cpus(void)
 {
 	if (cpu_hotplug.active_writer == current)
 		return;
-	mutex_lock(&cpu_hotplug.lock);
+	hotplug_lock();
 	if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
 		wake_up_process(cpu_hotplug.active_writer);
-	mutex_unlock(&cpu_hotplug.lock);
+	hotplug_unlock();
 
 }
 EXPORT_SYMBOL_GPL(put_online_cpus);
@@ -207,11 +220,11 @@ static void cpu_hotplug_begin(void)
 	cpu_hotplug.active_writer = current;
 
 	for (;;) {
-		mutex_lock(&cpu_hotplug.lock);
+		hotplug_lock();
 		if (likely(!cpu_hotplug.refcount))
 			break;
 		__set_current_state(TASK_UNINTERRUPTIBLE);
-		mutex_unlock(&cpu_hotplug.lock);
+		hotplug_unlock();
 		schedule();
 	}
 }
@@ -219,7 +232,7 @@ static void cpu_hotplug_begin(void)
 static void cpu_hotplug_done(void)
 {
 	cpu_hotplug.active_writer = NULL;
-	mutex_unlock(&cpu_hotplug.lock);
+	hotplug_unlock();
 }
 
 #else /* #if CONFIG_HOTPLUG_CPU */

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH RT 7/9][RFC] [PATCH 7/9] cpu/rt: Rework cpu down for PREEMPT_RT
  2012-03-02 15:36         ` Steven Rostedt
@ 2012-03-02 15:39           ` Steven Rostedt
  2012-03-02 15:40           ` Steven Rostedt
  2012-03-02 15:51           ` Thomas Gleixner
  2 siblings, 0 replies; 20+ messages in thread
From: Steven Rostedt @ 2012-03-02 15:39 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-rt-users, Carsten Emde, John Kacur,
	Peter Zijlstra, Clark Williams

On Fri, 2012-03-02 at 10:36 -0500, Steven Rostedt wrote:

> Not tested nor complied.
> 

I meant compiled, but I guess this patch doesn't comply either ;-)

-- Steve

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RT 7/9][RFC] [PATCH 7/9] cpu/rt: Rework cpu down for PREEMPT_RT
  2012-03-02 15:36         ` Steven Rostedt
  2012-03-02 15:39           ` Steven Rostedt
@ 2012-03-02 15:40           ` Steven Rostedt
  2012-03-02 15:51           ` Thomas Gleixner
  2 siblings, 0 replies; 20+ messages in thread
From: Steven Rostedt @ 2012-03-02 15:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-rt-users, Carsten Emde, John Kacur,
	Peter Zijlstra, Clark Williams

On Fri, 2012-03-02 at 10:36 -0500, Steven Rostedt wrote:
>  
> +#ifdef CONFIG_PREEMPT_RT_FULL
> +# define hotplug_lock() spin_lock(&cpu_hotplug.lock)
> +# define hotplug_unlock() spin_unlock(&cpu_hotplug.lock)
> +#else
> +# define hotplug_lock() mutex_lock(&cpu_hotplug.lock)
> +# define hotplug_lock() mutex_unlock(&cpu_hotplug.lock)

# define hotplug_unlock() mutex_unlock(&cpu_hotplug.lock)

Proof it doesn't comply.

-- Steve

> +#endif




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RT 7/9][RFC] [PATCH 7/9] cpu/rt: Rework cpu down for PREEMPT_RT
  2012-03-02 15:36         ` Steven Rostedt
  2012-03-02 15:39           ` Steven Rostedt
  2012-03-02 15:40           ` Steven Rostedt
@ 2012-03-02 15:51           ` Thomas Gleixner
  2 siblings, 0 replies; 20+ messages in thread
From: Thomas Gleixner @ 2012-03-02 15:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-rt-users, Carsten Emde, John Kacur,
	Peter Zijlstra, Clark Williams

On Fri, 2 Mar 2012, Steven Rostedt wrote:

> On Fri, 2012-03-02 at 16:20 +0100, Thomas Gleixner wrote:
> 
> > Already queued. I'd say #4 is a bug fix as well, though I
> > fundamentally hate the ass backwards semantics of that new function.
> 
> It is a bug fix, but I don't like it either. Note, that the patch didn't
> update the pin_current_cpu() code, which would have to be done too (to
> be a full fix).

I know.
 
> But I was also thinking that as a work around, as we plan on changing
> this code anyway. Instead of adding a new API which is ass backwards,
> just encompass the cpu_hotplug.lock instead. Something like this:

Thought about that already. It's way less fugly.
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2012-03-02 15:51 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-01 18:55 [PATCH RT 0/9][RFC] rt: Fix hotplugging and other nasties Steven Rostedt
2012-03-01 18:55 ` [PATCH RT 1/9][RFC] [PATCH 1/9] timer: Fix hotplug for -rt Steven Rostedt
2012-03-01 18:55 ` [PATCH RT 2/9][RFC] [PATCH 2/9] futex/rt: Fix possible lockup when taking pi_lock in proxy handler Steven Rostedt
2012-03-01 18:55 ` [PATCH RT 3/9][RFC] [PATCH 3/9] lglock/rt: Use non-rt for_each_cpu() in -rt code Steven Rostedt
2012-03-02  7:25   ` Srivatsa S. Bhat
2012-03-02 14:20     ` Steven Rostedt
2012-03-01 18:55 ` [PATCH RT 4/9][RFC] [PATCH 4/9] rtmutex: Add new mutex_lock_savestate() API Steven Rostedt
2012-03-01 18:55 ` [PATCH RT 5/9][RFC] [PATCH 5/9] ring-buffer/rt: Check for irqs disabled before grabbing reader lock Steven Rostedt
2012-03-01 18:55 ` [PATCH RT 6/9][RFC] [PATCH 6/9] sched/rt: Fix wait_task_interactive() to test rt_spin_lock state Steven Rostedt
2012-03-01 18:55 ` [PATCH RT 7/9][RFC] [PATCH 7/9] cpu/rt: Rework cpu down for PREEMPT_RT Steven Rostedt
2012-03-02 14:52   ` Thomas Gleixner
2012-03-02 15:11     ` Steven Rostedt
2012-03-02 15:15     ` Steven Rostedt
2012-03-02 15:20       ` Thomas Gleixner
2012-03-02 15:36         ` Steven Rostedt
2012-03-02 15:39           ` Steven Rostedt
2012-03-02 15:40           ` Steven Rostedt
2012-03-02 15:51           ` Thomas Gleixner
2012-03-01 18:55 ` [PATCH RT 8/9][RFC] [PATCH 8/9] workqueue: Revert workqueue: Fix PF_THREAD_BOUND abuse Steven Rostedt
2012-03-01 18:55 ` [PATCH RT 9/9][RFC] [PATCH 9/9] workqueue: Revert workqueue: Fix cpuhotplug trainwreck Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox