[PATCH/RFC] timer: fix deadlock on cpu hotplug

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH/RFC] timer: fix deadlock on cpu hotplug
@ 2010-09-21 14:20 Heiko Carstens
  2010-09-21 15:36 ` Tejun Heo
  2010-09-21 15:39 ` [PATCH/RFC] timer: fix deadlock on cpu hotplug Thomas Gleixner
  0 siblings, 2 replies; 12+ messages in thread
From: Heiko Carstens @ 2010-09-21 14:20 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Tejun Heo, Rusty Russell
  Cc: linux-kernel

From: Heiko Carstens <heiko.carstens@de.ibm.com>

I've seen the following deadlock on cpu hotplug stress test:

On cpu down the process that triggered offlining of a cpu waits for
stop_machine() to finish:

PID: 56033  TASK: e001540           CPU: 2   COMMAND: "cpu_all_off"
 #0 [37aa7990] schedule at 559194
 #1 [37aa7a40] schedule_timeout at 559de0
 #2 [37aa7b18] wait_for_common at 558bfa
 #3 [37aa7b90] __stop_cpus at 1a876e
 #4 [37aa7c68] stop_cpus at 1a8a3a
 #5 [37aa7c98] __stop_machine at 1a8adc
 #6 [37aa7cf8] _cpu_down at 55007a
 #7 [37aa7d78] cpu_down at 550280
 #8 [37aa7d98] store_online at 551d48
 #9 [37aa7dc0] sysfs_write_file at 2a3fa2
 #10 [37aa7e18] vfs_write at 229b3c
 #11 [37aa7e78] sys_write at 229d38
 #12 [37aa7eb8] sysc_noemu at 1146de

All cpus actually have been synchronized and cpu 0 got offlined. However,
the migration thread on cpu 5 got preempted just between preempt_enable()
and cpu_stop_signal_done() within cpu_stopper_thread():

PID: 55622  TASK: 31a00a40          CPU: 5   COMMAND: "migration/5"
 #0 [30f8bc80] schedule at 559194
 #1 [30f8bd30] preempt_schedule at 559b54
 #2 [30f8bd50] cpu_stopper_thread at 1a81dc
 #3 [30f8be28] kthread at 163224
 #4 [30f8beb8] kernel_thread_starter at 106c1a

For some reason the scheduler decided to throttle RT tasks on the runqueue
of cpu 5 (rt_throttled = 1). So as long as rt_throttled == 1 we won't see the
migration thread coming back to execution.
The only thing that would unthrottle the runqueue would be the rt_period_timer.
The timer is indeed scheduled, however in the dump I have it has been expired
for more than four hours.
The reason is simply that the timer is pending on the offlined cpu 0 and
therefore would never fire before it gets migrated to an online cpu. Before
the cpu hotplug mechanisms (cpu hotplug notifier with state CPU_DEAD) would
migrate the timer to an online cpu stop_machine() must complete ---> deadlock.

The fix _seems_ to be simple: just migrate timers after __cpu_disable() has
been called and use the CPU_DYING state. The subtle difference is of course
that the migration code now gets executed on the cpu that actually just is
going to disable itself instead of an arbitrary cpu that stays online.

This patch moves the migration of pending timers to an earlier time
(CPU_DYING), so that the deadlock described cannot happen anymore.

Up to now the hrtimer migration code called __hrtimer_peek_ahead_timers()
after migrating timers to the _current_ cpu. Now pending timers are moved
to a remote cpu and calling that function isn't possible anymore.
To solve that I introduced the function raise_remote_softirq() which gets
used to raise the HRTIMER_SOFTIRQ on the cpu where the timers have been
migrated to. Which will lead to execution of hrtimer_peek_ahead_timers()
as soon as softirq are executed on the remote cpu.

The proper place for such a generic function should be softirq.c, but this
is just an RFC and I would like to check if people are ok with the general
approach.
Or maybe it's possible to fix this in a better way?

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
---

 kernel/hrtimer.c |   30 +++++++++++++++++++++---------
 kernel/timer.c   |   14 ++++++++------
 2 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 1decafb..a912585 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1662,17 +1662,32 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base,
 	}
 }

+#ifdef CONFIG_HIGH_RES_TIMERS
+static void raise_remote_softirq_handler(void *nr)
+{
+	raise_softirq_irqoff((unsigned int)(long)nr);
+}
+
+static void raise_remote_softirq(int cpu, unsigned int nr)
+{
+	smp_call_function_single(cpu, raise_remote_softirq_handler,
+				 (void *)(long) nr, 0);
+}
+#endif
+
 static void migrate_hrtimers(int scpu)
 {
 	struct hrtimer_cpu_base *old_base, *new_base;
+	int dcpu;
 	int i;

 	BUG_ON(cpu_online(scpu));
+	BUG_ON(!irqs_disabled());
 	tick_cancel_sched_timer(scpu);

-	local_irq_disable();
+	dcpu = any_online_cpu(cpu_online_map);
 	old_base = &per_cpu(hrtimer_bases, scpu);
-	new_base = &__get_cpu_var(hrtimer_bases);
+	new_base = &per_cpu(hrtimer_bases, dcpu);
 	/*
 	 * The caller is globally serialized and nobody else
 	 * takes two locks at once, deadlock is not possible.
@@ -1687,10 +1702,9 @@ static void migrate_hrtimers(int scpu)

 	raw_spin_unlock(&old_base->lock);
 	raw_spin_unlock(&new_base->lock);
-
-	/* Check, if we got expired work to do */
-	__hrtimer_peek_ahead_timers();
-	local_irq_enable();
+#ifdef CONFIG_HIGH_RES_TIMERS
+	raise_remote_softirq(dcpu, HRTIMER_SOFTIRQ);
+#endif
 }

 #endif /* CONFIG_HOTPLUG_CPU */
@@ -1711,14 +1725,12 @@ static int __cpuinit hrtimer_cpu_notify(struct notifier_block *self,
 	case CPU_DYING:
 	case CPU_DYING_FROZEN:
 		clockevents_notify(CLOCK_EVT_NOTIFY_CPU_DYING, &scpu);
+		migrate_hrtimers(scpu);
 		break;
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
-	{
 		clockevents_notify(CLOCK_EVT_NOTIFY_CPU_DEAD, &scpu);
-		migrate_hrtimers(scpu);
 		break;
-	}
 #endif

 	default:
diff --git a/kernel/timer.c b/kernel/timer.c
index 97bf05b..c9e8679 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1665,16 +1665,19 @@ static void __cpuinit migrate_timers(int cpu)
 {
 	struct tvec_base *old_base;
 	struct tvec_base *new_base;
+	int dcpu;
 	int i;

 	BUG_ON(cpu_online(cpu));
+	BUG_ON(!irqs_disabled());
+	dcpu = any_online_cpu(cpu_online_map);
 	old_base = per_cpu(tvec_bases, cpu);
-	new_base = get_cpu_var(tvec_bases);
+	new_base = per_cpu(tvec_bases, dcpu);
 	/*
 	 * The caller is globally serialized and nobody else
 	 * takes two locks at once, deadlock is not possible.
 	 */
-	spin_lock_irq(&new_base->lock);
+	spin_lock(&new_base->lock);
 	spin_lock_nested(&old_base->lock, SINGLE_DEPTH_NESTING);

 	BUG_ON(old_base->running_timer);
@@ -1689,8 +1692,7 @@ static void __cpuinit migrate_timers(int cpu)
 	}

 	spin_unlock(&old_base->lock);
-	spin_unlock_irq(&new_base->lock);
-	put_cpu_var(tvec_bases);
+	spin_unlock(&new_base->lock);
 }
 #endif /* CONFIG_HOTPLUG_CPU */

@@ -1708,8 +1710,8 @@ static int __cpuinit timer_cpu_notify(struct notifier_block *self,
 			return notifier_from_errno(err);
 		break;
 #ifdef CONFIG_HOTPLUG_CPU
-	case CPU_DEAD:
-	case CPU_DEAD_FROZEN:
+	case CPU_DYING:
+	case CPU_DYING_FROZEN:
 		migrate_timers(cpu);
 		break;
 #endif

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH/RFC] timer: fix deadlock on cpu hotplug
  2010-09-21 14:20 [PATCH/RFC] timer: fix deadlock on cpu hotplug Heiko Carstens
@ 2010-09-21 15:36 ` Tejun Heo
  2010-09-21 15:40   ` Peter Zijlstra
  2010-09-21 15:39 ` [PATCH/RFC] timer: fix deadlock on cpu hotplug Thomas Gleixner
  1 sibling, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2010-09-21 15:36 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Andrew Morton,
	Rusty Russell, linux-kernel

Hello,

On 09/21/2010 04:20 PM, Heiko Carstens wrote:
> For some reason the scheduler decided to throttle RT tasks on the runqueue
> of cpu 5 (rt_throttled = 1). So as long as rt_throttled == 1 we won't see the
> migration thread coming back to execution.
> The only thing that would unthrottle the runqueue would be the rt_period_timer.
> The timer is indeed scheduled, however in the dump I have it has been expired
> for more than four hours.
> The reason is simply that the timer is pending on the offlined cpu 0 and
> therefore would never fire before it gets migrated to an online cpu. Before
> the cpu hotplug mechanisms (cpu hotplug notifier with state CPU_DEAD) would
> migrate the timer to an online cpu stop_machine() must complete ---> deadlock.
> 
> The fix _seems_ to be simple: just migrate timers after __cpu_disable() has
> been called and use the CPU_DYING state. The subtle difference is of course
> that the migration code now gets executed on the cpu that actually just is
> going to disable itself instead of an arbitrary cpu that stays online.

I think this is the second time we're seeing deadlock during cpu down
due to RT throttling and timer problem.  The rather delicate
dependency there makes me somewhat nervous.  If possible, I think it
would be better if we can simply turn the RT throttling off when
cpu_stop kicks in.  It's intended to be a mechanism to monopolize all
CPU cycles to begin with.  Would that be difficult?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH/RFC] timer: fix deadlock on cpu hotplug
  2010-09-21 15:36 ` Tejun Heo
@ 2010-09-21 15:40   ` Peter Zijlstra
  2010-09-22  8:37     ` Heiko Carstens
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2010-09-21 15:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Heiko Carstens, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	Rusty Russell, linux-kernel

On Tue, 2010-09-21 at 17:36 +0200, Tejun Heo wrote:
> Hello,
> 
> On 09/21/2010 04:20 PM, Heiko Carstens wrote:
> > For some reason the scheduler decided to throttle RT tasks on the runqueue
> > of cpu 5 (rt_throttled = 1). So as long as rt_throttled == 1 we won't see the
> > migration thread coming back to execution.
> > The only thing that would unthrottle the runqueue would be the rt_period_timer.
> > The timer is indeed scheduled, however in the dump I have it has been expired
> > for more than four hours.
> > The reason is simply that the timer is pending on the offlined cpu 0 and
> > therefore would never fire before it gets migrated to an online cpu. Before
> > the cpu hotplug mechanisms (cpu hotplug notifier with state CPU_DEAD) would
> > migrate the timer to an online cpu stop_machine() must complete ---> deadlock.
> > 
> > The fix _seems_ to be simple: just migrate timers after __cpu_disable() has
> > been called and use the CPU_DYING state. The subtle difference is of course
> > that the migration code now gets executed on the cpu that actually just is
> > going to disable itself instead of an arbitrary cpu that stays online.
> 
> I think this is the second time we're seeing deadlock during cpu down
> due to RT throttling and timer problem.  The rather delicate
> dependency there makes me somewhat nervous.  If possible, I think it
> would be better if we can simply turn the RT throttling off when
> cpu_stop kicks in.  It's intended to be a mechanism to monopolize all
> CPU cycles to begin with.  Would that be difficult?

I've wanted to pull the whole migration thread out from SCHED_FIFO for a
while. Doing that is probably the easiest thing.

Still would be nice to also cure this problem differently.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH/RFC] timer: fix deadlock on cpu hotplug
  2010-09-21 15:40   ` Peter Zijlstra
@ 2010-09-22  8:37     ` Heiko Carstens
  2010-09-22  9:22       ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: Heiko Carstens @ 2010-09-22  8:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	Rusty Russell, linux-kernel

On Tue, Sep 21, 2010 at 05:40:18PM +0200, Peter Zijlstra wrote:
> On Tue, 2010-09-21 at 17:36 +0200, Tejun Heo wrote:
> > I think this is the second time we're seeing deadlock during cpu down
> > due to RT throttling and timer problem.  The rather delicate
> > dependency there makes me somewhat nervous.  If possible, I think it
> > would be better if we can simply turn the RT throttling off when
> > cpu_stop kicks in.  It's intended to be a mechanism to monopolize all
> > CPU cycles to begin with.  Would that be difficult?
> 
> I've wanted to pull the whole migration thread out from SCHED_FIFO for a
> while. Doing that is probably the easiest thing.

Something like this?

diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 4372ccb..854fd57 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -291,7 +291,6 @@ repeat:
 static int __cpuinit cpu_stop_cpu_callback(struct notifier_block *nfb,
 					   unsigned long action, void *hcpu)
 {
-	struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
 	unsigned int cpu = (unsigned long)hcpu;
 	struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
 	struct task_struct *p;
@@ -304,7 +303,6 @@ static int __cpuinit cpu_stop_cpu_callback(struct notifier_block *nfb,
 				   cpu);
 		if (IS_ERR(p))
 			return NOTIFY_BAD;
-		sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
 		get_task_struct(p);
 		stopper->thread = p;
 		break;

...gets stuck nearly immediatly on cpu hotplug stress if the machine is
doing anything but idling around.
I was too lazy to figure out why it got stuck. I'm afraid that with such
a change a new class of bugs will appear.

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH/RFC] timer: fix deadlock on cpu hotplug
  2010-09-22  8:37     ` Heiko Carstens
@ 2010-09-22  9:22       ` Peter Zijlstra
  2010-09-22 14:29         ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2010-09-22  9:22 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Tejun Heo, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	Rusty Russell, linux-kernel

On Wed, 2010-09-22 at 10:37 +0200, Heiko Carstens wrote:
> > I've wanted to pull the whole migration thread out from SCHED_FIFO for a
> > while. Doing that is probably the easiest thing.
> 
> Something like th 

No that makes it a SCHED_OTHER task, and will indeed result in a wedged
system very quickly.

The idea was to move it to a class of its own above SCHED_FIFO.

I'll try and get something done, but I'm heading out to LinuxCon.JP
soon.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH/RFC] timer: fix deadlock on cpu hotplug
  2010-09-22  9:22       ` Peter Zijlstra
@ 2010-09-22 14:29         ` Peter Zijlstra
  2010-09-23 13:31           ` Heiko Carstens
  2010-10-18 19:16           ` [tip:sched/core] sched: Create special class for stop/migrate work tip-bot for Peter Zijlstra
  0 siblings, 2 replies; 12+ messages in thread
From: Peter Zijlstra @ 2010-09-22 14:29 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Tejun Heo, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	Rusty Russell, linux-kernel

On Wed, 2010-09-22 at 11:22 +0200, Peter Zijlstra wrote:

> The idea was to move it to a class of its own above SCHED_FIFO.
> 
> I'll try and get something done, but I'm heading out to LinuxCon.JP
> soon.

Something like the below, it seems to boot, build a kernel and hotplug.

Note that it also fixes a curious corner case, current mainline allows
userspace to change the scheduling priority/class of the migration
threads... something which isn't really advisable for the general health
of your system.

---
Subject: sched: Create special class for stop/migrate work
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Wed Sep 22 13:53:15 CEST 2010

In order to separate the stop/migrate work thread from the SCHED_FIFO
implementation, create a special class for it that is of higher priority
than SCHED_FIFO itself.

This currently solves a problem where cpu-hotplug consumes so much cpu-time
that the SCHED_FIFO class gets throttled, but has the bandwidth replenishment
timer pending on the now dead cpu.

It is also required for when we add the planned deadline scheduling class
above SCHED_FIFO, as the stop/migrate thread still needs to transcent those
tasks.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 kernel/sched.c          |   54 ++++++++++++++++++++----
 kernel/sched_stoptask.c |  107 ++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/stop_machine.c   |    6 +-
 3 files changed, 156 insertions(+), 11 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -486,7 +486,7 @@ struct rq {
 	 */
 	unsigned long nr_uninterruptible;
 
-	struct task_struct *curr, *idle;
+	struct task_struct *curr, *idle, *stop;
 	unsigned long next_balance;
 	struct mm_struct *prev_mm;
 
@@ -1837,7 +1837,7 @@ static inline void __set_task_cpu(struct
 
 static const struct sched_class rt_sched_class;
 
-#define sched_class_highest (&rt_sched_class)
+#define sched_class_highest (&stop_sched_class)
 #define for_each_class(class) \
    for (class = sched_class_highest; class; class = class->next)
 
@@ -1917,10 +1917,41 @@ static void deactivate_task(struct rq *r
 #include "sched_idletask.c"
 #include "sched_fair.c"
 #include "sched_rt.c"
+#include "sched_stoptask.c"
 #ifdef CONFIG_SCHED_DEBUG
 # include "sched_debug.c"
 #endif
 
+void sched_set_stop_task(int cpu, struct task_struct *stop)
+{
+	struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
+	struct task_struct *old_stop = cpu_rq(cpu)->stop;
+
+	if (stop) {
+		/*
+		 * Make it appear like a SCHED_FIFO task, its something
+		 * userspace knows about and won't get confused about.
+		 *
+		 * Also, it will make PI more or less work without too
+		 * much confusion -- but then, stop work should not
+		 * rely on PI working anyway.
+		 */
+		sched_setscheduler_nocheck(stop, SCHED_FIFO, &param);
+
+		stop->sched_class = &stop_sched_class;
+	}
+
+	cpu_rq(cpu)->stop = stop;
+
+	if (old_stop) {
+		/*
+		 * Reset it back to a normal scheduling class so that
+		 * it can die in pieces :-)
+		 */
+		old_stop->sched_class = &rt_sched_class;
+	}
+}
+
 /*
  * __normal_prio - return the priority that is based on the static prio
  */
@@ -3720,17 +3751,13 @@ pick_next_task(struct rq *rq)
 			return p;
 	}
 
-	class = sched_class_highest;
-	for ( ; ; ) {
+	for_each_class(class) {
 		p = class->pick_next_task(rq);
 		if (p)
 			return p;
-		/*
-		 * Will never be NULL as the idle class always
-		 * returns a non-NULL p:
-		 */
-		class = class->next;
 	}
+
+	BUG(); /* the idle class will always have a runnable task */
 }
 
 /*
@@ -4659,6 +4686,15 @@ static int __sched_setscheduler(struct t
 	 */
 	rq = __task_rq_lock(p);
 
+	/*
+	 * Changing the policy of the stop threads is a very bad idea
+	 */
+	if (p == rq->stop) {
+		__task_rq_unlock(rq);
+		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+		return -EINVAL;
+	}
+
 #ifdef CONFIG_RT_GROUP_SCHED
 	if (user) {
 		/*
Index: linux-2.6/kernel/stop_machine.c
===================================================================
--- linux-2.6.orig/kernel/stop_machine.c
+++ linux-2.6/kernel/stop_machine.c
@@ -287,11 +287,12 @@ static int cpu_stopper_thread(void *data
 	goto repeat;
 }
 
+extern void sched_set_stop_task(int cpu, struct task_struct *stop);
+
 /* manage stopper for a cpu, mostly lifted from sched migration thread mgmt */
 static int __cpuinit cpu_stop_cpu_callback(struct notifier_block *nfb,
 					   unsigned long action, void *hcpu)
 {
-	struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
 	unsigned int cpu = (unsigned long)hcpu;
 	struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
 	struct task_struct *p;
@@ -304,8 +305,8 @@ static int __cpuinit cpu_stop_cpu_callba
 				   cpu);
 		if (IS_ERR(p))
 			return NOTIFY_BAD;
-		sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
 		get_task_struct(p);
+		sched_set_stop_task(cpu, p);
 		stopper->thread = p;
 		break;
 
@@ -325,6 +326,7 @@ static int __cpuinit cpu_stop_cpu_callba
 	{
 		struct cpu_stop_work *work;
 
+		sched_set_stop_task(cpu, NULL);
 		/* kill the stopper */
 		kthread_stop(stopper->thread);
 		/* drain remaining works */
Index: linux-2.6/kernel/sched_stoptask.c
===================================================================
--- /dev/null
+++ linux-2.6/kernel/sched_stoptask.c
@@ -0,0 +1,107 @@
+/*
+ * stop-task scheduling class.
+ *
+ * The stop task is the highest priority task in the system, it preempts
+ * everything and will be preempted by nothing.
+ *
+ * See kernel/stop_machine.c
+ */
+
+#ifdef CONFIG_SMP
+static int
+select_task_rq_stop(struct rq *rq, struct task_struct *p, int sd_flag, int flags)
+{
+	return task_cpu(p); /* stop tasks as never migrate */
+}
+#endif /* CONFIG_SMP */
+
+static void
+check_preempt_curr_stop(struct rq *rq, struct task_struct *p, int flags)
+{
+	resched_task(rq->curr); /* we preempt everything */
+}
+
+static struct task_struct *pick_next_task_stop(struct rq *rq)
+{
+	struct task_struct *stop = rq->stop;
+
+	if (stop && stop->state == TASK_RUNNING)
+		return stop;
+
+	return NULL;
+}
+
+static void
+enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
+{
+}
+
+static void
+dequeue_task_stop(struct rq *rq, struct task_struct *p, int flags)
+{
+}
+
+static void yield_task_stop(struct rq *rq)
+{
+	BUG(); /* the stop task should never yield, its pointless. */
+}
+
+static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
+{
+}
+
+static void task_tick_stop(struct rq *rq, struct task_struct *curr, int queued)
+{
+}
+
+static void set_curr_task_stop(struct rq *rq)
+{
+}
+
+static void switched_to_stop(struct rq *rq, struct task_struct *p,
+			     int running)
+{
+	BUG(); /* its impossible to change to/from this class */
+}
+
+static void prio_changed_stop(struct rq *rq, struct task_struct *p,
+			      int oldprio, int running)
+{
+	BUG(); /* how!?, what priority? */
+}
+
+static unsigned int
+get_rr_interval_stop(struct rq *rq, struct task_struct *task)
+{
+	return 0;
+}
+
+/*
+ * Simple, special scheduling class for the per-CPU stop tasks:
+ */
+static const struct sched_class stop_sched_class = {
+	.next			= &rt_sched_class,
+
+	.enqueue_task		= enqueue_task_stop,
+	.dequeue_task		= dequeue_task_stop,
+	.yield_task		= yield_task_stop,
+
+	.check_preempt_curr	= check_preempt_curr_stop,
+
+	.pick_next_task		= pick_next_task_stop,
+	.put_prev_task		= put_prev_task_stop,
+
+#ifdef CONFIG_SMP
+	.select_task_rq		= select_task_rq_stop,
+#endif
+
+	.set_curr_task          = set_curr_task_stop,
+	.task_tick		= task_tick_stop,
+
+	.get_rr_interval	= get_rr_interval_stop,
+
+	.prio_changed		= prio_changed_stop,
+	.switched_to		= switched_to_stop,
+
+	/* no .task_new for stop tasks */
+};


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH/RFC] timer: fix deadlock on cpu hotplug
  2010-09-22 14:29         ` Peter Zijlstra
@ 2010-09-23 13:31           ` Heiko Carstens
  2010-09-25  0:19             ` Peter Zijlstra
  2010-10-11 12:31             ` Peter Zijlstra
  2010-10-18 19:16           ` [tip:sched/core] sched: Create special class for stop/migrate work tip-bot for Peter Zijlstra
  1 sibling, 2 replies; 12+ messages in thread
From: Heiko Carstens @ 2010-09-23 13:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	Rusty Russell, linux-kernel, Arnd Bergmann

On Wed, Sep 22, 2010 at 04:29:36PM +0200, Peter Zijlstra wrote:
> On Wed, 2010-09-22 at 11:22 +0200, Peter Zijlstra wrote:
> 
> > The idea was to move it to a class of its own above SCHED_FIFO.
> > 
> > I'll try and get something done, but I'm heading out to LinuxCon.JP
> > soon.
> 
> Something like the below, it seems to boot, build a kernel and hotplug.

Thanks for the fast patch. Yes, it works. Sort of:

------------[ cut here ]------------
WARNING: at kernel/kthread.c:182
Modules linked in:
Modules linked in:
CPU: 0 Not tainted 2.6.36-rc5-00034-g8b15575-dirty #21
Process cpukill2.sh (pid: 2630, task: 000000003ac05238, ksp: 000000003d8bf8d8)
Krnl PSW : 0704000180000000 000000000016adb6 (kthread_bind+0x8a/0x98)
           R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
Krnl GPRS: 0000000000000000 00000000026afd00 0000000000000000 0000000000000002
           000000000016ad60 000000000050e980 0000000000000000 00000000026b9150
           0000000000000002 0000000000000002 0000000000000002 000000003fbc0138
           0000000000000002 000000000050eb08 000000000016ad60 000000003d8bfbb0
Krnl Code: 000000000016ada6: f0b80004ebbf       srp     4(12,%r0),3007(%r14),8
           000000000016adac: f0a0000407f4       srp     4(11,%r0),2036,0
           000000000016adb2: a7f40001           brc     15,16adb4
          >000000000016adb6: e340f0b80004       lg      %r4,184(%r15)
           000000000016adbc: ebbff0a00004       lmg     %r11,%r15,160(%r15)
           000000000016adc2: 07f4               bcr     15,%r4
           000000000016adc4: eb5ff0400024       stmg    %r5,%r15,64(%r15)
           000000000016adca: c0d0001d1ea3       larl    %r13,50eb10
Call Trace:
([<000000000016ad60>] kthread_bind+0x34/0x98)
 [<00000000004f5270>] cpu_stop_cpu_callback+0xf0/0x1e4
 [<00000000004ff806>] notifier_call_chain+0x8e/0xdc
 [<00000000001721f2>] __raw_notifier_call_chain+0x22/0x30
 [<0000000000148c68>] __cpu_notify+0x44/0x80
 [<00000000004f3be0>] _cpu_up+0x110/0x150
 [<00000000004f3cea>] cpu_up+0xca/0xec
 [<00000000004f0416>] store_online+0x92/0xcc
 [<000000000026880a>] sysfs_write_file+0xf6/0x1a8
 [<00000000001f8c6c>] vfs_write+0xb0/0x158
 [<00000000001f8f6c>] SyS_write+0x58/0xa8
 [<0000000000117cf6>] sysc_noemu+0x10/0x16
 [<0000020000158fcc>] 0x20000158fcc
4 locks held by cpukill2.sh/2630:
 #0:  (&buffer->mutex){+.+.+.}, at: [<000000000026875e>] sysfs_write_file+0x4a/0x1a8
 #1:  (s_active#66){.+.+.+}, at: [<00000000002687e6>] sysfs_write_file+0xd2/0x1a80
 #2:  (cpu_add_remove_lock){+.+.+.}, at: [<00000000004f3cce>] cpu_up+0xae/0xec
 #3:  (cpu_hotplug.lock){+.+.+.}, at: [<0000000000148d42>] cpu_hotplug_begin+0x3e/0x74
Last Breaking-Event-Address:
 [<000000000016adb2>] kthread_bind+0x86/0x98

..and crashes afterwards ;)

Btw. I also tried to boot 2.6.35.5 with your patch and it hangs just at
the beginning. I also had to apply 5e3d20a68f63fc5a310687d81956c3b96e488b84
"init: Remove the BKL from startup code" to make the machine boot again,
which was a bit surprising.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH/RFC] timer: fix deadlock on cpu hotplug
  2010-09-23 13:31           ` Heiko Carstens
@ 2010-09-25  0:19             ` Peter Zijlstra
  2010-10-11 12:31             ` Peter Zijlstra
  1 sibling, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2010-09-25  0:19 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Tejun Heo, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	Rusty Russell, linux-kernel, Arnd Bergmann

On Thu, 2010-09-23 at 15:31 +0200, Heiko Carstens wrote:
> On Wed, Sep 22, 2010 at 04:29:36PM +0200, Peter Zijlstra wrote:
> > On Wed, 2010-09-22 at 11:22 +0200, Peter Zijlstra wrote:
> > 
> > > The idea was to move it to a class of its own above SCHED_FIFO.
> > > 
> > > I'll try and get something done, but I'm heading out to LinuxCon.JP
> > > soon.
> > 
> > Something like the below, it seems to boot, build a kernel and hotplug.
> 
> Thanks for the fast patch. Yes, it works. Sort of:
> 
> ------------[ cut here ]------------

<SPLAT>

> 
> ..and crashes afterwards ;)

Yeah, I can imagine,. weird though, kthread_create() should leave the
new thread in uninterruptible context. The above warning implies this is
not so.. 

> Btw. I also tried to boot 2.6.35.5 with your patch and it hangs just at
> the beginning. I also had to apply 5e3d20a68f63fc5a310687d81956c3b96e488b84
> "init: Remove the BKL from startup code" to make the machine boot again,
> which was a bit surprising.

Most curious and unexpected,.. I'll try and have a look, but it might
have to wait until after LinuxCon.JP..



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH/RFC] timer: fix deadlock on cpu hotplug
  2010-09-23 13:31           ` Heiko Carstens
  2010-09-25  0:19             ` Peter Zijlstra
@ 2010-10-11 12:31             ` Peter Zijlstra
  2010-10-11 13:51               ` Heiko Carstens
  1 sibling, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2010-10-11 12:31 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Tejun Heo, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	Rusty Russell, linux-kernel, Arnd Bergmann

On Thu, 2010-09-23 at 15:31 +0200, Heiko Carstens wrote:
> On Wed, Sep 22, 2010 at 04:29:36PM +0200, Peter Zijlstra wrote:
> > On Wed, 2010-09-22 at 11:22 +0200, Peter Zijlstra wrote:
> > 
> > > The idea was to move it to a class of its own above SCHED_FIFO.
> > > 
> > > I'll try and get something done, but I'm heading out to LinuxCon.JP
> > > soon.
> > 
> > Something like the below, it seems to boot, build a kernel and hotplug.
> 
> Thanks for the fast patch. Yes, it works. Sort of:
> 
> ------------[ cut here ]------------
> WARNING: at kernel/kthread.c:182
> Modules linked in:
> Modules linked in:
> CPU: 0 Not tainted 2.6.36-rc5-00034-g8b15575-dirty #21
> Process cpukill2.sh (pid: 2630, task: 000000003ac05238, ksp: 000000003d8bf8d8)
> Krnl PSW : 0704000180000000 000000000016adb6 (kthread_bind+0x8a/0x98)
>            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 EA:3
> Krnl GPRS: 0000000000000000 00000000026afd00 0000000000000000 0000000000000002
>            000000000016ad60 000000000050e980 0000000000000000 00000000026b9150
>            0000000000000002 0000000000000002 0000000000000002 000000003fbc0138
>            0000000000000002 000000000050eb08 000000000016ad60 000000003d8bfbb0
> Krnl Code: 000000000016ada6: f0b80004ebbf       srp     4(12,%r0),3007(%r14),8
>            000000000016adac: f0a0000407f4       srp     4(11,%r0),2036,0
>            000000000016adb2: a7f40001           brc     15,16adb4
>           >000000000016adb6: e340f0b80004       lg      %r4,184(%r15)
>            000000000016adbc: ebbff0a00004       lmg     %r11,%r15,160(%r15)
>            000000000016adc2: 07f4               bcr     15,%r4
>            000000000016adc4: eb5ff0400024       stmg    %r5,%r15,64(%r15)
>            000000000016adca: c0d0001d1ea3       larl    %r13,50eb10
> Call Trace:
> ([<000000000016ad60>] kthread_bind+0x34/0x98)
>  [<00000000004f5270>] cpu_stop_cpu_callback+0xf0/0x1e4
>  [<00000000004ff806>] notifier_call_chain+0x8e/0xdc
>  [<00000000001721f2>] __raw_notifier_call_chain+0x22/0x30
>  [<0000000000148c68>] __cpu_notify+0x44/0x80
>  [<00000000004f3be0>] _cpu_up+0x110/0x150
>  [<00000000004f3cea>] cpu_up+0xca/0xec
>  [<00000000004f0416>] store_online+0x92/0xcc
>  [<000000000026880a>] sysfs_write_file+0xf6/0x1a8
>  [<00000000001f8c6c>] vfs_write+0xb0/0x158
>  [<00000000001f8f6c>] SyS_write+0x58/0xa8
>  [<0000000000117cf6>] sysc_noemu+0x10/0x16
>  [<0000020000158fcc>] 0x20000158fcc
> 4 locks held by cpukill2.sh/2630:
>  #0:  (&buffer->mutex){+.+.+.}, at: [<000000000026875e>] sysfs_write_file+0x4a/0x1a8
>  #1:  (s_active#66){.+.+.+}, at: [<00000000002687e6>] sysfs_write_file+0xd2/0x1a80
>  #2:  (cpu_add_remove_lock){+.+.+.}, at: [<00000000004f3cce>] cpu_up+0xae/0xec
>  #3:  (cpu_hotplug.lock){+.+.+.}, at: [<0000000000148d42>] cpu_hotplug_begin+0x3e/0x74
> Last Breaking-Event-Address:
>  [<000000000016adb2>] kthread_bind+0x86/0x98
> 
> ..and crashes afterwards ;)
> 
> Btw. I also tried to boot 2.6.35.5 with your patch and it hangs just at
> the beginning. I also had to apply 5e3d20a68f63fc5a310687d81956c3b96e488b84
> "init: Remove the BKL from startup code" to make the machine boot again,
> which was a bit surprising.


Does something like the below (on top of the previous patch) work?

Can I have your hotplug script (cpukill2.sh?) to test, a simply while(1)
loop offlining and onlining cpu1 doesn't seem to impress my machine
(with the below folded in)..

---
Index: linux-2.6/kernel/stop_machine.c
===================================================================
--- linux-2.6.orig/kernel/stop_machine.c
+++ linux-2.6/kernel/stop_machine.c
@@ -306,12 +306,12 @@ static int __cpuinit cpu_stop_cpu_callba
 		if (IS_ERR(p))
 			return NOTIFY_BAD;
 		get_task_struct(p);
+		kthread_bind(p, cpu);
 		sched_set_stop_task(cpu, p);
 		stopper->thread = p;
 		break;
 
 	case CPU_ONLINE:
-		kthread_bind(stopper->thread, cpu);
 		/* strictly unnecessary, as first user will wake it */
 		wake_up_process(stopper->thread);
 		/* mark enabled */


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH/RFC] timer: fix deadlock on cpu hotplug
  2010-10-11 12:31             ` Peter Zijlstra
@ 2010-10-11 13:51               ` Heiko Carstens
  0 siblings, 0 replies; 12+ messages in thread
From: Heiko Carstens @ 2010-10-11 13:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	Rusty Russell, linux-kernel, Arnd Bergmann

On Mon, Oct 11, 2010 at 02:31:03PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-09-23 at 15:31 +0200, Heiko Carstens wrote:
> > On Wed, Sep 22, 2010 at 04:29:36PM +0200, Peter Zijlstra wrote:
> > > On Wed, 2010-09-22 at 11:22 +0200, Peter Zijlstra wrote:
> > > 
> > > > The idea was to move it to a class of its own above SCHED_FIFO.
> > > > 
> > > > I'll try and get something done, but I'm heading out to LinuxCon.JP
> > > > soon.
> > > 
> > > Something like the below, it seems to boot, build a kernel and hotplug.
> > 
> > Thanks for the fast patch. Yes, it works. Sort of:
> > 
> > ------------[ cut here ]------------
> > WARNING: at kernel/kthread.c:182
> > [...]
> > ..and crashes afterwards ;)
> > 
> > Btw. I also tried to boot 2.6.35.5 with your patch and it hangs just at
> > the beginning. I also had to apply 5e3d20a68f63fc5a310687d81956c3b96e488b84
> > "init: Remove the BKL from startup code" to make the machine boot again,
> > which was a bit surprising.
> 
> Does something like the below (on top of the previous patch) work?

Yes, it works. Just survived a git clone a complete kernel build while cpu
hotplug stress was running. Thanks!

Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>

> Can I have your hotplug script (cpukill2.sh?) to test, a simply while(1)
> loop offlining and onlining cpu1 doesn't seem to impress my machine
> (with the below folded in)..

That's one of the scripts our testers are using:

#! /bin/bash
RANDOM=`date '+%s'`
let "NUMBER = $RANDOM % 2 + 0"
CPUS=`ls /sys/devices/system/cpu/ | grep cpu`
count=`echo $CPUS | wc -w`
echo You have $count no. of CPUs
echo You have the following CPUs: $CPUS
while [ 1 ] ; do
   for ((i=0 ; i<=$count ; i++)) ; do
      rnd=$[($RANDOM % $count) ]     # RND between 0..max_no_of_cpu
      let "NUMBER = $RANDOM % 2 + 0"
      echo "echo $NUMBER > /sys/devices/system/cpu/cpu$rnd/online"
      echo $NUMBER > /sys/devices/system/cpu/cpu$rnd/online
   done
done

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [tip:sched/core] sched: Create special class for stop/migrate work
  2010-09-22 14:29         ` Peter Zijlstra
  2010-09-23 13:31           ` Heiko Carstens
@ 2010-10-18 19:16           ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 12+ messages in thread
From: tip-bot for Peter Zijlstra @ 2010-10-18 19:16 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, heiko.carstens, tglx,
	mingo

Commit-ID:  34f971f6f7988be4d014eec3e3526bee6d007ffa
Gitweb:     http://git.kernel.org/tip/34f971f6f7988be4d014eec3e3526bee6d007ffa
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Wed, 22 Sep 2010 13:53:15 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 18 Oct 2010 18:41:58 +0200

sched: Create special class for stop/migrate work

In order to separate the stop/migrate work thread from the SCHED_FIFO
implementation, create a special class for it that is of higher priority than
SCHED_FIFO itself.

This currently solves a problem where cpu-hotplug consumes so much cpu-time
that the SCHED_FIFO class gets throttled, but has the bandwidth replenishment
timer pending on the now dead cpu.

It is also required for when we add the planned deadline scheduling class above
SCHED_FIFO, as the stop/migrate thread still needs to transcent those tasks.

Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1285165776.2275.1022.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c          |   54 +++++++++++++++++++----
 kernel/sched_stoptask.c |  108 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/stop_machine.c   |    8 ++-
 3 files changed, 158 insertions(+), 12 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 7f52283..5f64fed 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -486,7 +486,7 @@ struct rq {
 	 */
 	unsigned long nr_uninterruptible;
 
-	struct task_struct *curr, *idle;
+	struct task_struct *curr, *idle, *stop;
 	unsigned long next_balance;
 	struct mm_struct *prev_mm;
 
@@ -1837,7 +1837,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 
 static const struct sched_class rt_sched_class;
 
-#define sched_class_highest (&rt_sched_class)
+#define sched_class_highest (&stop_sched_class)
 #define for_each_class(class) \
    for (class = sched_class_highest; class; class = class->next)
 
@@ -1917,10 +1917,41 @@ static void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
 #include "sched_idletask.c"
 #include "sched_fair.c"
 #include "sched_rt.c"
+#include "sched_stoptask.c"
 #ifdef CONFIG_SCHED_DEBUG
 # include "sched_debug.c"
 #endif
 
+void sched_set_stop_task(int cpu, struct task_struct *stop)
+{
+	struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
+	struct task_struct *old_stop = cpu_rq(cpu)->stop;
+
+	if (stop) {
+		/*
+		 * Make it appear like a SCHED_FIFO task, its something
+		 * userspace knows about and won't get confused about.
+		 *
+		 * Also, it will make PI more or less work without too
+		 * much confusion -- but then, stop work should not
+		 * rely on PI working anyway.
+		 */
+		sched_setscheduler_nocheck(stop, SCHED_FIFO, &param);
+
+		stop->sched_class = &stop_sched_class;
+	}
+
+	cpu_rq(cpu)->stop = stop;
+
+	if (old_stop) {
+		/*
+		 * Reset it back to a normal scheduling class so that
+		 * it can die in pieces.
+		 */
+		old_stop->sched_class = &rt_sched_class;
+	}
+}
+
 /*
  * __normal_prio - return the priority that is based on the static prio
  */
@@ -3720,17 +3751,13 @@ pick_next_task(struct rq *rq)
 			return p;
 	}
 
-	class = sched_class_highest;
-	for ( ; ; ) {
+	for_each_class(class) {
 		p = class->pick_next_task(rq);
 		if (p)
 			return p;
-		/*
-		 * Will never be NULL as the idle class always
-		 * returns a non-NULL p:
-		 */
-		class = class->next;
 	}
+
+	BUG(); /* the idle class will always have a runnable task */
 }
 
 /*
@@ -4659,6 +4686,15 @@ recheck:
 	 */
 	rq = __task_rq_lock(p);
 
+	/*
+	 * Changing the policy of the stop threads its a very bad idea
+	 */
+	if (p == rq->stop) {
+		__task_rq_unlock(rq);
+		raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+		return -EINVAL;
+	}
+
 #ifdef CONFIG_RT_GROUP_SCHED
 	if (user) {
 		/*
diff --git a/kernel/sched_stoptask.c b/kernel/sched_stoptask.c
new file mode 100644
index 0000000..45bddc0
--- /dev/null
+++ b/kernel/sched_stoptask.c
@@ -0,0 +1,108 @@
+/*
+ * stop-task scheduling class.
+ *
+ * The stop task is the highest priority task in the system, it preempts
+ * everything and will be preempted by nothing.
+ *
+ * See kernel/stop_machine.c
+ */
+
+#ifdef CONFIG_SMP
+static int
+select_task_rq_stop(struct rq *rq, struct task_struct *p,
+		    int sd_flag, int flags)
+{
+	return task_cpu(p); /* stop tasks as never migrate */
+}
+#endif /* CONFIG_SMP */
+
+static void
+check_preempt_curr_stop(struct rq *rq, struct task_struct *p, int flags)
+{
+	resched_task(rq->curr); /* we preempt everything */
+}
+
+static struct task_struct *pick_next_task_stop(struct rq *rq)
+{
+	struct task_struct *stop = rq->stop;
+
+	if (stop && stop->state == TASK_RUNNING)
+		return stop;
+
+	return NULL;
+}
+
+static void
+enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
+{
+}
+
+static void
+dequeue_task_stop(struct rq *rq, struct task_struct *p, int flags)
+{
+}
+
+static void yield_task_stop(struct rq *rq)
+{
+	BUG(); /* the stop task should never yield, its pointless. */
+}
+
+static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
+{
+}
+
+static void task_tick_stop(struct rq *rq, struct task_struct *curr, int queued)
+{
+}
+
+static void set_curr_task_stop(struct rq *rq)
+{
+}
+
+static void switched_to_stop(struct rq *rq, struct task_struct *p,
+			     int running)
+{
+	BUG(); /* its impossible to change to this class */
+}
+
+static void prio_changed_stop(struct rq *rq, struct task_struct *p,
+			      int oldprio, int running)
+{
+	BUG(); /* how!?, what priority? */
+}
+
+static unsigned int
+get_rr_interval_stop(struct rq *rq, struct task_struct *task)
+{
+	return 0;
+}
+
+/*
+ * Simple, special scheduling class for the per-CPU stop tasks:
+ */
+static const struct sched_class stop_sched_class = {
+	.next			= &rt_sched_class,
+
+	.enqueue_task		= enqueue_task_stop,
+	.dequeue_task		= dequeue_task_stop,
+	.yield_task		= yield_task_stop,
+
+	.check_preempt_curr	= check_preempt_curr_stop,
+
+	.pick_next_task		= pick_next_task_stop,
+	.put_prev_task		= put_prev_task_stop,
+
+#ifdef CONFIG_SMP
+	.select_task_rq		= select_task_rq_stop,
+#endif
+
+	.set_curr_task          = set_curr_task_stop,
+	.task_tick		= task_tick_stop,
+
+	.get_rr_interval	= get_rr_interval_stop,
+
+	.prio_changed		= prio_changed_stop,
+	.switched_to		= switched_to_stop,
+
+	/* no .task_new for stop tasks */
+};
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 4372ccb..090c288 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -287,11 +287,12 @@ repeat:
 	goto repeat;
 }
 
+extern void sched_set_stop_task(int cpu, struct task_struct *stop);
+
 /* manage stopper for a cpu, mostly lifted from sched migration thread mgmt */
 static int __cpuinit cpu_stop_cpu_callback(struct notifier_block *nfb,
 					   unsigned long action, void *hcpu)
 {
-	struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
 	unsigned int cpu = (unsigned long)hcpu;
 	struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
 	struct task_struct *p;
@@ -304,13 +305,13 @@ static int __cpuinit cpu_stop_cpu_callback(struct notifier_block *nfb,
 				   cpu);
 		if (IS_ERR(p))
 			return NOTIFY_BAD;
-		sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
 		get_task_struct(p);
+		kthread_bind(p, cpu);
+		sched_set_stop_task(cpu, p);
 		stopper->thread = p;
 		break;
 
 	case CPU_ONLINE:
-		kthread_bind(stopper->thread, cpu);
 		/* strictly unnecessary, as first user will wake it */
 		wake_up_process(stopper->thread);
 		/* mark enabled */
@@ -325,6 +326,7 @@ static int __cpuinit cpu_stop_cpu_callback(struct notifier_block *nfb,
 	{
 		struct cpu_stop_work *work;
 
+		sched_set_stop_task(cpu, NULL);
 		/* kill the stopper */
 		kthread_stop(stopper->thread);
 		/* drain remaining works */

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH/RFC] timer: fix deadlock on cpu hotplug
  2010-09-21 14:20 [PATCH/RFC] timer: fix deadlock on cpu hotplug Heiko Carstens
  2010-09-21 15:36 ` Tejun Heo
@ 2010-09-21 15:39 ` Thomas Gleixner
  1 sibling, 0 replies; 12+ messages in thread
From: Thomas Gleixner @ 2010-09-21 15:39 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Peter Zijlstra, Ingo Molnar, Andrew Morton, Tejun Heo,
	Rusty Russell, linux-kernel

On Tue, 21 Sep 2010, Heiko Carstens wrote:

> From: Heiko Carstens <heiko.carstens@de.ibm.com>
> 
> I've seen the following deadlock on cpu hotplug stress test:
> 
> On cpu down the process that triggered offlining of a cpu waits for
> stop_machine() to finish:
> 
> PID: 56033  TASK: e001540           CPU: 2   COMMAND: "cpu_all_off"
>  #0 [37aa7990] schedule at 559194
>  #1 [37aa7a40] schedule_timeout at 559de0
>  #2 [37aa7b18] wait_for_common at 558bfa
>  #3 [37aa7b90] __stop_cpus at 1a876e
>  #4 [37aa7c68] stop_cpus at 1a8a3a
>  #5 [37aa7c98] __stop_machine at 1a8adc
>  #6 [37aa7cf8] _cpu_down at 55007a
>  #7 [37aa7d78] cpu_down at 550280
>  #8 [37aa7d98] store_online at 551d48
>  #9 [37aa7dc0] sysfs_write_file at 2a3fa2
>  #10 [37aa7e18] vfs_write at 229b3c
>  #11 [37aa7e78] sys_write at 229d38
>  #12 [37aa7eb8] sysc_noemu at 1146de
> 
> All cpus actually have been synchronized and cpu 0 got offlined. However,
> the migration thread on cpu 5 got preempted just between preempt_enable()
> and cpu_stop_signal_done() within cpu_stopper_thread():
> 
> PID: 55622  TASK: 31a00a40          CPU: 5   COMMAND: "migration/5"
>  #0 [30f8bc80] schedule at 559194
>  #1 [30f8bd30] preempt_schedule at 559b54
>  #2 [30f8bd50] cpu_stopper_thread at 1a81dc
>  #3 [30f8be28] kthread at 163224
>  #4 [30f8beb8] kernel_thread_starter at 106c1a
> 
> For some reason the scheduler decided to throttle RT tasks on the runqueue
> of cpu 5 (rt_throttled = 1). So as long as rt_throttled == 1 we won't see the
> migration thread coming back to execution.
> The only thing that would unthrottle the runqueue would be the rt_period_timer.
> The timer is indeed scheduled, however in the dump I have it has been expired
> for more than four hours.
> The reason is simply that the timer is pending on the offlined cpu 0 and
> therefore would never fire before it gets migrated to an online cpu. Before
> the cpu hotplug mechanisms (cpu hotplug notifier with state CPU_DEAD) would
> migrate the timer to an online cpu stop_machine() must complete ---> deadlock.
> 
> The fix _seems_ to be simple: just migrate timers after __cpu_disable() has
> been called and use the CPU_DYING state. The subtle difference is of course
> that the migration code now gets executed on the cpu that actually just is
> going to disable itself instead of an arbitrary cpu that stays online.
> 
> This patch moves the migration of pending timers to an earlier time
> (CPU_DYING), so that the deadlock described cannot happen anymore.
> 
> Up to now the hrtimer migration code called __hrtimer_peek_ahead_timers()
> after migrating timers to the _current_ cpu. Now pending timers are moved
> to a remote cpu and calling that function isn't possible anymore.
> To solve that I introduced the function raise_remote_softirq() which gets
> used to raise the HRTIMER_SOFTIRQ on the cpu where the timers have been
> migrated to. Which will lead to execution of hrtimer_peek_ahead_timers()
> as soon as softirq are executed on the remote cpu.
> 
> The proper place for such a generic function should be softirq.c, but this
> is just an RFC and I would like to check if people are ok with the general
> approach.
> Or maybe it's possible to fix this in a better way?

Hmm, shouldnt we simply prevent the throttler to hold off the
migration thread ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-10-18 19:16 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-21 14:20 [PATCH/RFC] timer: fix deadlock on cpu hotplug Heiko Carstens
2010-09-21 15:36 ` Tejun Heo
2010-09-21 15:40   ` Peter Zijlstra
2010-09-22  8:37     ` Heiko Carstens
2010-09-22  9:22       ` Peter Zijlstra
2010-09-22 14:29         ` Peter Zijlstra
2010-09-23 13:31           ` Heiko Carstens
2010-09-25  0:19             ` Peter Zijlstra
2010-10-11 12:31             ` Peter Zijlstra
2010-10-11 13:51               ` Heiko Carstens
2010-10-18 19:16           ` [tip:sched/core] sched: Create special class for stop/migrate work tip-bot for Peter Zijlstra
2010-09-21 15:39 ` [PATCH/RFC] timer: fix deadlock on cpu hotplug Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox