public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@elte.hu>
To: Linus Torvalds <torvalds@osdl.org>, Andrew Morton <akpm@osdl.org>
Cc: linux-kernel@vger.kernel.org
Subject: [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs
Date: Tue, 16 Nov 2004 12:32:09 +0100	[thread overview]
Message-ID: <20041116113209.GA1890@elte.hu> (raw)


PREEMPT_RT on SMP systems triggered weird (very high) load average
values rather easily, which turned out to be a mainline kernel
->nr_uninterruptible handling bug in try_to_wake_up().

the following code:

        if (old_state == TASK_UNINTERRUPTIBLE) {
                old_rq->nr_uninterruptible--;

potentially executes with old_rq potentially being != rq, and hence
updating ->nr_uninterruptible without the lock held. Given a
sufficiently concurrent preemption workload the count can get out of
whack and updates might get lost, permanently skewing the global count. 
Nothing except the load-average uses nr_uninterruptible() so this
condition can go unnoticed quite easily.

the fix is to update ->nr_uninterruptible always on the runqueue where
the task currently is. (this is also a tiny performance plus for
try_to_wake_up() as a stackslot gets freed up.)

while fixing this bug i found three other ->nr_uninterruptible related
bugs:

 - the update should be moved from deactivate_task() into schedule(),
   beacause e.g. setscheduler() does deactivate_task()+activate_task(),
   which in turn may result in a -1 counter-skew if setscheduler() is
   done on a task asynchronously, which task is still on the runqueue
   but has already set ->state to TASK_UNINTERRUPTIBLE. 

   sys_sched_setscheduler() is used rarely, but the bug is real. (The
   fix is also a small performance enhancement.)

   The rules for ->nr_uninterruptible updating are the following: it
   gets increased by schedule() only, when a task is moved off the
   runqueue and it has a state of TASK_UNINTERRUPTIBLE. It is decreased
   by try_to_wake_up(), by the first wakeup that materially changes the
   state from TASK_UNINTERRUPTIBLE back to TASK_RUNNING, and moves the
   task to the runqueue.

 - on CPU-hotplug down we might zap a CPU that has a nonzero counter. 
   Due to the fuzzy nature of the global counter a CPU might hold a 
   nonzero ->nr_uninterruptible count even if it has no tasks anymore. 
   The solution is to 'migrate' the counter to another runqueue.

 - we should not return negative counter values from the
   nr_uninterruptible() function, since it accesses them without taking
   the runqueue locks, so the total sum might be slightly above or
   slightly below the real count.

i tested the attached patch on x86 SMP and it solves the load-average
problem. (I have tested CPU_HOTPLUG compilation but not functionality.) 
I think this is a must-have for 2.6.10, because there are apps that go
berzerk if load-average is too high (e.g. sendmail).

	Ingo

Signed-off-by: Ingo Molnar <mingo@elte.hu>

--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -217,7 +217,16 @@ struct runqueue {
 	unsigned long cpu_load;
 #endif
 	unsigned long long nr_switches;
-	unsigned long expired_timestamp, nr_uninterruptible;
+
+	/*
+	 * This is part of a global counter where only the total sum
+	 * over all CPUs matters. A task can increase this counter on
+	 * one CPU and if it got migrated afterwards it may decrease
+	 * it on another CPU. Always updated under the runqueue lock:
+	 */
+	unsigned long nr_uninterruptible;
+
+	unsigned long expired_timestamp;
 	unsigned long long timestamp_last_tick;
 	task_t *curr, *idle;
 	struct mm_struct *prev_mm;
@@ -762,8 +771,6 @@ static void activate_task(task_t *p, run
 static void deactivate_task(struct task_struct *p, runqueue_t *rq)
 {
 	rq->nr_running--;
-	if (p->state == TASK_UNINTERRUPTIBLE)
-		rq->nr_uninterruptible++;
 	dequeue_task(p, p->array);
 	p->array = NULL;
 }
@@ -983,14 +990,14 @@ static int try_to_wake_up(task_t * p, un
 	int cpu, this_cpu, success = 0;
 	unsigned long flags;
 	long old_state;
-	runqueue_t *rq, *old_rq;
+	runqueue_t *rq;
 #ifdef CONFIG_SMP
 	unsigned long load, this_load;
 	struct sched_domain *sd;
 	int new_cpu;
 #endif
 
-	old_rq = rq = task_rq_lock(p, &flags);
+	rq = task_rq_lock(p, &flags);
 	schedstat_inc(rq, ttwu_cnt);
 	old_state = p->state;
 	if (!(old_state & state))
@@ -1085,7 +1092,7 @@ out_set_cpu:
 out_activate:
 #endif /* CONFIG_SMP */
 	if (old_state == TASK_UNINTERRUPTIBLE) {
-		old_rq->nr_uninterruptible--;
+		rq->nr_uninterruptible--;
 		/*
 		 * Tasks on involuntary sleep don't earn
 		 * sleep_avg beyond just interactive state.
@@ -1415,6 +1422,13 @@ unsigned long nr_uninterruptible(void)
 	for_each_cpu(i)
 		sum += cpu_rq(i)->nr_uninterruptible;
 
+	/*
+	 * Since we read the counters lockless, it might be slightly
+	 * inaccurate. Do not allow it to go below zero though:
+	 */
+	if (unlikely((long)sum < 0))
+		sum = 0;
+
 	return sum;
 }
 
@@ -2581,8 +2595,11 @@ need_resched_nonpreemptible:
 		if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
 				unlikely(signal_pending(prev))))
 			prev->state = TASK_RUNNING;
-		else
+		else {
+			if (prev->state == TASK_UNINTERRUPTIBLE)
+				rq->nr_uninterruptible++;
 			deactivate_task(prev, rq);
+		}
 	}
 
 	cpu = smp_processor_id();
@@ -3914,6 +3931,26 @@ static void move_task_off_dead_cpu(int d
 	__migrate_task(tsk, dead_cpu, dest_cpu);
 }
 
+/*
+ * While a dead CPU has no uninterruptible tasks queued at this point,
+ * it might still have a nonzero ->nr_uninterruptible counter, because
+ * for performance reasons the counter is not stricly tracking tasks to
+ * their home CPUs. So we just add the counter to another CPU's counter,
+ * to keep the global sum constant after CPU-down:
+ */
+static void migrate_nr_uninterruptible(runqueue_t *rq_src)
+{
+	runqueue_t *rq_dest = cpu_rq(any_online_cpu(CPU_MASK_ALL));
+	unsigned long flags;
+
+	local_irq_save(flags);
+	double_rq_lock(rq_src, rq_dest);
+	rq_dest->nr_uninterruptible += rq_src->nr_uninterruptible;
+	rq_src->nr_uninterruptible = 0;
+	double_rq_unlock(rq_src, rq_dest);
+	local_irq_restore(flags);
+}
+
 /* Run through task list and migrate tasks from the dead cpu. */
 static void migrate_live_tasks(int src_cpu)
 {
@@ -4048,6 +4085,7 @@ static int migration_call(struct notifie
 		__setscheduler(rq->idle, SCHED_NORMAL, 0);
 		migrate_dead_tasks(cpu);
 		task_rq_unlock(rq, &flags);
+		migrate_nr_uninterruptible(rq);
 		BUG_ON(rq->nr_running != 0);
 
 		/* No need to migrate the tasks: it was best-effort if

             reply	other threads:[~2004-11-16 13:01 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-11-16 11:32 Ingo Molnar [this message]
2004-11-16 22:19 ` [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs Peter Williams
2004-11-16 23:28   ` Ingo Molnar
2004-11-16 23:10     ` Linus Torvalds
2004-11-17 10:26       ` Ingo Molnar
2004-11-17 15:52         ` Linus Torvalds
2004-11-18 16:21           ` Ingo Molnar
2005-01-28  0:53       ` Paul Jackson
2005-01-28  1:06         ` Linus Torvalds
2005-01-28  2:14           ` Paul Jackson
2005-01-28  4:28         ` Ingo Molnar
2005-01-28  5:18           ` Paul Jackson
2005-01-28  6:01           ` Andi Kleen
2004-11-16 23:48     ` Peter Williams
2004-11-16 22:49 ` Nick Piggin
2004-11-16 23:03   ` Nick Piggin
2004-11-16 23:32   ` Peter Williams
2004-11-16 23:37     ` Nick Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20041116113209.GA1890@elte.hu \
    --to=mingo@elte.hu \
    --cc=akpm@osdl.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox