From: Ingo Molnar <mingo@elte.hu>
To: Linus Torvalds <torvalds@osdl.org>, Andrew Morton <akpm@osdl.org>
Cc: linux-kernel@vger.kernel.org
Subject: [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs
Date: Tue, 16 Nov 2004 12:32:09 +0100 [thread overview]
Message-ID: <20041116113209.GA1890@elte.hu> (raw)
PREEMPT_RT on SMP systems triggered weird (very high) load average
values rather easily, which turned out to be a mainline kernel
->nr_uninterruptible handling bug in try_to_wake_up().
the following code:
if (old_state == TASK_UNINTERRUPTIBLE) {
old_rq->nr_uninterruptible--;
potentially executes with old_rq potentially being != rq, and hence
updating ->nr_uninterruptible without the lock held. Given a
sufficiently concurrent preemption workload the count can get out of
whack and updates might get lost, permanently skewing the global count.
Nothing except the load-average uses nr_uninterruptible() so this
condition can go unnoticed quite easily.
the fix is to update ->nr_uninterruptible always on the runqueue where
the task currently is. (this is also a tiny performance plus for
try_to_wake_up() as a stackslot gets freed up.)
while fixing this bug i found three other ->nr_uninterruptible related
bugs:
- the update should be moved from deactivate_task() into schedule(),
beacause e.g. setscheduler() does deactivate_task()+activate_task(),
which in turn may result in a -1 counter-skew if setscheduler() is
done on a task asynchronously, which task is still on the runqueue
but has already set ->state to TASK_UNINTERRUPTIBLE.
sys_sched_setscheduler() is used rarely, but the bug is real. (The
fix is also a small performance enhancement.)
The rules for ->nr_uninterruptible updating are the following: it
gets increased by schedule() only, when a task is moved off the
runqueue and it has a state of TASK_UNINTERRUPTIBLE. It is decreased
by try_to_wake_up(), by the first wakeup that materially changes the
state from TASK_UNINTERRUPTIBLE back to TASK_RUNNING, and moves the
task to the runqueue.
- on CPU-hotplug down we might zap a CPU that has a nonzero counter.
Due to the fuzzy nature of the global counter a CPU might hold a
nonzero ->nr_uninterruptible count even if it has no tasks anymore.
The solution is to 'migrate' the counter to another runqueue.
- we should not return negative counter values from the
nr_uninterruptible() function, since it accesses them without taking
the runqueue locks, so the total sum might be slightly above or
slightly below the real count.
i tested the attached patch on x86 SMP and it solves the load-average
problem. (I have tested CPU_HOTPLUG compilation but not functionality.)
I think this is a must-have for 2.6.10, because there are apps that go
berzerk if load-average is too high (e.g. sendmail).
Ingo
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -217,7 +217,16 @@ struct runqueue {
unsigned long cpu_load;
#endif
unsigned long long nr_switches;
- unsigned long expired_timestamp, nr_uninterruptible;
+
+ /*
+ * This is part of a global counter where only the total sum
+ * over all CPUs matters. A task can increase this counter on
+ * one CPU and if it got migrated afterwards it may decrease
+ * it on another CPU. Always updated under the runqueue lock:
+ */
+ unsigned long nr_uninterruptible;
+
+ unsigned long expired_timestamp;
unsigned long long timestamp_last_tick;
task_t *curr, *idle;
struct mm_struct *prev_mm;
@@ -762,8 +771,6 @@ static void activate_task(task_t *p, run
static void deactivate_task(struct task_struct *p, runqueue_t *rq)
{
rq->nr_running--;
- if (p->state == TASK_UNINTERRUPTIBLE)
- rq->nr_uninterruptible++;
dequeue_task(p, p->array);
p->array = NULL;
}
@@ -983,14 +990,14 @@ static int try_to_wake_up(task_t * p, un
int cpu, this_cpu, success = 0;
unsigned long flags;
long old_state;
- runqueue_t *rq, *old_rq;
+ runqueue_t *rq;
#ifdef CONFIG_SMP
unsigned long load, this_load;
struct sched_domain *sd;
int new_cpu;
#endif
- old_rq = rq = task_rq_lock(p, &flags);
+ rq = task_rq_lock(p, &flags);
schedstat_inc(rq, ttwu_cnt);
old_state = p->state;
if (!(old_state & state))
@@ -1085,7 +1092,7 @@ out_set_cpu:
out_activate:
#endif /* CONFIG_SMP */
if (old_state == TASK_UNINTERRUPTIBLE) {
- old_rq->nr_uninterruptible--;
+ rq->nr_uninterruptible--;
/*
* Tasks on involuntary sleep don't earn
* sleep_avg beyond just interactive state.
@@ -1415,6 +1422,13 @@ unsigned long nr_uninterruptible(void)
for_each_cpu(i)
sum += cpu_rq(i)->nr_uninterruptible;
+ /*
+ * Since we read the counters lockless, it might be slightly
+ * inaccurate. Do not allow it to go below zero though:
+ */
+ if (unlikely((long)sum < 0))
+ sum = 0;
+
return sum;
}
@@ -2581,8 +2595,11 @@ need_resched_nonpreemptible:
if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
unlikely(signal_pending(prev))))
prev->state = TASK_RUNNING;
- else
+ else {
+ if (prev->state == TASK_UNINTERRUPTIBLE)
+ rq->nr_uninterruptible++;
deactivate_task(prev, rq);
+ }
}
cpu = smp_processor_id();
@@ -3914,6 +3931,26 @@ static void move_task_off_dead_cpu(int d
__migrate_task(tsk, dead_cpu, dest_cpu);
}
+/*
+ * While a dead CPU has no uninterruptible tasks queued at this point,
+ * it might still have a nonzero ->nr_uninterruptible counter, because
+ * for performance reasons the counter is not stricly tracking tasks to
+ * their home CPUs. So we just add the counter to another CPU's counter,
+ * to keep the global sum constant after CPU-down:
+ */
+static void migrate_nr_uninterruptible(runqueue_t *rq_src)
+{
+ runqueue_t *rq_dest = cpu_rq(any_online_cpu(CPU_MASK_ALL));
+ unsigned long flags;
+
+ local_irq_save(flags);
+ double_rq_lock(rq_src, rq_dest);
+ rq_dest->nr_uninterruptible += rq_src->nr_uninterruptible;
+ rq_src->nr_uninterruptible = 0;
+ double_rq_unlock(rq_src, rq_dest);
+ local_irq_restore(flags);
+}
+
/* Run through task list and migrate tasks from the dead cpu. */
static void migrate_live_tasks(int src_cpu)
{
@@ -4048,6 +4085,7 @@ static int migration_call(struct notifie
__setscheduler(rq->idle, SCHED_NORMAL, 0);
migrate_dead_tasks(cpu);
task_rq_unlock(rq, &flags);
+ migrate_nr_uninterruptible(rq);
BUG_ON(rq->nr_running != 0);
/* No need to migrate the tasks: it was best-effort if
next reply other threads:[~2004-11-16 13:01 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-11-16 11:32 Ingo Molnar [this message]
2004-11-16 22:19 ` [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs Peter Williams
2004-11-16 23:28 ` Ingo Molnar
2004-11-16 23:10 ` Linus Torvalds
2004-11-17 10:26 ` Ingo Molnar
2004-11-17 15:52 ` Linus Torvalds
2004-11-18 16:21 ` Ingo Molnar
2005-01-28 0:53 ` Paul Jackson
2005-01-28 1:06 ` Linus Torvalds
2005-01-28 2:14 ` Paul Jackson
2005-01-28 4:28 ` Ingo Molnar
2005-01-28 5:18 ` Paul Jackson
2005-01-28 6:01 ` Andi Kleen
2004-11-16 23:48 ` Peter Williams
2004-11-16 22:49 ` Nick Piggin
2004-11-16 23:03 ` Nick Piggin
2004-11-16 23:32 ` Peter Williams
2004-11-16 23:37 ` Nick Piggin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20041116113209.GA1890@elte.hu \
--to=mingo@elte.hu \
--cc=akpm@osdl.org \
--cc=linux-kernel@vger.kernel.org \
--cc=torvalds@osdl.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.