public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs
@ 2004-11-16 11:32 Ingo Molnar
  2004-11-16 22:19 ` Peter Williams
  2004-11-16 22:49 ` Nick Piggin
  0 siblings, 2 replies; 18+ messages in thread
From: Ingo Molnar @ 2004-11-16 11:32 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton; +Cc: linux-kernel


PREEMPT_RT on SMP systems triggered weird (very high) load average
values rather easily, which turned out to be a mainline kernel
->nr_uninterruptible handling bug in try_to_wake_up().

the following code:

        if (old_state == TASK_UNINTERRUPTIBLE) {
                old_rq->nr_uninterruptible--;

potentially executes with old_rq potentially being != rq, and hence
updating ->nr_uninterruptible without the lock held. Given a
sufficiently concurrent preemption workload the count can get out of
whack and updates might get lost, permanently skewing the global count. 
Nothing except the load-average uses nr_uninterruptible() so this
condition can go unnoticed quite easily.

the fix is to update ->nr_uninterruptible always on the runqueue where
the task currently is. (this is also a tiny performance plus for
try_to_wake_up() as a stackslot gets freed up.)

while fixing this bug i found three other ->nr_uninterruptible related
bugs:

 - the update should be moved from deactivate_task() into schedule(),
   beacause e.g. setscheduler() does deactivate_task()+activate_task(),
   which in turn may result in a -1 counter-skew if setscheduler() is
   done on a task asynchronously, which task is still on the runqueue
   but has already set ->state to TASK_UNINTERRUPTIBLE. 

   sys_sched_setscheduler() is used rarely, but the bug is real. (The
   fix is also a small performance enhancement.)

   The rules for ->nr_uninterruptible updating are the following: it
   gets increased by schedule() only, when a task is moved off the
   runqueue and it has a state of TASK_UNINTERRUPTIBLE. It is decreased
   by try_to_wake_up(), by the first wakeup that materially changes the
   state from TASK_UNINTERRUPTIBLE back to TASK_RUNNING, and moves the
   task to the runqueue.

 - on CPU-hotplug down we might zap a CPU that has a nonzero counter. 
   Due to the fuzzy nature of the global counter a CPU might hold a 
   nonzero ->nr_uninterruptible count even if it has no tasks anymore. 
   The solution is to 'migrate' the counter to another runqueue.

 - we should not return negative counter values from the
   nr_uninterruptible() function, since it accesses them without taking
   the runqueue locks, so the total sum might be slightly above or
   slightly below the real count.

i tested the attached patch on x86 SMP and it solves the load-average
problem. (I have tested CPU_HOTPLUG compilation but not functionality.) 
I think this is a must-have for 2.6.10, because there are apps that go
berzerk if load-average is too high (e.g. sendmail).

	Ingo

Signed-off-by: Ingo Molnar <mingo@elte.hu>

--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -217,7 +217,16 @@ struct runqueue {
 	unsigned long cpu_load;
 #endif
 	unsigned long long nr_switches;
-	unsigned long expired_timestamp, nr_uninterruptible;
+
+	/*
+	 * This is part of a global counter where only the total sum
+	 * over all CPUs matters. A task can increase this counter on
+	 * one CPU and if it got migrated afterwards it may decrease
+	 * it on another CPU. Always updated under the runqueue lock:
+	 */
+	unsigned long nr_uninterruptible;
+
+	unsigned long expired_timestamp;
 	unsigned long long timestamp_last_tick;
 	task_t *curr, *idle;
 	struct mm_struct *prev_mm;
@@ -762,8 +771,6 @@ static void activate_task(task_t *p, run
 static void deactivate_task(struct task_struct *p, runqueue_t *rq)
 {
 	rq->nr_running--;
-	if (p->state == TASK_UNINTERRUPTIBLE)
-		rq->nr_uninterruptible++;
 	dequeue_task(p, p->array);
 	p->array = NULL;
 }
@@ -983,14 +990,14 @@ static int try_to_wake_up(task_t * p, un
 	int cpu, this_cpu, success = 0;
 	unsigned long flags;
 	long old_state;
-	runqueue_t *rq, *old_rq;
+	runqueue_t *rq;
 #ifdef CONFIG_SMP
 	unsigned long load, this_load;
 	struct sched_domain *sd;
 	int new_cpu;
 #endif
 
-	old_rq = rq = task_rq_lock(p, &flags);
+	rq = task_rq_lock(p, &flags);
 	schedstat_inc(rq, ttwu_cnt);
 	old_state = p->state;
 	if (!(old_state & state))
@@ -1085,7 +1092,7 @@ out_set_cpu:
 out_activate:
 #endif /* CONFIG_SMP */
 	if (old_state == TASK_UNINTERRUPTIBLE) {
-		old_rq->nr_uninterruptible--;
+		rq->nr_uninterruptible--;
 		/*
 		 * Tasks on involuntary sleep don't earn
 		 * sleep_avg beyond just interactive state.
@@ -1415,6 +1422,13 @@ unsigned long nr_uninterruptible(void)
 	for_each_cpu(i)
 		sum += cpu_rq(i)->nr_uninterruptible;
 
+	/*
+	 * Since we read the counters lockless, it might be slightly
+	 * inaccurate. Do not allow it to go below zero though:
+	 */
+	if (unlikely((long)sum < 0))
+		sum = 0;
+
 	return sum;
 }
 
@@ -2581,8 +2595,11 @@ need_resched_nonpreemptible:
 		if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
 				unlikely(signal_pending(prev))))
 			prev->state = TASK_RUNNING;
-		else
+		else {
+			if (prev->state == TASK_UNINTERRUPTIBLE)
+				rq->nr_uninterruptible++;
 			deactivate_task(prev, rq);
+		}
 	}
 
 	cpu = smp_processor_id();
@@ -3914,6 +3931,26 @@ static void move_task_off_dead_cpu(int d
 	__migrate_task(tsk, dead_cpu, dest_cpu);
 }
 
+/*
+ * While a dead CPU has no uninterruptible tasks queued at this point,
+ * it might still have a nonzero ->nr_uninterruptible counter, because
+ * for performance reasons the counter is not stricly tracking tasks to
+ * their home CPUs. So we just add the counter to another CPU's counter,
+ * to keep the global sum constant after CPU-down:
+ */
+static void migrate_nr_uninterruptible(runqueue_t *rq_src)
+{
+	runqueue_t *rq_dest = cpu_rq(any_online_cpu(CPU_MASK_ALL));
+	unsigned long flags;
+
+	local_irq_save(flags);
+	double_rq_lock(rq_src, rq_dest);
+	rq_dest->nr_uninterruptible += rq_src->nr_uninterruptible;
+	rq_src->nr_uninterruptible = 0;
+	double_rq_unlock(rq_src, rq_dest);
+	local_irq_restore(flags);
+}
+
 /* Run through task list and migrate tasks from the dead cpu. */
 static void migrate_live_tasks(int src_cpu)
 {
@@ -4048,6 +4085,7 @@ static int migration_call(struct notifie
 		__setscheduler(rq->idle, SCHED_NORMAL, 0);
 		migrate_dead_tasks(cpu);
 		task_rq_unlock(rq, &flags);
+		migrate_nr_uninterruptible(rq);
 		BUG_ON(rq->nr_running != 0);
 
 		/* No need to migrate the tasks: it was best-effort if

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs
  2004-11-16 11:32 [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs Ingo Molnar
@ 2004-11-16 22:19 ` Peter Williams
  2004-11-16 23:28   ` Ingo Molnar
  2004-11-16 22:49 ` Nick Piggin
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Williams @ 2004-11-16 22:19 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, Andrew Morton, linux-kernel

Ingo Molnar wrote:
> PREEMPT_RT on SMP systems triggered weird (very high) load average
> values rather easily, which turned out to be a mainline kernel
> ->nr_uninterruptible handling bug in try_to_wake_up().
> 
> the following code:
> 
>         if (old_state == TASK_UNINTERRUPTIBLE) {
>                 old_rq->nr_uninterruptible--;
> 
> potentially executes with old_rq potentially being != rq, and hence
> updating ->nr_uninterruptible without the lock held. Given a
> sufficiently concurrent preemption workload the count can get out of
> whack and updates might get lost, permanently skewing the global count. 
> Nothing except the load-average uses nr_uninterruptible() so this
> condition can go unnoticed quite easily.
> 
> the fix is to update ->nr_uninterruptible always on the runqueue where
> the task currently is. (this is also a tiny performance plus for
> try_to_wake_up() as a stackslot gets freed up.)

Couldn't this part of the problem have been solved by using an atomic_t 
for nr_uninterruptible as for nr_iowait?  It would also remove the need 
for migrate_nr_uninterruptible().

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs
  2004-11-16 11:32 [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs Ingo Molnar
  2004-11-16 22:19 ` Peter Williams
@ 2004-11-16 22:49 ` Nick Piggin
  2004-11-16 23:03   ` Nick Piggin
  2004-11-16 23:32   ` Peter Williams
  1 sibling, 2 replies; 18+ messages in thread
From: Nick Piggin @ 2004-11-16 22:49 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, Andrew Morton, linux-kernel

Ingo Molnar wrote:
> PREEMPT_RT on SMP systems triggered weird (very high) load average
> values rather easily, which turned out to be a mainline kernel
> ->nr_uninterruptible handling bug in try_to_wake_up().
> 
> the following code:
> 
>         if (old_state == TASK_UNINTERRUPTIBLE) {
>                 old_rq->nr_uninterruptible--;
> 
> potentially executes with old_rq potentially being != rq, and hence
> updating ->nr_uninterruptible without the lock held. Given a
> sufficiently concurrent preemption workload the count can get out of
> whack and updates might get lost, permanently skewing the global count. 
> Nothing except the load-average uses nr_uninterruptible() so this
> condition can go unnoticed quite easily.
> 

Hi Ingo,
Yes you're right.

I have another idea. Revert back to the old code, then just transfer
the nr_uninterruptible count when migrating a task. That way, the
rq's nr_uninterruptible field always is a measure of the number of
uninterruptible tasks on it. What do you think?

Nick

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs
  2004-11-16 22:49 ` Nick Piggin
@ 2004-11-16 23:03   ` Nick Piggin
  2004-11-16 23:32   ` Peter Williams
  1 sibling, 0 replies; 18+ messages in thread
From: Nick Piggin @ 2004-11-16 23:03 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1086 bytes --]

Nick Piggin wrote:
> Ingo Molnar wrote:
> 
>> PREEMPT_RT on SMP systems triggered weird (very high) load average
>> values rather easily, which turned out to be a mainline kernel
>> ->nr_uninterruptible handling bug in try_to_wake_up().
>>
>> the following code:
>>
>>         if (old_state == TASK_UNINTERRUPTIBLE) {
>>                 old_rq->nr_uninterruptible--;
>>
>> potentially executes with old_rq potentially being != rq, and hence
>> updating ->nr_uninterruptible without the lock held. Given a
>> sufficiently concurrent preemption workload the count can get out of
>> whack and updates might get lost, permanently skewing the global 
>> count. Nothing except the load-average uses nr_uninterruptible() so this
>> condition can go unnoticed quite easily.
>>
> 
> Hi Ingo,
> Yes you're right.
> 
> I have another idea. Revert back to the old code, then just transfer
> the nr_uninterruptible count when migrating a task. That way, the
> rq's nr_uninterruptible field always is a measure of the number of
> uninterruptible tasks on it. What do you think?

Something like this:

[-- Attachment #2: sched-nr_unint-fix.patch --]
[-- Type: text/x-patch, Size: 1499 bytes --]




---

 linux-2.6-npiggin/kernel/sched.c |   12 ++++++++----
 1 files changed, 8 insertions(+), 4 deletions(-)

diff -puN kernel/sched.c~sched-nr_unint-fix kernel/sched.c
--- linux-2.6/kernel/sched.c~sched-nr_unint-fix	2004-11-17 09:54:36.000000000 +1100
+++ linux-2.6-npiggin/kernel/sched.c	2004-11-17 10:01:49.000000000 +1100
@@ -981,14 +981,14 @@ static int try_to_wake_up(task_t * p, un
 	int cpu, this_cpu, success = 0;
 	unsigned long flags;
 	long old_state;
-	runqueue_t *rq, *old_rq;
+	runqueue_t *rq;
 #ifdef CONFIG_SMP
 	unsigned long load, this_load;
 	struct sched_domain *sd;
 	int new_cpu;
 #endif
 
-	old_rq = rq = task_rq_lock(p, &flags);
+	rq = task_rq_lock(p, &flags);
 	schedstat_inc(rq, ttwu_cnt);
 	old_state = p->state;
 	if (!(old_state & state))
@@ -1083,7 +1083,7 @@ out_set_cpu:
 out_activate:
 #endif /* CONFIG_SMP */
 	if (old_state == TASK_UNINTERRUPTIBLE) {
-		old_rq->nr_uninterruptible--;
+		rq->nr_uninterruptible--;
 		/*
 		 * Tasks on involuntary sleep don't earn
 		 * sleep_avg beyond just interactive state.
@@ -1608,8 +1608,12 @@ void pull_task(runqueue_t *src_rq, prio_
 {
 	dequeue_task(p, src_array);
 	src_rq->nr_running--;
-	set_task_cpu(p, this_cpu);
 	this_rq->nr_running++;
+	if (p->state == TASK_UNINTERRUPTIBLE) {
+		src_rq->nr_uninterruptible--;
+		this_rq->nr_uninterruptible++;
+	}
+	set_task_cpu(p, this_cpu);
 	enqueue_task(p, this_array);
 	p->timestamp = (p->timestamp - src_rq->timestamp_last_tick)
 				+ this_rq->timestamp_last_tick;

_

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs
  2004-11-16 23:28   ` Ingo Molnar
@ 2004-11-16 23:10     ` Linus Torvalds
  2004-11-17 10:26       ` Ingo Molnar
  2005-01-28  0:53       ` Paul Jackson
  2004-11-16 23:48     ` Peter Williams
  1 sibling, 2 replies; 18+ messages in thread
From: Linus Torvalds @ 2004-11-16 23:10 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Williams, Andrew Morton, linux-kernel



On Wed, 17 Nov 2004, Ingo Molnar wrote:
> 
> maybe, but why? Atomic ops are still a tad slower than normal ops

A "tad" slower? 

An atomic op is pretty much as expensive as a spinlock/unlock pair on x86.  
Not _quite_, but it's pretty close.

So yes, avoiding atomic ops is good if you can do it without any other 
real downsides (not having had time to check the patch yet, I won't do 
into that ;)

		Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs
  2004-11-16 22:19 ` Peter Williams
@ 2004-11-16 23:28   ` Ingo Molnar
  2004-11-16 23:10     ` Linus Torvalds
  2004-11-16 23:48     ` Peter Williams
  0 siblings, 2 replies; 18+ messages in thread
From: Ingo Molnar @ 2004-11-16 23:28 UTC (permalink / raw)
  To: Peter Williams; +Cc: Linus Torvalds, Andrew Morton, linux-kernel


* Peter Williams <pwil3058@bigpond.net.au> wrote:

> Couldn't this part of the problem have been solved by using an
> atomic_t for nr_uninterruptible as for nr_iowait?  It would also
> remove the need for migrate_nr_uninterruptible().

maybe, but why? Atomic ops are still a tad slower than normal ops and
every cycle counts in the wakeup path. Also, the solution is still not
correct, because it does not take other migration paths into account, so
we could end up with a sleeping task showing up on another CPU just as
well. The most robust solution is to simply not care about migration and
to care about the total count only.

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs
  2004-11-16 22:49 ` Nick Piggin
  2004-11-16 23:03   ` Nick Piggin
@ 2004-11-16 23:32   ` Peter Williams
  2004-11-16 23:37     ` Nick Piggin
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Williams @ 2004-11-16 23:32 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, linux-kernel

Nick Piggin wrote:
> Ingo Molnar wrote:
> 
>> PREEMPT_RT on SMP systems triggered weird (very high) load average
>> values rather easily, which turned out to be a mainline kernel
>> ->nr_uninterruptible handling bug in try_to_wake_up().
>>
>> the following code:
>>
>>         if (old_state == TASK_UNINTERRUPTIBLE) {
>>                 old_rq->nr_uninterruptible--;
>>
>> potentially executes with old_rq potentially being != rq, and hence
>> updating ->nr_uninterruptible without the lock held. Given a
>> sufficiently concurrent preemption workload the count can get out of
>> whack and updates might get lost, permanently skewing the global 
>> count. Nothing except the load-average uses nr_uninterruptible() so this
>> condition can go unnoticed quite easily.
>>
> 
> Hi Ingo,
> Yes you're right.
> 
> I have another idea. Revert back to the old code, then just transfer
> the nr_uninterruptible count when migrating a task. That way, the

I presume that you mean adjust rather than transfer.

> rq's nr_uninterruptible field always is a measure of the number of
> uninterruptible tasks on it. What do you think?

To make this work you need to do the adjustment every where that a task 
changes CPU while in the UNINTERRUPTIBLE state.  Are both run queue 
locks always held in these circumstances?  I don't think that they are 
in try_to_wake_up() but it may be possible to work around that.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs
  2004-11-16 23:32   ` Peter Williams
@ 2004-11-16 23:37     ` Nick Piggin
  0 siblings, 0 replies; 18+ messages in thread
From: Nick Piggin @ 2004-11-16 23:37 UTC (permalink / raw)
  To: Peter Williams; +Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, linux-kernel

Peter Williams wrote:
> Nick Piggin wrote:
> 
>> Ingo Molnar wrote:
>>
>>> PREEMPT_RT on SMP systems triggered weird (very high) load average
>>> values rather easily, which turned out to be a mainline kernel
>>> ->nr_uninterruptible handling bug in try_to_wake_up().
>>>
>>> the following code:
>>>
>>>         if (old_state == TASK_UNINTERRUPTIBLE) {
>>>                 old_rq->nr_uninterruptible--;
>>>
>>> potentially executes with old_rq potentially being != rq, and hence
>>> updating ->nr_uninterruptible without the lock held. Given a
>>> sufficiently concurrent preemption workload the count can get out of
>>> whack and updates might get lost, permanently skewing the global 
>>> count. Nothing except the load-average uses nr_uninterruptible() so this
>>> condition can go unnoticed quite easily.
>>>
>>
>> Hi Ingo,
>> Yes you're right.
>>
>> I have another idea. Revert back to the old code, then just transfer
>> the nr_uninterruptible count when migrating a task. That way, the
> 
> 
> I presume that you mean adjust rather than transfer.
> 
>> rq's nr_uninterruptible field always is a measure of the number of
>> uninterruptible tasks on it. What do you think?
> 
> 
> To make this work you need to do the adjustment every where that a task 
> changes CPU while in the UNINTERRUPTIBLE state.  Are both run queue 
> locks always held in these circumstances?  I don't think that they are 
> in try_to_wake_up() but it may be possible to work around that.
> 

Yeah this won't actually work of course, because a task can set itself
UNINTERRUPTIBLE and subsequently get preempted then moved CPUs before
calling schedule() itself.

And yeah I missed the original point of your fix which was due to the
task moving runqueues in try_to_wake_up. Sorry, forget about the patch :P

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs
  2004-11-16 23:28   ` Ingo Molnar
  2004-11-16 23:10     ` Linus Torvalds
@ 2004-11-16 23:48     ` Peter Williams
  1 sibling, 0 replies; 18+ messages in thread
From: Peter Williams @ 2004-11-16 23:48 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, Andrew Morton, linux-kernel

Ingo Molnar wrote:
> * Peter Williams <pwil3058@bigpond.net.au> wrote:
> 
> 
>>Couldn't this part of the problem have been solved by using an
>>atomic_t for nr_uninterruptible as for nr_iowait?  It would also
>>remove the need for migrate_nr_uninterruptible().
> 
> 
> maybe, but why? Atomic ops are still a tad slower than normal ops and
> every cycle counts in the wakeup path. Also, the solution is still not
> correct, because it does not take other migration paths into account, so

Oops.

> we could end up with a sleeping task showing up on another CPU just as
> well. The most robust solution is to simply not care about migration and
> to care about the total count only.

Yes and, with the new comment above its declaration, if anybody (in the 
future) wants to use the per cpu data they will be aware that they need 
to modify the code if they need it to be always accurate.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs
  2004-11-16 23:10     ` Linus Torvalds
@ 2004-11-17 10:26       ` Ingo Molnar
  2004-11-17 15:52         ` Linus Torvalds
  2005-01-28  0:53       ` Paul Jackson
  1 sibling, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2004-11-17 10:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Peter Williams, Andrew Morton, linux-kernel


* Linus Torvalds <torvalds@osdl.org> wrote:

> On Wed, 17 Nov 2004, Ingo Molnar wrote:
> > 
> > maybe, but why? Atomic ops are still a tad slower than normal ops
> 
> A "tad" slower? 
> 
> An atomic op is pretty much as expensive as a spinlock/unlock pair on
> x86.  Not _quite_, but it's pretty close.

it really depends on the layout of the data structure. The main cost
that we typically see combined with atomic ops and spinlocks is if the
target of the atomic op is a global/shared variable, in which case the
cacheline bounce cost is prohibitive and controls over any micro-cost.

But in this particular rq->nr_uninterruptible case we have per-CPU
variables, so the cost of the atomic op is, in theory, quite close to
the cost of a normal op.

In practice this means it's 10 cycles more expensive on P3-style CPUs. 
(I think on P4 style CPUs it should be much closer to 0, but i havent
been able to reliably time it there - cycle measurements show a 76
cycles cost which is way out of line.)

On a UP Athlon64 the cost of a LOCK-ed op is exactly the same as without
the LOCK prefix - but here the CPU can take shortcuts because in theory
it can skip any cache coherency considerations altogether. (albeit the
atomic op should still have relevance to DMA-able data, perhaps UP-mode
CPUs ignore that case.)  Also, i think it ought to be near zero-cost on
a Crusoe-style CPU?

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs
  2004-11-17 10:26       ` Ingo Molnar
@ 2004-11-17 15:52         ` Linus Torvalds
  2004-11-18 16:21           ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2004-11-17 15:52 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Williams, Andrew Morton, linux-kernel



On Wed, 17 Nov 2004, Ingo Molnar wrote:
> 
> In practice this means it's 10 cycles more expensive on P3-style CPUs. 
> (I think on P4 style CPUs it should be much closer to 0, but i havent
> been able to reliably time it there - cycle measurements show a 76
> cycles cost which is way out of line.)

No, it's not way out of line. It _is_ that expensive. It seems to totally 
synchronize the trace cache or something.

It shows up very clearly in profiles. Spinlocks and atomic operations are 
very expensive even on UP, where there are zero cache bounce issues. I'm 
not kidding you - compile an SMP kernel, run it on a UP P4, and check it 
out with oprofile.

I did that just a few weeks ago out of idle curiousity about some lmbench 
numbers. 

And it's not just P4's either. It shows up on at least P-M's too, even
though that's a PPro-based core like a PIII. It's at least 22 cycles
according to my tests (certainly not 10), but it seems to have a bigger
impact than that, likely because it also ends up serializing the pipeline
around it (ie it makes the instructions _around_ it slower too, something
you don't end up seeing if you just do rdtsc which _also_ serializes).

Try lmbench some time. Just pick a UP machine, compile a UP kernel on it,
run lmbench, then compile a SMP kernel, and run it again. There's
literally a 10-20% performance hit on a lot of things. Just from a
_couple_ of locks on the paths. 

Sure, some of it is actual different paths, like the VM teardown, which 
on SMP is just fundamentally more expensive because we have to be more 
careful. So mmap latency (delayed page frees) and file delete (delayed RCU 
delete of dentry) ends up having differences of 30-40%. But the 10-20% 
differences are things where the _only_ difference really is just the 
totally uncontended locking.

That's on a P-M. On a P4 it is _worse_, I think.

		Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs
  2004-11-17 15:52         ` Linus Torvalds
@ 2004-11-18 16:21           ` Ingo Molnar
  0 siblings, 0 replies; 18+ messages in thread
From: Ingo Molnar @ 2004-11-18 16:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Peter Williams, Andrew Morton, linux-kernel


* Linus Torvalds <torvalds@osdl.org> wrote:

> And it's not just P4's either. It shows up on at least P-M's too, even
> though that's a PPro-based core like a PIII. It's at least 22 cycles
> according to my tests (certainly not 10), but it seems to have a
> bigger impact than that, likely because it also ends up serializing
> the pipeline around it (ie it makes the instructions _around_ it
> slower too, something you don't end up seeing if you just do rdtsc
> which _also_ serializes).
> 
> Try lmbench some time. Just pick a UP machine, compile a UP kernel on
> it, run lmbench, then compile a SMP kernel, and run it again. There's
> literally a 10-20% performance hit on a lot of things. Just from a
> _couple_ of locks on the paths. 
> 
> Sure, some of it is actual different paths, like the VM teardown,
> which on SMP is just fundamentally more expensive because we have to
> be more careful. So mmap latency (delayed page frees) and file delete
> (delayed RCU delete of dentry) ends up having differences of 30-40%.
> But the 10-20% differences are things where the _only_ difference
> really is just the totally uncontended locking.
> 
> That's on a P-M. On a P4 it is _worse_, I think.

oh well. The Athlon64 really shines doing atomic ops though, and this
fooled me into thinking that the LOCK prefix has been properly taken
care of forever. What the hell is Intel doing ... atomic ops are so
fundamental to spinlocks. Thank God we can do the movb $1,%0 trick for
spin-unlock.

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs
  2004-11-16 23:10     ` Linus Torvalds
  2004-11-17 10:26       ` Ingo Molnar
@ 2005-01-28  0:53       ` Paul Jackson
  2005-01-28  1:06         ` Linus Torvalds
  2005-01-28  4:28         ` Ingo Molnar
  1 sibling, 2 replies; 18+ messages in thread
From: Paul Jackson @ 2005-01-28  0:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mingo, pwil3058, akpm, linux-kernel

A long time ago, Linus wrote:
> An atomic op is pretty much as expensive as a spinlock/unlock pair on x86.  
> Not _quite_, but it's pretty close.

Are both read and modify atomic ops relatively expensive on some CPUs,
or is it just modify atomic ops?

(Ignoring for this question the possibility that a mix of read and
modify ops could heat up a cache line on multiprocessor systems, and
focusing for the moment just on the CPU internals ...)

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs
  2005-01-28  0:53       ` Paul Jackson
@ 2005-01-28  1:06         ` Linus Torvalds
  2005-01-28  2:14           ` Paul Jackson
  2005-01-28  4:28         ` Ingo Molnar
  1 sibling, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2005-01-28  1:06 UTC (permalink / raw)
  To: Paul Jackson; +Cc: mingo, pwil3058, akpm, linux-kernel



On Thu, 27 Jan 2005, Paul Jackson wrote:
>
> A long time ago, Linus wrote:
> > An atomic op is pretty much as expensive as a spinlock/unlock pair on x86.  
> > Not _quite_, but it's pretty close.
> 
> Are both read and modify atomic ops relatively expensive on some CPUs,
> or is it just modify atomic ops?

Linux pretty much assumes that any naturally aligned read or write of the
basic word-size (actually "int"/"long"/pointer) are always atomic.  
That's true on all architectures that do SMP today. In fact, it's hard
seeing it ever be not true, although specialized architectures with
out-of-band locking bits might have different assumptions (and return
"invalid" for a memory location that is "in flux").

So if you just do a regular read or write, that's basically always cheap.

HOWEVER. In the absense of locking, you usually can't just do a regular
read or write. You'd have memory barriers etc, which quite easily bring
the cost up to similar as locking (modulo cacheline bounces, which would
happen on the actual access, not the barrier).

Also, "bigger" accesses are not atomic, ie a 64-bit read on a 32-bit 
platform usually requires a lock (or equivalent data structures, like 
sequence numbers with memory barriers).

The _real_ cost of not locking also often ends up being the inevitable 
bugs. Doing clever things with memory barriers is almost always a bug 
waiting to happen. It's just _really_ hard to wrap your head around all 
the things that can happen on ten different architectures with different
memory ordering, and a single missing barrier.

		Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs
  2005-01-28  1:06         ` Linus Torvalds
@ 2005-01-28  2:14           ` Paul Jackson
  0 siblings, 0 replies; 18+ messages in thread
From: Paul Jackson @ 2005-01-28  2:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mingo, pwil3058, akpm, linux-kernel

True - thanks - including the part about the cost of locking bugs.

My question was poorly phrased - the code speaks the answer to the
real question I had:

  $ grep define.atomic_ include/asm-ia64/atomic.h | head -2
  #define atomic_read(v)          ((v)->counter)
  #define atomic_set(v,i)         (((v)->counter) = (i))

An atomic_read() of a one word counter on ia64 is just a load, and an
atomic_set() is just a store.  This is unlike the more difficult
atomic_inc, atomic_dec, atomic_add, atomic_mutilate, ... calls that
require something fancier, and I presume more painful for that CPUs
innards.

Good.  Thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs
  2005-01-28  0:53       ` Paul Jackson
  2005-01-28  1:06         ` Linus Torvalds
@ 2005-01-28  4:28         ` Ingo Molnar
  2005-01-28  5:18           ` Paul Jackson
  2005-01-28  6:01           ` Andi Kleen
  1 sibling, 2 replies; 18+ messages in thread
From: Ingo Molnar @ 2005-01-28  4:28 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Linus Torvalds, pwil3058, akpm, linux-kernel


* Paul Jackson <pj@sgi.com> wrote:

> A long time ago, Linus wrote:
> > An atomic op is pretty much as expensive as a spinlock/unlock pair on x86.  
> > Not _quite_, but it's pretty close.
> 
> Are both read and modify atomic ops relatively expensive on some CPUs,
> or is it just modify atomic ops?
> 
> (Ignoring for this question the possibility that a mix of read and
> modify ops could heat up a cache line on multiprocessor systems, and
> focusing for the moment just on the CPU internals ...)

if by 'some CPUs' you mean x86 then it's the LOCK prefixed ops that are
expensive. I.e. all the LOCK-prefixed RMW variants of instructions:

 atomic.h:               LOCK "addl %1,%0"
 atomic.h:               LOCK "subl %1,%0"
 atomic.h:               LOCK "subl %2,%0; sete %1"
 atomic.h:               LOCK "incl %0"
 atomic.h:               LOCK "decl %0"
 atomic.h:               LOCK "decl %0; sete %1"
 atomic.h:               LOCK "incl %0; sete %1"
 atomic.h:               LOCK "addl %2,%0; sets %1"
 atomic.h:               LOCK "xaddl %0, %1;"
 atomic.h:__asm__ __volatile__(LOCK "andl %0,%1" \
 atomic.h:__asm__ __volatile__(LOCK "orl %0,%1" \

pure reads/writes are architecturally guaranteed to be atomic (so
atomic.h uses them, not some fancy instruction) and they are (/better
be) fast.

interestingly, the x86 spinlock implementation uses a LOCK-ed
instruction only on acquire - it uses a simple atomic write (and
implicit barrier assumption) on the way out:

 #define spin_unlock_string \
         "movb $1,%0" \
                 :"=m" (lock->slock) : : "memory"

no LOCK prefix. Due to this spinlocks can sometimes be _cheaper_ than
doing the same via atomic inc/dec.

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs
  2005-01-28  4:28         ` Ingo Molnar
@ 2005-01-28  5:18           ` Paul Jackson
  2005-01-28  6:01           ` Andi Kleen
  1 sibling, 0 replies; 18+ messages in thread
From: Paul Jackson @ 2005-01-28  5:18 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: torvalds, pwil3058, akpm, linux-kernel

Ingo wrote:
> if by 'some CPUs' you mean x86 then it's the LOCK prefixed ops ...

Yup - that's the one.  Thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs
  2005-01-28  4:28         ` Ingo Molnar
  2005-01-28  5:18           ` Paul Jackson
@ 2005-01-28  6:01           ` Andi Kleen
  1 sibling, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2005-01-28  6:01 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, pwil3058, akpm, linux-kernel

Ingo Molnar <mingo@elte.hu> writes:
>
> interestingly, the x86 spinlock implementation uses a LOCK-ed
> instruction only on acquire - it uses a simple atomic write (and
> implicit barrier assumption) on the way out:
>
>  #define spin_unlock_string \
>          "movb $1,%0" \
>                  :"=m" (lock->slock) : : "memory"
>
> no LOCK prefix. Due to this spinlocks can sometimes be _cheaper_ than
> doing the same via atomic inc/dec.

Unfortunately kernels are often compiled for PPro and on those
an LOCK prefix is used anyways to work around some bugs in early 
steppings. This makes spinlocks considerably slower (there are some
lock intensive not even so micro benchmarks that show the difference clearly)

It uses then

#define spin_unlock_string \
        "xchgb %b0, %1" \
                :"=q" (oldval), "=m" (lock->lock) \
                :"0" (oldval) : "memory"


which has an implicit LOCK and is equally slow.

I looked some time ago at patching it at runtime using alternative(),
but it would have bloated the patch tables a lot. Another way would
be a CONFIG_PPRO_BUT_UP_ON_BUGGY_ONES, but it is hard to find the exact
steppings with the problems.

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2005-01-28  6:02 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-11-16 11:32 [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs Ingo Molnar
2004-11-16 22:19 ` Peter Williams
2004-11-16 23:28   ` Ingo Molnar
2004-11-16 23:10     ` Linus Torvalds
2004-11-17 10:26       ` Ingo Molnar
2004-11-17 15:52         ` Linus Torvalds
2004-11-18 16:21           ` Ingo Molnar
2005-01-28  0:53       ` Paul Jackson
2005-01-28  1:06         ` Linus Torvalds
2005-01-28  2:14           ` Paul Jackson
2005-01-28  4:28         ` Ingo Molnar
2005-01-28  5:18           ` Paul Jackson
2005-01-28  6:01           ` Andi Kleen
2004-11-16 23:48     ` Peter Williams
2004-11-16 22:49 ` Nick Piggin
2004-11-16 23:03   ` Nick Piggin
2004-11-16 23:32   ` Peter Williams
2004-11-16 23:37     ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox