[PATCH RFC] smt nice introduces significant lock contention

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC] smt nice introduces significant lock contention
@ 2006-06-01 22:55 Chris Mason
  2006-06-01 23:57 ` Chen, Kenneth W
  0 siblings, 1 reply; 41+ messages in thread
From: Chris Mason @ 2006-06-01 22:55 UTC (permalink / raw)
  To: Con Kolivas, linux-kernel

Hello everyone,

Recent benchmarks showed some performance regressions between 2.6.16 and 
2.6.5.  We tracked down one of the regressions to lock contention in schedule 
heavy workloads (~70,000 context switches per second)

kernel/sched.c:dependent_sleeper() was responsible for most of the lock 
contention, hammering on the run queue locks.  The patch below is more of 
a discussion point than a suggested fix (although it does reduce lock 
contention significantly).  The dependent_sleeper code looks very expensive 
to me, especially for using a spinlock to bounce control between two different 
siblings in the same cpu.

--- a/kernel/sched.c	Thu May 18 15:55:43 2006 -0400
+++ b/kernel/sched.c	Tue May 23 21:13:52 2006 -0400
@@ -2630,6 +2630,27 @@ static inline void wakeup_busy_runqueue(
 		resched_task(rq->idle);
 }
 
+static int trylock_smt_cpus(cpumask_t sibling_map)
+{
+	int ret = 1;
+	int numlocked = 0;
+	int i;
+	for_each_cpu_mask(i, sibling_map) {
+		ret = spin_trylock(&cpu_rq(i)->lock);
+		if (!ret)
+			break;
+		numlocked++;
+	}
+	if (ret || !numlocked)
+		return ret;
+	for_each_cpu_mask(i, sibling_map) {
+		spin_unlock(&cpu_rq(i)->lock);
+		if (--numlocked == 0)
+			break;
+	}
+	return 0;
+}
+
 static void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq)
 {
 	struct sched_domain *tmp, *sd = NULL;
@@ -2643,22 +2664,16 @@ static void wake_sleeping_dependent(int 
 	if (!sd)
 		return;
 
-	/*
-	 * Unlock the current runqueue because we have to lock in
-	 * CPU order to avoid deadlocks. Caller knows that we might
-	 * unlock. We keep IRQs disabled.
-	 */
-	spin_unlock(&this_rq->lock);
-
 	sibling_map = sd->span;
 
-	for_each_cpu_mask(i, sibling_map)
-		spin_lock(&cpu_rq(i)->lock);
 	/*
 	 * We clear this CPU from the mask. This both simplifies the
 	 * inner loop and keps this_rq locked when we exit:
 	 */
 	cpu_clear(this_cpu, sibling_map);
+
+	if (!trylock_smt_cpus(sibling_map))
+		return;
 
 	for_each_cpu_mask(i, sibling_map) {
 		runqueue_t *smt_rq = cpu_rq(i);
@@ -2703,11 +2718,10 @@ static int dependent_sleeper(int this_cp
 	 * The same locking rules and details apply as for
 	 * wake_sleeping_dependent():
 	 */
-	spin_unlock(&this_rq->lock);
 	sibling_map = sd->span;
-	for_each_cpu_mask(i, sibling_map)
-		spin_lock(&cpu_rq(i)->lock);
 	cpu_clear(this_cpu, sibling_map);
+	if (!trylock_smt_cpus(sibling_map))
+		return 0;
 
 	/*
 	 * Establish next task to be run - it might have gone away because

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-01 22:55 [PATCH RFC] smt nice introduces significant lock contention Chris Mason
@ 2006-06-01 23:57 ` Chen, Kenneth W
  2006-06-02  1:59   ` Con Kolivas
  0 siblings, 1 reply; 41+ messages in thread
From: Chen, Kenneth W @ 2006-06-01 23:57 UTC (permalink / raw)
  To: 'Chris Mason', Con Kolivas, linux-kernel

Chris Mason wrote on Thursday, June 01, 2006 3:56 PM
> Hello everyone,
> 
> Recent benchmarks showed some performance regressions between 2.6.16 and 
> 2.6.5.  We tracked down one of the regressions to lock contention in schedule 
> heavy workloads (~70,000 context switches per second)
> 
> kernel/sched.c:dependent_sleeper() was responsible for most of the lock 
> contention, hammering on the run queue locks.  The patch below is more of 
> a discussion point than a suggested fix (although it does reduce lock 
> contention significantly).  The dependent_sleeper code looks very expensive 
> to me, especially for using a spinlock to bounce control between two different 
> siblings in the same cpu.


Yeah, this also sort of echo a recent discovery on one of our benchmarks
that schedule() is red hot in the kernel.  I was just scratching my head
not sure what's going on.  This dependent_sleeper could be the culprit
considering it is called almost at every context switch.  I don't think
we are hitting on lock contention, but by the large amount of code it
executes.  It really needs to be cut down ....

- Ken


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-01 23:57 ` Chen, Kenneth W
@ 2006-06-02  1:59   ` Con Kolivas
  2006-06-02  2:28     ` Con Kolivas
  2006-06-02  2:35     ` Chen, Kenneth W
  0 siblings, 2 replies; 41+ messages in thread
From: Con Kolivas @ 2006-06-02  1:59 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: 'Chris Mason', linux-kernel, Ingo Molnar

On Friday 02 June 2006 09:57, Chen, Kenneth W wrote:
> Chris Mason wrote on Thursday, June 01, 2006 3:56 PM
>
> > Hello everyone,
> >
> > Recent benchmarks showed some performance regressions between 2.6.16 and
> > 2.6.5.  We tracked down one of the regressions to lock contention in
> > schedule heavy workloads (~70,000 context switches per second)
> >
> > kernel/sched.c:dependent_sleeper() was responsible for most of the lock
> > contention, hammering on the run queue locks.  The patch below is more of
> > a discussion point than a suggested fix (although it does reduce lock
> > contention significantly).  The dependent_sleeper code looks very
> > expensive to me, especially for using a spinlock to bounce control
> > between two different siblings in the same cpu.
>
> Yeah, this also sort of echo a recent discovery on one of our benchmarks
> that schedule() is red hot in the kernel.  I was just scratching my head
> not sure what's going on.  This dependent_sleeper could be the culprit
> considering it is called almost at every context switch.  I don't think
> we are hitting on lock contention, but by the large amount of code it
> executes.  It really needs to be cut down ....

Thanks for the suggestion. How about something like this which takes your
idea and further expands on it. Compiled, boot and runtime tests ok.

---
It is not critical to functioning that dependent_sleeper() succeeds every
time. We can significantly reduce the locking overhead and contention of
dependent_sleeper by only doing trylock on the smt sibling runqueues. As
we're only doing trylock it means we do not need to observe the normal
locking order and we can get away without unlocking this_rq in schedule().
This provides us with an opportunity to simplify the code further.

Signed-off-by: Con Kolivas <kernel@kolivas.org>

---
 kernel/sched.c |   58 +++++++++++++++++++++++++++++++++++----------------------
 1 files changed, 36 insertions(+), 22 deletions(-)

Index: linux-2.6.17-rc5/kernel/sched.c
===================================================================
--- linux-2.6.17-rc5.orig/kernel/sched.c	2006-05-25 21:33:19.000000000 +1000
+++ linux-2.6.17-rc5/kernel/sched.c	2006-06-02 11:51:13.000000000 +1000
@@ -1687,7 +1687,7 @@ unsigned long nr_active(void)
  * double_rq_lock - safely lock two runqueues
  *
  * We must take them in cpu order to match code in
- * dependent_sleeper and wake_dependent_sleeper.
+ * wake_dependent_sleeper.
  *
  * Note this does not disable interrupts like task_rq_lock,
  * you need to do so manually before calling.
@@ -2731,6 +2731,32 @@ static void wake_sleeping_dependent(int 
 }
 
 /*
+ * Try to lock all the siblings of a cpu that is already locked. As we're
+ * only doing trylock the order is not important.
+ */
+static int trylock_smt_siblings(cpumask_t sibling_map)
+{
+	cpumask_t locked_siblings;
+	int i;
+
+	cpus_clear(locked_siblings);
+	for_each_cpu_mask(i, sibling_map) {
+		if (!spin_trylock(&cpu_rq(i)->lock))
+			break;
+		cpu_set(i, locked_siblings);
+
+	}
+
+	/* Did we lock all the siblings? */
+	if (cpus_equal(sibling_map, locked_siblings))
+		return 1;
+
+	for_each_cpu_mask(i, locked_siblings)
+		spin_unlock(&cpu_rq(i)->lock);
+	return 0;
+}
+
+/*
  * number of 'lost' timeslices this task wont be able to fully
  * utilize, if another task runs on a sibling. This models the
  * slowdown effect of other tasks running on siblings:
@@ -2740,7 +2766,11 @@ static inline unsigned long smt_slice(ta
 	return p->time_slice * (100 - sd->per_cpu_gain) / 100;
 }
 
-static int dependent_sleeper(int this_cpu, runqueue_t *this_rq)
+/*
+ * To minimise lock contention and to have to drop this_rq's runlock we only
+ * trylock the sibling runqueues and give up if we can't acquire the locks.
+ */
+static int dependent_sleeper(int this_cpu, struct runqueue *this_rq)
 {
 	struct sched_domain *tmp, *sd = NULL;
 	cpumask_t sibling_map;
@@ -2755,22 +2785,14 @@ static int dependent_sleeper(int this_cp
 	if (!sd)
 		return 0;
 
-	/*
-	 * The same locking rules and details apply as for
-	 * wake_sleeping_dependent():
-	 */
-	spin_unlock(&this_rq->lock);
 	sibling_map = sd->span;
-	for_each_cpu_mask(i, sibling_map)
-		spin_lock(&cpu_rq(i)->lock);
 	cpu_clear(this_cpu, sibling_map);
+	if (!trylock_smt_siblings(sibling_map))
+		return 0;
 
 	/*
-	 * Establish next task to be run - it might have gone away because
-	 * we released the runqueue lock above:
+	 * Establish next task to be run.
 	 */
-	if (!this_rq->nr_running)
-		goto out_unlock;
 	array = this_rq->active;
 	if (!array->nr_active)
 		array = this_rq->expired;
@@ -2835,7 +2857,7 @@ check_smt_task:
 				wakeup_busy_runqueue(smt_rq);
 		}
 	}
-out_unlock:
+
 	for_each_cpu_mask(i, sibling_map)
 		spin_unlock(&cpu_rq(i)->lock);
 	return ret;
@@ -2967,7 +2989,6 @@ need_resched_nonpreemptible:
 
 	cpu = smp_processor_id();
 	if (unlikely(!rq->nr_running)) {
-go_idle:
 		idle_balance(cpu, rq);
 		if (!rq->nr_running) {
 			next = rq->idle;
@@ -2986,13 +3007,6 @@ go_idle:
 			next = rq->idle;
 			goto switch_tasks;
 		}
-		/*
-		 * dependent_sleeper() releases and reacquires the runqueue
-		 * lock, hence go into the idle loop if the rq went
-		 * empty meanwhile:
-		 */
-		if (unlikely(!rq->nr_running))
-			goto go_idle;
 	}
 
 	array = rq->active;

-- 
-ck

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  1:59   ` Con Kolivas
@ 2006-06-02  2:28     ` Con Kolivas
  2006-06-02  3:55       ` Con Kolivas
  2006-06-02  2:35     ` Chen, Kenneth W
  1 sibling, 1 reply; 41+ messages in thread
From: Con Kolivas @ 2006-06-02  2:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: Chen, Kenneth W, 'Chris Mason', Ingo Molnar

On Friday 02 June 2006 11:59, Con Kolivas wrote:
> On Friday 02 June 2006 09:57, Chen, Kenneth W wrote:
> > Chris Mason wrote on Thursday, June 01, 2006 3:56 PM
> >
> > > Hello everyone,
> > >
> > > Recent benchmarks showed some performance regressions between 2.6.16
> > > and 2.6.5.  We tracked down one of the regressions to lock contention
> > > in schedule heavy workloads (~70,000 context switches per second)
> > >
> > > kernel/sched.c:dependent_sleeper() was responsible for most of the lock
> > > contention, hammering on the run queue locks.  The patch below is more
> > > of a discussion point than a suggested fix (although it does reduce
> > > lock contention significantly).  The dependent_sleeper code looks very
> > > expensive to me, especially for using a spinlock to bounce control
> > > between two different siblings in the same cpu.
> >
> > Yeah, this also sort of echo a recent discovery on one of our benchmarks
> > that schedule() is red hot in the kernel.  I was just scratching my head
> > not sure what's going on.  This dependent_sleeper could be the culprit
> > considering it is called almost at every context switch.  I don't think
> > we are hitting on lock contention, but by the large amount of code it
> > executes.  It really needs to be cut down ....
>
> Thanks for the suggestion. How about something like this which takes your
> idea and further expands on it. Compiled, boot and runtime tests ok.
>
> ---
> It is not critical to functioning that dependent_sleeper() succeeds every
> time. We can significantly reduce the locking overhead and contention of
> dependent_sleeper by only doing trylock on the smt sibling runqueues. As
> we're only doing trylock it means we do not need to observe the normal
> locking order and we can get away without unlocking this_rq in schedule().
> This provides us with an opportunity to simplify the code further.

Actually looking even further, we only introduced the extra lookup of the next 
task when we started unlocking the runqueue in schedule(). Since we can get 
by without locking this_rq in schedule with this approach we can simplify 
dependent_sleeper even further by doing the dependent sleeper check after we 
have discovered what next is in schedule and avoid looking it up twice. I'll 
hack something up to do that soon.

-- 
-ck

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  2:28     ` Con Kolivas
@ 2006-06-02  3:55       ` Con Kolivas
  2006-06-02  4:18         ` Nick Piggin
                           ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Con Kolivas @ 2006-06-02  3:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: Chen, Kenneth W, 'Chris Mason', Ingo Molnar

On Friday 02 June 2006 12:28, Con Kolivas wrote:
> Actually looking even further, we only introduced the extra lookup of the
> next task when we started unlocking the runqueue in schedule(). Since we
> can get by without locking this_rq in schedule with this approach we can
> simplify dependent_sleeper even further by doing the dependent sleeper
> check after we have discovered what next is in schedule and avoid looking
> it up twice. I'll hack something up to do that soon.

Something like this (sorry I couldn't help but keep hacking on it).
---
It is not critical to functioning that dependent_sleeper() succeeds every
time. We can significantly reduce the locking overhead and contention of
dependent_sleeper by only doing trylock on the smt sibling runqueues. As
we're only doing trylock it means we do not need to observe the normal
locking order and we can get away without unlocking this_rq in schedule().
This provides us with an opportunity to simplify the code further.

Signed-off-by: Con Kolivas <kernel@kolivas.org>

---
 kernel/sched.c |   58 +++++++++++++++++----------------------------------------
 1 files changed, 18 insertions(+), 40 deletions(-)

Index: linux-2.6.17-rc5/kernel/sched.c
===================================================================
--- linux-2.6.17-rc5.orig/kernel/sched.c	2006-05-25 21:33:19.000000000 +1000
+++ linux-2.6.17-rc5/kernel/sched.c	2006-06-02 13:54:32.000000000 +1000
@@ -1687,7 +1687,7 @@ unsigned long nr_active(void)
  * double_rq_lock - safely lock two runqueues
  *
  * We must take them in cpu order to match code in
- * dependent_sleeper and wake_dependent_sleeper.
+ * wake_dependent_sleeper.
  *
  * Note this does not disable interrupts like task_rq_lock,
  * you need to do so manually before calling.
@@ -2740,13 +2740,18 @@ static inline unsigned long smt_slice(ta
 	return p->time_slice * (100 - sd->per_cpu_gain) / 100;
 }
 
-static int dependent_sleeper(int this_cpu, runqueue_t *this_rq)
+/*
+ * To minimise lock contention and not have to drop this_rq's runlock we only
+ * trylock the sibling runqueues and bypass those runqueues if we fail to
+ * acquire their lock. As we only trylock the normal locking order does not
+ * need to be obeyed.
+ */
+static int dependent_sleeper(int this_cpu, struct runqueue *this_rq,
+	struct task_struct *p)
 {
 	struct sched_domain *tmp, *sd = NULL;
 	cpumask_t sibling_map;
-	prio_array_t *array;
 	int ret = 0, i;
-	task_t *p;
 
 	for_each_domain(this_cpu, tmp)
 		if (tmp->flags & SD_SHARE_CPUPOWER)
@@ -2755,29 +2760,12 @@ static int dependent_sleeper(int this_cp
 	if (!sd)
 		return 0;
 
-	/*
-	 * The same locking rules and details apply as for
-	 * wake_sleeping_dependent():
-	 */
-	spin_unlock(&this_rq->lock);
 	sibling_map = sd->span;
-	for_each_cpu_mask(i, sibling_map)
-		spin_lock(&cpu_rq(i)->lock);
 	cpu_clear(this_cpu, sibling_map);
-
-	/*
-	 * Establish next task to be run - it might have gone away because
-	 * we released the runqueue lock above:
-	 */
-	if (!this_rq->nr_running)
-		goto out_unlock;
-	array = this_rq->active;
-	if (!array->nr_active)
-		array = this_rq->expired;
-	BUG_ON(!array->nr_active);
-
-	p = list_entry(array->queue[sched_find_first_bit(array->bitmap)].next,
-		task_t, run_list);
+	for_each_cpu_mask(i, sibling_map) {
+		if (unlikely(!spin_trylock(&cpu_rq(i)->lock)))
+			cpu_clear(i, sibling_map);
+	}
 
 	for_each_cpu_mask(i, sibling_map) {
 		runqueue_t *smt_rq = cpu_rq(i);
@@ -2835,7 +2823,7 @@ check_smt_task:
 				wakeup_busy_runqueue(smt_rq);
 		}
 	}
-out_unlock:
+
 	for_each_cpu_mask(i, sibling_map)
 		spin_unlock(&cpu_rq(i)->lock);
 	return ret;
@@ -2845,7 +2833,8 @@ static inline void wake_sleeping_depende
 {
 }
 
-static inline int dependent_sleeper(int this_cpu, runqueue_t *this_rq)
+static inline int dependent_sleeper(int this_cpu, struct runqueue *this_rq,
+	struct task_struct *p)
 {
 	return 0;
 }
@@ -2967,7 +2956,6 @@ need_resched_nonpreemptible:
 
 	cpu = smp_processor_id();
 	if (unlikely(!rq->nr_running)) {
-go_idle:
 		idle_balance(cpu, rq);
 		if (!rq->nr_running) {
 			next = rq->idle;
@@ -2981,18 +2969,6 @@ go_idle:
 			if (!rq->nr_running)
 				goto switch_tasks;
 		}
-	} else {
-		if (dependent_sleeper(cpu, rq)) {
-			next = rq->idle;
-			goto switch_tasks;
-		}
-		/*
-		 * dependent_sleeper() releases and reacquires the runqueue
-		 * lock, hence go into the idle loop if the rq went
-		 * empty meanwhile:
-		 */
-		if (unlikely(!rq->nr_running))
-			goto go_idle;
 	}
 
 	array = rq->active;
@@ -3030,6 +3006,8 @@ go_idle:
 		}
 	}
 	next->sleep_type = SLEEP_NORMAL;
+	if (dependent_sleeper(cpu, rq, next))
+		next = rq->idle;
 switch_tasks:
 	if (next == rq->idle)
 		schedstat_inc(rq, sched_goidle);

-- 
-ck

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  3:55       ` Con Kolivas
@ 2006-06-02  4:18         ` Nick Piggin
  2006-06-02  6:08           ` Con Kolivas
  2006-06-02  8:24           ` Chen, Kenneth W
  2006-06-02  8:31         ` Chen, Kenneth W
  2006-06-02  8:50         ` Chen, Kenneth W
  2 siblings, 2 replies; 41+ messages in thread
From: Nick Piggin @ 2006-06-02  4:18 UTC (permalink / raw)
  To: Con Kolivas
  Cc: linux-kernel, Chen, Kenneth W, 'Chris Mason', Ingo Molnar

Con Kolivas wrote:
> On Friday 02 June 2006 12:28, Con Kolivas wrote:
> 
>>Actually looking even further, we only introduced the extra lookup of the
>>next task when we started unlocking the runqueue in schedule(). Since we
>>can get by without locking this_rq in schedule with this approach we can
>>simplify dependent_sleeper even further by doing the dependent sleeper
>>check after we have discovered what next is in schedule and avoid looking
>>it up twice. I'll hack something up to do that soon.
> 
> 
> Something like this (sorry I couldn't help but keep hacking on it).

Looking pretty good. Nice to acknowledge Chris's idea for
trylocks in your changelog when you submit a final patch.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  4:18         ` Nick Piggin
@ 2006-06-02  6:08           ` Con Kolivas
  2006-06-02  7:53             ` Nick Piggin
  2006-06-02  8:24           ` Chen, Kenneth W
  1 sibling, 1 reply; 41+ messages in thread
From: Con Kolivas @ 2006-06-02  6:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Chen, Kenneth W, 'Chris Mason', Ingo Molnar

On Friday 02 June 2006 14:18, Nick Piggin wrote:
> Con Kolivas wrote:
> > On Friday 02 June 2006 12:28, Con Kolivas wrote:
> >>Actually looking even further, we only introduced the extra lookup of the
> >>next task when we started unlocking the runqueue in schedule(). Since we
> >>can get by without locking this_rq in schedule with this approach we can
> >>simplify dependent_sleeper even further by doing the dependent sleeper
> >>check after we have discovered what next is in schedule and avoid looking
> >>it up twice. I'll hack something up to do that soon.
> >
> > Something like this (sorry I couldn't help but keep hacking on it).
>
> Looking pretty good.

Thanks

> Nice to acknowledge Chris's idea for 
> trylocks in your changelog when you submit a final patch.

I absolutely would and I would ask for him to sign off on it as well, once we 
agreed on a final form.

-- 
-ck

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  6:08           ` Con Kolivas
@ 2006-06-02  7:53             ` Nick Piggin
  2006-06-02  8:17               ` Con Kolivas
  2006-06-02  8:38               ` Chen, Kenneth W
  0 siblings, 2 replies; 41+ messages in thread
From: Nick Piggin @ 2006-06-02  7:53 UTC (permalink / raw)
  To: Con Kolivas
  Cc: linux-kernel, Chen, Kenneth W, 'Chris Mason', Ingo Molnar

[-- Attachment #1: Type: text/plain, Size: 464 bytes --]

Con Kolivas wrote:

>>Nice to acknowledge Chris's idea for 
>>trylocks in your changelog when you submit a final patch.
> 
> 
> I absolutely would and I would ask for him to sign off on it as well, once we 
> agreed on a final form.

No worries, I thought you would ;)

This is a small micro-optimisation / cleanup we can do after
smtnice gets converted to use trylocks. Might result in a little
less cacheline footprint in some cases.

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: sched-lock-opt.patch --]
[-- Type: text/plain, Size: 1157 bytes --]

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2006-06-02 17:46:23.000000000 +1000
+++ linux-2.6/kernel/sched.c	2006-06-02 17:48:50.000000000 +1000
@@ -239,7 +239,6 @@ struct runqueue {
 
 	task_t *migration_thread;
 	struct list_head migration_queue;
-	int cpu;
 #endif
 
 #ifdef CONFIG_SCHEDSTATS
@@ -1700,7 +1699,7 @@ static void double_rq_lock(runqueue_t *r
 		spin_lock(&rq1->lock);
 		__acquire(rq2->lock);	/* Fake it out ;) */
 	} else {
-		if (rq1->cpu < rq2->cpu) {
+		if (rq1 < rq2) {
 			spin_lock(&rq1->lock);
 			spin_lock(&rq2->lock);
 		} else {
@@ -1736,7 +1735,7 @@ static void double_lock_balance(runqueue
 	__acquires(this_rq->lock)
 {
 	if (unlikely(!spin_trylock(&busiest->lock))) {
-		if (busiest->cpu < this_rq->cpu) {
+		if (busiest < this_rq) {
 			spin_unlock(&this_rq->lock);
 			spin_lock(&busiest->lock);
 			spin_lock(&this_rq->lock);
@@ -6104,7 +6103,6 @@ void __init sched_init(void)
 		rq->push_cpu = 0;
 		rq->migration_thread = NULL;
 		INIT_LIST_HEAD(&rq->migration_queue);
-		rq->cpu = i;
 #endif
 		atomic_set(&rq->nr_iowait, 0);
 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  7:53             ` Nick Piggin
@ 2006-06-02  8:17               ` Con Kolivas
  2006-06-02  8:28                 ` Nick Piggin
  2006-06-02  8:38               ` Chen, Kenneth W
  1 sibling, 1 reply; 41+ messages in thread
From: Con Kolivas @ 2006-06-02  8:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Chen, Kenneth W, 'Chris Mason', Ingo Molnar

On Friday 02 June 2006 17:53, Nick Piggin wrote:
> This is a small micro-optimisation / cleanup we can do after
> smtnice gets converted to use trylocks. Might result in a little
> less cacheline footprint in some cases.

It's only dependent_sleeper that is being converted in these patches. The 
wake_sleeping_dependent component still locks all runqueues and needs to 
succeed in order to ensure a task doesn't keep sleeping indefinitely. That 
one doesn't get called from schedule() so is far less expensive. This means I 
don't think we can change that cpu based locking order which I believe was 
introduce to prevent a deadlock (?DaveJ disovered it iirc).

-- 
-ck

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  8:17               ` Con Kolivas
@ 2006-06-02  8:28                 ` Nick Piggin
  2006-06-02  8:34                   ` Chen, Kenneth W
  2006-06-02 10:19                   ` Con Kolivas
  0 siblings, 2 replies; 41+ messages in thread
From: Nick Piggin @ 2006-06-02  8:28 UTC (permalink / raw)
  To: Con Kolivas
  Cc: linux-kernel, Chen, Kenneth W, 'Chris Mason', Ingo Molnar

[-- Attachment #1: Type: text/plain, Size: 993 bytes --]

Con Kolivas wrote:
> On Friday 02 June 2006 17:53, Nick Piggin wrote:
> 
>>This is a small micro-optimisation / cleanup we can do after
>>smtnice gets converted to use trylocks. Might result in a little
>>less cacheline footprint in some cases.
> 
> 
> It's only dependent_sleeper that is being converted in these patches. The 
> wake_sleeping_dependent component still locks all runqueues and needs to 

Oh I missed that.

> succeed in order to ensure a task doesn't keep sleeping indefinitely. That 

Let's make it use trylocks as well. wake_priority_sleeper should ensure
things don't sleep forever I think? We should be optimising for the most
common case, and in many workloads, the runqueue does go idle frequently.

> one doesn't get called from schedule() so is far less expensive. This means I 
> don't think we can change that cpu based locking order which I believe was 
> introduce to prevent a deadlock (?DaveJ disovered it iirc).
> 

AntonB, I think.

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: sntnice2.patch --]
[-- Type: text/plain, Size: 1843 bytes --]

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2006-06-02 18:23:18.000000000 +1000
+++ linux-2.6/kernel/sched.c	2006-06-02 18:26:40.000000000 +1000
@@ -2686,6 +2686,9 @@ static inline void wakeup_busy_runqueue(
 		resched_task(rq->idle);
 }
 
+/*
+ * Called with interrupts disabled and this_rq's runqueue locked.
+ */
 static void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq)
 {
 	struct sched_domain *tmp, *sd = NULL;
@@ -2699,22 +2702,13 @@ static void wake_sleeping_dependent(int 
 	if (!sd)
 		return;
 
-	/*
-	 * Unlock the current runqueue because we have to lock in
-	 * CPU order to avoid deadlocks. Caller knows that we might
-	 * unlock. We keep IRQs disabled.
-	 */
-	spin_unlock(&this_rq->lock);
-
 	sibling_map = sd->span;
-
-	for_each_cpu_mask(i, sibling_map)
-		spin_lock(&cpu_rq(i)->lock);
-	/*
-	 * We clear this CPU from the mask. This both simplifies the
-	 * inner loop and keps this_rq locked when we exit:
-	 */
 	cpu_clear(this_cpu, sibling_map);
+	for_each_cpu_mask(i, sibling_map) {
+		if (unlikely(!spin_trylock(&cpu_rq(i)->lock)))
+			cpu_clear(i, sibling_map);
+	}
+
 
 	for_each_cpu_mask(i, sibling_map) {
 		runqueue_t *smt_rq = cpu_rq(i);
@@ -2724,10 +2718,6 @@ static void wake_sleeping_dependent(int 
 
 	for_each_cpu_mask(i, sibling_map)
 		spin_unlock(&cpu_rq(i)->lock);
-	/*
-	 * We exit with this_cpu's rq still held and IRQs
-	 * still disabled:
-	 */
 }
 
 /*
@@ -2961,13 +2951,6 @@ need_resched_nonpreemptible:
 			next = rq->idle;
 			rq->expired_timestamp = 0;
 			wake_sleeping_dependent(cpu, rq);
-			/*
-			 * wake_sleeping_dependent() might have released
-			 * the runqueue, so break out if we got new
-			 * tasks meanwhile:
-			 */
-			if (!rq->nr_running)
-				goto switch_tasks;
 		}
 	}
 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  8:28                 ` Nick Piggin
@ 2006-06-02  8:34                   ` Chen, Kenneth W
  2006-06-02  8:56                     ` Nick Piggin
  2006-06-02 10:19                   ` Con Kolivas
  1 sibling, 1 reply; 41+ messages in thread
From: Chen, Kenneth W @ 2006-06-02  8:34 UTC (permalink / raw)
  To: 'Nick Piggin', Con Kolivas
  Cc: linux-kernel, 'Chris Mason', Ingo Molnar

Nick Piggin wrote on Friday, June 02, 2006 1:29 AM
> Con Kolivas wrote:
> > On Friday 02 June 2006 17:53, Nick Piggin wrote:
> > 
> >>This is a small micro-optimisation / cleanup we can do after
> >>smtnice gets converted to use trylocks. Might result in a little
> >>less cacheline footprint in some cases.
> > 
> > 
> > It's only dependent_sleeper that is being converted in these patches. The 
> > wake_sleeping_dependent component still locks all runqueues and needs to 
> 
> Oh I missed that.
> 
> > succeed in order to ensure a task doesn't keep sleeping indefinitely. That 
> 
> Let's make it use trylocks as well. wake_priority_sleeper should ensure
> things don't sleep forever I think? We should be optimising for the most
> common case, and in many workloads, the runqueue does go idle frequently.
> 

Ha, you beat me by one minute. It did cross my mind to use try lock there as
well, take a look at my version, I think I have a better inner loop.

- Ken

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  8:34                   ` Chen, Kenneth W
@ 2006-06-02  8:56                     ` Nick Piggin
  2006-06-02  9:17                       ` Chen, Kenneth W
                                         ` (3 more replies)
  0 siblings, 4 replies; 41+ messages in thread
From: Nick Piggin @ 2006-06-02  8:56 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: Con Kolivas, linux-kernel, 'Chris Mason', Ingo Molnar

[-- Attachment #1: Type: text/plain, Size: 477 bytes --]

Chen, Kenneth W wrote:

> Ha, you beat me by one minute. It did cross my mind to use try lock there as
> well, take a look at my version, I think I have a better inner loop.

Actually you *have* to use trylocks I think, because the current runqueue
is already locked.

And why do we lock all siblings in the other case, for that matter? (not
that it makes much difference except on niagara today).

Rolled up patch with everyone's changes attached.

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: sched-smt.patch --]
[-- Type: text/plain, Size: 6272 bytes --]

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2006-06-02 18:51:25.000000000 +1000
+++ linux-2.6/kernel/sched.c	2006-06-02 18:54:38.000000000 +1000
@@ -239,7 +239,6 @@ struct runqueue {
 
 	task_t *migration_thread;
 	struct list_head migration_queue;
-	int cpu;
 #endif
 
 #ifdef CONFIG_SCHEDSTATS
@@ -1687,7 +1686,7 @@ unsigned long nr_active(void)
  * double_rq_lock - safely lock two runqueues
  *
  * We must take them in cpu order to match code in
- * dependent_sleeper and wake_dependent_sleeper.
+ * wake_dependent_sleeper.
  *
  * Note this does not disable interrupts like task_rq_lock,
  * you need to do so manually before calling.
@@ -1700,7 +1699,7 @@ static void double_rq_lock(runqueue_t *r
 		spin_lock(&rq1->lock);
 		__acquire(rq2->lock);	/* Fake it out ;) */
 	} else {
-		if (rq1->cpu < rq2->cpu) {
+		if (rq1 < rq2) {
 			spin_lock(&rq1->lock);
 			spin_lock(&rq2->lock);
 		} else {
@@ -1736,7 +1735,7 @@ static void double_lock_balance(runqueue
 	__acquires(this_rq->lock)
 {
 	if (unlikely(!spin_trylock(&busiest->lock))) {
-		if (busiest->cpu < this_rq->cpu) {
+		if (busiest < this_rq) {
 			spin_unlock(&this_rq->lock);
 			spin_lock(&busiest->lock);
 			spin_lock(&this_rq->lock);
@@ -2686,6 +2685,9 @@ static inline void wakeup_busy_runqueue(
 		resched_task(rq->idle);
 }
 
+/*
+ * Called with interrupts disabled and this_rq's runqueue locked.
+ */
 static void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq)
 {
 	struct sched_domain *tmp, *sd = NULL;
@@ -2693,41 +2695,24 @@ static void wake_sleeping_dependent(int 
 	int i;
 
 	for_each_domain(this_cpu, tmp)
-		if (tmp->flags & SD_SHARE_CPUPOWER)
+		if (tmp->flags & SD_SHARE_CPUPOWER) {
 			sd = tmp;
-
+			break;
+		}
 	if (!sd)
 		return;
 
-	/*
-	 * Unlock the current runqueue because we have to lock in
-	 * CPU order to avoid deadlocks. Caller knows that we might
-	 * unlock. We keep IRQs disabled.
-	 */
-	spin_unlock(&this_rq->lock);
-
 	sibling_map = sd->span;
-
-	for_each_cpu_mask(i, sibling_map)
-		spin_lock(&cpu_rq(i)->lock);
-	/*
-	 * We clear this CPU from the mask. This both simplifies the
-	 * inner loop and keps this_rq locked when we exit:
-	 */
 	cpu_clear(this_cpu, sibling_map);
-
 	for_each_cpu_mask(i, sibling_map) {
 		runqueue_t *smt_rq = cpu_rq(i);
+		if (unlikely(!spin_trylock(&smt_rq->lock)))
+			continue;
 
 		wakeup_busy_runqueue(smt_rq);
-	}
 
-	for_each_cpu_mask(i, sibling_map)
-		spin_unlock(&cpu_rq(i)->lock);
-	/*
-	 * We exit with this_cpu's rq still held and IRQs
-	 * still disabled:
-	 */
+		spin_unlock(&smt_rq->lock);
+	}
 }
 
 /*
@@ -2740,48 +2725,38 @@ static inline unsigned long smt_slice(ta
 	return p->time_slice * (100 - sd->per_cpu_gain) / 100;
 }
 
-static int dependent_sleeper(int this_cpu, runqueue_t *this_rq)
+/*
+ * To minimise lock contention and not have to drop this_rq's runlock we only
+ * trylock the sibling runqueues and bypass those runqueues if we fail to
+ * acquire their lock. As we only trylock the normal locking order does not
+ * need to be obeyed.
+ */
+static int dependent_sleeper(int this_cpu, struct runqueue *this_rq,
+	struct task_struct *p)
 {
 	struct sched_domain *tmp, *sd = NULL;
 	cpumask_t sibling_map;
-	prio_array_t *array;
 	int ret = 0, i;
-	task_t *p;
 
 	for_each_domain(this_cpu, tmp)
-		if (tmp->flags & SD_SHARE_CPUPOWER)
+		if (tmp->flags & SD_SHARE_CPUPOWER) {
 			sd = tmp;
-
+			break;
+		}
 	if (!sd)
 		return 0;
 
-	/*
-	 * The same locking rules and details apply as for
-	 * wake_sleeping_dependent():
-	 */
-	spin_unlock(&this_rq->lock);
 	sibling_map = sd->span;
-	for_each_cpu_mask(i, sibling_map)
-		spin_lock(&cpu_rq(i)->lock);
 	cpu_clear(this_cpu, sibling_map);
+	for_each_cpu_mask(i, sibling_map) {
+		runqueue_t *smt_rq;
+		task_t *smt_curr;
 
-	/*
-	 * Establish next task to be run - it might have gone away because
-	 * we released the runqueue lock above:
-	 */
-	if (!this_rq->nr_running)
-		goto out_unlock;
-	array = this_rq->active;
-	if (!array->nr_active)
-		array = this_rq->expired;
-	BUG_ON(!array->nr_active);
-
-	p = list_entry(array->queue[sched_find_first_bit(array->bitmap)].next,
-		task_t, run_list);
+		smt_rq = cpu_rq(i);
+		if (unlikely(!spin_trylock(&smt_rq->lock)))
+			continue;
 
-	for_each_cpu_mask(i, sibling_map) {
-		runqueue_t *smt_rq = cpu_rq(i);
-		task_t *smt_curr = smt_rq->curr;
+		smt_curr = smt_rq->curr;
 
 		/* Kernel threads do not participate in dependent sleeping */
 		if (!p->mm || !smt_curr->mm || rt_task(p))
@@ -2834,10 +2809,10 @@ check_smt_task:
 			else
 				wakeup_busy_runqueue(smt_rq);
 		}
+
+		spin_unlock(&smt_rq->lock);
 	}
-out_unlock:
-	for_each_cpu_mask(i, sibling_map)
-		spin_unlock(&cpu_rq(i)->lock);
+
 	return ret;
 }
 #else
@@ -2845,7 +2820,8 @@ static inline void wake_sleeping_depende
 {
 }
 
-static inline int dependent_sleeper(int this_cpu, runqueue_t *this_rq)
+static inline int dependent_sleeper(int this_cpu, struct runqueue *this_rq,
+	struct task_struct *p)
 {
 	return 0;
 }
@@ -2967,32 +2943,13 @@ need_resched_nonpreemptible:
 
 	cpu = smp_processor_id();
 	if (unlikely(!rq->nr_running)) {
-go_idle:
 		idle_balance(cpu, rq);
 		if (!rq->nr_running) {
 			next = rq->idle;
 			rq->expired_timestamp = 0;
 			wake_sleeping_dependent(cpu, rq);
-			/*
-			 * wake_sleeping_dependent() might have released
-			 * the runqueue, so break out if we got new
-			 * tasks meanwhile:
-			 */
-			if (!rq->nr_running)
-				goto switch_tasks;
-		}
-	} else {
-		if (dependent_sleeper(cpu, rq)) {
-			next = rq->idle;
 			goto switch_tasks;
 		}
-		/*
-		 * dependent_sleeper() releases and reacquires the runqueue
-		 * lock, hence go into the idle loop if the rq went
-		 * empty meanwhile:
-		 */
-		if (unlikely(!rq->nr_running))
-			goto go_idle;
 	}
 
 	array = rq->active;
@@ -3030,6 +2987,8 @@ go_idle:
 		}
 	}
 	next->sleep_type = SLEEP_NORMAL;
+	if (dependent_sleeper(cpu, rq, next))
+		next = rq->idle;
 switch_tasks:
 	if (next == rq->idle)
 		schedstat_inc(rq, sched_goidle);
@@ -6126,7 +6085,6 @@ void __init sched_init(void)
 		rq->push_cpu = 0;
 		rq->migration_thread = NULL;
 		INIT_LIST_HEAD(&rq->migration_queue);
-		rq->cpu = i;
 #endif
 		atomic_set(&rq->nr_iowait, 0);
 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  8:56                     ` Nick Piggin
@ 2006-06-02  9:17                       ` Chen, Kenneth W
  2006-06-02  9:25                         ` Con Kolivas
  2006-06-02  9:36                       ` Chen, Kenneth W
                                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 41+ messages in thread
From: Chen, Kenneth W @ 2006-06-02  9:17 UTC (permalink / raw)
  To: 'Nick Piggin'
  Cc: Con Kolivas, linux-kernel, 'Chris Mason', Ingo Molnar

Nick Piggin wrote on Friday, June 02, 2006 1:56 AM
> Chen, Kenneth W wrote:
> 
> > Ha, you beat me by one minute. It did cross my mind to use try lock there as
> > well, take a look at my version, I think I have a better inner loop.
> 
> Actually you *have* to use trylocks I think, because the current runqueue
> is already locked.

You are absolutely correct.  I forgot about the lock ordering.


> And why do we lock all siblings in the other case, for that matter? (not
> that it makes much difference except on niagara today).
> 
> Rolled up patch with everyone's changes attached.

What about the part in dependent_sleeper() being so bully and actively
resched other low priority sibling tasks?  I think it would be better
to just let the tasks running on sibling CPU to finish its current time
slice and then let the backoff logic to kick in.

- Ken

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  9:17                       ` Chen, Kenneth W
@ 2006-06-02  9:25                         ` Con Kolivas
  2006-06-02  9:31                           ` Chen, Kenneth W
  0 siblings, 1 reply; 41+ messages in thread
From: Con Kolivas @ 2006-06-02  9:25 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Nick Piggin', linux-kernel, 'Chris Mason',
	Ingo Molnar

On Friday 02 June 2006 19:17, Chen, Kenneth W wrote:
> What about the part in dependent_sleeper() being so bully and actively
> resched other low priority sibling tasks?  I think it would be better
> to just let the tasks running on sibling CPU to finish its current time
> slice and then let the backoff logic to kick in.

That would defeat the purpose of smt nice if the higher priority task starts 
after the lower priority task is running on its sibling cpu.

-- 
-ck

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  9:25                         ` Con Kolivas
@ 2006-06-02  9:31                           ` Chen, Kenneth W
  2006-06-02  9:34                             ` Con Kolivas
  0 siblings, 1 reply; 41+ messages in thread
From: Chen, Kenneth W @ 2006-06-02  9:31 UTC (permalink / raw)
  To: 'Con Kolivas'
  Cc: 'Nick Piggin', linux-kernel, 'Chris Mason',
	Ingo Molnar

Con Kolivas wrote on Friday, June 02, 2006 2:25 AM
> On Friday 02 June 2006 19:17, Chen, Kenneth W wrote:
> > What about the part in dependent_sleeper() being so bully and actively
> > resched other low priority sibling tasks?  I think it would be better
> > to just let the tasks running on sibling CPU to finish its current time
> > slice and then let the backoff logic to kick in.
> 
> That would defeat the purpose of smt nice if the higher priority task starts 
> after the lower priority task is running on its sibling cpu.

But only for the duration of lower priority tasks' time slice.  When lower
priority tasks time slice is used up, a resched is force from scheduler_tick(),
isn't it?  And at that time, it is delayed to run because of smt_nice.  You are
saying user can't tolerate that short period of time that CPU resource is
shared?  It's hard to believe.

- Ken

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  9:31                           ` Chen, Kenneth W
@ 2006-06-02  9:34                             ` Con Kolivas
  2006-06-02  9:53                               ` Chen, Kenneth W
  0 siblings, 1 reply; 41+ messages in thread
From: Con Kolivas @ 2006-06-02  9:34 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Nick Piggin', linux-kernel, 'Chris Mason',
	Ingo Molnar

On Friday 02 June 2006 19:31, Chen, Kenneth W wrote:
> Con Kolivas wrote on Friday, June 02, 2006 2:25 AM
>
> > On Friday 02 June 2006 19:17, Chen, Kenneth W wrote:
> > > What about the part in dependent_sleeper() being so bully and actively
> > > resched other low priority sibling tasks?  I think it would be better
> > > to just let the tasks running on sibling CPU to finish its current time
> > > slice and then let the backoff logic to kick in.
> >
> > That would defeat the purpose of smt nice if the higher priority task
> > starts after the lower priority task is running on its sibling cpu.
>
> But only for the duration of lower priority tasks' time slice.  When lower
> priority tasks time slice is used up, a resched is force from
> scheduler_tick(), isn't it?  And at that time, it is delayed to run because
> of smt_nice.  You are saying user can't tolerate that short period of time
> that CPU resource is shared?  It's hard to believe.

nice -20 vs nice 0 is 800ms vs 100ms. That's a long time to me.

-- 
-ck

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  9:34                             ` Con Kolivas
@ 2006-06-02  9:53                               ` Chen, Kenneth W
  2006-06-02 10:12                                 ` Con Kolivas
  0 siblings, 1 reply; 41+ messages in thread
From: Chen, Kenneth W @ 2006-06-02  9:53 UTC (permalink / raw)
  To: 'Con Kolivas'
  Cc: 'Nick Piggin', linux-kernel, 'Chris Mason',
	Ingo Molnar

Con Kolivas wrote on Friday, June 02, 2006 2:34 AM
> On Friday 02 June 2006 19:31, Chen, Kenneth W wrote:
> > Con Kolivas wrote on Friday, June 02, 2006 2:25 AM
> >
> > > On Friday 02 June 2006 19:17, Chen, Kenneth W wrote:
> > > > What about the part in dependent_sleeper() being so bully and actively
> > > > resched other low priority sibling tasks?  I think it would be better
> > > > to just let the tasks running on sibling CPU to finish its current time
> > > > slice and then let the backoff logic to kick in.
> > >
> > > That would defeat the purpose of smt nice if the higher priority task
> > > starts after the lower priority task is running on its sibling cpu.
> >
> > But only for the duration of lower priority tasks' time slice.  When lower
> > priority tasks time slice is used up, a resched is force from
> > scheduler_tick(), isn't it?  And at that time, it is delayed to run because
> > of smt_nice.  You are saying user can't tolerate that short period of time
> > that CPU resource is shared?  It's hard to believe.
> 
> nice -20 vs nice 0 is 800ms vs 100ms. That's a long time to me.


Yeah, but that is the worst case though.  Average would probably be a lot
lower than worst case.  Also, on smt it's not like the current logical cpu
is getting blocked because of another task is running on its sibling CPU.
The hardware still guarantees equal share of hardware resources for both
logical CPUs.

- Ken


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  9:53                               ` Chen, Kenneth W
@ 2006-06-02 10:12                                 ` Con Kolivas
  2006-06-02 20:53                                   ` Chen, Kenneth W
  0 siblings, 1 reply; 41+ messages in thread
From: Con Kolivas @ 2006-06-02 10:12 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Nick Piggin', linux-kernel, 'Chris Mason',
	Ingo Molnar

On Friday 02 June 2006 19:53, Chen, Kenneth W wrote:
> Yeah, but that is the worst case though.  Average would probably be a lot
> lower than worst case.  Also, on smt it's not like the current logical cpu
> is getting blocked because of another task is running on its sibling CPU.
> The hardware still guarantees equal share of hardware resources for both
> logical CPUs.

"Equal share of hardware resources" is exactly the problem; they shouldn't 
have equal share since they're sharing one physical cpu's resources. It's a 
relative breakage of the imposed nice support and I disagree with your 
conclusion.

-- 
-ck

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02 10:12                                 ` Con Kolivas
@ 2006-06-02 20:53                                   ` Chen, Kenneth W
  2006-06-02 22:15                                     ` Con Kolivas
  0 siblings, 1 reply; 41+ messages in thread
From: Chen, Kenneth W @ 2006-06-02 20:53 UTC (permalink / raw)
  To: 'Con Kolivas'
  Cc: 'Nick Piggin', linux-kernel, 'Chris Mason',
	Ingo Molnar

Con Kolivas wrote on Friday, June 02, 2006 3:13 AM
> On Friday 02 June 2006 19:53, Chen, Kenneth W wrote:
> > Yeah, but that is the worst case though.  Average would probably be a lot
> > lower than worst case.  Also, on smt it's not like the current logical cpu
> > is getting blocked because of another task is running on its sibling CPU.
> > The hardware still guarantees equal share of hardware resources for both
> > logical CPUs.
> 
> "Equal share of hardware resources" is exactly the problem; they shouldn't 
> have equal share since they're sharing one physical cpu's resources. It's a 
> relative breakage of the imposed nice support and I disagree with your 
> conclusion.


But you keep on missing the point that this only happens in the initial
stage of tasks competing for CPU resources.

If this is broken, then current smt nice is equally broken with the same
reasoning: once the low priority task gets scheduled, there is nothing
to kick it off the CPU until its entire time slice get used up.  They
compete equally with a high priority task running on the sibling CPU.

- Ken


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02 20:53                                   ` Chen, Kenneth W
@ 2006-06-02 22:15                                     ` Con Kolivas
  2006-06-02 22:19                                       ` Chen, Kenneth W
  0 siblings, 1 reply; 41+ messages in thread
From: Con Kolivas @ 2006-06-02 22:15 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Nick Piggin', linux-kernel, 'Chris Mason',
	Ingo Molnar

On Saturday 03 June 2006 06:53, Chen, Kenneth W wrote:
> Con Kolivas wrote on Friday, June 02, 2006 3:13 AM
>
> > On Friday 02 June 2006 19:53, Chen, Kenneth W wrote:
> > > Yeah, but that is the worst case though.  Average would probably be a
> > > lot lower than worst case.  Also, on smt it's not like the current
> > > logical cpu is getting blocked because of another task is running on
> > > its sibling CPU. The hardware still guarantees equal share of hardware
> > > resources for both logical CPUs.
> >
> > "Equal share of hardware resources" is exactly the problem; they
> > shouldn't have equal share since they're sharing one physical cpu's
> > resources. It's a relative breakage of the imposed nice support and I
> > disagree with your conclusion.
>
> But you keep on missing the point that this only happens in the initial
> stage of tasks competing for CPU resources.
>
> If this is broken, then current smt nice is equally broken with the same
> reasoning: once the low priority task gets scheduled, there is nothing
> to kick it off the CPU until its entire time slice get used up.  They
> compete equally with a high priority task running on the sibling CPU.

There has to be some way of metering it out and in the absence of cpu based 
hardware priority support (like ppc64 has) the only useful thing we have to 
work with is timeslice. Yes sometimes the high priority task is at the start 
and sometimes at the end of the timeslice but overall it balances the 
proportions out reasonably fairly.

-- 
-ck

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02 22:15                                     ` Con Kolivas
@ 2006-06-02 22:19                                       ` Chen, Kenneth W
  2006-06-02 22:31                                         ` Con Kolivas
  0 siblings, 1 reply; 41+ messages in thread
From: Chen, Kenneth W @ 2006-06-02 22:19 UTC (permalink / raw)
  To: 'Con Kolivas'
  Cc: 'Nick Piggin', linux-kernel, 'Chris Mason',
	Ingo Molnar

Con Kolivas wrote on Friday, June 02, 2006 3:15 PM
> On Saturday 03 June 2006 06:53, Chen, Kenneth W wrote:
> > Con Kolivas wrote on Friday, June 02, 2006 3:13 AM
> >
> > > On Friday 02 June 2006 19:53, Chen, Kenneth W wrote:
> > > > Yeah, but that is the worst case though.  Average would probably be a
> > > > lot lower than worst case.  Also, on smt it's not like the current
> > > > logical cpu is getting blocked because of another task is running on
> > > > its sibling CPU. The hardware still guarantees equal share of hardware
> > > > resources for both logical CPUs.
> > >
> > > "Equal share of hardware resources" is exactly the problem; they
> > > shouldn't have equal share since they're sharing one physical cpu's
> > > resources. It's a relative breakage of the imposed nice support and I
> > > disagree with your conclusion.
> >
> > But you keep on missing the point that this only happens in the initial
> > stage of tasks competing for CPU resources.
> >
> > If this is broken, then current smt nice is equally broken with the same
> > reasoning: once the low priority task gets scheduled, there is nothing
> > to kick it off the CPU until its entire time slice get used up.  They
> > compete equally with a high priority task running on the sibling CPU.
> 
> There has to be some way of metering it out and in the absence of cpu based 
> hardware priority support (like ppc64 has) the only useful thing we have to 
> work with is timeslice. Yes sometimes the high priority task is at the start 
> and sometimes at the end of the timeslice but overall it balances the 
> proportions out reasonably fairly.


Good!  Then why special case the initial stage?  Just let task run and it
will even out statistically.  Everyone is happy, less code, less special
case, same end result.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02 22:19                                       ` Chen, Kenneth W
@ 2006-06-02 22:31                                         ` Con Kolivas
  2006-06-02 22:58                                           ` Chen, Kenneth W
  0 siblings, 1 reply; 41+ messages in thread
From: Con Kolivas @ 2006-06-02 22:31 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Nick Piggin', linux-kernel, 'Chris Mason',
	Ingo Molnar

On Saturday 03 June 2006 08:19, Chen, Kenneth W wrote:
> Con Kolivas wrote on Friday, June 02, 2006 3:15 PM
>
> > On Saturday 03 June 2006 06:53, Chen, Kenneth W wrote:
> > > Con Kolivas wrote on Friday, June 02, 2006 3:13 AM
> > >
> > > > On Friday 02 June 2006 19:53, Chen, Kenneth W wrote:
> > > > > Yeah, but that is the worst case though.  Average would probably be
> > > > > a lot lower than worst case.  Also, on smt it's not like the
> > > > > current logical cpu is getting blocked because of another task is
> > > > > running on its sibling CPU. The hardware still guarantees equal
> > > > > share of hardware resources for both logical CPUs.
> > > >
> > > > "Equal share of hardware resources" is exactly the problem; they
> > > > shouldn't have equal share since they're sharing one physical cpu's
> > > > resources. It's a relative breakage of the imposed nice support and I
> > > > disagree with your conclusion.
> > >
> > > But you keep on missing the point that this only happens in the initial
> > > stage of tasks competing for CPU resources.
> > >
> > > If this is broken, then current smt nice is equally broken with the
> > > same reasoning: once the low priority task gets scheduled, there is
> > > nothing to kick it off the CPU until its entire time slice get used up.
> > >  They compete equally with a high priority task running on the sibling
> > > CPU.
> >
> > There has to be some way of metering it out and in the absence of cpu
> > based hardware priority support (like ppc64 has) the only useful thing we
> > have to work with is timeslice. Yes sometimes the high priority task is
> > at the start and sometimes at the end of the timeslice but overall it
> > balances the proportions out reasonably fairly.
>
> Good!  Then why special case the initial stage?  Just let task run and it
> will even out statistically.  Everyone is happy, less code, less special
> case, same end result.

Hang on I think I missed something there. What did you conclude I conceded 
there? When I say "work with timeslice" I mean use percentage of timeslice 
the way smt nice currently does.

-- 
-ck

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02 22:31                                         ` Con Kolivas
@ 2006-06-02 22:58                                           ` Chen, Kenneth W
  2006-06-03  0:02                                             ` Con Kolivas
  0 siblings, 1 reply; 41+ messages in thread
From: Chen, Kenneth W @ 2006-06-02 22:58 UTC (permalink / raw)
  To: 'Con Kolivas'
  Cc: 'Nick Piggin', linux-kernel, 'Chris Mason',
	Ingo Molnar

Con Kolivas wrote on Friday, June 02, 2006 3:32 PM
> > > > But you keep on missing the point that this only happens in the initial
> > > > stage of tasks competing for CPU resources.
> > > >
> > > > If this is broken, then current smt nice is equally broken with the
> > > > same reasoning: once the low priority task gets scheduled, there is
> > > > nothing to kick it off the CPU until its entire time slice get used up.
> > > >  They compete equally with a high priority task running on the sibling
> > > > CPU.
> > >
> > > There has to be some way of metering it out and in the absence of cpu
> > > based hardware priority support (like ppc64 has) the only useful thing we
> > > have to work with is timeslice. Yes sometimes the high priority task is
> > > at the start and sometimes at the end of the timeslice but overall it
> > > balances the proportions out reasonably fairly.
> >
> > Good!  Then why special case the initial stage?  Just let task run and it
> > will even out statistically.  Everyone is happy, less code, less special
> > case, same end result.
> 
> Hang on I think I missed something there. What did you conclude I conceded 
> there? When I say "work with timeslice" I mean use percentage of timeslice 
> the way smt nice currently does.

You haven't answered my question either.  What is the benefit of special
casing the initial stage of cpu resource competition?  Is it quantitatively
measurable?  If so, how much and with what workload?

Also there are oddness in the way that the sleep decision is made: 75% of
resched decision is to sleep, while 25% of the time is to run, but there
are fussiness in there and all depends on when the resched and time slice
expires.  Considering a scenario when resched occurs on jiffies%100 = 10,
and low priority task time slice of 100 jiffies, the current code always
decide to run the low priority task!

 if (rt_task(smt_curr)) {
    /*
     * With real time tasks we run non-rt tasks only
     * per_cpu_gain% of the time.
     */
    if ((jiffies % DEF_TIMESLICE) >
       (sd->per_cpu_gain * DEF_TIMESLICE / 100))
             ret = 1;

The whole point is: precise fairness can not be (or difficult to) control
at milli-second range.  It all will work out statistically in a long run.
Which again brings back the question: what does the special casing in
precision control will bring to the overall scheme?  So far, I have not
seen any quantitative measure at all that indicates it is a win.

- Ken

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02 22:58                                           ` Chen, Kenneth W
@ 2006-06-03  0:02                                             ` Con Kolivas
  2006-06-03  0:08                                               ` Chen, Kenneth W
  0 siblings, 1 reply; 41+ messages in thread
From: Con Kolivas @ 2006-06-03  0:02 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Nick Piggin', linux-kernel, 'Chris Mason',
	Ingo Molnar

On Saturday 03 June 2006 08:58, Chen, Kenneth W wrote:
> You haven't answered my question either.  What is the benefit of special
> casing the initial stage of cpu resource competition?  Is it quantitatively
> measurable?  If so, how much and with what workload?

Ah you mean what the whole point of smt nice is? Yes it's simple enough to do. 
Take the single hyperthreaded cpu with two cpu bound workloads. Let's say I 
run a cpu bound task nice 0 by itself and it completes in time X. If I boot 
it with hyperthread disabled and run a nice 0 and nice 19 task, the nice 0 
task gets 95% of the cpu and completes in time X*0.95. If I boot with 
hyperthread enabled and run the nice 0 and nice 19 tasks, the nice 0 task 
gets 100% of one sibling and the nice 19 task 100% of the other sibling. The 
nice 0 task completes in X*0.6. With the smt nice code added it completed in 
X*0.95. The ratios here are dependent on the workload but that was the 
average I could determine from comparing mprime workloads at differing nice 
and kernel compiles. There is no explicit way on the Intel smt cpus to tell 
it which sibling is running lower priority tasks (sprinkling mwaits around at 
regular intervals is not a realistic option for example).

-- 
-ck

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-03  0:02                                             ` Con Kolivas
@ 2006-06-03  0:08                                               ` Chen, Kenneth W
  2006-06-03  0:27                                                 ` Con Kolivas
  0 siblings, 1 reply; 41+ messages in thread
From: Chen, Kenneth W @ 2006-06-03  0:08 UTC (permalink / raw)
  To: 'Con Kolivas'
  Cc: 'Nick Piggin', linux-kernel, 'Chris Mason',
	Ingo Molnar

Con Kolivas wrote on Friday, June 02, 2006 5:03 PM
> On Saturday 03 June 2006 08:58, Chen, Kenneth W wrote:
> > You haven't answered my question either.  What is the benefit of special
> > casing the initial stage of cpu resource competition?  Is it quantitatively
> > measurable?  If so, how much and with what workload?
> 
> Ah you mean what the whole point of smt nice is? Yes it's simple enough to do. 
> Take the single hyperthreaded cpu with two cpu bound workloads. Let's say I 
> run a cpu bound task nice 0 by itself and it completes in time X. If I boot 
> it with hyperthread disabled and run a nice 0 and nice 19 task, the nice 0 
> task gets 95% of the cpu and completes in time X*0.95. If I boot with 
> hyperthread enabled and run the nice 0 and nice 19 tasks, the nice 0 task 
> gets 100% of one sibling and the nice 19 task 100% of the other sibling. The 
> nice 0 task completes in X*0.6. With the smt nice code added it completed in 
> X*0.95. The ratios here are dependent on the workload but that was the 
> average I could determine from comparing mprime workloads at differing nice 
> and kernel compiles. There is no explicit way on the Intel smt cpus to tell 
> it which sibling is running lower priority tasks (sprinkling mwaits around at 
> regular intervals is not a realistic option for example).


I know what smt nice is doing, and it is fine to have.  I'm simply proposing
the following patch, on top of last roll up patch.


diff -u ./kernel/sched.c ./kernel/sched.c
--- ./kernel/sched.c	2006-06-02 16:05:13.000000000 -0700
+++ ./kernel/sched.c	2006-06-02 17:03:55.000000000 -0700
@@ -2782,7 +2782,7 @@
 
 		/* Kernel threads do not participate in dependent sleeping */
 		if (!p->mm || !smt_curr->mm || rt_task(p))
-			goto check_smt_task;
+			continue;
 
 		/*
 		 * If a user task with lower static priority than the
@@ -2806,32 +2806,6 @@
 				smt_slice(smt_curr, sd) > task_timeslice(p))
 					ret = 1;
 
-check_smt_task:
-		if ((!smt_curr->mm && smt_curr != smt_rq->idle) ||
-			rt_task(smt_curr))
-				continue;
-		if (!p->mm) {
-			wakeup_busy_runqueue(smt_rq);
-			continue;
-		}
-
-		/*
-		 * Reschedule a lower priority task on the SMT sibling for
-		 * it to be put to sleep, or wake it up if it has been put to
-		 * sleep for priority reasons to see if it should run now.
-		 */
-		if (rt_task(p)) {
-			if ((jiffies % DEF_TIMESLICE) >
-				(sd->per_cpu_gain * DEF_TIMESLICE / 100))
-					resched_task(smt_curr);
-		} else {
-			if (TASK_PREEMPTS_CURR(p, smt_rq) &&
-				smt_slice(p, sd) > task_timeslice(smt_curr))
-					resched_task(smt_curr);
-			else
-				wakeup_busy_runqueue(smt_rq);
-		}
-
 		spin_unlock(&smt_rq->lock);
 	}
 	return ret;

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-03  0:08                                               ` Chen, Kenneth W
@ 2006-06-03  0:27                                                 ` Con Kolivas
  0 siblings, 0 replies; 41+ messages in thread
From: Con Kolivas @ 2006-06-03  0:27 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Nick Piggin', linux-kernel, 'Chris Mason',
	Ingo Molnar

On Saturday 03 June 2006 10:08, Chen, Kenneth W wrote:
> I know what smt nice is doing, and it is fine to have.  I'm simply
> proposing the following patch, on top of last roll up patch.

It's still relative breakage of the semantics but I'm willing to concede that 
it will only help a very small proportion of the time with the largest 
timeslices and is in the fastpath.

-- 
-ck

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  8:56                     ` Nick Piggin
  2006-06-02  9:17                       ` Chen, Kenneth W
@ 2006-06-02  9:36                       ` Chen, Kenneth W
  2006-06-02 10:30                       ` Con Kolivas
  2006-06-02 22:14                       ` Chen, Kenneth W
  3 siblings, 0 replies; 41+ messages in thread
From: Chen, Kenneth W @ 2006-06-02  9:36 UTC (permalink / raw)
  To: 'Nick Piggin'
  Cc: Con Kolivas, linux-kernel, 'Chris Mason', Ingo Molnar

Nick Piggin wrote on Friday, June 02, 2006 1:56 AM
> Chen, Kenneth W wrote:
> 
> > Ha, you beat me by one minute. It did cross my mind to use try lock there as
> > well, take a look at my version, I think I have a better inner loop.
> 
> Actually you *have* to use trylocks I think, because the current runqueue
> is already locked.
> 
> And why do we lock all siblings in the other case, for that matter? (not
> that it makes much difference except on niagara today).
> 
> Rolled up patch with everyone's changes attached.


OK, it's down to nit-picking now:

Remove this_rq argument from wake_sleeping_dependent() since it is not
used.  Nick, you had that in your earlier version, but it got lost in
the woods.

I don't like cpumask being declared on the stack.  Here is my version
to rid it out in wake_sleeping_dependent() and dependent_sleeper().

- Ken


--- ./kernel/sched.c.orig	2006-06-02 03:20:40.000000000 -0700
+++ ./kernel/sched.c	2006-06-02 03:20:59.000000000 -0700
@@ -2714,10 +2714,9 @@ static inline void wakeup_busy_runqueue(
 /*
  * Called with interrupts disabled and this_rq's runqueue locked.
  */
-static void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq)
+static void wake_sleeping_dependent(int this_cpu)
 {
 	struct sched_domain *tmp, *sd = NULL;
-	cpumask_t sibling_map;
 	int i;
 
 	for_each_domain(this_cpu, tmp)
@@ -2728,10 +2727,11 @@ static void wake_sleeping_dependent(int 
 	if (!sd)
 		return;
 
-	sibling_map = sd->span;
-	cpu_clear(this_cpu, sibling_map);
-	for_each_cpu_mask(i, sibling_map) {
+	for_each_cpu_mask(i, sd->span) {
 		runqueue_t *smt_rq = cpu_rq(i);
+
+		if (i == this_cpu)
+			continue;
 		if (unlikely(!spin_trylock(&smt_rq->lock)))
 			continue;
 
@@ -2761,7 +2761,6 @@ static int dependent_sleeper(int this_cp
 	struct task_struct *p)
 {
 	struct sched_domain *tmp, *sd = NULL;
-	cpumask_t sibling_map;
 	int ret = 0, i;
 
 	for_each_domain(this_cpu, tmp)
@@ -2772,12 +2771,13 @@ static int dependent_sleeper(int this_cp
 	if (!sd)
 		return 0;
 
-	sibling_map = sd->span;
-	cpu_clear(this_cpu, sibling_map);
-	for_each_cpu_mask(i, sibling_map) {
+	for_each_cpu_mask(i, sd->span) {
 		runqueue_t *smt_rq;
 		task_t *smt_curr;
 
+		if (i == this_cpu)
+			continue;
+
 		smt_rq = cpu_rq(i);
 		if (unlikely(!spin_trylock(&smt_rq->lock)))
 			continue;
@@ -2842,7 +2842,7 @@ check_smt_task:
 	return ret;
 }
 #else
-static inline void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq)
+static inline void wake_sleeping_dependent(int this_cpu)
 {
 }
 
@@ -2973,7 +2973,7 @@ need_resched_nonpreemptible:
 		if (!rq->nr_running) {
 			next = rq->idle;
 			rq->expired_timestamp = 0;
-			wake_sleeping_dependent(cpu, rq);
+			wake_sleeping_dependent(cpu);
 			goto switch_tasks;
 		}
 	}

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  8:56                     ` Nick Piggin
  2006-06-02  9:17                       ` Chen, Kenneth W
  2006-06-02  9:36                       ` Chen, Kenneth W
@ 2006-06-02 10:30                       ` Con Kolivas
  2006-06-02 13:16                         ` Con Kolivas
  2006-06-02 22:14                       ` Chen, Kenneth W
  3 siblings, 1 reply; 41+ messages in thread
From: Con Kolivas @ 2006-06-02 10:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Chen, Kenneth W, linux-kernel, 'Chris Mason', Ingo Molnar

On Friday 02 June 2006 18:56, Nick Piggin wrote:
> Chen, Kenneth W wrote:
> > Ha, you beat me by one minute. It did cross my mind to use try lock there
> > as well, take a look at my version, I think I have a better inner loop.
>
> Actually you *have* to use trylocks I think, because the current runqueue
> is already locked.
>
> And why do we lock all siblings in the other case, for that matter? (not
> that it makes much difference except on niagara today).

If we spinlock (and don't trylock as you're proposing) we'd have to do a 
double rq lock for each sibling. I guess half the time double_rq_lock will 
only be locking one runqueue... with 32 runqueues we either try to lock all 
32 or lock 1.5 runqueues 32 times... ugh both are ugly.

> Rolled up patch with everyone's changes attached.

I'm still not sure that only doing trylock is adequate, and 
wake_sleeping_dependent is only called when a runqueue falls idle in 
schedule, not when it's busy so its cost (in my mind) is far less than 
dependent_sleeper.

-- 
-ck

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02 10:30                       ` Con Kolivas
@ 2006-06-02 13:16                         ` Con Kolivas
  2006-06-02 21:54                           ` Chen, Kenneth W
  0 siblings, 1 reply; 41+ messages in thread
From: Con Kolivas @ 2006-06-02 13:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Nick Piggin, Chen, Kenneth W, 'Chris Mason', Ingo Molnar

On Friday 02 June 2006 20:30, Con Kolivas wrote:
> On Friday 02 June 2006 18:56, Nick Piggin wrote:
> > And why do we lock all siblings in the other case, for that matter? (not
> > that it makes much difference except on niagara today).
>
> If we spinlock (and don't trylock as you're proposing) we'd have to do a
> double rq lock for each sibling. I guess half the time double_rq_lock will
> only be locking one runqueue... with 32 runqueues we either try to lock all
> 32 or lock 1.5 runqueues 32 times... ugh both are ugly.

Thinking some more on this it is also clear that the concept of per_cpu_gain  
for smt is basically wrong once we get beyond straight forward 2 thread 
hyperthreading. If we have more than 2 thread units per physical core, the 
per cpu gain per logical core will decrease the more threads are running on 
it. While it's always been obvious the gain per logical core is entirely 
dependant on the type of workload and wont be a simple 25% increase in cpu 
power, it is clear that even if we assume an "overall" increase in cpu for 
each logical core added, there will be some non linear function relating 
power increase to thread units used. :-|

-- 
-ck

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02 13:16                         ` Con Kolivas
@ 2006-06-02 21:54                           ` Chen, Kenneth W
  2006-06-02 22:04                             ` Con Kolivas
  0 siblings, 1 reply; 41+ messages in thread
From: Chen, Kenneth W @ 2006-06-02 21:54 UTC (permalink / raw)
  To: 'Con Kolivas', linux-kernel
  Cc: Nick Piggin, 'Chris Mason', Ingo Molnar

Con Kolivas wrote on Friday, June 02, 2006 6:17 AM
> On Friday 02 June 2006 20:30, Con Kolivas wrote:
> > On Friday 02 June 2006 18:56, Nick Piggin wrote:
> > > And why do we lock all siblings in the other case, for that matter? (not
> > > that it makes much difference except on niagara today).
> >
> > If we spinlock (and don't trylock as you're proposing) we'd have to do a
> > double rq lock for each sibling. I guess half the time double_rq_lock will
> > only be locking one runqueue... with 32 runqueues we either try to lock all
> > 32 or lock 1.5 runqueues 32 times... ugh both are ugly.
> 
> Thinking some more on this it is also clear that the concept of per_cpu_gain  
> for smt is basically wrong once we get beyond straight forward 2 thread 
> hyperthreading. If we have more than 2 thread units per physical core, the 
> per cpu gain per logical core will decrease the more threads are running on 
> it. While it's always been obvious the gain per logical core is entirely 
> dependant on the type of workload and wont be a simple 25% increase in cpu 
> power, it is clear that even if we assume an "overall" increase in cpu for 
> each logical core added, there will be some non linear function relating 
> power increase to thread units used. :-|


In the context of having more than 2 sibling CPUs in a domain, doesn't the
current code also suffer from thunder hurd problem as well? When high
priority task goes to sleep, it will wake up *all* sibling sleepers and
then they will all fight for CPU resource, but potentially only one will win?


 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02 21:54                           ` Chen, Kenneth W
@ 2006-06-02 22:04                             ` Con Kolivas
  0 siblings, 0 replies; 41+ messages in thread
From: Con Kolivas @ 2006-06-02 22:04 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: linux-kernel, Nick Piggin, 'Chris Mason', Ingo Molnar

On Saturday 03 June 2006 07:54, Chen, Kenneth W wrote:
> Con Kolivas wrote on Friday, June 02, 2006 6:17 AM
> > Thinking some more on this it is also clear that the concept of
> > per_cpu_gain for smt is basically wrong once we get beyond straight
> > forward 2 thread hyperthreading. If we have more than 2 thread units per
> > physical core, the per cpu gain per logical core will decrease the more
> > threads are running on it. While it's always been obvious the gain per
> > logical core is entirely dependant on the type of workload and wont be a
> > simple 25% increase in cpu power, it is clear that even if we assume an
> > "overall" increase in cpu for each logical core added, there will be some
> > non linear function relating power increase to thread units used. :-|
>
> In the context of having more than 2 sibling CPUs in a domain, doesn't the
> current code also suffer from thunder hurd problem as well? When high
> priority task goes to sleep, it will wake up *all* sibling sleepers and
> then they will all fight for CPU resource, but potentially only one will
> win?

Yes. The smt nice code was never designed with that many threads in mind. This 
is why I'm bringing it up for discussion.

-- 
-ck

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  8:56                     ` Nick Piggin
                                         ` (2 preceding siblings ...)
  2006-06-02 10:30                       ` Con Kolivas
@ 2006-06-02 22:14                       ` Chen, Kenneth W
  3 siblings, 0 replies; 41+ messages in thread
From: Chen, Kenneth W @ 2006-06-02 22:14 UTC (permalink / raw)
  To: 'Nick Piggin'
  Cc: Con Kolivas, linux-kernel, 'Chris Mason', Ingo Molnar

Nick Piggin wrote on Friday, June 02, 2006 1:56 AM
> Chen, Kenneth W wrote:
> 
> > Ha, you beat me by one minute. It did cross my mind to use try lock there as
> > well, take a look at my version, I think I have a better inner loop.
> 
> Actually you *have* to use trylocks I think, because the current runqueue
> is already locked.
> 
> And why do we lock all siblings in the other case, for that matter? (not
> that it makes much difference except on niagara today).
> 
> Rolled up patch with everyone's changes attached.


Rolled up patch on top of Nick's.  This version doesn't have the change
that removes the bully-ness in dependent_sleeper(), which is under debate
right now (I still think it should be removed).



--- ./kernel/sched.c.orig	2006-06-02 15:59:42.000000000 -0700
+++ ./kernel/sched.c	2006-06-02 16:05:13.000000000 -0700
@@ -239,7 +239,6 @@ struct runqueue {
 
 	task_t *migration_thread;
 	struct list_head migration_queue;
-	int cpu;
 #endif
 
 #ifdef CONFIG_SCHEDSTATS
@@ -1728,9 +1727,6 @@ unsigned long nr_active(void)
 /*
  * double_rq_lock - safely lock two runqueues
  *
- * We must take them in cpu order to match code in
- * dependent_sleeper and wake_dependent_sleeper.
- *
  * Note this does not disable interrupts like task_rq_lock,
  * you need to do so manually before calling.
  */
@@ -1742,7 +1738,7 @@ static void double_rq_lock(runqueue_t *r
 		spin_lock(&rq1->lock);
 		__acquire(rq2->lock);	/* Fake it out ;) */
 	} else {
-		if (rq1->cpu < rq2->cpu) {
+		if (rq1 < rq2) {
 			spin_lock(&rq1->lock);
 			spin_lock(&rq2->lock);
 		} else {
@@ -1778,7 +1774,7 @@ static void double_lock_balance(runqueue
 	__acquires(this_rq->lock)
 {
 	if (unlikely(!spin_trylock(&busiest->lock))) {
-		if (busiest->cpu < this_rq->cpu) {
+		if (busiest < this_rq) {
 			spin_unlock(&this_rq->lock);
 			spin_lock(&busiest->lock);
 			spin_lock(&this_rq->lock);
@@ -2712,48 +2708,33 @@ static inline void wakeup_busy_runqueue(
 		resched_task(rq->idle);
 }
 
-static void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq)
+/*
+ * Called with interrupts disabled and this_rq's runqueue locked.
+ */
+static void wake_sleeping_dependent(int this_cpu)
 {
 	struct sched_domain *tmp, *sd = NULL;
-	cpumask_t sibling_map;
 	int i;
 
 	for_each_domain(this_cpu, tmp)
-		if (tmp->flags & SD_SHARE_CPUPOWER)
+		if (tmp->flags & SD_SHARE_CPUPOWER) {
 			sd = tmp;
-
+			break;
+		}
 	if (!sd)
 		return;
 
-	/*
-	 * Unlock the current runqueue because we have to lock in
-	 * CPU order to avoid deadlocks. Caller knows that we might
-	 * unlock. We keep IRQs disabled.
-	 */
-	spin_unlock(&this_rq->lock);
-
-	sibling_map = sd->span;
-
-	for_each_cpu_mask(i, sibling_map)
-		spin_lock(&cpu_rq(i)->lock);
-	/*
-	 * We clear this CPU from the mask. This both simplifies the
-	 * inner loop and keps this_rq locked when we exit:
-	 */
-	cpu_clear(this_cpu, sibling_map);
-
-	for_each_cpu_mask(i, sibling_map) {
+	for_each_cpu_mask(i, sd->span) {
 		runqueue_t *smt_rq = cpu_rq(i);
 
+		if (i == this_cpu)
+			continue;
+		if (unlikely(!spin_trylock(&smt_rq->lock)))
+			continue;
+
 		wakeup_busy_runqueue(smt_rq);
+		spin_unlock(&smt_rq->lock);
 	}
-
-	for_each_cpu_mask(i, sibling_map)
-		spin_unlock(&cpu_rq(i)->lock);
-	/*
-	 * We exit with this_cpu's rq still held and IRQs
-	 * still disabled:
-	 */
 }
 
 /*
@@ -2766,48 +2747,38 @@ static inline unsigned long smt_slice(ta
 	return p->time_slice * (100 - sd->per_cpu_gain) / 100;
 }
 
-static int dependent_sleeper(int this_cpu, runqueue_t *this_rq)
+/*
+ * To minimise lock contention and not have to drop this_rq's runlock we only
+ * trylock the sibling runqueues and bypass those runqueues if we fail to
+ * acquire their lock. As we only trylock the normal locking order does not
+ * need to be obeyed.
+ */
+static int dependent_sleeper(int this_cpu, struct runqueue *this_rq,
+	struct task_struct *p)
 {
 	struct sched_domain *tmp, *sd = NULL;
-	cpumask_t sibling_map;
-	prio_array_t *array;
 	int ret = 0, i;
-	task_t *p;
 
 	for_each_domain(this_cpu, tmp)
-		if (tmp->flags & SD_SHARE_CPUPOWER)
+		if (tmp->flags & SD_SHARE_CPUPOWER) {
 			sd = tmp;
-
+			break;
+		}
 	if (!sd)
 		return 0;
 
-	/*
-	 * The same locking rules and details apply as for
-	 * wake_sleeping_dependent():
-	 */
-	spin_unlock(&this_rq->lock);
-	sibling_map = sd->span;
-	for_each_cpu_mask(i, sibling_map)
-		spin_lock(&cpu_rq(i)->lock);
-	cpu_clear(this_cpu, sibling_map);
+	for_each_cpu_mask(i, sd->span) {
+		runqueue_t *smt_rq;
+		task_t *smt_curr;
 
-	/*
-	 * Establish next task to be run - it might have gone away because
-	 * we released the runqueue lock above:
-	 */
-	if (!this_rq->nr_running)
-		goto out_unlock;
-	array = this_rq->active;
-	if (!array->nr_active)
-		array = this_rq->expired;
-	BUG_ON(!array->nr_active);
+		if (i == this_cpu)
+			continue;
 
-	p = list_entry(array->queue[sched_find_first_bit(array->bitmap)].next,
-		task_t, run_list);
+		smt_rq = cpu_rq(i);
+		if (unlikely(!spin_trylock(&smt_rq->lock)))
+			continue;
 
-	for_each_cpu_mask(i, sibling_map) {
-		runqueue_t *smt_rq = cpu_rq(i);
-		task_t *smt_curr = smt_rq->curr;
+		smt_curr = smt_rq->curr;
 
 		/* Kernel threads do not participate in dependent sleeping */
 		if (!p->mm || !smt_curr->mm || rt_task(p))
@@ -2860,18 +2831,18 @@ check_smt_task:
 			else
 				wakeup_busy_runqueue(smt_rq);
 		}
+
+		spin_unlock(&smt_rq->lock);
 	}
-out_unlock:
-	for_each_cpu_mask(i, sibling_map)
-		spin_unlock(&cpu_rq(i)->lock);
 	return ret;
 }
 #else
-static inline void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq)
+static inline void wake_sleeping_dependent(int this_cpu)
 {
 }
 
-static inline int dependent_sleeper(int this_cpu, runqueue_t *this_rq)
+static inline int dependent_sleeper(int this_cpu, struct runqueue *this_rq,
+	struct task_struct *p)
 {
 	return 0;
 }
@@ -2993,32 +2964,13 @@ need_resched_nonpreemptible:
 
 	cpu = smp_processor_id();
 	if (unlikely(!rq->nr_running)) {
-go_idle:
 		idle_balance(cpu, rq);
 		if (!rq->nr_running) {
 			next = rq->idle;
 			rq->expired_timestamp = 0;
-			wake_sleeping_dependent(cpu, rq);
-			/*
-			 * wake_sleeping_dependent() might have released
-			 * the runqueue, so break out if we got new
-			 * tasks meanwhile:
-			 */
-			if (!rq->nr_running)
-				goto switch_tasks;
-		}
-	} else {
-		if (dependent_sleeper(cpu, rq)) {
-			next = rq->idle;
+			wake_sleeping_dependent(cpu);
 			goto switch_tasks;
 		}
-		/*
-		 * dependent_sleeper() releases and reacquires the runqueue
-		 * lock, hence go into the idle loop if the rq went
-		 * empty meanwhile:
-		 */
-		if (unlikely(!rq->nr_running))
-			goto go_idle;
 	}
 
 	array = rq->active;
@@ -3056,6 +3008,8 @@ go_idle:
 		}
 	}
 	next->sleep_type = SLEEP_NORMAL;
+	if (dependent_sleeper(cpu, rq, next))
+		next = rq->idle;
 switch_tasks:
 	if (next == rq->idle)
 		schedstat_inc(rq, sched_goidle);
@@ -6152,7 +6106,6 @@ void __init sched_init(void)
 		rq->push_cpu = 0;
 		rq->migration_thread = NULL;
 		INIT_LIST_HEAD(&rq->migration_queue);
-		rq->cpu = i;
 #endif
 		atomic_set(&rq->nr_iowait, 0);
 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  8:28                 ` Nick Piggin
  2006-06-02  8:34                   ` Chen, Kenneth W
@ 2006-06-02 10:19                   ` Con Kolivas
  2006-06-02 20:59                     ` Chen, Kenneth W
  1 sibling, 1 reply; 41+ messages in thread
From: Con Kolivas @ 2006-06-02 10:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Chen, Kenneth W, 'Chris Mason', Ingo Molnar

On Friday 02 June 2006 18:28, Nick Piggin wrote:
> Con Kolivas wrote:
> > On Friday 02 June 2006 17:53, Nick Piggin wrote:
> >>This is a small micro-optimisation / cleanup we can do after
> >>smtnice gets converted to use trylocks. Might result in a little
> >>less cacheline footprint in some cases.
> >
> > It's only dependent_sleeper that is being converted in these patches. The
> > wake_sleeping_dependent component still locks all runqueues and needs to
>
> Oh I missed that.
>
> > succeed in order to ensure a task doesn't keep sleeping indefinitely.
> > That
>
> Let's make it use trylocks as well. wake_priority_sleeper should ensure
> things don't sleep forever I think? We should be optimising for the most
> common case, and in many workloads, the runqueue does go idle frequently.

wake_priority_sleeper is only called per tick which can be 10ms at 100HZ. I 
don't think that's fast enough. It could even be possible for a lower 
priority task to always just miss the wakeup if it's (very) unlucky.

-- 
-ck

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02 10:19                   ` Con Kolivas
@ 2006-06-02 20:59                     ` Chen, Kenneth W
  0 siblings, 0 replies; 41+ messages in thread
From: Chen, Kenneth W @ 2006-06-02 20:59 UTC (permalink / raw)
  To: 'Con Kolivas', Nick Piggin
  Cc: linux-kernel, 'Chris Mason', Ingo Molnar

Con Kolivas wrote on Friday, June 02, 2006 3:19 AM
> On Friday 02 June 2006 18:28, Nick Piggin wrote:
> > Con Kolivas wrote:
> > > On Friday 02 June 2006 17:53, Nick Piggin wrote:
> > >>This is a small micro-optimisation / cleanup we can do after
> > >>smtnice gets converted to use trylocks. Might result in a little
> > >>less cacheline footprint in some cases.
> > >
> > > It's only dependent_sleeper that is being converted in these patches. The
> > > wake_sleeping_dependent component still locks all runqueues and needs to
> >
> > Oh I missed that.
> >
> > > succeed in order to ensure a task doesn't keep sleeping indefinitely.
> > > That
> >
> > Let's make it use trylocks as well. wake_priority_sleeper should ensure
> > things don't sleep forever I think? We should be optimising for the most
> > common case, and in many workloads, the runqueue does go idle frequently.
> 
> wake_priority_sleeper is only called per tick which can be 10ms at 100HZ. I 
> don't think that's fast enough. It could even be possible for a lower 
> priority task to always just miss the wakeup if it's (very) unlucky.


I see inconsistency here: in dependent_sleeper() function, you argued dearly
that it is important to be bully for the high priority task to kick off low
priority task on the sibling CPU.  Here you are arguing dearly to put the
delayed low priority sleeper back onto the CPU as early as possible (and it
would possibly competing with the other high priority task).  Having such
biased opposite treatment in different path doesn't look correct to me.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  7:53             ` Nick Piggin
  2006-06-02  8:17               ` Con Kolivas
@ 2006-06-02  8:38               ` Chen, Kenneth W
  1 sibling, 0 replies; 41+ messages in thread
From: Chen, Kenneth W @ 2006-06-02  8:38 UTC (permalink / raw)
  To: 'Nick Piggin', Con Kolivas
  Cc: linux-kernel, 'Chris Mason', Ingo Molnar

Nick Piggin wrote on Friday, June 02, 2006 12:53 AM
> Con Kolivas wrote:
> 
> >>Nice to acknowledge Chris's idea for 
> >>trylocks in your changelog when you submit a final patch.
> > 
> > 
> > I absolutely would and I would ask for him to sign off on it as well, once we 
> > agreed on a final form.
> 
> This is a small micro-optimisation / cleanup we can do after
> smtnice gets converted to use trylocks. Might result in a little
> less cacheline footprint in some cases.


Just to pile up more micro-optimization: kernel can break out of the
for_each_domain loop when it hits first SD_SHARE_CPUPOWER.

- Ken


Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>


--- ./kernel/sched.c.orig	2006-06-01 23:14:47.000000000 -0700
+++ ./kernel/sched.c	2006-06-01 23:18:07.000000000 -0700
@@ -2780,8 +2780,10 @@ static int dependent_sleeper(int this_cp
 	int ret = 0, i;
 
 	for_each_domain(this_cpu, tmp)
-		if (tmp->flags & SD_SHARE_CPUPOWER)
+		if (tmp->flags & SD_SHARE_CPUPOWER) {
 			sd = tmp;
+			break;
+		}
 
 	if (!sd)
 		return 0;


^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  4:18         ` Nick Piggin
  2006-06-02  6:08           ` Con Kolivas
@ 2006-06-02  8:24           ` Chen, Kenneth W
  1 sibling, 0 replies; 41+ messages in thread
From: Chen, Kenneth W @ 2006-06-02  8:24 UTC (permalink / raw)
  To: 'Nick Piggin', Con Kolivas
  Cc: linux-kernel, 'Chris Mason', Ingo Molnar

Nick Piggin wrote on Thursday, June 01, 2006 9:19 PM
> Con Kolivas wrote:
> > On Friday 02 June 2006 12:28, Con Kolivas wrote:
> > 
> >>Actually looking even further, we only introduced the extra lookup of the
> >>next task when we started unlocking the runqueue in schedule(). Since we
> >>can get by without locking this_rq in schedule with this approach we can
> >>simplify dependent_sleeper even further by doing the dependent sleeper
> >>check after we have discovered what next is in schedule and avoid looking
> >>it up twice. I'll hack something up to do that soon.
> > 
> > 
> > Something like this (sorry I couldn't help but keep hacking on it).
> 
> Looking pretty good. Nice to acknowledge Chris's idea for
> trylocks in your changelog when you submit a final patch.

Yes, as far as the lock is concerned in dependent_sleeper(), it looks
pretty good. I do have other comments that I will follow up in another
thread.

- Ken

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  3:55       ` Con Kolivas
  2006-06-02  4:18         ` Nick Piggin
@ 2006-06-02  8:31         ` Chen, Kenneth W
  2006-06-02  8:50         ` Chen, Kenneth W
  2 siblings, 0 replies; 41+ messages in thread
From: Chen, Kenneth W @ 2006-06-02  8:31 UTC (permalink / raw)
  To: 'Con Kolivas', linux-kernel; +Cc: 'Chris Mason', Ingo Molnar

Con Kolivas wrote on Thursday, June 01, 2006 8:55 PM
> On Friday 02 June 2006 12:28, Con Kolivas wrote:
> > Actually looking even further, we only introduced the extra lookup of the
> > next task when we started unlocking the runqueue in schedule(). Since we
> > can get by without locking this_rq in schedule with this approach we can
> > simplify dependent_sleeper even further by doing the dependent sleeper
> > check after we have discovered what next is in schedule and avoid looking
> > it up twice. I'll hack something up to do that soon.
> 
> Something like this (sorry I couldn't help but keep hacking on it).
> ---
> It is not critical to functioning that dependent_sleeper() succeeds every
> time. We can significantly reduce the locking overhead and contention of
> dependent_sleeper by only doing trylock on the smt sibling runqueues. As
> we're only doing trylock it means we do not need to observe the normal
> locking order and we can get away without unlocking this_rq in schedule().
> This provides us with an opportunity to simplify the code further.


The code in wake_sleeping_dependent() is also quite wacky: it unlocks
current runqueue, then re-acquires ALL the sibling runqueue lock, only
to call wakeup_busy_runqueue() against the smt sibling runqueue other
than itself.  AFAICT, wakeup_busy_runqueue() does not require *ALL*
sibling lock to be held.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>

--- ./kernel/sched.c.orig	2006-06-02 01:57:28.000000000 -0700
+++ ./kernel/sched.c	2006-06-02 02:19:37.000000000 -0700
@@ -2712,44 +2712,32 @@ static inline void wakeup_busy_runqueue(
 		resched_task(rq->idle);
 }
 
-static void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq)
+static void wake_sleeping_dependent(int this_cpu)
 {
 	struct sched_domain *tmp, *sd = NULL;
-	cpumask_t sibling_map;
 	int i;
 
 	for_each_domain(this_cpu, tmp)
-		if (tmp->flags & SD_SHARE_CPUPOWER)
+		if (tmp->flags & SD_SHARE_CPUPOWER) {
 			sd = tmp;
+			break;
+		}
 
 	if (!sd)
 		return;
 
-	/*
-	 * Unlock the current runqueue because we have to lock in
-	 * CPU order to avoid deadlocks. Caller knows that we might
-	 * unlock. We keep IRQs disabled.
-	 */
-	spin_unlock(&this_rq->lock);
-
-	sibling_map = sd->span;
-
-	for_each_cpu_mask(i, sibling_map)
-		spin_lock(&cpu_rq(i)->lock);
-	/*
-	 * We clear this CPU from the mask. This both simplifies the
-	 * inner loop and keps this_rq locked when we exit:
-	 */
-	cpu_clear(this_cpu, sibling_map);
+	for_each_cpu_mask(i, sd->span) {
+		runqueue_t *smt_rq;
 
-	for_each_cpu_mask(i, sibling_map) {
-		runqueue_t *smt_rq = cpu_rq(i);
+		if (i == this_cpu)
+			continue;
 
+		smt_rq = cpu_rq(i);
+		spin_lock(&smt_rq->lock);
 		wakeup_busy_runqueue(smt_rq);
+		spin_unlock(&smt_rq->lock);
 	}
 
-	for_each_cpu_mask(i, sibling_map)
-		spin_unlock(&cpu_rq(i)->lock);
 	/*
 	 * We exit with this_cpu's rq still held and IRQs
 	 * still disabled:
@@ -2857,7 +2845,7 @@ check_smt_task:
 	return ret;
 }
 #else
-static inline void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq)
+static inline void wake_sleeping_dependent(int this_cpu)
 {
 }
 
@@ -2988,14 +2976,8 @@ need_resched_nonpreemptible:
 		if (!rq->nr_running) {
 			next = rq->idle;
 			rq->expired_timestamp = 0;
-			wake_sleeping_dependent(cpu, rq);
-			/*
-			 * wake_sleeping_dependent() might have released
-			 * the runqueue, so break out if we got new
-			 * tasks meanwhile:
-			 */
-			if (!rq->nr_running)
-				goto switch_tasks;
+			wake_sleeping_dependent(cpu);
+			goto switch_tasks;
 		}
 	}
 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  3:55       ` Con Kolivas
  2006-06-02  4:18         ` Nick Piggin
  2006-06-02  8:31         ` Chen, Kenneth W
@ 2006-06-02  8:50         ` Chen, Kenneth W
  2 siblings, 0 replies; 41+ messages in thread
From: Chen, Kenneth W @ 2006-06-02  8:50 UTC (permalink / raw)
  To: 'Con Kolivas', linux-kernel, 'Nick Piggin'
  Cc: 'Chris Mason', Ingo Molnar

Con Kolivas wrote on Thursday, June 01, 2006 8:55 PM
> On Friday 02 June 2006 12:28, Con Kolivas wrote:
> > Actually looking even further, we only introduced the extra lookup of the
> > next task when we started unlocking the runqueue in schedule(). Since we
> > can get by without locking this_rq in schedule with this approach we can
> > simplify dependent_sleeper even further by doing the dependent sleeper
> > check after we have discovered what next is in schedule and avoid looking
> > it up twice. I'll hack something up to do that soon.
> 
> Something like this (sorry I couldn't help but keep hacking on it).
> ---
> It is not critical to functioning that dependent_sleeper() succeeds every
> time. We can significantly reduce the locking overhead and contention of
> dependent_sleeper by only doing trylock on the smt sibling runqueues. As
> we're only doing trylock it means we do not need to observe the normal
> locking order and we can get away without unlocking this_rq in schedule().
> This provides us with an opportunity to simplify the code further.


Why does dependent_sleeper() has to be so bully that it actively kick off
the poor little guy on the smt sibling? Can't it just wait for the less
fortunate process to finish off its time slice and then it can have more
physical resource to spare?


Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>

--- ./kernel/sched.c.orig	2006-06-02 02:37:37.000000000 -0700
+++ ./kernel/sched.c	2006-06-02 02:41:20.000000000 -0700
@@ -2789,7 +2789,7 @@ static int dependent_sleeper(int this_cp
 
 		/* Kernel threads do not participate in dependent sleeping */
 		if (!p->mm || !smt_curr->mm || rt_task(p))
-			goto check_smt_task;
+			continue;
 
 		/*
 		 * If a user task with lower static priority than the
@@ -2812,32 +2812,6 @@ static int dependent_sleeper(int this_cp
 				!TASK_PREEMPTS_CURR(p, smt_rq) &&
 				smt_slice(smt_curr, sd) > task_timeslice(p))
 					ret = 1;
-
-check_smt_task:
-		if ((!smt_curr->mm && smt_curr != smt_rq->idle) ||
-			rt_task(smt_curr))
-				continue;
-		if (!p->mm) {
-			wakeup_busy_runqueue(smt_rq);
-			continue;
-		}
-
-		/*
-		 * Reschedule a lower priority task on the SMT sibling for
-		 * it to be put to sleep, or wake it up if it has been put to
-		 * sleep for priority reasons to see if it should run now.
-		 */
-		if (rt_task(p)) {
-			if ((jiffies % DEF_TIMESLICE) >
-				(sd->per_cpu_gain * DEF_TIMESLICE / 100))
-					resched_task(smt_curr);
-		} else {
-			if (TASK_PREEMPTS_CURR(p, smt_rq) &&
-				smt_slice(p, sd) > task_timeslice(smt_curr))
-					resched_task(smt_curr);
-			else
-				wakeup_busy_runqueue(smt_rq);
-		}
 	}
 
 	for_each_cpu_mask(i, sibling_map)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  1:59   ` Con Kolivas
  2006-06-02  2:28     ` Con Kolivas
@ 2006-06-02  2:35     ` Chen, Kenneth W
  2006-06-02  3:04       ` Con Kolivas
  1 sibling, 1 reply; 41+ messages in thread
From: Chen, Kenneth W @ 2006-06-02  2:35 UTC (permalink / raw)
  To: 'Con Kolivas'; +Cc: 'Chris Mason', linux-kernel, Ingo Molnar

Con Kolivas wrote on Thursday, June 01, 2006 6:59 PM
> On Friday 02 June 2006 09:57, Chen, Kenneth W wrote:
> > Chris Mason wrote on Thursday, June 01, 2006 3:56 PM
> >
> > > Hello everyone,
> > >
> > > Recent benchmarks showed some performance regressions between 2.6.16 and
> > > 2.6.5.  We tracked down one of the regressions to lock contention in
> > > schedule heavy workloads (~70,000 context switches per second)
> > >
> > > kernel/sched.c:dependent_sleeper() was responsible for most of the lock
> > > contention, hammering on the run queue locks.  The patch below is more of
> > > a discussion point than a suggested fix (although it does reduce lock
> > > contention significantly).  The dependent_sleeper code looks very
> > > expensive to me, especially for using a spinlock to bounce control
> > > between two different siblings in the same cpu.
> >
> > Yeah, this also sort of echo a recent discovery on one of our benchmarks
> > that schedule() is red hot in the kernel.  I was just scratching my head
> > not sure what's going on.  This dependent_sleeper could be the culprit
> > considering it is called almost at every context switch.  I don't think
> > we are hitting on lock contention, but by the large amount of code it
> > executes.  It really needs to be cut down ....
> 
> Thanks for the suggestion. How about something like this which takes your
> idea and further expands on it. Compiled, boot and runtime tests ok.
> 
>  /*
> + * Try to lock all the siblings of a cpu that is already locked. As we're
> + * only doing trylock the order is not important.
> + */
> +static int trylock_smt_siblings(cpumask_t sibling_map)
> +{
> +	cpumask_t locked_siblings;
> +	int i;
> +
> +	cpus_clear(locked_siblings);
> +	for_each_cpu_mask(i, sibling_map) {
> +		if (!spin_trylock(&cpu_rq(i)->lock))
> +			break;
> +		cpu_set(i, locked_siblings);
> +
> +	}
> +
> +	/* Did we lock all the siblings? */
> +	if (cpus_equal(sibling_map, locked_siblings))
> +		return 1;
> +
> +	for_each_cpu_mask(i, locked_siblings)
> +		spin_unlock(&cpu_rq(i)->lock);
> +	return 0;
> +}


I like Chris's version of trylock_smt_siblings().  Why create all that
local variables?  sibling_map is passed by value, so the whole thing is
duplicated on the stack (I think it should be pass by pointer), then
there is another locked_siblings mask declared followed by a bitmask
compare. The big iron people who set CONFIG_NR_CPUS=1024 won't be
amused because of all that bitmask copying.

Let me hack up something too, so we can compare notes etc.

- Ken

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  2:35     ` Chen, Kenneth W
@ 2006-06-02  3:04       ` Con Kolivas
  2006-06-02  3:23         ` Con Kolivas
  0 siblings, 1 reply; 41+ messages in thread
From: Con Kolivas @ 2006-06-02  3:04 UTC (permalink / raw)
  To: Chen, Kenneth W; +Cc: 'Chris Mason', linux-kernel, Ingo Molnar

On Friday 02 June 2006 12:35, Chen, Kenneth W wrote:
> Con Kolivas wrote on Thursday, June 01, 2006 6:59 PM
>
> > On Friday 02 June 2006 09:57, Chen, Kenneth W wrote:
> > > Chris Mason wrote on Thursday, June 01, 2006 3:56 PM
> > >
> > > > Hello everyone,
> > > >
> > > > Recent benchmarks showed some performance regressions between 2.6.16
> > > > and 2.6.5.  We tracked down one of the regressions to lock contention
> > > > in schedule heavy workloads (~70,000 context switches per second)
> > > >
> > > > kernel/sched.c:dependent_sleeper() was responsible for most of the
> > > > lock contention, hammering on the run queue locks.  The patch below
> > > > is more of a discussion point than a suggested fix (although it does
> > > > reduce lock contention significantly).  The dependent_sleeper code
> > > > looks very expensive to me, especially for using a spinlock to bounce
> > > > control between two different siblings in the same cpu.
> > >
> > > Yeah, this also sort of echo a recent discovery on one of our
> > > benchmarks that schedule() is red hot in the kernel.  I was just
> > > scratching my head not sure what's going on.  This dependent_sleeper
> > > could be the culprit considering it is called almost at every context
> > > switch.  I don't think we are hitting on lock contention, but by the
> > > large amount of code it executes.  It really needs to be cut down ....
> >
> > Thanks for the suggestion. How about something like this which takes your
> > idea and further expands on it. Compiled, boot and runtime tests ok.
> >
> >  /*
> > + * Try to lock all the siblings of a cpu that is already locked. As
> > we're + * only doing trylock the order is not important.
> > + */
> > +static int trylock_smt_siblings(cpumask_t sibling_map)
> > +{
> > +	cpumask_t locked_siblings;
> > +	int i;
> > +
> > +	cpus_clear(locked_siblings);
> > +	for_each_cpu_mask(i, sibling_map) {
> > +		if (!spin_trylock(&cpu_rq(i)->lock))
> > +			break;
> > +		cpu_set(i, locked_siblings);
> > +
> > +	}
> > +
> > +	/* Did we lock all the siblings? */
> > +	if (cpus_equal(sibling_map, locked_siblings))
> > +		return 1;
> > +
> > +	for_each_cpu_mask(i, locked_siblings)
> > +		spin_unlock(&cpu_rq(i)->lock);
> > +	return 0;
> > +}
>
> I like Chris's version of trylock_smt_siblings().  Why create all that
> local variables?

Because cpumasks are so pretty.

> sibling_map is passed by value, so the whole thing is 
> duplicated on the stack (I think it should be pass by pointer), then
> there is another locked_siblings mask declared followed by a bitmask
> compare. The big iron people who set CONFIG_NR_CPUS=1024 won't be
> amused because of all that bitmask copying.

Good point.
>
> Let me hack up something too, so we can compare notes etc.

Sure, you have the hack lock. Actually we can just trylock and if it fails 
remove it from the sibling_map and go to out_unlock without a separate 
trylock_smt_cpus function entirely...

-- 
-ck

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] smt nice introduces significant lock contention
  2006-06-02  3:04       ` Con Kolivas
@ 2006-06-02  3:23         ` Con Kolivas
  0 siblings, 0 replies; 41+ messages in thread
From: Con Kolivas @ 2006-06-02  3:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: Chen, Kenneth W, 'Chris Mason', Ingo Molnar

On Friday 02 June 2006 13:04, Con Kolivas wrote:
> Sure, you have the hack lock. Actually we can just trylock and if it fails
> remove it from the sibling_map and go to out_unlock without a separate
> trylock_smt_cpus function entirely...

I mean go to out_unlock if the sibling_map is empty sorry.

-- 
-ck

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2006-06-03  0:27 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-01 22:55 [PATCH RFC] smt nice introduces significant lock contention Chris Mason
2006-06-01 23:57 ` Chen, Kenneth W
2006-06-02  1:59   ` Con Kolivas
2006-06-02  2:28     ` Con Kolivas
2006-06-02  3:55       ` Con Kolivas
2006-06-02  4:18         ` Nick Piggin
2006-06-02  6:08           ` Con Kolivas
2006-06-02  7:53             ` Nick Piggin
2006-06-02  8:17               ` Con Kolivas
2006-06-02  8:28                 ` Nick Piggin
2006-06-02  8:34                   ` Chen, Kenneth W
2006-06-02  8:56                     ` Nick Piggin
2006-06-02  9:17                       ` Chen, Kenneth W
2006-06-02  9:25                         ` Con Kolivas
2006-06-02  9:31                           ` Chen, Kenneth W
2006-06-02  9:34                             ` Con Kolivas
2006-06-02  9:53                               ` Chen, Kenneth W
2006-06-02 10:12                                 ` Con Kolivas
2006-06-02 20:53                                   ` Chen, Kenneth W
2006-06-02 22:15                                     ` Con Kolivas
2006-06-02 22:19                                       ` Chen, Kenneth W
2006-06-02 22:31                                         ` Con Kolivas
2006-06-02 22:58                                           ` Chen, Kenneth W
2006-06-03  0:02                                             ` Con Kolivas
2006-06-03  0:08                                               ` Chen, Kenneth W
2006-06-03  0:27                                                 ` Con Kolivas
2006-06-02  9:36                       ` Chen, Kenneth W
2006-06-02 10:30                       ` Con Kolivas
2006-06-02 13:16                         ` Con Kolivas
2006-06-02 21:54                           ` Chen, Kenneth W
2006-06-02 22:04                             ` Con Kolivas
2006-06-02 22:14                       ` Chen, Kenneth W
2006-06-02 10:19                   ` Con Kolivas
2006-06-02 20:59                     ` Chen, Kenneth W
2006-06-02  8:38               ` Chen, Kenneth W
2006-06-02  8:24           ` Chen, Kenneth W
2006-06-02  8:31         ` Chen, Kenneth W
2006-06-02  8:50         ` Chen, Kenneth W
2006-06-02  2:35     ` Chen, Kenneth W
2006-06-02  3:04       ` Con Kolivas
2006-06-02  3:23         ` Con Kolivas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).