* [PATCH RFC] smt nice introduces significant lock contention
@ 2006-06-01 22:55 Chris Mason
2006-06-01 23:57 ` Chen, Kenneth W
0 siblings, 1 reply; 41+ messages in thread
From: Chris Mason @ 2006-06-01 22:55 UTC (permalink / raw)
To: Con Kolivas, linux-kernel
Hello everyone,
Recent benchmarks showed some performance regressions between 2.6.16 and
2.6.5. We tracked down one of the regressions to lock contention in schedule
heavy workloads (~70,000 context switches per second)
kernel/sched.c:dependent_sleeper() was responsible for most of the lock
contention, hammering on the run queue locks. The patch below is more of
a discussion point than a suggested fix (although it does reduce lock
contention significantly). The dependent_sleeper code looks very expensive
to me, especially for using a spinlock to bounce control between two different
siblings in the same cpu.
--- a/kernel/sched.c Thu May 18 15:55:43 2006 -0400
+++ b/kernel/sched.c Tue May 23 21:13:52 2006 -0400
@@ -2630,6 +2630,27 @@ static inline void wakeup_busy_runqueue(
resched_task(rq->idle);
}
+static int trylock_smt_cpus(cpumask_t sibling_map)
+{
+ int ret = 1;
+ int numlocked = 0;
+ int i;
+ for_each_cpu_mask(i, sibling_map) {
+ ret = spin_trylock(&cpu_rq(i)->lock);
+ if (!ret)
+ break;
+ numlocked++;
+ }
+ if (ret || !numlocked)
+ return ret;
+ for_each_cpu_mask(i, sibling_map) {
+ spin_unlock(&cpu_rq(i)->lock);
+ if (--numlocked == 0)
+ break;
+ }
+ return 0;
+}
+
static void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq)
{
struct sched_domain *tmp, *sd = NULL;
@@ -2643,22 +2664,16 @@ static void wake_sleeping_dependent(int
if (!sd)
return;
- /*
- * Unlock the current runqueue because we have to lock in
- * CPU order to avoid deadlocks. Caller knows that we might
- * unlock. We keep IRQs disabled.
- */
- spin_unlock(&this_rq->lock);
-
sibling_map = sd->span;
- for_each_cpu_mask(i, sibling_map)
- spin_lock(&cpu_rq(i)->lock);
/*
* We clear this CPU from the mask. This both simplifies the
* inner loop and keps this_rq locked when we exit:
*/
cpu_clear(this_cpu, sibling_map);
+
+ if (!trylock_smt_cpus(sibling_map))
+ return;
for_each_cpu_mask(i, sibling_map) {
runqueue_t *smt_rq = cpu_rq(i);
@@ -2703,11 +2718,10 @@ static int dependent_sleeper(int this_cp
* The same locking rules and details apply as for
* wake_sleeping_dependent():
*/
- spin_unlock(&this_rq->lock);
sibling_map = sd->span;
- for_each_cpu_mask(i, sibling_map)
- spin_lock(&cpu_rq(i)->lock);
cpu_clear(this_cpu, sibling_map);
+ if (!trylock_smt_cpus(sibling_map))
+ return 0;
/*
* Establish next task to be run - it might have gone away because
^ permalink raw reply [flat|nested] 41+ messages in thread* RE: [PATCH RFC] smt nice introduces significant lock contention 2006-06-01 22:55 [PATCH RFC] smt nice introduces significant lock contention Chris Mason @ 2006-06-01 23:57 ` Chen, Kenneth W 2006-06-02 1:59 ` Con Kolivas 0 siblings, 1 reply; 41+ messages in thread From: Chen, Kenneth W @ 2006-06-01 23:57 UTC (permalink / raw) To: 'Chris Mason', Con Kolivas, linux-kernel Chris Mason wrote on Thursday, June 01, 2006 3:56 PM > Hello everyone, > > Recent benchmarks showed some performance regressions between 2.6.16 and > 2.6.5. We tracked down one of the regressions to lock contention in schedule > heavy workloads (~70,000 context switches per second) > > kernel/sched.c:dependent_sleeper() was responsible for most of the lock > contention, hammering on the run queue locks. The patch below is more of > a discussion point than a suggested fix (although it does reduce lock > contention significantly). The dependent_sleeper code looks very expensive > to me, especially for using a spinlock to bounce control between two different > siblings in the same cpu. Yeah, this also sort of echo a recent discovery on one of our benchmarks that schedule() is red hot in the kernel. I was just scratching my head not sure what's going on. This dependent_sleeper could be the culprit considering it is called almost at every context switch. I don't think we are hitting on lock contention, but by the large amount of code it executes. It really needs to be cut down .... - Ken ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC] smt nice introduces significant lock contention 2006-06-01 23:57 ` Chen, Kenneth W @ 2006-06-02 1:59 ` Con Kolivas 2006-06-02 2:28 ` Con Kolivas 2006-06-02 2:35 ` Chen, Kenneth W 0 siblings, 2 replies; 41+ messages in thread From: Con Kolivas @ 2006-06-02 1:59 UTC (permalink / raw) To: Chen, Kenneth W; +Cc: 'Chris Mason', linux-kernel, Ingo Molnar On Friday 02 June 2006 09:57, Chen, Kenneth W wrote: > Chris Mason wrote on Thursday, June 01, 2006 3:56 PM > > > Hello everyone, > > > > Recent benchmarks showed some performance regressions between 2.6.16 and > > 2.6.5. We tracked down one of the regressions to lock contention in > > schedule heavy workloads (~70,000 context switches per second) > > > > kernel/sched.c:dependent_sleeper() was responsible for most of the lock > > contention, hammering on the run queue locks. The patch below is more of > > a discussion point than a suggested fix (although it does reduce lock > > contention significantly). The dependent_sleeper code looks very > > expensive to me, especially for using a spinlock to bounce control > > between two different siblings in the same cpu. > > Yeah, this also sort of echo a recent discovery on one of our benchmarks > that schedule() is red hot in the kernel. I was just scratching my head > not sure what's going on. This dependent_sleeper could be the culprit > considering it is called almost at every context switch. I don't think > we are hitting on lock contention, but by the large amount of code it > executes. It really needs to be cut down .... Thanks for the suggestion. How about something like this which takes your idea and further expands on it. Compiled, boot and runtime tests ok. --- It is not critical to functioning that dependent_sleeper() succeeds every time. We can significantly reduce the locking overhead and contention of dependent_sleeper by only doing trylock on the smt sibling runqueues. As we're only doing trylock it means we do not need to observe the normal locking order and we can get away without unlocking this_rq in schedule(). This provides us with an opportunity to simplify the code further. Signed-off-by: Con Kolivas <kernel@kolivas.org> --- kernel/sched.c | 58 +++++++++++++++++++++++++++++++++++---------------------- 1 files changed, 36 insertions(+), 22 deletions(-) Index: linux-2.6.17-rc5/kernel/sched.c =================================================================== --- linux-2.6.17-rc5.orig/kernel/sched.c 2006-05-25 21:33:19.000000000 +1000 +++ linux-2.6.17-rc5/kernel/sched.c 2006-06-02 11:51:13.000000000 +1000 @@ -1687,7 +1687,7 @@ unsigned long nr_active(void) * double_rq_lock - safely lock two runqueues * * We must take them in cpu order to match code in - * dependent_sleeper and wake_dependent_sleeper. + * wake_dependent_sleeper. * * Note this does not disable interrupts like task_rq_lock, * you need to do so manually before calling. @@ -2731,6 +2731,32 @@ static void wake_sleeping_dependent(int } /* + * Try to lock all the siblings of a cpu that is already locked. As we're + * only doing trylock the order is not important. + */ +static int trylock_smt_siblings(cpumask_t sibling_map) +{ + cpumask_t locked_siblings; + int i; + + cpus_clear(locked_siblings); + for_each_cpu_mask(i, sibling_map) { + if (!spin_trylock(&cpu_rq(i)->lock)) + break; + cpu_set(i, locked_siblings); + + } + + /* Did we lock all the siblings? */ + if (cpus_equal(sibling_map, locked_siblings)) + return 1; + + for_each_cpu_mask(i, locked_siblings) + spin_unlock(&cpu_rq(i)->lock); + return 0; +} + +/* * number of 'lost' timeslices this task wont be able to fully * utilize, if another task runs on a sibling. This models the * slowdown effect of other tasks running on siblings: @@ -2740,7 +2766,11 @@ static inline unsigned long smt_slice(ta return p->time_slice * (100 - sd->per_cpu_gain) / 100; } -static int dependent_sleeper(int this_cpu, runqueue_t *this_rq) +/* + * To minimise lock contention and to have to drop this_rq's runlock we only + * trylock the sibling runqueues and give up if we can't acquire the locks. + */ +static int dependent_sleeper(int this_cpu, struct runqueue *this_rq) { struct sched_domain *tmp, *sd = NULL; cpumask_t sibling_map; @@ -2755,22 +2785,14 @@ static int dependent_sleeper(int this_cp if (!sd) return 0; - /* - * The same locking rules and details apply as for - * wake_sleeping_dependent(): - */ - spin_unlock(&this_rq->lock); sibling_map = sd->span; - for_each_cpu_mask(i, sibling_map) - spin_lock(&cpu_rq(i)->lock); cpu_clear(this_cpu, sibling_map); + if (!trylock_smt_siblings(sibling_map)) + return 0; /* - * Establish next task to be run - it might have gone away because - * we released the runqueue lock above: + * Establish next task to be run. */ - if (!this_rq->nr_running) - goto out_unlock; array = this_rq->active; if (!array->nr_active) array = this_rq->expired; @@ -2835,7 +2857,7 @@ check_smt_task: wakeup_busy_runqueue(smt_rq); } } -out_unlock: + for_each_cpu_mask(i, sibling_map) spin_unlock(&cpu_rq(i)->lock); return ret; @@ -2967,7 +2989,6 @@ need_resched_nonpreemptible: cpu = smp_processor_id(); if (unlikely(!rq->nr_running)) { -go_idle: idle_balance(cpu, rq); if (!rq->nr_running) { next = rq->idle; @@ -2986,13 +3007,6 @@ go_idle: next = rq->idle; goto switch_tasks; } - /* - * dependent_sleeper() releases and reacquires the runqueue - * lock, hence go into the idle loop if the rq went - * empty meanwhile: - */ - if (unlikely(!rq->nr_running)) - goto go_idle; } array = rq->active; -- -ck ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 1:59 ` Con Kolivas @ 2006-06-02 2:28 ` Con Kolivas 2006-06-02 3:55 ` Con Kolivas 2006-06-02 2:35 ` Chen, Kenneth W 1 sibling, 1 reply; 41+ messages in thread From: Con Kolivas @ 2006-06-02 2:28 UTC (permalink / raw) To: linux-kernel; +Cc: Chen, Kenneth W, 'Chris Mason', Ingo Molnar On Friday 02 June 2006 11:59, Con Kolivas wrote: > On Friday 02 June 2006 09:57, Chen, Kenneth W wrote: > > Chris Mason wrote on Thursday, June 01, 2006 3:56 PM > > > > > Hello everyone, > > > > > > Recent benchmarks showed some performance regressions between 2.6.16 > > > and 2.6.5. We tracked down one of the regressions to lock contention > > > in schedule heavy workloads (~70,000 context switches per second) > > > > > > kernel/sched.c:dependent_sleeper() was responsible for most of the lock > > > contention, hammering on the run queue locks. The patch below is more > > > of a discussion point than a suggested fix (although it does reduce > > > lock contention significantly). The dependent_sleeper code looks very > > > expensive to me, especially for using a spinlock to bounce control > > > between two different siblings in the same cpu. > > > > Yeah, this also sort of echo a recent discovery on one of our benchmarks > > that schedule() is red hot in the kernel. I was just scratching my head > > not sure what's going on. This dependent_sleeper could be the culprit > > considering it is called almost at every context switch. I don't think > > we are hitting on lock contention, but by the large amount of code it > > executes. It really needs to be cut down .... > > Thanks for the suggestion. How about something like this which takes your > idea and further expands on it. Compiled, boot and runtime tests ok. > > --- > It is not critical to functioning that dependent_sleeper() succeeds every > time. We can significantly reduce the locking overhead and contention of > dependent_sleeper by only doing trylock on the smt sibling runqueues. As > we're only doing trylock it means we do not need to observe the normal > locking order and we can get away without unlocking this_rq in schedule(). > This provides us with an opportunity to simplify the code further. Actually looking even further, we only introduced the extra lookup of the next task when we started unlocking the runqueue in schedule(). Since we can get by without locking this_rq in schedule with this approach we can simplify dependent_sleeper even further by doing the dependent sleeper check after we have discovered what next is in schedule and avoid looking it up twice. I'll hack something up to do that soon. -- -ck ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 2:28 ` Con Kolivas @ 2006-06-02 3:55 ` Con Kolivas 2006-06-02 4:18 ` Nick Piggin ` (2 more replies) 0 siblings, 3 replies; 41+ messages in thread From: Con Kolivas @ 2006-06-02 3:55 UTC (permalink / raw) To: linux-kernel; +Cc: Chen, Kenneth W, 'Chris Mason', Ingo Molnar On Friday 02 June 2006 12:28, Con Kolivas wrote: > Actually looking even further, we only introduced the extra lookup of the > next task when we started unlocking the runqueue in schedule(). Since we > can get by without locking this_rq in schedule with this approach we can > simplify dependent_sleeper even further by doing the dependent sleeper > check after we have discovered what next is in schedule and avoid looking > it up twice. I'll hack something up to do that soon. Something like this (sorry I couldn't help but keep hacking on it). --- It is not critical to functioning that dependent_sleeper() succeeds every time. We can significantly reduce the locking overhead and contention of dependent_sleeper by only doing trylock on the smt sibling runqueues. As we're only doing trylock it means we do not need to observe the normal locking order and we can get away without unlocking this_rq in schedule(). This provides us with an opportunity to simplify the code further. Signed-off-by: Con Kolivas <kernel@kolivas.org> --- kernel/sched.c | 58 +++++++++++++++++---------------------------------------- 1 files changed, 18 insertions(+), 40 deletions(-) Index: linux-2.6.17-rc5/kernel/sched.c =================================================================== --- linux-2.6.17-rc5.orig/kernel/sched.c 2006-05-25 21:33:19.000000000 +1000 +++ linux-2.6.17-rc5/kernel/sched.c 2006-06-02 13:54:32.000000000 +1000 @@ -1687,7 +1687,7 @@ unsigned long nr_active(void) * double_rq_lock - safely lock two runqueues * * We must take them in cpu order to match code in - * dependent_sleeper and wake_dependent_sleeper. + * wake_dependent_sleeper. * * Note this does not disable interrupts like task_rq_lock, * you need to do so manually before calling. @@ -2740,13 +2740,18 @@ static inline unsigned long smt_slice(ta return p->time_slice * (100 - sd->per_cpu_gain) / 100; } -static int dependent_sleeper(int this_cpu, runqueue_t *this_rq) +/* + * To minimise lock contention and not have to drop this_rq's runlock we only + * trylock the sibling runqueues and bypass those runqueues if we fail to + * acquire their lock. As we only trylock the normal locking order does not + * need to be obeyed. + */ +static int dependent_sleeper(int this_cpu, struct runqueue *this_rq, + struct task_struct *p) { struct sched_domain *tmp, *sd = NULL; cpumask_t sibling_map; - prio_array_t *array; int ret = 0, i; - task_t *p; for_each_domain(this_cpu, tmp) if (tmp->flags & SD_SHARE_CPUPOWER) @@ -2755,29 +2760,12 @@ static int dependent_sleeper(int this_cp if (!sd) return 0; - /* - * The same locking rules and details apply as for - * wake_sleeping_dependent(): - */ - spin_unlock(&this_rq->lock); sibling_map = sd->span; - for_each_cpu_mask(i, sibling_map) - spin_lock(&cpu_rq(i)->lock); cpu_clear(this_cpu, sibling_map); - - /* - * Establish next task to be run - it might have gone away because - * we released the runqueue lock above: - */ - if (!this_rq->nr_running) - goto out_unlock; - array = this_rq->active; - if (!array->nr_active) - array = this_rq->expired; - BUG_ON(!array->nr_active); - - p = list_entry(array->queue[sched_find_first_bit(array->bitmap)].next, - task_t, run_list); + for_each_cpu_mask(i, sibling_map) { + if (unlikely(!spin_trylock(&cpu_rq(i)->lock))) + cpu_clear(i, sibling_map); + } for_each_cpu_mask(i, sibling_map) { runqueue_t *smt_rq = cpu_rq(i); @@ -2835,7 +2823,7 @@ check_smt_task: wakeup_busy_runqueue(smt_rq); } } -out_unlock: + for_each_cpu_mask(i, sibling_map) spin_unlock(&cpu_rq(i)->lock); return ret; @@ -2845,7 +2833,8 @@ static inline void wake_sleeping_depende { } -static inline int dependent_sleeper(int this_cpu, runqueue_t *this_rq) +static inline int dependent_sleeper(int this_cpu, struct runqueue *this_rq, + struct task_struct *p) { return 0; } @@ -2967,7 +2956,6 @@ need_resched_nonpreemptible: cpu = smp_processor_id(); if (unlikely(!rq->nr_running)) { -go_idle: idle_balance(cpu, rq); if (!rq->nr_running) { next = rq->idle; @@ -2981,18 +2969,6 @@ go_idle: if (!rq->nr_running) goto switch_tasks; } - } else { - if (dependent_sleeper(cpu, rq)) { - next = rq->idle; - goto switch_tasks; - } - /* - * dependent_sleeper() releases and reacquires the runqueue - * lock, hence go into the idle loop if the rq went - * empty meanwhile: - */ - if (unlikely(!rq->nr_running)) - goto go_idle; } array = rq->active; @@ -3030,6 +3006,8 @@ go_idle: } } next->sleep_type = SLEEP_NORMAL; + if (dependent_sleeper(cpu, rq, next)) + next = rq->idle; switch_tasks: if (next == rq->idle) schedstat_inc(rq, sched_goidle); -- -ck ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 3:55 ` Con Kolivas @ 2006-06-02 4:18 ` Nick Piggin 2006-06-02 6:08 ` Con Kolivas 2006-06-02 8:24 ` Chen, Kenneth W 2006-06-02 8:31 ` Chen, Kenneth W 2006-06-02 8:50 ` Chen, Kenneth W 2 siblings, 2 replies; 41+ messages in thread From: Nick Piggin @ 2006-06-02 4:18 UTC (permalink / raw) To: Con Kolivas Cc: linux-kernel, Chen, Kenneth W, 'Chris Mason', Ingo Molnar Con Kolivas wrote: > On Friday 02 June 2006 12:28, Con Kolivas wrote: > >>Actually looking even further, we only introduced the extra lookup of the >>next task when we started unlocking the runqueue in schedule(). Since we >>can get by without locking this_rq in schedule with this approach we can >>simplify dependent_sleeper even further by doing the dependent sleeper >>check after we have discovered what next is in schedule and avoid looking >>it up twice. I'll hack something up to do that soon. > > > Something like this (sorry I couldn't help but keep hacking on it). Looking pretty good. Nice to acknowledge Chris's idea for trylocks in your changelog when you submit a final patch. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 4:18 ` Nick Piggin @ 2006-06-02 6:08 ` Con Kolivas 2006-06-02 7:53 ` Nick Piggin 2006-06-02 8:24 ` Chen, Kenneth W 1 sibling, 1 reply; 41+ messages in thread From: Con Kolivas @ 2006-06-02 6:08 UTC (permalink / raw) To: Nick Piggin Cc: linux-kernel, Chen, Kenneth W, 'Chris Mason', Ingo Molnar On Friday 02 June 2006 14:18, Nick Piggin wrote: > Con Kolivas wrote: > > On Friday 02 June 2006 12:28, Con Kolivas wrote: > >>Actually looking even further, we only introduced the extra lookup of the > >>next task when we started unlocking the runqueue in schedule(). Since we > >>can get by without locking this_rq in schedule with this approach we can > >>simplify dependent_sleeper even further by doing the dependent sleeper > >>check after we have discovered what next is in schedule and avoid looking > >>it up twice. I'll hack something up to do that soon. > > > > Something like this (sorry I couldn't help but keep hacking on it). > > Looking pretty good. Thanks > Nice to acknowledge Chris's idea for > trylocks in your changelog when you submit a final patch. I absolutely would and I would ask for him to sign off on it as well, once we agreed on a final form. -- -ck ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 6:08 ` Con Kolivas @ 2006-06-02 7:53 ` Nick Piggin 2006-06-02 8:17 ` Con Kolivas 2006-06-02 8:38 ` Chen, Kenneth W 0 siblings, 2 replies; 41+ messages in thread From: Nick Piggin @ 2006-06-02 7:53 UTC (permalink / raw) To: Con Kolivas Cc: linux-kernel, Chen, Kenneth W, 'Chris Mason', Ingo Molnar [-- Attachment #1: Type: text/plain, Size: 464 bytes --] Con Kolivas wrote: >>Nice to acknowledge Chris's idea for >>trylocks in your changelog when you submit a final patch. > > > I absolutely would and I would ask for him to sign off on it as well, once we > agreed on a final form. No worries, I thought you would ;) This is a small micro-optimisation / cleanup we can do after smtnice gets converted to use trylocks. Might result in a little less cacheline footprint in some cases. -- SUSE Labs, Novell Inc. [-- Attachment #2: sched-lock-opt.patch --] [-- Type: text/plain, Size: 1157 bytes --] Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c 2006-06-02 17:46:23.000000000 +1000 +++ linux-2.6/kernel/sched.c 2006-06-02 17:48:50.000000000 +1000 @@ -239,7 +239,6 @@ struct runqueue { task_t *migration_thread; struct list_head migration_queue; - int cpu; #endif #ifdef CONFIG_SCHEDSTATS @@ -1700,7 +1699,7 @@ static void double_rq_lock(runqueue_t *r spin_lock(&rq1->lock); __acquire(rq2->lock); /* Fake it out ;) */ } else { - if (rq1->cpu < rq2->cpu) { + if (rq1 < rq2) { spin_lock(&rq1->lock); spin_lock(&rq2->lock); } else { @@ -1736,7 +1735,7 @@ static void double_lock_balance(runqueue __acquires(this_rq->lock) { if (unlikely(!spin_trylock(&busiest->lock))) { - if (busiest->cpu < this_rq->cpu) { + if (busiest < this_rq) { spin_unlock(&this_rq->lock); spin_lock(&busiest->lock); spin_lock(&this_rq->lock); @@ -6104,7 +6103,6 @@ void __init sched_init(void) rq->push_cpu = 0; rq->migration_thread = NULL; INIT_LIST_HEAD(&rq->migration_queue); - rq->cpu = i; #endif atomic_set(&rq->nr_iowait, 0); ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 7:53 ` Nick Piggin @ 2006-06-02 8:17 ` Con Kolivas 2006-06-02 8:28 ` Nick Piggin 2006-06-02 8:38 ` Chen, Kenneth W 1 sibling, 1 reply; 41+ messages in thread From: Con Kolivas @ 2006-06-02 8:17 UTC (permalink / raw) To: Nick Piggin Cc: linux-kernel, Chen, Kenneth W, 'Chris Mason', Ingo Molnar On Friday 02 June 2006 17:53, Nick Piggin wrote: > This is a small micro-optimisation / cleanup we can do after > smtnice gets converted to use trylocks. Might result in a little > less cacheline footprint in some cases. It's only dependent_sleeper that is being converted in these patches. The wake_sleeping_dependent component still locks all runqueues and needs to succeed in order to ensure a task doesn't keep sleeping indefinitely. That one doesn't get called from schedule() so is far less expensive. This means I don't think we can change that cpu based locking order which I believe was introduce to prevent a deadlock (?DaveJ disovered it iirc). -- -ck ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 8:17 ` Con Kolivas @ 2006-06-02 8:28 ` Nick Piggin 2006-06-02 8:34 ` Chen, Kenneth W 2006-06-02 10:19 ` Con Kolivas 0 siblings, 2 replies; 41+ messages in thread From: Nick Piggin @ 2006-06-02 8:28 UTC (permalink / raw) To: Con Kolivas Cc: linux-kernel, Chen, Kenneth W, 'Chris Mason', Ingo Molnar [-- Attachment #1: Type: text/plain, Size: 993 bytes --] Con Kolivas wrote: > On Friday 02 June 2006 17:53, Nick Piggin wrote: > >>This is a small micro-optimisation / cleanup we can do after >>smtnice gets converted to use trylocks. Might result in a little >>less cacheline footprint in some cases. > > > It's only dependent_sleeper that is being converted in these patches. The > wake_sleeping_dependent component still locks all runqueues and needs to Oh I missed that. > succeed in order to ensure a task doesn't keep sleeping indefinitely. That Let's make it use trylocks as well. wake_priority_sleeper should ensure things don't sleep forever I think? We should be optimising for the most common case, and in many workloads, the runqueue does go idle frequently. > one doesn't get called from schedule() so is far less expensive. This means I > don't think we can change that cpu based locking order which I believe was > introduce to prevent a deadlock (?DaveJ disovered it iirc). > AntonB, I think. -- SUSE Labs, Novell Inc. [-- Attachment #2: sntnice2.patch --] [-- Type: text/plain, Size: 1843 bytes --] Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c 2006-06-02 18:23:18.000000000 +1000 +++ linux-2.6/kernel/sched.c 2006-06-02 18:26:40.000000000 +1000 @@ -2686,6 +2686,9 @@ static inline void wakeup_busy_runqueue( resched_task(rq->idle); } +/* + * Called with interrupts disabled and this_rq's runqueue locked. + */ static void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq) { struct sched_domain *tmp, *sd = NULL; @@ -2699,22 +2702,13 @@ static void wake_sleeping_dependent(int if (!sd) return; - /* - * Unlock the current runqueue because we have to lock in - * CPU order to avoid deadlocks. Caller knows that we might - * unlock. We keep IRQs disabled. - */ - spin_unlock(&this_rq->lock); - sibling_map = sd->span; - - for_each_cpu_mask(i, sibling_map) - spin_lock(&cpu_rq(i)->lock); - /* - * We clear this CPU from the mask. This both simplifies the - * inner loop and keps this_rq locked when we exit: - */ cpu_clear(this_cpu, sibling_map); + for_each_cpu_mask(i, sibling_map) { + if (unlikely(!spin_trylock(&cpu_rq(i)->lock))) + cpu_clear(i, sibling_map); + } + for_each_cpu_mask(i, sibling_map) { runqueue_t *smt_rq = cpu_rq(i); @@ -2724,10 +2718,6 @@ static void wake_sleeping_dependent(int for_each_cpu_mask(i, sibling_map) spin_unlock(&cpu_rq(i)->lock); - /* - * We exit with this_cpu's rq still held and IRQs - * still disabled: - */ } /* @@ -2961,13 +2951,6 @@ need_resched_nonpreemptible: next = rq->idle; rq->expired_timestamp = 0; wake_sleeping_dependent(cpu, rq); - /* - * wake_sleeping_dependent() might have released - * the runqueue, so break out if we got new - * tasks meanwhile: - */ - if (!rq->nr_running) - goto switch_tasks; } } ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 8:28 ` Nick Piggin @ 2006-06-02 8:34 ` Chen, Kenneth W 2006-06-02 8:56 ` Nick Piggin 2006-06-02 10:19 ` Con Kolivas 1 sibling, 1 reply; 41+ messages in thread From: Chen, Kenneth W @ 2006-06-02 8:34 UTC (permalink / raw) To: 'Nick Piggin', Con Kolivas Cc: linux-kernel, 'Chris Mason', Ingo Molnar Nick Piggin wrote on Friday, June 02, 2006 1:29 AM > Con Kolivas wrote: > > On Friday 02 June 2006 17:53, Nick Piggin wrote: > > > >>This is a small micro-optimisation / cleanup we can do after > >>smtnice gets converted to use trylocks. Might result in a little > >>less cacheline footprint in some cases. > > > > > > It's only dependent_sleeper that is being converted in these patches. The > > wake_sleeping_dependent component still locks all runqueues and needs to > > Oh I missed that. > > > succeed in order to ensure a task doesn't keep sleeping indefinitely. That > > Let's make it use trylocks as well. wake_priority_sleeper should ensure > things don't sleep forever I think? We should be optimising for the most > common case, and in many workloads, the runqueue does go idle frequently. > Ha, you beat me by one minute. It did cross my mind to use try lock there as well, take a look at my version, I think I have a better inner loop. - Ken ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 8:34 ` Chen, Kenneth W @ 2006-06-02 8:56 ` Nick Piggin 2006-06-02 9:17 ` Chen, Kenneth W ` (3 more replies) 0 siblings, 4 replies; 41+ messages in thread From: Nick Piggin @ 2006-06-02 8:56 UTC (permalink / raw) To: Chen, Kenneth W Cc: Con Kolivas, linux-kernel, 'Chris Mason', Ingo Molnar [-- Attachment #1: Type: text/plain, Size: 477 bytes --] Chen, Kenneth W wrote: > Ha, you beat me by one minute. It did cross my mind to use try lock there as > well, take a look at my version, I think I have a better inner loop. Actually you *have* to use trylocks I think, because the current runqueue is already locked. And why do we lock all siblings in the other case, for that matter? (not that it makes much difference except on niagara today). Rolled up patch with everyone's changes attached. -- SUSE Labs, Novell Inc. [-- Attachment #2: sched-smt.patch --] [-- Type: text/plain, Size: 6272 bytes --] Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c 2006-06-02 18:51:25.000000000 +1000 +++ linux-2.6/kernel/sched.c 2006-06-02 18:54:38.000000000 +1000 @@ -239,7 +239,6 @@ struct runqueue { task_t *migration_thread; struct list_head migration_queue; - int cpu; #endif #ifdef CONFIG_SCHEDSTATS @@ -1687,7 +1686,7 @@ unsigned long nr_active(void) * double_rq_lock - safely lock two runqueues * * We must take them in cpu order to match code in - * dependent_sleeper and wake_dependent_sleeper. + * wake_dependent_sleeper. * * Note this does not disable interrupts like task_rq_lock, * you need to do so manually before calling. @@ -1700,7 +1699,7 @@ static void double_rq_lock(runqueue_t *r spin_lock(&rq1->lock); __acquire(rq2->lock); /* Fake it out ;) */ } else { - if (rq1->cpu < rq2->cpu) { + if (rq1 < rq2) { spin_lock(&rq1->lock); spin_lock(&rq2->lock); } else { @@ -1736,7 +1735,7 @@ static void double_lock_balance(runqueue __acquires(this_rq->lock) { if (unlikely(!spin_trylock(&busiest->lock))) { - if (busiest->cpu < this_rq->cpu) { + if (busiest < this_rq) { spin_unlock(&this_rq->lock); spin_lock(&busiest->lock); spin_lock(&this_rq->lock); @@ -2686,6 +2685,9 @@ static inline void wakeup_busy_runqueue( resched_task(rq->idle); } +/* + * Called with interrupts disabled and this_rq's runqueue locked. + */ static void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq) { struct sched_domain *tmp, *sd = NULL; @@ -2693,41 +2695,24 @@ static void wake_sleeping_dependent(int int i; for_each_domain(this_cpu, tmp) - if (tmp->flags & SD_SHARE_CPUPOWER) + if (tmp->flags & SD_SHARE_CPUPOWER) { sd = tmp; - + break; + } if (!sd) return; - /* - * Unlock the current runqueue because we have to lock in - * CPU order to avoid deadlocks. Caller knows that we might - * unlock. We keep IRQs disabled. - */ - spin_unlock(&this_rq->lock); - sibling_map = sd->span; - - for_each_cpu_mask(i, sibling_map) - spin_lock(&cpu_rq(i)->lock); - /* - * We clear this CPU from the mask. This both simplifies the - * inner loop and keps this_rq locked when we exit: - */ cpu_clear(this_cpu, sibling_map); - for_each_cpu_mask(i, sibling_map) { runqueue_t *smt_rq = cpu_rq(i); + if (unlikely(!spin_trylock(&smt_rq->lock))) + continue; wakeup_busy_runqueue(smt_rq); - } - for_each_cpu_mask(i, sibling_map) - spin_unlock(&cpu_rq(i)->lock); - /* - * We exit with this_cpu's rq still held and IRQs - * still disabled: - */ + spin_unlock(&smt_rq->lock); + } } /* @@ -2740,48 +2725,38 @@ static inline unsigned long smt_slice(ta return p->time_slice * (100 - sd->per_cpu_gain) / 100; } -static int dependent_sleeper(int this_cpu, runqueue_t *this_rq) +/* + * To minimise lock contention and not have to drop this_rq's runlock we only + * trylock the sibling runqueues and bypass those runqueues if we fail to + * acquire their lock. As we only trylock the normal locking order does not + * need to be obeyed. + */ +static int dependent_sleeper(int this_cpu, struct runqueue *this_rq, + struct task_struct *p) { struct sched_domain *tmp, *sd = NULL; cpumask_t sibling_map; - prio_array_t *array; int ret = 0, i; - task_t *p; for_each_domain(this_cpu, tmp) - if (tmp->flags & SD_SHARE_CPUPOWER) + if (tmp->flags & SD_SHARE_CPUPOWER) { sd = tmp; - + break; + } if (!sd) return 0; - /* - * The same locking rules and details apply as for - * wake_sleeping_dependent(): - */ - spin_unlock(&this_rq->lock); sibling_map = sd->span; - for_each_cpu_mask(i, sibling_map) - spin_lock(&cpu_rq(i)->lock); cpu_clear(this_cpu, sibling_map); + for_each_cpu_mask(i, sibling_map) { + runqueue_t *smt_rq; + task_t *smt_curr; - /* - * Establish next task to be run - it might have gone away because - * we released the runqueue lock above: - */ - if (!this_rq->nr_running) - goto out_unlock; - array = this_rq->active; - if (!array->nr_active) - array = this_rq->expired; - BUG_ON(!array->nr_active); - - p = list_entry(array->queue[sched_find_first_bit(array->bitmap)].next, - task_t, run_list); + smt_rq = cpu_rq(i); + if (unlikely(!spin_trylock(&smt_rq->lock))) + continue; - for_each_cpu_mask(i, sibling_map) { - runqueue_t *smt_rq = cpu_rq(i); - task_t *smt_curr = smt_rq->curr; + smt_curr = smt_rq->curr; /* Kernel threads do not participate in dependent sleeping */ if (!p->mm || !smt_curr->mm || rt_task(p)) @@ -2834,10 +2809,10 @@ check_smt_task: else wakeup_busy_runqueue(smt_rq); } + + spin_unlock(&smt_rq->lock); } -out_unlock: - for_each_cpu_mask(i, sibling_map) - spin_unlock(&cpu_rq(i)->lock); + return ret; } #else @@ -2845,7 +2820,8 @@ static inline void wake_sleeping_depende { } -static inline int dependent_sleeper(int this_cpu, runqueue_t *this_rq) +static inline int dependent_sleeper(int this_cpu, struct runqueue *this_rq, + struct task_struct *p) { return 0; } @@ -2967,32 +2943,13 @@ need_resched_nonpreemptible: cpu = smp_processor_id(); if (unlikely(!rq->nr_running)) { -go_idle: idle_balance(cpu, rq); if (!rq->nr_running) { next = rq->idle; rq->expired_timestamp = 0; wake_sleeping_dependent(cpu, rq); - /* - * wake_sleeping_dependent() might have released - * the runqueue, so break out if we got new - * tasks meanwhile: - */ - if (!rq->nr_running) - goto switch_tasks; - } - } else { - if (dependent_sleeper(cpu, rq)) { - next = rq->idle; goto switch_tasks; } - /* - * dependent_sleeper() releases and reacquires the runqueue - * lock, hence go into the idle loop if the rq went - * empty meanwhile: - */ - if (unlikely(!rq->nr_running)) - goto go_idle; } array = rq->active; @@ -3030,6 +2987,8 @@ go_idle: } } next->sleep_type = SLEEP_NORMAL; + if (dependent_sleeper(cpu, rq, next)) + next = rq->idle; switch_tasks: if (next == rq->idle) schedstat_inc(rq, sched_goidle); @@ -6126,7 +6085,6 @@ void __init sched_init(void) rq->push_cpu = 0; rq->migration_thread = NULL; INIT_LIST_HEAD(&rq->migration_queue); - rq->cpu = i; #endif atomic_set(&rq->nr_iowait, 0); ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 8:56 ` Nick Piggin @ 2006-06-02 9:17 ` Chen, Kenneth W 2006-06-02 9:25 ` Con Kolivas 2006-06-02 9:36 ` Chen, Kenneth W ` (2 subsequent siblings) 3 siblings, 1 reply; 41+ messages in thread From: Chen, Kenneth W @ 2006-06-02 9:17 UTC (permalink / raw) To: 'Nick Piggin' Cc: Con Kolivas, linux-kernel, 'Chris Mason', Ingo Molnar Nick Piggin wrote on Friday, June 02, 2006 1:56 AM > Chen, Kenneth W wrote: > > > Ha, you beat me by one minute. It did cross my mind to use try lock there as > > well, take a look at my version, I think I have a better inner loop. > > Actually you *have* to use trylocks I think, because the current runqueue > is already locked. You are absolutely correct. I forgot about the lock ordering. > And why do we lock all siblings in the other case, for that matter? (not > that it makes much difference except on niagara today). > > Rolled up patch with everyone's changes attached. What about the part in dependent_sleeper() being so bully and actively resched other low priority sibling tasks? I think it would be better to just let the tasks running on sibling CPU to finish its current time slice and then let the backoff logic to kick in. - Ken ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 9:17 ` Chen, Kenneth W @ 2006-06-02 9:25 ` Con Kolivas 2006-06-02 9:31 ` Chen, Kenneth W 0 siblings, 1 reply; 41+ messages in thread From: Con Kolivas @ 2006-06-02 9:25 UTC (permalink / raw) To: Chen, Kenneth W Cc: 'Nick Piggin', linux-kernel, 'Chris Mason', Ingo Molnar On Friday 02 June 2006 19:17, Chen, Kenneth W wrote: > What about the part in dependent_sleeper() being so bully and actively > resched other low priority sibling tasks? I think it would be better > to just let the tasks running on sibling CPU to finish its current time > slice and then let the backoff logic to kick in. That would defeat the purpose of smt nice if the higher priority task starts after the lower priority task is running on its sibling cpu. -- -ck ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 9:25 ` Con Kolivas @ 2006-06-02 9:31 ` Chen, Kenneth W 2006-06-02 9:34 ` Con Kolivas 0 siblings, 1 reply; 41+ messages in thread From: Chen, Kenneth W @ 2006-06-02 9:31 UTC (permalink / raw) To: 'Con Kolivas' Cc: 'Nick Piggin', linux-kernel, 'Chris Mason', Ingo Molnar Con Kolivas wrote on Friday, June 02, 2006 2:25 AM > On Friday 02 June 2006 19:17, Chen, Kenneth W wrote: > > What about the part in dependent_sleeper() being so bully and actively > > resched other low priority sibling tasks? I think it would be better > > to just let the tasks running on sibling CPU to finish its current time > > slice and then let the backoff logic to kick in. > > That would defeat the purpose of smt nice if the higher priority task starts > after the lower priority task is running on its sibling cpu. But only for the duration of lower priority tasks' time slice. When lower priority tasks time slice is used up, a resched is force from scheduler_tick(), isn't it? And at that time, it is delayed to run because of smt_nice. You are saying user can't tolerate that short period of time that CPU resource is shared? It's hard to believe. - Ken ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 9:31 ` Chen, Kenneth W @ 2006-06-02 9:34 ` Con Kolivas 2006-06-02 9:53 ` Chen, Kenneth W 0 siblings, 1 reply; 41+ messages in thread From: Con Kolivas @ 2006-06-02 9:34 UTC (permalink / raw) To: Chen, Kenneth W Cc: 'Nick Piggin', linux-kernel, 'Chris Mason', Ingo Molnar On Friday 02 June 2006 19:31, Chen, Kenneth W wrote: > Con Kolivas wrote on Friday, June 02, 2006 2:25 AM > > > On Friday 02 June 2006 19:17, Chen, Kenneth W wrote: > > > What about the part in dependent_sleeper() being so bully and actively > > > resched other low priority sibling tasks? I think it would be better > > > to just let the tasks running on sibling CPU to finish its current time > > > slice and then let the backoff logic to kick in. > > > > That would defeat the purpose of smt nice if the higher priority task > > starts after the lower priority task is running on its sibling cpu. > > But only for the duration of lower priority tasks' time slice. When lower > priority tasks time slice is used up, a resched is force from > scheduler_tick(), isn't it? And at that time, it is delayed to run because > of smt_nice. You are saying user can't tolerate that short period of time > that CPU resource is shared? It's hard to believe. nice -20 vs nice 0 is 800ms vs 100ms. That's a long time to me. -- -ck ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 9:34 ` Con Kolivas @ 2006-06-02 9:53 ` Chen, Kenneth W 2006-06-02 10:12 ` Con Kolivas 0 siblings, 1 reply; 41+ messages in thread From: Chen, Kenneth W @ 2006-06-02 9:53 UTC (permalink / raw) To: 'Con Kolivas' Cc: 'Nick Piggin', linux-kernel, 'Chris Mason', Ingo Molnar Con Kolivas wrote on Friday, June 02, 2006 2:34 AM > On Friday 02 June 2006 19:31, Chen, Kenneth W wrote: > > Con Kolivas wrote on Friday, June 02, 2006 2:25 AM > > > > > On Friday 02 June 2006 19:17, Chen, Kenneth W wrote: > > > > What about the part in dependent_sleeper() being so bully and actively > > > > resched other low priority sibling tasks? I think it would be better > > > > to just let the tasks running on sibling CPU to finish its current time > > > > slice and then let the backoff logic to kick in. > > > > > > That would defeat the purpose of smt nice if the higher priority task > > > starts after the lower priority task is running on its sibling cpu. > > > > But only for the duration of lower priority tasks' time slice. When lower > > priority tasks time slice is used up, a resched is force from > > scheduler_tick(), isn't it? And at that time, it is delayed to run because > > of smt_nice. You are saying user can't tolerate that short period of time > > that CPU resource is shared? It's hard to believe. > > nice -20 vs nice 0 is 800ms vs 100ms. That's a long time to me. Yeah, but that is the worst case though. Average would probably be a lot lower than worst case. Also, on smt it's not like the current logical cpu is getting blocked because of another task is running on its sibling CPU. The hardware still guarantees equal share of hardware resources for both logical CPUs. - Ken ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 9:53 ` Chen, Kenneth W @ 2006-06-02 10:12 ` Con Kolivas 2006-06-02 20:53 ` Chen, Kenneth W 0 siblings, 1 reply; 41+ messages in thread From: Con Kolivas @ 2006-06-02 10:12 UTC (permalink / raw) To: Chen, Kenneth W Cc: 'Nick Piggin', linux-kernel, 'Chris Mason', Ingo Molnar On Friday 02 June 2006 19:53, Chen, Kenneth W wrote: > Yeah, but that is the worst case though. Average would probably be a lot > lower than worst case. Also, on smt it's not like the current logical cpu > is getting blocked because of another task is running on its sibling CPU. > The hardware still guarantees equal share of hardware resources for both > logical CPUs. "Equal share of hardware resources" is exactly the problem; they shouldn't have equal share since they're sharing one physical cpu's resources. It's a relative breakage of the imposed nice support and I disagree with your conclusion. -- -ck ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 10:12 ` Con Kolivas @ 2006-06-02 20:53 ` Chen, Kenneth W 2006-06-02 22:15 ` Con Kolivas 0 siblings, 1 reply; 41+ messages in thread From: Chen, Kenneth W @ 2006-06-02 20:53 UTC (permalink / raw) To: 'Con Kolivas' Cc: 'Nick Piggin', linux-kernel, 'Chris Mason', Ingo Molnar Con Kolivas wrote on Friday, June 02, 2006 3:13 AM > On Friday 02 June 2006 19:53, Chen, Kenneth W wrote: > > Yeah, but that is the worst case though. Average would probably be a lot > > lower than worst case. Also, on smt it's not like the current logical cpu > > is getting blocked because of another task is running on its sibling CPU. > > The hardware still guarantees equal share of hardware resources for both > > logical CPUs. > > "Equal share of hardware resources" is exactly the problem; they shouldn't > have equal share since they're sharing one physical cpu's resources. It's a > relative breakage of the imposed nice support and I disagree with your > conclusion. But you keep on missing the point that this only happens in the initial stage of tasks competing for CPU resources. If this is broken, then current smt nice is equally broken with the same reasoning: once the low priority task gets scheduled, there is nothing to kick it off the CPU until its entire time slice get used up. They compete equally with a high priority task running on the sibling CPU. - Ken ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 20:53 ` Chen, Kenneth W @ 2006-06-02 22:15 ` Con Kolivas 2006-06-02 22:19 ` Chen, Kenneth W 0 siblings, 1 reply; 41+ messages in thread From: Con Kolivas @ 2006-06-02 22:15 UTC (permalink / raw) To: Chen, Kenneth W Cc: 'Nick Piggin', linux-kernel, 'Chris Mason', Ingo Molnar On Saturday 03 June 2006 06:53, Chen, Kenneth W wrote: > Con Kolivas wrote on Friday, June 02, 2006 3:13 AM > > > On Friday 02 June 2006 19:53, Chen, Kenneth W wrote: > > > Yeah, but that is the worst case though. Average would probably be a > > > lot lower than worst case. Also, on smt it's not like the current > > > logical cpu is getting blocked because of another task is running on > > > its sibling CPU. The hardware still guarantees equal share of hardware > > > resources for both logical CPUs. > > > > "Equal share of hardware resources" is exactly the problem; they > > shouldn't have equal share since they're sharing one physical cpu's > > resources. It's a relative breakage of the imposed nice support and I > > disagree with your conclusion. > > But you keep on missing the point that this only happens in the initial > stage of tasks competing for CPU resources. > > If this is broken, then current smt nice is equally broken with the same > reasoning: once the low priority task gets scheduled, there is nothing > to kick it off the CPU until its entire time slice get used up. They > compete equally with a high priority task running on the sibling CPU. There has to be some way of metering it out and in the absence of cpu based hardware priority support (like ppc64 has) the only useful thing we have to work with is timeslice. Yes sometimes the high priority task is at the start and sometimes at the end of the timeslice but overall it balances the proportions out reasonably fairly. -- -ck ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 22:15 ` Con Kolivas @ 2006-06-02 22:19 ` Chen, Kenneth W 2006-06-02 22:31 ` Con Kolivas 0 siblings, 1 reply; 41+ messages in thread From: Chen, Kenneth W @ 2006-06-02 22:19 UTC (permalink / raw) To: 'Con Kolivas' Cc: 'Nick Piggin', linux-kernel, 'Chris Mason', Ingo Molnar Con Kolivas wrote on Friday, June 02, 2006 3:15 PM > On Saturday 03 June 2006 06:53, Chen, Kenneth W wrote: > > Con Kolivas wrote on Friday, June 02, 2006 3:13 AM > > > > > On Friday 02 June 2006 19:53, Chen, Kenneth W wrote: > > > > Yeah, but that is the worst case though. Average would probably be a > > > > lot lower than worst case. Also, on smt it's not like the current > > > > logical cpu is getting blocked because of another task is running on > > > > its sibling CPU. The hardware still guarantees equal share of hardware > > > > resources for both logical CPUs. > > > > > > "Equal share of hardware resources" is exactly the problem; they > > > shouldn't have equal share since they're sharing one physical cpu's > > > resources. It's a relative breakage of the imposed nice support and I > > > disagree with your conclusion. > > > > But you keep on missing the point that this only happens in the initial > > stage of tasks competing for CPU resources. > > > > If this is broken, then current smt nice is equally broken with the same > > reasoning: once the low priority task gets scheduled, there is nothing > > to kick it off the CPU until its entire time slice get used up. They > > compete equally with a high priority task running on the sibling CPU. > > There has to be some way of metering it out and in the absence of cpu based > hardware priority support (like ppc64 has) the only useful thing we have to > work with is timeslice. Yes sometimes the high priority task is at the start > and sometimes at the end of the timeslice but overall it balances the > proportions out reasonably fairly. Good! Then why special case the initial stage? Just let task run and it will even out statistically. Everyone is happy, less code, less special case, same end result. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 22:19 ` Chen, Kenneth W @ 2006-06-02 22:31 ` Con Kolivas 2006-06-02 22:58 ` Chen, Kenneth W 0 siblings, 1 reply; 41+ messages in thread From: Con Kolivas @ 2006-06-02 22:31 UTC (permalink / raw) To: Chen, Kenneth W Cc: 'Nick Piggin', linux-kernel, 'Chris Mason', Ingo Molnar On Saturday 03 June 2006 08:19, Chen, Kenneth W wrote: > Con Kolivas wrote on Friday, June 02, 2006 3:15 PM > > > On Saturday 03 June 2006 06:53, Chen, Kenneth W wrote: > > > Con Kolivas wrote on Friday, June 02, 2006 3:13 AM > > > > > > > On Friday 02 June 2006 19:53, Chen, Kenneth W wrote: > > > > > Yeah, but that is the worst case though. Average would probably be > > > > > a lot lower than worst case. Also, on smt it's not like the > > > > > current logical cpu is getting blocked because of another task is > > > > > running on its sibling CPU. The hardware still guarantees equal > > > > > share of hardware resources for both logical CPUs. > > > > > > > > "Equal share of hardware resources" is exactly the problem; they > > > > shouldn't have equal share since they're sharing one physical cpu's > > > > resources. It's a relative breakage of the imposed nice support and I > > > > disagree with your conclusion. > > > > > > But you keep on missing the point that this only happens in the initial > > > stage of tasks competing for CPU resources. > > > > > > If this is broken, then current smt nice is equally broken with the > > > same reasoning: once the low priority task gets scheduled, there is > > > nothing to kick it off the CPU until its entire time slice get used up. > > > They compete equally with a high priority task running on the sibling > > > CPU. > > > > There has to be some way of metering it out and in the absence of cpu > > based hardware priority support (like ppc64 has) the only useful thing we > > have to work with is timeslice. Yes sometimes the high priority task is > > at the start and sometimes at the end of the timeslice but overall it > > balances the proportions out reasonably fairly. > > Good! Then why special case the initial stage? Just let task run and it > will even out statistically. Everyone is happy, less code, less special > case, same end result. Hang on I think I missed something there. What did you conclude I conceded there? When I say "work with timeslice" I mean use percentage of timeslice the way smt nice currently does. -- -ck ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 22:31 ` Con Kolivas @ 2006-06-02 22:58 ` Chen, Kenneth W 2006-06-03 0:02 ` Con Kolivas 0 siblings, 1 reply; 41+ messages in thread From: Chen, Kenneth W @ 2006-06-02 22:58 UTC (permalink / raw) To: 'Con Kolivas' Cc: 'Nick Piggin', linux-kernel, 'Chris Mason', Ingo Molnar Con Kolivas wrote on Friday, June 02, 2006 3:32 PM > > > > But you keep on missing the point that this only happens in the initial > > > > stage of tasks competing for CPU resources. > > > > > > > > If this is broken, then current smt nice is equally broken with the > > > > same reasoning: once the low priority task gets scheduled, there is > > > > nothing to kick it off the CPU until its entire time slice get used up. > > > > They compete equally with a high priority task running on the sibling > > > > CPU. > > > > > > There has to be some way of metering it out and in the absence of cpu > > > based hardware priority support (like ppc64 has) the only useful thing we > > > have to work with is timeslice. Yes sometimes the high priority task is > > > at the start and sometimes at the end of the timeslice but overall it > > > balances the proportions out reasonably fairly. > > > > Good! Then why special case the initial stage? Just let task run and it > > will even out statistically. Everyone is happy, less code, less special > > case, same end result. > > Hang on I think I missed something there. What did you conclude I conceded > there? When I say "work with timeslice" I mean use percentage of timeslice > the way smt nice currently does. You haven't answered my question either. What is the benefit of special casing the initial stage of cpu resource competition? Is it quantitatively measurable? If so, how much and with what workload? Also there are oddness in the way that the sleep decision is made: 75% of resched decision is to sleep, while 25% of the time is to run, but there are fussiness in there and all depends on when the resched and time slice expires. Considering a scenario when resched occurs on jiffies%100 = 10, and low priority task time slice of 100 jiffies, the current code always decide to run the low priority task! if (rt_task(smt_curr)) { /* * With real time tasks we run non-rt tasks only * per_cpu_gain% of the time. */ if ((jiffies % DEF_TIMESLICE) > (sd->per_cpu_gain * DEF_TIMESLICE / 100)) ret = 1; The whole point is: precise fairness can not be (or difficult to) control at milli-second range. It all will work out statistically in a long run. Which again brings back the question: what does the special casing in precision control will bring to the overall scheme? So far, I have not seen any quantitative measure at all that indicates it is a win. - Ken ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 22:58 ` Chen, Kenneth W @ 2006-06-03 0:02 ` Con Kolivas 2006-06-03 0:08 ` Chen, Kenneth W 0 siblings, 1 reply; 41+ messages in thread From: Con Kolivas @ 2006-06-03 0:02 UTC (permalink / raw) To: Chen, Kenneth W Cc: 'Nick Piggin', linux-kernel, 'Chris Mason', Ingo Molnar On Saturday 03 June 2006 08:58, Chen, Kenneth W wrote: > You haven't answered my question either. What is the benefit of special > casing the initial stage of cpu resource competition? Is it quantitatively > measurable? If so, how much and with what workload? Ah you mean what the whole point of smt nice is? Yes it's simple enough to do. Take the single hyperthreaded cpu with two cpu bound workloads. Let's say I run a cpu bound task nice 0 by itself and it completes in time X. If I boot it with hyperthread disabled and run a nice 0 and nice 19 task, the nice 0 task gets 95% of the cpu and completes in time X*0.95. If I boot with hyperthread enabled and run the nice 0 and nice 19 tasks, the nice 0 task gets 100% of one sibling and the nice 19 task 100% of the other sibling. The nice 0 task completes in X*0.6. With the smt nice code added it completed in X*0.95. The ratios here are dependent on the workload but that was the average I could determine from comparing mprime workloads at differing nice and kernel compiles. There is no explicit way on the Intel smt cpus to tell it which sibling is running lower priority tasks (sprinkling mwaits around at regular intervals is not a realistic option for example). -- -ck ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH RFC] smt nice introduces significant lock contention 2006-06-03 0:02 ` Con Kolivas @ 2006-06-03 0:08 ` Chen, Kenneth W 2006-06-03 0:27 ` Con Kolivas 0 siblings, 1 reply; 41+ messages in thread From: Chen, Kenneth W @ 2006-06-03 0:08 UTC (permalink / raw) To: 'Con Kolivas' Cc: 'Nick Piggin', linux-kernel, 'Chris Mason', Ingo Molnar Con Kolivas wrote on Friday, June 02, 2006 5:03 PM > On Saturday 03 June 2006 08:58, Chen, Kenneth W wrote: > > You haven't answered my question either. What is the benefit of special > > casing the initial stage of cpu resource competition? Is it quantitatively > > measurable? If so, how much and with what workload? > > Ah you mean what the whole point of smt nice is? Yes it's simple enough to do. > Take the single hyperthreaded cpu with two cpu bound workloads. Let's say I > run a cpu bound task nice 0 by itself and it completes in time X. If I boot > it with hyperthread disabled and run a nice 0 and nice 19 task, the nice 0 > task gets 95% of the cpu and completes in time X*0.95. If I boot with > hyperthread enabled and run the nice 0 and nice 19 tasks, the nice 0 task > gets 100% of one sibling and the nice 19 task 100% of the other sibling. The > nice 0 task completes in X*0.6. With the smt nice code added it completed in > X*0.95. The ratios here are dependent on the workload but that was the > average I could determine from comparing mprime workloads at differing nice > and kernel compiles. There is no explicit way on the Intel smt cpus to tell > it which sibling is running lower priority tasks (sprinkling mwaits around at > regular intervals is not a realistic option for example). I know what smt nice is doing, and it is fine to have. I'm simply proposing the following patch, on top of last roll up patch. diff -u ./kernel/sched.c ./kernel/sched.c --- ./kernel/sched.c 2006-06-02 16:05:13.000000000 -0700 +++ ./kernel/sched.c 2006-06-02 17:03:55.000000000 -0700 @@ -2782,7 +2782,7 @@ /* Kernel threads do not participate in dependent sleeping */ if (!p->mm || !smt_curr->mm || rt_task(p)) - goto check_smt_task; + continue; /* * If a user task with lower static priority than the @@ -2806,32 +2806,6 @@ smt_slice(smt_curr, sd) > task_timeslice(p)) ret = 1; -check_smt_task: - if ((!smt_curr->mm && smt_curr != smt_rq->idle) || - rt_task(smt_curr)) - continue; - if (!p->mm) { - wakeup_busy_runqueue(smt_rq); - continue; - } - - /* - * Reschedule a lower priority task on the SMT sibling for - * it to be put to sleep, or wake it up if it has been put to - * sleep for priority reasons to see if it should run now. - */ - if (rt_task(p)) { - if ((jiffies % DEF_TIMESLICE) > - (sd->per_cpu_gain * DEF_TIMESLICE / 100)) - resched_task(smt_curr); - } else { - if (TASK_PREEMPTS_CURR(p, smt_rq) && - smt_slice(p, sd) > task_timeslice(smt_curr)) - resched_task(smt_curr); - else - wakeup_busy_runqueue(smt_rq); - } - spin_unlock(&smt_rq->lock); } return ret; ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC] smt nice introduces significant lock contention 2006-06-03 0:08 ` Chen, Kenneth W @ 2006-06-03 0:27 ` Con Kolivas 0 siblings, 0 replies; 41+ messages in thread From: Con Kolivas @ 2006-06-03 0:27 UTC (permalink / raw) To: Chen, Kenneth W Cc: 'Nick Piggin', linux-kernel, 'Chris Mason', Ingo Molnar On Saturday 03 June 2006 10:08, Chen, Kenneth W wrote: > I know what smt nice is doing, and it is fine to have. I'm simply > proposing the following patch, on top of last roll up patch. It's still relative breakage of the semantics but I'm willing to concede that it will only help a very small proportion of the time with the largest timeslices and is in the fastpath. -- -ck ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 8:56 ` Nick Piggin 2006-06-02 9:17 ` Chen, Kenneth W @ 2006-06-02 9:36 ` Chen, Kenneth W 2006-06-02 10:30 ` Con Kolivas 2006-06-02 22:14 ` Chen, Kenneth W 3 siblings, 0 replies; 41+ messages in thread From: Chen, Kenneth W @ 2006-06-02 9:36 UTC (permalink / raw) To: 'Nick Piggin' Cc: Con Kolivas, linux-kernel, 'Chris Mason', Ingo Molnar Nick Piggin wrote on Friday, June 02, 2006 1:56 AM > Chen, Kenneth W wrote: > > > Ha, you beat me by one minute. It did cross my mind to use try lock there as > > well, take a look at my version, I think I have a better inner loop. > > Actually you *have* to use trylocks I think, because the current runqueue > is already locked. > > And why do we lock all siblings in the other case, for that matter? (not > that it makes much difference except on niagara today). > > Rolled up patch with everyone's changes attached. OK, it's down to nit-picking now: Remove this_rq argument from wake_sleeping_dependent() since it is not used. Nick, you had that in your earlier version, but it got lost in the woods. I don't like cpumask being declared on the stack. Here is my version to rid it out in wake_sleeping_dependent() and dependent_sleeper(). - Ken --- ./kernel/sched.c.orig 2006-06-02 03:20:40.000000000 -0700 +++ ./kernel/sched.c 2006-06-02 03:20:59.000000000 -0700 @@ -2714,10 +2714,9 @@ static inline void wakeup_busy_runqueue( /* * Called with interrupts disabled and this_rq's runqueue locked. */ -static void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq) +static void wake_sleeping_dependent(int this_cpu) { struct sched_domain *tmp, *sd = NULL; - cpumask_t sibling_map; int i; for_each_domain(this_cpu, tmp) @@ -2728,10 +2727,11 @@ static void wake_sleeping_dependent(int if (!sd) return; - sibling_map = sd->span; - cpu_clear(this_cpu, sibling_map); - for_each_cpu_mask(i, sibling_map) { + for_each_cpu_mask(i, sd->span) { runqueue_t *smt_rq = cpu_rq(i); + + if (i == this_cpu) + continue; if (unlikely(!spin_trylock(&smt_rq->lock))) continue; @@ -2761,7 +2761,6 @@ static int dependent_sleeper(int this_cp struct task_struct *p) { struct sched_domain *tmp, *sd = NULL; - cpumask_t sibling_map; int ret = 0, i; for_each_domain(this_cpu, tmp) @@ -2772,12 +2771,13 @@ static int dependent_sleeper(int this_cp if (!sd) return 0; - sibling_map = sd->span; - cpu_clear(this_cpu, sibling_map); - for_each_cpu_mask(i, sibling_map) { + for_each_cpu_mask(i, sd->span) { runqueue_t *smt_rq; task_t *smt_curr; + if (i == this_cpu) + continue; + smt_rq = cpu_rq(i); if (unlikely(!spin_trylock(&smt_rq->lock))) continue; @@ -2842,7 +2842,7 @@ check_smt_task: return ret; } #else -static inline void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq) +static inline void wake_sleeping_dependent(int this_cpu) { } @@ -2973,7 +2973,7 @@ need_resched_nonpreemptible: if (!rq->nr_running) { next = rq->idle; rq->expired_timestamp = 0; - wake_sleeping_dependent(cpu, rq); + wake_sleeping_dependent(cpu); goto switch_tasks; } } ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 8:56 ` Nick Piggin 2006-06-02 9:17 ` Chen, Kenneth W 2006-06-02 9:36 ` Chen, Kenneth W @ 2006-06-02 10:30 ` Con Kolivas 2006-06-02 13:16 ` Con Kolivas 2006-06-02 22:14 ` Chen, Kenneth W 3 siblings, 1 reply; 41+ messages in thread From: Con Kolivas @ 2006-06-02 10:30 UTC (permalink / raw) To: Nick Piggin Cc: Chen, Kenneth W, linux-kernel, 'Chris Mason', Ingo Molnar On Friday 02 June 2006 18:56, Nick Piggin wrote: > Chen, Kenneth W wrote: > > Ha, you beat me by one minute. It did cross my mind to use try lock there > > as well, take a look at my version, I think I have a better inner loop. > > Actually you *have* to use trylocks I think, because the current runqueue > is already locked. > > And why do we lock all siblings in the other case, for that matter? (not > that it makes much difference except on niagara today). If we spinlock (and don't trylock as you're proposing) we'd have to do a double rq lock for each sibling. I guess half the time double_rq_lock will only be locking one runqueue... with 32 runqueues we either try to lock all 32 or lock 1.5 runqueues 32 times... ugh both are ugly. > Rolled up patch with everyone's changes attached. I'm still not sure that only doing trylock is adequate, and wake_sleeping_dependent is only called when a runqueue falls idle in schedule, not when it's busy so its cost (in my mind) is far less than dependent_sleeper. -- -ck ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 10:30 ` Con Kolivas @ 2006-06-02 13:16 ` Con Kolivas 2006-06-02 21:54 ` Chen, Kenneth W 0 siblings, 1 reply; 41+ messages in thread From: Con Kolivas @ 2006-06-02 13:16 UTC (permalink / raw) To: linux-kernel Cc: Nick Piggin, Chen, Kenneth W, 'Chris Mason', Ingo Molnar On Friday 02 June 2006 20:30, Con Kolivas wrote: > On Friday 02 June 2006 18:56, Nick Piggin wrote: > > And why do we lock all siblings in the other case, for that matter? (not > > that it makes much difference except on niagara today). > > If we spinlock (and don't trylock as you're proposing) we'd have to do a > double rq lock for each sibling. I guess half the time double_rq_lock will > only be locking one runqueue... with 32 runqueues we either try to lock all > 32 or lock 1.5 runqueues 32 times... ugh both are ugly. Thinking some more on this it is also clear that the concept of per_cpu_gain for smt is basically wrong once we get beyond straight forward 2 thread hyperthreading. If we have more than 2 thread units per physical core, the per cpu gain per logical core will decrease the more threads are running on it. While it's always been obvious the gain per logical core is entirely dependant on the type of workload and wont be a simple 25% increase in cpu power, it is clear that even if we assume an "overall" increase in cpu for each logical core added, there will be some non linear function relating power increase to thread units used. :-| -- -ck ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 13:16 ` Con Kolivas @ 2006-06-02 21:54 ` Chen, Kenneth W 2006-06-02 22:04 ` Con Kolivas 0 siblings, 1 reply; 41+ messages in thread From: Chen, Kenneth W @ 2006-06-02 21:54 UTC (permalink / raw) To: 'Con Kolivas', linux-kernel Cc: Nick Piggin, 'Chris Mason', Ingo Molnar Con Kolivas wrote on Friday, June 02, 2006 6:17 AM > On Friday 02 June 2006 20:30, Con Kolivas wrote: > > On Friday 02 June 2006 18:56, Nick Piggin wrote: > > > And why do we lock all siblings in the other case, for that matter? (not > > > that it makes much difference except on niagara today). > > > > If we spinlock (and don't trylock as you're proposing) we'd have to do a > > double rq lock for each sibling. I guess half the time double_rq_lock will > > only be locking one runqueue... with 32 runqueues we either try to lock all > > 32 or lock 1.5 runqueues 32 times... ugh both are ugly. > > Thinking some more on this it is also clear that the concept of per_cpu_gain > for smt is basically wrong once we get beyond straight forward 2 thread > hyperthreading. If we have more than 2 thread units per physical core, the > per cpu gain per logical core will decrease the more threads are running on > it. While it's always been obvious the gain per logical core is entirely > dependant on the type of workload and wont be a simple 25% increase in cpu > power, it is clear that even if we assume an "overall" increase in cpu for > each logical core added, there will be some non linear function relating > power increase to thread units used. :-| In the context of having more than 2 sibling CPUs in a domain, doesn't the current code also suffer from thunder hurd problem as well? When high priority task goes to sleep, it will wake up *all* sibling sleepers and then they will all fight for CPU resource, but potentially only one will win? ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 21:54 ` Chen, Kenneth W @ 2006-06-02 22:04 ` Con Kolivas 0 siblings, 0 replies; 41+ messages in thread From: Con Kolivas @ 2006-06-02 22:04 UTC (permalink / raw) To: Chen, Kenneth W Cc: linux-kernel, Nick Piggin, 'Chris Mason', Ingo Molnar On Saturday 03 June 2006 07:54, Chen, Kenneth W wrote: > Con Kolivas wrote on Friday, June 02, 2006 6:17 AM > > Thinking some more on this it is also clear that the concept of > > per_cpu_gain for smt is basically wrong once we get beyond straight > > forward 2 thread hyperthreading. If we have more than 2 thread units per > > physical core, the per cpu gain per logical core will decrease the more > > threads are running on it. While it's always been obvious the gain per > > logical core is entirely dependant on the type of workload and wont be a > > simple 25% increase in cpu power, it is clear that even if we assume an > > "overall" increase in cpu for each logical core added, there will be some > > non linear function relating power increase to thread units used. :-| > > In the context of having more than 2 sibling CPUs in a domain, doesn't the > current code also suffer from thunder hurd problem as well? When high > priority task goes to sleep, it will wake up *all* sibling sleepers and > then they will all fight for CPU resource, but potentially only one will > win? Yes. The smt nice code was never designed with that many threads in mind. This is why I'm bringing it up for discussion. -- -ck ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 8:56 ` Nick Piggin ` (2 preceding siblings ...) 2006-06-02 10:30 ` Con Kolivas @ 2006-06-02 22:14 ` Chen, Kenneth W 3 siblings, 0 replies; 41+ messages in thread From: Chen, Kenneth W @ 2006-06-02 22:14 UTC (permalink / raw) To: 'Nick Piggin' Cc: Con Kolivas, linux-kernel, 'Chris Mason', Ingo Molnar Nick Piggin wrote on Friday, June 02, 2006 1:56 AM > Chen, Kenneth W wrote: > > > Ha, you beat me by one minute. It did cross my mind to use try lock there as > > well, take a look at my version, I think I have a better inner loop. > > Actually you *have* to use trylocks I think, because the current runqueue > is already locked. > > And why do we lock all siblings in the other case, for that matter? (not > that it makes much difference except on niagara today). > > Rolled up patch with everyone's changes attached. Rolled up patch on top of Nick's. This version doesn't have the change that removes the bully-ness in dependent_sleeper(), which is under debate right now (I still think it should be removed). --- ./kernel/sched.c.orig 2006-06-02 15:59:42.000000000 -0700 +++ ./kernel/sched.c 2006-06-02 16:05:13.000000000 -0700 @@ -239,7 +239,6 @@ struct runqueue { task_t *migration_thread; struct list_head migration_queue; - int cpu; #endif #ifdef CONFIG_SCHEDSTATS @@ -1728,9 +1727,6 @@ unsigned long nr_active(void) /* * double_rq_lock - safely lock two runqueues * - * We must take them in cpu order to match code in - * dependent_sleeper and wake_dependent_sleeper. - * * Note this does not disable interrupts like task_rq_lock, * you need to do so manually before calling. */ @@ -1742,7 +1738,7 @@ static void double_rq_lock(runqueue_t *r spin_lock(&rq1->lock); __acquire(rq2->lock); /* Fake it out ;) */ } else { - if (rq1->cpu < rq2->cpu) { + if (rq1 < rq2) { spin_lock(&rq1->lock); spin_lock(&rq2->lock); } else { @@ -1778,7 +1774,7 @@ static void double_lock_balance(runqueue __acquires(this_rq->lock) { if (unlikely(!spin_trylock(&busiest->lock))) { - if (busiest->cpu < this_rq->cpu) { + if (busiest < this_rq) { spin_unlock(&this_rq->lock); spin_lock(&busiest->lock); spin_lock(&this_rq->lock); @@ -2712,48 +2708,33 @@ static inline void wakeup_busy_runqueue( resched_task(rq->idle); } -static void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq) +/* + * Called with interrupts disabled and this_rq's runqueue locked. + */ +static void wake_sleeping_dependent(int this_cpu) { struct sched_domain *tmp, *sd = NULL; - cpumask_t sibling_map; int i; for_each_domain(this_cpu, tmp) - if (tmp->flags & SD_SHARE_CPUPOWER) + if (tmp->flags & SD_SHARE_CPUPOWER) { sd = tmp; - + break; + } if (!sd) return; - /* - * Unlock the current runqueue because we have to lock in - * CPU order to avoid deadlocks. Caller knows that we might - * unlock. We keep IRQs disabled. - */ - spin_unlock(&this_rq->lock); - - sibling_map = sd->span; - - for_each_cpu_mask(i, sibling_map) - spin_lock(&cpu_rq(i)->lock); - /* - * We clear this CPU from the mask. This both simplifies the - * inner loop and keps this_rq locked when we exit: - */ - cpu_clear(this_cpu, sibling_map); - - for_each_cpu_mask(i, sibling_map) { + for_each_cpu_mask(i, sd->span) { runqueue_t *smt_rq = cpu_rq(i); + if (i == this_cpu) + continue; + if (unlikely(!spin_trylock(&smt_rq->lock))) + continue; + wakeup_busy_runqueue(smt_rq); + spin_unlock(&smt_rq->lock); } - - for_each_cpu_mask(i, sibling_map) - spin_unlock(&cpu_rq(i)->lock); - /* - * We exit with this_cpu's rq still held and IRQs - * still disabled: - */ } /* @@ -2766,48 +2747,38 @@ static inline unsigned long smt_slice(ta return p->time_slice * (100 - sd->per_cpu_gain) / 100; } -static int dependent_sleeper(int this_cpu, runqueue_t *this_rq) +/* + * To minimise lock contention and not have to drop this_rq's runlock we only + * trylock the sibling runqueues and bypass those runqueues if we fail to + * acquire their lock. As we only trylock the normal locking order does not + * need to be obeyed. + */ +static int dependent_sleeper(int this_cpu, struct runqueue *this_rq, + struct task_struct *p) { struct sched_domain *tmp, *sd = NULL; - cpumask_t sibling_map; - prio_array_t *array; int ret = 0, i; - task_t *p; for_each_domain(this_cpu, tmp) - if (tmp->flags & SD_SHARE_CPUPOWER) + if (tmp->flags & SD_SHARE_CPUPOWER) { sd = tmp; - + break; + } if (!sd) return 0; - /* - * The same locking rules and details apply as for - * wake_sleeping_dependent(): - */ - spin_unlock(&this_rq->lock); - sibling_map = sd->span; - for_each_cpu_mask(i, sibling_map) - spin_lock(&cpu_rq(i)->lock); - cpu_clear(this_cpu, sibling_map); + for_each_cpu_mask(i, sd->span) { + runqueue_t *smt_rq; + task_t *smt_curr; - /* - * Establish next task to be run - it might have gone away because - * we released the runqueue lock above: - */ - if (!this_rq->nr_running) - goto out_unlock; - array = this_rq->active; - if (!array->nr_active) - array = this_rq->expired; - BUG_ON(!array->nr_active); + if (i == this_cpu) + continue; - p = list_entry(array->queue[sched_find_first_bit(array->bitmap)].next, - task_t, run_list); + smt_rq = cpu_rq(i); + if (unlikely(!spin_trylock(&smt_rq->lock))) + continue; - for_each_cpu_mask(i, sibling_map) { - runqueue_t *smt_rq = cpu_rq(i); - task_t *smt_curr = smt_rq->curr; + smt_curr = smt_rq->curr; /* Kernel threads do not participate in dependent sleeping */ if (!p->mm || !smt_curr->mm || rt_task(p)) @@ -2860,18 +2831,18 @@ check_smt_task: else wakeup_busy_runqueue(smt_rq); } + + spin_unlock(&smt_rq->lock); } -out_unlock: - for_each_cpu_mask(i, sibling_map) - spin_unlock(&cpu_rq(i)->lock); return ret; } #else -static inline void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq) +static inline void wake_sleeping_dependent(int this_cpu) { } -static inline int dependent_sleeper(int this_cpu, runqueue_t *this_rq) +static inline int dependent_sleeper(int this_cpu, struct runqueue *this_rq, + struct task_struct *p) { return 0; } @@ -2993,32 +2964,13 @@ need_resched_nonpreemptible: cpu = smp_processor_id(); if (unlikely(!rq->nr_running)) { -go_idle: idle_balance(cpu, rq); if (!rq->nr_running) { next = rq->idle; rq->expired_timestamp = 0; - wake_sleeping_dependent(cpu, rq); - /* - * wake_sleeping_dependent() might have released - * the runqueue, so break out if we got new - * tasks meanwhile: - */ - if (!rq->nr_running) - goto switch_tasks; - } - } else { - if (dependent_sleeper(cpu, rq)) { - next = rq->idle; + wake_sleeping_dependent(cpu); goto switch_tasks; } - /* - * dependent_sleeper() releases and reacquires the runqueue - * lock, hence go into the idle loop if the rq went - * empty meanwhile: - */ - if (unlikely(!rq->nr_running)) - goto go_idle; } array = rq->active; @@ -3056,6 +3008,8 @@ go_idle: } } next->sleep_type = SLEEP_NORMAL; + if (dependent_sleeper(cpu, rq, next)) + next = rq->idle; switch_tasks: if (next == rq->idle) schedstat_inc(rq, sched_goidle); @@ -6152,7 +6106,6 @@ void __init sched_init(void) rq->push_cpu = 0; rq->migration_thread = NULL; INIT_LIST_HEAD(&rq->migration_queue); - rq->cpu = i; #endif atomic_set(&rq->nr_iowait, 0); ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 8:28 ` Nick Piggin 2006-06-02 8:34 ` Chen, Kenneth W @ 2006-06-02 10:19 ` Con Kolivas 2006-06-02 20:59 ` Chen, Kenneth W 1 sibling, 1 reply; 41+ messages in thread From: Con Kolivas @ 2006-06-02 10:19 UTC (permalink / raw) To: Nick Piggin Cc: linux-kernel, Chen, Kenneth W, 'Chris Mason', Ingo Molnar On Friday 02 June 2006 18:28, Nick Piggin wrote: > Con Kolivas wrote: > > On Friday 02 June 2006 17:53, Nick Piggin wrote: > >>This is a small micro-optimisation / cleanup we can do after > >>smtnice gets converted to use trylocks. Might result in a little > >>less cacheline footprint in some cases. > > > > It's only dependent_sleeper that is being converted in these patches. The > > wake_sleeping_dependent component still locks all runqueues and needs to > > Oh I missed that. > > > succeed in order to ensure a task doesn't keep sleeping indefinitely. > > That > > Let's make it use trylocks as well. wake_priority_sleeper should ensure > things don't sleep forever I think? We should be optimising for the most > common case, and in many workloads, the runqueue does go idle frequently. wake_priority_sleeper is only called per tick which can be 10ms at 100HZ. I don't think that's fast enough. It could even be possible for a lower priority task to always just miss the wakeup if it's (very) unlucky. -- -ck ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 10:19 ` Con Kolivas @ 2006-06-02 20:59 ` Chen, Kenneth W 0 siblings, 0 replies; 41+ messages in thread From: Chen, Kenneth W @ 2006-06-02 20:59 UTC (permalink / raw) To: 'Con Kolivas', Nick Piggin Cc: linux-kernel, 'Chris Mason', Ingo Molnar Con Kolivas wrote on Friday, June 02, 2006 3:19 AM > On Friday 02 June 2006 18:28, Nick Piggin wrote: > > Con Kolivas wrote: > > > On Friday 02 June 2006 17:53, Nick Piggin wrote: > > >>This is a small micro-optimisation / cleanup we can do after > > >>smtnice gets converted to use trylocks. Might result in a little > > >>less cacheline footprint in some cases. > > > > > > It's only dependent_sleeper that is being converted in these patches. The > > > wake_sleeping_dependent component still locks all runqueues and needs to > > > > Oh I missed that. > > > > > succeed in order to ensure a task doesn't keep sleeping indefinitely. > > > That > > > > Let's make it use trylocks as well. wake_priority_sleeper should ensure > > things don't sleep forever I think? We should be optimising for the most > > common case, and in many workloads, the runqueue does go idle frequently. > > wake_priority_sleeper is only called per tick which can be 10ms at 100HZ. I > don't think that's fast enough. It could even be possible for a lower > priority task to always just miss the wakeup if it's (very) unlucky. I see inconsistency here: in dependent_sleeper() function, you argued dearly that it is important to be bully for the high priority task to kick off low priority task on the sibling CPU. Here you are arguing dearly to put the delayed low priority sleeper back onto the CPU as early as possible (and it would possibly competing with the other high priority task). Having such biased opposite treatment in different path doesn't look correct to me. ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 7:53 ` Nick Piggin 2006-06-02 8:17 ` Con Kolivas @ 2006-06-02 8:38 ` Chen, Kenneth W 1 sibling, 0 replies; 41+ messages in thread From: Chen, Kenneth W @ 2006-06-02 8:38 UTC (permalink / raw) To: 'Nick Piggin', Con Kolivas Cc: linux-kernel, 'Chris Mason', Ingo Molnar Nick Piggin wrote on Friday, June 02, 2006 12:53 AM > Con Kolivas wrote: > > >>Nice to acknowledge Chris's idea for > >>trylocks in your changelog when you submit a final patch. > > > > > > I absolutely would and I would ask for him to sign off on it as well, once we > > agreed on a final form. > > This is a small micro-optimisation / cleanup we can do after > smtnice gets converted to use trylocks. Might result in a little > less cacheline footprint in some cases. Just to pile up more micro-optimization: kernel can break out of the for_each_domain loop when it hits first SD_SHARE_CPUPOWER. - Ken Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> --- ./kernel/sched.c.orig 2006-06-01 23:14:47.000000000 -0700 +++ ./kernel/sched.c 2006-06-01 23:18:07.000000000 -0700 @@ -2780,8 +2780,10 @@ static int dependent_sleeper(int this_cp int ret = 0, i; for_each_domain(this_cpu, tmp) - if (tmp->flags & SD_SHARE_CPUPOWER) + if (tmp->flags & SD_SHARE_CPUPOWER) { sd = tmp; + break; + } if (!sd) return 0; ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 4:18 ` Nick Piggin 2006-06-02 6:08 ` Con Kolivas @ 2006-06-02 8:24 ` Chen, Kenneth W 1 sibling, 0 replies; 41+ messages in thread From: Chen, Kenneth W @ 2006-06-02 8:24 UTC (permalink / raw) To: 'Nick Piggin', Con Kolivas Cc: linux-kernel, 'Chris Mason', Ingo Molnar Nick Piggin wrote on Thursday, June 01, 2006 9:19 PM > Con Kolivas wrote: > > On Friday 02 June 2006 12:28, Con Kolivas wrote: > > > >>Actually looking even further, we only introduced the extra lookup of the > >>next task when we started unlocking the runqueue in schedule(). Since we > >>can get by without locking this_rq in schedule with this approach we can > >>simplify dependent_sleeper even further by doing the dependent sleeper > >>check after we have discovered what next is in schedule and avoid looking > >>it up twice. I'll hack something up to do that soon. > > > > > > Something like this (sorry I couldn't help but keep hacking on it). > > Looking pretty good. Nice to acknowledge Chris's idea for > trylocks in your changelog when you submit a final patch. Yes, as far as the lock is concerned in dependent_sleeper(), it looks pretty good. I do have other comments that I will follow up in another thread. - Ken ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 3:55 ` Con Kolivas 2006-06-02 4:18 ` Nick Piggin @ 2006-06-02 8:31 ` Chen, Kenneth W 2006-06-02 8:50 ` Chen, Kenneth W 2 siblings, 0 replies; 41+ messages in thread From: Chen, Kenneth W @ 2006-06-02 8:31 UTC (permalink / raw) To: 'Con Kolivas', linux-kernel; +Cc: 'Chris Mason', Ingo Molnar Con Kolivas wrote on Thursday, June 01, 2006 8:55 PM > On Friday 02 June 2006 12:28, Con Kolivas wrote: > > Actually looking even further, we only introduced the extra lookup of the > > next task when we started unlocking the runqueue in schedule(). Since we > > can get by without locking this_rq in schedule with this approach we can > > simplify dependent_sleeper even further by doing the dependent sleeper > > check after we have discovered what next is in schedule and avoid looking > > it up twice. I'll hack something up to do that soon. > > Something like this (sorry I couldn't help but keep hacking on it). > --- > It is not critical to functioning that dependent_sleeper() succeeds every > time. We can significantly reduce the locking overhead and contention of > dependent_sleeper by only doing trylock on the smt sibling runqueues. As > we're only doing trylock it means we do not need to observe the normal > locking order and we can get away without unlocking this_rq in schedule(). > This provides us with an opportunity to simplify the code further. The code in wake_sleeping_dependent() is also quite wacky: it unlocks current runqueue, then re-acquires ALL the sibling runqueue lock, only to call wakeup_busy_runqueue() against the smt sibling runqueue other than itself. AFAICT, wakeup_busy_runqueue() does not require *ALL* sibling lock to be held. Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> --- ./kernel/sched.c.orig 2006-06-02 01:57:28.000000000 -0700 +++ ./kernel/sched.c 2006-06-02 02:19:37.000000000 -0700 @@ -2712,44 +2712,32 @@ static inline void wakeup_busy_runqueue( resched_task(rq->idle); } -static void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq) +static void wake_sleeping_dependent(int this_cpu) { struct sched_domain *tmp, *sd = NULL; - cpumask_t sibling_map; int i; for_each_domain(this_cpu, tmp) - if (tmp->flags & SD_SHARE_CPUPOWER) + if (tmp->flags & SD_SHARE_CPUPOWER) { sd = tmp; + break; + } if (!sd) return; - /* - * Unlock the current runqueue because we have to lock in - * CPU order to avoid deadlocks. Caller knows that we might - * unlock. We keep IRQs disabled. - */ - spin_unlock(&this_rq->lock); - - sibling_map = sd->span; - - for_each_cpu_mask(i, sibling_map) - spin_lock(&cpu_rq(i)->lock); - /* - * We clear this CPU from the mask. This both simplifies the - * inner loop and keps this_rq locked when we exit: - */ - cpu_clear(this_cpu, sibling_map); + for_each_cpu_mask(i, sd->span) { + runqueue_t *smt_rq; - for_each_cpu_mask(i, sibling_map) { - runqueue_t *smt_rq = cpu_rq(i); + if (i == this_cpu) + continue; + smt_rq = cpu_rq(i); + spin_lock(&smt_rq->lock); wakeup_busy_runqueue(smt_rq); + spin_unlock(&smt_rq->lock); } - for_each_cpu_mask(i, sibling_map) - spin_unlock(&cpu_rq(i)->lock); /* * We exit with this_cpu's rq still held and IRQs * still disabled: @@ -2857,7 +2845,7 @@ check_smt_task: return ret; } #else -static inline void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq) +static inline void wake_sleeping_dependent(int this_cpu) { } @@ -2988,14 +2976,8 @@ need_resched_nonpreemptible: if (!rq->nr_running) { next = rq->idle; rq->expired_timestamp = 0; - wake_sleeping_dependent(cpu, rq); - /* - * wake_sleeping_dependent() might have released - * the runqueue, so break out if we got new - * tasks meanwhile: - */ - if (!rq->nr_running) - goto switch_tasks; + wake_sleeping_dependent(cpu); + goto switch_tasks; } } ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 3:55 ` Con Kolivas 2006-06-02 4:18 ` Nick Piggin 2006-06-02 8:31 ` Chen, Kenneth W @ 2006-06-02 8:50 ` Chen, Kenneth W 2 siblings, 0 replies; 41+ messages in thread From: Chen, Kenneth W @ 2006-06-02 8:50 UTC (permalink / raw) To: 'Con Kolivas', linux-kernel, 'Nick Piggin' Cc: 'Chris Mason', Ingo Molnar Con Kolivas wrote on Thursday, June 01, 2006 8:55 PM > On Friday 02 June 2006 12:28, Con Kolivas wrote: > > Actually looking even further, we only introduced the extra lookup of the > > next task when we started unlocking the runqueue in schedule(). Since we > > can get by without locking this_rq in schedule with this approach we can > > simplify dependent_sleeper even further by doing the dependent sleeper > > check after we have discovered what next is in schedule and avoid looking > > it up twice. I'll hack something up to do that soon. > > Something like this (sorry I couldn't help but keep hacking on it). > --- > It is not critical to functioning that dependent_sleeper() succeeds every > time. We can significantly reduce the locking overhead and contention of > dependent_sleeper by only doing trylock on the smt sibling runqueues. As > we're only doing trylock it means we do not need to observe the normal > locking order and we can get away without unlocking this_rq in schedule(). > This provides us with an opportunity to simplify the code further. Why does dependent_sleeper() has to be so bully that it actively kick off the poor little guy on the smt sibling? Can't it just wait for the less fortunate process to finish off its time slice and then it can have more physical resource to spare? Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> --- ./kernel/sched.c.orig 2006-06-02 02:37:37.000000000 -0700 +++ ./kernel/sched.c 2006-06-02 02:41:20.000000000 -0700 @@ -2789,7 +2789,7 @@ static int dependent_sleeper(int this_cp /* Kernel threads do not participate in dependent sleeping */ if (!p->mm || !smt_curr->mm || rt_task(p)) - goto check_smt_task; + continue; /* * If a user task with lower static priority than the @@ -2812,32 +2812,6 @@ static int dependent_sleeper(int this_cp !TASK_PREEMPTS_CURR(p, smt_rq) && smt_slice(smt_curr, sd) > task_timeslice(p)) ret = 1; - -check_smt_task: - if ((!smt_curr->mm && smt_curr != smt_rq->idle) || - rt_task(smt_curr)) - continue; - if (!p->mm) { - wakeup_busy_runqueue(smt_rq); - continue; - } - - /* - * Reschedule a lower priority task on the SMT sibling for - * it to be put to sleep, or wake it up if it has been put to - * sleep for priority reasons to see if it should run now. - */ - if (rt_task(p)) { - if ((jiffies % DEF_TIMESLICE) > - (sd->per_cpu_gain * DEF_TIMESLICE / 100)) - resched_task(smt_curr); - } else { - if (TASK_PREEMPTS_CURR(p, smt_rq) && - smt_slice(p, sd) > task_timeslice(smt_curr)) - resched_task(smt_curr); - else - wakeup_busy_runqueue(smt_rq); - } } for_each_cpu_mask(i, sibling_map) ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 1:59 ` Con Kolivas 2006-06-02 2:28 ` Con Kolivas @ 2006-06-02 2:35 ` Chen, Kenneth W 2006-06-02 3:04 ` Con Kolivas 1 sibling, 1 reply; 41+ messages in thread From: Chen, Kenneth W @ 2006-06-02 2:35 UTC (permalink / raw) To: 'Con Kolivas'; +Cc: 'Chris Mason', linux-kernel, Ingo Molnar Con Kolivas wrote on Thursday, June 01, 2006 6:59 PM > On Friday 02 June 2006 09:57, Chen, Kenneth W wrote: > > Chris Mason wrote on Thursday, June 01, 2006 3:56 PM > > > > > Hello everyone, > > > > > > Recent benchmarks showed some performance regressions between 2.6.16 and > > > 2.6.5. We tracked down one of the regressions to lock contention in > > > schedule heavy workloads (~70,000 context switches per second) > > > > > > kernel/sched.c:dependent_sleeper() was responsible for most of the lock > > > contention, hammering on the run queue locks. The patch below is more of > > > a discussion point than a suggested fix (although it does reduce lock > > > contention significantly). The dependent_sleeper code looks very > > > expensive to me, especially for using a spinlock to bounce control > > > between two different siblings in the same cpu. > > > > Yeah, this also sort of echo a recent discovery on one of our benchmarks > > that schedule() is red hot in the kernel. I was just scratching my head > > not sure what's going on. This dependent_sleeper could be the culprit > > considering it is called almost at every context switch. I don't think > > we are hitting on lock contention, but by the large amount of code it > > executes. It really needs to be cut down .... > > Thanks for the suggestion. How about something like this which takes your > idea and further expands on it. Compiled, boot and runtime tests ok. > > /* > + * Try to lock all the siblings of a cpu that is already locked. As we're > + * only doing trylock the order is not important. > + */ > +static int trylock_smt_siblings(cpumask_t sibling_map) > +{ > + cpumask_t locked_siblings; > + int i; > + > + cpus_clear(locked_siblings); > + for_each_cpu_mask(i, sibling_map) { > + if (!spin_trylock(&cpu_rq(i)->lock)) > + break; > + cpu_set(i, locked_siblings); > + > + } > + > + /* Did we lock all the siblings? */ > + if (cpus_equal(sibling_map, locked_siblings)) > + return 1; > + > + for_each_cpu_mask(i, locked_siblings) > + spin_unlock(&cpu_rq(i)->lock); > + return 0; > +} I like Chris's version of trylock_smt_siblings(). Why create all that local variables? sibling_map is passed by value, so the whole thing is duplicated on the stack (I think it should be pass by pointer), then there is another locked_siblings mask declared followed by a bitmask compare. The big iron people who set CONFIG_NR_CPUS=1024 won't be amused because of all that bitmask copying. Let me hack up something too, so we can compare notes etc. - Ken ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 2:35 ` Chen, Kenneth W @ 2006-06-02 3:04 ` Con Kolivas 2006-06-02 3:23 ` Con Kolivas 0 siblings, 1 reply; 41+ messages in thread From: Con Kolivas @ 2006-06-02 3:04 UTC (permalink / raw) To: Chen, Kenneth W; +Cc: 'Chris Mason', linux-kernel, Ingo Molnar On Friday 02 June 2006 12:35, Chen, Kenneth W wrote: > Con Kolivas wrote on Thursday, June 01, 2006 6:59 PM > > > On Friday 02 June 2006 09:57, Chen, Kenneth W wrote: > > > Chris Mason wrote on Thursday, June 01, 2006 3:56 PM > > > > > > > Hello everyone, > > > > > > > > Recent benchmarks showed some performance regressions between 2.6.16 > > > > and 2.6.5. We tracked down one of the regressions to lock contention > > > > in schedule heavy workloads (~70,000 context switches per second) > > > > > > > > kernel/sched.c:dependent_sleeper() was responsible for most of the > > > > lock contention, hammering on the run queue locks. The patch below > > > > is more of a discussion point than a suggested fix (although it does > > > > reduce lock contention significantly). The dependent_sleeper code > > > > looks very expensive to me, especially for using a spinlock to bounce > > > > control between two different siblings in the same cpu. > > > > > > Yeah, this also sort of echo a recent discovery on one of our > > > benchmarks that schedule() is red hot in the kernel. I was just > > > scratching my head not sure what's going on. This dependent_sleeper > > > could be the culprit considering it is called almost at every context > > > switch. I don't think we are hitting on lock contention, but by the > > > large amount of code it executes. It really needs to be cut down .... > > > > Thanks for the suggestion. How about something like this which takes your > > idea and further expands on it. Compiled, boot and runtime tests ok. > > > > /* > > + * Try to lock all the siblings of a cpu that is already locked. As > > we're + * only doing trylock the order is not important. > > + */ > > +static int trylock_smt_siblings(cpumask_t sibling_map) > > +{ > > + cpumask_t locked_siblings; > > + int i; > > + > > + cpus_clear(locked_siblings); > > + for_each_cpu_mask(i, sibling_map) { > > + if (!spin_trylock(&cpu_rq(i)->lock)) > > + break; > > + cpu_set(i, locked_siblings); > > + > > + } > > + > > + /* Did we lock all the siblings? */ > > + if (cpus_equal(sibling_map, locked_siblings)) > > + return 1; > > + > > + for_each_cpu_mask(i, locked_siblings) > > + spin_unlock(&cpu_rq(i)->lock); > > + return 0; > > +} > > I like Chris's version of trylock_smt_siblings(). Why create all that > local variables? Because cpumasks are so pretty. > sibling_map is passed by value, so the whole thing is > duplicated on the stack (I think it should be pass by pointer), then > there is another locked_siblings mask declared followed by a bitmask > compare. The big iron people who set CONFIG_NR_CPUS=1024 won't be > amused because of all that bitmask copying. Good point. > > Let me hack up something too, so we can compare notes etc. Sure, you have the hack lock. Actually we can just trylock and if it fails remove it from the sibling_map and go to out_unlock without a separate trylock_smt_cpus function entirely... -- -ck ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH RFC] smt nice introduces significant lock contention 2006-06-02 3:04 ` Con Kolivas @ 2006-06-02 3:23 ` Con Kolivas 0 siblings, 0 replies; 41+ messages in thread From: Con Kolivas @ 2006-06-02 3:23 UTC (permalink / raw) To: linux-kernel; +Cc: Chen, Kenneth W, 'Chris Mason', Ingo Molnar On Friday 02 June 2006 13:04, Con Kolivas wrote: > Sure, you have the hack lock. Actually we can just trylock and if it fails > remove it from the sibling_map and go to out_unlock without a separate > trylock_smt_cpus function entirely... I mean go to out_unlock if the sibling_map is empty sorry. -- -ck ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2006-06-03 0:27 UTC | newest] Thread overview: 41+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-06-01 22:55 [PATCH RFC] smt nice introduces significant lock contention Chris Mason 2006-06-01 23:57 ` Chen, Kenneth W 2006-06-02 1:59 ` Con Kolivas 2006-06-02 2:28 ` Con Kolivas 2006-06-02 3:55 ` Con Kolivas 2006-06-02 4:18 ` Nick Piggin 2006-06-02 6:08 ` Con Kolivas 2006-06-02 7:53 ` Nick Piggin 2006-06-02 8:17 ` Con Kolivas 2006-06-02 8:28 ` Nick Piggin 2006-06-02 8:34 ` Chen, Kenneth W 2006-06-02 8:56 ` Nick Piggin 2006-06-02 9:17 ` Chen, Kenneth W 2006-06-02 9:25 ` Con Kolivas 2006-06-02 9:31 ` Chen, Kenneth W 2006-06-02 9:34 ` Con Kolivas 2006-06-02 9:53 ` Chen, Kenneth W 2006-06-02 10:12 ` Con Kolivas 2006-06-02 20:53 ` Chen, Kenneth W 2006-06-02 22:15 ` Con Kolivas 2006-06-02 22:19 ` Chen, Kenneth W 2006-06-02 22:31 ` Con Kolivas 2006-06-02 22:58 ` Chen, Kenneth W 2006-06-03 0:02 ` Con Kolivas 2006-06-03 0:08 ` Chen, Kenneth W 2006-06-03 0:27 ` Con Kolivas 2006-06-02 9:36 ` Chen, Kenneth W 2006-06-02 10:30 ` Con Kolivas 2006-06-02 13:16 ` Con Kolivas 2006-06-02 21:54 ` Chen, Kenneth W 2006-06-02 22:04 ` Con Kolivas 2006-06-02 22:14 ` Chen, Kenneth W 2006-06-02 10:19 ` Con Kolivas 2006-06-02 20:59 ` Chen, Kenneth W 2006-06-02 8:38 ` Chen, Kenneth W 2006-06-02 8:24 ` Chen, Kenneth W 2006-06-02 8:31 ` Chen, Kenneth W 2006-06-02 8:50 ` Chen, Kenneth W 2006-06-02 2:35 ` Chen, Kenneth W 2006-06-02 3:04 ` Con Kolivas 2006-06-02 3:23 ` Con Kolivas
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).