From: Frank Rowand <frank.rowand@am.sony.com>
To: Chris Mason <chris.mason@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@elte.hu>, "axboe@kernel.dk" <axboe@kernel.dk>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Mike Galbraith <efault@gmx.de>, Oleg Nesterov <oleg@redhat.com>,
tglx <tglx@linutronix.de>
Subject: Re: [PATCH RFC] reduce runqueue lock contention
Date: Wed, 1 Dec 2010 15:13:24 -0800 [thread overview]
Message-ID: <4CF6D694.4030003@am.sony.com> (raw)
In-Reply-To: <1277284257.1875.820.camel@laptop>
On 06/23/10 02:10, Peter Zijlstra wrote:
> On Tue, 2010-06-22 at 23:11 +0200, Ingo Molnar wrote:
>> * Peter Zijlstra <peterz@infradead.org> wrote:
>>
>>> So this one boots and builds a kernel on a dual-socket nehalem.
>>>
>>> there's still quite a number of XXXs to fix, but I don't think any of the
>>> races are crashing potential, mostly wrong accounting and scheduling iffies
>>> like.
>>>
>>> But give it a go.. see what it does for you (x86 only for now).
>>>
>>> Ingo, any comments other than, eew, scary? :-)
>>
>> None, other than a question: which future kernel do you aim it for? I'd prefer
>> v2.6.50 or later ;-)
>
> Well, assuming it all works out and actually reduces runqueue lock
> contention we still need to sort out all those XXXs in there, I'd say at
> the soonest somewhere near .38/.39 or so.
>
> Its definitely not something we should rush in.
This thread was started by Chris Mason back in May:
http://lkml.indiana.edu/hypermail/linux/kernel/1005.2/02329.html
The problem he was trying to fix is:
> Many different workloads end
> up hammering very hard on try_to_wake_up, to the point where the
> runqueue locks dominate CPU profiles.
>
> This includes benchmarking really fast IO subsystems, fast networking,
> fast pipes...well anywhere that we lots of procs on lots of cpus waiting
> for short periods on lots of different things.
Chris provided some code as a starting point for a solution.
Peter Zijlstra had some good ideas, and came up with some alternate code,
culminating with:
http://lkml.indiana.edu/hypermail/linux/kernel/1006.2/02381.html
Building on this previous work, I have another patch to try to address
the problem. I have taken some of Peter's code (the cmpxchg() based
queueing and unqueueing, plus the cross cpu interrupt), but taken a
simpler (and hopefully less scary) approach otherwise:
If the task to be woken is on a run queue on a different cpu then use
cmpxchg() to put it onto a pending try_to_wake_up list on the different
cpu. Then send an interrupt to the different cpu to cause that cpu to
call try_to_wake_up() for each process on the try_to_wake_up list.
The result is that the initial run queue lock acquired by try_to_wake_up()
will be on the cpu we are currently executing on, not a different cpu.
This patch does not address the run queue lock contention that may occur
if try_to_wake_up() chooses to move the waking process to another cpu,
based on the result returned by select_task_rq().
The patch was created on the -top tree.
Signed-off-by: Frank Rowand <frank.rowand@am.sony.com>
Chris, can you check the performance of this patch on your large system?
Limitations
x86 only
Tests
- tested on 2 cpu x86_64
- very simplistic workload
- results:
rq->lock contention count reduced by ~ 95%
rq->lock contention wait time reduced by ~ 90%
test duration reduced by ~ 0.5% - 4% (in the noise)
Review goals:
(1) performance results
(2) architectural comments
Review non-goal:
code style, etc (but will be a goal in a future review round)
Todo:
- add support for additional architectures
- polish code style
- add a schedule feature to control whether to use the new algorithm
- verify that smp_wmb() is implied by cmpxchg() on x86, so that the explicit
smp_wmb() in ttwu_queue_wake_up() can be removed.
---
arch/x86/kernel/smp.c | 1 1 + 0 - 0 !
include/linux/sched.h | 5 5 + 0 - 0 !
kernel/sched.c | 105 99 + 6 - 0 !
3 files changed, 105 insertions(+), 6 deletions(-)
Index: linux-2.6/arch/x86/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smp.c
+++ linux-2.6/arch/x86/kernel/smp.c
@@ -205,6 +205,7 @@ void smp_reschedule_interrupt(struct pt_
/*
* KVM uses this interrupt to force a cpu out of guest mode
*/
+ sched_ttwu_pending();
}
void smp_call_function_interrupt(struct pt_regs *regs)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1038,6 +1038,7 @@ struct sched_domain;
*/
#define WF_SYNC 0x01 /* waker goes to sleep after wakup */
#define WF_FORK 0x02 /* child wakeup after fork */
+#define WF_LOAD 0x04 /* for queued try_to_wake_up() */
#define ENQUEUE_WAKEUP 1
#define ENQUEUE_WAKING 2
@@ -1193,6 +1194,9 @@ struct task_struct {
int lock_depth; /* BKL lock depth */
#ifdef CONFIG_SMP
+ struct task_struct *ttwu_queue_wake_entry;
+ int ttwu_queue_load;
+ int ttwu_queue_wake_flags;
#ifdef __ARCH_WANT_UNLOCKED_CTXSW
int oncpu;
#endif
@@ -2017,6 +2021,7 @@ extern void release_uids(struct user_nam
extern void do_timer(unsigned long ticks);
+extern void sched_ttwu_pending(void);
extern int wake_up_state(struct task_struct *tsk, unsigned int state);
extern int wake_up_process(struct task_struct *tsk);
extern void wake_up_new_task(struct task_struct *tsk,
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -515,6 +515,8 @@ struct rq {
u64 age_stamp;
u64 idle_stamp;
u64 avg_idle;
+
+ struct task_struct *wake_list;
#endif
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
@@ -2332,6 +2334,39 @@ static inline void ttwu_post_activation(
wq_worker_waking_up(p, cpu_of(rq));
}
+#ifdef CONFIG_SMP
+static void ttwu_queue_wake_up(struct task_struct *p, int cpu, int wake_flags,
+ int load)
+{
+ struct task_struct *next = NULL;
+ struct rq *rq = cpu_rq(cpu);
+
+ if (load)
+ wake_flags |= WF_LOAD;
+ p->ttwu_queue_load = load;
+ p->ttwu_queue_wake_flags = wake_flags;
+ /* xxx
+ * smp_wmb() is implied by cmpxchg()
+ * (see Documentation/memory-barriers.txt).
+ * It is the case for arm.
+ * I don't know about x86, so do it explicitly until I know for sure.
+ */
+ smp_wmb();
+
+ for (;;) {
+ struct task_struct *old = next;
+
+ p->ttwu_queue_wake_entry = next;
+ next = cmpxchg(&rq->wake_list, old, p);
+ if (next == old)
+ break;
+ }
+
+ if (!next)
+ smp_send_reschedule(cpu);
+}
+#endif
+
/**
* try_to_wake_up - wake up a thread
* @p: the thread to be awakened
@@ -2354,13 +2389,51 @@ static int try_to_wake_up(struct task_st
unsigned long flags;
unsigned long en_flags = ENQUEUE_WAKEUP;
struct rq *rq;
+#ifdef CONFIG_SMP
+ int load;
+#endif
this_cpu = get_cpu();
+ local_irq_save(flags);
+
+#ifdef CONFIG_SMP
+ for (;;) {
+ unsigned int task_state = p->state;
+
+ if (!(task_state & state))
+ goto out_nolock;
+
+ /*
+ * We've got to store the contributes_to_load state before
+ * modifying the task state.
+ */
+ load = task_contributes_to_load(p);
+
+ if (cmpxchg(&p->state, task_state, TASK_WAKING) == task_state) {
+ if (state == TASK_WAKING)
+ load = (wake_flags & WF_LOAD) ? 1 : 0;
+ break;
+ }
+ }
+
+ /*
+ * There is a race where task_cpu could be set to
+ * this_cpu while task_state is TASK_WAKING?
+ *
+ * That's ok, the destination cpu will just send it back here when
+ * it calls try_to_wake_up() of this process.
+ */
+
+ cpu = task_cpu(p);
+ if (cpu != this_cpu) {
+ ttwu_queue_wake_up(p, cpu, wake_flags, load);
+ goto out_nolock;
+ }
+#endif
+
smp_wmb();
- rq = task_rq_lock(p, &flags);
- if (!(p->state & state))
- goto out;
+ rq = __task_rq_lock(p);
if (p->se.on_rq)
goto out_running;
@@ -2373,18 +2446,16 @@ static int try_to_wake_up(struct task_st
goto out_activate;
/*
- * In order to handle concurrent wakeups and release the rq->lock
- * we put the task in TASK_WAKING state.
- *
- * First fix up the nr_uninterruptible count:
+ * Can handle concurrent wakeups and release the rq->lock
+ * since we put the task in TASK_WAKING state.
*/
- if (task_contributes_to_load(p)) {
+
+ if (load) {
if (likely(cpu_online(orig_cpu)))
rq->nr_uninterruptible--;
else
this_rq()->nr_uninterruptible--;
}
- p->state = TASK_WAKING;
if (p->sched_class->task_waking) {
p->sched_class->task_waking(rq, p);
@@ -2430,13 +2501,32 @@ out_activate:
success = 1;
out_running:
ttwu_post_activation(p, rq, wake_flags, success);
-out:
- task_rq_unlock(rq, &flags);
+ __task_rq_unlock(rq);
+#ifdef CONFIG_SMP
+out_nolock:
+#endif
+ local_irq_restore(flags);
put_cpu();
return success;
}
+#ifdef CONFIG_SMP
+void sched_ttwu_pending(void)
+{
+ struct rq *rq = this_rq();
+ struct task_struct *p = xchg(&rq->wake_list, NULL);
+
+ if (!p)
+ return;
+
+ while (p) {
+ try_to_wake_up(p, TASK_WAKING, p->ttwu_queue_wake_flags);
+ p = p->ttwu_queue_wake_entry;
+ }
+}
+#endif
+
/**
* try_to_wake_up_local - try to wake up a local task with rq lock held
* @p: the thread to be awakened
next prev parent reply other threads:[~2010-12-01 23:13 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-05-20 20:48 [PATCH RFC] reduce runqueue lock contention Chris Mason
2010-05-20 21:09 ` Peter Zijlstra
2010-05-20 21:23 ` Peter Zijlstra
2010-05-20 22:17 ` Chris Mason
2010-05-20 22:21 ` Chris Mason
2010-06-04 10:56 ` Stijn Devriendt
2010-06-04 12:00 ` Peter Zijlstra
2010-06-05 9:37 ` Stijn Devriendt
2010-06-21 10:02 ` Peter Zijlstra
2010-06-21 10:54 ` Peter Zijlstra
2010-06-21 13:04 ` Peter Zijlstra
2010-06-22 13:33 ` Peter Zijlstra
2010-06-22 21:11 ` Ingo Molnar
2010-06-23 9:10 ` Peter Zijlstra
2010-12-01 23:13 ` Frank Rowand [this message]
2010-12-02 1:17 ` Frank Rowand
2010-12-02 7:36 ` Peter Zijlstra
2010-12-14 2:41 ` Frank Rowand
2010-12-14 3:42 ` Mike Galbraith
2010-12-14 21:42 ` Frank Rowand
2010-12-15 18:59 ` Oleg Nesterov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4CF6D694.4030003@am.sony.com \
--to=frank.rowand@am.sony.com \
--cc=axboe@kernel.dk \
--cc=chris.mason@oracle.com \
--cc=efault@gmx.de \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=oleg@redhat.com \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.