From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756612Ab0LAXN4 (ORCPT ); Wed, 1 Dec 2010 18:13:56 -0500 Received: from tx2ehsobe004.messaging.microsoft.com ([65.55.88.14]:58395 "EHLO TX2EHSOBE009.bigfish.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753581Ab0LAXNy (ORCPT ); Wed, 1 Dec 2010 18:13:54 -0500 X-SpamScore: -18 X-BigFish: VPS-18(zzbb2dK936eK1432N98dNef8K853kzz1202hzz8275bh8275dhz2fh2a8h691h637h668h61h) X-Spam-TCS-SCL: 0:0 X-Forefront-Antispam-Report: KIP:(null);UIP:(null);IPVD:NLI;H:mail7.fw-bc.sony.com;RD:mail7.fw-bc.sony.com;EFVD:NLI Message-ID: <4CF6D694.4030003@am.sony.com> Date: Wed, 1 Dec 2010 15:13:24 -0800 From: Frank Rowand Reply-To: User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.1) Gecko/20090814 Fedora/3.0-2.6.b3.fc11 Thunderbird/3.0b3 MIME-Version: 1.0 To: Chris Mason CC: Peter Zijlstra , Ingo Molnar , "axboe@kernel.dk" , "linux-kernel@vger.kernel.org" , Mike Galbraith , Oleg Nesterov , tglx Subject: Re: [PATCH RFC] reduce runqueue lock contention References: <20100520204810.GA19188@think> <1277114522.1875.469.camel@laptop> <1277117647.1875.503.camel@laptop> <1277125479.1875.510.camel@laptop> <1277213586.1875.704.camel@laptop> <20100622211141.GC21149@elte.hu> <1277284257.1875.820.camel@laptop> In-Reply-To: <1277284257.1875.820.camel@laptop> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-OriginatorOrg: am.sony.com Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/23/10 02:10, Peter Zijlstra wrote: > On Tue, 2010-06-22 at 23:11 +0200, Ingo Molnar wrote: >> * Peter Zijlstra wrote: >> >>> So this one boots and builds a kernel on a dual-socket nehalem. >>> >>> there's still quite a number of XXXs to fix, but I don't think any of the >>> races are crashing potential, mostly wrong accounting and scheduling iffies >>> like. >>> >>> But give it a go.. see what it does for you (x86 only for now). >>> >>> Ingo, any comments other than, eew, scary? :-) >> >> None, other than a question: which future kernel do you aim it for? I'd prefer >> v2.6.50 or later ;-) > > Well, assuming it all works out and actually reduces runqueue lock > contention we still need to sort out all those XXXs in there, I'd say at > the soonest somewhere near .38/.39 or so. > > Its definitely not something we should rush in. This thread was started by Chris Mason back in May: http://lkml.indiana.edu/hypermail/linux/kernel/1005.2/02329.html The problem he was trying to fix is: > Many different workloads end > up hammering very hard on try_to_wake_up, to the point where the > runqueue locks dominate CPU profiles. > > This includes benchmarking really fast IO subsystems, fast networking, > fast pipes...well anywhere that we lots of procs on lots of cpus waiting > for short periods on lots of different things. Chris provided some code as a starting point for a solution. Peter Zijlstra had some good ideas, and came up with some alternate code, culminating with: http://lkml.indiana.edu/hypermail/linux/kernel/1006.2/02381.html Building on this previous work, I have another patch to try to address the problem. I have taken some of Peter's code (the cmpxchg() based queueing and unqueueing, plus the cross cpu interrupt), but taken a simpler (and hopefully less scary) approach otherwise: If the task to be woken is on a run queue on a different cpu then use cmpxchg() to put it onto a pending try_to_wake_up list on the different cpu. Then send an interrupt to the different cpu to cause that cpu to call try_to_wake_up() for each process on the try_to_wake_up list. The result is that the initial run queue lock acquired by try_to_wake_up() will be on the cpu we are currently executing on, not a different cpu. This patch does not address the run queue lock contention that may occur if try_to_wake_up() chooses to move the waking process to another cpu, based on the result returned by select_task_rq(). The patch was created on the -top tree. Signed-off-by: Frank Rowand Chris, can you check the performance of this patch on your large system? Limitations x86 only Tests - tested on 2 cpu x86_64 - very simplistic workload - results: rq->lock contention count reduced by ~ 95% rq->lock contention wait time reduced by ~ 90% test duration reduced by ~ 0.5% - 4% (in the noise) Review goals: (1) performance results (2) architectural comments Review non-goal: code style, etc (but will be a goal in a future review round) Todo: - add support for additional architectures - polish code style - add a schedule feature to control whether to use the new algorithm - verify that smp_wmb() is implied by cmpxchg() on x86, so that the explicit smp_wmb() in ttwu_queue_wake_up() can be removed. --- arch/x86/kernel/smp.c | 1 1 + 0 - 0 ! include/linux/sched.h | 5 5 + 0 - 0 ! kernel/sched.c | 105 99 + 6 - 0 ! 3 files changed, 105 insertions(+), 6 deletions(-) Index: linux-2.6/arch/x86/kernel/smp.c =================================================================== --- linux-2.6.orig/arch/x86/kernel/smp.c +++ linux-2.6/arch/x86/kernel/smp.c @@ -205,6 +205,7 @@ void smp_reschedule_interrupt(struct pt_ /* * KVM uses this interrupt to force a cpu out of guest mode */ + sched_ttwu_pending(); } void smp_call_function_interrupt(struct pt_regs *regs) Index: linux-2.6/include/linux/sched.h =================================================================== --- linux-2.6.orig/include/linux/sched.h +++ linux-2.6/include/linux/sched.h @@ -1038,6 +1038,7 @@ struct sched_domain; */ #define WF_SYNC 0x01 /* waker goes to sleep after wakup */ #define WF_FORK 0x02 /* child wakeup after fork */ +#define WF_LOAD 0x04 /* for queued try_to_wake_up() */ #define ENQUEUE_WAKEUP 1 #define ENQUEUE_WAKING 2 @@ -1193,6 +1194,9 @@ struct task_struct { int lock_depth; /* BKL lock depth */ #ifdef CONFIG_SMP + struct task_struct *ttwu_queue_wake_entry; + int ttwu_queue_load; + int ttwu_queue_wake_flags; #ifdef __ARCH_WANT_UNLOCKED_CTXSW int oncpu; #endif @@ -2017,6 +2021,7 @@ extern void release_uids(struct user_nam extern void do_timer(unsigned long ticks); +extern void sched_ttwu_pending(void); extern int wake_up_state(struct task_struct *tsk, unsigned int state); extern int wake_up_process(struct task_struct *tsk); extern void wake_up_new_task(struct task_struct *tsk, Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -515,6 +515,8 @@ struct rq { u64 age_stamp; u64 idle_stamp; u64 avg_idle; + + struct task_struct *wake_list; #endif #ifdef CONFIG_IRQ_TIME_ACCOUNTING @@ -2332,6 +2334,39 @@ static inline void ttwu_post_activation( wq_worker_waking_up(p, cpu_of(rq)); } +#ifdef CONFIG_SMP +static void ttwu_queue_wake_up(struct task_struct *p, int cpu, int wake_flags, + int load) +{ + struct task_struct *next = NULL; + struct rq *rq = cpu_rq(cpu); + + if (load) + wake_flags |= WF_LOAD; + p->ttwu_queue_load = load; + p->ttwu_queue_wake_flags = wake_flags; + /* xxx + * smp_wmb() is implied by cmpxchg() + * (see Documentation/memory-barriers.txt). + * It is the case for arm. + * I don't know about x86, so do it explicitly until I know for sure. + */ + smp_wmb(); + + for (;;) { + struct task_struct *old = next; + + p->ttwu_queue_wake_entry = next; + next = cmpxchg(&rq->wake_list, old, p); + if (next == old) + break; + } + + if (!next) + smp_send_reschedule(cpu); +} +#endif + /** * try_to_wake_up - wake up a thread * @p: the thread to be awakened @@ -2354,13 +2389,51 @@ static int try_to_wake_up(struct task_st unsigned long flags; unsigned long en_flags = ENQUEUE_WAKEUP; struct rq *rq; +#ifdef CONFIG_SMP + int load; +#endif this_cpu = get_cpu(); + local_irq_save(flags); + +#ifdef CONFIG_SMP + for (;;) { + unsigned int task_state = p->state; + + if (!(task_state & state)) + goto out_nolock; + + /* + * We've got to store the contributes_to_load state before + * modifying the task state. + */ + load = task_contributes_to_load(p); + + if (cmpxchg(&p->state, task_state, TASK_WAKING) == task_state) { + if (state == TASK_WAKING) + load = (wake_flags & WF_LOAD) ? 1 : 0; + break; + } + } + + /* + * There is a race where task_cpu could be set to + * this_cpu while task_state is TASK_WAKING? + * + * That's ok, the destination cpu will just send it back here when + * it calls try_to_wake_up() of this process. + */ + + cpu = task_cpu(p); + if (cpu != this_cpu) { + ttwu_queue_wake_up(p, cpu, wake_flags, load); + goto out_nolock; + } +#endif + smp_wmb(); - rq = task_rq_lock(p, &flags); - if (!(p->state & state)) - goto out; + rq = __task_rq_lock(p); if (p->se.on_rq) goto out_running; @@ -2373,18 +2446,16 @@ static int try_to_wake_up(struct task_st goto out_activate; /* - * In order to handle concurrent wakeups and release the rq->lock - * we put the task in TASK_WAKING state. - * - * First fix up the nr_uninterruptible count: + * Can handle concurrent wakeups and release the rq->lock + * since we put the task in TASK_WAKING state. */ - if (task_contributes_to_load(p)) { + + if (load) { if (likely(cpu_online(orig_cpu))) rq->nr_uninterruptible--; else this_rq()->nr_uninterruptible--; } - p->state = TASK_WAKING; if (p->sched_class->task_waking) { p->sched_class->task_waking(rq, p); @@ -2430,13 +2501,32 @@ out_activate: success = 1; out_running: ttwu_post_activation(p, rq, wake_flags, success); -out: - task_rq_unlock(rq, &flags); + __task_rq_unlock(rq); +#ifdef CONFIG_SMP +out_nolock: +#endif + local_irq_restore(flags); put_cpu(); return success; } +#ifdef CONFIG_SMP +void sched_ttwu_pending(void) +{ + struct rq *rq = this_rq(); + struct task_struct *p = xchg(&rq->wake_list, NULL); + + if (!p) + return; + + while (p) { + try_to_wake_up(p, TASK_WAKING, p->ttwu_queue_wake_flags); + p = p->ttwu_queue_wake_entry; + } +} +#endif + /** * try_to_wake_up_local - try to wake up a local task with rq lock held * @p: the thread to be awakened