From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756776Ab2DXQ4d (ORCPT ); Tue, 24 Apr 2012 12:56:33 -0400 Received: from e28smtp05.in.ibm.com ([122.248.162.5]:48739 "EHLO e28smtp05.in.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755721Ab2DXQ42 (ORCPT ); Tue, 24 Apr 2012 12:56:28 -0400 Date: Tue, 24 Apr 2012 22:26:19 +0530 From: Srivatsa Vaddagiri To: Peter Zijlstra , Ingo Molnar Cc: Mike Galbraith , Suresh Siddha , Paul Turner , linux-kernel@vger.kernel.org Subject: [PATCH v1] sched: steer waking task to empty cfs_rq for better latencies Message-ID: <20120424165619.GA28701@linux.vnet.ibm.com> Reply-To: Srivatsa Vaddagiri MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) x-cbid: 12042416-8256-0000-0000-0000021FD019 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org During my investigation of a performance issue, I found that we can do a better job of reducing latencies for a waking task by steering it towards a cpu where it will get better sleeper credits. Consider a system with two nodes: N0-N1, each with 4 cpus and a cgroup /a which is of highest priority. Further all 4 cpus in a node are in the same llc (MC) domain. N0 N1 (0,1,2,3) (4,5,6,7) rq.nr_run -> 2 1 1 1 2 2 1 1 /a cfs_rq.nr_run -> 0 0 0 0 0 0 0 1 Consider a task of "/a" waking up after a short (< sysctl_sched_latency) sleep. Its prev_cpu was 7. select_idle_sibling(), failing to find a idle core, simply wakes up the task on CPU7, where it may be unable to preempt the currently running task (as its new vruntime is not sufficiently behind currently running tasks vruntime - owing to the short sleep it incurred). As a result, the task woken up is unable to run immediately and thus incurs some latency. A better choice would be to find a cpu in cpu7's MC domain where its cgroup has 0 tasks (thus allowing the waking task to get better sleeper credits). Patch below implements this idea. Some results with various benchmarks is enclosed. Machine : 2 Quad-core Intel X5570 CPU w/ H/T enabled (16 cpus) Kernel : tip (HEAD at 2adb096) guest VM : 2.6.18 linux kernel based enterprise guest Benchmarks are run in two scenarios: 1. BM -> Bare Metal. Benchmark is run on bare metal in root cgroup 2. VM -> Benchmark is run inside a guest VM. Several cpu hogs (in various cgroups) are run on host. Cgroup setup is as below: /libvirt/qemu/VM (cpu.shares = 8192. guest VM w/ 8 vcpus) /libvirt/qemu/hoga[bcd] (cpu.shares = 1024. hosts 4 cpu hogs each) Mean and std. dev. (in brackets) for both tip and tip+patch cases provided below: BM scenario: tip tip+patch Remarks mean(std. dev) mean (std. dev) volano 1 (6.5%) 0.97 (4.7%) 3% loss sysbench [n1] 1 (0.6%) 1.004 (0.7%) 0.4% win tbench 1 [n2] 1 (2%) 1.024 (1.6%) 2.4% win pipe bench [n3] 1 (5.5%) 1.009 (2.5%) 0.9% win VM scenario sysbench [n4] 1 (1.2%) 2.21 (1.3%) 121% win httperf [n5] 1 (5.7%) 1.522 (6%) 52.2% win tbench 8 [n6] 1 (3.1%) 1.91 (6.4%) 91% win volano 1 (4.3%) 1.06 (2.8%) 6% win Trade 1 1.94 94% win Notes: n1. sysbench was run with 16 threads. n2. tbench was run on localhost with 1 client n3. ops/sec metric from pipe bench captured. pipe bench run as: perf stat --repeat 10 --null perf bench sched pipe n4. sysbench was run (inside VM) with 8 threads. n5. httperf was run as with burst-length of 100 and wsess of 100,500,0. Webserver was running inside VM while benchmark was run on a physically different host. n6. tbench was run over network with 8 clients This is an improved version of the patch previously published that minimizes/avoids regressions seen earlier: https://lkml.org/lkml/2012/3/22/220 Comments/flames wellcome! -- Steer a waking task towards a cpu where its cgroup has zero tasks (in order to provide it better sleeper credits and hence reduce its wakeup latency). Signed-off-by: Srivatsa Vaddagiri --- kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) Index: current/kernel/sched/fair.c =================================================================== --- current.orig/kernel/sched/fair.c +++ current/kernel/sched/fair.c @@ -2459,6 +2459,32 @@ static long effective_load(struct task_g return wl; } + +/* + * Look for a CPU within @target's MC domain where the task's cgroup has + * zero tasks in its cfs_rq. + */ +static __always_inline int +select_idle_cfs_rq(struct task_struct *p, int target) +{ + struct cpumask tmpmask; + struct task_group *tg = task_group(p); + struct sched_domain *sd; + int i; + + if (tg == &root_task_group) + return target; + + sd = rcu_dereference(per_cpu(sd_llc, target)); + cpumask_and(&tmpmask, sched_domain_span(sd), tsk_cpus_allowed(p)); + for_each_cpu(i, &tmpmask) { + if (!tg->cfs_rq[i]->nr_running) + return i; + } + + return target; +} + #else static inline unsigned long effective_load(struct task_group *tg, int cpu, @@ -2467,6 +2493,12 @@ static inline unsigned long effective_lo return wl; } +static __always_inline int +select_idle_cfs_rq(struct task_struct *p, int target) +{ + return target; +} + #endif static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) @@ -2677,6 +2709,13 @@ next: sg = sg->next; } while (sg != sd->groups); } + + /* + * Look for the next best possibility - a cpu where this task gets + * (better) sleeper credits. + */ + target = select_idle_cfs_rq(p, target); + done: return target; }