Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
To: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Mike Galbraith <efault@gmx.de>,
	Suresh Siddha <suresh.b.siddha@intel.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Paul Turner <pjt@google.com>
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
Date: Thu, 22 Mar 2012 21:02:05 +0530	[thread overview]
Message-ID: <20120322153205.GA28570@linux.vnet.ibm.com> (raw)
In-Reply-To: <20120306091410.GD27238@elte.hu>

* Ingo Molnar <mingo@elte.hu> [2012-03-06 10:14:11]:

> > I did some experiments with volanomark and it does turn out to 
> > be sensitive to SD_BALANCE_WAKE, while the other wake-heavy 
> > benchmark that I am dealing with (Trade) benefits from it.
> 
> Does volanomark still do yield(), thereby invoking a random 
> shuffle of thread scheduling and pretty much voluntarily 
> ejecting itself from most scheduler performance considerations?
> 
> If it uses a real locking primitive such as futexes then its 
> performance matters more.

Some more interesting results on more recent tip kernel.

Machine : 2 Quad-core Intel X5570 CPU w/ H/T enabled (16 cpus)
Kernel  : tip (HEAD at ee415e2)
guest VM : 2.6.18 linux kernel based enterprise guest

Benchmarks are run in two scenarios:

1. BM -> Bare Metal. Benchmark is run on bare metal in root cgroup
2. VM -> Benchmark is run inside a guest VM. Several cpu hogs (in
 	 various cgroups) are run on host. Cgroup setup is as below:

	/sys (cpu.shares = 1024, hosts all system tasks)
	/libvirt (cpu.shares = 20000)
 	/libvirt/qemu/VM (cpu.shares = 8192. guest VM w/ 8 vcpus)
	/libvirt/qemu/hoga (cpu.shares = 1024. hosts 4 cpu hogs)
	/libvirt/qemu/hogb (cpu.shares = 1024. hosts 4 cpu hogs)
	/libvirt/qemu/hogc (cpu.shares = 1024. hosts 4 cpu hogs)
	/libvirt/qemu/hogd (cpu.shares = 1024. hosts 4 cpu hogs)

First BM (bare metal) scenario:

		tip	tip + patch

volano 		1	0.955   (4.5% degradation)
sysbench [n1] 	1	0.9984  (0.16% degradation)
tbench 1 [n2]	1	0.9096  (9% degradation)

Now the more interesting VM scenario:

		tip	tip + patch

volano		1	1.29   (29% improvement)
sysbench [n3]	1	2      (100% improvement)
tbench 1 [n4] 	1       1.07   (7% improvement)
tbench 8 [n5] 	1       1.26   (26% improvement)
httperf  [n6]	1 	1.05   (5% improvement)
Trade		1	1.31   (31% improvement)

Notes:
 
n1. sysbench was run with 16 threads.
n2. tbench was run on localhost with 1 client 
n3. sysbench was run with 8 threads
n4. tbench was run on localhost with 1 client
n5. tbench was run over network with 8 clients
n6. httperf was run as with burst-length of 100 and wsess of 100,500,0

So the patch seems to be a wholesome win when VCPU threads are waking
up (in a highly contended environment). One reason could be that any assumption 
of better cache hits by running (vcpu) threads on its prev_cpu may not
be fully correct as vcpu threads could represent many different threads
internally?

Anyway, there are degradations as well, considering which I see several 
possibilities:

1. Do balance-on-wake for vcpu threads only.
2. Document tuning possibility to improve performance in virtualized
   environment:
	- Either via sched_domain flags (disable SD_WAKE_AFFINE 
 	  at all levels and enable SD_BALANCE_WAKE at SMT/MC levels)
	- Or via a new sched_feat(BALANCE_WAKE) tunable

Any other thoughts or suggestions for more experiments?


--

Balance threads on wakeup to let it run on least-loaded CPU in the same
cache domain as its prev_cpu (or cur_cpu if wake_affine() test obliges).

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>


---
 include/linux/topology.h |    4 ++--
 kernel/sched/fair.c      |    5 ++++-
 2 files changed, 6 insertions(+), 3 deletions(-)

Index: current/include/linux/topology.h
===================================================================
--- current.orig/include/linux/topology.h
+++ current/include/linux/topology.h
@@ -96,7 +96,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_BALANCE_NEWIDLE			\
 				| 1*SD_BALANCE_EXEC			\
 				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
+				| 1*SD_BALANCE_WAKE			\
 				| 1*SD_WAKE_AFFINE			\
 				| 1*SD_SHARE_CPUPOWER			\
 				| 0*SD_POWERSAVINGS_BALANCE		\
@@ -129,7 +129,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_BALANCE_NEWIDLE			\
 				| 1*SD_BALANCE_EXEC			\
 				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
+				| 1*SD_BALANCE_WAKE			\
 				| 1*SD_WAKE_AFFINE			\
 				| 0*SD_PREFER_LOCAL			\
 				| 0*SD_SHARE_CPUPOWER			\
Index: current/kernel/sched/fair.c
===================================================================
--- current.orig/kernel/sched/fair.c
+++ current/kernel/sched/fair.c
@@ -2766,7 +2766,10 @@ select_task_rq_fair(struct task_struct *
 			prev_cpu = cpu;
 
 		new_cpu = select_idle_sibling(p, prev_cpu);
-		goto unlock;
+		if (idle_cpu(new_cpu))
+			goto unlock;
+		sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
+		cpu = prev_cpu;
 	}
 
 	while (sd) {

next prev parent reply	other threads:[~2012-03-22 15:32 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1329764866.2293.376.camhel@twins>
2012-03-05 15:24 ` sched: Avoid SMT siblings in select_idle_sibling() if possible Srivatsa Vaddagiri
2012-03-06  9:14   ` Ingo Molnar
2012-03-06 10:03     ` Srivatsa Vaddagiri
2012-03-22 15:32     ` Srivatsa Vaddagiri [this message]
2012-03-23  6:38       ` Mike Galbraith
2012-03-26  8:29       ` Peter Zijlstra
2012-03-26  8:36       ` Peter Zijlstra
2012-03-26 17:35         ` Srivatsa Vaddagiri
2012-03-26 18:06           ` Peter Zijlstra
2012-03-27 13:56             ` Mike Galbraith
2011-11-15  9:46 Peter Zijlstra
2011-11-16  1:14 ` Suresh Siddha
2011-11-16  9:24   ` Mike Galbraith
2011-11-16 18:37     ` Suresh Siddha
2011-11-17  1:59       ` Mike Galbraith
2011-11-17 15:38         ` Mike Galbraith
2011-11-17 15:56           ` Peter Zijlstra
2011-11-17 16:38             ` Mike Galbraith
2011-11-17 17:36               ` Suresh Siddha
2011-11-18 15:14                 ` Mike Galbraith
2012-02-20 14:41                   ` Peter Zijlstra
2012-02-20 15:03                     ` Srivatsa Vaddagiri
2012-02-20 18:25                       ` Mike Galbraith
2012-02-21  0:06                         ` Srivatsa Vaddagiri
2012-02-21  6:37                           ` Mike Galbraith
2012-02-21  8:09                             ` Srivatsa Vaddagiri
2012-02-20 18:14                     ` Mike Galbraith
2012-02-20 18:15                       ` Peter Zijlstra
2012-02-20 19:07                       ` Peter Zijlstra
2012-02-21  5:43                         ` Mike Galbraith
2012-02-21  8:32                           ` Srivatsa Vaddagiri
2012-02-21  9:21                             ` Mike Galbraith
2012-02-21 10:37                               ` Peter Zijlstra
2012-02-21 14:58                                 ` Srivatsa Vaddagiri
2012-02-23 10:49                       ` Srivatsa Vaddagiri
2012-02-23 11:19                         ` Ingo Molnar
2012-02-23 12:18                           ` Srivatsa Vaddagiri
2012-02-23 11:20                         ` Srivatsa Vaddagiri
2012-02-23 11:26                           ` Ingo Molnar
2012-02-23 11:32                             ` Srivatsa Vaddagiri
2012-02-23 16:17                               ` Ingo Molnar
2012-02-23 11:21                         ` Mike Galbraith
2012-02-25  6:54                           ` Srivatsa Vaddagiri
2012-02-25  8:30                             ` Mike Galbraith
2012-02-27 22:11                               ` Suresh Siddha
2012-02-28  5:05                                 ` Mike Galbraith
2011-11-17 19:08             ` Suresh Siddha
2011-11-18 15:12               ` Peter Zijlstra
2011-11-18 15:26                 ` Mike Galbraith

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120322153205.GA28570@linux.vnet.ibm.com \
    --to=vatsa@linux.vnet.ibm.com \
    --cc=efault@gmx.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=suresh.b.siddha@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox