Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
To: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Mike Galbraith <efault@gmx.de>,
	Suresh Siddha <suresh.b.siddha@intel.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Paul Turner <pjt@google.com>
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
Date: Thu, 22 Mar 2012 21:02:05 +0530	[thread overview]
Message-ID: <20120322153205.GA28570@linux.vnet.ibm.com> (raw)
In-Reply-To: <20120306091410.GD27238@elte.hu>

* Ingo Molnar <mingo@elte.hu> [2012-03-06 10:14:11]:

> > I did some experiments with volanomark and it does turn out to 
> > be sensitive to SD_BALANCE_WAKE, while the other wake-heavy 
> > benchmark that I am dealing with (Trade) benefits from it.
> 
> Does volanomark still do yield(), thereby invoking a random 
> shuffle of thread scheduling and pretty much voluntarily 
> ejecting itself from most scheduler performance considerations?
> 
> If it uses a real locking primitive such as futexes then its 
> performance matters more.

Some more interesting results on more recent tip kernel.

Machine : 2 Quad-core Intel X5570 CPU w/ H/T enabled (16 cpus)
Kernel  : tip (HEAD at ee415e2)
guest VM : 2.6.18 linux kernel based enterprise guest

Benchmarks are run in two scenarios:

1. BM -> Bare Metal. Benchmark is run on bare metal in root cgroup
2. VM -> Benchmark is run inside a guest VM. Several cpu hogs (in
 	 various cgroups) are run on host. Cgroup setup is as below:

	/sys (cpu.shares = 1024, hosts all system tasks)
	/libvirt (cpu.shares = 20000)
 	/libvirt/qemu/VM (cpu.shares = 8192. guest VM w/ 8 vcpus)
	/libvirt/qemu/hoga (cpu.shares = 1024. hosts 4 cpu hogs)
	/libvirt/qemu/hogb (cpu.shares = 1024. hosts 4 cpu hogs)
	/libvirt/qemu/hogc (cpu.shares = 1024. hosts 4 cpu hogs)
	/libvirt/qemu/hogd (cpu.shares = 1024. hosts 4 cpu hogs)

First BM (bare metal) scenario:

		tip	tip + patch

volano 		1	0.955   (4.5% degradation)
sysbench [n1] 	1	0.9984  (0.16% degradation)
tbench 1 [n2]	1	0.9096  (9% degradation)

Now the more interesting VM scenario:

		tip	tip + patch

volano		1	1.29   (29% improvement)
sysbench [n3]	1	2      (100% improvement)
tbench 1 [n4] 	1       1.07   (7% improvement)
tbench 8 [n5] 	1       1.26   (26% improvement)
httperf  [n6]	1 	1.05   (5% improvement)
Trade		1	1.31   (31% improvement)

Notes:
 
n1. sysbench was run with 16 threads.
n2. tbench was run on localhost with 1 client 
n3. sysbench was run with 8 threads
n4. tbench was run on localhost with 1 client
n5. tbench was run over network with 8 clients
n6. httperf was run as with burst-length of 100 and wsess of 100,500,0

So the patch seems to be a wholesome win when VCPU threads are waking
up (in a highly contended environment). One reason could be that any assumption 
of better cache hits by running (vcpu) threads on its prev_cpu may not
be fully correct as vcpu threads could represent many different threads
internally?

Anyway, there are degradations as well, considering which I see several 
possibilities:

1. Do balance-on-wake for vcpu threads only.
2. Document tuning possibility to improve performance in virtualized
   environment:
	- Either via sched_domain flags (disable SD_WAKE_AFFINE 
 	  at all levels and enable SD_BALANCE_WAKE at SMT/MC levels)
	- Or via a new sched_feat(BALANCE_WAKE) tunable

Any other thoughts or suggestions for more experiments?


--

Balance threads on wakeup to let it run on least-loaded CPU in the same
cache domain as its prev_cpu (or cur_cpu if wake_affine() test obliges).

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>


---
 include/linux/topology.h |    4 ++--
 kernel/sched/fair.c      |    5 ++++-
 2 files changed, 6 insertions(+), 3 deletions(-)

Index: current/include/linux/topology.h
===================================================================
--- current.orig/include/linux/topology.h
+++ current/include/linux/topology.h
@@ -96,7 +96,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_BALANCE_NEWIDLE			\
 				| 1*SD_BALANCE_EXEC			\
 				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
+				| 1*SD_BALANCE_WAKE			\
 				| 1*SD_WAKE_AFFINE			\
 				| 1*SD_SHARE_CPUPOWER			\
 				| 0*SD_POWERSAVINGS_BALANCE		\
@@ -129,7 +129,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_BALANCE_NEWIDLE			\
 				| 1*SD_BALANCE_EXEC			\
 				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
+				| 1*SD_BALANCE_WAKE			\
 				| 1*SD_WAKE_AFFINE			\
 				| 0*SD_PREFER_LOCAL			\
 				| 0*SD_SHARE_CPUPOWER			\
Index: current/kernel/sched/fair.c
===================================================================
--- current.orig/kernel/sched/fair.c
+++ current/kernel/sched/fair.c
@@ -2766,7 +2766,10 @@ select_task_rq_fair(struct task_struct *
 			prev_cpu = cpu;
 
 		new_cpu = select_idle_sibling(p, prev_cpu);
-		goto unlock;
+		if (idle_cpu(new_cpu))
+			goto unlock;
+		sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
+		cpu = prev_cpu;
 	}
 
 	while (sd) {

next prev parent reply	other threads:[~2012-03-22 15:32 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1329764866.2293.376.camhel@twins>
2012-03-05 15:24 ` sched: Avoid SMT siblings in select_idle_sibling() if possible Srivatsa Vaddagiri
2012-03-06  9:14   ` Ingo Molnar
2012-03-06 10:03     ` Srivatsa Vaddagiri
2012-03-22 15:32     ` Srivatsa Vaddagiri [this message]
2012-03-23  6:38       ` Mike Galbraith
2012-03-26  8:29       ` Peter Zijlstra
2012-03-26  8:36       ` Peter Zijlstra
2012-03-26 17:35         ` Srivatsa Vaddagiri
2012-03-26 18:06           ` Peter Zijlstra
2012-03-27 13:56             ` Mike Galbraith
2011-11-15  9:46 Peter Zijlstra
2011-11-16  1:14 ` Suresh Siddha
2011-11-16  9:24   ` Mike Galbraith
2011-11-16 18:37     ` Suresh Siddha
2011-11-17  1:59       ` Mike Galbraith
2011-11-17 15:38         ` Mike Galbraith
2011-11-17 15:56           ` Peter Zijlstra
2011-11-17 16:38             ` Mike Galbraith
2011-11-17 17:36               ` Suresh Siddha
2011-11-18 15:14                 ` Mike Galbraith
2012-02-20 14:41                   ` Peter Zijlstra
2012-02-20 15:03                     ` Srivatsa Vaddagiri
2012-02-20 18:25                       ` Mike Galbraith
2012-02-21  0:06                         ` Srivatsa Vaddagiri
2012-02-21  6:37                           ` Mike Galbraith
2012-02-21  8:09                             ` Srivatsa Vaddagiri
2012-02-20 18:14                     ` Mike Galbraith
2012-02-20 18:15                       ` Peter Zijlstra
2012-02-20 19:07                       ` Peter Zijlstra
2012-02-21  5:43                         ` Mike Galbraith
2012-02-21  8:32                           ` Srivatsa Vaddagiri
2012-02-21  9:21                             ` Mike Galbraith
2012-02-21 10:37                               ` Peter Zijlstra
2012-02-21 14:58                                 ` Srivatsa Vaddagiri
2012-02-23 10:49                       ` Srivatsa Vaddagiri
2012-02-23 11:19                         ` Ingo Molnar
2012-02-23 12:18                           ` Srivatsa Vaddagiri
2012-02-23 11:20                         ` Srivatsa Vaddagiri
2012-02-23 11:26                           ` Ingo Molnar
2012-02-23 11:32                             ` Srivatsa Vaddagiri
2012-02-23 16:17                               ` Ingo Molnar
2012-02-23 11:21                         ` Mike Galbraith
2012-02-25  6:54                           ` Srivatsa Vaddagiri
2012-02-25  8:30                             ` Mike Galbraith
2012-02-27 22:11                               ` Suresh Siddha
2012-02-28  5:05                                 ` Mike Galbraith
2011-11-17 19:08             ` Suresh Siddha
2011-11-18 15:12               ` Peter Zijlstra
2011-11-18 15:26                 ` Mike Galbraith

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120322153205.GA28570@linux.vnet.ibm.com \
    --to=vatsa@linux.vnet.ibm.com \
    --cc=efault@gmx.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=suresh.b.siddha@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.