public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1] sched: steer waking task to empty cfs_rq for better latencies
@ 2012-04-24 16:56 Srivatsa Vaddagiri
  2012-04-24 16:58 ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Srivatsa Vaddagiri @ 2012-04-24 16:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Mike Galbraith, Suresh Siddha, Paul Turner, linux-kernel

During my investigation of a performance issue, I found that
we can do a better job of reducing latencies for a waking task by
steering it towards a cpu where it will get better sleeper credits.

Consider a system with two nodes: N0-N1, each with 4 cpus and a cgroup
/a which is of highest priority. Further all 4 cpus in a node are in the same
llc (MC) domain.

		  	N0		   N1
			(0,1,2,3)	(4,5,6,7)

rq.nr_run ->	 	2 1 1 1         2 2 1 1
/a cfs_rq.nr_run ->     0 0 0 0  	0 0 0 1

Consider a task of "/a" waking up after a short (< sysctl_sched_latency) sleep.
Its prev_cpu was 7. select_idle_sibling(), failing to find a idle core, simply 
wakes up the task on CPU7, where it may be unable to preempt the
currently running task (as its new vruntime is not sufficiently behind
currently running tasks vruntime - owing to the short sleep it
incurred). As a result, the task woken up is unable to run immediately
and thus incurs some latency.

A better choice would be to find a cpu in cpu7's MC domain where its
cgroup has 0 tasks (thus allowing the waking task to get better sleeper
credits).

Patch below implements this idea. Some results with various benchmarks
is enclosed.

Machine : 2 Quad-core Intel X5570 CPU w/ H/T enabled (16 cpus)
Kernel  : tip (HEAD at 2adb096)
guest VM : 2.6.18 linux kernel based enterprise guest

Benchmarks are run in two scenarios:

1. BM -> Bare Metal. Benchmark is run on bare metal in root cgroup
2. VM -> Benchmark is run inside a guest VM. Several cpu hogs (in
        various cgroups) are run on host. Cgroup setup is as below:

	/libvirt/qemu/VM (cpu.shares = 8192. guest VM w/ 8 vcpus)
	/libvirt/qemu/hoga[bcd] (cpu.shares = 1024. hosts 4 cpu hogs each)

Mean and std. dev. (in brackets) for both tip and tip+patch cases provided
below:

BM scenario:
			tip 		tip+patch	Remarks
			mean(std. dev)  mean (std. dev)

volano 			1 (6.5%)	0.97 (4.7%)	3% loss
sysbench [n1]		1 (0.6%)	1.004 (0.7%)	0.4% win
tbench 1 [n2]		1 (2%)		1.024 (1.6%)	2.4% win
pipe bench [n3]		1 (5.5%)	1.009 (2.5%)	0.9% win

VM scenario

sysbench [n4]		1 (1.2%)	2.21 (1.3%)	121% win
httperf  [n5]		1 (5.7%)	1.522 (6%)	52.2% win
tbench 8 [n6]		1 (3.1%)	1.91 (6.4%)	91% win
volano			1  (4.3%)	1.06 (2.8%)	6% win
Trade    		1		1.94 		94% win


Notes:

n1. sysbench was run with 16 threads.
n2. tbench was run on localhost with 1 client
n3. ops/sec metric from pipe bench captured. pipe bench run as:
	perf stat --repeat 10 --null perf bench sched pipe
n4. sysbench was run (inside VM) with 8 threads.
n5. httperf was run as with burst-length of 100 and wsess of 100,500,0.
    Webserver was running inside VM while benchmark was run on a 
    physically different host.
n6. tbench was run over network with 8 clients

This is an improved version of the patch previously published that
minimizes/avoids regressions seen earlier:

	https://lkml.org/lkml/2012/3/22/220 

Comments/flames wellcome!


--

Steer a waking task towards a cpu where its cgroup has zero tasks (in
order to provide it better sleeper credits and hence reduce its wakeup
latency).

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---
 kernel/sched/fair.c |   39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

Index: current/kernel/sched/fair.c
===================================================================
--- current.orig/kernel/sched/fair.c
+++ current/kernel/sched/fair.c
@@ -2459,6 +2459,32 @@ static long effective_load(struct task_g
 
 	return wl;
 }
+
+/*
+ * Look for a CPU within @target's MC domain where the task's cgroup has
+ * zero tasks in its cfs_rq.
+ */
+static __always_inline int
+select_idle_cfs_rq(struct task_struct *p, int target)
+{
+	struct cpumask tmpmask;
+	struct task_group *tg = task_group(p);
+	struct sched_domain *sd;
+	int i;
+
+	if (tg == &root_task_group)
+		return target;
+
+	sd = rcu_dereference(per_cpu(sd_llc, target));
+	cpumask_and(&tmpmask, sched_domain_span(sd), tsk_cpus_allowed(p));
+	for_each_cpu(i, &tmpmask) {
+		if (!tg->cfs_rq[i]->nr_running)
+			return i;
+	}
+
+	return target;
+}
+
 #else
 
 static inline unsigned long effective_load(struct task_group *tg, int cpu,
@@ -2467,6 +2493,12 @@ static inline unsigned long effective_lo
 	return wl;
 }
 
+static __always_inline int
+select_idle_cfs_rq(struct task_struct *p, int target)
+{
+	return target;
+}
+
 #endif
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -2677,6 +2709,13 @@ next:
 			sg = sg->next;
 		} while (sg != sd->groups);
 	}
+
+	/*
+	 * Look for the next best possibility - a cpu where this task gets
+	 * (better) sleeper credits.
+	 */
+	target = select_idle_cfs_rq(p, target);
+
 done:
 	return target;
 }








^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1] sched: steer waking task to empty cfs_rq for better latencies
  2012-04-24 16:56 [PATCH v1] sched: steer waking task to empty cfs_rq for better latencies Srivatsa Vaddagiri
@ 2012-04-24 16:58 ` Peter Zijlstra
  2012-04-24 17:07   ` Srivatsa Vaddagiri
  2012-04-24 17:09   ` Peter Zijlstra
  0 siblings, 2 replies; 10+ messages in thread
From: Peter Zijlstra @ 2012-04-24 16:58 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, Paul Turner,
	linux-kernel

On Tue, 2012-04-24 at 22:26 +0530, Srivatsa Vaddagiri wrote:
> Steer a waking task towards a cpu where its cgroup has zero tasks (in
> order to provide it better sleeper credits and hence reduce its wakeup
> latency). 

That's just vile.. pjt could you post your global vruntime stuff so
vatsa can have a go at that?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1] sched: steer waking task to empty cfs_rq for better latencies
  2012-04-24 16:58 ` Peter Zijlstra
@ 2012-04-24 17:07   ` Srivatsa Vaddagiri
  2012-04-24 17:12     ` Peter Zijlstra
  2012-04-24 17:09   ` Peter Zijlstra
  1 sibling, 1 reply; 10+ messages in thread
From: Srivatsa Vaddagiri @ 2012-04-24 17:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, Paul Turner,
	linux-kernel

* Peter Zijlstra <peterz@infradead.org> [2012-04-24 18:58:16]:

> On Tue, 2012-04-24 at 22:26 +0530, Srivatsa Vaddagiri wrote:
> > Steer a waking task towards a cpu where its cgroup has zero tasks (in
> > order to provide it better sleeper credits and hence reduce its wakeup
> > latency). 
> 
> That's just vile.. pjt could you post your global vruntime stuff so
> vatsa can have a go at that?

The workload I am up against does experience tons of wakeup (after very short 
(microsecond range) bursts of sleeps) and so am skeptical how the global 
vruntime would keep up with this kind of workload. I'd be happy to test patches 
and give feedback!

- vatsa


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1] sched: steer waking task to empty cfs_rq for better latencies
  2012-04-24 16:58 ` Peter Zijlstra
  2012-04-24 17:07   ` Srivatsa Vaddagiri
@ 2012-04-24 17:09   ` Peter Zijlstra
  2012-04-24 17:26     ` Srivatsa Vaddagiri
  2012-05-02 14:01     ` Srivatsa Vaddagiri
  1 sibling, 2 replies; 10+ messages in thread
From: Peter Zijlstra @ 2012-04-24 17:09 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, Paul Turner,
	linux-kernel

On Tue, 2012-04-24 at 18:58 +0200, Peter Zijlstra wrote:
> On Tue, 2012-04-24 at 22:26 +0530, Srivatsa Vaddagiri wrote:
> > Steer a waking task towards a cpu where its cgroup has zero tasks (in
> > order to provide it better sleeper credits and hence reduce its wakeup
> > latency). 
> 
> That's just vile.. pjt could you post your global vruntime stuff so
> vatsa can have a go at that?

That is, you're playing a deficiency we should fix, not exploit.

Also, you do a second loop over all those cpus right after we've already
iterated them..

furthermore, that 100%+ gain is still way insane, what else is broken?
Did you try those paravirt tlb-flush patches and other such weirdness?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1] sched: steer waking task to empty cfs_rq for better latencies
  2012-04-24 17:07   ` Srivatsa Vaddagiri
@ 2012-04-24 17:12     ` Peter Zijlstra
  2012-04-24 17:35       ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2012-04-24 17:12 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, Paul Turner,
	linux-kernel

On Tue, 2012-04-24 at 22:37 +0530, Srivatsa Vaddagiri wrote:
> * Peter Zijlstra <peterz@infradead.org> [2012-04-24 18:58:16]:
> 
> > On Tue, 2012-04-24 at 22:26 +0530, Srivatsa Vaddagiri wrote:
> > > Steer a waking task towards a cpu where its cgroup has zero tasks (in
> > > order to provide it better sleeper credits and hence reduce its wakeup
> > > latency). 
> > 
> > That's just vile.. pjt could you post your global vruntime stuff so
> > vatsa can have a go at that?
> 
> The workload I am up against does experience tons of wakeup (after very short 
> (microsecond range) bursts of sleeps) and so am skeptical how the global 
> vruntime would keep up with this kind of workload. I'd be happy to test patches 
> and give feedback!

Thing is, global vruntime might fix that flaw/property you're exploiting
to get that preemption.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1] sched: steer waking task to empty cfs_rq for better latencies
  2012-04-24 17:09   ` Peter Zijlstra
@ 2012-04-24 17:26     ` Srivatsa Vaddagiri
  2012-05-02 14:01     ` Srivatsa Vaddagiri
  1 sibling, 0 replies; 10+ messages in thread
From: Srivatsa Vaddagiri @ 2012-04-24 17:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, Paul Turner,
	linux-kernel

* Peter Zijlstra <peterz@infradead.org> [2012-04-24 19:09:14]:

> On Tue, 2012-04-24 at 18:58 +0200, Peter Zijlstra wrote:
> > On Tue, 2012-04-24 at 22:26 +0530, Srivatsa Vaddagiri wrote:
> > > Steer a waking task towards a cpu where its cgroup has zero tasks (in
> > > order to provide it better sleeper credits and hence reduce its wakeup
> > > latency). 
> > 
> > That's just vile.. pjt could you post your global vruntime stuff so
> > vatsa can have a go at that?
> 
> That is, you're playing a deficiency we should fix, not exploit.
> 
> Also, you do a second loop over all those cpus right after we've already
> iterated them..

The first loop doesn't necessarily iterate thr' all cpus (as its looking
for a core that is fully idle - and hence breaks once it finds a busy
sibling).

> furthermore, that 100%+ gain is still way insane, what else is broken?

?

I have tried most benchmarks that were recommended for this kind of
change. Let me know if you suggest any other benchmark ..

> Did you try those paravirt tlb-flush patches and other such weirdness?

I will try that next. But IMHO the benefit of reduced wakeup latencies
should be over and above any benefit we get from paravirtualization.

- vatsa


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1] sched: steer waking task to empty cfs_rq for better latencies
  2012-04-24 17:12     ` Peter Zijlstra
@ 2012-04-24 17:35       ` Srivatsa Vaddagiri
  2012-04-24 18:03         ` Rakib Mullick
  0 siblings, 1 reply; 10+ messages in thread
From: Srivatsa Vaddagiri @ 2012-04-24 17:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, Paul Turner,
	linux-kernel

* Peter Zijlstra <peterz@infradead.org> [2012-04-24 19:12:27]:

> > The workload I am up against does experience tons of wakeup (after very short 
> > (microsecond range) bursts of sleeps) and so am skeptical how the global 
> > vruntime would keep up with this kind of workload. I'd be happy to test patches 
> > and give feedback!
> 
> Thing is, global vruntime might fix that flaw/property you're exploiting
> to get that preemption.

Am curious to see how global vruntime influences task placement (esp. on
wakeup). Going back to the previous example:

                        N0                 N1
                        (0,1,2,3)	(4,5,6,7)

rq.nr_run ->            2 1 1 1         2 2 1 1
/a cfs_rq.nr_run ->     0 0 0 0         0 0 0 1


CPU7 is running a task A0 belonging to "/a". Another task A1 belonging
to "/a" is waking up and its prev_cpu was 7. Currently we let it wake on
cpu7 itself and wait behind A0. With global vruntime, are you saying
that we let A1 wake on cpu7 and it would preempt A0? That would still not be 
nice as A0 now incurs some additional latencies?

- vatsa


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1] sched: steer waking task to empty cfs_rq for better latencies
  2012-04-24 17:35       ` Srivatsa Vaddagiri
@ 2012-04-24 18:03         ` Rakib Mullick
  0 siblings, 0 replies; 10+ messages in thread
From: Rakib Mullick @ 2012-04-24 18:03 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Peter Zijlstra, Ingo Molnar, Mike Galbraith, Suresh Siddha,
	Paul Turner, linux-kernel

On Tue, Apr 24, 2012 at 11:35 PM, Srivatsa Vaddagiri
<vatsa@linux.vnet.ibm.com> wrote:
> * Peter Zijlstra <peterz@infradead.org> [2012-04-24 19:12:27]:
>
>> > The workload I am up against does experience tons of wakeup (after very short
>> > (microsecond range) bursts of sleeps) and so am skeptical how the global
>> > vruntime would keep up with this kind of workload. I'd be happy to test patches
>> > and give feedback!
>>
>> Thing is, global vruntime might fix that flaw/property you're exploiting
>> to get that preemption.
>
> Am curious to see how global vruntime influences task placement (esp. on
> wakeup). Going back to the previous example:
>
IIUC, global vruntime don't influence task placement, perhaps it'll do
something related task's priority, as if wakeup task gets picked up by
scheduler quickly. I'm also curious to see how global vruntime
influences task placement. :-)

Thanks,
Rakib

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1] sched: steer waking task to empty cfs_rq for better latencies
  2012-04-24 17:09   ` Peter Zijlstra
  2012-04-24 17:26     ` Srivatsa Vaddagiri
@ 2012-05-02 14:01     ` Srivatsa Vaddagiri
  2012-05-03  5:43       ` Nikunj A Dadhania
  1 sibling, 1 reply; 10+ messages in thread
From: Srivatsa Vaddagiri @ 2012-05-02 14:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, Paul Turner,
	linux-kernel

* Peter Zijlstra <peterz@infradead.org> [2012-04-24 19:09:14]:

> On Tue, 2012-04-24 at 18:58 +0200, Peter Zijlstra wrote:
> > On Tue, 2012-04-24 at 22:26 +0530, Srivatsa Vaddagiri wrote:
> > > Steer a waking task towards a cpu where its cgroup has zero tasks (in
> > > order to provide it better sleeper credits and hence reduce its wakeup
> > > latency). 
> > 
> > That's just vile.. pjt could you post your global vruntime stuff so
> > vatsa can have a go at that?
> 
> That is, you're playing a deficiency we should fix, not exploit.
> 
> Also, you do a second loop over all those cpus right after we've already
> iterated them..
> 
> furthermore, that 100%+ gain is still way insane, what else is broken?
> Did you try those paravirt tlb-flush patches and other such weirdness?

I got around to try pv-tlb-flush patches and it showed >100%
improvement for sysbench (without the balance-on-wake patch on host). This
was what readprofile showed (when pv-tlb-flush patches were absent in
guest):

1135704 total                                   0.3265
636737 native_cpuid                             18192.4857
283201 __bitmap_empty                           2832.0100
137853 flush_tlb_others_ipi                     569.6405

I will try out how much they help Trade workload (which got me started
on this originally) and report back (part of the problem in trying out
is that pv-tlb-flush platches are throwing wierd problems - which Nikunj
is helping investigate). 

In any case, we can't expect users to be able to easily upgrade their
guest VM kernels and so I am still looking for a solution that works for
older guest VM kernels. Paul, hope I can get the global vruntime patches
from you soon to test how much it helps!

- vatsa


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1] sched: steer waking task to empty cfs_rq for better latencies
  2012-05-02 14:01     ` Srivatsa Vaddagiri
@ 2012-05-03  5:43       ` Nikunj A Dadhania
  0 siblings, 0 replies; 10+ messages in thread
From: Nikunj A Dadhania @ 2012-05-03  5:43 UTC (permalink / raw)
  To: Srivatsa Vaddagiri, Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, Paul Turner,
	linux-kernel

On Wed, 2 May 2012 19:31:17 +0530, Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote:
> * Peter Zijlstra <peterz@infradead.org> [2012-04-24 19:09:14]:
> 
> > On Tue, 2012-04-24 at 18:58 +0200, Peter Zijlstra wrote:
> > > On Tue, 2012-04-24 at 22:26 +0530, Srivatsa Vaddagiri wrote:
> > > > Steer a waking task towards a cpu where its cgroup has zero tasks (in
> > > > order to provide it better sleeper credits and hence reduce its wakeup
> > > > latency). 
> > > 
> > > That's just vile.. pjt could you post your global vruntime stuff so
> > > vatsa can have a go at that?
> > 
> > That is, you're playing a deficiency we should fix, not exploit.
> > 
> > Also, you do a second loop over all those cpus right after we've already
> > iterated them..
> > 
> > furthermore, that 100%+ gain is still way insane, what else is broken?
> > Did you try those paravirt tlb-flush patches and other such weirdness?
> 
> I got around to try pv-tlb-flush patches and it showed >100%
> improvement for sysbench (without the balance-on-wake patch on host). This
> was what readprofile showed (when pv-tlb-flush patches were absent in
> guest):
> 
> 1135704 total                                   0.3265
> 636737 native_cpuid                             18192.4857
> 283201 __bitmap_empty                           2832.0100
> 137853 flush_tlb_others_ipi                     569.6405
> 
> I will try out how much they help Trade workload (which got me started
> on this originally) and report back (part of the problem in trying out
> is that pv-tlb-flush platches are throwing wierd problems - which Nikunj
> is helping investigate). 
> 
A bug, the below patch should cure it. kvm_mmu_flush_tlb will queue the
request, but we have already passed the phase of checking that. I will
fold this in my next version.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6c42056..b114411 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1550,7 +1550,7 @@ static void kvm_set_vcpu_state(struct kvm_vcpu *vcpu)
 
        vs->state = 1;
        if (vs->flush_on_enter) {
-               kvm_mmu_flush_tlb(vcpu);
+               kvm_x86_ops->tlb_flush(vcpu);
                vs->flush_on_enter = 0;
        }
 


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-05-03  5:46 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-04-24 16:56 [PATCH v1] sched: steer waking task to empty cfs_rq for better latencies Srivatsa Vaddagiri
2012-04-24 16:58 ` Peter Zijlstra
2012-04-24 17:07   ` Srivatsa Vaddagiri
2012-04-24 17:12     ` Peter Zijlstra
2012-04-24 17:35       ` Srivatsa Vaddagiri
2012-04-24 18:03         ` Rakib Mullick
2012-04-24 17:09   ` Peter Zijlstra
2012-04-24 17:26     ` Srivatsa Vaddagiri
2012-05-02 14:01     ` Srivatsa Vaddagiri
2012-05-03  5:43       ` Nikunj A Dadhania

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox