[wake_afine fixes/improvements 0/3] Introduction

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [wake_afine fixes/improvements 0/3] Introduction
@ 2011-01-15  1:57 Paul Turner
  2011-01-15  1:57 ` [wake_afine fixes/improvements 1/3] sched: update effective_load() to use global share weights Paul Turner
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Paul Turner @ 2011-01-15  1:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Mike Galbraith, Nick Piggin,
	Srivatsa Vaddagiri

I've been looking at the wake_affine path to improve the group scheduling case
(wake affine performance for fair group sched has historically lagged) as well
as tweaking performance in general.

The current series of patches is attached, the first of which should probably be
considered for 2.6.38 since it fixes a bug/regression in the case of waking up
onto a previously (group) empty cpu.  While the others can be considered more
forwards looking.

I've been using an rpc ping-pong workload which is known be sensitive to poor affine 
decisions to benchmark these changes, I'm happy to run these patches against
other workloads.  In particular improvements on reaim have been demonstrated,
but since it's not as stable a benchmark the numbers are harder to present in
a representative fashion.  Suggestions/pet benchmarks greatly appreciated
here.

Some other things experimented with (but didn't pan out as a performance win):
- Considering instantaneous load on prev_cpu as well as current_cpu
- Using more gentle wl/wg values to reflect that they a task's contribution to
load_contribution is likely less than its weight.

Performance:

(througput is measured in txn/s across a 5 minute interval, with a 30 second 
warmup)

tip (no group scheduling):
throughput=57798.701988 reqs/sec.
throughput=58098.876188 reqs/sec.

tip: (autogroup + current shares code and associated broken effective_load)
throughput=49824.283179 reqs/sec.
throughput=48527.942386 reqs/sec.

tip (autogroup + old tg_shares code): [parity goal post]
throughput=57846.575060 reqs/sec.
throughput=57626.442034 reqs/sec.

tip (autogroup + effective_load rewrite):
throughput=58534.073595 reqs/sec.
throughput=58068.072052 reqs/sec.

tip (autogroup + effective_load + no affine moves for hot tasks):
throughput=60907.794697 reqs/sec.
throughput=61208.305629 reqs/sec.

Thanks,

- Paul

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [wake_afine fixes/improvements 1/3] sched: update effective_load() to use global share weights
  2011-01-15  1:57 [wake_afine fixes/improvements 0/3] Introduction Paul Turner
@ 2011-01-15  1:57 ` Paul Turner
  2011-01-17 14:11   ` Peter Zijlstra
  2011-01-18 19:04   ` [tip:sched/urgent] sched: Update " tip-bot for Paul Turner
  2011-01-15  1:57 ` [wake_afine fixes/improvements 2/3] sched: clean up task_hot() Paul Turner
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 12+ messages in thread
From: Paul Turner @ 2011-01-15  1:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Mike Galbraith, Nick Piggin,
	Srivatsa Vaddagiri

[-- Attachment #1: fix_wake_affine.patch --]
[-- Type: text/plain, Size: 1907 bytes --]

Previously effective_load would approximate the global load weight present on
a group taking advantage of:

entity_weight = tg->shares ( lw / global_lw ), where entity_weight was provided
by tg_shares_up.

This worked (approximately) for an 'empty' (at tg level) cpu since we would
place boost load representative of what a newly woken task would receive.

However, now that load is instantaneously updated this assumption is no longer
true and the load calculation is rather incorrect in this case.

Fix this (and improve the general case) by re-writing effective_load to take
advantage of the new shares distribution code.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched_fair.c |   32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

Index: tip3/kernel/sched_fair.c
===================================================================
--- tip3.orig/kernel/sched_fair.c
+++ tip3/kernel/sched_fair.c
@@ -1362,27 +1362,27 @@ static long effective_load(struct task_g
 		return wl;
 
 	for_each_sched_entity(se) {
-		long S, rw, s, a, b;
+		long lw, w;
 
-		S = se->my_q->tg->shares;
-		s = se->load.weight;
-		rw = se->my_q->load.weight;
+		tg = se->my_q->tg;
+		w = se->my_q->load.weight;
 
-		a = S*(rw + wl);
-		b = S*rw + s*wg;
+		/* use this cpu's instantaneous contribution */
+		lw = atomic_read(&tg->load_weight);
+		lw -= se->my_q->load_contribution;
+		lw += w + wg;
 
-		wl = s*(a-b);
+		wl += w;
 
-		if (likely(b))
-			wl /= b;
+		if (lw > 0 && wl < lw)
+			wl = (wl * tg->shares) / lw;
+		else
+			wl = tg->shares;
 
-		/*
-		 * Assume the group is already running and will
-		 * thus already be accounted for in the weight.
-		 *
-		 * That is, moving shares between CPUs, does not
-		 * alter the group weight.
-		 */
+		/* zero point is MIN_SHARES */
+		if (wl < MIN_SHARES)
+			wl = MIN_SHARES;
+		wl -= se->load.weight;
 		wg = 0;
 	}
 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [wake_afine fixes/improvements 1/3] sched: update effective_load() to use global share weights
  2011-01-15  1:57 ` [wake_afine fixes/improvements 1/3] sched: update effective_load() to use global share weights Paul Turner
@ 2011-01-17 14:11   ` Peter Zijlstra
  2011-01-17 14:20     ` Peter Zijlstra
  2011-01-18 19:04   ` [tip:sched/urgent] sched: Update " tip-bot for Paul Turner
  1 sibling, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2011-01-17 14:11 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Ingo Molnar, Mike Galbraith, Nick Piggin,
	Srivatsa Vaddagiri

On Fri, 2011-01-14 at 17:57 -0800, Paul Turner wrote:
> plain text document attachment (fix_wake_affine.patch)
> Previously effective_load would approximate the global load weight present on
> a group taking advantage of:
> 
> entity_weight = tg->shares ( lw / global_lw ), where entity_weight was provided
> by tg_shares_up.
> 
> This worked (approximately) for an 'empty' (at tg level) cpu since we would
> place boost load representative of what a newly woken task would receive.
> 
> However, now that load is instantaneously updated this assumption is no longer
> true and the load calculation is rather incorrect in this case.
> 
> Fix this (and improve the general case) by re-writing effective_load to take
> advantage of the new shares distribution code.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> 
> ---
>  kernel/sched_fair.c |   32 ++++++++++++++++----------------
>  1 file changed, 16 insertions(+), 16 deletions(-)
> 
> Index: tip3/kernel/sched_fair.c
> ===================================================================
> --- tip3.orig/kernel/sched_fair.c
> +++ tip3/kernel/sched_fair.c
> @@ -1362,27 +1362,27 @@ static long effective_load(struct task_g
>  		return wl;
>  
>  	for_each_sched_entity(se) {
> +		long lw, w;
>  
> +		tg = se->my_q->tg;
> +		w = se->my_q->load.weight;

weight of this cpu's part of the task-group

> +		/* use this cpu's instantaneous contribution */
> +		lw = atomic_read(&tg->load_weight);
> +		lw -= se->my_q->load_contribution;
> +		lw += w + wg;

total weight of this task_group + new load
 
> +		wl += w;

this cpu's weight + new load

> +		if (lw > 0 && wl < lw)
> +			wl = (wl * tg->shares) / lw;
> +		else
> +			wl = tg->shares;

OK, so this computes the new load for this cpu, by taking the
appropriate proportion of tg->shares, it clips on large wl, and does
something funny for !lw -- on purpose?


> +		/* zero point is MIN_SHARES */
> +		if (wl < MIN_SHARES)
> +			wl = MIN_SHARES;

*nod*

> +		wl -= se->load.weight;

Take the weight delta up to the next level..

>  		wg = 0;

And assume all further groups are already enqueued and stay enqueued.

>  	}




 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [wake_afine fixes/improvements 1/3] sched: update effective_load() to use global share weights
  2011-01-17 14:11   ` Peter Zijlstra
@ 2011-01-17 14:20     ` Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2011-01-17 14:20 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Ingo Molnar, Mike Galbraith, Nick Piggin,
	Srivatsa Vaddagiri

On Mon, 2011-01-17 at 15:11 +0100, Peter Zijlstra wrote:
> 
> > +             if (lw > 0 && wl < lw)
> > +                     wl = (wl * tg->shares) / lw;
> > +             else
> > +                     wl = tg->shares;
> 
> OK, so this computes the new load for this cpu, by taking the
> appropriate proportion of tg->shares, it clips on large wl, and does
> something funny for !lw -- on purpose? 

D'0h, when !lw, the tg is empty and we don't care what happens since it
won't get scheduled anyway..

Ok, very nice, applied!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [tip:sched/urgent] sched: Update effective_load() to use global share weights
  2011-01-15  1:57 ` [wake_afine fixes/improvements 1/3] sched: update effective_load() to use global share weights Paul Turner
  2011-01-17 14:11   ` Peter Zijlstra
@ 2011-01-18 19:04   ` tip-bot for Paul Turner
  1 sibling, 0 replies; 12+ messages in thread
From: tip-bot for Paul Turner @ 2011-01-18 19:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, pjt, tglx, mingo

Commit-ID:  977dda7c9b540f48b228174346d8b31542c1e99f
Gitweb:     http://git.kernel.org/tip/977dda7c9b540f48b228174346d8b31542c1e99f
Author:     Paul Turner <pjt@google.com>
AuthorDate: Fri, 14 Jan 2011 17:57:50 -0800
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Tue, 18 Jan 2011 15:09:38 +0100

sched: Update effective_load() to use global share weights

Previously effective_load would approximate the global load weight present on
a group taking advantage of:

entity_weight = tg->shares ( lw / global_lw ), where entity_weight was provided
by tg_shares_up.

This worked (approximately) for an 'empty' (at tg level) cpu since we would
place boost load representative of what a newly woken task would receive.

However, now that load is instantaneously updated this assumption is no longer
true and the load calculation is rather incorrect in this case.

Fix this (and improve the general case) by re-writing effective_load to take
advantage of the new shares distribution code.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110115015817.069769529@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched_fair.c |   32 ++++++++++++++++----------------
 1 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c62ebae..414145c 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1362,27 +1362,27 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 		return wl;
 
 	for_each_sched_entity(se) {
-		long S, rw, s, a, b;
+		long lw, w;
 
-		S = se->my_q->tg->shares;
-		s = se->load.weight;
-		rw = se->my_q->load.weight;
+		tg = se->my_q->tg;
+		w = se->my_q->load.weight;
 
-		a = S*(rw + wl);
-		b = S*rw + s*wg;
+		/* use this cpu's instantaneous contribution */
+		lw = atomic_read(&tg->load_weight);
+		lw -= se->my_q->load_contribution;
+		lw += w + wg;
 
-		wl = s*(a-b);
+		wl += w;
 
-		if (likely(b))
-			wl /= b;
+		if (lw > 0 && wl < lw)
+			wl = (wl * tg->shares) / lw;
+		else
+			wl = tg->shares;
 
-		/*
-		 * Assume the group is already running and will
-		 * thus already be accounted for in the weight.
-		 *
-		 * That is, moving shares between CPUs, does not
-		 * alter the group weight.
-		 */
+		/* zero point is MIN_SHARES */
+		if (wl < MIN_SHARES)
+			wl = MIN_SHARES;
+		wl -= se->load.weight;
 		wg = 0;
 	}
 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [wake_afine fixes/improvements 2/3] sched: clean up task_hot()
  2011-01-15  1:57 [wake_afine fixes/improvements 0/3] Introduction Paul Turner
  2011-01-15  1:57 ` [wake_afine fixes/improvements 1/3] sched: update effective_load() to use global share weights Paul Turner
@ 2011-01-15  1:57 ` Paul Turner
  2011-01-17 14:14   ` Peter Zijlstra
  2011-01-15  1:57 ` [wake_afine fixes/improvements 3/3] sched: introduce sched_feat(NO_HOT_AFFINE) Paul Turner
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Paul Turner @ 2011-01-15  1:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Mike Galbraith, Nick Piggin,
	Srivatsa Vaddagiri

[-- Attachment #1: no_hot_sd.patch --]
[-- Type: text/plain, Size: 3081 bytes --]

We no longer compute per-domain migration costs or have use for task_hot()
external to the fair scheduling class.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c      |   35 -----------------------------------
 kernel/sched_fair.c |   32 +++++++++++++++++++++++++++++++-
 2 files changed, 31 insertions(+), 36 deletions(-)

Index: tip3/kernel/sched.c
===================================================================
--- tip3.orig/kernel/sched.c
+++ tip3/kernel/sched.c
@@ -1522,8 +1522,6 @@ static unsigned long power_of(int cpu)
 	return cpu_rq(cpu)->cpu_power;
 }
 
-static int task_hot(struct task_struct *p, u64 now, struct sched_domain *sd);
-
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -2061,38 +2059,6 @@ static void check_preempt_curr(struct rq
 }
 
 #ifdef CONFIG_SMP
-/*
- * Is this task likely cache-hot:
- */
-static int
-task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
-{
-	s64 delta;
-
-	if (p->sched_class != &fair_sched_class)
-		return 0;
-
-	if (unlikely(p->policy == SCHED_IDLE))
-		return 0;
-
-	/*
-	 * Buddy candidates are cache hot:
-	 */
-	if (sched_feat(CACHE_HOT_BUDDY) && this_rq()->nr_running &&
-			(&p->se == cfs_rq_of(&p->se)->next ||
-			 &p->se == cfs_rq_of(&p->se)->last))
-		return 1;
-
-	if (sysctl_sched_migration_cost == -1)
-		return 1;
-	if (sysctl_sched_migration_cost == 0)
-		return 0;
-
-	delta = now - p->se.exec_start;
-
-	return delta < (s64)sysctl_sched_migration_cost;
-}
-
 void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 {
 #ifdef CONFIG_SCHED_DEBUG
@@ -9237,4 +9203,3 @@ struct cgroup_subsys cpuacct_subsys = {
 	.subsys_id = cpuacct_subsys_id,
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
-
Index: tip3/kernel/sched_fair.c
===================================================================
--- tip3.orig/kernel/sched_fair.c
+++ tip3/kernel/sched_fair.c
@@ -1346,6 +1346,36 @@ static void task_waking_fair(struct rq *
 	se->vruntime -= cfs_rq->min_vruntime;
 }
 
+/* is this task likely cache-hot */
+static int
+task_hot(struct task_struct *p, u64 now)
+{
+	s64 delta;
+
+	if (p->sched_class != &fair_sched_class)
+		return 0;
+
+	if (unlikely(p->policy == SCHED_IDLE))
+		return 0;
+
+	/*
+	 * Buddy candidates are cache hot:
+	 */
+	if (sched_feat(CACHE_HOT_BUDDY) && this_rq()->nr_running &&
+			(&p->se == cfs_rq_of(&p->se)->next ||
+			 &p->se == cfs_rq_of(&p->se)->last))
+		return 1;
+
+	if (sysctl_sched_migration_cost == -1)
+		return 1;
+	if (sysctl_sched_migration_cost == 0)
+		return 0;
+
+	delta = now - p->se.exec_start;
+
+	return delta < (s64)sysctl_sched_migration_cost;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /*
  * effective_load() calculates the load change as seen from the root_task_group
@@ -1954,7 +1984,7 @@ int can_migrate_task(struct task_struct 
 	 * 2) too many balance attempts have failed.
 	 */
 
-	tsk_cache_hot = task_hot(p, rq->clock_task, sd);
+	tsk_cache_hot = task_hot(p, rq->clock_task);
 	if (!tsk_cache_hot ||
 		sd->nr_balance_failed > sd->cache_nice_tries) {
 #ifdef CONFIG_SCHEDSTATS



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [wake_afine fixes/improvements 2/3] sched: clean up task_hot()
  2011-01-15  1:57 ` [wake_afine fixes/improvements 2/3] sched: clean up task_hot() Paul Turner
@ 2011-01-17 14:14   ` Peter Zijlstra
  2011-01-18 21:52     ` Paul Turner
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2011-01-17 14:14 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Ingo Molnar, Mike Galbraith, Nick Piggin,
	Srivatsa Vaddagiri

On Fri, 2011-01-14 at 17:57 -0800, Paul Turner wrote:
> plain text document attachment (no_hot_sd.patch)
> We no longer compute per-domain migration costs or have use for task_hot()
> external to the fair scheduling class.

Ok, so this a mostly a pure code move (aside from removing the unused sd
argument). I do seem to remember that various folks played around with
bringing the per sd cache refill cost back.. any conclusion on that?

(not really a big point, we can easily add the argument back when
needed)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [wake_afine fixes/improvements 2/3] sched: clean up task_hot()
  2011-01-17 14:14   ` Peter Zijlstra
@ 2011-01-18 21:52     ` Paul Turner
  0 siblings, 0 replies; 12+ messages in thread
From: Paul Turner @ 2011-01-18 21:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Mike Galbraith, Nick Piggin,
	Srivatsa Vaddagiri

On Mon, Jan 17, 2011 at 6:14 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Fri, 2011-01-14 at 17:57 -0800, Paul Turner wrote:
>> plain text document attachment (no_hot_sd.patch)
>> We no longer compute per-domain migration costs or have use for task_hot()
>> external to the fair scheduling class.
>
> Ok, so this a mostly a pure code move (aside from removing the unused sd
> argument). I do seem to remember that various folks played around with
> bringing the per sd cache refill cost back.. any conclusion on that?
>
> (not really a big point, we can easily add the argument back when
> needed)
>

Yeah this one's solely housekeeping.

I think there probably is value in a relative notion of what it means
to be hot that's based on the domain distance (especially with the
slightly more exotic topologies we're starting to see), but until some
framework exists I figured I might as well clean it up while I was
there.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [wake_afine fixes/improvements 3/3] sched: introduce sched_feat(NO_HOT_AFFINE)
  2011-01-15  1:57 [wake_afine fixes/improvements 0/3] Introduction Paul Turner
  2011-01-15  1:57 ` [wake_afine fixes/improvements 1/3] sched: update effective_load() to use global share weights Paul Turner
  2011-01-15  1:57 ` [wake_afine fixes/improvements 2/3] sched: clean up task_hot() Paul Turner
@ 2011-01-15  1:57 ` Paul Turner
  2011-01-15 14:29 ` [wake_afine fixes/improvements 0/3] Introduction Mike Galbraith
  2011-01-15 21:34 ` Nick Piggin
  4 siblings, 0 replies; 12+ messages in thread
From: Paul Turner @ 2011-01-15  1:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Mike Galbraith, Nick Piggin,
	Srivatsa Vaddagiri

[-- Attachment #1: task_hot_lazy.patch --]
[-- Type: text/plain, Size: 2177 bytes --]

re-introduce the cache-cold requirement for affine wake-up balancing.

A much more aggressive migration cost (currently 0.5ms) appears to have tilted
the needle towards favouring not performing affine migrations for cache_hot 
tasks.

Since the update_rq path is more expensive now (and the 'hot' window so small),
avoid hammering it in the common case where the (possibly slightly stale) 
rq->clock_task value has already advanced enough to invalidate hot-ness.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched_fair.c     |   20 +++++++++++++++++++-
 kernel/sched_features.h |    5 +++++
 2 files changed, 24 insertions(+), 1 deletion(-)

Index: tip3/kernel/sched_fair.c
===================================================================
--- tip3.orig/kernel/sched_fair.c
+++ tip3/kernel/sched_fair.c
@@ -1376,6 +1376,23 @@ task_hot(struct task_struct *p, u64 now)
 	return delta < (s64)sysctl_sched_migration_cost;
 }
 
+/*
+ * Since sched_migration_cost is (relatively) very small we only need to
+ * actually update the clock in the boundary case when determining whether a
+ * task is hot or not.
+ */
+static int task_hot_lazy(struct task_struct *p)
+{
+	struct rq *rq = task_rq(p);
+
+	if (!task_hot(p, rq->clock_task))
+		return 0;
+
+	update_rq_clock(rq);
+
+	return task_hot(p, rq->clock_task);
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /*
  * effective_load() calculates the load change as seen from the root_task_group
@@ -1664,7 +1681,8 @@ select_task_rq_fair(struct rq *rq, struc
 	int sync = wake_flags & WF_SYNC;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
-		if (cpumask_test_cpu(cpu, &p->cpus_allowed))
+		if (cpumask_test_cpu(cpu, &p->cpus_allowed) &&
+		   (!sched_feat(NO_HOT_AFFINE) || !task_hot_lazy(p)))
 			want_affine = 1;
 		new_cpu = prev_cpu;
 	}
Index: tip3/kernel/sched_features.h
===================================================================
--- tip3.orig/kernel/sched_features.h
+++ tip3/kernel/sched_features.h
@@ -64,3 +64,8 @@ SCHED_FEAT(OWNER_SPIN, 1)
  * Decrement CPU power based on irq activity
  */
 SCHED_FEAT(NONIRQ_POWER, 1)
+
+/*
+ * Don't consider cache-hot tasks for affine wakeups
+ */
+SCHED_FEAT(NO_HOT_AFFINE, 1)



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [wake_afine fixes/improvements 0/3] Introduction
  2011-01-15  1:57 [wake_afine fixes/improvements 0/3] Introduction Paul Turner
                   ` (2 preceding siblings ...)
  2011-01-15  1:57 ` [wake_afine fixes/improvements 3/3] sched: introduce sched_feat(NO_HOT_AFFINE) Paul Turner
@ 2011-01-15 14:29 ` Mike Galbraith
  2011-01-15 19:29   ` Paul Turner
  2011-01-15 21:34 ` Nick Piggin
  4 siblings, 1 reply; 12+ messages in thread
From: Mike Galbraith @ 2011-01-15 14:29 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Nick Piggin,
	Srivatsa Vaddagiri

On Fri, 2011-01-14 at 17:57 -0800, Paul Turner wrote:
> I've been looking at the wake_affine path to improve the group scheduling case
> (wake affine performance for fair group sched has historically lagged) as well
> as tweaking performance in general.
> 
> The current series of patches is attached, the first of which should probably be
> considered for 2.6.38 since it fixes a bug/regression in the case of waking up
> onto a previously (group) empty cpu.  While the others can be considered more
> forwards looking.
> 
> I've been using an rpc ping-pong workload which is known be sensitive to poor affine 
> decisions to benchmark these changes, I'm happy to run these patches against
> other workloads.  In particular improvements on reaim have been demonstrated,
> but since it's not as stable a benchmark the numbers are harder to present in
> a representative fashion.  Suggestions/pet benchmarks greatly appreciated
> here.
> 
> Some other things experimented with (but didn't pan out as a performance win):
> - Considering instantaneous load on prev_cpu as well as current_cpu
> - Using more gentle wl/wg values to reflect that they a task's contribution to
> load_contribution is likely less than its weight.
> 
> Performance:
> 
> (througput is measured in txn/s across a 5 minute interval, with a 30 second 
> warmup)
> 
> tip (no group scheduling):
> throughput=57798.701988 reqs/sec.
> throughput=58098.876188 reqs/sec.
> 
> tip: (autogroup + current shares code and associated broken effective_load)
> throughput=49824.283179 reqs/sec.
> throughput=48527.942386 reqs/sec.
> 
> tip (autogroup + old tg_shares code): [parity goal post]
> throughput=57846.575060 reqs/sec.
> throughput=57626.442034 reqs/sec.
> 
> tip (autogroup + effective_load rewrite):
> throughput=58534.073595 reqs/sec.
> throughput=58068.072052 reqs/sec.
> 
> tip (autogroup + effective_load + no affine moves for hot tasks):
> throughput=60907.794697 reqs/sec.
> throughput=61208.305629 reqs/sec.

The effective_load() change is a humongous improvement for mysql+oltp.
The rest is iffy looking on my box with this load.

Looks like what will happen with NO_HOT_AFFINE if say two high frequency
ping pong players are perturbed such that one lands non-affine, it will
stay that way instead of recovering, because these will always be hot.
I haven't tested that though, pure rumination ;-)

mysql+oltp numbers

unpatched v2.6.37-7185-g52cfd50

clients              1          2          4          8         16         32         64        128        256
noautogroup   11084.37   20904.39   37356.65   36855.64   35395.45   35585.32   33343.44   28259.58   21404.18
              11025.94   20870.93   37272.99   36835.54   35367.92   35448.45   33422.20   28309.88   21285.18
              11076.00   20774.98   36847.44   36881.97   35295.35   35031.19   33490.84   28254.12   21307.13
1        avg  11062.10   20850.10   37159.02   36857.71   35352.90   35354.98   33418.82   28274.52   21332.16

autogroup     10963.27   20058.34   23567.63   29361.08   29111.98   29731.23   28563.18   24151.10   18163.00
              10754.92   19713.71   22983.43   28906.34   28576.12   30809.49   28384.14   24208.99   18057.34
              10990.27   19645.70   22193.71   29247.07   28763.53   30764.55   28912.45   24143.41   18002.07
2        avg  10902.82   19805.91   22914.92   29171.49   28817.21   30435.09   28619.92   24167.83   18074.13
                  .985       .949       .616       .791       .815       .860       .856       .854       .847

patched v2.6.37-7185-g52cfd50

noautogroup   11095.73   20794.49   37062.81   36611.92   35444.55   35468.36   33463.56   28236.18   21255.67
              11035.59   20649.44   37304.91   36878.34   35331.63   35248.05   33424.15   28147.17   21370.39
              11077.88   20653.92   37207.26   37047.54   35441.78   35445.02   33469.31   28050.80   21306.89
         avg  11069.73   20699.28   37191.66   36845.93   35405.98   35387.14   33452.34   28144.71   21310.98
vs 1             1.000       .992      1.000       .999      1.001      1.000      1.001       .995       .999

noautogroup   10784.89   20304.49   37482.07   37251.63   35556.21   35116.93   32187.66   27839.60   21023.17
NO_HOT_AFFINE 10627.17   19835.43   37611.04   37168.37   35609.65   35289.32   32331.95   27598.50   21366.97
              10378.76   19998.29   37018.31   36888.67   35633.45   35277.39   32300.37   27896.24   21532.09
         avg  10596.94   20046.07   37370.47   37102.89   35599.77   35227.88   32273.32   27778.11   21307.41
vs 1              .957       .961      1.005      1.006      1.006       .996       .965       .982       .998

autogroup     10452.16   19547.57   36082.97   36653.02   35251.51   34099.80   31226.18   27274.91   20927.65
              10586.36   19931.37   36928.99   36640.64   35604.17   34238.38   31528.80   27412.44   20874.03
              10472.72   20143.83   36407.91   36715.85   35481.78   34332.42   31612.57   27357.18   21018.63
3        avg  10503.74   19874.25   36473.29   36669.83   35445.82   34223.53   31455.85   27348.17   20940.10
vs 1              .949       .953       .981       .994      1.002       .967       .941       .967       .981
vs 2              .963      1.003      1.591      1.257      1.230      1.124      1.099      1.131      1.158

autogroup     10276.41   19642.90   36790.86   36575.28   35326.89   34094.66   31626.82   27185.72   21017.51
NO_HOT_AFFINE 10305.91   20027.66   37017.90   36814.35   35452.63   34268.32   31399.49   27353.71   21039.37
              11013.96   19977.08   36984.17   36661.80   35393.99   34141.05   31246.47   26960.48   20873.94
         avg  10532.09   19882.54   36930.97   36683.81   35391.17   34168.01   31424.26   27166.63   20976.94
vs 1              .952       .953       .993       .995      1.001       .966       .940       .960       .983
vs 2              .965      1.003      1.611      1.257      1.228      1.122      1.097      1.124      1.160
vs 3             1.002      1.000      1.012      1.000       .998       .998       .998       .993      1.001



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [wake_afine fixes/improvements 0/3] Introduction
  2011-01-15 14:29 ` [wake_afine fixes/improvements 0/3] Introduction Mike Galbraith
@ 2011-01-15 19:29   ` Paul Turner
  0 siblings, 0 replies; 12+ messages in thread
From: Paul Turner @ 2011-01-15 19:29 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Nick Piggin,
	Srivatsa Vaddagiri

On Sat, Jan 15, 2011 at 6:29 AM, Mike Galbraith <efault@gmx.de> wrote:
> On Fri, 2011-01-14 at 17:57 -0800, Paul Turner wrote:
>> I've been looking at the wake_affine path to improve the group scheduling case
>> (wake affine performance for fair group sched has historically lagged) as well
>> as tweaking performance in general.
>>
>> The current series of patches is attached, the first of which should probably be
>> considered for 2.6.38 since it fixes a bug/regression in the case of waking up
>> onto a previously (group) empty cpu.  While the others can be considered more
>> forwards looking.
>>
>> I've been using an rpc ping-pong workload which is known be sensitive to poor affine
>> decisions to benchmark these changes, I'm happy to run these patches against
>> other workloads.  In particular improvements on reaim have been demonstrated,
>> but since it's not as stable a benchmark the numbers are harder to present in
>> a representative fashion.  Suggestions/pet benchmarks greatly appreciated
>> here.
>>
>> Some other things experimented with (but didn't pan out as a performance win):
>> - Considering instantaneous load on prev_cpu as well as current_cpu
>> - Using more gentle wl/wg values to reflect that they a task's contribution to
>> load_contribution is likely less than its weight.
>>
>> Performance:
>>
>> (througput is measured in txn/s across a 5 minute interval, with a 30 second
>> warmup)
>>
>> tip (no group scheduling):
>> throughput=57798.701988 reqs/sec.
>> throughput=58098.876188 reqs/sec.
>>
>> tip: (autogroup + current shares code and associated broken effective_load)
>> throughput=49824.283179 reqs/sec.
>> throughput=48527.942386 reqs/sec.
>>
>> tip (autogroup + old tg_shares code): [parity goal post]
>> throughput=57846.575060 reqs/sec.
>> throughput=57626.442034 reqs/sec.
>>
>> tip (autogroup + effective_load rewrite):
>> throughput=58534.073595 reqs/sec.
>> throughput=58068.072052 reqs/sec.
>>
>> tip (autogroup + effective_load + no affine moves for hot tasks):
>> throughput=60907.794697 reqs/sec.
>> throughput=61208.305629 reqs/sec.
>
> The effective_load() change is a humongous improvement for mysql+oltp.
> The rest is iffy looking on my box with this load.
>

Yes -- this one is definitely the priority, the other is more forward
looking since we've had some good gains with it internally.

> Looks like what will happen with NO_HOT_AFFINE if say two high frequency
> ping pong players are perturbed such that one lands non-affine, it will
> stay that way instead of recovering, because these will always be hot.
> I haven't tested that though, pure rumination ;-)
>

This is a  valid concern, the improvements we've seen have been with
many clients.  Thinking about it I suspect a better option might be to
just increase the imbalance_pct required for a hot task rather than
blocking the move entirely.  Will try this.

> mysql+oltp numbers
>
> unpatched v2.6.37-7185-g52cfd50
>
> clients              1          2          4          8         16         32         64        128        256
> noautogroup   11084.37   20904.39   37356.65   36855.64   35395.45   35585.32   33343.44   28259.58   21404.18
>              11025.94   20870.93   37272.99   36835.54   35367.92   35448.45   33422.20   28309.88   21285.18
>              11076.00   20774.98   36847.44   36881.97   35295.35   35031.19   33490.84   28254.12   21307.13
> 1        avg  11062.10   20850.10   37159.02   36857.71   35352.90   35354.98   33418.82   28274.52   21332.16
>
> autogroup     10963.27   20058.34   23567.63   29361.08   29111.98   29731.23   28563.18   24151.10   18163.00
>              10754.92   19713.71   22983.43   28906.34   28576.12   30809.49   28384.14   24208.99   18057.34
>              10990.27   19645.70   22193.71   29247.07   28763.53   30764.55   28912.45   24143.41   18002.07
> 2        avg  10902.82   19805.91   22914.92   29171.49   28817.21   30435.09   28619.92   24167.83   18074.13
>                  .985       .949       .616       .791       .815       .860       .856       .854       .847
>
> patched v2.6.37-7185-g52cfd50
>
> noautogroup   11095.73   20794.49   37062.81   36611.92   35444.55   35468.36   33463.56   28236.18   21255.67
>              11035.59   20649.44   37304.91   36878.34   35331.63   35248.05   33424.15   28147.17   21370.39
>              11077.88   20653.92   37207.26   37047.54   35441.78   35445.02   33469.31   28050.80   21306.89
>         avg  11069.73   20699.28   37191.66   36845.93   35405.98   35387.14   33452.34   28144.71   21310.98
> vs 1             1.000       .992      1.000       .999      1.001      1.000      1.001       .995       .999
>
> noautogroup   10784.89   20304.49   37482.07   37251.63   35556.21   35116.93   32187.66   27839.60   21023.17
> NO_HOT_AFFINE 10627.17   19835.43   37611.04   37168.37   35609.65   35289.32   32331.95   27598.50   21366.97
>              10378.76   19998.29   37018.31   36888.67   35633.45   35277.39   32300.37   27896.24   21532.09
>         avg  10596.94   20046.07   37370.47   37102.89   35599.77   35227.88   32273.32   27778.11   21307.41
> vs 1              .957       .961      1.005      1.006      1.006       .996       .965       .982       .998
>
> autogroup     10452.16   19547.57   36082.97   36653.02   35251.51   34099.80   31226.18   27274.91   20927.65
>              10586.36   19931.37   36928.99   36640.64   35604.17   34238.38   31528.80   27412.44   20874.03
>              10472.72   20143.83   36407.91   36715.85   35481.78   34332.42   31612.57   27357.18   21018.63
> 3        avg  10503.74   19874.25   36473.29   36669.83   35445.82   34223.53   31455.85   27348.17   20940.10
> vs 1              .949       .953       .981       .994      1.002       .967       .941       .967       .981
> vs 2              .963      1.003      1.591      1.257      1.230      1.124      1.099      1.131      1.158
>
> autogroup     10276.41   19642.90   36790.86   36575.28   35326.89   34094.66   31626.82   27185.72   21017.51
> NO_HOT_AFFINE 10305.91   20027.66   37017.90   36814.35   35452.63   34268.32   31399.49   27353.71   21039.37
>              11013.96   19977.08   36984.17   36661.80   35393.99   34141.05   31246.47   26960.48   20873.94
>         avg  10532.09   19882.54   36930.97   36683.81   35391.17   34168.01   31424.26   27166.63   20976.94
> vs 1              .952       .953       .993       .995      1.001       .966       .940       .960       .983
> vs 2              .965      1.003      1.611      1.257      1.228      1.122      1.097      1.124      1.160
> vs 3             1.002      1.000      1.012      1.000       .998       .998       .998       .993      1.001
>
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [wake_afine fixes/improvements 0/3] Introduction
  2011-01-15  1:57 [wake_afine fixes/improvements 0/3] Introduction Paul Turner
                   ` (3 preceding siblings ...)
  2011-01-15 14:29 ` [wake_afine fixes/improvements 0/3] Introduction Mike Galbraith
@ 2011-01-15 21:34 ` Nick Piggin
  4 siblings, 0 replies; 12+ messages in thread
From: Nick Piggin @ 2011-01-15 21:34 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Mike Galbraith,
	Nick Piggin, Srivatsa Vaddagiri

On Sat, Jan 15, 2011 at 12:57 PM, Paul Turner <pjt@google.com> wrote:
>
> I've been looking at the wake_affine path to improve the group scheduling case
> (wake affine performance for fair group sched has historically lagged) as well
> as tweaking performance in general.
>
> The current series of patches is attached, the first of which should probably be
> considered for 2.6.38 since it fixes a bug/regression in the case of waking up
> onto a previously (group) empty cpu.  While the others can be considered more
> forwards looking.
>
> I've been using an rpc ping-pong workload which is known be sensitive to poor affine
> decisions to benchmark these changes,

Not _necessarily_ the best thing to use :) As a sanity check maybe, but it would
be nice to have at least an improvement on one workload that somebody
actually uses (and then it's a matter of getting a lot more testing to
see it does
not cause regressions on others that people use).

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-01-18 21:52 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-15  1:57 [wake_afine fixes/improvements 0/3] Introduction Paul Turner
2011-01-15  1:57 ` [wake_afine fixes/improvements 1/3] sched: update effective_load() to use global share weights Paul Turner
2011-01-17 14:11   ` Peter Zijlstra
2011-01-17 14:20     ` Peter Zijlstra
2011-01-18 19:04   ` [tip:sched/urgent] sched: Update " tip-bot for Paul Turner
2011-01-15  1:57 ` [wake_afine fixes/improvements 2/3] sched: clean up task_hot() Paul Turner
2011-01-17 14:14   ` Peter Zijlstra
2011-01-18 21:52     ` Paul Turner
2011-01-15  1:57 ` [wake_afine fixes/improvements 3/3] sched: introduce sched_feat(NO_HOT_AFFINE) Paul Turner
2011-01-15 14:29 ` [wake_afine fixes/improvements 0/3] Introduction Mike Galbraith
2011-01-15 19:29   ` Paul Turner
2011-01-15 21:34 ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox