[patch v2 1/2] sched: check for prev_cpu == this_cpu before calling wake

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch v2 1/2] sched: check for prev_cpu == this_cpu before calling wake_affine()
@ 2010-03-08 22:19 Suresh Siddha
  2010-03-08 22:19 ` [patch v2 2/2] sched: fix select_idle_sibling() logic in select_task_rq_fair() Suresh Siddha
  2010-03-31 10:25 ` [patch v2 1/2] sched: check for prev_cpu == this_cpu before calling wake_affine() Peter Zijlstra
  0 siblings, 2 replies; 15+ messages in thread
From: Suresh Siddha @ 2010-03-08 22:19 UTC (permalink / raw)
  To: Mike Galbraith, Peter Zijlstra, Ingo Molnar
  Cc: Arjan van de Ven, linux-kernel, Vaidyanathan Srinivasan,
	Yanmin Zhang, Gautham R Shenoy, Suresh Siddha

[-- Attachment #1: fix_wake_affine.patch --]
[-- Type: text/plain, Size: 2481 bytes --]

On a single cpu system with SMT, in the scenario of one SMT thread being
idle with another SMT thread running a task and doing a non sync wakeup of
another task, we see (from the traces) that the woken up task ends up running
on the busy thread instead of the idle thread. Idle balancing that comes
in little bit later is fixing the scernaio.

But fixing this wake balance and running the woken up task directly on the
idle SMT thread improved the performance (phoronix 7zip compression workload)
by ~9% on an atom platform.

During the process wakeup, select_task_rq_fair() and wake_affine() makes
the decision to wakeup the task either on the previous cpu that the task
ran or the cpu that the task is currently woken up.

select_task_rq_fair() also goes through to see if there are any idle siblings
for the cpu that the task is woken up on. This is to ensure that we select
any idle sibling rather than choose a busy cpu.

In the above load scenario, it so happens that the prev_cpu (that the
task ran before) and this_cpu (where it is woken up currently) are the same. And
in this case, it looks like wake_affine() returns 0 and ultimately not selecting
the idle sibling chosen by select_idle_sibling() in select_task_rq_fair().
Further down the path of select_task_rq_fair(), we ultimately select
the currently running cpu (busy SMT thread instead of the idle SMT thread).

Check for prev_cpu == this_cpu before calling wake_affine() and no need to do
any fancy stuff(and ultimately taking wrong decisions) in this case.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
Changes from v1:
   Move the "this_cpu == prev_cpu" check before calling wake_affine()
---
 kernel/sched_fair.c |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1454,6 +1454,7 @@ static int select_task_rq_fair(struct ta
 	int want_affine = 0;
 	int want_sd = 1;
 	int sync = wake_flags & WF_SYNC;
+	int this_cpu = cpu;

 	if (sd_flag & SD_BALANCE_WAKE) {
 		if (sched_feat(AFFINE_WAKEUPS) &&
@@ -1545,8 +1546,10 @@ static int select_task_rq_fair(struct ta
 			update_shares(tmp);
 	}

-	if (affine_sd && wake_affine(affine_sd, p, sync))
-		return cpu;
+	if (affine_sd) {
+		if (this_cpu == prev_cpu || wake_affine(affine_sd, p, sync))
+			return cpu;
+	}

 	while (sd) {
 		int load_idx = sd->forkexec_idx;

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [patch v2 2/2] sched: fix select_idle_sibling() logic in select_task_rq_fair()
  2010-03-08 22:19 [patch v2 1/2] sched: check for prev_cpu == this_cpu before calling wake_affine() Suresh Siddha
@ 2010-03-08 22:19 ` Suresh Siddha
  2010-03-31 10:25 ` [patch v2 1/2] sched: check for prev_cpu == this_cpu before calling wake_affine() Peter Zijlstra
  1 sibling, 0 replies; 15+ messages in thread
From: Suresh Siddha @ 2010-03-08 22:19 UTC (permalink / raw)
  To: Mike Galbraith, Peter Zijlstra, Ingo Molnar
  Cc: Arjan van de Ven, linux-kernel, Vaidyanathan Srinivasan,
	Yanmin Zhang, Gautham R Shenoy, Suresh Siddha

[-- Attachment #1: fix_select_idle_sibling.patch --]
[-- Type: text/plain, Size: 4350 bytes --]

Address these issues in the current select_idle_sibling() logic.

a) Once we select the idle sibling, we use that domain (spanning the cpu that
   the task is currently woken-up and the idle sibling that we found) in our
   wake_affine() comparisons. This domain is completely different from the
   domain(we are supposed to use) that spans the cpu that the task currently
   woken-up and the cpu where the task previously ran.

b) We do select_idle_sibling() check only for the cpu that the task is
   currently woken-up on. If select_task_rq_fair() selects the previously run
   cpu for waking the task no, doing a select_idle_sibling() check
   for that cpu also helps and we don't do this currently.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
Changes from v1:
   Drop treating the current cpu with one running task as an idle cpu in the
   presence of sync wakeup.
---
 kernel/sched_fair.c |   69 +++++++++++++++++++++++++++-------------------------
 1 file changed, 36 insertions(+), 33 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1407,28 +1407,48 @@ find_idlest_cpu(struct sched_group *grou
  * Try and locate an idle CPU in the sched_domain.
  */
 static int
-select_idle_sibling(struct task_struct *p, struct sched_domain *sd, int target)
+select_idle_sibling(struct task_struct *p, int target)
 {
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	int i;
+	struct sched_domain *sd;
 
 	/*
-	 * If this domain spans both cpu and prev_cpu (see the SD_WAKE_AFFINE
-	 * test in select_task_rq_fair) and the prev_cpu is idle then that's
-	 * always a better target than the current cpu.
+	 * If the task is going to be woken-up on this cpu and if it is
+	 * already idle, then it is the right target.
 	 */
-	if (target == cpu && !cpu_rq(prev_cpu)->cfs.nr_running)
+	if (target == cpu && !cpu_rq(cpu)->cfs.nr_running)
+		return cpu;
+
+	/*
+	 * If the task is going to be woken-up on the cpu where it previously
+	 * ran and if it is currently idle, then it the right target.
+	 */
+	if (target == prev_cpu && !cpu_rq(prev_cpu)->cfs.nr_running)
 		return prev_cpu;
 
 	/*
-	 * Otherwise, iterate the domain and find an elegible idle cpu.
+	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
-	for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed) {
-		if (!cpu_rq(i)->cfs.nr_running) {
-			target = i;
+	for_each_domain(target, sd) {
+		if (!(sd->flags & SD_SHARE_PKG_RESOURCES))
 			break;
+
+		for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed) {
+			if (!cpu_rq(i)->cfs.nr_running) {
+				target = i;
+				break;
+			}
 		}
+
+		/*
+		 * Lets stop looking for an idle sibling when we reached
+		 * the domain that spans the current cpu and prev_cpu.
+		 */
+		if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
+		    cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
+			break;
 	}
 
 	return target;
@@ -1491,34 +1511,15 @@ static int select_task_rq_fair(struct ta
 				want_sd = 0;
 		}
 
-		/*
-		 * While iterating the domains looking for a spanning
-		 * WAKE_AFFINE domain, adjust the affine target to any idle cpu
-		 * in cache sharing domains along the way.
-		 */
 		if (want_affine) {
-			int target = -1;
-
 			/*
 			 * If both cpu and prev_cpu are part of this domain,
 			 * cpu is a valid SD_WAKE_AFFINE target.
 			 */
-			if (cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
-				target = cpu;
-
-			/*
-			 * If there's an idle sibling in this domain, make that
-			 * the wake_affine target instead of the current cpu.
-			 */
-			if (tmp->flags & SD_SHARE_PKG_RESOURCES)
-				target = select_idle_sibling(p, tmp, target);
-
-			if (target >= 0) {
-				if (tmp->flags & SD_WAKE_AFFINE) {
-					affine_sd = tmp;
-					want_affine = 0;
-				}
-				cpu = target;
+			if (cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))
+			    && (tmp->flags & SD_WAKE_AFFINE)) {
+				affine_sd = tmp;
+				want_affine = 0;
 			}
 		}
 
@@ -1548,7 +1549,9 @@ static int select_task_rq_fair(struct ta
 
 	if (affine_sd) {
 		if (this_cpu == prev_cpu || wake_affine(affine_sd, p, sync))
-			return cpu;
+			return select_idle_sibling(p, this_cpu);
+		else if (!sd)
+			return select_idle_sibling(p, prev_cpu);
 	}
 
 	while (sd) {



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v2 1/2] sched: check for prev_cpu == this_cpu before calling wake_affine()
  2010-03-08 22:19 [patch v2 1/2] sched: check for prev_cpu == this_cpu before calling wake_affine() Suresh Siddha
  2010-03-08 22:19 ` [patch v2 2/2] sched: fix select_idle_sibling() logic in select_task_rq_fair() Suresh Siddha
@ 2010-03-31 10:25 ` Peter Zijlstra
  2010-03-31 23:47   ` Suresh Siddha
  1 sibling, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2010-03-31 10:25 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Mike Galbraith, Ingo Molnar, Arjan van de Ven, linux-kernel,
	Vaidyanathan Srinivasan, Yanmin Zhang, Gautham R Shenoy

On Mon, 2010-03-08 at 14:19 -0800, Suresh Siddha wrote:
> plain text document attachment (fix_wake_affine.patch)
> On a single cpu system with SMT, in the scenario of one SMT thread being
> idle with another SMT thread running a task and doing a non sync wakeup of
> another task, we see (from the traces) that the woken up task ends up running
> on the busy thread instead of the idle thread. Idle balancing that comes
> in little bit later is fixing the scernaio.
> 
> But fixing this wake balance and running the woken up task directly on the
> idle SMT thread improved the performance (phoronix 7zip compression workload)
> by ~9% on an atom platform.
> 
> During the process wakeup, select_task_rq_fair() and wake_affine() makes
> the decision to wakeup the task either on the previous cpu that the task
> ran or the cpu that the task is currently woken up.
> 
> select_task_rq_fair() also goes through to see if there are any idle siblings
> for the cpu that the task is woken up on. This is to ensure that we select
> any idle sibling rather than choose a busy cpu.
> 
> In the above load scenario, it so happens that the prev_cpu (that the
> task ran before) and this_cpu (where it is woken up currently) are the same. And
> in this case, it looks like wake_affine() returns 0 and ultimately not selecting
> the idle sibling chosen by select_idle_sibling() in select_task_rq_fair().
> Further down the path of select_task_rq_fair(), we ultimately select
> the currently running cpu (busy SMT thread instead of the idle SMT thread).
> 
> Check for prev_cpu == this_cpu before calling wake_affine() and no need to do
> any fancy stuff(and ultimately taking wrong decisions) in this case.
> 
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> ---
> Changes from v1:
>    Move the "this_cpu == prev_cpu" check before calling wake_affine()
> ---
>  kernel/sched_fair.c |    7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> Index: tip/kernel/sched_fair.c
> ===================================================================
> --- tip.orig/kernel/sched_fair.c
> +++ tip/kernel/sched_fair.c
> @@ -1454,6 +1454,7 @@ static int select_task_rq_fair(struct ta
>  	int want_affine = 0;
>  	int want_sd = 1;
>  	int sync = wake_flags & WF_SYNC;
> +	int this_cpu = cpu;
>  
>  	if (sd_flag & SD_BALANCE_WAKE) {
>  		if (sched_feat(AFFINE_WAKEUPS) &&
> @@ -1545,8 +1546,10 @@ static int select_task_rq_fair(struct ta
>  			update_shares(tmp);
>  	}
>  
> -	if (affine_sd && wake_affine(affine_sd, p, sync))
> -		return cpu;
> +	if (affine_sd) {
> +		if (this_cpu == prev_cpu || wake_affine(affine_sd, p, sync))
> +			return cpu;
> +	}
>  
>  	while (sd) {
>  		int load_idx = sd->forkexec_idx;
> 

Right, so we since merged 8b911acd, in which Mike did almost this but
not quite, the question is over: cpu == prev_cpu vs this_cpu ==
prev_cpu.

Mike seems to see some workloads regress with the this_cpu check, does
your workload work with the cpu == prev_cpu one?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v2 1/2] sched: check for prev_cpu == this_cpu before calling wake_affine()
  2010-03-31 10:25 ` [patch v2 1/2] sched: check for prev_cpu == this_cpu before calling wake_affine() Peter Zijlstra
@ 2010-03-31 23:47   ` Suresh Siddha
  2010-04-01  5:32     ` Mike Galbraith
                       ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Suresh Siddha @ 2010-03-31 23:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Ingo Molnar, Arjan van de Ven,
	linux-kernel@vger.kernel.org, Vaidyanathan Srinivasan,
	Yanmin Zhang, Gautham R Shenoy

On Wed, 2010-03-31 at 03:25 -0700, Peter Zijlstra wrote: 
> Right, so we since merged 8b911acd, in which Mike did almost this but
> not quite, the question is over: cpu == prev_cpu vs this_cpu ==
> prev_cpu.
> 
> Mike seems to see some workloads regress with the this_cpu check, does
> your workload work with the cpu == prev_cpu one?

Mike saw a regression with the sync check that was in the previous
version (v1). Anyways, the current code in -tip has the check that I
wanted and which addresses the netbook (2 SMT cpu's) performance issue.

But the current logic in select_task_rq_fair() is not quite correct,
especially we can wake the task on a busy core rather than on an idle
core, as the latest changes are making the wake up decisions entirely on
an idle HT sibling if there is one.

Also there are couple of more issues which I have explained in the
previous version of the patch. I have updated my patch on top of the
latest -tip, which addresses all these issues. Let me know your
thoughts. Thanks.

--- 
From: Suresh Siddha <suresh.b.siddha@intel.com>
Subject: sched: fix select_idle_sibling() logic in select_task_rq_fair()

Issues in the current select_idle_sibling() logic in select_task_rq_fair()
in the context of a task wake-up:

a) Once we select the idle sibling, we use that domain (spanning the cpu that
   the task is currently woken-up and the idle sibling that we found) in our
   wake_affine() decisions. This domain is completely different from the
   domain(we are supposed to use) that spans the cpu that the task currently
   woken-up and the cpu where the task previously ran.

b) We do select_idle_sibling() check only for the cpu that the task is
   currently woken-up on. If select_task_rq_fair() selects the previously run
   cpu for waking the task, doing a select_idle_sibling() check
   for that cpu also helps and we don't do this currently.

c) In the scenarios where the cpu that the task is woken-up is busy but
   with its HT siblings are idle, we are selecting the task be woken-up
   on the idle HT sibling instead of a core that it previously ran
   and currently completely idle. i.e., we are not taking decisions based on
   wake_affine() but directly selecting an idle sibling that can cause
   an imbalance at the SMT/MC level which will be later corrected by the
   periodic load balancer.

Fix this by first going through the load imbalance calculations using
wake_affine() and once we make a decision of woken-up cpu vs previously-ran cpu,
then choose a possible idle sibling for waking up the task on.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 49ad993..f905a4b 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1385,28 +1385,48 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
  * Try and locate an idle CPU in the sched_domain.
  */
 static int
-select_idle_sibling(struct task_struct *p, struct sched_domain *sd, int target)
+select_idle_sibling(struct task_struct *p, int target)
 {
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	int i;
+	struct sched_domain *sd;
 
 	/*
-	 * If this domain spans both cpu and prev_cpu (see the SD_WAKE_AFFINE
-	 * test in select_task_rq_fair) and the prev_cpu is idle then that's
-	 * always a better target than the current cpu.
+	 * If the task is going to be woken-up on this cpu and if it is
+	 * already idle, then it is the right target.
 	 */
-	if (target == cpu && !cpu_rq(prev_cpu)->cfs.nr_running)
+	if (target == cpu && !cpu_rq(cpu)->cfs.nr_running)
+		return cpu;
+
+	/*
+	 * If the task is going to be woken-up on the cpu where it previously
+	 * ran and if it is currently idle, then it the right target.
+	 */
+	if (target == prev_cpu && !cpu_rq(prev_cpu)->cfs.nr_running)
 		return prev_cpu;
 
 	/*
-	 * Otherwise, iterate the domain and find an elegible idle cpu.
+	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
-	for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed) {
-		if (!cpu_rq(i)->cfs.nr_running) {
-			target = i;
+	for_each_domain(target, sd) {
+		if (!(sd->flags & SD_SHARE_PKG_RESOURCES))
 			break;
+
+		for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed) {
+			if (!cpu_rq(i)->cfs.nr_running) {
+				target = i;
+				break;
+			}
 		}
+
+		/*
+		 * Lets stop looking for an idle sibling when we reached
+		 * the domain that spans the current cpu and prev_cpu.
+		 */
+		if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
+		    cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
+			break;
 	}
 
 	return target;
@@ -1429,7 +1449,7 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	int new_cpu = cpu;
-	int want_affine = 0, cpu_idle = !current->pid;
+	int want_affine = 0;
 	int want_sd = 1;
 	int sync = wake_flags & WF_SYNC;
 
@@ -1467,36 +1487,15 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
 				want_sd = 0;
 		}
 
-		/*
-		 * While iterating the domains looking for a spanning
-		 * WAKE_AFFINE domain, adjust the affine target to any idle cpu
-		 * in cache sharing domains along the way.
-		 */
 		if (want_affine) {
-			int target = -1;
-
 			/*
 			 * If both cpu and prev_cpu are part of this domain,
 			 * cpu is a valid SD_WAKE_AFFINE target.
 			 */
-			if (cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
-				target = cpu;
-
-			/*
-			 * If there's an idle sibling in this domain, make that
-			 * the wake_affine target instead of the current cpu.
-			 */
-			if (!cpu_idle && tmp->flags & SD_SHARE_PKG_RESOURCES)
-				target = select_idle_sibling(p, tmp, target);
-
-			if (target >= 0) {
-				if (tmp->flags & SD_WAKE_AFFINE) {
-					affine_sd = tmp;
-					want_affine = 0;
-					if (target != cpu)
-						cpu_idle = 1;
-				}
-				cpu = target;
+			if (cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))
+			    && (tmp->flags & SD_WAKE_AFFINE)) {
+				affine_sd = tmp;
+				want_affine = 0;
 			}
 		}
 
@@ -1527,8 +1526,10 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
 #endif
 
 	if (affine_sd) {
-		if (cpu_idle || cpu == prev_cpu || wake_affine(affine_sd, p, sync))
-			return cpu;
+		if (cpu == prev_cpu || wake_affine(affine_sd, p, sync))
+			return select_idle_sibling(p, cpu);
+		else
+			return select_idle_sibling(p, prev_cpu);
 	}
 
 	while (sd) {



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [patch v2 1/2] sched: check for prev_cpu == this_cpu before calling wake_affine()
  2010-03-31 23:47   ` Suresh Siddha
@ 2010-04-01  5:32     ` Mike Galbraith
  2010-04-01 21:04       ` Suresh Siddha
  2010-04-20  8:46     ` Peter Zijlstra
  2010-04-23 10:50     ` [tip:sched/core] sched: Fix select_idle_sibling() logic in select_task_rq_fair() tip-bot for Suresh Siddha
  2 siblings, 1 reply; 15+ messages in thread
From: Mike Galbraith @ 2010-04-01  5:32 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Peter Zijlstra, Ingo Molnar, Arjan van de Ven,
	linux-kernel@vger.kernel.org, Vaidyanathan Srinivasan,
	Yanmin Zhang, Gautham R Shenoy

On Wed, 2010-03-31 at 16:47 -0700, Suresh Siddha wrote:

> Issues in the current select_idle_sibling() logic in select_task_rq_fair()
> in the context of a task wake-up:
> 
> a) Once we select the idle sibling, we use that domain (spanning the cpu that
>    the task is currently woken-up and the idle sibling that we found) in our
>    wake_affine() decisions. This domain is completely different from the
>    domain(we are supposed to use) that spans the cpu that the task currently
>    woken-up and the cpu where the task previously ran.

Why does that matter?  If we find an idle shared cache cpu before we hit
the spanning domain, we don't use affine_sd other than maybe (unlikely)
for updating group scheduler shares.

> b) We do select_idle_sibling() check only for the cpu that the task is
>    currently woken-up on. If select_task_rq_fair() selects the previously run
>    cpu for waking the task, doing a select_idle_sibling() check
>    for that cpu also helps and we don't do this currently.

True, but that costs too.  Those idle checks aren't cheap.

> c) In the scenarios where the cpu that the task is woken-up is busy but
>    with its HT siblings are idle, we are selecting the task be woken-up
>    on the idle HT sibling instead of a core that it previously ran
>    and currently completely idle. i.e., we are not taking decisions based on
>    wake_affine() but directly selecting an idle sibling that can cause
>    an imbalance at the SMT/MC level which will be later corrected by the
>    periodic load balancer.

Yes, the pressing decision for this one wakeup is can we wake to a
shared cache and thus avoid cache misses.

IMHO, the point of the affinity decision isn't instant perfect balance,
it's cache affinity if at all possible without wrecking balance.  Load
balancing moves tasks for optimal CPU utilization, tasks waking each
other pull to a shared domain.. a tug-of-war that balances buddies over
time.  wake_affine()'s job is only to say "no, leave it where it was for
now".  I don't see any reason to ask wake_affine()'s opinion about an
idle CPU.  We paid for idle shared cache knowledge.

We certainly wouldn't want to leave the wakee on it's previous CPU only
because that CPU is idle, it would have to be idle and sharing cache.

That said, Nehalem may ramp better with select_idle_sibling() turned off
at the HT level, and ramp was it's motivation.  Maybe you could continue
checking until out of shared cache country, but that's more expensive.

The logic may not be perfect, but it really needs to become cheaper, not
more expensive.

	-Mike

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v2 1/2] sched: check for prev_cpu == this_cpu before calling wake_affine()
  2010-04-01  5:32     ` Mike Galbraith
@ 2010-04-01 21:04       ` Suresh Siddha
  2010-04-02  6:20         ` Mike Galbraith
  0 siblings, 1 reply; 15+ messages in thread
From: Suresh Siddha @ 2010-04-01 21:04 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, Arjan van de Ven,
	linux-kernel@vger.kernel.org, Vaidyanathan Srinivasan,
	Yanmin Zhang, Gautham R Shenoy

On Wed, 2010-03-31 at 22:32 -0700, Mike Galbraith wrote:
> On Wed, 2010-03-31 at 16:47 -0700, Suresh Siddha wrote:
> 
> > Issues in the current select_idle_sibling() logic in select_task_rq_fair()
> > in the context of a task wake-up:
> > 
> > a) Once we select the idle sibling, we use that domain (spanning the cpu that
> >    the task is currently woken-up and the idle sibling that we found) in our
> >    wake_affine() decisions. This domain is completely different from the
> >    domain(we are supposed to use) that spans the cpu that the task currently
> >    woken-up and the cpu where the task previously ran.
> 
> Why does that matter?  If we find an idle shared cache cpu before we hit
> the spanning domain, we don't use affine_sd other than maybe (unlikely)
> for updating group scheduler shares.

Ok. This is not a big issue with the new idle cpu change, as atleast we
don't endup calling wake_affine() with the wrong sd. I have never tried
to understand any code surrounded by CONFIG_FAIR_GROUP_SCHED so can't
comment if the using affine_sd for updating group scheduler shares is
correct or not. But please look below for the issues with selecting the
idle sibling right away.

> 
> > b) We do select_idle_sibling() check only for the cpu that the task is
> >    currently woken-up on. If select_task_rq_fair() selects the previously run
> >    cpu for waking the task, doing a select_idle_sibling() check
> >    for that cpu also helps and we don't do this currently.
> 
> True, but that costs too.  Those idle checks aren't cheap.

Just like the current code, my patch is doing the idle checks only once.
Current code is doing idle checks for the woken-up cpu and my code is
first selecting woken-up vs previously-ran and then doing idle sibling
checks . So don't expect to see much cost increase.

> 
> > c) In the scenarios where the cpu that the task is woken-up is busy but
> >    with its HT siblings are idle, we are selecting the task be woken-up
> >    on the idle HT sibling instead of a core that it previously ran
> >    and currently completely idle. i.e., we are not taking decisions based on
> >    wake_affine() but directly selecting an idle sibling that can cause
> >    an imbalance at the SMT/MC level which will be later corrected by the
> >    periodic load balancer.
> 
> Yes, the pressing decision for this one wakeup is can we wake to a
> shared cache and thus avoid cache misses.

Last level cache sharing is much more important than small L1 and mid
level caches. Also performance impact of keeping both the threads on a
core busy in the context of an idle core and then periodic balancer
coming in and correcting this is more costly.

> IMHO, the point of the affinity decision isn't instant perfect balance,
> it's cache affinity if at all possible without wrecking balance.

For not wrecking balance we should do the wake_balance() and based on
that decision, do the select_idle_sibling() for selecting an idle cpu in
that cache affinity. Current code in -tip is opposite of this.

> Load balancing moves tasks for optimal CPU utilization, tasks waking each
> other pull to a shared domain.. a tug-of-war that balances buddies over
> time.  
>
> wake_affine()'s job is only to say "no, leave it where it was for
> now".  I don't see any reason to ask wake_affine()'s opinion about an
> idle CPU.  We paid for idle shared cache knowledge.
> 
> We certainly wouldn't want to leave the wakee on it's previous CPU only
> because that CPU is idle, it would have to be idle and sharing cache.

Consider this scenario. Today we do balance on fork() and exec(). This
will cause the tasks to start far away. On systems like NHM-EP, tasks
will start on two different sockets/nodes(as each socket is a numa node)
and allocate their memory locally etc. Task A starting on Node-0 and
Task B starting on Node-1. Once task B sleeps and if Task A or something
else wakes up task B on Node-0, (with the recent change) just because
there is an idle HT sibling on node-0 we endup waking the task on
node-0. This is wrong. We should first atleast go through wake_affine()
and if wake_affine() says ok to move the task to node-0, then we can
look at the cache siblings for node-0 and select an appropriate cpu.

thanks,
suresh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v2 1/2] sched: check for prev_cpu == this_cpu before calling wake_affine()
  2010-04-01 21:04       ` Suresh Siddha
@ 2010-04-02  6:20         ` Mike Galbraith
  2010-04-02 17:05           ` Suresh Siddha
  2010-04-14 20:45           ` Suresh Siddha
  0 siblings, 2 replies; 15+ messages in thread
From: Mike Galbraith @ 2010-04-02  6:20 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Peter Zijlstra, Ingo Molnar, Arjan van de Ven,
	linux-kernel@vger.kernel.org, Vaidyanathan Srinivasan,
	Yanmin Zhang, Gautham R Shenoy

On Thu, 2010-04-01 at 14:04 -0700, Suresh Siddha wrote:

> Consider this scenario. Today we do balance on fork() and exec(). This
> will cause the tasks to start far away. On systems like NHM-EP, tasks
> will start on two different sockets/nodes(as each socket is a numa node)
> and allocate their memory locally etc. Task A starting on Node-0 and
> Task B starting on Node-1. Once task B sleeps and if Task A or something
> else wakes up task B on Node-0, (with the recent change) just because
> there is an idle HT sibling on node-0 we endup waking the task on
> node-0. This is wrong. We should first atleast go through wake_affine()
> and if wake_affine() says ok to move the task to node-0, then we can
> look at the cache siblings for node-0 and select an appropriate cpu.

Yes, if task A and task B are more or less unrelated, you'd want them to
stay in separate domains, you'd not want some random event to pull.  The
other side of the coin is tasks which fork off partners that they will
talk to at high frequency.  They land just as far away, and desperately
need to move into a shared cache domain.  There's currently no
discriminator, so while always asking wake_affine() may reduce the risk
of moving a task with a large footprint, it also increases the risk of
leaving buddies jabbering cross cache.  You can tweak it in either
direction, and neither can be called "wrong", it's all compromise.

Do you have a compute load bouncing painfully which this patch cures?

I have no strong objections, and the result is certainly easier on the
eye.  If I were making the decision, I'd want to see some numbers.

	-Mike

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v2 1/2] sched: check for prev_cpu == this_cpu before calling wake_affine()
  2010-04-02  6:20         ` Mike Galbraith
@ 2010-04-02 17:05           ` Suresh Siddha
  2010-04-02 19:43             ` Mike Galbraith
  2010-04-14 20:45           ` Suresh Siddha
  1 sibling, 1 reply; 15+ messages in thread
From: Suresh Siddha @ 2010-04-02 17:05 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, Arjan van de Ven,
	linux-kernel@vger.kernel.org, Vaidyanathan Srinivasan,
	Yanmin Zhang, Gautham R Shenoy

On Thu, 2010-04-01 at 23:20 -0700, Mike Galbraith wrote:
> Yes, if task A and task B are more or less unrelated, you'd want them to
> stay in separate domains, you'd not want some random event to pull.  The
> other side of the coin is tasks which fork off partners that they will
> talk to at high frequency.  They land just as far away, and desperately
> need to move into a shared cache domain.  There's currently no
> discriminator, so while always asking wake_affine() may reduce the risk
> of moving a task with a large footprint, it also increases the risk of
> leaving buddies jabbering cross cache.

Mike, Apart from this small tweak that you added in wake_up() path there
is no extra logic that keeps buddies together for long. As I was saying,
fork/exec balance starts apart and in the partial loaded case (i.e.,
when # of running tasks <= # of sockets or # of total cores) the default
load balancer policy also tries to distribute the load equally among
sockets/cores (for peak cache/memory controller bw etc). While the
wakeup() may keep the buddies on SMT siblings, next load balancing event
will move them far away. If we need to keep buddies together we need
more changes than this small tweak.

> Do you have a compute load bouncing painfully which this patch cures?
> 
> I have no strong objections, and the result is certainly easier on the
> eye.  If I were making the decision, I'd want to see some numbers.

All I saw in the changelog when you added this new tweak was:

commit 8b911acdf08477c059d1c36c21113ab1696c612b
Author: Mike Galbraith <efault@gmx.de>
Date:   Thu Mar 11 17:17:16 2010 +0100

    sched: Fix select_idle_sibling()
    
    Don't bother with selection when the current cpu is idle. ....

Is it me or you who need to provide the data for justification for your
new tweak that changes the current behavior ;)

I will run some workloads aswell!

thanks,
suresh


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v2 1/2] sched: check for prev_cpu == this_cpu before calling wake_affine()
  2010-04-02 17:05           ` Suresh Siddha
@ 2010-04-02 19:43             ` Mike Galbraith
  0 siblings, 0 replies; 15+ messages in thread
From: Mike Galbraith @ 2010-04-02 19:43 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Peter Zijlstra, Ingo Molnar, Arjan van de Ven,
	linux-kernel@vger.kernel.org, Vaidyanathan Srinivasan,
	Yanmin Zhang, Gautham R Shenoy

On Fri, 2010-04-02 at 10:05 -0700, Suresh Siddha wrote:
> On Thu, 2010-04-01 at 23:20 -0700, Mike Galbraith wrote:
> > Yes, if task A and task B are more or less unrelated, you'd want them to
> > stay in separate domains, you'd not want some random event to pull.  The
> > other side of the coin is tasks which fork off partners that they will
> > talk to at high frequency.  They land just as far away, and desperately
> > need to move into a shared cache domain.  There's currently no
> > discriminator, so while always asking wake_affine() may reduce the risk
> > of moving a task with a large footprint, it also increases the risk of
> > leaving buddies jabbering cross cache.
> 
> Mike, Apart from this small tweak that you added in wake_up() path there
> is no extra logic that keeps buddies together for long.

The wakeup logic is the only thing that keeps buddies together.

>  As I was saying,
> fork/exec balance starts apart and in the partial loaded case (i.e.,
> when # of running tasks <= # of sockets or # of total cores) the default
> load balancer policy also tries to distribute the load equally among
> sockets/cores (for peak cache/memory controller bw etc). While the
> wakeup() may keep the buddies on SMT siblings, next load balancing event
> will move them far away. If we need to keep buddies together we need
> more changes than this small tweak.

We very definitely need to keep buddies cache affine.  Yes, maybe some
additional logic is needed, as there is a conflict between load types.
A kbuild spreading to the four winds is fine, while netperf jabbering
cross cache is horrible.

> > Do you have a compute load bouncing painfully which this patch cures?
> > 
> > I have no strong objections, and the result is certainly easier on the
> > eye.  If I were making the decision, I'd want to see some numbers.
> 
> All I saw in the changelog when you added this new tweak was:
> 
> commit 8b911acdf08477c059d1c36c21113ab1696c612b
> Author: Mike Galbraith <efault@gmx.de>
> Date:   Thu Mar 11 17:17:16 2010 +0100
> 
>     sched: Fix select_idle_sibling()
>     
>     Don't bother with selection when the current cpu is idle. ....
> 
> Is it me or you who need to provide the data for justification for your
> new tweak that changes the current behavior ;)
> 
> I will run some workloads aswell!

Whoa.  It was a simple question, no need to get defensive.  You need not
provide anything.  Forget I even asked, it's not my decision.

	-Mike


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v2 1/2] sched: check for prev_cpu == this_cpu before calling wake_affine()
  2010-04-02  6:20         ` Mike Galbraith
  2010-04-02 17:05           ` Suresh Siddha
@ 2010-04-14 20:45           ` Suresh Siddha
  2010-04-15  5:17             ` Mike Galbraith
  1 sibling, 1 reply; 15+ messages in thread
From: Suresh Siddha @ 2010-04-14 20:45 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, Arjan van de Ven,
	linux-kernel@vger.kernel.org, Vaidyanathan Srinivasan,
	Yanmin Zhang, Gautham R Shenoy

On Thu, 2010-04-01 at 23:20 -0700, Mike Galbraith wrote:
> Do you have a compute load bouncing painfully which this patch cures?
> 
> I have no strong objections, and the result is certainly easier on the
> eye.  If I were making the decision, I'd want to see some numbers.

Mike, PeterZ,

Finally got sometime to get back to this and provide some data backing
up my patch under discussion. Here are my test results:

System is a two socket quad-core NHM-EP with SMT enabled and the
workload is specjbb2005.

Warehouses    	Throughput
	     tip   tip+proposed-fix
    1	    35142   35027
    2	    73563   75977
    3	   105806  109836
    4	   133421  142490
    5	   152151  168888
    6	   164936  195392
    7	   184763  208155
    8	   192419  223846

PeterZ, I think the above clearly shows that we have a problem with the
current -tip code. Please consider the proposed patch (which can be
found at http://marc.info/?l=linux-kernel&m=127007936408754&w=2)

Mike, with the above data, can I have your Ack for the patch?

thanks,
suresh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v2 1/2] sched: check for prev_cpu == this_cpu before calling wake_affine()
  2010-04-14 20:45           ` Suresh Siddha
@ 2010-04-15  5:17             ` Mike Galbraith
  0 siblings, 0 replies; 15+ messages in thread
From: Mike Galbraith @ 2010-04-15  5:17 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Peter Zijlstra, Ingo Molnar, Arjan van de Ven,
	linux-kernel@vger.kernel.org, Vaidyanathan Srinivasan,
	Yanmin Zhang, Gautham R Shenoy

On Wed, 2010-04-14 at 13:45 -0700, Suresh Siddha wrote:
> On Thu, 2010-04-01 at 23:20 -0700, Mike Galbraith wrote:
> > Do you have a compute load bouncing painfully which this patch cures?
> > 
> > I have no strong objections, and the result is certainly easier on the
> > eye.  If I were making the decision, I'd want to see some numbers.
> 
> Mike, PeterZ,
> 
> Finally got sometime to get back to this and provide some data backing
> up my patch under discussion. Here are my test results:
> 
> System is a two socket quad-core NHM-EP with SMT enabled and the
> workload is specjbb2005.
> 
> Warehouses    	Throughput
> 	     tip   tip+proposed-fix
>     1	    35142   35027
>     2	    73563   75977
>     3	   105806  109836
>     4	   133421  142490
>     5	   152151  168888
>     6	   164936  195392
>     7	   184763  208155
>     8	   192419  223846
> 
> PeterZ, I think the above clearly shows that we have a problem with the
> current -tip code. Please consider the proposed patch (which can be
> found at http://marc.info/?l=linux-kernel&m=127007936408754&w=2)
> 
> Mike, with the above data, can I have your Ack for the patch?

Sure (not that you need it).  A bit of abbreviated testing this morning
showed no big hairy differences.  A bit of loss to very fast switchers,
but OTOH it improved ramp a bit for mysql/pgsql+oltp.

tip = v2.6.34-rc4-937-gba0b2c9
tip-x = tip + your patches

netperf TCP_RR
unpinned
tip          102877.39   102860.66   103210.25   avg 102982.76   1.000
tip-x        100926.59   100380.26   100536.35   avg 100614.40    .977

pinned
tip          100181.70   100288.34    99711.16   avg 100060.40   1.000
tip-x         99347.12   100551.80    99827.22   avg  99908.71    .998

tbench 8
tip          1195.51   1194.49   1197.46   avg 1195.82   1.000
tip-x        1183.19   1188.00   1188.13   avg 1186.44    .992

mysql+oltp
clients             1          2          4          8         16         32         64        128        256
tip          10424.24   20417.42   37151.84   36777.79   36442.06   36122.14   33803.15   30109.81   28368.75
             11196.80   20443.43   37560.39   37358.14   36822.26   36132.95   34411.88   30349.47   28909.04
             11204.02   20513.21   37550.11   37219.52   36840.44   36150.47   34445.02   30975.95   28840.59
tip avg      10941.68   20458.02   37420.78   37118.48   36701.58   36135.18   34220.01   30478.41   28706.12

tip-x        10331.62   20981.89   36768.45   36556.19   36069.56   35612.24   34100.70   30459.80   29043.76
             11101.77   21153.77   37622.83   37228.68   36845.37   36256.04   34606.37   31287.36   28635.98
             11076.66   21153.41   37703.80   37242.89   36842.93   36296.27   34733.06   30578.00   29313.10
tip-x avg    10836.68   21096.35   37365.02   37009.25   36585.95   36054.85   34480.04   30775.05   28997.61
vs tip           .990      1.031       .998       .997       .996       .997      1.007      1.009      1.010

pgsql+oltp
clients             1          2          4          8         16         32         64        128        256
tip          14683.30   30452.18   53826.68   53793.77   52931.67   51959.47   51326.82   49205.13   46884.50
             16154.17   30811.46   54277.20   53916.07   52792.89   52031.80   50967.99   48757.04   46259.84
             16151.36   29913.89   54071.21   53499.94   52707.50   51867.73   50887.40   49310.96   46544.42
tip avg      15662.94   30392.51   54058.36   53736.59   52810.68   51953.00   51060.73   49091.04   46562.92

tip-x        14641.30   31402.11   54267.43   53835.83   53024.57   51992.08   50336.10   49083.58   46662.24
             16167.44   31427.53   54008.47   53685.56   52709.81   52131.85   50848.08   48824.32   45973.29
             16259.76   31381.42   54327.51   53791.91   52857.70   51912.42   50941.33   49028.19   45938.22
tip-x avg    15689.50   31403.68   54201.13   53771.10   52864.02   52012.11   50708.50   48978.69   46191.25
vs tip          1.001      1.033      1.002      1.000      1.001      1.001       .993       .997       .992



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v2 1/2] sched: check for prev_cpu == this_cpu before calling wake_affine()
  2010-03-31 23:47   ` Suresh Siddha
  2010-04-01  5:32     ` Mike Galbraith
@ 2010-04-20  8:46     ` Peter Zijlstra
  2010-04-20  8:55       ` Peter Zijlstra
  2010-04-23 10:50     ` [tip:sched/core] sched: Fix select_idle_sibling() logic in select_task_rq_fair() tip-bot for Suresh Siddha
  2 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2010-04-20  8:46 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Mike Galbraith, Ingo Molnar, Arjan van de Ven,
	linux-kernel@vger.kernel.org, Vaidyanathan Srinivasan,
	Yanmin Zhang, Gautham R Shenoy

On Wed, 2010-03-31 at 16:47 -0700, Suresh Siddha wrote:

> From: Suresh Siddha <suresh.b.siddha@intel.com>
> Subject: sched: fix select_idle_sibling() logic in select_task_rq_fair()
> 
> Issues in the current select_idle_sibling() logic in select_task_rq_fair()
> in the context of a task wake-up:
> 
> a) Once we select the idle sibling, we use that domain (spanning the cpu that
>    the task is currently woken-up and the idle sibling that we found) in our
>    wake_affine() decisions. This domain is completely different from the
>    domain(we are supposed to use) that spans the cpu that the task currently
>    woken-up and the cpu where the task previously ran.
> 
> b) We do select_idle_sibling() check only for the cpu that the task is
>    currently woken-up on. If select_task_rq_fair() selects the previously run
>    cpu for waking the task, doing a select_idle_sibling() check
>    for that cpu also helps and we don't do this currently.
> 
> c) In the scenarios where the cpu that the task is woken-up is busy but
>    with its HT siblings are idle, we are selecting the task be woken-up
>    on the idle HT sibling instead of a core that it previously ran
>    and currently completely idle. i.e., we are not taking decisions based on
>    wake_affine() but directly selecting an idle sibling that can cause
>    an imbalance at the SMT/MC level which will be later corrected by the
>    periodic load balancer.
> 
> Fix this by first going through the load imbalance calculations using
> wake_affine() and once we make a decision of woken-up cpu vs previously-ran cpu,
> then choose a possible idle sibling for waking up the task on.
> 
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>

OK, so I'm going to take this, but I had one concern, see below:

> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 49ad993..f905a4b 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1385,28 +1385,48 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>   * Try and locate an idle CPU in the sched_domain.
>   */
>  static int
> +select_idle_sibling(struct task_struct *p, int target)
>  {
>  	int cpu = smp_processor_id();
>  	int prev_cpu = task_cpu(p);
>  	int i;
> +	struct sched_domain *sd;
>  
>  	/*
> +	 * If the task is going to be woken-up on this cpu and if it is
> +	 * already idle, then it is the right target.
>  	 */
> +	if (target == cpu && !cpu_rq(cpu)->cfs.nr_running)
> +		return cpu;
> +
> +	/*
> +	 * If the task is going to be woken-up on the cpu where it previously
> +	 * ran and if it is currently idle, then it the right target.
> +	 */
> +	if (target == prev_cpu && !cpu_rq(prev_cpu)->cfs.nr_running)
>  		return prev_cpu;
>  
>  	/*
> +	 * Otherwise, iterate the domains and find an elegible idle cpu.
>  	 */
> +	for_each_domain(target, sd) {
> +		if (!(sd->flags & SD_SHARE_PKG_RESOURCES))
>  			break;
> +
> +		for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed) {
> +			if (!cpu_rq(i)->cfs.nr_running) {
> +				target = i;
> +				break;
> +			}
>  		}
> +
> +		/*
> +		 * Lets stop looking for an idle sibling when we reached
> +		 * the domain that spans the current cpu and prev_cpu.
> +		 */
> +		if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
> +		    cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
> +			break;
>  	}
>  
>  	return target;

So here we keep using !cfs.nr_running to mean idle, which might not at
all be true when there's real-time tasks around.

So should we be using idle_cpu(i) instead?

> @@ -1429,7 +1449,7 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
>  	int cpu = smp_processor_id();
>  	int prev_cpu = task_cpu(p);
>  	int new_cpu = cpu;
> +	int want_affine = 0;
>  	int want_sd = 1;
>  	int sync = wake_flags & WF_SYNC;
>  
> @@ -1467,36 +1487,15 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
>  				want_sd = 0;
>  		}
>  
>                 if (want_affine) {
>                         /*
>                          * If both cpu and prev_cpu are part of this domain,
>                          * cpu is a valid SD_WAKE_AFFINE target.
>                          */
> +			if (cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))
> +			    && (tmp->flags & SD_WAKE_AFFINE)) {
> +				affine_sd = tmp;
> +				want_affine = 0;
>  			}
>  		}
>  
> @@ -1527,8 +1526,10 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
>  #endif
>  
>  	if (affine_sd) {
> +		if (cpu == prev_cpu || wake_affine(affine_sd, p, sync))
> +			return select_idle_sibling(p, cpu);
> +		else
> +			return select_idle_sibling(p, prev_cpu);
>  	}
>  
>  	while (sd) {
> 
> 




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v2 1/2] sched: check for prev_cpu == this_cpu before calling wake_affine()
  2010-04-20  8:46     ` Peter Zijlstra
@ 2010-04-20  8:55       ` Peter Zijlstra
  2010-04-20 17:03         ` Suresh Siddha
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2010-04-20  8:55 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Mike Galbraith, Ingo Molnar, Arjan van de Ven,
	linux-kernel@vger.kernel.org, Vaidyanathan Srinivasan,
	Yanmin Zhang, Gautham R Shenoy

On Tue, 2010-04-20 at 10:46 +0200, Peter Zijlstra wrote:
> So should we be using idle_cpu(i) instead?

something like the below..

---
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1375,26 +1375,25 @@ find_idlest_cpu(struct sched_group *grou
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
-static int
-select_idle_sibling(struct task_struct *p, int target)
+static int select_idle_sibling(struct task_struct *p, int target)
 {
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
-	int i;
 	struct sched_domain *sd;
+	int i;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
 	 * already idle, then it is the right target.
 	 */
-	if (target == cpu && !cpu_rq(cpu)->cfs.nr_running)
+	if (target == cpu && idle_cpu(cpu))
 		return cpu;
 
 	/*
 	 * If the task is going to be woken-up on the cpu where it previously
 	 * ran and if it is currently idle, then it the right target.
 	 */
-	if (target == prev_cpu && !cpu_rq(prev_cpu)->cfs.nr_running)
+	if (target == prev_cpu && idle_cpu(prev_cpu))
 		return prev_cpu;
 
 	/*
@@ -1405,7 +1404,7 @@ select_idle_sibling(struct task_struct *
 			break;
 
 		for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed) {
-			if (!cpu_rq(i)->cfs.nr_running) {
+			if (idle_cpu(i)) {
 				target = i;
 				break;
 			}
@@ -1479,16 +1478,14 @@ select_task_rq_fair(struct rq *rq, struc
 				want_sd = 0;
 		}
 
-		if (want_affine) {
-			/*
-			 * If both cpu and prev_cpu are part of this domain,
-			 * cpu is a valid SD_WAKE_AFFINE target.
-			 */
-			if (cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))
-			    && (tmp->flags & SD_WAKE_AFFINE)) {
-				affine_sd = tmp;
-				want_affine = 0;
-			}
+		/*
+		 * If both cpu and prev_cpu are part of this domain,
+		 * cpu is a valid SD_WAKE_AFFINE target.
+		 */
+		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
+		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
+			affine_sd = tmp;
+			want_affine = 0;
 		}
 
 		if (!want_sd && !want_affine)



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v2 1/2] sched: check for prev_cpu == this_cpu before calling wake_affine()
  2010-04-20  8:55       ` Peter Zijlstra
@ 2010-04-20 17:03         ` Suresh Siddha
  0 siblings, 0 replies; 15+ messages in thread
From: Suresh Siddha @ 2010-04-20 17:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Ingo Molnar, Arjan van de Ven,
	linux-kernel@vger.kernel.org, Vaidyanathan Srinivasan,
	Yanmin Zhang, Gautham R Shenoy

On Tue, 2010-04-20 at 01:55 -0700, Peter Zijlstra wrote:
> On Tue, 2010-04-20 at 10:46 +0200, Peter Zijlstra wrote:
> > So should we be using idle_cpu(i) instead?
> 
> something like the below..

Looks good to me.

Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>

> 
> ---
> Index: linux-2.6/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_fair.c
> +++ linux-2.6/kernel/sched_fair.c
> @@ -1375,26 +1375,25 @@ find_idlest_cpu(struct sched_group *grou
>  /*
>   * Try and locate an idle CPU in the sched_domain.
>   */
> -static int
> -select_idle_sibling(struct task_struct *p, int target)
> +static int select_idle_sibling(struct task_struct *p, int target)
>  {
>  	int cpu = smp_processor_id();
>  	int prev_cpu = task_cpu(p);
> -	int i;
>  	struct sched_domain *sd;
> +	int i;
>  
>  	/*
>  	 * If the task is going to be woken-up on this cpu and if it is
>  	 * already idle, then it is the right target.
>  	 */
> -	if (target == cpu && !cpu_rq(cpu)->cfs.nr_running)
> +	if (target == cpu && idle_cpu(cpu))
>  		return cpu;
>  
>  	/*
>  	 * If the task is going to be woken-up on the cpu where it previously
>  	 * ran and if it is currently idle, then it the right target.
>  	 */
> -	if (target == prev_cpu && !cpu_rq(prev_cpu)->cfs.nr_running)
> +	if (target == prev_cpu && idle_cpu(prev_cpu))
>  		return prev_cpu;
>  
>  	/*
> @@ -1405,7 +1404,7 @@ select_idle_sibling(struct task_struct *
>  			break;
>  
>  		for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed) {
> -			if (!cpu_rq(i)->cfs.nr_running) {
> +			if (idle_cpu(i)) {
>  				target = i;
>  				break;
>  			}
> @@ -1479,16 +1478,14 @@ select_task_rq_fair(struct rq *rq, struc
>  				want_sd = 0;
>  		}
>  
> -		if (want_affine) {
> -			/*
> -			 * If both cpu and prev_cpu are part of this domain,
> -			 * cpu is a valid SD_WAKE_AFFINE target.
> -			 */
> -			if (cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))
> -			    && (tmp->flags & SD_WAKE_AFFINE)) {
> -				affine_sd = tmp;
> -				want_affine = 0;
> -			}
> +		/*
> +		 * If both cpu and prev_cpu are part of this domain,
> +		 * cpu is a valid SD_WAKE_AFFINE target.
> +		 */
> +		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
> +		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
> +			affine_sd = tmp;
> +			want_affine = 0;
>  		}
>  
>  		if (!want_sd && !want_affine)
> 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [tip:sched/core] sched: Fix select_idle_sibling() logic in select_task_rq_fair()
  2010-03-31 23:47   ` Suresh Siddha
  2010-04-01  5:32     ` Mike Galbraith
  2010-04-20  8:46     ` Peter Zijlstra
@ 2010-04-23 10:50     ` tip-bot for Suresh Siddha
  2 siblings, 0 replies; 15+ messages in thread
From: tip-bot for Suresh Siddha @ 2010-04-23 10:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, suresh.b.siddha, tglx,
	mingo

Commit-ID:  99bd5e2f245d8cd17d040c82d40becdb3efd9b69
Gitweb:     http://git.kernel.org/tip/99bd5e2f245d8cd17d040c82d40becdb3efd9b69
Author:     Suresh Siddha <suresh.b.siddha@intel.com>
AuthorDate: Wed, 31 Mar 2010 16:47:45 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 23 Apr 2010 11:02:02 +0200

sched: Fix select_idle_sibling() logic in select_task_rq_fair()

Issues in the current select_idle_sibling() logic in select_task_rq_fair()
in the context of a task wake-up:

a) Once we select the idle sibling, we use that domain (spanning the cpu that
   the task is currently woken-up and the idle sibling that we found) in our
   wake_affine() decisions. This domain is completely different from the
   domain(we are supposed to use) that spans the cpu that the task currently
   woken-up and the cpu where the task previously ran.

b) We do select_idle_sibling() check only for the cpu that the task is
   currently woken-up on. If select_task_rq_fair() selects the previously run
   cpu for waking the task, doing a select_idle_sibling() check
   for that cpu also helps and we don't do this currently.

c) In the scenarios where the cpu that the task is woken-up is busy but
   with its HT siblings are idle, we are selecting the task be woken-up
   on the idle HT sibling instead of a core that it previously ran
   and currently completely idle. i.e., we are not taking decisions based on
   wake_affine() but directly selecting an idle sibling that can cause
   an imbalance at the SMT/MC level which will be later corrected by the
   periodic load balancer.

Fix this by first going through the load imbalance calculations using
wake_affine() and once we make a decision of woken-up cpu vs previously-ran cpu,
then choose a possible idle sibling for waking up the task on.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1270079265.7835.8.camel@sbs-t61.sc.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched_fair.c |   82 +++++++++++++++++++++++++--------------------------
 1 files changed, 40 insertions(+), 42 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 0a413c7..cbd8b8a 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1375,29 +1375,48 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
-static int
-select_idle_sibling(struct task_struct *p, struct sched_domain *sd, int target)
+static int select_idle_sibling(struct task_struct *p, int target)
 {
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
+	struct sched_domain *sd;
 	int i;
 
 	/*
-	 * If this domain spans both cpu and prev_cpu (see the SD_WAKE_AFFINE
-	 * test in select_task_rq_fair) and the prev_cpu is idle then that's
-	 * always a better target than the current cpu.
+	 * If the task is going to be woken-up on this cpu and if it is
+	 * already idle, then it is the right target.
+	 */
+	if (target == cpu && idle_cpu(cpu))
+		return cpu;
+
+	/*
+	 * If the task is going to be woken-up on the cpu where it previously
+	 * ran and if it is currently idle, then it the right target.
 	 */
-	if (target == cpu && !cpu_rq(prev_cpu)->cfs.nr_running)
+	if (target == prev_cpu && idle_cpu(prev_cpu))
 		return prev_cpu;
 
 	/*
-	 * Otherwise, iterate the domain and find an elegible idle cpu.
+	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
-	for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed) {
-		if (!cpu_rq(i)->cfs.nr_running) {
-			target = i;
+	for_each_domain(target, sd) {
+		if (!(sd->flags & SD_SHARE_PKG_RESOURCES))
 			break;
+
+		for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed) {
+			if (idle_cpu(i)) {
+				target = i;
+				break;
+			}
 		}
+
+		/*
+		 * Lets stop looking for an idle sibling when we reached
+		 * the domain that spans the current cpu and prev_cpu.
+		 */
+		if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
+		    cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
+			break;
 	}
 
 	return target;
@@ -1421,7 +1440,7 @@ select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	int new_cpu = cpu;
-	int want_affine = 0, cpu_idle = !current->pid;
+	int want_affine = 0;
 	int want_sd = 1;
 	int sync = wake_flags & WF_SYNC;
 
@@ -1460,36 +1479,13 @@ select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_
 		}
 
 		/*
-		 * While iterating the domains looking for a spanning
-		 * WAKE_AFFINE domain, adjust the affine target to any idle cpu
-		 * in cache sharing domains along the way.
+		 * If both cpu and prev_cpu are part of this domain,
+		 * cpu is a valid SD_WAKE_AFFINE target.
 		 */
-		if (want_affine) {
-			int target = -1;
-
-			/*
-			 * If both cpu and prev_cpu are part of this domain,
-			 * cpu is a valid SD_WAKE_AFFINE target.
-			 */
-			if (cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
-				target = cpu;
-
-			/*
-			 * If there's an idle sibling in this domain, make that
-			 * the wake_affine target instead of the current cpu.
-			 */
-			if (!cpu_idle && tmp->flags & SD_SHARE_PKG_RESOURCES)
-				target = select_idle_sibling(p, tmp, target);
-
-			if (target >= 0) {
-				if (tmp->flags & SD_WAKE_AFFINE) {
-					affine_sd = tmp;
-					want_affine = 0;
-					if (target != cpu)
-						cpu_idle = 1;
-				}
-				cpu = target;
-			}
+		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
+		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
+			affine_sd = tmp;
+			want_affine = 0;
 		}
 
 		if (!want_sd && !want_affine)
@@ -1520,8 +1516,10 @@ select_task_rq_fair(struct rq *rq, struct task_struct *p, int sd_flag, int wake_
 #endif
 
 	if (affine_sd) {
-		if (cpu_idle || cpu == prev_cpu || wake_affine(affine_sd, p, sync))
-			return cpu;
+		if (cpu == prev_cpu || wake_affine(affine_sd, p, sync))
+			return select_idle_sibling(p, cpu);
+		else
+			return select_idle_sibling(p, prev_cpu);
 	}
 
 	while (sd) {

^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-04-23 10:51 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-08 22:19 [patch v2 1/2] sched: check for prev_cpu == this_cpu before calling wake_affine() Suresh Siddha
2010-03-08 22:19 ` [patch v2 2/2] sched: fix select_idle_sibling() logic in select_task_rq_fair() Suresh Siddha
2010-03-31 10:25 ` [patch v2 1/2] sched: check for prev_cpu == this_cpu before calling wake_affine() Peter Zijlstra
2010-03-31 23:47   ` Suresh Siddha
2010-04-01  5:32     ` Mike Galbraith
2010-04-01 21:04       ` Suresh Siddha
2010-04-02  6:20         ` Mike Galbraith
2010-04-02 17:05           ` Suresh Siddha
2010-04-02 19:43             ` Mike Galbraith
2010-04-14 20:45           ` Suresh Siddha
2010-04-15  5:17             ` Mike Galbraith
2010-04-20  8:46     ` Peter Zijlstra
2010-04-20  8:55       ` Peter Zijlstra
2010-04-20 17:03         ` Suresh Siddha
2010-04-23 10:50     ` [tip:sched/core] sched: Fix select_idle_sibling() logic in select_task_rq_fair() tip-bot for Suresh Siddha

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox