sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
@ 2013-01-02  4:22 Preeti U Murthy
  2013-01-02  8:12 ` Mike Galbraith
  2013-01-07 15:48 ` Vincent Guittot
  0 siblings, 2 replies; 24+ messages in thread
From: Preeti U Murthy @ 2013-01-02  4:22 UTC (permalink / raw)
  To: LKML, svaidy@linux.vnet.ibm.com, Paul E. McKenney,
	Vincent Guittot, Peter Zijlstra, Viresh Kumar, Amit Kucheria,
	Morten Rasmussen, Paul McKenney, Andrew Morton, Arjan van de Ven,
	Ingo Molnar, Paul Turner, Venki Pallipadi, Robin Randhawa,
	Lists linaro-dev, Matthew Garrett

Hi everyone,
I have been looking at how different workloads react when the per entity
load tracking metric is integrated into the load balancer and what are
the possible reasons for it.

I had posted the integration patch earlier:
https://lkml.org/lkml/2012/11/15/391

Essentially what I am doing is:
1.I have disabled CONFIG_FAIR_GROUP_SCHED to make the analysis simple
2.I have replaced cfs_rq->load.weight in weighted_cpuload() with
cfs.runnable_load_avg,the active load tracking metric.
3.I have replaced se.load.weight in task_h_load() with
se.load.avg.contrib,the per entity load tracking metric.
4.The load balancer will end up using these metrics.

After conducting experiments on several workloads I found out that the
performance of the workloads with the above integration would neither
improve nor deteriorate.And this observation was consistent.

Ideally the performance should have improved considering,that the metric
does better tracking of load.

Let me explain with a simple example as to why we should see a
performance improvement ideally:Consider 2 80% tasks and 1 40% task.

With integration:
----------------

       40%
80%    40%
cpu1  cpu2

The above will be the scenario when the tasks fork initially.And this is
a perfectly balanced system,hence no more load balancing.And proper
distribution of loads on the cpu.

Without integration
-------------------

40%                               40%
80%    40%                 80%    40%
cpu1   cpu2        OR     cpu1   cpu2

Because the  view is that all the tasks as having the same load.The load
balancer could ping pong tasks between these two situations.

When I performed this experiment,I did not see an improvement in the
performance though in the former case.On further observation I found
that the following was actually happening.

With integration
----------------

Initially         40% task sleeps      40% task wakes up
                                       and select_idle_sibling()
                                       decides to wake it up on cpu1

       40%   ->                   ->   40%
80%    40%        80%    40%           80%      40%
cpu1  cpu2        cpu1   cpu2          cpu1     cpu2

This makes load balance trigger movement of 40% from cpu1 back to
cpu2.Hence the stability that the load balancer was trying to achieve is
gone.Hence the culprit boils down to select_idle_sibling.How is it the
culprit and how is it hindering performance of the workloads?

*What is the way ahead with the per entity load tracking metric in the
load balancer then?*

In replies to a post by Paul in https://lkml.org/lkml/2012/12/6/105,
he mentions the following:

"It is my intuition that the greatest carnage here is actually caused
by wake-up load-balancing getting in the way of periodic in
establishing a steady state. I suspect more mileage would result from
reducing the interference wake-up load-balancing has with steady
state."

"The whole point of using blocked load is so that you can converge on a
steady state where you don't NEED to move tasks.  What disrupts this is
we naturally prefer idle cpus on wake-up balance to reduce wake-up
latency. I think the better answer is making these two processes load
balancing() and select_idle_sibling() more co-operative."

I had not realised how this would happen until I saw it happening in the
above experiment.

Based on what Paul explained above let us use the runnable load + the
blocked load for calculating the load on a cfs runqueue rather than just
the runnable load(which is what i am doing now) and see its consequence.

Initially:       40% task sleeps

       40%
80%    40%   ->  80%  40%
cpu1   cpu2     cpu1  cpu2

So initially the load on cpu1 is say 80 and on cpu2 also it is
80.Balanced.Now when 40% task sleeps,the total load on cpu2=runnable
load+blocked load.which is still 80.

As a consequence,firstly,during periodic load balancing the load is not
moved from cpu1 to cpu2 when the 40% task sleeps.(It sees the load on
cpu2 as 80 and not as 40).
Hence the above scenario remains the same.On wake up,what happens?

Here comes the point of making both load balancing and wake up
balance(select_idle_sibling) co operative. How about we always schedule
the woken up task on the prev_cpu? This seems more sensible considering
load balancing considers blocked load as being a part of the load of cpu2.

If we do that,we end up scheduling the 40% task back on cpu2.Back to the
scenario which load balancing intended.Hence a steady state is
maintained no matter what unless other tasks show up.

Note that considering prev_cpu as the default cpu to run the woken up
task on is possible only because we use blocked load for load balancing
purposes.

The above steps of using blocked load and selecting the prev_cpu as the
target for the woken up task seems to me to be the next step.This could
allow the load balance with the per entity load tracking metric to
behave as it is supposed to without anything else disrupting it.And here
i expect a performance improvement.

Please do let me know your suggestions.This will greatly help take the
right steps here on, in achieving the correct integration.

Thank you

Regards
Preeti U Murthy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-02  4:22 sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer Preeti U Murthy
@ 2013-01-02  8:12 ` Mike Galbraith
  2013-01-03 10:38   ` Preeti U Murthy
  2013-01-07 15:48 ` Vincent Guittot
  1 sibling, 1 reply; 24+ messages in thread
From: Mike Galbraith @ 2013-01-02  8:12 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: LKML, svaidy@linux.vnet.ibm.com, Paul E. McKenney,
	Vincent Guittot, Peter Zijlstra, Viresh Kumar, Amit Kucheria,
	Morten Rasmussen, Paul McKenney, Andrew Morton, Arjan van de Ven,
	Ingo Molnar, Paul Turner, Venki Pallipadi, Robin Randhawa,
	Lists linaro-dev, Matthew Garrett

On Wed, 2013-01-02 at 09:52 +0530, Preeti U Murthy wrote: 
> Hi everyone,
> I have been looking at how different workloads react when the per entity
> load tracking metric is integrated into the load balancer and what are
> the possible reasons for it.
> 
> I had posted the integration patch earlier:
> https://lkml.org/lkml/2012/11/15/391
> 
> Essentially what I am doing is:
> 1.I have disabled CONFIG_FAIR_GROUP_SCHED to make the analysis simple
> 2.I have replaced cfs_rq->load.weight in weighted_cpuload() with
> cfs.runnable_load_avg,the active load tracking metric.
> 3.I have replaced se.load.weight in task_h_load() with
> se.load.avg.contrib,the per entity load tracking metric.
> 4.The load balancer will end up using these metrics.
> 
> After conducting experiments on several workloads I found out that the
> performance of the workloads with the above integration would neither
> improve nor deteriorate.And this observation was consistent.
> 
> Ideally the performance should have improved considering,that the metric
> does better tracking of load.
> 
> Let me explain with a simple example as to why we should see a
> performance improvement ideally:Consider 2 80% tasks and 1 40% task.
> 
> With integration:
> ----------------
> 
>        40%
> 80%    40%
> cpu1  cpu2
> 
> The above will be the scenario when the tasks fork initially.And this is
> a perfectly balanced system,hence no more load balancing.And proper
> distribution of loads on the cpu.
> 
> Without integration
> -------------------
> 
> 40%                               40%
> 80%    40%                 80%    40%
> cpu1   cpu2        OR     cpu1   cpu2
> 
> Because the  view is that all the tasks as having the same load.The load
> balancer could ping pong tasks between these two situations.
> 
> When I performed this experiment,I did not see an improvement in the
> performance though in the former case.On further observation I found
> that the following was actually happening.
> 
> With integration
> ----------------
> 
> Initially         40% task sleeps      40% task wakes up
>                                        and select_idle_sibling()
>                                        decides to wake it up on cpu1
> 
>        40%   ->                   ->   40%
> 80%    40%        80%    40%           80%      40%
> cpu1  cpu2        cpu1   cpu2          cpu1     cpu2
> 
> 
> This makes load balance trigger movement of 40% from cpu1 back to
> cpu2.Hence the stability that the load balancer was trying to achieve is
> gone.Hence the culprit boils down to select_idle_sibling.How is it the
> culprit and how is it hindering performance of the workloads?
> 
> *What is the way ahead with the per entity load tracking metric in the
> load balancer then?*

select_idle_sibling() is all about dynamic, lowering latency and
cranking up cores during ramp-up to boost throughput.  If you want the
system to achieve a stable state with periodic balancing, you need to
turn select_idle_sibling() the heck off.  Once you've gotten the box
mostly committed, it's just an overhead/bounce source anyway.

> In replies to a post by Paul in https://lkml.org/lkml/2012/12/6/105,
> he mentions the following:
> 
> "It is my intuition that the greatest carnage here is actually caused
> by wake-up load-balancing getting in the way of periodic in
> establishing a steady state. I suspect more mileage would result from
> reducing the interference wake-up load-balancing has with steady
> state."
> 
> "The whole point of using blocked load is so that you can converge on a
> steady state where you don't NEED to move tasks.  What disrupts this is
> we naturally prefer idle cpus on wake-up balance to reduce wake-up
> latency. I think the better answer is making these two processes load
> balancing() and select_idle_sibling() more co-operative."

The down-side of steady state seeking via load tracking being that you
want to take N% average load tasks, and stack them on top of each other,
which does nothing good for those tasks when they overlap in execution.
Long term balance looks all pretty, but if one or more of them could
have slipped into an idle shared cache, it's still a latency hit and
utilization loss.  You are at odds with select_idle_sibling()'s mission.
It cares about the here and now, while load tracking cares about fuzzy
long-term averages.

> I had not realised how this would happen until I saw it happening in the
> above experiment.

select_idle_sibling()'s job is to be annoying as hell.. and it does that
very well :)

> Based on what Paul explained above let us use the runnable load + the
> blocked load for calculating the load on a cfs runqueue rather than just
> the runnable load(which is what i am doing now) and see its consequence.
> 
> Initially:       40% task sleeps
> 
>        40%
> 80%    40%   ->  80%  40%
> cpu1   cpu2     cpu1  cpu2
> 
> So initially the load on cpu1 is say 80 and on cpu2 also it is
> 80.Balanced.Now when 40% task sleeps,the total load on cpu2=runnable
> load+blocked load.which is still 80.
> 
> As a consequence,firstly,during periodic load balancing the load is not
> moved from cpu1 to cpu2 when the 40% task sleeps.(It sees the load on
> cpu2 as 80 and not as 40).
> Hence the above scenario remains the same.On wake up,what happens?
> 
> Here comes the point of making both load balancing and wake up
> balance(select_idle_sibling) co operative. How about we always schedule
> the woken up task on the prev_cpu? This seems more sensible considering
> load balancing considers blocked load as being a part of the load of cpu2.

Once committed, that works fine.. until load fluctuates.  Don't plug the
holes opportunistically, you lose throughput.

> If we do that,we end up scheduling the 40% task back on cpu2.Back to the
> scenario which load balancing intended.Hence a steady state is
> maintained no matter what unless other tasks show up.
> 
> Note that considering prev_cpu as the default cpu to run the woken up
> task on is possible only because we use blocked load for load balancing
> purposes.
> 
> The above steps of using blocked load and selecting the prev_cpu as the
> target for the woken up task seems to me to be the next step.This could
> allow the load balance with the per entity load tracking metric to
> behave as it is supposed to without anything else disrupting it.And here
> i expect a performance improvement.
> 
> Please do let me know your suggestions.This will greatly help take the
> right steps here on, in achieving the correct integration.

Again, I think you want a knob to turn select_idle_sibling() off.  Load
tracking goal is to smooth, select_idle_sibling() goal is to peak.

-Mike


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-02  8:12 ` Mike Galbraith
@ 2013-01-03 10:38   ` Preeti U Murthy
  2013-01-03 20:06     ` Mike Galbraith
                       ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Preeti U Murthy @ 2013-01-03 10:38 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: LKML, svaidy@linux.vnet.ibm.com, Paul E. McKenney,
	Vincent Guittot, Peter Zijlstra, Viresh Kumar, Amit Kucheria,
	Morten Rasmussen, Paul McKenney, Andrew Morton, Arjan van de Ven,
	Ingo Molnar, Paul Turner, Venki Pallipadi, Robin Randhawa,
	Lists linaro-dev, Matthew Garrett, Alex Shi, srikar

Hi Mike,

Thank you very much for your feedback.Considering your suggestions,I have posted out a 
proposed solution to prevent select_idle_sibling() from becoming a disadvantage to normal
load balancing,rather aiding it.

**This patch is *without* the enablement of the per entity load tracking metric.**

This is with an intention to correct the existing select_idle_sibling() mess before
going ahead.

-------------------BEGIN PATCH--------------------------------------------------------

Subject: [PATCH] sched: Merge select_idle_sibling with the behaviour of SD_BALANCE_WAKE

The function of select_idle_sibling() is to place the woken up task in the
vicinity of the waking cpu or on the previous cpu depending on what wake_affine() says.
This placement being only in an idle group.If an idle group is not found,the
fallback cpu is either the waking cpu or the previous cpu accordingly.

This results in the runqueue of the waking cpu or the previous cpu getting
overloaded when the system is committed,which is a latency hit to these tasks.

What is required is that the newly woken up tasks be placed close to the wake
up cpu or the previous cpu,whichever is best, for reasons to avoid latency hit and cache
coldness respectively.This is achieved with wake_affine() deciding which
cache domain the task should be placed on.

Once this is decided,instead of searching for a completely idle group,let us
search for the idlest group.This will anyway return a completely idle group
if it exists and its mechanism will fall back to what select_idle_sibling()
was doing.But if this fails,find_idlest_group() continues the search for a
relatively more idle group.

The argument could be that,we wish to avoid migration of the newly woken up
task to any other group unless it is completely idle.But in this case, to
begin with we choose a sched domain,within which a migration could be less
harmful.We enable the SD_BALANCE_WAKE flag on the SMT and MC domains to co-operate
with the same.

This patch is based on the tip tree without enabling the per entity load
tracking.This is with an intention to clear up the select_idle_sibling() mess
before introducing the metric.
---
 include/linux/topology.h |    4 ++-
 kernel/sched/fair.c      |   61 +++++-----------------------------------------
 2 files changed, 9 insertions(+), 56 deletions(-)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index d3cf0d6..eeb309e 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -95,7 +95,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_BALANCE_NEWIDLE			\
 				| 1*SD_BALANCE_EXEC			\
 				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
+				| 1*SD_BALANCE_WAKE			\
 				| 1*SD_WAKE_AFFINE			\
 				| 1*SD_SHARE_CPUPOWER			\
 				| 1*SD_SHARE_PKG_RESOURCES		\
@@ -127,7 +127,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_BALANCE_NEWIDLE			\
 				| 1*SD_BALANCE_EXEC			\
 				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
+				| 1*SD_BALANCE_WAKE			\
 				| 1*SD_WAKE_AFFINE			\
 				| 0*SD_SHARE_CPUPOWER			\
 				| 1*SD_SHARE_PKG_RESOURCES		\
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b29cdbf..c33eda7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3303,58 +3303,6 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 	return idlest;
 }
 
-/*
- * Try and locate an idle CPU in the sched_domain.
- */
-static int select_idle_sibling(struct task_struct *p, int target)
-{
-	int cpu = smp_processor_id();
-	int prev_cpu = task_cpu(p);
-	struct sched_domain *sd;
-	struct sched_group *sg;
-	int i;
-
-	/*
-	 * If the task is going to be woken-up on this cpu and if it is
-	 * already idle, then it is the right target.
-	 */
-	if (target == cpu && idle_cpu(cpu))
-		return cpu;
-
-	/*
-	 * If the task is going to be woken-up on the cpu where it previously
-	 * ran and if it is currently idle, then it the right target.
-	 */
-	if (target == prev_cpu && idle_cpu(prev_cpu))
-		return prev_cpu;
-
-	/*
-	 * Otherwise, iterate the domains and find an elegible idle cpu.
-	 */
-	sd = rcu_dereference(per_cpu(sd_llc, target));
-	for_each_lower_domain(sd) {
-		sg = sd->groups;
-		do {
-			if (!cpumask_intersects(sched_group_cpus(sg),
-						tsk_cpus_allowed(p)))
-				goto next;
-
-			for_each_cpu(i, sched_group_cpus(sg)) {
-				if (!idle_cpu(i))
-					goto next;
-			}
-
-			target = cpumask_first_and(sched_group_cpus(sg),
-					tsk_cpus_allowed(p));
-			goto done;
-next:
-			sg = sg->next;
-		} while (sg != sd->groups);
-	}
-done:
-	return target;
-}
-
 #ifdef CONFIG_SCHED_NUMA
 static inline bool pick_numa_rand(int n)
 {
@@ -3469,8 +3417,13 @@ find_sd:
 		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
 			prev_cpu = cpu;
 
-		new_cpu = select_idle_sibling(p, prev_cpu);
-		goto unlock;
+		if (prev_cpu == task_cpu(p) && idle_cpu(prev_cpu) ||
+		    prev_cpu == smp_processor_id() && idle_cpu(prev_cpu)) {
+			new_cpu = prev_cpu;
+			goto unlock;
+		} else {
+			sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
+		}
 	}
 
 pick_idlest:







^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-03 10:38   ` Preeti U Murthy
@ 2013-01-03 20:06     ` Mike Galbraith
  2013-01-04 11:41     ` Mike Galbraith
  2013-01-05  8:13     ` Mike Galbraith
  2 siblings, 0 replies; 24+ messages in thread
From: Mike Galbraith @ 2013-01-03 20:06 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: LKML, svaidy@linux.vnet.ibm.com, Paul E. McKenney,
	Vincent Guittot, Peter Zijlstra, Viresh Kumar, Amit Kucheria,
	Morten Rasmussen, Paul McKenney, Andrew Morton, Arjan van de Ven,
	Ingo Molnar, Paul Turner, Venki Pallipadi, Robin Randhawa,
	Lists linaro-dev, Matthew Garrett, Alex Shi, srikar

On Thu, 2013-01-03 at 16:08 +0530, Preeti U Murthy wrote: 
> Hi Mike,
> 
> Thank you very much for your feedback.Considering your suggestions,I have posted out a 
> proposed solution to prevent select_idle_sibling() from becoming a disadvantage to normal
> load balancing,rather aiding it.
> 
> **This patch is *without* the enablement of the per entity load tracking metric.**
> 
> This is with an intention to correct the existing select_idle_sibling() mess before
> going ahead.

Bench the traversal cost, BALANCE_WAKE used to be _way_ to expensive to
live with for fast movers.  It was turned on briefly, only to be turned
off again.  Maybe that's changed, cool if so.

But before you do that: picking any idle or somewhat more idle cpu in an
entire large L3 package on every wakeup rapes fast moving communicating
tasks at L2.  You can't do that at high frequency and live, so any
integration has to prevent high frequency bounce.. but at the same time,
react quickly to ramp, pull communicating tasks together quickly.. and
not harm 1:N loads that currently benefit massively from full package
high frequency bounce :)

> -------------------BEGIN PATCH--------------------------------------------------------
> 
> Subject: [PATCH] sched: Merge select_idle_sibling with the behaviour of SD_BALANCE_WAKE
> 
> The function of select_idle_sibling() is to place the woken up task in the
> vicinity of the waking cpu or on the previous cpu depending on what wake_affine() says.
> This placement being only in an idle group.If an idle group is not found,the
> fallback cpu is either the waking cpu or the previous cpu accordingly.
> 
> This results in the runqueue of the waking cpu or the previous cpu getting
> overloaded when the system is committed,which is a latency hit to these tasks.
> 
> What is required is that the newly woken up tasks be placed close to the wake
> up cpu or the previous cpu,whichever is best, for reasons to avoid latency hit and cache
> coldness respectively.This is achieved with wake_affine() deciding which
> cache domain the task should be placed on.
> 
> Once this is decided,instead of searching for a completely idle group,let us
> search for the idlest group.This will anyway return a completely idle group
> if it exists and its mechanism will fall back to what select_idle_sibling()
> was doing.But if this fails,find_idlest_group() continues the search for a
> relatively more idle group.
> 
> The argument could be that,we wish to avoid migration of the newly woken up
> task to any other group unless it is completely idle.But in this case, to
> begin with we choose a sched domain,within which a migration could be less
> harmful.We enable the SD_BALANCE_WAKE flag on the SMT and MC domains to co-operate
> with the same.
> 
> This patch is based on the tip tree without enabling the per entity load
> tracking.This is with an intention to clear up the select_idle_sibling() mess
> before introducing the metric.
> ---
>  include/linux/topology.h |    4 ++-
>  kernel/sched/fair.c      |   61 +++++-----------------------------------------
>  2 files changed, 9 insertions(+), 56 deletions(-)
> 
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index d3cf0d6..eeb309e 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -95,7 +95,7 @@ int arch_update_cpu_topology(void);
>  				| 1*SD_BALANCE_NEWIDLE			\
>  				| 1*SD_BALANCE_EXEC			\
>  				| 1*SD_BALANCE_FORK			\
> -				| 0*SD_BALANCE_WAKE			\
> +				| 1*SD_BALANCE_WAKE			\
>  				| 1*SD_WAKE_AFFINE			\
>  				| 1*SD_SHARE_CPUPOWER			\
>  				| 1*SD_SHARE_PKG_RESOURCES		\
> @@ -127,7 +127,7 @@ int arch_update_cpu_topology(void);
>  				| 1*SD_BALANCE_NEWIDLE			\
>  				| 1*SD_BALANCE_EXEC			\
>  				| 1*SD_BALANCE_FORK			\
> -				| 0*SD_BALANCE_WAKE			\
> +				| 1*SD_BALANCE_WAKE			\
>  				| 1*SD_WAKE_AFFINE			\
>  				| 0*SD_SHARE_CPUPOWER			\
>  				| 1*SD_SHARE_PKG_RESOURCES		\
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b29cdbf..c33eda7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3303,58 +3303,6 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>  	return idlest;
>  }
>  
> -/*
> - * Try and locate an idle CPU in the sched_domain.
> - */
> -static int select_idle_sibling(struct task_struct *p, int target)
> -{
> -	int cpu = smp_processor_id();
> -	int prev_cpu = task_cpu(p);
> -	struct sched_domain *sd;
> -	struct sched_group *sg;
> -	int i;
> -
> -	/*
> -	 * If the task is going to be woken-up on this cpu and if it is
> -	 * already idle, then it is the right target.
> -	 */
> -	if (target == cpu && idle_cpu(cpu))
> -		return cpu;
> -
> -	/*
> -	 * If the task is going to be woken-up on the cpu where it previously
> -	 * ran and if it is currently idle, then it the right target.
> -	 */
> -	if (target == prev_cpu && idle_cpu(prev_cpu))
> -		return prev_cpu;
> -
> -	/*
> -	 * Otherwise, iterate the domains and find an elegible idle cpu.
> -	 */
> -	sd = rcu_dereference(per_cpu(sd_llc, target));
> -	for_each_lower_domain(sd) {
> -		sg = sd->groups;
> -		do {
> -			if (!cpumask_intersects(sched_group_cpus(sg),
> -						tsk_cpus_allowed(p)))
> -				goto next;
> -
> -			for_each_cpu(i, sched_group_cpus(sg)) {
> -				if (!idle_cpu(i))
> -					goto next;
> -			}
> -
> -			target = cpumask_first_and(sched_group_cpus(sg),
> -					tsk_cpus_allowed(p));
> -			goto done;
> -next:
> -			sg = sg->next;
> -		} while (sg != sd->groups);
> -	}
> -done:
> -	return target;
> -}
> -
>  #ifdef CONFIG_SCHED_NUMA
>  static inline bool pick_numa_rand(int n)
>  {
> @@ -3469,8 +3417,13 @@ find_sd:
>  		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
>  			prev_cpu = cpu;
>  
> -		new_cpu = select_idle_sibling(p, prev_cpu);
> -		goto unlock;
> +		if (prev_cpu == task_cpu(p) && idle_cpu(prev_cpu) ||
> +		    prev_cpu == smp_processor_id() && idle_cpu(prev_cpu)) {
> +			new_cpu = prev_cpu;
> +			goto unlock;
> +		} else {
> +			sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
> +		}
>  	}
>  
>  pick_idlest:
> 
> 
> 
> 
> 
> 



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-03 10:38   ` Preeti U Murthy
  2013-01-03 20:06     ` Mike Galbraith
@ 2013-01-04 11:41     ` Mike Galbraith
  2013-01-05  8:13     ` Mike Galbraith
  2 siblings, 0 replies; 24+ messages in thread
From: Mike Galbraith @ 2013-01-04 11:41 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: LKML, svaidy@linux.vnet.ibm.com, Paul E. McKenney,
	Vincent Guittot, Peter Zijlstra, Viresh Kumar, Amit Kucheria,
	Morten Rasmussen, Paul McKenney, Andrew Morton, Arjan van de Ven,
	Ingo Molnar, Paul Turner, Venki Pallipadi, Robin Randhawa,
	Lists linaro-dev, Matthew Garrett, Alex Shi, srikar

On Thu, 2013-01-03 at 16:08 +0530, Preeti U Murthy wrote: 
> Hi Mike,
> 
> Thank you very much for your feedback.Considering your suggestions,I have posted out a 
> proposed solution to prevent select_idle_sibling() from becoming a disadvantage to normal
> load balancing,rather aiding it.
> 
> **This patch is *without* the enablement of the per entity load tracking metric.**
> 
> This is with an intention to correct the existing select_idle_sibling() mess before
> going ahead.

Well, on the bright side, this is much kinder to light load tbench on
4x10 core (+ht) box than current bounce happy select_idle_sibling().  At
the heavy load end I'm losing heaping truckloads of throughput though,
likely some migration rate limiting would fix that up.

select_idle_sibling() doesn't have much of a pretty face these days, but
its evil hag face shows more prominently than ever, so making it dead is
a great goal ;-)

-Mike


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-03 10:38   ` Preeti U Murthy
  2013-01-03 20:06     ` Mike Galbraith
  2013-01-04 11:41     ` Mike Galbraith
@ 2013-01-05  8:13     ` Mike Galbraith
  2013-01-06 16:32       ` Mike Galbraith
  2 siblings, 1 reply; 24+ messages in thread
From: Mike Galbraith @ 2013-01-05  8:13 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: LKML, svaidy@linux.vnet.ibm.com, Paul E. McKenney,
	Vincent Guittot, Peter Zijlstra, Viresh Kumar, Amit Kucheria,
	Morten Rasmussen, Paul McKenney, Andrew Morton, Arjan van de Ven,
	Ingo Molnar, Paul Turner, Venki Pallipadi, Robin Randhawa,
	Lists linaro-dev, Matthew Garrett, Alex Shi, srikar

On Thu, 2013-01-03 at 16:08 +0530, Preeti U Murthy wrote:

> Subject: [PATCH] sched: Merge select_idle_sibling with the behaviour of SD_BALANCE_WAKE
> 
> The function of select_idle_sibling() is to place the woken up task in the
> vicinity of the waking cpu or on the previous cpu depending on what wake_affine() says.
> This placement being only in an idle group.If an idle group is not found,the
> fallback cpu is either the waking cpu or the previous cpu accordingly.
> 
> This results in the runqueue of the waking cpu or the previous cpu getting
> overloaded when the system is committed,which is a latency hit to these tasks.
> 
> What is required is that the newly woken up tasks be placed close to the wake
> up cpu or the previous cpu,whichever is best, for reasons to avoid latency hit and cache
> coldness respectively.This is achieved with wake_affine() deciding which
> cache domain the task should be placed on.
> 
> Once this is decided,instead of searching for a completely idle group,let us
> search for the idlest group.This will anyway return a completely idle group
> if it exists and its mechanism will fall back to what select_idle_sibling()
> was doing.But if this fails,find_idlest_group() continues the search for a
> relatively more idle group.
> 
> The argument could be that,we wish to avoid migration of the newly woken up
> task to any other group unless it is completely idle.But in this case, to
> begin with we choose a sched domain,within which a migration could be less
> harmful.We enable the SD_BALANCE_WAKE flag on the SMT and MC domains to co-operate
> with the same.

What if..

Fast movers currently suffer from traversing large package, mostly due
to traversal order walking 1:1 buddies hand in hand across the whole
package endlessly.  With only one buddy pair running, it's horrific.

Even if you change the order to be friendlier, perturbation induces
bouncing.  More spots to bounce too equals more bouncing.  Ergo, I cross
coupled cpu pairs to eliminate that.  If buddies are perturbed, having
one and only one buddy cpu pulls them back together, so can't induce a
bounce fest, only correct.  That worked well, but had the down side that
some loads really REALLY want maximum spread, so suffer when you remove
migration options as I did.  There's in_interrupt() consideration I'm
not so sure of too, in that case, going the extra mile to find an idle
hole to plug _may_ be worth some extra cost too.. dunno.  

So wrt integration, what if a buddy cpu were made a FIRST choice of
generic wake balancing vs the ONLY choice of select_idle_sibling() as I
did?  If buddy cpu is available, cool, perturbed pairs find each other
and pair back up, if not, and you were here too recently, you stay with
prev_cpu, avoid bounce and traversal at high frequency.  All tasks can
try the cheap buddy cpu first, all can try full domain as well, just not
at insane rates.  The heavier the short term load average (or such, with
instant decay on longish idle ala idle balance throttle so you ramp
well), the longer the 'forget eating full balance' interval becomes,
with cutoff affecting the cheap but also not free cross coupled buddy
cpu as well at some point.  Looking for an idle cpu at hefty load is a
waste of cycles at best, plugging micro-holes does nothing good even if
you find one, forget wake balance entirely at some cutoff, let periodic
balancing do it's thing in peace.

Hrmph, that's too many words, but basically, I think whacking
select_idle_sibling() integration into wake balance makes loads of
sense, but needs a bit more to not end up just moving the problems to a
different spot.

I still have a 2.6-rt problem I need to find time to squabble with, but
maybe I'll soonish see if what you did plus what I did combined works
out on that 4x10 core box where current is _so_ unbelievably horrible.
Heck, it can't get any worse, and the restricted wake balance alone
kinda sorta worked.

-Mike

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-05  8:13     ` Mike Galbraith
@ 2013-01-06 16:32       ` Mike Galbraith
  2013-01-07  5:29         ` Preeti U Murthy
  0 siblings, 1 reply; 24+ messages in thread
From: Mike Galbraith @ 2013-01-06 16:32 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: LKML, svaidy@linux.vnet.ibm.com, Paul E. McKenney,
	Vincent Guittot, Peter Zijlstra, Viresh Kumar, Amit Kucheria,
	Morten Rasmussen, Paul McKenney, Andrew Morton, Arjan van de Ven,
	Ingo Molnar, Paul Turner, Venki Pallipadi, Robin Randhawa,
	Lists linaro-dev, Matthew Garrett, Alex Shi, srikar

On Sat, 2013-01-05 at 09:13 +0100, Mike Galbraith wrote:

> I still have a 2.6-rt problem I need to find time to squabble with, but
> maybe I'll soonish see if what you did plus what I did combined works
> out on that 4x10 core box where current is _so_ unbelievably horrible.
> Heck, it can't get any worse, and the restricted wake balance alone
> kinda sorta worked.

Actually, I flunked copy/paste 101.  Below (preeti) shows the real deal.

tbench, 3 runs, 30 secs/run
revert = 37407ea7 reverted
clients                     1          5         10        20         40         80
3.6.0.virgin            27.83     139.50    1488.76   4172.93    6983.71    8301.73
                        29.23     139.98    1500.22   4162.92    6907.16    8231.13
                        30.00     141.43    1500.09   3975.50    6847.24    7983.98

3.6.0+revert           281.08    1404.76    2802.44   5019.49    7080.97    8592.80
                       282.38    1375.70    2747.23   4823.95    7052.15    8508.45
                       270.69    1375.53    2736.29   5243.05    7058.75    8806.72

3.6.0+preeti            26.43     126.62    1027.23   3350.06    7004.22    7561.83
                        26.67     128.66     922.57   3341.73    7045.05    7662.18
                        25.54     129.20    1015.02   3337.60    6591.32    7634.33

3.6.0+best_combined    280.48    1382.07    2730.27   4786.20    6477.28    7980.07
                       276.88    1392.50    2708.23   4741.25    6590.99    7992.11
                       278.92    1368.55    2735.49   4614.99    6573.38    7921.75

3.0.51-0.7.9-default   286.44    1415.37    2794.41   5284.39    7282.57   13670.80

Something is either wrong with 3.6 itself, or the config I'm using, as
max throughput is nowhere near where it should be (see default).  On the
bright side, integrating the two does show some promise.

-Mike


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-06 16:32       ` Mike Galbraith
@ 2013-01-07  5:29         ` Preeti U Murthy
  2013-01-07  7:36           ` Mike Galbraith
  0 siblings, 1 reply; 24+ messages in thread
From: Preeti U Murthy @ 2013-01-07  5:29 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: LKML, svaidy@linux.vnet.ibm.com, Paul E. McKenney,
	Vincent Guittot, Peter Zijlstra, Viresh Kumar, Amit Kucheria,
	Morten Rasmussen, Paul McKenney, Andrew Morton, Arjan van de Ven,
	Ingo Molnar, Paul Turner, Venki Pallipadi, Robin Randhawa,
	Lists linaro-dev, Matthew Garrett, Alex Shi, srikar

Hi Mike,
Thank you very much for your inputs.Just a few thoughts so that we are
clear with the problems so far in the scheduler scalability and in what
direction we ought to move to correct them.

1. During fork or exec,the scheduler goes through find_idlest_group()
and find_idlest_cpu() in select_task_rq_fair() by iterating through all
domains.Why then was a similar approach not followed for wake up
balancing? What was so different about wake ups (except that the woken
up task had to remain close to the prev/waking cpu) that we had to
introduce select_idle_sibling() in the first place?

2.To the best of my knowlege,the concept of buddy cpu was introduced in
select_idle_sibling() so as to avoid the entire package traversal and
restrict it to the buddy cpus alone.But even during fork or exec,we
iterate through all the sched domains,like I have mentioned above.Why
did not the buddy cpu solution come to the rescue here as well?

3.So the correct problem stands at avoid iterating through the entire
package at the cost of less aggression in finding the idle cpu or
iterate through the package with an intention of finding the idlest
cpu.To the best of my understanding the former is your approach or
commit 37407ea7,the latter is what I tried to do.But as you have rightly
pointed out my approach will have scaling issues.In this light,how does
your best_combined patch(below) look like?
Do you introduce a cut off value on the loads to decide on which
approach to take?

Meanwhile I will also try to run tbench and a few other benchmarks to
find out why the results are like below.Will update you very soon on this.

Thank you

Regards
Preeti U Murthy

On 01/06/2013 10:02 PM, Mike Galbraith wrote:
> On Sat, 2013-01-05 at 09:13 +0100, Mike Galbraith wrote:
> 
>> I still have a 2.6-rt problem I need to find time to squabble with, but
>> maybe I'll soonish see if what you did plus what I did combined works
>> out on that 4x10 core box where current is _so_ unbelievably horrible.
>> Heck, it can't get any worse, and the restricted wake balance alone
>> kinda sorta worked.
> 
> Actually, I flunked copy/paste 101.  Below (preeti) shows the real deal.
> 
> tbench, 3 runs, 30 secs/run
> revert = 37407ea7 reverted
> clients                     1          5         10        20         40         80
> 3.6.0.virgin            27.83     139.50    1488.76   4172.93    6983.71    8301.73
>                         29.23     139.98    1500.22   4162.92    6907.16    8231.13
>                         30.00     141.43    1500.09   3975.50    6847.24    7983.98
> 
> 3.6.0+revert           281.08    1404.76    2802.44   5019.49    7080.97    8592.80
>                        282.38    1375.70    2747.23   4823.95    7052.15    8508.45
>                        270.69    1375.53    2736.29   5243.05    7058.75    8806.72
> 
> 3.6.0+preeti            26.43     126.62    1027.23   3350.06    7004.22    7561.83
>                         26.67     128.66     922.57   3341.73    7045.05    7662.18
>                         25.54     129.20    1015.02   3337.60    6591.32    7634.33
> 
> 3.6.0+best_combined    280.48    1382.07    2730.27   4786.20    6477.28    7980.07
>                        276.88    1392.50    2708.23   4741.25    6590.99    7992.11
>                        278.92    1368.55    2735.49   4614.99    6573.38    7921.75
> 
> 3.0.51-0.7.9-default   286.44    1415.37    2794.41   5284.39    7282.57   13670.80
> 
> Something is either wrong with 3.6 itself, or the config I'm using, as
> max throughput is nowhere near where it should be (see default).  On the
> bright side, integrating the two does show some promise.
> 
> -Mike
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-07  5:29         ` Preeti U Murthy
@ 2013-01-07  7:36           ` Mike Galbraith
  2013-01-08  8:41             ` Preeti U Murthy
  0 siblings, 1 reply; 24+ messages in thread
From: Mike Galbraith @ 2013-01-07  7:36 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: LKML, svaidy@linux.vnet.ibm.com, Paul E. McKenney,
	Vincent Guittot, Peter Zijlstra, Viresh Kumar, Amit Kucheria,
	Morten Rasmussen, Paul McKenney, Andrew Morton, Arjan van de Ven,
	Ingo Molnar, Paul Turner, Venki Pallipadi, Robin Randhawa,
	Lists linaro-dev, Matthew Garrett, Alex Shi, srikar

On Mon, 2013-01-07 at 10:59 +0530, Preeti U Murthy wrote: 
> Hi Mike,
> Thank you very much for your inputs.Just a few thoughts so that we are
> clear with the problems so far in the scheduler scalability and in what
> direction we ought to move to correct them.
> 
> 1. During fork or exec,the scheduler goes through find_idlest_group()
> and find_idlest_cpu() in select_task_rq_fair() by iterating through all
> domains.Why then was a similar approach not followed for wake up
> balancing? What was so different about wake ups (except that the woken
> up task had to remain close to the prev/waking cpu) that we had to
> introduce select_idle_sibling() in the first place?

Cost.  select_idle_sibling() was introduced specifically to recoup
overlap via placing wakee in a shared L2 where there is no pain really,
it's all gain.  L3 ain't the same deal, but it still does a good job,
even for fast movers (on Intel) if you're not bouncing them all over the
place.  Breakeven depends on architecture, but critical for fast movers
on all is no bounce allowed.  Wakeup is a whole different thing than
fork/exec balancing, there, you just want to spread to seed your future,
and it doesn't matter much where you send wakees.  With wakeup of fast
movers, you lose on every migration, so buddies really really need to
find each other fast, and stay put, so you really do recoup overlap
instead of just paying again and again.  My old Q6600 kicks the snot out
of that 10 core Westmere CPU in a common desktop box cross core
scheduling scenario.  That's sick.

> 2.To the best of my knowlege,the concept of buddy cpu was introduced in
> select_idle_sibling() so as to avoid the entire package traversal and
> restrict it to the buddy cpus alone.But even during fork or exec,we
> iterate through all the sched domains,like I have mentioned above.Why
> did not the buddy cpu solution come to the rescue here as well?

I'm looking (well, pondering) fork/exec.  It may be a common case win
too, but I kinda doubt it'll matter.  It's the 1:1 waker/wakee relation
that suffers mightily.  The buddy was introduced specifically to bind
1:1 sw buddies (common) in hw as well.  They are one, make it so.  It
was the simplest solution I could think of to both kill the evil, and
cut cost to restore select_idle_sibling() to the optimization it was
intended to be vs the pessimization it has turned into on large L3 CPUs.
I took it back to it's functional roots.  Weld two modern cores
together, create core2duo on steroids, optimization works again and is
cheap again.

That the current behavior has its upsides too is.. most unfortunate :)

> 3.So the correct problem stands at avoid iterating through the entire
> package at the cost of less aggression in finding the idle cpu or
> iterate through the package with an intention of finding the idlest
> cpu.To the best of my understanding the former is your approach or
> commit 37407ea7,the latter is what I tried to do.But as you have rightly
> pointed out my approach will have scaling issues.In this light,how does
> your best_combined patch(below) look like?

It doesn't exist anymore, morphed into ever less wonderful as the day
went on.  Integration needs to fiddle with find_* to cut some wasted
cycles, and methinks reordered groups _may_ make it usable with no
idle-buddy hint.  Using the idle-buddy first looked promising, but
bottom line is that at higher load, you use it for a modest fee, buddy
was busy, so you then do the full scan paying a bigger fee on top of it,
which of course crops the high end.  Reorder should kill the low load
horror, but you've still got the problem that as you approach
saturation, the less value any search has.  That, and the logic is not
necessarily as light as it must be when targeting wakeups, where every
scheduler cycle detracts substantially from fast/light task performance.

> Do you introduce a cut off value on the loads to decide on which
> approach to take?

I fiddled with that, but then whacked it, because the result needs to be
nice to 1:1 and 1:N which may need to spread at high frequency, so the
cutoff would be harmful.  Seems likely either a cutoff or 1:1 vs 1:N
detection or _something_ will be needed to cut cost when it's flat not
affordable.  With 1:N pgbench (and ilk), there's no such thing as too
expensive a search, but for fast/light network stuff there is.

-Mike

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-02  4:22 sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer Preeti U Murthy
  2013-01-02  8:12 ` Mike Galbraith
@ 2013-01-07 15:48 ` Vincent Guittot
  2013-01-08  6:06   ` Preeti U Murthy
  1 sibling, 1 reply; 24+ messages in thread
From: Vincent Guittot @ 2013-01-07 15:48 UTC (permalink / raw)
  To: Preeti U Murthy, Matthew Garrett
  Cc: LKML, svaidy@linux.vnet.ibm.com, Paul E. McKenney, Peter Zijlstra,
	Viresh Kumar, Amit Kucheria, Morten Rasmussen, Paul McKenney,
	Andrew Morton, Arjan van de Ven, Ingo Molnar, Paul Turner,
	Venki Pallipadi, Robin Randhawa, Lists linaro-dev

On 2 January 2013 05:22, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> Hi everyone,
> I have been looking at how different workloads react when the per entity
> load tracking metric is integrated into the load balancer and what are
> the possible reasons for it.
>
> I had posted the integration patch earlier:
> https://lkml.org/lkml/2012/11/15/391
>
> Essentially what I am doing is:
> 1.I have disabled CONFIG_FAIR_GROUP_SCHED to make the analysis simple
> 2.I have replaced cfs_rq->load.weight in weighted_cpuload() with
> cfs.runnable_load_avg,the active load tracking metric.
> 3.I have replaced se.load.weight in task_h_load() with
> se.load.avg.contrib,the per entity load tracking metric.
> 4.The load balancer will end up using these metrics.
>
> After conducting experiments on several workloads I found out that the
> performance of the workloads with the above integration would neither
> improve nor deteriorate.And this observation was consistent.
>
> Ideally the performance should have improved considering,that the metric
> does better tracking of load.
>
> Let me explain with a simple example as to why we should see a
> performance improvement ideally:Consider 2 80% tasks and 1 40% task.
>
> With integration:
> ----------------
>
>        40%
> 80%    40%
> cpu1  cpu2
>
> The above will be the scenario when the tasks fork initially.And this is
> a perfectly balanced system,hence no more load balancing.And proper
> distribution of loads on the cpu.
>
> Without integration
> -------------------
>
> 40%                               40%
> 80%    40%                 80%    40%
> cpu1   cpu2        OR     cpu1   cpu2
>
> Because the  view is that all the tasks as having the same load.The load
> balancer could ping pong tasks between these two situations.
>
> When I performed this experiment,I did not see an improvement in the
> performance though in the former case.On further observation I found
> that the following was actually happening.
>
> With integration
> ----------------
>
> Initially         40% task sleeps      40% task wakes up
>                                        and select_idle_sibling()
>                                        decides to wake it up on cpu1
>
>        40%   ->                   ->   40%
> 80%    40%        80%    40%           80%      40%
> cpu1  cpu2        cpu1   cpu2          cpu1     cpu2
>
>
> This makes load balance trigger movement of 40% from cpu1 back to
> cpu2.Hence the stability that the load balancer was trying to achieve is
> gone.Hence the culprit boils down to select_idle_sibling.How is it the
> culprit and how is it hindering performance of the workloads?
>
> *What is the way ahead with the per entity load tracking metric in the
> load balancer then?*
>
> In replies to a post by Paul in https://lkml.org/lkml/2012/12/6/105,
> he mentions the following:
>
> "It is my intuition that the greatest carnage here is actually caused
> by wake-up load-balancing getting in the way of periodic in
> establishing a steady state. I suspect more mileage would result from
> reducing the interference wake-up load-balancing has with steady
> state."
>
> "The whole point of using blocked load is so that you can converge on a
> steady state where you don't NEED to move tasks.  What disrupts this is
> we naturally prefer idle cpus on wake-up balance to reduce wake-up
> latency. I think the better answer is making these two processes load
> balancing() and select_idle_sibling() more co-operative."
>
> I had not realised how this would happen until I saw it happening in the
> above experiment.
>
> Based on what Paul explained above let us use the runnable load + the
> blocked load for calculating the load on a cfs runqueue rather than just
> the runnable load(which is what i am doing now) and see its consequence.
>
> Initially:       40% task sleeps
>
>        40%
> 80%    40%   ->  80%  40%
> cpu1   cpu2     cpu1  cpu2
>
> So initially the load on cpu1 is say 80 and on cpu2 also it is
> 80.Balanced.Now when 40% task sleeps,the total load on cpu2=runnable
> load+blocked load.which is still 80.
>
> As a consequence,firstly,during periodic load balancing the load is not
> moved from cpu1 to cpu2 when the 40% task sleeps.(It sees the load on
> cpu2 as 80 and not as 40).
> Hence the above scenario remains the same.On wake up,what happens?
>
> Here comes the point of making both load balancing and wake up
> balance(select_idle_sibling) co operative. How about we always schedule
> the woken up task on the prev_cpu? This seems more sensible considering
> load balancing considers blocked load as being a part of the load of cpu2.

Hi Preeti,

I'm not sure that we want such steady state at cores level because we
take advantage of migrating wake up tasks between cores that share
their cache as Matthew demonstrated. But I agree that reaching such
steady state at cluster and CPU level is interesting.

IMHO, you're right that taking the blocked load into consideration
should minimize tasks migration between cluster but it should no
prevent fast task migration between cores that share their cache

Vincent
>
> If we do that,we end up scheduling the 40% task back on cpu2.Back to the
> scenario which load balancing intended.Hence a steady state is
> maintained no matter what unless other tasks show up.
>
> Note that considering prev_cpu as the default cpu to run the woken up
> task on is possible only because we use blocked load for load balancing
> purposes.
>
> The above steps of using blocked load and selecting the prev_cpu as the
> target for the woken up task seems to me to be the next step.This could
> allow the load balance with the per entity load tracking metric to
> behave as it is supposed to without anything else disrupting it.And here
> i expect a performance improvement.
>
> Please do let me know your suggestions.This will greatly help take the
> right steps here on, in achieving the correct integration.
>
> Thank you
>
> Regards
> Preeti U Murthy
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-07 15:48 ` Vincent Guittot
@ 2013-01-08  6:06   ` Preeti U Murthy
  2013-01-08 14:04     ` Vincent Guittot
  0 siblings, 1 reply; 24+ messages in thread
From: Preeti U Murthy @ 2013-01-08  6:06 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Matthew Garrett, LKML, svaidy@linux.vnet.ibm.com,
	Paul E. McKenney, Peter Zijlstra, Viresh Kumar, Amit Kucheria,
	Morten Rasmussen, Paul McKenney, Andrew Morton, Arjan van de Ven,
	Ingo Molnar, Paul Turner, Venki Pallipadi, Robin Randhawa,
	Lists linaro-dev

On 01/07/2013 09:18 PM, Vincent Guittot wrote:
> On 2 January 2013 05:22, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> Hi everyone,
>> I have been looking at how different workloads react when the per entity
>> load tracking metric is integrated into the load balancer and what are
>> the possible reasons for it.
>>
>> I had posted the integration patch earlier:
>> https://lkml.org/lkml/2012/11/15/391
>>
>> Essentially what I am doing is:
>> 1.I have disabled CONFIG_FAIR_GROUP_SCHED to make the analysis simple
>> 2.I have replaced cfs_rq->load.weight in weighted_cpuload() with
>> cfs.runnable_load_avg,the active load tracking metric.
>> 3.I have replaced se.load.weight in task_h_load() with
>> se.load.avg.contrib,the per entity load tracking metric.
>> 4.The load balancer will end up using these metrics.
>>
>> After conducting experiments on several workloads I found out that the
>> performance of the workloads with the above integration would neither
>> improve nor deteriorate.And this observation was consistent.
>>
>> Ideally the performance should have improved considering,that the metric
>> does better tracking of load.
>>
>> Let me explain with a simple example as to why we should see a
>> performance improvement ideally:Consider 2 80% tasks and 1 40% task.
>>
>> With integration:
>> ----------------
>>
>>        40%
>> 80%    40%
>> cpu1  cpu2
>>
>> The above will be the scenario when the tasks fork initially.And this is
>> a perfectly balanced system,hence no more load balancing.And proper
>> distribution of loads on the cpu.
>>
>> Without integration
>> -------------------
>>
>> 40%                               40%
>> 80%    40%                 80%    40%
>> cpu1   cpu2        OR     cpu1   cpu2
>>
>> Because the  view is that all the tasks as having the same load.The load
>> balancer could ping pong tasks between these two situations.
>>
>> When I performed this experiment,I did not see an improvement in the
>> performance though in the former case.On further observation I found
>> that the following was actually happening.
>>
>> With integration
>> ----------------
>>
>> Initially         40% task sleeps      40% task wakes up
>>                                        and select_idle_sibling()
>>                                        decides to wake it up on cpu1
>>
>>        40%   ->                   ->   40%
>> 80%    40%        80%    40%           80%      40%
>> cpu1  cpu2        cpu1   cpu2          cpu1     cpu2
>>
>>
>> This makes load balance trigger movement of 40% from cpu1 back to
>> cpu2.Hence the stability that the load balancer was trying to achieve is
>> gone.Hence the culprit boils down to select_idle_sibling.How is it the
>> culprit and how is it hindering performance of the workloads?
>>
>> *What is the way ahead with the per entity load tracking metric in the
>> load balancer then?*
>>
>> In replies to a post by Paul in https://lkml.org/lkml/2012/12/6/105,
>> he mentions the following:
>>
>> "It is my intuition that the greatest carnage here is actually caused
>> by wake-up load-balancing getting in the way of periodic in
>> establishing a steady state. I suspect more mileage would result from
>> reducing the interference wake-up load-balancing has with steady
>> state."
>>
>> "The whole point of using blocked load is so that you can converge on a
>> steady state where you don't NEED to move tasks.  What disrupts this is
>> we naturally prefer idle cpus on wake-up balance to reduce wake-up
>> latency. I think the better answer is making these two processes load
>> balancing() and select_idle_sibling() more co-operative."
>>
>> I had not realised how this would happen until I saw it happening in the
>> above experiment.
>>
>> Based on what Paul explained above let us use the runnable load + the
>> blocked load for calculating the load on a cfs runqueue rather than just
>> the runnable load(which is what i am doing now) and see its consequence.
>>
>> Initially:       40% task sleeps
>>
>>        40%
>> 80%    40%   ->  80%  40%
>> cpu1   cpu2     cpu1  cpu2
>>
>> So initially the load on cpu1 is say 80 and on cpu2 also it is
>> 80.Balanced.Now when 40% task sleeps,the total load on cpu2=runnable
>> load+blocked load.which is still 80.
>>
>> As a consequence,firstly,during periodic load balancing the load is not
>> moved from cpu1 to cpu2 when the 40% task sleeps.(It sees the load on
>> cpu2 as 80 and not as 40).
>> Hence the above scenario remains the same.On wake up,what happens?
>>
>> Here comes the point of making both load balancing and wake up
>> balance(select_idle_sibling) co operative. How about we always schedule
>> the woken up task on the prev_cpu? This seems more sensible considering
>> load balancing considers blocked load as being a part of the load of cpu2.
> 
> Hi Preeti,
> 
> I'm not sure that we want such steady state at cores level because we
> take advantage of migrating wake up tasks between cores that share
> their cache as Matthew demonstrated. But I agree that reaching such
> steady state at cluster and CPU level is interesting.
> 
> IMHO, you're right that taking the blocked load into consideration
> should minimize tasks migration between cluster but it should no
> prevent fast task migration between cores that share their cache

True Vincent.But I think the one disadvantage even at cpu or cluster
level is that when we consider blocked load, we might prevent any more
tasks from being scheduled on that cpu during periodic load balance if
the blocked load is too much.This is very poor cpu utilization

Also we can consider steady states if the waking tasks have a specific
waking pattern.I am not sure if we can risk hoping that the blocked task
would wake up soon or would wake up at time 'x' and utilize that cpu.

> 
> Vincent

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-07  7:36           ` Mike Galbraith
@ 2013-01-08  8:41             ` Preeti U Murthy
  2013-01-16 14:08               ` Alex Shi
  0 siblings, 1 reply; 24+ messages in thread
From: Preeti U Murthy @ 2013-01-08  8:41 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: LKML, svaidy@linux.vnet.ibm.com, Paul E. McKenney,
	Vincent Guittot, Peter Zijlstra, Viresh Kumar, Amit Kucheria,
	Morten Rasmussen, Paul McKenney, Andrew Morton, Arjan van de Ven,
	Ingo Molnar, Paul Turner, Venki Pallipadi, Robin Randhawa,
	Lists linaro-dev, Matthew Garrett, Alex Shi, srikar

Hi Mike,

Thank you very much for such a clear and comprehensive explanation.
So when I put together the problem and the proposed solution pieces in the current
scheduler scalability,the following was what I found:

1. select_idle_sibling() is needed as an agent to correctly find the right cpu for wake
   up tasks to go to."Correctly" would be to find an idle cpu at the lowest cost possible.
2."Cost could be lowered" either by optimizing the order of searching for an idle cpu or
   restricting the search to a few cpus alone.
3. The former has the problem that it would not prevent bouncing tasks all over the domain
   sharing an L3 cache,which could potentially affect the fast moving tasks.
4. The latter has the problem that it is not aggressive enough in finding an idle cpu.

This is some tangled problem,but I think the solution at best could be smoothed to a a flowchart.

       STEP1                       STEP2                STEP3
 _____________________
|                     |
|See if the idle buddy|No    _________________  Yes   ________________
|is free at all sched |---->| Do we search the|----> |Optimized search|
|domains              |     |sched domains    |      |________________|
|_____________________|     |for an idle cpu  |                 |
          |Yes              |_________________|                \|/
         \|/                        |No: saturated     Return target cpu
        Return                     \|/     system
        cpu buddy                Return prev_cpu

I dont know how well we can do STEP2. That seems to be the biggest challenge.STEP 1 is a
reverted commit 37407ea7,
"sched: Improve scalability via 'CPU buddies', which withstand random perturbations"
for reasons which can hopefully be overcome by this flowchart.

I have tried to tackle STEP3.STEP 3 will not prevent bouncing but a good STEP2 could tell
us if it is worth the bounce.

STEP3 Patch is given below:

***********************START PATCH**************************************

sched:Reduce the overhead of select_idle_sibling

From: Preeti U Murthy <preeti@linux.vnet.ibm.com>

The traversal of the sched domains to find the idlest cpu is a costly
operation currently in select_idle_sibling() due to large L3 caches.
So the traversal better be worth the effort.

In the current approach,in select_idle_sibling(),the sched domains are
searched top to bottom starting at the last level cache domain,and the search
is for the idlest *group*. It is less likely that you will find a completely
idle group at a higher level sched domain compared to a lower level sched
domain.This will make the first few iterations fruitless most of the times.
And it is less likely to find an idle *group* compared to an idle *cpu*.

This patch aims at finding an idle cpu.And since the iterations are from bottom to
top,stopping at the last level cache domain,it tries to see if the task can
be scheduled on a core which shares the l2 cache with the waking cpu/prev cpu first,
before probing the cpus at the higher level domain,trying to make the cost of
migration lower than the current approach.

Also it tries to return an idle cpu as soon as it finds one,without querying
the status of the other cpus in the sched group.This could potentially be
much faster unless a worse case such as not finding an idle cpu at every
domain occurs.
---
 kernel/sched/fair.c |   33 +++++++++++++++++++++------------
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b29cdbf..6123bca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3310,7 +3310,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
 {
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
-	struct sched_domain *sd;
+	struct sched_domain *sd, *tmp;
 	struct sched_group *sg;
 	int i;

@@ -3332,24 +3332,33 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
 	sd = rcu_dereference(per_cpu(sd_llc, target));
-	for_each_lower_domain(sd) {
-		sg = sd->groups;
+	/*
+ 	 * Iterate the domains bottom up,to cash in on the shared lower level
+ 	 * cache advantages.
+ 	 * Since this iteration is costly,return any idle cpu and dont wait
+ 	 * for a completely idle group.
+ 	 */
+	for_each_domain(target, tmp) {
+		sg = tmp->groups;
 		do {
 			if (!cpumask_intersects(sched_group_cpus(sg),
 						tsk_cpus_allowed(p)))
 				goto next;
-
 			for_each_cpu(i, sched_group_cpus(sg)) {
-				if (!idle_cpu(i))
-					goto next;
+				if (idle_cpu(i)) {
+					target = i;
+					goto done;
+				}
 			}
+next:			sg = sg->next;

-			target = cpumask_first_and(sched_group_cpus(sg),
-					tsk_cpus_allowed(p));
-			goto done;
-next:
-			sg = sg->next;
-		} while (sg != sd->groups);
+		} while (sg != tmp->groups);
+
+		/* if you have covered the llc domain then break out,it is
+		 * not worth migrating on any higher level domain
+		 */
+		if (tmp == sd)
+			break;
 	}
 done:
 	return target;

Regards
Preeti U Murthy

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-08  6:06   ` Preeti U Murthy
@ 2013-01-08 14:04     ` Vincent Guittot
  2013-01-09  3:14       ` Preeti U Murthy
  0 siblings, 1 reply; 24+ messages in thread
From: Vincent Guittot @ 2013-01-08 14:04 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Mike Galbraith, Matthew Garrett, LKML, svaidy@linux.vnet.ibm.com,
	Paul E. McKenney, Peter Zijlstra, Viresh Kumar, Amit Kucheria,
	Morten Rasmussen, Paul McKenney, Andrew Morton, Arjan van de Ven,
	Ingo Molnar, Paul Turner, Venki Pallipadi, Robin Randhawa,
	Lists linaro-dev

On 8 January 2013 07:06, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> On 01/07/2013 09:18 PM, Vincent Guittot wrote:
>> On 2 January 2013 05:22, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>>> Hi everyone,
>>> I have been looking at how different workloads react when the per entity
>>> load tracking metric is integrated into the load balancer and what are
>>> the possible reasons for it.
>>>
>>> I had posted the integration patch earlier:
>>> https://lkml.org/lkml/2012/11/15/391
>>>
>>> Essentially what I am doing is:
>>> 1.I have disabled CONFIG_FAIR_GROUP_SCHED to make the analysis simple
>>> 2.I have replaced cfs_rq->load.weight in weighted_cpuload() with
>>> cfs.runnable_load_avg,the active load tracking metric.
>>> 3.I have replaced se.load.weight in task_h_load() with
>>> se.load.avg.contrib,the per entity load tracking metric.
>>> 4.The load balancer will end up using these metrics.
>>>
>>> After conducting experiments on several workloads I found out that the
>>> performance of the workloads with the above integration would neither
>>> improve nor deteriorate.And this observation was consistent.
>>>
>>> Ideally the performance should have improved considering,that the metric
>>> does better tracking of load.
>>>
>>> Let me explain with a simple example as to why we should see a
>>> performance improvement ideally:Consider 2 80% tasks and 1 40% task.
>>>
>>> With integration:
>>> ----------------
>>>
>>>        40%
>>> 80%    40%
>>> cpu1  cpu2
>>>
>>> The above will be the scenario when the tasks fork initially.And this is
>>> a perfectly balanced system,hence no more load balancing.And proper
>>> distribution of loads on the cpu.
>>>
>>> Without integration
>>> -------------------
>>>
>>> 40%                               40%
>>> 80%    40%                 80%    40%
>>> cpu1   cpu2        OR     cpu1   cpu2
>>>
>>> Because the  view is that all the tasks as having the same load.The load
>>> balancer could ping pong tasks between these two situations.
>>>
>>> When I performed this experiment,I did not see an improvement in the
>>> performance though in the former case.On further observation I found
>>> that the following was actually happening.
>>>
>>> With integration
>>> ----------------
>>>
>>> Initially         40% task sleeps      40% task wakes up
>>>                                        and select_idle_sibling()
>>>                                        decides to wake it up on cpu1
>>>
>>>        40%   ->                   ->   40%
>>> 80%    40%        80%    40%           80%      40%
>>> cpu1  cpu2        cpu1   cpu2          cpu1     cpu2
>>>
>>>
>>> This makes load balance trigger movement of 40% from cpu1 back to
>>> cpu2.Hence the stability that the load balancer was trying to achieve is
>>> gone.Hence the culprit boils down to select_idle_sibling.How is it the
>>> culprit and how is it hindering performance of the workloads?
>>>
>>> *What is the way ahead with the per entity load tracking metric in the
>>> load balancer then?*
>>>
>>> In replies to a post by Paul in https://lkml.org/lkml/2012/12/6/105,
>>> he mentions the following:
>>>
>>> "It is my intuition that the greatest carnage here is actually caused
>>> by wake-up load-balancing getting in the way of periodic in
>>> establishing a steady state. I suspect more mileage would result from
>>> reducing the interference wake-up load-balancing has with steady
>>> state."
>>>
>>> "The whole point of using blocked load is so that you can converge on a
>>> steady state where you don't NEED to move tasks.  What disrupts this is
>>> we naturally prefer idle cpus on wake-up balance to reduce wake-up
>>> latency. I think the better answer is making these two processes load
>>> balancing() and select_idle_sibling() more co-operative."
>>>
>>> I had not realised how this would happen until I saw it happening in the
>>> above experiment.
>>>
>>> Based on what Paul explained above let us use the runnable load + the
>>> blocked load for calculating the load on a cfs runqueue rather than just
>>> the runnable load(which is what i am doing now) and see its consequence.
>>>
>>> Initially:       40% task sleeps
>>>
>>>        40%
>>> 80%    40%   ->  80%  40%
>>> cpu1   cpu2     cpu1  cpu2
>>>
>>> So initially the load on cpu1 is say 80 and on cpu2 also it is
>>> 80.Balanced.Now when 40% task sleeps,the total load on cpu2=runnable
>>> load+blocked load.which is still 80.
>>>
>>> As a consequence,firstly,during periodic load balancing the load is not
>>> moved from cpu1 to cpu2 when the 40% task sleeps.(It sees the load on
>>> cpu2 as 80 and not as 40).
>>> Hence the above scenario remains the same.On wake up,what happens?
>>>
>>> Here comes the point of making both load balancing and wake up
>>> balance(select_idle_sibling) co operative. How about we always schedule
>>> the woken up task on the prev_cpu? This seems more sensible considering
>>> load balancing considers blocked load as being a part of the load of cpu2.
>>
>> Hi Preeti,
>>
>> I'm not sure that we want such steady state at cores level because we
>> take advantage of migrating wake up tasks between cores that share
>> their cache as Matthew demonstrated. But I agree that reaching such
>> steady state at cluster and CPU level is interesting.
>>
>> IMHO, you're right that taking the blocked load into consideration
>> should minimize tasks migration between cluster but it should no
>> prevent fast task migration between cores that share their cache
>
> True Vincent.But I think the one disadvantage even at cpu or cluster
> level is that when we consider blocked load, we might prevent any more
> tasks from being scheduled on that cpu during periodic load balance if
> the blocked load is too much.This is very poor cpu utilization

The blocked load of a cluster will be high if the blocked tasks have
run recently. The contribution of a blocked task will be divided by 2
each 32ms, so it means that a high blocked load will be made of recent
running tasks and the long sleeping tasks will not influence the load
balancing.
The load balance period is between 1 tick (10ms for idle load balance
on ARM) and up to 256 ms (for busy load balance) so a high blocked
load should imply some tasks that have run recently otherwise your
blocked load will be small and will not have a large influence on your
load balance

>
> Also we can consider steady states if the waking tasks have a specific
> waking pattern.I am not sure if we can risk hoping that the blocked task
> would wake up soon or would wake up at time 'x' and utilize that cpu.

Ok, so you don't consider to use blocked load in load balancing any more ?

regards,
Vincent
>
>>
>> Vincent
>
> Regards
> Preeti U Murthy
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-08 14:04     ` Vincent Guittot
@ 2013-01-09  3:14       ` Preeti U Murthy
  2013-01-20 15:30         ` Alex Shi
  0 siblings, 1 reply; 24+ messages in thread
From: Preeti U Murthy @ 2013-01-09  3:14 UTC (permalink / raw)
  To: Vincent Guittot, Mike Galbraith
  Cc: Matthew Garrett, LKML, svaidy@linux.vnet.ibm.com,
	Paul E. McKenney, Peter Zijlstra, Viresh Kumar, Amit Kucheria,
	Morten Rasmussen, Paul McKenney, Andrew Morton, Arjan van de Ven,
	Ingo Molnar, Paul Turner, Venki Pallipadi, Robin Randhawa,
	Lists linaro-dev, Alex Shi

>>>> Here comes the point of making both load balancing and wake up
>>>> balance(select_idle_sibling) co operative. How about we always schedule
>>>> the woken up task on the prev_cpu? This seems more sensible considering
>>>> load balancing considers blocked load as being a part of the load of cpu2.
>>>
>>> Hi Preeti,
>>>
>>> I'm not sure that we want such steady state at cores level because we
>>> take advantage of migrating wake up tasks between cores that share
>>> their cache as Matthew demonstrated. But I agree that reaching such
>>> steady state at cluster and CPU level is interesting.
>>>
>>> IMHO, you're right that taking the blocked load into consideration
>>> should minimize tasks migration between cluster but it should no
>>> prevent fast task migration between cores that share their cache
>>
>> True Vincent.But I think the one disadvantage even at cpu or cluster
>> level is that when we consider blocked load, we might prevent any more
>> tasks from being scheduled on that cpu during periodic load balance if
>> the blocked load is too much.This is very poor cpu utilization
> 
> The blocked load of a cluster will be high if the blocked tasks have
> run recently. The contribution of a blocked task will be divided by 2
> each 32ms, so it means that a high blocked load will be made of recent
> running tasks and the long sleeping tasks will not influence the load
> balancing.
> The load balance period is between 1 tick (10ms for idle load balance
> on ARM) and up to 256 ms (for busy load balance) so a high blocked
> load should imply some tasks that have run recently otherwise your
> blocked load will be small and will not have a large influence on your
> load balance

Makes a lot of sense.

>> Also we can consider steady states if the waking tasks have a specific
>> waking pattern.I am not sure if we can risk hoping that the blocked task
>> would wake up soon or would wake up at time 'x' and utilize that cpu.
> 
> Ok, so you don't consider to use blocked load in load balancing any more ?

Hmm..This has got me thinking.I thought to solve the existing
select_idle_sibling() problem of bouncing tasks all over the l3 package
and taking time to find an idle buddy could be solved in isolation with
the PJT's metric.But that does not seem to be the case considering the
suggestions by you and Mike.

Currently there are so many approaches proposed to improve the scheduler
that it is confusing as to how and which pieces fit well.Let me lay them
down.Please do help me put them together.

Jigsaw Piece1:Use Pjt's metric in load balancing and  Blocked
load+runnable load as part of cpu load while load balancing.

Jigsaw Piece2: select_idle_sibling() choosing the cpu to wake up tasks on.

Jigsaw Piece3: 'cpu buddy' concept to prevent bouncing of tasks.

Considering both yours and Mike's suggestions,what do you guys think of
the following puzzle and solution?

*Puzzle*: Waking up tasks should not take too much time to find a cpu to
run on and should not keep bouncing on too many cpus all over the
package, and should try as much not to create too much of an imbalance
in the load distribution of the cpus.

*Solution:*

Place Jigsaw Piece 1 first:Use Pjt's metric and blocked load + runnable
load as part of cpu load while load balancing.
(As time passes the blocked load becomes less significant on that
cpu,hence load balancing will go on as usual).

Place Jigsaw Piece 2 next: When tasks wake up,**use
select_idle_sibling() to see only if you can migrate tasks between cores
that share their cache**,
IOW see if the cpu at the lowest level sched domain is idle.If it is,
then schedule on it and migrate_task_rq_fair() will remove the load from
the prev_cpu,if not idle,then return the prev_cpu() which had already
considered the blocked load as part of its overall load.Hence very
little imbalance will be created.

*Possible End Picture*

Waking up tasks will not take time to find a cpu since we are probing
the cpus at only one sched domain level.The bouncing of tasks will be
restricted at the core level.An imbalance will not be created as the
blocked load is also considered while load balancing.

*Concerns*

1.Is the wake up load balancing in this solution less aggressive so as
to harm throughput significantly ?
2.Do we need Jigsaw Piece 3 at all?

Please do let me know what you all think.Thank you very much for your
suggestions.
>
> regards,
> Vincent

Regards
Preeti U Murthy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-08  8:41             ` Preeti U Murthy
@ 2013-01-16 14:08               ` Alex Shi
  2013-01-17  5:17                 ` Namhyung Kim
  2013-01-17  8:45                 ` Preeti U Murthy
  0 siblings, 2 replies; 24+ messages in thread
From: Alex Shi @ 2013-01-16 14:08 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Mike Galbraith, LKML, svaidy@linux.vnet.ibm.com, Paul E. McKenney,
	Vincent Guittot, Peter Zijlstra, Viresh Kumar, Amit Kucheria,
	Morten Rasmussen, Paul McKenney, Andrew Morton, Arjan van de Ven,
	Ingo Molnar, Paul Turner, Venki Pallipadi, Robin Randhawa,
	Lists linaro-dev, Matthew Garrett, srikar

On 01/08/2013 04:41 PM, Preeti U Murthy wrote:
> Hi Mike,
> 
> Thank you very much for such a clear and comprehensive explanation.
> So when I put together the problem and the proposed solution pieces in the current
> scheduler scalability,the following was what I found:
> 
> 1. select_idle_sibling() is needed as an agent to correctly find the right cpu for wake
>    up tasks to go to."Correctly" would be to find an idle cpu at the lowest cost possible.
> 2."Cost could be lowered" either by optimizing the order of searching for an idle cpu or
>    restricting the search to a few cpus alone.
> 3. The former has the problem that it would not prevent bouncing tasks all over the domain
>    sharing an L3 cache,which could potentially affect the fast moving tasks.
> 4. The latter has the problem that it is not aggressive enough in finding an idle cpu.
> 
> This is some tangled problem,but I think the solution at best could be smoothed to a a flowchart.
> 
>        STEP1                       STEP2                STEP3
>  _____________________
> |                     |
> |See if the idle buddy|No    _________________  Yes   ________________
> |is free at all sched |---->| Do we search the|----> |Optimized search|
> |domains              |     |sched domains    |      |________________|
> |_____________________|     |for an idle cpu  |                 |
>           |Yes              |_________________|                \|/
>          \|/                        |No: saturated     Return target cpu
>         Return                     \|/     system
>         cpu buddy                Return prev_cpu
> 



I re-written the patch as following. hackbench/aim9 doest show clean performance change.
Actually we can get some profit. it also will be very slight. :) 
BTW, it still need another patch before apply this. Just to show the logical.

===========
>From 145ff27744c8ac04eda056739fe5aa907a00877e Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@intel.com>
Date: Fri, 11 Jan 2013 16:49:03 +0800
Subject: [PATCH 3/7] sched: select_idle_sibling optimization

Current logical in this function will insist to wake up the task in a
totally idle group, otherwise it would rather back to previous cpu.

The new logical will try to wake up the task on any idle cpu in the same
cpu socket (in same sd_llc), while idle cpu in the smaller domain has
higher priority.

It should has some help on burst wake up benchmarks like aim7.

Original-patch-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c |   40 +++++++++++++++++++---------------------
 1 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e116215..fa40e49 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3253,13 +3253,13 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
-static int select_idle_sibling(struct task_struct *p)
+static int select_idle_sibling(struct task_struct *p,
+				struct sched_domain *affine_sd, int sync)
 {
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	struct sched_domain *sd;
 	struct sched_group *sg;
-	int i;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
@@ -3281,27 +3281,25 @@ static int select_idle_sibling(struct task_struct *p)
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
-	sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
-	for_each_lower_domain(sd) {
+	for_each_domain(prev_cpu, sd) {
 		sg = sd->groups;
 		do {
-			if (!cpumask_intersects(sched_group_cpus(sg),
-						tsk_cpus_allowed(p)))
-				goto next;
-
-			for_each_cpu(i, sched_group_cpus(sg)) {
-				if (!idle_cpu(i))
-					goto next;
-			}
-
-			prev_cpu = cpumask_first_and(sched_group_cpus(sg),
-					tsk_cpus_allowed(p));
-			goto done;
-next:
-			sg = sg->next;
-		} while (sg != sd->groups);
+			int nr_busy = atomic_read(&sg->sgp->nr_busy_cpus);
+			int i;
+
+			/* no idle cpu in the group */
+			if (nr_busy == sg->group_weight)
+				continue;
+			for_each_cpu_and(i, sched_group_cpus(sg),
+							tsk_cpus_allowed(p))
+				if (idle_cpu(i))
+					return i;
+		} while (sg = sg->next, sg != sd->groups);
+
+		/* only wake up task on the same cpu socket as prev cpu */
+		if (sd == per_cpu(sd_llc, prev_cpu))
+			break;
 	}
-done:
 	return prev_cpu;
 }
 
@@ -3355,7 +3353,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 	}
 
 	if (affine_sd) {
-		new_cpu = select_idle_sibling(p, prev_cpu);
+		new_cpu = select_idle_sibling(p, affine_sd, sync);
 		goto unlock;
 	}
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-16 14:08               ` Alex Shi
@ 2013-01-17  5:17                 ` Namhyung Kim
  2013-01-17 10:16                   ` Preeti U Murthy
  2013-01-17 13:41                   ` Alex Shi
  2013-01-17  8:45                 ` Preeti U Murthy
  1 sibling, 2 replies; 24+ messages in thread
From: Namhyung Kim @ 2013-01-17  5:17 UTC (permalink / raw)
  To: Alex Shi
  Cc: Preeti U Murthy, Mike Galbraith, LKML, svaidy@linux.vnet.ibm.com,
	Paul E. McKenney, Vincent Guittot, Peter Zijlstra, Viresh Kumar,
	Amit Kucheria, Morten Rasmussen, Paul McKenney, Andrew Morton,
	Arjan van de Ven, Ingo Molnar, Paul Turner, Venki Pallipadi,
	Robin Randhawa, Lists linaro-dev, Matthew Garrett, srikar

On Wed, 16 Jan 2013 22:08:21 +0800, Alex Shi wrote:
> On 01/08/2013 04:41 PM, Preeti U Murthy wrote:
>> Hi Mike,
>> 
>> Thank you very much for such a clear and comprehensive explanation.
>> So when I put together the problem and the proposed solution pieces in the current
>> scheduler scalability,the following was what I found:
>> 
>> 1. select_idle_sibling() is needed as an agent to correctly find the right cpu for wake
>>    up tasks to go to."Correctly" would be to find an idle cpu at the lowest cost possible.
>> 2."Cost could be lowered" either by optimizing the order of searching for an idle cpu or
>>    restricting the search to a few cpus alone.
>> 3. The former has the problem that it would not prevent bouncing tasks all over the domain
>>    sharing an L3 cache,which could potentially affect the fast moving tasks.
>> 4. The latter has the problem that it is not aggressive enough in finding an idle cpu.
>> 
>> This is some tangled problem,but I think the solution at best could be smoothed to a a flowchart.
>> 
>>        STEP1                       STEP2                STEP3
>>  _____________________
>> |                     |
>> |See if the idle buddy|No    _________________  Yes   ________________
>> |is free at all sched |---->| Do we search the|----> |Optimized search|
>> |domains              |     |sched domains    |      |________________|
>> |_____________________|     |for an idle cpu  |                 |
>>           |Yes              |_________________|                \|/
>>          \|/                        |No: saturated     Return target cpu
>>         Return                     \|/     system
>>         cpu buddy                Return prev_cpu
>> 
>
>
>
> I re-written the patch as following. hackbench/aim9 doest show clean performance change.
> Actually we can get some profit. it also will be very slight. :) 
> BTW, it still need another patch before apply this. Just to show the logical.
>
> ===========
>> From 145ff27744c8ac04eda056739fe5aa907a00877e Mon Sep 17 00:00:00 2001
> From: Alex Shi <alex.shi@intel.com>
> Date: Fri, 11 Jan 2013 16:49:03 +0800
> Subject: [PATCH 3/7] sched: select_idle_sibling optimization
>
> Current logical in this function will insist to wake up the task in a
> totally idle group, otherwise it would rather back to previous cpu.

Or current cpu depending on result of wake_affine(), right?

>
> The new logical will try to wake up the task on any idle cpu in the same
> cpu socket (in same sd_llc), while idle cpu in the smaller domain has
> higher priority.

But what about SMT domain?

I mean it seems that the code prefers running a task on a idle cpu which
is a sibling thread in the same core rather than running it on an idle
cpu in another idle core.  I guess we didn't do that before.

>
> It should has some help on burst wake up benchmarks like aim7.
>
> Original-patch-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/fair.c |   40 +++++++++++++++++++---------------------
>  1 files changed, 19 insertions(+), 21 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e116215..fa40e49 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3253,13 +3253,13 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>  /*
>   * Try and locate an idle CPU in the sched_domain.
>   */
> -static int select_idle_sibling(struct task_struct *p)
> +static int select_idle_sibling(struct task_struct *p,
> +				struct sched_domain *affine_sd, int sync)

Where are these arguments used?


>  {
>  	int cpu = smp_processor_id();
>  	int prev_cpu = task_cpu(p);
>  	struct sched_domain *sd;
>  	struct sched_group *sg;
> -	int i;
>  
>  	/*
>  	 * If the task is going to be woken-up on this cpu and if it is
> @@ -3281,27 +3281,25 @@ static int select_idle_sibling(struct task_struct *p)
>  	/*
>  	 * Otherwise, iterate the domains and find an elegible idle cpu.
>  	 */
> -	sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
> -	for_each_lower_domain(sd) {
> +	for_each_domain(prev_cpu, sd) {

Always start from the prev_cpu?


>  		sg = sd->groups;
>  		do {
> -			if (!cpumask_intersects(sched_group_cpus(sg),
> -						tsk_cpus_allowed(p)))
> -				goto next;
> -
> -			for_each_cpu(i, sched_group_cpus(sg)) {
> -				if (!idle_cpu(i))
> -					goto next;
> -			}
> -
> -			prev_cpu = cpumask_first_and(sched_group_cpus(sg),
> -					tsk_cpus_allowed(p));
> -			goto done;
> -next:
> -			sg = sg->next;
> -		} while (sg != sd->groups);
> +			int nr_busy = atomic_read(&sg->sgp->nr_busy_cpus);
> +			int i;
> +
> +			/* no idle cpu in the group */
> +			if (nr_busy == sg->group_weight)
> +				continue;

Maybe we can skip local group since it's a bottom-up search so we know
there's no idle cpu in the lower domain from the prior iteration.


> +			for_each_cpu_and(i, sched_group_cpus(sg),
> +							tsk_cpus_allowed(p))
> +				if (idle_cpu(i))
> +					return i;
> +		} while (sg = sg->next, sg != sd->groups);
> +
> +		/* only wake up task on the same cpu socket as prev cpu */
> +		if (sd == per_cpu(sd_llc, prev_cpu))
> +			break;
>  	}
> -done:
>  	return prev_cpu;
>  }
>  
> @@ -3355,7 +3353,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>  	}
>  
>  	if (affine_sd) {
> -		new_cpu = select_idle_sibling(p, prev_cpu);
> +		new_cpu = select_idle_sibling(p, affine_sd, sync);
>  		goto unlock;
>  	}

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-16 14:08               ` Alex Shi
  2013-01-17  5:17                 ` Namhyung Kim
@ 2013-01-17  8:45                 ` Preeti U Murthy
  1 sibling, 0 replies; 24+ messages in thread
From: Preeti U Murthy @ 2013-01-17  8:45 UTC (permalink / raw)
  To: Alex Shi
  Cc: Mike Galbraith, LKML, svaidy@linux.vnet.ibm.com, Paul E. McKenney,
	Vincent Guittot, Peter Zijlstra, Viresh Kumar, Amit Kucheria,
	Morten Rasmussen, Paul McKenney, Andrew Morton, Arjan van de Ven,
	Ingo Molnar, Paul Turner, Venki Pallipadi, Robin Randhawa,
	Lists linaro-dev, Matthew Garrett, srikar

Hi Alex,

On 01/16/2013 07:38 PM, Alex Shi wrote:
> On 01/08/2013 04:41 PM, Preeti U Murthy wrote:
>> Hi Mike,
>>
>> Thank you very much for such a clear and comprehensive explanation.
>> So when I put together the problem and the proposed solution pieces in the current
>> scheduler scalability,the following was what I found:
>>
>> 1. select_idle_sibling() is needed as an agent to correctly find the right cpu for wake
>>    up tasks to go to."Correctly" would be to find an idle cpu at the lowest cost possible.
>> 2."Cost could be lowered" either by optimizing the order of searching for an idle cpu or
>>    restricting the search to a few cpus alone.
>> 3. The former has the problem that it would not prevent bouncing tasks all over the domain
>>    sharing an L3 cache,which could potentially affect the fast moving tasks.
>> 4. The latter has the problem that it is not aggressive enough in finding an idle cpu.
>>
>> This is some tangled problem,but I think the solution at best could be smoothed to a a flowchart.
>>
>>        STEP1                       STEP2                STEP3
>>  _____________________
>> |                     |
>> |See if the idle buddy|No    _________________  Yes   ________________
>> |is free at all sched |---->| Do we search the|----> |Optimized search|
>> |domains              |     |sched domains    |      |________________|
>> |_____________________|     |for an idle cpu  |                 |
>>           |Yes              |_________________|                \|/
>>          \|/                        |No: saturated     Return target cpu
>>         Return                     \|/     system
>>         cpu buddy                Return prev_cpu
>>
> 
> 
> 
> I re-written the patch as following. hackbench/aim9 doest show clean performance change.
> Actually we can get some profit. it also will be very slight. :) 
> BTW, it still need another patch before apply this. Just to show the logical.
> 
> ===========
> From 145ff27744c8ac04eda056739fe5aa907a00877e Mon Sep 17 00:00:00 2001
> From: Alex Shi <alex.shi@intel.com>
> Date: Fri, 11 Jan 2013 16:49:03 +0800
> Subject: [PATCH 3/7] sched: select_idle_sibling optimization
> 
> Current logical in this function will insist to wake up the task in a
> totally idle group, otherwise it would rather back to previous cpu.

As Namhyung pointed out this could be the waking cpu as well.
> 
> The new logical will try to wake up the task on any idle cpu in the same
> cpu socket (in same sd_llc), while idle cpu in the smaller domain has
> higher priority.

Here is where the problem of large sockets come in.select_idle_sibling()
has its main weakness here.No doubt that this patch could improve
performance over the current logic,but will still retain the major
drawback of searching for an idle cpu in a large socket in the worst case.
> 
> It should has some help on burst wake up benchmarks like aim7.
> 
> Original-patch-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/fair.c |   40 +++++++++++++++++++---------------------
>  1 files changed, 19 insertions(+), 21 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e116215..fa40e49 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3253,13 +3253,13 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>  /*
>   * Try and locate an idle CPU in the sched_domain.
>   */
> -static int select_idle_sibling(struct task_struct *p)
> +static int select_idle_sibling(struct task_struct *p,
> +				struct sched_domain *affine_sd, int sync)

As Namhyung pointed out where is affine_sd being used?
>  {
>  	int cpu = smp_processor_id();
>  	int prev_cpu = task_cpu(p);
>  	struct sched_domain *sd;
>  	struct sched_group *sg;
> -	int i;
> 
>  	/*
>  	 * If the task is going to be woken-up on this cpu and if it is
> @@ -3281,27 +3281,25 @@ static int select_idle_sibling(struct task_struct *p)
>  	/*
>  	 * Otherwise, iterate the domains and find an elegible idle cpu.
>  	 */
> -	sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
> -	for_each_lower_domain(sd) {
> +	for_each_domain(prev_cpu, sd) {

Why is prev_cpu being used? Ideally it should be the target cpu
(waking/prev) depending on what wake_affine() decides.

>  		sg = sd->groups;
>  		do {
> -			if (!cpumask_intersects(sched_group_cpus(sg),
> -						tsk_cpus_allowed(p)))
> -				goto next;
> -
> -			for_each_cpu(i, sched_group_cpus(sg)) {
> -				if (!idle_cpu(i))
> -					goto next;
> -			}
> -
> -			prev_cpu = cpumask_first_and(sched_group_cpus(sg),
> -					tsk_cpus_allowed(p));
> -			goto done;
> -next:
> -			sg = sg->next;
> -		} while (sg != sd->groups);
> +			int nr_busy = atomic_read(&sg->sgp->nr_busy_cpus);
> +			int i;
> +
> +			/* no idle cpu in the group */
> +			if (nr_busy == sg->group_weight)
> +				continue;
> +			for_each_cpu_and(i, sched_group_cpus(sg),
> +							tsk_cpus_allowed(p))
> +				if (idle_cpu(i))
> +					return i;
> +		} while (sg = sg->next, sg != sd->groups);
> +
> +		/* only wake up task on the same cpu socket as prev cpu */
> +		if (sd == per_cpu(sd_llc, prev_cpu))
> +			break;
>  	}
> -done:
>  	return prev_cpu;

target cpu needs to be returned right? I dont understand why prev_cpu is
considered everywhere.

>  }
> 
> @@ -3355,7 +3353,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>  	}
> 
>  	if (affine_sd) {
> -		new_cpu = select_idle_sibling(p, prev_cpu);
> +		new_cpu = select_idle_sibling(p, affine_sd, sync);
>  		goto unlock;
>  	}

To overcome the drawback of large sockets, I had suggested using
blocked_load+runnable_load of PJT's metric and in select_idle_sibling()
querying only the L2 cache domains.

https://lkml.org/lkml/2013/1/8/619

What do you think about this?


> 

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-17  5:17                 ` Namhyung Kim
@ 2013-01-17 10:16                   ` Preeti U Murthy
  2013-01-17 13:41                   ` Alex Shi
  1 sibling, 0 replies; 24+ messages in thread
From: Preeti U Murthy @ 2013-01-17 10:16 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Alex Shi, Mike Galbraith, LKML, svaidy@linux.vnet.ibm.com,
	Paul E. McKenney, Vincent Guittot, Peter Zijlstra, Viresh Kumar,
	Amit Kucheria, Morten Rasmussen, Paul McKenney, Andrew Morton,
	Arjan van de Ven, Ingo Molnar, Paul Turner, Venki Pallipadi,
	Robin Randhawa, Lists linaro-dev, Matthew Garrett, srikar

Hi Namhyung,

>> I re-written the patch as following. hackbench/aim9 doest show clean performance change.
>> Actually we can get some profit. it also will be very slight. :) 
>> BTW, it still need another patch before apply this. Just to show the logical.
>>
>> ===========
>>> From 145ff27744c8ac04eda056739fe5aa907a00877e Mon Sep 17 00:00:00 2001
>> From: Alex Shi <alex.shi@intel.com>
>> Date: Fri, 11 Jan 2013 16:49:03 +0800
>> Subject: [PATCH 3/7] sched: select_idle_sibling optimization
>>
>> Current logical in this function will insist to wake up the task in a
>> totally idle group, otherwise it would rather back to previous cpu.
> 
> Or current cpu depending on result of wake_affine(), right?
> 
>>
>> The new logical will try to wake up the task on any idle cpu in the same
>> cpu socket (in same sd_llc), while idle cpu in the smaller domain has
>> higher priority.
> 
> But what about SMT domain?

The previous approach also descended till the SMT domain.Here we start
from the SMT domain.

You could check with /proc/schedstat as to which are the different
domains the cpu is a part of and SMT domain happens to be domain0.As far
as i know for_each_lower_domain will descend till domain0.
> 
> I mean it seems that the code prefers running a task on a idle cpu which
> is a sibling thread in the same core rather than running it on an idle
> cpu in another idle core.  I guess we didn't do that before.
> 
>>
>> It should has some help on burst wake up benchmarks like aim7.
>>
>> Original-patch-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
>> Signed-off-by: Alex Shi <alex.shi@intel.com>
>> ---
>>  kernel/sched/fair.c |   40 +++++++++++++++++++---------------------
>>  1 files changed, 19 insertions(+), 21 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e116215..fa40e49 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3253,13 +3253,13 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>>  /*
>>   * Try and locate an idle CPU in the sched_domain.
>>   */
>> -static int select_idle_sibling(struct task_struct *p)
>> +static int select_idle_sibling(struct task_struct *p,
>> +				struct sched_domain *affine_sd, int sync)
> 
> Where are these arguments used?
> 
> 
>>  {
>>  	int cpu = smp_processor_id();
>>  	int prev_cpu = task_cpu(p);
>>  	struct sched_domain *sd;
>>  	struct sched_group *sg;
>> -	int i;
>>  
>>  	/*
>>  	 * If the task is going to be woken-up on this cpu and if it is
>> @@ -3281,27 +3281,25 @@ static int select_idle_sibling(struct task_struct *p)
>>  	/*
>>  	 * Otherwise, iterate the domains and find an elegible idle cpu.
>>  	 */
>> -	sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
>> -	for_each_lower_domain(sd) {
>> +	for_each_domain(prev_cpu, sd) {
> 
> Always start from the prev_cpu?
> 
> 
>>  		sg = sd->groups;
>>  		do {
>> -			if (!cpumask_intersects(sched_group_cpus(sg),
>> -						tsk_cpus_allowed(p)))
>> -				goto next;
>> -
>> -			for_each_cpu(i, sched_group_cpus(sg)) {
>> -				if (!idle_cpu(i))
>> -					goto next;
>> -			}
>> -
>> -			prev_cpu = cpumask_first_and(sched_group_cpus(sg),
>> -					tsk_cpus_allowed(p));
>> -			goto done;
>> -next:
>> -			sg = sg->next;
>> -		} while (sg != sd->groups);
>> +			int nr_busy = atomic_read(&sg->sgp->nr_busy_cpus);
>> +			int i;
>> +
>> +			/* no idle cpu in the group */
>> +			if (nr_busy == sg->group_weight)
>> +				continue;
> 
> Maybe we can skip local group since it's a bottom-up search so we know
> there's no idle cpu in the lower domain from the prior iteration.

We could have done this for the current logic because it checks for an
*idle* group.The local group would definitely fail this test.But here we
need to check the local group also because we are looking for an idle cpu.


Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-17  5:17                 ` Namhyung Kim
  2013-01-17 10:16                   ` Preeti U Murthy
@ 2013-01-17 13:41                   ` Alex Shi
  2013-01-24  3:13                     ` Alex Shi
  1 sibling, 1 reply; 24+ messages in thread
From: Alex Shi @ 2013-01-17 13:41 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Preeti U Murthy, Mike Galbraith, LKML, svaidy@linux.vnet.ibm.com,
	Paul E. McKenney, Vincent Guittot, Peter Zijlstra, Viresh Kumar,
	Amit Kucheria, Morten Rasmussen, Paul McKenney, Andrew Morton,
	Arjan van de Ven, Ingo Molnar, Paul Turner, Venki Pallipadi,
	Robin Randhawa, Lists linaro-dev, Matthew Garrett, srikar

On 01/17/2013 01:17 PM, Namhyung Kim wrote:
> On Wed, 16 Jan 2013 22:08:21 +0800, Alex Shi wrote:
>> On 01/08/2013 04:41 PM, Preeti U Murthy wrote:
>>> Hi Mike,
>>>
>>> Thank you very much for such a clear and comprehensive explanation.
>>> So when I put together the problem and the proposed solution pieces in the current
>>> scheduler scalability,the following was what I found:
>>>
>>> 1. select_idle_sibling() is needed as an agent to correctly find the right cpu for wake
>>>    up tasks to go to."Correctly" would be to find an idle cpu at the lowest cost possible.
>>> 2."Cost could be lowered" either by optimizing the order of searching for an idle cpu or
>>>    restricting the search to a few cpus alone.
>>> 3. The former has the problem that it would not prevent bouncing tasks all over the domain
>>>    sharing an L3 cache,which could potentially affect the fast moving tasks.
>>> 4. The latter has the problem that it is not aggressive enough in finding an idle cpu.
>>>
>>> This is some tangled problem,but I think the solution at best could be smoothed to a a flowchart.
>>>
>>>        STEP1                       STEP2                STEP3
>>>  _____________________
>>> |                     |
>>> |See if the idle buddy|No    _________________  Yes   ________________
>>> |is free at all sched |---->| Do we search the|----> |Optimized search|
>>> |domains              |     |sched domains    |      |________________|
>>> |_____________________|     |for an idle cpu  |                 |
>>>           |Yes              |_________________|                \|/
>>>          \|/                        |No: saturated     Return target cpu
>>>         Return                     \|/     system
>>>         cpu buddy                Return prev_cpu
>>>
>>
>>
>>
>> I re-written the patch as following. hackbench/aim9 doest show clean performance change.
>> Actually we can get some profit. it also will be very slight. :) 
>> BTW, it still need another patch before apply this. Just to show the logical.
>>
>> ===========
>>> From 145ff27744c8ac04eda056739fe5aa907a00877e Mon Sep 17 00:00:00 2001
>> From: Alex Shi <alex.shi@intel.com>
>> Date: Fri, 11 Jan 2013 16:49:03 +0800
>> Subject: [PATCH 3/7] sched: select_idle_sibling optimization
>>
>> Current logical in this function will insist to wake up the task in a
>> totally idle group, otherwise it would rather back to previous cpu.
> 
> Or current cpu depending on result of wake_affine(), right?
> 
>>
>> The new logical will try to wake up the task on any idle cpu in the same
>> cpu socket (in same sd_llc), while idle cpu in the smaller domain has
>> higher priority.
> 
> But what about SMT domain?
> 
> I mean it seems that the code prefers running a task on a idle cpu which
> is a sibling thread in the same core rather than running it on an idle
> cpu in another idle core.  I guess we didn't do that before.
> 
>>
>> It should has some help on burst wake up benchmarks like aim7.
>>
>> Original-patch-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
>> Signed-off-by: Alex Shi <alex.shi@intel.com>
>> ---
>>  kernel/sched/fair.c |   40 +++++++++++++++++++---------------------
>>  1 files changed, 19 insertions(+), 21 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e116215..fa40e49 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3253,13 +3253,13 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>>  /*
>>   * Try and locate an idle CPU in the sched_domain.
>>   */
>> -static int select_idle_sibling(struct task_struct *p)
>> +static int select_idle_sibling(struct task_struct *p,
>> +				struct sched_domain *affine_sd, int sync)
> 
> Where are these arguments used?
> 
> 
>>  {
>>  	int cpu = smp_processor_id();
>>  	int prev_cpu = task_cpu(p);
>>  	struct sched_domain *sd;
>>  	struct sched_group *sg;
>> -	int i;
>>  
>>  	/*
>>  	 * If the task is going to be woken-up on this cpu and if it is
>> @@ -3281,27 +3281,25 @@ static int select_idle_sibling(struct task_struct *p)
>>  	/*
>>  	 * Otherwise, iterate the domains and find an elegible idle cpu.
>>  	 */
>> -	sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
>> -	for_each_lower_domain(sd) {
>> +	for_each_domain(prev_cpu, sd) {
> 
> Always start from the prev_cpu?

a previous patch will check wake_affine and set prev_cpu = cpu accordingly.
> 
> 
>>  		sg = sd->groups;
>>  		do {
>> -			if (!cpumask_intersects(sched_group_cpus(sg),
>> -						tsk_cpus_allowed(p)))
>> -				goto next;
>> -
>> -			for_each_cpu(i, sched_group_cpus(sg)) {
>> -				if (!idle_cpu(i))
>> -					goto next;
>> -			}
>> -
>> -			prev_cpu = cpumask_first_and(sched_group_cpus(sg),
>> -					tsk_cpus_allowed(p));
>> -			goto done;
>> -next:
>> -			sg = sg->next;
>> -		} while (sg != sd->groups);
>> +			int nr_busy = atomic_read(&sg->sgp->nr_busy_cpus);
>> +			int i;
>> +
>> +			/* no idle cpu in the group */
>> +			if (nr_busy == sg->group_weight)
>> +				continue;
> 
> Maybe we can skip local group since it's a bottom-up search so we know
> there's no idle cpu in the lower domain from the prior iteration.
> 

I did this change but seems results are worse on my machines, guess start seeking idle cpu bottom up is a bad idea.
The following is full version with above change.

diff --git a/include/linux/topology.h b/include/linux/topology.h
index d3cf0d6..386bcf4 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -132,6 +132,7 @@ int arch_update_cpu_topology(void);
 				| 0*SD_SHARE_CPUPOWER			\
 				| 1*SD_SHARE_PKG_RESOURCES		\
 				| 0*SD_SERIALIZE			\
+				| 1*SD_PREFER_SIBLING			\
 				,					\
 	.last_balance		= jiffies,				\
 	.balance_interval	= 1,					\
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5eea870..271b335 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3169,6 +3169,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 
 		return 1;
 	}
+	/* bias toward prev cpu */
 	return 0;
 }
 
@@ -3252,53 +3253,56 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
-static int select_idle_sibling(struct task_struct *p, int target)
+static int select_idle_sibling(struct task_struct *p,
+				struct sched_domain *affine_sd, int sync)
 {
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
-	struct sched_domain *sd;
+	struct sched_domain *sd, *llcsd;
 	struct sched_group *sg;
-	int i;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
 	 * already idle, then it is the right target.
 	 */
-	if (target == cpu && idle_cpu(cpu))
+	if (idle_cpu(cpu))
 		return cpu;
 
 	/*
 	 * If the task is going to be woken-up on the cpu where it previously
 	 * ran and if it is currently idle, then it the right target.
 	 */
-	if (target == prev_cpu && idle_cpu(prev_cpu))
+	if (cpu != prev_cpu && idle_cpu(prev_cpu))
 		return prev_cpu;
 
+	if (cpu != prev_cpu && !wake_affine(affine_sd, p, sync))
+		cpu = prev_cpu;
+
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
-	sd = rcu_dereference(per_cpu(sd_llc, target));
-	for_each_lower_domain(sd) {
+	llcsd = rcu_dereference(per_cpu(sd_llc, cpu));
+	for_each_domain(cpu, sd) {
 		sg = sd->groups;
 		do {
-			if (!cpumask_intersects(sched_group_cpus(sg),
-						tsk_cpus_allowed(p)))
-				goto next;
-
-			for_each_cpu(i, sched_group_cpus(sg)) {
-				if (!idle_cpu(i))
-					goto next;
-			}
-
-			target = cpumask_first_and(sched_group_cpus(sg),
-					tsk_cpus_allowed(p));
-			goto done;
-next:
-			sg = sg->next;
-		} while (sg != sd->groups);
+			int nr_busy = atomic_read(&sg->sgp->nr_busy_cpus);
+			int i;
+
+			/* skip local group and if no idle cpu in group */
+			if (cpumask_test_cpu(cpu, sched_group_cpus(sg)) ||
+					nr_busy == sg->group_weight)
+				continue;
+			for_each_cpu_and(i, sched_group_cpus(sg),
+							tsk_cpus_allowed(p))
+				if (idle_cpu(i))
+					return i;
+		} while (sg = sg->next, sg != sd->groups);
+
+		/* only wake up task on the same cpu socket as prev cpu */
+		if (sd == llcsd)
+			break;
 	}
-done:
-	return target;
+	return cpu;
 }
 
 /*
@@ -3351,10 +3355,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 	}
 
 	if (affine_sd) {
-		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
-			prev_cpu = cpu;
-
-		new_cpu = select_idle_sibling(p, prev_cpu);
+		new_cpu = select_idle_sibling(p, affine_sd, sync);
 		goto unlock;
 	}
 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-09  3:14       ` Preeti U Murthy
@ 2013-01-20 15:30         ` Alex Shi
  2013-01-20 15:52           ` Alex Shi
  0 siblings, 1 reply; 24+ messages in thread
From: Alex Shi @ 2013-01-20 15:30 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Vincent Guittot, Mike Galbraith, Matthew Garrett, LKML,
	svaidy@linux.vnet.ibm.com, Paul E. McKenney, Peter Zijlstra,
	Viresh Kumar, Amit Kucheria, Morten Rasmussen, Paul McKenney,
	Andrew Morton, Arjan van de Ven, Ingo Molnar, Paul Turner,
	Venki Pallipadi, Robin Randhawa, Lists linaro-dev

On 01/09/2013 11:14 AM, Preeti U Murthy wrote:
>>>>> Here comes the point of making both load balancing and wake up
>>>>> balance(select_idle_sibling) co operative. How about we always schedule
>>>>> the woken up task on the prev_cpu? This seems more sensible considering
>>>>> load balancing considers blocked load as being a part of the load of cpu2.
>>>>
>>>> Hi Preeti,
>>>>
>>>> I'm not sure that we want such steady state at cores level because we
>>>> take advantage of migrating wake up tasks between cores that share
>>>> their cache as Matthew demonstrated. But I agree that reaching such
>>>> steady state at cluster and CPU level is interesting.
>>>>
>>>> IMHO, you're right that taking the blocked load into consideration
>>>> should minimize tasks migration between cluster but it should no
>>>> prevent fast task migration between cores that share their cache
>>>
>>> True Vincent.But I think the one disadvantage even at cpu or cluster
>>> level is that when we consider blocked load, we might prevent any more
>>> tasks from being scheduled on that cpu during periodic load balance if
>>> the blocked load is too much.This is very poor cpu utilization
>>
>> The blocked load of a cluster will be high if the blocked tasks have
>> run recently. The contribution of a blocked task will be divided by 2
>> each 32ms, so it means that a high blocked load will be made of recent
>> running tasks and the long sleeping tasks will not influence the load
>> balancing.
>> The load balance period is between 1 tick (10ms for idle load balance
>> on ARM) and up to 256 ms (for busy load balance) so a high blocked
>> load should imply some tasks that have run recently otherwise your
>> blocked load will be small and will not have a large influence on your
>> load balance

Just tried using cfs's runnable_load_avg + blocked_load_avg in
weighted_cpuload() with my v3 patchset, aim9 shared workfile testing
show the performance dropped 70% more on the NHM EP machine. :(

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-20 15:30         ` Alex Shi
@ 2013-01-20 15:52           ` Alex Shi
  2013-01-21  2:40             ` Preeti U Murthy
  0 siblings, 1 reply; 24+ messages in thread
From: Alex Shi @ 2013-01-20 15:52 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Vincent Guittot, Mike Galbraith, Matthew Garrett, LKML,
	svaidy@linux.vnet.ibm.com, Paul E. McKenney, Peter Zijlstra,
	Viresh Kumar, Amit Kucheria, Morten Rasmussen, Paul McKenney,
	Andrew Morton, Arjan van de Ven, Ingo Molnar, Paul Turner,
	Venki Pallipadi, Robin Randhawa, Lists linaro-dev

>>> The blocked load of a cluster will be high if the blocked tasks have
>>> run recently. The contribution of a blocked task will be divided by 2
>>> each 32ms, so it means that a high blocked load will be made of recent
>>> running tasks and the long sleeping tasks will not influence the load
>>> balancing.
>>> The load balance period is between 1 tick (10ms for idle load balance
>>> on ARM) and up to 256 ms (for busy load balance) so a high blocked
>>> load should imply some tasks that have run recently otherwise your
>>> blocked load will be small and will not have a large influence on your
>>> load balance
> 
> Just tried using cfs's runnable_load_avg + blocked_load_avg in
> weighted_cpuload() with my v3 patchset, aim9 shared workfile testing
> show the performance dropped 70% more on the NHM EP machine. :(
> 

Ops, the performance is still worse than just count runnable_load_avg.
But dropping is not so big, it dropped 30%, not 70%.

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-20 15:52           ` Alex Shi
@ 2013-01-21  2:40             ` Preeti U Murthy
  2013-01-21  3:26               ` Alex Shi
  0 siblings, 1 reply; 24+ messages in thread
From: Preeti U Murthy @ 2013-01-21  2:40 UTC (permalink / raw)
  To: Alex Shi
  Cc: Vincent Guittot, Mike Galbraith, Matthew Garrett, LKML,
	svaidy@linux.vnet.ibm.com, Paul E. McKenney, Peter Zijlstra,
	Viresh Kumar, Amit Kucheria, Morten Rasmussen, Paul McKenney,
	Andrew Morton, Arjan van de Ven, Ingo Molnar, Paul Turner,
	Venki Pallipadi, Robin Randhawa, Lists linaro-dev

Hi Alex,
Thank you very much for running the below benchmark on
blocked_load+runnable_load:) Just a few queries.

How did you do the wake up balancing? Did you iterate over the L3
package looking for an idle cpu? Or did you just query the L2 package
for an idle cpu?

I think when you are using blocked_load+runnable_load it would be better
if we just query the L2 package as Vincent had pointed out because the
fundamental behind using blocked_load+runnable_load is to keep a steady
state across cpus unless we could reap the advantage of moving the
blocked load to a sibling core when it wakes up.

And the drop of performance is relative to what?
1.Your v3 patchset with runnable_load_avg in weighted_cpu_load().
2.Your v3 patchset with runnable_load_avg+blocked_load_avg in
weighted_cpu_load().

Are the above two what you are comparing? And in the above two versions
have you included your [PATCH] sched: use instant load weight in burst
regular load balance?

On 01/20/2013 09:22 PM, Alex Shi wrote:
>>>> The blocked load of a cluster will be high if the blocked tasks have
>>>> run recently. The contribution of a blocked task will be divided by 2
>>>> each 32ms, so it means that a high blocked load will be made of recent
>>>> running tasks and the long sleeping tasks will not influence the load
>>>> balancing.
>>>> The load balance period is between 1 tick (10ms for idle load balance
>>>> on ARM) and up to 256 ms (for busy load balance) so a high blocked
>>>> load should imply some tasks that have run recently otherwise your
>>>> blocked load will be small and will not have a large influence on your
>>>> load balance
>>
>> Just tried using cfs's runnable_load_avg + blocked_load_avg in
>> weighted_cpuload() with my v3 patchset, aim9 shared workfile testing
>> show the performance dropped 70% more on the NHM EP machine. :(
>>
> 
> Ops, the performance is still worse than just count runnable_load_avg.
> But dropping is not so big, it dropped 30%, not 70%.
> 

Thank you

Regards
Preeti U Murthy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-21  2:40             ` Preeti U Murthy
@ 2013-01-21  3:26               ` Alex Shi
  0 siblings, 0 replies; 24+ messages in thread
From: Alex Shi @ 2013-01-21  3:26 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Vincent Guittot, Mike Galbraith, Matthew Garrett, LKML,
	svaidy@linux.vnet.ibm.com, Paul E. McKenney, Peter Zijlstra,
	Viresh Kumar, Amit Kucheria, Morten Rasmussen, Paul McKenney,
	Andrew Morton, Arjan van de Ven, Ingo Molnar, Paul Turner,
	Venki Pallipadi, Robin Randhawa, Lists linaro-dev

On 01/21/2013 10:40 AM, Preeti U Murthy wrote:
> Hi Alex,
> Thank you very much for running the below benchmark on
> blocked_load+runnable_load:) Just a few queries.
> 
> How did you do the wake up balancing? Did you iterate over the L3
> package looking for an idle cpu? Or did you just query the L2 package
> for an idle cpu?
> 

Just used the current select_idle_sibling function, so it search in L3
package.
> I think when you are using blocked_load+runnable_load it would be better
> if we just query the L2 package as Vincent had pointed out because the
> fundamental behind using blocked_load+runnable_load is to keep a steady
> state across cpus unless we could reap the advantage of moving the
> blocked load to a sibling core when it wakes up.
> 
> And the drop of performance is relative to what?

it is 2 VS 3.8-rc3
> 1.Your v3 patchset with runnable_load_avg in weighted_cpu_load().
> 2.Your v3 patchset with runnable_load_avg+blocked_load_avg in
> weighted_cpu_load().
> 
> Are the above two what you are comparing? And in the above two versions
> have you included your [PATCH] sched: use instant load weight in burst
> regular load balance?

no this patch.
> 
> On 01/20/2013 09:22 PM, Alex Shi wrote:
>>>>> The blocked load of a cluster will be high if the blocked tasks have
>>>>> run recently. The contribution of a blocked task will be divided by 2
>>>>> each 32ms, so it means that a high blocked load will be made of recent
>>>>> running tasks and the long sleeping tasks will not influence the load
>>>>> balancing.
>>>>> The load balance period is between 1 tick (10ms for idle load balance
>>>>> on ARM) and up to 256 ms (for busy load balance) so a high blocked
>>>>> load should imply some tasks that have run recently otherwise your
>>>>> blocked load will be small and will not have a large influence on your
>>>>> load balance
>>>
>>> Just tried using cfs's runnable_load_avg + blocked_load_avg in
>>> weighted_cpuload() with my v3 patchset, aim9 shared workfile testing
>>> show the performance dropped 70% more on the NHM EP machine. :(
>>>
>>
>> Ops, the performance is still worse than just count runnable_load_avg.
>> But dropping is not so big, it dropped 30%, not 70%.
>>
> 
> Thank you
> 
> Regards
> Preeti U Murthy
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer
  2013-01-17 13:41                   ` Alex Shi
@ 2013-01-24  3:13                     ` Alex Shi
  0 siblings, 0 replies; 24+ messages in thread
From: Alex Shi @ 2013-01-24  3:13 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Preeti U Murthy, Mike Galbraith, LKML, svaidy@linux.vnet.ibm.com,
	Paul E. McKenney, Vincent Guittot, Peter Zijlstra, Viresh Kumar,
	Amit Kucheria, Morten Rasmussen, Paul McKenney, Andrew Morton,
	Arjan van de Ven, Ingo Molnar, Paul Turner, Venki Pallipadi,
	Robin Randhawa, Lists linaro-dev, Matthew Garrett, srikar


>>
>> Maybe we can skip local group since it's a bottom-up search so we know
>> there's no idle cpu in the lower domain from the prior iteration.
>>
> 
> I did this change but seems results are worse on my machines, guess start seeking idle cpu bottom up is a bad idea.
> The following is full version with above change.
> 

also tried to keep top-down seeking mode, and will return any idle cpu
instead of the first cpu in a idle group. But the result doesn't show
better on benchmark hackbench.

===
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5eea870..fb85094 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3169,6 +3169,7 @@ static int wake_affine(struct sched_domain *sd,
struct task_struct *p, int sync)
  		return 1;
 	}
+	/* bias toward prev cpu */
 	return 0;
 }
 @@ -3252,7 +3253,8 @@ find_idlest_cpu(struct sched_group *group, struct
task_struct *p, int this_cpu)
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
-static int select_idle_sibling(struct task_struct *p, int target)
+static int select_idle_sibling(struct task_struct *p,
+				struct sched_domain *affine_sd, int sync)
 {
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
@@ -3264,20 +3266,23 @@ static int select_idle_sibling(struct
task_struct *p, int target)
 	 * If the task is going to be woken-up on this cpu and if it is
 	 * already idle, then it is the right target.
 	 */
-	if (target == cpu && idle_cpu(cpu))
+	if (idle_cpu(cpu))
 		return cpu;
  	/*
 	 * If the task is going to be woken-up on the cpu where it previously
 	 * ran and if it is currently idle, then it the right target.
 	 */
-	if (target == prev_cpu && idle_cpu(prev_cpu))
+	if (cpu != prev_cpu && idle_cpu(prev_cpu))
 		return prev_cpu;
 +	if (cpu != prev_cpu && !wake_affine(affine_sd, p, sync))
+		cpu = prev_cpu;
+
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
-	sd = rcu_dereference(per_cpu(sd_llc, target));
+	sd = rcu_dereference(per_cpu(sd_llc, cpu));
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
 		do {
@@ -3290,7 +3295,7 @@ static int select_idle_sibling(struct task_struct
*p, int target)
 					goto next;
 			}
 -			target = cpumask_first_and(sched_group_cpus(sg),
+			cpu = cpumask_first_and(sched_group_cpus(sg),
 					tsk_cpus_allowed(p));
 			goto done;
 next:
@@ -3298,7 +3303,7 @@ next:
 		} while (sg != sd->groups);
 	}
 done:
-	return target;
+	return cpu;
 }
  /*
@@ -3351,10 +3356,7 @@ select_task_rq_fair(struct task_struct *p, int
sd_flag, int wake_flags)
 	}
  	if (affine_sd) {
-		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
-			prev_cpu = cpu;
-
-		new_cpu = select_idle_sibling(p, prev_cpu);
+		new_cpu = select_idle_sibling(p, affine_sd, sync);
 		goto unlock;
 	}


^ permalink raw reply related	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2013-01-24  3:13 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-02  4:22 sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer Preeti U Murthy
2013-01-02  8:12 ` Mike Galbraith
2013-01-03 10:38   ` Preeti U Murthy
2013-01-03 20:06     ` Mike Galbraith
2013-01-04 11:41     ` Mike Galbraith
2013-01-05  8:13     ` Mike Galbraith
2013-01-06 16:32       ` Mike Galbraith
2013-01-07  5:29         ` Preeti U Murthy
2013-01-07  7:36           ` Mike Galbraith
2013-01-08  8:41             ` Preeti U Murthy
2013-01-16 14:08               ` Alex Shi
2013-01-17  5:17                 ` Namhyung Kim
2013-01-17 10:16                   ` Preeti U Murthy
2013-01-17 13:41                   ` Alex Shi
2013-01-24  3:13                     ` Alex Shi
2013-01-17  8:45                 ` Preeti U Murthy
2013-01-07 15:48 ` Vincent Guittot
2013-01-08  6:06   ` Preeti U Murthy
2013-01-08 14:04     ` Vincent Guittot
2013-01-09  3:14       ` Preeti U Murthy
2013-01-20 15:30         ` Alex Shi
2013-01-20 15:52           ` Alex Shi
2013-01-21  2:40             ` Preeti U Murthy
2013-01-21  3:26               ` Alex Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).