linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Preeti U Murthy <preeti@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>, Alex Shi <alex.shi@intel.com>
Cc: torvalds@linux-foundation.org, mingo@redhat.com,
	tglx@linutronix.de, akpm@linux-foundation.org,
	arjan@linux.intel.com, bp@alien8.de, pjt@google.com,
	namhyung@kernel.org, efault@gmx.de, vincent.guittot@linaro.org,
	gregkh@linuxfoundation.org, viresh.kumar@linaro.org,
	linux-kernel@vger.kernel.org
Subject: Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake
Date: Thu, 14 Feb 2013 13:42:04 +0530	[thread overview]
Message-ID: <511C9C54.4090508@linux.vnet.ibm.com> (raw)
In-Reply-To: <1360664565.4485.13.camel@laptop>

Hi everyone,

On 02/12/2013 03:52 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>> Guess the search cpu from bottom to up in domain tree come from
>> commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
>> balancing over tasks on all level domains.
>>
>> This balancing cost too much if there has many domain/groups in a
>> large system.
>>
>> If we remove this code, we will get quick fork/exec/wake with a
>> similar
>> balancing result amony whole system.
>>
>> This patch increases 10+% performance of hackbench on my 4 sockets
>> SNB machines and about 3% increasing on 2 sockets servers.
>>
>>
> Numbers be groovy.. still I'd like a little more on the behavioural
> change. Expand on what exactly is lost by this change so that if we
> later find a regression we have a better idea of what and how.
> 
> For instance, note how find_idlest_group() isn't symmetric wrt
> local_group. So by not doing the domain iteration we change things.
> 
> Now, it might well be that all this is somewhat overkill as it is, but
> should we then not replace all of it with a simple min search over all
> eligible cpus; that would be a real clean up.

Hi Peter,Alex,
If the eligible cpus happen to be all the cpus,then iterating over all the 
cpus for idlest would be much worse than iterating over sched domains right?
I am also wondering how important it is to bias the balancing of forked/woken up
task onto an idlest cpu at every iteration.

If biasing towards the idlest_cpu at every iteration is not really the criteria,
then we could cut down on the iterations in fork/exec/wake balancing.
Then the problem boils down to,is the option between biasing our search towards
the idlest_cpu or the idlest_group.If we are not really concerned about balancing
load across  groups,but ensuring we find the idlest cpu to run the task on,then
Alex's patch seems to have covered the criteria.

However if the concern is to distribute the load uniformly across groups,then
I have the following patch which might reduce the overhead of the search of an
eligible cpu for a forked/exec/woken up task.

Alex,if your patch does not show favourable behavioural changes,you could try
the below and check the same.

**************************START PATCH*************************************

sched:Improve balancing for fork/exec/wake tasks

As I see it,currently,we first find the idlest group,then the idlest cpu
within it.However the current code does not seem to get convinced that the
selected cpu in the current iteration,is in the correct child group and does
an iteration over all the groups yet again,
*taking the selected cpu as reference to point to the next level sched
domain.*

Why then find the idlest cpu at every iteration,if the concern is primarily
around the idlest group at each iteration? Why not find the idlest group in
every iteration and the idlest cpu in the final iteration? As a result:

1.We save time spent in going over all the cpus in a sched group in
find_idlest_cpu() at every iteration.
2.Functionality remains the same.We find the idlest group at every level,and
consider a cpu of the idlest group as a reference to get the next lower level
sched domain so as to find the idlest group there.
However instead of taking the idlest cpu as the reference,take
the first cpu as the reference.The resulting next level sched domain remains
the same.

*However this completely removes the bias towards the idlest cpu at every level.*

This patchset therefore tries to bias its find towards the right group to put
the task on,and not the idlest cpu.
---
 kernel/sched/fair.c |   53 ++++++++++++++-------------------------------------
 1 file changed, 15 insertions(+), 38 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8691b0d..90855dd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3190,16 +3190,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
  */
 static struct sched_group *
 find_idlest_group(struct sched_domain *sd, struct task_struct *p,
-		  int this_cpu, int load_idx)
+		  int this_cpu)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
-	struct sched_group *this_group = NULL;
-	u64 min_load = ~0ULL, this_load = 0;
-	int imbalance = 100 + (sd->imbalance_pct-100)/2;
+	struct sched_group;
+	u64 min_load = ~0ULL;
 
 	do {
 		u64 load, avg_load;
-		int local_group;
 		int i;
 
 		/* Skip over this group if it has no CPUs allowed */
@@ -3207,18 +3205,11 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 					tsk_cpus_allowed(p)))
 			continue;
 
-		local_group = cpumask_test_cpu(this_cpu,
-					       sched_group_cpus(group));
-
 		/* Tally up the load of all CPUs in the group */
 		avg_load = 0;
 
 		for_each_cpu(i, sched_group_cpus(group)) {
-			/* Bias balancing toward cpus of our domain */
-			if (local_group)
-				load = source_load(i, load_idx);
-			else
-				load = target_load(i, load_idx);
+			load = weighted_cpuload(i);
 
 			avg_load += load;
 		}
@@ -3227,20 +3218,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		avg_load = (avg_load * SCHED_POWER_SCALE);
 		do_div(avg_load, group->sgp->power);
 
-		if (local_group) {
-			this_load = avg_load;
-			this_group = group;
-		} else if (avg_load < min_load) {
+		if (avg_load < min_load) {
 			min_load = avg_load;
 			idlest = group;
 		}
 	} while (group = group->next, group != sd->groups);
 
-	if (this_group && idlest!= this_group) {
-		/* Bias towards our group again */
-		if (!idlest || 100*this_load < imbalance*min_load)
-			return this_group;
-	}
 	return idlest;
 }
 
@@ -3248,7 +3231,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
  * find_idlest_cpu - find the idlest cpu among the cpus in group.
  */
 static int
-find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
+find_idlest_cpu(struct sched_group *group, struct task_struct *p)
 {
 	u64 load, min_load = ~0ULL;
 	int idlest = -1;
@@ -3258,7 +3241,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
 		load = weighted_cpuload(i);
 
-		if (load < min_load || (load == min_load && i == this_cpu)) {
+		if (load < min_load) {
 			min_load = load;
 			idlest = i;
 		}
@@ -3325,6 +3308,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	int new_cpu = cpu;
+	struct sched_group *group;
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
 
@@ -3367,8 +3351,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 	}
 
 	while (sd) {
-		int load_idx = sd->forkexec_idx;
-		struct sched_group *group;
 		int weight;
 
 		if (!(sd->flags & sd_flag)) {
@@ -3376,15 +3358,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 			continue;
 		}
 
-		if (sd_flag & SD_BALANCE_WAKE)
-			load_idx = sd->wake_idx;
-
-		group = find_idlest_group(sd, p, cpu, load_idx);
-		if (!group) {
-			goto unlock;
-		}
+		group = find_idlest_group(sd, p, cpu);
 
-		new_cpu = find_idlest_cpu(group, p, cpu);
+		new_cpu = group_first_cpu(group);
 
 		/* Now try balancing at a lower domain level of new_cpu */
 		cpu = new_cpu;
@@ -3398,6 +3374,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		}
 		/* while loop will break here if sd == NULL */
 	}
+	new_cpu = find_idlest_cpu(group, p);
 unlock:
 	rcu_read_unlock();
 
@@ -4280,15 +4257,15 @@ static inline void update_sd_lb_power_stats(struct lb_env *env,
 	if (sgs->sum_nr_running + 1 >  sgs->group_capacity)
 		return;
 	if (sgs->group_util > sds->leader_util ||
-		sgs->group_util == sds->leader_util &&
-		group_first_cpu(group) < group_first_cpu(sds->group_leader)) {
+		(sgs->group_util == sds->leader_util &&
+		group_first_cpu(group) < group_first_cpu(sds->group_leader))) {
 		sds->group_leader = group;
 		sds->leader_util = sgs->group_util;
 	}
 	/* Calculate the group which is almost idle */
 	if (sgs->group_util < sds->min_util ||
-		sgs->group_util == sds->min_util &&
-		group_first_cpu(group) > group_first_cpu(sds->group_leader)) {
+		(sgs->group_util == sds->min_util &&
+		group_first_cpu(group) > group_first_cpu(sds->group_leader))) {
 		sds->group_min = group;
 		sds->min_util = sgs->group_util;
 		sds->min_load_per_task = sgs->sum_weighted_load;

> 
> 
Regards
Preeti U Murthy


  parent reply	other threads:[~2013-02-14  8:13 UTC|newest]

Thread overview: 88+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
2013-01-24  3:06 ` [patch v4 01/18] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level Alex Shi
2013-02-12 10:11   ` Peter Zijlstra
2013-02-13 13:22     ` Alex Shi
2013-02-15 12:38       ` Peter Zijlstra
2013-02-16  5:16         ` Alex Shi
2013-02-13 14:17     ` Alex Shi
2013-01-24  3:06 ` [patch v4 02/18] sched: select_task_rq_fair clean up Alex Shi
2013-02-12 10:14   ` Peter Zijlstra
2013-02-13 14:44     ` Alex Shi
2013-01-24  3:06 ` [patch v4 03/18] sched: fix find_idlest_group mess logical Alex Shi
2013-02-12 10:16   ` Peter Zijlstra
2013-02-13 15:07     ` Alex Shi
2013-01-24  3:06 ` [patch v4 04/18] sched: don't need go to smaller sched domain Alex Shi
2013-01-24  3:06 ` [patch v4 05/18] sched: quicker balancing on fork/exec/wake Alex Shi
2013-02-12 10:22   ` Peter Zijlstra
2013-02-14  3:13     ` Alex Shi
2013-02-14  8:12     ` Preeti U Murthy [this message]
2013-02-14 14:08       ` Alex Shi
2013-02-15 13:00       ` Peter Zijlstra
2013-01-24  3:06 ` [patch v4 06/18] sched: give initial value for runnable avg of sched entities Alex Shi
2013-02-12 10:23   ` Peter Zijlstra
2013-01-24  3:06 ` [patch v4 07/18] sched: set initial load avg of new forked task Alex Shi
2013-02-12 10:26   ` Peter Zijlstra
2013-02-13 15:14     ` Alex Shi
2013-02-13 15:41       ` Paul Turner
2013-02-14 13:07         ` Alex Shi
2013-02-19 11:34           ` Paul Turner
2013-02-20  4:18             ` Preeti U Murthy
2013-02-20  5:13             ` Alex Shi
2013-01-24  3:06 ` [patch v4 08/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
2013-02-12 10:27   ` Peter Zijlstra
2013-02-13 15:23     ` Alex Shi
2013-02-13 15:45       ` Paul Turner
2013-02-14  3:07         ` Preeti U Murthy
2013-01-24  3:06 ` [patch v4 09/18] sched: add sched_policies in kernel Alex Shi
2013-02-12 10:36   ` Peter Zijlstra
2013-02-13 15:41     ` Alex Shi
2013-01-24  3:06 ` [patch v4 10/18] sched: add sysfs interface for sched_policy selection Alex Shi
2013-01-24  3:06 ` [patch v4 11/18] sched: log the cpu utilization at rq Alex Shi
2013-02-12 10:39   ` Peter Zijlstra
2013-02-14  3:10     ` Alex Shi
2013-01-24  3:06 ` [patch v4 12/18] sched: add power aware scheduling in fork/exec/wake Alex Shi
2013-01-24  3:06 ` [patch v4 13/18] sched: packing small tasks in wake/exec balancing Alex Shi
2013-01-24  3:06 ` [patch v4 14/18] sched: add power/performance balance allowed flag Alex Shi
2013-01-24  3:06 ` [patch v4 15/18] sched: pull all tasks from source group Alex Shi
2013-01-24  3:06 ` [patch v4 16/18] sched: don't care if the local group has capacity Alex Shi
2013-01-24  3:06 ` [patch v4 17/18] sched: power aware load balance, Alex Shi
2013-01-24  3:07 ` [patch v4 18/18] sched: lazy power balance Alex Shi
2013-01-24  9:44 ` [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Borislav Petkov
2013-01-24 15:07   ` Alex Shi
2013-01-27  2:41     ` Alex Shi
2013-01-27  4:36       ` Mike Galbraith
2013-01-27 10:35         ` Borislav Petkov
2013-01-27 13:25           ` Alex Shi
2013-01-27 15:51             ` Mike Galbraith
2013-01-28  5:17               ` Mike Galbraith
2013-01-28  5:51                 ` Alex Shi
2013-01-28  6:15                   ` Mike Galbraith
2013-01-28  6:42                     ` Mike Galbraith
2013-01-28  7:20                       ` Mike Galbraith
2013-01-29  1:17                       ` Alex Shi
2013-01-28  9:55                 ` Borislav Petkov
2013-01-28 10:44                   ` Mike Galbraith
2013-01-28 11:29                     ` Borislav Petkov
2013-01-28 11:32                       ` Mike Galbraith
2013-01-28 11:40                         ` Mike Galbraith
2013-01-28 15:22                           ` Borislav Petkov
2013-01-28 15:55                             ` Mike Galbraith
2013-01-29  1:38                               ` Alex Shi
2013-01-29  1:32                         ` Alex Shi
2013-01-29  1:36                       ` Alex Shi
2013-01-28 15:47                 ` Mike Galbraith
2013-01-29  1:45                   ` Alex Shi
2013-01-29  4:03                     ` Mike Galbraith
2013-01-29  2:27                   ` Alex Shi
2013-01-27 10:40       ` Borislav Petkov
2013-01-27 14:03         ` Alex Shi
2013-01-28  5:19         ` Alex Shi
2013-01-28  6:49           ` Mike Galbraith
2013-01-28  7:17             ` Alex Shi
2013-01-28  7:33               ` Mike Galbraith
2013-01-29  6:02           ` Alex Shi
2013-01-28  1:28 ` Alex Shi
2013-02-04  1:35 ` Alex Shi
2013-02-04 11:09   ` Ingo Molnar
2013-02-05  2:26     ` Alex Shi
2013-02-06  5:08       ` Alex Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=511C9C54.4090508@linux.vnet.ibm.com \
    --to=preeti@linux.vnet.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@intel.com \
    --cc=arjan@linux.intel.com \
    --cc=bp@alien8.de \
    --cc=efault@gmx.de \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    --cc=viresh.kumar@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).