Re: [PATCH] sched/fair: Do not decay new task load on first enqueue

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Vincent Guittot <vincent.guittot@linaro.org>
To: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Matt Fleming <matt@codeblueprint.co.uk>,
	Ingo Molnar <mingo@kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Mike Galbraith <umgwanakikbuti@gmail.com>,
	Yuyang Du <yuyang.du@intel.com>
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue
Date: Wed, 28 Sep 2016 06:13:09 -0700	[thread overview]
Message-ID: <20160928131309.GA5483@vingu-laptop> (raw)
In-Reply-To: <CAKfTPtCOEc58C3E_k30fSR4OVqjFJq6ncK0BHyqNzsogVf5tgw@mail.gmail.com>

Le Wednesday 28 Sep 2016 à 05:27:54 (-0700), Vincent Guittot a écrit :
> On 28 September 2016 at 04:31, Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
> > On 28/09/16 12:19, Peter Zijlstra wrote:
> >> On Wed, Sep 28, 2016 at 12:06:43PM +0100, Dietmar Eggemann wrote:
> >>> On 28/09/16 11:14, Peter Zijlstra wrote:
> >>>> On Fri, Sep 23, 2016 at 12:58:08PM +0100, Matt Fleming wrote:
> >
> > [...]
> >
> >>> Not sure what you mean by 'after fixing' but the se is initialized with
> >>> a possibly stale 'now' value in post_init_entity_util_avg()->
> >>> attach_entity_load_avg() before the clock is updated in
> >>> activate_task()->enqueue_task().
> >>
> >> I meant that after I fix the above issue of calling post_init with a
> >> stale clock. So the + update_rq_clock(rq) in the patch.
> >
> > OK.
> >
> > [...]
> >
> >>>> While staring at this, I don't think we can still hit
> >>>> vruntime_normalized() with a new task, so I _think_ we can remove that
> >>>> !se->sum_exec_runtime clause there (and rejoice), no?
> >>>
> >>> I'm afraid that with accurate timing we will get the same situation that
> >>> we add and subtract the same amount of load (probably 1024 now and not
> >>> 1002 (or less)) to/from cfs_rq->runnable_load_avg for the initial (fork)
> >>> hackbench run.
> >>> After all, it's 'runnable' based.
> >>
> >> The idea was that since we now update rq clock before post_init and then
> >> leave it be, both post_init and enqueue see the exact same timestamp,
> >> and the delta is 0, resulting in no aging.
> >>
> >> Or did I fail to make that happen?
> >
> > No, but IMHO what Matt wants is ageing for the hackench tasks at the end
> > of their fork phase so there is a tiny amount of
> > cfs_rq->runnable_load_avg left on cpuX after the fork related dequeue so
> > the (load-based) fork-balancer chooses cpuY for the next hackbench task.
> > That's why he wanted to avoid the __update_load_avg(se) on enqueue (thus
> > adding 1024 to cfs_rq->runnable_load_avg) and do the ageing only on
> > dequeue (removing <1024 from cfs_rq->runnable_load_avg).
> 
> wanting  cfs_rq->runnable_load_avg to be not null when nothing is
> runnable on the cfs_rq seems a bit odd.
> We should better take into account cfs_rq->avg.load_avg or the
> cfs_rq->avg.util_avg in the select_idlest_group in this case

IIUC the problem raised by Matt, he see a regression because we now remove
during the dequeue the exact same load as during the enqueue so
cfs_rq->runnable_load_avg is null so we select a cfs_rq that might already have
a lot of hackbench blocked thread.
The fact that runnable_load_avg is null, when the cfs_rq doesn't have runnable
task, is quite correct and we should keep it. But when we look for the idlest
group, we have to take into account the blocked thread.

That's what i have tried to do below


---
 kernel/sched/fair.c | 30 +++++++++++++++++++++++-------
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 06b3c47..702915e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5353,7 +5353,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		  int this_cpu, int sd_flag)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
-	unsigned long min_load = ULONG_MAX, this_load = 0;
+	unsigned long min_runnable_load = ULONG_MAX, this_load = 0;
+	unsigned long min_avg_load = ULONG_MAX;
 	int load_idx = sd->forkexec_idx;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
@@ -5361,7 +5362,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		load_idx = sd->wake_idx;
 
 	do {
-		unsigned long load, avg_load;
+		unsigned long load, avg_load, runnable_load;
 		int local_group;
 		int i;
 
@@ -5375,6 +5376,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 
 		/* Tally up the load of all CPUs in the group */
 		avg_load = 0;
+		runnable_load = 0;
 
 		for_each_cpu(i, sched_group_cpus(group)) {
 			/* Bias balancing toward cpus of our domain */
@@ -5383,21 +5385,35 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 			else
 				load = target_load(i, load_idx);
 
-			avg_load += load;
+			runnable_load += load;
+
+			avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs); 
 		}
 
 		/* Adjust by relative CPU capacity of the group */
 		avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;
+		runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;
 
 		if (local_group) {
-			this_load = avg_load;
-		} else if (avg_load < min_load) {
-			min_load = avg_load;
+			this_load = runnable_load;
+		} else if (runnable_load < min_runnable_load) {
+			min_runnable_load = runnable_load;
+			min_avg_load = avg_load;
+			idlest = group;
+		} else if ((runnable_load == min_runnable_load) && (avg_load < min_avg_load)) {
+		/*
+		 * In case that we have same runnable load (especially null
+		 *  runnable load), we select the group with smallest blocked
+		 *  load
+		 */
+			min_avg_load = avg_load;
+			min_runnable_load = runnable_load;
 			idlest = group;
 		}
+
 	} while (group = group->next, group != sd->groups);
 
-	if (!idlest || 100*this_load < imbalance*min_load)
+	if (!idlest || 100*this_load < imbalance*min_runnable_load)
 		return NULL;
 	return idlest;
 }


> 
> >
> >

next prev parent reply	other threads:[~2016-09-28 13:13 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-23 11:58 [PATCH] sched/fair: Do not decay new task load on first enqueue Matt Fleming
2016-09-23 14:30 ` Vincent Guittot
2016-09-27 13:48   ` Dietmar Eggemann
2016-09-27 19:24     ` Matt Fleming
2016-09-27 19:21   ` Matt Fleming
2016-09-28 10:14 ` Peter Zijlstra
2016-09-28 11:06   ` Dietmar Eggemann
2016-09-28 11:19     ` Peter Zijlstra
2016-09-28 11:31       ` Dietmar Eggemann
2016-09-28 11:46         ` Vincent Guittot
2016-09-28 12:00           ` Vincent Guittot
2016-10-04 21:25             ` Matt Fleming
2016-10-04 20:16           ` Matt Fleming
2016-09-28 12:27         ` Vincent Guittot
2016-09-28 13:13           ` Vincent Guittot [this message]
2016-09-29 16:15             ` Dietmar Eggemann
2016-10-03 13:05               ` Vincent Guittot
2016-09-28 17:59       ` Dietmar Eggemann
2016-09-28 19:37   ` Matt Fleming
2016-09-30 20:30     ` Matt Fleming
2016-10-09  3:39     ` Wanpeng Li
2016-10-10 10:01       ` Matt Fleming
2016-10-10 10:09         ` Wanpeng Li
2016-10-11 10:27           ` Matt Fleming
2016-10-10 12:29         ` Vincent Guittot
2016-10-10 13:54           ` Dietmar Eggemann
2016-10-10 18:29             ` Vincent Guittot
2016-10-11  9:44               ` Dietmar Eggemann
2016-10-11 10:39                 ` Matt Fleming
2016-10-18 10:11                   ` Matt Fleming
2016-10-10 17:34           ` Vincent Guittot
2016-10-11 10:24             ` Matt Fleming
2016-10-11 13:14               ` Vincent Guittot
2016-10-11 18:57                 ` Matt Fleming
2016-10-12  7:41                   ` Vincent Guittot
2016-10-18 11:09                     ` Peter Zijlstra
2016-10-18 15:19                       ` Vincent Guittot
2016-10-18 10:29               ` Matt Fleming
2016-10-18 11:10                 ` Peter Zijlstra
2016-10-18 11:29                   ` Matt Fleming
2016-10-18 12:15                     ` Peter Zijlstra
2016-10-19  6:38                       ` Vincent Guittot
2016-10-19  9:53                         ` Peter Zijlstra
2016-11-09 16:53                           ` Vincent Guittot
2016-10-04 20:11   ` Matt Fleming
2016-10-09  5:57 ` [lkp] [sched/fair] f54c5d4e28: hackbench.throughput 10.6% improvement kernel test robot

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:06b3c47 dfblob:702915e )
 OR (
bs:"Re: [PATCH] sched/fair: Do not decay new task load on first enqueue" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160928131309.GA5483@vingu-laptop \
    --to=vincent.guittot@linaro.org \
    --cc=dietmar.eggemann@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=matt@codeblueprint.co.uk \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=umgwanakikbuti@gmail.com \
    --cc=yuyang.du@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox