Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Peter Zijlstra <peterz@infradead.org>
To: Rabin Vincent <rabin.vincent@axis.com>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Paul Turner <pjt@google.com>, Ben Segall <bsegall@google.com>,
	yuyang.du@intel.com, Morten Rasmussen <morten.rasmussen@arm.com>
Subject: Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()
Date: Wed, 1 Jul 2015 22:44:04 +0200	[thread overview]
Message-ID: <20150701204404.GH25159@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <20150701145551.GA15690@axis.com>

On Wed, Jul 01, 2015 at 04:55:51PM +0200, Rabin Vincent wrote:
>  PID: 413    TASK: 8edda408  CPU: 1   COMMAND: "rngd"
>   task_h_load():     0 [ = (load_avg_contrib {    0} * cfs_rq->h_load {    0}) / (cfs_rq->runnable_load_avg {    0} + 1) ]
>   SE: 8edda450 load_avg_contrib:     0 load.weight:  1024 PARENT: 8fffbd00 GROUPNAME: (null)
>   SE: 8fffbd00 load_avg_contrib:     0 load.weight:     2 PARENT: 8f531f80 GROUPNAME: rngd@hwrng.service
>   SE: 8f531f80 load_avg_contrib:     0 load.weight:  1024 PARENT: 8f456e00 GROUPNAME: system-rngd.slice
>   SE: 8f456e00 load_avg_contrib:   118 load.weight:   911 PARENT: 00000000 GROUPNAME: system.slice

So there's two problems there... the first we can (and should) fix, the
second I'm not sure there's anything we can do about.

Firstly, a group (parent) load_avg_contrib should never be less than
that of its constituent parts, therefore the top 3 SEs should have at
least 118 too.

Now its been a while since I looked at the per entity load tracking
stuff so some of the details have left me, but while it looks like we
add the se->avg.load_avg_contrib to its cfs->runnable_load, we do not
propagate that into the corresponding (group) se.

This means the se->avg.load_avg_contrib is accounted per cpu without
migration benefits. So if our task just got migrated onto a cpu that
hasn't ran the group in a while, the group will not have accumulated
runtime.

A quick fix would be something like the below; although I think we want
to do something else, like maybe propagate the load_avg_contrib up the
hierarchy etc.. But I need to think more about that.

The second problem is that your second SE has a weight of 2, that'll get
the task_h_load() a factor of 1/512 in which will flatten pretty much
anything down to small. This is per configuration, so there's really not
something we can or should do about that.

Untested, uncompiled hackery following, mostly for discussion.

---
 kernel/sched/fair.c  | 6 ++++--
 kernel/sched/sched.h | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d57cc0ca0a6..95d0ba249c8b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6082,7 +6082,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 	struct rq *rq = rq_of(cfs_rq);
 	struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
 	unsigned long now = jiffies;
-	unsigned long load;
+	unsigned long load, load_avg_contrib = 0;
 
 	if (cfs_rq->last_h_load_update == now)
 		return;
@@ -6090,6 +6090,8 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 	cfs_rq->h_load_next = NULL;
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_load_avg_contrib = load_avg_contrib =
+			max(load_avg_contrib, se->avg.load_avg_contrib);
 		cfs_rq->h_load_next = se;
 		if (cfs_rq->last_h_load_update == now)
 			break;
@@ -6102,7 +6104,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 
 	while ((se = cfs_rq->h_load_next) != NULL) {
 		load = cfs_rq->h_load;
-		load = div64_ul(load * se->avg.load_avg_contrib,
+		load = div64_ul(load * cfs_rq->h_load_avg_contrib,
 				cfs_rq->runnable_load_avg + 1);
 		cfs_rq = group_cfs_rq(se);
 		cfs_rq->h_load = load;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 885889190a1f..7738e3b301b7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -394,6 +394,7 @@ struct cfs_rq {
 	 * this group.
 	 */
 	unsigned long h_load;
+	unsigned long h_load_avg_contrib;
 	u64 last_h_load_update;
 	struct sched_entity *h_load_next;
 #endif /* CONFIG_FAIR_GROUP_SCHED */

next prev parent reply	other threads:[~2015-07-01 20:44 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-30 14:30 [PATCH?] Livelock in pick_next_task_fair() / idle_balance() Rabin Vincent
2015-07-01  5:36 ` Mike Galbraith
2015-07-01 14:55   ` Rabin Vincent
2015-07-01 15:47     ` Mike Galbraith
2015-07-01 20:44     ` Peter Zijlstra [this message]
2015-07-01 23:25       ` Yuyang Du
2015-07-02  8:05         ` Mike Galbraith
2015-07-02  1:05           ` Yuyang Du
2015-07-02 10:25             ` Mike Galbraith
2015-07-02 11:40             ` Morten Rasmussen
2015-07-02 19:37               ` Yuyang Du
2015-07-03  9:34                 ` Morten Rasmussen
2015-07-03 16:38                   ` Peter Zijlstra
2015-07-05 22:31                     ` Yuyang Du
2015-07-09 14:32                       ` Morten Rasmussen
2015-07-09 23:24                         ` Yuyang Du
2015-07-05 20:12                   ` Yuyang Du
2015-07-06 17:36                     ` Dietmar Eggemann
2015-07-07 11:17                       ` Rabin Vincent
2015-07-13 17:43                         ` Dietmar Eggemann
2015-07-09 13:53                     ` Morten Rasmussen
2015-07-09 22:34                       ` Yuyang Du
2015-07-02 10:53         ` Peter Zijlstra
2015-07-02 11:44           ` Morten Rasmussen
2015-07-02 18:42             ` Yuyang Du
2015-07-03  4:42               ` Mike Galbraith
2015-07-03 16:39         ` Peter Zijlstra
2015-07-05 22:11           ` Yuyang Du
2015-07-09  6:15             ` Stefan Ekenberg
2015-07-26 18:57             ` Yuyang Du
2015-08-03 17:05             ` [tip:sched/core] sched/fair: Avoid pulling all tasks in idle balancing tip-bot for Yuyang Du

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:3d57cc0ca0a dfblob:95d0ba249c8 dfblob:885889190a1
dfblob:7738e3b301b )
 OR (
bs:"Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150701204404.GH25159@twins.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=bsegall@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=morten.rasmussen@arm.com \
    --cc=pjt@google.com \
    --cc=rabin.vincent@axis.com \
    --cc=umgwanakikbuti@gmail.com \
    --cc=yuyang.du@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox