From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751764AbbJLLr2 (ORCPT <rfc822;w@1wt.eu>);
	Mon, 12 Oct 2015 07:47:28 -0400
Received: from bombadil.infradead.org ([198.137.202.9]:50413 "EHLO
	bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751337AbbJLLr1 (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Mon, 12 Oct 2015 07:47:27 -0400
Date: Mon, 12 Oct 2015 13:47:23 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: Yuyang Du <yuyang.du@intel.com>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>, linux-kernel@vger.kernel.org
Subject: Re: 4.3 group scheduling regression
Message-ID: <20151012114723.GL3816@twins.programming.kicks-ass.net>
References: <1444483369.2804.9.camel@gmail.com>
 <20151010170142.GI3816@twins.programming.kicks-ass.net>
 <1444530318.3363.40.camel@gmail.com>
 <1444585321.4169.18.camel@gmail.com>
 <20151012072344.GM3604@twins.programming.kicks-ass.net>
 <1444635897.3425.19.camel@gmail.com>
 <20151012080407.GJ3816@twins.programming.kicks-ass.net>
 <20151012005351.GJ11102@intel.com>
 <20151012091206.GK3816@twins.programming.kicks-ass.net>
 <20151012021230.GK11102@intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20151012021230.GK11102@intel.com>
User-Agent: Mutt/1.5.21 (2012-12-30)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon, Oct 12, 2015 at 10:12:31AM +0800, Yuyang Du wrote:
> On Mon, Oct 12, 2015 at 11:12:06AM +0200, Peter Zijlstra wrote:

> > So in the old code we had 'magic' to deal with the case where a cgroup
> > was consuming less than 1 cpu's worth of runtime. For example, a single
> > task running in the group.
> > 
> > In that scenario it might be possible that the group entity weight:
> > 
> > 	se->weight = (tg->shares * cfs_rq->weight) / tg->weight;
> > 
> > Strongly deviates from the tg->shares; you want the single task reflect
> > the full group shares to the next level; due to the whole distributed
> > approximation stuff.
> 
> Yeah, I thought so.
>  
> > I see you've deleted all that code; see the former
> > __update_group_entity_contrib().
>  
> Probably not there, it actually was an icky way to adjust things.

Yeah, no argument there.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4df37a4..b184da0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2370,7 +2370,7 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
>  	 */
>  	tg_weight = atomic_long_read(&tg->load_avg);
>  	tg_weight -= cfs_rq->tg_load_avg_contrib;
> -	tg_weight += cfs_rq_load_avg(cfs_rq);
> +	tg_weight += cfs_rq->load.weight;
>  
>  	return tg_weight;
>  }
> @@ -2380,7 +2380,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
>  	long tg_weight, load, shares;
>  
>  	tg_weight = calc_tg_weight(tg, cfs_rq);
> -	load = cfs_rq_load_avg(cfs_rq);
> +	load = cfs_rq->load.weight;
>  
>  	shares = (tg->shares * load);
>  	if (tg_weight)

Aah, yes very much so. I completely overlooked that :-(

When calculating shares we very much want the current load, not the load
average.

Also, should we do the below? At this point se->on_rq is still 0 so
reweight_entity() will not update (dequeue/enqueue) the accounting, but
we'll have just accounted the 'old' load.weight.

Doing it this way around we'll first update the weight and then account
it, which seems more accurate.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 700eb548315f..d2efef565aed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3009,8 +3009,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 */
 	update_curr(cfs_rq);
 	enqueue_entity_load_avg(cfs_rq, se);
-	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
+	account_entity_enqueue(cfs_rq, se);
 
 	if (flags & ENQUEUE_WAKEUP) {
 		place_entity(cfs_rq, se, 0);