From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751764AbbJLLr2 (ORCPT ); Mon, 12 Oct 2015 07:47:28 -0400 Received: from bombadil.infradead.org ([198.137.202.9]:50413 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751337AbbJLLr1 (ORCPT ); Mon, 12 Oct 2015 07:47:27 -0400 Date: Mon, 12 Oct 2015 13:47:23 +0200 From: Peter Zijlstra To: Yuyang Du Cc: Mike Galbraith , linux-kernel@vger.kernel.org Subject: Re: 4.3 group scheduling regression Message-ID: <20151012114723.GL3816@twins.programming.kicks-ass.net> References: <1444483369.2804.9.camel@gmail.com> <20151010170142.GI3816@twins.programming.kicks-ass.net> <1444530318.3363.40.camel@gmail.com> <1444585321.4169.18.camel@gmail.com> <20151012072344.GM3604@twins.programming.kicks-ass.net> <1444635897.3425.19.camel@gmail.com> <20151012080407.GJ3816@twins.programming.kicks-ass.net> <20151012005351.GJ11102@intel.com> <20151012091206.GK3816@twins.programming.kicks-ass.net> <20151012021230.GK11102@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151012021230.GK11102@intel.com> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Oct 12, 2015 at 10:12:31AM +0800, Yuyang Du wrote: > On Mon, Oct 12, 2015 at 11:12:06AM +0200, Peter Zijlstra wrote: > > So in the old code we had 'magic' to deal with the case where a cgroup > > was consuming less than 1 cpu's worth of runtime. For example, a single > > task running in the group. > > > > In that scenario it might be possible that the group entity weight: > > > > se->weight = (tg->shares * cfs_rq->weight) / tg->weight; > > > > Strongly deviates from the tg->shares; you want the single task reflect > > the full group shares to the next level; due to the whole distributed > > approximation stuff. > > Yeah, I thought so. > > > I see you've deleted all that code; see the former > > __update_group_entity_contrib(). > > Probably not there, it actually was an icky way to adjust things. Yeah, no argument there. > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 4df37a4..b184da0 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2370,7 +2370,7 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq) > */ > tg_weight = atomic_long_read(&tg->load_avg); > tg_weight -= cfs_rq->tg_load_avg_contrib; > - tg_weight += cfs_rq_load_avg(cfs_rq); > + tg_weight += cfs_rq->load.weight; > > return tg_weight; > } > @@ -2380,7 +2380,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg) > long tg_weight, load, shares; > > tg_weight = calc_tg_weight(tg, cfs_rq); > - load = cfs_rq_load_avg(cfs_rq); > + load = cfs_rq->load.weight; > > shares = (tg->shares * load); > if (tg_weight) Aah, yes very much so. I completely overlooked that :-( When calculating shares we very much want the current load, not the load average. Also, should we do the below? At this point se->on_rq is still 0 so reweight_entity() will not update (dequeue/enqueue) the accounting, but we'll have just accounted the 'old' load.weight. Doing it this way around we'll first update the weight and then account it, which seems more accurate. --- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 700eb548315f..d2efef565aed 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3009,8 +3009,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) */ update_curr(cfs_rq); enqueue_entity_load_avg(cfs_rq, se); - account_entity_enqueue(cfs_rq, se); update_cfs_shares(cfs_rq); + account_entity_enqueue(cfs_rq, se); if (flags & ENQUEUE_WAKEUP) { place_entity(cfs_rq, se, 0);