From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754628AbcFPRdY (ORCPT ); Thu, 16 Jun 2016 13:33:24 -0400 Received: from mail-pf0-f182.google.com ([209.85.192.182]:34527 "EHLO mail-pf0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754020AbcFPRdW (ORCPT ); Thu, 16 Jun 2016 13:33:22 -0400 From: bsegall@google.com To: Konstantin Khlebnikov , pjt@google.com Cc: Peter Zijlstra , Ingo Molnar , linux-kernel@vger.kernel.org, stable@vger.kernel.org Subject: Re: [PATCH] sched/fair: initialize throttle_count for new task-groups lazily References: <146608182119.21870.8439834428248129633.stgit@buzz> <5762E090.4000706@yandex-team.ru> Date: Thu, 16 Jun 2016 10:33:19 -0700 In-Reply-To: <5762E090.4000706@yandex-team.ru> (Konstantin Khlebnikov's message of "Thu, 16 Jun 2016 20:23:28 +0300") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Konstantin Khlebnikov writes: > On 16.06.2016 20:03, bsegall@google.com wrote: >> Konstantin Khlebnikov writes: >> >>> Cgroup created inside throttled group must inherit current throttle_count. >>> Broken throttle_count allows to nominate throttled entries as a next buddy, >>> later this leads to null pointer dereference in pick_next_task_fair(). >>> >>> This patch initialize cfs_rq->throttle_count at first enqueue: laziness >>> allows to skip locking all rq at group creation. Lazy approach also allows >>> to skip full sub-tree scan at throttling hierarchy (not in this patch). >>> >>> Signed-off-by: Konstantin Khlebnikov >>> Cc: Stable # v3.2+ >>> --- >>> kernel/sched/fair.c | 19 +++++++++++++++++++ >>> kernel/sched/sched.h | 2 +- >>> 2 files changed, 20 insertions(+), 1 deletion(-) >>> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index 218f8e83db73..fe809fe169d2 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -4185,6 +4185,25 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq) >>> if (!cfs_bandwidth_used()) >>> return; >>> >>> + /* synchronize hierarchical throttle counter */ >>> + if (unlikely(!cfs_rq->throttle_uptodate)) { >>> + struct rq *rq = rq_of(cfs_rq); >>> + struct cfs_rq *pcfs_rq; >>> + struct task_group *tg; >>> + >>> + cfs_rq->throttle_uptodate = 1; >>> + /* get closest uptodate node because leaves goes first */ >>> + for (tg = cfs_rq->tg->parent; tg; tg = tg->parent) { >>> + pcfs_rq = tg->cfs_rq[cpu_of(rq)]; >>> + if (pcfs_rq->throttle_uptodate) >>> + break; >>> + } >>> + if (tg) { >>> + cfs_rq->throttle_count = pcfs_rq->throttle_count; >>> + cfs_rq->throttled_clock_task = rq_clock_task(rq); >>> + } >>> + } >>> + >> >> Checking just in enqueue is not sufficient - throttled_lb_pair can check >> against a cfs_rq that has never been enqueued (and possibly other >> paths). > > Looks like this is minor problem: in worst case load-balancer will migrate > task into throttled hierarchy. And this could happens only once for each > newly created group. > >> >> It might also make sense to go ahead and initialize all the cfs_rqs we >> skipped over to avoid some n^2 pathological behavior. You could also use >> throttle_count == -1 or < 0. (We had our own version of this patch that >> I guess we forgot to push?) > > n^2 shouldn't be a problem while this happens only once for each > group. Yeah it's shouldn't be a problem, it's more a why not. > > throttle_count == -1 could be overwritten when parent throttles/unthrottles > before initialization. We could set it to INT_MIN/2 and check <0 but this > will hide possible bugs here. One more int in the same cacheline shouldn't > add noticeable overhead. Yeah, you would have to check in tg_throttle_up/tg_unthrottle_up as well to do this. > > I've also added this into our kernel to catch such problems without crash. > Probably it's worth to add into upstream because stale buddy is a real pain) > > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5506,8 +5506,11 @@ static void set_last_buddy(struct sched_entity *se) > if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE)) > return; > > - for_each_sched_entity(se) > + for_each_sched_entity(se) { > + if (WARN_ON_ONCE(!se->on_rq)) > + return; > cfs_rq_of(se)->last = se; > + } > } > > static void set_next_buddy(struct sched_entity *se) > @@ -5515,8 +5518,11 @@ static void set_next_buddy(struct sched_entity *se) > if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE)) > return; > > - for_each_sched_entity(se) > + for_each_sched_entity(se) { > + if (WARN_ON_ONCE(!se->on_rq)) > + return; > cfs_rq_of(se)->next = se; > + } > } > > static void set_skip_buddy(struct sched_entity *se) > > >> >> >>> /* an active group must be handled by the update_curr()->put() path */ >>> if (!cfs_rq->runtime_enabled || cfs_rq->curr) >>> return; >>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >>> index 72f1f3087b04..7cbeb92a1cb9 100644 >>> --- a/kernel/sched/sched.h >>> +++ b/kernel/sched/sched.h >>> @@ -437,7 +437,7 @@ struct cfs_rq { >>> >>> u64 throttled_clock, throttled_clock_task; >>> u64 throttled_clock_task_time; >>> - int throttled, throttle_count; >>> + int throttled, throttle_count, throttle_uptodate; >>> struct list_head throttled_list; >>> #endif /* CONFIG_CFS_BANDWIDTH */ >>> #endif /* CONFIG_FAIR_GROUP_SCHED */