From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 83D0DC433EF for ; Mon, 28 Mar 2022 19:05:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245447AbiC1THJ (ORCPT ); Mon, 28 Mar 2022 15:07:09 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39910 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240477AbiC1THH (ORCPT ); Mon, 28 Mar 2022 15:07:07 -0400 Received: from mail-pj1-x1032.google.com (mail-pj1-x1032.google.com [IPv6:2607:f8b0:4864:20::1032]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A7CFD2655F for ; Mon, 28 Mar 2022 12:05:25 -0700 (PDT) Received: by mail-pj1-x1032.google.com with SMTP id h23-20020a17090a051700b001c9c1dd3acbso328888pjh.3 for ; Mon, 28 Mar 2022 12:05:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=from:to:cc:subject:references:date:in-reply-to:message-id :user-agent:mime-version; bh=YCYRd62odQbhm1ussqmokuyT2HY6xyQaWKLzA0V0RsY=; b=ELQjNE6QPy2I503Gk6OBD4NTAYuy6/XclclMutEveQzRbFQB2kmOmUMLhMzGeZWL4s 1Jn1i6vCVjjjDS8MszUck2LluHn36mMvSukwaPNXc9DAN6qCjXBjdEtCG4vl3drdqvd/ p6eWDIA7726ist6iYiB7fu1nn6OQiOd+3BbDxsmXUMgaNW51Unl7caHaebbIgmkEMo4+ 6f9phlCCBpbYUtGYUHJLbjFMnxFNJQFN/UkAdkBcvK1m5KyKHxZUfzT8Xq6J8jYH0LAR vFmZl3OwZauZmE5kEGjzfVYwFSf2Ub8lTrA77HeDEgwg13Rsu/EfZY45AskIXEiYtRbc 6wcw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:references:date:in-reply-to :message-id:user-agent:mime-version; bh=YCYRd62odQbhm1ussqmokuyT2HY6xyQaWKLzA0V0RsY=; b=TRxMyhHqzLdzrLbF7oOJKWP+iGgA2dXAhpHD2zFx/smHKEqsKJEfw+KB1MF58WFahP WLnw2tQ7SL1YHKvKvKKFv2bWjaFRpFgc4TUAZc3UqvTLrG7vuKjZOG9OXHADaUJbdVU4 T4KDOPweVADAWurQSfZgpTDtZaiH7HCEmzVQs/drRa08HmG8q/+6E2WaZTYkkjVx8/lE 7qmi9U2ab9Gg5zT4Lns5MNVIFaAOmoeQRwIsBgPitG/ozRZGRKlscLV0mfmw2vFJL3au Y4e0is1+9gbyJg4SnmxBhXkI+mfXinauQPXOnwckapP83OlzW4iilYaDa2o5CUnNpohs hkwQ== X-Gm-Message-State: AOAM532ngraXQDA2GUp1WJVaUBF8NaCySm+qAiRe1YKPBwotTldB/kEG 8rcUk/cQtq0njUT3um2JOIP3Kw== X-Google-Smtp-Source: ABdhPJzUFAsOERMBjOAlloqdgnMCLOKlP6NUmhzJFN4yBCS2J8U1gCkv5F2P4RZylqv4DQ3+z/PEdQ== X-Received: by 2002:a17:90a:4897:b0:1c7:5fce:cbcd with SMTP id b23-20020a17090a489700b001c75fcecbcdmr601267pjh.45.1648494324920; Mon, 28 Mar 2022 12:05:24 -0700 (PDT) Received: from bsegall-glaptop.localhost (c-73-71-82-80.hsd1.ca.comcast.net. [73.71.82.80]) by smtp.gmail.com with ESMTPSA id s4-20020a056a00194400b004fb358ffe84sm9017211pfk.104.2022.03.28.12.05.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 28 Mar 2022 12:05:23 -0700 (PDT) From: Benjamin Segall To: Chengming Zhou Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, mgorman@suse.de, bristot@redhat.com, linux-kernel@vger.kernel.org, duanxiongchun@bytedance.com, songmuchun@bytedance.com Subject: Re: [PATCH] sched/fair: fix broken bandwidth control with nohz_full References: <20220328110751.39987-1-zhouchengming@bytedance.com> Date: Mon, 28 Mar 2022 12:05:20 -0700 In-Reply-To: <20220328110751.39987-1-zhouchengming@bytedance.com> (Chengming Zhou's message of "Mon, 28 Mar 2022 19:07:51 +0800") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Chengming Zhou writes: > With nohz_full enabled on cpu, the scheduler_tick() will be stopped > when only one CFS task left on rq. > > scheduler_tick() > task_tick_fair() > entity_tick() > update_curr() > account_cfs_rq_runtime(cfs_rq, delta_exec) --> stopped > > So that running task can't account its runtime periodically, but > the cfs_bandwidth hrtimer still __refill_cfs_bandwidth_runtime() > periodically. Later in one period, the task would account very > big delta_exec, which cause the cfs_rq to be throttled for a > long time. > > There are two solutions for the problem, the first is that we > can check in sched_can_stop_tick() if current task's cfs_rq > have runtime_enabled, in which case we don't stop tick. But > it will make nohz_full almost useless in cloud environment > that every container has the cpu bandwidth control setting. > > The other is what this patch implemented, cfs_bandwidth hrtimer > would sync unaccounted runtime from all running cfs_rqs with > tick stopped, just before __refill_cfs_bandwidth_runtime(). > Also do the same thing in tg_set_cfs_bandwidth(). > > A testcase to reproduce: > ``` > cd /sys/fs/cgroup > echo "+cpu" > cgroup.subtree_control > > mkdir test > echo "105000 100000" > test/cpu.max The other half of this problem I think would be that it also won't throttle when it /is/ out of bandwidth. That is, 'echo "50000 100000" > test/cpu.max' would not stop after 50ms of runtime is spent, it would only stop (with this patch) after 100ms. It would then correctly need to repay that debt, so the (very) long-run ratio should be correct, but a misbehaving task could be use practically arbitrarily more than their supposed quota over the ~100s of milliseconds timescales cfsb is generally supposed to work on. However, I don't really see a way for cfsb's current design to avoid that on nohz_full (without vetoing sched_can_stop_tick). It might be possible to try and guess when quota will run out by tracking the number of running or runnable threads, but that would be a nontrivial undertaking and involve a bunch of hrtimer reprogramming (particularly in cases when some cgroups actually have tasks that wake/sleep frequently). > > echo $$ > test/cgroup.procs > taskset -c 1 bash -c "while true; do let i++; done" > ``` > Ctrl-C and cat test/cpu.stat to see if nr_throttled > 0. > > The above testcase uses period 100ms and quota 105ms, would > only see nr_throttled > 0 on nohz_full system. The problem > is gone in test with this patch. > > Signed-off-by: Chengming Zhou > --- > kernel/sched/core.c | 4 ++++ > kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++++ > kernel/sched/sched.h | 3 +++ > 3 files changed, 37 insertions(+) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index d575b4914925..17b5e3d27401 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -10443,6 +10443,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota, > */ > if (runtime_enabled && !runtime_was_enabled) > cfs_bandwidth_usage_inc(); > + > + if (runtime_was_enabled) > + sync_cfs_bandwidth_runtime(cfs_b); > + > raw_spin_lock_irq(&cfs_b->lock); > cfs_b->period = ns_to_ktime(period); > cfs_b->quota = quota; > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index ee0664c9d291..ebda70a0e3a8 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5339,6 +5339,34 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer) > return HRTIMER_NORESTART; > } > > +#ifdef CONFIG_NO_HZ_FULL > +void sync_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) > +{ > + unsigned int cpu; > + struct rq *rq; > + struct rq_flags rf; > + struct cfs_rq *cfs_rq; > + struct task_group *tg; > + > + tg = container_of(cfs_b, struct task_group, cfs_bandwidth); > + > + for_each_online_cpu(cpu) { > + if (!tick_nohz_tick_stopped_cpu(cpu)) > + continue; > + > + rq = cpu_rq(cpu); > + cfs_rq = tg->cfs_rq[cpu]; > + > + rq_lock_irqsave(rq, &rf); A for-every-nohz-cpu rqlock is definitely not great when the cgroup is likely to be running on much fewer cpus. I'm not sure what the current opinion would be on doing some sort of optimistic locking like if (!READ_ONCE(cfs_rq->curr)) continue; rq_lock_irqsave(...); if (cfs_rq->curr) { ... } but solutions to the other problem I mentioned earlier might provide help here anyways. > + if (cfs_rq->curr) { > + update_rq_clock(rq); > + update_curr(cfs_rq); > + } > + rq_unlock_irqrestore(rq, &rf); > + } > +} > +#endif > + > extern const u64 max_cfs_quota_period; > > static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer) > @@ -5350,6 +5378,8 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer) > int idle = 0; > int count = 0; > > + sync_cfs_bandwidth_runtime(cfs_b); > + > raw_spin_lock_irqsave(&cfs_b->lock, flags); > for (;;) { > overrun = hrtimer_forward_now(timer, cfs_b->period); > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 58263f90c559..57f9da9c50c1 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -2351,9 +2351,12 @@ static inline void sched_update_tick_dependency(struct rq *rq) > else > tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED); > } > + > +extern void sync_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b); > #else > static inline int sched_tick_offload_init(void) { return 0; } > static inline void sched_update_tick_dependency(struct rq *rq) { } > +static inline void sync_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) {} > #endif > > static inline void add_nr_running(struct rq *rq, unsigned count)