From: Benjamin Segall <bsegall@google.com>
To: Aaron Lu <ziqianlu@bytedance.com>
Cc: "Valentin Schneider" <vschneid@redhat.com>,
"K Prateek Nayak" <kprateek.nayak@amd.com>,
"Peter Zijlstra" <peterz@infradead.org>,
"Hao Jia" <jiahao.kernel@gmail.com>,
"Chengming Zhou" <chengming.zhou@linux.dev>,
"Josh Don" <joshdon@google.com>, "Ingo Molnar" <mingo@redhat.com>,
"Vincent Guittot" <vincent.guittot@linaro.org>,
"Xi Wang" <xii@google.com>,
linux-kernel@vger.kernel.org,
"Juri Lelli" <juri.lelli@redhat.com>,
"Dietmar Eggemann" <dietmar.eggemann@arm.com>,
"Steven Rostedt" <rostedt@goodmis.org>,
"Mel Gorman" <mgorman@suse.de>,
"Chuyi Zhou" <zhouchuyi@bytedance.com>,
"Jan Kiszka" <jan.kiszka@siemens.com>,
"Florian Bezdeka" <florian.bezdeka@siemens.com>,
"Songtang Liu" <liusongtang@bytedance.com>,
"Chen Yu" <yu.c.chen@intel.com>,
"Matteo Martelli" <matteo.martelli@codethink.co.uk>,
"Michal Koutný" <mkoutny@suse.com>,
"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>
Subject: Re: [PATCH v2] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
Date: Mon, 27 Oct 2025 15:33:19 -0700 [thread overview]
Message-ID: <xm26ms5cug9c.fsf@google.com> (raw)
In-Reply-To: <20251023085604.244-1-ziqianlu@bytedance.com> (Aaron Lu's message of "Thu, 23 Oct 2025 16:56:04 +0800")
Aaron Lu <ziqianlu@bytedance.com> writes:
> When a cfs_rq is to be throttled, its limbo list should be empty and
> that's why there is a warn in tg_throttle_down() for non empty
> cfs_rq->throttled_limbo_list.
>
> When running a test with the following hierarchy:
>
> root
> / \
> A* ...
> / | \ ...
> B
> / \
> C*
>
> where both A and C have quota settings, that warn on non empty limbo list
> is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
> part of the cfs_rq for the sake of simpler representation).
>
> Debug showed it happened like this:
> Task group C is created and quota is set, so in tg_set_cfs_bandwidth(),
> cfs_rq_c is initialized with runtime_enabled set, runtime_remaining
> equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c,
> *multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task
> group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is
> called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled),
> these throttled tasks are directly placed into cfs_rq_c's limbo list by
> enqueue_throttled_task().
>
> Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these
> tasks. The first enqueue triggers check_enqueue_throttle(), and with zero
> runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it
> can't get more runtime and enters tg_throttle_down(), where the warning
> is hit due to remaining tasks in the limbo list.
>
> I think it's a chaos to trigger throttle on unthrottle path, the status
> of a being unthrottled cfs_rq can be in a mixed state at the end, so fix
> this by calling throttle_cfs_rq() in tg_set_cfs_bandwidth() immediately
> after enabling bandwidth and setting runtime_remaining = 0. This ensures
> cfs_rq_c is throttled upfront and cannot enter tg_unthrottle_up() with
> zero runtime_remaining.
>
> Also, update outdated comments in tg_throttle_down() since
> unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
>
> While at it, remove a redundant assignment to se in tg_throttle_down().
>
> Fixes: e1fad12dcb66("sched/fair: Switch to task based throttle model")
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> v2: add update_rq_clock() before throttle_cfs_rq() as reported by Hao
> Jia, or a warn on outdated rq clock is trigged in tg_throttle_down().
> This can happen when user specified a tiny quota.
>
> Note that Hao Jia also proposed another solution by using a special flag
> when doing enqueue_task_fair() in unthrottle path to avoid doing
> check_enqueue_throttle() [0]. I think that approach is fine too and it
> also has the benefit of not needing to worry about any other potential
> cases where a cfs_rq is unthrottled with <=0 runtime_remaining. Thoughts
> on which approach to go is welcome, thanks.
> [0]: https://lore.kernel.org/lkml/c4a1bcea-fb00-6f3f-6bf6-d876393190e4@gmail.com/
>
> kernel/sched/core.c | 11 ++++++++++-
> kernel/sched/fair.c | 16 +++++++---------
> kernel/sched/sched.h | 1 +
> 3 files changed, 18 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f1ebf67b48e21..58185ec5b8efd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9608,7 +9608,16 @@ static int tg_set_cfs_bandwidth(struct task_group *tg,
> cfs_rq->runtime_enabled = runtime_enabled;
> cfs_rq->runtime_remaining = 0;
>
> - if (cfs_rq->throttled)
> + /*
> + * Throttle cfs_rq now or it can be unthrottled with zero
> + * runtime_remaining and gets throttled on its unthrottle path.
> + */
> + if (cfs_rq->runtime_enabled && !cfs_rq->throttled) {
> + update_rq_clock(rq);
> + throttle_cfs_rq(cfs_rq);
> + }
> +
> + if (!cfs_rq->runtime_enabled && cfs_rq->throttled)
> unthrottle_cfs_rq(cfs_rq);
> }
>
So if this is the only case it can come up, and it only occurs becasue
we set runtime_remaining = 0 and check in unthrottle with <= 0, then I
think we should just set runtime_remaining = 1 here.
That seems simpler than either throttling immediately (despite likely
having plenty of cfs_b->runtime) or adding an enqueue flag. Adding NR_CPUs
nanoseconds worth of quota on configure seems fine.
unthrottle_cfs_rq/tg_unthrottle_up itself doesn't drop rq lock, so we
shouldn't be able to see cfs_rq->runtime_remaining being consumed during
it, even if it's running on a remote cpu so that threads in the cfs_rq
can be running. They should wind up stuck waiting for rq lock in order
to update runtime_remaining.
Is there anything you see missing from that approach? I think it doing =
0 in particular here is just an artifact, and while the extra check for
runtime_remaining in unthrottle isn't unreasonable, the conflict with
tg_set_cfs_bandwidth isn't a fundamental issue.
next prev parent reply other threads:[~2025-10-27 22:33 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-23 8:56 [PATCH v2] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining Aaron Lu
2025-10-27 22:33 ` Benjamin Segall [this message]
2025-10-28 6:36 ` Aaron Lu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=xm26ms5cug9c.fsf@google.com \
--to=bsegall@google.com \
--cc=bigeasy@linutronix.de \
--cc=chengming.zhou@linux.dev \
--cc=dietmar.eggemann@arm.com \
--cc=florian.bezdeka@siemens.com \
--cc=jan.kiszka@siemens.com \
--cc=jiahao.kernel@gmail.com \
--cc=joshdon@google.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=liusongtang@bytedance.com \
--cc=matteo.martelli@codethink.co.uk \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=mkoutny@suse.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=xii@google.com \
--cc=yu.c.chen@intel.com \
--cc=zhouchuyi@bytedance.com \
--cc=ziqianlu@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.