From: Benjamin Segall <bsegall@google.com>
To: Aaron Lu <ziqianlu@bytedance.com>
Cc: "K Prateek Nayak" <kprateek.nayak@amd.com>,
"Peter Zijlstra" <peterz@infradead.org>,
"Hao Jia" <jiahao.kernel@gmail.com>,
"Valentin Schneider" <vschneid@redhat.com>,
"Chengming Zhou" <chengming.zhou@linux.dev>,
"Josh Don" <joshdon@google.com>, "Ingo Molnar" <mingo@redhat.com>,
"Vincent Guittot" <vincent.guittot@linaro.org>,
"Xi Wang" <xii@google.com>,
linux-kernel@vger.kernel.org,
"Juri Lelli" <juri.lelli@redhat.com>,
"Dietmar Eggemann" <dietmar.eggemann@arm.com>,
"Steven Rostedt" <rostedt@goodmis.org>,
"Mel Gorman" <mgorman@suse.de>,
"Chuyi Zhou" <zhouchuyi@bytedance.com>,
"Jan Kiszka" <jan.kiszka@siemens.com>,
"Florian Bezdeka" <florian.bezdeka@siemens.com>,
"Songtang Liu" <liusongtang@bytedance.com>,
"Chen Yu" <yu.c.chen@intel.com>,
"Matteo Martelli" <matteo.martelli@codethink.co.uk>,
"Michal Koutný" <mkoutny@suse.com>,
"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>
Subject: Re: [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
Date: Wed, 05 Nov 2025 13:37:28 -0800 [thread overview]
Message-ID: <xm26346sjh47.fsf@google.com> (raw)
In-Reply-To: <20251030032755.560-1-ziqianlu@bytedance.com> (Aaron Lu's message of "Thu, 30 Oct 2025 11:27:55 +0800")
Aaron Lu <ziqianlu@bytedance.com> writes:
> When a cfs_rq is to be throttled, its limbo list should be empty and
> that's why there is a warn in tg_throttle_down() for non empty
> cfs_rq->throttled_limbo_list.
>
> When running a test with the following hierarchy:
>
> root
> / \
> A* ...
> / | \ ...
> B
> / \
> C*
>
> where both A and C have quota settings, that warn on non empty limbo list
> is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
> part of the cfs_rq for the sake of simpler representation).
>
> Debug showed it happened like this:
> Task group C is created and quota is set, so in tg_set_cfs_bandwidth(),
> cfs_rq_c is initialized with runtime_enabled set, runtime_remaining
> equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c,
> *multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task
> group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is
> called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled),
> these throttled tasks are directly placed into cfs_rq_c's limbo list by
> enqueue_throttled_task().
>
> Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these
> tasks. The first enqueue triggers check_enqueue_throttle(), and with zero
> runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it
> can't get more runtime and enters tg_throttle_down(), where the warning
> is hit due to remaining tasks in the limbo list.
>
> I think it's a chaos to trigger throttle on unthrottle path, the status
> of a being unthrottled cfs_rq can be in a mixed state in the end, so fix
> this by granting 1ns to cfs_rq in tg_set_cfs_bandwidth(). This ensures
> cfs_rq_c has a positive runtime_remaining when initialized as unthrottled
> and cannot enter tg_unthrottle_up() with zero runtime_remaining.
>
> Also, update outdated comments in tg_throttle_down() since
> unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
> While at it, remove a redundant assignment to se in tg_throttle_down().
>
> Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
> Suggested-by: Benjamin Segall <bsegall@google.com>
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> ---
> v3: grant cfs_rq 1ns runtime on quota set as suggested by Ben, thanks!
>
> kernel/sched/core.c | 2 +-
> kernel/sched/fair.c | 15 ++++++---------
> 2 files changed, 7 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f1ebf67b48e21..f754a60de8484 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9606,7 +9606,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg,
>
> guard(rq_lock_irq)(rq);
> cfs_rq->runtime_enabled = runtime_enabled;
> - cfs_rq->runtime_remaining = 0;
> + cfs_rq->runtime_remaining = 1;
>
> if (cfs_rq->throttled)
> unthrottle_cfs_rq(cfs_rq);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 25970dbbb2795..5b752324270b0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6024,20 +6024,17 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
>
> /*
> - * It's possible we are called with !runtime_remaining due to things
> - * like user changed quota setting(see tg_set_cfs_bandwidth()) or async
> - * unthrottled us with a positive runtime_remaining but other still
> - * running entities consumed those runtime before we reached here.
> + * It's possible we are called with runtime_remaining < 0 due to things
> + * like async unthrottled us with a positive runtime_remaining but other
> + * still running entities consumed those runtime before we reached here.
> *
> - * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
> - * because any enqueue in tg_unthrottle_up() will immediately trigger a
> - * throttle, which is not supposed to happen on unthrottle path.
> + * We can't unthrottle this cfs_rq without any runtime remaining because
> + * any enqueue in tg_unthrottle_up() will immediately trigger a throttle,
> + * which is not supposed to happen on unthrottle path.
> */
> if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
> return;
>
> - se = cfs_rq->tg->se[cpu_of(rq)];
> -
> cfs_rq->throttled = 0;
>
> update_rq_clock(rq);
Reviewed-By: Benjamin Segall <bsegall@google.com>
next prev parent reply other threads:[~2025-11-05 21:37 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-30 3:27 [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining Aaron Lu
2025-10-30 5:46 ` K Prateek Nayak
2025-10-30 6:01 ` Aaron Lu
2025-10-30 6:51 ` Hao Jia
2025-10-30 7:45 ` Aaron Lu
2025-11-05 21:37 ` Benjamin Segall [this message]
2025-11-06 11:25 ` Aaron Lu
2025-11-06 11:27 ` Peter Zijlstra
2025-11-06 11:33 ` Aaron Lu
2025-11-06 11:41 ` [tip: sched/urgent] " tip-bot2 for Aaron Lu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=xm26346sjh47.fsf@google.com \
--to=bsegall@google.com \
--cc=bigeasy@linutronix.de \
--cc=chengming.zhou@linux.dev \
--cc=dietmar.eggemann@arm.com \
--cc=florian.bezdeka@siemens.com \
--cc=jan.kiszka@siemens.com \
--cc=jiahao.kernel@gmail.com \
--cc=joshdon@google.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=liusongtang@bytedance.com \
--cc=matteo.martelli@codethink.co.uk \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=mkoutny@suse.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=xii@google.com \
--cc=yu.c.chen@intel.com \
--cc=zhouchuyi@bytedance.com \
--cc=ziqianlu@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.