From: Aaron Lu <ziqianlu@bytedance.com>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Matteo Martelli <matteo.martelli@codethink.co.uk>,
linux-kernel@vger.kernel.org,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
Valentin Schneider <vschneid@redhat.com>
Subject: Re: [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled
Date: Fri, 26 Sep 2025 17:38:01 +0800 [thread overview]
Message-ID: <20250926093801.GE120@bytedance> (raw)
In-Reply-To: <20250926081918.30488-1-kprateek.nayak@amd.com>
On Fri, Sep 26, 2025 at 08:19:17AM +0000, K Prateek Nayak wrote:
> Matteo reported hitting the assert_list_leaf_cfs_rq() warning from
> enqueue_task_fair() post commit fe8d238e646e ("sched/fair: Propagate
> load for throttled cfs_rq") which transitioned to using
> cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in
> propagate_entity_cfs_rq().
>
> The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the
> hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all
> its descendants should have their PELT frozen too or weird things can
> happen as a result of children accumulating PELT signals when the
> parents have their PELT clock stopped.
>
> Another side effect of this is the loss of integrity of the leaf cfs_rq
> list. As debugged by Aaron, consider the following hierarchy:
>
> root(#)
> / \
> A(#) B(*)
> |
> C <--- new cgroup
> |
> D <--- new cgroup
>
> # - Already on leaf cfs_rq list
> * - Throttled with PELT frozen
>
> The newly created cgroups don't have their "pelt_clock_throttled" signal
> synced with cgroup B. Next, the following series of events occur:
>
> 1. online_fair_sched_group() for cgroup D will call
> propagate_entity_cfs_rq(). (Same can happen if a throttled task is
> moved to cgroup C and enqueue_task_fair() returns early.)
>
> propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to
> "rq->tmp_alone_branch" since its PELT clock is not marked throttled
> and cfs_rq of cgroup B is not on the list.
>
> cfs_rq of cgroup B is skipped since its PELT is throttled.
>
> root cfs_rq already exists on cfs_rq leading to
> list_add_leaf_cfs_rq() returning early.
>
> The cfs_rq of cgroup C is left dangling on the
> "rq->tmp_alone_branch".
>
> 2. A new task wakes up on cgroup A. Since the whole hierarchy is already
> on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early
> without any modifications to "rq->tmp_alone_branch".
>
> The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the
> dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch".
>
> !!! Splat !!!
>
> Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not
> enough since the new cfs_rq is not yet enqueued on the hierarchy. A
> dequeue on other subtree on the throttled hierarchy can freeze the PELT
> clock for the parent hierarchy without setting the indicators for this
> newly added cfs_rq which was never enqueued.
>
Sigh...
> Since there are no tasks on the new hierarchy, start a cfs_rq on a
> throttled hierarchy with its PELT clock throttled. The first enqueue, or
> the distribution (whichever happens first) will unfreeze the PELT clock
> and queue the cfs_rq on the leaf cfs_rq list.
>
Makes sense.
> While at it, add an assert_list_leaf_cfs_rq() in
> propagate_entity_cfs_rq() to catch such cases in the future.
>
> Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
> Reported-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
> Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/
> Fixes: eb962f251fbb ("sched/fair: Task based throttle time accounting")
Should be Fixes: e1fad12dcb66("sched/fair: Switch to task based throttle
model")? "Task based throttle time accounting" doesn't touch pelt bits.
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Thanks for the fix.
BTW, I'm thinking in propagate_entity_cfs_rq(), we shouldn't check the
ancestor cfs_rq's pelt clock throttled status but only the first level
cfs_rq's, because the purpose is to have the first level cfs_rq to stay
on the leaf list and all those list_add_leaf_cfs_rq() for its ancestors
are just to make sure the list is fully connected. I mean something like
this:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 75c615f5ed640..6a6d9200ab93c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13170,6 +13170,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
static void propagate_entity_cfs_rq(struct sched_entity *se)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ bool add = !cfs_rq_pelt_clock_throttled(cfs_rq);
/*
* If a task gets attached to this cfs_rq and before being queued,
@@ -13177,7 +13178,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
* change, make sure this cfs_rq stays on leaf cfs_rq list to have
* that removed load decayed or it can cause faireness problem.
*/
- if (!cfs_rq_pelt_clock_throttled(cfs_rq))
+ if (add)
list_add_leaf_cfs_rq(cfs_rq);
/* Start to propagate at parent */
@@ -13188,7 +13189,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
update_load_avg(cfs_rq, se, UPDATE_TG);
- if (!cfs_rq_pelt_clock_throttled(cfs_rq))
+ if (add)
list_add_leaf_cfs_rq(cfs_rq);
}
But this is a different thing and can be taken care of if necessary
later. Current logic doesn't have a problem, it's just not as clear as
the above diff to me.
next prev parent reply other threads:[~2025-09-26 9:38 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-10 9:50 [PATCH 0/4] Task based throttle follow ups Aaron Lu
2025-09-10 9:50 ` [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq Aaron Lu
2025-09-10 12:36 ` Chengming Zhou
2025-09-16 11:43 ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-23 13:05 ` [PATCH 1/4] " Matteo Martelli
2025-09-24 11:33 ` Aaron Lu
2025-09-25 8:17 ` K Prateek Nayak
2025-09-25 9:29 ` Aaron Lu
2025-09-25 11:22 ` K Prateek Nayak
2025-09-25 12:05 ` Aaron Lu
2025-09-25 13:33 ` Matteo Martelli
2025-09-26 4:32 ` K Prateek Nayak
2025-09-26 5:53 ` Aaron Lu
2025-09-26 8:19 ` [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled K Prateek Nayak
2025-09-26 9:38 ` Aaron Lu [this message]
2025-09-26 10:11 ` K Prateek Nayak
2025-10-01 20:37 ` Benjamin Segall
2025-09-26 14:48 ` Matteo Martelli
2025-10-21 5:35 ` [PATCH v2] " K Prateek Nayak
2025-10-21 10:10 ` Peter Zijlstra
2025-10-22 13:28 ` [tip: sched/urgent] " tip-bot2 for K Prateek Nayak
2025-09-29 7:51 ` [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq Aaron Lu
2025-09-10 9:50 ` [PATCH 2/4] sched/fair: update_cfs_group() for throttled cfs_rqs Aaron Lu
2025-09-16 11:43 ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-10 9:50 ` [PATCH 3/4] sched/fair: Do not special case tasks in throttled hierarchy Aaron Lu
2025-09-16 11:43 ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-10 9:50 ` [PATCH 4/4] sched/fair: Do not balance task to a throttled cfs_rq Aaron Lu
2025-09-11 2:03 ` kernel test robot
2025-09-12 3:44 ` [PATCH update " Aaron Lu
2025-09-12 3:56 ` K Prateek Nayak
2025-09-16 11:43 ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-11 10:42 ` [PATCH 0/4] Task based throttle follow ups Peter Zijlstra
2025-09-11 12:16 ` Aaron Lu
2025-09-15 21:54 ` Benjamin Segall
2025-09-19 14:37 ` Valentin Schneider
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250926093801.GE120@bytedance \
--to=ziqianlu@bytedance.com \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=matteo.martelli@codethink.co.uk \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.