From: Matteo Martelli <matteo.martelli@codethink.co.uk>
To: K Prateek Nayak <kprateek.nayak@amd.com>,
Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Aaron Lu <ziqianlu@bytedance.com>,
linux-kernel@vger.kernel.org
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
Valentin Schneider <vschneid@redhat.com>,
K Prateek Nayak <kprateek.nayak@amd.com>,
Matteo Martelli <matteo.martelli@codethink.co.uk>
Subject: Re: [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled
Date: Fri, 26 Sep 2025 16:48:57 +0200 [thread overview]
Message-ID: <83bb46158288dfb314fdf07918b074ae@codethink.co.uk> (raw)
In-Reply-To: <20250926081918.30488-1-kprateek.nayak@amd.com>
Hi Prateek,
On Fri, 26 Sep 2025 08:19:17 +0000, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> Matteo reported hitting the assert_list_leaf_cfs_rq() warning from
> enqueue_task_fair() post commit fe8d238e646e ("sched/fair: Propagate
> load for throttled cfs_rq") which transitioned to using
> cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in
> propagate_entity_cfs_rq().
>
> The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the
> hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all
> its descendants should have their PELT frozen too or weird things can
> happen as a result of children accumulating PELT signals when the
> parents have their PELT clock stopped.
>
> Another side effect of this is the loss of integrity of the leaf cfs_rq
> list. As debugged by Aaron, consider the following hierarchy:
>
> root(#)
> / \
> A(#) B(*)
> |
> C <--- new cgroup
> |
> D <--- new cgroup
>
> # - Already on leaf cfs_rq list
> * - Throttled with PELT frozen
>
> The newly created cgroups don't have their "pelt_clock_throttled" signal
> synced with cgroup B. Next, the following series of events occur:
>
> 1. online_fair_sched_group() for cgroup D will call
> propagate_entity_cfs_rq(). (Same can happen if a throttled task is
> moved to cgroup C and enqueue_task_fair() returns early.)
>
> propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to
> "rq->tmp_alone_branch" since its PELT clock is not marked throttled
> and cfs_rq of cgroup B is not on the list.
>
> cfs_rq of cgroup B is skipped since its PELT is throttled.
>
> root cfs_rq already exists on cfs_rq leading to
> list_add_leaf_cfs_rq() returning early.
>
> The cfs_rq of cgroup C is left dangling on the
> "rq->tmp_alone_branch".
>
> 2. A new task wakes up on cgroup A. Since the whole hierarchy is already
> on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early
> without any modifications to "rq->tmp_alone_branch".
>
> The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the
> dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch".
>
> !!! Splat !!!
>
> Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not
> enough since the new cfs_rq is not yet enqueued on the hierarchy. A
> dequeue on other subtree on the throttled hierarchy can freeze the PELT
> clock for the parent hierarchy without setting the indicators for this
> newly added cfs_rq which was never enqueued.
>
> Since there are no tasks on the new hierarchy, start a cfs_rq on a
> throttled hierarchy with its PELT clock throttled. The first enqueue, or
> the distribution (whichever happens first) will unfreeze the PELT clock
> and queue the cfs_rq on the leaf cfs_rq list.
>
> While at it, add an assert_list_leaf_cfs_rq() in
> propagate_entity_cfs_rq() to catch such cases in the future.
>
> Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
> Reported-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
> Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/
> Fixes: eb962f251fbb ("sched/fair: Task based throttle time accounting")
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Stress test included running sched-messaging in nested hierarchy with
> various quota set alongside a continuous loop of cgroup creation and
> deletion, as well as another loop of continuous movement of a busy loop
> between cgroups.
>
> No splats have been observed yet with this patch.
>
> Aaron, Matteo,
>
> I've not added any "Tested-by" tags since the final diff is slightly
> different from the diff shared previously. ...
I applied this patch on top of commit 45b7f780739a ("sched: Fix some
typos in include/linux/preempt.h") from sched/core branch of tip tree,
and tested it with exactly the same setup I described in my previous
email[1]. With the patch applied, I couldn't reproduce the warning in 5
hours of testing, while before the patch the issue was systematically
reprodicible and the warning was being triggered at least once per
minute.
Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
> ...
[1]: https://lore.kernel.org/all/e2e558b863c929c5019264b2ddefd4c0@codethink.co.uk/
Thanks to you and Aaron for addressing this!
Best regards,
Matteo Martelli
next prev parent reply other threads:[~2025-09-26 14:49 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-10 9:50 [PATCH 0/4] Task based throttle follow ups Aaron Lu
2025-09-10 9:50 ` [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq Aaron Lu
2025-09-10 12:36 ` Chengming Zhou
2025-09-16 11:43 ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-23 13:05 ` [PATCH 1/4] " Matteo Martelli
2025-09-24 11:33 ` Aaron Lu
2025-09-25 8:17 ` K Prateek Nayak
2025-09-25 9:29 ` Aaron Lu
2025-09-25 11:22 ` K Prateek Nayak
2025-09-25 12:05 ` Aaron Lu
2025-09-25 13:33 ` Matteo Martelli
2025-09-26 4:32 ` K Prateek Nayak
2025-09-26 5:53 ` Aaron Lu
2025-09-26 8:19 ` [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled K Prateek Nayak
2025-09-26 9:38 ` Aaron Lu
2025-09-26 10:11 ` K Prateek Nayak
2025-10-01 20:37 ` Benjamin Segall
2025-09-26 14:48 ` Matteo Martelli [this message]
2025-10-21 5:35 ` [PATCH v2] " K Prateek Nayak
2025-10-21 10:10 ` Peter Zijlstra
2025-10-22 13:28 ` [tip: sched/urgent] " tip-bot2 for K Prateek Nayak
2025-09-29 7:51 ` [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq Aaron Lu
2025-09-10 9:50 ` [PATCH 2/4] sched/fair: update_cfs_group() for throttled cfs_rqs Aaron Lu
2025-09-16 11:43 ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-10 9:50 ` [PATCH 3/4] sched/fair: Do not special case tasks in throttled hierarchy Aaron Lu
2025-09-16 11:43 ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-10 9:50 ` [PATCH 4/4] sched/fair: Do not balance task to a throttled cfs_rq Aaron Lu
2025-09-11 2:03 ` kernel test robot
2025-09-12 3:44 ` [PATCH update " Aaron Lu
2025-09-12 3:56 ` K Prateek Nayak
2025-09-16 11:43 ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-11 10:42 ` [PATCH 0/4] Task based throttle follow ups Peter Zijlstra
2025-09-11 12:16 ` Aaron Lu
2025-09-15 21:54 ` Benjamin Segall
2025-09-19 14:37 ` Valentin Schneider
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=83bb46158288dfb314fdf07918b074ae@codethink.co.uk \
--to=matteo.martelli@codethink.co.uk \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=ziqianlu@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox