All of lore.kernel.org
 help / color / mirror / Atom feed
From: Matteo Martelli <matteo.martelli@codethink.co.uk>
To: K Prateek Nayak <kprateek.nayak@amd.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Aaron Lu <ziqianlu@bytedance.com>,
	linux-kernel@vger.kernel.org
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Matteo Martelli <matteo.martelli@codethink.co.uk>
Subject: Re: [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled
Date: Fri, 26 Sep 2025 16:48:57 +0200	[thread overview]
Message-ID: <83bb46158288dfb314fdf07918b074ae@codethink.co.uk> (raw)
In-Reply-To: <20250926081918.30488-1-kprateek.nayak@amd.com>

Hi Prateek,

On Fri, 26 Sep 2025 08:19:17 +0000, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> Matteo reported hitting the assert_list_leaf_cfs_rq() warning from
> enqueue_task_fair() post commit fe8d238e646e ("sched/fair: Propagate
> load for throttled cfs_rq") which transitioned to using
> cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in
> propagate_entity_cfs_rq().
> 
> The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the
> hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all
> its descendants should have their PELT frozen too or weird things can
> happen as a result of children accumulating PELT signals when the
> parents have their PELT clock stopped.
> 
> Another side effect of this is the loss of integrity of the leaf cfs_rq
> list. As debugged by Aaron, consider the following hierarchy:
> 
>     root(#)
>    /    \
>   A(#)   B(*)
>          |
>          C <--- new cgroup
>          |
>          D <--- new cgroup
> 
>   # - Already on leaf cfs_rq list
>   * - Throttled with PELT frozen
> 
> The newly created cgroups don't have their "pelt_clock_throttled" signal
> synced with cgroup B. Next, the following series of events occur:
> 
> 1. online_fair_sched_group() for cgroup D will call
>    propagate_entity_cfs_rq(). (Same can happen if a throttled task is
>    moved to cgroup C and enqueue_task_fair() returns early.)
> 
>    propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to
>    "rq->tmp_alone_branch" since its PELT clock is not marked throttled
>    and cfs_rq of cgroup B is not on the list.
> 
>    cfs_rq of cgroup B is skipped since its PELT is throttled.
> 
>    root cfs_rq already exists on cfs_rq leading to
>    list_add_leaf_cfs_rq() returning early.
> 
>    The cfs_rq of cgroup C is left dangling on the
>    "rq->tmp_alone_branch".
> 
> 2. A new task wakes up on cgroup A. Since the whole hierarchy is already
>    on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early
>    without any modifications to "rq->tmp_alone_branch".
> 
>    The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the
>    dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch".
> 
>    !!! Splat !!!
> 
> Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not
> enough since the new cfs_rq is not yet enqueued on the hierarchy. A
> dequeue on other subtree on the throttled hierarchy can freeze the PELT
> clock for the parent hierarchy without setting the indicators for this
> newly added cfs_rq which was never enqueued.
> 
> Since there are no tasks on the new hierarchy, start a cfs_rq on a
> throttled hierarchy with its PELT clock throttled. The first enqueue, or
> the distribution (whichever happens first) will unfreeze the PELT clock
> and queue the cfs_rq on the leaf cfs_rq list.
> 
> While at it, add an assert_list_leaf_cfs_rq() in
> propagate_entity_cfs_rq() to catch such cases in the future.
> 
> Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
> Reported-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
> Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/
> Fixes: eb962f251fbb ("sched/fair: Task based throttle time accounting")
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Stress test included running sched-messaging in nested hierarchy with
> various quota set alongside a continuous loop of cgroup creation and
> deletion, as well as another loop of continuous movement of a busy loop
> between cgroups.
> 
> No splats have been observed yet with this patch.
> 
> Aaron, Matteo,
> 
> I've not added any "Tested-by" tags since the final diff is slightly
> different from the diff shared previously. ...

I applied this patch on top of commit 45b7f780739a ("sched: Fix some
typos in include/linux/preempt.h") from sched/core branch of tip tree,
and tested it with exactly the same setup I described in my previous
email[1]. With the patch applied, I couldn't reproduce the warning in 5
hours of testing, while before the patch the issue was systematically
reprodicible and the warning was being triggered at least once per
minute.

Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>

> ...

[1]: https://lore.kernel.org/all/e2e558b863c929c5019264b2ddefd4c0@codethink.co.uk/

Thanks to you and Aaron for addressing this!

Best regards,
Matteo Martelli

  parent reply	other threads:[~2025-09-26 14:49 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-10  9:50 [PATCH 0/4] Task based throttle follow ups Aaron Lu
2025-09-10  9:50 ` [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq Aaron Lu
2025-09-10 12:36   ` Chengming Zhou
2025-09-16 11:43   ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-23 13:05   ` [PATCH 1/4] " Matteo Martelli
2025-09-24 11:33     ` Aaron Lu
2025-09-25  8:17       ` K Prateek Nayak
2025-09-25  9:29         ` Aaron Lu
2025-09-25 11:22           ` K Prateek Nayak
2025-09-25 12:05             ` Aaron Lu
2025-09-25 13:33               ` Matteo Martelli
2025-09-26  4:32                 ` K Prateek Nayak
2025-09-26  5:53                   ` Aaron Lu
2025-09-26  8:19                 ` [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled K Prateek Nayak
2025-09-26  9:38                   ` Aaron Lu
2025-09-26 10:11                     ` K Prateek Nayak
2025-10-01 20:37                     ` Benjamin Segall
2025-09-26 14:48                   ` Matteo Martelli [this message]
2025-10-21  5:35                 ` [PATCH v2] " K Prateek Nayak
2025-10-21 10:10                   ` Peter Zijlstra
2025-10-22 13:28                   ` [tip: sched/urgent] " tip-bot2 for K Prateek Nayak
2025-09-29  7:51     ` [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq Aaron Lu
2025-09-10  9:50 ` [PATCH 2/4] sched/fair: update_cfs_group() for throttled cfs_rqs Aaron Lu
2025-09-16 11:43   ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-10  9:50 ` [PATCH 3/4] sched/fair: Do not special case tasks in throttled hierarchy Aaron Lu
2025-09-16 11:43   ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-10  9:50 ` [PATCH 4/4] sched/fair: Do not balance task to a throttled cfs_rq Aaron Lu
2025-09-11  2:03   ` kernel test robot
2025-09-12  3:44     ` [PATCH update " Aaron Lu
2025-09-12  3:56       ` K Prateek Nayak
2025-09-16 11:43       ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-11 10:42 ` [PATCH 0/4] Task based throttle follow ups Peter Zijlstra
2025-09-11 12:16   ` Aaron Lu
2025-09-15 21:54 ` Benjamin Segall
2025-09-19 14:37 ` Valentin Schneider

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83bb46158288dfb314fdf07918b074ae@codethink.co.uk \
    --to=matteo.martelli@codethink.co.uk \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=ziqianlu@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.