From: Aaron Lu <ziqianlu@bytedance.com>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: "Matteo Martelli" <matteo.martelli@codethink.co.uk>,
"Valentin Schneider" <vschneid@redhat.com>,
"Ben Segall" <bsegall@google.com>,
"Peter Zijlstra" <peterz@infradead.org>,
"Chengming Zhou" <chengming.zhou@linux.dev>,
"Josh Don" <joshdon@google.com>, "Ingo Molnar" <mingo@redhat.com>,
"Vincent Guittot" <vincent.guittot@linaro.org>,
"Xi Wang" <xii@google.com>,
linux-kernel@vger.kernel.org,
"Juri Lelli" <juri.lelli@redhat.com>,
"Dietmar Eggemann" <dietmar.eggemann@arm.com>,
"Steven Rostedt" <rostedt@goodmis.org>,
"Mel Gorman" <mgorman@suse.de>,
"Chuyi Zhou" <zhouchuyi@bytedance.com>,
"Jan Kiszka" <jan.kiszka@siemens.com>,
"Florian Bezdeka" <florian.bezdeka@siemens.com>,
"Songtang Liu" <liusongtang@bytedance.com>,
"Chen Yu" <yu.c.chen@intel.com>,
"Michal Koutný" <mkoutny@suse.com>,
"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>
Subject: Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq
Date: Thu, 25 Sep 2025 20:05:04 +0800 [thread overview]
Message-ID: <20250925120504.GC120@bytedance> (raw)
In-Reply-To: <72706108-f1c3-4719-a65c-c7c5d76f9b1e@amd.com>
On Thu, Sep 25, 2025 at 04:52:25PM +0530, K Prateek Nayak wrote:
>
>
> On 9/25/2025 2:59 PM, Aaron Lu wrote:
> > Hi Prateek,
> >
> > On Thu, Sep 25, 2025 at 01:47:35PM +0530, K Prateek Nayak wrote:
> >> Hello Aaron, Matteo,
> >>
> >> On 9/24/2025 5:03 PM, Aaron Lu wrote:
> >>>> [ 18.421350] WARNING: CPU: 0 PID: 1 at kernel/sched/fair.c:400 enqueue_task_fair+0x925/0x980
> >>>
> >>> I stared at the code and haven't been able to figure out when
> >>> enqueue_task_fair() would end up with a broken leaf cfs_rq list.
> >>
> >> Yeah neither could I. I tried running with PREEMPT_RT too and still
> >> couldn't trigger it :(
> >>
> >> But I'm wondering if all we are missing is:
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index f993de30e146..5f9e7b4df391 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6435,6 +6435,7 @@ static void sync_throttle(struct task_group *tg, int cpu)
> >>
> >> cfs_rq->throttle_count = pcfs_rq->throttle_count;
> >> cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
> >> + cfs_rq->pelt_clock_throttled = pcfs_rq->pelt_clock_throttled;
> >> }
> >>
> >> /* conditionally throttle active cfs_rq's from put_prev_entity() */
> >> ---
> >>
> >> This is the only way we can currently have a break in
> >> cfs_rq_pelt_clock_throttled() hierarchy.
> >>
> >
> > Great finding! Yes, that is missed.
> >
> > According to this info, I'm able to trigger the assert in
> > enqueue_task_fair(). The stack is different from Matteo's: his stack is
> > from ttwu path while mine is from exit. Anyway, let me do more analysis
> > and get back to you:
> >
> > [ 67.041905] ------------[ cut here ]------------
> > [ 67.042387] WARNING: CPU: 2 PID: 11582 at kernel/sched/fair.c:401 enqueue_task_fair+0x6db/0x720
> > [ 67.043227] Modules linked in:
> > [ 67.043537] CPU: 2 UID: 0 PID: 11582 Comm: sudo Tainted: G W 6.17.0-rc4-00010-gfe8d238e646e-dirty #72 PREEMPT(voluntary)
> > [ 67.044694] Tainted: [W]=WARN
> > [ 67.044997] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > [ 67.045910] RIP: 0010:enqueue_task_fair+0x6db/0x720
> > [ 67.046383] Code: 00 48 c7 c7 96 b2 60 82 c6 05 af 64 2e 05 01 e8 fb 12 03 00 8b 4c 24 04 e9 f8 fc ff ff 4c 89 ef e8 ea a2 ff ff e9 ad fa ff ff <0f> 0b e9 5d fc ff ff 49 8b b4 24 08 0b 00 00 4c 89 e7 e8 de 31 01
> > [ 67.048155] RSP: 0018:ffa000002d2a7dc0 EFLAGS: 00010087
> > [ 67.048655] RAX: ff11000ff05fd2e8 RBX: 0000000000000000 RCX: 0000000000000004
> > [ 67.049339] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff11000ff05fd1f0
> > [ 67.050036] RBP: 0000000000000001 R08: 0000000000000000 R09: ff11000ff05fc908
> > [ 67.050731] R10: 0000000000000000 R11: 00000000fa83b2da R12: ff11000ff05fc800
> > [ 67.051402] R13: 0000000000000000 R14: 00000000002ab980 R15: ff11000ff05fc8c0
> > [ 67.052083] FS: 0000000000000000(0000) GS:ff110010696a6000(0000) knlGS:0000000000000000
> > [ 67.052855] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 67.053404] CR2: 00007f67f8b96168 CR3: 0000000002c3c006 CR4: 0000000000371ef0
> > [ 67.054083] Call Trace:
> > [ 67.054334] <TASK>
> > [ 67.054546] enqueue_task+0x35/0xd0
> > [ 67.054885] sched_move_task+0x291/0x370
> > [ 67.055268] ? kmem_cache_free+0x2d9/0x480
> > [ 67.055669] do_exit+0x204/0x4f0
> > [ 67.055984] ? lock_release+0x10a/0x170
> > [ 67.056356] do_group_exit+0x36/0xa0
> > [ 67.056714] __x64_sys_exit_group+0x18/0x20
> > [ 67.057121] x64_sys_call+0x14fa/0x1720
> > [ 67.057502] do_syscall_64+0x6a/0x2d0
> > [ 67.057865] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Great! I'll try stressing this path too.
I now also see other paths leading to enqueue_task_fair() too, so I
think this is the same problem as seen by Matteo.
> P.S. Are you seeing this with sync_throttle() fix too?
Nope, your finding fixed it for me :)
I added some trace prints but due to too many traces, it keeps losing
those critical ones.
Anyway, I think I've figured out how it happened: during
online_fair_sched_group() -> sync_throttle(), the newly onlined cfs_rq
didn't have pelt_clock_throttled synced. Suppose its parent's pelt clock
is throttled, then in propagate_entity_cfs_rq(), this newly onlined
cfs_rq is added to leaf list but its parent is not. Now
rq->tmp_alone_branch points to this newly onlined cfs_rq, waiting for
its parent to be added(but this didn't happen).
Then another task wakes up and gets enqueued on this same cpu, all its
ancestor cfs_rqs are already on the list so list_add_leaf_cfs_rq()
didn't manipulate rq->tmp_alone_branch. At the end of the enqueue,
assert will fire.
I'm thinking we should add an assert_list_leaf_cfs_rq() at the end of
propagate_entity_cfs_rq() to capture other potential problems.
Hi Matteo,
Can you test the above diff Prateek sent in his last email? Thanks.
next prev parent reply other threads:[~2025-09-25 12:05 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-10 9:50 [PATCH 0/4] Task based throttle follow ups Aaron Lu
2025-09-10 9:50 ` [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq Aaron Lu
2025-09-10 12:36 ` Chengming Zhou
2025-09-16 11:43 ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-23 13:05 ` [PATCH 1/4] " Matteo Martelli
2025-09-24 11:33 ` Aaron Lu
2025-09-25 8:17 ` K Prateek Nayak
2025-09-25 9:29 ` Aaron Lu
2025-09-25 11:22 ` K Prateek Nayak
2025-09-25 12:05 ` Aaron Lu [this message]
2025-09-25 13:33 ` Matteo Martelli
2025-09-26 4:32 ` K Prateek Nayak
2025-09-26 5:53 ` Aaron Lu
2025-09-26 8:19 ` [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled K Prateek Nayak
2025-09-26 9:38 ` Aaron Lu
2025-09-26 10:11 ` K Prateek Nayak
2025-10-01 20:37 ` Benjamin Segall
2025-09-26 14:48 ` Matteo Martelli
2025-10-21 5:35 ` [PATCH v2] " K Prateek Nayak
2025-10-21 10:10 ` Peter Zijlstra
2025-10-22 13:28 ` [tip: sched/urgent] " tip-bot2 for K Prateek Nayak
2025-09-29 7:51 ` [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq Aaron Lu
2025-09-10 9:50 ` [PATCH 2/4] sched/fair: update_cfs_group() for throttled cfs_rqs Aaron Lu
2025-09-16 11:43 ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-10 9:50 ` [PATCH 3/4] sched/fair: Do not special case tasks in throttled hierarchy Aaron Lu
2025-09-16 11:43 ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-10 9:50 ` [PATCH 4/4] sched/fair: Do not balance task to a throttled cfs_rq Aaron Lu
2025-09-11 2:03 ` kernel test robot
2025-09-12 3:44 ` [PATCH update " Aaron Lu
2025-09-12 3:56 ` K Prateek Nayak
2025-09-16 11:43 ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-11 10:42 ` [PATCH 0/4] Task based throttle follow ups Peter Zijlstra
2025-09-11 12:16 ` Aaron Lu
2025-09-15 21:54 ` Benjamin Segall
2025-09-19 14:37 ` Valentin Schneider
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250925120504.GC120@bytedance \
--to=ziqianlu@bytedance.com \
--cc=bigeasy@linutronix.de \
--cc=bsegall@google.com \
--cc=chengming.zhou@linux.dev \
--cc=dietmar.eggemann@arm.com \
--cc=florian.bezdeka@siemens.com \
--cc=jan.kiszka@siemens.com \
--cc=joshdon@google.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=liusongtang@bytedance.com \
--cc=matteo.martelli@codethink.co.uk \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=mkoutny@suse.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=xii@google.com \
--cc=yu.c.chen@intel.com \
--cc=zhouchuyi@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox