From: Aaron Lu <ziqianlu@bytedance.com>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: "Matteo Martelli" <matteo.martelli@codethink.co.uk>,
"Valentin Schneider" <vschneid@redhat.com>,
"Ben Segall" <bsegall@google.com>,
"Peter Zijlstra" <peterz@infradead.org>,
"Chengming Zhou" <chengming.zhou@linux.dev>,
"Josh Don" <joshdon@google.com>, "Ingo Molnar" <mingo@redhat.com>,
"Vincent Guittot" <vincent.guittot@linaro.org>,
"Xi Wang" <xii@google.com>,
linux-kernel@vger.kernel.org,
"Juri Lelli" <juri.lelli@redhat.com>,
"Dietmar Eggemann" <dietmar.eggemann@arm.com>,
"Steven Rostedt" <rostedt@goodmis.org>,
"Mel Gorman" <mgorman@suse.de>,
"Chuyi Zhou" <zhouchuyi@bytedance.com>,
"Jan Kiszka" <jan.kiszka@siemens.com>,
"Florian Bezdeka" <florian.bezdeka@siemens.com>,
"Songtang Liu" <liusongtang@bytedance.com>,
"Chen Yu" <yu.c.chen@intel.com>,
"Michal Koutný" <mkoutny@suse.com>,
"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>
Subject: Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq
Date: Thu, 25 Sep 2025 20:05:04 +0800 [thread overview]
Message-ID: <20250925120504.GC120@bytedance> (raw)
In-Reply-To: <72706108-f1c3-4719-a65c-c7c5d76f9b1e@amd.com>
On Thu, Sep 25, 2025 at 04:52:25PM +0530, K Prateek Nayak wrote:
>
>
> On 9/25/2025 2:59 PM, Aaron Lu wrote:
> > Hi Prateek,
> >
> > On Thu, Sep 25, 2025 at 01:47:35PM +0530, K Prateek Nayak wrote:
> >> Hello Aaron, Matteo,
> >>
> >> On 9/24/2025 5:03 PM, Aaron Lu wrote:
> >>>> [ 18.421350] WARNING: CPU: 0 PID: 1 at kernel/sched/fair.c:400 enqueue_task_fair+0x925/0x980
> >>>
> >>> I stared at the code and haven't been able to figure out when
> >>> enqueue_task_fair() would end up with a broken leaf cfs_rq list.
> >>
> >> Yeah neither could I. I tried running with PREEMPT_RT too and still
> >> couldn't trigger it :(
> >>
> >> But I'm wondering if all we are missing is:
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index f993de30e146..5f9e7b4df391 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6435,6 +6435,7 @@ static void sync_throttle(struct task_group *tg, int cpu)
> >>
> >> cfs_rq->throttle_count = pcfs_rq->throttle_count;
> >> cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
> >> + cfs_rq->pelt_clock_throttled = pcfs_rq->pelt_clock_throttled;
> >> }
> >>
> >> /* conditionally throttle active cfs_rq's from put_prev_entity() */
> >> ---
> >>
> >> This is the only way we can currently have a break in
> >> cfs_rq_pelt_clock_throttled() hierarchy.
> >>
> >
> > Great finding! Yes, that is missed.
> >
> > According to this info, I'm able to trigger the assert in
> > enqueue_task_fair(). The stack is different from Matteo's: his stack is
> > from ttwu path while mine is from exit. Anyway, let me do more analysis
> > and get back to you:
> >
> > [ 67.041905] ------------[ cut here ]------------
> > [ 67.042387] WARNING: CPU: 2 PID: 11582 at kernel/sched/fair.c:401 enqueue_task_fair+0x6db/0x720
> > [ 67.043227] Modules linked in:
> > [ 67.043537] CPU: 2 UID: 0 PID: 11582 Comm: sudo Tainted: G W 6.17.0-rc4-00010-gfe8d238e646e-dirty #72 PREEMPT(voluntary)
> > [ 67.044694] Tainted: [W]=WARN
> > [ 67.044997] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > [ 67.045910] RIP: 0010:enqueue_task_fair+0x6db/0x720
> > [ 67.046383] Code: 00 48 c7 c7 96 b2 60 82 c6 05 af 64 2e 05 01 e8 fb 12 03 00 8b 4c 24 04 e9 f8 fc ff ff 4c 89 ef e8 ea a2 ff ff e9 ad fa ff ff <0f> 0b e9 5d fc ff ff 49 8b b4 24 08 0b 00 00 4c 89 e7 e8 de 31 01
> > [ 67.048155] RSP: 0018:ffa000002d2a7dc0 EFLAGS: 00010087
> > [ 67.048655] RAX: ff11000ff05fd2e8 RBX: 0000000000000000 RCX: 0000000000000004
> > [ 67.049339] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff11000ff05fd1f0
> > [ 67.050036] RBP: 0000000000000001 R08: 0000000000000000 R09: ff11000ff05fc908
> > [ 67.050731] R10: 0000000000000000 R11: 00000000fa83b2da R12: ff11000ff05fc800
> > [ 67.051402] R13: 0000000000000000 R14: 00000000002ab980 R15: ff11000ff05fc8c0
> > [ 67.052083] FS: 0000000000000000(0000) GS:ff110010696a6000(0000) knlGS:0000000000000000
> > [ 67.052855] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 67.053404] CR2: 00007f67f8b96168 CR3: 0000000002c3c006 CR4: 0000000000371ef0
> > [ 67.054083] Call Trace:
> > [ 67.054334] <TASK>
> > [ 67.054546] enqueue_task+0x35/0xd0
> > [ 67.054885] sched_move_task+0x291/0x370
> > [ 67.055268] ? kmem_cache_free+0x2d9/0x480
> > [ 67.055669] do_exit+0x204/0x4f0
> > [ 67.055984] ? lock_release+0x10a/0x170
> > [ 67.056356] do_group_exit+0x36/0xa0
> > [ 67.056714] __x64_sys_exit_group+0x18/0x20
> > [ 67.057121] x64_sys_call+0x14fa/0x1720
> > [ 67.057502] do_syscall_64+0x6a/0x2d0
> > [ 67.057865] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Great! I'll try stressing this path too.
I now also see other paths leading to enqueue_task_fair() too, so I
think this is the same problem as seen by Matteo.
> P.S. Are you seeing this with sync_throttle() fix too?
Nope, your finding fixed it for me :)
I added some trace prints but due to too many traces, it keeps losing
those critical ones.
Anyway, I think I've figured out how it happened: during
online_fair_sched_group() -> sync_throttle(), the newly onlined cfs_rq
didn't have pelt_clock_throttled synced. Suppose its parent's pelt clock
is throttled, then in propagate_entity_cfs_rq(), this newly onlined
cfs_rq is added to leaf list but its parent is not. Now
rq->tmp_alone_branch points to this newly onlined cfs_rq, waiting for
its parent to be added(but this didn't happen).
Then another task wakes up and gets enqueued on this same cpu, all its
ancestor cfs_rqs are already on the list so list_add_leaf_cfs_rq()
didn't manipulate rq->tmp_alone_branch. At the end of the enqueue,
assert will fire.
I'm thinking we should add an assert_list_leaf_cfs_rq() at the end of
propagate_entity_cfs_rq() to capture other potential problems.
Hi Matteo,
Can you test the above diff Prateek sent in his last email? Thanks.
next prev parent reply other threads:[~2025-09-25 12:05 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-10 9:50 [PATCH 0/4] Task based throttle follow ups Aaron Lu
2025-09-10 9:50 ` [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq Aaron Lu
2025-09-10 12:36 ` Chengming Zhou
2025-09-16 11:43 ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-23 13:05 ` [PATCH 1/4] " Matteo Martelli
2025-09-24 11:33 ` Aaron Lu
2025-09-25 8:17 ` K Prateek Nayak
2025-09-25 9:29 ` Aaron Lu
2025-09-25 11:22 ` K Prateek Nayak
2025-09-25 12:05 ` Aaron Lu [this message]
2025-09-25 13:33 ` Matteo Martelli
2025-09-26 4:32 ` K Prateek Nayak
2025-09-26 5:53 ` Aaron Lu
2025-09-26 8:19 ` [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled K Prateek Nayak
2025-09-26 9:38 ` Aaron Lu
2025-09-26 10:11 ` K Prateek Nayak
2025-10-01 20:37 ` Benjamin Segall
2025-09-26 14:48 ` Matteo Martelli
2025-10-21 5:35 ` [PATCH v2] " K Prateek Nayak
2025-10-21 10:10 ` Peter Zijlstra
2025-10-22 13:28 ` [tip: sched/urgent] " tip-bot2 for K Prateek Nayak
2025-09-29 7:51 ` [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq Aaron Lu
2025-09-10 9:50 ` [PATCH 2/4] sched/fair: update_cfs_group() for throttled cfs_rqs Aaron Lu
2025-09-16 11:43 ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-10 9:50 ` [PATCH 3/4] sched/fair: Do not special case tasks in throttled hierarchy Aaron Lu
2025-09-16 11:43 ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-10 9:50 ` [PATCH 4/4] sched/fair: Do not balance task to a throttled cfs_rq Aaron Lu
2025-09-11 2:03 ` kernel test robot
2025-09-12 3:44 ` [PATCH update " Aaron Lu
2025-09-12 3:56 ` K Prateek Nayak
2025-09-16 11:43 ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-11 10:42 ` [PATCH 0/4] Task based throttle follow ups Peter Zijlstra
2025-09-11 12:16 ` Aaron Lu
2025-09-15 21:54 ` Benjamin Segall
2025-09-19 14:37 ` Valentin Schneider
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250925120504.GC120@bytedance \
--to=ziqianlu@bytedance.com \
--cc=bigeasy@linutronix.de \
--cc=bsegall@google.com \
--cc=chengming.zhou@linux.dev \
--cc=dietmar.eggemann@arm.com \
--cc=florian.bezdeka@siemens.com \
--cc=jan.kiszka@siemens.com \
--cc=joshdon@google.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=liusongtang@bytedance.com \
--cc=matteo.martelli@codethink.co.uk \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=mkoutny@suse.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=xii@google.com \
--cc=yu.c.chen@intel.com \
--cc=zhouchuyi@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.