* [PATCH] sched_ext: sync disable_irq_work in bpf_scx_unreg()
@ 2026-04-22 10:09 Richard Cheng
2026-04-22 10:46 ` Andrea Righi
2026-04-22 10:51 ` Cheng-Yang Chou
0 siblings, 2 replies; 6+ messages in thread
From: Richard Cheng @ 2026-04-22 10:09 UTC (permalink / raw)
To: arighi, mingo, peterz, tj, void, changwoo, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid
Cc: sched-ext, linux-kernel, newtonl, kristinc, kaihengf, kobak,
Richard Cheng
When unregistered my self-written scx scheduler, the following panic
occurs [1].
The root cause is that the JIT page backing ops->quiescent() is freed
before all callers of that function have stopped.
The expected ordering during teardown is:
bitmap_zero(sch->has_op) + synchronize_rcu()
-> guarantees no CPU will ever call sch->ops.* again
-> only THEN free the BPF struct_ops JIT page
bpf_scx_unreg() is supposed to enforce the order, but after
commit f4a6c506d118 ("sched_ext: Always bounce scx_disable() through
irq_work"), disable_work is no longer queued directly, causing
kthread_flush_work() to be a noop. Thus, the caller drops the struct_ops
map too early and poisoned with AARCH64_BREAK_FAULT before
disable_workfn ever execute.
So the subsequent dequeue_task() still sees SCX_HAS_OP(sch, quiescent)
as true and calls ops.quiescent, which hit on the poisoned page and BRK
panic.
Fix it by syncing disable_irq_work first, so disable_work is guaranteed
to be queued before waiting for it.
Fixes: f4a6c506d118 ("sched_ext: Always bounce scx_disable() through irq_work")
Signed-off-by: Richard Cheng <icheng@nvidia.com>
---
[1]:
[ 188.572805] sched_ext: BPF scheduler "invariant_0.1.0_aarch64_unknown_linux_gnu_debug" enabled
[ 229.923133] Kernel text patching generated an invalid instruction at 0xffff80009bc2c1f8!
[ 229.923146] Internal error: Oops - BRK: 00000000f2000100 [#1] SMP
[ 230.077871] CPU: 48 UID: 0 PID: 1760 Comm: kworker/u583:7 Not tainted 7.0.0+ #3 PREEMPT(full)
[ 230.086677] Hardware name: NVIDIA GB200 NVL/P3809-BMC, BIOS 02.05.12 20251107
[ 230.093972] Workqueue: events_unbound bpf_map_free_deferred
[ 230.099675] Sched_ext: invariant_0.1.0_aarch64_unknown_linux_gnu_debug (disabling), task: runnable_at=-174ms
[ 230.116843] pc : 0xffff80009bc2c1f8
[ 230.120406] lr : dequeue_task_scx+0x270/0x2d0
[ 230.217749] Call trace:
[ 230.228515] 0xffff80009bc2c1f8 (P)
[ 230.232077] dequeue_task+0x84/0x188
[ 230.235728] sched_change_begin+0x1dc/0x250
[ 230.240000] __set_cpus_allowed_ptr_locked+0x17c/0x240
[ 230.245250] __set_cpus_allowed_ptr+0x74/0xf0
[ 230.249701] ___migrate_enable+0x4c/0xa0
[ 230.253707] bpf_map_free_deferred+0x1a4/0x1b0
[ 230.258246] process_one_work+0x184/0x540
[ 230.262342] worker_thread+0x19c/0x348
[ 230.266170] kthread+0x13c/0x150
[ 230.269465] ret_from_fork+0x10/0x20
[ 230.281393] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
[ 230.287621] ---[ end trace 0000000000000000 ]---
[ 231.160046] Kernel panic - not syncing: Oops - BRK: Fatal exception in interrupt
Best regards,
Richard Cheng.
---
kernel/sched/ext.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 012ca8bd70fb..065660382a0c 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -7349,6 +7349,12 @@ static void bpf_scx_unreg(void *kdata, struct bpf_link *link)
struct scx_sched *sch = rcu_dereference_protected(ops->priv, true);
scx_disable(sch, SCX_EXIT_UNREG);
+ /*
+ * sch->disable_work might still not queued, causing kthread_flush_work()
+ * as a noop. Syncing the irq_work first is required to guarantee the
+ * kthread work has been queued before waiting for it.
+ */
+ irq_work_sync(&sch->disable_irq_work);
kthread_flush_work(&sch->disable_work);
RCU_INIT_POINTER(ops->priv, NULL);
kobject_put(&sch->kobj);
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [PATCH] sched_ext: sync disable_irq_work in bpf_scx_unreg() 2026-04-22 10:09 [PATCH] sched_ext: sync disable_irq_work in bpf_scx_unreg() Richard Cheng @ 2026-04-22 10:46 ` Andrea Righi 2026-04-22 10:51 ` Cheng-Yang Chou 1 sibling, 0 replies; 6+ messages in thread From: Andrea Righi @ 2026-04-22 10:46 UTC (permalink / raw) To: Richard Cheng Cc: mingo, peterz, tj, void, changwoo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, sched-ext, linux-kernel, newtonl, kristinc, kaihengf, kobak Hi Richard, On Wed, Apr 22, 2026 at 06:09:38PM +0800, Richard Cheng wrote: > When unregistered my self-written scx scheduler, the following panic > occurs [1]. > > The root cause is that the JIT page backing ops->quiescent() is freed > before all callers of that function have stopped. > > The expected ordering during teardown is: > bitmap_zero(sch->has_op) + synchronize_rcu() > -> guarantees no CPU will ever call sch->ops.* again > -> only THEN free the BPF struct_ops JIT page > > bpf_scx_unreg() is supposed to enforce the order, but after > commit f4a6c506d118 ("sched_ext: Always bounce scx_disable() through > irq_work"), disable_work is no longer queued directly, causing > kthread_flush_work() to be a noop. Thus, the caller drops the struct_ops > map too early and poisoned with AARCH64_BREAK_FAULT before > disable_workfn ever execute. > > So the subsequent dequeue_task() still sees SCX_HAS_OP(sch, quiescent) > as true and calls ops.quiescent, which hit on the poisoned page and BRK > panic. > > Fix it by syncing disable_irq_work first, so disable_work is guaranteed > to be queued before waiting for it. > > Fixes: f4a6c506d118 ("sched_ext: Always bounce scx_disable() through irq_work") > Signed-off-by: Richard Cheng <icheng@nvidia.com> > --- > [1]: > [ 188.572805] sched_ext: BPF scheduler "invariant_0.1.0_aarch64_unknown_linux_gnu_debug" enabled > [ 229.923133] Kernel text patching generated an invalid instruction at 0xffff80009bc2c1f8! > [ 229.923146] Internal error: Oops - BRK: 00000000f2000100 [#1] SMP > [ 230.077871] CPU: 48 UID: 0 PID: 1760 Comm: kworker/u583:7 Not tainted 7.0.0+ #3 PREEMPT(full) > [ 230.086677] Hardware name: NVIDIA GB200 NVL/P3809-BMC, BIOS 02.05.12 20251107 > [ 230.093972] Workqueue: events_unbound bpf_map_free_deferred > [ 230.099675] Sched_ext: invariant_0.1.0_aarch64_unknown_linux_gnu_debug (disabling), task: runnable_at=-174ms > [ 230.116843] pc : 0xffff80009bc2c1f8 > [ 230.120406] lr : dequeue_task_scx+0x270/0x2d0 > [ 230.217749] Call trace: > [ 230.228515] 0xffff80009bc2c1f8 (P) > [ 230.232077] dequeue_task+0x84/0x188 > [ 230.235728] sched_change_begin+0x1dc/0x250 > [ 230.240000] __set_cpus_allowed_ptr_locked+0x17c/0x240 > [ 230.245250] __set_cpus_allowed_ptr+0x74/0xf0 > [ 230.249701] ___migrate_enable+0x4c/0xa0 > [ 230.253707] bpf_map_free_deferred+0x1a4/0x1b0 > [ 230.258246] process_one_work+0x184/0x540 > [ 230.262342] worker_thread+0x19c/0x348 > [ 230.266170] kthread+0x13c/0x150 > [ 230.269465] ret_from_fork+0x10/0x20 > [ 230.281393] Code: d4202000 d4202000 d4202000 d4202000 (d4202000) > [ 230.287621] ---[ end trace 0000000000000000 ]--- > [ 231.160046] Kernel panic - not syncing: Oops - BRK: Fatal exception in interrupt > > Best regards, > Richard Cheng. > --- > kernel/sched/ext.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 012ca8bd70fb..065660382a0c 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -7349,6 +7349,12 @@ static void bpf_scx_unreg(void *kdata, struct bpf_link *link) > struct scx_sched *sch = rcu_dereference_protected(ops->priv, true); > > scx_disable(sch, SCX_EXIT_UNREG); > + /* > + * sch->disable_work might still not queued, causing kthread_flush_work() > + * as a noop. Syncing the irq_work first is required to guarantee the nit, maybe rephrase: sch->disable_work might not have been queued yet, causing kthread_flush_work() to be a no-op. > + * kthread work has been queued before waiting for it. > + */ > + irq_work_sync(&sch->disable_irq_work); > kthread_flush_work(&sch->disable_work); > RCU_INIT_POINTER(ops->priv, NULL); > kobject_put(&sch->kobj); > -- > 2.43.0 > I can't reproduce it locally, but from a logical perspective it makes sense. Nice catch! Reviewed-by: Andrea Righi <arighi@nvidia.com> Thanks, -Andrea ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] sched_ext: sync disable_irq_work in bpf_scx_unreg() 2026-04-22 10:09 [PATCH] sched_ext: sync disable_irq_work in bpf_scx_unreg() Richard Cheng 2026-04-22 10:46 ` Andrea Righi @ 2026-04-22 10:51 ` Cheng-Yang Chou 2026-04-22 17:24 ` Tejun Heo 2026-04-24 3:29 ` Richard Cheng 1 sibling, 2 replies; 6+ messages in thread From: Cheng-Yang Chou @ 2026-04-22 10:51 UTC (permalink / raw) To: Richard Cheng Cc: arighi, mingo, peterz, tj, void, changwoo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, sched-ext, linux-kernel, newtonl, kristinc, kaihengf, kobak, Ching-Chun Huang, Chia-Ping Tsai Hi Richard, On Wed, Apr 22, 2026 at 06:09:38PM +0800, Richard Cheng wrote: > When unregistered my self-written scx scheduler, the following panic > occurs [1]. Nit: you've placed the panic log [1] below the --- separator. Content below this line will not be preserved in the comment msg. Could you please move it into the commit msg body and send a v2 patch? > > The root cause is that the JIT page backing ops->quiescent() is freed > before all callers of that function have stopped. > > The expected ordering during teardown is: > bitmap_zero(sch->has_op) + synchronize_rcu() > -> guarantees no CPU will ever call sch->ops.* again > -> only THEN free the BPF struct_ops JIT page > > bpf_scx_unreg() is supposed to enforce the order, but after > commit f4a6c506d118 ("sched_ext: Always bounce scx_disable() through > irq_work"), disable_work is no longer queued directly, causing > kthread_flush_work() to be a noop. Thus, the caller drops the struct_ops > map too early and poisoned with AARCH64_BREAK_FAULT before > disable_workfn ever execute. > > So the subsequent dequeue_task() still sees SCX_HAS_OP(sch, quiescent) > as true and calls ops.quiescent, which hit on the poisoned page and BRK > panic. > > Fix it by syncing disable_irq_work first, so disable_work is guaranteed > to be queued before waiting for it. > > Fixes: f4a6c506d118 ("sched_ext: Always bounce scx_disable() through irq_work") > Signed-off-by: Richard Cheng <icheng@nvidia.com> Thanks for the fix, and the logic looks correct to me. Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Also, scx_root_enable_workfn() has the same pattern in its error path: scx_error(sch, "scx_root_enable() failed (%d)", ret); kthread_flush_work(&sch->disable_work); The comment above indicates that this flush is meant to "ensure that error is reported before init completion". However, because scx_error() goes through scx_vexit() -> irq_work_queue(), the flush can be a no-op here as well. The same applies to the sub-scheduler enable error path. Should those be fixed in the same patch? Tejun, Andrea, wdyt? Thanks [...] > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 012ca8bd70fb..065660382a0c 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -7349,6 +7349,12 @@ static void bpf_scx_unreg(void *kdata, struct bpf_link *link) > struct scx_sched *sch = rcu_dereference_protected(ops->priv, true); > > scx_disable(sch, SCX_EXIT_UNREG); > + /* > + * sch->disable_work might still not queued, causing kthread_flush_work() > + * as a noop. Syncing the irq_work first is required to guarantee the Perhaps s/noop/no-op/? Though it's just a matter of taste. ^_^ > + * kthread work has been queued before waiting for it. > + */ > + irq_work_sync(&sch->disable_irq_work); > kthread_flush_work(&sch->disable_work); > RCU_INIT_POINTER(ops->priv, NULL); > kobject_put(&sch->kobj); > -- > 2.43.0 > > -- Cheers, Cheng-Yang ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] sched_ext: sync disable_irq_work in bpf_scx_unreg() 2026-04-22 10:51 ` Cheng-Yang Chou @ 2026-04-22 17:24 ` Tejun Heo 2026-04-24 3:31 ` Richard Cheng 2026-04-24 3:29 ` Richard Cheng 1 sibling, 1 reply; 6+ messages in thread From: Tejun Heo @ 2026-04-22 17:24 UTC (permalink / raw) To: Cheng-Yang Chou Cc: Richard Cheng, arighi, mingo, peterz, void, changwoo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, sched-ext, linux-kernel, newtonl, kristinc, kaihengf, kobak, Ching-Chun Huang, Chia-Ping Tsai Hello, On Wed, Apr 22, 2026 at 06:51:13PM +0800, Cheng-Yang Chou wrote: > Also, scx_root_enable_workfn() has the same pattern in its error path: > > scx_error(sch, "scx_root_enable() failed (%d)", ret); > kthread_flush_work(&sch->disable_work); Yeah, can you please add a helper - e.g. flush_disable_work() - to package sync and flush and use that in both places? > > + /* > > + * sch->disable_work might still not queued, causing kthread_flush_work() > > + * as a noop. Syncing the irq_work first is required to guarantee the > > Perhaps s/noop/no-op/? Though it's just a matter of taste. ^_^ noop is used widely in the kernel. In general, I don't think we need this level of language policing. Thanks. -- tejun ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] sched_ext: sync disable_irq_work in bpf_scx_unreg() 2026-04-22 17:24 ` Tejun Heo @ 2026-04-24 3:31 ` Richard Cheng 0 siblings, 0 replies; 6+ messages in thread From: Richard Cheng @ 2026-04-24 3:31 UTC (permalink / raw) To: Tejun Heo Cc: Cheng-Yang Chou, arighi, mingo, peterz, void, changwoo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, sched-ext, linux-kernel, newtonl, kristinc, kaihengf, kobak, Ching-Chun Huang, Chia-Ping Tsai On Wed, Apr 22, 2026 at 07:24:07AM +0800, Tejun Heo wrote: > Hello, > > On Wed, Apr 22, 2026 at 06:51:13PM +0800, Cheng-Yang Chou wrote: > > Also, scx_root_enable_workfn() has the same pattern in its error path: > > > > scx_error(sch, "scx_root_enable() failed (%d)", ret); > > kthread_flush_work(&sch->disable_work); > > Yeah, can you please add a helper - e.g. flush_disable_work() - to package > sync and flush and use that in both places? Hello Tejun, No problem I'll add this part in v2. > > > > + /* > > > + * sch->disable_work might still not queued, causing kthread_flush_work() > > > + * as a noop. Syncing the irq_work first is required to guarantee the > > > > Perhaps s/noop/no-op/? Though it's just a matter of taste. ^_^ > > noop is used widely in the kernel. In general, I don't think we need this > level of language policing. > > Thanks. > > -- > tejun Thanks. - Richard ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] sched_ext: sync disable_irq_work in bpf_scx_unreg() 2026-04-22 10:51 ` Cheng-Yang Chou 2026-04-22 17:24 ` Tejun Heo @ 2026-04-24 3:29 ` Richard Cheng 1 sibling, 0 replies; 6+ messages in thread From: Richard Cheng @ 2026-04-24 3:29 UTC (permalink / raw) To: Cheng-Yang Chou Cc: arighi, mingo, peterz, tj, void, changwoo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, sched-ext, linux-kernel, newtonl, kristinc, kaihengf, kobak, Ching-Chun Huang, Chia-Ping Tsai On Wed, Apr 22, 2026 at 06:51:13PM +0800, Cheng-Yang Chou wrote: > Hi Richard, > > On Wed, Apr 22, 2026 at 06:09:38PM +0800, Richard Cheng wrote: > > When unregistered my self-written scx scheduler, the following panic > > occurs [1]. > > Nit: you've placed the panic log [1] below the --- separator. Content > below this line will not be preserved in the comment msg. > Could you please move it into the commit msg body and send a v2 patch? > Hi Cheng-Yang, Sure thing, thanks. > > > > The root cause is that the JIT page backing ops->quiescent() is freed > > before all callers of that function have stopped. > > > > The expected ordering during teardown is: > > bitmap_zero(sch->has_op) + synchronize_rcu() > > -> guarantees no CPU will ever call sch->ops.* again > > -> only THEN free the BPF struct_ops JIT page > > > > bpf_scx_unreg() is supposed to enforce the order, but after > > commit f4a6c506d118 ("sched_ext: Always bounce scx_disable() through > > irq_work"), disable_work is no longer queued directly, causing > > kthread_flush_work() to be a noop. Thus, the caller drops the struct_ops > > map too early and poisoned with AARCH64_BREAK_FAULT before > > disable_workfn ever execute. > > > > So the subsequent dequeue_task() still sees SCX_HAS_OP(sch, quiescent) > > as true and calls ops.quiescent, which hit on the poisoned page and BRK > > panic. > > > > Fix it by syncing disable_irq_work first, so disable_work is guaranteed > > to be queued before waiting for it. > > > > Fixes: f4a6c506d118 ("sched_ext: Always bounce scx_disable() through irq_work") > > Signed-off-by: Richard Cheng <icheng@nvidia.com> > > Thanks for the fix, and the logic looks correct to me. > > Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> > > Also, scx_root_enable_workfn() has the same pattern in its error path: > > scx_error(sch, "scx_root_enable() failed (%d)", ret); > kthread_flush_work(&sch->disable_work); > > The comment above indicates that this flush is meant to "ensure that > error is reported before init completion". However, because scx_error() > goes through scx_vexit() -> irq_work_queue(), the flush can be a no-op > here as well. The same applies to the sub-scheduler enable error path. > > Should those be fixed in the same patch? Tejun, Andrea, wdyt? Thanks > Sure I'll amend that part in v2. > [...] > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > > index 012ca8bd70fb..065660382a0c 100644 > > --- a/kernel/sched/ext.c > > +++ b/kernel/sched/ext.c > > @@ -7349,6 +7349,12 @@ static void bpf_scx_unreg(void *kdata, struct bpf_link *link) > > struct scx_sched *sch = rcu_dereference_protected(ops->priv, true); > > > > scx_disable(sch, SCX_EXIT_UNREG); > > + /* > > + * sch->disable_work might still not queued, causing kthread_flush_work() > > + * as a noop. Syncing the irq_work first is required to guarantee the > > Perhaps s/noop/no-op/? Though it's just a matter of taste. ^_^ > > > + * kthread work has been queued before waiting for it. > > + */ > > + irq_work_sync(&sch->disable_irq_work); > > kthread_flush_work(&sch->disable_work); > > RCU_INIT_POINTER(ops->priv, NULL); > > kobject_put(&sch->kobj); > > -- > > 2.43.0 > > > > > > -- > Cheers, > Cheng-Yang - Richard ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-04-24 3:31 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-22 10:09 [PATCH] sched_ext: sync disable_irq_work in bpf_scx_unreg() Richard Cheng 2026-04-22 10:46 ` Andrea Righi 2026-04-22 10:51 ` Cheng-Yang Chou 2026-04-22 17:24 ` Tejun Heo 2026-04-24 3:31 ` Richard Cheng 2026-04-24 3:29 ` Richard Cheng
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox