* [PATCH v2] sched_ext: sync disable_irq_work in bpf_scx_unreg()
@ 2026-04-24 10:02 Richard Cheng
2026-04-24 10:21 ` Cheng-Yang Chou
2026-04-24 17:29 ` Tejun Heo
0 siblings, 2 replies; 3+ messages in thread
From: Richard Cheng @ 2026-04-24 10:02 UTC (permalink / raw)
To: tj
Cc: void, arighi, changwoo, mingo, peterz, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, kprateek.nayak, sched-ext, linux-kernel, yphbchou0911,
newtonl, kristinc, kaihengf, kobak, jserv, chia7712,
Richard Cheng
When unregistered my self-written scx scheduler, the following panic
occurs.
[ 229.923133] Kernel text patching generated an invalid instruction at 0xffff80009bc2c1f8!
[ 229.923146] Internal error: Oops - BRK: 00000000f2000100 [#1] SMP
[ 230.077871] CPU: 48 UID: 0 PID: 1760 Comm: kworker/u583:7 Not tainted 7.0.0+ #3 PREEMPT(full)
[ 230.086677] Hardware name: NVIDIA GB200 NVL/P3809-BMC, BIOS 02.05.12 20251107
[ 230.093972] Workqueue: events_unbound bpf_map_free_deferred
[ 230.099675] Sched_ext: invariant_0.1.0_aarch64_unknown_linux_gnu_debug (disabling), task: runnable_at=-174ms
[ 230.116843] pc : 0xffff80009bc2c1f8
[ 230.120406] lr : dequeue_task_scx+0x270/0x2d0
[ 230.217749] Call trace:
[ 230.228515] 0xffff80009bc2c1f8 (P)
[ 230.232077] dequeue_task+0x84/0x188
[ 230.235728] sched_change_begin+0x1dc/0x250
[ 230.240000] __set_cpus_allowed_ptr_locked+0x17c/0x240
[ 230.245250] __set_cpus_allowed_ptr+0x74/0xf0
[ 230.249701] ___migrate_enable+0x4c/0xa0
[ 230.253707] bpf_map_free_deferred+0x1a4/0x1b0
[ 230.258246] process_one_work+0x184/0x540
[ 230.262342] worker_thread+0x19c/0x348
[ 230.266170] kthread+0x13c/0x150
[ 230.269465] ret_from_fork+0x10/0x20
[ 230.281393] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
[ 230.287621] ---[ end trace 0000000000000000 ]---
[ 231.160046] Kernel panic - not syncing: Oops - BRK: Fatal exception in interrupt
The root cause is that the JIT page backing ops->quiescent() is freed
before all callers of that function have stopped.
The expected ordering during teardown is:
bitmap_zero(sch->has_op) + synchronize_rcu()
-> guarantees no CPU will ever call sch->ops.* again
-> only THEN free the BPF struct_ops JIT page
bpf_scx_unreg() is supposed to enforce the order, but after
commit f4a6c506d118 ("sched_ext: Always bounce scx_disable() through
irq_work"), disable_work is no longer queued directly, causing
kthread_flush_work() to be a noop. Thus, the caller drops the struct_ops
map too early and poisoned with AARCH64_BREAK_FAULT before
disable_workfn ever execute.
So the subsequent dequeue_task() still sees SCX_HAS_OP(sch, quiescent)
as true and calls ops.quiescent, which hit on the poisoned page and BRK
panic.
Add a helper scx_flush_disable_work() so the future use cases that want
to flush disable_work can use it.
Also amend the call for scx_root_enable_workfn() and
scx_sub_enable_workfn() which have similar pattern in the error path.
Fixes: f4a6c506d118 ("sched_ext: Always bounce scx_disable() through irq_work")
Signed-off-by: Richard Cheng <icheng@nvidia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
---
Changelog:
v1 -> v2:
- Add scx_flush_disable_work() helper
- Amend error path in scx_root_enable_workfn() and
scx_sub_enable_workfn()
Best regards,
Richard Cheng.
---
kernel/sched/ext.c | 20 +++++++++++++++++---
1 file changed, 17 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 012ca8bd70fb..ff42ac197bfd 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5921,6 +5921,20 @@ static void scx_disable(struct scx_sched *sch, enum scx_exit_kind kind)
irq_work_queue(&sch->disable_irq_work);
}
+/**
+ * scx_flush_disable_work - flush the disable work and wait for it to finish
+ * @sch: the scheduler
+ *
+ * sch->disable_work might still not queued, causing kthread_flush_work()
+ * as a noop. Syncing the irq_work first is required to guarantee the
+ * kthread work has been queued before waiting for it.
+ */
+static void scx_flush_disable_work(struct scx_sched *sch)
+{
+ irq_work_sync(&sch->disable_irq_work);
+ kthread_flush_work(&sch->disable_work);
+}
+
static void dump_newline(struct seq_buf *s)
{
trace_sched_ext_dump("");
@@ -6821,7 +6835,7 @@ static void scx_root_enable_workfn(struct kthread_work *work)
* completion. sch's base reference will be put by bpf_scx_unreg().
*/
scx_error(sch, "scx_root_enable() failed (%d)", ret);
- kthread_flush_work(&sch->disable_work);
+ scx_flush_disable_work(sch);
cmd->ret = 0;
}
@@ -7088,7 +7102,7 @@ static void scx_sub_enable_workfn(struct kthread_work *work)
percpu_up_write(&scx_fork_rwsem);
err_disable:
mutex_unlock(&scx_enable_mutex);
- kthread_flush_work(&sch->disable_work);
+ scx_flush_disable_work(sch);
cmd->ret = 0;
}
@@ -7349,7 +7363,7 @@ static void bpf_scx_unreg(void *kdata, struct bpf_link *link)
struct scx_sched *sch = rcu_dereference_protected(ops->priv, true);
scx_disable(sch, SCX_EXIT_UNREG);
- kthread_flush_work(&sch->disable_work);
+ scx_flush_disable_work(sch);
RCU_INIT_POINTER(ops->priv, NULL);
kobject_put(&sch->kobj);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 3+ messages in thread* Re: [PATCH v2] sched_ext: sync disable_irq_work in bpf_scx_unreg()
2026-04-24 10:02 [PATCH v2] sched_ext: sync disable_irq_work in bpf_scx_unreg() Richard Cheng
@ 2026-04-24 10:21 ` Cheng-Yang Chou
2026-04-24 17:29 ` Tejun Heo
1 sibling, 0 replies; 3+ messages in thread
From: Cheng-Yang Chou @ 2026-04-24 10:21 UTC (permalink / raw)
To: Richard Cheng
Cc: tj, void, arighi, changwoo, mingo, peterz, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, kprateek.nayak, sched-ext, linux-kernel, newtonl,
kristinc, kaihengf, kobak, jserv, chia7712
Hi Richard,
On Fri, Apr 24, 2026 at 06:02:21PM +0800, Richard Cheng wrote:
[...]
> ---
> kernel/sched/ext.c | 20 +++++++++++++++++---
> 1 file changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 012ca8bd70fb..ff42ac197bfd 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -5921,6 +5921,20 @@ static void scx_disable(struct scx_sched *sch, enum scx_exit_kind kind)
> irq_work_queue(&sch->disable_irq_work);
> }
>
> +/**
> + * scx_flush_disable_work - flush the disable work and wait for it to finish
> + * @sch: the scheduler
Not to be picky here, but I think '@sch: the scheduler' is a bit vague.
Since nearly every 'sch' in ext.c is a scheduler instance,
maybe '@sch: sched to be flushed' would be more descriptive?
You can check more specific comments for 'sch' elsewhere in ext.c as well.
WDYT, thanks!
> + *
> + * sch->disable_work might still not queued, causing kthread_flush_work()
> + * as a noop. Syncing the irq_work first is required to guarantee the
> + * kthread work has been queued before waiting for it.
> + */
> +static void scx_flush_disable_work(struct scx_sched *sch)
> +{
> + irq_work_sync(&sch->disable_irq_work);
> + kthread_flush_work(&sch->disable_work);
> +}
> +
> static void dump_newline(struct seq_buf *s)
> {
> trace_sched_ext_dump("");
> @@ -6821,7 +6835,7 @@ static void scx_root_enable_workfn(struct kthread_work *work)
> * completion. sch's base reference will be put by bpf_scx_unreg().
> */
> scx_error(sch, "scx_root_enable() failed (%d)", ret);
> - kthread_flush_work(&sch->disable_work);
> + scx_flush_disable_work(sch);
> cmd->ret = 0;
> }
>
> @@ -7088,7 +7102,7 @@ static void scx_sub_enable_workfn(struct kthread_work *work)
> percpu_up_write(&scx_fork_rwsem);
> err_disable:
> mutex_unlock(&scx_enable_mutex);
> - kthread_flush_work(&sch->disable_work);
> + scx_flush_disable_work(sch);
> cmd->ret = 0;
> }
>
> @@ -7349,7 +7363,7 @@ static void bpf_scx_unreg(void *kdata, struct bpf_link *link)
> struct scx_sched *sch = rcu_dereference_protected(ops->priv, true);
>
> scx_disable(sch, SCX_EXIT_UNREG);
> - kthread_flush_work(&sch->disable_work);
> + scx_flush_disable_work(sch);
> RCU_INIT_POINTER(ops->priv, NULL);
> kobject_put(&sch->kobj);
> }
> --
> 2.43.0
>
--
Cheers,
Cheng-Yang
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: [PATCH v2] sched_ext: sync disable_irq_work in bpf_scx_unreg()
2026-04-24 10:02 [PATCH v2] sched_ext: sync disable_irq_work in bpf_scx_unreg() Richard Cheng
2026-04-24 10:21 ` Cheng-Yang Chou
@ 2026-04-24 17:29 ` Tejun Heo
1 sibling, 0 replies; 3+ messages in thread
From: Tejun Heo @ 2026-04-24 17:29 UTC (permalink / raw)
To: Richard Cheng
Cc: void, arighi, changwoo, mingo, peterz, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, kprateek.nayak, sched-ext, linux-kernel, yphbchou0911,
newtonl, kristinc, kaihengf, kobak, jserv, chia7712,
Emil Tsalapatis
Hello,
On Fri, Apr 24, 2026 at 06:02:21PM +0800, Richard Cheng wrote:
> [PATCH v2] sched_ext: sync disable_irq_work in bpf_scx_unreg()
Applied to sched_ext/for-7.1-fixes.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-04-24 17:29 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-24 10:02 [PATCH v2] sched_ext: sync disable_irq_work in bpf_scx_unreg() Richard Cheng
2026-04-24 10:21 ` Cheng-Yang Chou
2026-04-24 17:29 ` Tejun Heo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox