[PATCH] sched_ext: sync disable_irq_work in bpf_scx

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] sched_ext: sync disable_irq_work in bpf_scx_unreg()
@ 2026-04-22 10:09 Richard Cheng
  2026-04-22 10:46 ` Andrea Righi
  2026-04-22 10:51 ` Cheng-Yang Chou
  0 siblings, 2 replies; 6+ messages in thread
From: Richard Cheng @ 2026-04-22 10:09 UTC (permalink / raw)
  To: arighi, mingo, peterz, tj, void, changwoo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid
  Cc: sched-ext, linux-kernel, newtonl, kristinc, kaihengf, kobak,
	Richard Cheng

When unregistered my self-written scx scheduler, the following panic
occurs [1].

The root cause is that the JIT page backing ops->quiescent() is freed
before all callers of that function have stopped.

The expected ordering during teardown is:
    bitmap_zero(sch->has_op) + synchronize_rcu()
        -> guarantees no CPU will ever call sch->ops.* again
    -> only THEN free the BPF struct_ops JIT page

bpf_scx_unreg() is supposed to enforce the order, but after
commit f4a6c506d118 ("sched_ext: Always bounce scx_disable() through
irq_work"), disable_work is no longer queued directly, causing
kthread_flush_work() to be a noop. Thus, the caller drops the struct_ops
map too early and poisoned with AARCH64_BREAK_FAULT before
disable_workfn ever execute.

So the subsequent dequeue_task() still sees SCX_HAS_OP(sch, quiescent)
as true and calls ops.quiescent, which hit on the poisoned page and BRK
panic.

Fix it by syncing disable_irq_work first, so disable_work is guaranteed
to be queued before waiting for it.

Fixes: f4a6c506d118 ("sched_ext: Always bounce scx_disable() through irq_work")
Signed-off-by: Richard Cheng <icheng@nvidia.com>
---
[1]:
[  188.572805] sched_ext: BPF scheduler "invariant_0.1.0_aarch64_unknown_linux_gnu_debug" enabled
[  229.923133] Kernel text patching generated an invalid instruction at 0xffff80009bc2c1f8!
[  229.923146] Internal error: Oops - BRK: 00000000f2000100 [#1]  SMP
[  230.077871] CPU: 48 UID: 0 PID: 1760 Comm: kworker/u583:7 Not tainted 7.0.0+ #3 PREEMPT(full)
[  230.086677] Hardware name: NVIDIA GB200 NVL/P3809-BMC, BIOS 02.05.12 20251107
[  230.093972] Workqueue: events_unbound bpf_map_free_deferred
[  230.099675] Sched_ext: invariant_0.1.0_aarch64_unknown_linux_gnu_debug (disabling), task: runnable_at=-174ms
[  230.116843] pc : 0xffff80009bc2c1f8
[  230.120406] lr : dequeue_task_scx+0x270/0x2d0
[  230.217749] Call trace:
[  230.228515]  0xffff80009bc2c1f8 (P)
[  230.232077]  dequeue_task+0x84/0x188
[  230.235728]  sched_change_begin+0x1dc/0x250
[  230.240000]  __set_cpus_allowed_ptr_locked+0x17c/0x240
[  230.245250]  __set_cpus_allowed_ptr+0x74/0xf0
[  230.249701]  ___migrate_enable+0x4c/0xa0
[  230.253707]  bpf_map_free_deferred+0x1a4/0x1b0
[  230.258246]  process_one_work+0x184/0x540
[  230.262342]  worker_thread+0x19c/0x348
[  230.266170]  kthread+0x13c/0x150
[  230.269465]  ret_from_fork+0x10/0x20
[  230.281393] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
[  230.287621] ---[ end trace 0000000000000000 ]---
[  231.160046] Kernel panic - not syncing: Oops - BRK: Fatal exception in interrupt

Best regards,
Richard Cheng.
---
 kernel/sched/ext.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 012ca8bd70fb..065660382a0c 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -7349,6 +7349,12 @@ static void bpf_scx_unreg(void *kdata, struct bpf_link *link)
 	struct scx_sched *sch = rcu_dereference_protected(ops->priv, true);
 
 	scx_disable(sch, SCX_EXIT_UNREG);
+	/*
+	 * sch->disable_work might still not queued, causing kthread_flush_work()
+	 * as a noop. Syncing the irq_work first is required to guarantee the
+	 * kthread work has been queued before waiting for it.
+	 */
+	irq_work_sync(&sch->disable_irq_work);
 	kthread_flush_work(&sch->disable_work);
 	RCU_INIT_POINTER(ops->priv, NULL);
 	kobject_put(&sch->kobj);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched_ext: sync disable_irq_work in bpf_scx_unreg()
  2026-04-22 10:09 [PATCH] sched_ext: sync disable_irq_work in bpf_scx_unreg() Richard Cheng
@ 2026-04-22 10:46 ` Andrea Righi
  2026-04-22 10:51 ` Cheng-Yang Chou
  1 sibling, 0 replies; 6+ messages in thread
From: Andrea Righi @ 2026-04-22 10:46 UTC (permalink / raw)
  To: Richard Cheng
  Cc: mingo, peterz, tj, void, changwoo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, sched-ext,
	linux-kernel, newtonl, kristinc, kaihengf, kobak

Hi Richard,

On Wed, Apr 22, 2026 at 06:09:38PM +0800, Richard Cheng wrote:
> When unregistered my self-written scx scheduler, the following panic
> occurs [1].
> 
> The root cause is that the JIT page backing ops->quiescent() is freed
> before all callers of that function have stopped.
> 
> The expected ordering during teardown is:
>     bitmap_zero(sch->has_op) + synchronize_rcu()
>         -> guarantees no CPU will ever call sch->ops.* again
>     -> only THEN free the BPF struct_ops JIT page
> 
> bpf_scx_unreg() is supposed to enforce the order, but after
> commit f4a6c506d118 ("sched_ext: Always bounce scx_disable() through
> irq_work"), disable_work is no longer queued directly, causing
> kthread_flush_work() to be a noop. Thus, the caller drops the struct_ops
> map too early and poisoned with AARCH64_BREAK_FAULT before
> disable_workfn ever execute.
> 
> So the subsequent dequeue_task() still sees SCX_HAS_OP(sch, quiescent)
> as true and calls ops.quiescent, which hit on the poisoned page and BRK
> panic.
> 
> Fix it by syncing disable_irq_work first, so disable_work is guaranteed
> to be queued before waiting for it.
> 
> Fixes: f4a6c506d118 ("sched_ext: Always bounce scx_disable() through irq_work")
> Signed-off-by: Richard Cheng <icheng@nvidia.com>
> ---
> [1]:
> [  188.572805] sched_ext: BPF scheduler "invariant_0.1.0_aarch64_unknown_linux_gnu_debug" enabled
> [  229.923133] Kernel text patching generated an invalid instruction at 0xffff80009bc2c1f8!
> [  229.923146] Internal error: Oops - BRK: 00000000f2000100 [#1]  SMP
> [  230.077871] CPU: 48 UID: 0 PID: 1760 Comm: kworker/u583:7 Not tainted 7.0.0+ #3 PREEMPT(full)
> [  230.086677] Hardware name: NVIDIA GB200 NVL/P3809-BMC, BIOS 02.05.12 20251107
> [  230.093972] Workqueue: events_unbound bpf_map_free_deferred
> [  230.099675] Sched_ext: invariant_0.1.0_aarch64_unknown_linux_gnu_debug (disabling), task: runnable_at=-174ms
> [  230.116843] pc : 0xffff80009bc2c1f8
> [  230.120406] lr : dequeue_task_scx+0x270/0x2d0
> [  230.217749] Call trace:
> [  230.228515]  0xffff80009bc2c1f8 (P)
> [  230.232077]  dequeue_task+0x84/0x188
> [  230.235728]  sched_change_begin+0x1dc/0x250
> [  230.240000]  __set_cpus_allowed_ptr_locked+0x17c/0x240
> [  230.245250]  __set_cpus_allowed_ptr+0x74/0xf0
> [  230.249701]  ___migrate_enable+0x4c/0xa0
> [  230.253707]  bpf_map_free_deferred+0x1a4/0x1b0
> [  230.258246]  process_one_work+0x184/0x540
> [  230.262342]  worker_thread+0x19c/0x348
> [  230.266170]  kthread+0x13c/0x150
> [  230.269465]  ret_from_fork+0x10/0x20
> [  230.281393] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
> [  230.287621] ---[ end trace 0000000000000000 ]---
> [  231.160046] Kernel panic - not syncing: Oops - BRK: Fatal exception in interrupt
> 
> Best regards,
> Richard Cheng.
> ---
>  kernel/sched/ext.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 012ca8bd70fb..065660382a0c 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -7349,6 +7349,12 @@ static void bpf_scx_unreg(void *kdata, struct bpf_link *link)
>  	struct scx_sched *sch = rcu_dereference_protected(ops->priv, true);
>  
>  	scx_disable(sch, SCX_EXIT_UNREG);
> +	/*
> +	 * sch->disable_work might still not queued, causing kthread_flush_work()
> +	 * as a noop. Syncing the irq_work first is required to guarantee the

nit, maybe rephrase: sch->disable_work might not have been queued yet, causing
kthread_flush_work() to be a no-op.

> +	 * kthread work has been queued before waiting for it.
> +	 */
> +	irq_work_sync(&sch->disable_irq_work);
>  	kthread_flush_work(&sch->disable_work);
>  	RCU_INIT_POINTER(ops->priv, NULL);
>  	kobject_put(&sch->kobj);
> -- 
> 2.43.0
> 

I can't reproduce it locally, but from a logical perspective it makes sense.
Nice catch!

Reviewed-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched_ext: sync disable_irq_work in bpf_scx_unreg()
  2026-04-22 10:09 [PATCH] sched_ext: sync disable_irq_work in bpf_scx_unreg() Richard Cheng
  2026-04-22 10:46 ` Andrea Righi
@ 2026-04-22 10:51 ` Cheng-Yang Chou
  2026-04-22 17:24   ` Tejun Heo
  2026-04-24  3:29   ` Richard Cheng
  1 sibling, 2 replies; 6+ messages in thread
From: Cheng-Yang Chou @ 2026-04-22 10:51 UTC (permalink / raw)
  To: Richard Cheng
  Cc: arighi, mingo, peterz, tj, void, changwoo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, sched-ext, linux-kernel, newtonl, kristinc, kaihengf,
	kobak, Ching-Chun Huang, Chia-Ping Tsai

Hi Richard,

On Wed, Apr 22, 2026 at 06:09:38PM +0800, Richard Cheng wrote:
> When unregistered my self-written scx scheduler, the following panic
> occurs [1].

Nit: you've placed the panic log [1] below the --- separator. Content
below this line will not be preserved in the comment msg.
Could you please move it into the commit msg body and send a v2 patch?

> 
> The root cause is that the JIT page backing ops->quiescent() is freed
> before all callers of that function have stopped.
> 
> The expected ordering during teardown is:
>     bitmap_zero(sch->has_op) + synchronize_rcu()
>         -> guarantees no CPU will ever call sch->ops.* again
>     -> only THEN free the BPF struct_ops JIT page
> 
> bpf_scx_unreg() is supposed to enforce the order, but after
> commit f4a6c506d118 ("sched_ext: Always bounce scx_disable() through
> irq_work"), disable_work is no longer queued directly, causing
> kthread_flush_work() to be a noop. Thus, the caller drops the struct_ops
> map too early and poisoned with AARCH64_BREAK_FAULT before
> disable_workfn ever execute.
> 
> So the subsequent dequeue_task() still sees SCX_HAS_OP(sch, quiescent)
> as true and calls ops.quiescent, which hit on the poisoned page and BRK
> panic.
> 
> Fix it by syncing disable_irq_work first, so disable_work is guaranteed
> to be queued before waiting for it.
> 
> Fixes: f4a6c506d118 ("sched_ext: Always bounce scx_disable() through irq_work")
> Signed-off-by: Richard Cheng <icheng@nvidia.com>

Thanks for the fix, and the logic looks correct to me.

Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>

Also, scx_root_enable_workfn() has the same pattern in its error path:

	scx_error(sch, "scx_root_enable() failed (%d)", ret);
	kthread_flush_work(&sch->disable_work);

The comment above indicates that this flush is meant to "ensure that
error is reported before init completion". However, because scx_error()
goes through scx_vexit() -> irq_work_queue(), the flush can be a no-op
here as well. The same applies to the sub-scheduler enable error path.

Should those be fixed in the same patch? Tejun, Andrea, wdyt? Thanks

[...]
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 012ca8bd70fb..065660382a0c 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -7349,6 +7349,12 @@ static void bpf_scx_unreg(void *kdata, struct bpf_link *link)
>  	struct scx_sched *sch = rcu_dereference_protected(ops->priv, true);
>  
>  	scx_disable(sch, SCX_EXIT_UNREG);
> +	/*
> +	 * sch->disable_work might still not queued, causing kthread_flush_work()
> +	 * as a noop. Syncing the irq_work first is required to guarantee the

Perhaps s/noop/no-op/? Though it's just a matter of taste. ^_^

> +	 * kthread work has been queued before waiting for it.
> +	 */
> +	irq_work_sync(&sch->disable_irq_work);
>  	kthread_flush_work(&sch->disable_work);
>  	RCU_INIT_POINTER(ops->priv, NULL);
>  	kobject_put(&sch->kobj);
> -- 
> 2.43.0
> 
> 

-- 
Cheers,
Cheng-Yang

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched_ext: sync disable_irq_work in bpf_scx_unreg()
  2026-04-22 10:51 ` Cheng-Yang Chou
@ 2026-04-22 17:24   ` Tejun Heo
  2026-04-24  3:31     ` Richard Cheng
  2026-04-24  3:29   ` Richard Cheng
  1 sibling, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2026-04-22 17:24 UTC (permalink / raw)
  To: Cheng-Yang Chou
  Cc: Richard Cheng, arighi, mingo, peterz, void, changwoo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, sched-ext, linux-kernel, newtonl, kristinc, kaihengf,
	kobak, Ching-Chun Huang, Chia-Ping Tsai

Hello,

On Wed, Apr 22, 2026 at 06:51:13PM +0800, Cheng-Yang Chou wrote:
> Also, scx_root_enable_workfn() has the same pattern in its error path:
> 
> 	scx_error(sch, "scx_root_enable() failed (%d)", ret);
> 	kthread_flush_work(&sch->disable_work);

Yeah, can you please add a helper - e.g. flush_disable_work() - to package
sync and flush and use that in both places?

> > +	/*
> > +	 * sch->disable_work might still not queued, causing kthread_flush_work()
> > +	 * as a noop. Syncing the irq_work first is required to guarantee the
> 
> Perhaps s/noop/no-op/? Though it's just a matter of taste. ^_^

noop is used widely in the kernel. In general, I don't think we need this
level of language policing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched_ext: sync disable_irq_work in bpf_scx_unreg()
  2026-04-22 17:24   ` Tejun Heo
@ 2026-04-24  3:31     ` Richard Cheng
  0 siblings, 0 replies; 6+ messages in thread
From: Richard Cheng @ 2026-04-24  3:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Cheng-Yang Chou, arighi, mingo, peterz, void, changwoo,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, sched-ext, linux-kernel, newtonl, kristinc,
	kaihengf, kobak, Ching-Chun Huang, Chia-Ping Tsai

On Wed, Apr 22, 2026 at 07:24:07AM +0800, Tejun Heo wrote:
> Hello,
> 
> On Wed, Apr 22, 2026 at 06:51:13PM +0800, Cheng-Yang Chou wrote:
> > Also, scx_root_enable_workfn() has the same pattern in its error path:
> > 
> > 	scx_error(sch, "scx_root_enable() failed (%d)", ret);
> > 	kthread_flush_work(&sch->disable_work);
> 
> Yeah, can you please add a helper - e.g. flush_disable_work() - to package
> sync and flush and use that in both places?

Hello Tejun,

No problem I'll add this part in v2.

> 
> > > +	/*
> > > +	 * sch->disable_work might still not queued, causing kthread_flush_work()
> > > +	 * as a noop. Syncing the irq_work first is required to guarantee the
> > 
> > Perhaps s/noop/no-op/? Though it's just a matter of taste. ^_^
> 
> noop is used widely in the kernel. In general, I don't think we need this
> level of language policing.
> 
> Thanks.
> 
> -- 
> tejun

Thanks.

- Richard

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched_ext: sync disable_irq_work in bpf_scx_unreg()
  2026-04-22 10:51 ` Cheng-Yang Chou
  2026-04-22 17:24   ` Tejun Heo
@ 2026-04-24  3:29   ` Richard Cheng
  1 sibling, 0 replies; 6+ messages in thread
From: Richard Cheng @ 2026-04-24  3:29 UTC (permalink / raw)
  To: Cheng-Yang Chou
  Cc: arighi, mingo, peterz, tj, void, changwoo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, sched-ext, linux-kernel, newtonl, kristinc, kaihengf,
	kobak, Ching-Chun Huang, Chia-Ping Tsai

On Wed, Apr 22, 2026 at 06:51:13PM +0800, Cheng-Yang Chou wrote:
> Hi Richard,
> 
> On Wed, Apr 22, 2026 at 06:09:38PM +0800, Richard Cheng wrote:
> > When unregistered my self-written scx scheduler, the following panic
> > occurs [1].
> 
> Nit: you've placed the panic log [1] below the --- separator. Content
> below this line will not be preserved in the comment msg.
> Could you please move it into the commit msg body and send a v2 patch?
>

Hi Cheng-Yang,

Sure thing, thanks.
 
> > 
> > The root cause is that the JIT page backing ops->quiescent() is freed
> > before all callers of that function have stopped.
> > 
> > The expected ordering during teardown is:
> >     bitmap_zero(sch->has_op) + synchronize_rcu()
> >         -> guarantees no CPU will ever call sch->ops.* again
> >     -> only THEN free the BPF struct_ops JIT page
> > 
> > bpf_scx_unreg() is supposed to enforce the order, but after
> > commit f4a6c506d118 ("sched_ext: Always bounce scx_disable() through
> > irq_work"), disable_work is no longer queued directly, causing
> > kthread_flush_work() to be a noop. Thus, the caller drops the struct_ops
> > map too early and poisoned with AARCH64_BREAK_FAULT before
> > disable_workfn ever execute.
> > 
> > So the subsequent dequeue_task() still sees SCX_HAS_OP(sch, quiescent)
> > as true and calls ops.quiescent, which hit on the poisoned page and BRK
> > panic.
> > 
> > Fix it by syncing disable_irq_work first, so disable_work is guaranteed
> > to be queued before waiting for it.
> > 
> > Fixes: f4a6c506d118 ("sched_ext: Always bounce scx_disable() through irq_work")
> > Signed-off-by: Richard Cheng <icheng@nvidia.com>
> 
> Thanks for the fix, and the logic looks correct to me.
> 
> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
> 
> Also, scx_root_enable_workfn() has the same pattern in its error path:
> 
> 	scx_error(sch, "scx_root_enable() failed (%d)", ret);
> 	kthread_flush_work(&sch->disable_work);
> 
> The comment above indicates that this flush is meant to "ensure that
> error is reported before init completion". However, because scx_error()
> goes through scx_vexit() -> irq_work_queue(), the flush can be a no-op
> here as well. The same applies to the sub-scheduler enable error path.
> 
> Should those be fixed in the same patch? Tejun, Andrea, wdyt? Thanks
> 

Sure I'll amend that part in v2.

> [...]
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index 012ca8bd70fb..065660382a0c 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -7349,6 +7349,12 @@ static void bpf_scx_unreg(void *kdata, struct bpf_link *link)
> >  	struct scx_sched *sch = rcu_dereference_protected(ops->priv, true);
> >  
> >  	scx_disable(sch, SCX_EXIT_UNREG);
> > +	/*
> > +	 * sch->disable_work might still not queued, causing kthread_flush_work()
> > +	 * as a noop. Syncing the irq_work first is required to guarantee the
> 
> Perhaps s/noop/no-op/? Though it's just a matter of taste. ^_^
> 
> > +	 * kthread work has been queued before waiting for it.
> > +	 */
> > +	irq_work_sync(&sch->disable_irq_work);
> >  	kthread_flush_work(&sch->disable_work);
> >  	RCU_INIT_POINTER(ops->priv, NULL);
> >  	kobject_put(&sch->kobj);
> > -- 
> > 2.43.0
> > 
> > 
> 
> -- 
> Cheers,
> Cheng-Yang

- Richard

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-24  3:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-22 10:09 [PATCH] sched_ext: sync disable_irq_work in bpf_scx_unreg() Richard Cheng
2026-04-22 10:46 ` Andrea Righi
2026-04-22 10:51 ` Cheng-Yang Chou
2026-04-22 17:24   ` Tejun Heo
2026-04-24  3:31     ` Richard Cheng
2026-04-24  3:29   ` Richard Cheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox