From: Richard Cheng <icheng@nvidia.com>
To: arighi@nvidia.com, mingo@redhat.com, peterz@infradead.org,
tj@kernel.org, void@manifault.com, changwoo@igalia.com,
juri.lelli@redhat.com, vincent.guittot@linaro.org,
dietmar.eggemann@arm.com, rostedt@goodmis.org,
bsegall@google.com, mgorman@suse.de, vschneid@redhat.com
Cc: sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org,
newtonl@nvidia.com, kristinc@nvidia.com, kaihengf@nvidia.com,
kobak@nvidia.com, Richard Cheng <icheng@nvidia.com>
Subject: [PATCH] sched_ext: sync disable_irq_work in bpf_scx_unreg()
Date: Wed, 22 Apr 2026 18:09:38 +0800 [thread overview]
Message-ID: <20260422100938.35781-1-icheng@nvidia.com> (raw)
When unregistered my self-written scx scheduler, the following panic
occurs [1].
The root cause is that the JIT page backing ops->quiescent() is freed
before all callers of that function have stopped.
The expected ordering during teardown is:
bitmap_zero(sch->has_op) + synchronize_rcu()
-> guarantees no CPU will ever call sch->ops.* again
-> only THEN free the BPF struct_ops JIT page
bpf_scx_unreg() is supposed to enforce the order, but after
commit f4a6c506d118 ("sched_ext: Always bounce scx_disable() through
irq_work"), disable_work is no longer queued directly, causing
kthread_flush_work() to be a noop. Thus, the caller drops the struct_ops
map too early and poisoned with AARCH64_BREAK_FAULT before
disable_workfn ever execute.
So the subsequent dequeue_task() still sees SCX_HAS_OP(sch, quiescent)
as true and calls ops.quiescent, which hit on the poisoned page and BRK
panic.
Fix it by syncing disable_irq_work first, so disable_work is guaranteed
to be queued before waiting for it.
Fixes: f4a6c506d118 ("sched_ext: Always bounce scx_disable() through irq_work")
Signed-off-by: Richard Cheng <icheng@nvidia.com>
---
[1]:
[ 188.572805] sched_ext: BPF scheduler "invariant_0.1.0_aarch64_unknown_linux_gnu_debug" enabled
[ 229.923133] Kernel text patching generated an invalid instruction at 0xffff80009bc2c1f8!
[ 229.923146] Internal error: Oops - BRK: 00000000f2000100 [#1] SMP
[ 230.077871] CPU: 48 UID: 0 PID: 1760 Comm: kworker/u583:7 Not tainted 7.0.0+ #3 PREEMPT(full)
[ 230.086677] Hardware name: NVIDIA GB200 NVL/P3809-BMC, BIOS 02.05.12 20251107
[ 230.093972] Workqueue: events_unbound bpf_map_free_deferred
[ 230.099675] Sched_ext: invariant_0.1.0_aarch64_unknown_linux_gnu_debug (disabling), task: runnable_at=-174ms
[ 230.116843] pc : 0xffff80009bc2c1f8
[ 230.120406] lr : dequeue_task_scx+0x270/0x2d0
[ 230.217749] Call trace:
[ 230.228515] 0xffff80009bc2c1f8 (P)
[ 230.232077] dequeue_task+0x84/0x188
[ 230.235728] sched_change_begin+0x1dc/0x250
[ 230.240000] __set_cpus_allowed_ptr_locked+0x17c/0x240
[ 230.245250] __set_cpus_allowed_ptr+0x74/0xf0
[ 230.249701] ___migrate_enable+0x4c/0xa0
[ 230.253707] bpf_map_free_deferred+0x1a4/0x1b0
[ 230.258246] process_one_work+0x184/0x540
[ 230.262342] worker_thread+0x19c/0x348
[ 230.266170] kthread+0x13c/0x150
[ 230.269465] ret_from_fork+0x10/0x20
[ 230.281393] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
[ 230.287621] ---[ end trace 0000000000000000 ]---
[ 231.160046] Kernel panic - not syncing: Oops - BRK: Fatal exception in interrupt
Best regards,
Richard Cheng.
---
kernel/sched/ext.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 012ca8bd70fb..065660382a0c 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -7349,6 +7349,12 @@ static void bpf_scx_unreg(void *kdata, struct bpf_link *link)
struct scx_sched *sch = rcu_dereference_protected(ops->priv, true);
scx_disable(sch, SCX_EXIT_UNREG);
+ /*
+ * sch->disable_work might still not queued, causing kthread_flush_work()
+ * as a noop. Syncing the irq_work first is required to guarantee the
+ * kthread work has been queued before waiting for it.
+ */
+ irq_work_sync(&sch->disable_irq_work);
kthread_flush_work(&sch->disable_work);
RCU_INIT_POINTER(ops->priv, NULL);
kobject_put(&sch->kobj);
--
2.43.0
next reply other threads:[~2026-04-22 10:09 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-22 10:09 Richard Cheng [this message]
2026-04-22 10:46 ` [PATCH] sched_ext: sync disable_irq_work in bpf_scx_unreg() Andrea Righi
2026-04-22 10:51 ` Cheng-Yang Chou
2026-04-22 17:24 ` Tejun Heo
2026-04-24 3:31 ` Richard Cheng
2026-04-24 3:29 ` Richard Cheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260422100938.35781-1-icheng@nvidia.com \
--to=icheng@nvidia.com \
--cc=arighi@nvidia.com \
--cc=bsegall@google.com \
--cc=changwoo@igalia.com \
--cc=dietmar.eggemann@arm.com \
--cc=juri.lelli@redhat.com \
--cc=kaihengf@nvidia.com \
--cc=kobak@nvidia.com \
--cc=kristinc@nvidia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=newtonl@nvidia.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=sched-ext@lists.linux.dev \
--cc=tj@kernel.org \
--cc=vincent.guittot@linaro.org \
--cc=void@manifault.com \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.