* [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes
@ 2026-04-24 20:44 Tejun Heo
2026-04-24 20:44 ` [PATCH 01/13] sched_ext: Unregister sub_kset on scheduler disable Tejun Heo
` (15 more replies)
0 siblings, 16 replies; 24+ messages in thread
From: Tejun Heo @ 2026-04-24 20:44 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, linux-kernel, Emil Tsalapatis, Chris Mason,
Ryan Newton, Tejun Heo
Hello,
This patchset collects fixes for issues surfaced by Chris Mason's
AI-assisted review of sched_ext. The bugs span use-after-free, leak,
lock/state inconsistency, rq-lock AA deadlock, and cross-task kfunc
misuse paths. Each patch stands on its own.
Based on sched_ext/for-7.1-fixes (510a27055446).
1: sched_ext: Unregister sub_kset on scheduler disable
2: sched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_new
3: sched_ext: Skip tasks with stale task_rq in bypass_lb_cpu()
4: sched_ext: Don't disable tasks in scx_sub_enable_workfn() abort path
5: sched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup setters
6: sched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / scx_bpf_dsq_nr_queued()
7: sched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() FIFO-tail
8: sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP
9: sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task
10: sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before
11: sched_ext: Make bypass LB cpumasks per-scheduler
12: sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED
13: sched_ext: Refuse cross-task select_cpu_from_kfunc calls
Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git fix-slop-review
kernel/sched/ext.c | 238 +++++++++++++++++++++++++++++---------------
kernel/sched/ext_idle.c | 19 +++-
kernel/sched/ext_internal.h | 2 +
3 files changed, 174 insertions(+), 85 deletions(-)
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 01/13] sched_ext: Unregister sub_kset on scheduler disable
2026-04-24 20:44 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes Tejun Heo
@ 2026-04-24 20:44 ` Tejun Heo
2026-04-24 20:44 ` [PATCH 02/13] sched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_new Tejun Heo
` (14 subsequent siblings)
15 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2026-04-24 20:44 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, linux-kernel, Emil Tsalapatis, Chris Mason,
Ryan Newton, Tejun Heo
When ops.sub_attach is set, scx_alloc_and_add_sched() creates sub_kset as a
child of &sch->kobj, which pins the parent with its own reference. The
disable paths never call kset_unregister(), so the final kobject_put() in
bpf_scx_unreg() leaves a stale reference and scx_kobj_release() never runs,
leaking the whole struct scx_sched on every load/unload cycle.
Unregister sub_kset in scx_root_disable() and scx_sub_disable() before
kobject_del(&sch->kobj).
Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index a018034dd81c..0c435a4612dc 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5700,6 +5700,8 @@ static void scx_sub_disable(struct scx_sched *sch)
if (sch->ops.exit)
SCX_CALL_OP(sch, exit, NULL, sch->exit_info);
+ if (sch->sub_kset)
+ kset_unregister(sch->sub_kset);
kobject_del(&sch->kobj);
}
#else /* CONFIG_EXT_SUB_SCHED */
@@ -5831,6 +5833,10 @@ static void scx_root_disable(struct scx_sched *sch)
* could observe an object of the same name still in the hierarchy when
* the next scheduler is loaded.
*/
+#ifdef CONFIG_EXT_SUB_SCHED
+ if (sch->sub_kset)
+ kset_unregister(sch->sub_kset);
+#endif
kobject_del(&sch->kobj);
free_kick_syncs();
--
2.53.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 02/13] sched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_new
2026-04-24 20:44 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes Tejun Heo
2026-04-24 20:44 ` [PATCH 01/13] sched_ext: Unregister sub_kset on scheduler disable Tejun Heo
@ 2026-04-24 20:44 ` Tejun Heo
2026-04-24 20:44 ` [PATCH 03/13] sched_ext: Skip tasks with stale task_rq in bypass_lb_cpu() Tejun Heo
` (13 subsequent siblings)
15 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2026-04-24 20:44 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, linux-kernel, Emil Tsalapatis, Chris Mason,
Ryan Newton, Tejun Heo, stable
bpf_iter_scx_dsq_new() clears kit->dsq on failure and
bpf_iter_scx_dsq_{next,destroy}() guard against that. scx_dsq_move() doesn't -
it dereferences kit->dsq immediately, so a BPF program that calls
scx_bpf_dsq_move[_vtime]() after a failed iter_new oopses the kernel.
Return false if kit->dsq is NULL.
Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 0c435a4612dc..89170a0e5779 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -8055,12 +8055,22 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
struct task_struct *p, u64 dsq_id, u64 enq_flags)
{
struct scx_dispatch_q *src_dsq = kit->dsq, *dst_dsq;
- struct scx_sched *sch = src_dsq->sched;
+ struct scx_sched *sch;
struct rq *this_rq, *src_rq, *locked_rq;
bool dispatched = false;
bool in_balance;
unsigned long flags;
+ /*
+ * The verifier considers an iterator slot initialized on any
+ * KF_ITER_NEW return, so a BPF program may legally reach here after
+ * bpf_iter_scx_dsq_new() failed and left @kit->dsq NULL.
+ */
+ if (unlikely(!src_dsq))
+ return false;
+
+ sch = src_dsq->sched;
+
if (!scx_vet_enq_flags(sch, dsq_id, &enq_flags))
return false;
--
2.53.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 03/13] sched_ext: Skip tasks with stale task_rq in bypass_lb_cpu()
2026-04-24 20:44 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes Tejun Heo
2026-04-24 20:44 ` [PATCH 01/13] sched_ext: Unregister sub_kset on scheduler disable Tejun Heo
2026-04-24 20:44 ` [PATCH 02/13] sched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_new Tejun Heo
@ 2026-04-24 20:44 ` Tejun Heo
2026-04-24 20:44 ` [PATCH 04/13] sched_ext: Don't disable tasks in scx_sub_enable_workfn() abort path Tejun Heo
` (12 subsequent siblings)
15 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2026-04-24 20:44 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, linux-kernel, Emil Tsalapatis, Chris Mason,
Ryan Newton, Tejun Heo, stable
bypass_lb_cpu() transfers tasks between per-CPU bypass DSQs without
migrating them - task_cpu() only updates when the donee later consumes the
task via move_remote_task_to_local_dsq(). If the LB timer fires again before
consumption and the new DSQ becomes a donor, @p is still on the previous CPU
and task_rq(@p) != donor_rq. @p can't be moved without its own rq locked.
Skip such tasks.
Fixes: 95d1df610cdc ("sched_ext: Implement load balancer for bypass mode")
Cc: stable@vger.kernel.org # v6.19+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 89170a0e5779..62b4139a4cc8 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5002,6 +5002,15 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, s32 donor,
if (cpumask_empty(donee_mask))
break;
+ /*
+ * If an earlier pass placed @p on @donor_dsq from a different
+ * CPU and the donee hasn't consumed it yet, @p is still on the
+ * previous CPU and task_rq(@p) != @donor_rq. @p can't be moved
+ * without its rq locked. Skip.
+ */
+ if (task_rq(p) != donor_rq)
+ continue;
+
donee = cpumask_any_and_distribute(donee_mask, p->cpus_ptr);
if (donee >= nr_cpu_ids)
continue;
--
2.53.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 04/13] sched_ext: Don't disable tasks in scx_sub_enable_workfn() abort path
2026-04-24 20:44 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes Tejun Heo
` (2 preceding siblings ...)
2026-04-24 20:44 ` [PATCH 03/13] sched_ext: Skip tasks with stale task_rq in bypass_lb_cpu() Tejun Heo
@ 2026-04-24 20:44 ` Tejun Heo
2026-04-24 20:44 ` [PATCH 05/13] sched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup setters Tejun Heo
` (11 subsequent siblings)
15 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2026-04-24 20:44 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, linux-kernel, Emil Tsalapatis, Chris Mason,
Ryan Newton, Tejun Heo
scx_sub_enable_workfn()'s prep loop calls __scx_init_task(sch, p, false)
without transitioning task state, then sets SCX_TASK_SUB_INIT. If prep fails
partway, the abort path runs __scx_disable_and_exit_task(sch, p) on the
marked tasks. Task state is still the parent's ENABLED, so that dispatches
to the SCX_TASK_ENABLED arm and calls scx_disable_task(sch, p) - i.e.
child->ops.disable() - for tasks on which child->ops.enable() never ran. A
BPF sub-scheduler allocating per-task state in enable/freeing in disable
would operate on uninitialized state.
The dying-task branch in scx_disable_and_exit_task() has the same problem,
and scx_enabling_sub_sched was cleared before the abort cleanup loop - a
task exiting during cleanup tripped the WARN and skipped both ops.exit_task
and the SCX_TASK_SUB_INIT clear, leaking per-task resources and leaving the
task stuck.
Introduce scx_sub_init_cancel_task() that calls ops.exit_task with
cancelled=true - matching what the top-level init path does when init_task
itself returns -errno. Use it in the abort loop and in the dying-task
branch. scx_enabling_sub_sched now stays set until the abort loop finishes
clearing SUB_INIT, so concurrent exits hitting the dying-task branch can
still find @sch. That branch also clears SCX_TASK_SUB_INIT unconditionally
when seen, leaving the task unmarked even if the WARN fires.
Fixes: 337ec00b1d9c ("sched_ext: Implement cgroup sub-sched enabling and disabling")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 36 ++++++++++++++++++++++++++++++------
1 file changed, 30 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 62b4139a4cc8..f7cca6f07a58 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3633,6 +3633,22 @@ static void __scx_disable_and_exit_task(struct scx_sched *sch,
SCX_CALL_OP_TASK(sch, exit_task, task_rq(p), p, &args);
}
+/*
+ * Undo a completed __scx_init_task(sch, p, false) when scx_enable_task() never
+ * ran. The task state has not been transitioned, so this mirrors the
+ * SCX_TASK_INIT branch in __scx_disable_and_exit_task().
+ */
+static void scx_sub_init_cancel_task(struct scx_sched *sch, struct task_struct *p)
+{
+ struct scx_exit_task_args args = { .cancelled = true };
+
+ lockdep_assert_held(&p->pi_lock);
+ lockdep_assert_rq_held(task_rq(p));
+
+ if (SCX_HAS_OP(sch, exit_task))
+ SCX_CALL_OP_TASK(sch, exit_task, task_rq(p), p, &args);
+}
+
static void scx_disable_and_exit_task(struct scx_sched *sch,
struct task_struct *p)
{
@@ -3641,11 +3657,12 @@ static void scx_disable_and_exit_task(struct scx_sched *sch,
/*
* If set, @p exited between __scx_init_task() and scx_enable_task() in
* scx_sub_enable() and is initialized for both the associated sched and
- * its parent. Disable and exit for the child too.
+ * its parent. Exit for the child too - scx_enable_task() never ran for
+ * it, so undo only init_task.
*/
- if ((p->scx.flags & SCX_TASK_SUB_INIT) &&
- !WARN_ON_ONCE(!scx_enabling_sub_sched)) {
- __scx_disable_and_exit_task(scx_enabling_sub_sched, p);
+ if (p->scx.flags & SCX_TASK_SUB_INIT) {
+ if (!WARN_ON_ONCE(!scx_enabling_sub_sched))
+ scx_sub_init_cancel_task(scx_enabling_sub_sched, p);
p->scx.flags &= ~SCX_TASK_SUB_INIT;
}
@@ -7103,16 +7120,23 @@ static void scx_sub_enable_workfn(struct kthread_work *work)
abort:
put_task_struct(p);
scx_task_iter_stop(&sti);
- scx_enabling_sub_sched = NULL;
+ /*
+ * Undo __scx_init_task() for tasks we marked. scx_enable_task() never
+ * ran for @sch on them, so calling scx_disable_task() here would invoke
+ * ops.disable() without a matching ops.enable(). scx_enabling_sub_sched
+ * must stay set until SUB_INIT is cleared from every marked task -
+ * scx_disable_and_exit_task() reads it when a task exits concurrently.
+ */
scx_task_iter_start(&sti, sch->cgrp);
while ((p = scx_task_iter_next_locked(&sti))) {
if (p->scx.flags & SCX_TASK_SUB_INIT) {
- __scx_disable_and_exit_task(sch, p);
+ scx_sub_init_cancel_task(sch, p);
p->scx.flags &= ~SCX_TASK_SUB_INIT;
}
}
scx_task_iter_stop(&sti);
+ scx_enabling_sub_sched = NULL;
err_unlock_and_disable:
/* we'll soon enter disable path, keep bypass on */
scx_cgroup_unlock();
--
2.53.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 05/13] sched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup setters
2026-04-24 20:44 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes Tejun Heo
` (3 preceding siblings ...)
2026-04-24 20:44 ` [PATCH 04/13] sched_ext: Don't disable tasks in scx_sub_enable_workfn() abort path Tejun Heo
@ 2026-04-24 20:44 ` Tejun Heo
2026-04-24 20:44 ` [PATCH 06/13] sched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / scx_bpf_dsq_nr_queued() Tejun Heo
` (10 subsequent siblings)
15 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2026-04-24 20:44 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, linux-kernel, Emil Tsalapatis, Chris Mason,
Ryan Newton, Tejun Heo, stable
scx_group_set_{weight,idle,bandwidth}() cache scx_root before acquiring
scx_cgroup_ops_rwsem, so the pointer can be stale by the time the op runs.
If the loaded scheduler is disabled and freed (via RCU work) and another is
enabled between the naked load and the rwsem acquire, the reader sees
scx_cgroup_enabled=true (the new scheduler's) but dereferences the freed one
- UAF on SCX_HAS_OP(sch, ...) / SCX_CALL_OP(sch, ...).
scx_cgroup_enabled is toggled only under scx_cgroup_ops_rwsem write
(scx_cgroup_{init,exit}), so reading scx_root inside the rwsem read section
correlates @sch with the enabled snapshot.
Fixes: a5bd6ba30b33 ("sched_ext: Use cgroup_lock/unlock() to synchronize against cgroup operations")
Cc: stable@vger.kernel.org # v6.18+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f7cca6f07a58..59445e95d2f2 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4343,9 +4343,10 @@ void scx_cgroup_cancel_attach(struct cgroup_taskset *tset)
void scx_group_set_weight(struct task_group *tg, unsigned long weight)
{
- struct scx_sched *sch = scx_root;
+ struct scx_sched *sch;
percpu_down_read(&scx_cgroup_ops_rwsem);
+ sch = scx_root;
if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_set_weight) &&
tg->scx.weight != weight)
@@ -4358,9 +4359,10 @@ void scx_group_set_weight(struct task_group *tg, unsigned long weight)
void scx_group_set_idle(struct task_group *tg, bool idle)
{
- struct scx_sched *sch = scx_root;
+ struct scx_sched *sch;
percpu_down_read(&scx_cgroup_ops_rwsem);
+ sch = scx_root;
if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_set_idle))
SCX_CALL_OP(sch, cgroup_set_idle, NULL, tg_cgrp(tg), idle);
@@ -4374,9 +4376,10 @@ void scx_group_set_idle(struct task_group *tg, bool idle)
void scx_group_set_bandwidth(struct task_group *tg,
u64 period_us, u64 quota_us, u64 burst_us)
{
- struct scx_sched *sch = scx_root;
+ struct scx_sched *sch;
percpu_down_read(&scx_cgroup_ops_rwsem);
+ sch = scx_root;
if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_set_bandwidth) &&
(tg->scx.bw_period_us != period_us ||
--
2.53.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 06/13] sched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / scx_bpf_dsq_nr_queued()
2026-04-24 20:44 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes Tejun Heo
` (4 preceding siblings ...)
2026-04-24 20:44 ` [PATCH 05/13] sched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup setters Tejun Heo
@ 2026-04-24 20:44 ` Tejun Heo
2026-04-24 20:44 ` [PATCH 07/13] sched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() FIFO-tail Tejun Heo
` (9 subsequent siblings)
15 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2026-04-24 20:44 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, linux-kernel, Emil Tsalapatis, Chris Mason,
Ryan Newton, Tejun Heo
scx_bpf_create_dsq() resolves the calling scheduler via scx_prog_sched(aux)
and inserts the new DSQ into that scheduler's dsq_hash. Its inverse
scx_bpf_destroy_dsq() and the query helper scx_bpf_dsq_nr_queued() were
hard-coded to rcu_dereference(scx_root), so a sub-scheduler could only
destroy or query DSQs in the root scheduler's hash - never its own. If the
root had a DSQ with the same id, the sub-sched silently destroyed it and the
root aborted on the next dispatch ("invalid DSQ ID 0x0..").
Take a const struct bpf_prog_aux *aux via KF_IMPLICIT_ARGS and resolve the
scheduler with scx_prog_sched(aux), matching scx_bpf_create_dsq().
Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 59445e95d2f2..4bd1fcba50c5 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -8701,11 +8701,12 @@ __bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags, const struct bpf_prog_aux
/**
* scx_bpf_dsq_nr_queued - Return the number of queued tasks
* @dsq_id: id of the DSQ
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
*
* Return the number of tasks in the DSQ matching @dsq_id. If not found,
* -%ENOENT is returned.
*/
-__bpf_kfunc s32 scx_bpf_dsq_nr_queued(u64 dsq_id)
+__bpf_kfunc s32 scx_bpf_dsq_nr_queued(u64 dsq_id, const struct bpf_prog_aux *aux)
{
struct scx_sched *sch;
struct scx_dispatch_q *dsq;
@@ -8713,7 +8714,7 @@ __bpf_kfunc s32 scx_bpf_dsq_nr_queued(u64 dsq_id)
preempt_disable();
- sch = rcu_dereference_sched(scx_root);
+ sch = scx_prog_sched(aux);
if (unlikely(!sch)) {
ret = -ENODEV;
goto out;
@@ -8745,21 +8746,21 @@ __bpf_kfunc s32 scx_bpf_dsq_nr_queued(u64 dsq_id)
/**
* scx_bpf_destroy_dsq - Destroy a custom DSQ
* @dsq_id: DSQ to destroy
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
*
* Destroy the custom DSQ identified by @dsq_id. Only DSQs created with
* scx_bpf_create_dsq() can be destroyed. The caller must ensure that the DSQ is
* empty and no further tasks are dispatched to it. Ignored if called on a DSQ
* which doesn't exist. Can be called from any online scx_ops operations.
*/
-__bpf_kfunc void scx_bpf_destroy_dsq(u64 dsq_id)
+__bpf_kfunc void scx_bpf_destroy_dsq(u64 dsq_id, const struct bpf_prog_aux *aux)
{
struct scx_sched *sch;
- rcu_read_lock();
- sch = rcu_dereference(scx_root);
+ guard(rcu)();
+ sch = scx_prog_sched(aux);
if (sch)
destroy_dsq(sch, dsq_id);
- rcu_read_unlock();
}
/**
@@ -9513,8 +9514,8 @@ BTF_KFUNCS_START(scx_kfunc_ids_any)
BTF_ID_FLAGS(func, scx_bpf_task_set_slice, KF_IMPLICIT_ARGS | KF_RCU);
BTF_ID_FLAGS(func, scx_bpf_task_set_dsq_vtime, KF_IMPLICIT_ARGS | KF_RCU);
BTF_ID_FLAGS(func, scx_bpf_kick_cpu, KF_IMPLICIT_ARGS)
-BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued)
-BTF_ID_FLAGS(func, scx_bpf_destroy_dsq)
+BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_destroy_dsq, KF_IMPLICIT_ARGS)
BTF_ID_FLAGS(func, scx_bpf_dsq_peek, KF_IMPLICIT_ARGS | KF_RCU_PROTECTED | KF_RET_NULL)
BTF_ID_FLAGS(func, scx_bpf_dsq_reenq, KF_IMPLICIT_ARGS)
BTF_ID_FLAGS(func, scx_bpf_reenqueue_local___v2, KF_IMPLICIT_ARGS)
--
2.53.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 07/13] sched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() FIFO-tail
2026-04-24 20:44 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes Tejun Heo
` (5 preceding siblings ...)
2026-04-24 20:44 ` [PATCH 06/13] sched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / scx_bpf_dsq_nr_queued() Tejun Heo
@ 2026-04-24 20:44 ` Tejun Heo
2026-04-24 20:44 ` [PATCH 08/13] sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP Tejun Heo
` (8 subsequent siblings)
15 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2026-04-24 20:44 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, linux-kernel, Emil Tsalapatis, Chris Mason,
Ryan Newton, Tejun Heo, stable
dispatch_enqueue()'s FIFO-tail path used list_empty(&dsq->list) to decide
whether to set dsq->first_task on enqueue. dsq->list can contain parked BPF
iterator cursors (SCX_DSQ_LNODE_ITER_CURSOR), so list_empty() is not a
reliable "no real task" check. If the last real task is unlinked while a
cursor is parked, first_task becomes NULL; the next FIFO-tail enqueue then
sees list_empty() == false and skips the first_task update, leaving
scx_bpf_dsq_peek() returning NULL for a non-empty DSQ.
Test dsq->first_task directly, which already tracks only real tasks and is
maintained under dsq->lock.
Fixes: 44f5c8ec5b9a ("sched_ext: Add lockless peek operation for DSQs")
Cc: stable@vger.kernel.org # v6.19+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ryan Newton <newton@meta.com>
---
kernel/sched/ext.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 4bd1fcba50c5..045b4c914768 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1495,11 +1495,13 @@ static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
if (!(dsq->id & SCX_DSQ_FLAG_BUILTIN))
rcu_assign_pointer(dsq->first_task, p);
} else {
- bool was_empty;
-
- was_empty = list_empty(&dsq->list);
+ /*
+ * dsq->list can contain parked BPF iterator cursors, so
+ * list_empty() here isn't a reliable proxy for "no real
+ * task in the DSQ". Test dsq->first_task directly.
+ */
list_add_tail(&p->scx.dsq_list.node, &dsq->list);
- if (was_empty && !(dsq->id & SCX_DSQ_FLAG_BUILTIN))
+ if (!dsq->first_task && !(dsq->id & SCX_DSQ_FLAG_BUILTIN))
rcu_assign_pointer(dsq->first_task, p);
}
}
--
2.53.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 08/13] sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP
2026-04-24 20:44 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes Tejun Heo
` (6 preceding siblings ...)
2026-04-24 20:44 ` [PATCH 07/13] sched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() FIFO-tail Tejun Heo
@ 2026-04-24 20:44 ` Tejun Heo
2026-04-24 20:44 ` [PATCH 09/13] sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task Tejun Heo
` (7 subsequent siblings)
15 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2026-04-24 20:44 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, linux-kernel, Emil Tsalapatis, Chris Mason,
Ryan Newton, Tejun Heo
SCX_CALL_OP{,_RET}() unconditionally clears scx_locked_rq_state to NULL on
exit. Correct at the top level, but ops can recurse via
scx_bpf_sub_dispatch(): a parent's ops.dispatch calls the helper, which
invokes the child's ops.dispatch under another SCX_CALL_OP. When the inner
call returns, the NULL clobbers the outer's state. The parent's BPF then
calls kfuncs like scx_bpf_cpuperf_set() which read scx_locked_rq()==NULL and
re-acquire the already-held rq.
Snapshot scx_locked_rq_state on entry and restore on exit. Rename the rq
parameter to locked_rq across all SCX_CALL_OP* macros so the snapshot local
can be typed as 'struct rq *' without colliding with the parameter token in
the expansion. SCX_CALL_OP_TASK{,_RET}() and SCX_CALL_OP_2TASKS_RET() funnel
through the two base macros and inherit the fix.
Fixes: 4f8b122848db ("sched_ext: Add basic building blocks for nested sub-scheduler dispatching")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 49 ++++++++++++++++++++++++++++------------------
1 file changed, 30 insertions(+), 19 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 045b4c914768..608d5dc4c8bc 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -470,24 +470,35 @@ static inline void update_locked_rq(struct rq *rq)
__this_cpu_write(scx_locked_rq_state, rq);
}
-#define SCX_CALL_OP(sch, op, rq, args...) \
+/*
+ * SCX ops can recurse via scx_bpf_sub_dispatch() - the inner call must not
+ * clobber the outer's scx_locked_rq_state. Save it on entry, restore on exit.
+ */
+#define SCX_CALL_OP(sch, op, locked_rq, args...) \
do { \
- if (rq) \
- update_locked_rq(rq); \
+ struct rq *__prev_locked_rq; \
+ \
+ if (locked_rq) { \
+ __prev_locked_rq = scx_locked_rq(); \
+ update_locked_rq(locked_rq); \
+ } \
(sch)->ops.op(args); \
- if (rq) \
- update_locked_rq(NULL); \
+ if (locked_rq) \
+ update_locked_rq(__prev_locked_rq); \
} while (0)
-#define SCX_CALL_OP_RET(sch, op, rq, args...) \
+#define SCX_CALL_OP_RET(sch, op, locked_rq, args...) \
({ \
+ struct rq *__prev_locked_rq; \
__typeof__((sch)->ops.op(args)) __ret; \
\
- if (rq) \
- update_locked_rq(rq); \
+ if (locked_rq) { \
+ __prev_locked_rq = scx_locked_rq(); \
+ update_locked_rq(locked_rq); \
+ } \
__ret = (sch)->ops.op(args); \
- if (rq) \
- update_locked_rq(NULL); \
+ if (locked_rq) \
+ update_locked_rq(__prev_locked_rq); \
__ret; \
})
@@ -499,39 +510,39 @@ do { \
* those subject tasks.
*
* Every SCX_CALL_OP_TASK*() call site invokes its op with @p's rq lock held -
- * either via the @rq argument here, or (for ops.select_cpu()) via @p's pi_lock
- * held by try_to_wake_up() with rq tracking via scx_rq.in_select_cpu. So if
- * kf_tasks[] is set, @p's scheduler-protected fields are stable.
+ * either via the @locked_rq argument here, or (for ops.select_cpu()) via @p's
+ * pi_lock held by try_to_wake_up() with rq tracking via scx_rq.in_select_cpu.
+ * So if kf_tasks[] is set, @p's scheduler-protected fields are stable.
*
* kf_tasks[] can not stack, so task-based SCX ops must not nest. The
* WARN_ON_ONCE() in each macro catches a re-entry of any of the three variants
* while a previous one is still in progress.
*/
-#define SCX_CALL_OP_TASK(sch, op, rq, task, args...) \
+#define SCX_CALL_OP_TASK(sch, op, locked_rq, task, args...) \
do { \
WARN_ON_ONCE(current->scx.kf_tasks[0]); \
current->scx.kf_tasks[0] = task; \
- SCX_CALL_OP((sch), op, rq, task, ##args); \
+ SCX_CALL_OP((sch), op, locked_rq, task, ##args); \
current->scx.kf_tasks[0] = NULL; \
} while (0)
-#define SCX_CALL_OP_TASK_RET(sch, op, rq, task, args...) \
+#define SCX_CALL_OP_TASK_RET(sch, op, locked_rq, task, args...) \
({ \
__typeof__((sch)->ops.op(task, ##args)) __ret; \
WARN_ON_ONCE(current->scx.kf_tasks[0]); \
current->scx.kf_tasks[0] = task; \
- __ret = SCX_CALL_OP_RET((sch), op, rq, task, ##args); \
+ __ret = SCX_CALL_OP_RET((sch), op, locked_rq, task, ##args); \
current->scx.kf_tasks[0] = NULL; \
__ret; \
})
-#define SCX_CALL_OP_2TASKS_RET(sch, op, rq, task0, task1, args...) \
+#define SCX_CALL_OP_2TASKS_RET(sch, op, locked_rq, task0, task1, args...) \
({ \
__typeof__((sch)->ops.op(task0, task1, ##args)) __ret; \
WARN_ON_ONCE(current->scx.kf_tasks[0]); \
current->scx.kf_tasks[0] = task0; \
current->scx.kf_tasks[1] = task1; \
- __ret = SCX_CALL_OP_RET((sch), op, rq, task0, task1, ##args); \
+ __ret = SCX_CALL_OP_RET((sch), op, locked_rq, task0, task1, ##args); \
current->scx.kf_tasks[0] = NULL; \
current->scx.kf_tasks[1] = NULL; \
__ret; \
--
2.53.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 09/13] sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task
2026-04-24 20:44 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes Tejun Heo
` (7 preceding siblings ...)
2026-04-24 20:44 ` [PATCH 08/13] sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP Tejun Heo
@ 2026-04-24 20:44 ` Tejun Heo
2026-04-24 20:44 ` [PATCH 10/13] sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before Tejun Heo
` (6 subsequent siblings)
15 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2026-04-24 20:44 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, linux-kernel, Emil Tsalapatis, Chris Mason,
Ryan Newton, Tejun Heo, stable
scx_dump_state() walks CPUs with rq_lock_irqsave() held and invokes
ops.dump_cpu / ops.dump_task with NULL locked_rq, leaving
scx_locked_rq_state NULL. If the BPF callback calls a kfunc that
re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu)
- it re-acquires the already-held rq.
Pass the held rq to SCX_CALL_OP(). Thread it into scx_dump_task() too.
The pre-loop ops.dump call runs before rq_lock_irqsave() so keeps
rq=NULL.
Fixes: 07814a9439a3 ("sched_ext: Print debug dump after an error exit")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 608d5dc4c8bc..90008be04c7c 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -6096,9 +6096,8 @@ static void ops_dump_exit(void)
scx_dump_data.cpu = -1;
}
-static void scx_dump_task(struct scx_sched *sch,
- struct seq_buf *s, struct scx_dump_ctx *dctx,
- struct task_struct *p, char marker)
+static void scx_dump_task(struct scx_sched *sch, struct seq_buf *s, struct scx_dump_ctx *dctx,
+ struct rq *rq, struct task_struct *p, char marker)
{
static unsigned long bt[SCX_EXIT_BT_LEN];
struct scx_sched *task_sch = scx_task_sched(p);
@@ -6139,7 +6138,7 @@ static void scx_dump_task(struct scx_sched *sch,
if (SCX_HAS_OP(sch, dump_task)) {
ops_dump_init(s, " ");
- SCX_CALL_OP(sch, dump_task, NULL, dctx, p);
+ SCX_CALL_OP(sch, dump_task, rq, dctx, p);
ops_dump_exit();
}
@@ -6263,8 +6262,7 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei,
used = seq_buf_used(&ns);
if (SCX_HAS_OP(sch, dump_cpu)) {
ops_dump_init(&ns, " ");
- SCX_CALL_OP(sch, dump_cpu, NULL,
- &dctx, cpu, idle);
+ SCX_CALL_OP(sch, dump_cpu, rq, &dctx, cpu, idle);
ops_dump_exit();
}
@@ -6287,11 +6285,11 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei,
if (rq->curr->sched_class == &ext_sched_class &&
(dump_all_tasks || scx_task_on_sched(sch, rq->curr)))
- scx_dump_task(sch, &s, &dctx, rq->curr, '*');
+ scx_dump_task(sch, &s, &dctx, rq, rq->curr, '*');
list_for_each_entry(p, &rq->scx.runnable_list, scx.runnable_node)
if (dump_all_tasks || scx_task_on_sched(sch, p))
- scx_dump_task(sch, &s, &dctx, p, ' ');
+ scx_dump_task(sch, &s, &dctx, rq, p, ' ');
next:
rq_unlock_irqrestore(rq, &rf);
}
--
2.53.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 10/13] sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before
2026-04-24 20:44 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes Tejun Heo
` (8 preceding siblings ...)
2026-04-24 20:44 ` [PATCH 09/13] sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task Tejun Heo
@ 2026-04-24 20:44 ` Tejun Heo
2026-04-24 20:44 ` [PATCH 11/13] sched_ext: Make bypass LB cpumasks per-scheduler Tejun Heo
` (5 subsequent siblings)
15 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2026-04-24 20:44 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, linux-kernel, Emil Tsalapatis, Chris Mason,
Ryan Newton, Tejun Heo, stable
scx_prio_less() runs from core-sched's pick_next_task() path with rq
locked but invokes ops.core_sched_before() with NULL locked_rq, leaving
scx_locked_rq_state NULL. If the BPF callback calls a kfunc that
re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu)
- it re-acquires the already-held rq.
Pass task_rq(a).
Fixes: 7b0888b7cc19 ("sched_ext: Implement core-sched support")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 90008be04c7c..da188af21b3d 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3198,7 +3198,7 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
if (sch_a == sch_b && SCX_HAS_OP(sch_a, core_sched_before) &&
!scx_bypassing(sch_a, task_cpu(a)))
return SCX_CALL_OP_2TASKS_RET(sch_a, core_sched_before,
- NULL,
+ task_rq(a),
(struct task_struct *)a,
(struct task_struct *)b);
else
--
2.53.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 11/13] sched_ext: Make bypass LB cpumasks per-scheduler
2026-04-24 20:44 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes Tejun Heo
` (9 preceding siblings ...)
2026-04-24 20:44 ` [PATCH 10/13] sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before Tejun Heo
@ 2026-04-24 20:44 ` Tejun Heo
2026-04-24 20:44 ` [PATCH 12/13] sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED Tejun Heo
` (4 subsequent siblings)
15 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2026-04-24 20:44 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, linux-kernel, Emil Tsalapatis, Chris Mason,
Ryan Newton, Tejun Heo, stable
scx_bypass_lb_{donee,resched}_cpumask were file-scope statics shared by all
scheduler instances. With CONFIG_EXT_SUB_SCHED, multiple sched instances
each arm their own bypass_lb_timer; concurrent bypass_lb_node() calls RMW
the global cpumasks with no lock, corrupting donee/resched decisions.
Move the cpumasks into struct scx_sched, allocate them alongside the timer
in scx_alloc_and_add_sched(), free them in scx_sched_free_rcu_work().
Fixes: 95d1df610cdc ("sched_ext: Implement load balancer for bypass mode")
Cc: stable@vger.kernel.org # v6.19+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 33 +++++++++++++++++++--------------
kernel/sched/ext_internal.h | 2 ++
2 files changed, 21 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index da188af21b3d..0980531c0e6c 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -53,8 +53,6 @@ DEFINE_STATIC_KEY_FALSE(__scx_enabled);
DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem);
static atomic_t scx_enable_state_var = ATOMIC_INIT(SCX_DISABLED);
static DEFINE_RAW_SPINLOCK(scx_bypass_lock);
-static cpumask_var_t scx_bypass_lb_donee_cpumask;
-static cpumask_var_t scx_bypass_lb_resched_cpumask;
static bool scx_init_task_enabled;
static bool scx_switching_all;
DEFINE_STATIC_KEY_FALSE(__scx_switched_all);
@@ -4747,6 +4745,8 @@ static void scx_sched_free_rcu_work(struct work_struct *work)
irq_work_sync(&sch->disable_irq_work);
kthread_destroy_worker(sch->helper);
timer_shutdown_sync(&sch->bypass_lb_timer);
+ free_cpumask_var(sch->bypass_lb_donee_cpumask);
+ free_cpumask_var(sch->bypass_lb_resched_cpumask);
#ifdef CONFIG_EXT_SUB_SCHED
kfree(sch->cgrp_path);
@@ -5102,8 +5102,8 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, s32 donor,
static void bypass_lb_node(struct scx_sched *sch, int node)
{
const struct cpumask *node_mask = cpumask_of_node(node);
- struct cpumask *donee_mask = scx_bypass_lb_donee_cpumask;
- struct cpumask *resched_mask = scx_bypass_lb_resched_cpumask;
+ struct cpumask *donee_mask = sch->bypass_lb_donee_cpumask;
+ struct cpumask *resched_mask = sch->bypass_lb_resched_cpumask;
u32 nr_tasks = 0, nr_cpus = 0, nr_balanced = 0;
u32 nr_target, nr_donor_target;
u32 before_min = U32_MAX, before_max = 0;
@@ -6499,6 +6499,15 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
init_irq_work(&sch->disable_irq_work, scx_disable_irq_workfn);
kthread_init_work(&sch->disable_work, scx_disable_workfn);
timer_setup(&sch->bypass_lb_timer, scx_bypass_lb_timerfn, 0);
+
+ if (!alloc_cpumask_var(&sch->bypass_lb_donee_cpumask, GFP_KERNEL)) {
+ ret = -ENOMEM;
+ goto err_stop_helper;
+ }
+ if (!alloc_cpumask_var(&sch->bypass_lb_resched_cpumask, GFP_KERNEL)) {
+ ret = -ENOMEM;
+ goto err_free_lb_cpumask;
+ }
sch->ops = *ops;
rcu_assign_pointer(ops->priv, sch);
@@ -6508,14 +6517,14 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
char *buf = kzalloc(PATH_MAX, GFP_KERNEL);
if (!buf) {
ret = -ENOMEM;
- goto err_stop_helper;
+ goto err_free_lb_resched;
}
cgroup_path(cgrp, buf, PATH_MAX);
sch->cgrp_path = kstrdup(buf, GFP_KERNEL);
kfree(buf);
if (!sch->cgrp_path) {
ret = -ENOMEM;
- goto err_stop_helper;
+ goto err_free_lb_resched;
}
sch->cgrp = cgrp;
@@ -6550,10 +6559,12 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
#endif /* CONFIG_EXT_SUB_SCHED */
return sch;
-#ifdef CONFIG_EXT_SUB_SCHED
+err_free_lb_resched:
+ free_cpumask_var(sch->bypass_lb_resched_cpumask);
+err_free_lb_cpumask:
+ free_cpumask_var(sch->bypass_lb_donee_cpumask);
err_stop_helper:
kthread_destroy_worker(sch->helper);
-#endif
err_free_pcpu:
for_each_possible_cpu(cpu) {
if (cpu == bypass_fail_cpu)
@@ -9740,12 +9751,6 @@ static int __init scx_init(void)
return ret;
}
- if (!alloc_cpumask_var(&scx_bypass_lb_donee_cpumask, GFP_KERNEL) ||
- !alloc_cpumask_var(&scx_bypass_lb_resched_cpumask, GFP_KERNEL)) {
- pr_err("sched_ext: Failed to allocate cpumasks\n");
- return -ENOMEM;
- }
-
return 0;
}
__initcall(scx_init);
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 62ce4eaf6a3f..a075732d4430 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -1075,6 +1075,8 @@ struct scx_sched {
struct irq_work disable_irq_work;
struct kthread_work disable_work;
struct timer_list bypass_lb_timer;
+ cpumask_var_t bypass_lb_donee_cpumask;
+ cpumask_var_t bypass_lb_resched_cpumask;
struct rcu_work rcu_work;
/* all ancestors including self */
--
2.53.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 12/13] sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED
2026-04-24 20:44 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes Tejun Heo
` (10 preceding siblings ...)
2026-04-24 20:44 ` [PATCH 11/13] sched_ext: Make bypass LB cpumasks per-scheduler Tejun Heo
@ 2026-04-24 20:44 ` Tejun Heo
2026-04-24 20:44 ` [PATCH 13/13] sched_ext: Refuse cross-task select_cpu_from_kfunc calls Tejun Heo
` (3 subsequent siblings)
15 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2026-04-24 20:44 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, linux-kernel, Emil Tsalapatis, Chris Mason,
Ryan Newton, Tejun Heo
Two EXT_GROUP_SCHED/SUB_SCHED guards are misclassified:
- scx_root_enable_workfn()'s cgroup_get(cgrp) and the err_put_cgrp unwind
in scx_alloc_and_add_sched() are under `#if GROUP || SUB`, but the
matching cgroup_put() in scx_sched_free_rcu_work() is inside `#ifdef SUB`
only (via sch->cgrp, stored only under SUB). GROUP-only would leak a
reference on every root-sched enable.
- sch_cgroup() / set_cgroup_sched() live under `#if GROUP || SUB` but touch
SUB-only fields (sch->cgrp, cgroup->scx_sched). GROUP-only wouldn't
compile.
GROUP needs CGROUP_SCHED; SUB needs only CGROUPS. CGROUPS=y/CGROUP_SCHED=n
gives the reachable GROUP=n, SUB=y combination; GROUP=y, SUB=n isn't
reachable today (SUB is def_bool y under CGROUPS). Neither miscategorization
triggers a real bug in any reachable config, but keep the guards honest:
- Narrow cgroup_get and err_put_cgrp to `#ifdef SUB` (matches the free-side
put).
- Move sch_cgroup() and set_cgroup_sched() to a separate `#ifdef SUB` block
with no-op stubs for the !SUB case; keep root_cgroup() and scx_cgroup_{
lock,unlock}() under `#if GROUP || SUB` since those only need cgroup core.
Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 41 ++++++++++++++++++++++-------------------
1 file changed, 22 insertions(+), 19 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 0980531c0e6c..9a31b8f064e9 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4413,21 +4413,6 @@ static struct cgroup *root_cgroup(void)
return &cgrp_dfl_root.cgrp;
}
-static struct cgroup *sch_cgroup(struct scx_sched *sch)
-{
- return sch->cgrp;
-}
-
-/* for each descendant of @cgrp including self, set ->scx_sched to @sch */
-static void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch)
-{
- struct cgroup *pos;
- struct cgroup_subsys_state *css;
-
- cgroup_for_each_live_descendant_pre(pos, css, cgrp)
- rcu_assign_pointer(pos->scx_sched, sch);
-}
-
static void scx_cgroup_lock(void)
{
#ifdef CONFIG_EXT_GROUP_SCHED
@@ -4445,12 +4430,30 @@ static void scx_cgroup_unlock(void)
}
#else /* CONFIG_EXT_GROUP_SCHED || CONFIG_EXT_SUB_SCHED */
static struct cgroup *root_cgroup(void) { return NULL; }
-static struct cgroup *sch_cgroup(struct scx_sched *sch) { return NULL; }
-static void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) {}
static void scx_cgroup_lock(void) {}
static void scx_cgroup_unlock(void) {}
#endif /* CONFIG_EXT_GROUP_SCHED || CONFIG_EXT_SUB_SCHED */
+#ifdef CONFIG_EXT_SUB_SCHED
+static struct cgroup *sch_cgroup(struct scx_sched *sch)
+{
+ return sch->cgrp;
+}
+
+/* for each descendant of @cgrp including self, set ->scx_sched to @sch */
+static void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch)
+{
+ struct cgroup *pos;
+ struct cgroup_subsys_state *css;
+
+ cgroup_for_each_live_descendant_pre(pos, css, cgrp)
+ rcu_assign_pointer(pos->scx_sched, sch);
+}
+#else /* CONFIG_EXT_SUB_SCHED */
+static struct cgroup *sch_cgroup(struct scx_sched *sch) { return NULL; }
+static void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) {}
+#endif /* CONFIG_EXT_SUB_SCHED */
+
/*
* Omitted operations:
*
@@ -6583,7 +6586,7 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
err_free_sch:
kfree(sch);
err_put_cgrp:
-#if defined(CONFIG_EXT_GROUP_SCHED) || defined(CONFIG_EXT_SUB_SCHED)
+#ifdef CONFIG_EXT_SUB_SCHED
cgroup_put(cgrp);
#endif
return ERR_PTR(ret);
@@ -6674,7 +6677,7 @@ static void scx_root_enable_workfn(struct kthread_work *work)
if (ret)
goto err_unlock;
-#if defined(CONFIG_EXT_GROUP_SCHED) || defined(CONFIG_EXT_SUB_SCHED)
+#ifdef CONFIG_EXT_SUB_SCHED
cgroup_get(cgrp);
#endif
sch = scx_alloc_and_add_sched(ops, cgrp, NULL);
--
2.53.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 13/13] sched_ext: Refuse cross-task select_cpu_from_kfunc calls
2026-04-24 20:44 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes Tejun Heo
` (11 preceding siblings ...)
2026-04-24 20:44 ` [PATCH 12/13] sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED Tejun Heo
@ 2026-04-24 20:44 ` Tejun Heo
2026-04-24 21:46 ` Andrea Righi
2026-04-25 0:19 ` [PATCH v2 " Tejun Heo
2026-04-24 21:08 ` [PATCH 14/13] sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime Tejun Heo
` (2 subsequent siblings)
15 siblings, 2 replies; 24+ messages in thread
From: Tejun Heo @ 2026-04-24 20:44 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, linux-kernel, Emil Tsalapatis, Chris Mason,
Ryan Newton, Tejun Heo
select_cpu_from_kfunc() skipped pi_lock for @p when called from
ops.select_cpu() or another rq-locked SCX op, assuming the held lock
protects @p. scx_bpf_select_cpu_dfl() / __scx_bpf_select_cpu_and() accept an
arbitrary KF_RCU task_struct, so a caller in e.g. ops.select_cpu(p1) or
ops.enqueue(p1) can pass some other p2 - the held pi_lock / rq lock is p1's,
not p2's - and reading p2->cpus_ptr / nr_cpus_allowed races with
set_cpus_allowed_ptr() and migrate_disable_switch() on another CPU.
Abort the scheduler on cross-task calls in both branches: check @p against
direct_dispatch_task (the task currently being selected) for
ops.select_cpu(), and task_rq(p) against scx_locked_rq() for other rq-locked
SCX ops.
Fixes: 0022b328504d ("sched_ext: Decouple kfunc unlocked-context check from kf_mask")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/ext_idle.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
index c43d62d90e40..ff4d1b97437d 100644
--- a/kernel/sched/ext_idle.c
+++ b/kernel/sched/ext_idle.c
@@ -927,14 +927,24 @@ static s32 select_cpu_from_kfunc(struct scx_sched *sch, struct task_struct *p,
* Accessing p->cpus_ptr / p->nr_cpus_allowed needs either @p's rq
* lock or @p's pi_lock. Three cases:
*
- * - inside ops.select_cpu(): try_to_wake_up() holds @p's pi_lock.
+ * - inside ops.select_cpu(): try_to_wake_up() holds the wake-up
+ * task's pi_lock (stashed in direct_dispatch_task;
+ * mark_direct_dispatch() invalidates it post-dispatch).
* - other rq-locked SCX op: scx_locked_rq() points at the held rq.
* - truly unlocked (UNLOCKED ops, SYSCALL, non-SCX struct_ops):
* nothing held, take pi_lock ourselves.
+ *
+ * In the first two cases, BPF schedulers may pass an arbitrary task
+ * that the held lock doesn't cover. Refuse those.
*/
if (this_rq()->scx.in_select_cpu) {
+ if (p != __this_cpu_read(direct_dispatch_task))
+ goto cross_task;
lockdep_assert_held(&p->pi_lock);
- } else if (!scx_locked_rq()) {
+ } else if (scx_locked_rq()) {
+ if (task_rq(p) != scx_locked_rq())
+ goto cross_task;
+ } else {
raw_spin_lock_irqsave(&p->pi_lock, irq_flags);
we_locked = true;
}
@@ -960,6 +970,11 @@ static s32 select_cpu_from_kfunc(struct scx_sched *sch, struct task_struct *p,
raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags);
return cpu;
+
+cross_task:
+ scx_error(sch, "select_cpu kfunc called cross-task on %s[%d]",
+ p->comm, p->pid);
+ return -EINVAL;
}
/**
--
2.53.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 14/13] sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime
2026-04-24 20:44 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes Tejun Heo
` (12 preceding siblings ...)
2026-04-24 20:44 ` [PATCH 13/13] sched_ext: Refuse cross-task select_cpu_from_kfunc calls Tejun Heo
@ 2026-04-24 21:08 ` Tejun Heo
2026-04-24 21:08 ` [PATCH 15/13] sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable Tejun Heo
2026-04-24 22:10 ` [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes Andrea Righi
2026-04-25 0:39 ` Tejun Heo
15 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2026-04-24 21:08 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, linux-kernel, Emil Tsalapatis, Chris Mason,
Ryan Newton, Tejun Heo
scx_prog_sched(aux) returns NULL for TRACING / SYSCALL BPF progs that
have no struct_ops association when the root scheduler has sub_attach
set. scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() pass
that NULL into scx_task_on_sched(sch, p), which under
CONFIG_EXT_SUB_SCHED is rcu_access_pointer(p->scx.sched) == sch. For
any non-scx task p->scx.sched is NULL, so NULL == NULL returns true
and the authority gate is bypassed - a privileged but
non-struct_ops-associated prog can poke p->scx.slice /
p->scx.dsq_vtime on arbitrary tasks.
Reject !sch up front so the gate only admits callers with a resolved
scheduler.
Fixes: 245d09c594ea ("sched_ext: Enforce scheduler ownership when updating slice and dsq_vtime")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 9a31b8f064e9..52b63266e647 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -8619,7 +8619,7 @@ __bpf_kfunc bool scx_bpf_task_set_slice(struct task_struct *p, u64 slice,
guard(rcu)();
sch = scx_prog_sched(aux);
- if (unlikely(!scx_task_on_sched(sch, p)))
+ if (unlikely(!sch || !scx_task_on_sched(sch, p)))
return false;
p->scx.slice = slice;
@@ -8642,7 +8642,7 @@ __bpf_kfunc bool scx_bpf_task_set_dsq_vtime(struct task_struct *p, u64 vtime,
guard(rcu)();
sch = scx_prog_sched(aux);
- if (unlikely(!scx_task_on_sched(sch, p)))
+ if (unlikely(!sch || !scx_task_on_sched(sch, p)))
return false;
p->scx.dsq_vtime = vtime;
--
2.53.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 15/13] sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable
2026-04-24 21:08 ` [PATCH 14/13] sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime Tejun Heo
@ 2026-04-24 21:08 ` Tejun Heo
2026-04-24 22:00 ` Andrea Righi
2026-04-25 0:19 ` [PATCH v2 " Tejun Heo
0 siblings, 2 replies; 24+ messages in thread
From: Tejun Heo @ 2026-04-24 21:08 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, linux-kernel, Emil Tsalapatis, Chris Mason,
Ryan Newton, Tejun Heo
scx_root_enable_workfn() takes cpus_read_lock() before
scx_link_sched(sch), but the `if (ret) goto err_disable` on failure
skips the matching cpus_read_unlock() - all other err_disable gotos
along this path drop the lock first.
scx_link_sched() only returns non-zero on the sub-sched path
(parent != NULL), so the leak path is unreachable via the root
caller today. Still, the unwind is out of line with the surrounding
paths.
Drop cpus_read_lock() before goto err_disable.
Fixes: 0128c850513a ("sched_ext: Exit early on hotplug events during attach")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 52b63266e647..d374a5a16bf9 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -6715,8 +6715,10 @@ static void scx_root_enable_workfn(struct kthread_work *work)
rcu_assign_pointer(scx_root, sch);
ret = scx_link_sched(sch);
- if (ret)
+ if (ret) {
+ cpus_read_unlock();
goto err_disable;
+ }
scx_idle_enable(ops);
--
2.53.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 13/13] sched_ext: Refuse cross-task select_cpu_from_kfunc calls
2026-04-24 20:44 ` [PATCH 13/13] sched_ext: Refuse cross-task select_cpu_from_kfunc calls Tejun Heo
@ 2026-04-24 21:46 ` Andrea Righi
2026-04-25 0:19 ` [PATCH v2 " Tejun Heo
1 sibling, 0 replies; 24+ messages in thread
From: Andrea Righi @ 2026-04-24 21:46 UTC (permalink / raw)
To: Tejun Heo
Cc: David Vernet, Changwoo Min, sched-ext, linux-kernel,
Emil Tsalapatis, Chris Mason, Ryan Newton
Hi Tejun,
On Fri, Apr 24, 2026 at 10:44:18AM -1000, Tejun Heo wrote:
> select_cpu_from_kfunc() skipped pi_lock for @p when called from
> ops.select_cpu() or another rq-locked SCX op, assuming the held lock
> protects @p. scx_bpf_select_cpu_dfl() / __scx_bpf_select_cpu_and() accept an
> arbitrary KF_RCU task_struct, so a caller in e.g. ops.select_cpu(p1) or
> ops.enqueue(p1) can pass some other p2 - the held pi_lock / rq lock is p1's,
> not p2's - and reading p2->cpus_ptr / nr_cpus_allowed races with
> set_cpus_allowed_ptr() and migrate_disable_switch() on another CPU.
>
> Abort the scheduler on cross-task calls in both branches: check @p against
> direct_dispatch_task (the task currently being selected) for
> ops.select_cpu(), and task_rq(p) against scx_locked_rq() for other rq-locked
> SCX ops.
>
> Fixes: 0022b328504d ("sched_ext: Decouple kfunc unlocked-context check from kf_mask")
> Reported-by: Chris Mason <clm@meta.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Andrea Righi <arighi@nvidia.com>
> ---
> kernel/sched/ext_idle.c | 19 +++++++++++++++++--
> 1 file changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> index c43d62d90e40..ff4d1b97437d 100644
> --- a/kernel/sched/ext_idle.c
> +++ b/kernel/sched/ext_idle.c
> @@ -927,14 +927,24 @@ static s32 select_cpu_from_kfunc(struct scx_sched *sch, struct task_struct *p,
> * Accessing p->cpus_ptr / p->nr_cpus_allowed needs either @p's rq
> * lock or @p's pi_lock. Three cases:
> *
> - * - inside ops.select_cpu(): try_to_wake_up() holds @p's pi_lock.
> + * - inside ops.select_cpu(): try_to_wake_up() holds the wake-up
> + * task's pi_lock (stashed in direct_dispatch_task;
> + * mark_direct_dispatch() invalidates it post-dispatch).
> * - other rq-locked SCX op: scx_locked_rq() points at the held rq.
> * - truly unlocked (UNLOCKED ops, SYSCALL, non-SCX struct_ops):
> * nothing held, take pi_lock ourselves.
> + *
> + * In the first two cases, BPF schedulers may pass an arbitrary task
> + * that the held lock doesn't cover. Refuse those.
> */
> if (this_rq()->scx.in_select_cpu) {
> + if (p != __this_cpu_read(direct_dispatch_task))
> + goto cross_task;
I'm wondering, what happens if in ops.select_cpu() the BPF scheduler calls
scx_bpf_dsq_insert() first, then calls scx_bpf_select_cpu_and() (or
scx_bpf_select_cpu_dfl()), then this check doesn't look valid, because
mark_direct_dispatch() would set direct_dispatch_task to ERR_PTR(-ESRCH).
Can we just check scx_kf_arg_task_ok(sch, p) here?
> lockdep_assert_held(&p->pi_lock);
> - } else if (!scx_locked_rq()) {
> + } else if (scx_locked_rq()) {
> + if (task_rq(p) != scx_locked_rq())
> + goto cross_task;
> + } else {
> raw_spin_lock_irqsave(&p->pi_lock, irq_flags);
> we_locked = true;
> }
> @@ -960,6 +970,11 @@ static s32 select_cpu_from_kfunc(struct scx_sched *sch, struct task_struct *p,
> raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags);
>
> return cpu;
> +
> +cross_task:
> + scx_error(sch, "select_cpu kfunc called cross-task on %s[%d]",
> + p->comm, p->pid);
> + return -EINVAL;
> }
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 15/13] sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable
2026-04-24 21:08 ` [PATCH 15/13] sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable Tejun Heo
@ 2026-04-24 22:00 ` Andrea Righi
2026-04-25 0:19 ` [PATCH v2 " Tejun Heo
1 sibling, 0 replies; 24+ messages in thread
From: Andrea Righi @ 2026-04-24 22:00 UTC (permalink / raw)
To: Tejun Heo
Cc: David Vernet, Changwoo Min, sched-ext, linux-kernel,
Emil Tsalapatis, Chris Mason, Ryan Newton
Hi Tejun,
On Fri, Apr 24, 2026 at 11:08:02AM -1000, Tejun Heo wrote:
> scx_root_enable_workfn() takes cpus_read_lock() before
> scx_link_sched(sch), but the `if (ret) goto err_disable` on failure
> skips the matching cpus_read_unlock() - all other err_disable gotos
> along this path drop the lock first.
>
> scx_link_sched() only returns non-zero on the sub-sched path
> (parent != NULL), so the leak path is unreachable via the root
> caller today. Still, the unwind is out of line with the surrounding
> paths.
>
> Drop cpus_read_lock() before goto err_disable.
>
> Fixes: 0128c850513a ("sched_ext: Exit early on hotplug events during attach")
I think we're fixing:
25037af712eb ("sched_ext: Add rhashtable lookup for sub-schedulers")
Thanks,
-Andrea
> Reported-by: Chris Mason <clm@meta.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
> kernel/sched/ext.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 52b63266e647..d374a5a16bf9 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -6715,8 +6715,10 @@ static void scx_root_enable_workfn(struct kthread_work *work)
> rcu_assign_pointer(scx_root, sch);
>
> ret = scx_link_sched(sch);
> - if (ret)
> + if (ret) {
> + cpus_read_unlock();
> goto err_disable;
> + }
>
> scx_idle_enable(ops);
>
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes
2026-04-24 20:44 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes Tejun Heo
` (13 preceding siblings ...)
2026-04-24 21:08 ` [PATCH 14/13] sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime Tejun Heo
@ 2026-04-24 22:10 ` Andrea Righi
2026-04-25 0:39 ` Tejun Heo
15 siblings, 0 replies; 24+ messages in thread
From: Andrea Righi @ 2026-04-24 22:10 UTC (permalink / raw)
To: Tejun Heo
Cc: David Vernet, Changwoo Min, sched-ext, linux-kernel,
Emil Tsalapatis, Chris Mason, Ryan Newton
On Fri, Apr 24, 2026 at 10:44:05AM -1000, Tejun Heo wrote:
> Hello,
>
> This patchset collects fixes for issues surfaced by Chris Mason's
> AI-assisted review of sched_ext. The bugs span use-after-free, leak,
> lock/state inconsistency, rq-lock AA deadlock, and cross-task kfunc
> misuse paths. Each patch stands on its own.
>
> Based on sched_ext/for-7.1-fixes (510a27055446).
Sent a couple of comments about patch 13 and patch 15.
Everything else looks good to me, feel free to add:
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Thanks,
-Andrea
>
> 1: sched_ext: Unregister sub_kset on scheduler disable
> 2: sched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_new
> 3: sched_ext: Skip tasks with stale task_rq in bypass_lb_cpu()
> 4: sched_ext: Don't disable tasks in scx_sub_enable_workfn() abort path
> 5: sched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup setters
> 6: sched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / scx_bpf_dsq_nr_queued()
> 7: sched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() FIFO-tail
> 8: sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP
> 9: sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task
> 10: sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before
> 11: sched_ext: Make bypass LB cpumasks per-scheduler
> 12: sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED
> 13: sched_ext: Refuse cross-task select_cpu_from_kfunc calls
>
> Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git fix-slop-review
>
> kernel/sched/ext.c | 238 +++++++++++++++++++++++++++++---------------
> kernel/sched/ext_idle.c | 19 +++-
> kernel/sched/ext_internal.h | 2 +
> 3 files changed, 174 insertions(+), 85 deletions(-)
>
> Thanks.
>
> --
> tejun
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v2 13/13] sched_ext: Refuse cross-task select_cpu_from_kfunc calls
2026-04-24 20:44 ` [PATCH 13/13] sched_ext: Refuse cross-task select_cpu_from_kfunc calls Tejun Heo
2026-04-24 21:46 ` Andrea Righi
@ 2026-04-25 0:19 ` Tejun Heo
2026-04-25 6:50 ` Andrea Righi
1 sibling, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2026-04-25 0:19 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, linux-kernel, Emil Tsalapatis, Chris Mason,
Ryan Newton, Tejun Heo
select_cpu_from_kfunc() skipped pi_lock for @p when called from
ops.select_cpu() or another rq-locked SCX op, assuming the held lock
protects @p. scx_bpf_select_cpu_dfl() / __scx_bpf_select_cpu_and() accept an
arbitrary KF_RCU task_struct, so a caller in e.g. ops.select_cpu(p1) or
ops.enqueue(p1) can pass some other p2 - the held pi_lock / rq lock is p1's,
not p2's - and reading p2->cpus_ptr / nr_cpus_allowed races with
set_cpus_allowed_ptr() and migrate_disable_switch() on another CPU.
Abort the scheduler on cross-task calls in both branches: for
ops.select_cpu() use scx_kf_arg_task_ok() to verify @p is the wake-up
task recorded in current->scx.kf_tasks[] by SCX_CALL_OP_TASK_RET();
for other rq-locked SCX ops compare task_rq(p) against scx_locked_rq().
v2: Per Andrea Righi: switch the in_select_cpu cross-task check from
direct_dispatch_task comparison to scx_kf_arg_task_ok(). The former
spuriously rejects when ops.select_cpu() calls scx_bpf_dsq_insert()
first (mark_direct_dispatch() stamps direct_dispatch_task =
ERR_PTR(-ESRCH)), then calls scx_bpf_select_cpu_*() on the same task.
Fixes: 0022b328504d ("sched_ext: Decouple kfunc unlocked-context check from kf_mask")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/ext_idle.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
index c43d62d90e40..7468560a6d80 100644
--- a/kernel/sched/ext_idle.c
+++ b/kernel/sched/ext_idle.c
@@ -927,14 +927,24 @@ static s32 select_cpu_from_kfunc(struct scx_sched *sch, struct task_struct *p,
* Accessing p->cpus_ptr / p->nr_cpus_allowed needs either @p's rq
* lock or @p's pi_lock. Three cases:
*
- * - inside ops.select_cpu(): try_to_wake_up() holds @p's pi_lock.
+ * - inside ops.select_cpu(): try_to_wake_up() holds the wake-up
+ * task's pi_lock; the wake-up task is recorded in kf_tasks[0]
+ * by SCX_CALL_OP_TASK_RET().
* - other rq-locked SCX op: scx_locked_rq() points at the held rq.
* - truly unlocked (UNLOCKED ops, SYSCALL, non-SCX struct_ops):
* nothing held, take pi_lock ourselves.
+ *
+ * In the first two cases, BPF schedulers may pass an arbitrary task
+ * that the held lock doesn't cover. Refuse those.
*/
if (this_rq()->scx.in_select_cpu) {
+ if (!scx_kf_arg_task_ok(sch, p))
+ return -EINVAL;
lockdep_assert_held(&p->pi_lock);
- } else if (!scx_locked_rq()) {
+ } else if (scx_locked_rq()) {
+ if (task_rq(p) != scx_locked_rq())
+ goto cross_task;
+ } else {
raw_spin_lock_irqsave(&p->pi_lock, irq_flags);
we_locked = true;
}
@@ -960,6 +970,11 @@ static s32 select_cpu_from_kfunc(struct scx_sched *sch, struct task_struct *p,
raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags);
return cpu;
+
+cross_task:
+ scx_error(sch, "select_cpu kfunc called cross-task on %s[%d]",
+ p->comm, p->pid);
+ return -EINVAL;
}
/**
--
2.53.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH v2 15/13] sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable
2026-04-24 21:08 ` [PATCH 15/13] sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable Tejun Heo
2026-04-24 22:00 ` Andrea Righi
@ 2026-04-25 0:19 ` Tejun Heo
2026-04-25 6:51 ` Andrea Righi
1 sibling, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2026-04-25 0:19 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, linux-kernel, Emil Tsalapatis, Chris Mason,
Ryan Newton, Tejun Heo
scx_root_enable_workfn() takes cpus_read_lock() before
scx_link_sched(sch), but the `if (ret) goto err_disable` on failure
skips the matching cpus_read_unlock() - all other err_disable gotos
along this path drop the lock first.
scx_link_sched() only returns non-zero on the sub-sched path
(parent != NULL), so the leak path is unreachable via the root
caller today. Still, the unwind is out of line with the surrounding
paths.
Drop cpus_read_lock() before goto err_disable.
v2: Correct Fixes: tag per Andrea Righi - the missing
cpus_read_unlock() became a (latent) bug when scx_link_sched() was
made fallible, not when err_disable was reorganized for hotplug.
Fixes: 25037af712eb ("sched_ext: Add rhashtable lookup for sub-schedulers")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f333fd0cb83f..9eda20e5fdb8 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -6736,8 +6736,10 @@ static void scx_root_enable_workfn(struct kthread_work *work)
rcu_assign_pointer(scx_root, sch);
ret = scx_link_sched(sch);
- if (ret)
+ if (ret) {
+ cpus_read_unlock();
goto err_disable;
+ }
scx_idle_enable(ops);
--
2.53.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes
2026-04-24 20:44 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes Tejun Heo
` (14 preceding siblings ...)
2026-04-24 22:10 ` [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes Andrea Righi
@ 2026-04-25 0:39 ` Tejun Heo
15 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2026-04-25 0:39 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, linux-kernel, Emil Tsalapatis, Chris Mason,
Ryan Newton
Hello,
Applied 1-13 plus extras 14/13 and 15/13 to sched_ext/for-7.1-fixes.
Patches 13 and 15/13 applied as v2; Andrea's Reviewed-by added on the
other twelve plus 14/13.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 13/13] sched_ext: Refuse cross-task select_cpu_from_kfunc calls
2026-04-25 0:19 ` [PATCH v2 " Tejun Heo
@ 2026-04-25 6:50 ` Andrea Righi
0 siblings, 0 replies; 24+ messages in thread
From: Andrea Righi @ 2026-04-25 6:50 UTC (permalink / raw)
To: Tejun Heo
Cc: David Vernet, Changwoo Min, sched-ext, linux-kernel,
Emil Tsalapatis, Chris Mason, Ryan Newton
Hi Tejun,
On Fri, Apr 24, 2026 at 02:19:42PM -1000, Tejun Heo wrote:
> select_cpu_from_kfunc() skipped pi_lock for @p when called from
> ops.select_cpu() or another rq-locked SCX op, assuming the held lock
> protects @p. scx_bpf_select_cpu_dfl() / __scx_bpf_select_cpu_and() accept an
> arbitrary KF_RCU task_struct, so a caller in e.g. ops.select_cpu(p1) or
> ops.enqueue(p1) can pass some other p2 - the held pi_lock / rq lock is p1's,
> not p2's - and reading p2->cpus_ptr / nr_cpus_allowed races with
> set_cpus_allowed_ptr() and migrate_disable_switch() on another CPU.
>
> Abort the scheduler on cross-task calls in both branches: for
> ops.select_cpu() use scx_kf_arg_task_ok() to verify @p is the wake-up
> task recorded in current->scx.kf_tasks[] by SCX_CALL_OP_TASK_RET();
> for other rq-locked SCX ops compare task_rq(p) against scx_locked_rq().
>
> v2: Per Andrea Righi: switch the in_select_cpu cross-task check from
> direct_dispatch_task comparison to scx_kf_arg_task_ok(). The former
> spuriously rejects when ops.select_cpu() calls scx_bpf_dsq_insert()
> first (mark_direct_dispatch() stamps direct_dispatch_task =
> ERR_PTR(-ESRCH)), then calls scx_bpf_select_cpu_*() on the same task.
>
> Fixes: 0022b328504d ("sched_ext: Decouple kfunc unlocked-context check from kf_mask")
> Reported-by: Chris Mason <clm@meta.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Andrea Righi <arighi@nvidia.com>
Looks good!
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Thanks,
-Andrea
> ---
> kernel/sched/ext_idle.c | 19 +++++++++++++++++--
> 1 file changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> index c43d62d90e40..7468560a6d80 100644
> --- a/kernel/sched/ext_idle.c
> +++ b/kernel/sched/ext_idle.c
> @@ -927,14 +927,24 @@ static s32 select_cpu_from_kfunc(struct scx_sched *sch, struct task_struct *p,
> * Accessing p->cpus_ptr / p->nr_cpus_allowed needs either @p's rq
> * lock or @p's pi_lock. Three cases:
> *
> - * - inside ops.select_cpu(): try_to_wake_up() holds @p's pi_lock.
> + * - inside ops.select_cpu(): try_to_wake_up() holds the wake-up
> + * task's pi_lock; the wake-up task is recorded in kf_tasks[0]
> + * by SCX_CALL_OP_TASK_RET().
> * - other rq-locked SCX op: scx_locked_rq() points at the held rq.
> * - truly unlocked (UNLOCKED ops, SYSCALL, non-SCX struct_ops):
> * nothing held, take pi_lock ourselves.
> + *
> + * In the first two cases, BPF schedulers may pass an arbitrary task
> + * that the held lock doesn't cover. Refuse those.
> */
> if (this_rq()->scx.in_select_cpu) {
> + if (!scx_kf_arg_task_ok(sch, p))
> + return -EINVAL;
> lockdep_assert_held(&p->pi_lock);
> - } else if (!scx_locked_rq()) {
> + } else if (scx_locked_rq()) {
> + if (task_rq(p) != scx_locked_rq())
> + goto cross_task;
> + } else {
> raw_spin_lock_irqsave(&p->pi_lock, irq_flags);
> we_locked = true;
> }
> @@ -960,6 +970,11 @@ static s32 select_cpu_from_kfunc(struct scx_sched *sch, struct task_struct *p,
> raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags);
>
> return cpu;
> +
> +cross_task:
> + scx_error(sch, "select_cpu kfunc called cross-task on %s[%d]",
> + p->comm, p->pid);
> + return -EINVAL;
> }
>
> /**
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 15/13] sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable
2026-04-25 0:19 ` [PATCH v2 " Tejun Heo
@ 2026-04-25 6:51 ` Andrea Righi
0 siblings, 0 replies; 24+ messages in thread
From: Andrea Righi @ 2026-04-25 6:51 UTC (permalink / raw)
To: Tejun Heo
Cc: David Vernet, Changwoo Min, sched-ext, linux-kernel,
Emil Tsalapatis, Chris Mason, Ryan Newton
On Fri, Apr 24, 2026 at 02:19:51PM -1000, Tejun Heo wrote:
> scx_root_enable_workfn() takes cpus_read_lock() before
> scx_link_sched(sch), but the `if (ret) goto err_disable` on failure
> skips the matching cpus_read_unlock() - all other err_disable gotos
> along this path drop the lock first.
>
> scx_link_sched() only returns non-zero on the sub-sched path
> (parent != NULL), so the leak path is unreachable via the root
> caller today. Still, the unwind is out of line with the surrounding
> paths.
>
> Drop cpus_read_lock() before goto err_disable.
>
> v2: Correct Fixes: tag per Andrea Righi - the missing
> cpus_read_unlock() became a (latent) bug when scx_link_sched() was
> made fallible, not when err_disable was reorganized for hotplug.
>
> Fixes: 25037af712eb ("sched_ext: Add rhashtable lookup for sub-schedulers")
> Reported-by: Chris Mason <clm@meta.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Thanks,
-Andrea
> ---
> kernel/sched/ext.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index f333fd0cb83f..9eda20e5fdb8 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -6736,8 +6736,10 @@ static void scx_root_enable_workfn(struct kthread_work *work)
> rcu_assign_pointer(scx_root, sch);
>
> ret = scx_link_sched(sch);
> - if (ret)
> + if (ret) {
> + cpus_read_unlock();
> goto err_disable;
> + }
>
> scx_idle_enable(ops);
>
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2026-04-25 6:52 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-24 20:44 [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes Tejun Heo
2026-04-24 20:44 ` [PATCH 01/13] sched_ext: Unregister sub_kset on scheduler disable Tejun Heo
2026-04-24 20:44 ` [PATCH 02/13] sched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_new Tejun Heo
2026-04-24 20:44 ` [PATCH 03/13] sched_ext: Skip tasks with stale task_rq in bypass_lb_cpu() Tejun Heo
2026-04-24 20:44 ` [PATCH 04/13] sched_ext: Don't disable tasks in scx_sub_enable_workfn() abort path Tejun Heo
2026-04-24 20:44 ` [PATCH 05/13] sched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup setters Tejun Heo
2026-04-24 20:44 ` [PATCH 06/13] sched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / scx_bpf_dsq_nr_queued() Tejun Heo
2026-04-24 20:44 ` [PATCH 07/13] sched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() FIFO-tail Tejun Heo
2026-04-24 20:44 ` [PATCH 08/13] sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP Tejun Heo
2026-04-24 20:44 ` [PATCH 09/13] sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task Tejun Heo
2026-04-24 20:44 ` [PATCH 10/13] sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before Tejun Heo
2026-04-24 20:44 ` [PATCH 11/13] sched_ext: Make bypass LB cpumasks per-scheduler Tejun Heo
2026-04-24 20:44 ` [PATCH 12/13] sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED Tejun Heo
2026-04-24 20:44 ` [PATCH 13/13] sched_ext: Refuse cross-task select_cpu_from_kfunc calls Tejun Heo
2026-04-24 21:46 ` Andrea Righi
2026-04-25 0:19 ` [PATCH v2 " Tejun Heo
2026-04-25 6:50 ` Andrea Righi
2026-04-24 21:08 ` [PATCH 14/13] sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime Tejun Heo
2026-04-24 21:08 ` [PATCH 15/13] sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable Tejun Heo
2026-04-24 22:00 ` Andrea Righi
2026-04-25 0:19 ` [PATCH v2 " Tejun Heo
2026-04-25 6:51 ` Andrea Righi
2026-04-24 22:10 ` [PATCHSET sched_ext/for-7.1-fixes] sched_ext: Assorted fixes Andrea Righi
2026-04-25 0:39 ` Tejun Heo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox