* [PATCHSET sched_ext/for-7.1] sched_ext: Add verifier-time kfunc context filter
@ 2026-04-10 6:30 Tejun Heo
2026-04-10 6:30 ` [PATCH 01/10] sched_ext: Drop TRACING access to select_cpu kfuncs Tejun Heo
` (10 more replies)
0 siblings, 11 replies; 26+ messages in thread
From: Tejun Heo @ 2026-04-10 6:30 UTC (permalink / raw)
To: sched-ext, David Vernet, Andrea Righi, Changwoo Min
Cc: Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai,
Emil Tsalapatis, linux-kernel, Tejun Heo
Hello,
This moves enforcement of SCX context-sensitive kfunc restrictions from
runtime kf_mask checks to BPF verifier-time filtering, using the BPF core's
struct_ops context information.
This is based on work by Juntong Deng and Cheng-Yang Chou:
https://lore.kernel.org/r/20260406154834.1920962-1-yphbchou0911@gmail.com
I ended up redoing the series. The number of changes needed and the
difficulty of validating each one made iterating through review emails
impractical:
- Pre-existing call-site bugs needed fixing first. ops.cgroup_move() was
mislabeled as SCX_KF_UNLOCKED when sched_move_task() actually holds the
rq lock, and set_cpus_allowed_scx() passed rq=NULL to SCX_CALL_OP_TASK
despite holding the rq lock. These had to be sorted out before the
runtime-to-verifier conversion could be validated.
- The macro-based kfunc ID deduplication (SCX_KFUNCS_*) made it hard to
verify that the new code produced the same accept/reject verdicts as
the old.
- No systematic validation of the full (kfunc, caller) verdict matrix
existed, so it wasn't clear whether the conversion was correct.
This series takes a different approach: first fix the call-site bugs that
made the conversion harder than it needed to be, then do the conversion in
small isolated steps, and verify the full verdict matrix at each stage.
The series:
01/10 Drop TRACING access to select_cpu kfuncs
02/10 Add select_cpu kfuncs to scx_kfunc_ids_unlocked
03/10 Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask
04/10 Fix ops.cgroup_move() invocation kf_mask and rq tracking
05/10 Decouple kfunc unlocked-context check from kf_mask
06/10 Drop redundant rq-locked check from scx_bpf_task_cgroup()
07/10 Add verifier-time kfunc context filter
08/10 Remove runtime kfunc mask enforcement
09/10 Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok()
10/10 Warn on task-based SCX op recursion
Patches 1-2 are extracted from the original patchset. Patches 3-4 fix
pre-existing call-site bugs where SCX_CALL_OP_TASK passed rq=NULL despite
the kernel holding the rq lock. Patch 5 converts select_cpu_from_kfunc and
scx_dsq_move to explicit locked-state tests. Patch 6 drops the now-
redundant kf_mask check from scx_kf_allowed_on_arg_tasks. Patch 7 adds the
verifier-time filter. Patch 8 removes the runtime kf_mask machinery. Patches
9-10 are post-removal cleanup.
The full verdict matrix was verified by writing BPF test programs covering
every kfunc group from every relevant caller context, testing both baseline
and patched kernels. All in-tree example schedulers and most scx-repo
schedulers pass smoke testing on the patched kernel.
Based on sched_ext/for-7.1 (ff1befcb1683).
include/linux/sched/ext.h | 28 ---
kernel/sched/ext.c | 415 ++++++++++++++++++++---------------------
kernel/sched/ext_idle.c | 69 ++++---
kernel/sched/ext_idle.h | 2 +
kernel/sched/ext_internal.h | 8 +-
kernel/sched/sched.h | 1 +
6 files changed, 253 insertions(+), 270 deletions(-)
Git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-kf-allowed-filter
--
tejun
^ permalink raw reply [flat|nested] 26+ messages in thread* [PATCH 01/10] sched_ext: Drop TRACING access to select_cpu kfuncs 2026-04-10 6:30 [PATCHSET sched_ext/for-7.1] sched_ext: Add verifier-time kfunc context filter Tejun Heo @ 2026-04-10 6:30 ` Tejun Heo 2026-04-10 16:04 ` Andrea Righi 2026-04-10 6:30 ` [PATCH 02/10] sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked Tejun Heo ` (9 subsequent siblings) 10 siblings, 1 reply; 26+ messages in thread From: Tejun Heo @ 2026-04-10 6:30 UTC (permalink / raw) To: sched-ext, David Vernet, Andrea Righi, Changwoo Min Cc: Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel, Tejun Heo The select_cpu kfuncs - scx_bpf_select_cpu_dfl(), scx_bpf_select_cpu_and() and __scx_bpf_select_cpu_and() - take task_rq_lock() internally. Exposing them via scx_kfunc_set_idle to BPF_PROG_TYPE_TRACING is unsafe: arbitrary tracing contexts (kprobes, tracepoints, fentry, LSM) may run with @p's pi_lock state unknown. Move them out of scx_kfunc_ids_idle into a new scx_kfunc_ids_select_cpu set registered only for STRUCT_OPS and SYSCALL. Extracted from a larger verifier-time kfunc context filter patch originally written by Juntong Deng. Original-patch-by: Juntong Deng <juntong.deng@outlook.com> Cc: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> --- kernel/sched/ext_idle.c | 25 +++++++++++++++++++++---- 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c index ecf7e09b54ae..cd88aee47bd8 100644 --- a/kernel/sched/ext_idle.c +++ b/kernel/sched/ext_idle.c @@ -1469,9 +1469,6 @@ BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_IMPLICIT_ARGS | KF_RCU) BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_IMPLICIT_ARGS | KF_RCU) BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu_node, KF_IMPLICIT_ARGS | KF_RCU) BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_IMPLICIT_ARGS | KF_RCU) -BTF_ID_FLAGS(func, __scx_bpf_select_cpu_and, KF_IMPLICIT_ARGS | KF_RCU) -BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU) -BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_IMPLICIT_ARGS | KF_RCU) BTF_KFUNCS_END(scx_kfunc_ids_idle) static const struct btf_kfunc_id_set scx_kfunc_set_idle = { @@ -1479,13 +1476,33 @@ static const struct btf_kfunc_id_set scx_kfunc_set_idle = { .set = &scx_kfunc_ids_idle, }; +/* + * The select_cpu kfuncs internally call task_rq_lock() when invoked from an + * rq-unlocked context, and thus cannot be safely called from arbitrary tracing + * contexts where @p's pi_lock state is unknown. Keep them out of + * BPF_PROG_TYPE_TRACING by registering them in their own set which is exposed + * only to STRUCT_OPS and SYSCALL programs. + */ +BTF_KFUNCS_START(scx_kfunc_ids_select_cpu) +BTF_ID_FLAGS(func, __scx_bpf_select_cpu_and, KF_IMPLICIT_ARGS | KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_IMPLICIT_ARGS | KF_RCU) +BTF_KFUNCS_END(scx_kfunc_ids_select_cpu) + +static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = { + .owner = THIS_MODULE, + .set = &scx_kfunc_ids_select_cpu, +}; + int scx_idle_init(void) { int ret; ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_idle) || register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &scx_kfunc_set_idle) || - register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_idle); + register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_idle) || + register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_select_cpu) || + register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_select_cpu); return ret; } -- 2.53.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 01/10] sched_ext: Drop TRACING access to select_cpu kfuncs 2026-04-10 6:30 ` [PATCH 01/10] sched_ext: Drop TRACING access to select_cpu kfuncs Tejun Heo @ 2026-04-10 16:04 ` Andrea Righi 0 siblings, 0 replies; 26+ messages in thread From: Andrea Righi @ 2026-04-10 16:04 UTC (permalink / raw) To: Tejun Heo Cc: sched-ext, David Vernet, Changwoo Min, Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel Hi Tejun, On Thu, Apr 09, 2026 at 08:30:37PM -1000, Tejun Heo wrote: > The select_cpu kfuncs - scx_bpf_select_cpu_dfl(), scx_bpf_select_cpu_and() > and __scx_bpf_select_cpu_and() - take task_rq_lock() internally. Exposing > them via scx_kfunc_set_idle to BPF_PROG_TYPE_TRACING is unsafe: arbitrary > tracing contexts (kprobes, tracepoints, fentry, LSM) may run with @p's > pi_lock state unknown. > > Move them out of scx_kfunc_ids_idle into a new scx_kfunc_ids_select_cpu > set registered only for STRUCT_OPS and SYSCALL. In addition of being unsafe it also makes sense from a logical perspective to not "allocate" idle CPUs from a BPF_PROG_TYPE_TRACING context. > > Extracted from a larger verifier-time kfunc context filter patch > originally written by Juntong Deng. > > Original-patch-by: Juntong Deng <juntong.deng@outlook.com> > Cc: Cheng-Yang Chou <yphbchou0911@gmail.com> > Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Thanks, -Andrea > --- > kernel/sched/ext_idle.c | 25 +++++++++++++++++++++---- > 1 file changed, 21 insertions(+), 4 deletions(-) > > diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c > index ecf7e09b54ae..cd88aee47bd8 100644 > --- a/kernel/sched/ext_idle.c > +++ b/kernel/sched/ext_idle.c > @@ -1469,9 +1469,6 @@ BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_IMPLICIT_ARGS | KF_RCU) > BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_IMPLICIT_ARGS | KF_RCU) > BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu_node, KF_IMPLICIT_ARGS | KF_RCU) > BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_IMPLICIT_ARGS | KF_RCU) > -BTF_ID_FLAGS(func, __scx_bpf_select_cpu_and, KF_IMPLICIT_ARGS | KF_RCU) > -BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU) > -BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_IMPLICIT_ARGS | KF_RCU) > BTF_KFUNCS_END(scx_kfunc_ids_idle) > > static const struct btf_kfunc_id_set scx_kfunc_set_idle = { > @@ -1479,13 +1476,33 @@ static const struct btf_kfunc_id_set scx_kfunc_set_idle = { > .set = &scx_kfunc_ids_idle, > }; > > +/* > + * The select_cpu kfuncs internally call task_rq_lock() when invoked from an > + * rq-unlocked context, and thus cannot be safely called from arbitrary tracing > + * contexts where @p's pi_lock state is unknown. Keep them out of > + * BPF_PROG_TYPE_TRACING by registering them in their own set which is exposed > + * only to STRUCT_OPS and SYSCALL programs. > + */ > +BTF_KFUNCS_START(scx_kfunc_ids_select_cpu) > +BTF_ID_FLAGS(func, __scx_bpf_select_cpu_and, KF_IMPLICIT_ARGS | KF_RCU) > +BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU) > +BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_IMPLICIT_ARGS | KF_RCU) > +BTF_KFUNCS_END(scx_kfunc_ids_select_cpu) > + > +static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = { > + .owner = THIS_MODULE, > + .set = &scx_kfunc_ids_select_cpu, > +}; > + > int scx_idle_init(void) > { > int ret; > > ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_idle) || > register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &scx_kfunc_set_idle) || > - register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_idle); > + register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_idle) || > + register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_select_cpu) || > + register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_select_cpu); > > return ret; > } > -- > 2.53.0 > ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 02/10] sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked 2026-04-10 6:30 [PATCHSET sched_ext/for-7.1] sched_ext: Add verifier-time kfunc context filter Tejun Heo 2026-04-10 6:30 ` [PATCH 01/10] sched_ext: Drop TRACING access to select_cpu kfuncs Tejun Heo @ 2026-04-10 6:30 ` Tejun Heo 2026-04-10 16:07 ` Andrea Righi 2026-04-10 17:51 ` [PATCH v2 " Tejun Heo 2026-04-10 6:30 ` [PATCH 03/10] sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask Tejun Heo ` (8 subsequent siblings) 10 siblings, 2 replies; 26+ messages in thread From: Tejun Heo @ 2026-04-10 6:30 UTC (permalink / raw) To: sched-ext, David Vernet, Andrea Righi, Changwoo Min Cc: Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel, Tejun Heo select_cpu_from_kfunc() has an extra scx_kf_allowed_if_unlocked() branch that accepts calls from unlocked contexts and takes task_rq_lock() itself - a "callable from unlocked" property encoded in the kfunc body rather than in set membership. That's fine while the runtime check is the authoritative gate, but the upcoming verifier-time filter uses set membership as the source of truth and needs it to reflect every context the kfunc may be called from. Add the three select_cpu kfuncs to scx_kfunc_ids_unlocked so their full set of callable contexts is captured by set membership. This follows the existing dual-set convention used by scx_bpf_dsq_move{,_vtime} and scx_bpf_dsq_move_set_{slice,vtime}, which are members of both scx_kfunc_ids_dispatch and scx_kfunc_ids_unlocked. While at it, add brief comments on each duplicate BTF_ID_FLAGS block (including the pre-existing dsq_move ones) explaining the dual membership. No runtime behavior change: the runtime check in select_cpu_from_kfunc() remains the authoritative gate until it is removed along with the rest of the scx_kf_mask enforcement in a follow-up. Signed-off-by: Tejun Heo <tj@kernel.org> --- kernel/sched/ext.c | 6 ++++++ kernel/sched/ext_idle.c | 4 ++++ 2 files changed, 10 insertions(+) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index b757b853b42b..cf441fb4b1ad 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -8497,6 +8497,7 @@ BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots, KF_IMPLICIT_ARGS) BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel, KF_IMPLICIT_ARGS) BTF_ID_FLAGS(func, scx_bpf_dsq_move_to_local, KF_IMPLICIT_ARGS) BTF_ID_FLAGS(func, scx_bpf_dsq_move_to_local___v2, KF_IMPLICIT_ARGS) +/* also in scx_kfunc_ids_unlocked: also callable from unlocked contexts */ BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_slice, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_vtime, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_dsq_move, KF_RCU) @@ -8612,10 +8613,15 @@ __bpf_kfunc_end_defs(); BTF_KFUNCS_START(scx_kfunc_ids_unlocked) BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_IMPLICIT_ARGS | KF_SLEEPABLE) +/* also in scx_kfunc_ids_dispatch: also callable from ops.dispatch() */ BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_slice, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_vtime, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_dsq_move, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_dsq_move_vtime, KF_RCU) +/* also in scx_kfunc_ids_select_cpu: also callable from ops.select_cpu()/ops.enqueue() */ +BTF_ID_FLAGS(func, __scx_bpf_select_cpu_and, KF_IMPLICIT_ARGS | KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_IMPLICIT_ARGS | KF_RCU) BTF_KFUNCS_END(scx_kfunc_ids_unlocked) static const struct btf_kfunc_id_set scx_kfunc_set_unlocked = { diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c index cd88aee47bd8..8c31fb65477c 100644 --- a/kernel/sched/ext_idle.c +++ b/kernel/sched/ext_idle.c @@ -1482,6 +1482,10 @@ static const struct btf_kfunc_id_set scx_kfunc_set_idle = { * contexts where @p's pi_lock state is unknown. Keep them out of * BPF_PROG_TYPE_TRACING by registering them in their own set which is exposed * only to STRUCT_OPS and SYSCALL programs. + * + * These kfuncs are also members of scx_kfunc_ids_unlocked (see ext.c) because + * they're callable from unlocked contexts in addition to ops.select_cpu() and + * ops.enqueue(). */ BTF_KFUNCS_START(scx_kfunc_ids_select_cpu) BTF_ID_FLAGS(func, __scx_bpf_select_cpu_and, KF_IMPLICIT_ARGS | KF_RCU) -- 2.53.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 02/10] sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked 2026-04-10 6:30 ` [PATCH 02/10] sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked Tejun Heo @ 2026-04-10 16:07 ` Andrea Righi 2026-04-10 17:51 ` [PATCH v2 " Tejun Heo 1 sibling, 0 replies; 26+ messages in thread From: Andrea Righi @ 2026-04-10 16:07 UTC (permalink / raw) To: Tejun Heo Cc: sched-ext, David Vernet, Changwoo Min, Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel On Thu, Apr 09, 2026 at 08:30:38PM -1000, Tejun Heo wrote: > select_cpu_from_kfunc() has an extra scx_kf_allowed_if_unlocked() branch > that accepts calls from unlocked contexts and takes task_rq_lock() itself > - a "callable from unlocked" property encoded in the kfunc body rather > than in set membership. That's fine while the runtime check is the > authoritative gate, but the upcoming verifier-time filter uses set > membership as the source of truth and needs it to reflect every context > the kfunc may be called from. > > Add the three select_cpu kfuncs to scx_kfunc_ids_unlocked so their full > set of callable contexts is captured by set membership. This follows the > existing dual-set convention used by scx_bpf_dsq_move{,_vtime} and > scx_bpf_dsq_move_set_{slice,vtime}, which are members of both > scx_kfunc_ids_dispatch and scx_kfunc_ids_unlocked. > > While at it, add brief comments on each duplicate BTF_ID_FLAGS block > (including the pre-existing dsq_move ones) explaining the dual > membership. > > No runtime behavior change: the runtime check in select_cpu_from_kfunc() > remains the authoritative gate until it is removed along with the rest > of the scx_kf_mask enforcement in a follow-up. > > Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Minor nit below. > --- > kernel/sched/ext.c | 6 ++++++ > kernel/sched/ext_idle.c | 4 ++++ > 2 files changed, 10 insertions(+) > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index b757b853b42b..cf441fb4b1ad 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -8497,6 +8497,7 @@ BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots, KF_IMPLICIT_ARGS) > BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel, KF_IMPLICIT_ARGS) > BTF_ID_FLAGS(func, scx_bpf_dsq_move_to_local, KF_IMPLICIT_ARGS) > BTF_ID_FLAGS(func, scx_bpf_dsq_move_to_local___v2, KF_IMPLICIT_ARGS) > +/* also in scx_kfunc_ids_unlocked: also callable from unlocked contexts */ > BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_slice, KF_RCU) > BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_vtime, KF_RCU) > BTF_ID_FLAGS(func, scx_bpf_dsq_move, KF_RCU) Nit: we don't see it from the patch context, but below there's scx_bpf_sub_dispatch() which also seems to be callable from unlocked context with this comment. Maybe move scx_bpf_sub_dispatch() before this comment? > @@ -8612,10 +8613,15 @@ __bpf_kfunc_end_defs(); > > BTF_KFUNCS_START(scx_kfunc_ids_unlocked) > BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_IMPLICIT_ARGS | KF_SLEEPABLE) > +/* also in scx_kfunc_ids_dispatch: also callable from ops.dispatch() */ > BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_slice, KF_RCU) > BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_vtime, KF_RCU) > BTF_ID_FLAGS(func, scx_bpf_dsq_move, KF_RCU) > BTF_ID_FLAGS(func, scx_bpf_dsq_move_vtime, KF_RCU) > +/* also in scx_kfunc_ids_select_cpu: also callable from ops.select_cpu()/ops.enqueue() */ > +BTF_ID_FLAGS(func, __scx_bpf_select_cpu_and, KF_IMPLICIT_ARGS | KF_RCU) > +BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU) > +BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_IMPLICIT_ARGS | KF_RCU) > BTF_KFUNCS_END(scx_kfunc_ids_unlocked) > > static const struct btf_kfunc_id_set scx_kfunc_set_unlocked = { > diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c > index cd88aee47bd8..8c31fb65477c 100644 > --- a/kernel/sched/ext_idle.c > +++ b/kernel/sched/ext_idle.c > @@ -1482,6 +1482,10 @@ static const struct btf_kfunc_id_set scx_kfunc_set_idle = { > * contexts where @p's pi_lock state is unknown. Keep them out of > * BPF_PROG_TYPE_TRACING by registering them in their own set which is exposed > * only to STRUCT_OPS and SYSCALL programs. > + * > + * These kfuncs are also members of scx_kfunc_ids_unlocked (see ext.c) because > + * they're callable from unlocked contexts in addition to ops.select_cpu() and > + * ops.enqueue(). > */ > BTF_KFUNCS_START(scx_kfunc_ids_select_cpu) > BTF_ID_FLAGS(func, __scx_bpf_select_cpu_and, KF_IMPLICIT_ARGS | KF_RCU) > -- > 2.53.0 > Thanks, -Andrea ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v2 02/10] sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked 2026-04-10 6:30 ` [PATCH 02/10] sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked Tejun Heo 2026-04-10 16:07 ` Andrea Righi @ 2026-04-10 17:51 ` Tejun Heo 1 sibling, 0 replies; 26+ messages in thread From: Tejun Heo @ 2026-04-10 17:51 UTC (permalink / raw) To: sched-ext, David Vernet, Andrea Righi, Changwoo Min Cc: Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel select_cpu_from_kfunc() has an extra scx_kf_allowed_if_unlocked() branch that accepts calls from unlocked contexts and takes task_rq_lock() itself - a "callable from unlocked" property encoded in the kfunc body rather than in set membership. That's fine while the runtime check is the authoritative gate, but the upcoming verifier-time filter uses set membership as the source of truth and needs it to reflect every context the kfunc may be called from. Add the three select_cpu kfuncs to scx_kfunc_ids_unlocked so their full set of callable contexts is captured by set membership. This follows the existing dual-set convention used by scx_bpf_dsq_move{,_vtime} and scx_bpf_dsq_move_set_{slice,vtime}, which are members of both scx_kfunc_ids_dispatch and scx_kfunc_ids_unlocked. While at it, add brief comments on each duplicate BTF_ID_FLAGS block (including the pre-existing dsq_move ones) explaining the dual membership. No runtime behavior change: the runtime check in select_cpu_from_kfunc() remains the authoritative gate until it is removed along with the rest of the scx_kf_mask enforcement in a follow-up. v2: Clarify dispatch-set comment to name scx_bpf_dsq_move*() explicitly so it doesn't appear to cover scx_bpf_sub_dispatch() (Andrea Righi). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> --- kernel/sched/ext.c | 6 ++++++ kernel/sched/ext_idle.c | 4 ++++ 2 files changed, 10 insertions(+) --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -8497,6 +8497,7 @@ BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_s BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel, KF_IMPLICIT_ARGS) BTF_ID_FLAGS(func, scx_bpf_dsq_move_to_local, KF_IMPLICIT_ARGS) BTF_ID_FLAGS(func, scx_bpf_dsq_move_to_local___v2, KF_IMPLICIT_ARGS) +/* scx_bpf_dsq_move*() also in scx_kfunc_ids_unlocked: callable from unlocked contexts */ BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_slice, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_vtime, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_dsq_move, KF_RCU) @@ -8612,10 +8613,15 @@ __bpf_kfunc_end_defs(); BTF_KFUNCS_START(scx_kfunc_ids_unlocked) BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_IMPLICIT_ARGS | KF_SLEEPABLE) +/* also in scx_kfunc_ids_dispatch: also callable from ops.dispatch() */ BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_slice, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_vtime, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_dsq_move, KF_RCU) BTF_ID_FLAGS(func, scx_bpf_dsq_move_vtime, KF_RCU) +/* also in scx_kfunc_ids_select_cpu: also callable from ops.select_cpu()/ops.enqueue() */ +BTF_ID_FLAGS(func, __scx_bpf_select_cpu_and, KF_IMPLICIT_ARGS | KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_IMPLICIT_ARGS | KF_RCU) BTF_KFUNCS_END(scx_kfunc_ids_unlocked) static const struct btf_kfunc_id_set scx_kfunc_set_unlocked = { --- a/kernel/sched/ext_idle.c +++ b/kernel/sched/ext_idle.c @@ -1482,6 +1482,10 @@ static const struct btf_kfunc_id_set scx * contexts where @p's pi_lock state is unknown. Keep them out of * BPF_PROG_TYPE_TRACING by registering them in their own set which is exposed * only to STRUCT_OPS and SYSCALL programs. + * + * These kfuncs are also members of scx_kfunc_ids_unlocked (see ext.c) because + * they're callable from unlocked contexts in addition to ops.select_cpu() and + * ops.enqueue(). */ BTF_KFUNCS_START(scx_kfunc_ids_select_cpu) BTF_ID_FLAGS(func, __scx_bpf_select_cpu_and, KF_IMPLICIT_ARGS | KF_RCU) ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 03/10] sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask 2026-04-10 6:30 [PATCHSET sched_ext/for-7.1] sched_ext: Add verifier-time kfunc context filter Tejun Heo 2026-04-10 6:30 ` [PATCH 01/10] sched_ext: Drop TRACING access to select_cpu kfuncs Tejun Heo 2026-04-10 6:30 ` [PATCH 02/10] sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked Tejun Heo @ 2026-04-10 6:30 ` Tejun Heo 2026-04-10 16:12 ` Andrea Righi 2026-04-10 17:51 ` [PATCH v2 " Tejun Heo 2026-04-10 6:30 ` [PATCH 04/10] sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking Tejun Heo ` (7 subsequent siblings) 10 siblings, 2 replies; 26+ messages in thread From: Tejun Heo @ 2026-04-10 6:30 UTC (permalink / raw) To: sched-ext, David Vernet, Andrea Righi, Changwoo Min Cc: Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel, Tejun Heo The SCX_CALL_OP_TASK call site passes rq=NULL incorrectly, leaving scx_locked_rq() unset. Pass task_rq(p) instead so update_locked_rq() reflects reality. Signed-off-by: Tejun Heo <tj@kernel.org> --- kernel/sched/ext.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index cf441fb4b1ad..6ca0085903e0 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -3360,7 +3360,7 @@ static void set_cpus_allowed_scx(struct task_struct *p, * designation pointless. Cast it away when calling the operation. */ if (SCX_HAS_OP(sch, set_cpumask)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_cpumask, NULL, + SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_cpumask, task_rq(p), p, (struct cpumask *)p->cpus_ptr); } -- 2.53.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 03/10] sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask 2026-04-10 6:30 ` [PATCH 03/10] sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask Tejun Heo @ 2026-04-10 16:12 ` Andrea Righi 2026-04-10 17:51 ` [PATCH v2 " Tejun Heo 1 sibling, 0 replies; 26+ messages in thread From: Andrea Righi @ 2026-04-10 16:12 UTC (permalink / raw) To: Tejun Heo Cc: sched-ext, David Vernet, Changwoo Min, Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel On Thu, Apr 09, 2026 at 08:30:39PM -1000, Tejun Heo wrote: > The SCX_CALL_OP_TASK call site passes rq=NULL incorrectly, leaving > scx_locked_rq() unset. Pass task_rq(p) instead so update_locked_rq() > reflects reality. > > Signed-off-by: Tejun Heo <tj@kernel.org> Maybe add: Fixes: 18853ba782be ("sched_ext: Track currently locked rq") Apart than that, good catch. Reviewed-by: Andrea Righi <arighi@nvidia.com> Thanks, -Andrea > --- > kernel/sched/ext.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index cf441fb4b1ad..6ca0085903e0 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -3360,7 +3360,7 @@ static void set_cpus_allowed_scx(struct task_struct *p, > * designation pointless. Cast it away when calling the operation. > */ > if (SCX_HAS_OP(sch, set_cpumask)) > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_cpumask, NULL, > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_cpumask, task_rq(p), > p, (struct cpumask *)p->cpus_ptr); > } > > -- > 2.53.0 > ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v2 03/10] sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask 2026-04-10 6:30 ` [PATCH 03/10] sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask Tejun Heo 2026-04-10 16:12 ` Andrea Righi @ 2026-04-10 17:51 ` Tejun Heo 1 sibling, 0 replies; 26+ messages in thread From: Tejun Heo @ 2026-04-10 17:51 UTC (permalink / raw) To: sched-ext, David Vernet, Andrea Righi, Changwoo Min Cc: Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel The SCX_CALL_OP_TASK call site passes rq=NULL incorrectly, leaving scx_locked_rq() unset. Pass task_rq(p) instead so update_locked_rq() reflects reality. v2: Add Fixes: tag (Andrea Righi). Fixes: 18853ba782be ("sched_ext: Track currently locked rq") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> --- kernel/sched/ext.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -3360,7 +3360,7 @@ static void set_cpus_allowed_scx(struct * designation pointless. Cast it away when calling the operation. */ if (SCX_HAS_OP(sch, set_cpumask)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_cpumask, NULL, + SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_cpumask, task_rq(p), p, (struct cpumask *)p->cpus_ptr); } ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 04/10] sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking 2026-04-10 6:30 [PATCHSET sched_ext/for-7.1] sched_ext: Add verifier-time kfunc context filter Tejun Heo ` (2 preceding siblings ...) 2026-04-10 6:30 ` [PATCH 03/10] sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask Tejun Heo @ 2026-04-10 6:30 ` Tejun Heo 2026-04-10 16:16 ` Andrea Righi 2026-04-10 17:51 ` [PATCH v2 " Tejun Heo 2026-04-10 6:30 ` [PATCH 05/10] sched_ext: Decouple kfunc unlocked-context check from kf_mask Tejun Heo ` (6 subsequent siblings) 10 siblings, 2 replies; 26+ messages in thread From: Tejun Heo @ 2026-04-10 6:30 UTC (permalink / raw) To: sched-ext, David Vernet, Andrea Righi, Changwoo Min Cc: Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel, Tejun Heo sched_move_task() invokes ops.cgroup_move() inside task_rq_lock(tsk), so @p's rq lock is held. The SCX_CALL_OP_TASK invocation mislabels this: - kf_mask = SCX_KF_UNLOCKED (== 0), claiming no lock is held. - rq = NULL, so update_locked_rq() doesn't run and scx_locked_rq() returns NULL. Switch to SCX_KF_REST and pass task_rq(p), matching ops.set_cpumask() from set_cpus_allowed_scx(). Three effects: - scx_bpf_task_cgroup() becomes callable (was rejected by scx_kf_allowed(__SCX_KF_RQ_LOCKED)). Safe; rq lock is held. - scx_bpf_dsq_move() is now rejected (was allowed via the unlocked branch). Calling it while holding an unrelated task's rq lock is risky; rejection is correct. - scx_bpf_select_cpu_*() previously took the unlocked branch in select_cpu_from_kfunc() and called task_rq_lock(p, &rf), which would deadlock against the already-held pi_lock. Now it takes the locked-rq branch and is rejected with -EPERM via the existing kf_allowed(SCX_KF_SELECT_CPU | SCX_KF_ENQUEUE) check. Latent deadlock fix. No in-tree scheduler is known to call any of these from ops.cgroup_move(). Signed-off-by: Tejun Heo <tj@kernel.org> --- kernel/sched/ext.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 6ca0085903e0..f7db8822a544 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -4397,7 +4397,7 @@ void scx_cgroup_move_task(struct task_struct *p) */ if (SCX_HAS_OP(sch, cgroup_move) && !WARN_ON_ONCE(!p->scx.cgrp_moving_from)) - SCX_CALL_OP_TASK(sch, SCX_KF_UNLOCKED, cgroup_move, NULL, + SCX_CALL_OP_TASK(sch, SCX_KF_REST, cgroup_move, task_rq(p), p, p->scx.cgrp_moving_from, tg_cgrp(task_group(p))); p->scx.cgrp_moving_from = NULL; -- 2.53.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 04/10] sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking 2026-04-10 6:30 ` [PATCH 04/10] sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking Tejun Heo @ 2026-04-10 16:16 ` Andrea Righi 2026-04-10 17:51 ` [PATCH v2 " Tejun Heo 1 sibling, 0 replies; 26+ messages in thread From: Andrea Righi @ 2026-04-10 16:16 UTC (permalink / raw) To: Tejun Heo Cc: sched-ext, David Vernet, Changwoo Min, Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel On Thu, Apr 09, 2026 at 08:30:40PM -1000, Tejun Heo wrote: > sched_move_task() invokes ops.cgroup_move() inside task_rq_lock(tsk), so > @p's rq lock is held. The SCX_CALL_OP_TASK invocation mislabels this: > > - kf_mask = SCX_KF_UNLOCKED (== 0), claiming no lock is held. > - rq = NULL, so update_locked_rq() doesn't run and scx_locked_rq() > returns NULL. > > Switch to SCX_KF_REST and pass task_rq(p), matching ops.set_cpumask() > from set_cpus_allowed_scx(). > > Three effects: > > - scx_bpf_task_cgroup() becomes callable (was rejected by > scx_kf_allowed(__SCX_KF_RQ_LOCKED)). Safe; rq lock is held. > > - scx_bpf_dsq_move() is now rejected (was allowed via the unlocked > branch). Calling it while holding an unrelated task's rq lock is > risky; rejection is correct. > > - scx_bpf_select_cpu_*() previously took the unlocked branch in > select_cpu_from_kfunc() and called task_rq_lock(p, &rf), which > would deadlock against the already-held pi_lock. Now it takes the > locked-rq branch and is rejected with -EPERM via the existing > kf_allowed(SCX_KF_SELECT_CPU | SCX_KF_ENQUEUE) check. Latent > deadlock fix. > > No in-tree scheduler is known to call any of these from ops.cgroup_move(). Similarly to the ops.set_cpumask() fix maybe add: Fixes: 18853ba782be ("sched_ext: Track currently locked rq") With that: Reviewed-by: Andrea Righi <arighi@nvidia.com> Thanks, -Andrea > > Signed-off-by: Tejun Heo <tj@kernel.org> > --- > kernel/sched/ext.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 6ca0085903e0..f7db8822a544 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -4397,7 +4397,7 @@ void scx_cgroup_move_task(struct task_struct *p) > */ > if (SCX_HAS_OP(sch, cgroup_move) && > !WARN_ON_ONCE(!p->scx.cgrp_moving_from)) > - SCX_CALL_OP_TASK(sch, SCX_KF_UNLOCKED, cgroup_move, NULL, > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, cgroup_move, task_rq(p), > p, p->scx.cgrp_moving_from, > tg_cgrp(task_group(p))); > p->scx.cgrp_moving_from = NULL; > -- > 2.53.0 > ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v2 04/10] sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking 2026-04-10 6:30 ` [PATCH 04/10] sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking Tejun Heo 2026-04-10 16:16 ` Andrea Righi @ 2026-04-10 17:51 ` Tejun Heo 1 sibling, 0 replies; 26+ messages in thread From: Tejun Heo @ 2026-04-10 17:51 UTC (permalink / raw) To: sched-ext, David Vernet, Andrea Righi, Changwoo Min Cc: Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel sched_move_task() invokes ops.cgroup_move() inside task_rq_lock(tsk), so @p's rq lock is held. The SCX_CALL_OP_TASK invocation mislabels this: - kf_mask = SCX_KF_UNLOCKED (== 0), claiming no lock is held. - rq = NULL, so update_locked_rq() doesn't run and scx_locked_rq() returns NULL. Switch to SCX_KF_REST and pass task_rq(p), matching ops.set_cpumask() from set_cpus_allowed_scx(). Three effects: - scx_bpf_task_cgroup() becomes callable (was rejected by scx_kf_allowed(__SCX_KF_RQ_LOCKED)). Safe; rq lock is held. - scx_bpf_dsq_move() is now rejected (was allowed via the unlocked branch). Calling it while holding an unrelated task's rq lock is risky; rejection is correct. - scx_bpf_select_cpu_*() previously took the unlocked branch in select_cpu_from_kfunc() and called task_rq_lock(p, &rf), which would deadlock against the already-held pi_lock. Now it takes the locked-rq branch and is rejected with -EPERM via the existing kf_allowed(SCX_KF_SELECT_CPU | SCX_KF_ENQUEUE) check. Latent deadlock fix. No in-tree scheduler is known to call any of these from ops.cgroup_move(). v2: Add Fixes: tag (Andrea Righi). Fixes: 18853ba782be ("sched_ext: Track currently locked rq") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> --- kernel/sched/ext.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -4397,7 +4397,7 @@ void scx_cgroup_move_task(struct task_st */ if (SCX_HAS_OP(sch, cgroup_move) && !WARN_ON_ONCE(!p->scx.cgrp_moving_from)) - SCX_CALL_OP_TASK(sch, SCX_KF_UNLOCKED, cgroup_move, NULL, + SCX_CALL_OP_TASK(sch, SCX_KF_REST, cgroup_move, task_rq(p), p, p->scx.cgrp_moving_from, tg_cgrp(task_group(p))); p->scx.cgrp_moving_from = NULL; ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 05/10] sched_ext: Decouple kfunc unlocked-context check from kf_mask 2026-04-10 6:30 [PATCHSET sched_ext/for-7.1] sched_ext: Add verifier-time kfunc context filter Tejun Heo ` (3 preceding siblings ...) 2026-04-10 6:30 ` [PATCH 04/10] sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking Tejun Heo @ 2026-04-10 6:30 ` Tejun Heo 2026-04-10 16:34 ` Andrea Righi 2026-04-10 17:51 ` [PATCH v2 " Tejun Heo 2026-04-10 6:30 ` [PATCH 06/10] sched_ext: Drop redundant rq-locked check from scx_bpf_task_cgroup() Tejun Heo ` (5 subsequent siblings) 10 siblings, 2 replies; 26+ messages in thread From: Tejun Heo @ 2026-04-10 6:30 UTC (permalink / raw) To: sched-ext, David Vernet, Andrea Righi, Changwoo Min Cc: Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel, Tejun Heo scx_kf_allowed_if_unlocked() uses !current->scx.kf_mask as a proxy for "no SCX-tracked lock held". kf_mask is removed in a follow-up patch, so its two callers - select_cpu_from_kfunc() and scx_dsq_move() - need another basis. Add a new bool scx_rq.in_select_cpu, set across the SCX_CALL_OP_TASK_RET that invokes ops.select_cpu(), to capture the one case where SCX itself holds no lock but try_to_wake_up() holds @p's pi_lock. Together with scx_locked_rq(), it expresses the same accepted-context set. select_cpu_from_kfunc() needs a runtime test because it has to take different locking paths depending on context. Open-code as a three-way branch. The unlocked branch takes raw_spin_lock_irqsave(&p->pi_lock) directly - pi_lock alone is enough for the fields the kfunc reads, and is lighter than task_rq_lock(). scx_dsq_move() doesn't really need a runtime test - its accepted contexts could be enforced at verifier load time. But since the runtime state is already there and using it keeps the upcoming load-time filter simpler, just write it the same way: (scx_locked_rq() || in_select_cpu) && !kf_allowed(DISPATCH). scx_kf_allowed_if_unlocked() is deleted with the conversions. No functional change. Signed-off-by: Tejun Heo <tj@kernel.org> --- kernel/sched/ext.c | 4 +++- kernel/sched/ext_idle.c | 39 ++++++++++++++++--------------------- kernel/sched/ext_internal.h | 5 ----- kernel/sched/sched.h | 1 + 4 files changed, 21 insertions(+), 28 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index f7db8822a544..a0bcdc805273 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -3308,10 +3308,12 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag WARN_ON_ONCE(*ddsp_taskp); *ddsp_taskp = p; + this_rq()->scx.in_select_cpu = true; cpu = SCX_CALL_OP_TASK_RET(sch, SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU, select_cpu, NULL, p, prev_cpu, wake_flags); + this_rq()->scx.in_select_cpu = false; p->scx.selected_cpu = cpu; *ddsp_taskp = NULL; if (ops_cpu_valid(sch, cpu, "from ops.select_cpu()")) @@ -8144,7 +8146,7 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit, bool in_balance; unsigned long flags; - if (!scx_kf_allowed_if_unlocked() && + if ((scx_locked_rq() || this_rq()->scx.in_select_cpu) && !scx_kf_allowed(sch, SCX_KF_DISPATCH)) return false; diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c index 8c31fb65477c..f99ceeba2e56 100644 --- a/kernel/sched/ext_idle.c +++ b/kernel/sched/ext_idle.c @@ -913,8 +913,8 @@ static s32 select_cpu_from_kfunc(struct scx_sched *sch, struct task_struct *p, s32 prev_cpu, u64 wake_flags, const struct cpumask *allowed, u64 flags) { - struct rq *rq; - struct rq_flags rf; + unsigned long irq_flags; + bool we_locked = false; s32 cpu; if (!ops_cpu_valid(sch, prev_cpu, NULL)) @@ -924,27 +924,22 @@ static s32 select_cpu_from_kfunc(struct scx_sched *sch, struct task_struct *p, return -EBUSY; /* - * If called from an unlocked context, acquire the task's rq lock, - * so that we can safely access p->cpus_ptr and p->nr_cpus_allowed. + * Accessing p->cpus_ptr / p->nr_cpus_allowed needs either @p's rq + * lock or @p's pi_lock. Three cases: * - * Otherwise, allow to use this kfunc only from ops.select_cpu() - * and ops.select_enqueue(). + * - inside ops.select_cpu(): try_to_wake_up() holds @p's pi_lock. + * - other rq-locked SCX op: scx_locked_rq() points at the held rq. + * - truly unlocked (UNLOCKED ops, SYSCALL, non-SCX struct_ops): + * nothing held, take pi_lock ourselves. */ - if (scx_kf_allowed_if_unlocked()) { - rq = task_rq_lock(p, &rf); - } else { - if (!scx_kf_allowed(sch, SCX_KF_SELECT_CPU | SCX_KF_ENQUEUE)) - return -EPERM; - rq = scx_locked_rq(); - } - - /* - * Validate locking correctness to access p->cpus_ptr and - * p->nr_cpus_allowed: if we're holding an rq lock, we're safe; - * otherwise, assert that p->pi_lock is held. - */ - if (!rq) + if (this_rq()->scx.in_select_cpu) { lockdep_assert_held(&p->pi_lock); + } else if (!scx_locked_rq()) { + raw_spin_lock_irqsave(&p->pi_lock, irq_flags); + we_locked = true; + } else if (!scx_kf_allowed(sch, SCX_KF_ENQUEUE)) { + return -EPERM; + } /* * This may also be called from ops.enqueue(), so we need to handle @@ -963,8 +958,8 @@ static s32 select_cpu_from_kfunc(struct scx_sched *sch, struct task_struct *p, allowed ?: p->cpus_ptr, flags); } - if (scx_kf_allowed_if_unlocked()) - task_rq_unlock(rq, p, &rf); + if (we_locked) + raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags); return cpu; } diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h index b4f36d8b9c1d..54da08a223b7 100644 --- a/kernel/sched/ext_internal.h +++ b/kernel/sched/ext_internal.h @@ -1372,11 +1372,6 @@ static inline struct rq *scx_locked_rq(void) return __this_cpu_read(scx_locked_rq_state); } -static inline bool scx_kf_allowed_if_unlocked(void) -{ - return !current->scx.kf_mask; -} - static inline bool scx_bypassing(struct scx_sched *sch, s32 cpu) { return unlikely(per_cpu_ptr(sch->pcpu, cpu)->flags & diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ae0783e27c1e..0b6a177fd597 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -798,6 +798,7 @@ struct scx_rq { u64 extra_enq_flags; /* see move_task_to_local_dsq() */ u32 nr_running; u32 cpuperf_target; /* [0, SCHED_CAPACITY_SCALE] */ + bool in_select_cpu; bool cpu_released; u32 flags; u32 nr_immed; /* ENQ_IMMED tasks on local_dsq */ -- 2.53.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 05/10] sched_ext: Decouple kfunc unlocked-context check from kf_mask 2026-04-10 6:30 ` [PATCH 05/10] sched_ext: Decouple kfunc unlocked-context check from kf_mask Tejun Heo @ 2026-04-10 16:34 ` Andrea Righi 2026-04-10 17:51 ` [PATCH v2 " Tejun Heo 1 sibling, 0 replies; 26+ messages in thread From: Andrea Righi @ 2026-04-10 16:34 UTC (permalink / raw) To: Tejun Heo Cc: sched-ext, David Vernet, Changwoo Min, Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel On Thu, Apr 09, 2026 at 08:30:41PM -1000, Tejun Heo wrote: > scx_kf_allowed_if_unlocked() uses !current->scx.kf_mask as a proxy for "no > SCX-tracked lock held". kf_mask is removed in a follow-up patch, so its two > callers - select_cpu_from_kfunc() and scx_dsq_move() - need another basis. > > Add a new bool scx_rq.in_select_cpu, set across the SCX_CALL_OP_TASK_RET > that invokes ops.select_cpu(), to capture the one case where SCX itself > holds no lock but try_to_wake_up() holds @p's pi_lock. Together with > scx_locked_rq(), it expresses the same accepted-context set. > > select_cpu_from_kfunc() needs a runtime test because it has to take > different locking paths depending on context. Open-code as a three-way > branch. The unlocked branch takes raw_spin_lock_irqsave(&p->pi_lock) > directly - pi_lock alone is enough for the fields the kfunc reads, and is > lighter than task_rq_lock(). > > scx_dsq_move() doesn't really need a runtime test - its accepted contexts > could be enforced at verifier load time. But since the runtime state is > already there and using it keeps the upcoming load-time filter simpler, just > write it the same way: (scx_locked_rq() || in_select_cpu) && > !kf_allowed(DISPATCH). > > scx_kf_allowed_if_unlocked() is deleted with the conversions. > > No functional change. Makes sense. Nit: it's more of "no semantic change" rather than "no functional change", because we acquire pi_lock in the unlocked context scenario, instead of the more expensive taks_rq_lock(). Apart than that looks good. Reviewed-by: Andrea Righi <arighi@nvidia.com> Thanks, -Andrea > > Signed-off-by: Tejun Heo <tj@kernel.org> > --- > kernel/sched/ext.c | 4 +++- > kernel/sched/ext_idle.c | 39 ++++++++++++++++--------------------- > kernel/sched/ext_internal.h | 5 ----- > kernel/sched/sched.h | 1 + > 4 files changed, 21 insertions(+), 28 deletions(-) > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index f7db8822a544..a0bcdc805273 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -3308,10 +3308,12 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag > WARN_ON_ONCE(*ddsp_taskp); > *ddsp_taskp = p; > > + this_rq()->scx.in_select_cpu = true; > cpu = SCX_CALL_OP_TASK_RET(sch, > SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU, > select_cpu, NULL, p, prev_cpu, > wake_flags); > + this_rq()->scx.in_select_cpu = false; > p->scx.selected_cpu = cpu; > *ddsp_taskp = NULL; > if (ops_cpu_valid(sch, cpu, "from ops.select_cpu()")) > @@ -8144,7 +8146,7 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit, > bool in_balance; > unsigned long flags; > > - if (!scx_kf_allowed_if_unlocked() && > + if ((scx_locked_rq() || this_rq()->scx.in_select_cpu) && > !scx_kf_allowed(sch, SCX_KF_DISPATCH)) > return false; > > diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c > index 8c31fb65477c..f99ceeba2e56 100644 > --- a/kernel/sched/ext_idle.c > +++ b/kernel/sched/ext_idle.c > @@ -913,8 +913,8 @@ static s32 select_cpu_from_kfunc(struct scx_sched *sch, struct task_struct *p, > s32 prev_cpu, u64 wake_flags, > const struct cpumask *allowed, u64 flags) > { > - struct rq *rq; > - struct rq_flags rf; > + unsigned long irq_flags; > + bool we_locked = false; > s32 cpu; > > if (!ops_cpu_valid(sch, prev_cpu, NULL)) > @@ -924,27 +924,22 @@ static s32 select_cpu_from_kfunc(struct scx_sched *sch, struct task_struct *p, > return -EBUSY; > > /* > - * If called from an unlocked context, acquire the task's rq lock, > - * so that we can safely access p->cpus_ptr and p->nr_cpus_allowed. > + * Accessing p->cpus_ptr / p->nr_cpus_allowed needs either @p's rq > + * lock or @p's pi_lock. Three cases: > * > - * Otherwise, allow to use this kfunc only from ops.select_cpu() > - * and ops.select_enqueue(). > + * - inside ops.select_cpu(): try_to_wake_up() holds @p's pi_lock. > + * - other rq-locked SCX op: scx_locked_rq() points at the held rq. > + * - truly unlocked (UNLOCKED ops, SYSCALL, non-SCX struct_ops): > + * nothing held, take pi_lock ourselves. > */ > - if (scx_kf_allowed_if_unlocked()) { > - rq = task_rq_lock(p, &rf); > - } else { > - if (!scx_kf_allowed(sch, SCX_KF_SELECT_CPU | SCX_KF_ENQUEUE)) > - return -EPERM; > - rq = scx_locked_rq(); > - } > - > - /* > - * Validate locking correctness to access p->cpus_ptr and > - * p->nr_cpus_allowed: if we're holding an rq lock, we're safe; > - * otherwise, assert that p->pi_lock is held. > - */ > - if (!rq) > + if (this_rq()->scx.in_select_cpu) { > lockdep_assert_held(&p->pi_lock); > + } else if (!scx_locked_rq()) { > + raw_spin_lock_irqsave(&p->pi_lock, irq_flags); > + we_locked = true; > + } else if (!scx_kf_allowed(sch, SCX_KF_ENQUEUE)) { > + return -EPERM; > + } > > /* > * This may also be called from ops.enqueue(), so we need to handle > @@ -963,8 +958,8 @@ static s32 select_cpu_from_kfunc(struct scx_sched *sch, struct task_struct *p, > allowed ?: p->cpus_ptr, flags); > } > > - if (scx_kf_allowed_if_unlocked()) > - task_rq_unlock(rq, p, &rf); > + if (we_locked) > + raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags); > > return cpu; > } > diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h > index b4f36d8b9c1d..54da08a223b7 100644 > --- a/kernel/sched/ext_internal.h > +++ b/kernel/sched/ext_internal.h > @@ -1372,11 +1372,6 @@ static inline struct rq *scx_locked_rq(void) > return __this_cpu_read(scx_locked_rq_state); > } > > -static inline bool scx_kf_allowed_if_unlocked(void) > -{ > - return !current->scx.kf_mask; > -} > - > static inline bool scx_bypassing(struct scx_sched *sch, s32 cpu) > { > return unlikely(per_cpu_ptr(sch->pcpu, cpu)->flags & > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index ae0783e27c1e..0b6a177fd597 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -798,6 +798,7 @@ struct scx_rq { > u64 extra_enq_flags; /* see move_task_to_local_dsq() */ > u32 nr_running; > u32 cpuperf_target; /* [0, SCHED_CAPACITY_SCALE] */ > + bool in_select_cpu; > bool cpu_released; > u32 flags; > u32 nr_immed; /* ENQ_IMMED tasks on local_dsq */ > -- > 2.53.0 > ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v2 05/10] sched_ext: Decouple kfunc unlocked-context check from kf_mask 2026-04-10 6:30 ` [PATCH 05/10] sched_ext: Decouple kfunc unlocked-context check from kf_mask Tejun Heo 2026-04-10 16:34 ` Andrea Righi @ 2026-04-10 17:51 ` Tejun Heo 1 sibling, 0 replies; 26+ messages in thread From: Tejun Heo @ 2026-04-10 17:51 UTC (permalink / raw) To: sched-ext, David Vernet, Andrea Righi, Changwoo Min Cc: Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel scx_kf_allowed_if_unlocked() uses !current->scx.kf_mask as a proxy for "no SCX-tracked lock held". kf_mask is removed in a follow-up patch, so its two callers - select_cpu_from_kfunc() and scx_dsq_move() - need another basis. Add a new bool scx_rq.in_select_cpu, set across the SCX_CALL_OP_TASK_RET that invokes ops.select_cpu(), to capture the one case where SCX itself holds no lock but try_to_wake_up() holds @p's pi_lock. Together with scx_locked_rq(), it expresses the same accepted-context set. select_cpu_from_kfunc() needs a runtime test because it has to take different locking paths depending on context. Open-code as a three-way branch. The unlocked branch takes raw_spin_lock_irqsave(&p->pi_lock) directly - pi_lock alone is enough for the fields the kfunc reads, and is lighter than task_rq_lock(). scx_dsq_move() doesn't really need a runtime test - its accepted contexts could be enforced at verifier load time. But since the runtime state is already there and using it keeps the upcoming load-time filter simpler, just write it the same way: (scx_locked_rq() || in_select_cpu) && !kf_allowed(DISPATCH). scx_kf_allowed_if_unlocked() is deleted with the conversions. No semantic change. v2: s/No functional change/No semantic change/ - the unlocked path now acquires pi_lock instead of the heavier task_rq_lock() (Andrea Righi). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> --- kernel/sched/ext.c | 4 +++- kernel/sched/ext_idle.c | 39 +++++++++++++++++---------------------- kernel/sched/ext_internal.h | 5 ----- kernel/sched/sched.h | 1 + 4 files changed, 21 insertions(+), 28 deletions(-) --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -3308,10 +3308,12 @@ static int select_task_rq_scx(struct tas WARN_ON_ONCE(*ddsp_taskp); *ddsp_taskp = p; + this_rq()->scx.in_select_cpu = true; cpu = SCX_CALL_OP_TASK_RET(sch, SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU, select_cpu, NULL, p, prev_cpu, wake_flags); + this_rq()->scx.in_select_cpu = false; p->scx.selected_cpu = cpu; *ddsp_taskp = NULL; if (ops_cpu_valid(sch, cpu, "from ops.select_cpu()")) @@ -8144,7 +8146,7 @@ static bool scx_dsq_move(struct bpf_iter bool in_balance; unsigned long flags; - if (!scx_kf_allowed_if_unlocked() && + if ((scx_locked_rq() || this_rq()->scx.in_select_cpu) && !scx_kf_allowed(sch, SCX_KF_DISPATCH)) return false; --- a/kernel/sched/ext_idle.c +++ b/kernel/sched/ext_idle.c @@ -913,8 +913,8 @@ static s32 select_cpu_from_kfunc(struct s32 prev_cpu, u64 wake_flags, const struct cpumask *allowed, u64 flags) { - struct rq *rq; - struct rq_flags rf; + unsigned long irq_flags; + bool we_locked = false; s32 cpu; if (!ops_cpu_valid(sch, prev_cpu, NULL)) @@ -924,27 +924,22 @@ static s32 select_cpu_from_kfunc(struct return -EBUSY; /* - * If called from an unlocked context, acquire the task's rq lock, - * so that we can safely access p->cpus_ptr and p->nr_cpus_allowed. + * Accessing p->cpus_ptr / p->nr_cpus_allowed needs either @p's rq + * lock or @p's pi_lock. Three cases: * - * Otherwise, allow to use this kfunc only from ops.select_cpu() - * and ops.select_enqueue(). + * - inside ops.select_cpu(): try_to_wake_up() holds @p's pi_lock. + * - other rq-locked SCX op: scx_locked_rq() points at the held rq. + * - truly unlocked (UNLOCKED ops, SYSCALL, non-SCX struct_ops): + * nothing held, take pi_lock ourselves. */ - if (scx_kf_allowed_if_unlocked()) { - rq = task_rq_lock(p, &rf); - } else { - if (!scx_kf_allowed(sch, SCX_KF_SELECT_CPU | SCX_KF_ENQUEUE)) - return -EPERM; - rq = scx_locked_rq(); - } - - /* - * Validate locking correctness to access p->cpus_ptr and - * p->nr_cpus_allowed: if we're holding an rq lock, we're safe; - * otherwise, assert that p->pi_lock is held. - */ - if (!rq) + if (this_rq()->scx.in_select_cpu) { lockdep_assert_held(&p->pi_lock); + } else if (!scx_locked_rq()) { + raw_spin_lock_irqsave(&p->pi_lock, irq_flags); + we_locked = true; + } else if (!scx_kf_allowed(sch, SCX_KF_ENQUEUE)) { + return -EPERM; + } /* * This may also be called from ops.enqueue(), so we need to handle @@ -963,8 +958,8 @@ static s32 select_cpu_from_kfunc(struct allowed ?: p->cpus_ptr, flags); } - if (scx_kf_allowed_if_unlocked()) - task_rq_unlock(rq, p, &rf); + if (we_locked) + raw_spin_unlock_irqrestore(&p->pi_lock, irq_flags); return cpu; } --- a/kernel/sched/ext_internal.h +++ b/kernel/sched/ext_internal.h @@ -1372,11 +1372,6 @@ static inline struct rq *scx_locked_rq(v return __this_cpu_read(scx_locked_rq_state); } -static inline bool scx_kf_allowed_if_unlocked(void) -{ - return !current->scx.kf_mask; -} - static inline bool scx_bypassing(struct scx_sched *sch, s32 cpu) { return unlikely(per_cpu_ptr(sch->pcpu, cpu)->flags & --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -798,6 +798,7 @@ struct scx_rq { u64 extra_enq_flags; /* see move_task_to_local_dsq() */ u32 nr_running; u32 cpuperf_target; /* [0, SCHED_CAPACITY_SCALE] */ + bool in_select_cpu; bool cpu_released; u32 flags; u32 nr_immed; /* ENQ_IMMED tasks on local_dsq */ ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 06/10] sched_ext: Drop redundant rq-locked check from scx_bpf_task_cgroup() 2026-04-10 6:30 [PATCHSET sched_ext/for-7.1] sched_ext: Add verifier-time kfunc context filter Tejun Heo ` (4 preceding siblings ...) 2026-04-10 6:30 ` [PATCH 05/10] sched_ext: Decouple kfunc unlocked-context check from kf_mask Tejun Heo @ 2026-04-10 6:30 ` Tejun Heo 2026-04-10 16:36 ` Andrea Righi 2026-04-10 6:30 ` [PATCH 07/10] sched_ext: Add verifier-time kfunc context filter Tejun Heo ` (4 subsequent siblings) 10 siblings, 1 reply; 26+ messages in thread From: Tejun Heo @ 2026-04-10 6:30 UTC (permalink / raw) To: sched-ext, David Vernet, Andrea Righi, Changwoo Min Cc: Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel, Tejun Heo scx_kf_allowed_on_arg_tasks() runs both an scx_kf_allowed(__SCX_KF_RQ_LOCKED) mask check and a kf_tasks[] check. After the preceding call-site fixes, every SCX_CALL_OP_TASK*() invocation has kf_mask & __SCX_KF_RQ_LOCKED non-zero, so the mask check is redundant whenever the kf_tasks[] check passes. Drop it and simplify the helper to take only @sch and @p. Fold the locking guarantee into the SCX_CALL_OP_TASK() comment block, which scx_bpf_task_cgroup() now points to. No functional change. Extracted from a larger verifier-time kfunc context filter patch originally written by Juntong Deng. Original-patch-by: Juntong Deng <juntong.deng@outlook.com> Cc: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> --- kernel/sched/ext.c | 32 +++++++++++++++----------------- 1 file changed, 15 insertions(+), 17 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index a0bcdc805273..6d7c5c2605c7 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -540,15 +540,18 @@ do { \ }) /* - * Some kfuncs are allowed only on the tasks that are subjects of the - * in-progress scx_ops operation for, e.g., locking guarantees. To enforce such - * restrictions, the following SCX_CALL_OP_*() variants should be used when - * invoking scx_ops operations that take task arguments. These can only be used - * for non-nesting operations due to the way the tasks are tracked. - * - * kfuncs which can only operate on such tasks can in turn use - * scx_kf_allowed_on_arg_tasks() to test whether the invocation is allowed on - * the specific task. + * SCX_CALL_OP_TASK*() invokes an SCX op that takes one or two task arguments + * and records them in current->scx.kf_tasks[] for the duration of the call. A + * kfunc invoked from inside such an op can then use + * scx_kf_allowed_on_arg_tasks() to verify that its task argument is one of + * those subject tasks. + * + * Every SCX_CALL_OP_TASK*() call site invokes its op with @p's rq lock held - + * either via the @rq argument here, or (for ops.select_cpu()) via @p's pi_lock + * held by try_to_wake_up() with rq tracking via scx_rq.in_select_cpu. So if + * kf_tasks[] is set, @p's scheduler-protected fields are stable. + * + * These macros only work for non-nesting ops since kf_tasks[] is not stacked. */ #define SCX_CALL_OP_TASK(sch, mask, op, rq, task, args...) \ do { \ @@ -613,12 +616,8 @@ static __always_inline bool scx_kf_allowed(struct scx_sched *sch, u32 mask) /* see SCX_CALL_OP_TASK() */ static __always_inline bool scx_kf_allowed_on_arg_tasks(struct scx_sched *sch, - u32 mask, struct task_struct *p) { - if (!scx_kf_allowed(sch, mask)) - return false; - if (unlikely((p != current->scx.kf_tasks[0] && p != current->scx.kf_tasks[1]))) { scx_error(sch, "called on a task not being operated on"); @@ -9535,9 +9534,8 @@ __bpf_kfunc void scx_bpf_events(struct scx_event_stats *events, * @p->sched_task_group->css.cgroup represents the cgroup @p is associated with * from the scheduler's POV. SCX operations should use this function to * determine @p's current cgroup as, unlike following @p->cgroups, - * @p->sched_task_group is protected by @p's rq lock and thus atomic w.r.t. all - * rq-locked operations. Can be called on the parameter tasks of rq-locked - * operations. The restriction guarantees that @p's rq is locked by the caller. + * @p->sched_task_group is stable for the duration of the SCX op. See + * SCX_CALL_OP_TASK() for details. */ __bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p, const struct bpf_prog_aux *aux) @@ -9552,7 +9550,7 @@ __bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p, if (unlikely(!sch)) goto out; - if (!scx_kf_allowed_on_arg_tasks(sch, __SCX_KF_RQ_LOCKED, p)) + if (!scx_kf_allowed_on_arg_tasks(sch, p)) goto out; cgrp = tg_cgrp(tg); -- 2.53.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 06/10] sched_ext: Drop redundant rq-locked check from scx_bpf_task_cgroup() 2026-04-10 6:30 ` [PATCH 06/10] sched_ext: Drop redundant rq-locked check from scx_bpf_task_cgroup() Tejun Heo @ 2026-04-10 16:36 ` Andrea Righi 0 siblings, 0 replies; 26+ messages in thread From: Andrea Righi @ 2026-04-10 16:36 UTC (permalink / raw) To: Tejun Heo Cc: sched-ext, David Vernet, Changwoo Min, Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel On Thu, Apr 09, 2026 at 08:30:42PM -1000, Tejun Heo wrote: > scx_kf_allowed_on_arg_tasks() runs both an scx_kf_allowed(__SCX_KF_RQ_LOCKED) > mask check and a kf_tasks[] check. After the preceding call-site fixes, > every SCX_CALL_OP_TASK*() invocation has kf_mask & __SCX_KF_RQ_LOCKED > non-zero, so the mask check is redundant whenever the kf_tasks[] check > passes. Drop it and simplify the helper to take only @sch and @p. > > Fold the locking guarantee into the SCX_CALL_OP_TASK() comment block, which > scx_bpf_task_cgroup() now points to. > > No functional change. > > Extracted from a larger verifier-time kfunc context filter patch > originally written by Juntong Deng. > > Original-patch-by: Juntong Deng <juntong.deng@outlook.com> > Cc: Cheng-Yang Chou <yphbchou0911@gmail.com> > Signed-off-by: Tejun Heo <tj@kernel.org> Looks good. Reviewed-by: Andrea Righi <arighi@nvidia.com> Thanks, -Andrea > --- > kernel/sched/ext.c | 32 +++++++++++++++----------------- > 1 file changed, 15 insertions(+), 17 deletions(-) > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index a0bcdc805273..6d7c5c2605c7 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -540,15 +540,18 @@ do { \ > }) > > /* > - * Some kfuncs are allowed only on the tasks that are subjects of the > - * in-progress scx_ops operation for, e.g., locking guarantees. To enforce such > - * restrictions, the following SCX_CALL_OP_*() variants should be used when > - * invoking scx_ops operations that take task arguments. These can only be used > - * for non-nesting operations due to the way the tasks are tracked. > - * > - * kfuncs which can only operate on such tasks can in turn use > - * scx_kf_allowed_on_arg_tasks() to test whether the invocation is allowed on > - * the specific task. > + * SCX_CALL_OP_TASK*() invokes an SCX op that takes one or two task arguments > + * and records them in current->scx.kf_tasks[] for the duration of the call. A > + * kfunc invoked from inside such an op can then use > + * scx_kf_allowed_on_arg_tasks() to verify that its task argument is one of > + * those subject tasks. > + * > + * Every SCX_CALL_OP_TASK*() call site invokes its op with @p's rq lock held - > + * either via the @rq argument here, or (for ops.select_cpu()) via @p's pi_lock > + * held by try_to_wake_up() with rq tracking via scx_rq.in_select_cpu. So if > + * kf_tasks[] is set, @p's scheduler-protected fields are stable. > + * > + * These macros only work for non-nesting ops since kf_tasks[] is not stacked. > */ > #define SCX_CALL_OP_TASK(sch, mask, op, rq, task, args...) \ > do { \ > @@ -613,12 +616,8 @@ static __always_inline bool scx_kf_allowed(struct scx_sched *sch, u32 mask) > > /* see SCX_CALL_OP_TASK() */ > static __always_inline bool scx_kf_allowed_on_arg_tasks(struct scx_sched *sch, > - u32 mask, > struct task_struct *p) > { > - if (!scx_kf_allowed(sch, mask)) > - return false; > - > if (unlikely((p != current->scx.kf_tasks[0] && > p != current->scx.kf_tasks[1]))) { > scx_error(sch, "called on a task not being operated on"); > @@ -9535,9 +9534,8 @@ __bpf_kfunc void scx_bpf_events(struct scx_event_stats *events, > * @p->sched_task_group->css.cgroup represents the cgroup @p is associated with > * from the scheduler's POV. SCX operations should use this function to > * determine @p's current cgroup as, unlike following @p->cgroups, > - * @p->sched_task_group is protected by @p's rq lock and thus atomic w.r.t. all > - * rq-locked operations. Can be called on the parameter tasks of rq-locked > - * operations. The restriction guarantees that @p's rq is locked by the caller. > + * @p->sched_task_group is stable for the duration of the SCX op. See > + * SCX_CALL_OP_TASK() for details. > */ > __bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p, > const struct bpf_prog_aux *aux) > @@ -9552,7 +9550,7 @@ __bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p, > if (unlikely(!sch)) > goto out; > > - if (!scx_kf_allowed_on_arg_tasks(sch, __SCX_KF_RQ_LOCKED, p)) > + if (!scx_kf_allowed_on_arg_tasks(sch, p)) > goto out; > > cgrp = tg_cgrp(tg); > -- > 2.53.0 > ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 07/10] sched_ext: Add verifier-time kfunc context filter 2026-04-10 6:30 [PATCHSET sched_ext/for-7.1] sched_ext: Add verifier-time kfunc context filter Tejun Heo ` (5 preceding siblings ...) 2026-04-10 6:30 ` [PATCH 06/10] sched_ext: Drop redundant rq-locked check from scx_bpf_task_cgroup() Tejun Heo @ 2026-04-10 6:30 ` Tejun Heo 2026-04-10 16:49 ` Andrea Righi 2026-04-10 6:30 ` [PATCH 08/10] sched_ext: Remove runtime kfunc mask enforcement Tejun Heo ` (3 subsequent siblings) 10 siblings, 1 reply; 26+ messages in thread From: Tejun Heo @ 2026-04-10 6:30 UTC (permalink / raw) To: sched-ext, David Vernet, Andrea Righi, Changwoo Min Cc: Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel, Tejun Heo Move enforcement of SCX context-sensitive kfunc restrictions from per-task runtime kf_mask checks to BPF verifier-time filtering, using the BPF core's struct_ops context information. A shared .filter callback is attached to each context-sensitive BTF set and consults a per-op allow table (scx_kf_allow_flags[]) indexed by SCX ops member offset. Disallowed calls are now rejected at program load time instead of at runtime. The old model split reachability across two places: each SCX_CALL_OP*() set bits naming its op context, and each kfunc's scx_kf_allowed() check OR'd together the bits it accepted. A kfunc was callable when those two masks overlapped. The new model transposes the result to the caller side - each op's allow flags directly list the kfunc groups it may call. The old bit assignments were: Call-site bits: ops.select_cpu = ENQUEUE | SELECT_CPU ops.enqueue = ENQUEUE ops.dispatch = DISPATCH ops.cpu_release = CPU_RELEASE Kfunc-group accepted bits: enqueue group = ENQUEUE | DISPATCH select_cpu group = SELECT_CPU | ENQUEUE dispatch group = DISPATCH cpu_release group = CPU_RELEASE Intersecting them yields the reachability now expressed directly by scx_kf_allow_flags[]: ops.select_cpu -> SELECT_CPU | ENQUEUE ops.enqueue -> SELECT_CPU | ENQUEUE ops.dispatch -> ENQUEUE | DISPATCH ops.cpu_release -> CPU_RELEASE Unlocked ops carried no kf_mask bits and reached only unlocked kfuncs; that maps directly to UNLOCKED in the new table. Equivalence was checked by walking every (op, kfunc-group) combination across SCX ops, SYSCALL, and non-SCX struct_ops callers against the old scx_kf_allowed() runtime checks. With two intended exceptions (see below), all combinations reach the same verdict; disallowed calls are now caught at load time instead of firing scx_error() at runtime. scx_bpf_dsq_move_set_slice() and scx_bpf_dsq_move_set_vtime() are exceptions: they have no runtime check at all, but the new filter rejects them from ops outside dispatch/unlocked. The affected cases are nonsensical - the values these setters store are only read by scx_bpf_dsq_move{,_vtime}(), which is itself restricted to dispatch/unlocked, so a setter call from anywhere else was already dead code. Runtime scx_kf_mask enforcement is left in place by this patch and removed in a follow-up. Original-patch-by: Juntong Deng <juntong.deng@outlook.com> Original-patch-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> --- kernel/sched/ext.c | 124 ++++++++++++++++++++++++++++++++++-- kernel/sched/ext_idle.c | 1 + kernel/sched/ext_idle.h | 2 + kernel/sched/ext_internal.h | 3 + 4 files changed, 125 insertions(+), 5 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 6d7c5c2605c7..81a4fea4c6b6 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -8133,6 +8133,7 @@ BTF_KFUNCS_END(scx_kfunc_ids_enqueue_dispatch) static const struct btf_kfunc_id_set scx_kfunc_set_enqueue_dispatch = { .owner = THIS_MODULE, .set = &scx_kfunc_ids_enqueue_dispatch, + .filter = scx_kfunc_context_filter, }; static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit, @@ -8511,6 +8512,7 @@ BTF_KFUNCS_END(scx_kfunc_ids_dispatch) static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = { .owner = THIS_MODULE, .set = &scx_kfunc_ids_dispatch, + .filter = scx_kfunc_context_filter, }; __bpf_kfunc_start_defs(); @@ -8551,6 +8553,7 @@ BTF_KFUNCS_END(scx_kfunc_ids_cpu_release) static const struct btf_kfunc_id_set scx_kfunc_set_cpu_release = { .owner = THIS_MODULE, .set = &scx_kfunc_ids_cpu_release, + .filter = scx_kfunc_context_filter, }; __bpf_kfunc_start_defs(); @@ -8628,6 +8631,7 @@ BTF_KFUNCS_END(scx_kfunc_ids_unlocked) static const struct btf_kfunc_id_set scx_kfunc_set_unlocked = { .owner = THIS_MODULE, .set = &scx_kfunc_ids_unlocked, + .filter = scx_kfunc_context_filter, }; __bpf_kfunc_start_defs(); @@ -9603,6 +9607,115 @@ static const struct btf_kfunc_id_set scx_kfunc_set_any = { .set = &scx_kfunc_ids_any, }; +/* + * Per-op kfunc allow flags. Each bit corresponds to a context-sensitive kfunc + * group; an op may permit zero or more groups, with the union expressed in + * scx_kf_allow_flags[]. The verifier-time filter (scx_kfunc_context_filter()) + * consults this table to decide whether a context-sensitive kfunc is callable + * from a given SCX op. + */ +enum scx_kf_allow_flags { + SCX_KF_ALLOW_UNLOCKED = 1 << 0, + SCX_KF_ALLOW_CPU_RELEASE = 1 << 1, + SCX_KF_ALLOW_DISPATCH = 1 << 2, + SCX_KF_ALLOW_ENQUEUE = 1 << 3, + SCX_KF_ALLOW_SELECT_CPU = 1 << 4, +}; + +/* + * Map each SCX op to the union of kfunc groups it permits, indexed by + * SCX_OP_IDX(op). Ops not listed only permit kfuncs that are not + * context-sensitive. + */ +static const u32 scx_kf_allow_flags[] = { + [SCX_OP_IDX(select_cpu)] = SCX_KF_ALLOW_SELECT_CPU | SCX_KF_ALLOW_ENQUEUE, + [SCX_OP_IDX(enqueue)] = SCX_KF_ALLOW_SELECT_CPU | SCX_KF_ALLOW_ENQUEUE, + [SCX_OP_IDX(dispatch)] = SCX_KF_ALLOW_ENQUEUE | SCX_KF_ALLOW_DISPATCH, + [SCX_OP_IDX(cpu_release)] = SCX_KF_ALLOW_CPU_RELEASE, + [SCX_OP_IDX(init_task)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(dump)] = SCX_KF_ALLOW_UNLOCKED, +#ifdef CONFIG_EXT_GROUP_SCHED + [SCX_OP_IDX(cgroup_init)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(cgroup_exit)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(cgroup_prep_move)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(cgroup_cancel_move)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(cgroup_set_weight)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(cgroup_set_bandwidth)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(cgroup_set_idle)] = SCX_KF_ALLOW_UNLOCKED, +#endif /* CONFIG_EXT_GROUP_SCHED */ + [SCX_OP_IDX(sub_attach)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(sub_detach)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(cpu_online)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(cpu_offline)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(init)] = SCX_KF_ALLOW_UNLOCKED, + [SCX_OP_IDX(exit)] = SCX_KF_ALLOW_UNLOCKED, +}; + +/* + * Verifier-time filter for context-sensitive SCX kfuncs. Registered via the + * .filter field on each per-group btf_kfunc_id_set. The BPF core invokes this + * for every kfunc call in the registered hook (BPF_PROG_TYPE_STRUCT_OPS or + * BPF_PROG_TYPE_SYSCALL), regardless of which set originally introduced the + * kfunc - so the filter must short-circuit on kfuncs it doesn't govern (e.g. + * scx_kfunc_ids_any) by falling through to "allow" when none of the + * context-sensitive sets contain the kfunc. + */ +int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id) +{ + bool in_unlocked = btf_id_set8_contains(&scx_kfunc_ids_unlocked, kfunc_id); + bool in_select_cpu = btf_id_set8_contains(&scx_kfunc_ids_select_cpu, kfunc_id); + bool in_enqueue = btf_id_set8_contains(&scx_kfunc_ids_enqueue_dispatch, kfunc_id); + bool in_dispatch = btf_id_set8_contains(&scx_kfunc_ids_dispatch, kfunc_id); + bool in_cpu_release = btf_id_set8_contains(&scx_kfunc_ids_cpu_release, kfunc_id); + u32 moff, flags; + + /* Not a context-sensitive kfunc (e.g. from scx_kfunc_ids_any) - allow. */ + if (!(in_unlocked || in_select_cpu || in_enqueue || in_dispatch || in_cpu_release)) + return 0; + + /* SYSCALL progs (e.g. BPF test_run()) may call unlocked and select_cpu kfuncs. */ + if (prog->type == BPF_PROG_TYPE_SYSCALL) + return (in_unlocked || in_select_cpu) ? 0 : -EACCES; + + if (prog->type != BPF_PROG_TYPE_STRUCT_OPS) + return -EACCES; + + /* + * add_subprog_and_kfunc() collects all kfunc calls, including dead code + * guarded by bpf_ksym_exists(), before check_attach_btf_id() sets + * prog->aux->st_ops. Allow all kfuncs when st_ops is not yet set; + * do_check_main() re-runs the filter with st_ops set and enforces the + * actual restrictions. + */ + if (!prog->aux->st_ops) + return 0; + + /* + * Non-SCX struct_ops: only unlocked kfuncs are safe. The other + * context-sensitive kfuncs assume the rq lock is held by the SCX + * dispatch path, which doesn't apply to other struct_ops users. + */ + if (prog->aux->st_ops != &bpf_sched_ext_ops) + return in_unlocked ? 0 : -EACCES; + + /* SCX struct_ops: check the per-op allow list. */ + moff = prog->aux->attach_st_ops_member_off; + flags = scx_kf_allow_flags[SCX_MOFF_IDX(moff)]; + + if ((flags & SCX_KF_ALLOW_UNLOCKED) && in_unlocked) + return 0; + if ((flags & SCX_KF_ALLOW_CPU_RELEASE) && in_cpu_release) + return 0; + if ((flags & SCX_KF_ALLOW_DISPATCH) && in_dispatch) + return 0; + if ((flags & SCX_KF_ALLOW_ENQUEUE) && in_enqueue) + return 0; + if ((flags & SCX_KF_ALLOW_SELECT_CPU) && in_select_cpu) + return 0; + + return -EACCES; +} + static int __init scx_init(void) { int ret; @@ -9612,11 +9725,12 @@ static int __init scx_init(void) * register_btf_kfunc_id_set() needs most of the system to be up. * * Some kfuncs are context-sensitive and can only be called from - * specific SCX ops. They are grouped into BTF sets accordingly. - * Unfortunately, BPF currently doesn't have a way of enforcing such - * restrictions. Eventually, the verifier should be able to enforce - * them. For now, register them the same and make each kfunc explicitly - * check using scx_kf_allowed(). + * specific SCX ops. They are grouped into per-context BTF sets, each + * registered with scx_kfunc_context_filter as its .filter callback. The + * BPF core dedups identical filter pointers per hook + * (btf_populate_kfunc_set()), so the filter is invoked exactly once per + * kfunc lookup; it consults scx_kf_allow_flags[] to enforce per-op + * restrictions at verify time. */ if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_enqueue_dispatch)) || diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c index f99ceeba2e56..ec49e0c9892e 100644 --- a/kernel/sched/ext_idle.c +++ b/kernel/sched/ext_idle.c @@ -1491,6 +1491,7 @@ BTF_KFUNCS_END(scx_kfunc_ids_select_cpu) static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = { .owner = THIS_MODULE, .set = &scx_kfunc_ids_select_cpu, + .filter = scx_kfunc_context_filter, }; int scx_idle_init(void) diff --git a/kernel/sched/ext_idle.h b/kernel/sched/ext_idle.h index fa583f141f35..dc35f850481e 100644 --- a/kernel/sched/ext_idle.h +++ b/kernel/sched/ext_idle.h @@ -12,6 +12,8 @@ struct sched_ext_ops; +extern struct btf_id_set8 scx_kfunc_ids_select_cpu; + void scx_idle_update_selcpu_topology(struct sched_ext_ops *ops); void scx_idle_init_masks(void); diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h index 54da08a223b7..62ce4eaf6a3f 100644 --- a/kernel/sched/ext_internal.h +++ b/kernel/sched/ext_internal.h @@ -6,6 +6,7 @@ * Copyright (c) 2025 Tejun Heo <tj@kernel.org> */ #define SCX_OP_IDX(op) (offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void))) +#define SCX_MOFF_IDX(moff) ((moff) / sizeof(void (*)(void))) enum scx_consts { SCX_DSP_DFL_MAX_BATCH = 32, @@ -1363,6 +1364,8 @@ enum scx_ops_state { extern struct scx_sched __rcu *scx_root; DECLARE_PER_CPU(struct rq *, scx_locked_rq_state); +int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id); + /* * Return the rq currently locked from an scx callback, or NULL if no rq is * locked. -- 2.53.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 07/10] sched_ext: Add verifier-time kfunc context filter 2026-04-10 6:30 ` [PATCH 07/10] sched_ext: Add verifier-time kfunc context filter Tejun Heo @ 2026-04-10 16:49 ` Andrea Righi 0 siblings, 0 replies; 26+ messages in thread From: Andrea Righi @ 2026-04-10 16:49 UTC (permalink / raw) To: Tejun Heo Cc: sched-ext, David Vernet, Changwoo Min, Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel On Thu, Apr 09, 2026 at 08:30:43PM -1000, Tejun Heo wrote: > Move enforcement of SCX context-sensitive kfunc restrictions from per-task > runtime kf_mask checks to BPF verifier-time filtering, using the BPF core's > struct_ops context information. > > A shared .filter callback is attached to each context-sensitive BTF set > and consults a per-op allow table (scx_kf_allow_flags[]) indexed by SCX > ops member offset. Disallowed calls are now rejected at program load time > instead of at runtime. > > The old model split reachability across two places: each SCX_CALL_OP*() > set bits naming its op context, and each kfunc's scx_kf_allowed() check > OR'd together the bits it accepted. A kfunc was callable when those two > masks overlapped. The new model transposes the result to the caller side - > each op's allow flags directly list the kfunc groups it may call. The old > bit assignments were: > > Call-site bits: > ops.select_cpu = ENQUEUE | SELECT_CPU > ops.enqueue = ENQUEUE > ops.dispatch = DISPATCH > ops.cpu_release = CPU_RELEASE > > Kfunc-group accepted bits: > enqueue group = ENQUEUE | DISPATCH > select_cpu group = SELECT_CPU | ENQUEUE > dispatch group = DISPATCH > cpu_release group = CPU_RELEASE > > Intersecting them yields the reachability now expressed directly by > scx_kf_allow_flags[]: > > ops.select_cpu -> SELECT_CPU | ENQUEUE > ops.enqueue -> SELECT_CPU | ENQUEUE > ops.dispatch -> ENQUEUE | DISPATCH > ops.cpu_release -> CPU_RELEASE > > Unlocked ops carried no kf_mask bits and reached only unlocked kfuncs; > that maps directly to UNLOCKED in the new table. > > Equivalence was checked by walking every (op, kfunc-group) combination > across SCX ops, SYSCALL, and non-SCX struct_ops callers against the old > scx_kf_allowed() runtime checks. With two intended exceptions (see below), > all combinations reach the same verdict; disallowed calls are now caught at > load time instead of firing scx_error() at runtime. > > scx_bpf_dsq_move_set_slice() and scx_bpf_dsq_move_set_vtime() are > exceptions: they have no runtime check at all, but the new filter rejects > them from ops outside dispatch/unlocked. The affected cases are nonsensical > - the values these setters store are only read by > scx_bpf_dsq_move{,_vtime}(), which is itself restricted to > dispatch/unlocked, so a setter call from anywhere else was already dead > code. > > Runtime scx_kf_mask enforcement is left in place by this patch and removed > in a follow-up. > > Original-patch-by: Juntong Deng <juntong.deng@outlook.com> > Original-patch-by: Cheng-Yang Chou <yphbchou0911@gmail.com> > Signed-off-by: Tejun Heo <tj@kernel.org> Looks good. Reviewed-by: Andrea Righi <arighi@nvidia.com> Thanks, -Andrea > --- > kernel/sched/ext.c | 124 ++++++++++++++++++++++++++++++++++-- > kernel/sched/ext_idle.c | 1 + > kernel/sched/ext_idle.h | 2 + > kernel/sched/ext_internal.h | 3 + > 4 files changed, 125 insertions(+), 5 deletions(-) > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 6d7c5c2605c7..81a4fea4c6b6 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -8133,6 +8133,7 @@ BTF_KFUNCS_END(scx_kfunc_ids_enqueue_dispatch) > static const struct btf_kfunc_id_set scx_kfunc_set_enqueue_dispatch = { > .owner = THIS_MODULE, > .set = &scx_kfunc_ids_enqueue_dispatch, > + .filter = scx_kfunc_context_filter, > }; > > static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit, > @@ -8511,6 +8512,7 @@ BTF_KFUNCS_END(scx_kfunc_ids_dispatch) > static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = { > .owner = THIS_MODULE, > .set = &scx_kfunc_ids_dispatch, > + .filter = scx_kfunc_context_filter, > }; > > __bpf_kfunc_start_defs(); > @@ -8551,6 +8553,7 @@ BTF_KFUNCS_END(scx_kfunc_ids_cpu_release) > static const struct btf_kfunc_id_set scx_kfunc_set_cpu_release = { > .owner = THIS_MODULE, > .set = &scx_kfunc_ids_cpu_release, > + .filter = scx_kfunc_context_filter, > }; > > __bpf_kfunc_start_defs(); > @@ -8628,6 +8631,7 @@ BTF_KFUNCS_END(scx_kfunc_ids_unlocked) > static const struct btf_kfunc_id_set scx_kfunc_set_unlocked = { > .owner = THIS_MODULE, > .set = &scx_kfunc_ids_unlocked, > + .filter = scx_kfunc_context_filter, > }; > > __bpf_kfunc_start_defs(); > @@ -9603,6 +9607,115 @@ static const struct btf_kfunc_id_set scx_kfunc_set_any = { > .set = &scx_kfunc_ids_any, > }; > > +/* > + * Per-op kfunc allow flags. Each bit corresponds to a context-sensitive kfunc > + * group; an op may permit zero or more groups, with the union expressed in > + * scx_kf_allow_flags[]. The verifier-time filter (scx_kfunc_context_filter()) > + * consults this table to decide whether a context-sensitive kfunc is callable > + * from a given SCX op. > + */ > +enum scx_kf_allow_flags { > + SCX_KF_ALLOW_UNLOCKED = 1 << 0, > + SCX_KF_ALLOW_CPU_RELEASE = 1 << 1, > + SCX_KF_ALLOW_DISPATCH = 1 << 2, > + SCX_KF_ALLOW_ENQUEUE = 1 << 3, > + SCX_KF_ALLOW_SELECT_CPU = 1 << 4, > +}; > + > +/* > + * Map each SCX op to the union of kfunc groups it permits, indexed by > + * SCX_OP_IDX(op). Ops not listed only permit kfuncs that are not > + * context-sensitive. > + */ > +static const u32 scx_kf_allow_flags[] = { > + [SCX_OP_IDX(select_cpu)] = SCX_KF_ALLOW_SELECT_CPU | SCX_KF_ALLOW_ENQUEUE, > + [SCX_OP_IDX(enqueue)] = SCX_KF_ALLOW_SELECT_CPU | SCX_KF_ALLOW_ENQUEUE, > + [SCX_OP_IDX(dispatch)] = SCX_KF_ALLOW_ENQUEUE | SCX_KF_ALLOW_DISPATCH, > + [SCX_OP_IDX(cpu_release)] = SCX_KF_ALLOW_CPU_RELEASE, > + [SCX_OP_IDX(init_task)] = SCX_KF_ALLOW_UNLOCKED, > + [SCX_OP_IDX(dump)] = SCX_KF_ALLOW_UNLOCKED, > +#ifdef CONFIG_EXT_GROUP_SCHED > + [SCX_OP_IDX(cgroup_init)] = SCX_KF_ALLOW_UNLOCKED, > + [SCX_OP_IDX(cgroup_exit)] = SCX_KF_ALLOW_UNLOCKED, > + [SCX_OP_IDX(cgroup_prep_move)] = SCX_KF_ALLOW_UNLOCKED, > + [SCX_OP_IDX(cgroup_cancel_move)] = SCX_KF_ALLOW_UNLOCKED, > + [SCX_OP_IDX(cgroup_set_weight)] = SCX_KF_ALLOW_UNLOCKED, > + [SCX_OP_IDX(cgroup_set_bandwidth)] = SCX_KF_ALLOW_UNLOCKED, > + [SCX_OP_IDX(cgroup_set_idle)] = SCX_KF_ALLOW_UNLOCKED, > +#endif /* CONFIG_EXT_GROUP_SCHED */ > + [SCX_OP_IDX(sub_attach)] = SCX_KF_ALLOW_UNLOCKED, > + [SCX_OP_IDX(sub_detach)] = SCX_KF_ALLOW_UNLOCKED, > + [SCX_OP_IDX(cpu_online)] = SCX_KF_ALLOW_UNLOCKED, > + [SCX_OP_IDX(cpu_offline)] = SCX_KF_ALLOW_UNLOCKED, > + [SCX_OP_IDX(init)] = SCX_KF_ALLOW_UNLOCKED, > + [SCX_OP_IDX(exit)] = SCX_KF_ALLOW_UNLOCKED, > +}; > + > +/* > + * Verifier-time filter for context-sensitive SCX kfuncs. Registered via the > + * .filter field on each per-group btf_kfunc_id_set. The BPF core invokes this > + * for every kfunc call in the registered hook (BPF_PROG_TYPE_STRUCT_OPS or > + * BPF_PROG_TYPE_SYSCALL), regardless of which set originally introduced the > + * kfunc - so the filter must short-circuit on kfuncs it doesn't govern (e.g. > + * scx_kfunc_ids_any) by falling through to "allow" when none of the > + * context-sensitive sets contain the kfunc. > + */ > +int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id) > +{ > + bool in_unlocked = btf_id_set8_contains(&scx_kfunc_ids_unlocked, kfunc_id); > + bool in_select_cpu = btf_id_set8_contains(&scx_kfunc_ids_select_cpu, kfunc_id); > + bool in_enqueue = btf_id_set8_contains(&scx_kfunc_ids_enqueue_dispatch, kfunc_id); > + bool in_dispatch = btf_id_set8_contains(&scx_kfunc_ids_dispatch, kfunc_id); > + bool in_cpu_release = btf_id_set8_contains(&scx_kfunc_ids_cpu_release, kfunc_id); > + u32 moff, flags; > + > + /* Not a context-sensitive kfunc (e.g. from scx_kfunc_ids_any) - allow. */ > + if (!(in_unlocked || in_select_cpu || in_enqueue || in_dispatch || in_cpu_release)) > + return 0; > + > + /* SYSCALL progs (e.g. BPF test_run()) may call unlocked and select_cpu kfuncs. */ > + if (prog->type == BPF_PROG_TYPE_SYSCALL) > + return (in_unlocked || in_select_cpu) ? 0 : -EACCES; > + > + if (prog->type != BPF_PROG_TYPE_STRUCT_OPS) > + return -EACCES; > + > + /* > + * add_subprog_and_kfunc() collects all kfunc calls, including dead code > + * guarded by bpf_ksym_exists(), before check_attach_btf_id() sets > + * prog->aux->st_ops. Allow all kfuncs when st_ops is not yet set; > + * do_check_main() re-runs the filter with st_ops set and enforces the > + * actual restrictions. > + */ > + if (!prog->aux->st_ops) > + return 0; > + > + /* > + * Non-SCX struct_ops: only unlocked kfuncs are safe. The other > + * context-sensitive kfuncs assume the rq lock is held by the SCX > + * dispatch path, which doesn't apply to other struct_ops users. > + */ > + if (prog->aux->st_ops != &bpf_sched_ext_ops) > + return in_unlocked ? 0 : -EACCES; > + > + /* SCX struct_ops: check the per-op allow list. */ > + moff = prog->aux->attach_st_ops_member_off; > + flags = scx_kf_allow_flags[SCX_MOFF_IDX(moff)]; > + > + if ((flags & SCX_KF_ALLOW_UNLOCKED) && in_unlocked) > + return 0; > + if ((flags & SCX_KF_ALLOW_CPU_RELEASE) && in_cpu_release) > + return 0; > + if ((flags & SCX_KF_ALLOW_DISPATCH) && in_dispatch) > + return 0; > + if ((flags & SCX_KF_ALLOW_ENQUEUE) && in_enqueue) > + return 0; > + if ((flags & SCX_KF_ALLOW_SELECT_CPU) && in_select_cpu) > + return 0; > + > + return -EACCES; > +} > + > static int __init scx_init(void) > { > int ret; > @@ -9612,11 +9725,12 @@ static int __init scx_init(void) > * register_btf_kfunc_id_set() needs most of the system to be up. > * > * Some kfuncs are context-sensitive and can only be called from > - * specific SCX ops. They are grouped into BTF sets accordingly. > - * Unfortunately, BPF currently doesn't have a way of enforcing such > - * restrictions. Eventually, the verifier should be able to enforce > - * them. For now, register them the same and make each kfunc explicitly > - * check using scx_kf_allowed(). > + * specific SCX ops. They are grouped into per-context BTF sets, each > + * registered with scx_kfunc_context_filter as its .filter callback. The > + * BPF core dedups identical filter pointers per hook > + * (btf_populate_kfunc_set()), so the filter is invoked exactly once per > + * kfunc lookup; it consults scx_kf_allow_flags[] to enforce per-op > + * restrictions at verify time. > */ > if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, > &scx_kfunc_set_enqueue_dispatch)) || > diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c > index f99ceeba2e56..ec49e0c9892e 100644 > --- a/kernel/sched/ext_idle.c > +++ b/kernel/sched/ext_idle.c > @@ -1491,6 +1491,7 @@ BTF_KFUNCS_END(scx_kfunc_ids_select_cpu) > static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = { > .owner = THIS_MODULE, > .set = &scx_kfunc_ids_select_cpu, > + .filter = scx_kfunc_context_filter, > }; > > int scx_idle_init(void) > diff --git a/kernel/sched/ext_idle.h b/kernel/sched/ext_idle.h > index fa583f141f35..dc35f850481e 100644 > --- a/kernel/sched/ext_idle.h > +++ b/kernel/sched/ext_idle.h > @@ -12,6 +12,8 @@ > > struct sched_ext_ops; > > +extern struct btf_id_set8 scx_kfunc_ids_select_cpu; > + > void scx_idle_update_selcpu_topology(struct sched_ext_ops *ops); > void scx_idle_init_masks(void); > > diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h > index 54da08a223b7..62ce4eaf6a3f 100644 > --- a/kernel/sched/ext_internal.h > +++ b/kernel/sched/ext_internal.h > @@ -6,6 +6,7 @@ > * Copyright (c) 2025 Tejun Heo <tj@kernel.org> > */ > #define SCX_OP_IDX(op) (offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void))) > +#define SCX_MOFF_IDX(moff) ((moff) / sizeof(void (*)(void))) > > enum scx_consts { > SCX_DSP_DFL_MAX_BATCH = 32, > @@ -1363,6 +1364,8 @@ enum scx_ops_state { > extern struct scx_sched __rcu *scx_root; > DECLARE_PER_CPU(struct rq *, scx_locked_rq_state); > > +int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id); > + > /* > * Return the rq currently locked from an scx callback, or NULL if no rq is > * locked. > -- > 2.53.0 > ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 08/10] sched_ext: Remove runtime kfunc mask enforcement 2026-04-10 6:30 [PATCHSET sched_ext/for-7.1] sched_ext: Add verifier-time kfunc context filter Tejun Heo ` (6 preceding siblings ...) 2026-04-10 6:30 ` [PATCH 07/10] sched_ext: Add verifier-time kfunc context filter Tejun Heo @ 2026-04-10 6:30 ` Tejun Heo 2026-04-10 16:50 ` Andrea Righi 2026-04-10 6:30 ` [PATCH 09/10] sched_ext: Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok() Tejun Heo ` (2 subsequent siblings) 10 siblings, 1 reply; 26+ messages in thread From: Tejun Heo @ 2026-04-10 6:30 UTC (permalink / raw) To: sched-ext, David Vernet, Andrea Righi, Changwoo Min Cc: Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel, Tejun Heo From: Cheng-Yang Chou <yphbchou0911@gmail.com> Now that scx_kfunc_context_filter enforces context-sensitive kfunc restrictions at BPF load time, the per-task runtime enforcement via scx_kf_mask is redundant. Remove it entirely: - Delete enum scx_kf_mask, the kf_mask field on sched_ext_entity, and the scx_kf_allow()/scx_kf_disallow()/scx_kf_allowed() helpers along with the higher_bits()/highest_bit() helpers they used. - Strip the @mask parameter (and the BUILD_BUG_ON checks) from the SCX_CALL_OP[_RET]/SCX_CALL_OP_TASK[_RET]/SCX_CALL_OP_2TASKS_RET macros and update every call site. Reflow call sites that were wrapped only to fit the old 5-arg form and now collapse onto a single line under ~100 cols. - Remove the in-kfunc scx_kf_allowed() runtime checks from scx_dsq_insert_preamble(), scx_dsq_move(), scx_bpf_dispatch_nr_slots(), scx_bpf_dispatch_cancel(), scx_bpf_dsq_move_to_local___v2(), scx_bpf_sub_dispatch(), scx_bpf_reenqueue_local(), and the per-call guard inside select_cpu_from_kfunc(). scx_bpf_task_cgroup() and scx_kf_allowed_on_arg_tasks() were already cleaned up in the "drop redundant rq-locked check" patch. scx_kf_allowed_if_unlocked() was rewritten in the preceding "decouple" patch. No further changes to those helpers here. Co-developed-by: Juntong Deng <juntong.deng@outlook.com> Signed-off-by: Juntong Deng <juntong.deng@outlook.com> Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> --- include/linux/sched/ext.h | 28 ----- kernel/sched/ext.c | 244 +++++++++----------------------------- kernel/sched/ext_idle.c | 4 +- 3 files changed, 58 insertions(+), 218 deletions(-) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index 602dc83cab36..1a3af2ea2a79 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -147,33 +147,6 @@ enum scx_ent_dsq_flags { SCX_TASK_DSQ_ON_PRIQ = 1 << 0, /* task is queued on the priority queue of a dsq */ }; -/* - * Mask bits for scx_entity.kf_mask. Not all kfuncs can be called from - * everywhere and the following bits track which kfunc sets are currently - * allowed for %current. This simple per-task tracking works because SCX ops - * nest in a limited way. BPF will likely implement a way to allow and disallow - * kfuncs depending on the calling context which will replace this manual - * mechanism. See scx_kf_allow(). - */ -enum scx_kf_mask { - SCX_KF_UNLOCKED = 0, /* sleepable and not rq locked */ - /* ENQUEUE and DISPATCH may be nested inside CPU_RELEASE */ - SCX_KF_CPU_RELEASE = 1 << 0, /* ops.cpu_release() */ - /* - * ops.dispatch() may release rq lock temporarily and thus ENQUEUE and - * SELECT_CPU may be nested inside. ops.dequeue (in REST) may also be - * nested inside DISPATCH. - */ - SCX_KF_DISPATCH = 1 << 1, /* ops.dispatch() */ - SCX_KF_ENQUEUE = 1 << 2, /* ops.enqueue() and ops.select_cpu() */ - SCX_KF_SELECT_CPU = 1 << 3, /* ops.select_cpu() */ - SCX_KF_REST = 1 << 4, /* other rq-locked operations */ - - __SCX_KF_RQ_LOCKED = SCX_KF_CPU_RELEASE | SCX_KF_DISPATCH | - SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST, - __SCX_KF_TERMINAL = SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST, -}; - enum scx_dsq_lnode_flags { SCX_DSQ_LNODE_ITER_CURSOR = 1 << 0, @@ -221,7 +194,6 @@ struct sched_ext_entity { s32 sticky_cpu; s32 holding_cpu; s32 selected_cpu; - u32 kf_mask; /* see scx_kf_mask above */ struct task_struct *kf_tasks[2]; /* see SCX_CALL_OP_TASK() */ struct list_head runnable_node; /* rq->scx.runnable_list */ diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 81a4fea4c6b6..70e9434a9a0d 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -229,19 +229,6 @@ static long jiffies_delta_msecs(unsigned long at, unsigned long now) return -(long)jiffies_to_msecs(now - at); } -/* if the highest set bit is N, return a mask with bits [N+1, 31] set */ -static u32 higher_bits(u32 flags) -{ - return ~((1 << fls(flags)) - 1); -} - -/* return the mask with only the highest bit set */ -static u32 highest_bit(u32 flags) -{ - int bit = fls(flags); - return ((u64)1 << bit) >> 1; -} - static bool u32_before(u32 a, u32 b) { return (s32)(a - b) < 0; @@ -462,30 +449,6 @@ static bool rq_is_open(struct rq *rq, u64 enq_flags) return false; } -/* - * scx_kf_mask enforcement. Some kfuncs can only be called from specific SCX - * ops. When invoking SCX ops, SCX_CALL_OP[_RET]() should be used to indicate - * the allowed kfuncs and those kfuncs should use scx_kf_allowed() to check - * whether it's running from an allowed context. - * - * @mask is constant, always inline to cull the mask calculations. - */ -static __always_inline void scx_kf_allow(u32 mask) -{ - /* nesting is allowed only in increasing scx_kf_mask order */ - WARN_ONCE((mask | higher_bits(mask)) & current->scx.kf_mask, - "invalid nesting current->scx.kf_mask=0x%x mask=0x%x\n", - current->scx.kf_mask, mask); - current->scx.kf_mask |= mask; - barrier(); -} - -static void scx_kf_disallow(u32 mask) -{ - barrier(); - current->scx.kf_mask &= ~mask; -} - /* * Track the rq currently locked. * @@ -506,34 +469,22 @@ static inline void update_locked_rq(struct rq *rq) __this_cpu_write(scx_locked_rq_state, rq); } -#define SCX_CALL_OP(sch, mask, op, rq, args...) \ +#define SCX_CALL_OP(sch, op, rq, args...) \ do { \ if (rq) \ update_locked_rq(rq); \ - if (mask) { \ - scx_kf_allow(mask); \ - (sch)->ops.op(args); \ - scx_kf_disallow(mask); \ - } else { \ - (sch)->ops.op(args); \ - } \ + (sch)->ops.op(args); \ if (rq) \ update_locked_rq(NULL); \ } while (0) -#define SCX_CALL_OP_RET(sch, mask, op, rq, args...) \ +#define SCX_CALL_OP_RET(sch, op, rq, args...) \ ({ \ __typeof__((sch)->ops.op(args)) __ret; \ \ if (rq) \ update_locked_rq(rq); \ - if (mask) { \ - scx_kf_allow(mask); \ - __ret = (sch)->ops.op(args); \ - scx_kf_disallow(mask); \ - } else { \ - __ret = (sch)->ops.op(args); \ - } \ + __ret = (sch)->ops.op(args); \ if (rq) \ update_locked_rq(NULL); \ __ret; \ @@ -553,67 +504,33 @@ do { \ * * These macros only work for non-nesting ops since kf_tasks[] is not stacked. */ -#define SCX_CALL_OP_TASK(sch, mask, op, rq, task, args...) \ +#define SCX_CALL_OP_TASK(sch, op, rq, task, args...) \ do { \ - BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ current->scx.kf_tasks[0] = task; \ - SCX_CALL_OP((sch), mask, op, rq, task, ##args); \ + SCX_CALL_OP((sch), op, rq, task, ##args); \ current->scx.kf_tasks[0] = NULL; \ } while (0) -#define SCX_CALL_OP_TASK_RET(sch, mask, op, rq, task, args...) \ +#define SCX_CALL_OP_TASK_RET(sch, op, rq, task, args...) \ ({ \ __typeof__((sch)->ops.op(task, ##args)) __ret; \ - BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ current->scx.kf_tasks[0] = task; \ - __ret = SCX_CALL_OP_RET((sch), mask, op, rq, task, ##args); \ + __ret = SCX_CALL_OP_RET((sch), op, rq, task, ##args); \ current->scx.kf_tasks[0] = NULL; \ __ret; \ }) -#define SCX_CALL_OP_2TASKS_RET(sch, mask, op, rq, task0, task1, args...) \ +#define SCX_CALL_OP_2TASKS_RET(sch, op, rq, task0, task1, args...) \ ({ \ __typeof__((sch)->ops.op(task0, task1, ##args)) __ret; \ - BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ current->scx.kf_tasks[0] = task0; \ current->scx.kf_tasks[1] = task1; \ - __ret = SCX_CALL_OP_RET((sch), mask, op, rq, task0, task1, ##args); \ + __ret = SCX_CALL_OP_RET((sch), op, rq, task0, task1, ##args); \ current->scx.kf_tasks[0] = NULL; \ current->scx.kf_tasks[1] = NULL; \ __ret; \ }) -/* @mask is constant, always inline to cull unnecessary branches */ -static __always_inline bool scx_kf_allowed(struct scx_sched *sch, u32 mask) -{ - if (unlikely(!(current->scx.kf_mask & mask))) { - scx_error(sch, "kfunc with mask 0x%x called from an operation only allowing 0x%x", - mask, current->scx.kf_mask); - return false; - } - - /* - * Enforce nesting boundaries. e.g. A kfunc which can be called from - * DISPATCH must not be called if we're running DEQUEUE which is nested - * inside ops.dispatch(). We don't need to check boundaries for any - * blocking kfuncs as the verifier ensures they're only called from - * sleepable progs. - */ - if (unlikely(highest_bit(mask) == SCX_KF_CPU_RELEASE && - (current->scx.kf_mask & higher_bits(SCX_KF_CPU_RELEASE)))) { - scx_error(sch, "cpu_release kfunc called from a nested operation"); - return false; - } - - if (unlikely(highest_bit(mask) == SCX_KF_DISPATCH && - (current->scx.kf_mask & higher_bits(SCX_KF_DISPATCH)))) { - scx_error(sch, "dispatch kfunc called from a nested operation"); - return false; - } - - return true; -} - /* see SCX_CALL_OP_TASK() */ static __always_inline bool scx_kf_allowed_on_arg_tasks(struct scx_sched *sch, struct task_struct *p) @@ -1461,7 +1378,7 @@ static void call_task_dequeue(struct scx_sched *sch, struct rq *rq, return; if (SCX_HAS_OP(sch, dequeue)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, deq_flags); + SCX_CALL_OP_TASK(sch, dequeue, rq, p, deq_flags); p->scx.flags &= ~SCX_TASK_IN_CUSTODY; } @@ -1920,7 +1837,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, WARN_ON_ONCE(*ddsp_taskp); *ddsp_taskp = p; - SCX_CALL_OP_TASK(sch, SCX_KF_ENQUEUE, enqueue, rq, p, enq_flags); + SCX_CALL_OP_TASK(sch, enqueue, rq, p, enq_flags); *ddsp_taskp = NULL; if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID) @@ -2024,7 +1941,7 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int core_enq_ add_nr_running(rq, 1); if (SCX_HAS_OP(sch, runnable) && !task_on_rq_migrating(p)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, runnable, rq, p, enq_flags); + SCX_CALL_OP_TASK(sch, runnable, rq, p, enq_flags); if (enq_flags & SCX_ENQ_WAKEUP) touch_core_sched(rq, p); @@ -2141,11 +2058,11 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int core_deq_ */ if (SCX_HAS_OP(sch, stopping) && task_current(rq, p)) { update_curr_scx(rq); - SCX_CALL_OP_TASK(sch, SCX_KF_REST, stopping, rq, p, false); + SCX_CALL_OP_TASK(sch, stopping, rq, p, false); } if (SCX_HAS_OP(sch, quiescent) && !task_on_rq_migrating(p)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, quiescent, rq, p, deq_flags); + SCX_CALL_OP_TASK(sch, quiescent, rq, p, deq_flags); if (deq_flags & SCX_DEQ_SLEEP) p->scx.flags |= SCX_TASK_DEQD_FOR_SLEEP; @@ -2167,7 +2084,7 @@ static void yield_task_scx(struct rq *rq) struct scx_sched *sch = scx_task_sched(p); if (SCX_HAS_OP(sch, yield)) - SCX_CALL_OP_2TASKS_RET(sch, SCX_KF_REST, yield, rq, p, NULL); + SCX_CALL_OP_2TASKS_RET(sch, yield, rq, p, NULL); else p->scx.slice = 0; } @@ -2178,8 +2095,7 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to) struct scx_sched *sch = scx_task_sched(from); if (SCX_HAS_OP(sch, yield) && sch == scx_task_sched(to)) - return SCX_CALL_OP_2TASKS_RET(sch, SCX_KF_REST, yield, rq, - from, to); + return SCX_CALL_OP_2TASKS_RET(sch, yield, rq, from, to); else return false; } @@ -2799,20 +2715,11 @@ scx_dispatch_sched(struct scx_sched *sch, struct rq *rq, dspc->nr_tasks = 0; if (nested) { - /* - * If nested, don't update kf_mask as the originating - * invocation would already have set it up. - */ - SCX_CALL_OP(sch, 0, dispatch, rq, cpu, - prev_on_sch ? prev : NULL); + SCX_CALL_OP(sch, dispatch, rq, cpu, prev_on_sch ? prev : NULL); } else { - /* - * If not nested, stash @prev so that nested invocations - * can access it. - */ + /* stash @prev so that nested invocations can access it */ rq->scx.sub_dispatch_prev = prev; - SCX_CALL_OP(sch, SCX_KF_DISPATCH, dispatch, rq, cpu, - prev_on_sch ? prev : NULL); + SCX_CALL_OP(sch, dispatch, rq, cpu, prev_on_sch ? prev : NULL); rq->scx.sub_dispatch_prev = NULL; } @@ -2871,7 +2778,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev) * emitted in switch_class(). */ if (SCX_HAS_OP(sch, cpu_acquire)) - SCX_CALL_OP(sch, SCX_KF_REST, cpu_acquire, rq, cpu, NULL); + SCX_CALL_OP(sch, cpu_acquire, rq, cpu, NULL); rq->scx.cpu_released = false; } @@ -2950,7 +2857,7 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) /* see dequeue_task_scx() on why we skip when !QUEUED */ if (SCX_HAS_OP(sch, running) && (p->scx.flags & SCX_TASK_QUEUED)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, running, rq, p); + SCX_CALL_OP_TASK(sch, running, rq, p); clr_task_runnable(p, true); @@ -3022,8 +2929,7 @@ static void switch_class(struct rq *rq, struct task_struct *next) .task = next, }; - SCX_CALL_OP(sch, SCX_KF_CPU_RELEASE, cpu_release, rq, - cpu_of(rq), &args); + SCX_CALL_OP(sch, cpu_release, rq, cpu_of(rq), &args); } rq->scx.cpu_released = true; } @@ -3041,7 +2947,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p, /* see dequeue_task_scx() on why we skip when !QUEUED */ if (SCX_HAS_OP(sch, stopping) && (p->scx.flags & SCX_TASK_QUEUED)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, stopping, rq, p, true); + SCX_CALL_OP_TASK(sch, stopping, rq, p, true); if (p->scx.flags & SCX_TASK_QUEUED) { set_task_runnable(rq, p); @@ -3271,7 +3177,7 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, */ if (sch_a == sch_b && SCX_HAS_OP(sch_a, core_sched_before) && !scx_bypassing(sch_a, task_cpu(a))) - return SCX_CALL_OP_2TASKS_RET(sch_a, SCX_KF_REST, core_sched_before, + return SCX_CALL_OP_2TASKS_RET(sch_a, core_sched_before, NULL, (struct task_struct *)a, (struct task_struct *)b); @@ -3308,10 +3214,7 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag *ddsp_taskp = p; this_rq()->scx.in_select_cpu = true; - cpu = SCX_CALL_OP_TASK_RET(sch, - SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU, - select_cpu, NULL, p, prev_cpu, - wake_flags); + cpu = SCX_CALL_OP_TASK_RET(sch, select_cpu, NULL, p, prev_cpu, wake_flags); this_rq()->scx.in_select_cpu = false; p->scx.selected_cpu = cpu; *ddsp_taskp = NULL; @@ -3361,8 +3264,7 @@ static void set_cpus_allowed_scx(struct task_struct *p, * designation pointless. Cast it away when calling the operation. */ if (SCX_HAS_OP(sch, set_cpumask)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_cpumask, task_rq(p), - p, (struct cpumask *)p->cpus_ptr); + SCX_CALL_OP_TASK(sch, set_cpumask, task_rq(p), p, (struct cpumask *)p->cpus_ptr); } static void handle_hotplug(struct rq *rq, bool online) @@ -3384,9 +3286,9 @@ static void handle_hotplug(struct rq *rq, bool online) scx_idle_update_selcpu_topology(&sch->ops); if (online && SCX_HAS_OP(sch, cpu_online)) - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cpu_online, NULL, cpu); + SCX_CALL_OP(sch, cpu_online, NULL, cpu); else if (!online && SCX_HAS_OP(sch, cpu_offline)) - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cpu_offline, NULL, cpu); + SCX_CALL_OP(sch, cpu_offline, NULL, cpu); else scx_exit(sch, SCX_EXIT_UNREG_KERN, SCX_ECODE_ACT_RESTART | SCX_ECODE_RSN_HOTPLUG, @@ -3504,7 +3406,7 @@ static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued) curr->scx.slice = 0; touch_core_sched(rq, curr); } else if (SCX_HAS_OP(sch, tick)) { - SCX_CALL_OP_TASK(sch, SCX_KF_REST, tick, rq, curr); + SCX_CALL_OP_TASK(sch, tick, rq, curr); } if (!curr->scx.slice) @@ -3580,8 +3482,7 @@ static int __scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fo .fork = fork, }; - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, init_task, NULL, - p, &args); + ret = SCX_CALL_OP_RET(sch, init_task, NULL, p, &args); if (unlikely(ret)) { ret = ops_sanitize_err(sch, "init_task", ret); return ret; @@ -3662,11 +3563,10 @@ static void __scx_enable_task(struct scx_sched *sch, struct task_struct *p) p->scx.weight = sched_weight_to_cgroup(weight); if (SCX_HAS_OP(sch, enable)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, enable, rq, p); + SCX_CALL_OP_TASK(sch, enable, rq, p); if (SCX_HAS_OP(sch, set_weight)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_weight, rq, - p, p->scx.weight); + SCX_CALL_OP_TASK(sch, set_weight, rq, p, p->scx.weight); } static void scx_enable_task(struct scx_sched *sch, struct task_struct *p) @@ -3685,7 +3585,7 @@ static void scx_disable_task(struct scx_sched *sch, struct task_struct *p) clear_direct_dispatch(p); if (SCX_HAS_OP(sch, disable)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p); + SCX_CALL_OP_TASK(sch, disable, rq, p); scx_set_task_state(p, SCX_TASK_READY); /* @@ -3723,8 +3623,7 @@ static void __scx_disable_and_exit_task(struct scx_sched *sch, } if (SCX_HAS_OP(sch, exit_task)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, exit_task, task_rq(p), - p, &args); + SCX_CALL_OP_TASK(sch, exit_task, task_rq(p), p, &args); } static void scx_disable_and_exit_task(struct scx_sched *sch, @@ -3903,8 +3802,7 @@ static void reweight_task_scx(struct rq *rq, struct task_struct *p, p->scx.weight = sched_weight_to_cgroup(scale_load_down(lw->weight)); if (SCX_HAS_OP(sch, set_weight)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_weight, rq, - p, p->scx.weight); + SCX_CALL_OP_TASK(sch, set_weight, rq, p, p->scx.weight); } static void prio_changed_scx(struct rq *rq, struct task_struct *p, u64 oldprio) @@ -3925,8 +3823,7 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p) * different scheduler class. Keep the BPF scheduler up-to-date. */ if (SCX_HAS_OP(sch, set_cpumask)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_cpumask, rq, - p, (struct cpumask *)p->cpus_ptr); + SCX_CALL_OP_TASK(sch, set_cpumask, rq, p, (struct cpumask *)p->cpus_ptr); } static void switched_from_scx(struct rq *rq, struct task_struct *p) @@ -4309,7 +4206,7 @@ int scx_tg_online(struct task_group *tg) .bw_quota_us = tg->scx.bw_quota_us, .bw_burst_us = tg->scx.bw_burst_us }; - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, cgroup_init, + ret = SCX_CALL_OP_RET(sch, cgroup_init, NULL, tg->css.cgroup, &args); if (ret) ret = ops_sanitize_err(sch, "cgroup_init", ret); @@ -4331,8 +4228,7 @@ void scx_tg_offline(struct task_group *tg) if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_exit) && (tg->scx.flags & SCX_TG_INITED)) - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_exit, NULL, - tg->css.cgroup); + SCX_CALL_OP(sch, cgroup_exit, NULL, tg->css.cgroup); tg->scx.flags &= ~(SCX_TG_ONLINE | SCX_TG_INITED); } @@ -4361,8 +4257,7 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset) continue; if (SCX_HAS_OP(sch, cgroup_prep_move)) { - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, - cgroup_prep_move, NULL, + ret = SCX_CALL_OP_RET(sch, cgroup_prep_move, NULL, p, from, css->cgroup); if (ret) goto err; @@ -4377,7 +4272,7 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset) cgroup_taskset_for_each(p, css, tset) { if (SCX_HAS_OP(sch, cgroup_cancel_move) && p->scx.cgrp_moving_from) - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_cancel_move, NULL, + SCX_CALL_OP(sch, cgroup_cancel_move, NULL, p, p->scx.cgrp_moving_from, css->cgroup); p->scx.cgrp_moving_from = NULL; } @@ -4398,7 +4293,7 @@ void scx_cgroup_move_task(struct task_struct *p) */ if (SCX_HAS_OP(sch, cgroup_move) && !WARN_ON_ONCE(!p->scx.cgrp_moving_from)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, cgroup_move, task_rq(p), + SCX_CALL_OP_TASK(sch, cgroup_move, task_rq(p), p, p->scx.cgrp_moving_from, tg_cgrp(task_group(p))); p->scx.cgrp_moving_from = NULL; @@ -4416,7 +4311,7 @@ void scx_cgroup_cancel_attach(struct cgroup_taskset *tset) cgroup_taskset_for_each(p, css, tset) { if (SCX_HAS_OP(sch, cgroup_cancel_move) && p->scx.cgrp_moving_from) - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_cancel_move, NULL, + SCX_CALL_OP(sch, cgroup_cancel_move, NULL, p, p->scx.cgrp_moving_from, css->cgroup); p->scx.cgrp_moving_from = NULL; } @@ -4430,8 +4325,7 @@ void scx_group_set_weight(struct task_group *tg, unsigned long weight) if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_set_weight) && tg->scx.weight != weight) - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_set_weight, NULL, - tg_cgrp(tg), weight); + SCX_CALL_OP(sch, cgroup_set_weight, NULL, tg_cgrp(tg), weight); tg->scx.weight = weight; @@ -4445,8 +4339,7 @@ void scx_group_set_idle(struct task_group *tg, bool idle) percpu_down_read(&scx_cgroup_ops_rwsem); if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_set_idle)) - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_set_idle, NULL, - tg_cgrp(tg), idle); + SCX_CALL_OP(sch, cgroup_set_idle, NULL, tg_cgrp(tg), idle); /* Update the task group's idle state */ tg->scx.idle = idle; @@ -4465,7 +4358,7 @@ void scx_group_set_bandwidth(struct task_group *tg, (tg->scx.bw_period_us != period_us || tg->scx.bw_quota_us != quota_us || tg->scx.bw_burst_us != burst_us)) - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_set_bandwidth, NULL, + SCX_CALL_OP(sch, cgroup_set_bandwidth, NULL, tg_cgrp(tg), period_us, quota_us, burst_us); tg->scx.bw_period_us = period_us; @@ -4690,8 +4583,7 @@ static void scx_cgroup_exit(struct scx_sched *sch) if (!sch->ops.cgroup_exit) continue; - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_exit, NULL, - css->cgroup); + SCX_CALL_OP(sch, cgroup_exit, NULL, css->cgroup); } } @@ -4722,7 +4614,7 @@ static int scx_cgroup_init(struct scx_sched *sch) continue; } - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, cgroup_init, NULL, + ret = SCX_CALL_OP_RET(sch, cgroup_init, NULL, css->cgroup, &args); if (ret) { scx_error(sch, "ops.cgroup_init() failed (%d)", ret); @@ -5795,12 +5687,12 @@ static void scx_sub_disable(struct scx_sched *sch) .ops = &sch->ops, .cgroup_path = sch->cgrp_path, }; - SCX_CALL_OP(parent, SCX_KF_UNLOCKED, sub_detach, NULL, + SCX_CALL_OP(parent, sub_detach, NULL, &sub_detach_args); } if (sch->ops.exit) - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, exit, NULL, sch->exit_info); + SCX_CALL_OP(sch, exit, NULL, sch->exit_info); kobject_del(&sch->kobj); } #else /* CONFIG_EXT_SUB_SCHED */ @@ -5915,7 +5807,7 @@ static void scx_root_disable(struct scx_sched *sch) } if (sch->ops.exit) - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, exit, NULL, ei); + SCX_CALL_OP(sch, exit, NULL, ei); scx_unlink_sched(sch); @@ -6178,7 +6070,7 @@ static void scx_dump_task(struct scx_sched *sch, if (SCX_HAS_OP(sch, dump_task)) { ops_dump_init(s, " "); - SCX_CALL_OP(sch, SCX_KF_REST, dump_task, NULL, dctx, p); + SCX_CALL_OP(sch, dump_task, NULL, dctx, p); ops_dump_exit(); } @@ -6242,7 +6134,7 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei, if (SCX_HAS_OP(sch, dump)) { ops_dump_init(&s, ""); - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, dump, NULL, &dctx); + SCX_CALL_OP(sch, dump, NULL, &dctx); ops_dump_exit(); } @@ -6302,7 +6194,7 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei, used = seq_buf_used(&ns); if (SCX_HAS_OP(sch, dump_cpu)) { ops_dump_init(&ns, " "); - SCX_CALL_OP(sch, SCX_KF_REST, dump_cpu, NULL, + SCX_CALL_OP(sch, dump_cpu, NULL, &dctx, cpu, idle); ops_dump_exit(); } @@ -6748,7 +6640,7 @@ static void scx_root_enable_workfn(struct kthread_work *work) scx_idle_enable(ops); if (sch->ops.init) { - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, init, NULL); + ret = SCX_CALL_OP_RET(sch, init, NULL); if (ret) { ret = ops_sanitize_err(sch, "init", ret); cpus_read_unlock(); @@ -7020,7 +6912,7 @@ static void scx_sub_enable_workfn(struct kthread_work *work) } if (sch->ops.init) { - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, init, NULL); + ret = SCX_CALL_OP_RET(sch, init, NULL); if (ret) { ret = ops_sanitize_err(sch, "init", ret); scx_error(sch, "ops.init() failed (%d)", ret); @@ -7037,7 +6929,7 @@ static void scx_sub_enable_workfn(struct kthread_work *work) .cgroup_path = sch->cgrp_path, }; - ret = SCX_CALL_OP_RET(parent, SCX_KF_UNLOCKED, sub_attach, NULL, + ret = SCX_CALL_OP_RET(parent, sub_attach, NULL, &sub_attach_args); if (ret) { ret = ops_sanitize_err(sch, "sub_attach", ret); @@ -7891,9 +7783,6 @@ static bool scx_vet_enq_flags(struct scx_sched *sch, u64 dsq_id, u64 *enq_flags) static bool scx_dsq_insert_preamble(struct scx_sched *sch, struct task_struct *p, u64 dsq_id, u64 *enq_flags) { - if (!scx_kf_allowed(sch, SCX_KF_ENQUEUE | SCX_KF_DISPATCH)) - return false; - lockdep_assert_irqs_disabled(); if (unlikely(!p)) { @@ -8146,10 +8035,6 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit, bool in_balance; unsigned long flags; - if ((scx_locked_rq() || this_rq()->scx.in_select_cpu) && - !scx_kf_allowed(sch, SCX_KF_DISPATCH)) - return false; - if (!scx_vet_enq_flags(sch, dsq_id, &enq_flags)) return false; @@ -8244,9 +8129,6 @@ __bpf_kfunc u32 scx_bpf_dispatch_nr_slots(const struct bpf_prog_aux *aux) if (unlikely(!sch)) return 0; - if (!scx_kf_allowed(sch, SCX_KF_DISPATCH)) - return 0; - return sch->dsp_max_batch - __this_cpu_read(sch->pcpu->dsp_ctx.cursor); } @@ -8268,9 +8150,6 @@ __bpf_kfunc void scx_bpf_dispatch_cancel(const struct bpf_prog_aux *aux) if (unlikely(!sch)) return; - if (!scx_kf_allowed(sch, SCX_KF_DISPATCH)) - return; - dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx; if (dspc->cursor > 0) @@ -8317,9 +8196,6 @@ __bpf_kfunc bool scx_bpf_dsq_move_to_local___v2(u64 dsq_id, u64 enq_flags, if (unlikely(!sch)) return false; - if (!scx_kf_allowed(sch, SCX_KF_DISPATCH)) - return false; - if (!scx_vet_enq_flags(sch, SCX_DSQ_LOCAL, &enq_flags)) return false; @@ -8473,9 +8349,6 @@ __bpf_kfunc bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux * if (unlikely(!parent)) return false; - if (!scx_kf_allowed(parent, SCX_KF_DISPATCH)) - return false; - child = scx_find_sub_sched(cgroup_id); if (unlikely(!child)) @@ -8535,9 +8408,6 @@ __bpf_kfunc u32 scx_bpf_reenqueue_local(const struct bpf_prog_aux *aux) if (unlikely(!sch)) return 0; - if (!scx_kf_allowed(sch, SCX_KF_CPU_RELEASE)) - return 0; - rq = cpu_rq(smp_processor_id()); lockdep_assert_rq_held(rq); diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c index ec49e0c9892e..443d12a3df67 100644 --- a/kernel/sched/ext_idle.c +++ b/kernel/sched/ext_idle.c @@ -789,7 +789,7 @@ void __scx_update_idle(struct rq *rq, bool idle, bool do_notify) */ if (SCX_HAS_OP(sch, update_idle) && do_notify && !scx_bypassing(sch, cpu_of(rq))) - SCX_CALL_OP(sch, SCX_KF_REST, update_idle, rq, cpu_of(rq), idle); + SCX_CALL_OP(sch, update_idle, rq, cpu_of(rq), idle); } static void reset_idle_masks(struct sched_ext_ops *ops) @@ -937,8 +937,6 @@ static s32 select_cpu_from_kfunc(struct scx_sched *sch, struct task_struct *p, } else if (!scx_locked_rq()) { raw_spin_lock_irqsave(&p->pi_lock, irq_flags); we_locked = true; - } else if (!scx_kf_allowed(sch, SCX_KF_ENQUEUE)) { - return -EPERM; } /* -- 2.53.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 08/10] sched_ext: Remove runtime kfunc mask enforcement 2026-04-10 6:30 ` [PATCH 08/10] sched_ext: Remove runtime kfunc mask enforcement Tejun Heo @ 2026-04-10 16:50 ` Andrea Righi 0 siblings, 0 replies; 26+ messages in thread From: Andrea Righi @ 2026-04-10 16:50 UTC (permalink / raw) To: Tejun Heo Cc: sched-ext, David Vernet, Changwoo Min, Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel On Thu, Apr 09, 2026 at 08:30:44PM -1000, Tejun Heo wrote: > From: Cheng-Yang Chou <yphbchou0911@gmail.com> > > Now that scx_kfunc_context_filter enforces context-sensitive kfunc > restrictions at BPF load time, the per-task runtime enforcement via > scx_kf_mask is redundant. Remove it entirely: > > - Delete enum scx_kf_mask, the kf_mask field on sched_ext_entity, and > the scx_kf_allow()/scx_kf_disallow()/scx_kf_allowed() helpers along > with the higher_bits()/highest_bit() helpers they used. > - Strip the @mask parameter (and the BUILD_BUG_ON checks) from the > SCX_CALL_OP[_RET]/SCX_CALL_OP_TASK[_RET]/SCX_CALL_OP_2TASKS_RET > macros and update every call site. Reflow call sites that were > wrapped only to fit the old 5-arg form and now collapse onto a single > line under ~100 cols. > - Remove the in-kfunc scx_kf_allowed() runtime checks from > scx_dsq_insert_preamble(), scx_dsq_move(), scx_bpf_dispatch_nr_slots(), > scx_bpf_dispatch_cancel(), scx_bpf_dsq_move_to_local___v2(), > scx_bpf_sub_dispatch(), scx_bpf_reenqueue_local(), and the per-call > guard inside select_cpu_from_kfunc(). > > scx_bpf_task_cgroup() and scx_kf_allowed_on_arg_tasks() were already > cleaned up in the "drop redundant rq-locked check" patch. > scx_kf_allowed_if_unlocked() was rewritten in the preceding "decouple" > patch. No further changes to those helpers here. > > Co-developed-by: Juntong Deng <juntong.deng@outlook.com> > Signed-off-by: Juntong Deng <juntong.deng@outlook.com> > Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> > Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Thanks, -Andrea > --- > include/linux/sched/ext.h | 28 ----- > kernel/sched/ext.c | 244 +++++++++----------------------------- > kernel/sched/ext_idle.c | 4 +- > 3 files changed, 58 insertions(+), 218 deletions(-) > > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h > index 602dc83cab36..1a3af2ea2a79 100644 > --- a/include/linux/sched/ext.h > +++ b/include/linux/sched/ext.h > @@ -147,33 +147,6 @@ enum scx_ent_dsq_flags { > SCX_TASK_DSQ_ON_PRIQ = 1 << 0, /* task is queued on the priority queue of a dsq */ > }; > > -/* > - * Mask bits for scx_entity.kf_mask. Not all kfuncs can be called from > - * everywhere and the following bits track which kfunc sets are currently > - * allowed for %current. This simple per-task tracking works because SCX ops > - * nest in a limited way. BPF will likely implement a way to allow and disallow > - * kfuncs depending on the calling context which will replace this manual > - * mechanism. See scx_kf_allow(). > - */ > -enum scx_kf_mask { > - SCX_KF_UNLOCKED = 0, /* sleepable and not rq locked */ > - /* ENQUEUE and DISPATCH may be nested inside CPU_RELEASE */ > - SCX_KF_CPU_RELEASE = 1 << 0, /* ops.cpu_release() */ > - /* > - * ops.dispatch() may release rq lock temporarily and thus ENQUEUE and > - * SELECT_CPU may be nested inside. ops.dequeue (in REST) may also be > - * nested inside DISPATCH. > - */ > - SCX_KF_DISPATCH = 1 << 1, /* ops.dispatch() */ > - SCX_KF_ENQUEUE = 1 << 2, /* ops.enqueue() and ops.select_cpu() */ > - SCX_KF_SELECT_CPU = 1 << 3, /* ops.select_cpu() */ > - SCX_KF_REST = 1 << 4, /* other rq-locked operations */ > - > - __SCX_KF_RQ_LOCKED = SCX_KF_CPU_RELEASE | SCX_KF_DISPATCH | > - SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST, > - __SCX_KF_TERMINAL = SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST, > -}; > - > enum scx_dsq_lnode_flags { > SCX_DSQ_LNODE_ITER_CURSOR = 1 << 0, > > @@ -221,7 +194,6 @@ struct sched_ext_entity { > s32 sticky_cpu; > s32 holding_cpu; > s32 selected_cpu; > - u32 kf_mask; /* see scx_kf_mask above */ > struct task_struct *kf_tasks[2]; /* see SCX_CALL_OP_TASK() */ > > struct list_head runnable_node; /* rq->scx.runnable_list */ > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 81a4fea4c6b6..70e9434a9a0d 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -229,19 +229,6 @@ static long jiffies_delta_msecs(unsigned long at, unsigned long now) > return -(long)jiffies_to_msecs(now - at); > } > > -/* if the highest set bit is N, return a mask with bits [N+1, 31] set */ > -static u32 higher_bits(u32 flags) > -{ > - return ~((1 << fls(flags)) - 1); > -} > - > -/* return the mask with only the highest bit set */ > -static u32 highest_bit(u32 flags) > -{ > - int bit = fls(flags); > - return ((u64)1 << bit) >> 1; > -} > - > static bool u32_before(u32 a, u32 b) > { > return (s32)(a - b) < 0; > @@ -462,30 +449,6 @@ static bool rq_is_open(struct rq *rq, u64 enq_flags) > return false; > } > > -/* > - * scx_kf_mask enforcement. Some kfuncs can only be called from specific SCX > - * ops. When invoking SCX ops, SCX_CALL_OP[_RET]() should be used to indicate > - * the allowed kfuncs and those kfuncs should use scx_kf_allowed() to check > - * whether it's running from an allowed context. > - * > - * @mask is constant, always inline to cull the mask calculations. > - */ > -static __always_inline void scx_kf_allow(u32 mask) > -{ > - /* nesting is allowed only in increasing scx_kf_mask order */ > - WARN_ONCE((mask | higher_bits(mask)) & current->scx.kf_mask, > - "invalid nesting current->scx.kf_mask=0x%x mask=0x%x\n", > - current->scx.kf_mask, mask); > - current->scx.kf_mask |= mask; > - barrier(); > -} > - > -static void scx_kf_disallow(u32 mask) > -{ > - barrier(); > - current->scx.kf_mask &= ~mask; > -} > - > /* > * Track the rq currently locked. > * > @@ -506,34 +469,22 @@ static inline void update_locked_rq(struct rq *rq) > __this_cpu_write(scx_locked_rq_state, rq); > } > > -#define SCX_CALL_OP(sch, mask, op, rq, args...) \ > +#define SCX_CALL_OP(sch, op, rq, args...) \ > do { \ > if (rq) \ > update_locked_rq(rq); \ > - if (mask) { \ > - scx_kf_allow(mask); \ > - (sch)->ops.op(args); \ > - scx_kf_disallow(mask); \ > - } else { \ > - (sch)->ops.op(args); \ > - } \ > + (sch)->ops.op(args); \ > if (rq) \ > update_locked_rq(NULL); \ > } while (0) > > -#define SCX_CALL_OP_RET(sch, mask, op, rq, args...) \ > +#define SCX_CALL_OP_RET(sch, op, rq, args...) \ > ({ \ > __typeof__((sch)->ops.op(args)) __ret; \ > \ > if (rq) \ > update_locked_rq(rq); \ > - if (mask) { \ > - scx_kf_allow(mask); \ > - __ret = (sch)->ops.op(args); \ > - scx_kf_disallow(mask); \ > - } else { \ > - __ret = (sch)->ops.op(args); \ > - } \ > + __ret = (sch)->ops.op(args); \ > if (rq) \ > update_locked_rq(NULL); \ > __ret; \ > @@ -553,67 +504,33 @@ do { \ > * > * These macros only work for non-nesting ops since kf_tasks[] is not stacked. > */ > -#define SCX_CALL_OP_TASK(sch, mask, op, rq, task, args...) \ > +#define SCX_CALL_OP_TASK(sch, op, rq, task, args...) \ > do { \ > - BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ > current->scx.kf_tasks[0] = task; \ > - SCX_CALL_OP((sch), mask, op, rq, task, ##args); \ > + SCX_CALL_OP((sch), op, rq, task, ##args); \ > current->scx.kf_tasks[0] = NULL; \ > } while (0) > > -#define SCX_CALL_OP_TASK_RET(sch, mask, op, rq, task, args...) \ > +#define SCX_CALL_OP_TASK_RET(sch, op, rq, task, args...) \ > ({ \ > __typeof__((sch)->ops.op(task, ##args)) __ret; \ > - BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ > current->scx.kf_tasks[0] = task; \ > - __ret = SCX_CALL_OP_RET((sch), mask, op, rq, task, ##args); \ > + __ret = SCX_CALL_OP_RET((sch), op, rq, task, ##args); \ > current->scx.kf_tasks[0] = NULL; \ > __ret; \ > }) > > -#define SCX_CALL_OP_2TASKS_RET(sch, mask, op, rq, task0, task1, args...) \ > +#define SCX_CALL_OP_2TASKS_RET(sch, op, rq, task0, task1, args...) \ > ({ \ > __typeof__((sch)->ops.op(task0, task1, ##args)) __ret; \ > - BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ > current->scx.kf_tasks[0] = task0; \ > current->scx.kf_tasks[1] = task1; \ > - __ret = SCX_CALL_OP_RET((sch), mask, op, rq, task0, task1, ##args); \ > + __ret = SCX_CALL_OP_RET((sch), op, rq, task0, task1, ##args); \ > current->scx.kf_tasks[0] = NULL; \ > current->scx.kf_tasks[1] = NULL; \ > __ret; \ > }) > > -/* @mask is constant, always inline to cull unnecessary branches */ > -static __always_inline bool scx_kf_allowed(struct scx_sched *sch, u32 mask) > -{ > - if (unlikely(!(current->scx.kf_mask & mask))) { > - scx_error(sch, "kfunc with mask 0x%x called from an operation only allowing 0x%x", > - mask, current->scx.kf_mask); > - return false; > - } > - > - /* > - * Enforce nesting boundaries. e.g. A kfunc which can be called from > - * DISPATCH must not be called if we're running DEQUEUE which is nested > - * inside ops.dispatch(). We don't need to check boundaries for any > - * blocking kfuncs as the verifier ensures they're only called from > - * sleepable progs. > - */ > - if (unlikely(highest_bit(mask) == SCX_KF_CPU_RELEASE && > - (current->scx.kf_mask & higher_bits(SCX_KF_CPU_RELEASE)))) { > - scx_error(sch, "cpu_release kfunc called from a nested operation"); > - return false; > - } > - > - if (unlikely(highest_bit(mask) == SCX_KF_DISPATCH && > - (current->scx.kf_mask & higher_bits(SCX_KF_DISPATCH)))) { > - scx_error(sch, "dispatch kfunc called from a nested operation"); > - return false; > - } > - > - return true; > -} > - > /* see SCX_CALL_OP_TASK() */ > static __always_inline bool scx_kf_allowed_on_arg_tasks(struct scx_sched *sch, > struct task_struct *p) > @@ -1461,7 +1378,7 @@ static void call_task_dequeue(struct scx_sched *sch, struct rq *rq, > return; > > if (SCX_HAS_OP(sch, dequeue)) > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, deq_flags); > + SCX_CALL_OP_TASK(sch, dequeue, rq, p, deq_flags); > > p->scx.flags &= ~SCX_TASK_IN_CUSTODY; > } > @@ -1920,7 +1837,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, > WARN_ON_ONCE(*ddsp_taskp); > *ddsp_taskp = p; > > - SCX_CALL_OP_TASK(sch, SCX_KF_ENQUEUE, enqueue, rq, p, enq_flags); > + SCX_CALL_OP_TASK(sch, enqueue, rq, p, enq_flags); > > *ddsp_taskp = NULL; > if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID) > @@ -2024,7 +1941,7 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int core_enq_ > add_nr_running(rq, 1); > > if (SCX_HAS_OP(sch, runnable) && !task_on_rq_migrating(p)) > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, runnable, rq, p, enq_flags); > + SCX_CALL_OP_TASK(sch, runnable, rq, p, enq_flags); > > if (enq_flags & SCX_ENQ_WAKEUP) > touch_core_sched(rq, p); > @@ -2141,11 +2058,11 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int core_deq_ > */ > if (SCX_HAS_OP(sch, stopping) && task_current(rq, p)) { > update_curr_scx(rq); > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, stopping, rq, p, false); > + SCX_CALL_OP_TASK(sch, stopping, rq, p, false); > } > > if (SCX_HAS_OP(sch, quiescent) && !task_on_rq_migrating(p)) > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, quiescent, rq, p, deq_flags); > + SCX_CALL_OP_TASK(sch, quiescent, rq, p, deq_flags); > > if (deq_flags & SCX_DEQ_SLEEP) > p->scx.flags |= SCX_TASK_DEQD_FOR_SLEEP; > @@ -2167,7 +2084,7 @@ static void yield_task_scx(struct rq *rq) > struct scx_sched *sch = scx_task_sched(p); > > if (SCX_HAS_OP(sch, yield)) > - SCX_CALL_OP_2TASKS_RET(sch, SCX_KF_REST, yield, rq, p, NULL); > + SCX_CALL_OP_2TASKS_RET(sch, yield, rq, p, NULL); > else > p->scx.slice = 0; > } > @@ -2178,8 +2095,7 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to) > struct scx_sched *sch = scx_task_sched(from); > > if (SCX_HAS_OP(sch, yield) && sch == scx_task_sched(to)) > - return SCX_CALL_OP_2TASKS_RET(sch, SCX_KF_REST, yield, rq, > - from, to); > + return SCX_CALL_OP_2TASKS_RET(sch, yield, rq, from, to); > else > return false; > } > @@ -2799,20 +2715,11 @@ scx_dispatch_sched(struct scx_sched *sch, struct rq *rq, > dspc->nr_tasks = 0; > > if (nested) { > - /* > - * If nested, don't update kf_mask as the originating > - * invocation would already have set it up. > - */ > - SCX_CALL_OP(sch, 0, dispatch, rq, cpu, > - prev_on_sch ? prev : NULL); > + SCX_CALL_OP(sch, dispatch, rq, cpu, prev_on_sch ? prev : NULL); > } else { > - /* > - * If not nested, stash @prev so that nested invocations > - * can access it. > - */ > + /* stash @prev so that nested invocations can access it */ > rq->scx.sub_dispatch_prev = prev; > - SCX_CALL_OP(sch, SCX_KF_DISPATCH, dispatch, rq, cpu, > - prev_on_sch ? prev : NULL); > + SCX_CALL_OP(sch, dispatch, rq, cpu, prev_on_sch ? prev : NULL); > rq->scx.sub_dispatch_prev = NULL; > } > > @@ -2871,7 +2778,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev) > * emitted in switch_class(). > */ > if (SCX_HAS_OP(sch, cpu_acquire)) > - SCX_CALL_OP(sch, SCX_KF_REST, cpu_acquire, rq, cpu, NULL); > + SCX_CALL_OP(sch, cpu_acquire, rq, cpu, NULL); > rq->scx.cpu_released = false; > } > > @@ -2950,7 +2857,7 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) > > /* see dequeue_task_scx() on why we skip when !QUEUED */ > if (SCX_HAS_OP(sch, running) && (p->scx.flags & SCX_TASK_QUEUED)) > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, running, rq, p); > + SCX_CALL_OP_TASK(sch, running, rq, p); > > clr_task_runnable(p, true); > > @@ -3022,8 +2929,7 @@ static void switch_class(struct rq *rq, struct task_struct *next) > .task = next, > }; > > - SCX_CALL_OP(sch, SCX_KF_CPU_RELEASE, cpu_release, rq, > - cpu_of(rq), &args); > + SCX_CALL_OP(sch, cpu_release, rq, cpu_of(rq), &args); > } > rq->scx.cpu_released = true; > } > @@ -3041,7 +2947,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p, > > /* see dequeue_task_scx() on why we skip when !QUEUED */ > if (SCX_HAS_OP(sch, stopping) && (p->scx.flags & SCX_TASK_QUEUED)) > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, stopping, rq, p, true); > + SCX_CALL_OP_TASK(sch, stopping, rq, p, true); > > if (p->scx.flags & SCX_TASK_QUEUED) { > set_task_runnable(rq, p); > @@ -3271,7 +3177,7 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, > */ > if (sch_a == sch_b && SCX_HAS_OP(sch_a, core_sched_before) && > !scx_bypassing(sch_a, task_cpu(a))) > - return SCX_CALL_OP_2TASKS_RET(sch_a, SCX_KF_REST, core_sched_before, > + return SCX_CALL_OP_2TASKS_RET(sch_a, core_sched_before, > NULL, > (struct task_struct *)a, > (struct task_struct *)b); > @@ -3308,10 +3214,7 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag > *ddsp_taskp = p; > > this_rq()->scx.in_select_cpu = true; > - cpu = SCX_CALL_OP_TASK_RET(sch, > - SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU, > - select_cpu, NULL, p, prev_cpu, > - wake_flags); > + cpu = SCX_CALL_OP_TASK_RET(sch, select_cpu, NULL, p, prev_cpu, wake_flags); > this_rq()->scx.in_select_cpu = false; > p->scx.selected_cpu = cpu; > *ddsp_taskp = NULL; > @@ -3361,8 +3264,7 @@ static void set_cpus_allowed_scx(struct task_struct *p, > * designation pointless. Cast it away when calling the operation. > */ > if (SCX_HAS_OP(sch, set_cpumask)) > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_cpumask, task_rq(p), > - p, (struct cpumask *)p->cpus_ptr); > + SCX_CALL_OP_TASK(sch, set_cpumask, task_rq(p), p, (struct cpumask *)p->cpus_ptr); > } > > static void handle_hotplug(struct rq *rq, bool online) > @@ -3384,9 +3286,9 @@ static void handle_hotplug(struct rq *rq, bool online) > scx_idle_update_selcpu_topology(&sch->ops); > > if (online && SCX_HAS_OP(sch, cpu_online)) > - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cpu_online, NULL, cpu); > + SCX_CALL_OP(sch, cpu_online, NULL, cpu); > else if (!online && SCX_HAS_OP(sch, cpu_offline)) > - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cpu_offline, NULL, cpu); > + SCX_CALL_OP(sch, cpu_offline, NULL, cpu); > else > scx_exit(sch, SCX_EXIT_UNREG_KERN, > SCX_ECODE_ACT_RESTART | SCX_ECODE_RSN_HOTPLUG, > @@ -3504,7 +3406,7 @@ static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued) > curr->scx.slice = 0; > touch_core_sched(rq, curr); > } else if (SCX_HAS_OP(sch, tick)) { > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, tick, rq, curr); > + SCX_CALL_OP_TASK(sch, tick, rq, curr); > } > > if (!curr->scx.slice) > @@ -3580,8 +3482,7 @@ static int __scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fo > .fork = fork, > }; > > - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, init_task, NULL, > - p, &args); > + ret = SCX_CALL_OP_RET(sch, init_task, NULL, p, &args); > if (unlikely(ret)) { > ret = ops_sanitize_err(sch, "init_task", ret); > return ret; > @@ -3662,11 +3563,10 @@ static void __scx_enable_task(struct scx_sched *sch, struct task_struct *p) > p->scx.weight = sched_weight_to_cgroup(weight); > > if (SCX_HAS_OP(sch, enable)) > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, enable, rq, p); > + SCX_CALL_OP_TASK(sch, enable, rq, p); > > if (SCX_HAS_OP(sch, set_weight)) > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_weight, rq, > - p, p->scx.weight); > + SCX_CALL_OP_TASK(sch, set_weight, rq, p, p->scx.weight); > } > > static void scx_enable_task(struct scx_sched *sch, struct task_struct *p) > @@ -3685,7 +3585,7 @@ static void scx_disable_task(struct scx_sched *sch, struct task_struct *p) > clear_direct_dispatch(p); > > if (SCX_HAS_OP(sch, disable)) > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p); > + SCX_CALL_OP_TASK(sch, disable, rq, p); > scx_set_task_state(p, SCX_TASK_READY); > > /* > @@ -3723,8 +3623,7 @@ static void __scx_disable_and_exit_task(struct scx_sched *sch, > } > > if (SCX_HAS_OP(sch, exit_task)) > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, exit_task, task_rq(p), > - p, &args); > + SCX_CALL_OP_TASK(sch, exit_task, task_rq(p), p, &args); > } > > static void scx_disable_and_exit_task(struct scx_sched *sch, > @@ -3903,8 +3802,7 @@ static void reweight_task_scx(struct rq *rq, struct task_struct *p, > > p->scx.weight = sched_weight_to_cgroup(scale_load_down(lw->weight)); > if (SCX_HAS_OP(sch, set_weight)) > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_weight, rq, > - p, p->scx.weight); > + SCX_CALL_OP_TASK(sch, set_weight, rq, p, p->scx.weight); > } > > static void prio_changed_scx(struct rq *rq, struct task_struct *p, u64 oldprio) > @@ -3925,8 +3823,7 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p) > * different scheduler class. Keep the BPF scheduler up-to-date. > */ > if (SCX_HAS_OP(sch, set_cpumask)) > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, set_cpumask, rq, > - p, (struct cpumask *)p->cpus_ptr); > + SCX_CALL_OP_TASK(sch, set_cpumask, rq, p, (struct cpumask *)p->cpus_ptr); > } > > static void switched_from_scx(struct rq *rq, struct task_struct *p) > @@ -4309,7 +4206,7 @@ int scx_tg_online(struct task_group *tg) > .bw_quota_us = tg->scx.bw_quota_us, > .bw_burst_us = tg->scx.bw_burst_us }; > > - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, cgroup_init, > + ret = SCX_CALL_OP_RET(sch, cgroup_init, > NULL, tg->css.cgroup, &args); > if (ret) > ret = ops_sanitize_err(sch, "cgroup_init", ret); > @@ -4331,8 +4228,7 @@ void scx_tg_offline(struct task_group *tg) > > if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_exit) && > (tg->scx.flags & SCX_TG_INITED)) > - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_exit, NULL, > - tg->css.cgroup); > + SCX_CALL_OP(sch, cgroup_exit, NULL, tg->css.cgroup); > tg->scx.flags &= ~(SCX_TG_ONLINE | SCX_TG_INITED); > } > > @@ -4361,8 +4257,7 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset) > continue; > > if (SCX_HAS_OP(sch, cgroup_prep_move)) { > - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, > - cgroup_prep_move, NULL, > + ret = SCX_CALL_OP_RET(sch, cgroup_prep_move, NULL, > p, from, css->cgroup); > if (ret) > goto err; > @@ -4377,7 +4272,7 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset) > cgroup_taskset_for_each(p, css, tset) { > if (SCX_HAS_OP(sch, cgroup_cancel_move) && > p->scx.cgrp_moving_from) > - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_cancel_move, NULL, > + SCX_CALL_OP(sch, cgroup_cancel_move, NULL, > p, p->scx.cgrp_moving_from, css->cgroup); > p->scx.cgrp_moving_from = NULL; > } > @@ -4398,7 +4293,7 @@ void scx_cgroup_move_task(struct task_struct *p) > */ > if (SCX_HAS_OP(sch, cgroup_move) && > !WARN_ON_ONCE(!p->scx.cgrp_moving_from)) > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, cgroup_move, task_rq(p), > + SCX_CALL_OP_TASK(sch, cgroup_move, task_rq(p), > p, p->scx.cgrp_moving_from, > tg_cgrp(task_group(p))); > p->scx.cgrp_moving_from = NULL; > @@ -4416,7 +4311,7 @@ void scx_cgroup_cancel_attach(struct cgroup_taskset *tset) > cgroup_taskset_for_each(p, css, tset) { > if (SCX_HAS_OP(sch, cgroup_cancel_move) && > p->scx.cgrp_moving_from) > - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_cancel_move, NULL, > + SCX_CALL_OP(sch, cgroup_cancel_move, NULL, > p, p->scx.cgrp_moving_from, css->cgroup); > p->scx.cgrp_moving_from = NULL; > } > @@ -4430,8 +4325,7 @@ void scx_group_set_weight(struct task_group *tg, unsigned long weight) > > if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_set_weight) && > tg->scx.weight != weight) > - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_set_weight, NULL, > - tg_cgrp(tg), weight); > + SCX_CALL_OP(sch, cgroup_set_weight, NULL, tg_cgrp(tg), weight); > > tg->scx.weight = weight; > > @@ -4445,8 +4339,7 @@ void scx_group_set_idle(struct task_group *tg, bool idle) > percpu_down_read(&scx_cgroup_ops_rwsem); > > if (scx_cgroup_enabled && SCX_HAS_OP(sch, cgroup_set_idle)) > - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_set_idle, NULL, > - tg_cgrp(tg), idle); > + SCX_CALL_OP(sch, cgroup_set_idle, NULL, tg_cgrp(tg), idle); > > /* Update the task group's idle state */ > tg->scx.idle = idle; > @@ -4465,7 +4358,7 @@ void scx_group_set_bandwidth(struct task_group *tg, > (tg->scx.bw_period_us != period_us || > tg->scx.bw_quota_us != quota_us || > tg->scx.bw_burst_us != burst_us)) > - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_set_bandwidth, NULL, > + SCX_CALL_OP(sch, cgroup_set_bandwidth, NULL, > tg_cgrp(tg), period_us, quota_us, burst_us); > > tg->scx.bw_period_us = period_us; > @@ -4690,8 +4583,7 @@ static void scx_cgroup_exit(struct scx_sched *sch) > if (!sch->ops.cgroup_exit) > continue; > > - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, cgroup_exit, NULL, > - css->cgroup); > + SCX_CALL_OP(sch, cgroup_exit, NULL, css->cgroup); > } > } > > @@ -4722,7 +4614,7 @@ static int scx_cgroup_init(struct scx_sched *sch) > continue; > } > > - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, cgroup_init, NULL, > + ret = SCX_CALL_OP_RET(sch, cgroup_init, NULL, > css->cgroup, &args); > if (ret) { > scx_error(sch, "ops.cgroup_init() failed (%d)", ret); > @@ -5795,12 +5687,12 @@ static void scx_sub_disable(struct scx_sched *sch) > .ops = &sch->ops, > .cgroup_path = sch->cgrp_path, > }; > - SCX_CALL_OP(parent, SCX_KF_UNLOCKED, sub_detach, NULL, > + SCX_CALL_OP(parent, sub_detach, NULL, > &sub_detach_args); > } > > if (sch->ops.exit) > - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, exit, NULL, sch->exit_info); > + SCX_CALL_OP(sch, exit, NULL, sch->exit_info); > kobject_del(&sch->kobj); > } > #else /* CONFIG_EXT_SUB_SCHED */ > @@ -5915,7 +5807,7 @@ static void scx_root_disable(struct scx_sched *sch) > } > > if (sch->ops.exit) > - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, exit, NULL, ei); > + SCX_CALL_OP(sch, exit, NULL, ei); > > scx_unlink_sched(sch); > > @@ -6178,7 +6070,7 @@ static void scx_dump_task(struct scx_sched *sch, > > if (SCX_HAS_OP(sch, dump_task)) { > ops_dump_init(s, " "); > - SCX_CALL_OP(sch, SCX_KF_REST, dump_task, NULL, dctx, p); > + SCX_CALL_OP(sch, dump_task, NULL, dctx, p); > ops_dump_exit(); > } > > @@ -6242,7 +6134,7 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei, > > if (SCX_HAS_OP(sch, dump)) { > ops_dump_init(&s, ""); > - SCX_CALL_OP(sch, SCX_KF_UNLOCKED, dump, NULL, &dctx); > + SCX_CALL_OP(sch, dump, NULL, &dctx); > ops_dump_exit(); > } > > @@ -6302,7 +6194,7 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei, > used = seq_buf_used(&ns); > if (SCX_HAS_OP(sch, dump_cpu)) { > ops_dump_init(&ns, " "); > - SCX_CALL_OP(sch, SCX_KF_REST, dump_cpu, NULL, > + SCX_CALL_OP(sch, dump_cpu, NULL, > &dctx, cpu, idle); > ops_dump_exit(); > } > @@ -6748,7 +6640,7 @@ static void scx_root_enable_workfn(struct kthread_work *work) > scx_idle_enable(ops); > > if (sch->ops.init) { > - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, init, NULL); > + ret = SCX_CALL_OP_RET(sch, init, NULL); > if (ret) { > ret = ops_sanitize_err(sch, "init", ret); > cpus_read_unlock(); > @@ -7020,7 +6912,7 @@ static void scx_sub_enable_workfn(struct kthread_work *work) > } > > if (sch->ops.init) { > - ret = SCX_CALL_OP_RET(sch, SCX_KF_UNLOCKED, init, NULL); > + ret = SCX_CALL_OP_RET(sch, init, NULL); > if (ret) { > ret = ops_sanitize_err(sch, "init", ret); > scx_error(sch, "ops.init() failed (%d)", ret); > @@ -7037,7 +6929,7 @@ static void scx_sub_enable_workfn(struct kthread_work *work) > .cgroup_path = sch->cgrp_path, > }; > > - ret = SCX_CALL_OP_RET(parent, SCX_KF_UNLOCKED, sub_attach, NULL, > + ret = SCX_CALL_OP_RET(parent, sub_attach, NULL, > &sub_attach_args); > if (ret) { > ret = ops_sanitize_err(sch, "sub_attach", ret); > @@ -7891,9 +7783,6 @@ static bool scx_vet_enq_flags(struct scx_sched *sch, u64 dsq_id, u64 *enq_flags) > static bool scx_dsq_insert_preamble(struct scx_sched *sch, struct task_struct *p, > u64 dsq_id, u64 *enq_flags) > { > - if (!scx_kf_allowed(sch, SCX_KF_ENQUEUE | SCX_KF_DISPATCH)) > - return false; > - > lockdep_assert_irqs_disabled(); > > if (unlikely(!p)) { > @@ -8146,10 +8035,6 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit, > bool in_balance; > unsigned long flags; > > - if ((scx_locked_rq() || this_rq()->scx.in_select_cpu) && > - !scx_kf_allowed(sch, SCX_KF_DISPATCH)) > - return false; > - > if (!scx_vet_enq_flags(sch, dsq_id, &enq_flags)) > return false; > > @@ -8244,9 +8129,6 @@ __bpf_kfunc u32 scx_bpf_dispatch_nr_slots(const struct bpf_prog_aux *aux) > if (unlikely(!sch)) > return 0; > > - if (!scx_kf_allowed(sch, SCX_KF_DISPATCH)) > - return 0; > - > return sch->dsp_max_batch - __this_cpu_read(sch->pcpu->dsp_ctx.cursor); > } > > @@ -8268,9 +8150,6 @@ __bpf_kfunc void scx_bpf_dispatch_cancel(const struct bpf_prog_aux *aux) > if (unlikely(!sch)) > return; > > - if (!scx_kf_allowed(sch, SCX_KF_DISPATCH)) > - return; > - > dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx; > > if (dspc->cursor > 0) > @@ -8317,9 +8196,6 @@ __bpf_kfunc bool scx_bpf_dsq_move_to_local___v2(u64 dsq_id, u64 enq_flags, > if (unlikely(!sch)) > return false; > > - if (!scx_kf_allowed(sch, SCX_KF_DISPATCH)) > - return false; > - > if (!scx_vet_enq_flags(sch, SCX_DSQ_LOCAL, &enq_flags)) > return false; > > @@ -8473,9 +8349,6 @@ __bpf_kfunc bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux * > if (unlikely(!parent)) > return false; > > - if (!scx_kf_allowed(parent, SCX_KF_DISPATCH)) > - return false; > - > child = scx_find_sub_sched(cgroup_id); > > if (unlikely(!child)) > @@ -8535,9 +8408,6 @@ __bpf_kfunc u32 scx_bpf_reenqueue_local(const struct bpf_prog_aux *aux) > if (unlikely(!sch)) > return 0; > > - if (!scx_kf_allowed(sch, SCX_KF_CPU_RELEASE)) > - return 0; > - > rq = cpu_rq(smp_processor_id()); > lockdep_assert_rq_held(rq); > > diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c > index ec49e0c9892e..443d12a3df67 100644 > --- a/kernel/sched/ext_idle.c > +++ b/kernel/sched/ext_idle.c > @@ -789,7 +789,7 @@ void __scx_update_idle(struct rq *rq, bool idle, bool do_notify) > */ > if (SCX_HAS_OP(sch, update_idle) && do_notify && > !scx_bypassing(sch, cpu_of(rq))) > - SCX_CALL_OP(sch, SCX_KF_REST, update_idle, rq, cpu_of(rq), idle); > + SCX_CALL_OP(sch, update_idle, rq, cpu_of(rq), idle); > } > > static void reset_idle_masks(struct sched_ext_ops *ops) > @@ -937,8 +937,6 @@ static s32 select_cpu_from_kfunc(struct scx_sched *sch, struct task_struct *p, > } else if (!scx_locked_rq()) { > raw_spin_lock_irqsave(&p->pi_lock, irq_flags); > we_locked = true; > - } else if (!scx_kf_allowed(sch, SCX_KF_ENQUEUE)) { > - return -EPERM; > } > > /* > -- > 2.53.0 > ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 09/10] sched_ext: Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok() 2026-04-10 6:30 [PATCHSET sched_ext/for-7.1] sched_ext: Add verifier-time kfunc context filter Tejun Heo ` (7 preceding siblings ...) 2026-04-10 6:30 ` [PATCH 08/10] sched_ext: Remove runtime kfunc mask enforcement Tejun Heo @ 2026-04-10 6:30 ` Tejun Heo 2026-04-10 16:55 ` Andrea Righi 2026-04-10 6:30 ` [PATCH 10/10] sched_ext: Warn on task-based SCX op recursion Tejun Heo 2026-04-10 17:45 ` [PATCHSET sched_ext/for-7.1] sched_ext: Add verifier-time kfunc context filter Andrea Righi 10 siblings, 1 reply; 26+ messages in thread From: Tejun Heo @ 2026-04-10 6:30 UTC (permalink / raw) To: sched-ext, David Vernet, Andrea Righi, Changwoo Min Cc: Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel, Tejun Heo The "kf_allowed" framing on this helper comes from the old runtime scx_kf_allowed() gate, which has been removed. Rename it to describe what it actually does in the new model. Pure rename, no functional change. Signed-off-by: Tejun Heo <tj@kernel.org> --- kernel/sched/ext.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 70e9434a9a0d..27091ae075a3 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -494,7 +494,7 @@ do { \ * SCX_CALL_OP_TASK*() invokes an SCX op that takes one or two task arguments * and records them in current->scx.kf_tasks[] for the duration of the call. A * kfunc invoked from inside such an op can then use - * scx_kf_allowed_on_arg_tasks() to verify that its task argument is one of + * scx_kf_arg_task_ok() to verify that its task argument is one of * those subject tasks. * * Every SCX_CALL_OP_TASK*() call site invokes its op with @p's rq lock held - @@ -532,7 +532,7 @@ do { \ }) /* see SCX_CALL_OP_TASK() */ -static __always_inline bool scx_kf_allowed_on_arg_tasks(struct scx_sched *sch, +static __always_inline bool scx_kf_arg_task_ok(struct scx_sched *sch, struct task_struct *p) { if (unlikely((p != current->scx.kf_tasks[0] && @@ -9424,7 +9424,7 @@ __bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p, if (unlikely(!sch)) goto out; - if (!scx_kf_allowed_on_arg_tasks(sch, p)) + if (!scx_kf_arg_task_ok(sch, p)) goto out; cgrp = tg_cgrp(tg); -- 2.53.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 09/10] sched_ext: Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok() 2026-04-10 6:30 ` [PATCH 09/10] sched_ext: Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok() Tejun Heo @ 2026-04-10 16:55 ` Andrea Righi 0 siblings, 0 replies; 26+ messages in thread From: Andrea Righi @ 2026-04-10 16:55 UTC (permalink / raw) To: Tejun Heo Cc: sched-ext, David Vernet, Changwoo Min, Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel On Thu, Apr 09, 2026 at 08:30:45PM -1000, Tejun Heo wrote: > The "kf_allowed" framing on this helper comes from the old runtime > scx_kf_allowed() gate, which has been removed. Rename it to describe what it > actually does in the new model. > > Pure rename, no functional change. > > Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> Thanks, -Andrea > --- > kernel/sched/ext.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 70e9434a9a0d..27091ae075a3 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -494,7 +494,7 @@ do { \ > * SCX_CALL_OP_TASK*() invokes an SCX op that takes one or two task arguments > * and records them in current->scx.kf_tasks[] for the duration of the call. A > * kfunc invoked from inside such an op can then use > - * scx_kf_allowed_on_arg_tasks() to verify that its task argument is one of > + * scx_kf_arg_task_ok() to verify that its task argument is one of > * those subject tasks. > * > * Every SCX_CALL_OP_TASK*() call site invokes its op with @p's rq lock held - > @@ -532,7 +532,7 @@ do { \ > }) > > /* see SCX_CALL_OP_TASK() */ > -static __always_inline bool scx_kf_allowed_on_arg_tasks(struct scx_sched *sch, > +static __always_inline bool scx_kf_arg_task_ok(struct scx_sched *sch, > struct task_struct *p) > { > if (unlikely((p != current->scx.kf_tasks[0] && > @@ -9424,7 +9424,7 @@ __bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p, > if (unlikely(!sch)) > goto out; > > - if (!scx_kf_allowed_on_arg_tasks(sch, p)) > + if (!scx_kf_arg_task_ok(sch, p)) > goto out; > > cgrp = tg_cgrp(tg); > -- > 2.53.0 > ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 10/10] sched_ext: Warn on task-based SCX op recursion 2026-04-10 6:30 [PATCHSET sched_ext/for-7.1] sched_ext: Add verifier-time kfunc context filter Tejun Heo ` (8 preceding siblings ...) 2026-04-10 6:30 ` [PATCH 09/10] sched_ext: Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok() Tejun Heo @ 2026-04-10 6:30 ` Tejun Heo 2026-04-10 17:38 ` Andrea Righi 2026-04-10 17:45 ` [PATCHSET sched_ext/for-7.1] sched_ext: Add verifier-time kfunc context filter Andrea Righi 10 siblings, 1 reply; 26+ messages in thread From: Tejun Heo @ 2026-04-10 6:30 UTC (permalink / raw) To: sched-ext, David Vernet, Andrea Righi, Changwoo Min Cc: Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel, Tejun Heo The kf_tasks[] design assumes task-based SCX ops don't nest - if they did, kf_tasks[0] would get clobbered. The old scx_kf_allow() WARN_ONCE caught invalid nesting via kf_mask, but that machinery is gone now. Add a WARN_ON_ONCE(current->scx.kf_tasks[0]) at the top of each SCX_CALL_OP_TASK*() macro. Checking kf_tasks[0] alone is sufficient: all three variants (SCX_CALL_OP_TASK, SCX_CALL_OP_TASK_RET, SCX_CALL_OP_2TASKS_RET) write to kf_tasks[0], so a non-NULL value at entry to any of the three means re-entry from somewhere in the family. Signed-off-by: Tejun Heo <tj@kernel.org> --- kernel/sched/ext.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 27091ae075a3..99760d1fbbd4 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -502,10 +502,13 @@ do { \ * held by try_to_wake_up() with rq tracking via scx_rq.in_select_cpu. So if * kf_tasks[] is set, @p's scheduler-protected fields are stable. * - * These macros only work for non-nesting ops since kf_tasks[] is not stacked. + * kf_tasks[] can not stack, so task-based SCX ops must not nest. The + * WARN_ON_ONCE() in each macro catches a re-entry of any of the three variants + * while a previous one is still in progress. */ #define SCX_CALL_OP_TASK(sch, op, rq, task, args...) \ do { \ + WARN_ON_ONCE(current->scx.kf_tasks[0]); \ current->scx.kf_tasks[0] = task; \ SCX_CALL_OP((sch), op, rq, task, ##args); \ current->scx.kf_tasks[0] = NULL; \ @@ -514,6 +517,7 @@ do { \ #define SCX_CALL_OP_TASK_RET(sch, op, rq, task, args...) \ ({ \ __typeof__((sch)->ops.op(task, ##args)) __ret; \ + WARN_ON_ONCE(current->scx.kf_tasks[0]); \ current->scx.kf_tasks[0] = task; \ __ret = SCX_CALL_OP_RET((sch), op, rq, task, ##args); \ current->scx.kf_tasks[0] = NULL; \ @@ -523,6 +527,7 @@ do { \ #define SCX_CALL_OP_2TASKS_RET(sch, op, rq, task0, task1, args...) \ ({ \ __typeof__((sch)->ops.op(task0, task1, ##args)) __ret; \ + WARN_ON_ONCE(current->scx.kf_tasks[0]); \ current->scx.kf_tasks[0] = task0; \ current->scx.kf_tasks[1] = task1; \ __ret = SCX_CALL_OP_RET((sch), op, rq, task0, task1, ##args); \ -- 2.53.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 10/10] sched_ext: Warn on task-based SCX op recursion 2026-04-10 6:30 ` [PATCH 10/10] sched_ext: Warn on task-based SCX op recursion Tejun Heo @ 2026-04-10 17:38 ` Andrea Righi 0 siblings, 0 replies; 26+ messages in thread From: Andrea Righi @ 2026-04-10 17:38 UTC (permalink / raw) To: Tejun Heo Cc: sched-ext, David Vernet, Changwoo Min, Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel On Thu, Apr 09, 2026 at 08:30:46PM -1000, Tejun Heo wrote: > The kf_tasks[] design assumes task-based SCX ops don't nest - if they > did, kf_tasks[0] would get clobbered. The old scx_kf_allow() WARN_ONCE > caught invalid nesting via kf_mask, but that machinery is gone now. > > Add a WARN_ON_ONCE(current->scx.kf_tasks[0]) at the top of each > SCX_CALL_OP_TASK*() macro. Checking kf_tasks[0] alone is sufficient: all > three variants (SCX_CALL_OP_TASK, SCX_CALL_OP_TASK_RET, > SCX_CALL_OP_2TASKS_RET) write to kf_tasks[0], so a non-NULL value at > entry to any of the three means re-entry from somewhere in the family. > > Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> Thanks, -Andrea > --- > kernel/sched/ext.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 27091ae075a3..99760d1fbbd4 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -502,10 +502,13 @@ do { \ > * held by try_to_wake_up() with rq tracking via scx_rq.in_select_cpu. So if > * kf_tasks[] is set, @p's scheduler-protected fields are stable. > * > - * These macros only work for non-nesting ops since kf_tasks[] is not stacked. > + * kf_tasks[] can not stack, so task-based SCX ops must not nest. The > + * WARN_ON_ONCE() in each macro catches a re-entry of any of the three variants > + * while a previous one is still in progress. > */ > #define SCX_CALL_OP_TASK(sch, op, rq, task, args...) \ > do { \ > + WARN_ON_ONCE(current->scx.kf_tasks[0]); \ > current->scx.kf_tasks[0] = task; \ > SCX_CALL_OP((sch), op, rq, task, ##args); \ > current->scx.kf_tasks[0] = NULL; \ > @@ -514,6 +517,7 @@ do { \ > #define SCX_CALL_OP_TASK_RET(sch, op, rq, task, args...) \ > ({ \ > __typeof__((sch)->ops.op(task, ##args)) __ret; \ > + WARN_ON_ONCE(current->scx.kf_tasks[0]); \ > current->scx.kf_tasks[0] = task; \ > __ret = SCX_CALL_OP_RET((sch), op, rq, task, ##args); \ > current->scx.kf_tasks[0] = NULL; \ > @@ -523,6 +527,7 @@ do { \ > #define SCX_CALL_OP_2TASKS_RET(sch, op, rq, task0, task1, args...) \ > ({ \ > __typeof__((sch)->ops.op(task0, task1, ##args)) __ret; \ > + WARN_ON_ONCE(current->scx.kf_tasks[0]); \ > current->scx.kf_tasks[0] = task0; \ > current->scx.kf_tasks[1] = task1; \ > __ret = SCX_CALL_OP_RET((sch), op, rq, task0, task1, ##args); \ > -- > 2.53.0 > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCHSET sched_ext/for-7.1] sched_ext: Add verifier-time kfunc context filter 2026-04-10 6:30 [PATCHSET sched_ext/for-7.1] sched_ext: Add verifier-time kfunc context filter Tejun Heo ` (9 preceding siblings ...) 2026-04-10 6:30 ` [PATCH 10/10] sched_ext: Warn on task-based SCX op recursion Tejun Heo @ 2026-04-10 17:45 ` Andrea Righi 10 siblings, 0 replies; 26+ messages in thread From: Andrea Righi @ 2026-04-10 17:45 UTC (permalink / raw) To: Tejun Heo Cc: sched-ext, David Vernet, Changwoo Min, Cheng-Yang Chou, Juntong Deng, Ching-Chun Huang, Chia-Ping Tsai, Emil Tsalapatis, linux-kernel On Thu, Apr 09, 2026 at 08:30:36PM -1000, Tejun Heo wrote: > Hello, > > This moves enforcement of SCX context-sensitive kfunc restrictions from > runtime kf_mask checks to BPF verifier-time filtering, using the BPF core's > struct_ops context information. > > This is based on work by Juntong Deng and Cheng-Yang Chou: > > https://lore.kernel.org/r/20260406154834.1920962-1-yphbchou0911@gmail.com > > I ended up redoing the series. The number of changes needed and the > difficulty of validating each one made iterating through review emails > impractical: > > - Pre-existing call-site bugs needed fixing first. ops.cgroup_move() was > mislabeled as SCX_KF_UNLOCKED when sched_move_task() actually holds the > rq lock, and set_cpus_allowed_scx() passed rq=NULL to SCX_CALL_OP_TASK > despite holding the rq lock. These had to be sorted out before the > runtime-to-verifier conversion could be validated. > > - The macro-based kfunc ID deduplication (SCX_KFUNCS_*) made it hard to > verify that the new code produced the same accept/reject verdicts as > the old. > > - No systematic validation of the full (kfunc, caller) verdict matrix > existed, so it wasn't clear whether the conversion was correct. > > This series takes a different approach: first fix the call-site bugs that > made the conversion harder than it needed to be, then do the conversion in > small isolated steps, and verify the full verdict matrix at each stage. Thanks Tejun, Juntong and Cheng-Yang for working on this! I've done some basic smoke tests with this and everything seems to work fine so far. I'm planning to run more extensive performance tests, last time that I tried to brutally comment out scx_kf_allowed() I was getting some small but consistent performance improvements, so I'm expecting to notice something similar with this one. Will keep you informed. Thanks, -Andrea ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2026-04-10 17:51 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-10 6:30 [PATCHSET sched_ext/for-7.1] sched_ext: Add verifier-time kfunc context filter Tejun Heo 2026-04-10 6:30 ` [PATCH 01/10] sched_ext: Drop TRACING access to select_cpu kfuncs Tejun Heo 2026-04-10 16:04 ` Andrea Righi 2026-04-10 6:30 ` [PATCH 02/10] sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked Tejun Heo 2026-04-10 16:07 ` Andrea Righi 2026-04-10 17:51 ` [PATCH v2 " Tejun Heo 2026-04-10 6:30 ` [PATCH 03/10] sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask Tejun Heo 2026-04-10 16:12 ` Andrea Righi 2026-04-10 17:51 ` [PATCH v2 " Tejun Heo 2026-04-10 6:30 ` [PATCH 04/10] sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking Tejun Heo 2026-04-10 16:16 ` Andrea Righi 2026-04-10 17:51 ` [PATCH v2 " Tejun Heo 2026-04-10 6:30 ` [PATCH 05/10] sched_ext: Decouple kfunc unlocked-context check from kf_mask Tejun Heo 2026-04-10 16:34 ` Andrea Righi 2026-04-10 17:51 ` [PATCH v2 " Tejun Heo 2026-04-10 6:30 ` [PATCH 06/10] sched_ext: Drop redundant rq-locked check from scx_bpf_task_cgroup() Tejun Heo 2026-04-10 16:36 ` Andrea Righi 2026-04-10 6:30 ` [PATCH 07/10] sched_ext: Add verifier-time kfunc context filter Tejun Heo 2026-04-10 16:49 ` Andrea Righi 2026-04-10 6:30 ` [PATCH 08/10] sched_ext: Remove runtime kfunc mask enforcement Tejun Heo 2026-04-10 16:50 ` Andrea Righi 2026-04-10 6:30 ` [PATCH 09/10] sched_ext: Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok() Tejun Heo 2026-04-10 16:55 ` Andrea Righi 2026-04-10 6:30 ` [PATCH 10/10] sched_ext: Warn on task-based SCX op recursion Tejun Heo 2026-04-10 17:38 ` Andrea Righi 2026-04-10 17:45 ` [PATCHSET sched_ext/for-7.1] sched_ext: Add verifier-time kfunc context filter Andrea Righi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox