[PATCH sched_ext/for-6.15 v3 0/5] bpf, sched_ext: Make kfunc filters support struct

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH sched_ext/for-6.15 v3 0/5] bpf, sched_ext: Make kfunc filters support struct_ops context to reduce runtime overhead
@ 2025-02-26 19:24 Juntong Deng
  2025-02-26 19:28 ` [PATCH sched_ext/for-6.15 v3 1/5] bpf: Add struct_ops context information to struct bpf_prog_aux Juntong Deng
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Juntong Deng @ 2025-02-26 19:24 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, tj, void,
	arighi, changwoo
  Cc: bpf, linux-kernel

This patch series makes kfunc filters support the use of struct_ops
context information to reduce SCX runtime overhead.

After improving kfunc filters, SCX no longer needs the mask-based
runtime kfuncs call restriction, so this patch removes the mask-based
runtime restriction and updates the corresponding test case.

I added *st_ops as part of the context information to avoid kfuncs being
incorrectly blocked when used in non-SCX scenarios where the member
offsets would have a different meaning.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
---
v2 -> v3:
* Change scx_kfunc_ids_ops_context to the more descriptive
  scx_kfunc_ids_ops_context_sensitive
  
* Change the target to sched_ext/for-6.15

* Remove RFC tag

* v2 link: https://lore.kernel.org/bpf/AM6PR03MB5080855B90C3FE9B6C4243B099FE2@AM6PR03MB5080.eurprd03.prod.outlook.com/T/#u

v1 -> v2:
* Use completely new design in filtering

* v1 link: https://lore.kernel.org/bpf/AM6PR03MB5080261D024B49D26F3FFF0099F72@AM6PR03MB5080.eurprd03.prod.outlook.com/T/#u

Juntong Deng (5):
  bpf: Add struct_ops context information to struct bpf_prog_aux
  sched_ext: Declare context-sensitive kfunc groups that can be used by
    different SCX operations
  sched_ext: Add scx_kfunc_ids_ops_context_sensitive for unified
    filtering of context-sensitive SCX kfuncs
  sched_ext: Removed mask-based runtime restrictions on calling kfuncs
    in different contexts
  selftests/sched_ext: Update enq_select_cpu_fails to adapt to
    struct_ops context filter

 include/linux/bpf.h                           |   2 +
 include/linux/sched/ext.h                     |  24 --
 kernel/bpf/verifier.c                         |   8 +-
 kernel/sched/ext.c                            | 385 ++++++++----------
 kernel/sched/ext_idle.c                       |  13 +-
 .../sched_ext/enq_select_cpu_fails.c          |  37 +-
 6 files changed, 194 insertions(+), 275 deletions(-)

-- 
2.39.5


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH sched_ext/for-6.15 v3 1/5] bpf: Add struct_ops context information to struct bpf_prog_aux
  2025-02-26 19:24 [PATCH sched_ext/for-6.15 v3 0/5] bpf, sched_ext: Make kfunc filters support struct_ops context to reduce runtime overhead Juntong Deng
@ 2025-02-26 19:28 ` Juntong Deng
  2025-02-26 19:28 ` [PATCH sched_ext/for-6.15 v3 2/5] sched_ext: Declare context-sensitive kfunc groups that can be used by different SCX operations Juntong Deng
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Juntong Deng @ 2025-02-26 19:28 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, tj, void,
	arighi, changwoo
  Cc: bpf, linux-kernel

This patch adds struct_ops context information to struct bpf_prog_aux.

This context information will be used in the kfunc filter.

Currently the added context information includes struct_ops member
offset and a pointer to struct bpf_struct_ops.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h   | 2 ++
 kernel/bpf/verifier.c | 8 ++++++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f3f50e29d639..e06348a59dcf 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1503,6 +1503,7 @@ struct bpf_prog_aux {
 	u32 real_func_cnt; /* includes hidden progs, only used for JIT and freeing progs */
 	u32 func_idx; /* 0 for non-func prog, the index in func array for func prog */
 	u32 attach_btf_id; /* in-kernel BTF type id to attach to */
+	u32 attach_st_ops_member_off;
 	u32 ctx_arg_info_size;
 	u32 max_rdonly_access;
 	u32 max_rdwr_access;
@@ -1547,6 +1548,7 @@ struct bpf_prog_aux {
 #endif
 	struct bpf_ksym ksym;
 	const struct bpf_prog_ops *ops;
+	const struct bpf_struct_ops *st_ops;
 	struct bpf_map **used_maps;
 	struct mutex used_maps_mutex; /* mutex for used_maps and used_map_cnt */
 	struct btf_mod_pair *used_btfs;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 9971c03adfd5..2dee3fd190a4 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -22384,7 +22384,7 @@ static int check_struct_ops_btf_id(struct bpf_verifier_env *env)
 	const struct bpf_struct_ops *st_ops;
 	const struct btf_member *member;
 	struct bpf_prog *prog = env->prog;
-	u32 btf_id, member_idx;
+	u32 btf_id, member_idx, member_off;
 	struct btf *btf;
 	const char *mname;
 	int err;
@@ -22435,7 +22435,8 @@ static int check_struct_ops_btf_id(struct bpf_verifier_env *env)
 		return -EINVAL;
 	}
 
-	err = bpf_struct_ops_supported(st_ops, __btf_member_bit_offset(t, member) / 8);
+	member_off = __btf_member_bit_offset(t, member) / 8;
+	err = bpf_struct_ops_supported(st_ops, member_off);
 	if (err) {
 		verbose(env, "attach to unsupported member %s of struct %s\n",
 			mname, st_ops->name);
@@ -22463,6 +22464,9 @@ static int check_struct_ops_btf_id(struct bpf_verifier_env *env)
 	prog->aux->ctx_arg_info_size =
 		st_ops_desc->arg_info[member_idx].cnt;
 
+	prog->aux->st_ops = st_ops;
+	prog->aux->attach_st_ops_member_off = member_off;
+
 	prog->aux->attach_func_proto = func_proto;
 	prog->aux->attach_func_name = mname;
 	env->ops = st_ops->verifier_ops;
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH sched_ext/for-6.15 v3 2/5] sched_ext: Declare context-sensitive kfunc groups that can be used by different SCX operations
  2025-02-26 19:24 [PATCH sched_ext/for-6.15 v3 0/5] bpf, sched_ext: Make kfunc filters support struct_ops context to reduce runtime overhead Juntong Deng
  2025-02-26 19:28 ` [PATCH sched_ext/for-6.15 v3 1/5] bpf: Add struct_ops context information to struct bpf_prog_aux Juntong Deng
@ 2025-02-26 19:28 ` Juntong Deng
  2025-02-27 19:19   ` Tejun Heo
  2025-02-28 21:57   ` Tejun Heo
  2025-02-26 19:28 ` [PATCH sched_ext/for-6.15 v3 3/5] sched_ext: Add scx_kfunc_ids_ops_context_sensitive for unified filtering of context-sensitive SCX kfuncs Juntong Deng
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 23+ messages in thread
From: Juntong Deng @ 2025-02-26 19:28 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, tj, void,
	arighi, changwoo
  Cc: bpf, linux-kernel

This patch declare context-sensitive kfunc groups that can be used by
different SCX operations.

In SCX, some kfuncs are context-sensitive and can only be used in
specific SCX operations.

Currently context-sensitive kfuncs can be grouped into UNLOCKED,
CPU_RELEASE, DISPATCH, ENQUEUE, SELECT_CPU.

In this patch enum scx_ops_kf_flags was added to represent these groups,
which is based on scx_kf_mask.

SCX_OPS_KF_ANY is a special value that indicates kfuncs can be used in
any context.

scx_ops_context_flags is used to declare the groups of kfuncs that can
be used by each SCX operation. An SCX operation can use multiple groups
of kfuncs.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
---
 kernel/sched/ext.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 74b247c36b3d..15fac968629e 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -730,6 +730,54 @@ struct sched_ext_ops {
 	char name[SCX_OPS_NAME_LEN];
 };
 
+/* Each flag corresponds to a btf kfunc id set */
+enum scx_ops_kf_flags {
+	SCX_OPS_KF_ANY			= 0,
+	SCX_OPS_KF_UNLOCKED		= 1 << 1,
+	SCX_OPS_KF_CPU_RELEASE		= 1 << 2,
+	SCX_OPS_KF_DISPATCH		= 1 << 3,
+	SCX_OPS_KF_ENQUEUE		= 1 << 4,
+	SCX_OPS_KF_SELECT_CPU		= 1 << 5,
+};
+
+static const u32 scx_ops_context_flags[] = {
+	[SCX_OP_IDX(select_cpu)]		= SCX_OPS_KF_SELECT_CPU | SCX_OPS_KF_ENQUEUE,
+	[SCX_OP_IDX(enqueue)]			= SCX_OPS_KF_ENQUEUE,
+	[SCX_OP_IDX(dequeue)]			= SCX_OPS_KF_ANY,
+	[SCX_OP_IDX(dispatch)]			= SCX_OPS_KF_DISPATCH | SCX_OPS_KF_ENQUEUE,
+	[SCX_OP_IDX(tick)]			= SCX_OPS_KF_ANY,
+	[SCX_OP_IDX(runnable)]			= SCX_OPS_KF_ANY,
+	[SCX_OP_IDX(running)]			= SCX_OPS_KF_ANY,
+	[SCX_OP_IDX(stopping)]			= SCX_OPS_KF_ANY,
+	[SCX_OP_IDX(quiescent)]			= SCX_OPS_KF_ANY,
+	[SCX_OP_IDX(yield)]			= SCX_OPS_KF_ANY,
+	[SCX_OP_IDX(core_sched_before)]		= SCX_OPS_KF_ANY,
+	[SCX_OP_IDX(set_weight)]		= SCX_OPS_KF_ANY,
+	[SCX_OP_IDX(set_cpumask)]		= SCX_OPS_KF_ANY,
+	[SCX_OP_IDX(update_idle)]		= SCX_OPS_KF_ANY,
+	[SCX_OP_IDX(cpu_acquire)]		= SCX_OPS_KF_ANY,
+	[SCX_OP_IDX(cpu_release)]		= SCX_OPS_KF_CPU_RELEASE,
+	[SCX_OP_IDX(init_task)]			= SCX_OPS_KF_UNLOCKED,
+	[SCX_OP_IDX(exit_task)]			= SCX_OPS_KF_ANY,
+	[SCX_OP_IDX(enable)]			= SCX_OPS_KF_ANY,
+	[SCX_OP_IDX(disable)]			= SCX_OPS_KF_ANY,
+	[SCX_OP_IDX(dump)]			= SCX_OPS_KF_DISPATCH,
+	[SCX_OP_IDX(dump_cpu)]			= SCX_OPS_KF_ANY,
+	[SCX_OP_IDX(dump_task)]			= SCX_OPS_KF_ANY,
+#ifdef CONFIG_EXT_GROUP_SCHED
+	[SCX_OP_IDX(cgroup_init)]		= SCX_OPS_KF_UNLOCKED,
+	[SCX_OP_IDX(cgroup_exit)]		= SCX_OPS_KF_UNLOCKED,
+	[SCX_OP_IDX(cgroup_prep_move)]		= SCX_OPS_KF_UNLOCKED,
+	[SCX_OP_IDX(cgroup_move)]		= SCX_OPS_KF_UNLOCKED,
+	[SCX_OP_IDX(cgroup_cancel_move)]	= SCX_OPS_KF_UNLOCKED,
+	[SCX_OP_IDX(cgroup_set_weight)]		= SCX_OPS_KF_UNLOCKED,
+#endif	/* CONFIG_EXT_GROUP_SCHED */
+	[SCX_OP_IDX(cpu_online)]		= SCX_OPS_KF_UNLOCKED,
+	[SCX_OP_IDX(cpu_offline)]		= SCX_OPS_KF_UNLOCKED,
+	[SCX_OP_IDX(init)]			= SCX_OPS_KF_UNLOCKED,
+	[SCX_OP_IDX(exit)]			= SCX_OPS_KF_UNLOCKED,
+};
+
 enum scx_opi {
 	SCX_OPI_BEGIN			= 0,
 	SCX_OPI_NORMAL_BEGIN		= 0,
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH sched_ext/for-6.15 v3 3/5] sched_ext: Add scx_kfunc_ids_ops_context_sensitive for unified filtering of context-sensitive SCX kfuncs
  2025-02-26 19:24 [PATCH sched_ext/for-6.15 v3 0/5] bpf, sched_ext: Make kfunc filters support struct_ops context to reduce runtime overhead Juntong Deng
  2025-02-26 19:28 ` [PATCH sched_ext/for-6.15 v3 1/5] bpf: Add struct_ops context information to struct bpf_prog_aux Juntong Deng
  2025-02-26 19:28 ` [PATCH sched_ext/for-6.15 v3 2/5] sched_ext: Declare context-sensitive kfunc groups that can be used by different SCX operations Juntong Deng
@ 2025-02-26 19:28 ` Juntong Deng
  2025-02-27 20:25   ` Tejun Heo
  2025-02-26 19:28 ` [PATCH sched_ext/for-6.15 v3 4/5] sched_ext: Removed mask-based runtime restrictions on calling kfuncs in different contexts Juntong Deng
  2025-02-26 19:28 ` [PATCH sched_ext/for-6.15 v3 5/5] selftests/sched_ext: Update enq_select_cpu_fails to adapt to struct_ops context filter Juntong Deng
  4 siblings, 1 reply; 23+ messages in thread
From: Juntong Deng @ 2025-02-26 19:28 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, tj, void,
	arighi, changwoo
  Cc: bpf, linux-kernel

This patch adds scx_kfunc_ids_ops_context_sensitive for unified
filtering of context-sensitive SCX kfuncs.

Currently we need to rely on kfunc id sets to group context-sensitive
SCX kfuncs.

If we add filters to each group kfunc id set separately, it will be
cumbersome. A better approach would be to use different kfunc id sets
for grouping purposes and filtering purposes.

scx_kfunc_ids_ops_context_sensitive is a kfunc id set for filtering
purposes, which contains all context-sensitive SCX kfuncs and implements
filtering rules for different contexts in the filter (by searching the
kfunc id sets used for grouping purposes).

Now we only need to register scx_kfunc_ids_ops_context_sensitive, no
longer need to register multiple context-sensitive kfunc id sets.

In addition, this patch adds the SCX_MOFF_IDX macro to facilitate the
calculation of idx based on moff.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
---
 kernel/sched/ext.c      | 110 ++++++++++++++++++++++++++++++----------
 kernel/sched/ext_idle.c |   8 +--
 2 files changed, 83 insertions(+), 35 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 15fac968629e..c337f6206ae5 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -10,6 +10,7 @@
 #include "ext_idle.h"
 
 #define SCX_OP_IDX(op)		(offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void)))
+#define SCX_MOFF_IDX(moff)	(moff / sizeof(void (*)(void)))
 
 enum scx_consts {
 	SCX_DSP_DFL_MAX_BATCH		= 32,
@@ -6300,11 +6301,6 @@ BTF_ID_FLAGS(func, scx_bpf_dispatch, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime, KF_RCU)
 BTF_KFUNCS_END(scx_kfunc_ids_enqueue_dispatch)
 
-static const struct btf_kfunc_id_set scx_kfunc_set_enqueue_dispatch = {
-	.owner			= THIS_MODULE,
-	.set			= &scx_kfunc_ids_enqueue_dispatch,
-};
-
 static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
 			 struct task_struct *p, u64 dsq_id, u64 enq_flags)
 {
@@ -6620,11 +6616,6 @@ BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime_from_dsq, KF_RCU)
 BTF_KFUNCS_END(scx_kfunc_ids_dispatch)
 
-static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = {
-	.owner			= THIS_MODULE,
-	.set			= &scx_kfunc_ids_dispatch,
-};
-
 __bpf_kfunc_start_defs();
 
 /**
@@ -6687,11 +6678,6 @@ BTF_KFUNCS_START(scx_kfunc_ids_cpu_release)
 BTF_ID_FLAGS(func, scx_bpf_reenqueue_local)
 BTF_KFUNCS_END(scx_kfunc_ids_cpu_release)
 
-static const struct btf_kfunc_id_set scx_kfunc_set_cpu_release = {
-	.owner			= THIS_MODULE,
-	.set			= &scx_kfunc_ids_cpu_release,
-};
-
 __bpf_kfunc_start_defs();
 
 /**
@@ -6724,11 +6710,6 @@ BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime_from_dsq, KF_RCU)
 BTF_KFUNCS_END(scx_kfunc_ids_unlocked)
 
-static const struct btf_kfunc_id_set scx_kfunc_set_unlocked = {
-	.owner			= THIS_MODULE,
-	.set			= &scx_kfunc_ids_unlocked,
-};
-
 __bpf_kfunc_start_defs();
 
 /**
@@ -7370,6 +7351,85 @@ __bpf_kfunc void scx_bpf_events(struct scx_event_stats *events,
 
 __bpf_kfunc_end_defs();
 
+BTF_KFUNCS_START(scx_kfunc_ids_ops_context_sensitive)
+/* scx_kfunc_ids_select_cpu */
+BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU)
+/* scx_kfunc_ids_enqueue_dispatch */
+BTF_ID_FLAGS(func, scx_bpf_dsq_insert, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_dsq_insert_vtime, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_dispatch, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime, KF_RCU)
+/* scx_kfunc_ids_dispatch */
+BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots)
+BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel)
+BTF_ID_FLAGS(func, scx_bpf_dsq_move_to_local)
+BTF_ID_FLAGS(func, scx_bpf_consume)
+/* scx_kfunc_ids_cpu_release */
+BTF_ID_FLAGS(func, scx_bpf_reenqueue_local)
+/* scx_kfunc_ids_unlocked */
+BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE)
+/* Intersection of scx_kfunc_ids_dispatch and scx_kfunc_ids_unlocked */
+BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_slice)
+BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_vtime)
+BTF_ID_FLAGS(func, scx_bpf_dsq_move, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_dsq_move_vtime, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq_set_slice)
+BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq_set_vtime)
+BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime_from_dsq, KF_RCU)
+BTF_KFUNCS_END(scx_kfunc_ids_ops_context_sensitive)
+
+extern struct btf_id_set8 scx_kfunc_ids_select_cpu;
+
+static int scx_kfunc_ids_ops_context_sensitive_filter(const struct bpf_prog *prog, u32 kfunc_id)
+{
+	u32 moff, flags;
+
+	if (!btf_id_set8_contains(&scx_kfunc_ids_ops_context_sensitive, kfunc_id))
+		return 0;
+
+	if (prog->type == BPF_PROG_TYPE_SYSCALL &&
+	    btf_id_set8_contains(&scx_kfunc_ids_unlocked, kfunc_id))
+		return 0;
+
+	if (prog->type == BPF_PROG_TYPE_STRUCT_OPS &&
+	    prog->aux->st_ops != &bpf_sched_ext_ops)
+		return 0;
+
+	/* prog->type == BPF_PROG_TYPE_STRUCT_OPS && prog->aux->st_ops == &bpf_sched_ext_ops*/
+
+	moff = prog->aux->attach_st_ops_member_off;
+	flags = scx_ops_context_flags[SCX_MOFF_IDX(moff)];
+
+	if ((flags & SCX_OPS_KF_UNLOCKED) &&
+	    btf_id_set8_contains(&scx_kfunc_ids_unlocked, kfunc_id))
+		return 0;
+
+	if ((flags & SCX_OPS_KF_CPU_RELEASE) &&
+	    btf_id_set8_contains(&scx_kfunc_ids_cpu_release, kfunc_id))
+		return 0;
+
+	if ((flags & SCX_OPS_KF_DISPATCH) &&
+	    btf_id_set8_contains(&scx_kfunc_ids_dispatch, kfunc_id))
+		return 0;
+
+	if ((flags & SCX_OPS_KF_ENQUEUE) &&
+	    btf_id_set8_contains(&scx_kfunc_ids_enqueue_dispatch, kfunc_id))
+		return 0;
+
+	if ((flags & SCX_OPS_KF_SELECT_CPU) &&
+	    btf_id_set8_contains(&scx_kfunc_ids_select_cpu, kfunc_id))
+		return 0;
+
+	return -EACCES;
+}
+
+static const struct btf_kfunc_id_set scx_kfunc_set_ops_context_sensitive = {
+	.owner			= THIS_MODULE,
+	.set			= &scx_kfunc_ids_ops_context_sensitive,
+	.filter			= scx_kfunc_ids_ops_context_sensitive_filter,
+};
+
 BTF_KFUNCS_START(scx_kfunc_ids_any)
 BTF_ID_FLAGS(func, scx_bpf_kick_cpu)
 BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued)
@@ -7425,15 +7485,9 @@ static int __init scx_init(void)
 	 * check using scx_kf_allowed().
 	 */
 	if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
-					     &scx_kfunc_set_enqueue_dispatch)) ||
-	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
-					     &scx_kfunc_set_dispatch)) ||
-	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
-					     &scx_kfunc_set_cpu_release)) ||
-	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
-					     &scx_kfunc_set_unlocked)) ||
+					     &scx_kfunc_set_ops_context_sensitive)) ||
 	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL,
-					     &scx_kfunc_set_unlocked)) ||
+					     &scx_kfunc_set_ops_context_sensitive)) ||
 	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
 					     &scx_kfunc_set_any)) ||
 	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,
diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
index dc40e0baf77c..efb6077810d8 100644
--- a/kernel/sched/ext_idle.c
+++ b/kernel/sched/ext_idle.c
@@ -1125,17 +1125,11 @@ BTF_KFUNCS_START(scx_kfunc_ids_select_cpu)
 BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU)
 BTF_KFUNCS_END(scx_kfunc_ids_select_cpu)
 
-static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = {
-	.owner			= THIS_MODULE,
-	.set			= &scx_kfunc_ids_select_cpu,
-};
-
 int scx_idle_init(void)
 {
 	int ret;
 
-	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_select_cpu) ||
-	      register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_idle) ||
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_idle) ||
 	      register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &scx_kfunc_set_idle) ||
 	      register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_idle);
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH sched_ext/for-6.15 v3 4/5] sched_ext: Removed mask-based runtime restrictions on calling kfuncs in different contexts
  2025-02-26 19:24 [PATCH sched_ext/for-6.15 v3 0/5] bpf, sched_ext: Make kfunc filters support struct_ops context to reduce runtime overhead Juntong Deng
                   ` (2 preceding siblings ...)
  2025-02-26 19:28 ` [PATCH sched_ext/for-6.15 v3 3/5] sched_ext: Add scx_kfunc_ids_ops_context_sensitive for unified filtering of context-sensitive SCX kfuncs Juntong Deng
@ 2025-02-26 19:28 ` Juntong Deng
  2025-02-27 16:24   ` Andrea Righi
  2025-02-26 19:28 ` [PATCH sched_ext/for-6.15 v3 5/5] selftests/sched_ext: Update enq_select_cpu_fails to adapt to struct_ops context filter Juntong Deng
  4 siblings, 1 reply; 23+ messages in thread
From: Juntong Deng @ 2025-02-26 19:28 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, tj, void,
	arighi, changwoo
  Cc: bpf, linux-kernel

Currently, kfunc filters already support filtering based on struct_ops
context information.

The BPF verifier can check context-sensitive kfuncs before the SCX
program is run, avoiding runtime overhead.

Therefore we no longer need mask-based runtime restrictions.

This patch removes the mask-based runtime restrictions.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
---
 include/linux/sched/ext.h |  24 ----
 kernel/sched/ext.c        | 227 ++++++++------------------------------
 kernel/sched/ext_idle.c   |   5 +-
 3 files changed, 50 insertions(+), 206 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index f7545430a548..9980d6b55c84 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -96,29 +96,6 @@ enum scx_ent_dsq_flags {
 	SCX_TASK_DSQ_ON_PRIQ	= 1 << 0, /* task is queued on the priority queue of a dsq */
 };
 
-/*
- * Mask bits for scx_entity.kf_mask. Not all kfuncs can be called from
- * everywhere and the following bits track which kfunc sets are currently
- * allowed for %current. This simple per-task tracking works because SCX ops
- * nest in a limited way. BPF will likely implement a way to allow and disallow
- * kfuncs depending on the calling context which will replace this manual
- * mechanism. See scx_kf_allow().
- */
-enum scx_kf_mask {
-	SCX_KF_UNLOCKED		= 0,	  /* sleepable and not rq locked */
-	/* ENQUEUE and DISPATCH may be nested inside CPU_RELEASE */
-	SCX_KF_CPU_RELEASE	= 1 << 0, /* ops.cpu_release() */
-	/* ops.dequeue (in REST) may be nested inside DISPATCH */
-	SCX_KF_DISPATCH		= 1 << 1, /* ops.dispatch() */
-	SCX_KF_ENQUEUE		= 1 << 2, /* ops.enqueue() and ops.select_cpu() */
-	SCX_KF_SELECT_CPU	= 1 << 3, /* ops.select_cpu() */
-	SCX_KF_REST		= 1 << 4, /* other rq-locked operations */
-
-	__SCX_KF_RQ_LOCKED	= SCX_KF_CPU_RELEASE | SCX_KF_DISPATCH |
-				  SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST,
-	__SCX_KF_TERMINAL	= SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST,
-};
-
 enum scx_dsq_lnode_flags {
 	SCX_DSQ_LNODE_ITER_CURSOR = 1 << 0,
 
@@ -147,7 +124,6 @@ struct sched_ext_entity {
 	s32			sticky_cpu;
 	s32			holding_cpu;
 	s32			selected_cpu;
-	u32			kf_mask;	/* see scx_kf_mask above */
 	struct task_struct	*kf_tasks[2];	/* see SCX_CALL_OP_TASK() */
 	atomic_long_t		ops_state;
 
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index c337f6206ae5..7dc5f11be66b 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1115,19 +1115,6 @@ static long jiffies_delta_msecs(unsigned long at, unsigned long now)
 		return -(long)jiffies_to_msecs(now - at);
 }
 
-/* if the highest set bit is N, return a mask with bits [N+1, 31] set */
-static u32 higher_bits(u32 flags)
-{
-	return ~((1 << fls(flags)) - 1);
-}
-
-/* return the mask with only the highest bit set */
-static u32 highest_bit(u32 flags)
-{
-	int bit = fls(flags);
-	return ((u64)1 << bit) >> 1;
-}
-
 static bool u32_before(u32 a, u32 b)
 {
 	return (s32)(a - b) < 0;
@@ -1143,51 +1130,12 @@ static struct scx_dispatch_q *find_user_dsq(u64 dsq_id)
 	return rhashtable_lookup_fast(&dsq_hash, &dsq_id, dsq_hash_params);
 }
 
-/*
- * scx_kf_mask enforcement. Some kfuncs can only be called from specific SCX
- * ops. When invoking SCX ops, SCX_CALL_OP[_RET]() should be used to indicate
- * the allowed kfuncs and those kfuncs should use scx_kf_allowed() to check
- * whether it's running from an allowed context.
- *
- * @mask is constant, always inline to cull the mask calculations.
- */
-static __always_inline void scx_kf_allow(u32 mask)
-{
-	/* nesting is allowed only in increasing scx_kf_mask order */
-	WARN_ONCE((mask | higher_bits(mask)) & current->scx.kf_mask,
-		  "invalid nesting current->scx.kf_mask=0x%x mask=0x%x\n",
-		  current->scx.kf_mask, mask);
-	current->scx.kf_mask |= mask;
-	barrier();
-}
-
-static void scx_kf_disallow(u32 mask)
-{
-	barrier();
-	current->scx.kf_mask &= ~mask;
-}
-
-#define SCX_CALL_OP(mask, op, args...)						\
-do {										\
-	if (mask) {								\
-		scx_kf_allow(mask);						\
-		scx_ops.op(args);						\
-		scx_kf_disallow(mask);						\
-	} else {								\
-		scx_ops.op(args);						\
-	}									\
-} while (0)
+#define SCX_CALL_OP(op, args...)	scx_ops.op(args)
 
-#define SCX_CALL_OP_RET(mask, op, args...)					\
+#define SCX_CALL_OP_RET(op, args...)						\
 ({										\
 	__typeof__(scx_ops.op(args)) __ret;					\
-	if (mask) {								\
-		scx_kf_allow(mask);						\
-		__ret = scx_ops.op(args);					\
-		scx_kf_disallow(mask);						\
-	} else {								\
-		__ret = scx_ops.op(args);					\
-	}									\
+	__ret = scx_ops.op(args);						\
 	__ret;									\
 })
 
@@ -1202,74 +1150,36 @@ do {										\
  * scx_kf_allowed_on_arg_tasks() to test whether the invocation is allowed on
  * the specific task.
  */
-#define SCX_CALL_OP_TASK(mask, op, task, args...)				\
+#define SCX_CALL_OP_TASK(op, task, args...)					\
 do {										\
-	BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL);				\
 	current->scx.kf_tasks[0] = task;					\
-	SCX_CALL_OP(mask, op, task, ##args);					\
+	SCX_CALL_OP(op, task, ##args);						\
 	current->scx.kf_tasks[0] = NULL;					\
 } while (0)
 
-#define SCX_CALL_OP_TASK_RET(mask, op, task, args...)				\
+#define SCX_CALL_OP_TASK_RET(op, task, args...)					\
 ({										\
 	__typeof__(scx_ops.op(task, ##args)) __ret;				\
-	BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL);				\
 	current->scx.kf_tasks[0] = task;					\
-	__ret = SCX_CALL_OP_RET(mask, op, task, ##args);			\
+	__ret = SCX_CALL_OP_RET(op, task, ##args);				\
 	current->scx.kf_tasks[0] = NULL;					\
 	__ret;									\
 })
 
-#define SCX_CALL_OP_2TASKS_RET(mask, op, task0, task1, args...)			\
+#define SCX_CALL_OP_2TASKS_RET(op, task0, task1, args...)			\
 ({										\
 	__typeof__(scx_ops.op(task0, task1, ##args)) __ret;			\
-	BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL);				\
 	current->scx.kf_tasks[0] = task0;					\
 	current->scx.kf_tasks[1] = task1;					\
-	__ret = SCX_CALL_OP_RET(mask, op, task0, task1, ##args);		\
+	__ret = SCX_CALL_OP_RET(op, task0, task1, ##args);			\
 	current->scx.kf_tasks[0] = NULL;					\
 	current->scx.kf_tasks[1] = NULL;					\
 	__ret;									\
 })
 
-/* @mask is constant, always inline to cull unnecessary branches */
-static __always_inline bool scx_kf_allowed(u32 mask)
-{
-	if (unlikely(!(current->scx.kf_mask & mask))) {
-		scx_ops_error("kfunc with mask 0x%x called from an operation only allowing 0x%x",
-			      mask, current->scx.kf_mask);
-		return false;
-	}
-
-	/*
-	 * Enforce nesting boundaries. e.g. A kfunc which can be called from
-	 * DISPATCH must not be called if we're running DEQUEUE which is nested
-	 * inside ops.dispatch(). We don't need to check boundaries for any
-	 * blocking kfuncs as the verifier ensures they're only called from
-	 * sleepable progs.
-	 */
-	if (unlikely(highest_bit(mask) == SCX_KF_CPU_RELEASE &&
-		     (current->scx.kf_mask & higher_bits(SCX_KF_CPU_RELEASE)))) {
-		scx_ops_error("cpu_release kfunc called from a nested operation");
-		return false;
-	}
-
-	if (unlikely(highest_bit(mask) == SCX_KF_DISPATCH &&
-		     (current->scx.kf_mask & higher_bits(SCX_KF_DISPATCH)))) {
-		scx_ops_error("dispatch kfunc called from a nested operation");
-		return false;
-	}
-
-	return true;
-}
-
 /* see SCX_CALL_OP_TASK() */
-static __always_inline bool scx_kf_allowed_on_arg_tasks(u32 mask,
-							struct task_struct *p)
+static __always_inline bool scx_kf_allowed_on_arg_tasks(struct task_struct *p)
 {
-	if (!scx_kf_allowed(mask))
-		return false;
-
 	if (unlikely((p != current->scx.kf_tasks[0] &&
 		      p != current->scx.kf_tasks[1]))) {
 		scx_ops_error("called on a task not being operated on");
@@ -1279,11 +1189,6 @@ static __always_inline bool scx_kf_allowed_on_arg_tasks(u32 mask,
 	return true;
 }
 
-static bool scx_kf_allowed_if_unlocked(void)
-{
-	return !current->scx.kf_mask;
-}
-
 /**
  * nldsq_next_task - Iterate to the next task in a non-local DSQ
  * @dsq: user dsq being iterated
@@ -2219,7 +2124,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	WARN_ON_ONCE(*ddsp_taskp);
 	*ddsp_taskp = p;
 
-	SCX_CALL_OP_TASK(SCX_KF_ENQUEUE, enqueue, p, enq_flags);
+	SCX_CALL_OP_TASK(enqueue, p, enq_flags);
 
 	*ddsp_taskp = NULL;
 	if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
@@ -2316,7 +2221,7 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
 	add_nr_running(rq, 1);
 
 	if (SCX_HAS_OP(runnable) && !task_on_rq_migrating(p))
-		SCX_CALL_OP_TASK(SCX_KF_REST, runnable, p, enq_flags);
+		SCX_CALL_OP_TASK(runnable, p, enq_flags);
 
 	if (enq_flags & SCX_ENQ_WAKEUP)
 		touch_core_sched(rq, p);
@@ -2351,7 +2256,7 @@ static void ops_dequeue(struct task_struct *p, u64 deq_flags)
 		BUG();
 	case SCX_OPSS_QUEUED:
 		if (SCX_HAS_OP(dequeue))
-			SCX_CALL_OP_TASK(SCX_KF_REST, dequeue, p, deq_flags);
+			SCX_CALL_OP_TASK(dequeue, p, deq_flags);
 
 		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
 					    SCX_OPSS_NONE))
@@ -2400,11 +2305,11 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags
 	 */
 	if (SCX_HAS_OP(stopping) && task_current(rq, p)) {
 		update_curr_scx(rq);
-		SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, false);
+		SCX_CALL_OP_TASK(stopping, p, false);
 	}
 
 	if (SCX_HAS_OP(quiescent) && !task_on_rq_migrating(p))
-		SCX_CALL_OP_TASK(SCX_KF_REST, quiescent, p, deq_flags);
+		SCX_CALL_OP_TASK(quiescent, p, deq_flags);
 
 	if (deq_flags & SCX_DEQ_SLEEP)
 		p->scx.flags |= SCX_TASK_DEQD_FOR_SLEEP;
@@ -2424,7 +2329,7 @@ static void yield_task_scx(struct rq *rq)
 	struct task_struct *p = rq->curr;
 
 	if (SCX_HAS_OP(yield))
-		SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, yield, p, NULL);
+		SCX_CALL_OP_2TASKS_RET(yield, p, NULL);
 	else
 		p->scx.slice = 0;
 }
@@ -2434,7 +2339,7 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to)
 	struct task_struct *from = rq->curr;
 
 	if (SCX_HAS_OP(yield))
-		return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, yield, from, to);
+		return SCX_CALL_OP_2TASKS_RET(yield, from, to);
 	else
 		return false;
 }
@@ -2992,7 +2897,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
 		 * emitted in switch_class().
 		 */
 		if (SCX_HAS_OP(cpu_acquire))
-			SCX_CALL_OP(SCX_KF_REST, cpu_acquire, cpu_of(rq), NULL);
+			SCX_CALL_OP(cpu_acquire, cpu_of(rq), NULL);
 		rq->scx.cpu_released = false;
 	}
 
@@ -3037,8 +2942,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
 	do {
 		dspc->nr_tasks = 0;
 
-		SCX_CALL_OP(SCX_KF_DISPATCH, dispatch, cpu_of(rq),
-			    prev_on_scx ? prev : NULL);
+		SCX_CALL_OP(dispatch, cpu_of(rq), prev_on_scx ? prev : NULL);
 
 		flush_dispatch_buf(rq);
 
@@ -3159,7 +3063,7 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
 
 	/* see dequeue_task_scx() on why we skip when !QUEUED */
 	if (SCX_HAS_OP(running) && (p->scx.flags & SCX_TASK_QUEUED))
-		SCX_CALL_OP_TASK(SCX_KF_REST, running, p);
+		SCX_CALL_OP_TASK(running, p);
 
 	clr_task_runnable(p, true);
 
@@ -3240,8 +3144,7 @@ static void switch_class(struct rq *rq, struct task_struct *next)
 				.task = next,
 			};
 
-			SCX_CALL_OP(SCX_KF_CPU_RELEASE,
-				    cpu_release, cpu_of(rq), &args);
+			SCX_CALL_OP(cpu_release, cpu_of(rq), &args);
 		}
 		rq->scx.cpu_released = true;
 	}
@@ -3254,7 +3157,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
 
 	/* see dequeue_task_scx() on why we skip when !QUEUED */
 	if (SCX_HAS_OP(stopping) && (p->scx.flags & SCX_TASK_QUEUED))
-		SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, true);
+		SCX_CALL_OP_TASK(stopping, p, true);
 
 	if (p->scx.flags & SCX_TASK_QUEUED) {
 		set_task_runnable(rq, p);
@@ -3428,8 +3331,7 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
 		WARN_ON_ONCE(*ddsp_taskp);
 		*ddsp_taskp = p;
 
-		cpu = SCX_CALL_OP_TASK_RET(SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU,
-					   select_cpu, p, prev_cpu, wake_flags);
+		cpu = SCX_CALL_OP_TASK_RET(select_cpu, p, prev_cpu, wake_flags);
 		p->scx.selected_cpu = cpu;
 		*ddsp_taskp = NULL;
 		if (ops_cpu_valid(cpu, "from ops.select_cpu()"))
@@ -3473,8 +3375,7 @@ static void set_cpus_allowed_scx(struct task_struct *p,
 	 * designation pointless. Cast it away when calling the operation.
 	 */
 	if (SCX_HAS_OP(set_cpumask))
-		SCX_CALL_OP_TASK(SCX_KF_REST, set_cpumask, p,
-				 (struct cpumask *)p->cpus_ptr);
+		SCX_CALL_OP_TASK(set_cpumask, p, (struct cpumask *)p->cpus_ptr);
 }
 
 static void handle_hotplug(struct rq *rq, bool online)
@@ -3487,9 +3388,9 @@ static void handle_hotplug(struct rq *rq, bool online)
 		scx_idle_update_selcpu_topology(&scx_ops);
 
 	if (online && SCX_HAS_OP(cpu_online))
-		SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_online, cpu);
+		SCX_CALL_OP(cpu_online, cpu);
 	else if (!online && SCX_HAS_OP(cpu_offline))
-		SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_offline, cpu);
+		SCX_CALL_OP(cpu_offline, cpu);
 	else
 		scx_ops_exit(SCX_ECODE_ACT_RESTART | SCX_ECODE_RSN_HOTPLUG,
 			     "cpu %d going %s, exiting scheduler", cpu,
@@ -3593,7 +3494,7 @@ static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued)
 		curr->scx.slice = 0;
 		touch_core_sched(rq, curr);
 	} else if (SCX_HAS_OP(tick)) {
-		SCX_CALL_OP(SCX_KF_REST, tick, curr);
+		SCX_CALL_OP(tick, curr);
 	}
 
 	if (!curr->scx.slice)
@@ -3670,7 +3571,7 @@ static int scx_ops_init_task(struct task_struct *p, struct task_group *tg, bool
 			.fork = fork,
 		};
 
-		ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, init_task, p, &args);
+		ret = SCX_CALL_OP_RET(init_task, p, &args);
 		if (unlikely(ret)) {
 			ret = ops_sanitize_err("init_task", ret);
 			return ret;
@@ -3727,11 +3628,11 @@ static void scx_ops_enable_task(struct task_struct *p)
 	p->scx.weight = sched_weight_to_cgroup(weight);
 
 	if (SCX_HAS_OP(enable))
-		SCX_CALL_OP_TASK(SCX_KF_REST, enable, p);
+		SCX_CALL_OP_TASK(enable, p);
 	scx_set_task_state(p, SCX_TASK_ENABLED);
 
 	if (SCX_HAS_OP(set_weight))
-		SCX_CALL_OP_TASK(SCX_KF_REST, set_weight, p, p->scx.weight);
+		SCX_CALL_OP_TASK(set_weight, p, p->scx.weight);
 }
 
 static void scx_ops_disable_task(struct task_struct *p)
@@ -3740,7 +3641,7 @@ static void scx_ops_disable_task(struct task_struct *p)
 	WARN_ON_ONCE(scx_get_task_state(p) != SCX_TASK_ENABLED);
 
 	if (SCX_HAS_OP(disable))
-		SCX_CALL_OP(SCX_KF_REST, disable, p);
+		SCX_CALL_OP(disable, p);
 	scx_set_task_state(p, SCX_TASK_READY);
 }
 
@@ -3769,7 +3670,7 @@ static void scx_ops_exit_task(struct task_struct *p)
 	}
 
 	if (SCX_HAS_OP(exit_task))
-		SCX_CALL_OP(SCX_KF_REST, exit_task, p, &args);
+		SCX_CALL_OP(exit_task, p, &args);
 	scx_set_task_state(p, SCX_TASK_NONE);
 }
 
@@ -3878,7 +3779,7 @@ static void reweight_task_scx(struct rq *rq, struct task_struct *p,
 
 	p->scx.weight = sched_weight_to_cgroup(scale_load_down(lw->weight));
 	if (SCX_HAS_OP(set_weight))
-		SCX_CALL_OP_TASK(SCX_KF_REST, set_weight, p, p->scx.weight);
+		SCX_CALL_OP_TASK(set_weight, p, p->scx.weight);
 }
 
 static void prio_changed_scx(struct rq *rq, struct task_struct *p, int oldprio)
@@ -3894,8 +3795,7 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p)
 	 * different scheduler class. Keep the BPF scheduler up-to-date.
 	 */
 	if (SCX_HAS_OP(set_cpumask))
-		SCX_CALL_OP_TASK(SCX_KF_REST, set_cpumask, p,
-				 (struct cpumask *)p->cpus_ptr);
+		SCX_CALL_OP_TASK(set_cpumask, p, (struct cpumask *)p->cpus_ptr);
 }
 
 static void switched_from_scx(struct rq *rq, struct task_struct *p)
@@ -3987,8 +3887,7 @@ int scx_tg_online(struct task_group *tg)
 			struct scx_cgroup_init_args args =
 				{ .weight = tg->scx_weight };
 
-			ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_init,
-					      tg->css.cgroup, &args);
+			ret = SCX_CALL_OP_RET(cgroup_init, tg->css.cgroup, &args);
 			if (ret)
 				ret = ops_sanitize_err("cgroup_init", ret);
 		}
@@ -4009,7 +3908,7 @@ void scx_tg_offline(struct task_group *tg)
 	percpu_down_read(&scx_cgroup_rwsem);
 
 	if (SCX_HAS_OP(cgroup_exit) && (tg->scx_flags & SCX_TG_INITED))
-		SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_exit, tg->css.cgroup);
+		SCX_CALL_OP(cgroup_exit, tg->css.cgroup);
 	tg->scx_flags &= ~(SCX_TG_ONLINE | SCX_TG_INITED);
 
 	percpu_up_read(&scx_cgroup_rwsem);
@@ -4042,8 +3941,7 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset)
 			continue;
 
 		if (SCX_HAS_OP(cgroup_prep_move)) {
-			ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_prep_move,
-					      p, from, css->cgroup);
+			ret = SCX_CALL_OP_RET(cgroup_prep_move, p, from, css->cgroup);
 			if (ret)
 				goto err;
 		}
@@ -4056,8 +3954,7 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset)
 err:
 	cgroup_taskset_for_each(p, css, tset) {
 		if (SCX_HAS_OP(cgroup_cancel_move) && p->scx.cgrp_moving_from)
-			SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_cancel_move, p,
-				    p->scx.cgrp_moving_from, css->cgroup);
+			SCX_CALL_OP(cgroup_cancel_move, p, p->scx.cgrp_moving_from, css->cgroup);
 		p->scx.cgrp_moving_from = NULL;
 	}
 
@@ -4075,8 +3972,7 @@ void scx_cgroup_move_task(struct task_struct *p)
 	 * cgrp_moving_from set.
 	 */
 	if (SCX_HAS_OP(cgroup_move) && !WARN_ON_ONCE(!p->scx.cgrp_moving_from))
-		SCX_CALL_OP_TASK(SCX_KF_UNLOCKED, cgroup_move, p,
-			p->scx.cgrp_moving_from, tg_cgrp(task_group(p)));
+		SCX_CALL_OP_TASK(cgroup_move, p, p->scx.cgrp_moving_from, tg_cgrp(task_group(p)));
 	p->scx.cgrp_moving_from = NULL;
 }
 
@@ -4095,8 +3991,7 @@ void scx_cgroup_cancel_attach(struct cgroup_taskset *tset)
 
 	cgroup_taskset_for_each(p, css, tset) {
 		if (SCX_HAS_OP(cgroup_cancel_move) && p->scx.cgrp_moving_from)
-			SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_cancel_move, p,
-				    p->scx.cgrp_moving_from, css->cgroup);
+			SCX_CALL_OP(cgroup_cancel_move, p, p->scx.cgrp_moving_from, css->cgroup);
 		p->scx.cgrp_moving_from = NULL;
 	}
 out_unlock:
@@ -4109,8 +4004,7 @@ void scx_group_set_weight(struct task_group *tg, unsigned long weight)
 
 	if (scx_cgroup_enabled && tg->scx_weight != weight) {
 		if (SCX_HAS_OP(cgroup_set_weight))
-			SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_set_weight,
-				    tg_cgrp(tg), weight);
+			SCX_CALL_OP(cgroup_set_weight, tg_cgrp(tg), weight);
 		tg->scx_weight = weight;
 	}
 
@@ -4300,7 +4194,7 @@ static void scx_cgroup_exit(void)
 			continue;
 		rcu_read_unlock();
 
-		SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_exit, css->cgroup);
+		SCX_CALL_OP(cgroup_exit, css->cgroup);
 
 		rcu_read_lock();
 		css_put(css);
@@ -4343,8 +4237,7 @@ static int scx_cgroup_init(void)
 			continue;
 		rcu_read_unlock();
 
-		ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_init,
-				      css->cgroup, &args);
+		ret = SCX_CALL_OP_RET(cgroup_init, css->cgroup, &args);
 		if (ret) {
 			css_put(css);
 			scx_ops_error("ops.cgroup_init() failed (%d)", ret);
@@ -4840,7 +4733,7 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
 	}
 
 	if (scx_ops.exit)
-		SCX_CALL_OP(SCX_KF_UNLOCKED, exit, ei);
+		SCX_CALL_OP(exit, ei);
 
 	cancel_delayed_work_sync(&scx_watchdog_work);
 
@@ -5047,7 +4940,7 @@ static void scx_dump_task(struct seq_buf *s, struct scx_dump_ctx *dctx,
 
 	if (SCX_HAS_OP(dump_task)) {
 		ops_dump_init(s, "    ");
-		SCX_CALL_OP(SCX_KF_REST, dump_task, dctx, p);
+		SCX_CALL_OP(dump_task, dctx, p);
 		ops_dump_exit();
 	}
 
@@ -5094,7 +4987,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
 
 	if (SCX_HAS_OP(dump)) {
 		ops_dump_init(&s, "");
-		SCX_CALL_OP(SCX_KF_UNLOCKED, dump, &dctx);
+		SCX_CALL_OP(dump, &dctx);
 		ops_dump_exit();
 	}
 
@@ -5151,7 +5044,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
 		used = seq_buf_used(&ns);
 		if (SCX_HAS_OP(dump_cpu)) {
 			ops_dump_init(&ns, "  ");
-			SCX_CALL_OP(SCX_KF_REST, dump_cpu, &dctx, cpu, idle);
+			SCX_CALL_OP(dump_cpu, &dctx, cpu, idle);
 			ops_dump_exit();
 		}
 
@@ -5405,7 +5298,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 	cpus_read_lock();
 
 	if (scx_ops.init) {
-		ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, init);
+		ret = SCX_CALL_OP_RET(init);
 		if (ret) {
 			ret = ops_sanitize_err("init", ret);
 			cpus_read_unlock();
@@ -6146,9 +6039,6 @@ void __init init_sched_ext_class(void)
  */
 static bool scx_dsq_insert_preamble(struct task_struct *p, u64 enq_flags)
 {
-	if (!scx_kf_allowed(SCX_KF_ENQUEUE | SCX_KF_DISPATCH))
-		return false;
-
 	lockdep_assert_irqs_disabled();
 
 	if (unlikely(!p)) {
@@ -6310,9 +6200,6 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
 	bool in_balance;
 	unsigned long flags;
 
-	if (!scx_kf_allowed_if_unlocked() && !scx_kf_allowed(SCX_KF_DISPATCH))
-		return false;
-
 	/*
 	 * Can be called from either ops.dispatch() locking this_rq() or any
 	 * context where no rq lock is held. If latter, lock @p's task_rq which
@@ -6395,9 +6282,6 @@ __bpf_kfunc_start_defs();
  */
 __bpf_kfunc u32 scx_bpf_dispatch_nr_slots(void)
 {
-	if (!scx_kf_allowed(SCX_KF_DISPATCH))
-		return 0;
-
 	return scx_dsp_max_batch - __this_cpu_read(scx_dsp_ctx->cursor);
 }
 
@@ -6411,9 +6295,6 @@ __bpf_kfunc void scx_bpf_dispatch_cancel(void)
 {
 	struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx);
 
-	if (!scx_kf_allowed(SCX_KF_DISPATCH))
-		return;
-
 	if (dspc->cursor > 0)
 		dspc->cursor--;
 	else
@@ -6439,9 +6320,6 @@ __bpf_kfunc bool scx_bpf_dsq_move_to_local(u64 dsq_id)
 	struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx);
 	struct scx_dispatch_q *dsq;
 
-	if (!scx_kf_allowed(SCX_KF_DISPATCH))
-		return false;
-
 	flush_dispatch_buf(dspc->rq);
 
 	dsq = find_user_dsq(dsq_id);
@@ -6632,9 +6510,6 @@ __bpf_kfunc u32 scx_bpf_reenqueue_local(void)
 	struct rq *rq;
 	struct task_struct *p, *n;
 
-	if (!scx_kf_allowed(SCX_KF_CPU_RELEASE))
-		return 0;
-
 	rq = cpu_rq(smp_processor_id());
 	lockdep_assert_rq_held(rq);
 
@@ -7239,7 +7114,7 @@ __bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p)
 	struct task_group *tg = p->sched_task_group;
 	struct cgroup *cgrp = &cgrp_dfl_root.cgrp;
 
-	if (!scx_kf_allowed_on_arg_tasks(__SCX_KF_RQ_LOCKED, p))
+	if (!scx_kf_allowed_on_arg_tasks(p))
 		goto out;
 
 	cgrp = tg_cgrp(tg);
@@ -7479,10 +7354,6 @@ static int __init scx_init(void)
 	 *
 	 * Some kfuncs are context-sensitive and can only be called from
 	 * specific SCX ops. They are grouped into BTF sets accordingly.
-	 * Unfortunately, BPF currently doesn't have a way of enforcing such
-	 * restrictions. Eventually, the verifier should be able to enforce
-	 * them. For now, register them the same and make each kfunc explicitly
-	 * check using scx_kf_allowed().
 	 */
 	if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
 					     &scx_kfunc_set_ops_context_sensitive)) ||
diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
index efb6077810d8..e241935021eb 100644
--- a/kernel/sched/ext_idle.c
+++ b/kernel/sched/ext_idle.c
@@ -658,7 +658,7 @@ void __scx_update_idle(struct rq *rq, bool idle, bool do_notify)
 	 * managed by put_prev_task_idle()/set_next_task_idle().
 	 */
 	if (SCX_HAS_OP(update_idle) && do_notify && !scx_rq_bypassing(rq))
-		SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
+		SCX_CALL_OP(update_idle, cpu_of(rq), idle);
 
 	/*
 	 * Update the idle masks:
@@ -803,9 +803,6 @@ __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 	if (!check_builtin_idle_enabled())
 		goto prev_cpu;
 
-	if (!scx_kf_allowed(SCX_KF_SELECT_CPU))
-		goto prev_cpu;
-
 #ifdef CONFIG_SMP
 	return scx_select_cpu_dfl(p, prev_cpu, wake_flags, is_idle);
 #endif
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH sched_ext/for-6.15 v3 5/5] selftests/sched_ext: Update enq_select_cpu_fails to adapt to struct_ops context filter
  2025-02-26 19:24 [PATCH sched_ext/for-6.15 v3 0/5] bpf, sched_ext: Make kfunc filters support struct_ops context to reduce runtime overhead Juntong Deng
                   ` (3 preceding siblings ...)
  2025-02-26 19:28 ` [PATCH sched_ext/for-6.15 v3 4/5] sched_ext: Removed mask-based runtime restrictions on calling kfuncs in different contexts Juntong Deng
@ 2025-02-26 19:28 ` Juntong Deng
  4 siblings, 0 replies; 23+ messages in thread
From: Juntong Deng @ 2025-02-26 19:28 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, tj, void,
	arighi, changwoo
  Cc: bpf, linux-kernel

After kfunc filters support struct_ops context information, SCX programs
that call incorrect context-sensitive kfuncs will fail to load.

This patch updates the enq_select_cpu_fails test case to adapt to the
failed load situation.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
---
 .../sched_ext/enq_select_cpu_fails.c          | 37 +++----------------
 1 file changed, 5 insertions(+), 32 deletions(-)

diff --git a/tools/testing/selftests/sched_ext/enq_select_cpu_fails.c b/tools/testing/selftests/sched_ext/enq_select_cpu_fails.c
index a80e3a3b3698..a04ad9a48a8f 100644
--- a/tools/testing/selftests/sched_ext/enq_select_cpu_fails.c
+++ b/tools/testing/selftests/sched_ext/enq_select_cpu_fails.c
@@ -11,51 +11,24 @@
 #include "enq_select_cpu_fails.bpf.skel.h"
 #include "scx_test.h"
 
-static enum scx_test_status setup(void **ctx)
-{
-	struct enq_select_cpu_fails *skel;
-
-	skel = enq_select_cpu_fails__open();
-	SCX_FAIL_IF(!skel, "Failed to open");
-	SCX_ENUM_INIT(skel);
-	SCX_FAIL_IF(enq_select_cpu_fails__load(skel), "Failed to load skel");
-
-	*ctx = skel;
-
-	return SCX_TEST_PASS;
-}
-
 static enum scx_test_status run(void *ctx)
 {
-	struct enq_select_cpu_fails *skel = ctx;
-	struct bpf_link *link;
+	struct enq_select_cpu_fails *skel;
 
-	link = bpf_map__attach_struct_ops(skel->maps.enq_select_cpu_fails_ops);
-	if (!link) {
-		SCX_ERR("Failed to attach scheduler");
+	skel = enq_select_cpu_fails__open_and_load();
+	if (skel) {
+		enq_select_cpu_fails__destroy(skel);
+		SCX_ERR("This program should fail to load");
 		return SCX_TEST_FAIL;
 	}
 
-	sleep(1);
-
-	bpf_link__destroy(link);
-
 	return SCX_TEST_PASS;
 }
 
-static void cleanup(void *ctx)
-{
-	struct enq_select_cpu_fails *skel = ctx;
-
-	enq_select_cpu_fails__destroy(skel);
-}
-
 struct scx_test enq_select_cpu_fails = {
 	.name = "enq_select_cpu_fails",
 	.description = "Verify we fail to call scx_bpf_select_cpu_dfl() "
 		       "from ops.enqueue()",
-	.setup = setup,
 	.run = run,
-	.cleanup = cleanup,
 };
 REGISTER_SCX_TEST(&enq_select_cpu_fails)
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH sched_ext/for-6.15 v3 4/5] sched_ext: Removed mask-based runtime restrictions on calling kfuncs in different contexts
  2025-02-26 19:28 ` [PATCH sched_ext/for-6.15 v3 4/5] sched_ext: Removed mask-based runtime restrictions on calling kfuncs in different contexts Juntong Deng
@ 2025-02-27 16:24   ` Andrea Righi
  2025-02-27 16:49     ` Juntong Deng
  0 siblings, 1 reply; 23+ messages in thread
From: Andrea Righi @ 2025-02-27 16:24 UTC (permalink / raw)
  To: Juntong Deng
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, tj, void,
	changwoo, bpf, linux-kernel

On Wed, Feb 26, 2025 at 07:28:19PM +0000, Juntong Deng wrote:
> Currently, kfunc filters already support filtering based on struct_ops
> context information.
> 
> The BPF verifier can check context-sensitive kfuncs before the SCX
> program is run, avoiding runtime overhead.
> 
> Therefore we no longer need mask-based runtime restrictions.
> 
> This patch removes the mask-based runtime restrictions.

You may have missed scx_prio_less(), that is still using SCX_KF_REST:

kernel/sched/ext.c: In function ‘scx_prio_less’:
kernel/sched/ext.c:1171:27: error: ‘struct sched_ext_ops’ has no member named ‘SCX_KF_REST’
 1171 |         __typeof__(scx_ops.op(task0, task1, ##args)) __ret;                     \
      |                           ^

I think you just need to add:

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 4a4713c3af67b..51c13b8c27743 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3302,7 +3302,7 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
 	 * verifier.
 	 */
 	if (SCX_HAS_OP(core_sched_before) && !scx_rq_bypassing(task_rq(a)))
-		return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, core_sched_before,
+		return SCX_CALL_OP_2TASKS_RET(core_sched_before,
 					      (struct task_struct *)a,
 					      (struct task_struct *)b);
 	else

-Andrea

> 
> Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
> ---
>  include/linux/sched/ext.h |  24 ----
>  kernel/sched/ext.c        | 227 ++++++++------------------------------
>  kernel/sched/ext_idle.c   |   5 +-
>  3 files changed, 50 insertions(+), 206 deletions(-)
> 
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index f7545430a548..9980d6b55c84 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -96,29 +96,6 @@ enum scx_ent_dsq_flags {
>  	SCX_TASK_DSQ_ON_PRIQ	= 1 << 0, /* task is queued on the priority queue of a dsq */
>  };
>  
> -/*
> - * Mask bits for scx_entity.kf_mask. Not all kfuncs can be called from
> - * everywhere and the following bits track which kfunc sets are currently
> - * allowed for %current. This simple per-task tracking works because SCX ops
> - * nest in a limited way. BPF will likely implement a way to allow and disallow
> - * kfuncs depending on the calling context which will replace this manual
> - * mechanism. See scx_kf_allow().
> - */
> -enum scx_kf_mask {
> -	SCX_KF_UNLOCKED		= 0,	  /* sleepable and not rq locked */
> -	/* ENQUEUE and DISPATCH may be nested inside CPU_RELEASE */
> -	SCX_KF_CPU_RELEASE	= 1 << 0, /* ops.cpu_release() */
> -	/* ops.dequeue (in REST) may be nested inside DISPATCH */
> -	SCX_KF_DISPATCH		= 1 << 1, /* ops.dispatch() */
> -	SCX_KF_ENQUEUE		= 1 << 2, /* ops.enqueue() and ops.select_cpu() */
> -	SCX_KF_SELECT_CPU	= 1 << 3, /* ops.select_cpu() */
> -	SCX_KF_REST		= 1 << 4, /* other rq-locked operations */
> -
> -	__SCX_KF_RQ_LOCKED	= SCX_KF_CPU_RELEASE | SCX_KF_DISPATCH |
> -				  SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST,
> -	__SCX_KF_TERMINAL	= SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST,
> -};
> -
>  enum scx_dsq_lnode_flags {
>  	SCX_DSQ_LNODE_ITER_CURSOR = 1 << 0,
>  
> @@ -147,7 +124,6 @@ struct sched_ext_entity {
>  	s32			sticky_cpu;
>  	s32			holding_cpu;
>  	s32			selected_cpu;
> -	u32			kf_mask;	/* see scx_kf_mask above */
>  	struct task_struct	*kf_tasks[2];	/* see SCX_CALL_OP_TASK() */
>  	atomic_long_t		ops_state;
>  
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index c337f6206ae5..7dc5f11be66b 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1115,19 +1115,6 @@ static long jiffies_delta_msecs(unsigned long at, unsigned long now)
>  		return -(long)jiffies_to_msecs(now - at);
>  }
>  
> -/* if the highest set bit is N, return a mask with bits [N+1, 31] set */
> -static u32 higher_bits(u32 flags)
> -{
> -	return ~((1 << fls(flags)) - 1);
> -}
> -
> -/* return the mask with only the highest bit set */
> -static u32 highest_bit(u32 flags)
> -{
> -	int bit = fls(flags);
> -	return ((u64)1 << bit) >> 1;
> -}
> -
>  static bool u32_before(u32 a, u32 b)
>  {
>  	return (s32)(a - b) < 0;
> @@ -1143,51 +1130,12 @@ static struct scx_dispatch_q *find_user_dsq(u64 dsq_id)
>  	return rhashtable_lookup_fast(&dsq_hash, &dsq_id, dsq_hash_params);
>  }
>  
> -/*
> - * scx_kf_mask enforcement. Some kfuncs can only be called from specific SCX
> - * ops. When invoking SCX ops, SCX_CALL_OP[_RET]() should be used to indicate
> - * the allowed kfuncs and those kfuncs should use scx_kf_allowed() to check
> - * whether it's running from an allowed context.
> - *
> - * @mask is constant, always inline to cull the mask calculations.
> - */
> -static __always_inline void scx_kf_allow(u32 mask)
> -{
> -	/* nesting is allowed only in increasing scx_kf_mask order */
> -	WARN_ONCE((mask | higher_bits(mask)) & current->scx.kf_mask,
> -		  "invalid nesting current->scx.kf_mask=0x%x mask=0x%x\n",
> -		  current->scx.kf_mask, mask);
> -	current->scx.kf_mask |= mask;
> -	barrier();
> -}
> -
> -static void scx_kf_disallow(u32 mask)
> -{
> -	barrier();
> -	current->scx.kf_mask &= ~mask;
> -}
> -
> -#define SCX_CALL_OP(mask, op, args...)						\
> -do {										\
> -	if (mask) {								\
> -		scx_kf_allow(mask);						\
> -		scx_ops.op(args);						\
> -		scx_kf_disallow(mask);						\
> -	} else {								\
> -		scx_ops.op(args);						\
> -	}									\
> -} while (0)
> +#define SCX_CALL_OP(op, args...)	scx_ops.op(args)
>  
> -#define SCX_CALL_OP_RET(mask, op, args...)					\
> +#define SCX_CALL_OP_RET(op, args...)						\
>  ({										\
>  	__typeof__(scx_ops.op(args)) __ret;					\
> -	if (mask) {								\
> -		scx_kf_allow(mask);						\
> -		__ret = scx_ops.op(args);					\
> -		scx_kf_disallow(mask);						\
> -	} else {								\
> -		__ret = scx_ops.op(args);					\
> -	}									\
> +	__ret = scx_ops.op(args);						\
>  	__ret;									\
>  })
>  
> @@ -1202,74 +1150,36 @@ do {										\
>   * scx_kf_allowed_on_arg_tasks() to test whether the invocation is allowed on
>   * the specific task.
>   */
> -#define SCX_CALL_OP_TASK(mask, op, task, args...)				\
> +#define SCX_CALL_OP_TASK(op, task, args...)					\
>  do {										\
> -	BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL);				\
>  	current->scx.kf_tasks[0] = task;					\
> -	SCX_CALL_OP(mask, op, task, ##args);					\
> +	SCX_CALL_OP(op, task, ##args);						\
>  	current->scx.kf_tasks[0] = NULL;					\
>  } while (0)
>  
> -#define SCX_CALL_OP_TASK_RET(mask, op, task, args...)				\
> +#define SCX_CALL_OP_TASK_RET(op, task, args...)					\
>  ({										\
>  	__typeof__(scx_ops.op(task, ##args)) __ret;				\
> -	BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL);				\
>  	current->scx.kf_tasks[0] = task;					\
> -	__ret = SCX_CALL_OP_RET(mask, op, task, ##args);			\
> +	__ret = SCX_CALL_OP_RET(op, task, ##args);				\
>  	current->scx.kf_tasks[0] = NULL;					\
>  	__ret;									\
>  })
>  
> -#define SCX_CALL_OP_2TASKS_RET(mask, op, task0, task1, args...)			\
> +#define SCX_CALL_OP_2TASKS_RET(op, task0, task1, args...)			\
>  ({										\
>  	__typeof__(scx_ops.op(task0, task1, ##args)) __ret;			\
> -	BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL);				\
>  	current->scx.kf_tasks[0] = task0;					\
>  	current->scx.kf_tasks[1] = task1;					\
> -	__ret = SCX_CALL_OP_RET(mask, op, task0, task1, ##args);		\
> +	__ret = SCX_CALL_OP_RET(op, task0, task1, ##args);			\
>  	current->scx.kf_tasks[0] = NULL;					\
>  	current->scx.kf_tasks[1] = NULL;					\
>  	__ret;									\
>  })
>  
> -/* @mask is constant, always inline to cull unnecessary branches */
> -static __always_inline bool scx_kf_allowed(u32 mask)
> -{
> -	if (unlikely(!(current->scx.kf_mask & mask))) {
> -		scx_ops_error("kfunc with mask 0x%x called from an operation only allowing 0x%x",
> -			      mask, current->scx.kf_mask);
> -		return false;
> -	}
> -
> -	/*
> -	 * Enforce nesting boundaries. e.g. A kfunc which can be called from
> -	 * DISPATCH must not be called if we're running DEQUEUE which is nested
> -	 * inside ops.dispatch(). We don't need to check boundaries for any
> -	 * blocking kfuncs as the verifier ensures they're only called from
> -	 * sleepable progs.
> -	 */
> -	if (unlikely(highest_bit(mask) == SCX_KF_CPU_RELEASE &&
> -		     (current->scx.kf_mask & higher_bits(SCX_KF_CPU_RELEASE)))) {
> -		scx_ops_error("cpu_release kfunc called from a nested operation");
> -		return false;
> -	}
> -
> -	if (unlikely(highest_bit(mask) == SCX_KF_DISPATCH &&
> -		     (current->scx.kf_mask & higher_bits(SCX_KF_DISPATCH)))) {
> -		scx_ops_error("dispatch kfunc called from a nested operation");
> -		return false;
> -	}
> -
> -	return true;
> -}
> -
>  /* see SCX_CALL_OP_TASK() */
> -static __always_inline bool scx_kf_allowed_on_arg_tasks(u32 mask,
> -							struct task_struct *p)
> +static __always_inline bool scx_kf_allowed_on_arg_tasks(struct task_struct *p)
>  {
> -	if (!scx_kf_allowed(mask))
> -		return false;
> -
>  	if (unlikely((p != current->scx.kf_tasks[0] &&
>  		      p != current->scx.kf_tasks[1]))) {
>  		scx_ops_error("called on a task not being operated on");
> @@ -1279,11 +1189,6 @@ static __always_inline bool scx_kf_allowed_on_arg_tasks(u32 mask,
>  	return true;
>  }
>  
> -static bool scx_kf_allowed_if_unlocked(void)
> -{
> -	return !current->scx.kf_mask;
> -}
> -
>  /**
>   * nldsq_next_task - Iterate to the next task in a non-local DSQ
>   * @dsq: user dsq being iterated
> @@ -2219,7 +2124,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>  	WARN_ON_ONCE(*ddsp_taskp);
>  	*ddsp_taskp = p;
>  
> -	SCX_CALL_OP_TASK(SCX_KF_ENQUEUE, enqueue, p, enq_flags);
> +	SCX_CALL_OP_TASK(enqueue, p, enq_flags);
>  
>  	*ddsp_taskp = NULL;
>  	if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
> @@ -2316,7 +2221,7 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
>  	add_nr_running(rq, 1);
>  
>  	if (SCX_HAS_OP(runnable) && !task_on_rq_migrating(p))
> -		SCX_CALL_OP_TASK(SCX_KF_REST, runnable, p, enq_flags);
> +		SCX_CALL_OP_TASK(runnable, p, enq_flags);
>  
>  	if (enq_flags & SCX_ENQ_WAKEUP)
>  		touch_core_sched(rq, p);
> @@ -2351,7 +2256,7 @@ static void ops_dequeue(struct task_struct *p, u64 deq_flags)
>  		BUG();
>  	case SCX_OPSS_QUEUED:
>  		if (SCX_HAS_OP(dequeue))
> -			SCX_CALL_OP_TASK(SCX_KF_REST, dequeue, p, deq_flags);
> +			SCX_CALL_OP_TASK(dequeue, p, deq_flags);
>  
>  		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
>  					    SCX_OPSS_NONE))
> @@ -2400,11 +2305,11 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags
>  	 */
>  	if (SCX_HAS_OP(stopping) && task_current(rq, p)) {
>  		update_curr_scx(rq);
> -		SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, false);
> +		SCX_CALL_OP_TASK(stopping, p, false);
>  	}
>  
>  	if (SCX_HAS_OP(quiescent) && !task_on_rq_migrating(p))
> -		SCX_CALL_OP_TASK(SCX_KF_REST, quiescent, p, deq_flags);
> +		SCX_CALL_OP_TASK(quiescent, p, deq_flags);
>  
>  	if (deq_flags & SCX_DEQ_SLEEP)
>  		p->scx.flags |= SCX_TASK_DEQD_FOR_SLEEP;
> @@ -2424,7 +2329,7 @@ static void yield_task_scx(struct rq *rq)
>  	struct task_struct *p = rq->curr;
>  
>  	if (SCX_HAS_OP(yield))
> -		SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, yield, p, NULL);
> +		SCX_CALL_OP_2TASKS_RET(yield, p, NULL);
>  	else
>  		p->scx.slice = 0;
>  }
> @@ -2434,7 +2339,7 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to)
>  	struct task_struct *from = rq->curr;
>  
>  	if (SCX_HAS_OP(yield))
> -		return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, yield, from, to);
> +		return SCX_CALL_OP_2TASKS_RET(yield, from, to);
>  	else
>  		return false;
>  }
> @@ -2992,7 +2897,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
>  		 * emitted in switch_class().
>  		 */
>  		if (SCX_HAS_OP(cpu_acquire))
> -			SCX_CALL_OP(SCX_KF_REST, cpu_acquire, cpu_of(rq), NULL);
> +			SCX_CALL_OP(cpu_acquire, cpu_of(rq), NULL);
>  		rq->scx.cpu_released = false;
>  	}
>  
> @@ -3037,8 +2942,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
>  	do {
>  		dspc->nr_tasks = 0;
>  
> -		SCX_CALL_OP(SCX_KF_DISPATCH, dispatch, cpu_of(rq),
> -			    prev_on_scx ? prev : NULL);
> +		SCX_CALL_OP(dispatch, cpu_of(rq), prev_on_scx ? prev : NULL);
>  
>  		flush_dispatch_buf(rq);
>  
> @@ -3159,7 +3063,7 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
>  
>  	/* see dequeue_task_scx() on why we skip when !QUEUED */
>  	if (SCX_HAS_OP(running) && (p->scx.flags & SCX_TASK_QUEUED))
> -		SCX_CALL_OP_TASK(SCX_KF_REST, running, p);
> +		SCX_CALL_OP_TASK(running, p);
>  
>  	clr_task_runnable(p, true);
>  
> @@ -3240,8 +3144,7 @@ static void switch_class(struct rq *rq, struct task_struct *next)
>  				.task = next,
>  			};
>  
> -			SCX_CALL_OP(SCX_KF_CPU_RELEASE,
> -				    cpu_release, cpu_of(rq), &args);
> +			SCX_CALL_OP(cpu_release, cpu_of(rq), &args);
>  		}
>  		rq->scx.cpu_released = true;
>  	}
> @@ -3254,7 +3157,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
>  
>  	/* see dequeue_task_scx() on why we skip when !QUEUED */
>  	if (SCX_HAS_OP(stopping) && (p->scx.flags & SCX_TASK_QUEUED))
> -		SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, true);
> +		SCX_CALL_OP_TASK(stopping, p, true);
>  
>  	if (p->scx.flags & SCX_TASK_QUEUED) {
>  		set_task_runnable(rq, p);
> @@ -3428,8 +3331,7 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
>  		WARN_ON_ONCE(*ddsp_taskp);
>  		*ddsp_taskp = p;
>  
> -		cpu = SCX_CALL_OP_TASK_RET(SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU,
> -					   select_cpu, p, prev_cpu, wake_flags);
> +		cpu = SCX_CALL_OP_TASK_RET(select_cpu, p, prev_cpu, wake_flags);
>  		p->scx.selected_cpu = cpu;
>  		*ddsp_taskp = NULL;
>  		if (ops_cpu_valid(cpu, "from ops.select_cpu()"))
> @@ -3473,8 +3375,7 @@ static void set_cpus_allowed_scx(struct task_struct *p,
>  	 * designation pointless. Cast it away when calling the operation.
>  	 */
>  	if (SCX_HAS_OP(set_cpumask))
> -		SCX_CALL_OP_TASK(SCX_KF_REST, set_cpumask, p,
> -				 (struct cpumask *)p->cpus_ptr);
> +		SCX_CALL_OP_TASK(set_cpumask, p, (struct cpumask *)p->cpus_ptr);
>  }
>  
>  static void handle_hotplug(struct rq *rq, bool online)
> @@ -3487,9 +3388,9 @@ static void handle_hotplug(struct rq *rq, bool online)
>  		scx_idle_update_selcpu_topology(&scx_ops);
>  
>  	if (online && SCX_HAS_OP(cpu_online))
> -		SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_online, cpu);
> +		SCX_CALL_OP(cpu_online, cpu);
>  	else if (!online && SCX_HAS_OP(cpu_offline))
> -		SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_offline, cpu);
> +		SCX_CALL_OP(cpu_offline, cpu);
>  	else
>  		scx_ops_exit(SCX_ECODE_ACT_RESTART | SCX_ECODE_RSN_HOTPLUG,
>  			     "cpu %d going %s, exiting scheduler", cpu,
> @@ -3593,7 +3494,7 @@ static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued)
>  		curr->scx.slice = 0;
>  		touch_core_sched(rq, curr);
>  	} else if (SCX_HAS_OP(tick)) {
> -		SCX_CALL_OP(SCX_KF_REST, tick, curr);
> +		SCX_CALL_OP(tick, curr);
>  	}
>  
>  	if (!curr->scx.slice)
> @@ -3670,7 +3571,7 @@ static int scx_ops_init_task(struct task_struct *p, struct task_group *tg, bool
>  			.fork = fork,
>  		};
>  
> -		ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, init_task, p, &args);
> +		ret = SCX_CALL_OP_RET(init_task, p, &args);
>  		if (unlikely(ret)) {
>  			ret = ops_sanitize_err("init_task", ret);
>  			return ret;
> @@ -3727,11 +3628,11 @@ static void scx_ops_enable_task(struct task_struct *p)
>  	p->scx.weight = sched_weight_to_cgroup(weight);
>  
>  	if (SCX_HAS_OP(enable))
> -		SCX_CALL_OP_TASK(SCX_KF_REST, enable, p);
> +		SCX_CALL_OP_TASK(enable, p);
>  	scx_set_task_state(p, SCX_TASK_ENABLED);
>  
>  	if (SCX_HAS_OP(set_weight))
> -		SCX_CALL_OP_TASK(SCX_KF_REST, set_weight, p, p->scx.weight);
> +		SCX_CALL_OP_TASK(set_weight, p, p->scx.weight);
>  }
>  
>  static void scx_ops_disable_task(struct task_struct *p)
> @@ -3740,7 +3641,7 @@ static void scx_ops_disable_task(struct task_struct *p)
>  	WARN_ON_ONCE(scx_get_task_state(p) != SCX_TASK_ENABLED);
>  
>  	if (SCX_HAS_OP(disable))
> -		SCX_CALL_OP(SCX_KF_REST, disable, p);
> +		SCX_CALL_OP(disable, p);
>  	scx_set_task_state(p, SCX_TASK_READY);
>  }
>  
> @@ -3769,7 +3670,7 @@ static void scx_ops_exit_task(struct task_struct *p)
>  	}
>  
>  	if (SCX_HAS_OP(exit_task))
> -		SCX_CALL_OP(SCX_KF_REST, exit_task, p, &args);
> +		SCX_CALL_OP(exit_task, p, &args);
>  	scx_set_task_state(p, SCX_TASK_NONE);
>  }
>  
> @@ -3878,7 +3779,7 @@ static void reweight_task_scx(struct rq *rq, struct task_struct *p,
>  
>  	p->scx.weight = sched_weight_to_cgroup(scale_load_down(lw->weight));
>  	if (SCX_HAS_OP(set_weight))
> -		SCX_CALL_OP_TASK(SCX_KF_REST, set_weight, p, p->scx.weight);
> +		SCX_CALL_OP_TASK(set_weight, p, p->scx.weight);
>  }
>  
>  static void prio_changed_scx(struct rq *rq, struct task_struct *p, int oldprio)
> @@ -3894,8 +3795,7 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p)
>  	 * different scheduler class. Keep the BPF scheduler up-to-date.
>  	 */
>  	if (SCX_HAS_OP(set_cpumask))
> -		SCX_CALL_OP_TASK(SCX_KF_REST, set_cpumask, p,
> -				 (struct cpumask *)p->cpus_ptr);
> +		SCX_CALL_OP_TASK(set_cpumask, p, (struct cpumask *)p->cpus_ptr);
>  }
>  
>  static void switched_from_scx(struct rq *rq, struct task_struct *p)
> @@ -3987,8 +3887,7 @@ int scx_tg_online(struct task_group *tg)
>  			struct scx_cgroup_init_args args =
>  				{ .weight = tg->scx_weight };
>  
> -			ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_init,
> -					      tg->css.cgroup, &args);
> +			ret = SCX_CALL_OP_RET(cgroup_init, tg->css.cgroup, &args);
>  			if (ret)
>  				ret = ops_sanitize_err("cgroup_init", ret);
>  		}
> @@ -4009,7 +3908,7 @@ void scx_tg_offline(struct task_group *tg)
>  	percpu_down_read(&scx_cgroup_rwsem);
>  
>  	if (SCX_HAS_OP(cgroup_exit) && (tg->scx_flags & SCX_TG_INITED))
> -		SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_exit, tg->css.cgroup);
> +		SCX_CALL_OP(cgroup_exit, tg->css.cgroup);
>  	tg->scx_flags &= ~(SCX_TG_ONLINE | SCX_TG_INITED);
>  
>  	percpu_up_read(&scx_cgroup_rwsem);
> @@ -4042,8 +3941,7 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset)
>  			continue;
>  
>  		if (SCX_HAS_OP(cgroup_prep_move)) {
> -			ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_prep_move,
> -					      p, from, css->cgroup);
> +			ret = SCX_CALL_OP_RET(cgroup_prep_move, p, from, css->cgroup);
>  			if (ret)
>  				goto err;
>  		}
> @@ -4056,8 +3954,7 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset)
>  err:
>  	cgroup_taskset_for_each(p, css, tset) {
>  		if (SCX_HAS_OP(cgroup_cancel_move) && p->scx.cgrp_moving_from)
> -			SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_cancel_move, p,
> -				    p->scx.cgrp_moving_from, css->cgroup);
> +			SCX_CALL_OP(cgroup_cancel_move, p, p->scx.cgrp_moving_from, css->cgroup);
>  		p->scx.cgrp_moving_from = NULL;
>  	}
>  
> @@ -4075,8 +3972,7 @@ void scx_cgroup_move_task(struct task_struct *p)
>  	 * cgrp_moving_from set.
>  	 */
>  	if (SCX_HAS_OP(cgroup_move) && !WARN_ON_ONCE(!p->scx.cgrp_moving_from))
> -		SCX_CALL_OP_TASK(SCX_KF_UNLOCKED, cgroup_move, p,
> -			p->scx.cgrp_moving_from, tg_cgrp(task_group(p)));
> +		SCX_CALL_OP_TASK(cgroup_move, p, p->scx.cgrp_moving_from, tg_cgrp(task_group(p)));
>  	p->scx.cgrp_moving_from = NULL;
>  }
>  
> @@ -4095,8 +3991,7 @@ void scx_cgroup_cancel_attach(struct cgroup_taskset *tset)
>  
>  	cgroup_taskset_for_each(p, css, tset) {
>  		if (SCX_HAS_OP(cgroup_cancel_move) && p->scx.cgrp_moving_from)
> -			SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_cancel_move, p,
> -				    p->scx.cgrp_moving_from, css->cgroup);
> +			SCX_CALL_OP(cgroup_cancel_move, p, p->scx.cgrp_moving_from, css->cgroup);
>  		p->scx.cgrp_moving_from = NULL;
>  	}
>  out_unlock:
> @@ -4109,8 +4004,7 @@ void scx_group_set_weight(struct task_group *tg, unsigned long weight)
>  
>  	if (scx_cgroup_enabled && tg->scx_weight != weight) {
>  		if (SCX_HAS_OP(cgroup_set_weight))
> -			SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_set_weight,
> -				    tg_cgrp(tg), weight);
> +			SCX_CALL_OP(cgroup_set_weight, tg_cgrp(tg), weight);
>  		tg->scx_weight = weight;
>  	}
>  
> @@ -4300,7 +4194,7 @@ static void scx_cgroup_exit(void)
>  			continue;
>  		rcu_read_unlock();
>  
> -		SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_exit, css->cgroup);
> +		SCX_CALL_OP(cgroup_exit, css->cgroup);
>  
>  		rcu_read_lock();
>  		css_put(css);
> @@ -4343,8 +4237,7 @@ static int scx_cgroup_init(void)
>  			continue;
>  		rcu_read_unlock();
>  
> -		ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_init,
> -				      css->cgroup, &args);
> +		ret = SCX_CALL_OP_RET(cgroup_init, css->cgroup, &args);
>  		if (ret) {
>  			css_put(css);
>  			scx_ops_error("ops.cgroup_init() failed (%d)", ret);
> @@ -4840,7 +4733,7 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
>  	}
>  
>  	if (scx_ops.exit)
> -		SCX_CALL_OP(SCX_KF_UNLOCKED, exit, ei);
> +		SCX_CALL_OP(exit, ei);
>  
>  	cancel_delayed_work_sync(&scx_watchdog_work);
>  
> @@ -5047,7 +4940,7 @@ static void scx_dump_task(struct seq_buf *s, struct scx_dump_ctx *dctx,
>  
>  	if (SCX_HAS_OP(dump_task)) {
>  		ops_dump_init(s, "    ");
> -		SCX_CALL_OP(SCX_KF_REST, dump_task, dctx, p);
> +		SCX_CALL_OP(dump_task, dctx, p);
>  		ops_dump_exit();
>  	}
>  
> @@ -5094,7 +4987,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
>  
>  	if (SCX_HAS_OP(dump)) {
>  		ops_dump_init(&s, "");
> -		SCX_CALL_OP(SCX_KF_UNLOCKED, dump, &dctx);
> +		SCX_CALL_OP(dump, &dctx);
>  		ops_dump_exit();
>  	}
>  
> @@ -5151,7 +5044,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
>  		used = seq_buf_used(&ns);
>  		if (SCX_HAS_OP(dump_cpu)) {
>  			ops_dump_init(&ns, "  ");
> -			SCX_CALL_OP(SCX_KF_REST, dump_cpu, &dctx, cpu, idle);
> +			SCX_CALL_OP(dump_cpu, &dctx, cpu, idle);
>  			ops_dump_exit();
>  		}
>  
> @@ -5405,7 +5298,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
>  	cpus_read_lock();
>  
>  	if (scx_ops.init) {
> -		ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, init);
> +		ret = SCX_CALL_OP_RET(init);
>  		if (ret) {
>  			ret = ops_sanitize_err("init", ret);
>  			cpus_read_unlock();
> @@ -6146,9 +6039,6 @@ void __init init_sched_ext_class(void)
>   */
>  static bool scx_dsq_insert_preamble(struct task_struct *p, u64 enq_flags)
>  {
> -	if (!scx_kf_allowed(SCX_KF_ENQUEUE | SCX_KF_DISPATCH))
> -		return false;
> -
>  	lockdep_assert_irqs_disabled();
>  
>  	if (unlikely(!p)) {
> @@ -6310,9 +6200,6 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
>  	bool in_balance;
>  	unsigned long flags;
>  
> -	if (!scx_kf_allowed_if_unlocked() && !scx_kf_allowed(SCX_KF_DISPATCH))
> -		return false;
> -
>  	/*
>  	 * Can be called from either ops.dispatch() locking this_rq() or any
>  	 * context where no rq lock is held. If latter, lock @p's task_rq which
> @@ -6395,9 +6282,6 @@ __bpf_kfunc_start_defs();
>   */
>  __bpf_kfunc u32 scx_bpf_dispatch_nr_slots(void)
>  {
> -	if (!scx_kf_allowed(SCX_KF_DISPATCH))
> -		return 0;
> -
>  	return scx_dsp_max_batch - __this_cpu_read(scx_dsp_ctx->cursor);
>  }
>  
> @@ -6411,9 +6295,6 @@ __bpf_kfunc void scx_bpf_dispatch_cancel(void)
>  {
>  	struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx);
>  
> -	if (!scx_kf_allowed(SCX_KF_DISPATCH))
> -		return;
> -
>  	if (dspc->cursor > 0)
>  		dspc->cursor--;
>  	else
> @@ -6439,9 +6320,6 @@ __bpf_kfunc bool scx_bpf_dsq_move_to_local(u64 dsq_id)
>  	struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx);
>  	struct scx_dispatch_q *dsq;
>  
> -	if (!scx_kf_allowed(SCX_KF_DISPATCH))
> -		return false;
> -
>  	flush_dispatch_buf(dspc->rq);
>  
>  	dsq = find_user_dsq(dsq_id);
> @@ -6632,9 +6510,6 @@ __bpf_kfunc u32 scx_bpf_reenqueue_local(void)
>  	struct rq *rq;
>  	struct task_struct *p, *n;
>  
> -	if (!scx_kf_allowed(SCX_KF_CPU_RELEASE))
> -		return 0;
> -
>  	rq = cpu_rq(smp_processor_id());
>  	lockdep_assert_rq_held(rq);
>  
> @@ -7239,7 +7114,7 @@ __bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p)
>  	struct task_group *tg = p->sched_task_group;
>  	struct cgroup *cgrp = &cgrp_dfl_root.cgrp;
>  
> -	if (!scx_kf_allowed_on_arg_tasks(__SCX_KF_RQ_LOCKED, p))
> +	if (!scx_kf_allowed_on_arg_tasks(p))
>  		goto out;
>  
>  	cgrp = tg_cgrp(tg);
> @@ -7479,10 +7354,6 @@ static int __init scx_init(void)
>  	 *
>  	 * Some kfuncs are context-sensitive and can only be called from
>  	 * specific SCX ops. They are grouped into BTF sets accordingly.
> -	 * Unfortunately, BPF currently doesn't have a way of enforcing such
> -	 * restrictions. Eventually, the verifier should be able to enforce
> -	 * them. For now, register them the same and make each kfunc explicitly
> -	 * check using scx_kf_allowed().
>  	 */
>  	if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
>  					     &scx_kfunc_set_ops_context_sensitive)) ||
> diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> index efb6077810d8..e241935021eb 100644
> --- a/kernel/sched/ext_idle.c
> +++ b/kernel/sched/ext_idle.c
> @@ -658,7 +658,7 @@ void __scx_update_idle(struct rq *rq, bool idle, bool do_notify)
>  	 * managed by put_prev_task_idle()/set_next_task_idle().
>  	 */
>  	if (SCX_HAS_OP(update_idle) && do_notify && !scx_rq_bypassing(rq))
> -		SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
> +		SCX_CALL_OP(update_idle, cpu_of(rq), idle);
>  
>  	/*
>  	 * Update the idle masks:
> @@ -803,9 +803,6 @@ __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  	if (!check_builtin_idle_enabled())
>  		goto prev_cpu;
>  
> -	if (!scx_kf_allowed(SCX_KF_SELECT_CPU))
> -		goto prev_cpu;
> -
>  #ifdef CONFIG_SMP
>  	return scx_select_cpu_dfl(p, prev_cpu, wake_flags, is_idle);
>  #endif
> -- 
> 2.39.5
> 

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH sched_ext/for-6.15 v3 4/5] sched_ext: Removed mask-based runtime restrictions on calling kfuncs in different contexts
  2025-02-27 16:24   ` Andrea Righi
@ 2025-02-27 16:49     ` Juntong Deng
  0 siblings, 0 replies; 23+ messages in thread
From: Juntong Deng @ 2025-02-27 16:49 UTC (permalink / raw)
  To: Andrea Righi
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, tj, void,
	changwoo, bpf, linux-kernel

On 2025/2/27 16:24, Andrea Righi wrote:
> On Wed, Feb 26, 2025 at 07:28:19PM +0000, Juntong Deng wrote:
>> Currently, kfunc filters already support filtering based on struct_ops
>> context information.
>>
>> The BPF verifier can check context-sensitive kfuncs before the SCX
>> program is run, avoiding runtime overhead.
>>
>> Therefore we no longer need mask-based runtime restrictions.
>>
>> This patch removes the mask-based runtime restrictions.
> 
> You may have missed scx_prio_less(), that is still using SCX_KF_REST:
> 
> kernel/sched/ext.c: In function ‘scx_prio_less’:
> kernel/sched/ext.c:1171:27: error: ‘struct sched_ext_ops’ has no member named ‘SCX_KF_REST’
>   1171 |         __typeof__(scx_ops.op(task0, task1, ##args)) __ret;                     \
>        |                           ^
> 
> I think you just need to add:
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 4a4713c3af67b..51c13b8c27743 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -3302,7 +3302,7 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
>   	 * verifier.
>   	 */
>   	if (SCX_HAS_OP(core_sched_before) && !scx_rq_bypassing(task_rq(a)))
> -		return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, core_sched_before,
> +		return SCX_CALL_OP_2TASKS_RET(core_sched_before,
>   					      (struct task_struct *)a,
>   					      (struct task_struct *)b);
>   	else
> 
> -Andrea
> 

Thanks for pointing this out.

I made the mistake of not enabling CONFIG_SCHED_CORE, which led to not
noticing this issue.

If you find any other issues, please let me know.

I will fix these issues in the next version.

>>
>> Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
>> ---
>>   include/linux/sched/ext.h |  24 ----
>>   kernel/sched/ext.c        | 227 ++++++++------------------------------
>>   kernel/sched/ext_idle.c   |   5 +-
>>   3 files changed, 50 insertions(+), 206 deletions(-)
>>
>> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
>> index f7545430a548..9980d6b55c84 100644
>> --- a/include/linux/sched/ext.h
>> +++ b/include/linux/sched/ext.h
>> @@ -96,29 +96,6 @@ enum scx_ent_dsq_flags {
>>   	SCX_TASK_DSQ_ON_PRIQ	= 1 << 0, /* task is queued on the priority queue of a dsq */
>>   };
>>   
>> -/*
>> - * Mask bits for scx_entity.kf_mask. Not all kfuncs can be called from
>> - * everywhere and the following bits track which kfunc sets are currently
>> - * allowed for %current. This simple per-task tracking works because SCX ops
>> - * nest in a limited way. BPF will likely implement a way to allow and disallow
>> - * kfuncs depending on the calling context which will replace this manual
>> - * mechanism. See scx_kf_allow().
>> - */
>> -enum scx_kf_mask {
>> -	SCX_KF_UNLOCKED		= 0,	  /* sleepable and not rq locked */
>> -	/* ENQUEUE and DISPATCH may be nested inside CPU_RELEASE */
>> -	SCX_KF_CPU_RELEASE	= 1 << 0, /* ops.cpu_release() */
>> -	/* ops.dequeue (in REST) may be nested inside DISPATCH */
>> -	SCX_KF_DISPATCH		= 1 << 1, /* ops.dispatch() */
>> -	SCX_KF_ENQUEUE		= 1 << 2, /* ops.enqueue() and ops.select_cpu() */
>> -	SCX_KF_SELECT_CPU	= 1 << 3, /* ops.select_cpu() */
>> -	SCX_KF_REST		= 1 << 4, /* other rq-locked operations */
>> -
>> -	__SCX_KF_RQ_LOCKED	= SCX_KF_CPU_RELEASE | SCX_KF_DISPATCH |
>> -				  SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST,
>> -	__SCX_KF_TERMINAL	= SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST,
>> -};
>> -
>>   enum scx_dsq_lnode_flags {
>>   	SCX_DSQ_LNODE_ITER_CURSOR = 1 << 0,
>>   
>> @@ -147,7 +124,6 @@ struct sched_ext_entity {
>>   	s32			sticky_cpu;
>>   	s32			holding_cpu;
>>   	s32			selected_cpu;
>> -	u32			kf_mask;	/* see scx_kf_mask above */
>>   	struct task_struct	*kf_tasks[2];	/* see SCX_CALL_OP_TASK() */
>>   	atomic_long_t		ops_state;
>>   
>> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
>> index c337f6206ae5..7dc5f11be66b 100644
>> --- a/kernel/sched/ext.c
>> +++ b/kernel/sched/ext.c
>> @@ -1115,19 +1115,6 @@ static long jiffies_delta_msecs(unsigned long at, unsigned long now)
>>   		return -(long)jiffies_to_msecs(now - at);
>>   }
>>   
>> -/* if the highest set bit is N, return a mask with bits [N+1, 31] set */
>> -static u32 higher_bits(u32 flags)
>> -{
>> -	return ~((1 << fls(flags)) - 1);
>> -}
>> -
>> -/* return the mask with only the highest bit set */
>> -static u32 highest_bit(u32 flags)
>> -{
>> -	int bit = fls(flags);
>> -	return ((u64)1 << bit) >> 1;
>> -}
>> -
>>   static bool u32_before(u32 a, u32 b)
>>   {
>>   	return (s32)(a - b) < 0;
>> @@ -1143,51 +1130,12 @@ static struct scx_dispatch_q *find_user_dsq(u64 dsq_id)
>>   	return rhashtable_lookup_fast(&dsq_hash, &dsq_id, dsq_hash_params);
>>   }
>>   
>> -/*
>> - * scx_kf_mask enforcement. Some kfuncs can only be called from specific SCX
>> - * ops. When invoking SCX ops, SCX_CALL_OP[_RET]() should be used to indicate
>> - * the allowed kfuncs and those kfuncs should use scx_kf_allowed() to check
>> - * whether it's running from an allowed context.
>> - *
>> - * @mask is constant, always inline to cull the mask calculations.
>> - */
>> -static __always_inline void scx_kf_allow(u32 mask)
>> -{
>> -	/* nesting is allowed only in increasing scx_kf_mask order */
>> -	WARN_ONCE((mask | higher_bits(mask)) & current->scx.kf_mask,
>> -		  "invalid nesting current->scx.kf_mask=0x%x mask=0x%x\n",
>> -		  current->scx.kf_mask, mask);
>> -	current->scx.kf_mask |= mask;
>> -	barrier();
>> -}
>> -
>> -static void scx_kf_disallow(u32 mask)
>> -{
>> -	barrier();
>> -	current->scx.kf_mask &= ~mask;
>> -}
>> -
>> -#define SCX_CALL_OP(mask, op, args...)						\
>> -do {										\
>> -	if (mask) {								\
>> -		scx_kf_allow(mask);						\
>> -		scx_ops.op(args);						\
>> -		scx_kf_disallow(mask);						\
>> -	} else {								\
>> -		scx_ops.op(args);						\
>> -	}									\
>> -} while (0)
>> +#define SCX_CALL_OP(op, args...)	scx_ops.op(args)
>>   
>> -#define SCX_CALL_OP_RET(mask, op, args...)					\
>> +#define SCX_CALL_OP_RET(op, args...)						\
>>   ({										\
>>   	__typeof__(scx_ops.op(args)) __ret;					\
>> -	if (mask) {								\
>> -		scx_kf_allow(mask);						\
>> -		__ret = scx_ops.op(args);					\
>> -		scx_kf_disallow(mask);						\
>> -	} else {								\
>> -		__ret = scx_ops.op(args);					\
>> -	}									\
>> +	__ret = scx_ops.op(args);						\
>>   	__ret;									\
>>   })
>>   
>> @@ -1202,74 +1150,36 @@ do {										\
>>    * scx_kf_allowed_on_arg_tasks() to test whether the invocation is allowed on
>>    * the specific task.
>>    */
>> -#define SCX_CALL_OP_TASK(mask, op, task, args...)				\
>> +#define SCX_CALL_OP_TASK(op, task, args...)					\
>>   do {										\
>> -	BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL);				\
>>   	current->scx.kf_tasks[0] = task;					\
>> -	SCX_CALL_OP(mask, op, task, ##args);					\
>> +	SCX_CALL_OP(op, task, ##args);						\
>>   	current->scx.kf_tasks[0] = NULL;					\
>>   } while (0)
>>   
>> -#define SCX_CALL_OP_TASK_RET(mask, op, task, args...)				\
>> +#define SCX_CALL_OP_TASK_RET(op, task, args...)					\
>>   ({										\
>>   	__typeof__(scx_ops.op(task, ##args)) __ret;				\
>> -	BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL);				\
>>   	current->scx.kf_tasks[0] = task;					\
>> -	__ret = SCX_CALL_OP_RET(mask, op, task, ##args);			\
>> +	__ret = SCX_CALL_OP_RET(op, task, ##args);				\
>>   	current->scx.kf_tasks[0] = NULL;					\
>>   	__ret;									\
>>   })
>>   
>> -#define SCX_CALL_OP_2TASKS_RET(mask, op, task0, task1, args...)			\
>> +#define SCX_CALL_OP_2TASKS_RET(op, task0, task1, args...)			\
>>   ({										\
>>   	__typeof__(scx_ops.op(task0, task1, ##args)) __ret;			\
>> -	BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL);				\
>>   	current->scx.kf_tasks[0] = task0;					\
>>   	current->scx.kf_tasks[1] = task1;					\
>> -	__ret = SCX_CALL_OP_RET(mask, op, task0, task1, ##args);		\
>> +	__ret = SCX_CALL_OP_RET(op, task0, task1, ##args);			\
>>   	current->scx.kf_tasks[0] = NULL;					\
>>   	current->scx.kf_tasks[1] = NULL;					\
>>   	__ret;									\
>>   })
>>   
>> -/* @mask is constant, always inline to cull unnecessary branches */
>> -static __always_inline bool scx_kf_allowed(u32 mask)
>> -{
>> -	if (unlikely(!(current->scx.kf_mask & mask))) {
>> -		scx_ops_error("kfunc with mask 0x%x called from an operation only allowing 0x%x",
>> -			      mask, current->scx.kf_mask);
>> -		return false;
>> -	}
>> -
>> -	/*
>> -	 * Enforce nesting boundaries. e.g. A kfunc which can be called from
>> -	 * DISPATCH must not be called if we're running DEQUEUE which is nested
>> -	 * inside ops.dispatch(). We don't need to check boundaries for any
>> -	 * blocking kfuncs as the verifier ensures they're only called from
>> -	 * sleepable progs.
>> -	 */
>> -	if (unlikely(highest_bit(mask) == SCX_KF_CPU_RELEASE &&
>> -		     (current->scx.kf_mask & higher_bits(SCX_KF_CPU_RELEASE)))) {
>> -		scx_ops_error("cpu_release kfunc called from a nested operation");
>> -		return false;
>> -	}
>> -
>> -	if (unlikely(highest_bit(mask) == SCX_KF_DISPATCH &&
>> -		     (current->scx.kf_mask & higher_bits(SCX_KF_DISPATCH)))) {
>> -		scx_ops_error("dispatch kfunc called from a nested operation");
>> -		return false;
>> -	}
>> -
>> -	return true;
>> -}
>> -
>>   /* see SCX_CALL_OP_TASK() */
>> -static __always_inline bool scx_kf_allowed_on_arg_tasks(u32 mask,
>> -							struct task_struct *p)
>> +static __always_inline bool scx_kf_allowed_on_arg_tasks(struct task_struct *p)
>>   {
>> -	if (!scx_kf_allowed(mask))
>> -		return false;
>> -
>>   	if (unlikely((p != current->scx.kf_tasks[0] &&
>>   		      p != current->scx.kf_tasks[1]))) {
>>   		scx_ops_error("called on a task not being operated on");
>> @@ -1279,11 +1189,6 @@ static __always_inline bool scx_kf_allowed_on_arg_tasks(u32 mask,
>>   	return true;
>>   }
>>   
>> -static bool scx_kf_allowed_if_unlocked(void)
>> -{
>> -	return !current->scx.kf_mask;
>> -}
>> -
>>   /**
>>    * nldsq_next_task - Iterate to the next task in a non-local DSQ
>>    * @dsq: user dsq being iterated
>> @@ -2219,7 +2124,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>>   	WARN_ON_ONCE(*ddsp_taskp);
>>   	*ddsp_taskp = p;
>>   
>> -	SCX_CALL_OP_TASK(SCX_KF_ENQUEUE, enqueue, p, enq_flags);
>> +	SCX_CALL_OP_TASK(enqueue, p, enq_flags);
>>   
>>   	*ddsp_taskp = NULL;
>>   	if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
>> @@ -2316,7 +2221,7 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
>>   	add_nr_running(rq, 1);
>>   
>>   	if (SCX_HAS_OP(runnable) && !task_on_rq_migrating(p))
>> -		SCX_CALL_OP_TASK(SCX_KF_REST, runnable, p, enq_flags);
>> +		SCX_CALL_OP_TASK(runnable, p, enq_flags);
>>   
>>   	if (enq_flags & SCX_ENQ_WAKEUP)
>>   		touch_core_sched(rq, p);
>> @@ -2351,7 +2256,7 @@ static void ops_dequeue(struct task_struct *p, u64 deq_flags)
>>   		BUG();
>>   	case SCX_OPSS_QUEUED:
>>   		if (SCX_HAS_OP(dequeue))
>> -			SCX_CALL_OP_TASK(SCX_KF_REST, dequeue, p, deq_flags);
>> +			SCX_CALL_OP_TASK(dequeue, p, deq_flags);
>>   
>>   		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
>>   					    SCX_OPSS_NONE))
>> @@ -2400,11 +2305,11 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags
>>   	 */
>>   	if (SCX_HAS_OP(stopping) && task_current(rq, p)) {
>>   		update_curr_scx(rq);
>> -		SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, false);
>> +		SCX_CALL_OP_TASK(stopping, p, false);
>>   	}
>>   
>>   	if (SCX_HAS_OP(quiescent) && !task_on_rq_migrating(p))
>> -		SCX_CALL_OP_TASK(SCX_KF_REST, quiescent, p, deq_flags);
>> +		SCX_CALL_OP_TASK(quiescent, p, deq_flags);
>>   
>>   	if (deq_flags & SCX_DEQ_SLEEP)
>>   		p->scx.flags |= SCX_TASK_DEQD_FOR_SLEEP;
>> @@ -2424,7 +2329,7 @@ static void yield_task_scx(struct rq *rq)
>>   	struct task_struct *p = rq->curr;
>>   
>>   	if (SCX_HAS_OP(yield))
>> -		SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, yield, p, NULL);
>> +		SCX_CALL_OP_2TASKS_RET(yield, p, NULL);
>>   	else
>>   		p->scx.slice = 0;
>>   }
>> @@ -2434,7 +2339,7 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to)
>>   	struct task_struct *from = rq->curr;
>>   
>>   	if (SCX_HAS_OP(yield))
>> -		return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, yield, from, to);
>> +		return SCX_CALL_OP_2TASKS_RET(yield, from, to);
>>   	else
>>   		return false;
>>   }
>> @@ -2992,7 +2897,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
>>   		 * emitted in switch_class().
>>   		 */
>>   		if (SCX_HAS_OP(cpu_acquire))
>> -			SCX_CALL_OP(SCX_KF_REST, cpu_acquire, cpu_of(rq), NULL);
>> +			SCX_CALL_OP(cpu_acquire, cpu_of(rq), NULL);
>>   		rq->scx.cpu_released = false;
>>   	}
>>   
>> @@ -3037,8 +2942,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
>>   	do {
>>   		dspc->nr_tasks = 0;
>>   
>> -		SCX_CALL_OP(SCX_KF_DISPATCH, dispatch, cpu_of(rq),
>> -			    prev_on_scx ? prev : NULL);
>> +		SCX_CALL_OP(dispatch, cpu_of(rq), prev_on_scx ? prev : NULL);
>>   
>>   		flush_dispatch_buf(rq);
>>   
>> @@ -3159,7 +3063,7 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
>>   
>>   	/* see dequeue_task_scx() on why we skip when !QUEUED */
>>   	if (SCX_HAS_OP(running) && (p->scx.flags & SCX_TASK_QUEUED))
>> -		SCX_CALL_OP_TASK(SCX_KF_REST, running, p);
>> +		SCX_CALL_OP_TASK(running, p);
>>   
>>   	clr_task_runnable(p, true);
>>   
>> @@ -3240,8 +3144,7 @@ static void switch_class(struct rq *rq, struct task_struct *next)
>>   				.task = next,
>>   			};
>>   
>> -			SCX_CALL_OP(SCX_KF_CPU_RELEASE,
>> -				    cpu_release, cpu_of(rq), &args);
>> +			SCX_CALL_OP(cpu_release, cpu_of(rq), &args);
>>   		}
>>   		rq->scx.cpu_released = true;
>>   	}
>> @@ -3254,7 +3157,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
>>   
>>   	/* see dequeue_task_scx() on why we skip when !QUEUED */
>>   	if (SCX_HAS_OP(stopping) && (p->scx.flags & SCX_TASK_QUEUED))
>> -		SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, true);
>> +		SCX_CALL_OP_TASK(stopping, p, true);
>>   
>>   	if (p->scx.flags & SCX_TASK_QUEUED) {
>>   		set_task_runnable(rq, p);
>> @@ -3428,8 +3331,7 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
>>   		WARN_ON_ONCE(*ddsp_taskp);
>>   		*ddsp_taskp = p;
>>   
>> -		cpu = SCX_CALL_OP_TASK_RET(SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU,
>> -					   select_cpu, p, prev_cpu, wake_flags);
>> +		cpu = SCX_CALL_OP_TASK_RET(select_cpu, p, prev_cpu, wake_flags);
>>   		p->scx.selected_cpu = cpu;
>>   		*ddsp_taskp = NULL;
>>   		if (ops_cpu_valid(cpu, "from ops.select_cpu()"))
>> @@ -3473,8 +3375,7 @@ static void set_cpus_allowed_scx(struct task_struct *p,
>>   	 * designation pointless. Cast it away when calling the operation.
>>   	 */
>>   	if (SCX_HAS_OP(set_cpumask))
>> -		SCX_CALL_OP_TASK(SCX_KF_REST, set_cpumask, p,
>> -				 (struct cpumask *)p->cpus_ptr);
>> +		SCX_CALL_OP_TASK(set_cpumask, p, (struct cpumask *)p->cpus_ptr);
>>   }
>>   
>>   static void handle_hotplug(struct rq *rq, bool online)
>> @@ -3487,9 +3388,9 @@ static void handle_hotplug(struct rq *rq, bool online)
>>   		scx_idle_update_selcpu_topology(&scx_ops);
>>   
>>   	if (online && SCX_HAS_OP(cpu_online))
>> -		SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_online, cpu);
>> +		SCX_CALL_OP(cpu_online, cpu);
>>   	else if (!online && SCX_HAS_OP(cpu_offline))
>> -		SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_offline, cpu);
>> +		SCX_CALL_OP(cpu_offline, cpu);
>>   	else
>>   		scx_ops_exit(SCX_ECODE_ACT_RESTART | SCX_ECODE_RSN_HOTPLUG,
>>   			     "cpu %d going %s, exiting scheduler", cpu,
>> @@ -3593,7 +3494,7 @@ static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued)
>>   		curr->scx.slice = 0;
>>   		touch_core_sched(rq, curr);
>>   	} else if (SCX_HAS_OP(tick)) {
>> -		SCX_CALL_OP(SCX_KF_REST, tick, curr);
>> +		SCX_CALL_OP(tick, curr);
>>   	}
>>   
>>   	if (!curr->scx.slice)
>> @@ -3670,7 +3571,7 @@ static int scx_ops_init_task(struct task_struct *p, struct task_group *tg, bool
>>   			.fork = fork,
>>   		};
>>   
>> -		ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, init_task, p, &args);
>> +		ret = SCX_CALL_OP_RET(init_task, p, &args);
>>   		if (unlikely(ret)) {
>>   			ret = ops_sanitize_err("init_task", ret);
>>   			return ret;
>> @@ -3727,11 +3628,11 @@ static void scx_ops_enable_task(struct task_struct *p)
>>   	p->scx.weight = sched_weight_to_cgroup(weight);
>>   
>>   	if (SCX_HAS_OP(enable))
>> -		SCX_CALL_OP_TASK(SCX_KF_REST, enable, p);
>> +		SCX_CALL_OP_TASK(enable, p);
>>   	scx_set_task_state(p, SCX_TASK_ENABLED);
>>   
>>   	if (SCX_HAS_OP(set_weight))
>> -		SCX_CALL_OP_TASK(SCX_KF_REST, set_weight, p, p->scx.weight);
>> +		SCX_CALL_OP_TASK(set_weight, p, p->scx.weight);
>>   }
>>   
>>   static void scx_ops_disable_task(struct task_struct *p)
>> @@ -3740,7 +3641,7 @@ static void scx_ops_disable_task(struct task_struct *p)
>>   	WARN_ON_ONCE(scx_get_task_state(p) != SCX_TASK_ENABLED);
>>   
>>   	if (SCX_HAS_OP(disable))
>> -		SCX_CALL_OP(SCX_KF_REST, disable, p);
>> +		SCX_CALL_OP(disable, p);
>>   	scx_set_task_state(p, SCX_TASK_READY);
>>   }
>>   
>> @@ -3769,7 +3670,7 @@ static void scx_ops_exit_task(struct task_struct *p)
>>   	}
>>   
>>   	if (SCX_HAS_OP(exit_task))
>> -		SCX_CALL_OP(SCX_KF_REST, exit_task, p, &args);
>> +		SCX_CALL_OP(exit_task, p, &args);
>>   	scx_set_task_state(p, SCX_TASK_NONE);
>>   }
>>   
>> @@ -3878,7 +3779,7 @@ static void reweight_task_scx(struct rq *rq, struct task_struct *p,
>>   
>>   	p->scx.weight = sched_weight_to_cgroup(scale_load_down(lw->weight));
>>   	if (SCX_HAS_OP(set_weight))
>> -		SCX_CALL_OP_TASK(SCX_KF_REST, set_weight, p, p->scx.weight);
>> +		SCX_CALL_OP_TASK(set_weight, p, p->scx.weight);
>>   }
>>   
>>   static void prio_changed_scx(struct rq *rq, struct task_struct *p, int oldprio)
>> @@ -3894,8 +3795,7 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p)
>>   	 * different scheduler class. Keep the BPF scheduler up-to-date.
>>   	 */
>>   	if (SCX_HAS_OP(set_cpumask))
>> -		SCX_CALL_OP_TASK(SCX_KF_REST, set_cpumask, p,
>> -				 (struct cpumask *)p->cpus_ptr);
>> +		SCX_CALL_OP_TASK(set_cpumask, p, (struct cpumask *)p->cpus_ptr);
>>   }
>>   
>>   static void switched_from_scx(struct rq *rq, struct task_struct *p)
>> @@ -3987,8 +3887,7 @@ int scx_tg_online(struct task_group *tg)
>>   			struct scx_cgroup_init_args args =
>>   				{ .weight = tg->scx_weight };
>>   
>> -			ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_init,
>> -					      tg->css.cgroup, &args);
>> +			ret = SCX_CALL_OP_RET(cgroup_init, tg->css.cgroup, &args);
>>   			if (ret)
>>   				ret = ops_sanitize_err("cgroup_init", ret);
>>   		}
>> @@ -4009,7 +3908,7 @@ void scx_tg_offline(struct task_group *tg)
>>   	percpu_down_read(&scx_cgroup_rwsem);
>>   
>>   	if (SCX_HAS_OP(cgroup_exit) && (tg->scx_flags & SCX_TG_INITED))
>> -		SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_exit, tg->css.cgroup);
>> +		SCX_CALL_OP(cgroup_exit, tg->css.cgroup);
>>   	tg->scx_flags &= ~(SCX_TG_ONLINE | SCX_TG_INITED);
>>   
>>   	percpu_up_read(&scx_cgroup_rwsem);
>> @@ -4042,8 +3941,7 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset)
>>   			continue;
>>   
>>   		if (SCX_HAS_OP(cgroup_prep_move)) {
>> -			ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_prep_move,
>> -					      p, from, css->cgroup);
>> +			ret = SCX_CALL_OP_RET(cgroup_prep_move, p, from, css->cgroup);
>>   			if (ret)
>>   				goto err;
>>   		}
>> @@ -4056,8 +3954,7 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset)
>>   err:
>>   	cgroup_taskset_for_each(p, css, tset) {
>>   		if (SCX_HAS_OP(cgroup_cancel_move) && p->scx.cgrp_moving_from)
>> -			SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_cancel_move, p,
>> -				    p->scx.cgrp_moving_from, css->cgroup);
>> +			SCX_CALL_OP(cgroup_cancel_move, p, p->scx.cgrp_moving_from, css->cgroup);
>>   		p->scx.cgrp_moving_from = NULL;
>>   	}
>>   
>> @@ -4075,8 +3972,7 @@ void scx_cgroup_move_task(struct task_struct *p)
>>   	 * cgrp_moving_from set.
>>   	 */
>>   	if (SCX_HAS_OP(cgroup_move) && !WARN_ON_ONCE(!p->scx.cgrp_moving_from))
>> -		SCX_CALL_OP_TASK(SCX_KF_UNLOCKED, cgroup_move, p,
>> -			p->scx.cgrp_moving_from, tg_cgrp(task_group(p)));
>> +		SCX_CALL_OP_TASK(cgroup_move, p, p->scx.cgrp_moving_from, tg_cgrp(task_group(p)));
>>   	p->scx.cgrp_moving_from = NULL;
>>   }
>>   
>> @@ -4095,8 +3991,7 @@ void scx_cgroup_cancel_attach(struct cgroup_taskset *tset)
>>   
>>   	cgroup_taskset_for_each(p, css, tset) {
>>   		if (SCX_HAS_OP(cgroup_cancel_move) && p->scx.cgrp_moving_from)
>> -			SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_cancel_move, p,
>> -				    p->scx.cgrp_moving_from, css->cgroup);
>> +			SCX_CALL_OP(cgroup_cancel_move, p, p->scx.cgrp_moving_from, css->cgroup);
>>   		p->scx.cgrp_moving_from = NULL;
>>   	}
>>   out_unlock:
>> @@ -4109,8 +4004,7 @@ void scx_group_set_weight(struct task_group *tg, unsigned long weight)
>>   
>>   	if (scx_cgroup_enabled && tg->scx_weight != weight) {
>>   		if (SCX_HAS_OP(cgroup_set_weight))
>> -			SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_set_weight,
>> -				    tg_cgrp(tg), weight);
>> +			SCX_CALL_OP(cgroup_set_weight, tg_cgrp(tg), weight);
>>   		tg->scx_weight = weight;
>>   	}
>>   
>> @@ -4300,7 +4194,7 @@ static void scx_cgroup_exit(void)
>>   			continue;
>>   		rcu_read_unlock();
>>   
>> -		SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_exit, css->cgroup);
>> +		SCX_CALL_OP(cgroup_exit, css->cgroup);
>>   
>>   		rcu_read_lock();
>>   		css_put(css);
>> @@ -4343,8 +4237,7 @@ static int scx_cgroup_init(void)
>>   			continue;
>>   		rcu_read_unlock();
>>   
>> -		ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_init,
>> -				      css->cgroup, &args);
>> +		ret = SCX_CALL_OP_RET(cgroup_init, css->cgroup, &args);
>>   		if (ret) {
>>   			css_put(css);
>>   			scx_ops_error("ops.cgroup_init() failed (%d)", ret);
>> @@ -4840,7 +4733,7 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
>>   	}
>>   
>>   	if (scx_ops.exit)
>> -		SCX_CALL_OP(SCX_KF_UNLOCKED, exit, ei);
>> +		SCX_CALL_OP(exit, ei);
>>   
>>   	cancel_delayed_work_sync(&scx_watchdog_work);
>>   
>> @@ -5047,7 +4940,7 @@ static void scx_dump_task(struct seq_buf *s, struct scx_dump_ctx *dctx,
>>   
>>   	if (SCX_HAS_OP(dump_task)) {
>>   		ops_dump_init(s, "    ");
>> -		SCX_CALL_OP(SCX_KF_REST, dump_task, dctx, p);
>> +		SCX_CALL_OP(dump_task, dctx, p);
>>   		ops_dump_exit();
>>   	}
>>   
>> @@ -5094,7 +4987,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
>>   
>>   	if (SCX_HAS_OP(dump)) {
>>   		ops_dump_init(&s, "");
>> -		SCX_CALL_OP(SCX_KF_UNLOCKED, dump, &dctx);
>> +		SCX_CALL_OP(dump, &dctx);
>>   		ops_dump_exit();
>>   	}
>>   
>> @@ -5151,7 +5044,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
>>   		used = seq_buf_used(&ns);
>>   		if (SCX_HAS_OP(dump_cpu)) {
>>   			ops_dump_init(&ns, "  ");
>> -			SCX_CALL_OP(SCX_KF_REST, dump_cpu, &dctx, cpu, idle);
>> +			SCX_CALL_OP(dump_cpu, &dctx, cpu, idle);
>>   			ops_dump_exit();
>>   		}
>>   
>> @@ -5405,7 +5298,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
>>   	cpus_read_lock();
>>   
>>   	if (scx_ops.init) {
>> -		ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, init);
>> +		ret = SCX_CALL_OP_RET(init);
>>   		if (ret) {
>>   			ret = ops_sanitize_err("init", ret);
>>   			cpus_read_unlock();
>> @@ -6146,9 +6039,6 @@ void __init init_sched_ext_class(void)
>>    */
>>   static bool scx_dsq_insert_preamble(struct task_struct *p, u64 enq_flags)
>>   {
>> -	if (!scx_kf_allowed(SCX_KF_ENQUEUE | SCX_KF_DISPATCH))
>> -		return false;
>> -
>>   	lockdep_assert_irqs_disabled();
>>   
>>   	if (unlikely(!p)) {
>> @@ -6310,9 +6200,6 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
>>   	bool in_balance;
>>   	unsigned long flags;
>>   
>> -	if (!scx_kf_allowed_if_unlocked() && !scx_kf_allowed(SCX_KF_DISPATCH))
>> -		return false;
>> -
>>   	/*
>>   	 * Can be called from either ops.dispatch() locking this_rq() or any
>>   	 * context where no rq lock is held. If latter, lock @p's task_rq which
>> @@ -6395,9 +6282,6 @@ __bpf_kfunc_start_defs();
>>    */
>>   __bpf_kfunc u32 scx_bpf_dispatch_nr_slots(void)
>>   {
>> -	if (!scx_kf_allowed(SCX_KF_DISPATCH))
>> -		return 0;
>> -
>>   	return scx_dsp_max_batch - __this_cpu_read(scx_dsp_ctx->cursor);
>>   }
>>   
>> @@ -6411,9 +6295,6 @@ __bpf_kfunc void scx_bpf_dispatch_cancel(void)
>>   {
>>   	struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx);
>>   
>> -	if (!scx_kf_allowed(SCX_KF_DISPATCH))
>> -		return;
>> -
>>   	if (dspc->cursor > 0)
>>   		dspc->cursor--;
>>   	else
>> @@ -6439,9 +6320,6 @@ __bpf_kfunc bool scx_bpf_dsq_move_to_local(u64 dsq_id)
>>   	struct scx_dsp_ctx *dspc = this_cpu_ptr(scx_dsp_ctx);
>>   	struct scx_dispatch_q *dsq;
>>   
>> -	if (!scx_kf_allowed(SCX_KF_DISPATCH))
>> -		return false;
>> -
>>   	flush_dispatch_buf(dspc->rq);
>>   
>>   	dsq = find_user_dsq(dsq_id);
>> @@ -6632,9 +6510,6 @@ __bpf_kfunc u32 scx_bpf_reenqueue_local(void)
>>   	struct rq *rq;
>>   	struct task_struct *p, *n;
>>   
>> -	if (!scx_kf_allowed(SCX_KF_CPU_RELEASE))
>> -		return 0;
>> -
>>   	rq = cpu_rq(smp_processor_id());
>>   	lockdep_assert_rq_held(rq);
>>   
>> @@ -7239,7 +7114,7 @@ __bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p)
>>   	struct task_group *tg = p->sched_task_group;
>>   	struct cgroup *cgrp = &cgrp_dfl_root.cgrp;
>>   
>> -	if (!scx_kf_allowed_on_arg_tasks(__SCX_KF_RQ_LOCKED, p))
>> +	if (!scx_kf_allowed_on_arg_tasks(p))
>>   		goto out;
>>   
>>   	cgrp = tg_cgrp(tg);
>> @@ -7479,10 +7354,6 @@ static int __init scx_init(void)
>>   	 *
>>   	 * Some kfuncs are context-sensitive and can only be called from
>>   	 * specific SCX ops. They are grouped into BTF sets accordingly.
>> -	 * Unfortunately, BPF currently doesn't have a way of enforcing such
>> -	 * restrictions. Eventually, the verifier should be able to enforce
>> -	 * them. For now, register them the same and make each kfunc explicitly
>> -	 * check using scx_kf_allowed().
>>   	 */
>>   	if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
>>   					     &scx_kfunc_set_ops_context_sensitive)) ||
>> diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
>> index efb6077810d8..e241935021eb 100644
>> --- a/kernel/sched/ext_idle.c
>> +++ b/kernel/sched/ext_idle.c
>> @@ -658,7 +658,7 @@ void __scx_update_idle(struct rq *rq, bool idle, bool do_notify)
>>   	 * managed by put_prev_task_idle()/set_next_task_idle().
>>   	 */
>>   	if (SCX_HAS_OP(update_idle) && do_notify && !scx_rq_bypassing(rq))
>> -		SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
>> +		SCX_CALL_OP(update_idle, cpu_of(rq), idle);
>>   
>>   	/*
>>   	 * Update the idle masks:
>> @@ -803,9 +803,6 @@ __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>>   	if (!check_builtin_idle_enabled())
>>   		goto prev_cpu;
>>   
>> -	if (!scx_kf_allowed(SCX_KF_SELECT_CPU))
>> -		goto prev_cpu;
>> -
>>   #ifdef CONFIG_SMP
>>   	return scx_select_cpu_dfl(p, prev_cpu, wake_flags, is_idle);
>>   #endif
>> -- 
>> 2.39.5
>>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH sched_ext/for-6.15 v3 2/5] sched_ext: Declare context-sensitive kfunc groups that can be used by different SCX operations
  2025-02-26 19:28 ` [PATCH sched_ext/for-6.15 v3 2/5] sched_ext: Declare context-sensitive kfunc groups that can be used by different SCX operations Juntong Deng
@ 2025-02-27 19:19   ` Tejun Heo
  2025-02-28 21:57   ` Tejun Heo
  1 sibling, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2025-02-27 19:19 UTC (permalink / raw)
  To: Juntong Deng
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, void, arighi,
	changwoo, bpf, linux-kernel

On Wed, Feb 26, 2025 at 07:28:17PM +0000, Juntong Deng wrote:
> This patch declare context-sensitive kfunc groups that can be used by
             ^
             declares

> different SCX operations.
...
> +/* Each flag corresponds to a btf kfunc id set */
> +enum scx_ops_kf_flags {
> +	SCX_OPS_KF_ANY			= 0,
> +	SCX_OPS_KF_UNLOCKED		= 1 << 1,

nit: Any specific reason to skip bit 0?

> +	SCX_OPS_KF_CPU_RELEASE		= 1 << 2,
> +	SCX_OPS_KF_DISPATCH		= 1 << 3,
> +	SCX_OPS_KF_ENQUEUE		= 1 << 4,
> +	SCX_OPS_KF_SELECT_CPU		= 1 << 5,
> +};
> +
> +static const u32 scx_ops_context_flags[] = {
> +	[SCX_OP_IDX(select_cpu)]		= SCX_OPS_KF_SELECT_CPU | SCX_OPS_KF_ENQUEUE,
> +	[SCX_OP_IDX(enqueue)]			= SCX_OPS_KF_ENQUEUE,
> +	[SCX_OP_IDX(dequeue)]			= SCX_OPS_KF_ANY,
> +	[SCX_OP_IDX(dispatch)]			= SCX_OPS_KF_DISPATCH | SCX_OPS_KF_ENQUEUE,
> +	[SCX_OP_IDX(tick)]			= SCX_OPS_KF_ANY,
> +	[SCX_OP_IDX(runnable)]			= SCX_OPS_KF_ANY,
> +	[SCX_OP_IDX(running)]			= SCX_OPS_KF_ANY,
> +	[SCX_OP_IDX(stopping)]			= SCX_OPS_KF_ANY,
> +	[SCX_OP_IDX(quiescent)]			= SCX_OPS_KF_ANY,
> +	[SCX_OP_IDX(yield)]			= SCX_OPS_KF_ANY,
> +	[SCX_OP_IDX(core_sched_before)]		= SCX_OPS_KF_ANY,
> +	[SCX_OP_IDX(set_weight)]		= SCX_OPS_KF_ANY,
> +	[SCX_OP_IDX(set_cpumask)]		= SCX_OPS_KF_ANY,
> +	[SCX_OP_IDX(update_idle)]		= SCX_OPS_KF_ANY,
> +	[SCX_OP_IDX(cpu_acquire)]		= SCX_OPS_KF_ANY,
> +	[SCX_OP_IDX(cpu_release)]		= SCX_OPS_KF_CPU_RELEASE,
> +	[SCX_OP_IDX(init_task)]			= SCX_OPS_KF_UNLOCKED,
> +	[SCX_OP_IDX(exit_task)]			= SCX_OPS_KF_ANY,
> +	[SCX_OP_IDX(enable)]			= SCX_OPS_KF_ANY,
> +	[SCX_OP_IDX(disable)]			= SCX_OPS_KF_ANY,
> +	[SCX_OP_IDX(dump)]			= SCX_OPS_KF_DISPATCH,

Shouldn't this be SCX_OPS_KF_UNLOCKED?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH sched_ext/for-6.15 v3 3/5] sched_ext: Add scx_kfunc_ids_ops_context_sensitive for unified filtering of context-sensitive SCX kfuncs
  2025-02-26 19:28 ` [PATCH sched_ext/for-6.15 v3 3/5] sched_ext: Add scx_kfunc_ids_ops_context_sensitive for unified filtering of context-sensitive SCX kfuncs Juntong Deng
@ 2025-02-27 20:25   ` Tejun Heo
  2025-02-27 21:23     ` Juntong Deng
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2025-02-27 20:25 UTC (permalink / raw)
  To: Juntong Deng
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, void, arighi,
	changwoo, bpf, linux-kernel

Hello,

On Wed, Feb 26, 2025 at 07:28:18PM +0000, Juntong Deng wrote:
> +BTF_KFUNCS_START(scx_kfunc_ids_ops_context_sensitive)
> +/* scx_kfunc_ids_select_cpu */
> +BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU)
> +/* scx_kfunc_ids_enqueue_dispatch */
> +BTF_ID_FLAGS(func, scx_bpf_dsq_insert, KF_RCU)
> +BTF_ID_FLAGS(func, scx_bpf_dsq_insert_vtime, KF_RCU)
> +BTF_ID_FLAGS(func, scx_bpf_dispatch, KF_RCU)
> +BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime, KF_RCU)
> +/* scx_kfunc_ids_dispatch */
> +BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots)
> +BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel)
> +BTF_ID_FLAGS(func, scx_bpf_dsq_move_to_local)
> +BTF_ID_FLAGS(func, scx_bpf_consume)
> +/* scx_kfunc_ids_cpu_release */
> +BTF_ID_FLAGS(func, scx_bpf_reenqueue_local)
> +/* scx_kfunc_ids_unlocked */
> +BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE)
> +/* Intersection of scx_kfunc_ids_dispatch and scx_kfunc_ids_unlocked */
> +BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_slice)
> +BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_vtime)
> +BTF_ID_FLAGS(func, scx_bpf_dsq_move, KF_RCU)
> +BTF_ID_FLAGS(func, scx_bpf_dsq_move_vtime, KF_RCU)
> +BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq_set_slice)
> +BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq_set_vtime)
> +BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq, KF_RCU)
> +BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime_from_dsq, KF_RCU)
> +BTF_KFUNCS_END(scx_kfunc_ids_ops_context_sensitive)

I'm not a big fan of repeating the kfuncs. This is going to be error-prone.
Can't it register and test the existing sets in the filter function instead?
If that's too cumbersome, maybe we can put these in a macro so that we don't
have to repeat the functions?

> +static int scx_kfunc_ids_ops_context_sensitive_filter(const struct bpf_prog *prog, u32 kfunc_id)
> +{
> +	u32 moff, flags;
> +
> +	if (!btf_id_set8_contains(&scx_kfunc_ids_ops_context_sensitive, kfunc_id))
> +		return 0;
> +
> +	if (prog->type == BPF_PROG_TYPE_SYSCALL &&
> +	    btf_id_set8_contains(&scx_kfunc_ids_unlocked, kfunc_id))
> +		return 0;

Not from this change but these can probably be allowed from TRACING too.

> +	if (prog->type == BPF_PROG_TYPE_STRUCT_OPS &&
> +	    prog->aux->st_ops != &bpf_sched_ext_ops)
> +		return 0;

Why can't other struct_ops progs call scx_kfunc_ids_unlocked kfuncs?

> +	/* prog->type == BPF_PROG_TYPE_STRUCT_OPS && prog->aux->st_ops == &bpf_sched_ext_ops*/
> +
> +	moff = prog->aux->attach_st_ops_member_off;
> +	flags = scx_ops_context_flags[SCX_MOFF_IDX(moff)];
> +
> +	if ((flags & SCX_OPS_KF_UNLOCKED) &&
> +	    btf_id_set8_contains(&scx_kfunc_ids_unlocked, kfunc_id))
> +		return 0;

Wouldn't this disallow e.g. ops.dispatch() from calling scx_dsq_move()?

Have you tested that the before and after behaviors match?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH sched_ext/for-6.15 v3 3/5] sched_ext: Add scx_kfunc_ids_ops_context_sensitive for unified filtering of context-sensitive SCX kfuncs
  2025-02-27 20:25   ` Tejun Heo
@ 2025-02-27 21:23     ` Juntong Deng
  2025-02-27 21:32       ` Tejun Heo
  2025-02-28  2:38       ` Alexei Starovoitov
  0 siblings, 2 replies; 23+ messages in thread
From: Juntong Deng @ 2025-02-27 21:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, void, arighi,
	changwoo, bpf, linux-kernel

On 2025/2/27 20:25, Tejun Heo wrote:

>> +/* Each flag corresponds to a btf kfunc id set */
>> +enum scx_ops_kf_flags {
>> +	SCX_OPS_KF_ANY			= 0,
>> +	SCX_OPS_KF_UNLOCKED		= 1 << 1,

> nit: Any specific reason to skip bit 0?

Thanks for your reply.

This is a mistake and will be fixed in the next version.

>> +	[SCX_OP_IDX(exit_task)]			= SCX_OPS_KF_ANY,
>> +	[SCX_OP_IDX(enable)]			= SCX_OPS_KF_ANY,
>> +	[SCX_OP_IDX(disable)]			= SCX_OPS_KF_ANY,
>> +	[SCX_OP_IDX(dump)]			= SCX_OPS_KF_DISPATCH,

> Shouldn't this be SCX_OPS_KF_UNLOCKED?

This is another mistake and will be fixed in the next version.

> Hello,
> 
> On Wed, Feb 26, 2025 at 07:28:18PM +0000, Juntong Deng wrote:
>> +BTF_KFUNCS_START(scx_kfunc_ids_ops_context_sensitive)
>> +/* scx_kfunc_ids_select_cpu */
>> +BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU)
>> +/* scx_kfunc_ids_enqueue_dispatch */
>> +BTF_ID_FLAGS(func, scx_bpf_dsq_insert, KF_RCU)
>> +BTF_ID_FLAGS(func, scx_bpf_dsq_insert_vtime, KF_RCU)
>> +BTF_ID_FLAGS(func, scx_bpf_dispatch, KF_RCU)
>> +BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime, KF_RCU)
>> +/* scx_kfunc_ids_dispatch */
>> +BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots)
>> +BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel)
>> +BTF_ID_FLAGS(func, scx_bpf_dsq_move_to_local)
>> +BTF_ID_FLAGS(func, scx_bpf_consume)
>> +/* scx_kfunc_ids_cpu_release */
>> +BTF_ID_FLAGS(func, scx_bpf_reenqueue_local)
>> +/* scx_kfunc_ids_unlocked */
>> +BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE)
>> +/* Intersection of scx_kfunc_ids_dispatch and scx_kfunc_ids_unlocked */
>> +BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_slice)
>> +BTF_ID_FLAGS(func, scx_bpf_dsq_move_set_vtime)
>> +BTF_ID_FLAGS(func, scx_bpf_dsq_move, KF_RCU)
>> +BTF_ID_FLAGS(func, scx_bpf_dsq_move_vtime, KF_RCU)
>> +BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq_set_slice)
>> +BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq_set_vtime)
>> +BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq, KF_RCU)
>> +BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime_from_dsq, KF_RCU)
>> +BTF_KFUNCS_END(scx_kfunc_ids_ops_context_sensitive)
> 
> I'm not a big fan of repeating the kfuncs. This is going to be error-prone.
> Can't it register and test the existing sets in the filter function instead?
> If that's too cumbersome, maybe we can put these in a macro so that we don't
> have to repeat the functions?
> 

The core idea of the current design is to separate the kfunc id set used
for filtering purpose and grouping purpose, so that we only need one
filter and do not need to add separate filters for each kfunc id set.
So although kfuncs appear repeatedly in two kfunc id sets, they are
used for different purposes.

scx_kfunc_ids_ops_context_sensitive is only used for filtering purposes
and includes all context-sensitive kfuncs. We need to rely on another
grouping purpose kfunc id set, for example, scx_kfunc_ids_dispatch,
to determine whether a kfunc is allowed to be called in the
dispatch context.

Macro is a good idea, I will try it in the next version.

>> +static int scx_kfunc_ids_ops_context_sensitive_filter(const struct bpf_prog *prog, u32 kfunc_id)
>> +{
>> +	u32 moff, flags;
>> +
>> +	if (!btf_id_set8_contains(&scx_kfunc_ids_ops_context_sensitive, kfunc_id))
>> +		return 0;
>> +
>> +	if (prog->type == BPF_PROG_TYPE_SYSCALL &&
>> +	    btf_id_set8_contains(&scx_kfunc_ids_unlocked, kfunc_id))
>> +		return 0;
> 
> Not from this change but these can probably be allowed from TRACING too.
> 

Not sure if it is safe to make these kfuncs available in TRACING.
If Alexei sees this email, could you please leave a comment?

>> +	if (prog->type == BPF_PROG_TYPE_STRUCT_OPS &&
>> +	    prog->aux->st_ops != &bpf_sched_ext_ops)
>> +		return 0;
> 
> Why can't other struct_ops progs call scx_kfunc_ids_unlocked kfuncs?
> 

Return 0 means allowed. So kfuncs in scx_kfunc_ids_unlocked can be
called by other struct_ops programs.

>> +	/* prog->type == BPF_PROG_TYPE_STRUCT_OPS && prog->aux->st_ops == &bpf_sched_ext_ops*/
>> +
>> +	moff = prog->aux->attach_st_ops_member_off;
>> +	flags = scx_ops_context_flags[SCX_MOFF_IDX(moff)];
>> +
>> +	if ((flags & SCX_OPS_KF_UNLOCKED) &&
>> +	    btf_id_set8_contains(&scx_kfunc_ids_unlocked, kfunc_id))
>> +		return 0;
> 
> Wouldn't this disallow e.g. ops.dispatch() from calling scx_dsq_move()?
> 

No, because

>> [SCX_OP_IDX(dispatch)] = SCX_OPS_KF_DISPATCH | SCX_OPS_KF_ENQUEUE,

Therefore, kfuncs (scx_bpf_dsq_move_*) in scx_kfunc_ids_dispatch can be
called in the dispatch context.

> Have you tested that the before and after behaviors match?
>

I tested the programs in tools/testing/selftests/sched_ext and
tools/sched_ext and all worked fine.

If there are other cases that are not covered, we may need to add new
test cases.

> Thanks.
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH sched_ext/for-6.15 v3 3/5] sched_ext: Add scx_kfunc_ids_ops_context_sensitive for unified filtering of context-sensitive SCX kfuncs
  2025-02-27 21:23     ` Juntong Deng
@ 2025-02-27 21:32       ` Tejun Heo
  2025-02-28  2:34         ` Alexei Starovoitov
  2025-02-28 18:42         ` Juntong Deng
  2025-02-28  2:38       ` Alexei Starovoitov
  1 sibling, 2 replies; 23+ messages in thread
From: Tejun Heo @ 2025-02-27 21:32 UTC (permalink / raw)
  To: Juntong Deng
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, void, arighi,
	changwoo, bpf, linux-kernel

Hello,

On Thu, Feb 27, 2025 at 09:23:20PM +0000, Juntong Deng wrote:
> > > +	if (prog->type == BPF_PROG_TYPE_STRUCT_OPS &&
> > > +	    prog->aux->st_ops != &bpf_sched_ext_ops)
> > > +		return 0;
> > 
> > Why can't other struct_ops progs call scx_kfunc_ids_unlocked kfuncs?
> > 
> 
> Return 0 means allowed. So kfuncs in scx_kfunc_ids_unlocked can be
> called by other struct_ops programs.

Hmm... would that mean a non-sched_ext bpf prog would be able to call e.g.
scx_bpf_dsq_insert()?

> > > +	/* prog->type == BPF_PROG_TYPE_STRUCT_OPS && prog->aux->st_ops == &bpf_sched_ext_ops*/
> > > +
> > > +	moff = prog->aux->attach_st_ops_member_off;
> > > +	flags = scx_ops_context_flags[SCX_MOFF_IDX(moff)];
> > > +
> > > +	if ((flags & SCX_OPS_KF_UNLOCKED) &&
> > > +	    btf_id_set8_contains(&scx_kfunc_ids_unlocked, kfunc_id))
> > > +		return 0;
> > 
> > Wouldn't this disallow e.g. ops.dispatch() from calling scx_dsq_move()?
> > 
> 
> No, because
> 
> > > [SCX_OP_IDX(dispatch)] = SCX_OPS_KF_DISPATCH | SCX_OPS_KF_ENQUEUE,
> 
> Therefore, kfuncs (scx_bpf_dsq_move_*) in scx_kfunc_ids_dispatch can be
> called in the dispatch context.

I see, scx_dsq_move_*() are in both groups, so it should be fine. I'm not
fully sure the groupings are the actually implemented filtering are in sync.
They are intended to be but the grouping didn't really matter in the
previous implementation. So, they need to be carefully audited.

> > Have you tested that the before and after behaviors match?
> 
> I tested the programs in tools/testing/selftests/sched_ext and
> tools/sched_ext and all worked fine.
> 
> If there are other cases that are not covered, we may need to add new
> test cases.

Right, the coverage there isn't perfect. Testing all conditions would be too
much but it'd be nice to have a test case which at least confirms that all
allowed cases verify successfully.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH sched_ext/for-6.15 v3 3/5] sched_ext: Add scx_kfunc_ids_ops_context_sensitive for unified filtering of context-sensitive SCX kfuncs
  2025-02-27 21:32       ` Tejun Heo
@ 2025-02-28  2:34         ` Alexei Starovoitov
  2025-02-28 21:31           ` Tejun Heo
  2025-02-28 18:42         ` Juntong Deng
  1 sibling, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2025-02-28  2:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Juntong Deng, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eddy Z, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, David Vernet, Andrea Righi, Changwoo Min,
	bpf, LKML

On Thu, Feb 27, 2025 at 1:32 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, Feb 27, 2025 at 09:23:20PM +0000, Juntong Deng wrote:
> > > > + if (prog->type == BPF_PROG_TYPE_STRUCT_OPS &&
> > > > +     prog->aux->st_ops != &bpf_sched_ext_ops)
> > > > +         return 0;
> > >
> > > Why can't other struct_ops progs call scx_kfunc_ids_unlocked kfuncs?
> > >
> >
> > Return 0 means allowed. So kfuncs in scx_kfunc_ids_unlocked can be
> > called by other struct_ops programs.
>
> Hmm... would that mean a non-sched_ext bpf prog would be able to call e.g.
> scx_bpf_dsq_insert()?

Not as far as I can tell.
scx_kfunc_ids_unlocked[] doesn't include scx_bpf_dsq_insert.
It's part of scx_kfunc_ids_enqueue_dispatch[].

So this bit in patch 3 enables it:
+       if ((flags & SCX_OPS_KF_ENQUEUE) &&
+           btf_id_set8_contains(&scx_kfunc_ids_enqueue_dispatch, kfunc_id))

and in patch 2:
+       [SCX_OP_IDX(enqueue)]                   = SCX_OPS_KF_ENQUEUE,

So scx_bpf_dsq_insert() kfunc can only be called out
of enqueue() sched-ext hook.

So the restriction is still the same. afaict.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH sched_ext/for-6.15 v3 3/5] sched_ext: Add scx_kfunc_ids_ops_context_sensitive for unified filtering of context-sensitive SCX kfuncs
  2025-02-27 21:23     ` Juntong Deng
  2025-02-27 21:32       ` Tejun Heo
@ 2025-02-28  2:38       ` Alexei Starovoitov
  2025-02-28 21:14         ` Tejun Heo
  1 sibling, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2025-02-28  2:38 UTC (permalink / raw)
  To: Juntong Deng
  Cc: Tejun Heo, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eddy Z, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, David Vernet, Andrea Righi, Changwoo Min,
	bpf, LKML

On Thu, Feb 27, 2025 at 1:23 PM Juntong Deng <juntong.deng@outlook.com> wrote:
>
> >> +static int scx_kfunc_ids_ops_context_sensitive_filter(const struct bpf_prog *prog, u32 kfunc_id)
> >> +{
> >> +    u32 moff, flags;
> >> +
> >> +    if (!btf_id_set8_contains(&scx_kfunc_ids_ops_context_sensitive, kfunc_id))
> >> +            return 0;
> >> +
> >> +    if (prog->type == BPF_PROG_TYPE_SYSCALL &&
> >> +        btf_id_set8_contains(&scx_kfunc_ids_unlocked, kfunc_id))
> >> +            return 0;
> >
> > Not from this change but these can probably be allowed from TRACING too.
> >
>
> Not sure if it is safe to make these kfuncs available in TRACING.
> If Alexei sees this email, could you please leave a comment?

Hold on, you want to enable all of scx_kfunc_ids_unlocked[] set
to all of TRACING ? What is the use case ?
Maybe it's safe, but without in-depth analysis we shouldn't.
Currently sched-ext allows scx_kfunc_set_any[] for tracing.
I would stick to that in this patch set.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH sched_ext/for-6.15 v3 3/5] sched_ext: Add scx_kfunc_ids_ops_context_sensitive for unified filtering of context-sensitive SCX kfuncs
  2025-02-27 21:32       ` Tejun Heo
  2025-02-28  2:34         ` Alexei Starovoitov
@ 2025-02-28 18:42         ` Juntong Deng
  2025-02-28 21:20           ` Tejun Heo
  1 sibling, 1 reply; 23+ messages in thread
From: Juntong Deng @ 2025-02-28 18:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, void, arighi,
	changwoo, bpf, linux-kernel

On 2025/2/27 21:32, Tejun Heo wrote:
> Hello,
> 
> On Thu, Feb 27, 2025 at 09:23:20PM +0000, Juntong Deng wrote:
>>>> +	if (prog->type == BPF_PROG_TYPE_STRUCT_OPS &&
>>>> +	    prog->aux->st_ops != &bpf_sched_ext_ops)
>>>> +		return 0;
>>>
>>> Why can't other struct_ops progs call scx_kfunc_ids_unlocked kfuncs?
>>>
>>
>> Return 0 means allowed. So kfuncs in scx_kfunc_ids_unlocked can be
>> called by other struct_ops programs.
> 
> Hmm... would that mean a non-sched_ext bpf prog would be able to call e.g.
> scx_bpf_dsq_insert()?
> 

For other struct_ops programs, yes, in the current logic,
when prog->aux->st_ops != &bpf_sched_ext_ops, all calls are allowed.

This may seem a bit weird, but the reason I did it is that in other
struct_ops programs, the meaning of member_off changes, so the logic
that follows makes no sense at all.

Of course, we can change this, and ideally there would be some groupings
(kfunc id set) that declare which kfunc can be called by other
struct_ops programs and which cannot.

>>>> +	/* prog->type == BPF_PROG_TYPE_STRUCT_OPS && prog->aux->st_ops == &bpf_sched_ext_ops*/
>>>> +
>>>> +	moff = prog->aux->attach_st_ops_member_off;
>>>> +	flags = scx_ops_context_flags[SCX_MOFF_IDX(moff)];
>>>> +
>>>> +	if ((flags & SCX_OPS_KF_UNLOCKED) &&
>>>> +	    btf_id_set8_contains(&scx_kfunc_ids_unlocked, kfunc_id))
>>>> +		return 0;
>>>
>>> Wouldn't this disallow e.g. ops.dispatch() from calling scx_dsq_move()?
>>>
>>
>> No, because
>>
>>>> [SCX_OP_IDX(dispatch)] = SCX_OPS_KF_DISPATCH | SCX_OPS_KF_ENQUEUE,
>>
>> Therefore, kfuncs (scx_bpf_dsq_move_*) in scx_kfunc_ids_dispatch can be
>> called in the dispatch context.
> 
> I see, scx_dsq_move_*() are in both groups, so it should be fine. I'm not
> fully sure the groupings are the actually implemented filtering are in sync.
> They are intended to be but the grouping didn't really matter in the
> previous implementation. So, they need to be carefully audited.
> 

After you audit the current groupings of scx kfuncs, please tell me how
you would like to change the current groupings.

>>> Have you tested that the before and after behaviors match?
>>
>> I tested the programs in tools/testing/selftests/sched_ext and
>> tools/sched_ext and all worked fine.
>>
>> If there are other cases that are not covered, we may need to add new
>> test cases.
> 
> Right, the coverage there isn't perfect. Testing all conditions would be too
> much but it'd be nice to have a test case which at least confirms that all
> allowed cases verify successfully.
> 

Yes, we can add a simple test case for each operation that is not
SCX_OPS_KF_ANY.

> Thanks.
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH sched_ext/for-6.15 v3 3/5] sched_ext: Add scx_kfunc_ids_ops_context_sensitive for unified filtering of context-sensitive SCX kfuncs
  2025-02-28  2:38       ` Alexei Starovoitov
@ 2025-02-28 21:14         ` Tejun Heo
  0 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2025-02-28 21:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Juntong Deng, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eddy Z, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, David Vernet, Andrea Righi, Changwoo Min,
	bpf, LKML

Hello,

On Thu, Feb 27, 2025 at 06:38:32PM -0800, Alexei Starovoitov wrote:
...
> > > Not from this change but these can probably be allowed from TRACING too.
> > >
> >
> > Not sure if it is safe to make these kfuncs available in TRACING.
> > If Alexei sees this email, could you please leave a comment?
> 
> Hold on, you want to enable all of scx_kfunc_ids_unlocked[] set
> to all of TRACING ? What is the use case ?

I thought it may be useful to be able to iterate DSQs and trigger
dispatching from there but

> Maybe it's safe, but without in-depth analysis we shouldn't.
> Currently sched-ext allows scx_kfunc_set_any[] for tracing.
> I would stick to that in this patch set.

I haven't thought through about safety at all and was mostly naively
thinking that if _any is safe _unlocked should too. Right now, unlocked has
two groups of kfuncs - the ones that should be able to sleep and the ones
that can be called from within scx_dsq_iter iterations. The former is
excluded from TRACING through KF_SLEEPABLE, I think. I don't know whether
dsq iteration is allowed in TRACING.

Anyways, not a real issue either way.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH sched_ext/for-6.15 v3 3/5] sched_ext: Add scx_kfunc_ids_ops_context_sensitive for unified filtering of context-sensitive SCX kfuncs
  2025-02-28 18:42         ` Juntong Deng
@ 2025-02-28 21:20           ` Tejun Heo
  0 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2025-02-28 21:20 UTC (permalink / raw)
  To: Juntong Deng
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, void, arighi,
	changwoo, bpf, linux-kernel

Hello,

On Fri, Feb 28, 2025 at 06:42:11PM +0000, Juntong Deng wrote:
> > > Return 0 means allowed. So kfuncs in scx_kfunc_ids_unlocked can be
> > > called by other struct_ops programs.
> > 
> > Hmm... would that mean a non-sched_ext bpf prog would be able to call e.g.
> > scx_bpf_dsq_insert()?
> 
> For other struct_ops programs, yes, in the current logic,
> when prog->aux->st_ops != &bpf_sched_ext_ops, all calls are allowed.
> 
> This may seem a bit weird, but the reason I did it is that in other
> struct_ops programs, the meaning of member_off changes, so the logic
> that follows makes no sense at all.
> 
> Of course, we can change this, and ideally there would be some groupings
> (kfunc id set) that declare which kfunc can be called by other
> struct_ops programs and which cannot.

Other than any and unlocked, I don't think other bpf struct ops should be
able to call SCX kfuncs. They all assume rq lock to be held which wouldn't
be true for other struct_ops after all.

...
> > I see, scx_dsq_move_*() are in both groups, so it should be fine. I'm not
> > fully sure the groupings are the actually implemented filtering are in sync.
> > They are intended to be but the grouping didn't really matter in the
> > previous implementation. So, they need to be carefully audited.
> 
> After you audit the current groupings of scx kfuncs, please tell me how
> you would like to change the current groupings.

Yeah, I'll go over them but after all, we need to ensure that the behavior
currently implemented by scx_kf_allowed*() matches what the new code does,
so I'd appreciate if you can go over with that in mind too. This is kinda
confusing so we can definitely use more eyes.

> > Right, the coverage there isn't perfect. Testing all conditions would be too
> > much but it'd be nice to have a test case which at least confirms that all
> > allowed cases verify successfully.
> 
> Yes, we can add a simple test case for each operation that is not
> SCX_OPS_KF_ANY.

That'd be great. Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH sched_ext/for-6.15 v3 3/5] sched_ext: Add scx_kfunc_ids_ops_context_sensitive for unified filtering of context-sensitive SCX kfuncs
  2025-02-28  2:34         ` Alexei Starovoitov
@ 2025-02-28 21:31           ` Tejun Heo
  2025-02-28 23:06             ` Alexei Starovoitov
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2025-02-28 21:31 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Juntong Deng, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eddy Z, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, David Vernet, Andrea Righi, Changwoo Min,
	bpf, LKML

Hello,

On Thu, Feb 27, 2025 at 06:34:37PM -0800, Alexei Starovoitov wrote:
> > Hmm... would that mean a non-sched_ext bpf prog would be able to call e.g.
> > scx_bpf_dsq_insert()?
> 
> Not as far as I can tell.
> scx_kfunc_ids_unlocked[] doesn't include scx_bpf_dsq_insert.
> It's part of scx_kfunc_ids_enqueue_dispatch[].
> 
> So this bit in patch 3 enables it:
> +       if ((flags & SCX_OPS_KF_ENQUEUE) &&
> +           btf_id_set8_contains(&scx_kfunc_ids_enqueue_dispatch, kfunc_id))
> 
> and in patch 2:
> +       [SCX_OP_IDX(enqueue)]                   = SCX_OPS_KF_ENQUEUE,
> 
> So scx_bpf_dsq_insert() kfunc can only be called out
> of enqueue() sched-ext hook.
> 
> So the restriction is still the same. afaict.

Hmm... maybe I'm missing something:

  static int scx_kfunc_ids_ops_context_sensitive_filter(const struct bpf_prog *prog, u32 kfunc_id)
  {
         u32 moff, flags;

         // allow non-context sensitive kfuncs
         if (!btf_id_set8_contains(&scx_kfunc_ids_ops_context_sensitive, kfunc_id))
                 return 0;

         // allow unlocked to be called form all SYSCALL progs
         if (prog->type == BPF_PROG_TYPE_SYSCALL &&
             btf_id_set8_contains(&scx_kfunc_ids_unlocked, kfunc_id))
                 return 0;

         // *** HERE, allow if the prog is not SCX ***
         if (prog->type == BPF_PROG_TYPE_STRUCT_OPS &&
             prog->aux->st_ops != &bpf_sched_ext_ops)
                 return 0;

         /* prog->type == BPF_PROG_TYPE_STRUCT_OPS && prog->aux->st_ops == &bpf_sched_ext_ops*/

         // other context sensitive allow's

So, I think it'd say yes if a non-SCX BPF prog tries to call a context
sensitive SCX kfunc.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH sched_ext/for-6.15 v3 2/5] sched_ext: Declare context-sensitive kfunc groups that can be used by different SCX operations
  2025-02-26 19:28 ` [PATCH sched_ext/for-6.15 v3 2/5] sched_ext: Declare context-sensitive kfunc groups that can be used by different SCX operations Juntong Deng
  2025-02-27 19:19   ` Tejun Heo
@ 2025-02-28 21:57   ` Tejun Heo
  2025-03-03 21:12     ` Juntong Deng
  1 sibling, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2025-02-28 21:57 UTC (permalink / raw)
  To: Juntong Deng
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, void, arighi,
	changwoo, bpf, linux-kernel

On Wed, Feb 26, 2025 at 07:28:17PM +0000, Juntong Deng wrote:
> This patch declare context-sensitive kfunc groups that can be used by
> different SCX operations.
> 
> In SCX, some kfuncs are context-sensitive and can only be used in
> specific SCX operations.
> 
> Currently context-sensitive kfuncs can be grouped into UNLOCKED,
> CPU_RELEASE, DISPATCH, ENQUEUE, SELECT_CPU.
> 
> In this patch enum scx_ops_kf_flags was added to represent these groups,
> which is based on scx_kf_mask.
> 
> SCX_OPS_KF_ANY is a special value that indicates kfuncs can be used in
> any context.
> 
> scx_ops_context_flags is used to declare the groups of kfuncs that can
> be used by each SCX operation. An SCX operation can use multiple groups
> of kfuncs.
> 

Can you merge this into the next patch? I don't think separating this out
helps with reviewing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH sched_ext/for-6.15 v3 3/5] sched_ext: Add scx_kfunc_ids_ops_context_sensitive for unified filtering of context-sensitive SCX kfuncs
  2025-02-28 21:31           ` Tejun Heo
@ 2025-02-28 23:06             ` Alexei Starovoitov
  0 siblings, 0 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2025-02-28 23:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Juntong Deng, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eddy Z, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, David Vernet, Andrea Righi, Changwoo Min,
	bpf, LKML

On Fri, Feb 28, 2025 at 1:31 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, Feb 27, 2025 at 06:34:37PM -0800, Alexei Starovoitov wrote:
> > > Hmm... would that mean a non-sched_ext bpf prog would be able to call e.g.
> > > scx_bpf_dsq_insert()?
> >
> > Not as far as I can tell.
> > scx_kfunc_ids_unlocked[] doesn't include scx_bpf_dsq_insert.
> > It's part of scx_kfunc_ids_enqueue_dispatch[].
> >
> > So this bit in patch 3 enables it:
> > +       if ((flags & SCX_OPS_KF_ENQUEUE) &&
> > +           btf_id_set8_contains(&scx_kfunc_ids_enqueue_dispatch, kfunc_id))
> >
> > and in patch 2:
> > +       [SCX_OP_IDX(enqueue)]                   = SCX_OPS_KF_ENQUEUE,
> >
> > So scx_bpf_dsq_insert() kfunc can only be called out
> > of enqueue() sched-ext hook.
> >
> > So the restriction is still the same. afaict.
>
> Hmm... maybe I'm missing something:
>
>   static int scx_kfunc_ids_ops_context_sensitive_filter(const struct bpf_prog *prog, u32 kfunc_id)
>   {
>          u32 moff, flags;
>
>          // allow non-context sensitive kfuncs
>          if (!btf_id_set8_contains(&scx_kfunc_ids_ops_context_sensitive, kfunc_id))
>                  return 0;
>
>          // allow unlocked to be called form all SYSCALL progs
>          if (prog->type == BPF_PROG_TYPE_SYSCALL &&
>              btf_id_set8_contains(&scx_kfunc_ids_unlocked, kfunc_id))
>                  return 0;
>
>          // *** HERE, allow if the prog is not SCX ***
>          if (prog->type == BPF_PROG_TYPE_STRUCT_OPS &&
>              prog->aux->st_ops != &bpf_sched_ext_ops)

Ohh. You're right. My bad.
Here it probably needs
&& !!btf_id_set8_contains(&scx_kfunc_ids_ops_context_sensitive

since later scx_kfunc_set_ops_context_sensitive
is registered for all of struct-ops.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH sched_ext/for-6.15 v3 2/5] sched_ext: Declare context-sensitive kfunc groups that can be used by different SCX operations
  2025-02-28 21:57   ` Tejun Heo
@ 2025-03-03 21:12     ` Juntong Deng
  2025-03-04 18:00       ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Juntong Deng @ 2025-03-03 21:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, void, arighi,
	changwoo, bpf, linux-kernel

On 2025/2/28 21:57, Tejun Heo wrote:

> On Fri, Feb 28, 2025 at 06:42:11PM +0000, Juntong Deng wrote:
> > > > Return 0 means allowed. So kfuncs in scx_kfunc_ids_unlocked can be
> > > > called by other struct_ops programs.
> >
> > > Hmm... would that mean a non-sched_ext bpf prog would be able to call e.g.
> > > scx_bpf_dsq_insert()?
> >
> > For other struct_ops programs, yes, in the current logic,
> > when prog->aux->st_ops != &bpf_sched_ext_ops, all calls are allowed.
> >
> > This may seem a bit weird, but the reason I did it is that in other
> > struct_ops programs, the meaning of member_off changes, so the logic
> > that follows makes no sense at all.
> >
> > Of course, we can change this, and ideally there would be some groupings
> > (kfunc id set) that declare which kfunc can be called by other
> > struct_ops programs and which cannot.

> Other than any and unlocked, I don't think other bpf struct ops should be
> able to call SCX kfuncs. They all assume rq lock to be held which wouldn't
> be true for other struct_ops after all.

Ok, I will allow only any and unlocked to be called by other struct_ops
programs in the next version.

> ...
> > > I see, scx_dsq_move_*() are in both groups, so it should be fine. I'm not
> > > fully sure the groupings are the actually implemented filtering are in sync.
> > > They are intended to be but the grouping didn't really matter in the
> > > previous implementation. So, they need to be carefully audited.
> >
> > After you audit the current groupings of scx kfuncs, please tell me how
> > you would like to change the current groupings.

> Yeah, I'll go over them but after all, we need to ensure that the behavior
> currently implemented by scx_kf_allowed*() matches what the new code does,
> so I'd appreciate if you can go over with that in mind too. This is kinda
> confusing so we can definitely use more eyes.

Yes, I will use more eyes and be more careful on consistency.

> On Wed, Feb 26, 2025 at 07:28:17PM +0000, Juntong Deng wrote:
>> This patch declare context-sensitive kfunc groups that can be used by
>> different SCX operations.
>>
>> In SCX, some kfuncs are context-sensitive and can only be used in
>> specific SCX operations.
>>
>> Currently context-sensitive kfuncs can be grouped into UNLOCKED,
>> CPU_RELEASE, DISPATCH, ENQUEUE, SELECT_CPU.
>>
>> In this patch enum scx_ops_kf_flags was added to represent these groups,
>> which is based on scx_kf_mask.
>>
>> SCX_OPS_KF_ANY is a special value that indicates kfuncs can be used in
>> any context.
>>
>> scx_ops_context_flags is used to declare the groups of kfuncs that can
>> be used by each SCX operation. An SCX operation can use multiple groups
>> of kfuncs.
>>
> 
> Can you merge this into the next patch? I don't think separating this out
> helps with reviewing.
> 

Yes, I can merge them in the next version.

I am not sure, but it seems to me that the two patches are doing
different things?

> Thanks.
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH sched_ext/for-6.15 v3 2/5] sched_ext: Declare context-sensitive kfunc groups that can be used by different SCX operations
  2025-03-03 21:12     ` Juntong Deng
@ 2025-03-04 18:00       ` Tejun Heo
  2025-03-07  0:07         ` Juntong Deng
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2025-03-04 18:00 UTC (permalink / raw)
  To: Juntong Deng
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, void, arighi,
	changwoo, bpf, linux-kernel

On Mon, Mar 03, 2025 at 09:12:52PM +0000, Juntong Deng wrote:
> > Can you merge this into the next patch? I don't think separating this out
> > helps with reviewing.
> > 
> 
> Yes, I can merge them in the next version.
> 
> I am not sure, but it seems to me that the two patches are doing
> different things?

I don't know. It can be argued either way but it's more that the table added
by the first patch is not used in the first patch and the second patch is
difficult to review without referring to the table added in the first patch.
It can be either way but I don't see benefits of them being separate
patches.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH sched_ext/for-6.15 v3 2/5] sched_ext: Declare context-sensitive kfunc groups that can be used by different SCX operations
  2025-03-04 18:00       ` Tejun Heo
@ 2025-03-07  0:07         ` Juntong Deng
  0 siblings, 0 replies; 23+ messages in thread
From: Juntong Deng @ 2025-03-07  0:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, void, arighi,
	changwoo, bpf, linux-kernel

On 2025/3/4 18:00, Tejun Heo wrote:
> On Mon, Mar 03, 2025 at 09:12:52PM +0000, Juntong Deng wrote:
>>> Can you merge this into the next patch? I don't think separating this out
>>> helps with reviewing.
>>>
>>
>> Yes, I can merge them in the next version.
>>
>> I am not sure, but it seems to me that the two patches are doing
>> different things?
> 
> I don't know. It can be argued either way but it's more that the table added
> by the first patch is not used in the first patch and the second patch is
> difficult to review without referring to the table added in the first patch.
> It can be either way but I don't see benefits of them being separate
> patches.
> 

Yes, I agree, that is a good point.

I will merge them in the next version.

> Thanks.
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-03-07  0:07 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-26 19:24 [PATCH sched_ext/for-6.15 v3 0/5] bpf, sched_ext: Make kfunc filters support struct_ops context to reduce runtime overhead Juntong Deng
2025-02-26 19:28 ` [PATCH sched_ext/for-6.15 v3 1/5] bpf: Add struct_ops context information to struct bpf_prog_aux Juntong Deng
2025-02-26 19:28 ` [PATCH sched_ext/for-6.15 v3 2/5] sched_ext: Declare context-sensitive kfunc groups that can be used by different SCX operations Juntong Deng
2025-02-27 19:19   ` Tejun Heo
2025-02-28 21:57   ` Tejun Heo
2025-03-03 21:12     ` Juntong Deng
2025-03-04 18:00       ` Tejun Heo
2025-03-07  0:07         ` Juntong Deng
2025-02-26 19:28 ` [PATCH sched_ext/for-6.15 v3 3/5] sched_ext: Add scx_kfunc_ids_ops_context_sensitive for unified filtering of context-sensitive SCX kfuncs Juntong Deng
2025-02-27 20:25   ` Tejun Heo
2025-02-27 21:23     ` Juntong Deng
2025-02-27 21:32       ` Tejun Heo
2025-02-28  2:34         ` Alexei Starovoitov
2025-02-28 21:31           ` Tejun Heo
2025-02-28 23:06             ` Alexei Starovoitov
2025-02-28 18:42         ` Juntong Deng
2025-02-28 21:20           ` Tejun Heo
2025-02-28  2:38       ` Alexei Starovoitov
2025-02-28 21:14         ` Tejun Heo
2025-02-26 19:28 ` [PATCH sched_ext/for-6.15 v3 4/5] sched_ext: Removed mask-based runtime restrictions on calling kfuncs in different contexts Juntong Deng
2025-02-27 16:24   ` Andrea Righi
2025-02-27 16:49     ` Juntong Deng
2025-02-26 19:28 ` [PATCH sched_ext/for-6.15 v3 5/5] selftests/sched_ext: Update enq_select_cpu_fails to adapt to struct_ops context filter Juntong Deng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).