[RFC PATCH bpf-next 0/7] bpf: BPF internal fine-grained permission management (BPF internal capabilities)

BPF List
 help / color / mirror / Atom feed

* [RFC PATCH bpf-next 0/7] bpf: BPF internal fine-grained permission management (BPF internal capabilities)
@ 2025-01-16 19:35 Juntong Deng
  2025-01-16 19:41 ` [RFC PATCH bpf-next 1/7] bpf: Add capability field to BTF_ID_FLAGS Juntong Deng
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Juntong Deng @ 2025-01-16 19:35 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, tj, void
  Cc: bpf, linux-kernel

Overview
--------

This is a proof-of-concept patch series that aims to rethink the current
permission management of bpf programs.

This patch series is used to demonstrate the idea of BPF (internal)
capabilities (fine-grained permissions model) to solve the problems
caused by the coarse-grained permissions model based on program type
in the current BPF.

In this patch series, I consider what BPF kfuncs a bpf program can use,
what BPF helpers it can use, what BPF maps it can use, etc. as
permissions of the bpf program.

Note that the "capabilities" mentioned in this patch series have nothing
to do with Linux capabilities, nor with userspace.

The BPF capabilities in this patch series are capabilities that are ONLY
used internally in the bpf subsystem.

The ideas in this patch series come from previous discussions [0].

[0]: https://lore.kernel.org/bpf/AM6PR03MB5080DC63013560E26507079E99042@AM6PR03MB5080.eurprd03.prod.outlook.com/T/#t

Motivation
----------

Currently, the permission management of bpf programs is a coarse-grained
model based on program types. The program type determines the
permissions of the bpf program. 

This is fine when BPF has fewer usage scenarios, but it becomes
inappropriate when BPF has more usage scenarios.

The following are the current problems:

1. Cannot change the permissions of bpf program in different contexts

Since permissions management in BPF is based on program type, once a
bpf program selects a program type, its permissions cannot be changed.

Currently sched-ext (SCX) is implemented based on the
BPF_PROG_TYPE_STRUCT_OPS program type, but SCX needs to enforce
different restrictions in different contexts. For example, some kfuncs
can only be used in the DISPATCH context, and some kfuncs can only be
used in the CPU_RELEASE context.

However, the current BPF permission management based on program type
cannot natively implement these restrictions. The current approach used
by SCX is dynamic detection, by adding masks to check at runtime if
disallowed kfuncs are being called, which results in runtime overhead.

Ideally, we could check for these incorrect uses of kfuncs via the
verifier without any runtime overhead.

2. Permission rules cannot be inherited and extended between program types

When one program type has a large number of the same base permissions as
another program type, the current permission model based on program
types cannot achieve "inheritance".

All kfuncs need to be registered to each program type separately and
populated into struct btf_id_set8 of each program type via
btf_populate_kfunc_set.

The current feature similar to "inheritance" is "alias".
BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_TRACEPOINT,
BPF_PROG_TYPE_PERF_EVENT, BPF_PROG_TYPE_LSM are actually "aliases"
of BTF_KFUNC_HOOK_TRACING.

So what should we do if there are differences in permissions between
"aliased" program types? We need to implement a filter callback function
to filter out some commonly registered kfuncs under different specific
program types.

This is obviously not an elegant solution.

The essence of all the above problems comes from the fact that the
current coarse-grained bpf permission model based on program type is
no longer appropriate and we need to rethink it.

What we need to face is:

1. ONE bpf program type can be used in MANY different contexts
(scenarios), and these contexts may have different restrictions.

2. There will be more bpf program types, and there will be a lot of
common permissions between different program types.

When faced with complex permission management, we need a fine-grained
permission management model. It is difficult for us to achieve fine-
grained permission division based on a coarse-grained permission model.

The current SCX mask and filter callback functions are band-aids for
this coarse-grained permission model.

BPF Capabilities
----------------

BPF capabilities is a capability-based permission model used internally
in the BPF subsystem. In BPF capabilities, all kfuncs will be registered
into different capabilities according to fine-grained permission
division, rather than directly registered into the program type.

BPF capabilities aims to achieve is:

1. Fine-grained permission division

All kfuncs can be divided into different sets according to their
functions and registered to different capabilities, such as
BPF_CAP_FS, BPF_CAP_LSM, BPF_CAP_SCX_DISPATCH. 

In this way, we can enable or disable some features in
different contexts.

2. Dynamically enable and disable capabilities

The bpf verifier maintains a list of capabilities that are
currently enabled for the bpf program. This list can be modified in
different contexts. 

When a bpf program accesses a feature corresponding to an enabled
capability, it will be allowed, but if it accesses a feature
corresponding to a disabled capability, it will be denied.

3. Capabilities hierarchy

Capabilities can be organized in a hierarchy. For example, we can
define TRACING_CAP_BASE, which includes all common capabilities in
tracing scenarios and can be used in BPF_PROG_TYPE_TRACING,
BPF_PROG_TYPE_TRACEPOINT, BPF_PROG_TYPE_PERF_EVENT,
and BPF_PROG_TYPE_LSM.

We do not need to list all required capabilities separately for
each program type.

4. Low-coupling capabilities system:

Different subsystems can define their own capabilities and change the
capabilities of a bpf program (enable or disable) in the verifier in
different contexts in a appropriate way.

All of this does not require modifications to the BPF core and needs
to be decoupled from the BPF core.

Proof of Concept Alert
----------------------

Note that this is a proof-of-concept in the early stages and all code
in this patch series is not well-designed.

This is a minimal proof-of-concept used only to demonstrate the idea,
and the code is full of bugs and bits and pieces here and there,
please don't mind.

Current Implementation
----------------------

The implementation in this patch series is a possible way to implement
BPF capabilities. We can discuss other better implementations of
BPF capabilities.

1. Fine-grained permission division

I added a new field "capability" in BTF_ID to record the capability of
each kfuncs. This field will be set when registering the kfuncs sets. 

All kfuncs will be put into the same struct btf_id_set8, and will no
longer be divided into different sets according to program type.

All permission managements are based on capabilities, not program types.

2. Dynamically enable and disable capabilities

I added a bitmap "bpf_capabilities" to struct bpf_verifier_env to record
the capabilities currently enabled for the bpf program.

This bitmap can be changed in different contexts. In check_kfunc_call,
the bitmap is used to determine whether the kfunc call is legal.

3. Capabilities hierarchy

I used macros to define sets of base capabilities, such as
STRUCT_OPS_BASE_CAPS.

The default enabled capabilities for each program type are defined via
array, which can contain base capabilities macros.

4. Low-coupling capabilities system:

I added the bpf_capabilities_adjust callback function to
struct bpf_verifier_ops and the context information context_info
to struct bpf_verifier_env (in the case of SCX, this context
information may be "moff").

Passing context_info to the bpf_capabilities_adjust callback function
allows the implementer to determine the current context and make changes
to the enabled capabilities list of the bpf program in the verifier.

Test Results
------------

For testing I added scx_simple_cap_test. I added
scx_bpf_dsq_move_to_local to enqueue, which is not allowed.
If we run this program, the verifier will report errors.

./build/bin/scx_simple_cap_test 
libbpf: prog 'simple_enqueue': BPF program load failed: -EACCES
libbpf: prog 'simple_enqueue': -- BEGIN PROG LOAD LOG --
...
17: (85) call scx_bpf_dsq_move_to_local#135437
The bpf program does not have the capability to call scx_bpf_dsq_move_to_local
...
libbpf: failed to load BPF skeleton 'scx_simple_cap_test': -EACCES
[SCX_BUG] scx_simple_cap_test.c:88 (Permission denied)
Failed to load skel

But if we run scx_simple, the program can run normally.

./build/bin/scx_simple
[  152.792015] sched_ext: BPF scheduler "simple" enabled
local=7 global=0
local=30 global=3
local=33 global=11

More
----

BPF capabilities is a general function that is flexible and extensible. 

In my opinion, bpf capabilities can be used not only to manage kfuncs,
but can be used to manage permissions for all features of BPF, including
BPF helpers, BPF maps, etc.

We can associate these features with a capability, so that the bpf
verifier can manage them according to different contexts.

Maybe we can also make BPF capabilities configurable through /sys/bpf
or associate some BPF capabilities with Linux capabilities, so that
system administrators can choose to only open part of BPF features
to certain users.

Related Suggestions
-------------------

In the current implementation, I need to add capability information to
each kfuncs, this is implemented by modifying the BTF_ID structure.

But I cannot modify BTF_ID directly, because BTF_ID is used for data
structures in addition to kfuncs, and data structures do not need
capability information.

My suggestion is to use BTF_ID_FLAGS for all kfuncs and only use BTF_ID
for data structures.

This way we can distinguish kfuncs from data structures.

At The End
----------

This is a proof-of-concept patch series that rethinks the current BPF
permissions management.

All ideas and implementations are not complete yet, but BPF capabilities
may be a better solution than the current program type-based
permission management.

Welcome to discuss and give feedback!

Many thanks.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>

Juntong Deng (7):
  bpf: Add capability field to BTF_ID_FLAGS
  bpf: Add enum bpf_capability
  bpf: Add capabilities version of kfuncs registration
  bpf: Make the verifier support BPF capabilities
  bpf: Add default BPF capabilities initialization for program types
  sched_ext: Make SCX use BPF capabilities
  sched_ext: Add proof-of-concept test case

 include/linux/bpf.h                       |   2 +
 include/linux/bpf_verifier.h              |   6 +
 include/linux/btf.h                       |   8 +-
 include/linux/btf_ids.h                   |   6 +-
 include/uapi/linux/bpf.h                  |  15 ++
 kernel/bpf/btf.c                          | 165 +++++++++++++++++++++-
 kernel/bpf/verifier.c                     |  66 ++++++++-
 kernel/sched/ext.c                        |  74 ++++++++--
 tools/bpf/resolve_btfids/main.c           |   2 +-
 tools/include/linux/btf_ids.h             |   1 +
 tools/sched_ext/Makefile                  |   2 +-
 tools/sched_ext/scx_simple_cap_test.bpf.c | 159 +++++++++++++++++++++
 tools/sched_ext/scx_simple_cap_test.c     | 107 ++++++++++++++
 13 files changed, 590 insertions(+), 23 deletions(-)
 create mode 100644 tools/sched_ext/scx_simple_cap_test.bpf.c
 create mode 100644 tools/sched_ext/scx_simple_cap_test.c

-- 
2.39.5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH bpf-next 1/7] bpf: Add capability field to BTF_ID_FLAGS
  2025-01-16 19:35 [RFC PATCH bpf-next 0/7] bpf: BPF internal fine-grained permission management (BPF internal capabilities) Juntong Deng
@ 2025-01-16 19:41 ` Juntong Deng
  2025-01-16 19:41 ` [RFC PATCH bpf-next 2/7] bpf: Add enum bpf_capability Juntong Deng
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Juntong Deng @ 2025-01-16 19:41 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, tj, void
  Cc: bpf, linux-kernel

This patch adds capability field to BTF_ID_FLAGS to record capability
information of each kfunc.

Note that the capability field is just a placeholder, the actual
capability value is set in btf_populate_kfunc_set_cap.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
---
 include/linux/btf_ids.h         | 6 +++++-
 tools/bpf/resolve_btfids/main.c | 2 +-
 tools/include/linux/btf_ids.h   | 1 +
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/btf_ids.h b/include/linux/btf_ids.h
index 139bdececdcf..40231ea36058 100644
--- a/include/linux/btf_ids.h
+++ b/include/linux/btf_ids.h
@@ -19,6 +19,7 @@ struct btf_id_set8 {
 	struct {
 		u32 id;
 		u32 flags;
+		u32 capability;
 	} pairs[];
 };
 
@@ -65,7 +66,10 @@ word							\
 	__BTF_ID(__ID(__BTF_ID__##prefix##__##name##__), "")
 
 #define ____BTF_ID_FLAGS(prefix, name, flags) \
-	__BTF_ID(__ID(__BTF_ID__##prefix##__##name##__), ".long " #flags "\n")
+	__BTF_ID(__ID(__BTF_ID__##prefix##__##name##__), \
+	".long " #flags "\n"				 \
+	".zero 4         \n")
+
 #define __BTF_ID_FLAGS(prefix, name, flags, ...) \
 	____BTF_ID_FLAGS(prefix, name, flags)
 #define BTF_ID_FLAGS(prefix, name, ...) \
diff --git a/tools/bpf/resolve_btfids/main.c b/tools/bpf/resolve_btfids/main.c
index d47191c6e55e..48be22f9a14e 100644
--- a/tools/bpf/resolve_btfids/main.c
+++ b/tools/bpf/resolve_btfids/main.c
@@ -495,7 +495,7 @@ static int symbols_collect(struct object *obj)
 			 * that - 1.
 			 */
 			if (id) {
-				id->cnt = sym.st_size / sizeof(uint64_t) - 1;
+				id->cnt = (sym.st_size - sizeof(u32) * 2) / (sizeof(u32) * 3);
 				id->is_set8 = true;
 			}
 		/* set */
diff --git a/tools/include/linux/btf_ids.h b/tools/include/linux/btf_ids.h
index 72ea363d434d..a6c9b560b6ce 100644
--- a/tools/include/linux/btf_ids.h
+++ b/tools/include/linux/btf_ids.h
@@ -16,6 +16,7 @@ struct btf_id_set8 {
 	struct {
 		u32 id;
 		u32 flags;
+		u32 capability;
 	} pairs[];
 };
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH bpf-next 2/7] bpf: Add enum bpf_capability
  2025-01-16 19:35 [RFC PATCH bpf-next 0/7] bpf: BPF internal fine-grained permission management (BPF internal capabilities) Juntong Deng
  2025-01-16 19:41 ` [RFC PATCH bpf-next 1/7] bpf: Add capability field to BTF_ID_FLAGS Juntong Deng
@ 2025-01-16 19:41 ` Juntong Deng
  2025-01-16 22:56   ` Song Liu
  2025-01-16 19:41 ` [RFC PATCH bpf-next 3/7] bpf: Add capabilities version of kfuncs registration Juntong Deng
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Juntong Deng @ 2025-01-16 19:41 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, tj, void
  Cc: bpf, linux-kernel

This patch adds enum bpf_capability, currently only for proof
of concept.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
---
 include/uapi/linux/bpf.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2acf9b336371..94c21d4eb786 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1058,6 +1058,21 @@ enum bpf_prog_type {
 	__MAX_BPF_PROG_TYPE
 };
 
+enum bpf_capability {
+	BPF_CAP_NONE = 0,
+	BPF_CAP_TEST_1,
+	BPF_CAP_TEST_2,
+	BPF_CAP_TEST_3,
+	BPF_CAP_SCX_ANY,
+	BPF_CAP_SCX_KF_UNLOCKED,
+	BPF_CAP_SCX_KF_CPU_RELEASE,
+	BPF_CAP_SCX_KF_DISPATCH,
+	BPF_CAP_SCX_KF_ENQUEUE,
+	BPF_CAP_SCX_KF_SELECT_CPU,
+	BPF_CAP_SCX_KF_REST,
+	__MAX_BPF_CAP
+};
+
 enum bpf_attach_type {
 	BPF_CGROUP_INET_INGRESS,
 	BPF_CGROUP_INET_EGRESS,
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH bpf-next 3/7] bpf: Add capabilities version of kfuncs registration
  2025-01-16 19:35 [RFC PATCH bpf-next 0/7] bpf: BPF internal fine-grained permission management (BPF internal capabilities) Juntong Deng
  2025-01-16 19:41 ` [RFC PATCH bpf-next 1/7] bpf: Add capability field to BTF_ID_FLAGS Juntong Deng
  2025-01-16 19:41 ` [RFC PATCH bpf-next 2/7] bpf: Add enum bpf_capability Juntong Deng
@ 2025-01-16 19:41 ` Juntong Deng
  2025-01-16 19:41 ` [RFC PATCH bpf-next 4/7] bpf: Make the verifier support BPF capabilities Juntong Deng
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Juntong Deng @ 2025-01-16 19:41 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, tj, void
  Cc: bpf, linux-kernel

This patch adds capabilities versions of kfuncs registration related
functions and data structures.

btf_populate_kfunc_set_cap, __btf_kfunc_id_set_contains_cap,
__register_btf_kfunc_id_set_cap, and register_btf_kfunc_id_set_cap,
corresponding to btf_populate_kfunc_set, __btf_kfunc_id_set_contains,
__register_btf_kfunc_id_set, and register_btf_kfunc_id_set respectively.

Note that these are proof-of-concept versions of the functions. In real
implementation, the original functions should be modified directly.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
---
 include/linux/btf.h |   8 ++-
 kernel/bpf/btf.c    | 165 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 170 insertions(+), 3 deletions(-)

diff --git a/include/linux/btf.h b/include/linux/btf.h
index 2a08a2b55592..71d9658ee328 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -569,11 +569,14 @@ const char *btf_str_by_offset(const struct btf *btf, u32 offset);
 struct btf *btf_parse_vmlinux(void);
 struct btf *bpf_prog_get_target_btf(const struct bpf_prog *prog);
 u32 *btf_kfunc_id_set_contains(const struct btf *btf, u32 kfunc_btf_id,
-			       const struct bpf_prog *prog);
+			       const struct bpf_prog *prog,
+			       u32 *capability);
 u32 *btf_kfunc_is_modify_return(const struct btf *btf, u32 kfunc_btf_id,
 				const struct bpf_prog *prog);
 int register_btf_kfunc_id_set(enum bpf_prog_type prog_type,
 			      const struct btf_kfunc_id_set *s);
+int register_btf_kfunc_id_set_cap(enum bpf_capability capability,
+				  const struct btf_kfunc_id_set *s);
 int register_btf_fmodret_id_set(const struct btf_kfunc_id_set *kset);
 s32 btf_find_dtor_kfunc(struct btf *btf, u32 btf_id);
 int register_btf_id_dtor_kfuncs(const struct btf_id_dtor_kfunc *dtors, u32 add_cnt,
@@ -632,7 +635,8 @@ static inline const char *btf_name_by_offset(const struct btf *btf,
 }
 static inline u32 *btf_kfunc_id_set_contains(const struct btf *btf,
 					     u32 kfunc_btf_id,
-					     struct bpf_prog *prog)
+					     struct bpf_prog *prog,
+					     u32 *capability)
 
 {
 	return NULL;
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 8396ce1d0fba..535074527e80 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -236,6 +236,7 @@ struct btf_kfunc_hook_filter {
 struct btf_kfunc_set_tab {
 	struct btf_id_set8 *sets[BTF_KFUNC_HOOK_MAX];
 	struct btf_kfunc_hook_filter hook_filters[BTF_KFUNC_HOOK_MAX];
+	struct btf_id_set8 *cap_poc_set;
 };
 
 struct btf_id_dtor_kfunc_tab {
@@ -8483,6 +8484,96 @@ static int btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook,
 	return ret;
 }
 
+static int btf_populate_kfunc_set_cap(struct btf *btf, enum bpf_capability capability,
+				  const struct btf_kfunc_id_set *kset)
+{
+	struct btf_id_set8 *add_set = kset->set;
+	bool vmlinux_set = !btf_is_module(btf);
+	struct btf_kfunc_set_tab *tab;
+	struct btf_id_set8 *set;
+	u32 set_cnt, i;
+	int ret;
+
+	if (capability >= __MAX_BPF_CAP) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	if (!add_set->cnt)
+		return 0;
+
+	tab = btf->kfunc_set_tab;
+
+	if (!tab) {
+		tab = kzalloc(sizeof(*tab), GFP_KERNEL | __GFP_NOWARN);
+		if (!tab)
+			return -ENOMEM;
+		btf->kfunc_set_tab = tab;
+	}
+
+	set = tab->cap_poc_set;
+	/* Warn when register_btf_kfunc_id_set is called twice for the same hook
+	 * for module sets.
+	 */
+	if (WARN_ON_ONCE(set && !vmlinux_set)) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	/* In case of vmlinux sets, there may be more than one set being
+	 * registered per hook. To create a unified set, we allocate a new set
+	 * and concatenate all individual sets being registered. While each set
+	 * is individually sorted, they may become unsorted when concatenated,
+	 * hence re-sorting the final set again is required to make binary
+	 * searching the set using btf_id_set8_contains function work.
+	 *
+	 * For module sets, we need to allocate as we may need to relocate
+	 * BTF ids.
+	 */
+	set_cnt = set ? set->cnt : 0;
+
+	if (set_cnt > U32_MAX - add_set->cnt) {
+		ret = -EOVERFLOW;
+		goto end;
+	}
+
+	if (set_cnt + add_set->cnt > BTF_KFUNC_SET_MAX_CNT) {
+		ret = -E2BIG;
+		goto end;
+	}
+
+	/* Grow set */
+	set = krealloc(tab->cap_poc_set,
+		       offsetof(struct btf_id_set8, pairs[set_cnt + add_set->cnt]),
+		       GFP_KERNEL | __GFP_NOWARN);
+	if (!set) {
+		ret = -ENOMEM;
+		goto end;
+	}
+
+	/* For newly allocated set, initialize set->cnt to 0 */
+	if (!tab->cap_poc_set)
+		set->cnt = 0;
+	tab->cap_poc_set = set;
+
+	/* Concatenate the two sets */
+	memcpy(set->pairs + set->cnt, add_set->pairs, add_set->cnt * sizeof(set->pairs[0]));
+	/* Now that the set is copied, update with relocated BTF ids */
+	for (i = set->cnt; i < set->cnt + add_set->cnt; i++) {
+		set->pairs[i].id = btf_relocate_id(btf, set->pairs[i].id);
+		set->pairs[i].capability = capability;
+	}
+
+	set->cnt += add_set->cnt;
+
+	sort(set->pairs, set->cnt, sizeof(set->pairs[0]), btf_id_cmp_func, NULL);
+
+	return 0;
+end:
+	btf_free_kfunc_set_tab(btf);
+	return ret;
+}
+
 static u32 *__btf_kfunc_id_set_contains(const struct btf *btf,
 					enum btf_kfunc_hook hook,
 					u32 kfunc_btf_id,
@@ -8511,6 +8602,30 @@ static u32 *__btf_kfunc_id_set_contains(const struct btf *btf,
 	return id + 1;
 }
 
+static u32 *__btf_kfunc_id_set_contains_cap(const struct btf *btf,
+					u32 kfunc_btf_id,
+					const struct bpf_prog *prog,
+					u32 *capability)
+{
+	struct btf_id_set8 *set;
+	u32 *id;
+
+	if (!btf->kfunc_set_tab)
+		return NULL;
+
+	set = btf->kfunc_set_tab->cap_poc_set;
+	if (!set)
+		return NULL;
+	id = btf_id_set8_contains(set, kfunc_btf_id);
+	if (!id)
+		return NULL;
+	/* The capability is next to flags */
+	if (capability)
+		*capability = *(id + 2);
+	/* The flags for BTF ID are located next to it */
+	return id + 1;
+}
+
 static int bpf_prog_type_to_kfunc_hook(enum bpf_prog_type prog_type)
 {
 	switch (prog_type) {
@@ -8565,12 +8680,20 @@ static int bpf_prog_type_to_kfunc_hook(enum bpf_prog_type prog_type)
  */
 u32 *btf_kfunc_id_set_contains(const struct btf *btf,
 			       u32 kfunc_btf_id,
-			       const struct bpf_prog *prog)
+			       const struct bpf_prog *prog,
+			       u32 *capability)
 {
 	enum bpf_prog_type prog_type = resolve_prog_type(prog);
 	enum btf_kfunc_hook hook;
 	u32 *kfunc_flags;
 
+	kfunc_flags = __btf_kfunc_id_set_contains_cap(btf, kfunc_btf_id, prog, capability);
+	if (kfunc_flags)
+		return kfunc_flags;
+
+	if (capability)
+		*capability = BPF_CAP_NONE;
+
 	kfunc_flags = __btf_kfunc_id_set_contains(btf, BTF_KFUNC_HOOK_COMMON, kfunc_btf_id, prog);
 	if (kfunc_flags)
 		return kfunc_flags;
@@ -8611,6 +8734,31 @@ static int __register_btf_kfunc_id_set(enum btf_kfunc_hook hook,
 	return ret;
 }
 
+static int __register_btf_kfunc_id_set_cap(enum bpf_capability capability,
+					   const struct btf_kfunc_id_set *kset)
+{
+	struct btf *btf;
+	int ret, i;
+
+	btf = btf_get_module_btf(kset->owner);
+	if (!btf)
+		return check_btf_kconfigs(kset->owner, "kfunc");
+	if (IS_ERR(btf))
+		return PTR_ERR(btf);
+
+	for (i = 0; i < kset->set->cnt; i++) {
+		ret = btf_check_kfunc_protos(btf, btf_relocate_id(btf, kset->set->pairs[i].id),
+					     kset->set->pairs[i].flags);
+		if (ret)
+			goto err_out;
+	}
+
+	ret = btf_populate_kfunc_set_cap(btf, capability, kset);
+err_out:
+	btf_put(btf);
+	return ret;
+}
+
 /* This function must be invoked only from initcalls/module init functions */
 int register_btf_kfunc_id_set(enum bpf_prog_type prog_type,
 			      const struct btf_kfunc_id_set *kset)
@@ -8630,6 +8778,21 @@ int register_btf_kfunc_id_set(enum bpf_prog_type prog_type,
 }
 EXPORT_SYMBOL_GPL(register_btf_kfunc_id_set);
 
+int register_btf_kfunc_id_set_cap(enum bpf_capability capability,
+				  const struct btf_kfunc_id_set *kset)
+{
+	/* All kfuncs need to be tagged as such in BTF.
+	 * WARN() for initcall registrations that do not check errors.
+	 */
+	if (!(kset->set->flags & BTF_SET8_KFUNCS)) {
+		WARN_ON(!kset->owner);
+		return -EINVAL;
+	}
+
+	return __register_btf_kfunc_id_set_cap(capability, kset);
+}
+EXPORT_SYMBOL_GPL(register_btf_kfunc_id_set_cap);
+
 /* This function must be invoked only from initcalls/module init functions */
 int register_btf_fmodret_id_set(const struct btf_kfunc_id_set *kset)
 {
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH bpf-next 4/7] bpf: Make the verifier support BPF capabilities
  2025-01-16 19:35 [RFC PATCH bpf-next 0/7] bpf: BPF internal fine-grained permission management (BPF internal capabilities) Juntong Deng
                   ` (2 preceding siblings ...)
  2025-01-16 19:41 ` [RFC PATCH bpf-next 3/7] bpf: Add capabilities version of kfuncs registration Juntong Deng
@ 2025-01-16 19:41 ` Juntong Deng
  2025-01-16 19:41 ` [RFC PATCH bpf-next 5/7] bpf: Add default BPF capabilities initialization for program types Juntong Deng
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Juntong Deng @ 2025-01-16 19:41 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, tj, void
  Cc: bpf, linux-kernel

This patch makes the verifier support BPF capabilities.

Add bpf_capabilities bitmap and context_info to
struct bpf_verifier_env.

Add bpf_capabilities_adjust callback function to
struct bpf_verifier_ops.

Add check for BPF capabilities in check_kfunc_call.

Add call to bpf_capabilities_adjust callback function in
do_check_common.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
---
 include/linux/bpf.h          |  2 ++
 include/linux/bpf_verifier.h |  6 ++++++
 kernel/bpf/verifier.c        | 29 ++++++++++++++++++++++++-----
 3 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index feda0ce90f5a..73d2ff1003ac 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1021,6 +1021,8 @@ struct bpf_verifier_ops {
 	int (*btf_struct_access)(struct bpf_verifier_log *log,
 				 const struct bpf_reg_state *reg,
 				 int off, int size);
+	int (*bpf_capabilities_adjust)(unsigned long *bpf_capabilities,
+				       u32 context_info, bool enter);
 };
 
 struct bpf_prog_offload_ops {
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 32c23f2a3086..6d0dad5f756d 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -784,8 +784,14 @@ struct bpf_verifier_env {
 	char tmp_str_buf[TMP_STR_BUF_LEN];
 	struct bpf_insn insn_buf[INSN_BUF_SIZE];
 	struct bpf_insn epilogue_buf[INSN_BUF_SIZE];
+	DECLARE_BITMAP(bpf_capabilities, __MAX_BPF_CAP);
+	u32 context_info;
 };
 
+#define ENABLE_BPF_CAPABILITY(caps, cap) __set_bit(cap, caps)
+#define DISABLE_BPF_CAPABILITY(caps, cap) __clear_bit(cap, caps)
+#define IS_BPF_CAPABILITY_ENABLED(caps, cap) test_bit(cap, caps)
+
 static inline struct bpf_func_info_aux *subprog_aux(struct bpf_verifier_env *env, int subprog)
 {
 	return &env->prog->aux->func_info_aux[subprog];
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b8ca227c78af..2a321a641b4a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -30,6 +30,7 @@
 #include <net/xdp.h>
 #include <linux/trace_events.h>
 #include <linux/kallsyms.h>
+#include <uapi/linux/bpf.h>
 
 #include "disasm.h"
 
@@ -12917,7 +12918,8 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 static int fetch_kfunc_meta(struct bpf_verifier_env *env,
 			    struct bpf_insn *insn,
 			    struct bpf_kfunc_call_arg_meta *meta,
-			    const char **kfunc_name)
+			    const char **kfunc_name,
+			    u32 *capability)
 {
 	const struct btf_type *func, *func_proto;
 	u32 func_id, *kfunc_flags;
@@ -12941,7 +12943,7 @@ static int fetch_kfunc_meta(struct bpf_verifier_env *env,
 		*kfunc_name = func_name;
 	func_proto = btf_type_by_id(desc_btf, func->type);
 
-	kfunc_flags = btf_kfunc_id_set_contains(desc_btf, func_id, env->prog);
+	kfunc_flags = btf_kfunc_id_set_contains(desc_btf, func_id, env->prog, capability);
 	if (!kfunc_flags) {
 		return -EACCES;
 	}
@@ -12972,16 +12974,26 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 	const struct btf_param *args;
 	const struct btf_type *ret_t;
 	struct btf *desc_btf;
+	u32 capability;
 
 	/* skip for now, but return error when we find this in fixup_kfunc_call */
 	if (!insn->imm)
 		return 0;
 
-	err = fetch_kfunc_meta(env, insn, &meta, &func_name);
+	err = fetch_kfunc_meta(env, insn, &meta, &func_name, &capability);
 	if (err == -EACCES && func_name)
 		verbose(env, "calling kernel function %s is not allowed\n", func_name);
 	if (err)
 		return err;
+
+	if (capability != BPF_CAP_NONE) {
+		if (!IS_BPF_CAPABILITY_ENABLED(env->bpf_capabilities, capability) && func_name) {
+			verbose(env, "The bpf program does not have the capability to call %s\n",
+				func_name);
+			return -EACCES;
+		}
+	}
+
 	desc_btf = meta.btf;
 	insn_aux = &env->insn_aux_data[insn_idx];
 
@@ -16824,7 +16836,7 @@ static void mark_fastcall_pattern_for_call(struct bpf_verifier_env *env,
 		struct bpf_kfunc_call_arg_meta meta;
 		int err;
 
-		err = fetch_kfunc_meta(env, call, &meta, NULL);
+		err = fetch_kfunc_meta(env, call, &meta, NULL, NULL);
 		if (err < 0)
 			/* error would be reported later */
 			return;
@@ -16980,7 +16992,7 @@ static int visit_insn(int t, struct bpf_verifier_env *env)
 		if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
 			struct bpf_kfunc_call_arg_meta meta;
 
-			ret = fetch_kfunc_meta(env, insn, &meta, NULL);
+			ret = fetch_kfunc_meta(env, insn, &meta, NULL, NULL);
 			if (ret == 0 && is_iter_next_kfunc(&meta)) {
 				mark_prune_point(env, t);
 				/* Checking and saving state checkpoints at iter_next() call
@@ -22093,6 +22105,9 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
 	state->first_insn_idx = env->subprog_info[subprog].start;
 	state->last_insn_idx = -1;
 
+	if (env->ops->bpf_capabilities_adjust)
+		env->ops->bpf_capabilities_adjust(env->bpf_capabilities, env->context_info, true);
+
 	regs = state->frame[state->curframe]->regs;
 	if (subprog || env->prog->type == BPF_PROG_TYPE_EXT) {
 		const char *sub_name = subprog_name(env, subprog);
@@ -22176,6 +22191,9 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
 
 	ret = do_check(env);
 out:
+	if (env->ops->bpf_capabilities_adjust)
+		env->ops->bpf_capabilities_adjust(env->bpf_capabilities, env->context_info, false);
+
 	/* check for NULL is necessary, since cur_state can be freed inside
 	 * do_check() under memory pressure.
 	 */
@@ -22385,6 +22403,7 @@ static int check_struct_ops_btf_id(struct bpf_verifier_env *env)
 	prog->aux->attach_func_proto = func_proto;
 	prog->aux->attach_func_name = mname;
 	env->ops = st_ops->verifier_ops;
+	env->context_info = __btf_member_bit_offset(t, member) / 8; // moff
 
 	return 0;
 }
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH bpf-next 5/7] bpf: Add default BPF capabilities initialization for program types
  2025-01-16 19:35 [RFC PATCH bpf-next 0/7] bpf: BPF internal fine-grained permission management (BPF internal capabilities) Juntong Deng
                   ` (3 preceding siblings ...)
  2025-01-16 19:41 ` [RFC PATCH bpf-next 4/7] bpf: Make the verifier support BPF capabilities Juntong Deng
@ 2025-01-16 19:41 ` Juntong Deng
  2025-01-16 19:41 ` [RFC PATCH bpf-next 6/7] sched_ext: Make SCX use BPF capabilities Juntong Deng
  2025-01-16 19:41 ` [RFC PATCH bpf-next 7/7] sched_ext: Add proof-of-concept test case Juntong Deng
  6 siblings, 0 replies; 18+ messages in thread
From: Juntong Deng @ 2025-01-16 19:41 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, tj, void
  Cc: bpf, linux-kernel

This patch adds default BPF capabilities initialization for
program types.

Since this is a proof of concept, only BPF capabilities initialization
for BPF_PROG_TYPE_STRUCT_OPS is added.

BPF_PROG_TYPE_STRUCT_OPS enables only STRUCT_OPS_BASE_CAPS and
BPF_CAP_SCX_ANY by default.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
---
 kernel/bpf/verifier.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 2a321a641b4a..7a69a581861f 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -22959,6 +22959,40 @@ static int process_fd_array(struct bpf_verifier_env *env, union bpf_attr *attr,
 	return 0;
 }
 
+#define STRUCT_OPS_BASE_CAPS \
+	BPF_CAP_TEST_1, \
+	BPF_CAP_TEST_2, \
+	BPF_CAP_TEST_3
+
+static const enum bpf_capability struct_ops_caps[] = {
+	STRUCT_OPS_BASE_CAPS,
+	BPF_CAP_SCX_ANY
+};
+
+struct bpf_program_type_caps {
+	enum bpf_prog_type type;
+	u32 size;
+	const enum bpf_capability *capabilities;
+};
+
+static const struct bpf_program_type_caps bpf_default_capabilities[] = {
+	{
+		.type = BPF_PROG_TYPE_STRUCT_OPS,
+		.size = ARRAY_SIZE(struct_ops_caps),
+		.capabilities = struct_ops_caps
+	},
+};
+
+static void setup_bpf_capabilities(unsigned long *bpf_capabilities,
+				   const struct bpf_program_type_caps *caps)
+{
+	int i;
+
+	bitmap_zero(bpf_capabilities, __MAX_BPF_CAP);
+	for (i = 0; i < caps->size; i++)
+		ENABLE_BPF_CAPABILITY(bpf_capabilities, caps->capabilities[i]);
+}
+
 int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u32 uattr_size)
 {
 	u64 start_time = ktime_get_ns();
@@ -22997,6 +23031,9 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u3
 	env->bypass_spec_v4 = bpf_bypass_spec_v4(env->prog->aux->token);
 	env->bpf_capable = is_priv = bpf_token_capable(env->prog->aux->token, CAP_BPF);
 
+	if (env->prog->type == BPF_PROG_TYPE_STRUCT_OPS)
+		setup_bpf_capabilities(env->bpf_capabilities, &bpf_default_capabilities[0]);
+
 	bpf_get_btf_vmlinux();
 
 	/* grab the mutex to protect few globals used by verifier */
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH bpf-next 6/7] sched_ext: Make SCX use BPF capabilities
  2025-01-16 19:35 [RFC PATCH bpf-next 0/7] bpf: BPF internal fine-grained permission management (BPF internal capabilities) Juntong Deng
                   ` (4 preceding siblings ...)
  2025-01-16 19:41 ` [RFC PATCH bpf-next 5/7] bpf: Add default BPF capabilities initialization for program types Juntong Deng
@ 2025-01-16 19:41 ` Juntong Deng
  2025-01-17 16:58   ` Tejun Heo
  2025-01-24  4:52   ` Alexei Starovoitov
  2025-01-16 19:41 ` [RFC PATCH bpf-next 7/7] sched_ext: Add proof-of-concept test case Juntong Deng
  6 siblings, 2 replies; 18+ messages in thread
From: Juntong Deng @ 2025-01-16 19:41 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, tj, void
  Cc: bpf, linux-kernel

This patch modifies SCX to use BPF capabilities.

Make all SCX kfuncs register to BPF capabilities instead of
BPF_PROG_TYPE_STRUCT_OPS.

Add bpf_scx_bpf_capabilities_adjust as bpf_capabilities_adjust
callback function.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
---
 kernel/sched/ext.c | 74 ++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 62 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 7fff1d045477..53cc7c3ed80b 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5765,10 +5765,66 @@ bpf_scx_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	}
 }
 
+static int bpf_scx_bpf_capabilities_adjust(unsigned long *bpf_capabilities,
+					   u32 context_info, bool enter)
+{
+	if (enter) {
+		switch (context_info) {
+		case offsetof(struct sched_ext_ops, select_cpu):
+			ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_SELECT_CPU);
+			ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
+			break;
+		case offsetof(struct sched_ext_ops, enqueue):
+			ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
+			break;
+		case offsetof(struct sched_ext_ops, dispatch):
+			ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_DISPATCH);
+			break;
+		case offsetof(struct sched_ext_ops, running):
+		case offsetof(struct sched_ext_ops, stopping):
+		case offsetof(struct sched_ext_ops, enable):
+			ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_REST);
+			break;
+		case offsetof(struct sched_ext_ops, init):
+		case offsetof(struct sched_ext_ops, exit):
+			ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_UNLOCKED);
+			break;
+		default:
+			return -EINVAL;
+		}
+	} else {
+		switch (context_info) {
+		case offsetof(struct sched_ext_ops, select_cpu):
+			DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_SELECT_CPU);
+			DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
+			break;
+		case offsetof(struct sched_ext_ops, enqueue):
+			DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
+			break;
+		case offsetof(struct sched_ext_ops, dispatch):
+			DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_DISPATCH);
+			break;
+		case offsetof(struct sched_ext_ops, running):
+		case offsetof(struct sched_ext_ops, stopping):
+		case offsetof(struct sched_ext_ops, enable):
+			DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_REST);
+			break;
+		case offsetof(struct sched_ext_ops, init):
+		case offsetof(struct sched_ext_ops, exit):
+			DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_UNLOCKED);
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
 static const struct bpf_verifier_ops bpf_scx_verifier_ops = {
 	.get_func_proto = bpf_scx_get_func_proto,
 	.is_valid_access = bpf_scx_is_valid_access,
 	.btf_struct_access = bpf_scx_btf_struct_access,
+	.bpf_capabilities_adjust = bpf_scx_bpf_capabilities_adjust
 };
 
 static int bpf_scx_init_member(const struct btf_type *t,
@@ -7596,23 +7652,17 @@ static int __init scx_init(void)
 	 * them. For now, register them the same and make each kfunc explicitly
 	 * check using scx_kf_allowed().
 	 */
-	if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+	if ((ret = register_btf_kfunc_id_set_cap(BPF_CAP_SCX_KF_SELECT_CPU,
 					     &scx_kfunc_set_select_cpu)) ||
-	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+	    (ret = register_btf_kfunc_id_set_cap(BPF_CAP_SCX_KF_ENQUEUE,
 					     &scx_kfunc_set_enqueue_dispatch)) ||
-	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+	    (ret = register_btf_kfunc_id_set_cap(BPF_CAP_SCX_KF_DISPATCH,
 					     &scx_kfunc_set_dispatch)) ||
-	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+	    (ret = register_btf_kfunc_id_set_cap(BPF_CAP_SCX_KF_CPU_RELEASE,
 					     &scx_kfunc_set_cpu_release)) ||
-	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
-					     &scx_kfunc_set_unlocked)) ||
-	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL,
+	    (ret = register_btf_kfunc_id_set_cap(BPF_CAP_SCX_KF_UNLOCKED,
 					     &scx_kfunc_set_unlocked)) ||
-	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
-					     &scx_kfunc_set_any)) ||
-	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,
-					     &scx_kfunc_set_any)) ||
-	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL,
+	    (ret = register_btf_kfunc_id_set_cap(BPF_CAP_SCX_ANY,
 					     &scx_kfunc_set_any))) {
 		pr_err("sched_ext: Failed to register kfunc sets (%d)\n", ret);
 		return ret;
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH bpf-next 7/7] sched_ext: Add proof-of-concept test case
  2025-01-16 19:35 [RFC PATCH bpf-next 0/7] bpf: BPF internal fine-grained permission management (BPF internal capabilities) Juntong Deng
                   ` (5 preceding siblings ...)
  2025-01-16 19:41 ` [RFC PATCH bpf-next 6/7] sched_ext: Make SCX use BPF capabilities Juntong Deng
@ 2025-01-16 19:41 ` Juntong Deng
  6 siblings, 0 replies; 18+ messages in thread
From: Juntong Deng @ 2025-01-16 19:41 UTC (permalink / raw)
  To: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, tj, void
  Cc: bpf, linux-kernel

This patch adds a proof-of-concept test case scx_simple_cap_test, which
is based on scx_simple.

Add scx_bpf_dsq_move_to_local to enqueue, which will be rejected by
the verifier.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
---
 tools/sched_ext/Makefile                  |   2 +-
 tools/sched_ext/scx_simple_cap_test.bpf.c | 159 ++++++++++++++++++++++
 tools/sched_ext/scx_simple_cap_test.c     | 107 +++++++++++++++
 3 files changed, 267 insertions(+), 1 deletion(-)
 create mode 100644 tools/sched_ext/scx_simple_cap_test.bpf.c
 create mode 100644 tools/sched_ext/scx_simple_cap_test.c

diff --git a/tools/sched_ext/Makefile b/tools/sched_ext/Makefile
index ca3815e572d8..0e87aa77594b 100644
--- a/tools/sched_ext/Makefile
+++ b/tools/sched_ext/Makefile
@@ -176,7 +176,7 @@ $(INCLUDE_DIR)/%.bpf.skel.h: $(SCXOBJ_DIR)/%.bpf.o $(INCLUDE_DIR)/vmlinux.h $(BP
 
 SCX_COMMON_DEPS := include/scx/common.h include/scx/user_exit_info.h | $(BINDIR)
 
-c-sched-targets = scx_simple scx_qmap scx_central scx_flatcg
+c-sched-targets = scx_simple scx_qmap scx_central scx_flatcg scx_simple_cap_test
 
 $(addprefix $(BINDIR)/,$(c-sched-targets)): \
 	$(BINDIR)/%: \
diff --git a/tools/sched_ext/scx_simple_cap_test.bpf.c b/tools/sched_ext/scx_simple_cap_test.bpf.c
new file mode 100644
index 000000000000..6bcf4dcbfcb4
--- /dev/null
+++ b/tools/sched_ext/scx_simple_cap_test.bpf.c
@@ -0,0 +1,159 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * A simple scheduler.
+ *
+ * By default, it operates as a simple global weighted vtime scheduler and can
+ * be switched to FIFO scheduling. It also demonstrates the following niceties.
+ *
+ * - Statistics tracking how many tasks are queued to local and global dsq's.
+ * - Termination notification for userspace.
+ *
+ * While very simple, this scheduler should work reasonably well on CPUs with a
+ * uniform L3 cache topology. While preemption is not implemented, the fact that
+ * the scheduling queue is shared across all CPUs means that whatever is at the
+ * front of the queue is likely to be executed fairly quickly given enough
+ * number of CPUs. The FIFO scheduling mode may be beneficial to some workloads
+ * but comes with the usual problems with FIFO scheduling where saturating
+ * threads can easily drown out interactive ones.
+ *
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <tj@kernel.org>
+ * Copyright (c) 2022 David Vernet <dvernet@meta.com>
+ */
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+const volatile bool fifo_sched;
+
+static u64 vtime_now;
+UEI_DEFINE(uei);
+
+/*
+ * Built-in DSQs such as SCX_DSQ_GLOBAL cannot be used as priority queues
+ * (meaning, cannot be dispatched to with scx_bpf_dsq_insert_vtime()). We
+ * therefore create a separate DSQ with ID 0 that we dispatch to and consume
+ * from. If scx_simple only supported global FIFO scheduling, then we could just
+ * use SCX_DSQ_GLOBAL.
+ */
+#define SHARED_DSQ 0
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(key_size, sizeof(u32));
+	__uint(value_size, sizeof(u64));
+	__uint(max_entries, 2);			/* [local, global] */
+} stats SEC(".maps");
+
+static void stat_inc(u32 idx)
+{
+	u64 *cnt_p = bpf_map_lookup_elem(&stats, &idx);
+	if (cnt_p)
+		(*cnt_p)++;
+}
+
+static inline bool vtime_before(u64 a, u64 b)
+{
+	return (s64)(a - b) < 0;
+}
+
+s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags)
+{
+	bool is_idle = false;
+	s32 cpu;
+
+	cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &is_idle);
+	if (is_idle) {
+		stat_inc(0);	/* count local queueing */
+		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
+	}
+
+	return cpu;
+}
+
+void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags)
+{
+	stat_inc(1);	/* count global queueing */
+
+	/* bpf capabilities test! */
+	scx_bpf_dsq_move_to_local(SHARED_DSQ);
+
+	if (fifo_sched) {
+		scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags);
+	} else {
+		u64 vtime = p->scx.dsq_vtime;
+
+		/*
+		 * Limit the amount of budget that an idling task can accumulate
+		 * to one slice.
+		 */
+		if (vtime_before(vtime, vtime_now - SCX_SLICE_DFL))
+			vtime = vtime_now - SCX_SLICE_DFL;
+
+		scx_bpf_dsq_insert_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, vtime,
+					 enq_flags);
+	}
+}
+
+void BPF_STRUCT_OPS(simple_dispatch, s32 cpu, struct task_struct *prev)
+{
+	scx_bpf_dsq_move_to_local(SHARED_DSQ);
+}
+
+void BPF_STRUCT_OPS(simple_running, struct task_struct *p)
+{
+	if (fifo_sched)
+		return;
+
+	/*
+	 * Global vtime always progresses forward as tasks start executing. The
+	 * test and update can be performed concurrently from multiple CPUs and
+	 * thus racy. Any error should be contained and temporary. Let's just
+	 * live with it.
+	 */
+	if (vtime_before(vtime_now, p->scx.dsq_vtime))
+		vtime_now = p->scx.dsq_vtime;
+}
+
+void BPF_STRUCT_OPS(simple_stopping, struct task_struct *p, bool runnable)
+{
+	if (fifo_sched)
+		return;
+
+	/*
+	 * Scale the execution time by the inverse of the weight and charge.
+	 *
+	 * Note that the default yield implementation yields by setting
+	 * @p->scx.slice to zero and the following would treat the yielding task
+	 * as if it has consumed all its slice. If this penalizes yielding tasks
+	 * too much, determine the execution time by taking explicit timestamps
+	 * instead of depending on @p->scx.slice.
+	 */
+	p->scx.dsq_vtime += (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight;
+}
+
+void BPF_STRUCT_OPS(simple_enable, struct task_struct *p)
+{
+	p->scx.dsq_vtime = vtime_now;
+}
+
+s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init)
+{
+	return scx_bpf_create_dsq(SHARED_DSQ, -1);
+}
+
+void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei)
+{
+	UEI_RECORD(uei, ei);
+}
+
+SCX_OPS_DEFINE(simple_ops,
+	       .select_cpu		= (void *)simple_select_cpu,
+	       .enqueue			= (void *)simple_enqueue,
+	       .dispatch		= (void *)simple_dispatch,
+	       .running			= (void *)simple_running,
+	       .stopping		= (void *)simple_stopping,
+	       .enable			= (void *)simple_enable,
+	       .init			= (void *)simple_init,
+	       .exit			= (void *)simple_exit,
+	       .name			= "simple");
diff --git a/tools/sched_ext/scx_simple_cap_test.c b/tools/sched_ext/scx_simple_cap_test.c
new file mode 100644
index 000000000000..c4a0a5b1e0cf
--- /dev/null
+++ b/tools/sched_ext/scx_simple_cap_test.c
@@ -0,0 +1,107 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2022 Tejun Heo <tj@kernel.org>
+ * Copyright (c) 2022 David Vernet <dvernet@meta.com>
+ */
+#include <stdio.h>
+#include <unistd.h>
+#include <signal.h>
+#include <libgen.h>
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include "scx_simple_cap_test.bpf.skel.h"
+
+const char help_fmt[] =
+"A simple sched_ext scheduler.\n"
+"\n"
+"See the top-level comment in .bpf.c for more details.\n"
+"\n"
+"Usage: %s [-f] [-v]\n"
+"\n"
+"  -f            Use FIFO scheduling instead of weighted vtime scheduling\n"
+"  -v            Print libbpf debug messages\n"
+"  -h            Display this help and exit\n";
+
+static bool verbose;
+static volatile int exit_req;
+
+static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
+{
+	if (level == LIBBPF_DEBUG && !verbose)
+		return 0;
+	return vfprintf(stderr, format, args);
+}
+
+static void sigint_handler(int simple)
+{
+	exit_req = 1;
+}
+
+static void read_stats(struct scx_simple_cap_test *skel, __u64 *stats)
+{
+	int nr_cpus = libbpf_num_possible_cpus();
+	__u64 cnts[2][nr_cpus];
+	__u32 idx;
+
+	memset(stats, 0, sizeof(stats[0]) * 2);
+
+	for (idx = 0; idx < 2; idx++) {
+		int ret, cpu;
+
+		ret = bpf_map_lookup_elem(bpf_map__fd(skel->maps.stats),
+					  &idx, cnts[idx]);
+		if (ret < 0)
+			continue;
+		for (cpu = 0; cpu < nr_cpus; cpu++)
+			stats[idx] += cnts[idx][cpu];
+	}
+}
+
+int main(int argc, char **argv)
+{
+	struct scx_simple_cap_test *skel;
+	struct bpf_link *link;
+	__u32 opt;
+	__u64 ecode;
+
+	libbpf_set_print(libbpf_print_fn);
+	signal(SIGINT, sigint_handler);
+	signal(SIGTERM, sigint_handler);
+restart:
+	skel = SCX_OPS_OPEN(simple_ops, scx_simple_cap_test);
+
+	while ((opt = getopt(argc, argv, "fvh")) != -1) {
+		switch (opt) {
+		case 'f':
+			skel->rodata->fifo_sched = true;
+			break;
+		case 'v':
+			verbose = true;
+			break;
+		default:
+			fprintf(stderr, help_fmt, basename(argv[0]));
+			return opt != 'h';
+		}
+	}
+
+	SCX_OPS_LOAD(skel, simple_ops, scx_simple_cap_test, uei);
+	link = SCX_OPS_ATTACH(skel, simple_ops, scx_simple_cap_test);
+
+	while (!exit_req && !UEI_EXITED(skel, uei)) {
+		__u64 stats[2];
+
+		read_stats(skel, stats);
+		printf("local=%llu global=%llu\n", stats[0], stats[1]);
+		fflush(stdout);
+		sleep(1);
+	}
+
+	bpf_link__destroy(link);
+	ecode = UEI_REPORT(skel, uei);
+	scx_simple_cap_test__destroy(skel);
+
+	if (UEI_ECODE_RESTART(ecode))
+		goto restart;
+	return 0;
+}
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 2/7] bpf: Add enum bpf_capability
  2025-01-16 19:41 ` [RFC PATCH bpf-next 2/7] bpf: Add enum bpf_capability Juntong Deng
@ 2025-01-16 22:56   ` Song Liu
  2025-01-17 19:37     ` Juntong Deng
  0 siblings, 1 reply; 18+ messages in thread
From: Song Liu @ 2025-01-16 22:56 UTC (permalink / raw)
  To: Juntong Deng
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, tj, void, bpf,
	linux-kernel

On Thu, Jan 16, 2025 at 11:43 AM Juntong Deng <juntong.deng@outlook.com> wrote:
>
> This patch adds enum bpf_capability, currently only for proof
> of concept.
>
> Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
> ---
>  include/uapi/linux/bpf.h | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 2acf9b336371..94c21d4eb786 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1058,6 +1058,21 @@ enum bpf_prog_type {
>         __MAX_BPF_PROG_TYPE
>  };
>
> +enum bpf_capability {
> +       BPF_CAP_NONE = 0,
> +       BPF_CAP_TEST_1,
> +       BPF_CAP_TEST_2,
> +       BPF_CAP_TEST_3,
> +       BPF_CAP_SCX_ANY,
> +       BPF_CAP_SCX_KF_UNLOCKED,
> +       BPF_CAP_SCX_KF_CPU_RELEASE,
> +       BPF_CAP_SCX_KF_DISPATCH,
> +       BPF_CAP_SCX_KF_ENQUEUE,
> +       BPF_CAP_SCX_KF_SELECT_CPU,
> +       BPF_CAP_SCX_KF_REST,
> +       __MAX_BPF_CAP
> +};
> +

I don't think we need to handle these in the core verifier.
Instead, we can put the same logic in:

    fetch_kfunc_meta =>
       btf_kfunc_id_set_contains =>
           __btf_kfunc_id_set_contains =>
              hook_filter->filters[i]()

Thanks,
Song

>  enum bpf_attach_type {
>         BPF_CGROUP_INET_INGRESS,
>         BPF_CGROUP_INET_EGRESS,
> --
> 2.39.5
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 6/7] sched_ext: Make SCX use BPF capabilities
  2025-01-16 19:41 ` [RFC PATCH bpf-next 6/7] sched_ext: Make SCX use BPF capabilities Juntong Deng
@ 2025-01-17 16:58   ` Tejun Heo
  2025-01-17 20:09     ` Juntong Deng
  2025-01-24  4:52   ` Alexei Starovoitov
  1 sibling, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2025-01-17 16:58 UTC (permalink / raw)
  To: Juntong Deng
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, void, bpf,
	linux-kernel

Hello,

On Thu, Jan 16, 2025 at 07:41:11PM +0000, Juntong Deng wrote:
...
> +static int bpf_scx_bpf_capabilities_adjust(unsigned long *bpf_capabilities,
> +					   u32 context_info, bool enter)
> +{
> +	if (enter) {
> +		switch (context_info) {
> +		case offsetof(struct sched_ext_ops, select_cpu):
> +			ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_SELECT_CPU);
> +			ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
> +			break;
...
> +		}
> +	} else {
> +		switch (context_info) {
> +		case offsetof(struct sched_ext_ops, select_cpu):
> +			DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_SELECT_CPU);
> +			DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
> +			break;
...
> +	}
> +	return 0;
> +}

From sched_ext's POV, this, or whatever works is fine but I have some basic
comments:

- The caps are u32. If all kfuncs use this facility for access control, it's
  plausible or even likely that 32 is not going to be enough. I suppose it
  can be turned into u64 and then a proper bitmap later? Maybe good idea to
  start out with a proper bitmap in the first place?

- There are benefits to centralizing all the caps in a single place but it
  can also be kinda cumbersome.

- Even with global defs, the cap adjustments are procedural, not
  declarative. If it needs to be procedural anyway, I wonder whether the
  global defs are necessary in the first place. What prevents implementing
  it the other way around - pass in the calling context and provide helpers
  and macros to respond yay or nay procedurally.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 2/7] bpf: Add enum bpf_capability
  2025-01-16 22:56   ` Song Liu
@ 2025-01-17 19:37     ` Juntong Deng
  2025-01-17 21:40       ` Song Liu
  0 siblings, 1 reply; 18+ messages in thread
From: Juntong Deng @ 2025-01-17 19:37 UTC (permalink / raw)
  To: Song Liu
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, tj, void, bpf,
	linux-kernel

On 2025/1/16 22:56, Song Liu wrote:
> On Thu, Jan 16, 2025 at 11:43 AM Juntong Deng <juntong.deng@outlook.com> wrote:
>>
>> This patch adds enum bpf_capability, currently only for proof
>> of concept.
>>
>> Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
>> ---
>>   include/uapi/linux/bpf.h | 15 +++++++++++++++
>>   1 file changed, 15 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 2acf9b336371..94c21d4eb786 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -1058,6 +1058,21 @@ enum bpf_prog_type {
>>          __MAX_BPF_PROG_TYPE
>>   };
>>
>> +enum bpf_capability {
>> +       BPF_CAP_NONE = 0,
>> +       BPF_CAP_TEST_1,
>> +       BPF_CAP_TEST_2,
>> +       BPF_CAP_TEST_3,
>> +       BPF_CAP_SCX_ANY,
>> +       BPF_CAP_SCX_KF_UNLOCKED,
>> +       BPF_CAP_SCX_KF_CPU_RELEASE,
>> +       BPF_CAP_SCX_KF_DISPATCH,
>> +       BPF_CAP_SCX_KF_ENQUEUE,
>> +       BPF_CAP_SCX_KF_SELECT_CPU,
>> +       BPF_CAP_SCX_KF_REST,
>> +       __MAX_BPF_CAP
>> +};
>> +
> 
> I don't think we need to handle these in the core verifier.
> Instead, we can put the same logic in:
> 
>      fetch_kfunc_meta =>
>         btf_kfunc_id_set_contains =>
>             __btf_kfunc_id_set_contains =>
>                hook_filter->filters[i]()
> 
> Thanks,
> Song
> 

Thanks for your reply.

I am not sure if BPF capabilities is a good approach.

But we currently need filters because we register all kfuncs to program
types, which are too coarse-grained, so we need additional filters for
further filtering (make the granularity finer).

We added struct btf_kfunc_hook_filter and added filter logic in
btf_populate_kfunc_set, __btf_kfunc_id_set_contains, essentially to
mitigate the problem of coarse-grained permissions management.

If we register all kfuncs to BPF capabilities, then we will no longer
need additional filters for further filtering because BPF capabilities
is already fine-grained.

Would it be a better idea for us to let each kfunc have its own
capability attribute?

In addition, BPF capabilities seem like a extensible idea. Would it be
valuable if we make other features of BPF (BPF helpers, BPF maps, even
attach targets) have their own capability attributes and can be managed
uniformly through BPF capabilities?

For example, if a bpf program has BPF_CAP_TRACING, then it will be able
to use kprobes and can use tracing related kfuncs and helpers. If a bpf
program has BPF_CAP_SOCK then it will be able to use
BPF_MAP_TYPE_SOCKMAP and use socket related kfuncs and helpers.

In other words, if we add a general internal permissions management
system to the BPF subsystem, would it be valuable?

BPF is general, and it is foreseeable that BPF will be applied to more
and more subsystems and scenarios, so maybe a general fine-grained
permissions management would be better?

Fine-grained permissions management will bring potential flexibility
in configurability.

For example, if a system administrator wants to open the features of the
HID-BPF driver to users, but the system administrator does not want to
open other BPF features to users, such as sched_ext.

But both HID-BPF and sched_ext are using BPF_PROG_TYPE_STRUCT_OPS, so we
cannot restrict based on program type.

If based on BPF capabilities, maybe the system administrator can give
the user CAP_BPF (Linux capabilities) and set only BPF_CAP_HID
(BPF capabilities) to be enabled in /sys/bpf/bpf_enabled_capabilities
(which can only be modified by CAP_SYS_ADMIN), then the restriction of
only allowing the use of HID-BPF can be applied.

In this example, CAP_BPF is also too coarse-grained. Although BPF
looks like a separate subsystem, in fact BPF (low-level) is
the infrastructure, and we implement features (high-level) for
different scenarios based on BPF. It might be better to manage
BPF at a finer granularity.

If we want to restrict some features and open only some features,
coarse-grained permission management cannot do it.

Maybe BPF capabilities not only can be used to solve the current SCX
context problem, but other problems as well?

Maybe we can have more discussion?

Many thanks.

>>   enum bpf_attach_type {
>>          BPF_CGROUP_INET_INGRESS,
>>          BPF_CGROUP_INET_EGRESS,
>> --
>> 2.39.5
>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 6/7] sched_ext: Make SCX use BPF capabilities
  2025-01-17 16:58   ` Tejun Heo
@ 2025-01-17 20:09     ` Juntong Deng
  0 siblings, 0 replies; 18+ messages in thread
From: Juntong Deng @ 2025-01-17 20:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, void, bpf,
	linux-kernel

On 2025/1/17 16:58, Tejun Heo wrote:
> Hello,
> 
> On Thu, Jan 16, 2025 at 07:41:11PM +0000, Juntong Deng wrote:
> ...
>> +static int bpf_scx_bpf_capabilities_adjust(unsigned long *bpf_capabilities,
>> +					   u32 context_info, bool enter)
>> +{
>> +	if (enter) {
>> +		switch (context_info) {
>> +		case offsetof(struct sched_ext_ops, select_cpu):
>> +			ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_SELECT_CPU);
>> +			ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
>> +			break;
> ...
>> +		}
>> +	} else {
>> +		switch (context_info) {
>> +		case offsetof(struct sched_ext_ops, select_cpu):
>> +			DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_SELECT_CPU);
>> +			DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
>> +			break;
> ...
>> +	}
>> +	return 0;
>> +}
> 
>  From sched_ext's POV, this, or whatever works is fine but I have some basic
> comments:
> 
> - The caps are u32. If all kfuncs use this facility for access control, it's
>    plausible or even likely that 32 is not going to be enough. I suppose it
>    can be turned into u64 and then a proper bitmap later? Maybe good idea to
>    start out with a proper bitmap in the first place?
> 

Thanks for your reply.

I considered this, and 32 capabilities is definitely not enough.

So although in BTF_ID_FLAGS the capability is 32 bits, this is used as
an index and a bitmap is used in struct bpf_verifier_env.

We can have UINT_MAX quantity capability.

> - There are benefits to centralizing all the caps in a single place but it
>    can also be kinda cumbersome.
> 

Yes, I agree, maybe centralized declarations are cumbersome.

But the purpose of centralized declaration is to give each capability a
unique number for distinction.

I am not sure there is a non-centralized declarative way to do this.

Do you have any suggested approach?

> - Even with global defs, the cap adjustments are procedural, not
>    declarative. If it needs to be procedural anyway, I wonder whether the
>    global defs are necessary in the first place. What prevents implementing
>    it the other way around - pass in the calling context and provide helpers
>    and macros to respond yay or nay procedurally.
> 

You are right!

Thanks for pointing this out!

If it is procedural, then global definitions are really not necessary.

I will rethink this.

> Thanks.
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 2/7] bpf: Add enum bpf_capability
  2025-01-17 19:37     ` Juntong Deng
@ 2025-01-17 21:40       ` Song Liu
  2025-01-20 21:49         ` Juntong Deng
  0 siblings, 1 reply; 18+ messages in thread
From: Song Liu @ 2025-01-17 21:40 UTC (permalink / raw)
  To: Juntong Deng
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, tj, void, bpf,
	linux-kernel

On Fri, Jan 17, 2025 at 11:37 AM Juntong Deng <juntong.deng@outlook.com> wrote:
>
[...]
>
> Thanks for your reply.
>
> I am not sure if BPF capabilities is a good approach.
>
> But we currently need filters because we register all kfuncs to program
> types, which are too coarse-grained, so we need additional filters for
> further filtering (make the granularity finer).
>
> We added struct btf_kfunc_hook_filter and added filter logic in
> btf_populate_kfunc_set, __btf_kfunc_id_set_contains, essentially to
> mitigate the problem of coarse-grained permissions management.
>
> If we register all kfuncs to BPF capabilities, then we will no longer
> need additional filters for further filtering because BPF capabilities
> is already fine-grained.

bpf_capabilities_adjust is the filter function with a different name.
So the extra capability concept doesn't give us much benefit.

>
> Would it be a better idea for us to let each kfunc have its own
> capability attribute?

This is no different to the BPF helper function ID, which turned
out to be not scalable.

>
> In addition, BPF capabilities seem like a extensible idea. Would it be
> valuable if we make other features of BPF (BPF helpers, BPF maps, even
> attach targets) have their own capability attributes and can be managed
> uniformly through BPF capabilities?
>
> For example, if a bpf program has BPF_CAP_TRACING, then it will be able
> to use kprobes and can use tracing related kfuncs and helpers. If a bpf
> program has BPF_CAP_SOCK then it will be able to use
> BPF_MAP_TYPE_SOCKMAP and use socket related kfuncs and helpers.
>
> In other words, if we add a general internal permissions management
> system to the BPF subsystem, would it be valuable?
>
> BPF is general, and it is foreseeable that BPF will be applied to more
> and more subsystems and scenarios, so maybe a general fine-grained
> permissions management would be better?
>
> Fine-grained permissions management will bring potential flexibility
> in configurability.
>
> For example, if a system administrator wants to open the features of the
> HID-BPF driver to users, but the system administrator does not want to
> open other BPF features to users, such as sched_ext.

This appears to be a totally separate topic.

[...]

> Maybe we can have more discussion?

We sure need more discussion before shipping any changes for this
topic.

Thanks,
Song

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 2/7] bpf: Add enum bpf_capability
  2025-01-17 21:40       ` Song Liu
@ 2025-01-20 21:49         ` Juntong Deng
  2025-01-21 17:41           ` Song Liu
  0 siblings, 1 reply; 18+ messages in thread
From: Juntong Deng @ 2025-01-20 21:49 UTC (permalink / raw)
  To: Song Liu, andrii
  Cc: ast, daniel, john.fastabend, martin.lau, eddyz87, yonghong.song,
	kpsingh, sdf, haoluo, jolsa, memxor, tj, void, bpf, linux-kernel

On 2025/1/17 21:40, Song Liu wrote:
> On Fri, Jan 17, 2025 at 11:37 AM Juntong Deng <juntong.deng@outlook.com> wrote:
>>
> [...]
>>
>> Thanks for your reply.
>>
>> I am not sure if BPF capabilities is a good approach.
>>
>> But we currently need filters because we register all kfuncs to program
>> types, which are too coarse-grained, so we need additional filters for
>> further filtering (make the granularity finer).
>>
>> We added struct btf_kfunc_hook_filter and added filter logic in
>> btf_populate_kfunc_set, __btf_kfunc_id_set_contains, essentially to
>> mitigate the problem of coarse-grained permissions management.
>>
>> If we register all kfuncs to BPF capabilities, then we will no longer
>> need additional filters for further filtering because BPF capabilities
>> is already fine-grained.
> 
> bpf_capabilities_adjust is the filter function with a different name.
> So the extra capability concept doesn't give us much benefit.
> 

Yes, you are right, I realized this too.

Especially since bpf_capabilities_adjust is procedural, adjusting
capabilities based on different contexts is essentially no different
from filters.

Although we can use BTF_LIST to store the capabilities of different
contexts in BTF sections, making the BPF capabilities corresponding
to different contexts declarative.

But it seems not worth it, because the filter is more straightforward.

I totally agree that BPF capabilities will not give us much benefit in
solving the SCX context problem.

>>
>> Would it be a better idea for us to let each kfunc have its own
>> capability attribute?
> 
> This is no different to the BPF helper function ID, which turned
> out to be not scalable.
> 

There still seems to be a difference? BPF capabilities are not
one-to-one with kfuncs, and multiple kfuncs can be bound to one
BPF capability.

BPF capabilities are more like fine-grained versions of program types.

>>
>> In addition, BPF capabilities seem like a extensible idea. Would it be
>> valuable if we make other features of BPF (BPF helpers, BPF maps, even
>> attach targets) have their own capability attributes and can be managed
>> uniformly through BPF capabilities?
>>
>> For example, if a bpf program has BPF_CAP_TRACING, then it will be able
>> to use kprobes and can use tracing related kfuncs and helpers. If a bpf
>> program has BPF_CAP_SOCK then it will be able to use
>> BPF_MAP_TYPE_SOCKMAP and use socket related kfuncs and helpers.
>>
>> In other words, if we add a general internal permissions management
>> system to the BPF subsystem, would it be valuable?
>>
>> BPF is general, and it is foreseeable that BPF will be applied to more
>> and more subsystems and scenarios, so maybe a general fine-grained
>> permissions management would be better?
>>
>> Fine-grained permissions management will bring potential flexibility
>> in configurability.
>>
>> For example, if a system administrator wants to open the features of the
>> HID-BPF driver to users, but the system administrator does not want to
>> open other BPF features to users, such as sched_ext.
> 
> This appears to be a totally separate topic.
> 

Although I am not sure, I guess general fine-grained permissions
management might still be valuable (not necessarily BPF capabilities).

I found that Andrii Nakryiko implemented something similar in
BPF Token[0].

Similar to SCX, BPF features are fine-grained through masks to restrict
only part of the BPF features to be opened.

This seems to indicate that the demand for making BPF permissions
management fine-grained has always existed, and the demand for opening
only part of the BPF features will reappear in different forms.

Maybe we do need a general fine-grained permissions management solution?

If Andrii saw this email, could you please join the discussion?

[0]: https://lwn.net/Articles/947173/

> [...]
> 
>> Maybe we can have more discussion?
> 
> We sure need more discussion before shipping any changes for this
> topic.
> 
> Thanks,
> Song

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 2/7] bpf: Add enum bpf_capability
  2025-01-20 21:49         ` Juntong Deng
@ 2025-01-21 17:41           ` Song Liu
  0 siblings, 0 replies; 18+ messages in thread
From: Song Liu @ 2025-01-21 17:41 UTC (permalink / raw)
  To: Juntong Deng
  Cc: andrii, ast, daniel, john.fastabend, martin.lau, eddyz87,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, memxor, tj, void, bpf,
	linux-kernel

On Mon, Jan 20, 2025 at 1:50 PM Juntong Deng <juntong.deng@outlook.com> wrote:
>
[...]
>
> >>
> >> Would it be a better idea for us to let each kfunc have its own
> >> capability attribute?
> >
> > This is no different to the BPF helper function ID, which turned
> > out to be not scalable.
> >
>
> There still seems to be a difference? BPF capabilities are not
> one-to-one with kfuncs, and multiple kfuncs can be bound to one
> BPF capability.
>
> BPF capabilities are more like fine-grained versions of program types.

I personally think struct_ops gives good enough fine-grained control.
Therefore, I don't see a real need for a different concept.

[...]

> >>
> >> For example, if a system administrator wants to open the features of the
> >> HID-BPF driver to users, but the system administrator does not want to
> >> open other BPF features to users, such as sched_ext.
> >
> > This appears to be a totally separate topic.
> >
>
> Although I am not sure, I guess general fine-grained permissions
> management might still be valuable (not necessarily BPF capabilities).
>
> I found that Andrii Nakryiko implemented something similar in
> BPF Token[0].
>
> Similar to SCX, BPF features are fine-grained through masks to restrict
> only part of the BPF features to be opened.
>
> This seems to indicate that the demand for making BPF permissions
> management fine-grained has always existed, and the demand for opening
> only part of the BPF features will reappear in different forms.
>
> Maybe we do need a general fine-grained permissions management solution?

I don't think it is easy to build a fine-grained permission management
solution that fits most scenarios. It is better to do this via programmable
interfaces, e.g. with BPF LSM.

Thanks,
Song

> If Andrii saw this email, could you please join the discussion?
>
> [0]: https://lwn.net/Articles/947173/
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 6/7] sched_ext: Make SCX use BPF capabilities
  2025-01-16 19:41 ` [RFC PATCH bpf-next 6/7] sched_ext: Make SCX use BPF capabilities Juntong Deng
  2025-01-17 16:58   ` Tejun Heo
@ 2025-01-24  4:52   ` Alexei Starovoitov
  2025-01-24 22:44     ` Juntong Deng
  1 sibling, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2025-01-24  4:52 UTC (permalink / raw)
  To: Juntong Deng
  Cc: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eddy Z, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, Tejun Heo, David Vernet, bpf, LKML

On Thu, Jan 16, 2025 at 11:47 AM Juntong Deng <juntong.deng@outlook.com> wrote:
>
> This patch modifies SCX to use BPF capabilities.
>
> Make all SCX kfuncs register to BPF capabilities instead of
> BPF_PROG_TYPE_STRUCT_OPS.
>
> Add bpf_scx_bpf_capabilities_adjust as bpf_capabilities_adjust
> callback function.
>
> Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
> ---
>  kernel/sched/ext.c | 74 ++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 62 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 7fff1d045477..53cc7c3ed80b 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -5765,10 +5765,66 @@ bpf_scx_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>         }
>  }

'capabilities' name doesn't fit.
The word already has its meaning in the kernel.
It cannot be reused for a different purpose.

> +static int bpf_scx_bpf_capabilities_adjust(unsigned long *bpf_capabilities,
> +                                          u32 context_info, bool enter)
> +{
> +       if (enter) {
> +               switch (context_info) {
> +               case offsetof(struct sched_ext_ops, select_cpu):
> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_SELECT_CPU);
> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
> +                       break;
> +               case offsetof(struct sched_ext_ops, enqueue):
> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
> +                       break;
> +               case offsetof(struct sched_ext_ops, dispatch):
> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_DISPATCH);
> +                       break;
> +               case offsetof(struct sched_ext_ops, running):
> +               case offsetof(struct sched_ext_ops, stopping):
> +               case offsetof(struct sched_ext_ops, enable):
> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_REST);
> +                       break;
> +               case offsetof(struct sched_ext_ops, init):
> +               case offsetof(struct sched_ext_ops, exit):
> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_UNLOCKED);
> +                       break;
> +               default:
> +                       return -EINVAL;
> +               }
> +       } else {
> +               switch (context_info) {
> +               case offsetof(struct sched_ext_ops, select_cpu):
> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_SELECT_CPU);
> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
> +                       break;
> +               case offsetof(struct sched_ext_ops, enqueue):
> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
> +                       break;
> +               case offsetof(struct sched_ext_ops, dispatch):
> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_DISPATCH);
> +                       break;
> +               case offsetof(struct sched_ext_ops, running):
> +               case offsetof(struct sched_ext_ops, stopping):
> +               case offsetof(struct sched_ext_ops, enable):
> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_REST);
> +                       break;
> +               case offsetof(struct sched_ext_ops, init):
> +               case offsetof(struct sched_ext_ops, exit):
> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_UNLOCKED);
> +                       break;
> +               default:
> +                       return -EINVAL;
> +               }
> +       }
> +       return 0;
> +}

and this callback defeats the whole point of u32 bitmask.

In earlier patch
env->context_info = __btf_member_bit_offset(t, member) / 8; // moff

is also wrong.
The context_info name is too generic and misleading.
and 'env' isn't a right place to save moff.

Let's try to implement what was discussed earlier:

1
After successful check_struct_ops_btf_id() save moff in
prog->aux->attach_st_ops_member_off.

2
Add .filter callback to sched-ext kfunc registration path and
let it allow/deny kfuncs based on st_ops attach point.

3
Remove scx_kf_allow() and current->scx.kf_mask.

That will be a nice perf win and will prove that
this approach works end-to-end.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 6/7] sched_ext: Make SCX use BPF capabilities
  2025-01-24  4:52   ` Alexei Starovoitov
@ 2025-01-24 22:44     ` Juntong Deng
  2025-01-30  0:32       ` Alexei Starovoitov
  0 siblings, 1 reply; 18+ messages in thread
From: Juntong Deng @ 2025-01-24 22:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eddy Z, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, Tejun Heo, David Vernet, bpf, LKML

On 2025/1/24 04:52, Alexei Starovoitov wrote:
> On Thu, Jan 16, 2025 at 11:47 AM Juntong Deng <juntong.deng@outlook.com> wrote:
>>
>> This patch modifies SCX to use BPF capabilities.
>>
>> Make all SCX kfuncs register to BPF capabilities instead of
>> BPF_PROG_TYPE_STRUCT_OPS.
>>
>> Add bpf_scx_bpf_capabilities_adjust as bpf_capabilities_adjust
>> callback function.
>>
>> Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
>> ---
>>   kernel/sched/ext.c | 74 ++++++++++++++++++++++++++++++++++++++--------
>>   1 file changed, 62 insertions(+), 12 deletions(-)
>>
>> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
>> index 7fff1d045477..53cc7c3ed80b 100644
>> --- a/kernel/sched/ext.c
>> +++ b/kernel/sched/ext.c
>> @@ -5765,10 +5765,66 @@ bpf_scx_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>>          }
>>   }
> 
> 'capabilities' name doesn't fit.
> The word already has its meaning in the kernel.
> It cannot be reused for a different purpose.
> 
>> +static int bpf_scx_bpf_capabilities_adjust(unsigned long *bpf_capabilities,
>> +                                          u32 context_info, bool enter)
>> +{
>> +       if (enter) {
>> +               switch (context_info) {
>> +               case offsetof(struct sched_ext_ops, select_cpu):
>> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_SELECT_CPU);
>> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
>> +                       break;
>> +               case offsetof(struct sched_ext_ops, enqueue):
>> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
>> +                       break;
>> +               case offsetof(struct sched_ext_ops, dispatch):
>> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_DISPATCH);
>> +                       break;
>> +               case offsetof(struct sched_ext_ops, running):
>> +               case offsetof(struct sched_ext_ops, stopping):
>> +               case offsetof(struct sched_ext_ops, enable):
>> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_REST);
>> +                       break;
>> +               case offsetof(struct sched_ext_ops, init):
>> +               case offsetof(struct sched_ext_ops, exit):
>> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_UNLOCKED);
>> +                       break;
>> +               default:
>> +                       return -EINVAL;
>> +               }
>> +       } else {
>> +               switch (context_info) {
>> +               case offsetof(struct sched_ext_ops, select_cpu):
>> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_SELECT_CPU);
>> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
>> +                       break;
>> +               case offsetof(struct sched_ext_ops, enqueue):
>> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
>> +                       break;
>> +               case offsetof(struct sched_ext_ops, dispatch):
>> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_DISPATCH);
>> +                       break;
>> +               case offsetof(struct sched_ext_ops, running):
>> +               case offsetof(struct sched_ext_ops, stopping):
>> +               case offsetof(struct sched_ext_ops, enable):
>> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_REST);
>> +                       break;
>> +               case offsetof(struct sched_ext_ops, init):
>> +               case offsetof(struct sched_ext_ops, exit):
>> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_UNLOCKED);
>> +                       break;
>> +               default:
>> +                       return -EINVAL;
>> +               }
>> +       }
>> +       return 0;
>> +}
> 
> and this callback defeats the whole point of u32 bitmask.
> 

Yes, you are right, I agree that procedural callbacks defeat the purpose
of BPF capabilities.

> In earlier patch
> env->context_info = __btf_member_bit_offset(t, member) / 8; // moff
> 
> is also wrong.
> The context_info name is too generic and misleading.
> and 'env' isn't a right place to save moff.
> 
> Let's try to implement what was discussed earlier:
> 
> 1
> After successful check_struct_ops_btf_id() save moff in
> prog->aux->attach_st_ops_member_off.
> 
> 2
> Add .filter callback to sched-ext kfunc registration path and
> let it allow/deny kfuncs based on st_ops attach point.
> 
> 3
> Remove scx_kf_allow() and current->scx.kf_mask.
> 
> That will be a nice perf win and will prove that
> this approach works end-to-end.

I am trying, but I found a problem (bug?) when I added test cases
to bpf_testmod.c.

Filters currently do not work with kernel modules.

Filters rely heavily on (bpf_fs_kfunc_set_ids as an example)

if (!btf_id_set8_contains(&bpf_fs_kfunc_set_ids, kfunc_id)

exclude kfuncs that are not part of its own set
(__btf_kfunc_id_set_contains performs all the filters for each kfunc),
otherwise it will result in false rejects.

But this method cannot be used in kernel modules because the BTF ids of
all kfuncs are relocated.

The BTF ids of all kfuncs in the kernel module will be relocated by
btf_relocate_id in btf_populate_kfunc_set.

This results in the kfunc_id passed into the filter being different from
the BTF id in set_ids.

One possible solution is to export btf_relocate_id and
btf_get_module_btf, and let the kernel module do the relocation itself.

But I am not sure exporting them is a good idea.

Do you have any suggestions?


In addition, BTF_KFUNC_FILTER_MAX_CNT is currently 16, which is not a
large enough size.

If we use filters to enforce restrictions on struct_ops for different
contexts, then each different context needs a filter.

All filters for scenarios using struct_ops (SCX, HID, TCP congestion,
etc.) are placed in the same struct btf_kfunc_hook_filter
(filters array).

It is foreseeable that the 16 slots will be exhausted soon.

Should we change it to a linked list?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH bpf-next 6/7] sched_ext: Make SCX use BPF capabilities
  2025-01-24 22:44     ` Juntong Deng
@ 2025-01-30  0:32       ` Alexei Starovoitov
  0 siblings, 0 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2025-01-30  0:32 UTC (permalink / raw)
  To: Juntong Deng
  Cc: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eddy Z, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, Tejun Heo, David Vernet, bpf, LKML

On Fri, Jan 24, 2025 at 2:45 PM Juntong Deng <juntong.deng@outlook.com> wrote:
>
> On 2025/1/24 04:52, Alexei Starovoitov wrote:
> > On Thu, Jan 16, 2025 at 11:47 AM Juntong Deng <juntong.deng@outlook.com> wrote:
> >>
> >> This patch modifies SCX to use BPF capabilities.
> >>
> >> Make all SCX kfuncs register to BPF capabilities instead of
> >> BPF_PROG_TYPE_STRUCT_OPS.
> >>
> >> Add bpf_scx_bpf_capabilities_adjust as bpf_capabilities_adjust
> >> callback function.
> >>
> >> Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
> >> ---
> >>   kernel/sched/ext.c | 74 ++++++++++++++++++++++++++++++++++++++--------
> >>   1 file changed, 62 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> >> index 7fff1d045477..53cc7c3ed80b 100644
> >> --- a/kernel/sched/ext.c
> >> +++ b/kernel/sched/ext.c
> >> @@ -5765,10 +5765,66 @@ bpf_scx_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> >>          }
> >>   }
> >
> > 'capabilities' name doesn't fit.
> > The word already has its meaning in the kernel.
> > It cannot be reused for a different purpose.
> >
> >> +static int bpf_scx_bpf_capabilities_adjust(unsigned long *bpf_capabilities,
> >> +                                          u32 context_info, bool enter)
> >> +{
> >> +       if (enter) {
> >> +               switch (context_info) {
> >> +               case offsetof(struct sched_ext_ops, select_cpu):
> >> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_SELECT_CPU);
> >> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
> >> +                       break;
> >> +               case offsetof(struct sched_ext_ops, enqueue):
> >> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
> >> +                       break;
> >> +               case offsetof(struct sched_ext_ops, dispatch):
> >> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_DISPATCH);
> >> +                       break;
> >> +               case offsetof(struct sched_ext_ops, running):
> >> +               case offsetof(struct sched_ext_ops, stopping):
> >> +               case offsetof(struct sched_ext_ops, enable):
> >> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_REST);
> >> +                       break;
> >> +               case offsetof(struct sched_ext_ops, init):
> >> +               case offsetof(struct sched_ext_ops, exit):
> >> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_UNLOCKED);
> >> +                       break;
> >> +               default:
> >> +                       return -EINVAL;
> >> +               }
> >> +       } else {
> >> +               switch (context_info) {
> >> +               case offsetof(struct sched_ext_ops, select_cpu):
> >> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_SELECT_CPU);
> >> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
> >> +                       break;
> >> +               case offsetof(struct sched_ext_ops, enqueue):
> >> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
> >> +                       break;
> >> +               case offsetof(struct sched_ext_ops, dispatch):
> >> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_DISPATCH);
> >> +                       break;
> >> +               case offsetof(struct sched_ext_ops, running):
> >> +               case offsetof(struct sched_ext_ops, stopping):
> >> +               case offsetof(struct sched_ext_ops, enable):
> >> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_REST);
> >> +                       break;
> >> +               case offsetof(struct sched_ext_ops, init):
> >> +               case offsetof(struct sched_ext_ops, exit):
> >> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_UNLOCKED);
> >> +                       break;
> >> +               default:
> >> +                       return -EINVAL;
> >> +               }
> >> +       }
> >> +       return 0;
> >> +}
> >
> > and this callback defeats the whole point of u32 bitmask.
> >
>
> Yes, you are right, I agree that procedural callbacks defeat the purpose
> of BPF capabilities.
>
> > In earlier patch
> > env->context_info = __btf_member_bit_offset(t, member) / 8; // moff
> >
> > is also wrong.
> > The context_info name is too generic and misleading.
> > and 'env' isn't a right place to save moff.
> >
> > Let's try to implement what was discussed earlier:
> >
> > 1
> > After successful check_struct_ops_btf_id() save moff in
> > prog->aux->attach_st_ops_member_off.
> >
> > 2
> > Add .filter callback to sched-ext kfunc registration path and
> > let it allow/deny kfuncs based on st_ops attach point.
> >
> > 3
> > Remove scx_kf_allow() and current->scx.kf_mask.
> >
> > That will be a nice perf win and will prove that
> > this approach works end-to-end.
>
> I am trying, but I found a problem (bug?) when I added test cases
> to bpf_testmod.c.
>
> Filters currently do not work with kernel modules.
>
> Filters rely heavily on (bpf_fs_kfunc_set_ids as an example)
>
> if (!btf_id_set8_contains(&bpf_fs_kfunc_set_ids, kfunc_id)
>
> exclude kfuncs that are not part of its own set
> (__btf_kfunc_id_set_contains performs all the filters for each kfunc),
> otherwise it will result in false rejects.
>
> But this method cannot be used in kernel modules because the BTF ids of
> all kfuncs are relocated.
>
> The BTF ids of all kfuncs in the kernel module will be relocated by
> btf_relocate_id in btf_populate_kfunc_set.
>
> This results in the kfunc_id passed into the filter being different from
> the BTF id in set_ids.
>
> One possible solution is to export btf_relocate_id and
> btf_get_module_btf, and let the kernel module do the relocation itself.
>
> But I am not sure exporting them is a good idea.
>
> Do you have any suggestions?

That's not a problem to fix today.
Currently only sched-ext needs this new ->filter() logic
and it's builtin.
Let's prototype the steps 1,2,3 end-to-end and see whether
anything big is missing.
The lack of bpf_testmod can be addressed later if the whole
approach works for sched-ext.

> In addition, BTF_KFUNC_FILTER_MAX_CNT is currently 16, which is not a
> large enough size.
>
> If we use filters to enforce restrictions on struct_ops for different
> contexts, then each different context needs a filter.
>
> All filters for scenarios using struct_ops (SCX, HID, TCP congestion,
> etc.) are placed in the same struct btf_kfunc_hook_filter
> (filters array).
>
> It is foreseeable that the 16 slots will be exhausted soon.
>
> Should we change it to a linked list?

No. Don't fix what is not broken.
We have a concrete run-time inefficiency on sched-ext side
let's address that first and deal with everything later.
Not the other way around.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-01-30  0:32 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-16 19:35 [RFC PATCH bpf-next 0/7] bpf: BPF internal fine-grained permission management (BPF internal capabilities) Juntong Deng
2025-01-16 19:41 ` [RFC PATCH bpf-next 1/7] bpf: Add capability field to BTF_ID_FLAGS Juntong Deng
2025-01-16 19:41 ` [RFC PATCH bpf-next 2/7] bpf: Add enum bpf_capability Juntong Deng
2025-01-16 22:56   ` Song Liu
2025-01-17 19:37     ` Juntong Deng
2025-01-17 21:40       ` Song Liu
2025-01-20 21:49         ` Juntong Deng
2025-01-21 17:41           ` Song Liu
2025-01-16 19:41 ` [RFC PATCH bpf-next 3/7] bpf: Add capabilities version of kfuncs registration Juntong Deng
2025-01-16 19:41 ` [RFC PATCH bpf-next 4/7] bpf: Make the verifier support BPF capabilities Juntong Deng
2025-01-16 19:41 ` [RFC PATCH bpf-next 5/7] bpf: Add default BPF capabilities initialization for program types Juntong Deng
2025-01-16 19:41 ` [RFC PATCH bpf-next 6/7] sched_ext: Make SCX use BPF capabilities Juntong Deng
2025-01-17 16:58   ` Tejun Heo
2025-01-17 20:09     ` Juntong Deng
2025-01-24  4:52   ` Alexei Starovoitov
2025-01-24 22:44     ` Juntong Deng
2025-01-30  0:32       ` Alexei Starovoitov
2025-01-16 19:41 ` [RFC PATCH bpf-next 7/7] sched_ext: Add proof-of-concept test case Juntong Deng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox