* [PATCH bpf-next v6 0/9] bpf: Support private stack for bpf progs
@ 2024-10-20 19:13 Yonghong Song
2024-10-20 19:13 ` [PATCH bpf-next v6 1/9] bpf: Allow each subprog having stack size of 512 bytes Yonghong Song
` (8 more replies)
0 siblings, 9 replies; 37+ messages in thread
From: Yonghong Song @ 2024-10-20 19:13 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
Martin KaFai Lau, Tejun Heo
The main motivation for private stack comes from nested scheduler in
sched-ext from Tejun. The basic idea is that
- each cgroup will its own associated bpf program,
- bpf program with parent cgroup will call bpf programs
in immediate child cgroups.
Let us say we have the following cgroup hierarchy:
root_cg (prog0):
cg1 (prog1):
cg11 (prog11):
cg111 (prog111)
cg112 (prog112)
cg12 (prog12):
cg121 (prog121)
cg122 (prog122)
cg2 (prog2):
cg21 (prog21)
cg22 (prog22)
cg23 (prog23)
In the above example, prog0 will call a kfunc which will call prog1 and
prog2 to get sched info for cg1 and cg2 and then the information is
summarized and sent back to prog0. Similarly, prog11 and prog12 will be
invoked in the kfunc and the result will be summarized and sent back to
prog1, etc. The following illustrates a possible call sequence:
... -> bpf prog A -> kfunc -> ops.<callback_fn> (bpf prog B) ...
Currently, for each thread, the x86 kernel allocate 16KB stack. Each
bpf program (including its subprograms) has maximum 512B stack size to
avoid potential stack overflow. Nested bpf programs further increase the
risk of stack overflow. To avoid potential stack overflow caused by bpf
programs, this patch set supported private stack and bpf program stack
space is allocated during verification time. Such private stack is applied
to tracing programs like kprobe/uprobe, perf_event, tracepoint, raw
tracepoint and struct_ops progs. For struct_ops progs, if the callback
stub function name has format like
<st_ops_name>__<member_name>__priv_stack
that callback func prog will use the private stack. For other tracing
programs, if the prog (including subprogs, but not including callback
functions) stack depth is greater than or equals to 128 bytes, private
stack will be used.
But more than one instance of the same bpf program may run in the system.
To make things simple, percpu private stack is allocated for each program,
so if the same program is running on different cpus concurrently, we won't
have any issue. Note that the kernel already have logic to prevent the
recursion for the same bpf program on the same cpu (kprobe, fentry, etc.).
This patch set implemented a percpu private stack based approach for x86
arch. Please see each individual patch for details.
Change logs:
v5 -> v6:
- v5 link: https://lore.kernel.org/bpf/20241017223138.3175885-1-yonghong.song@linux.dev/
- Instead of using (or not using) private stack at struct_ops level,
each prog in struct_ops can decide whether to use private stack or not.
v4 -> v5:
- v4 link: https://lore.kernel.org/bpf/20241010175552.1895980-1-yonghong.song@linux.dev/
- Remove bpf_prog_call() related implementation.
- Allow (opt-in) private stack for sched-ext progs.
v3 -> v4:
- v3 link: https://lore.kernel.org/bpf/20240926234506.1769256-1-yonghong.song@linux.dev/
There is a long discussion in the above v3 link trying to allow private
stack to be used by kernel functions in order to simplify implementation.
But unfortunately we didn't find a workable solution yet, so we return
to the approach where private stack is only used by bpf programs.
- Add bpf_prog_call() kfunc.
v2 -> v3:
- Instead of per-subprog private stack allocation, allocate private
stacks at main prog or callback entry prog. Subprogs not main or callback
progs will increment the inherited stack pointer to be their
frame pointer.
- Private stack allows each prog max stack size to be 512 bytes, intead
of the whole prog hierarchy to be 512 bytes.
- Add some tests.
Yonghong Song (9):
bpf: Allow each subprog having stack size of 512 bytes
bpf: Rename bpf_struct_ops_arg_info to bpf_struct_ops_func_info
bpf: Support private stack for struct ops programs
bpf: Mark each subprog with proper private stack modes
bpf, x86: Refactor func emit_prologue
bpf, x86: Create a helper for certain "reg <op>= imm" operations
bpf, x86: Add jit support for private stack
selftests/bpf: Add tracing prog private stack tests
selftests/bpf: Add struct_ops prog private stack tests
arch/x86/net/bpf_jit_comp.c | 187 +++++++++++----
include/linux/bpf.h | 16 +-
include/linux/bpf_verifier.h | 4 +
include/linux/filter.h | 1 +
kernel/bpf/bpf_struct_ops.c | 71 +++---
kernel/bpf/core.c | 24 ++
kernel/bpf/verifier.c | 139 +++++++++--
.../selftests/bpf/bpf_testmod/bpf_testmod.c | 77 +++++++
.../selftests/bpf/bpf_testmod/bpf_testmod.h | 6 +
.../bpf/prog_tests/struct_ops_private_stack.c | 106 +++++++++
.../selftests/bpf/prog_tests/verifier.c | 2 +
.../bpf/progs/struct_ops_private_stack.c | 62 +++++
.../bpf/progs/struct_ops_private_stack_fail.c | 62 +++++
.../progs/struct_ops_private_stack_recur.c | 50 ++++
.../bpf/progs/verifier_private_stack.c | 216 ++++++++++++++++++
15 files changed, 933 insertions(+), 90 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/struct_ops_private_stack.c
create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_private_stack.c
create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_private_stack_fail.c
create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_private_stack_recur.c
create mode 100644 tools/testing/selftests/bpf/progs/verifier_private_stack.c
--
2.43.5
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH bpf-next v6 1/9] bpf: Allow each subprog having stack size of 512 bytes
2024-10-20 19:13 [PATCH bpf-next v6 0/9] bpf: Support private stack for bpf progs Yonghong Song
@ 2024-10-20 19:13 ` Yonghong Song
2024-10-22 1:18 ` Alexei Starovoitov
2024-10-20 19:13 ` [PATCH bpf-next v6 2/9] bpf: Rename bpf_struct_ops_arg_info to bpf_struct_ops_func_info Yonghong Song
` (7 subsequent siblings)
8 siblings, 1 reply; 37+ messages in thread
From: Yonghong Song @ 2024-10-20 19:13 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
Martin KaFai Lau, Tejun Heo
With private stack support, each subprog can have stack with up to 512
bytes. The limit of 512 bytes per subprog is kept to avoid increasing
verifier complexity since greater than 512 bytes will cause big verifier
change and increase memory consumption and verification time.
If private stack is supported, for a bpf prog, esp. when it has
subprogs, private stack will be allocated for the main prog
and for each callback subprog. For example,
main_prog
subprog1
calling helper
subprog10 (callback func)
subprog11
subprog2
calling helper
subprog10 (callback func)
subprog11
Separate private allocations for main_prog and callback_fn subprog10
will make things easier since the helper function uses the kernel stack.
In this patch, some tracing programs are allowed to use private
stack since tracing prog may be triggered in the middle of some other
prog runs. Additional subprog info is also collected for later to
allocate private stack for main prog and each callback functions.
Note that if any tail_call is called in the prog (including all subprogs),
then private stack is not used.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
include/linux/bpf.h | 1 +
include/linux/bpf_verifier.h | 3 ++
include/linux/filter.h | 1 +
kernel/bpf/core.c | 5 ++
kernel/bpf/verifier.c | 100 ++++++++++++++++++++++++++++++-----
5 files changed, 97 insertions(+), 13 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0c216e71cec7..6ad8ace7075a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1490,6 +1490,7 @@ struct bpf_prog_aux {
bool exception_cb;
bool exception_boundary;
bool is_extended; /* true if extended by freplace program */
+ bool priv_stack_eligible;
u64 prog_array_member_cnt; /* counts how many times as member of prog_array */
struct mutex ext_mutex; /* mutex for is_extended and prog_array_member_cnt */
struct bpf_arena *arena;
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 4513372c5bc8..bcfe868e3801 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -659,6 +659,8 @@ struct bpf_subprog_info {
* are used for bpf_fastcall spills and fills.
*/
s16 fastcall_stack_off;
+ u16 subtree_stack_depth;
+ u16 subtree_top_idx;
bool has_tail_call: 1;
bool tail_call_reachable: 1;
bool has_ld_abs: 1;
@@ -668,6 +670,7 @@ struct bpf_subprog_info {
bool args_cached: 1;
/* true if bpf_fastcall stack region is used by functions that can't be inlined */
bool keep_fastcall_stack: 1;
+ bool priv_stack_eligible: 1;
u8 arg_cnt;
struct bpf_subprog_arg_info args[MAX_BPF_FUNC_REG_ARGS];
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 7d7578a8eac1..3a21947f2fd4 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1119,6 +1119,7 @@ bool bpf_jit_supports_exceptions(void);
bool bpf_jit_supports_ptr_xchg(void);
bool bpf_jit_supports_arena(void);
bool bpf_jit_supports_insn(struct bpf_insn *insn, bool in_arena);
+bool bpf_jit_supports_private_stack(void);
u64 bpf_arch_uaddress_limit(void);
void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie);
bool bpf_helper_changes_pkt_data(void *func);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 233ea78f8f1b..14d9288441f2 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -3045,6 +3045,11 @@ bool __weak bpf_jit_supports_exceptions(void)
return false;
}
+bool __weak bpf_jit_supports_private_stack(void)
+{
+ return false;
+}
+
void __weak arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie)
{
}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f514247ba8ba..45bea4066272 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -194,6 +194,8 @@ struct bpf_verifier_stack_elem {
#define BPF_GLOBAL_PERCPU_MA_MAX_SIZE 512
+#define BPF_PRIV_STACK_MIN_SUBTREE_SIZE 128
+
static int acquire_reference_state(struct bpf_verifier_env *env, int insn_idx);
static int release_reference(struct bpf_verifier_env *env, int ref_obj_id);
static void invalidate_non_owning_refs(struct bpf_verifier_env *env);
@@ -5982,6 +5984,41 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
strict);
}
+static bool bpf_enable_private_stack(struct bpf_verifier_env *env)
+{
+ if (!bpf_jit_supports_private_stack())
+ return false;
+
+ switch (env->prog->type) {
+ case BPF_PROG_TYPE_KPROBE:
+ case BPF_PROG_TYPE_TRACEPOINT:
+ case BPF_PROG_TYPE_PERF_EVENT:
+ case BPF_PROG_TYPE_RAW_TRACEPOINT:
+ return true;
+ case BPF_PROG_TYPE_TRACING:
+ if (env->prog->expected_attach_type != BPF_TRACE_ITER)
+ return true;
+ fallthrough;
+ default:
+ return false;
+ }
+}
+
+static bool is_priv_stack_supported(struct bpf_verifier_env *env)
+{
+ struct bpf_subprog_info *si = env->subprog_info;
+ bool has_tail_call = false;
+
+ for (int i = 0; i < env->subprog_cnt; i++) {
+ if (si[i].has_tail_call) {
+ has_tail_call = true;
+ break;
+ }
+ }
+
+ return !has_tail_call && bpf_enable_private_stack(env);
+}
+
static int round_up_stack_depth(struct bpf_verifier_env *env, int stack_depth)
{
if (env->prog->jit_requested)
@@ -5999,16 +6036,21 @@ static int round_up_stack_depth(struct bpf_verifier_env *env, int stack_depth)
* Since recursion is prevented by check_cfg() this algorithm
* only needs a local stack of MAX_CALL_FRAMES to remember callsites
*/
-static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
+static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
+ bool check_priv_stack, bool priv_stack_supported)
{
struct bpf_subprog_info *subprog = env->subprog_info;
struct bpf_insn *insn = env->prog->insnsi;
int depth = 0, frame = 0, i, subprog_end;
bool tail_call_reachable = false;
+ bool priv_stack_eligible = false;
int ret_insn[MAX_CALL_FRAMES];
int ret_prog[MAX_CALL_FRAMES];
- int j;
+ int j, subprog_stack_depth;
+ int orig_idx = idx;
+ if (check_priv_stack)
+ subprog[idx].subtree_top_idx = idx;
i = subprog[idx].start;
process_func:
/* protect against potential stack overflow that might happen when
@@ -6030,18 +6072,33 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
* tailcall will unwind the current stack frame but it will not get rid
* of caller's stack as shown on the example above.
*/
- if (idx && subprog[idx].has_tail_call && depth >= 256) {
+ if (!check_priv_stack && idx && subprog[idx].has_tail_call && depth >= 256) {
verbose(env,
"tail_calls are not allowed when call stack of previous frames is %d bytes. Too large\n",
depth);
return -EACCES;
}
- depth += round_up_stack_depth(env, subprog[idx].stack_depth);
- if (depth > MAX_BPF_STACK) {
+ subprog_stack_depth = round_up_stack_depth(env, subprog[idx].stack_depth);
+ depth += subprog_stack_depth;
+ if (!check_priv_stack && !priv_stack_supported && depth > MAX_BPF_STACK) {
verbose(env, "combined stack size of %d calls is %d. Too large\n",
frame + 1, depth);
return -EACCES;
}
+ if (check_priv_stack) {
+ if (subprog_stack_depth > MAX_BPF_STACK) {
+ verbose(env, "stack size of subprog %d is %d. Too large\n",
+ idx, subprog_stack_depth);
+ return -EACCES;
+ }
+
+ if (!priv_stack_eligible && depth >= BPF_PRIV_STACK_MIN_SUBTREE_SIZE) {
+ subprog[orig_idx].priv_stack_eligible = true;
+ env->prog->aux->priv_stack_eligible = priv_stack_eligible = true;
+ }
+ subprog[orig_idx].subtree_stack_depth =
+ max_t(u16, subprog[orig_idx].subtree_stack_depth, depth);
+ }
continue_func:
subprog_end = subprog[idx + 1].start;
for (; i < subprog_end; i++) {
@@ -6078,6 +6135,12 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
next_insn = i + insn[i].imm + 1;
sidx = find_subprog(env, next_insn);
if (sidx < 0) {
+ /* It is possible that callback func has been removed as dead code after
+ * instruction rewrites, e.g. bpf_loop with cnt 0.
+ */
+ if (check_priv_stack)
+ continue;
+
WARN_ONCE(1, "verifier bug. No program starts at insn %d\n",
next_insn);
return -EFAULT;
@@ -6097,8 +6160,10 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
}
i = next_insn;
idx = sidx;
+ if (check_priv_stack)
+ subprog[idx].subtree_top_idx = orig_idx;
- if (subprog[idx].has_tail_call)
+ if (!check_priv_stack && subprog[idx].has_tail_call)
tail_call_reachable = true;
frame++;
@@ -6122,7 +6187,7 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
}
subprog[ret_prog[j]].tail_call_reachable = true;
}
- if (subprog[0].tail_call_reachable)
+ if (!check_priv_stack && subprog[0].tail_call_reachable)
env->prog->aux->tail_call_reachable = true;
/* end of for() loop means the last insn of the 'subprog'
@@ -6137,14 +6202,18 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
goto continue_func;
}
-static int check_max_stack_depth(struct bpf_verifier_env *env)
+static int check_max_stack_depth(struct bpf_verifier_env *env, bool check_priv_stack,
+ bool priv_stack_supported)
{
struct bpf_subprog_info *si = env->subprog_info;
+ bool check_subprog;
int ret;
for (int i = 0; i < env->subprog_cnt; i++) {
- if (!i || si[i].is_async_cb) {
- ret = check_max_stack_depth_subprog(env, i);
+ check_subprog = !i || (check_priv_stack ? si[i].is_cb : si[i].is_async_cb);
+ if (check_subprog) {
+ ret = check_max_stack_depth_subprog(env, i, check_priv_stack,
+ priv_stack_supported);
if (ret < 0)
return ret;
}
@@ -22303,7 +22372,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u3
struct bpf_verifier_env *env;
int i, len, ret = -EINVAL, err;
u32 log_true_size;
- bool is_priv;
+ bool is_priv, priv_stack_supported = false;
/* no program is valid */
if (ARRAY_SIZE(bpf_verifier_ops) == 0)
@@ -22430,8 +22499,10 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u3
if (ret == 0)
ret = remove_fastcall_spills_fills(env);
- if (ret == 0)
- ret = check_max_stack_depth(env);
+ if (ret == 0) {
+ priv_stack_supported = is_priv_stack_supported(env);
+ ret = check_max_stack_depth(env, false, priv_stack_supported);
+ }
/* instruction rewrites happen after this point */
if (ret == 0)
@@ -22465,6 +22536,9 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u3
: false;
}
+ if (ret == 0 && priv_stack_supported)
+ ret = check_max_stack_depth(env, true, true);
+
if (ret == 0)
ret = fixup_call_args(env);
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH bpf-next v6 2/9] bpf: Rename bpf_struct_ops_arg_info to bpf_struct_ops_func_info
2024-10-20 19:13 [PATCH bpf-next v6 0/9] bpf: Support private stack for bpf progs Yonghong Song
2024-10-20 19:13 ` [PATCH bpf-next v6 1/9] bpf: Allow each subprog having stack size of 512 bytes Yonghong Song
@ 2024-10-20 19:13 ` Yonghong Song
2024-10-20 19:13 ` [PATCH bpf-next v6 3/9] bpf: Support private stack for struct ops programs Yonghong Song
` (6 subsequent siblings)
8 siblings, 0 replies; 37+ messages in thread
From: Yonghong Song @ 2024-10-20 19:13 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
Martin KaFai Lau, Tejun Heo
In the subsequent patch, some not argument information will be added to
struct bpf_struct_ops_arg_info. So let us rename the struct to
bpf_struct_ops_func_info. No functionality change.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
include/linux/bpf.h | 4 ++--
kernel/bpf/bpf_struct_ops.c | 36 ++++++++++++++++++------------------
kernel/bpf/verifier.c | 4 ++--
3 files changed, 22 insertions(+), 22 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 6ad8ace7075a..f3884ce2603d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1773,7 +1773,7 @@ struct bpf_struct_ops {
* btf_ctx_access() will lookup prog->aux->ctx_arg_info to find the
* corresponding entry for an given argument.
*/
-struct bpf_struct_ops_arg_info {
+struct bpf_struct_ops_func_info {
struct bpf_ctx_arg_aux *info;
u32 cnt;
};
@@ -1787,7 +1787,7 @@ struct bpf_struct_ops_desc {
u32 value_id;
/* Collection of argument information for each member */
- struct bpf_struct_ops_arg_info *arg_info;
+ struct bpf_struct_ops_func_info *func_info;
};
enum bpf_struct_ops_state {
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index fda3dd2ee984..8279b5a57798 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -182,11 +182,11 @@ find_stub_func_proto(const struct btf *btf, const char *st_op_name,
/* Prepare argument info for every nullable argument of a member of a
* struct_ops type.
*
- * Initialize a struct bpf_struct_ops_arg_info according to type info of
+ * Initialize a struct bpf_struct_ops_func_info according to type info of
* the arguments of a stub function. (Check kCFI for more information about
* stub functions.)
*
- * Each member in the struct_ops type has a struct bpf_struct_ops_arg_info
+ * Each member in the struct_ops type has a struct bpf_struct_ops_func_info
* to provide an array of struct bpf_ctx_arg_aux, which in turn provides
* the information that used by the verifier to check the arguments of the
* BPF struct_ops program assigned to the member. Here, we only care about
@@ -196,14 +196,14 @@ find_stub_func_proto(const struct btf *btf, const char *st_op_name,
* prog->aux->ctx_arg_info of BPF struct_ops programs and passed to the
* verifier. (See check_struct_ops_btf_id())
*
- * arg_info->info will be the list of struct bpf_ctx_arg_aux if success. If
+ * func_info->info will be the list of struct bpf_ctx_arg_aux if success. If
* fails, it will be kept untouched.
*/
-static int prepare_arg_info(struct btf *btf,
+static int prepare_func_info(struct btf *btf,
const char *st_ops_name,
const char *member_name,
const struct btf_type *func_proto,
- struct bpf_struct_ops_arg_info *arg_info)
+ struct bpf_struct_ops_func_info *func_info)
{
const struct btf_type *stub_func_proto, *pointed_type;
const struct btf_param *stub_args, *args;
@@ -282,8 +282,8 @@ static int prepare_arg_info(struct btf *btf,
}
if (info_cnt) {
- arg_info->info = info_buf;
- arg_info->cnt = info_cnt;
+ func_info->info = info_buf;
+ func_info->cnt = info_cnt;
} else {
kfree(info_buf);
}
@@ -296,17 +296,17 @@ static int prepare_arg_info(struct btf *btf,
return -EINVAL;
}
-/* Clean up the arg_info in a struct bpf_struct_ops_desc. */
+/* Clean up the func_info in a struct bpf_struct_ops_desc. */
void bpf_struct_ops_desc_release(struct bpf_struct_ops_desc *st_ops_desc)
{
- struct bpf_struct_ops_arg_info *arg_info;
+ struct bpf_struct_ops_func_info *func_info;
int i;
- arg_info = st_ops_desc->arg_info;
+ func_info = st_ops_desc->func_info;
for (i = 0; i < btf_type_vlen(st_ops_desc->type); i++)
- kfree(arg_info[i].info);
+ kfree(func_info[i].info);
- kfree(arg_info);
+ kfree(func_info);
}
int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc,
@@ -314,7 +314,7 @@ int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc,
struct bpf_verifier_log *log)
{
struct bpf_struct_ops *st_ops = st_ops_desc->st_ops;
- struct bpf_struct_ops_arg_info *arg_info;
+ struct bpf_struct_ops_func_info *func_info;
const struct btf_member *member;
const struct btf_type *t;
s32 type_id, value_id;
@@ -359,12 +359,12 @@ int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc,
if (!is_valid_value_type(btf, value_id, t, value_name))
return -EINVAL;
- arg_info = kcalloc(btf_type_vlen(t), sizeof(*arg_info),
+ func_info = kcalloc(btf_type_vlen(t), sizeof(*func_info),
GFP_KERNEL);
- if (!arg_info)
+ if (!func_info)
return -ENOMEM;
- st_ops_desc->arg_info = arg_info;
+ st_ops_desc->func_info = func_info;
st_ops_desc->type = t;
st_ops_desc->type_id = type_id;
st_ops_desc->value_id = value_id;
@@ -403,9 +403,9 @@ int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc,
goto errout;
}
- err = prepare_arg_info(btf, st_ops->name, mname,
+ err = prepare_func_info(btf, st_ops->name, mname,
func_proto,
- arg_info + i);
+ func_info + i);
if (err)
goto errout;
}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 45bea4066272..ccfe159cfbde 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -21880,9 +21880,9 @@ static int check_struct_ops_btf_id(struct bpf_verifier_env *env)
/* btf_ctx_access() used this to provide argument type info */
prog->aux->ctx_arg_info =
- st_ops_desc->arg_info[member_idx].info;
+ st_ops_desc->func_info[member_idx].info;
prog->aux->ctx_arg_info_size =
- st_ops_desc->arg_info[member_idx].cnt;
+ st_ops_desc->func_info[member_idx].cnt;
prog->aux->attach_func_proto = func_proto;
prog->aux->attach_func_name = mname;
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH bpf-next v6 3/9] bpf: Support private stack for struct ops programs
2024-10-20 19:13 [PATCH bpf-next v6 0/9] bpf: Support private stack for bpf progs Yonghong Song
2024-10-20 19:13 ` [PATCH bpf-next v6 1/9] bpf: Allow each subprog having stack size of 512 bytes Yonghong Song
2024-10-20 19:13 ` [PATCH bpf-next v6 2/9] bpf: Rename bpf_struct_ops_arg_info to bpf_struct_ops_func_info Yonghong Song
@ 2024-10-20 19:13 ` Yonghong Song
2024-10-22 1:34 ` Alexei Starovoitov
2024-10-20 19:14 ` [PATCH bpf-next v6 4/9] bpf: Mark each subprog with proper private stack modes Yonghong Song
` (5 subsequent siblings)
8 siblings, 1 reply; 37+ messages in thread
From: Yonghong Song @ 2024-10-20 19:13 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
Martin KaFai Lau, Tejun Heo
To identify whether a st_ops program requests private stack or not,
the st_ops stub function is checked. If the stub function has the
following name
<st_ops_name>__<member_name>__priv_stack
then the corresponding st_ops member func requests to use private
stack. The information that the private stack is requested or not
is encoded in struct bpf_struct_ops_func_info which will later be
used by verifier.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
include/linux/bpf.h | 2 ++
kernel/bpf/bpf_struct_ops.c | 35 +++++++++++++++++++++++++----------
kernel/bpf/verifier.c | 8 +++++++-
3 files changed, 34 insertions(+), 11 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f3884ce2603d..376e43fc72b9 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1491,6 +1491,7 @@ struct bpf_prog_aux {
bool exception_boundary;
bool is_extended; /* true if extended by freplace program */
bool priv_stack_eligible;
+ bool priv_stack_always;
u64 prog_array_member_cnt; /* counts how many times as member of prog_array */
struct mutex ext_mutex; /* mutex for is_extended and prog_array_member_cnt */
struct bpf_arena *arena;
@@ -1776,6 +1777,7 @@ struct bpf_struct_ops {
struct bpf_struct_ops_func_info {
struct bpf_ctx_arg_aux *info;
u32 cnt;
+ bool priv_stack_always;
};
struct bpf_struct_ops_desc {
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 8279b5a57798..2cd4bd086c7a 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -145,33 +145,44 @@ void bpf_struct_ops_image_free(void *image)
}
#define MAYBE_NULL_SUFFIX "__nullable"
-#define MAX_STUB_NAME 128
+#define MAX_STUB_NAME 140
/* Return the type info of a stub function, if it exists.
*
- * The name of a stub function is made up of the name of the struct_ops and
- * the name of the function pointer member, separated by "__". For example,
- * if the struct_ops type is named "foo_ops" and the function pointer
- * member is named "bar", the stub function name would be "foo_ops__bar".
+ * The name of a stub function is made up of the name of the struct_ops,
+ * the name of the function pointer member and optionally "priv_stack"
+ * suffix, separated by "__". For example, if the struct_ops type is named
+ * "foo_ops" and the function pointer member is named "bar", the stub
+ * function name would be "foo_ops__bar". If a suffix "priv_stack" exists,
+ * the stub function name would be "foo_ops__bar__priv_stack".
*/
static const struct btf_type *
find_stub_func_proto(const struct btf *btf, const char *st_op_name,
- const char *member_name)
+ const char *member_name, bool *priv_stack_always)
{
char stub_func_name[MAX_STUB_NAME];
const struct btf_type *func_type;
s32 btf_id;
int cp;
- cp = snprintf(stub_func_name, MAX_STUB_NAME, "%s__%s",
+ cp = snprintf(stub_func_name, MAX_STUB_NAME, "%s__%s__priv_stack",
st_op_name, member_name);
if (cp >= MAX_STUB_NAME) {
pr_warn("Stub function name too long\n");
return NULL;
}
+
btf_id = btf_find_by_name_kind(btf, stub_func_name, BTF_KIND_FUNC);
- if (btf_id < 0)
- return NULL;
+ if (btf_id >= 0) {
+ *priv_stack_always = true;
+ } else {
+ cp = snprintf(stub_func_name, MAX_STUB_NAME, "%s__%s",
+ st_op_name, member_name);
+ btf_id = btf_find_by_name_kind(btf, stub_func_name, BTF_KIND_FUNC);
+ if (btf_id < 0)
+ return NULL;
+ }
+
func_type = btf_type_by_id(btf, btf_id);
if (!func_type)
return NULL;
@@ -209,10 +220,12 @@ static int prepare_func_info(struct btf *btf,
const struct btf_param *stub_args, *args;
struct bpf_ctx_arg_aux *info, *info_buf;
u32 nargs, arg_no, info_cnt = 0;
+ bool priv_stack_always = false;
u32 arg_btf_id;
int offset;
- stub_func_proto = find_stub_func_proto(btf, st_ops_name, member_name);
+ stub_func_proto = find_stub_func_proto(btf, st_ops_name, member_name,
+ &priv_stack_always);
if (!stub_func_proto)
return 0;
@@ -226,6 +239,8 @@ static int prepare_func_info(struct btf *btf,
return -EINVAL;
}
+ func_info->priv_stack_always = priv_stack_always;
+
if (!nargs)
return 0;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ccfe159cfbde..25283ee6f86f 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5995,6 +5995,8 @@ static bool bpf_enable_private_stack(struct bpf_verifier_env *env)
case BPF_PROG_TYPE_PERF_EVENT:
case BPF_PROG_TYPE_RAW_TRACEPOINT:
return true;
+ case BPF_PROG_TYPE_STRUCT_OPS:
+ return env->prog->aux->priv_stack_always;
case BPF_PROG_TYPE_TRACING:
if (env->prog->expected_attach_type != BPF_TRACE_ITER)
return true;
@@ -6092,7 +6094,9 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
return -EACCES;
}
- if (!priv_stack_eligible && depth >= BPF_PRIV_STACK_MIN_SUBTREE_SIZE) {
+ if (!priv_stack_eligible &&
+ (env->prog->aux->priv_stack_always ||
+ depth >= BPF_PRIV_STACK_MIN_SUBTREE_SIZE)) {
subprog[orig_idx].priv_stack_eligible = true;
env->prog->aux->priv_stack_eligible = priv_stack_eligible = true;
}
@@ -21883,6 +21887,8 @@ static int check_struct_ops_btf_id(struct bpf_verifier_env *env)
st_ops_desc->func_info[member_idx].info;
prog->aux->ctx_arg_info_size =
st_ops_desc->func_info[member_idx].cnt;
+ prog->aux->priv_stack_always =
+ st_ops_desc->func_info[member_idx].priv_stack_always;
prog->aux->attach_func_proto = func_proto;
prog->aux->attach_func_name = mname;
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH bpf-next v6 4/9] bpf: Mark each subprog with proper private stack modes
2024-10-20 19:13 [PATCH bpf-next v6 0/9] bpf: Support private stack for bpf progs Yonghong Song
` (2 preceding siblings ...)
2024-10-20 19:13 ` [PATCH bpf-next v6 3/9] bpf: Support private stack for struct ops programs Yonghong Song
@ 2024-10-20 19:14 ` Yonghong Song
2024-10-20 22:01 ` Jiri Olsa
2024-10-20 19:14 ` [PATCH bpf-next v6 5/9] bpf, x86: Refactor func emit_prologue Yonghong Song
` (4 subsequent siblings)
8 siblings, 1 reply; 37+ messages in thread
From: Yonghong Song @ 2024-10-20 19:14 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
Martin KaFai Lau, Tejun Heo
Three private stack modes are used to direct jit action:
NO_PRIV_STACK: do not use private stack
PRIV_STACK_SUB_PROG: adjust frame pointer address (similar to normal stack)
PRIV_STACK_ROOT_PROG: set the frame pointer
Note that for subtree root prog (main prog or callback fn), even if the
bpf_prog stack size is 0, PRIV_STACK_ROOT_PROG mode is still used.
This is for bpf exception handling. More details can be found in
subsequent jit support and selftest patches.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
include/linux/bpf.h | 9 +++++++++
kernel/bpf/core.c | 19 +++++++++++++++++++
kernel/bpf/verifier.c | 29 +++++++++++++++++++++++++++++
3 files changed, 57 insertions(+)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 376e43fc72b9..27430e9dcfe3 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1456,6 +1456,12 @@ struct btf_mod_pair {
struct bpf_kfunc_desc_tab;
+enum bpf_priv_stack_mode {
+ NO_PRIV_STACK,
+ PRIV_STACK_SUB_PROG,
+ PRIV_STACK_ROOT_PROG,
+};
+
struct bpf_prog_aux {
atomic64_t refcnt;
u32 used_map_cnt;
@@ -1472,6 +1478,9 @@ struct bpf_prog_aux {
u32 ctx_arg_info_size;
u32 max_rdonly_access;
u32 max_rdwr_access;
+ enum bpf_priv_stack_mode priv_stack_mode;
+ u16 subtree_stack_depth; /* Subtree stack depth if PRIV_STACK_ROOT_PROG, 0 otherwise */
+ void __percpu *priv_stack_ptr;
struct btf *attach_btf;
const struct bpf_ctx_arg_aux *ctx_arg_info;
struct mutex dst_mutex; /* protects dst_* pointers below, *after* prog becomes visible */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 14d9288441f2..aee0055def4f 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1240,6 +1240,7 @@ void __weak bpf_jit_free(struct bpf_prog *fp)
struct bpf_binary_header *hdr = bpf_jit_binary_hdr(fp);
bpf_jit_binary_free(hdr);
+ free_percpu(fp->aux->priv_stack_ptr);
WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(fp));
}
@@ -2421,6 +2422,24 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
if (*err)
return fp;
+ if (fp->aux->priv_stack_eligible) {
+ if (!fp->aux->stack_depth) {
+ fp->aux->priv_stack_mode = NO_PRIV_STACK;
+ } else {
+ void __percpu *priv_stack_ptr;
+
+ fp->aux->priv_stack_mode = PRIV_STACK_ROOT_PROG;
+ priv_stack_ptr =
+ __alloc_percpu_gfp(fp->aux->stack_depth, 8, GFP_KERNEL);
+ if (!priv_stack_ptr) {
+ *err = -ENOMEM;
+ return fp;
+ }
+ fp->aux->subtree_stack_depth = fp->aux->stack_depth;
+ fp->aux->priv_stack_ptr = priv_stack_ptr;
+ }
+ }
+
fp = bpf_int_jit_compile(fp);
bpf_prog_jit_attempt_done(fp);
if (!fp->jited && jit_needed) {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 25283ee6f86f..f770015d6ad1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -20018,6 +20018,8 @@ static int jit_subprogs(struct bpf_verifier_env *env)
{
struct bpf_prog *prog = env->prog, **func, *tmp;
int i, j, subprog_start, subprog_end = 0, len, subprog;
+ int subtree_top_idx, subtree_stack_depth;
+ void __percpu *priv_stack_ptr;
struct bpf_map *map_ptr;
struct bpf_insn *insn;
void *old_bpf_func;
@@ -20096,6 +20098,33 @@ static int jit_subprogs(struct bpf_verifier_env *env)
func[i]->is_func = 1;
func[i]->sleepable = prog->sleepable;
func[i]->aux->func_idx = i;
+
+ subtree_top_idx = env->subprog_info[i].subtree_top_idx;
+ if (env->subprog_info[subtree_top_idx].priv_stack_eligible) {
+ if (subtree_top_idx == i)
+ func[i]->aux->subtree_stack_depth =
+ env->subprog_info[i].subtree_stack_depth;
+
+ subtree_stack_depth = func[i]->aux->subtree_stack_depth;
+ if (subtree_top_idx != i) {
+ if (env->subprog_info[subtree_top_idx].subtree_stack_depth)
+ func[i]->aux->priv_stack_mode = PRIV_STACK_SUB_PROG;
+ else
+ func[i]->aux->priv_stack_mode = NO_PRIV_STACK;
+ } else if (!subtree_stack_depth) {
+ func[i]->aux->priv_stack_mode = PRIV_STACK_ROOT_PROG;
+ } else {
+ func[i]->aux->priv_stack_mode = PRIV_STACK_ROOT_PROG;
+ priv_stack_ptr =
+ __alloc_percpu_gfp(subtree_stack_depth, 8, GFP_KERNEL);
+ if (!priv_stack_ptr) {
+ err = -ENOMEM;
+ goto out_free;
+ }
+ func[i]->aux->priv_stack_ptr = priv_stack_ptr;
+ }
+ }
+
/* Below members will be freed only at prog->aux */
func[i]->aux->btf = prog->aux->btf;
func[i]->aux->func_info = prog->aux->func_info;
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH bpf-next v6 5/9] bpf, x86: Refactor func emit_prologue
2024-10-20 19:13 [PATCH bpf-next v6 0/9] bpf: Support private stack for bpf progs Yonghong Song
` (3 preceding siblings ...)
2024-10-20 19:14 ` [PATCH bpf-next v6 4/9] bpf: Mark each subprog with proper private stack modes Yonghong Song
@ 2024-10-20 19:14 ` Yonghong Song
2024-10-20 19:14 ` [PATCH bpf-next v6 6/9] bpf, x86: Create a helper for certain "reg <op>= imm" operations Yonghong Song
` (3 subsequent siblings)
8 siblings, 0 replies; 37+ messages in thread
From: Yonghong Song @ 2024-10-20 19:14 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
Martin KaFai Lau, Tejun Heo
Refactor function emit_prologue() such that it has bpf_prog as one of
arguments. This can reduce the number of total arguments since later
on there will be more arguments being added to this function.
Also add a variable 'stack_depth' to hold the value for
bpf_prog->aux->stack_depth
to simplify the code.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
arch/x86/net/bpf_jit_comp.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 06b080b61aa5..6d24389e58a1 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -489,10 +489,12 @@ static void emit_prologue_tail_call(u8 **pprog, bool is_subprog)
* bpf_tail_call helper will skip the first X86_TAIL_CALL_OFFSET bytes
* while jumping to another program
*/
-static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
- bool tail_call_reachable, bool is_subprog,
- bool is_exception_cb)
+static void emit_prologue(u8 **pprog, u32 stack_depth, struct bpf_prog *bpf_prog,
+ bool tail_call_reachable)
{
+ bool ebpf_from_cbpf = bpf_prog_was_classic(bpf_prog);
+ bool is_exception_cb = bpf_prog->aux->exception_cb;
+ bool is_subprog = bpf_is_subprog(bpf_prog);
u8 *prog = *pprog;
emit_cfi(&prog, is_subprog ? cfi_bpf_subprog_hash : cfi_bpf_hash);
@@ -1424,17 +1426,18 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
u64 arena_vm_start, user_vm_start;
int i, excnt = 0;
int ilen, proglen = 0;
+ u32 stack_depth;
u8 *prog = temp;
int err;
+ stack_depth = bpf_prog->aux->stack_depth;
+
arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena);
user_vm_start = bpf_arena_get_user_vm_start(bpf_prog->aux->arena);
detect_reg_usage(insn, insn_cnt, callee_regs_used);
- emit_prologue(&prog, bpf_prog->aux->stack_depth,
- bpf_prog_was_classic(bpf_prog), tail_call_reachable,
- bpf_is_subprog(bpf_prog), bpf_prog->aux->exception_cb);
+ emit_prologue(&prog, stack_depth, bpf_prog, tail_call_reachable);
/* Exception callback will clobber callee regs for its own use, and
* restore the original callee regs from main prog's stack frame.
*/
@@ -2128,7 +2131,7 @@ st: if (is_imm8(insn->off))
func = (u8 *) __bpf_call_base + imm32;
if (tail_call_reachable) {
- LOAD_TAIL_CALL_CNT_PTR(bpf_prog->aux->stack_depth);
+ LOAD_TAIL_CALL_CNT_PTR(stack_depth);
ip += 7;
}
if (!imm32)
@@ -2145,13 +2148,13 @@ st: if (is_imm8(insn->off))
&bpf_prog->aux->poke_tab[imm32 - 1],
&prog, image + addrs[i - 1],
callee_regs_used,
- bpf_prog->aux->stack_depth,
+ stack_depth,
ctx);
else
emit_bpf_tail_call_indirect(bpf_prog,
&prog,
callee_regs_used,
- bpf_prog->aux->stack_depth,
+ stack_depth,
image + addrs[i - 1],
ctx);
break;
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH bpf-next v6 6/9] bpf, x86: Create a helper for certain "reg <op>= imm" operations
2024-10-20 19:13 [PATCH bpf-next v6 0/9] bpf: Support private stack for bpf progs Yonghong Song
` (4 preceding siblings ...)
2024-10-20 19:14 ` [PATCH bpf-next v6 5/9] bpf, x86: Refactor func emit_prologue Yonghong Song
@ 2024-10-20 19:14 ` Yonghong Song
2024-10-20 19:14 ` [PATCH bpf-next v6 7/9] bpf, x86: Add jit support for private stack Yonghong Song
` (2 subsequent siblings)
8 siblings, 0 replies; 37+ messages in thread
From: Yonghong Song @ 2024-10-20 19:14 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
Martin KaFai Lau, Tejun Heo
Create a helper to generate jited codes for certain "reg <op>= imm"
operations where operations are for add/sub/and/or/xor. This helper
will be used in the subsequent patch.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
arch/x86/net/bpf_jit_comp.c | 82 +++++++++++++++++++++----------------
1 file changed, 46 insertions(+), 36 deletions(-)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 6d24389e58a1..6be8c739c3c2 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1406,6 +1406,51 @@ static void emit_shiftx(u8 **pprog, u32 dst_reg, u8 src_reg, bool is64, u8 op)
*pprog = prog;
}
+/* emit ADD/SUB/AND/OR/XOR 'reg <op>= imm' operations */
+static void emit_alu_imm(u8 **pprog, u8 insn_code, u32 dst_reg, s32 imm32)
+{
+ u8 b2 = 0, b3 = 0;
+ u8 *prog = *pprog;
+
+ maybe_emit_1mod(&prog, dst_reg, BPF_CLASS(insn_code) == BPF_ALU64);
+
+ /*
+ * b3 holds 'normal' opcode, b2 short form only valid
+ * in case dst is eax/rax.
+ */
+ switch (BPF_OP(insn_code)) {
+ case BPF_ADD:
+ b3 = 0xC0;
+ b2 = 0x05;
+ break;
+ case BPF_SUB:
+ b3 = 0xE8;
+ b2 = 0x2D;
+ break;
+ case BPF_AND:
+ b3 = 0xE0;
+ b2 = 0x25;
+ break;
+ case BPF_OR:
+ b3 = 0xC8;
+ b2 = 0x0D;
+ break;
+ case BPF_XOR:
+ b3 = 0xF0;
+ b2 = 0x35;
+ break;
+ }
+
+ if (is_imm8(imm32))
+ EMIT3(0x83, add_1reg(b3, dst_reg), imm32);
+ else if (is_axreg(dst_reg))
+ EMIT1_off32(b2, imm32);
+ else
+ EMIT2_off32(0x81, add_1reg(b3, dst_reg), imm32);
+
+ *pprog = prog;
+}
+
#define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
#define __LOAD_TCC_PTR(off) \
@@ -1567,42 +1612,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
case BPF_ALU64 | BPF_AND | BPF_K:
case BPF_ALU64 | BPF_OR | BPF_K:
case BPF_ALU64 | BPF_XOR | BPF_K:
- maybe_emit_1mod(&prog, dst_reg,
- BPF_CLASS(insn->code) == BPF_ALU64);
-
- /*
- * b3 holds 'normal' opcode, b2 short form only valid
- * in case dst is eax/rax.
- */
- switch (BPF_OP(insn->code)) {
- case BPF_ADD:
- b3 = 0xC0;
- b2 = 0x05;
- break;
- case BPF_SUB:
- b3 = 0xE8;
- b2 = 0x2D;
- break;
- case BPF_AND:
- b3 = 0xE0;
- b2 = 0x25;
- break;
- case BPF_OR:
- b3 = 0xC8;
- b2 = 0x0D;
- break;
- case BPF_XOR:
- b3 = 0xF0;
- b2 = 0x35;
- break;
- }
-
- if (is_imm8(imm32))
- EMIT3(0x83, add_1reg(b3, dst_reg), imm32);
- else if (is_axreg(dst_reg))
- EMIT1_off32(b2, imm32);
- else
- EMIT2_off32(0x81, add_1reg(b3, dst_reg), imm32);
+ emit_alu_imm(&prog, insn->code, dst_reg, imm32);
break;
case BPF_ALU64 | BPF_MOV | BPF_K:
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH bpf-next v6 7/9] bpf, x86: Add jit support for private stack
2024-10-20 19:13 [PATCH bpf-next v6 0/9] bpf: Support private stack for bpf progs Yonghong Song
` (5 preceding siblings ...)
2024-10-20 19:14 ` [PATCH bpf-next v6 6/9] bpf, x86: Create a helper for certain "reg <op>= imm" operations Yonghong Song
@ 2024-10-20 19:14 ` Yonghong Song
2024-10-20 19:14 ` [PATCH bpf-next v6 8/9] selftests/bpf: Add tracing prog private stack tests Yonghong Song
2024-10-20 19:14 ` [PATCH bpf-next v6 9/9] selftests/bpf: Add struct_ops " Yonghong Song
8 siblings, 0 replies; 37+ messages in thread
From: Yonghong Song @ 2024-10-20 19:14 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
Martin KaFai Lau, Tejun Heo
Add jit support for private stack. For a particular subtree, e.g.,
subtree_root <== stack depth 120
subprog1 <== stack depth 80
subprog2 <== stack depth 40
subprog3 <== stack depth 160
Let us say that priv_stack_ptr is the memory address allocated for
private stack. The frame pointer for each above is calculated like below:
subtree_root <== subtree_root_fp = private_stack_ptr + 120
subprog1 <== subtree_subprog1_fp = subtree_root_fp + 80
subprog2 <== subtree_subprog2_fp = subtree_subprog1_fp + 40
subprog3 <== subtree_subprog1_fp = subtree_root_fp + 160
For any function call to helper/kfunc, push/pop prog frame pointer
is needed in order to preserve frame pointer value.
To deal with exception handling, push/pop frame pointer is also used
surrounding call to subsequent subprog. For example,
subtree_root
subprog1
...
insn: call bpf_throw
...
After jit, we will have
subtree_root
insn: push r9
subprog1
...
insn: push r9
insn: call bpf_throw
insn: pop r9
...
insn: pop r9
exception_handler
pop r9
...
where r9 represents the fp for each subprog.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
arch/x86/net/bpf_jit_comp.c | 88 +++++++++++++++++++++++++++++++++++-
include/linux/bpf_verifier.h | 1 +
2 files changed, 87 insertions(+), 2 deletions(-)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 6be8c739c3c2..86ebca32befc 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -325,6 +325,22 @@ struct jit_context {
/* Number of bytes that will be skipped on tailcall */
#define X86_TAIL_CALL_OFFSET (12 + ENDBR_INSN_SIZE)
+static void push_r9(u8 **pprog)
+{
+ u8 *prog = *pprog;
+
+ EMIT2(0x41, 0x51); /* push r9 */
+ *pprog = prog;
+}
+
+static void pop_r9(u8 **pprog)
+{
+ u8 *prog = *pprog;
+
+ EMIT2(0x41, 0x59); /* pop r9 */
+ *pprog = prog;
+}
+
static void push_r12(u8 **pprog)
{
u8 *prog = *pprog;
@@ -484,13 +500,17 @@ static void emit_prologue_tail_call(u8 **pprog, bool is_subprog)
*pprog = prog;
}
+static void emit_priv_frame_ptr(u8 **pprog, struct bpf_prog *bpf_prog,
+ enum bpf_priv_stack_mode priv_stack_mode);
+
/*
* Emit x86-64 prologue code for BPF program.
* bpf_tail_call helper will skip the first X86_TAIL_CALL_OFFSET bytes
* while jumping to another program
*/
static void emit_prologue(u8 **pprog, u32 stack_depth, struct bpf_prog *bpf_prog,
- bool tail_call_reachable)
+ bool tail_call_reachable,
+ enum bpf_priv_stack_mode priv_stack_mode)
{
bool ebpf_from_cbpf = bpf_prog_was_classic(bpf_prog);
bool is_exception_cb = bpf_prog->aux->exception_cb;
@@ -520,6 +540,8 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, struct bpf_prog *bpf_prog
* first restore those callee-saved regs from stack, before
* reusing the stack frame.
*/
+ if (priv_stack_mode != NO_PRIV_STACK)
+ pop_r9(&prog);
pop_callee_regs(&prog, all_callee_regs_used);
pop_r12(&prog);
/* Reset the stack frame. */
@@ -532,6 +554,8 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, struct bpf_prog *bpf_prog
/* X86_TAIL_CALL_OFFSET is here */
EMIT_ENDBR();
+ emit_priv_frame_ptr(&prog, bpf_prog, priv_stack_mode);
+
/* sub rsp, rounded_stack_depth */
if (stack_depth)
EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
@@ -1451,6 +1475,42 @@ static void emit_alu_imm(u8 **pprog, u8 insn_code, u32 dst_reg, s32 imm32)
*pprog = prog;
}
+static void emit_root_priv_frame_ptr(u8 **pprog, struct bpf_prog *bpf_prog,
+ u32 orig_stack_depth)
+{
+ void __percpu *priv_frame_ptr;
+ u8 *prog = *pprog;
+
+ priv_frame_ptr = bpf_prog->aux->priv_stack_ptr + orig_stack_depth;
+
+ /* movabs r9, priv_frame_ptr */
+ emit_mov_imm64(&prog, X86_REG_R9, (long) priv_frame_ptr >> 32,
+ (u32) (long) priv_frame_ptr);
+#ifdef CONFIG_SMP
+ /* add <r9>, gs:[<off>] */
+ EMIT2(0x65, 0x4c);
+ EMIT3(0x03, 0x0c, 0x25);
+ EMIT((u32)(unsigned long)&this_cpu_off, 4);
+#endif
+ *pprog = prog;
+}
+
+static void emit_priv_frame_ptr(u8 **pprog, struct bpf_prog *bpf_prog,
+ enum bpf_priv_stack_mode priv_stack_mode)
+{
+ u32 orig_stack_depth = round_up(bpf_prog->aux->stack_depth, 8);
+ u8 *prog = *pprog;
+
+ if (priv_stack_mode == PRIV_STACK_ROOT_PROG)
+ emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
+ else if (priv_stack_mode == PRIV_STACK_SUB_PROG && orig_stack_depth)
+ /* r9 += orig_stack_depth */
+ emit_alu_imm(&prog, BPF_ALU64 | BPF_ADD | BPF_K, X86_REG_R9,
+ orig_stack_depth);
+
+ *pprog = prog;
+}
+
#define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
#define __LOAD_TCC_PTR(off) \
@@ -1464,6 +1524,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
{
bool tail_call_reachable = bpf_prog->aux->tail_call_reachable;
struct bpf_insn *insn = bpf_prog->insnsi;
+ enum bpf_priv_stack_mode priv_stack_mode;
bool callee_regs_used[4] = {};
int insn_cnt = bpf_prog->len;
bool seen_exit = false;
@@ -1476,13 +1537,17 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
int err;
stack_depth = bpf_prog->aux->stack_depth;
+ priv_stack_mode = bpf_prog->aux->priv_stack_mode;
+ if (priv_stack_mode != NO_PRIV_STACK)
+ stack_depth = 0;
arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena);
user_vm_start = bpf_arena_get_user_vm_start(bpf_prog->aux->arena);
detect_reg_usage(insn, insn_cnt, callee_regs_used);
- emit_prologue(&prog, stack_depth, bpf_prog, tail_call_reachable);
+ emit_prologue(&prog, stack_depth, bpf_prog, tail_call_reachable,
+ priv_stack_mode);
/* Exception callback will clobber callee regs for its own use, and
* restore the original callee regs from main prog's stack frame.
*/
@@ -1521,6 +1586,14 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
u8 *func;
int nops;
+ if (priv_stack_mode != NO_PRIV_STACK) {
+ if (src_reg == BPF_REG_FP)
+ src_reg = X86_REG_R9;
+
+ if (dst_reg == BPF_REG_FP)
+ dst_reg = X86_REG_R9;
+ }
+
switch (insn->code) {
/* ALU */
case BPF_ALU | BPF_ADD | BPF_X:
@@ -2146,9 +2219,15 @@ st: if (is_imm8(insn->off))
}
if (!imm32)
return -EINVAL;
+ if (priv_stack_mode != NO_PRIV_STACK) {
+ push_r9(&prog);
+ ip += 2;
+ }
ip += x86_call_depth_emit_accounting(&prog, func, ip);
if (emit_call(&prog, func, ip))
return -EINVAL;
+ if (priv_stack_mode != NO_PRIV_STACK)
+ pop_r9(&prog);
break;
}
@@ -3572,6 +3651,11 @@ bool bpf_jit_supports_exceptions(void)
return IS_ENABLED(CONFIG_UNWINDER_ORC);
}
+bool bpf_jit_supports_private_stack(void)
+{
+ return true;
+}
+
void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie)
{
#if defined(CONFIG_UNWINDER_ORC)
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index bcfe868e3801..dd28b05bcff0 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -891,6 +891,7 @@ static inline bool bpf_prog_check_recur(const struct bpf_prog *prog)
case BPF_PROG_TYPE_TRACING:
return prog->expected_attach_type != BPF_TRACE_ITER;
case BPF_PROG_TYPE_STRUCT_OPS:
+ return prog->aux->priv_stack_eligible;
case BPF_PROG_TYPE_LSM:
return false;
default:
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH bpf-next v6 8/9] selftests/bpf: Add tracing prog private stack tests
2024-10-20 19:13 [PATCH bpf-next v6 0/9] bpf: Support private stack for bpf progs Yonghong Song
` (6 preceding siblings ...)
2024-10-20 19:14 ` [PATCH bpf-next v6 7/9] bpf, x86: Add jit support for private stack Yonghong Song
@ 2024-10-20 19:14 ` Yonghong Song
2024-10-20 21:59 ` Jiri Olsa
2024-10-20 19:14 ` [PATCH bpf-next v6 9/9] selftests/bpf: Add struct_ops " Yonghong Song
8 siblings, 1 reply; 37+ messages in thread
From: Yonghong Song @ 2024-10-20 19:14 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
Martin KaFai Lau, Tejun Heo
Some private stack tests are added including:
- prog with stack size greater than BPF_PSTACK_MIN_SUBTREE_SIZE.
- prog with stack size less than BPF_PSTACK_MIN_SUBTREE_SIZE.
- prog with one subprog having MAX_BPF_STACK stack size and another
subprog having non-zero stack size.
- prog with callback function.
- prog with exception in main prog or subprog.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
.../selftests/bpf/prog_tests/verifier.c | 2 +
.../bpf/progs/verifier_private_stack.c | 216 ++++++++++++++++++
2 files changed, 218 insertions(+)
create mode 100644 tools/testing/selftests/bpf/progs/verifier_private_stack.c
diff --git a/tools/testing/selftests/bpf/prog_tests/verifier.c b/tools/testing/selftests/bpf/prog_tests/verifier.c
index e26b5150fc43..635ff3509403 100644
--- a/tools/testing/selftests/bpf/prog_tests/verifier.c
+++ b/tools/testing/selftests/bpf/prog_tests/verifier.c
@@ -59,6 +59,7 @@
#include "verifier_or_jmp32_k.skel.h"
#include "verifier_precision.skel.h"
#include "verifier_prevent_map_lookup.skel.h"
+#include "verifier_private_stack.skel.h"
#include "verifier_raw_stack.skel.h"
#include "verifier_raw_tp_writable.skel.h"
#include "verifier_reg_equal.skel.h"
@@ -185,6 +186,7 @@ void test_verifier_bpf_fastcall(void) { RUN(verifier_bpf_fastcall); }
void test_verifier_or_jmp32_k(void) { RUN(verifier_or_jmp32_k); }
void test_verifier_precision(void) { RUN(verifier_precision); }
void test_verifier_prevent_map_lookup(void) { RUN(verifier_prevent_map_lookup); }
+void test_verifier_private_stack(void) { RUN(verifier_private_stack); }
void test_verifier_raw_stack(void) { RUN(verifier_raw_stack); }
void test_verifier_raw_tp_writable(void) { RUN(verifier_raw_tp_writable); }
void test_verifier_reg_equal(void) { RUN(verifier_reg_equal); }
diff --git a/tools/testing/selftests/bpf/progs/verifier_private_stack.c b/tools/testing/selftests/bpf/progs/verifier_private_stack.c
new file mode 100644
index 000000000000..e8de565f8b34
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/verifier_private_stack.c
@@ -0,0 +1,216 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+#include "bpf_experimental.h"
+
+/* From include/linux/filter.h */
+#define MAX_BPF_STACK 512
+
+#if defined(__TARGET_ARCH_x86)
+
+SEC("kprobe")
+__description("Private stack, single prog")
+__success
+__arch_x86_64
+__jited(" movabsq $0x{{.*}}, %r9")
+__jited(" addq %gs:0x{{.*}}, %r9")
+__jited(" movl $0x2a, %edi")
+__jited(" movq %rdi, -0x100(%r9)")
+__naked void private_stack_single_prog(void)
+{
+ asm volatile (
+ "r1 = 42;"
+ "*(u64 *)(r10 - 256) = r1;"
+ "r0 = 0;"
+ "exit;"
+ :
+ :
+ : __clobber_all);
+}
+
+__used
+__naked static void cumulative_stack_depth_subprog(void)
+{
+ asm volatile (
+ "r1 = 41;"
+ "*(u64 *)(r10 - 32) = r1;"
+ "call %[bpf_get_smp_processor_id];"
+ "exit;"
+ :: __imm(bpf_get_smp_processor_id)
+ : __clobber_all);
+}
+
+SEC("kprobe")
+__description("Private stack, subtree > MAX_BPF_STACK")
+__success
+__arch_x86_64
+/* private stack fp for the main prog */
+__jited(" movabsq $0x{{.*}}, %r9")
+__jited(" addq %gs:0x{{.*}}, %r9")
+__jited(" movl $0x2a, %edi")
+__jited(" movq %rdi, -0x200(%r9)")
+__jited(" pushq %r9")
+__jited(" callq 0x{{.*}}")
+__jited(" popq %r9")
+__jited(" xorl %eax, %eax")
+__naked void private_stack_nested_1(void)
+{
+ asm volatile (
+ "r1 = 42;"
+ "*(u64 *)(r10 - %[max_bpf_stack]) = r1;"
+ "call cumulative_stack_depth_subprog;"
+ "r0 = 0;"
+ "exit;"
+ :
+ : __imm_const(max_bpf_stack, MAX_BPF_STACK)
+ : __clobber_all);
+}
+
+SEC("kprobe")
+__description("Private stack, subtree > MAX_BPF_STACK")
+__success
+__arch_x86_64
+/* private stack fp for the subprog */
+__jited(" addq $0x20, %r9")
+__naked void private_stack_nested_2(void)
+{
+ asm volatile (
+ "r1 = 42;"
+ "*(u64 *)(r10 - %[max_bpf_stack]) = r1;"
+ "call cumulative_stack_depth_subprog;"
+ "r0 = 0;"
+ "exit;"
+ :
+ : __imm_const(max_bpf_stack, MAX_BPF_STACK)
+ : __clobber_all);
+}
+
+SEC("raw_tp")
+__description("No private stack, nested")
+__success
+__arch_x86_64
+__jited(" subq $0x8, %rsp")
+__naked void no_private_stack_nested(void)
+{
+ asm volatile (
+ "r1 = 42;"
+ "*(u64 *)(r10 - 8) = r1;"
+ "call cumulative_stack_depth_subprog;"
+ "r0 = 0;"
+ "exit;"
+ :
+ :
+ : __clobber_all);
+}
+
+__naked __noinline __used
+static unsigned long loop_callback(void)
+{
+ asm volatile (
+ "call %[bpf_get_prandom_u32];"
+ "r1 = 42;"
+ "*(u64 *)(r10 - 512) = r1;"
+ "call cumulative_stack_depth_subprog;"
+ "r0 = 0;"
+ "exit;"
+ :
+ : __imm(bpf_get_prandom_u32)
+ : __clobber_common);
+}
+
+SEC("raw_tp")
+__description("Private stack, callback")
+__success
+__arch_x86_64
+/* for func loop_callback */
+__jited("func #1")
+__jited(" endbr64")
+__jited(" nopl (%rax,%rax)")
+__jited(" nopl (%rax)")
+__jited(" pushq %rbp")
+__jited(" movq %rsp, %rbp")
+__jited(" endbr64")
+__jited(" movabsq $0x{{.*}}, %r9")
+__jited(" addq %gs:0x{{.*}}, %r9")
+__jited(" pushq %r9")
+__jited(" callq")
+__jited(" popq %r9")
+__jited(" movl $0x2a, %edi")
+__jited(" movq %rdi, -0x200(%r9)")
+__jited(" pushq %r9")
+__jited(" callq")
+__jited(" popq %r9")
+__naked void private_stack_callback(void)
+{
+ asm volatile (
+ "r1 = 1;"
+ "r2 = %[loop_callback];"
+ "r3 = 0;"
+ "r4 = 0;"
+ "call %[bpf_loop];"
+ "r0 = 0;"
+ "exit;"
+ :
+ : __imm_ptr(loop_callback),
+ __imm(bpf_loop)
+ : __clobber_common);
+}
+
+SEC("fentry/bpf_fentry_test9")
+__description("Private stack, exception in main prog")
+__success __retval(0)
+__arch_x86_64
+__jited(" pushq %r9")
+__jited(" callq")
+__jited(" popq %r9")
+int private_stack_exception_main_prog(void)
+{
+ asm volatile (
+ "r1 = 42;"
+ "*(u64 *)(r10 - 512) = r1;"
+ ::: __clobber_common);
+
+ bpf_throw(0);
+ return 0;
+}
+
+__used static int subprog_exception(void)
+{
+ bpf_throw(0);
+ return 0;
+}
+
+SEC("fentry/bpf_fentry_test9")
+__description("Private stack, exception in subprog")
+__success __retval(0)
+__arch_x86_64
+__jited(" movq %rdi, -0x200(%r9)")
+__jited(" pushq %r9")
+__jited(" callq")
+__jited(" popq %r9")
+int private_stack_exception_sub_prog(void)
+{
+ asm volatile (
+ "r1 = 42;"
+ "*(u64 *)(r10 - 512) = r1;"
+ "call subprog_exception;"
+ ::: __clobber_common);
+
+ return 0;
+}
+
+#else
+
+SEC("kprobe")
+__description("private stack is not supported, use a dummy test")
+__success
+int dummy_test(void)
+{
+ return 0;
+}
+
+#endif
+
+char _license[] SEC("license") = "GPL";
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH bpf-next v6 9/9] selftests/bpf: Add struct_ops prog private stack tests
2024-10-20 19:13 [PATCH bpf-next v6 0/9] bpf: Support private stack for bpf progs Yonghong Song
` (7 preceding siblings ...)
2024-10-20 19:14 ` [PATCH bpf-next v6 8/9] selftests/bpf: Add tracing prog private stack tests Yonghong Song
@ 2024-10-20 19:14 ` Yonghong Song
8 siblings, 0 replies; 37+ messages in thread
From: Yonghong Song @ 2024-10-20 19:14 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
Martin KaFai Lau, Tejun Heo
Add three tests for struct_ops using private stack.
./test_progs -t struct_ops_private_stack
#333/1 struct_ops_private_stack/private_stack:OK
#333/2 struct_ops_private_stack/private_stack_fail:OK
#333/3 struct_ops_private_stack/private_stack_recur:OK
#333 struct_ops_private_stack:OK
The first one is with nested two different callback functions where the
first prog has more than 512 byte stack size (including subprogs) with
private stack enabled.
The second one is a negative test where the second prog has more than 512
byte stack size without private stack enabled.
The third one is the same callback function recursing itself. At run time,
the jit trampoline recursion check kicks in to prevent the recursion.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
.../selftests/bpf/bpf_testmod/bpf_testmod.c | 77 +++++++++++++
.../selftests/bpf/bpf_testmod/bpf_testmod.h | 6 +
.../bpf/prog_tests/struct_ops_private_stack.c | 106 ++++++++++++++++++
.../bpf/progs/struct_ops_private_stack.c | 62 ++++++++++
.../bpf/progs/struct_ops_private_stack_fail.c | 62 ++++++++++
.../progs/struct_ops_private_stack_recur.c | 50 +++++++++
6 files changed, 363 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/struct_ops_private_stack.c
create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_private_stack.c
create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_private_stack_fail.c
create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_private_stack_recur.c
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 8835761d9a12..00bb23cfa66e 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -245,6 +245,39 @@ __bpf_kfunc void bpf_testmod_ctx_release(struct bpf_testmod_ctx *ctx)
call_rcu(&ctx->rcu, testmod_free_cb);
}
+static struct bpf_testmod_ops3 *st_ops3;
+
+static int bpf_testmod_ops3__test_1__priv_stack(void)
+{
+ return 0;
+}
+
+static int bpf_testmod_test_4(void)
+{
+ return 0;
+}
+
+static struct bpf_testmod_ops3 __bpf_testmod_ops3 = {
+ .test_1 = bpf_testmod_ops3__test_1__priv_stack,
+ .test_2 = bpf_testmod_test_4,
+};
+
+static void bpf_testmod_test_struct_ops3(void)
+{
+ if (st_ops3)
+ st_ops3->test_1();
+}
+
+__bpf_kfunc void bpf_testmod_ops3_call_test_1(void)
+{
+ st_ops3->test_1();
+}
+
+__bpf_kfunc void bpf_testmod_ops3_call_test_2(void)
+{
+ st_ops3->test_2();
+}
+
struct bpf_testmod_btf_type_tag_1 {
int a;
};
@@ -380,6 +413,8 @@ bpf_testmod_test_read(struct file *file, struct kobject *kobj,
(void)bpf_testmod_test_arg_ptr_to_struct(&struct_arg1_2);
+ bpf_testmod_test_struct_ops3();
+
struct_arg3 = kmalloc((sizeof(struct bpf_testmod_struct_arg_3) +
sizeof(int)), GFP_KERNEL);
if (struct_arg3 != NULL) {
@@ -584,6 +619,8 @@ BTF_ID_FLAGS(func, bpf_kfunc_trusted_num_test, KF_TRUSTED_ARGS)
BTF_ID_FLAGS(func, bpf_kfunc_rcu_task_test, KF_RCU)
BTF_ID_FLAGS(func, bpf_testmod_ctx_create, KF_ACQUIRE | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_testmod_ctx_release, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_testmod_ops3_call_test_1)
+BTF_ID_FLAGS(func, bpf_testmod_ops3_call_test_2)
BTF_KFUNCS_END(bpf_testmod_common_kfunc_ids)
BTF_ID_LIST(bpf_testmod_dtor_ids)
@@ -1094,6 +1131,10 @@ static const struct bpf_verifier_ops bpf_testmod_verifier_ops = {
.is_valid_access = bpf_testmod_ops_is_valid_access,
};
+static const struct bpf_verifier_ops bpf_testmod_verifier_ops3 = {
+ .is_valid_access = bpf_testmod_ops_is_valid_access,
+};
+
static int bpf_dummy_reg(void *kdata, struct bpf_link *link)
{
struct bpf_testmod_ops *ops = kdata;
@@ -1173,6 +1214,41 @@ struct bpf_struct_ops bpf_testmod_ops2 = {
.owner = THIS_MODULE,
};
+static int st_ops3_reg(void *kdata, struct bpf_link *link)
+{
+ int err = 0;
+
+ mutex_lock(&st_ops_mutex);
+ if (st_ops3) {
+ pr_err("st_ops has already been registered\n");
+ err = -EEXIST;
+ goto unlock;
+ }
+ st_ops3 = kdata;
+
+unlock:
+ mutex_unlock(&st_ops_mutex);
+ return err;
+}
+
+static void st_ops3_unreg(void *kdata, struct bpf_link *link)
+{
+ mutex_lock(&st_ops_mutex);
+ st_ops3 = NULL;
+ mutex_unlock(&st_ops_mutex);
+}
+
+struct bpf_struct_ops bpf_testmod_ops3 = {
+ .verifier_ops = &bpf_testmod_verifier_ops3,
+ .init = bpf_testmod_ops_init,
+ .init_member = bpf_testmod_ops_init_member,
+ .reg = st_ops3_reg,
+ .unreg = st_ops3_unreg,
+ .cfi_stubs = &__bpf_testmod_ops3,
+ .name = "bpf_testmod_ops3",
+ .owner = THIS_MODULE,
+};
+
static int bpf_test_mod_st_ops__test_prologue(struct st_ops_args *args)
{
return 0;
@@ -1331,6 +1407,7 @@ static int bpf_testmod_init(void)
ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_testmod_kfunc_set);
ret = ret ?: register_bpf_struct_ops(&bpf_bpf_testmod_ops, bpf_testmod_ops);
ret = ret ?: register_bpf_struct_ops(&bpf_testmod_ops2, bpf_testmod_ops2);
+ ret = ret ?: register_bpf_struct_ops(&bpf_testmod_ops3, bpf_testmod_ops3);
ret = ret ?: register_bpf_struct_ops(&testmod_st_ops, bpf_testmod_st_ops);
ret = ret ?: register_btf_id_dtor_kfuncs(bpf_testmod_dtors,
ARRAY_SIZE(bpf_testmod_dtors),
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
index fb7dff47597a..59c600074eea 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
@@ -92,6 +92,12 @@ struct bpf_testmod_ops {
struct bpf_testmod_ops2 {
int (*test_1)(void);
+ int (*test_2)(void);
+};
+
+struct bpf_testmod_ops3 {
+ int (*test_1)(void);
+ int (*test_2)(void);
};
struct st_ops_args {
diff --git a/tools/testing/selftests/bpf/prog_tests/struct_ops_private_stack.c b/tools/testing/selftests/bpf/prog_tests/struct_ops_private_stack.c
new file mode 100644
index 000000000000..4006879ca3fe
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/struct_ops_private_stack.c
@@ -0,0 +1,106 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <test_progs.h>
+#include "struct_ops_private_stack.skel.h"
+#include "struct_ops_private_stack_fail.skel.h"
+#include "struct_ops_private_stack_recur.skel.h"
+
+static void test_private_stack(void)
+{
+ struct struct_ops_private_stack *skel;
+ struct bpf_link *link;
+ int err;
+
+ skel = struct_ops_private_stack__open();
+ if (!ASSERT_OK_PTR(skel, "struct_ops_private_stack__open"))
+ return;
+
+ if (skel->data->skip) {
+ test__skip();
+ goto cleanup;
+ }
+
+ err = struct_ops_private_stack__load(skel);
+ if (!ASSERT_OK(err, "struct_ops_private_stack__load"))
+ goto cleanup;
+
+ link = bpf_map__attach_struct_ops(skel->maps.testmod_1);
+ if (!ASSERT_OK_PTR(link, "attach_struct_ops"))
+ goto cleanup;
+
+ ASSERT_OK(trigger_module_test_read(256), "trigger_read");
+
+ ASSERT_EQ(skel->bss->val_i, 3, "val_i");
+ ASSERT_EQ(skel->bss->val_j, 8, "val_j");
+
+ bpf_link__destroy(link);
+
+cleanup:
+ struct_ops_private_stack__destroy(skel);
+}
+
+static void test_private_stack_fail(void)
+{
+ struct struct_ops_private_stack_fail *skel;
+ int err;
+
+ skel = struct_ops_private_stack_fail__open();
+ if (!ASSERT_OK_PTR(skel, "struct_ops_private_stack_fail__open"))
+ return;
+
+ if (skel->data->skip) {
+ test__skip();
+ goto cleanup;
+ }
+
+ err = struct_ops_private_stack_fail__load(skel);
+ if (!ASSERT_ERR(err, "struct_ops_private_stack_fail__load"))
+ goto cleanup;
+ return;
+
+cleanup:
+ struct_ops_private_stack_fail__destroy(skel);
+}
+
+static void test_private_stack_recur(void)
+{
+ struct struct_ops_private_stack_recur *skel;
+ struct bpf_link *link;
+ int err;
+
+ skel = struct_ops_private_stack_recur__open();
+ if (!ASSERT_OK_PTR(skel, "struct_ops_private_stack_recur__open"))
+ return;
+
+ if (skel->data->skip) {
+ test__skip();
+ goto cleanup;
+ }
+
+ err = struct_ops_private_stack_recur__load(skel);
+ if (!ASSERT_OK(err, "struct_ops_private_stack_recur__load"))
+ goto cleanup;
+
+ link = bpf_map__attach_struct_ops(skel->maps.testmod_1);
+ if (!ASSERT_OK_PTR(link, "attach_struct_ops"))
+ goto cleanup;
+
+ ASSERT_OK(trigger_module_test_read(256), "trigger_read");
+
+ ASSERT_EQ(skel->bss->val_j, 3, "val_j");
+
+ bpf_link__destroy(link);
+
+cleanup:
+ struct_ops_private_stack_recur__destroy(skel);
+}
+
+void test_struct_ops_private_stack(void)
+{
+ if (test__start_subtest("private_stack"))
+ test_private_stack();
+ if (test__start_subtest("private_stack_fail"))
+ test_private_stack_fail();
+ if (test__start_subtest("private_stack_recur"))
+ test_private_stack_recur();
+}
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_private_stack.c b/tools/testing/selftests/bpf/progs/struct_ops_private_stack.c
new file mode 100644
index 000000000000..8ea57e5348ab
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_private_stack.c
@@ -0,0 +1,62 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+#if defined(__TARGET_ARCH_x86)
+bool skip __attribute((__section__(".data"))) = false;
+#else
+bool skip = true;
+#endif
+
+void bpf_testmod_ops3_call_test_2(void) __ksym;
+
+int val_i, val_j;
+
+__noinline static int subprog2(int *a, int *b)
+{
+ return val_i + a[10] + b[20];
+}
+
+__noinline static int subprog1(int *a)
+{
+ /* stack size 200 bytes */
+ int b[50] = {};
+
+ b[20] = 2;
+ return subprog2(a, b);
+}
+
+
+SEC("struct_ops")
+int BPF_PROG(test_1)
+{
+ /* stack size 400 bytes */
+ int a[100] = {};
+
+ a[10] = 1;
+ val_i = subprog1(a);
+ bpf_testmod_ops3_call_test_2();
+ return 0;
+}
+
+SEC("struct_ops")
+int BPF_PROG(test_2)
+{
+ /* stack size 200 bytes */
+ int a[50] = {};
+
+ a[10] = 3;
+ val_j = subprog1(a);
+ return 0;
+}
+
+SEC(".struct_ops")
+struct bpf_testmod_ops3 testmod_1 = {
+ .test_1 = (void *)test_1,
+ .test_2 = (void *)test_2,
+};
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_private_stack_fail.c b/tools/testing/selftests/bpf/progs/struct_ops_private_stack_fail.c
new file mode 100644
index 000000000000..1f55ec4cee37
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_private_stack_fail.c
@@ -0,0 +1,62 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+#if defined(__TARGET_ARCH_x86)
+bool skip __attribute((__section__(".data"))) = false;
+#else
+bool skip = true;
+#endif
+
+void bpf_testmod_ops3_call_test_2(void) __ksym;
+
+int val_i, val_j;
+
+__noinline static int subprog2(int *a, int *b)
+{
+ return val_i + a[10] + b[20];
+}
+
+__noinline static int subprog1(int *a)
+{
+ /* stack size 200 bytes */
+ int b[50] = {};
+
+ b[20] = 2;
+ return subprog2(a, b);
+}
+
+
+SEC("struct_ops")
+int BPF_PROG(test_1)
+{
+ /* stack size 100 bytes */
+ int a[25] = {};
+
+ a[10] = 1;
+ val_i = subprog1(a);
+ bpf_testmod_ops3_call_test_2();
+ return 0;
+}
+
+SEC("struct_ops")
+int BPF_PROG(test_2)
+{
+ /* stack size 400 bytes */
+ int a[100] = {};
+
+ a[10] = 3;
+ val_j = subprog1(a);
+ return 0;
+}
+
+SEC(".struct_ops")
+struct bpf_testmod_ops3 testmod_1 = {
+ .test_1 = (void *)test_1,
+ .test_2 = (void *)test_2,
+};
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_private_stack_recur.c b/tools/testing/selftests/bpf/progs/struct_ops_private_stack_recur.c
new file mode 100644
index 000000000000..15d4e914dc92
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_private_stack_recur.c
@@ -0,0 +1,50 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+#if defined(__TARGET_ARCH_x86)
+bool skip __attribute((__section__(".data"))) = false;
+#else
+bool skip = true;
+#endif
+
+void bpf_testmod_ops3_call_test_1(void) __ksym;
+
+int val_i, val_j;
+
+__noinline static int subprog2(int *a, int *b)
+{
+ return val_i + a[10] + b[20];
+}
+
+__noinline static int subprog1(int *a)
+{
+ /* stack size 400 bytes */
+ int b[100] = {};
+
+ b[20] = 2;
+ return subprog2(a, b);
+}
+
+
+SEC("struct_ops")
+int BPF_PROG(test_1)
+{
+ /* stack size 400 bytes */
+ int a[100] = {};
+
+ a[10] = 1;
+ val_j += subprog1(a);
+ bpf_testmod_ops3_call_test_1();
+ return 0;
+}
+
+SEC(".struct_ops")
+struct bpf_testmod_ops3 testmod_1 = {
+ .test_1 = (void *)test_1,
+};
--
2.43.5
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 8/9] selftests/bpf: Add tracing prog private stack tests
2024-10-20 19:14 ` [PATCH bpf-next v6 8/9] selftests/bpf: Add tracing prog private stack tests Yonghong Song
@ 2024-10-20 21:59 ` Jiri Olsa
2024-10-21 4:32 ` Yonghong Song
0 siblings, 1 reply; 37+ messages in thread
From: Jiri Olsa @ 2024-10-20 21:59 UTC (permalink / raw)
To: Yonghong Song
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
kernel-team, Martin KaFai Lau, Tejun Heo
On Sun, Oct 20, 2024 at 12:14:31PM -0700, Yonghong Song wrote:
SNIP
> +__naked __noinline __used
> +static unsigned long loop_callback(void)
> +{
> + asm volatile (
> + "call %[bpf_get_prandom_u32];"
> + "r1 = 42;"
> + "*(u64 *)(r10 - 512) = r1;"
> + "call cumulative_stack_depth_subprog;"
> + "r0 = 0;"
> + "exit;"
> + :
> + : __imm(bpf_get_prandom_u32)
> + : __clobber_common);
> +}
> +
> +SEC("raw_tp")
> +__description("Private stack, callback")
> +__success
> +__arch_x86_64
> +/* for func loop_callback */
> +__jited("func #1")
> +__jited(" endbr64")
this should fail if CONFIG_X86_KERNEL_IBT is not enabled, right?
hm, but I can see that also in other tests, so I guess it's fine,
should we add it to config.x86_64 ?
jirka
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 4/9] bpf: Mark each subprog with proper private stack modes
2024-10-20 19:14 ` [PATCH bpf-next v6 4/9] bpf: Mark each subprog with proper private stack modes Yonghong Song
@ 2024-10-20 22:01 ` Jiri Olsa
2024-10-21 4:22 ` Yonghong Song
0 siblings, 1 reply; 37+ messages in thread
From: Jiri Olsa @ 2024-10-20 22:01 UTC (permalink / raw)
To: Yonghong Song
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
kernel-team, Martin KaFai Lau, Tejun Heo
On Sun, Oct 20, 2024 at 12:14:05PM -0700, Yonghong Song wrote:
> Three private stack modes are used to direct jit action:
> NO_PRIV_STACK: do not use private stack
> PRIV_STACK_SUB_PROG: adjust frame pointer address (similar to normal stack)
> PRIV_STACK_ROOT_PROG: set the frame pointer
>
> Note that for subtree root prog (main prog or callback fn), even if the
> bpf_prog stack size is 0, PRIV_STACK_ROOT_PROG mode is still used.
> This is for bpf exception handling. More details can be found in
> subsequent jit support and selftest patches.
>
> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
> ---
> include/linux/bpf.h | 9 +++++++++
> kernel/bpf/core.c | 19 +++++++++++++++++++
> kernel/bpf/verifier.c | 29 +++++++++++++++++++++++++++++
> 3 files changed, 57 insertions(+)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 376e43fc72b9..27430e9dcfe3 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1456,6 +1456,12 @@ struct btf_mod_pair {
>
> struct bpf_kfunc_desc_tab;
>
> +enum bpf_priv_stack_mode {
> + NO_PRIV_STACK,
> + PRIV_STACK_SUB_PROG,
> + PRIV_STACK_ROOT_PROG,
> +};
> +
> struct bpf_prog_aux {
> atomic64_t refcnt;
> u32 used_map_cnt;
> @@ -1472,6 +1478,9 @@ struct bpf_prog_aux {
> u32 ctx_arg_info_size;
> u32 max_rdonly_access;
> u32 max_rdwr_access;
> + enum bpf_priv_stack_mode priv_stack_mode;
> + u16 subtree_stack_depth; /* Subtree stack depth if PRIV_STACK_ROOT_PROG, 0 otherwise */
> + void __percpu *priv_stack_ptr;
> struct btf *attach_btf;
> const struct bpf_ctx_arg_aux *ctx_arg_info;
> struct mutex dst_mutex; /* protects dst_* pointers below, *after* prog becomes visible */
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 14d9288441f2..aee0055def4f 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -1240,6 +1240,7 @@ void __weak bpf_jit_free(struct bpf_prog *fp)
> struct bpf_binary_header *hdr = bpf_jit_binary_hdr(fp);
>
> bpf_jit_binary_free(hdr);
> + free_percpu(fp->aux->priv_stack_ptr);
this should be also put to the x86 version of the bpf_jit_free ?
jirka
> WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(fp));
> }
>
> @@ -2421,6 +2422,24 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
> if (*err)
> return fp;
>
> + if (fp->aux->priv_stack_eligible) {
> + if (!fp->aux->stack_depth) {
> + fp->aux->priv_stack_mode = NO_PRIV_STACK;
> + } else {
> + void __percpu *priv_stack_ptr;
> +
> + fp->aux->priv_stack_mode = PRIV_STACK_ROOT_PROG;
> + priv_stack_ptr =
> + __alloc_percpu_gfp(fp->aux->stack_depth, 8, GFP_KERNEL);
> + if (!priv_stack_ptr) {
> + *err = -ENOMEM;
> + return fp;
> + }
> + fp->aux->subtree_stack_depth = fp->aux->stack_depth;
> + fp->aux->priv_stack_ptr = priv_stack_ptr;
> + }
> + }
> +
> fp = bpf_int_jit_compile(fp);
> bpf_prog_jit_attempt_done(fp);
> if (!fp->jited && jit_needed) {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 25283ee6f86f..f770015d6ad1 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -20018,6 +20018,8 @@ static int jit_subprogs(struct bpf_verifier_env *env)
> {
> struct bpf_prog *prog = env->prog, **func, *tmp;
> int i, j, subprog_start, subprog_end = 0, len, subprog;
> + int subtree_top_idx, subtree_stack_depth;
> + void __percpu *priv_stack_ptr;
> struct bpf_map *map_ptr;
> struct bpf_insn *insn;
> void *old_bpf_func;
> @@ -20096,6 +20098,33 @@ static int jit_subprogs(struct bpf_verifier_env *env)
> func[i]->is_func = 1;
> func[i]->sleepable = prog->sleepable;
> func[i]->aux->func_idx = i;
> +
> + subtree_top_idx = env->subprog_info[i].subtree_top_idx;
> + if (env->subprog_info[subtree_top_idx].priv_stack_eligible) {
> + if (subtree_top_idx == i)
> + func[i]->aux->subtree_stack_depth =
> + env->subprog_info[i].subtree_stack_depth;
> +
> + subtree_stack_depth = func[i]->aux->subtree_stack_depth;
> + if (subtree_top_idx != i) {
> + if (env->subprog_info[subtree_top_idx].subtree_stack_depth)
> + func[i]->aux->priv_stack_mode = PRIV_STACK_SUB_PROG;
> + else
> + func[i]->aux->priv_stack_mode = NO_PRIV_STACK;
> + } else if (!subtree_stack_depth) {
> + func[i]->aux->priv_stack_mode = PRIV_STACK_ROOT_PROG;
> + } else {
> + func[i]->aux->priv_stack_mode = PRIV_STACK_ROOT_PROG;
> + priv_stack_ptr =
> + __alloc_percpu_gfp(subtree_stack_depth, 8, GFP_KERNEL);
> + if (!priv_stack_ptr) {
> + err = -ENOMEM;
> + goto out_free;
> + }
> + func[i]->aux->priv_stack_ptr = priv_stack_ptr;
> + }
> + }
> +
> /* Below members will be freed only at prog->aux */
> func[i]->aux->btf = prog->aux->btf;
> func[i]->aux->func_info = prog->aux->func_info;
> --
> 2.43.5
>
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 4/9] bpf: Mark each subprog with proper private stack modes
2024-10-20 22:01 ` Jiri Olsa
@ 2024-10-21 4:22 ` Yonghong Song
0 siblings, 0 replies; 37+ messages in thread
From: Yonghong Song @ 2024-10-21 4:22 UTC (permalink / raw)
To: Jiri Olsa
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
kernel-team, Martin KaFai Lau, Tejun Heo
On 10/20/24 3:01 PM, Jiri Olsa wrote:
> On Sun, Oct 20, 2024 at 12:14:05PM -0700, Yonghong Song wrote:
>> Three private stack modes are used to direct jit action:
>> NO_PRIV_STACK: do not use private stack
>> PRIV_STACK_SUB_PROG: adjust frame pointer address (similar to normal stack)
>> PRIV_STACK_ROOT_PROG: set the frame pointer
>>
>> Note that for subtree root prog (main prog or callback fn), even if the
>> bpf_prog stack size is 0, PRIV_STACK_ROOT_PROG mode is still used.
>> This is for bpf exception handling. More details can be found in
>> subsequent jit support and selftest patches.
>>
>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>> ---
>> include/linux/bpf.h | 9 +++++++++
>> kernel/bpf/core.c | 19 +++++++++++++++++++
>> kernel/bpf/verifier.c | 29 +++++++++++++++++++++++++++++
>> 3 files changed, 57 insertions(+)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 376e43fc72b9..27430e9dcfe3 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1456,6 +1456,12 @@ struct btf_mod_pair {
>>
>> struct bpf_kfunc_desc_tab;
>>
>> +enum bpf_priv_stack_mode {
>> + NO_PRIV_STACK,
>> + PRIV_STACK_SUB_PROG,
>> + PRIV_STACK_ROOT_PROG,
>> +};
>> +
>> struct bpf_prog_aux {
>> atomic64_t refcnt;
>> u32 used_map_cnt;
>> @@ -1472,6 +1478,9 @@ struct bpf_prog_aux {
>> u32 ctx_arg_info_size;
>> u32 max_rdonly_access;
>> u32 max_rdwr_access;
>> + enum bpf_priv_stack_mode priv_stack_mode;
>> + u16 subtree_stack_depth; /* Subtree stack depth if PRIV_STACK_ROOT_PROG, 0 otherwise */
>> + void __percpu *priv_stack_ptr;
>> struct btf *attach_btf;
>> const struct bpf_ctx_arg_aux *ctx_arg_info;
>> struct mutex dst_mutex; /* protects dst_* pointers below, *after* prog becomes visible */
>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> index 14d9288441f2..aee0055def4f 100644
>> --- a/kernel/bpf/core.c
>> +++ b/kernel/bpf/core.c
>> @@ -1240,6 +1240,7 @@ void __weak bpf_jit_free(struct bpf_prog *fp)
>> struct bpf_binary_header *hdr = bpf_jit_binary_hdr(fp);
>>
>> bpf_jit_binary_free(hdr);
>> + free_percpu(fp->aux->priv_stack_ptr);
> this should be also put to the x86 version of the bpf_jit_free ?
Thanks for spotting this! Indeed, the x86 version of bpf_jit_free should
be used. Will fix in the next revision.
>
> jirka
>
>> WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(fp));
>> }
[...]
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 8/9] selftests/bpf: Add tracing prog private stack tests
2024-10-20 21:59 ` Jiri Olsa
@ 2024-10-21 4:32 ` Yonghong Song
2024-10-21 10:40 ` Jiri Olsa
0 siblings, 1 reply; 37+ messages in thread
From: Yonghong Song @ 2024-10-21 4:32 UTC (permalink / raw)
To: Jiri Olsa
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
kernel-team, Martin KaFai Lau, Tejun Heo
On 10/20/24 2:59 PM, Jiri Olsa wrote:
> On Sun, Oct 20, 2024 at 12:14:31PM -0700, Yonghong Song wrote:
>
> SNIP
>
>> +__naked __noinline __used
>> +static unsigned long loop_callback(void)
>> +{
>> + asm volatile (
>> + "call %[bpf_get_prandom_u32];"
>> + "r1 = 42;"
>> + "*(u64 *)(r10 - 512) = r1;"
>> + "call cumulative_stack_depth_subprog;"
>> + "r0 = 0;"
>> + "exit;"
>> + :
>> + : __imm(bpf_get_prandom_u32)
>> + : __clobber_common);
>> +}
>> +
>> +SEC("raw_tp")
>> +__description("Private stack, callback")
>> +__success
>> +__arch_x86_64
>> +/* for func loop_callback */
>> +__jited("func #1")
>> +__jited(" endbr64")
> this should fail if CONFIG_X86_KERNEL_IBT is not enabled, right?
>
> hm, but I can see that also in other tests, so I guess it's fine,
> should we add it to config.x86_64 ?
The CI has CONFIG_X86_KERNEL_IBT as well.
I checked x86 kconfig, I see
config CC_HAS_IBT
# GCC >= 9 and binutils >= 2.29
# Retpoline check to work around https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93654
# Clang/LLVM >= 14
# https://github.com/llvm/llvm-project/commit/e0b89df2e0f0130881bf6c39bf31d7f6aac00e0f
# https://github.com/llvm/llvm-project/commit/dfcf69770bc522b9e411c66454934a37c1f35332
def_bool ((CC_IS_GCC && $(cc-option, -fcf-protection=branch -mindirect-branch-register)) || \
(CC_IS_CLANG && CLANG_VERSION >= 140000)) && \
$(as-instr,endbr64)
config X86_KERNEL_IBT
prompt "Indirect Branch Tracking"
def_bool y
depends on X86_64 && CC_HAS_IBT && HAVE_OBJTOOL
# https://github.com/llvm/llvm-project/commit/9d7001eba9c4cb311e03cd8cdc231f9e579f2d0f
depends on !LD_IS_LLD || LLD_VERSION >= 140000
select OBJTOOL
select X86_CET
help
Build the kernel with support for Indirect Branch Tracking, a
hardware support course-grain forward-edge Control Flow Integrity
protection. It enforces that all indirect calls must land on
an ENDBR instruction, as such, the compiler will instrument the
code with them to make this happen.
In addition to building the kernel with IBT, seal all functions that
are not indirect call targets, avoiding them ever becoming one.
This requires LTO like objtool runs and will slow down the build. It
does significantly reduce the number of ENDBR instructions in the
kernel image.
So CONFIG_X86_KERNEL_IBT will be enabled if clang >= version_14 or newer gcc.
In my system, the gcc version is 13.1. So there is no need to explicitly add
CONFIG_X86_KERNEL_IBT to the selftests/bpf/config.x86_64 file.
>
> jirka
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 8/9] selftests/bpf: Add tracing prog private stack tests
2024-10-21 4:32 ` Yonghong Song
@ 2024-10-21 10:40 ` Jiri Olsa
2024-10-21 16:19 ` Yonghong Song
0 siblings, 1 reply; 37+ messages in thread
From: Jiri Olsa @ 2024-10-21 10:40 UTC (permalink / raw)
To: Yonghong Song
Cc: Jiri Olsa, bpf, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, kernel-team, Martin KaFai Lau, Tejun Heo
On Sun, Oct 20, 2024 at 09:32:38PM -0700, Yonghong Song wrote:
>
> On 10/20/24 2:59 PM, Jiri Olsa wrote:
> > On Sun, Oct 20, 2024 at 12:14:31PM -0700, Yonghong Song wrote:
> >
> > SNIP
> >
> > > +__naked __noinline __used
> > > +static unsigned long loop_callback(void)
> > > +{
> > > + asm volatile (
> > > + "call %[bpf_get_prandom_u32];"
> > > + "r1 = 42;"
> > > + "*(u64 *)(r10 - 512) = r1;"
> > > + "call cumulative_stack_depth_subprog;"
> > > + "r0 = 0;"
> > > + "exit;"
> > > + :
> > > + : __imm(bpf_get_prandom_u32)
> > > + : __clobber_common);
> > > +}
> > > +
> > > +SEC("raw_tp")
> > > +__description("Private stack, callback")
> > > +__success
> > > +__arch_x86_64
> > > +/* for func loop_callback */
> > > +__jited("func #1")
> > > +__jited(" endbr64")
> > this should fail if CONFIG_X86_KERNEL_IBT is not enabled, right?
> >
> > hm, but I can see that also in other tests, so I guess it's fine,
> > should we add it to config.x86_64 ?
>
> The CI has CONFIG_X86_KERNEL_IBT as well.
>
> I checked x86 kconfig, I see
>
> config CC_HAS_IBT
> # GCC >= 9 and binutils >= 2.29
> # Retpoline check to work around https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93654
> # Clang/LLVM >= 14
> # https://github.com/llvm/llvm-project/commit/e0b89df2e0f0130881bf6c39bf31d7f6aac00e0f
> # https://github.com/llvm/llvm-project/commit/dfcf69770bc522b9e411c66454934a37c1f35332
> def_bool ((CC_IS_GCC && $(cc-option, -fcf-protection=branch -mindirect-branch-register)) || \
> (CC_IS_CLANG && CLANG_VERSION >= 140000)) && \
> $(as-instr,endbr64)
>
> config X86_KERNEL_IBT
> prompt "Indirect Branch Tracking"
> def_bool y
> depends on X86_64 && CC_HAS_IBT && HAVE_OBJTOOL
> # https://github.com/llvm/llvm-project/commit/9d7001eba9c4cb311e03cd8cdc231f9e579f2d0f
> depends on !LD_IS_LLD || LLD_VERSION >= 140000
> select OBJTOOL
> select X86_CET
> help
> Build the kernel with support for Indirect Branch Tracking, a
> hardware support course-grain forward-edge Control Flow Integrity
> protection. It enforces that all indirect calls must land on
> an ENDBR instruction, as such, the compiler will instrument the
> code with them to make this happen.
> In addition to building the kernel with IBT, seal all functions that
> are not indirect call targets, avoiding them ever becoming one.
> This requires LTO like objtool runs and will slow down the build. It
> does significantly reduce the number of ENDBR instructions in the
> kernel image.
>
> So CONFIG_X86_KERNEL_IBT will be enabled if clang >= version_14 or newer gcc.
IIUC it's just dependency, no? doesn't mean it'll get enabled automatically
> In my system, the gcc version is 13.1. So there is no need to explicitly add
> CONFIG_X86_KERNEL_IBT to the selftests/bpf/config.x86_64 file.
I had to enable it manualy for gcc 13.3.1
jirka
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 8/9] selftests/bpf: Add tracing prog private stack tests
2024-10-21 10:40 ` Jiri Olsa
@ 2024-10-21 16:19 ` Yonghong Song
2024-10-21 21:13 ` Jiri Olsa
0 siblings, 1 reply; 37+ messages in thread
From: Yonghong Song @ 2024-10-21 16:19 UTC (permalink / raw)
To: Jiri Olsa
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
kernel-team, Martin KaFai Lau, Tejun Heo
On 10/21/24 3:40 AM, Jiri Olsa wrote:
> On Sun, Oct 20, 2024 at 09:32:38PM -0700, Yonghong Song wrote:
>> On 10/20/24 2:59 PM, Jiri Olsa wrote:
>>> On Sun, Oct 20, 2024 at 12:14:31PM -0700, Yonghong Song wrote:
>>>
>>> SNIP
>>>
>>>> +__naked __noinline __used
>>>> +static unsigned long loop_callback(void)
>>>> +{
>>>> + asm volatile (
>>>> + "call %[bpf_get_prandom_u32];"
>>>> + "r1 = 42;"
>>>> + "*(u64 *)(r10 - 512) = r1;"
>>>> + "call cumulative_stack_depth_subprog;"
>>>> + "r0 = 0;"
>>>> + "exit;"
>>>> + :
>>>> + : __imm(bpf_get_prandom_u32)
>>>> + : __clobber_common);
>>>> +}
>>>> +
>>>> +SEC("raw_tp")
>>>> +__description("Private stack, callback")
>>>> +__success
>>>> +__arch_x86_64
>>>> +/* for func loop_callback */
>>>> +__jited("func #1")
>>>> +__jited(" endbr64")
>>> this should fail if CONFIG_X86_KERNEL_IBT is not enabled, right?
>>>
>>> hm, but I can see that also in other tests, so I guess it's fine,
>>> should we add it to config.x86_64 ?
>> The CI has CONFIG_X86_KERNEL_IBT as well.
>>
>> I checked x86 kconfig, I see
>>
>> config CC_HAS_IBT
>> # GCC >= 9 and binutils >= 2.29
>> # Retpoline check to work around https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93654
>> # Clang/LLVM >= 14
>> # https://github.com/llvm/llvm-project/commit/e0b89df2e0f0130881bf6c39bf31d7f6aac00e0f
>> # https://github.com/llvm/llvm-project/commit/dfcf69770bc522b9e411c66454934a37c1f35332
>> def_bool ((CC_IS_GCC && $(cc-option, -fcf-protection=branch -mindirect-branch-register)) || \
>> (CC_IS_CLANG && CLANG_VERSION >= 140000)) && \
>> $(as-instr,endbr64)
>>
>> config X86_KERNEL_IBT
>> prompt "Indirect Branch Tracking"
>> def_bool y
>> depends on X86_64 && CC_HAS_IBT && HAVE_OBJTOOL
>> # https://github.com/llvm/llvm-project/commit/9d7001eba9c4cb311e03cd8cdc231f9e579f2d0f
>> depends on !LD_IS_LLD || LLD_VERSION >= 140000
>> select OBJTOOL
>> select X86_CET
>> help
>> Build the kernel with support for Indirect Branch Tracking, a
>> hardware support course-grain forward-edge Control Flow Integrity
>> protection. It enforces that all indirect calls must land on
>> an ENDBR instruction, as such, the compiler will instrument the
>> code with them to make this happen.
>> In addition to building the kernel with IBT, seal all functions that
>> are not indirect call targets, avoiding them ever becoming one.
>> This requires LTO like objtool runs and will slow down the build. It
>> does significantly reduce the number of ENDBR instructions in the
>> kernel image.
>>
>> So CONFIG_X86_KERNEL_IBT will be enabled if clang >= version_14 or newer gcc.
> IIUC it's just dependency, no? doesn't mean it'll get enabled automatically
>
>> In my system, the gcc version is 13.1. So there is no need to explicitly add
>> CONFIG_X86_KERNEL_IBT to the selftests/bpf/config.x86_64 file.
> I had to enable it manualy for gcc 13.3.1
IIUC, the ci config is generated based on config + config.x86_64 + config.vm
in tools/testing/selftests/bpf directory.
In my case .config is generated from config + config.x86_64 + config.vm
With my local gcc 11.5, I did
make olddefconfig
and I see CONFIG_X86_KERNEL_IBT=y is set.
Maybe your base config is a little bit different from what ci used.
My local config is based on ci config + some more e.g. enabling KASAN etc.
Could you debug a little more on why CONFIG_X86_KERNEL_IBT not enabled
by default in your case? For
config X86_KERNEL_IBT
prompt "Indirect Branch Tracking"
def_bool y
depends on X86_64 && CC_HAS_IBT && HAVE_OBJTOOL
# https://github.com/llvm/llvm-project/commit/9d7001eba9c4cb311e03cd8cdc231f9e579f2d0f
depends on !LD_IS_LLD || LLD_VERSION >= 140000
select OBJTOOL
select X86_CET
default is 'y' so if all dependencies are met, CONFIG_X86_KERNEL_IBT
is supposed to be on by default.
>
> jirka
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 8/9] selftests/bpf: Add tracing prog private stack tests
2024-10-21 16:19 ` Yonghong Song
@ 2024-10-21 21:13 ` Jiri Olsa
0 siblings, 0 replies; 37+ messages in thread
From: Jiri Olsa @ 2024-10-21 21:13 UTC (permalink / raw)
To: Yonghong Song
Cc: Jiri Olsa, bpf, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, kernel-team, Martin KaFai Lau, Tejun Heo
On Mon, Oct 21, 2024 at 09:19:57AM -0700, Yonghong Song wrote:
SNIP
> > > In my system, the gcc version is 13.1. So there is no need to explicitly add
> > > CONFIG_X86_KERNEL_IBT to the selftests/bpf/config.x86_64 file.
> > I had to enable it manualy for gcc 13.3.1
>
> IIUC, the ci config is generated based on config + config.x86_64 + config.vm
> in tools/testing/selftests/bpf directory.
>
> In my case .config is generated from config + config.x86_64 + config.vm
> With my local gcc 11.5, I did
> make olddefconfig
> and I see CONFIG_X86_KERNEL_IBT=y is set.
>
> Maybe your base config is a little bit different from what ci used.
> My local config is based on ci config + some more e.g. enabling KASAN etc.
>
> Could you debug a little more on why CONFIG_X86_KERNEL_IBT not enabled
> by default in your case? For
ok, I think I disabled that manually
>
> config X86_KERNEL_IBT
> prompt "Indirect Branch Tracking"
> def_bool y
> depends on X86_64 && CC_HAS_IBT && HAVE_OBJTOOL
> # https://github.com/llvm/llvm-project/commit/9d7001eba9c4cb311e03cd8cdc231f9e579f2d0f
> depends on !LD_IS_LLD || LLD_VERSION >= 140000
> select OBJTOOL
> select X86_CET
>
> default is 'y' so if all dependencies are met, CONFIG_X86_KERNEL_IBT
> is supposed to be on by default.
ah right, that should work then.. thanks for the details
jirka
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 1/9] bpf: Allow each subprog having stack size of 512 bytes
2024-10-20 19:13 ` [PATCH bpf-next v6 1/9] bpf: Allow each subprog having stack size of 512 bytes Yonghong Song
@ 2024-10-22 1:18 ` Alexei Starovoitov
2024-10-22 3:21 ` Yonghong Song
0 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2024-10-22 1:18 UTC (permalink / raw)
To: Yonghong Song
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau, Tejun Heo
On Sun, Oct 20, 2024 at 12:14 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
> With private stack support, each subprog can have stack with up to 512
> bytes. The limit of 512 bytes per subprog is kept to avoid increasing
> verifier complexity since greater than 512 bytes will cause big verifier
> change and increase memory consumption and verification time.
>
> If private stack is supported, for a bpf prog, esp. when it has
> subprogs, private stack will be allocated for the main prog
> and for each callback subprog. For example,
> main_prog
> subprog1
> calling helper
> subprog10 (callback func)
> subprog11
> subprog2
> calling helper
> subprog10 (callback func)
> subprog11
>
> Separate private allocations for main_prog and callback_fn subprog10
> will make things easier since the helper function uses the kernel stack.
>
> In this patch, some tracing programs are allowed to use private
> stack since tracing prog may be triggered in the middle of some other
> prog runs. Additional subprog info is also collected for later to
> allocate private stack for main prog and each callback functions.
>
> Note that if any tail_call is called in the prog (including all subprogs),
> then private stack is not used.
>
> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
> ---
> include/linux/bpf.h | 1 +
> include/linux/bpf_verifier.h | 3 ++
> include/linux/filter.h | 1 +
> kernel/bpf/core.c | 5 ++
> kernel/bpf/verifier.c | 100 ++++++++++++++++++++++++++++++-----
> 5 files changed, 97 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 0c216e71cec7..6ad8ace7075a 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1490,6 +1490,7 @@ struct bpf_prog_aux {
> bool exception_cb;
> bool exception_boundary;
> bool is_extended; /* true if extended by freplace program */
> + bool priv_stack_eligible;
> u64 prog_array_member_cnt; /* counts how many times as member of prog_array */
> struct mutex ext_mutex; /* mutex for is_extended and prog_array_member_cnt */
> struct bpf_arena *arena;
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 4513372c5bc8..bcfe868e3801 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -659,6 +659,8 @@ struct bpf_subprog_info {
> * are used for bpf_fastcall spills and fills.
> */
> s16 fastcall_stack_off;
> + u16 subtree_stack_depth;
> + u16 subtree_top_idx;
> bool has_tail_call: 1;
> bool tail_call_reachable: 1;
> bool has_ld_abs: 1;
> @@ -668,6 +670,7 @@ struct bpf_subprog_info {
> bool args_cached: 1;
> /* true if bpf_fastcall stack region is used by functions that can't be inlined */
> bool keep_fastcall_stack: 1;
> + bool priv_stack_eligible: 1;
>
> u8 arg_cnt;
> struct bpf_subprog_arg_info args[MAX_BPF_FUNC_REG_ARGS];
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 7d7578a8eac1..3a21947f2fd4 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1119,6 +1119,7 @@ bool bpf_jit_supports_exceptions(void);
> bool bpf_jit_supports_ptr_xchg(void);
> bool bpf_jit_supports_arena(void);
> bool bpf_jit_supports_insn(struct bpf_insn *insn, bool in_arena);
> +bool bpf_jit_supports_private_stack(void);
> u64 bpf_arch_uaddress_limit(void);
> void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie);
> bool bpf_helper_changes_pkt_data(void *func);
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 233ea78f8f1b..14d9288441f2 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -3045,6 +3045,11 @@ bool __weak bpf_jit_supports_exceptions(void)
> return false;
> }
>
> +bool __weak bpf_jit_supports_private_stack(void)
> +{
> + return false;
> +}
> +
> void __weak arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie)
> {
> }
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index f514247ba8ba..45bea4066272 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -194,6 +194,8 @@ struct bpf_verifier_stack_elem {
>
> #define BPF_GLOBAL_PERCPU_MA_MAX_SIZE 512
>
> +#define BPF_PRIV_STACK_MIN_SUBTREE_SIZE 128
> +
> static int acquire_reference_state(struct bpf_verifier_env *env, int insn_idx);
> static int release_reference(struct bpf_verifier_env *env, int ref_obj_id);
> static void invalidate_non_owning_refs(struct bpf_verifier_env *env);
> @@ -5982,6 +5984,41 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
> strict);
> }
>
> +static bool bpf_enable_private_stack(struct bpf_verifier_env *env)
> +{
> + if (!bpf_jit_supports_private_stack())
> + return false;
> +
> + switch (env->prog->type) {
> + case BPF_PROG_TYPE_KPROBE:
> + case BPF_PROG_TYPE_TRACEPOINT:
> + case BPF_PROG_TYPE_PERF_EVENT:
> + case BPF_PROG_TYPE_RAW_TRACEPOINT:
> + return true;
> + case BPF_PROG_TYPE_TRACING:
> + if (env->prog->expected_attach_type != BPF_TRACE_ITER)
> + return true;
> + fallthrough;
> + default:
> + return false;
> + }
> +}
> +
> +static bool is_priv_stack_supported(struct bpf_verifier_env *env)
> +{
> + struct bpf_subprog_info *si = env->subprog_info;
> + bool has_tail_call = false;
> +
> + for (int i = 0; i < env->subprog_cnt; i++) {
> + if (si[i].has_tail_call) {
> + has_tail_call = true;
> + break;
> + }
> + }
> +
> + return !has_tail_call && bpf_enable_private_stack(env);
> +}
> +
> static int round_up_stack_depth(struct bpf_verifier_env *env, int stack_depth)
> {
> if (env->prog->jit_requested)
> @@ -5999,16 +6036,21 @@ static int round_up_stack_depth(struct bpf_verifier_env *env, int stack_depth)
> * Since recursion is prevented by check_cfg() this algorithm
> * only needs a local stack of MAX_CALL_FRAMES to remember callsites
> */
> -static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
> +static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
> + bool check_priv_stack, bool priv_stack_supported)
> {
> struct bpf_subprog_info *subprog = env->subprog_info;
> struct bpf_insn *insn = env->prog->insnsi;
> int depth = 0, frame = 0, i, subprog_end;
> bool tail_call_reachable = false;
> + bool priv_stack_eligible = false;
> int ret_insn[MAX_CALL_FRAMES];
> int ret_prog[MAX_CALL_FRAMES];
> - int j;
> + int j, subprog_stack_depth;
> + int orig_idx = idx;
>
> + if (check_priv_stack)
> + subprog[idx].subtree_top_idx = idx;
> i = subprog[idx].start;
> process_func:
> /* protect against potential stack overflow that might happen when
> @@ -6030,18 +6072,33 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
> * tailcall will unwind the current stack frame but it will not get rid
> * of caller's stack as shown on the example above.
> */
> - if (idx && subprog[idx].has_tail_call && depth >= 256) {
> + if (!check_priv_stack && idx && subprog[idx].has_tail_call && depth >= 256) {
> verbose(env,
> "tail_calls are not allowed when call stack of previous frames is %d bytes. Too large\n",
> depth);
> return -EACCES;
> }
> - depth += round_up_stack_depth(env, subprog[idx].stack_depth);
> - if (depth > MAX_BPF_STACK) {
> + subprog_stack_depth = round_up_stack_depth(env, subprog[idx].stack_depth);
> + depth += subprog_stack_depth;
> + if (!check_priv_stack && !priv_stack_supported && depth > MAX_BPF_STACK) {
> verbose(env, "combined stack size of %d calls is %d. Too large\n",
> frame + 1, depth);
> return -EACCES;
> }
> + if (check_priv_stack) {
> + if (subprog_stack_depth > MAX_BPF_STACK) {
> + verbose(env, "stack size of subprog %d is %d. Too large\n",
> + idx, subprog_stack_depth);
> + return -EACCES;
> + }
> +
> + if (!priv_stack_eligible && depth >= BPF_PRIV_STACK_MIN_SUBTREE_SIZE) {
> + subprog[orig_idx].priv_stack_eligible = true;
> + env->prog->aux->priv_stack_eligible = priv_stack_eligible = true;
> + }
> + subprog[orig_idx].subtree_stack_depth =
> + max_t(u16, subprog[orig_idx].subtree_stack_depth, depth);
> + }
> continue_func:
> subprog_end = subprog[idx + 1].start;
> for (; i < subprog_end; i++) {
> @@ -6078,6 +6135,12 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
> next_insn = i + insn[i].imm + 1;
> sidx = find_subprog(env, next_insn);
> if (sidx < 0) {
> + /* It is possible that callback func has been removed as dead code after
> + * instruction rewrites, e.g. bpf_loop with cnt 0.
> + */
> + if (check_priv_stack)
> + continue;
> +
and this extra hack only because check_max_stack_depth() will
be called the 2nd time ?
Why call it twice at all ?
Record everything in the first pass.
> WARN_ONCE(1, "verifier bug. No program starts at insn %d\n",
> next_insn);
> return -EFAULT;
> @@ -6097,8 +6160,10 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
> }
> i = next_insn;
> idx = sidx;
> + if (check_priv_stack)
> + subprog[idx].subtree_top_idx = orig_idx;
>
> - if (subprog[idx].has_tail_call)
> + if (!check_priv_stack && subprog[idx].has_tail_call)
> tail_call_reachable = true;
>
> frame++;
> @@ -6122,7 +6187,7 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
> }
> subprog[ret_prog[j]].tail_call_reachable = true;
> }
> - if (subprog[0].tail_call_reachable)
> + if (!check_priv_stack && subprog[0].tail_call_reachable)
> env->prog->aux->tail_call_reachable = true;
>
> /* end of for() loop means the last insn of the 'subprog'
> @@ -6137,14 +6202,18 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
> goto continue_func;
> }
>
> -static int check_max_stack_depth(struct bpf_verifier_env *env)
> +static int check_max_stack_depth(struct bpf_verifier_env *env, bool check_priv_stack,
> + bool priv_stack_supported)
> {
> struct bpf_subprog_info *si = env->subprog_info;
> + bool check_subprog;
> int ret;
>
> for (int i = 0; i < env->subprog_cnt; i++) {
> - if (!i || si[i].is_async_cb) {
> - ret = check_max_stack_depth_subprog(env, i);
> + check_subprog = !i || (check_priv_stack ? si[i].is_cb : si[i].is_async_cb);
why?
This looks very suspicious.
> + if (check_subprog) {
> + ret = check_max_stack_depth_subprog(env, i, check_priv_stack,
> + priv_stack_supported);
> if (ret < 0)
> return ret;
> }
> @@ -22303,7 +22372,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u3
> struct bpf_verifier_env *env;
> int i, len, ret = -EINVAL, err;
> u32 log_true_size;
> - bool is_priv;
> + bool is_priv, priv_stack_supported = false;
>
> /* no program is valid */
> if (ARRAY_SIZE(bpf_verifier_ops) == 0)
> @@ -22430,8 +22499,10 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u3
> if (ret == 0)
> ret = remove_fastcall_spills_fills(env);
>
> - if (ret == 0)
> - ret = check_max_stack_depth(env);
> + if (ret == 0) {
> + priv_stack_supported = is_priv_stack_supported(env);
> + ret = check_max_stack_depth(env, false, priv_stack_supported);
> + }
>
> /* instruction rewrites happen after this point */
> if (ret == 0)
> @@ -22465,6 +22536,9 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u3
> : false;
> }
>
> + if (ret == 0 && priv_stack_supported)
> + ret = check_max_stack_depth(env, true, true);
> +
> if (ret == 0)
> ret = fixup_call_args(env);
>
> --
> 2.43.5
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 3/9] bpf: Support private stack for struct ops programs
2024-10-20 19:13 ` [PATCH bpf-next v6 3/9] bpf: Support private stack for struct ops programs Yonghong Song
@ 2024-10-22 1:34 ` Alexei Starovoitov
2024-10-22 2:59 ` Yonghong Song
2024-10-22 17:26 ` Martin KaFai Lau
0 siblings, 2 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2024-10-22 1:34 UTC (permalink / raw)
To: Yonghong Song
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau, Tejun Heo
On Sun, Oct 20, 2024 at 12:16 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
> To identify whether a st_ops program requests private stack or not,
> the st_ops stub function is checked. If the stub function has the
> following name
> <st_ops_name>__<member_name>__priv_stack
> then the corresponding st_ops member func requests to use private
> stack. The information that the private stack is requested or not
> is encoded in struct bpf_struct_ops_func_info which will later be
> used by verifier.
>
> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
> ---
> include/linux/bpf.h | 2 ++
> kernel/bpf/bpf_struct_ops.c | 35 +++++++++++++++++++++++++----------
> kernel/bpf/verifier.c | 8 +++++++-
> 3 files changed, 34 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index f3884ce2603d..376e43fc72b9 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1491,6 +1491,7 @@ struct bpf_prog_aux {
> bool exception_boundary;
> bool is_extended; /* true if extended by freplace program */
> bool priv_stack_eligible;
> + bool priv_stack_always;
> u64 prog_array_member_cnt; /* counts how many times as member of prog_array */
> struct mutex ext_mutex; /* mutex for is_extended and prog_array_member_cnt */
> struct bpf_arena *arena;
> @@ -1776,6 +1777,7 @@ struct bpf_struct_ops {
> struct bpf_struct_ops_func_info {
> struct bpf_ctx_arg_aux *info;
> u32 cnt;
> + bool priv_stack_always;
> };
>
> struct bpf_struct_ops_desc {
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index 8279b5a57798..2cd4bd086c7a 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -145,33 +145,44 @@ void bpf_struct_ops_image_free(void *image)
> }
>
> #define MAYBE_NULL_SUFFIX "__nullable"
> -#define MAX_STUB_NAME 128
> +#define MAX_STUB_NAME 140
>
> /* Return the type info of a stub function, if it exists.
> *
> - * The name of a stub function is made up of the name of the struct_ops and
> - * the name of the function pointer member, separated by "__". For example,
> - * if the struct_ops type is named "foo_ops" and the function pointer
> - * member is named "bar", the stub function name would be "foo_ops__bar".
> + * The name of a stub function is made up of the name of the struct_ops,
> + * the name of the function pointer member and optionally "priv_stack"
> + * suffix, separated by "__". For example, if the struct_ops type is named
> + * "foo_ops" and the function pointer member is named "bar", the stub
> + * function name would be "foo_ops__bar". If a suffix "priv_stack" exists,
> + * the stub function name would be "foo_ops__bar__priv_stack".
> */
> static const struct btf_type *
> find_stub_func_proto(const struct btf *btf, const char *st_op_name,
> - const char *member_name)
> + const char *member_name, bool *priv_stack_always)
> {
> char stub_func_name[MAX_STUB_NAME];
> const struct btf_type *func_type;
> s32 btf_id;
> int cp;
>
> - cp = snprintf(stub_func_name, MAX_STUB_NAME, "%s__%s",
> + cp = snprintf(stub_func_name, MAX_STUB_NAME, "%s__%s__priv_stack",
> st_op_name, member_name);
I don't think this approach fits.
pw-bot: cr
Also looking at original
commit 1611603537a4 ("bpf: Create argument information for nullable arguments.")
that added this %s__%s notation I'm not sure why we went
with that approach.
Just to avoid adding __nullable suffix in the actual callback
and using cfi stub callback names with such suffixes as
a "proxy" for the real callback?
Did we ever use this functionality for anything other than
bpf_testmod_ops__test_maybe_null selftest ?
Martin ?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 3/9] bpf: Support private stack for struct ops programs
2024-10-22 1:34 ` Alexei Starovoitov
@ 2024-10-22 2:59 ` Yonghong Song
2024-10-22 17:26 ` Martin KaFai Lau
1 sibling, 0 replies; 37+ messages in thread
From: Yonghong Song @ 2024-10-22 2:59 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau, Tejun Heo
On 10/21/24 6:34 PM, Alexei Starovoitov wrote:
> On Sun, Oct 20, 2024 at 12:16 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>> To identify whether a st_ops program requests private stack or not,
>> the st_ops stub function is checked. If the stub function has the
>> following name
>> <st_ops_name>__<member_name>__priv_stack
>> then the corresponding st_ops member func requests to use private
>> stack. The information that the private stack is requested or not
>> is encoded in struct bpf_struct_ops_func_info which will later be
>> used by verifier.
>>
>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>> ---
>> include/linux/bpf.h | 2 ++
>> kernel/bpf/bpf_struct_ops.c | 35 +++++++++++++++++++++++++----------
>> kernel/bpf/verifier.c | 8 +++++++-
>> 3 files changed, 34 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index f3884ce2603d..376e43fc72b9 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1491,6 +1491,7 @@ struct bpf_prog_aux {
>> bool exception_boundary;
>> bool is_extended; /* true if extended by freplace program */
>> bool priv_stack_eligible;
>> + bool priv_stack_always;
>> u64 prog_array_member_cnt; /* counts how many times as member of prog_array */
>> struct mutex ext_mutex; /* mutex for is_extended and prog_array_member_cnt */
>> struct bpf_arena *arena;
>> @@ -1776,6 +1777,7 @@ struct bpf_struct_ops {
>> struct bpf_struct_ops_func_info {
>> struct bpf_ctx_arg_aux *info;
>> u32 cnt;
>> + bool priv_stack_always;
>> };
>>
>> struct bpf_struct_ops_desc {
>> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
>> index 8279b5a57798..2cd4bd086c7a 100644
>> --- a/kernel/bpf/bpf_struct_ops.c
>> +++ b/kernel/bpf/bpf_struct_ops.c
>> @@ -145,33 +145,44 @@ void bpf_struct_ops_image_free(void *image)
>> }
>>
>> #define MAYBE_NULL_SUFFIX "__nullable"
>> -#define MAX_STUB_NAME 128
>> +#define MAX_STUB_NAME 140
>>
>> /* Return the type info of a stub function, if it exists.
>> *
>> - * The name of a stub function is made up of the name of the struct_ops and
>> - * the name of the function pointer member, separated by "__". For example,
>> - * if the struct_ops type is named "foo_ops" and the function pointer
>> - * member is named "bar", the stub function name would be "foo_ops__bar".
>> + * The name of a stub function is made up of the name of the struct_ops,
>> + * the name of the function pointer member and optionally "priv_stack"
>> + * suffix, separated by "__". For example, if the struct_ops type is named
>> + * "foo_ops" and the function pointer member is named "bar", the stub
>> + * function name would be "foo_ops__bar". If a suffix "priv_stack" exists,
>> + * the stub function name would be "foo_ops__bar__priv_stack".
>> */
>> static const struct btf_type *
>> find_stub_func_proto(const struct btf *btf, const char *st_op_name,
>> - const char *member_name)
>> + const char *member_name, bool *priv_stack_always)
>> {
>> char stub_func_name[MAX_STUB_NAME];
>> const struct btf_type *func_type;
>> s32 btf_id;
>> int cp;
>>
>> - cp = snprintf(stub_func_name, MAX_STUB_NAME, "%s__%s",
>> + cp = snprintf(stub_func_name, MAX_STUB_NAME, "%s__%s__priv_stack",
>> st_op_name, member_name);
> I don't think this approach fits.
> pw-bot: cr
Okay, I will use check_member() callback function then. It should avoid
this hack.
>
> Also looking at original
> commit 1611603537a4 ("bpf: Create argument information for nullable arguments.")
> that added this %s__%s notation I'm not sure why we went
> with that approach.
>
> Just to avoid adding __nullable suffix in the actual callback
> and using cfi stub callback names with such suffixes as
> a "proxy" for the real callback?
>
> Did we ever use this functionality for anything other than
> bpf_testmod_ops__test_maybe_null selftest ?
>
> Martin ?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 1/9] bpf: Allow each subprog having stack size of 512 bytes
2024-10-22 1:18 ` Alexei Starovoitov
@ 2024-10-22 3:21 ` Yonghong Song
2024-10-22 3:43 ` Alexei Starovoitov
0 siblings, 1 reply; 37+ messages in thread
From: Yonghong Song @ 2024-10-22 3:21 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau, Tejun Heo
On 10/21/24 6:18 PM, Alexei Starovoitov wrote:
> On Sun, Oct 20, 2024 at 12:14 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>> With private stack support, each subprog can have stack with up to 512
>> bytes. The limit of 512 bytes per subprog is kept to avoid increasing
>> verifier complexity since greater than 512 bytes will cause big verifier
>> change and increase memory consumption and verification time.
>>
>> If private stack is supported, for a bpf prog, esp. when it has
>> subprogs, private stack will be allocated for the main prog
>> and for each callback subprog. For example,
>> main_prog
>> subprog1
>> calling helper
>> subprog10 (callback func)
>> subprog11
>> subprog2
>> calling helper
>> subprog10 (callback func)
>> subprog11
>>
>> Separate private allocations for main_prog and callback_fn subprog10
>> will make things easier since the helper function uses the kernel stack.
>>
>> In this patch, some tracing programs are allowed to use private
>> stack since tracing prog may be triggered in the middle of some other
>> prog runs. Additional subprog info is also collected for later to
>> allocate private stack for main prog and each callback functions.
>>
>> Note that if any tail_call is called in the prog (including all subprogs),
>> then private stack is not used.
>>
>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>> ---
>> include/linux/bpf.h | 1 +
>> include/linux/bpf_verifier.h | 3 ++
>> include/linux/filter.h | 1 +
>> kernel/bpf/core.c | 5 ++
>> kernel/bpf/verifier.c | 100 ++++++++++++++++++++++++++++++-----
>> 5 files changed, 97 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 0c216e71cec7..6ad8ace7075a 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1490,6 +1490,7 @@ struct bpf_prog_aux {
>> bool exception_cb;
>> bool exception_boundary;
>> bool is_extended; /* true if extended by freplace program */
>> + bool priv_stack_eligible;
>> u64 prog_array_member_cnt; /* counts how many times as member of prog_array */
>> struct mutex ext_mutex; /* mutex for is_extended and prog_array_member_cnt */
>> struct bpf_arena *arena;
>> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
>> index 4513372c5bc8..bcfe868e3801 100644
>> --- a/include/linux/bpf_verifier.h
>> +++ b/include/linux/bpf_verifier.h
>> @@ -659,6 +659,8 @@ struct bpf_subprog_info {
>> * are used for bpf_fastcall spills and fills.
>> */
>> s16 fastcall_stack_off;
>> + u16 subtree_stack_depth;
>> + u16 subtree_top_idx;
>> bool has_tail_call: 1;
>> bool tail_call_reachable: 1;
>> bool has_ld_abs: 1;
>> @@ -668,6 +670,7 @@ struct bpf_subprog_info {
>> bool args_cached: 1;
>> /* true if bpf_fastcall stack region is used by functions that can't be inlined */
>> bool keep_fastcall_stack: 1;
>> + bool priv_stack_eligible: 1;
>>
>> u8 arg_cnt;
>> struct bpf_subprog_arg_info args[MAX_BPF_FUNC_REG_ARGS];
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index 7d7578a8eac1..3a21947f2fd4 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -1119,6 +1119,7 @@ bool bpf_jit_supports_exceptions(void);
>> bool bpf_jit_supports_ptr_xchg(void);
>> bool bpf_jit_supports_arena(void);
>> bool bpf_jit_supports_insn(struct bpf_insn *insn, bool in_arena);
>> +bool bpf_jit_supports_private_stack(void);
>> u64 bpf_arch_uaddress_limit(void);
>> void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie);
>> bool bpf_helper_changes_pkt_data(void *func);
>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> index 233ea78f8f1b..14d9288441f2 100644
>> --- a/kernel/bpf/core.c
>> +++ b/kernel/bpf/core.c
>> @@ -3045,6 +3045,11 @@ bool __weak bpf_jit_supports_exceptions(void)
>> return false;
>> }
>>
>> +bool __weak bpf_jit_supports_private_stack(void)
>> +{
>> + return false;
>> +}
>> +
>> void __weak arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie)
>> {
>> }
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index f514247ba8ba..45bea4066272 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -194,6 +194,8 @@ struct bpf_verifier_stack_elem {
>>
>> #define BPF_GLOBAL_PERCPU_MA_MAX_SIZE 512
>>
>> +#define BPF_PRIV_STACK_MIN_SUBTREE_SIZE 128
>> +
>> static int acquire_reference_state(struct bpf_verifier_env *env, int insn_idx);
>> static int release_reference(struct bpf_verifier_env *env, int ref_obj_id);
>> static void invalidate_non_owning_refs(struct bpf_verifier_env *env);
>> @@ -5982,6 +5984,41 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
>> strict);
>> }
>>
>> +static bool bpf_enable_private_stack(struct bpf_verifier_env *env)
>> +{
>> + if (!bpf_jit_supports_private_stack())
>> + return false;
>> +
>> + switch (env->prog->type) {
>> + case BPF_PROG_TYPE_KPROBE:
>> + case BPF_PROG_TYPE_TRACEPOINT:
>> + case BPF_PROG_TYPE_PERF_EVENT:
>> + case BPF_PROG_TYPE_RAW_TRACEPOINT:
>> + return true;
>> + case BPF_PROG_TYPE_TRACING:
>> + if (env->prog->expected_attach_type != BPF_TRACE_ITER)
>> + return true;
>> + fallthrough;
>> + default:
>> + return false;
>> + }
>> +}
>> +
>> +static bool is_priv_stack_supported(struct bpf_verifier_env *env)
>> +{
>> + struct bpf_subprog_info *si = env->subprog_info;
>> + bool has_tail_call = false;
>> +
>> + for (int i = 0; i < env->subprog_cnt; i++) {
>> + if (si[i].has_tail_call) {
>> + has_tail_call = true;
>> + break;
>> + }
>> + }
>> +
>> + return !has_tail_call && bpf_enable_private_stack(env);
>> +}
>> +
>> static int round_up_stack_depth(struct bpf_verifier_env *env, int stack_depth)
>> {
>> if (env->prog->jit_requested)
>> @@ -5999,16 +6036,21 @@ static int round_up_stack_depth(struct bpf_verifier_env *env, int stack_depth)
>> * Since recursion is prevented by check_cfg() this algorithm
>> * only needs a local stack of MAX_CALL_FRAMES to remember callsites
>> */
>> -static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
>> +static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
>> + bool check_priv_stack, bool priv_stack_supported)
>> {
>> struct bpf_subprog_info *subprog = env->subprog_info;
>> struct bpf_insn *insn = env->prog->insnsi;
>> int depth = 0, frame = 0, i, subprog_end;
>> bool tail_call_reachable = false;
>> + bool priv_stack_eligible = false;
>> int ret_insn[MAX_CALL_FRAMES];
>> int ret_prog[MAX_CALL_FRAMES];
>> - int j;
>> + int j, subprog_stack_depth;
>> + int orig_idx = idx;
>>
>> + if (check_priv_stack)
>> + subprog[idx].subtree_top_idx = idx;
>> i = subprog[idx].start;
>> process_func:
>> /* protect against potential stack overflow that might happen when
>> @@ -6030,18 +6072,33 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
>> * tailcall will unwind the current stack frame but it will not get rid
>> * of caller's stack as shown on the example above.
>> */
>> - if (idx && subprog[idx].has_tail_call && depth >= 256) {
>> + if (!check_priv_stack && idx && subprog[idx].has_tail_call && depth >= 256) {
>> verbose(env,
>> "tail_calls are not allowed when call stack of previous frames is %d bytes. Too large\n",
>> depth);
>> return -EACCES;
>> }
>> - depth += round_up_stack_depth(env, subprog[idx].stack_depth);
>> - if (depth > MAX_BPF_STACK) {
>> + subprog_stack_depth = round_up_stack_depth(env, subprog[idx].stack_depth);
>> + depth += subprog_stack_depth;
>> + if (!check_priv_stack && !priv_stack_supported && depth > MAX_BPF_STACK) {
>> verbose(env, "combined stack size of %d calls is %d. Too large\n",
>> frame + 1, depth);
>> return -EACCES;
>> }
>> + if (check_priv_stack) {
>> + if (subprog_stack_depth > MAX_BPF_STACK) {
>> + verbose(env, "stack size of subprog %d is %d. Too large\n",
>> + idx, subprog_stack_depth);
>> + return -EACCES;
>> + }
>> +
>> + if (!priv_stack_eligible && depth >= BPF_PRIV_STACK_MIN_SUBTREE_SIZE) {
>> + subprog[orig_idx].priv_stack_eligible = true;
>> + env->prog->aux->priv_stack_eligible = priv_stack_eligible = true;
>> + }
>> + subprog[orig_idx].subtree_stack_depth =
>> + max_t(u16, subprog[orig_idx].subtree_stack_depth, depth);
>> + }
>> continue_func:
>> subprog_end = subprog[idx + 1].start;
>> for (; i < subprog_end; i++) {
>> @@ -6078,6 +6135,12 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
>> next_insn = i + insn[i].imm + 1;
>> sidx = find_subprog(env, next_insn);
>> if (sidx < 0) {
>> + /* It is possible that callback func has been removed as dead code after
>> + * instruction rewrites, e.g. bpf_loop with cnt 0.
>> + */
>> + if (check_priv_stack)
>> + continue;
>> +
> and this extra hack only because check_max_stack_depth() will
> be called the 2nd time ?
> Why call it twice at all ?
> Record everything in the first pass.
The individual stack size may increase between check_max_stack_depth() and jit.
So we have to go through second pass to compute precise subtree (prog + subprogs)
stack size, which is needed to allocate percpu private stack.
One thing we could do is to record the (sub)prog<->subprog relations in the first
pass and right before the jit do another pass to calculate subtree stack size.
I guess that is what you suggest?
>
>> WARN_ONCE(1, "verifier bug. No program starts at insn %d\n",
>> next_insn);
>> return -EFAULT;
>> @@ -6097,8 +6160,10 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
>> }
>> i = next_insn;
>> idx = sidx;
>> + if (check_priv_stack)
>> + subprog[idx].subtree_top_idx = orig_idx;
>>
>> - if (subprog[idx].has_tail_call)
>> + if (!check_priv_stack && subprog[idx].has_tail_call)
>> tail_call_reachable = true;
>>
>> frame++;
>> @@ -6122,7 +6187,7 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
>> }
>> subprog[ret_prog[j]].tail_call_reachable = true;
>> }
>> - if (subprog[0].tail_call_reachable)
>> + if (!check_priv_stack && subprog[0].tail_call_reachable)
>> env->prog->aux->tail_call_reachable = true;
>>
>> /* end of for() loop means the last insn of the 'subprog'
>> @@ -6137,14 +6202,18 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
>> goto continue_func;
>> }
>>
>> -static int check_max_stack_depth(struct bpf_verifier_env *env)
>> +static int check_max_stack_depth(struct bpf_verifier_env *env, bool check_priv_stack,
>> + bool priv_stack_supported)
>> {
>> struct bpf_subprog_info *si = env->subprog_info;
>> + bool check_subprog;
>> int ret;
>>
>> for (int i = 0; i < env->subprog_cnt; i++) {
>> - if (!i || si[i].is_async_cb) {
>> - ret = check_max_stack_depth_subprog(env, i);
>> + check_subprog = !i || (check_priv_stack ? si[i].is_cb : si[i].is_async_cb);
> why?
> This looks very suspicious.
This is to simplify jit. For example,
main_prog <=== main_prog_priv_stack_ptr
subprog1 <=== there is a helper which has a callback_fn
<=== for example bpf_for_each_map_elem
callback_fn
subprog2
In callback_fn, we cannot simplify do
r9 += stack_size_for_callback_fn
since r9 may have been clobbered between subprog1 and callback_fn.
That is why currently I allocate private_stack separately for callback_fn.
Alternatively we could do
callback_fn_priv_stack_ptr = main_prog_priv_stack_ptr + off
where off equals to (stack size tree main_prog+subprog1).
I can do this approach too with a little more information in prog->aux.
WDYT?
>
>> + if (check_subprog) {
>> + ret = check_max_stack_depth_subprog(env, i, check_priv_stack,
>> + priv_stack_supported);
>> if (ret < 0)
>> return ret;
>> }
>> @@ -22303,7 +22372,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u3
>> struct bpf_verifier_env *env;
>> int i, len, ret = -EINVAL, err;
>> u32 log_true_size;
>> - bool is_priv;
>> + bool is_priv, priv_stack_supported = false;
>>
>> /* no program is valid */
>> if (ARRAY_SIZE(bpf_verifier_ops) == 0)
>> @@ -22430,8 +22499,10 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u3
>> if (ret == 0)
>> ret = remove_fastcall_spills_fills(env);
>>
>> - if (ret == 0)
>> - ret = check_max_stack_depth(env);
>> + if (ret == 0) {
>> + priv_stack_supported = is_priv_stack_supported(env);
>> + ret = check_max_stack_depth(env, false, priv_stack_supported);
>> + }
>>
>> /* instruction rewrites happen after this point */
>> if (ret == 0)
>> @@ -22465,6 +22536,9 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u3
>> : false;
>> }
>>
>> + if (ret == 0 && priv_stack_supported)
>> + ret = check_max_stack_depth(env, true, true);
>> +
>> if (ret == 0)
>> ret = fixup_call_args(env);
>>
>> --
>> 2.43.5
>>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 1/9] bpf: Allow each subprog having stack size of 512 bytes
2024-10-22 3:21 ` Yonghong Song
@ 2024-10-22 3:43 ` Alexei Starovoitov
2024-10-22 4:08 ` Yonghong Song
2024-10-22 20:13 ` Yonghong Song
0 siblings, 2 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2024-10-22 3:43 UTC (permalink / raw)
To: Yonghong Song
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau, Tejun Heo
On Mon, Oct 21, 2024 at 8:21 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
> >> for (int i = 0; i < env->subprog_cnt; i++) {
> >> - if (!i || si[i].is_async_cb) {
> >> - ret = check_max_stack_depth_subprog(env, i);
> >> + check_subprog = !i || (check_priv_stack ? si[i].is_cb : si[i].is_async_cb);
> > why?
> > This looks very suspicious.
>
> This is to simplify jit. For example,
> main_prog <=== main_prog_priv_stack_ptr
> subprog1 <=== there is a helper which has a callback_fn
> <=== for example bpf_for_each_map_elem
>
> callback_fn
> subprog2
>
> In callback_fn, we cannot simplify do
> r9 += stack_size_for_callback_fn
> since r9 may have been clobbered between subprog1 and callback_fn.
> That is why currently I allocate private_stack separately for callback_fn.
>
> Alternatively we could do
> callback_fn_priv_stack_ptr = main_prog_priv_stack_ptr + off
> where off equals to (stack size tree main_prog+subprog1).
> I can do this approach too with a little more information in prog->aux.
> WDYT?
I see. I think we're overcomplicating the verifier just to
be able to do 'r9 += stack' in the subprog.
The cases of async vs sync and directly vs kfunc/helper
(and soon with inlining of kfuncs) are getting too hard
to reason about.
I think we need to go back to the earlier approach
where every subprog had its own private stack and was
setting up r9 = my_priv_stack in the prologue.
I suspect it's possible to construct a convoluted subprog
that calls itself a limited amount of time and the verifier allows that.
I feel it will be easier to detect just that condition
in the verifier and fallback to the normal stack.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 1/9] bpf: Allow each subprog having stack size of 512 bytes
2024-10-22 3:43 ` Alexei Starovoitov
@ 2024-10-22 4:08 ` Yonghong Song
2024-10-22 20:13 ` Yonghong Song
1 sibling, 0 replies; 37+ messages in thread
From: Yonghong Song @ 2024-10-22 4:08 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau, Tejun Heo
On 10/21/24 8:43 PM, Alexei Starovoitov wrote:
> On Mon, Oct 21, 2024 at 8:21 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>> for (int i = 0; i < env->subprog_cnt; i++) {
>>>> - if (!i || si[i].is_async_cb) {
>>>> - ret = check_max_stack_depth_subprog(env, i);
>>>> + check_subprog = !i || (check_priv_stack ? si[i].is_cb : si[i].is_async_cb);
>>> why?
>>> This looks very suspicious.
>> This is to simplify jit. For example,
>> main_prog <=== main_prog_priv_stack_ptr
>> subprog1 <=== there is a helper which has a callback_fn
>> <=== for example bpf_for_each_map_elem
>>
>> callback_fn
>> subprog2
>>
>> In callback_fn, we cannot simplify do
>> r9 += stack_size_for_callback_fn
>> since r9 may have been clobbered between subprog1 and callback_fn.
>> That is why currently I allocate private_stack separately for callback_fn.
>>
>> Alternatively we could do
>> callback_fn_priv_stack_ptr = main_prog_priv_stack_ptr + off
>> where off equals to (stack size tree main_prog+subprog1).
>> I can do this approach too with a little more information in prog->aux.
>> WDYT?
> I see. I think we're overcomplicating the verifier just to
> be able to do 'r9 += stack' in the subprog.
> The cases of async vs sync and directly vs kfunc/helper
> (and soon with inlining of kfuncs) are getting too hard
> to reason about.
>
> I think we need to go back to the earlier approach
> where every subprog had its own private stack and was
> setting up r9 = my_priv_stack in the prologue.
Indeed, per private_stack per prog(subprog) will be much
simpler.
>
> I suspect it's possible to construct a convoluted subprog
> that calls itself a limited amount of time and the verifier allows that.
> I feel it will be easier to detect just that condition
> in the verifier and fallback to the normal stack.
Yes, I think check_max_stack_depth_subprog() should be able to detect
subprog recursion.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 3/9] bpf: Support private stack for struct ops programs
2024-10-22 1:34 ` Alexei Starovoitov
2024-10-22 2:59 ` Yonghong Song
@ 2024-10-22 17:26 ` Martin KaFai Lau
2024-10-22 20:19 ` Alexei Starovoitov
1 sibling, 1 reply; 37+ messages in thread
From: Martin KaFai Lau @ 2024-10-22 17:26 UTC (permalink / raw)
To: Alexei Starovoitov, Yonghong Song
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau, Tejun Heo
On 10/21/24 6:34 PM, Alexei Starovoitov wrote:
> On Sun, Oct 20, 2024 at 12:16 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> To identify whether a st_ops program requests private stack or not,
>> the st_ops stub function is checked. If the stub function has the
>> following name
>> <st_ops_name>__<member_name>__priv_stack
>> then the corresponding st_ops member func requests to use private
>> stack. The information that the private stack is requested or not
>> is encoded in struct bpf_struct_ops_func_info which will later be
>> used by verifier.
>>
>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>> ---
>> include/linux/bpf.h | 2 ++
>> kernel/bpf/bpf_struct_ops.c | 35 +++++++++++++++++++++++++----------
>> kernel/bpf/verifier.c | 8 +++++++-
>> 3 files changed, 34 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index f3884ce2603d..376e43fc72b9 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1491,6 +1491,7 @@ struct bpf_prog_aux {
>> bool exception_boundary;
>> bool is_extended; /* true if extended by freplace program */
>> bool priv_stack_eligible;
>> + bool priv_stack_always;
>> u64 prog_array_member_cnt; /* counts how many times as member of prog_array */
>> struct mutex ext_mutex; /* mutex for is_extended and prog_array_member_cnt */
>> struct bpf_arena *arena;
>> @@ -1776,6 +1777,7 @@ struct bpf_struct_ops {
>> struct bpf_struct_ops_func_info {
>> struct bpf_ctx_arg_aux *info;
>> u32 cnt;
>> + bool priv_stack_always;
>> };
>>
>> struct bpf_struct_ops_desc {
>> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
>> index 8279b5a57798..2cd4bd086c7a 100644
>> --- a/kernel/bpf/bpf_struct_ops.c
>> +++ b/kernel/bpf/bpf_struct_ops.c
>> @@ -145,33 +145,44 @@ void bpf_struct_ops_image_free(void *image)
>> }
>>
>> #define MAYBE_NULL_SUFFIX "__nullable"
>> -#define MAX_STUB_NAME 128
>> +#define MAX_STUB_NAME 140
>>
>> /* Return the type info of a stub function, if it exists.
>> *
>> - * The name of a stub function is made up of the name of the struct_ops and
>> - * the name of the function pointer member, separated by "__". For example,
>> - * if the struct_ops type is named "foo_ops" and the function pointer
>> - * member is named "bar", the stub function name would be "foo_ops__bar".
>> + * The name of a stub function is made up of the name of the struct_ops,
>> + * the name of the function pointer member and optionally "priv_stack"
>> + * suffix, separated by "__". For example, if the struct_ops type is named
>> + * "foo_ops" and the function pointer member is named "bar", the stub
>> + * function name would be "foo_ops__bar". If a suffix "priv_stack" exists,
>> + * the stub function name would be "foo_ops__bar__priv_stack".
>> */
>> static const struct btf_type *
>> find_stub_func_proto(const struct btf *btf, const char *st_op_name,
>> - const char *member_name)
>> + const char *member_name, bool *priv_stack_always)
>> {
>> char stub_func_name[MAX_STUB_NAME];
>> const struct btf_type *func_type;
>> s32 btf_id;
>> int cp;
>>
>> - cp = snprintf(stub_func_name, MAX_STUB_NAME, "%s__%s",
>> + cp = snprintf(stub_func_name, MAX_STUB_NAME, "%s__%s__priv_stack",
>> st_op_name, member_name);
>
> I don't think this approach fits.
> pw-bot: cr
>
> Also looking at original
> commit 1611603537a4 ("bpf: Create argument information for nullable arguments.")
> that added this %s__%s notation I'm not sure why we went
> with that approach.
>
> Just to avoid adding __nullable suffix in the actual callback
> and using cfi stub callback names with such suffixes as
> a "proxy" for the real callback?
>
> Did we ever use this functionality for anything other than
> bpf_testmod_ops__test_maybe_null selftest ?
>
> Martin ?
The __nullable is to tag an argument of an ops. The member in the struct (e.g.
tcp_congestion_ops) is a pointer to FUNC_PROTO and its argument does not have an
argument name to tag. Hence, we went with tagging the actual FUNC in the cfi object.
The __nullable argument tagging request was originally from sched_ext but I also
don't see its usage in-tree for now.
For the priv_stack tagging, I also don't think it is a good way of doing it. It
is like adding __nullable to flag the ops may return NULL pointer which I also
tried to avoid in the bpf-qdisc patch set.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 1/9] bpf: Allow each subprog having stack size of 512 bytes
2024-10-22 3:43 ` Alexei Starovoitov
2024-10-22 4:08 ` Yonghong Song
@ 2024-10-22 20:13 ` Yonghong Song
2024-10-22 20:41 ` Alexei Starovoitov
1 sibling, 1 reply; 37+ messages in thread
From: Yonghong Song @ 2024-10-22 20:13 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau, Tejun Heo
On 10/21/24 8:43 PM, Alexei Starovoitov wrote:
> On Mon, Oct 21, 2024 at 8:21 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>> for (int i = 0; i < env->subprog_cnt; i++) {
>>>> - if (!i || si[i].is_async_cb) {
>>>> - ret = check_max_stack_depth_subprog(env, i);
>>>> + check_subprog = !i || (check_priv_stack ? si[i].is_cb : si[i].is_async_cb);
>>> why?
>>> This looks very suspicious.
>> This is to simplify jit. For example,
>> main_prog <=== main_prog_priv_stack_ptr
>> subprog1 <=== there is a helper which has a callback_fn
>> <=== for example bpf_for_each_map_elem
>>
>> callback_fn
>> subprog2
>>
>> In callback_fn, we cannot simplify do
>> r9 += stack_size_for_callback_fn
>> since r9 may have been clobbered between subprog1 and callback_fn.
>> That is why currently I allocate private_stack separately for callback_fn.
>>
>> Alternatively we could do
>> callback_fn_priv_stack_ptr = main_prog_priv_stack_ptr + off
>> where off equals to (stack size tree main_prog+subprog1).
>> I can do this approach too with a little more information in prog->aux.
>> WDYT?
> I see. I think we're overcomplicating the verifier just to
> be able to do 'r9 += stack' in the subprog.
> The cases of async vs sync and directly vs kfunc/helper
> (and soon with inlining of kfuncs) are getting too hard
> to reason about.
>
> I think we need to go back to the earlier approach
> where every subprog had its own private stack and was
> setting up r9 = my_priv_stack in the prologue.
>
> I suspect it's possible to construct a convoluted subprog
> that calls itself a limited amount of time and the verifier allows that.
> I feel it will be easier to detect just that condition
> in the verifier and fallback to the normal stack.
I tried a simple bpf prog below.
$ cat private_stack_subprog_recur.c
// SPDX-License-Identifier: GPL-2.0
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include "../bpf_testmod/bpf_testmod.h"
char _license[] SEC("license") = "GPL";
#if defined(__TARGET_ARCH_x86)
bool skip __attribute((__section__(".data"))) = false;
#else
bool skip = true;
#endif
int i;
__noinline static void subprog1(int level)
{
if (level > 0) {
subprog1(level >> 1);
i++;
}
}
SEC("kprobe")
int prog1(void)
{
subprog1(1);
return 0;
}
In the above prog, we have a recursion of subprog1. The
callchain is:
prog -> subprog1 -> subprog1
The insn-level verification is successful since argument
of subprog1() has precise value.
But eventually, verification failed with the following message:
the call stack of 8 frames is too deep !
The error message is
if (frame >= MAX_CALL_FRAMES) {
verbose(env, "the call stack of %d frames is too deep !\n",
frame);
return -E2BIG;
}
in function check_max_stack_depth_subprog().
Basically in function check_max_stack_depth_subprog(), tracing subprog
call is done only based on call insn. All conditionals are ignored.
In the above example, check_max_stack_depth_subprog() will have the
call graph like
prog -> subprog1 -> subprog1 -> subprog1 -> subprog1 -> ...
and eventually hit the error.
Basically with check_max_stack_depth_subprog() self recursion is not
possible for a bpf prog.
This limitation is back to year 2017.
commit 70a87ffea8ac bpf: fix maximum stack depth tracking logic
So I assume people really do not write progs with self recursion inside
the main prog (including subprogs).
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 3/9] bpf: Support private stack for struct ops programs
2024-10-22 17:26 ` Martin KaFai Lau
@ 2024-10-22 20:19 ` Alexei Starovoitov
2024-10-23 21:00 ` Tejun Heo
0 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2024-10-22 20:19 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Kernel Team, Martin KaFai Lau, Tejun Heo
On Tue, Oct 22, 2024 at 10:27 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/21/24 6:34 PM, Alexei Starovoitov wrote:
> > On Sun, Oct 20, 2024 at 12:16 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> >>
> >> To identify whether a st_ops program requests private stack or not,
> >> the st_ops stub function is checked. If the stub function has the
> >> following name
> >> <st_ops_name>__<member_name>__priv_stack
> >> then the corresponding st_ops member func requests to use private
> >> stack. The information that the private stack is requested or not
> >> is encoded in struct bpf_struct_ops_func_info which will later be
> >> used by verifier.
> >>
> >> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
> >> ---
> >> include/linux/bpf.h | 2 ++
> >> kernel/bpf/bpf_struct_ops.c | 35 +++++++++++++++++++++++++----------
> >> kernel/bpf/verifier.c | 8 +++++++-
> >> 3 files changed, 34 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> >> index f3884ce2603d..376e43fc72b9 100644
> >> --- a/include/linux/bpf.h
> >> +++ b/include/linux/bpf.h
> >> @@ -1491,6 +1491,7 @@ struct bpf_prog_aux {
> >> bool exception_boundary;
> >> bool is_extended; /* true if extended by freplace program */
> >> bool priv_stack_eligible;
> >> + bool priv_stack_always;
> >> u64 prog_array_member_cnt; /* counts how many times as member of prog_array */
> >> struct mutex ext_mutex; /* mutex for is_extended and prog_array_member_cnt */
> >> struct bpf_arena *arena;
> >> @@ -1776,6 +1777,7 @@ struct bpf_struct_ops {
> >> struct bpf_struct_ops_func_info {
> >> struct bpf_ctx_arg_aux *info;
> >> u32 cnt;
> >> + bool priv_stack_always;
> >> };
> >>
> >> struct bpf_struct_ops_desc {
> >> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> >> index 8279b5a57798..2cd4bd086c7a 100644
> >> --- a/kernel/bpf/bpf_struct_ops.c
> >> +++ b/kernel/bpf/bpf_struct_ops.c
> >> @@ -145,33 +145,44 @@ void bpf_struct_ops_image_free(void *image)
> >> }
> >>
> >> #define MAYBE_NULL_SUFFIX "__nullable"
> >> -#define MAX_STUB_NAME 128
> >> +#define MAX_STUB_NAME 140
> >>
> >> /* Return the type info of a stub function, if it exists.
> >> *
> >> - * The name of a stub function is made up of the name of the struct_ops and
> >> - * the name of the function pointer member, separated by "__". For example,
> >> - * if the struct_ops type is named "foo_ops" and the function pointer
> >> - * member is named "bar", the stub function name would be "foo_ops__bar".
> >> + * The name of a stub function is made up of the name of the struct_ops,
> >> + * the name of the function pointer member and optionally "priv_stack"
> >> + * suffix, separated by "__". For example, if the struct_ops type is named
> >> + * "foo_ops" and the function pointer member is named "bar", the stub
> >> + * function name would be "foo_ops__bar". If a suffix "priv_stack" exists,
> >> + * the stub function name would be "foo_ops__bar__priv_stack".
> >> */
> >> static const struct btf_type *
> >> find_stub_func_proto(const struct btf *btf, const char *st_op_name,
> >> - const char *member_name)
> >> + const char *member_name, bool *priv_stack_always)
> >> {
> >> char stub_func_name[MAX_STUB_NAME];
> >> const struct btf_type *func_type;
> >> s32 btf_id;
> >> int cp;
> >>
> >> - cp = snprintf(stub_func_name, MAX_STUB_NAME, "%s__%s",
> >> + cp = snprintf(stub_func_name, MAX_STUB_NAME, "%s__%s__priv_stack",
> >> st_op_name, member_name);
> >
> > I don't think this approach fits.
> > pw-bot: cr
> >
> > Also looking at original
> > commit 1611603537a4 ("bpf: Create argument information for nullable arguments.")
> > that added this %s__%s notation I'm not sure why we went
> > with that approach.
> >
> > Just to avoid adding __nullable suffix in the actual callback
> > and using cfi stub callback names with such suffixes as
> > a "proxy" for the real callback?
> >
> > Did we ever use this functionality for anything other than
> > bpf_testmod_ops__test_maybe_null selftest ?
> >
> > Martin ?
>
> The __nullable is to tag an argument of an ops. The member in the struct (e.g.
> tcp_congestion_ops) is a pointer to FUNC_PROTO and its argument does not have an
> argument name to tag. Hence, we went with tagging the actual FUNC in the cfi object.
Ahh. Right. That makes sense.
> The __nullable argument tagging request was originally from sched_ext but I also
> don't see its usage in-tree for now.
ok. Let's sync up with Tejun whether they have plans to use it.
> For the priv_stack tagging, I also don't think it is a good way of doing it. It
> is like adding __nullable to flag the ops may return NULL pointer which I also
> tried to avoid in the bpf-qdisc patch set.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 1/9] bpf: Allow each subprog having stack size of 512 bytes
2024-10-22 20:13 ` Yonghong Song
@ 2024-10-22 20:41 ` Alexei Starovoitov
2024-10-22 21:29 ` Kumar Kartikeya Dwivedi
2024-10-22 21:43 ` Yonghong Song
0 siblings, 2 replies; 37+ messages in thread
From: Alexei Starovoitov @ 2024-10-22 20:41 UTC (permalink / raw)
To: Yonghong Song
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau, Tejun Heo
On Tue, Oct 22, 2024 at 1:13 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
> On 10/21/24 8:43 PM, Alexei Starovoitov wrote:
> > On Mon, Oct 21, 2024 at 8:21 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> >>>> for (int i = 0; i < env->subprog_cnt; i++) {
> >>>> - if (!i || si[i].is_async_cb) {
> >>>> - ret = check_max_stack_depth_subprog(env, i);
> >>>> + check_subprog = !i || (check_priv_stack ? si[i].is_cb : si[i].is_async_cb);
> >>> why?
> >>> This looks very suspicious.
> >> This is to simplify jit. For example,
> >> main_prog <=== main_prog_priv_stack_ptr
> >> subprog1 <=== there is a helper which has a callback_fn
> >> <=== for example bpf_for_each_map_elem
> >>
> >> callback_fn
> >> subprog2
> >>
> >> In callback_fn, we cannot simplify do
> >> r9 += stack_size_for_callback_fn
> >> since r9 may have been clobbered between subprog1 and callback_fn.
> >> That is why currently I allocate private_stack separately for callback_fn.
> >>
> >> Alternatively we could do
> >> callback_fn_priv_stack_ptr = main_prog_priv_stack_ptr + off
> >> where off equals to (stack size tree main_prog+subprog1).
> >> I can do this approach too with a little more information in prog->aux.
> >> WDYT?
> > I see. I think we're overcomplicating the verifier just to
> > be able to do 'r9 += stack' in the subprog.
> > The cases of async vs sync and directly vs kfunc/helper
> > (and soon with inlining of kfuncs) are getting too hard
> > to reason about.
> >
> > I think we need to go back to the earlier approach
> > where every subprog had its own private stack and was
> > setting up r9 = my_priv_stack in the prologue.
> >
> > I suspect it's possible to construct a convoluted subprog
> > that calls itself a limited amount of time and the verifier allows that.
> > I feel it will be easier to detect just that condition
> > in the verifier and fallback to the normal stack.
>
> I tried a simple bpf prog below.
>
> $ cat private_stack_subprog_recur.c
> // SPDX-License-Identifier: GPL-2.0
>
> #include <vmlinux.h>
> #include <bpf/bpf_helpers.h>
> #include <bpf/bpf_tracing.h>
> #include "../bpf_testmod/bpf_testmod.h"
>
> char _license[] SEC("license") = "GPL";
>
> #if defined(__TARGET_ARCH_x86)
> bool skip __attribute((__section__(".data"))) = false;
> #else
> bool skip = true;
> #endif
>
> int i;
>
> __noinline static void subprog1(int level)
> {
> if (level > 0) {
> subprog1(level >> 1);
> i++;
> }
> }
>
> SEC("kprobe")
> int prog1(void)
> {
> subprog1(1);
> return 0;
> }
>
> In the above prog, we have a recursion of subprog1. The
> callchain is:
> prog -> subprog1 -> subprog1
>
> The insn-level verification is successful since argument
> of subprog1() has precise value.
>
> But eventually, verification failed with the following message:
> the call stack of 8 frames is too deep !
>
> The error message is
> if (frame >= MAX_CALL_FRAMES) {
> verbose(env, "the call stack of %d frames is too deep !\n",
> frame);
> return -E2BIG;
> }
> in function check_max_stack_depth_subprog().
> Basically in function check_max_stack_depth_subprog(), tracing subprog
> call is done only based on call insn. All conditionals are ignored.
> In the above example, check_max_stack_depth_subprog() will have the
> call graph like
> prog -> subprog1 -> subprog1 -> subprog1 -> subprog1 -> ...
> and eventually hit the error.
>
> Basically with check_max_stack_depth_subprog() self recursion is not
> possible for a bpf prog.
>
> This limitation is back to year 2017.
> commit 70a87ffea8ac bpf: fix maximum stack depth tracking logic
>
> So I assume people really do not write progs with self recursion inside
> the main prog (including subprogs).
Thanks for checking this part.
What about sync and async callbacks? Can they recurse?
Since progs are preemptible is the following possible:
__noinline static void subprog(void)
{
/* delay */
}
static int timer_cb(void *map, int *key, void *val)
{
subprog();
}
SEC("tc")
int prog1(void)
{
bpf_timer_set_callback( &timer_cb);
subprog();
return 0;
}
timers use softirq.
I'm not sure whether it's the same stack or not.
So it may be borderline ok-ish for other reasons,
but the question remains. Will subprog recurse this way?
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 1/9] bpf: Allow each subprog having stack size of 512 bytes
2024-10-22 20:41 ` Alexei Starovoitov
@ 2024-10-22 21:29 ` Kumar Kartikeya Dwivedi
2024-10-22 21:36 ` Kumar Kartikeya Dwivedi
2024-10-22 21:43 ` Yonghong Song
1 sibling, 1 reply; 37+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-10-22 21:29 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Kernel Team, Martin KaFai Lau, Tejun Heo
On Tue, 22 Oct 2024 at 22:41, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Oct 22, 2024 at 1:13 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> >
> >
> > On 10/21/24 8:43 PM, Alexei Starovoitov wrote:
> > > On Mon, Oct 21, 2024 at 8:21 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> > >>>> for (int i = 0; i < env->subprog_cnt; i++) {
> > >>>> - if (!i || si[i].is_async_cb) {
> > >>>> - ret = check_max_stack_depth_subprog(env, i);
> > >>>> + check_subprog = !i || (check_priv_stack ? si[i].is_cb : si[i].is_async_cb);
> > >>> why?
> > >>> This looks very suspicious.
> > >> This is to simplify jit. For example,
> > >> main_prog <=== main_prog_priv_stack_ptr
> > >> subprog1 <=== there is a helper which has a callback_fn
> > >> <=== for example bpf_for_each_map_elem
> > >>
> > >> callback_fn
> > >> subprog2
> > >>
> > >> In callback_fn, we cannot simplify do
> > >> r9 += stack_size_for_callback_fn
> > >> since r9 may have been clobbered between subprog1 and callback_fn.
> > >> That is why currently I allocate private_stack separately for callback_fn.
> > >>
> > >> Alternatively we could do
> > >> callback_fn_priv_stack_ptr = main_prog_priv_stack_ptr + off
> > >> where off equals to (stack size tree main_prog+subprog1).
> > >> I can do this approach too with a little more information in prog->aux.
> > >> WDYT?
> > > I see. I think we're overcomplicating the verifier just to
> > > be able to do 'r9 += stack' in the subprog.
> > > The cases of async vs sync and directly vs kfunc/helper
> > > (and soon with inlining of kfuncs) are getting too hard
> > > to reason about.
> > >
> > > I think we need to go back to the earlier approach
> > > where every subprog had its own private stack and was
> > > setting up r9 = my_priv_stack in the prologue.
> > >
> > > I suspect it's possible to construct a convoluted subprog
> > > that calls itself a limited amount of time and the verifier allows that.
> > > I feel it will be easier to detect just that condition
> > > in the verifier and fallback to the normal stack.
> >
> > I tried a simple bpf prog below.
> >
> > $ cat private_stack_subprog_recur.c
> > // SPDX-License-Identifier: GPL-2.0
> >
> > #include <vmlinux.h>
> > #include <bpf/bpf_helpers.h>
> > #include <bpf/bpf_tracing.h>
> > #include "../bpf_testmod/bpf_testmod.h"
> >
> > char _license[] SEC("license") = "GPL";
> >
> > #if defined(__TARGET_ARCH_x86)
> > bool skip __attribute((__section__(".data"))) = false;
> > #else
> > bool skip = true;
> > #endif
> >
> > int i;
> >
> > __noinline static void subprog1(int level)
> > {
> > if (level > 0) {
> > subprog1(level >> 1);
> > i++;
> > }
> > }
> >
> > SEC("kprobe")
> > int prog1(void)
> > {
> > subprog1(1);
> > return 0;
> > }
> >
> > In the above prog, we have a recursion of subprog1. The
> > callchain is:
> > prog -> subprog1 -> subprog1
> >
> > The insn-level verification is successful since argument
> > of subprog1() has precise value.
> >
> > But eventually, verification failed with the following message:
> > the call stack of 8 frames is too deep !
> >
> > The error message is
> > if (frame >= MAX_CALL_FRAMES) {
> > verbose(env, "the call stack of %d frames is too deep !\n",
> > frame);
> > return -E2BIG;
> > }
> > in function check_max_stack_depth_subprog().
> > Basically in function check_max_stack_depth_subprog(), tracing subprog
> > call is done only based on call insn. All conditionals are ignored.
> > In the above example, check_max_stack_depth_subprog() will have the
> > call graph like
> > prog -> subprog1 -> subprog1 -> subprog1 -> subprog1 -> ...
> > and eventually hit the error.
> >
> > Basically with check_max_stack_depth_subprog() self recursion is not
> > possible for a bpf prog.
> >
> > This limitation is back to year 2017.
> > commit 70a87ffea8ac bpf: fix maximum stack depth tracking logic
> >
> > So I assume people really do not write progs with self recursion inside
> > the main prog (including subprogs).
>
> Thanks for checking this part.
>
> What about sync and async callbacks? Can they recurse?
>
> Since progs are preemptible is the following possible:
>
> __noinline static void subprog(void)
> {
> /* delay */
> }
>
> static int timer_cb(void *map, int *key, void *val)
> {
> subprog();
> }
>
> SEC("tc")
> int prog1(void)
> {
> bpf_timer_set_callback( &timer_cb);
> subprog();
> return 0;
> }
>
> timers use softirq.
> I'm not sure whether it's the same stack or not.
> So it may be borderline ok-ish for other reasons,
> but the question remains. Will subprog recurse this way?
>
Yes, but not in the normal ways.
There can be only one softirq context per-CPU (even on preemptible RT
with timers running in kthreads), but timer_cb can also be called
directly by the prog. So any other context the same prog can execute
in will allow it to call timer_cb while another invocation is
potentially preempted out on the same CPU.
It might be better to disallow direct calling such async callbacks,
because I'm not sure anyone relies on that behavior, but it is
something I've previously looked at (for exception_cb, which is
disallowed to be called directly due to the distinct way prologue is
set up).
We'll also need to remember this when/if we introduce hardirq mode for
BPF timers.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 1/9] bpf: Allow each subprog having stack size of 512 bytes
2024-10-22 21:29 ` Kumar Kartikeya Dwivedi
@ 2024-10-22 21:36 ` Kumar Kartikeya Dwivedi
0 siblings, 0 replies; 37+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-10-22 21:36 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Kernel Team, Martin KaFai Lau, Tejun Heo
On Tue, 22 Oct 2024 at 23:29, Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> On Tue, 22 Oct 2024 at 22:41, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Oct 22, 2024 at 1:13 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> > >
> > >
> > > On 10/21/24 8:43 PM, Alexei Starovoitov wrote:
> > > > On Mon, Oct 21, 2024 at 8:21 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> > > >>>> for (int i = 0; i < env->subprog_cnt; i++) {
> > > >>>> - if (!i || si[i].is_async_cb) {
> > > >>>> - ret = check_max_stack_depth_subprog(env, i);
> > > >>>> + check_subprog = !i || (check_priv_stack ? si[i].is_cb : si[i].is_async_cb);
> > > >>> why?
> > > >>> This looks very suspicious.
> > > >> This is to simplify jit. For example,
> > > >> main_prog <=== main_prog_priv_stack_ptr
> > > >> subprog1 <=== there is a helper which has a callback_fn
> > > >> <=== for example bpf_for_each_map_elem
> > > >>
> > > >> callback_fn
> > > >> subprog2
> > > >>
> > > >> In callback_fn, we cannot simplify do
> > > >> r9 += stack_size_for_callback_fn
> > > >> since r9 may have been clobbered between subprog1 and callback_fn.
> > > >> That is why currently I allocate private_stack separately for callback_fn.
> > > >>
> > > >> Alternatively we could do
> > > >> callback_fn_priv_stack_ptr = main_prog_priv_stack_ptr + off
> > > >> where off equals to (stack size tree main_prog+subprog1).
> > > >> I can do this approach too with a little more information in prog->aux.
> > > >> WDYT?
> > > > I see. I think we're overcomplicating the verifier just to
> > > > be able to do 'r9 += stack' in the subprog.
> > > > The cases of async vs sync and directly vs kfunc/helper
> > > > (and soon with inlining of kfuncs) are getting too hard
> > > > to reason about.
> > > >
> > > > I think we need to go back to the earlier approach
> > > > where every subprog had its own private stack and was
> > > > setting up r9 = my_priv_stack in the prologue.
> > > >
> > > > I suspect it's possible to construct a convoluted subprog
> > > > that calls itself a limited amount of time and the verifier allows that.
> > > > I feel it will be easier to detect just that condition
> > > > in the verifier and fallback to the normal stack.
> > >
> > > I tried a simple bpf prog below.
> > >
> > > $ cat private_stack_subprog_recur.c
> > > // SPDX-License-Identifier: GPL-2.0
> > >
> > > #include <vmlinux.h>
> > > #include <bpf/bpf_helpers.h>
> > > #include <bpf/bpf_tracing.h>
> > > #include "../bpf_testmod/bpf_testmod.h"
> > >
> > > char _license[] SEC("license") = "GPL";
> > >
> > > #if defined(__TARGET_ARCH_x86)
> > > bool skip __attribute((__section__(".data"))) = false;
> > > #else
> > > bool skip = true;
> > > #endif
> > >
> > > int i;
> > >
> > > __noinline static void subprog1(int level)
> > > {
> > > if (level > 0) {
> > > subprog1(level >> 1);
> > > i++;
> > > }
> > > }
> > >
> > > SEC("kprobe")
> > > int prog1(void)
> > > {
> > > subprog1(1);
> > > return 0;
> > > }
> > >
> > > In the above prog, we have a recursion of subprog1. The
> > > callchain is:
> > > prog -> subprog1 -> subprog1
> > >
> > > The insn-level verification is successful since argument
> > > of subprog1() has precise value.
> > >
> > > But eventually, verification failed with the following message:
> > > the call stack of 8 frames is too deep !
> > >
> > > The error message is
> > > if (frame >= MAX_CALL_FRAMES) {
> > > verbose(env, "the call stack of %d frames is too deep !\n",
> > > frame);
> > > return -E2BIG;
> > > }
> > > in function check_max_stack_depth_subprog().
> > > Basically in function check_max_stack_depth_subprog(), tracing subprog
> > > call is done only based on call insn. All conditionals are ignored.
> > > In the above example, check_max_stack_depth_subprog() will have the
> > > call graph like
> > > prog -> subprog1 -> subprog1 -> subprog1 -> subprog1 -> ...
> > > and eventually hit the error.
> > >
> > > Basically with check_max_stack_depth_subprog() self recursion is not
> > > possible for a bpf prog.
> > >
> > > This limitation is back to year 2017.
> > > commit 70a87ffea8ac bpf: fix maximum stack depth tracking logic
> > >
> > > So I assume people really do not write progs with self recursion inside
> > > the main prog (including subprogs).
> >
> > Thanks for checking this part.
> >
> > What about sync and async callbacks? Can they recurse?
> >
> > Since progs are preemptible is the following possible:
> >
> > __noinline static void subprog(void)
> > {
> > /* delay */
> > }
> >
> > static int timer_cb(void *map, int *key, void *val)
> > {
> > subprog();
> > }
> >
> > SEC("tc")
> > int prog1(void)
> > {
> > bpf_timer_set_callback( &timer_cb);
> > subprog();
> > return 0;
> > }
> >
> > timers use softirq.
> > I'm not sure whether it's the same stack or not.
> > So it may be borderline ok-ish for other reasons,
> > but the question remains. Will subprog recurse this way?
> >
>
> Yes, but not in the normal ways.
> There can be only one softirq context per-CPU (even on preemptible RT
> with timers running in kthreads), but timer_cb can also be called
> directly by the prog. So any other context the same prog can execute
> in will allow it to call timer_cb while another invocation is
> potentially preempted out on the same CPU.
> It might be better to disallow direct calling such async callbacks,
> because I'm not sure anyone relies on that behavior, but it is
> something I've previously looked at (for exception_cb, which is
> disallowed to be called directly due to the distinct way prologue is
> set up).
Ah, in your example it's a subprog() called by both. Yeah, I guess we
can't really prevent that from happening.
>
> We'll also need to remember this when/if we introduce hardirq mode for
> BPF timers.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 1/9] bpf: Allow each subprog having stack size of 512 bytes
2024-10-22 20:41 ` Alexei Starovoitov
2024-10-22 21:29 ` Kumar Kartikeya Dwivedi
@ 2024-10-22 21:43 ` Yonghong Song
2024-10-22 21:57 ` Alexei Starovoitov
1 sibling, 1 reply; 37+ messages in thread
From: Yonghong Song @ 2024-10-22 21:43 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau, Tejun Heo
On 10/22/24 1:41 PM, Alexei Starovoitov wrote:
> On Tue, Oct 22, 2024 at 1:13 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> On 10/21/24 8:43 PM, Alexei Starovoitov wrote:
>>> On Mon, Oct 21, 2024 at 8:21 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>>>> for (int i = 0; i < env->subprog_cnt; i++) {
>>>>>> - if (!i || si[i].is_async_cb) {
>>>>>> - ret = check_max_stack_depth_subprog(env, i);
>>>>>> + check_subprog = !i || (check_priv_stack ? si[i].is_cb : si[i].is_async_cb);
>>>>> why?
>>>>> This looks very suspicious.
>>>> This is to simplify jit. For example,
>>>> main_prog <=== main_prog_priv_stack_ptr
>>>> subprog1 <=== there is a helper which has a callback_fn
>>>> <=== for example bpf_for_each_map_elem
>>>>
>>>> callback_fn
>>>> subprog2
>>>>
>>>> In callback_fn, we cannot simplify do
>>>> r9 += stack_size_for_callback_fn
>>>> since r9 may have been clobbered between subprog1 and callback_fn.
>>>> That is why currently I allocate private_stack separately for callback_fn.
>>>>
>>>> Alternatively we could do
>>>> callback_fn_priv_stack_ptr = main_prog_priv_stack_ptr + off
>>>> where off equals to (stack size tree main_prog+subprog1).
>>>> I can do this approach too with a little more information in prog->aux.
>>>> WDYT?
>>> I see. I think we're overcomplicating the verifier just to
>>> be able to do 'r9 += stack' in the subprog.
>>> The cases of async vs sync and directly vs kfunc/helper
>>> (and soon with inlining of kfuncs) are getting too hard
>>> to reason about.
>>>
>>> I think we need to go back to the earlier approach
>>> where every subprog had its own private stack and was
>>> setting up r9 = my_priv_stack in the prologue.
>>>
>>> I suspect it's possible to construct a convoluted subprog
>>> that calls itself a limited amount of time and the verifier allows that.
>>> I feel it will be easier to detect just that condition
>>> in the verifier and fallback to the normal stack.
>> I tried a simple bpf prog below.
>>
>> $ cat private_stack_subprog_recur.c
>> // SPDX-License-Identifier: GPL-2.0
>>
>> #include <vmlinux.h>
>> #include <bpf/bpf_helpers.h>
>> #include <bpf/bpf_tracing.h>
>> #include "../bpf_testmod/bpf_testmod.h"
>>
>> char _license[] SEC("license") = "GPL";
>>
>> #if defined(__TARGET_ARCH_x86)
>> bool skip __attribute((__section__(".data"))) = false;
>> #else
>> bool skip = true;
>> #endif
>>
>> int i;
>>
>> __noinline static void subprog1(int level)
>> {
>> if (level > 0) {
>> subprog1(level >> 1);
>> i++;
>> }
>> }
>>
>> SEC("kprobe")
>> int prog1(void)
>> {
>> subprog1(1);
>> return 0;
>> }
>>
>> In the above prog, we have a recursion of subprog1. The
>> callchain is:
>> prog -> subprog1 -> subprog1
>>
>> The insn-level verification is successful since argument
>> of subprog1() has precise value.
>>
>> But eventually, verification failed with the following message:
>> the call stack of 8 frames is too deep !
>>
>> The error message is
>> if (frame >= MAX_CALL_FRAMES) {
>> verbose(env, "the call stack of %d frames is too deep !\n",
>> frame);
>> return -E2BIG;
>> }
>> in function check_max_stack_depth_subprog().
>> Basically in function check_max_stack_depth_subprog(), tracing subprog
>> call is done only based on call insn. All conditionals are ignored.
>> In the above example, check_max_stack_depth_subprog() will have the
>> call graph like
>> prog -> subprog1 -> subprog1 -> subprog1 -> subprog1 -> ...
>> and eventually hit the error.
>>
>> Basically with check_max_stack_depth_subprog() self recursion is not
>> possible for a bpf prog.
>>
>> This limitation is back to year 2017.
>> commit 70a87ffea8ac bpf: fix maximum stack depth tracking logic
>>
>> So I assume people really do not write progs with self recursion inside
>> the main prog (including subprogs).
> Thanks for checking this part.
>
> What about sync and async callbacks? Can they recurse?
For sync, there will be no recurses between subprogs.
This is due to the following func.
static int check_max_stack_depth(struct bpf_verifier_env *env)
{
struct bpf_subprog_info *si = env->subprog_info;
int ret;
for (int i = 0; i < env->subprog_cnt; i++) {
if (!i || si[i].is_async_cb) {
ret = check_max_stack_depth_subprog(env, i);
if (ret < 0)
return ret;
}
continue;
}
return 0;
}
subprog root only starts from the main prog or async_cb.
So regular sync callback will is treated similar
to other direct-call subprog.
>
> Since progs are preemptible is the following possible:
>
> __noinline static void subprog(void)
> {
> /* delay */
> }
>
> static int timer_cb(void *map, int *key, void *val)
> {
> subprog();
> }
>
> SEC("tc")
> int prog1(void)
> {
> bpf_timer_set_callback( &timer_cb);
> subprog();
> return 0;
> }
>
> timers use softirq.
> I'm not sure whether it's the same stack or not.
> So it may be borderline ok-ish for other reasons,
> but the question remains. Will subprog recurse this way?
But for async cb, as you mentioned it is possible that
prog1->subprog could be called in process context
and the callback timer_cb->subprog could be called in
nested way on top of prog1->subprog.
To handle such cases, I guess I can refactor the code
to record maximum stack_tree_depth in subprog info and
do the checking after the subprog 0 and all async
progs are processed.
To handle a subprog may be used in more than one
subtree (subprog 0 tree or async tree), I need to
add a 'visited' field to bpf_subprog_info.
I think this should work.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 1/9] bpf: Allow each subprog having stack size of 512 bytes
2024-10-22 21:43 ` Yonghong Song
@ 2024-10-22 21:57 ` Alexei Starovoitov
2024-10-22 22:41 ` Yonghong Song
0 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2024-10-22 21:57 UTC (permalink / raw)
To: Yonghong Song
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau, Tejun Heo
On Tue, Oct 22, 2024 at 2:43 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
> To handle a subprog may be used in more than one
> subtree (subprog 0 tree or async tree), I need to
> add a 'visited' field to bpf_subprog_info.
> I think this should work.
This is getting quite complicated.
But looks like we have even bigger problem:
SEC("lsm/...")
int BPF_PROG(...)
{
volatile char buf[..];
buf[..] =
}
The approach to have per-prog per-cpu priv stack
doesn't work for the above.
Sleepable and non-sleepable LSM progs are preemptible.
Multiple tasks can be running the same program on the same cpu
preempting each other.
The priv stack of this prog will be corrupted.
Maybe it won't be an issue for sched-ext prog
attached to a cgroup, but it feels fragile for bpf infra
to rely on implementation detail of another subsystem.
We probably need to go back to the drawing board.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 1/9] bpf: Allow each subprog having stack size of 512 bytes
2024-10-22 21:57 ` Alexei Starovoitov
@ 2024-10-22 22:41 ` Yonghong Song
2024-10-22 22:59 ` Alexei Starovoitov
0 siblings, 1 reply; 37+ messages in thread
From: Yonghong Song @ 2024-10-22 22:41 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau, Tejun Heo
On 10/22/24 2:57 PM, Alexei Starovoitov wrote:
> On Tue, Oct 22, 2024 at 2:43 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>> To handle a subprog may be used in more than one
>> subtree (subprog 0 tree or async tree), I need to
>> add a 'visited' field to bpf_subprog_info.
>> I think this should work.
> This is getting quite complicated.
>
> But looks like we have even bigger problem:
>
> SEC("lsm/...")
> int BPF_PROG(...)
> {
> volatile char buf[..];
> buf[..] =
> }
If I understand correctly, lsm/... corresponds to BPF_PROG_TYPE_LSM prog type.
The current implementation only supports the following plus struct_ops programs.
+ switch (env->prog->type) {
+ case BPF_PROG_TYPE_KPROBE:
+ case BPF_PROG_TYPE_TRACEPOINT:
+ case BPF_PROG_TYPE_PERF_EVENT:
+ case BPF_PROG_TYPE_RAW_TRACEPOINT:
+ return true;
+ case BPF_PROG_TYPE_TRACING:
+ if (env->prog->expected_attach_type != BPF_TRACE_ITER)
+ return true;
+ fallthrough;
+ default:
+ return false;
+ }
I do agree that lsm programs will have issues if using private stack
since preemptible is possible and we don't have recursion check for
them (which is right in order to provide correct functionality).
>
> The approach to have per-prog per-cpu priv stack
> doesn't work for the above.
> Sleepable and non-sleepable LSM progs are preemptible.
> Multiple tasks can be running the same program on the same cpu
> preempting each other.
> The priv stack of this prog will be corrupted.
>
> Maybe it won't be an issue for sched-ext prog
> attached to a cgroup, but it feels fragile for bpf infra
> to rely on implementation detail of another subsystem.
> We probably need to go back to the drawing board.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 1/9] bpf: Allow each subprog having stack size of 512 bytes
2024-10-22 22:41 ` Yonghong Song
@ 2024-10-22 22:59 ` Alexei Starovoitov
2024-10-22 23:53 ` Yonghong Song
0 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2024-10-22 22:59 UTC (permalink / raw)
To: Yonghong Song
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau, Tejun Heo
On Tue, Oct 22, 2024 at 3:41 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
> On 10/22/24 2:57 PM, Alexei Starovoitov wrote:
> > On Tue, Oct 22, 2024 at 2:43 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> >> To handle a subprog may be used in more than one
> >> subtree (subprog 0 tree or async tree), I need to
> >> add a 'visited' field to bpf_subprog_info.
> >> I think this should work.
> > This is getting quite complicated.
> >
> > But looks like we have even bigger problem:
> >
> > SEC("lsm/...")
> > int BPF_PROG(...)
> > {
> > volatile char buf[..];
> > buf[..] =
> > }
>
> If I understand correctly, lsm/... corresponds to BPF_PROG_TYPE_LSM prog type.
> The current implementation only supports the following plus struct_ops programs.
>
> + switch (env->prog->type) {
> + case BPF_PROG_TYPE_KPROBE:
> + case BPF_PROG_TYPE_TRACEPOINT:
> + case BPF_PROG_TYPE_PERF_EVENT:
> + case BPF_PROG_TYPE_RAW_TRACEPOINT:
> + return true;
> + case BPF_PROG_TYPE_TRACING:
> + if (env->prog->expected_attach_type != BPF_TRACE_ITER)
> + return true;
> + fallthrough;
> + default:
> + return false;
> + }
>
> I do agree that lsm programs will have issues if using private stack
> since preemptible is possible and we don't have recursion check for
> them (which is right in order to provide correct functionality).
static inline bool bpf_prog_check_recur(const struct bpf_prog *prog)
{
switch (resolve_prog_type(prog)) {
case BPF_PROG_TYPE_TRACING:
return prog->expected_attach_type != BPF_TRACE_ITER;
case BPF_PROG_TYPE_STRUCT_OPS:
case BPF_PROG_TYPE_LSM:
return false;
default:
return true;
}
}
LSM prog is an example. The same issue is with struct_ops progs.
But struct_ops sched-ext progs is main motivation for adding
priv stack.
sched-ext will signal to bpf that it needs priv stack and
we would have to add "recursion no more than 1" check
and there is a chance (like above LSM prog demonstrates)
that struct_ops will be hitting this recursion check
and the prog will not be run.
The miss count will increment, of course, but the whole
priv stack feature for struct_ops becomes unreliable.
Hence the patches become questionable.
Why add a feature when the main user will struggle to use it.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 1/9] bpf: Allow each subprog having stack size of 512 bytes
2024-10-22 22:59 ` Alexei Starovoitov
@ 2024-10-22 23:53 ` Yonghong Song
0 siblings, 0 replies; 37+ messages in thread
From: Yonghong Song @ 2024-10-22 23:53 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Kernel Team, Martin KaFai Lau, Tejun Heo
On 10/22/24 3:59 PM, Alexei Starovoitov wrote:
> On Tue, Oct 22, 2024 at 3:41 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> On 10/22/24 2:57 PM, Alexei Starovoitov wrote:
>>> On Tue, Oct 22, 2024 at 2:43 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>> To handle a subprog may be used in more than one
>>>> subtree (subprog 0 tree or async tree), I need to
>>>> add a 'visited' field to bpf_subprog_info.
>>>> I think this should work.
>>> This is getting quite complicated.
>>>
>>> But looks like we have even bigger problem:
>>>
>>> SEC("lsm/...")
>>> int BPF_PROG(...)
>>> {
>>> volatile char buf[..];
>>> buf[..] =
>>> }
>> If I understand correctly, lsm/... corresponds to BPF_PROG_TYPE_LSM prog type.
>> The current implementation only supports the following plus struct_ops programs.
>>
>> + switch (env->prog->type) {
>> + case BPF_PROG_TYPE_KPROBE:
>> + case BPF_PROG_TYPE_TRACEPOINT:
>> + case BPF_PROG_TYPE_PERF_EVENT:
>> + case BPF_PROG_TYPE_RAW_TRACEPOINT:
>> + return true;
>> + case BPF_PROG_TYPE_TRACING:
>> + if (env->prog->expected_attach_type != BPF_TRACE_ITER)
>> + return true;
>> + fallthrough;
>> + default:
>> + return false;
>> + }
>>
>> I do agree that lsm programs will have issues if using private stack
>> since preemptible is possible and we don't have recursion check for
>> them (which is right in order to provide correct functionality).
> static inline bool bpf_prog_check_recur(const struct bpf_prog *prog)
> {
> switch (resolve_prog_type(prog)) {
> case BPF_PROG_TYPE_TRACING:
> return prog->expected_attach_type != BPF_TRACE_ITER;
> case BPF_PROG_TYPE_STRUCT_OPS:
> case BPF_PROG_TYPE_LSM:
> return false;
> default:
> return true;
> }
> }
>
> LSM prog is an example. The same issue is with struct_ops progs.
> But struct_ops sched-ext progs is main motivation for adding
> priv stack.
>
> sched-ext will signal to bpf that it needs priv stack and
> we would have to add "recursion no more than 1" check
> and there is a chance (like above LSM prog demonstrates)
> that struct_ops will be hitting this recursion check
> and the prog will not be run.
> The miss count will increment, of course, but the whole
> priv stack feature for struct_ops becomes unreliable.
> Hence the patches become questionable.
> Why add a feature when the main user will struggle to use it.
Indeed, this is a known issue we kind of already aware of.
The recursion check (regardless it is one or four) may cause
prog no run if actual recursion level is beyond what recursion
check is doing.
I guess we indeed need to go back to drawing board again,
starting from struct_ops which is the main motivation of this
idea.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 3/9] bpf: Support private stack for struct ops programs
2024-10-22 20:19 ` Alexei Starovoitov
@ 2024-10-23 21:00 ` Tejun Heo
2024-10-23 23:07 ` Alexei Starovoitov
0 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2024-10-23 21:00 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Martin KaFai Lau, Yonghong Song, bpf, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Kernel Team, Martin KaFai Lau
Hello,
On Tue, Oct 22, 2024 at 01:19:58PM -0700, Alexei Starovoitov wrote:
> > The __nullable argument tagging request was originally from sched_ext but I also
> > don't see its usage in-tree for now.
>
> ok. Let's sync up with Tejun whether they have plans to use it.
Yeah, in sched_ext_ops.dispatch(s32 cpu, struct task_struct *prev), @prev
can be NULL and right now if a BPF scheduler derefs without checking for
NULL, it can trigger kernel crash, I think, so it needs __nullable tagging.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 3/9] bpf: Support private stack for struct ops programs
2024-10-23 21:00 ` Tejun Heo
@ 2024-10-23 23:07 ` Alexei Starovoitov
2024-10-24 0:56 ` Tejun Heo
0 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2024-10-23 23:07 UTC (permalink / raw)
To: Tejun Heo
Cc: Martin KaFai Lau, Yonghong Song, bpf, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Kernel Team, Martin KaFai Lau
On Wed, Oct 23, 2024 at 2:00 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, Oct 22, 2024 at 01:19:58PM -0700, Alexei Starovoitov wrote:
> > > The __nullable argument tagging request was originally from sched_ext but I also
> > > don't see its usage in-tree for now.
> >
> > ok. Let's sync up with Tejun whether they have plans to use it.
>
> Yeah, in sched_ext_ops.dispatch(s32 cpu, struct task_struct *prev), @prev
> can be NULL and right now if a BPF scheduler derefs without checking for
> NULL, it can trigger kernel crash, I think, so it needs __nullable tagging.
I see. The following should do it:
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 3cd7c50a51c5..82bef41d7eae 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5492,7 +5492,7 @@ static int bpf_scx_validate(void *kdata)
static s32 select_cpu_stub(struct task_struct *p, s32 prev_cpu, u64
wake_flags) { return -EINVAL; }
static void enqueue_stub(struct task_struct *p, u64 enq_flags) {}
static void dequeue_stub(struct task_struct *p, u64 enq_flags) {}
-static void dispatch_stub(s32 prev_cpu, struct task_struct *p) {}
+static void dispatch_stub(s32 prev_cpu, struct task_struct *p__nullable) {}
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH bpf-next v6 3/9] bpf: Support private stack for struct ops programs
2024-10-23 23:07 ` Alexei Starovoitov
@ 2024-10-24 0:56 ` Tejun Heo
0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2024-10-24 0:56 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Martin KaFai Lau, Yonghong Song, bpf, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Kernel Team, Martin KaFai Lau
On Wed, Oct 23, 2024 at 04:07:49PM -0700, Alexei Starovoitov wrote:
> On Wed, Oct 23, 2024 at 2:00 PM Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello,
> >
> > On Tue, Oct 22, 2024 at 01:19:58PM -0700, Alexei Starovoitov wrote:
> > > > The __nullable argument tagging request was originally from sched_ext but I also
> > > > don't see its usage in-tree for now.
> > >
> > > ok. Let's sync up with Tejun whether they have plans to use it.
> >
> > Yeah, in sched_ext_ops.dispatch(s32 cpu, struct task_struct *prev), @prev
> > can be NULL and right now if a BPF scheduler derefs without checking for
> > NULL, it can trigger kernel crash, I think, so it needs __nullable tagging.
>
> I see. The following should do it:
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 3cd7c50a51c5..82bef41d7eae 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -5492,7 +5492,7 @@ static int bpf_scx_validate(void *kdata)
> static s32 select_cpu_stub(struct task_struct *p, s32 prev_cpu, u64
> wake_flags) { return -EINVAL; }
> static void enqueue_stub(struct task_struct *p, u64 enq_flags) {}
> static void dequeue_stub(struct task_struct *p, u64 enq_flags) {}
> -static void dispatch_stub(s32 prev_cpu, struct task_struct *p) {}
> +static void dispatch_stub(s32 prev_cpu, struct task_struct *p__nullable) {}
This is a lot neater than the existing workaround:
http://lkml.kernel.org/r/Zxma-ZFPKYZDqCGu@slm.duckdns.org
Thanks!
--
tejun
^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2024-10-24 0:56 UTC | newest]
Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-20 19:13 [PATCH bpf-next v6 0/9] bpf: Support private stack for bpf progs Yonghong Song
2024-10-20 19:13 ` [PATCH bpf-next v6 1/9] bpf: Allow each subprog having stack size of 512 bytes Yonghong Song
2024-10-22 1:18 ` Alexei Starovoitov
2024-10-22 3:21 ` Yonghong Song
2024-10-22 3:43 ` Alexei Starovoitov
2024-10-22 4:08 ` Yonghong Song
2024-10-22 20:13 ` Yonghong Song
2024-10-22 20:41 ` Alexei Starovoitov
2024-10-22 21:29 ` Kumar Kartikeya Dwivedi
2024-10-22 21:36 ` Kumar Kartikeya Dwivedi
2024-10-22 21:43 ` Yonghong Song
2024-10-22 21:57 ` Alexei Starovoitov
2024-10-22 22:41 ` Yonghong Song
2024-10-22 22:59 ` Alexei Starovoitov
2024-10-22 23:53 ` Yonghong Song
2024-10-20 19:13 ` [PATCH bpf-next v6 2/9] bpf: Rename bpf_struct_ops_arg_info to bpf_struct_ops_func_info Yonghong Song
2024-10-20 19:13 ` [PATCH bpf-next v6 3/9] bpf: Support private stack for struct ops programs Yonghong Song
2024-10-22 1:34 ` Alexei Starovoitov
2024-10-22 2:59 ` Yonghong Song
2024-10-22 17:26 ` Martin KaFai Lau
2024-10-22 20:19 ` Alexei Starovoitov
2024-10-23 21:00 ` Tejun Heo
2024-10-23 23:07 ` Alexei Starovoitov
2024-10-24 0:56 ` Tejun Heo
2024-10-20 19:14 ` [PATCH bpf-next v6 4/9] bpf: Mark each subprog with proper private stack modes Yonghong Song
2024-10-20 22:01 ` Jiri Olsa
2024-10-21 4:22 ` Yonghong Song
2024-10-20 19:14 ` [PATCH bpf-next v6 5/9] bpf, x86: Refactor func emit_prologue Yonghong Song
2024-10-20 19:14 ` [PATCH bpf-next v6 6/9] bpf, x86: Create a helper for certain "reg <op>= imm" operations Yonghong Song
2024-10-20 19:14 ` [PATCH bpf-next v6 7/9] bpf, x86: Add jit support for private stack Yonghong Song
2024-10-20 19:14 ` [PATCH bpf-next v6 8/9] selftests/bpf: Add tracing prog private stack tests Yonghong Song
2024-10-20 21:59 ` Jiri Olsa
2024-10-21 4:32 ` Yonghong Song
2024-10-21 10:40 ` Jiri Olsa
2024-10-21 16:19 ` Yonghong Song
2024-10-21 21:13 ` Jiri Olsa
2024-10-20 19:14 ` [PATCH bpf-next v6 9/9] selftests/bpf: Add struct_ops " Yonghong Song
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox