[PATCH bpf-next v9 00/10] bpf: Support private stack for bpf progs

public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf-next v9 00/10] bpf: Support private stack for bpf progs
@ 2024-11-04 19:34 Yonghong Song
  2024-11-04 19:35 ` [PATCH bpf-next v9 01/10] bpf: Check stack depth limit after visiting all subprogs Yonghong Song
                   ` (9 more replies)
  0 siblings, 10 replies; 41+ messages in thread
From: Yonghong Song @ 2024-11-04 19:34 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Tejun Heo

The main motivation for private stack comes from nested scheduler in
sched-ext from Tejun. The basic idea is that
 - each cgroup will its own associated bpf program,
 - bpf program with parent cgroup will call bpf programs
   in immediate child cgroups.

Let us say we have the following cgroup hierarchy:
  root_cg (prog0):
    cg1 (prog1):
      cg11 (prog11):
        cg111 (prog111)
        cg112 (prog112)
      cg12 (prog12):
        cg121 (prog121)
        cg122 (prog122)
    cg2 (prog2):
      cg21 (prog21)
      cg22 (prog22)
      cg23 (prog23)

In the above example, prog0 will call a kfunc which will call prog1 and
prog2 to get sched info for cg1 and cg2 and then the information is
summarized and sent back to prog0. Similarly, prog11 and prog12 will be
invoked in the kfunc and the result will be summarized and sent back to
prog1, etc. The following illustrates a possible call sequence:
   ... -> bpf prog A -> kfunc -> ops.<callback_fn> (bpf prog B) ...

Currently, for each thread, the x86 kernel allocate 16KB stack. Each
bpf program (including its subprograms) has maximum 512B stack size to
avoid potential stack overflow. Nested bpf programs further increase the
risk of stack overflow. To avoid potential stack overflow caused by bpf
programs, this patch set supported private stack and bpf program stack
space is allocated during verification time. Using private stack for
bpf progs can reduce or avoid potential kernel stack overflow.

Currently private stack is applied to tracing programs like kprobe/uprobe,
perf_event, tracepoint, raw tracepoint and struct_ops progs. 
Tracing progs enable private stack if any subprog stack size is more
than a threshold (i.e. 64B). Struct-ops progs enable private stack
based on particular struct op implementation which can enable private
stack before verification at per-insn level.

For all these progs, the kernel will do recursion check (no nesting for
per prog per cpu) to ensure that private stack won't be overwritten.
The bpf_prog_aux struct has a callback func recursion_detected() which
can be implemented by kernel subsystem to synchronously detect recursion,
report error, etc.

Only x86_64 arch supports private stack now. It can be extended to other
archs later. Please see each individual patch for details.

Change logs:
  v8 -> v9:
    - v8 link: https://lore.kernel.org/bpf/20241101030950.2677215-1-yonghong.song@linux.dev/
    - Use enum to express priv stack mode.
    - Use bits in bpf_subprog_info struct to do subprog recursion check between
      main/async and async subprogs.
    - Fix potential memory leak.
    - Rename recursion detection func from recursion_skipped() to recursion_detected().
  v7 -> v8:
    - v7 link: https://lore.kernel.org/bpf/20241029221637.264348-1-yonghong.song@linux.dev/
    - Add recursion_skipped() callback func to bpf_prog->aux structure such that if
      a recursion miss happened and bpf_prog->aux->recursion_skipped is not NULL, the
      callback fn will be called so the subsystem can do proper action based on their
      respective design.
  v6 -> v7:
    - v6 link: https://lore.kernel.org/bpf/20241020191341.2104841-1-yonghong.song@linux.dev/
    - Going back to do private stack allocation per prog instead per subtree. This can
      simplify implementation and avoid verifier complexity.
    - Handle potential nested subprog run if async callback exists.
    - Use struct_ops->check_member() callback to set whether a particular struct-ops
      prog wants private stack or not.
  v5 -> v6:
    - v5 link: https://lore.kernel.org/bpf/20241017223138.3175885-1-yonghong.song@linux.dev/
    - Instead of using (or not using) private stack at struct_ops level,
      each prog in struct_ops can decide whether to use private stack or not.
  v4 -> v5:
    - v4 link: https://lore.kernel.org/bpf/20241010175552.1895980-1-yonghong.song@linux.dev/
    - Remove bpf_prog_call() related implementation.
    - Allow (opt-in) private stack for sched-ext progs.
  v3 -> v4:
    - v3 link: https://lore.kernel.org/bpf/20240926234506.1769256-1-yonghong.song@linux.dev/
      There is a long discussion in the above v3 link trying to allow private
      stack to be used by kernel functions in order to simplify implementation.
      But unfortunately we didn't find a workable solution yet, so we return
      to the approach where private stack is only used by bpf programs.
    - Add bpf_prog_call() kfunc.
  v2 -> v3:
    - Instead of per-subprog private stack allocation, allocate private
      stacks at main prog or callback entry prog. Subprogs not main or callback
      progs will increment the inherited stack pointer to be their
      frame pointer.
    - Private stack allows each prog max stack size to be 512 bytes, intead
      of the whole prog hierarchy to be 512 bytes.
    - Add some tests.

Yonghong Song (10):
  bpf: Check stack depth limit after visiting all subprogs
  bpf: Return false for bpf_prog_check_recur() default case
  bpf: Allow private stack to have each subprog having stack size of 512
    bytes
  bpf: Check potential private stack recursion for progs with async
    callback
  bpf: Allocate private stack for eligible main prog or subprogs
  bpf, x86: Avoid repeated usage of bpf_prog->aux->stack_depth
  bpf, x86: Support private stack in jit
  selftests/bpf: Add tracing prog private stack tests
  bpf: Support private stack for struct_ops progs
  selftests/bpf: Add struct_ops prog private stack tests

 arch/x86/net/bpf_jit_comp.c                   |  73 ++++-
 include/linux/bpf.h                           |   3 +
 include/linux/bpf_verifier.h                  |   7 +-
 include/linux/filter.h                        |   1 +
 kernel/bpf/core.c                             |  24 +-
 kernel/bpf/trampoline.c                       |   4 +
 kernel/bpf/verifier.c                         | 150 +++++++++-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   | 104 +++++++
 .../selftests/bpf/bpf_testmod/bpf_testmod.h   |   5 +
 .../bpf/prog_tests/struct_ops_private_stack.c | 106 +++++++
 .../selftests/bpf/prog_tests/verifier.c       |   2 +
 .../bpf/progs/struct_ops_private_stack.c      |  62 ++++
 .../bpf/progs/struct_ops_private_stack_fail.c |  62 ++++
 .../progs/struct_ops_private_stack_recur.c    |  50 ++++
 .../bpf/progs/verifier_private_stack.c        | 272 ++++++++++++++++++
 15 files changed, 906 insertions(+), 19 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/struct_ops_private_stack.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_private_stack.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_private_stack_fail.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_private_stack_recur.c
 create mode 100644 tools/testing/selftests/bpf/progs/verifier_private_stack.c

-- 
2.43.5


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH bpf-next v9 01/10] bpf: Check stack depth limit after visiting all subprogs
  2024-11-04 19:34 [PATCH bpf-next v9 00/10] bpf: Support private stack for bpf progs Yonghong Song
@ 2024-11-04 19:35 ` Yonghong Song
  2024-11-04 19:35 ` [PATCH bpf-next v9 02/10] bpf: Return false for bpf_prog_check_recur() default case Yonghong Song
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: Yonghong Song @ 2024-11-04 19:35 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Tejun Heo

Check stack depth limit after all subprogs are visited. Note that if
private stack is enabled, the only stack size restriction is for a single
subprog with size less than or equal to MAX_BPF_STACK.

In subsequent patches, in function check_max_stack_depth(), there could
be a flip from enabling private stack to disabling private stack due to
potential nested bpf subprog run. Moving stack depth limit checking after
visiting all subprogs ensures the checking not missed in such flipping
cases.

The useless 'continue' statement in the loop in func
check_max_stack_depth() is also removed.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 kernel/bpf/verifier.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ba800c7611e3..ed8f70e51141 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6032,7 +6032,8 @@ static int round_up_stack_depth(struct bpf_verifier_env *env, int stack_depth)
  * Since recursion is prevented by check_cfg() this algorithm
  * only needs a local stack of MAX_CALL_FRAMES to remember callsites
  */
-static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
+static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
+					 int *subtree_depth, int *depth_frame)
 {
 	struct bpf_subprog_info *subprog = env->subprog_info;
 	struct bpf_insn *insn = env->prog->insnsi;
@@ -6070,10 +6071,9 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
 		return -EACCES;
 	}
 	depth += round_up_stack_depth(env, subprog[idx].stack_depth);
-	if (depth > MAX_BPF_STACK) {
-		verbose(env, "combined stack size of %d calls is %d. Too large\n",
-			frame + 1, depth);
-		return -EACCES;
+	if (depth > MAX_BPF_STACK && !*subtree_depth) {
+		*subtree_depth = depth;
+		*depth_frame = frame + 1;
 	}
 continue_func:
 	subprog_end = subprog[idx + 1].start;
@@ -6173,15 +6173,19 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
 static int check_max_stack_depth(struct bpf_verifier_env *env)
 {
 	struct bpf_subprog_info *si = env->subprog_info;
-	int ret;
+	int ret, subtree_depth = 0, depth_frame;
 
 	for (int i = 0; i < env->subprog_cnt; i++) {
 		if (!i || si[i].is_async_cb) {
-			ret = check_max_stack_depth_subprog(env, i);
+			ret = check_max_stack_depth_subprog(env, i, &subtree_depth, &depth_frame);
 			if (ret < 0)
 				return ret;
 		}
-		continue;
+	}
+	if (subtree_depth > MAX_BPF_STACK) {
+		verbose(env, "combined stack size of %d calls is %d. Too large\n",
+			depth_frame, subtree_depth);
+		return -EACCES;
 	}
 	return 0;
 }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH bpf-next v9 02/10] bpf: Return false for bpf_prog_check_recur() default case
  2024-11-04 19:34 [PATCH bpf-next v9 00/10] bpf: Support private stack for bpf progs Yonghong Song
  2024-11-04 19:35 ` [PATCH bpf-next v9 01/10] bpf: Check stack depth limit after visiting all subprogs Yonghong Song
@ 2024-11-04 19:35 ` Yonghong Song
  2024-11-05  1:21   ` Alexei Starovoitov
  2024-11-04 19:35 ` [PATCH bpf-next v9 03/10] bpf: Allow private stack to have each subprog having stack size of 512 bytes Yonghong Song
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 41+ messages in thread
From: Yonghong Song @ 2024-11-04 19:35 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Tejun Heo

The bpf_prog_check_recur() funciton is currently used by trampoline
and tracing programs (also using trampoline) to check whether a
particular prog supports recursion checking or not. The default case
(non-trampoline progs) return true in the current implementation.

Let us make the non-trampoline prog recursion check return false
instead. It does not impact any existing use cases and allows the
function to be used outside the trampoline context in the next patch.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 include/linux/bpf_verifier.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 4513372c5bc8..ad887c68d3e1 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -889,9 +889,8 @@ static inline bool bpf_prog_check_recur(const struct bpf_prog *prog)
 		return prog->expected_attach_type != BPF_TRACE_ITER;
 	case BPF_PROG_TYPE_STRUCT_OPS:
 	case BPF_PROG_TYPE_LSM:
-		return false;
 	default:
-		return true;
+		return false;
 	}
 }
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH bpf-next v9 03/10] bpf: Allow private stack to have each subprog having stack size of 512 bytes
  2024-11-04 19:34 [PATCH bpf-next v9 00/10] bpf: Support private stack for bpf progs Yonghong Song
  2024-11-04 19:35 ` [PATCH bpf-next v9 01/10] bpf: Check stack depth limit after visiting all subprogs Yonghong Song
  2024-11-04 19:35 ` [PATCH bpf-next v9 02/10] bpf: Return false for bpf_prog_check_recur() default case Yonghong Song
@ 2024-11-04 19:35 ` Yonghong Song
  2024-11-05  2:47   ` Alexei Starovoitov
  2024-11-04 19:35 ` [PATCH bpf-next v9 04/10] bpf: Check potential private stack recursion for progs with async callback Yonghong Song
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 41+ messages in thread
From: Yonghong Song @ 2024-11-04 19:35 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Tejun Heo

With private stack support, each subprog can have stack with up to 512
bytes. The limit of 512 bytes per subprog is kept to avoid increasing
verifier complexity since greater than 512 bytes will cause big verifier
change and increase memory consumption and verification time.

If private stack is supported and certain stack size threshold is reached,
that subprog will have its own private stack allocated.

In this patch, some tracing programs are allowed to use private
stack since tracing prog may be triggered in the middle of some other
prog runs. The supported tracing programs already have recursion check
such that if the same prog is running on the same cpu again, the nested
prog run will be skipped. This ensures bpf prog private stack is not
over-written.

Note that if any tail_call is called in the prog (including all subprogs),
then private stack is not used.

Function bpf_enable_priv_stack() return values include NO_PRIV_STACK,
PRIV_STACK_ADAPTIVE and PRIV_STACK_ALWAYS. The NO_PRIV_STACK represents
priv stack not enabled, PRIV_STACK_ADAPTIVE for priv stack enabled with
some conditions (e.g. stack size threshold), and PRIV_STACK_ALWAYS for
priv stack always enabled. The PRIV_STACK_ALWAYS will be used by later
struct_ops progs.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 include/linux/bpf.h          |  1 +
 include/linux/bpf_verifier.h |  1 +
 include/linux/filter.h       |  1 +
 kernel/bpf/core.c            |  5 +++
 kernel/bpf/verifier.c        | 73 +++++++++++++++++++++++++++++++++---
 5 files changed, 76 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index c3ba4d475174..8db3c5d7404b 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1523,6 +1523,7 @@ struct bpf_prog_aux {
 	bool exception_cb;
 	bool exception_boundary;
 	bool is_extended; /* true if extended by freplace program */
+	bool use_priv_stack;
 	u64 prog_array_member_cnt; /* counts how many times as member of prog_array */
 	struct mutex ext_mutex; /* mutex for is_extended and prog_array_member_cnt */
 	struct bpf_arena *arena;
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index ad887c68d3e1..0622c11a7e19 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -668,6 +668,7 @@ struct bpf_subprog_info {
 	bool args_cached: 1;
 	/* true if bpf_fastcall stack region is used by functions that can't be inlined */
 	bool keep_fastcall_stack: 1;
+	bool use_priv_stack: 1;
 
 	u8 arg_cnt;
 	struct bpf_subprog_arg_info args[MAX_BPF_FUNC_REG_ARGS];
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 7d7578a8eac1..3a21947f2fd4 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1119,6 +1119,7 @@ bool bpf_jit_supports_exceptions(void);
 bool bpf_jit_supports_ptr_xchg(void);
 bool bpf_jit_supports_arena(void);
 bool bpf_jit_supports_insn(struct bpf_insn *insn, bool in_arena);
+bool bpf_jit_supports_private_stack(void);
 u64 bpf_arch_uaddress_limit(void);
 void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie);
 bool bpf_helper_changes_pkt_data(void *func);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 233ea78f8f1b..14d9288441f2 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -3045,6 +3045,11 @@ bool __weak bpf_jit_supports_exceptions(void)
 	return false;
 }
 
+bool __weak bpf_jit_supports_private_stack(void)
+{
+	return false;
+}
+
 void __weak arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie)
 {
 }
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ed8f70e51141..406195c433ea 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -194,6 +194,8 @@ struct bpf_verifier_stack_elem {
 
 #define BPF_GLOBAL_PERCPU_MA_MAX_SIZE  512
 
+#define BPF_PRIV_STACK_MIN_SIZE		64
+
 static int acquire_reference_state(struct bpf_verifier_env *env, int insn_idx);
 static int release_reference(struct bpf_verifier_env *env, int ref_obj_id);
 static void invalidate_non_owning_refs(struct bpf_verifier_env *env);
@@ -6015,6 +6017,38 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
 					   strict);
 }
 
+enum priv_stack_mode {
+	NO_PRIV_STACK,
+	PRIV_STACK_ADAPTIVE,
+	PRIV_STACK_ALWAYS,
+};
+
+static enum priv_stack_mode bpf_enable_priv_stack(struct bpf_prog *prog)
+{
+	if (!bpf_jit_supports_private_stack())
+		return NO_PRIV_STACK;
+
+	/* bpf_prog_check_recur() checks all prog types that use bpf trampoline
+	 * while kprobe/tp/perf_event/raw_tp don't use trampoline hence checked
+	 * explicitly.
+	 */
+	switch (prog->type) {
+	case BPF_PROG_TYPE_KPROBE:
+	case BPF_PROG_TYPE_TRACEPOINT:
+	case BPF_PROG_TYPE_PERF_EVENT:
+	case BPF_PROG_TYPE_RAW_TRACEPOINT:
+		return PRIV_STACK_ADAPTIVE;
+	default:
+		break;
+	}
+
+	if (!bpf_prog_check_recur(prog))
+		return NO_PRIV_STACK;
+
+
+	return PRIV_STACK_ADAPTIVE;
+}
+
 static int round_up_stack_depth(struct bpf_verifier_env *env, int stack_depth)
 {
 	if (env->prog->jit_requested)
@@ -6033,11 +6067,12 @@ static int round_up_stack_depth(struct bpf_verifier_env *env, int stack_depth)
  * only needs a local stack of MAX_CALL_FRAMES to remember callsites
  */
 static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
-					 int *subtree_depth, int *depth_frame)
+					 int *subtree_depth, int *depth_frame,
+					 int priv_stack_supported)
 {
 	struct bpf_subprog_info *subprog = env->subprog_info;
 	struct bpf_insn *insn = env->prog->insnsi;
-	int depth = 0, frame = 0, i, subprog_end;
+	int depth = 0, frame = 0, i, subprog_end, subprog_depth;
 	bool tail_call_reachable = false;
 	int ret_insn[MAX_CALL_FRAMES];
 	int ret_prog[MAX_CALL_FRAMES];
@@ -6070,11 +6105,23 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
 			depth);
 		return -EACCES;
 	}
-	depth += round_up_stack_depth(env, subprog[idx].stack_depth);
+	subprog_depth = round_up_stack_depth(env, subprog[idx].stack_depth);
+	depth += subprog_depth;
 	if (depth > MAX_BPF_STACK && !*subtree_depth) {
 		*subtree_depth = depth;
 		*depth_frame = frame + 1;
 	}
+	if (priv_stack_supported != NO_PRIV_STACK) {
+		if (!subprog[idx].use_priv_stack) {
+			if (subprog_depth > MAX_BPF_STACK) {
+				verbose(env, "stack size of subprog %d is %d. Too large\n",
+					idx, subprog_depth);
+				return -EACCES;
+			}
+			if (subprog_depth >= BPF_PRIV_STACK_MIN_SIZE)
+				subprog[idx].use_priv_stack = true;
+		}
+	}
 continue_func:
 	subprog_end = subprog[idx + 1].start;
 	for (; i < subprog_end; i++) {
@@ -6173,20 +6220,36 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
 static int check_max_stack_depth(struct bpf_verifier_env *env)
 {
 	struct bpf_subprog_info *si = env->subprog_info;
+	enum priv_stack_mode priv_stack_supported;
 	int ret, subtree_depth = 0, depth_frame;
 
+	priv_stack_supported = bpf_enable_priv_stack(env->prog);
+
+	if (priv_stack_supported != NO_PRIV_STACK) {
+		for (int i = 0; i < env->subprog_cnt; i++) {
+			if (!si[i].has_tail_call)
+				continue;
+			priv_stack_supported = NO_PRIV_STACK;
+			break;
+		}
+	}
+
 	for (int i = 0; i < env->subprog_cnt; i++) {
 		if (!i || si[i].is_async_cb) {
-			ret = check_max_stack_depth_subprog(env, i, &subtree_depth, &depth_frame);
+			ret = check_max_stack_depth_subprog(env, i, &subtree_depth, &depth_frame,
+							    priv_stack_supported);
 			if (ret < 0)
 				return ret;
 		}
 	}
-	if (subtree_depth > MAX_BPF_STACK) {
+
+	if (priv_stack_supported == NO_PRIV_STACK && subtree_depth > MAX_BPF_STACK) {
 		verbose(env, "combined stack size of %d calls is %d. Too large\n",
 			depth_frame, subtree_depth);
 		return -EACCES;
 	}
+	if (si[0].use_priv_stack)
+		env->prog->aux->use_priv_stack = true;
 	return 0;
 }
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH bpf-next v9 04/10] bpf: Check potential private stack recursion for progs with async callback
  2024-11-04 19:34 [PATCH bpf-next v9 00/10] bpf: Support private stack for bpf progs Yonghong Song
                   ` (2 preceding siblings ...)
  2024-11-04 19:35 ` [PATCH bpf-next v9 03/10] bpf: Allow private stack to have each subprog having stack size of 512 bytes Yonghong Song
@ 2024-11-04 19:35 ` Yonghong Song
  2024-11-05  2:51   ` Alexei Starovoitov
  2024-11-04 19:35 ` [PATCH bpf-next v9 05/10] bpf: Allocate private stack for eligible main prog or subprogs Yonghong Song
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 41+ messages in thread
From: Yonghong Song @ 2024-11-04 19:35 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Tejun Heo

In previous patch, tracing progs are enabled for private stack since
recursion checking ensures there exists no nested same bpf prog run on
the same cpu.

But it is still possible for nested bpf subprog run on the same cpu
if the same subprog is called in both main prog and async callback,
or in different async callbacks. For example,
  main_prog
   bpf_timer_set_callback(timer, timer_cb);
   call sub1
  sub1
   ...
  time_cb
   call sub1

In the above case, nested subprog run for sub1 is possible with one in
process context and the other in softirq context. If this is the case,
the verifier will disable private stack for this bpf prog.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 include/linux/bpf_verifier.h |  2 ++
 kernel/bpf/verifier.c        | 42 +++++++++++++++++++++++++++++++-----
 2 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 0622c11a7e19..e921589abc72 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -669,6 +669,8 @@ struct bpf_subprog_info {
 	/* true if bpf_fastcall stack region is used by functions that can't be inlined */
 	bool keep_fastcall_stack: 1;
 	bool use_priv_stack: 1;
+	bool visited_with_priv_stack_accum: 1;
+	bool visited_with_priv_stack: 1;
 
 	u8 arg_cnt;
 	struct bpf_subprog_arg_info args[MAX_BPF_FUNC_REG_ARGS];
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 406195c433ea..e01b3f0fd314 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6118,8 +6118,12 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
 					idx, subprog_depth);
 				return -EACCES;
 			}
-			if (subprog_depth >= BPF_PRIV_STACK_MIN_SIZE)
+			if (subprog_depth >= BPF_PRIV_STACK_MIN_SIZE) {
 				subprog[idx].use_priv_stack = true;
+				subprog[idx].visited_with_priv_stack = true;
+			}
+		} else {
+			subprog[idx].visited_with_priv_stack = true;
 		}
 	}
 continue_func:
@@ -6220,10 +6224,12 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
 static int check_max_stack_depth(struct bpf_verifier_env *env)
 {
 	struct bpf_subprog_info *si = env->subprog_info;
+	enum priv_stack_mode orig_priv_stack_supported;
 	enum priv_stack_mode priv_stack_supported;
 	int ret, subtree_depth = 0, depth_frame;
 
 	priv_stack_supported = bpf_enable_priv_stack(env->prog);
+	orig_priv_stack_supported = priv_stack_supported;
 
 	if (priv_stack_supported != NO_PRIV_STACK) {
 		for (int i = 0; i < env->subprog_cnt; i++) {
@@ -6240,13 +6246,39 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
 							    priv_stack_supported);
 			if (ret < 0)
 				return ret;
+
+			if (priv_stack_supported != NO_PRIV_STACK) {
+				for (int j = 0; j < env->subprog_cnt; j++) {
+					if (si[j].visited_with_priv_stack_accum &&
+					    si[j].visited_with_priv_stack) {
+						/* si[j] is visited by both main/async subprog
+						 * and another async subprog.
+						 */
+						priv_stack_supported = NO_PRIV_STACK;
+						break;
+					}
+					if (!si[j].visited_with_priv_stack_accum)
+						si[j].visited_with_priv_stack_accum =
+							si[j].visited_with_priv_stack;
+				}
+			}
+			if (priv_stack_supported != NO_PRIV_STACK) {
+				for (int j = 0; j < env->subprog_cnt; j++)
+					si[j].visited_with_priv_stack = false;
+			}
 		}
 	}
 
-	if (priv_stack_supported == NO_PRIV_STACK && subtree_depth > MAX_BPF_STACK) {
-		verbose(env, "combined stack size of %d calls is %d. Too large\n",
-			depth_frame, subtree_depth);
-		return -EACCES;
+	if (priv_stack_supported == NO_PRIV_STACK) {
+		if (subtree_depth > MAX_BPF_STACK) {
+			verbose(env, "combined stack size of %d calls is %d. Too large\n",
+				depth_frame, subtree_depth);
+			return -EACCES;
+		}
+		if (orig_priv_stack_supported == PRIV_STACK_ADAPTIVE) {
+			for (int i = 0; i < env->subprog_cnt; i++)
+				si[i].use_priv_stack = false;
+		}
 	}
 	if (si[0].use_priv_stack)
 		env->prog->aux->use_priv_stack = true;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH bpf-next v9 05/10] bpf: Allocate private stack for eligible main prog or subprogs
  2024-11-04 19:34 [PATCH bpf-next v9 00/10] bpf: Support private stack for bpf progs Yonghong Song
                   ` (3 preceding siblings ...)
  2024-11-04 19:35 ` [PATCH bpf-next v9 04/10] bpf: Check potential private stack recursion for progs with async callback Yonghong Song
@ 2024-11-04 19:35 ` Yonghong Song
  2024-11-05  1:38   ` Alexei Starovoitov
  2024-11-04 19:35 ` [PATCH bpf-next v9 06/10] bpf, x86: Avoid repeated usage of bpf_prog->aux->stack_depth Yonghong Song
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 41+ messages in thread
From: Yonghong Song @ 2024-11-04 19:35 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Tejun Heo

For any main prog or subprogs, allocate private stack space if requested
by subprog info or main prog. The alignment for private stack is 16
since maximum stack alignment is 16 for bpf-enabled archs.

If jit failed, the allocated private stack will be freed in the same
function where the allocation happens. If jit succeeded, e.g., for
x86_64 arch, the allocated private stack is freed in arch specific
implementation of bpf_jit_free().

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 arch/x86/net/bpf_jit_comp.c |  1 +
 include/linux/bpf.h         |  1 +
 kernel/bpf/core.c           | 19 ++++++++++++++++---
 kernel/bpf/verifier.c       | 13 +++++++++++++
 4 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 06b080b61aa5..59d294b8dd67 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -3544,6 +3544,7 @@ void bpf_jit_free(struct bpf_prog *prog)
 		prog->bpf_func = (void *)prog->bpf_func - cfi_get_offset();
 		hdr = bpf_jit_binary_pack_hdr(prog);
 		bpf_jit_binary_pack_free(hdr, NULL);
+		free_percpu(prog->aux->priv_stack_ptr);
 		WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(prog));
 	}
 
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 8db3c5d7404b..8a3ea7440a4a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1507,6 +1507,7 @@ struct bpf_prog_aux {
 	u32 max_rdwr_access;
 	struct btf *attach_btf;
 	const struct bpf_ctx_arg_aux *ctx_arg_info;
+	void __percpu *priv_stack_ptr;
 	struct mutex dst_mutex; /* protects dst_* pointers below, *after* prog becomes visible */
 	struct bpf_prog *dst_prog;
 	struct bpf_trampoline *dst_trampoline;
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 14d9288441f2..f7a3e93c41e1 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2396,6 +2396,7 @@ static void bpf_prog_select_func(struct bpf_prog *fp)
  */
 struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
 {
+	void __percpu *priv_stack_ptr = NULL;
 	/* In case of BPF to BPF calls, verifier did all the prep
 	 * work with regards to JITing, etc.
 	 */
@@ -2421,11 +2422,23 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
 		if (*err)
 			return fp;
 
+		if (fp->aux->use_priv_stack && fp->aux->stack_depth) {
+			priv_stack_ptr = __alloc_percpu_gfp(fp->aux->stack_depth, 16, GFP_KERNEL);
+			if (!priv_stack_ptr) {
+				*err = -ENOMEM;
+				return fp;
+			}
+			fp->aux->priv_stack_ptr = priv_stack_ptr;
+		}
+
 		fp = bpf_int_jit_compile(fp);
 		bpf_prog_jit_attempt_done(fp);
-		if (!fp->jited && jit_needed) {
-			*err = -ENOTSUPP;
-			return fp;
+		if (!fp->jited) {
+			free_percpu(priv_stack_ptr);
+			if (jit_needed) {
+				*err = -ENOTSUPP;
+				return fp;
+			}
 		}
 	} else {
 		*err = bpf_prog_offload_compile(fp);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e01b3f0fd314..03ae76d57076 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -20073,6 +20073,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 {
 	struct bpf_prog *prog = env->prog, **func, *tmp;
 	int i, j, subprog_start, subprog_end = 0, len, subprog;
+	void __percpu *priv_stack_ptr;
 	struct bpf_map *map_ptr;
 	struct bpf_insn *insn;
 	void *old_bpf_func;
@@ -20169,6 +20170,17 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 
 		func[i]->aux->name[0] = 'F';
 		func[i]->aux->stack_depth = env->subprog_info[i].stack_depth;
+
+		if (env->subprog_info[i].use_priv_stack && func[i]->aux->stack_depth) {
+			priv_stack_ptr = __alloc_percpu_gfp(func[i]->aux->stack_depth, 16,
+							    GFP_KERNEL);
+			if (!priv_stack_ptr) {
+				err = -ENOMEM;
+				goto out_free;
+			}
+			func[i]->aux->priv_stack_ptr = priv_stack_ptr;
+		}
+
 		func[i]->jit_requested = 1;
 		func[i]->blinding_requested = prog->blinding_requested;
 		func[i]->aux->kfunc_tab = prog->aux->kfunc_tab;
@@ -20201,6 +20213,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 			func[i]->aux->exception_boundary = env->seen_exception;
 		func[i] = bpf_int_jit_compile(func[i]);
 		if (!func[i]->jited) {
+			free_percpu(func[i]->aux->priv_stack_ptr);
 			err = -ENOTSUPP;
 			goto out_free;
 		}
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH bpf-next v9 06/10] bpf, x86: Avoid repeated usage of bpf_prog->aux->stack_depth
  2024-11-04 19:34 [PATCH bpf-next v9 00/10] bpf: Support private stack for bpf progs Yonghong Song
                   ` (4 preceding siblings ...)
  2024-11-04 19:35 ` [PATCH bpf-next v9 05/10] bpf: Allocate private stack for eligible main prog or subprogs Yonghong Song
@ 2024-11-04 19:35 ` Yonghong Song
  2024-11-04 19:35 ` [PATCH bpf-next v9 07/10] bpf, x86: Support private stack in jit Yonghong Song
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: Yonghong Song @ 2024-11-04 19:35 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Tejun Heo

Refactor the code to avoid repeated usage of bpf_prog->aux->stack_depth
in do_jit() func. If the private stack is used, the stack_depth will be
0 for that prog. Refactoring make it easy to adjust stack_depth.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 arch/x86/net/bpf_jit_comp.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 59d294b8dd67..181d9f04418f 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1425,14 +1425,17 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 	int i, excnt = 0;
 	int ilen, proglen = 0;
 	u8 *prog = temp;
+	u32 stack_depth;
 	int err;
 
+	stack_depth = bpf_prog->aux->stack_depth;
+
 	arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena);
 	user_vm_start = bpf_arena_get_user_vm_start(bpf_prog->aux->arena);
 
 	detect_reg_usage(insn, insn_cnt, callee_regs_used);
 
-	emit_prologue(&prog, bpf_prog->aux->stack_depth,
+	emit_prologue(&prog, stack_depth,
 		      bpf_prog_was_classic(bpf_prog), tail_call_reachable,
 		      bpf_is_subprog(bpf_prog), bpf_prog->aux->exception_cb);
 	/* Exception callback will clobber callee regs for its own use, and
@@ -2128,7 +2131,7 @@ st:			if (is_imm8(insn->off))
 
 			func = (u8 *) __bpf_call_base + imm32;
 			if (tail_call_reachable) {
-				LOAD_TAIL_CALL_CNT_PTR(bpf_prog->aux->stack_depth);
+				LOAD_TAIL_CALL_CNT_PTR(stack_depth);
 				ip += 7;
 			}
 			if (!imm32)
@@ -2145,13 +2148,13 @@ st:			if (is_imm8(insn->off))
 							  &bpf_prog->aux->poke_tab[imm32 - 1],
 							  &prog, image + addrs[i - 1],
 							  callee_regs_used,
-							  bpf_prog->aux->stack_depth,
+							  stack_depth,
 							  ctx);
 			else
 				emit_bpf_tail_call_indirect(bpf_prog,
 							    &prog,
 							    callee_regs_used,
-							    bpf_prog->aux->stack_depth,
+							    stack_depth,
 							    image + addrs[i - 1],
 							    ctx);
 			break;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH bpf-next v9 07/10] bpf, x86: Support private stack in jit
  2024-11-04 19:34 [PATCH bpf-next v9 00/10] bpf: Support private stack for bpf progs Yonghong Song
                   ` (5 preceding siblings ...)
  2024-11-04 19:35 ` [PATCH bpf-next v9 06/10] bpf, x86: Avoid repeated usage of bpf_prog->aux->stack_depth Yonghong Song
@ 2024-11-04 19:35 ` Yonghong Song
  2024-11-04 19:35 ` [PATCH bpf-next v9 08/10] selftests/bpf: Add tracing prog private stack tests Yonghong Song
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: Yonghong Song @ 2024-11-04 19:35 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Tejun Heo

Support private stack in jit. The x86 register 9 (X86_REG_R9) is used to
replace bpf frame register (BPF_REG_10). The private stack is used per
subprog if it is enabled by verifier. The X86_REG_R9 is saved and
restored around every func call (not including tailcall) to maintain
correctness of X86_REG_R9.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 arch/x86/net/bpf_jit_comp.c | 61 +++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 181d9f04418f..4ee69071c26d 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -325,6 +325,22 @@ struct jit_context {
 /* Number of bytes that will be skipped on tailcall */
 #define X86_TAIL_CALL_OFFSET	(12 + ENDBR_INSN_SIZE)
 
+static void push_r9(u8 **pprog)
+{
+	u8 *prog = *pprog;
+
+	EMIT2(0x41, 0x51);   /* push r9 */
+	*pprog = prog;
+}
+
+static void pop_r9(u8 **pprog)
+{
+	u8 *prog = *pprog;
+
+	EMIT2(0x41, 0x59);   /* pop r9 */
+	*pprog = prog;
+}
+
 static void push_r12(u8 **pprog)
 {
 	u8 *prog = *pprog;
@@ -1404,6 +1420,24 @@ static void emit_shiftx(u8 **pprog, u32 dst_reg, u8 src_reg, bool is64, u8 op)
 	*pprog = prog;
 }
 
+static void emit_priv_frame_ptr(u8 **pprog, void __percpu *priv_frame_ptr)
+{
+	u8 *prog = *pprog;
+
+	/* movabs r9, priv_frame_ptr */
+	emit_mov_imm64(&prog, X86_REG_R9, (__force long) priv_frame_ptr >> 32,
+		       (u32) (__force long) priv_frame_ptr);
+
+#ifdef CONFIG_SMP
+	/* add <r9>, gs:[<off>] */
+	EMIT2(0x65, 0x4c);
+	EMIT3(0x03, 0x0c, 0x25);
+	EMIT((u32)(unsigned long)&this_cpu_off, 4);
+#endif
+
+	*pprog = prog;
+}
+
 #define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
 
 #define __LOAD_TCC_PTR(off)			\
@@ -1421,6 +1455,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 	int insn_cnt = bpf_prog->len;
 	bool seen_exit = false;
 	u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
+	void __percpu *priv_frame_ptr = NULL;
 	u64 arena_vm_start, user_vm_start;
 	int i, excnt = 0;
 	int ilen, proglen = 0;
@@ -1429,6 +1464,10 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 	int err;
 
 	stack_depth = bpf_prog->aux->stack_depth;
+	if (bpf_prog->aux->priv_stack_ptr) {
+		priv_frame_ptr = bpf_prog->aux->priv_stack_ptr + round_up(stack_depth, 8);
+		stack_depth = 0;
+	}
 
 	arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena);
 	user_vm_start = bpf_arena_get_user_vm_start(bpf_prog->aux->arena);
@@ -1457,6 +1496,9 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 		emit_mov_imm64(&prog, X86_REG_R12,
 			       arena_vm_start >> 32, (u32) arena_vm_start);
 
+	if (priv_frame_ptr)
+		emit_priv_frame_ptr(&prog, priv_frame_ptr);
+
 	ilen = prog - temp;
 	if (rw_image)
 		memcpy(rw_image + proglen, temp, ilen);
@@ -1476,6 +1518,14 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 		u8 *func;
 		int nops;
 
+		if (priv_frame_ptr) {
+			if (src_reg == BPF_REG_FP)
+				src_reg = X86_REG_R9;
+
+			if (dst_reg == BPF_REG_FP)
+				dst_reg = X86_REG_R9;
+		}
+
 		switch (insn->code) {
 			/* ALU */
 		case BPF_ALU | BPF_ADD | BPF_X:
@@ -2136,9 +2186,15 @@ st:			if (is_imm8(insn->off))
 			}
 			if (!imm32)
 				return -EINVAL;
+			if (priv_frame_ptr) {
+				push_r9(&prog);
+				ip += 2;
+			}
 			ip += x86_call_depth_emit_accounting(&prog, func, ip);
 			if (emit_call(&prog, func, ip))
 				return -EINVAL;
+			if (priv_frame_ptr)
+				pop_r9(&prog);
 			break;
 		}
 
@@ -3563,6 +3619,11 @@ bool bpf_jit_supports_exceptions(void)
 	return IS_ENABLED(CONFIG_UNWINDER_ORC);
 }
 
+bool bpf_jit_supports_private_stack(void)
+{
+	return true;
+}
+
 void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie)
 {
 #if defined(CONFIG_UNWINDER_ORC)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH bpf-next v9 08/10] selftests/bpf: Add tracing prog private stack tests
  2024-11-04 19:34 [PATCH bpf-next v9 00/10] bpf: Support private stack for bpf progs Yonghong Song
                   ` (6 preceding siblings ...)
  2024-11-04 19:35 ` [PATCH bpf-next v9 07/10] bpf, x86: Support private stack in jit Yonghong Song
@ 2024-11-04 19:35 ` Yonghong Song
  2024-11-04 19:35 ` [PATCH bpf-next v9 09/10] bpf: Support private stack for struct_ops progs Yonghong Song
  2024-11-04 19:35 ` [PATCH bpf-next v9 10/10] selftests/bpf: Add struct_ops prog private stack tests Yonghong Song
  9 siblings, 0 replies; 41+ messages in thread
From: Yonghong Song @ 2024-11-04 19:35 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Tejun Heo

Some private stack tests are added including:
  - main prog only with stack size greater than BPF_PSTACK_MIN_SIZE.
  - main prog only with stack size smaller than BPF_PSTACK_MIN_SIZE.
  - prog with one subprog having MAX_BPF_STACK stack size and another
    subprog having non-zero small stack size.
  - prog with callback function.
  - prog with exception in main prog or subprog.
  - prog with async callback without nesting
  - prog with async callback with possible nesting

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 .../selftests/bpf/prog_tests/verifier.c       |   2 +
 .../bpf/progs/verifier_private_stack.c        | 272 ++++++++++++++++++
 2 files changed, 274 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/verifier_private_stack.c

diff --git a/tools/testing/selftests/bpf/prog_tests/verifier.c b/tools/testing/selftests/bpf/prog_tests/verifier.c
index 75f7a2ce334b..d9f65adb456b 100644
--- a/tools/testing/selftests/bpf/prog_tests/verifier.c
+++ b/tools/testing/selftests/bpf/prog_tests/verifier.c
@@ -61,6 +61,7 @@
 #include "verifier_or_jmp32_k.skel.h"
 #include "verifier_precision.skel.h"
 #include "verifier_prevent_map_lookup.skel.h"
+#include "verifier_private_stack.skel.h"
 #include "verifier_raw_stack.skel.h"
 #include "verifier_raw_tp_writable.skel.h"
 #include "verifier_reg_equal.skel.h"
@@ -188,6 +189,7 @@ void test_verifier_bpf_fastcall(void)         { RUN(verifier_bpf_fastcall); }
 void test_verifier_or_jmp32_k(void)           { RUN(verifier_or_jmp32_k); }
 void test_verifier_precision(void)            { RUN(verifier_precision); }
 void test_verifier_prevent_map_lookup(void)   { RUN(verifier_prevent_map_lookup); }
+void test_verifier_private_stack(void)        { RUN(verifier_private_stack); }
 void test_verifier_raw_stack(void)            { RUN(verifier_raw_stack); }
 void test_verifier_raw_tp_writable(void)      { RUN(verifier_raw_tp_writable); }
 void test_verifier_reg_equal(void)            { RUN(verifier_reg_equal); }
diff --git a/tools/testing/selftests/bpf/progs/verifier_private_stack.c b/tools/testing/selftests/bpf/progs/verifier_private_stack.c
new file mode 100644
index 000000000000..b1fbdf119553
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/verifier_private_stack.c
@@ -0,0 +1,272 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+#include "bpf_experimental.h"
+
+/* From include/linux/filter.h */
+#define MAX_BPF_STACK    512
+
+#if defined(__TARGET_ARCH_x86)
+
+struct elem {
+	struct bpf_timer t;
+	char pad[256];
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, int);
+	__type(value, struct elem);
+} array SEC(".maps");
+
+SEC("kprobe")
+__description("Private stack, single prog")
+__success
+__arch_x86_64
+__jited("	movabsq	$0x{{.*}}, %r9")
+__jited("	addq	%gs:0x{{.*}}, %r9")
+__jited("	movl	$0x2a, %edi")
+__jited("	movq	%rdi, -0x100(%r9)")
+__naked void private_stack_single_prog(void)
+{
+	asm volatile ("			\
+	r1 = 42;			\
+	*(u64 *)(r10 - 256) = r1;	\
+	r0 = 0;				\
+	exit;				\
+"	::: __clobber_all);
+}
+
+SEC("raw_tp")
+__description("No private stack")
+__success
+__arch_x86_64
+__jited("	subq	$0x8, %rsp")
+__naked void no_private_stack_nested(void)
+{
+	asm volatile ("			\
+	r1 = 42;			\
+	*(u64 *)(r10 - 8) = r1;		\
+	r0 = 0;				\
+	exit;				\
+"	::: __clobber_all);
+}
+
+__used
+__naked static void cumulative_stack_depth_subprog(void)
+{
+	asm volatile ("				\
+	r1 = 41;				\
+	*(u64 *)(r10 - 32) = r1;		\
+	call %[bpf_get_smp_processor_id];	\
+	exit;					\
+"	:
+	: __imm(bpf_get_smp_processor_id)
+	: __clobber_all);
+}
+
+SEC("kprobe")
+__description("Private stack, subtree > MAX_BPF_STACK")
+__success
+__arch_x86_64
+/* private stack fp for the main prog */
+__jited("	movabsq	$0x{{.*}}, %r9")
+__jited("	addq	%gs:0x{{.*}}, %r9")
+__jited("	movl	$0x2a, %edi")
+__jited("	movq	%rdi, -0x200(%r9)")
+__jited("	pushq	%r9")
+__jited("	callq	0x{{.*}}")
+__jited("	popq	%r9")
+__jited("	xorl	%eax, %eax")
+__naked void private_stack_nested_1(void)
+{
+	asm volatile ("				\
+	r1 = 42;				\
+	*(u64 *)(r10 - %[max_bpf_stack]) = r1;	\
+	call cumulative_stack_depth_subprog;	\
+	r0 = 0;					\
+	exit;					\
+"	:
+	: __imm_const(max_bpf_stack, MAX_BPF_STACK)
+	: __clobber_all);
+}
+
+__naked __noinline __used
+static unsigned long loop_callback(void)
+{
+	asm volatile ("				\
+	call %[bpf_get_prandom_u32];		\
+	r1 = 42;				\
+	*(u64 *)(r10 - 512) = r1;		\
+	call cumulative_stack_depth_subprog;	\
+	r0 = 0;					\
+	exit;					\
+"	:
+	: __imm(bpf_get_prandom_u32)
+	: __clobber_common);
+}
+
+SEC("raw_tp")
+__description("Private stack, callback")
+__success
+__arch_x86_64
+/* for func loop_callback */
+__jited("func #1")
+__jited("	endbr64")
+__jited("	nopl	(%rax,%rax)")
+__jited("	nopl	(%rax)")
+__jited("	pushq	%rbp")
+__jited("	movq	%rsp, %rbp")
+__jited("	endbr64")
+__jited("	movabsq	$0x{{.*}}, %r9")
+__jited("	addq	%gs:0x{{.*}}, %r9")
+__jited("	pushq	%r9")
+__jited("	callq")
+__jited("	popq	%r9")
+__jited("	movl	$0x2a, %edi")
+__jited("	movq	%rdi, -0x200(%r9)")
+__jited("	pushq	%r9")
+__jited("	callq")
+__jited("	popq	%r9")
+__naked void private_stack_callback(void)
+{
+	asm volatile ("			\
+	r1 = 1;				\
+	r2 = %[loop_callback];		\
+	r3 = 0;				\
+	r4 = 0;				\
+	call %[bpf_loop];		\
+	r0 = 0;				\
+	exit;				\
+"	:
+	: __imm_ptr(loop_callback),
+	  __imm(bpf_loop)
+	: __clobber_common);
+}
+
+SEC("fentry/bpf_fentry_test9")
+__description("Private stack, exception in main prog")
+__success __retval(0)
+__arch_x86_64
+__jited("	pushq	%r9")
+__jited("	callq")
+__jited("	popq	%r9")
+int private_stack_exception_main_prog(void)
+{
+	asm volatile ("			\
+	r1 = 42;			\
+	*(u64 *)(r10 - 512) = r1;	\
+"	::: __clobber_common);
+
+	bpf_throw(0);
+	return 0;
+}
+
+__used static int subprog_exception(void)
+{
+	bpf_throw(0);
+	return 0;
+}
+
+SEC("fentry/bpf_fentry_test9")
+__description("Private stack, exception in subprog")
+__success __retval(0)
+__arch_x86_64
+__jited("	movq	%rdi, -0x200(%r9)")
+__jited("	pushq	%r9")
+__jited("	callq")
+__jited("	popq	%r9")
+int private_stack_exception_sub_prog(void)
+{
+	asm volatile ("			\
+	r1 = 42;			\
+	*(u64 *)(r10 - 512) = r1;	\
+	call subprog_exception;		\
+"	::: __clobber_common);
+
+	return 0;
+}
+
+int glob;
+__noinline static void subprog2(int *val)
+{
+	glob += val[0] * 2;
+}
+
+__noinline static void subprog1(int *val)
+{
+	int tmp[64] = {};
+
+	tmp[0] = *val;
+	subprog2(tmp);
+}
+
+__noinline static int timer_cb1(void *map, int *key, struct bpf_timer *timer)
+{
+	subprog1(key);
+	return 0;
+}
+
+__noinline static int timer_cb2(void *map, int *key, struct bpf_timer *timer)
+{
+	return 0;
+}
+
+SEC("fentry/bpf_fentry_test9")
+__description("Private stack, async callback, not nested")
+__success __retval(0)
+__arch_x86_64
+__jited("	movabsq	$0x{{.*}}, %r9")
+int private_stack_async_callback_1(void)
+{
+	struct bpf_timer *arr_timer;
+	int array_key = 0;
+
+	arr_timer = bpf_map_lookup_elem(&array, &array_key);
+	if (!arr_timer)
+		return 0;
+
+	bpf_timer_init(arr_timer, &array, 1);
+	bpf_timer_set_callback(arr_timer, timer_cb2);
+	bpf_timer_start(arr_timer, 0, 0);
+	subprog1(&array_key);
+	return 0;
+}
+
+SEC("fentry/bpf_fentry_test9")
+__description("Private stack, async callback, potential nesting")
+__success __retval(0)
+__arch_x86_64
+__jited("	subq	$0x100, %rsp")
+int private_stack_async_callback_2(void)
+{
+	struct bpf_timer *arr_timer;
+	int array_key = 0;
+
+	arr_timer = bpf_map_lookup_elem(&array, &array_key);
+	if (!arr_timer)
+		return 0;
+
+	bpf_timer_init(arr_timer, &array, 1);
+	bpf_timer_set_callback(arr_timer, timer_cb1);
+	bpf_timer_start(arr_timer, 0, 0);
+	subprog1(&array_key);
+	return 0;
+}
+
+#else
+
+SEC("kprobe")
+__description("private stack is not supported, use a dummy test")
+__success
+int dummy_test(void)
+{
+	return 0;
+}
+
+#endif
+
+char _license[] SEC("license") = "GPL";
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH bpf-next v9 09/10] bpf: Support private stack for struct_ops progs
  2024-11-04 19:34 [PATCH bpf-next v9 00/10] bpf: Support private stack for bpf progs Yonghong Song
                   ` (7 preceding siblings ...)
  2024-11-04 19:35 ` [PATCH bpf-next v9 08/10] selftests/bpf: Add tracing prog private stack tests Yonghong Song
@ 2024-11-04 19:35 ` Yonghong Song
  2024-11-04 19:35 ` [PATCH bpf-next v9 10/10] selftests/bpf: Add struct_ops prog private stack tests Yonghong Song
  9 siblings, 0 replies; 41+ messages in thread
From: Yonghong Song @ 2024-11-04 19:35 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Tejun Heo

For struct_ops progs, whether a particular prog will use private stack
or not (prog->aux->use_priv_stack) will be set before actual insn-level
verification for that prog. One particular implementation is to
piggyback on struct_ops->check_member(). The next patch will have an
example for this. The struct_ops->check_member() will set
prog->aux->use_priv_stack to be true which enables private stack
usage with ignoring BPF_PRIV_STACK_MIN_SIZE limit.

If use_priv_stack is true for a particular struct_ops prog, bpf
trampoline will need to do recursion checks (one level at this point)
to avoid stack overwrite. A field (recursion_detected()) is added to
bpf_prog_aux structure such that if bpf_prog->aux->recursion_detected
is set by the struct_ops subsystem, the function will be called
to report an error, collect related info, etc.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 include/linux/bpf.h          |  1 +
 include/linux/bpf_verifier.h |  1 +
 kernel/bpf/trampoline.c      |  4 ++++
 kernel/bpf/verifier.c        | 20 +++++++++++++++++++-
 4 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 8a3ea7440a4a..d04f990dd6d9 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1528,6 +1528,7 @@ struct bpf_prog_aux {
 	u64 prog_array_member_cnt; /* counts how many times as member of prog_array */
 	struct mutex ext_mutex; /* mutex for is_extended and prog_array_member_cnt */
 	struct bpf_arena *arena;
+	void (*recursion_detected)(struct bpf_prog *prog); /* callback if recursion is detected */
 	/* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
 	const struct btf_type *attach_func_proto;
 	/* function name for valid attach_btf_id */
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index e921589abc72..f65431c42f9e 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -891,6 +891,7 @@ static inline bool bpf_prog_check_recur(const struct bpf_prog *prog)
 	case BPF_PROG_TYPE_TRACING:
 		return prog->expected_attach_type != BPF_TRACE_ITER;
 	case BPF_PROG_TYPE_STRUCT_OPS:
+		return prog->aux->use_priv_stack;
 	case BPF_PROG_TYPE_LSM:
 	default:
 		return false;
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 9f36c049f4c2..a8d188b31da5 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -899,6 +899,8 @@ static u64 notrace __bpf_prog_enter_recur(struct bpf_prog *prog, struct bpf_tram
 
 	if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) {
 		bpf_prog_inc_misses_counter(prog);
+		if (prog->aux->recursion_detected)
+			prog->aux->recursion_detected(prog);
 		return 0;
 	}
 	return bpf_prog_start_time();
@@ -975,6 +977,8 @@ u64 notrace __bpf_prog_enter_sleepable_recur(struct bpf_prog *prog,
 
 	if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) {
 		bpf_prog_inc_misses_counter(prog);
+		if (prog->aux->recursion_detected)
+			prog->aux->recursion_detected(prog);
 		return 0;
 	}
 	return bpf_prog_start_time();
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 03ae76d57076..ee16c328dffc 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6045,6 +6045,8 @@ static enum priv_stack_mode bpf_enable_priv_stack(struct bpf_prog *prog)
 	if (!bpf_prog_check_recur(prog))
 		return NO_PRIV_STACK;
 
+	if (prog->type == BPF_PROG_TYPE_STRUCT_OPS)
+		return PRIV_STACK_ALWAYS;
 
 	return PRIV_STACK_ADAPTIVE;
 }
@@ -6118,7 +6120,8 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
 					idx, subprog_depth);
 				return -EACCES;
 			}
-			if (subprog_depth >= BPF_PRIV_STACK_MIN_SIZE) {
+			if (priv_stack_supported == PRIV_STACK_ALWAYS ||
+			    subprog_depth >= BPF_PRIV_STACK_MIN_SIZE) {
 				subprog[idx].use_priv_stack = true;
 				subprog[idx].visited_with_priv_stack = true;
 			}
@@ -6235,6 +6238,11 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
 		for (int i = 0; i < env->subprog_cnt; i++) {
 			if (!si[i].has_tail_call)
 				continue;
+			if (priv_stack_supported == PRIV_STACK_ALWAYS) {
+				verbose(env,
+					"Private stack not supported due to tail call\n");
+				return -EACCES;
+			}
 			priv_stack_supported = NO_PRIV_STACK;
 			break;
 		}
@@ -6275,6 +6283,11 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
 				depth_frame, subtree_depth);
 			return -EACCES;
 		}
+		if (orig_priv_stack_supported == PRIV_STACK_ALWAYS) {
+			verbose(env,
+				"Private stack not supported due to possible nested subprog run\n");
+			return -EACCES;
+		}
 		if (orig_priv_stack_supported == PRIV_STACK_ADAPTIVE) {
 			for (int i = 0; i < env->subprog_cnt; i++)
 				si[i].use_priv_stack = false;
@@ -21950,6 +21963,11 @@ static int check_struct_ops_btf_id(struct bpf_verifier_env *env)
 		}
 	}
 
+	if (prog->aux->use_priv_stack && !bpf_jit_supports_private_stack()) {
+		verbose(env, "Private stack not supported by jit\n");
+		return -EACCES;
+	}
+
 	/* btf_ctx_access() used this to provide argument type info */
 	prog->aux->ctx_arg_info =
 		st_ops_desc->arg_info[member_idx].info;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH bpf-next v9 10/10] selftests/bpf: Add struct_ops prog private stack tests
  2024-11-04 19:34 [PATCH bpf-next v9 00/10] bpf: Support private stack for bpf progs Yonghong Song
                   ` (8 preceding siblings ...)
  2024-11-04 19:35 ` [PATCH bpf-next v9 09/10] bpf: Support private stack for struct_ops progs Yonghong Song
@ 2024-11-04 19:35 ` Yonghong Song
  9 siblings, 0 replies; 41+ messages in thread
From: Yonghong Song @ 2024-11-04 19:35 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Tejun Heo

Add three tests for struct_ops using private stack.
  ./test_progs -t struct_ops_private_stack
  #333/1   struct_ops_private_stack/private_stack:OK
  #333/2   struct_ops_private_stack/private_stack_fail:OK
  #333/3   struct_ops_private_stack/private_stack_recur:OK
  #333     struct_ops_private_stack:OK

The following is a snippet of a struct_ops check_member() implementation:

	u32 moff = __btf_member_bit_offset(t, member) / 8;
	switch (moff) {
	case offsetof(struct bpf_testmod_ops3, test_1):
        	prog->aux->use_priv_stack = true;
                prog->aux->recursion_detected = test_1_recursion_detected;
        	fallthrough;
	default:
        	break;
	}
	return 0;

The first test is with nested two different callback functions where the
first prog has more than 512 byte stack size (including subprogs) with
private stack enabled.

The second test is a negative test where the second prog has more than 512
byte stack size without private stack enabled.

The third test is the same callback function recursing itself. At run time,
the jit trampoline recursion check kicks in to prevent the recursion. The
recursion_detected() callback function is implemented by the bpf_testmod,
the following message in dmesg
  bpf_testmod: oh no, recursing into test_1, recursion_misses 1
demonstrates the callback function is indeed triggered when recursion miss
happens.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   | 104 +++++++++++++++++
 .../selftests/bpf/bpf_testmod/bpf_testmod.h   |   5 +
 .../bpf/prog_tests/struct_ops_private_stack.c | 106 ++++++++++++++++++
 .../bpf/progs/struct_ops_private_stack.c      |  62 ++++++++++
 .../bpf/progs/struct_ops_private_stack_fail.c |  62 ++++++++++
 .../progs/struct_ops_private_stack_recur.c    |  50 +++++++++
 6 files changed, 389 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/struct_ops_private_stack.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_private_stack.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_private_stack_fail.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_private_stack_recur.c

diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 8835761d9a12..46fe282b2e93 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -245,6 +245,39 @@ __bpf_kfunc void bpf_testmod_ctx_release(struct bpf_testmod_ctx *ctx)
 		call_rcu(&ctx->rcu, testmod_free_cb);
 }
 
+static struct bpf_testmod_ops3 *st_ops3;
+
+static int bpf_testmod_test_3(void)
+{
+	return 0;
+}
+
+static int bpf_testmod_test_4(void)
+{
+	return 0;
+}
+
+static struct bpf_testmod_ops3 __bpf_testmod_ops3 = {
+	.test_1 = bpf_testmod_test_3,
+	.test_2 = bpf_testmod_test_4,
+};
+
+static void bpf_testmod_test_struct_ops3(void)
+{
+	if (st_ops3)
+		st_ops3->test_1();
+}
+
+__bpf_kfunc void bpf_testmod_ops3_call_test_1(void)
+{
+	st_ops3->test_1();
+}
+
+__bpf_kfunc void bpf_testmod_ops3_call_test_2(void)
+{
+	st_ops3->test_2();
+}
+
 struct bpf_testmod_btf_type_tag_1 {
 	int a;
 };
@@ -380,6 +413,8 @@ bpf_testmod_test_read(struct file *file, struct kobject *kobj,
 
 	(void)bpf_testmod_test_arg_ptr_to_struct(&struct_arg1_2);
 
+	bpf_testmod_test_struct_ops3();
+
 	struct_arg3 = kmalloc((sizeof(struct bpf_testmod_struct_arg_3) +
 				sizeof(int)), GFP_KERNEL);
 	if (struct_arg3 != NULL) {
@@ -584,6 +619,8 @@ BTF_ID_FLAGS(func, bpf_kfunc_trusted_num_test, KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, bpf_kfunc_rcu_task_test, KF_RCU)
 BTF_ID_FLAGS(func, bpf_testmod_ctx_create, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_testmod_ctx_release, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_testmod_ops3_call_test_1)
+BTF_ID_FLAGS(func, bpf_testmod_ops3_call_test_2)
 BTF_KFUNCS_END(bpf_testmod_common_kfunc_ids)
 
 BTF_ID_LIST(bpf_testmod_dtor_ids)
@@ -1094,6 +1131,10 @@ static const struct bpf_verifier_ops bpf_testmod_verifier_ops = {
 	.is_valid_access = bpf_testmod_ops_is_valid_access,
 };
 
+static const struct bpf_verifier_ops bpf_testmod_verifier_ops3 = {
+	.is_valid_access = bpf_testmod_ops_is_valid_access,
+};
+
 static int bpf_dummy_reg(void *kdata, struct bpf_link *link)
 {
 	struct bpf_testmod_ops *ops = kdata;
@@ -1173,6 +1214,68 @@ struct bpf_struct_ops bpf_testmod_ops2 = {
 	.owner = THIS_MODULE,
 };
 
+static int st_ops3_reg(void *kdata, struct bpf_link *link)
+{
+	int err = 0;
+
+	mutex_lock(&st_ops_mutex);
+	if (st_ops3) {
+		pr_err("st_ops has already been registered\n");
+		err = -EEXIST;
+		goto unlock;
+	}
+	st_ops3 = kdata;
+
+unlock:
+	mutex_unlock(&st_ops_mutex);
+	return err;
+}
+
+static void st_ops3_unreg(void *kdata, struct bpf_link *link)
+{
+	mutex_lock(&st_ops_mutex);
+	st_ops3 = NULL;
+	mutex_unlock(&st_ops_mutex);
+}
+
+static void test_1_recursion_detected(struct bpf_prog *prog)
+{
+	struct bpf_prog_stats *stats;
+
+	stats = this_cpu_ptr(prog->stats);
+	printk("bpf_testmod: oh no, recursing into test_1, recursion_misses %llu",
+	       u64_stats_read(&stats->misses));
+}
+
+static int st_ops3_check_member(const struct btf_type *t,
+				const struct btf_member *member,
+				const struct bpf_prog *prog)
+{
+	u32 moff = __btf_member_bit_offset(t, member) / 8;
+
+	switch (moff) {
+	case offsetof(struct bpf_testmod_ops3, test_1):
+		prog->aux->use_priv_stack = true;
+		prog->aux->recursion_detected = test_1_recursion_detected;
+		fallthrough;
+	default:
+		break;
+	}
+	return 0;
+}
+
+struct bpf_struct_ops bpf_testmod_ops3 = {
+	.verifier_ops = &bpf_testmod_verifier_ops3,
+	.init = bpf_testmod_ops_init,
+	.init_member = bpf_testmod_ops_init_member,
+	.reg = st_ops3_reg,
+	.unreg = st_ops3_unreg,
+	.check_member = st_ops3_check_member,
+	.cfi_stubs = &__bpf_testmod_ops3,
+	.name = "bpf_testmod_ops3",
+	.owner = THIS_MODULE,
+};
+
 static int bpf_test_mod_st_ops__test_prologue(struct st_ops_args *args)
 {
 	return 0;
@@ -1331,6 +1434,7 @@ static int bpf_testmod_init(void)
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_testmod_kfunc_set);
 	ret = ret ?: register_bpf_struct_ops(&bpf_bpf_testmod_ops, bpf_testmod_ops);
 	ret = ret ?: register_bpf_struct_ops(&bpf_testmod_ops2, bpf_testmod_ops2);
+	ret = ret ?: register_bpf_struct_ops(&bpf_testmod_ops3, bpf_testmod_ops3);
 	ret = ret ?: register_bpf_struct_ops(&testmod_st_ops, bpf_testmod_st_ops);
 	ret = ret ?: register_btf_id_dtor_kfuncs(bpf_testmod_dtors,
 						 ARRAY_SIZE(bpf_testmod_dtors),
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
index fb7dff47597a..356803d1c10e 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
@@ -94,6 +94,11 @@ struct bpf_testmod_ops2 {
 	int (*test_1)(void);
 };
 
+struct bpf_testmod_ops3 {
+	int (*test_1)(void);
+	int (*test_2)(void);
+};
+
 struct st_ops_args {
 	u64 a;
 };
diff --git a/tools/testing/selftests/bpf/prog_tests/struct_ops_private_stack.c b/tools/testing/selftests/bpf/prog_tests/struct_ops_private_stack.c
new file mode 100644
index 000000000000..4006879ca3fe
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/struct_ops_private_stack.c
@@ -0,0 +1,106 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <test_progs.h>
+#include "struct_ops_private_stack.skel.h"
+#include "struct_ops_private_stack_fail.skel.h"
+#include "struct_ops_private_stack_recur.skel.h"
+
+static void test_private_stack(void)
+{
+	struct struct_ops_private_stack *skel;
+	struct bpf_link *link;
+	int err;
+
+	skel = struct_ops_private_stack__open();
+	if (!ASSERT_OK_PTR(skel, "struct_ops_private_stack__open"))
+		return;
+
+	if (skel->data->skip) {
+		test__skip();
+		goto cleanup;
+	}
+
+	err = struct_ops_private_stack__load(skel);
+	if (!ASSERT_OK(err, "struct_ops_private_stack__load"))
+		goto cleanup;
+
+	link = bpf_map__attach_struct_ops(skel->maps.testmod_1);
+	if (!ASSERT_OK_PTR(link, "attach_struct_ops"))
+		goto cleanup;
+
+	ASSERT_OK(trigger_module_test_read(256), "trigger_read");
+
+	ASSERT_EQ(skel->bss->val_i, 3, "val_i");
+	ASSERT_EQ(skel->bss->val_j, 8, "val_j");
+
+	bpf_link__destroy(link);
+
+cleanup:
+	struct_ops_private_stack__destroy(skel);
+}
+
+static void test_private_stack_fail(void)
+{
+	struct struct_ops_private_stack_fail *skel;
+	int err;
+
+	skel = struct_ops_private_stack_fail__open();
+	if (!ASSERT_OK_PTR(skel, "struct_ops_private_stack_fail__open"))
+		return;
+
+	if (skel->data->skip) {
+		test__skip();
+		goto cleanup;
+	}
+
+	err = struct_ops_private_stack_fail__load(skel);
+	if (!ASSERT_ERR(err, "struct_ops_private_stack_fail__load"))
+		goto cleanup;
+	return;
+
+cleanup:
+	struct_ops_private_stack_fail__destroy(skel);
+}
+
+static void test_private_stack_recur(void)
+{
+	struct struct_ops_private_stack_recur *skel;
+	struct bpf_link *link;
+	int err;
+
+	skel = struct_ops_private_stack_recur__open();
+	if (!ASSERT_OK_PTR(skel, "struct_ops_private_stack_recur__open"))
+		return;
+
+	if (skel->data->skip) {
+		test__skip();
+		goto cleanup;
+	}
+
+	err = struct_ops_private_stack_recur__load(skel);
+	if (!ASSERT_OK(err, "struct_ops_private_stack_recur__load"))
+		goto cleanup;
+
+	link = bpf_map__attach_struct_ops(skel->maps.testmod_1);
+	if (!ASSERT_OK_PTR(link, "attach_struct_ops"))
+		goto cleanup;
+
+	ASSERT_OK(trigger_module_test_read(256), "trigger_read");
+
+	ASSERT_EQ(skel->bss->val_j, 3, "val_j");
+
+	bpf_link__destroy(link);
+
+cleanup:
+	struct_ops_private_stack_recur__destroy(skel);
+}
+
+void test_struct_ops_private_stack(void)
+{
+	if (test__start_subtest("private_stack"))
+		test_private_stack();
+	if (test__start_subtest("private_stack_fail"))
+		test_private_stack_fail();
+	if (test__start_subtest("private_stack_recur"))
+		test_private_stack_recur();
+}
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_private_stack.c b/tools/testing/selftests/bpf/progs/struct_ops_private_stack.c
new file mode 100644
index 000000000000..8ea57e5348ab
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_private_stack.c
@@ -0,0 +1,62 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+#if defined(__TARGET_ARCH_x86)
+bool skip __attribute((__section__(".data"))) = false;
+#else
+bool skip = true;
+#endif
+
+void bpf_testmod_ops3_call_test_2(void) __ksym;
+
+int val_i, val_j;
+
+__noinline static int subprog2(int *a, int *b)
+{
+	return val_i + a[10] + b[20];
+}
+
+__noinline static int subprog1(int *a)
+{
+	/* stack size 200 bytes */
+	int b[50] = {};
+
+	b[20] = 2;
+	return subprog2(a, b);
+}
+
+
+SEC("struct_ops")
+int BPF_PROG(test_1)
+{
+	/* stack size 400 bytes */
+	int a[100] = {};
+
+	a[10] = 1;
+	val_i = subprog1(a);
+	bpf_testmod_ops3_call_test_2();
+	return 0;
+}
+
+SEC("struct_ops")
+int BPF_PROG(test_2)
+{
+	/* stack size 200 bytes */
+	int a[50] = {};
+
+	a[10] = 3;
+	val_j = subprog1(a);
+	return 0;
+}
+
+SEC(".struct_ops")
+struct bpf_testmod_ops3 testmod_1 = {
+	.test_1 = (void *)test_1,
+	.test_2 = (void *)test_2,
+};
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_private_stack_fail.c b/tools/testing/selftests/bpf/progs/struct_ops_private_stack_fail.c
new file mode 100644
index 000000000000..1f55ec4cee37
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_private_stack_fail.c
@@ -0,0 +1,62 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+#if defined(__TARGET_ARCH_x86)
+bool skip __attribute((__section__(".data"))) = false;
+#else
+bool skip = true;
+#endif
+
+void bpf_testmod_ops3_call_test_2(void) __ksym;
+
+int val_i, val_j;
+
+__noinline static int subprog2(int *a, int *b)
+{
+	return val_i + a[10] + b[20];
+}
+
+__noinline static int subprog1(int *a)
+{
+	/* stack size 200 bytes */
+	int b[50] = {};
+
+	b[20] = 2;
+	return subprog2(a, b);
+}
+
+
+SEC("struct_ops")
+int BPF_PROG(test_1)
+{
+	/* stack size 100 bytes */
+	int a[25] = {};
+
+	a[10] = 1;
+	val_i = subprog1(a);
+	bpf_testmod_ops3_call_test_2();
+	return 0;
+}
+
+SEC("struct_ops")
+int BPF_PROG(test_2)
+{
+	/* stack size 400 bytes */
+	int a[100] = {};
+
+	a[10] = 3;
+	val_j = subprog1(a);
+	return 0;
+}
+
+SEC(".struct_ops")
+struct bpf_testmod_ops3 testmod_1 = {
+	.test_1 = (void *)test_1,
+	.test_2 = (void *)test_2,
+};
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_private_stack_recur.c b/tools/testing/selftests/bpf/progs/struct_ops_private_stack_recur.c
new file mode 100644
index 000000000000..15d4e914dc92
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_private_stack_recur.c
@@ -0,0 +1,50 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+#if defined(__TARGET_ARCH_x86)
+bool skip __attribute((__section__(".data"))) = false;
+#else
+bool skip = true;
+#endif
+
+void bpf_testmod_ops3_call_test_1(void) __ksym;
+
+int val_i, val_j;
+
+__noinline static int subprog2(int *a, int *b)
+{
+	return val_i + a[10] + b[20];
+}
+
+__noinline static int subprog1(int *a)
+{
+	/* stack size 400 bytes */
+	int b[100] = {};
+
+	b[20] = 2;
+	return subprog2(a, b);
+}
+
+
+SEC("struct_ops")
+int BPF_PROG(test_1)
+{
+	/* stack size 400 bytes */
+	int a[100] = {};
+
+	a[10] = 1;
+	val_j += subprog1(a);
+	bpf_testmod_ops3_call_test_1();
+	return 0;
+}
+
+SEC(".struct_ops")
+struct bpf_testmod_ops3 testmod_1 = {
+	.test_1 = (void *)test_1,
+};
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 02/10] bpf: Return false for bpf_prog_check_recur() default case
  2024-11-04 19:35 ` [PATCH bpf-next v9 02/10] bpf: Return false for bpf_prog_check_recur() default case Yonghong Song
@ 2024-11-05  1:21   ` Alexei Starovoitov
  2024-11-05  1:35     ` Yonghong Song
  0 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2024-11-05  1:21 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo

On Mon, Nov 4, 2024 at 11:35 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>
> The bpf_prog_check_recur() funciton is currently used by trampoline
> and tracing programs (also using trampoline) to check whether a
> particular prog supports recursion checking or not. The default case
> (non-trampoline progs) return true in the current implementation.
>
> Let us make the non-trampoline prog recursion check return false
> instead. It does not impact any existing use cases and allows the
> function to be used outside the trampoline context in the next patch.

Does not impact ?! But it does.
This patch removes recursion check from fentry progs.
This cannot be right.

pw-bot: cr

> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
> ---
>  include/linux/bpf_verifier.h | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 4513372c5bc8..ad887c68d3e1 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -889,9 +889,8 @@ static inline bool bpf_prog_check_recur(const struct bpf_prog *prog)
>                 return prog->expected_attach_type != BPF_TRACE_ITER;
>         case BPF_PROG_TYPE_STRUCT_OPS:
>         case BPF_PROG_TYPE_LSM:
> -               return false;
>         default:
> -               return true;
> +               return false;
>         }
>  }
>
> --
> 2.43.5
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 02/10] bpf: Return false for bpf_prog_check_recur() default case
  2024-11-05  1:21   ` Alexei Starovoitov
@ 2024-11-05  1:35     ` Yonghong Song
  2024-11-05  1:55       ` Alexei Starovoitov
  0 siblings, 1 reply; 41+ messages in thread
From: Yonghong Song @ 2024-11-05  1:35 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo


On 11/4/24 5:21 PM, Alexei Starovoitov wrote:
> On Mon, Nov 4, 2024 at 11:35 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>> The bpf_prog_check_recur() funciton is currently used by trampoline
>> and tracing programs (also using trampoline) to check whether a
>> particular prog supports recursion checking or not. The default case
>> (non-trampoline progs) return true in the current implementation.
>>
>> Let us make the non-trampoline prog recursion check return false
>> instead. It does not impact any existing use cases and allows the
>> function to be used outside the trampoline context in the next patch.
> Does not impact ?! But it does.
> This patch removes recursion check from fentry progs.
> This cannot be right.

The original bpf_prog_check_recur() implementation:

static inline bool bpf_prog_check_recur(const struct bpf_prog *prog)
{
         switch (resolve_prog_type(prog)) {
         case BPF_PROG_TYPE_TRACING:
                 return prog->expected_attach_type != BPF_TRACE_ITER;
         case BPF_PROG_TYPE_STRUCT_OPS:
         case BPF_PROG_TYPE_LSM:
                 return false;
         default:
                 return true;
         }
}

fentry prog is a TRACING prog, so it is covered. Did I miss anything?

>
> pw-bot: cr
>
>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>> ---
>>   include/linux/bpf_verifier.h | 3 +--
>>   1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
>> index 4513372c5bc8..ad887c68d3e1 100644
>> --- a/include/linux/bpf_verifier.h
>> +++ b/include/linux/bpf_verifier.h
>> @@ -889,9 +889,8 @@ static inline bool bpf_prog_check_recur(const struct bpf_prog *prog)
>>                  return prog->expected_attach_type != BPF_TRACE_ITER;
>>          case BPF_PROG_TYPE_STRUCT_OPS:
>>          case BPF_PROG_TYPE_LSM:
>> -               return false;
>>          default:
>> -               return true;
>> +               return false;
>>          }
>>   }
>>
>> --
>> 2.43.5
>>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 05/10] bpf: Allocate private stack for eligible main prog or subprogs
  2024-11-04 19:35 ` [PATCH bpf-next v9 05/10] bpf: Allocate private stack for eligible main prog or subprogs Yonghong Song
@ 2024-11-05  1:38   ` Alexei Starovoitov
  2024-11-05  3:07     ` Yonghong Song
  0 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2024-11-05  1:38 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo

On Mon, Nov 4, 2024 at 11:38 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>
> For any main prog or subprogs, allocate private stack space if requested
> by subprog info or main prog. The alignment for private stack is 16
> since maximum stack alignment is 16 for bpf-enabled archs.
>
> If jit failed, the allocated private stack will be freed in the same
> function where the allocation happens. If jit succeeded, e.g., for
> x86_64 arch, the allocated private stack is freed in arch specific
> implementation of bpf_jit_free().
>
> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
> ---
>  arch/x86/net/bpf_jit_comp.c |  1 +
>  include/linux/bpf.h         |  1 +
>  kernel/bpf/core.c           | 19 ++++++++++++++++---
>  kernel/bpf/verifier.c       | 13 +++++++++++++
>  4 files changed, 31 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 06b080b61aa5..59d294b8dd67 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -3544,6 +3544,7 @@ void bpf_jit_free(struct bpf_prog *prog)
>                 prog->bpf_func = (void *)prog->bpf_func - cfi_get_offset();
>                 hdr = bpf_jit_binary_pack_hdr(prog);
>                 bpf_jit_binary_pack_free(hdr, NULL);
> +               free_percpu(prog->aux->priv_stack_ptr);
>                 WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(prog));
>         }
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 8db3c5d7404b..8a3ea7440a4a 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1507,6 +1507,7 @@ struct bpf_prog_aux {
>         u32 max_rdwr_access;
>         struct btf *attach_btf;
>         const struct bpf_ctx_arg_aux *ctx_arg_info;
> +       void __percpu *priv_stack_ptr;
>         struct mutex dst_mutex; /* protects dst_* pointers below, *after* prog becomes visible */
>         struct bpf_prog *dst_prog;
>         struct bpf_trampoline *dst_trampoline;
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 14d9288441f2..f7a3e93c41e1 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -2396,6 +2396,7 @@ static void bpf_prog_select_func(struct bpf_prog *fp)
>   */
>  struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
>  {
> +       void __percpu *priv_stack_ptr = NULL;
>         /* In case of BPF to BPF calls, verifier did all the prep
>          * work with regards to JITing, etc.
>          */
> @@ -2421,11 +2422,23 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
>                 if (*err)
>                         return fp;
>
> +               if (fp->aux->use_priv_stack && fp->aux->stack_depth) {
> +                       priv_stack_ptr = __alloc_percpu_gfp(fp->aux->stack_depth, 16, GFP_KERNEL);
> +                       if (!priv_stack_ptr) {
> +                               *err = -ENOMEM;
> +                               return fp;
> +                       }
> +                       fp->aux->priv_stack_ptr = priv_stack_ptr;
> +               }
> +
>                 fp = bpf_int_jit_compile(fp);
>                 bpf_prog_jit_attempt_done(fp);
> -               if (!fp->jited && jit_needed) {
> -                       *err = -ENOTSUPP;
> -                       return fp;
> +               if (!fp->jited) {
> +                       free_percpu(priv_stack_ptr);
> +                       if (jit_needed) {
> +                               *err = -ENOTSUPP;
> +                               return fp;
> +                       }
>                 }
>         } else {
>                 *err = bpf_prog_offload_compile(fp);
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index e01b3f0fd314..03ae76d57076 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -20073,6 +20073,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
>  {
>         struct bpf_prog *prog = env->prog, **func, *tmp;
>         int i, j, subprog_start, subprog_end = 0, len, subprog;
> +       void __percpu *priv_stack_ptr;
>         struct bpf_map *map_ptr;
>         struct bpf_insn *insn;
>         void *old_bpf_func;
> @@ -20169,6 +20170,17 @@ static int jit_subprogs(struct bpf_verifier_env *env)
>
>                 func[i]->aux->name[0] = 'F';
>                 func[i]->aux->stack_depth = env->subprog_info[i].stack_depth;
> +
> +               if (env->subprog_info[i].use_priv_stack && func[i]->aux->stack_depth) {
> +                       priv_stack_ptr = __alloc_percpu_gfp(func[i]->aux->stack_depth, 16,
> +                                                           GFP_KERNEL);
> +                       if (!priv_stack_ptr) {
> +                               err = -ENOMEM;
> +                               goto out_free;
> +                       }
> +                       func[i]->aux->priv_stack_ptr = priv_stack_ptr;
> +               }
> +
>                 func[i]->jit_requested = 1;
>                 func[i]->blinding_requested = prog->blinding_requested;
>                 func[i]->aux->kfunc_tab = prog->aux->kfunc_tab;
> @@ -20201,6 +20213,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
>                         func[i]->aux->exception_boundary = env->seen_exception;
>                 func[i] = bpf_int_jit_compile(func[i]);
>                 if (!func[i]->jited) {
> +                       free_percpu(func[i]->aux->priv_stack_ptr);
>                         err = -ENOTSUPP;
>                         goto out_free;
>                 }

Looks correct from leaks pov, but this is so hard to follow.
I still don't like this imbalanced alloc/free.
Either both need to be done by core or both by JIT.

And JIT is probably better, since in:
_alloc_percpu_gfp(func[i]->aux->stack_depth, 16

16 alignment is x86 specific.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 02/10] bpf: Return false for bpf_prog_check_recur() default case
  2024-11-05  1:35     ` Yonghong Song
@ 2024-11-05  1:55       ` Alexei Starovoitov
  2024-11-05  2:53         ` Yonghong Song
  0 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2024-11-05  1:55 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo

On Mon, Nov 4, 2024 at 5:35 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
> On 11/4/24 5:21 PM, Alexei Starovoitov wrote:
> > On Mon, Nov 4, 2024 at 11:35 AM Yonghong Song <yonghong.song@linux.dev> wrote:
> >> The bpf_prog_check_recur() funciton is currently used by trampoline
> >> and tracing programs (also using trampoline) to check whether a
> >> particular prog supports recursion checking or not. The default case
> >> (non-trampoline progs) return true in the current implementation.
> >>
> >> Let us make the non-trampoline prog recursion check return false
> >> instead. It does not impact any existing use cases and allows the
> >> function to be used outside the trampoline context in the next patch.
> > Does not impact ?! But it does.
> > This patch removes recursion check from fentry progs.
> > This cannot be right.
>
> The original bpf_prog_check_recur() implementation:
>
> static inline bool bpf_prog_check_recur(const struct bpf_prog *prog)
> {
>          switch (resolve_prog_type(prog)) {
>          case BPF_PROG_TYPE_TRACING:
>                  return prog->expected_attach_type != BPF_TRACE_ITER;
>          case BPF_PROG_TYPE_STRUCT_OPS:
>          case BPF_PROG_TYPE_LSM:
>                  return false;
>          default:
>                  return true;
>          }
> }
>
> fentry prog is a TRACING prog, so it is covered. Did I miss anything?

I see. This is way too subtle.
You're correct that fentry is TYPE_TRACING,
so it could have "worked" if it was used to build trampolines only.

But this helper is called for other prog types:

        case BPF_FUNC_task_storage_get:
                if (bpf_prog_check_recur(prog))
                        return &bpf_task_storage_get_recur_proto;
                return &bpf_task_storage_get_proto;

so it's still not correct, but for a different reason.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 03/10] bpf: Allow private stack to have each subprog having stack size of 512 bytes
  2024-11-04 19:35 ` [PATCH bpf-next v9 03/10] bpf: Allow private stack to have each subprog having stack size of 512 bytes Yonghong Song
@ 2024-11-05  2:47   ` Alexei Starovoitov
  2024-11-05  3:09     ` Yonghong Song
  0 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2024-11-05  2:47 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo

On Mon, Nov 4, 2024 at 11:35 AM Yonghong Song <yonghong.song@linux.dev> wrote:
> @@ -6070,11 +6105,23 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
>                         depth);
>                 return -EACCES;
>         }
> -       depth += round_up_stack_depth(env, subprog[idx].stack_depth);
> +       subprog_depth = round_up_stack_depth(env, subprog[idx].stack_depth);
> +       depth += subprog_depth;
>         if (depth > MAX_BPF_STACK && !*subtree_depth) {
>                 *subtree_depth = depth;
>                 *depth_frame = frame + 1;
>         }
> +       if (priv_stack_supported != NO_PRIV_STACK) {
> +               if (!subprog[idx].use_priv_stack) {
> +                       if (subprog_depth > MAX_BPF_STACK) {
> +                               verbose(env, "stack size of subprog %d is %d. Too large\n",
> +                                       idx, subprog_depth);
> +                               return -EACCES;
> +                       }
> +                       if (subprog_depth >= BPF_PRIV_STACK_MIN_SIZE)
> +                               subprog[idx].use_priv_stack = true;
> +               }
> +       }

Hold on. If I'm reading this correctly this adaptive priv stack
concept will make some subprogs with stack >= 64 to use priv_stack
while other subprogs will still use normal stack?
Same for the main prog. It may or may not use priv stack ?

I guess this is ok-ish, but needs to be clearly explained in comments
and commit log.
My first reaction to such adaptive concept was negative, since
such "random" mix of priv stack in some subprogs makes
the whole thing pretty hard to reason about it,
but I guess it's valid to use normal stack when stack usage
is small. No need to penalize every subprog.

I wonder what others think about it.

Also it would be cleaner to rewrite above as:
if (subprog_depth > MAX_BPF_STACK) {
   verbose();
   return -EACCESS;
}
if (priv_stack_supported == PRIV_STACK_ADAPTIVE &&
    subprog_depth >= BPF_PRIV_STACK_MIN_SIZE)
   subprog[idx].use_priv_stack = true;

less indent and easier to read.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 04/10] bpf: Check potential private stack recursion for progs with async callback
  2024-11-04 19:35 ` [PATCH bpf-next v9 04/10] bpf: Check potential private stack recursion for progs with async callback Yonghong Song
@ 2024-11-05  2:51   ` Alexei Starovoitov
  2024-11-05  3:37     ` Yonghong Song
  0 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2024-11-05  2:51 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo

On Mon, Nov 4, 2024 at 11:38 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>
> In previous patch, tracing progs are enabled for private stack since
> recursion checking ensures there exists no nested same bpf prog run on
> the same cpu.
>
> But it is still possible for nested bpf subprog run on the same cpu
> if the same subprog is called in both main prog and async callback,
> or in different async callbacks. For example,
>   main_prog
>    bpf_timer_set_callback(timer, timer_cb);
>    call sub1
>   sub1
>    ...
>   time_cb
>    call sub1
>
> In the above case, nested subprog run for sub1 is possible with one in
> process context and the other in softirq context. If this is the case,
> the verifier will disable private stack for this bpf prog.
>
> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
> ---
>  include/linux/bpf_verifier.h |  2 ++
>  kernel/bpf/verifier.c        | 42 +++++++++++++++++++++++++++++++-----
>  2 files changed, 39 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 0622c11a7e19..e921589abc72 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -669,6 +669,8 @@ struct bpf_subprog_info {
>         /* true if bpf_fastcall stack region is used by functions that can't be inlined */
>         bool keep_fastcall_stack: 1;
>         bool use_priv_stack: 1;
> +       bool visited_with_priv_stack_accum: 1;
> +       bool visited_with_priv_stack: 1;
>
>         u8 arg_cnt;
>         struct bpf_subprog_arg_info args[MAX_BPF_FUNC_REG_ARGS];
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 406195c433ea..e01b3f0fd314 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -6118,8 +6118,12 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
>                                         idx, subprog_depth);
>                                 return -EACCES;
>                         }
> -                       if (subprog_depth >= BPF_PRIV_STACK_MIN_SIZE)
> +                       if (subprog_depth >= BPF_PRIV_STACK_MIN_SIZE) {
>                                 subprog[idx].use_priv_stack = true;
> +                               subprog[idx].visited_with_priv_stack = true;
> +                       }
> +               } else {
> +                       subprog[idx].visited_with_priv_stack = true;

See suggestion for patch 3.
It's cleaner to rewrite with a single visited_with_priv_stack = true; statement.

>                 }
>         }
>  continue_func:
> @@ -6220,10 +6224,12 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
>  static int check_max_stack_depth(struct bpf_verifier_env *env)
>  {
>         struct bpf_subprog_info *si = env->subprog_info;
> +       enum priv_stack_mode orig_priv_stack_supported;
>         enum priv_stack_mode priv_stack_supported;
>         int ret, subtree_depth = 0, depth_frame;
>
>         priv_stack_supported = bpf_enable_priv_stack(env->prog);
> +       orig_priv_stack_supported = priv_stack_supported;
>
>         if (priv_stack_supported != NO_PRIV_STACK) {
>                 for (int i = 0; i < env->subprog_cnt; i++) {
> @@ -6240,13 +6246,39 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
>                                                             priv_stack_supported);
>                         if (ret < 0)
>                                 return ret;
> +
> +                       if (priv_stack_supported != NO_PRIV_STACK) {
> +                               for (int j = 0; j < env->subprog_cnt; j++) {
> +                                       if (si[j].visited_with_priv_stack_accum &&
> +                                           si[j].visited_with_priv_stack) {
> +                                               /* si[j] is visited by both main/async subprog
> +                                                * and another async subprog.
> +                                                */
> +                                               priv_stack_supported = NO_PRIV_STACK;
> +                                               break;
> +                                       }
> +                                       if (!si[j].visited_with_priv_stack_accum)
> +                                               si[j].visited_with_priv_stack_accum =
> +                                                       si[j].visited_with_priv_stack;
> +                               }
> +                       }
> +                       if (priv_stack_supported != NO_PRIV_STACK) {
> +                               for (int j = 0; j < env->subprog_cnt; j++)
> +                                       si[j].visited_with_priv_stack = false;
> +                       }

I cannot understand what this algorithm is doing.
What is the meaning of visited_with_priv_stack_accum ?

>                 }
>         }
>
> -       if (priv_stack_supported == NO_PRIV_STACK && subtree_depth > MAX_BPF_STACK) {
> -               verbose(env, "combined stack size of %d calls is %d. Too large\n",
> -                       depth_frame, subtree_depth);
> -               return -EACCES;
> +       if (priv_stack_supported == NO_PRIV_STACK) {
> +               if (subtree_depth > MAX_BPF_STACK) {
> +                       verbose(env, "combined stack size of %d calls is %d. Too large\n",
> +                               depth_frame, subtree_depth);
> +                       return -EACCES;
> +               }
> +               if (orig_priv_stack_supported == PRIV_STACK_ADAPTIVE) {
> +                       for (int i = 0; i < env->subprog_cnt; i++)
> +                               si[i].use_priv_stack = false;
> +               }

why? This patch suppose clear use_priv_stack from subprogs
that are dual called and only from those subprogs.
All other subprogs are fine.

But it seems the alog attempts to detect one such calling scenario
and disables priv_stack everywhere?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 02/10] bpf: Return false for bpf_prog_check_recur() default case
  2024-11-05  1:55       ` Alexei Starovoitov
@ 2024-11-05  2:53         ` Yonghong Song
  2024-11-05  3:50           ` Yonghong Song
  0 siblings, 1 reply; 41+ messages in thread
From: Yonghong Song @ 2024-11-05  2:53 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo

On 11/4/24 5:55 PM, Alexei Starovoitov wrote:
> On Mon, Nov 4, 2024 at 5:35 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> On 11/4/24 5:21 PM, Alexei Starovoitov wrote:
>>> On Mon, Nov 4, 2024 at 11:35 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>> The bpf_prog_check_recur() funciton is currently used by trampoline
>>>> and tracing programs (also using trampoline) to check whether a
>>>> particular prog supports recursion checking or not. The default case
>>>> (non-trampoline progs) return true in the current implementation.
>>>>
>>>> Let us make the non-trampoline prog recursion check return false
>>>> instead. It does not impact any existing use cases and allows the
>>>> function to be used outside the trampoline context in the next patch.
>>> Does not impact ?! But it does.
>>> This patch removes recursion check from fentry progs.
>>> This cannot be right.
>> The original bpf_prog_check_recur() implementation:
>>
>> static inline bool bpf_prog_check_recur(const struct bpf_prog *prog)
>> {
>>           switch (resolve_prog_type(prog)) {
>>           case BPF_PROG_TYPE_TRACING:
>>                   return prog->expected_attach_type != BPF_TRACE_ITER;
>>           case BPF_PROG_TYPE_STRUCT_OPS:
>>           case BPF_PROG_TYPE_LSM:
>>                   return false;
>>           default:
>>                   return true;
>>           }
>> }
>>
>> fentry prog is a TRACING prog, so it is covered. Did I miss anything?
> I see. This is way too subtle.
> You're correct that fentry is TYPE_TRACING,
> so it could have "worked" if it was used to build trampolines only.
>
> But this helper is called for other prog types:
>
>          case BPF_FUNC_task_storage_get:
>                  if (bpf_prog_check_recur(prog))
>                          return &bpf_task_storage_get_recur_proto;
>                  return &bpf_task_storage_get_proto;
>
> so it's still not correct, but for a different reason.

There are four uses for func bpf_prog_check_recur() in kernel based on 
cscope: 0 kernel/bpf/trampoline.c bpf_trampoline_enter 1053 if 
(bpf_prog_check_recur(prog)) 1 kernel/bpf/trampoline.c 
bpf_trampoline_exit 1068 if (bpf_prog_check_recur(prog)) 2 
kernel/trace/bpf_trace.c bpf_tracing_func_proto 1549 if 
(bpf_prog_check_recur(prog)) 3 kernel/trace/bpf_trace.c 
bpf_tracing_func_proto 1553 if (bpf_prog_check_recur(prog)) The 2nd and 
3rd ones are in bpf_trace.c. 1444 static const struct bpf_func_proto * 
1445 bpf_tracing_func_proto(enum bpf_func_id func_id, const struct 
bpf_prog *prog) 1446 { 1447 switch (func_id) { ... 1548 case 
BPF_FUNC_task_storage_get: 1549 if (bpf_prog_check_recur(prog)) 1550 
return &bpf_task_storage_get_recur_proto; 1551 return 
&bpf_task_storage_get_proto; 1552 case BPF_FUNC_task_storage_delete: 
1553 if (bpf_prog_check_recur(prog)) 1554 return 
&bpf_task_storage_delete_recur_proto; 1555 return 
&bpf_task_storage_delete_proto; ... 1568 default: 1569 return 
bpf_base_func_proto(func_id, prog); 1570 } 1571 } They are used for 
tracing programs. So we should be safe here. But if you think that 
changing bpf_proc_check_recur() and calling function 
bpf_prog_check_recur() in bpf_enable_priv_stack() is too subtle, I can 
go back to my original approach which makes all supported prog types 
explicit in bpf_enable_priv_stack().

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 05/10] bpf: Allocate private stack for eligible main prog or subprogs
  2024-11-05  1:38   ` Alexei Starovoitov
@ 2024-11-05  3:07     ` Yonghong Song
  2024-11-05  3:44       ` Yonghong Song
  0 siblings, 1 reply; 41+ messages in thread
From: Yonghong Song @ 2024-11-05  3:07 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo


On 11/4/24 5:38 PM, Alexei Starovoitov wrote:
> On Mon, Nov 4, 2024 at 11:38 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>> For any main prog or subprogs, allocate private stack space if requested
>> by subprog info or main prog. The alignment for private stack is 16
>> since maximum stack alignment is 16 for bpf-enabled archs.
>>
>> If jit failed, the allocated private stack will be freed in the same
>> function where the allocation happens. If jit succeeded, e.g., for
>> x86_64 arch, the allocated private stack is freed in arch specific
>> implementation of bpf_jit_free().
>>
>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>> ---
>>   arch/x86/net/bpf_jit_comp.c |  1 +
>>   include/linux/bpf.h         |  1 +
>>   kernel/bpf/core.c           | 19 ++++++++++++++++---
>>   kernel/bpf/verifier.c       | 13 +++++++++++++
>>   4 files changed, 31 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
>> index 06b080b61aa5..59d294b8dd67 100644
>> --- a/arch/x86/net/bpf_jit_comp.c
>> +++ b/arch/x86/net/bpf_jit_comp.c
>> @@ -3544,6 +3544,7 @@ void bpf_jit_free(struct bpf_prog *prog)
>>                  prog->bpf_func = (void *)prog->bpf_func - cfi_get_offset();
>>                  hdr = bpf_jit_binary_pack_hdr(prog);
>>                  bpf_jit_binary_pack_free(hdr, NULL);
>> +               free_percpu(prog->aux->priv_stack_ptr);
>>                  WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(prog));
>>          }
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 8db3c5d7404b..8a3ea7440a4a 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1507,6 +1507,7 @@ struct bpf_prog_aux {
>>          u32 max_rdwr_access;
>>          struct btf *attach_btf;
>>          const struct bpf_ctx_arg_aux *ctx_arg_info;
>> +       void __percpu *priv_stack_ptr;
>>          struct mutex dst_mutex; /* protects dst_* pointers below, *after* prog becomes visible */
>>          struct bpf_prog *dst_prog;
>>          struct bpf_trampoline *dst_trampoline;
>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> index 14d9288441f2..f7a3e93c41e1 100644
>> --- a/kernel/bpf/core.c
>> +++ b/kernel/bpf/core.c
>> @@ -2396,6 +2396,7 @@ static void bpf_prog_select_func(struct bpf_prog *fp)
>>    */
>>   struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
>>   {
>> +       void __percpu *priv_stack_ptr = NULL;
>>          /* In case of BPF to BPF calls, verifier did all the prep
>>           * work with regards to JITing, etc.
>>           */
>> @@ -2421,11 +2422,23 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
>>                  if (*err)
>>                          return fp;
>>
>> +               if (fp->aux->use_priv_stack && fp->aux->stack_depth) {
>> +                       priv_stack_ptr = __alloc_percpu_gfp(fp->aux->stack_depth, 16, GFP_KERNEL);
>> +                       if (!priv_stack_ptr) {
>> +                               *err = -ENOMEM;
>> +                               return fp;
>> +                       }
>> +                       fp->aux->priv_stack_ptr = priv_stack_ptr;
>> +               }
>> +
>>                  fp = bpf_int_jit_compile(fp);
>>                  bpf_prog_jit_attempt_done(fp);
>> -               if (!fp->jited && jit_needed) {
>> -                       *err = -ENOTSUPP;
>> -                       return fp;
>> +               if (!fp->jited) {
>> +                       free_percpu(priv_stack_ptr);
>> +                       if (jit_needed) {
>> +                               *err = -ENOTSUPP;
>> +                               return fp;
>> +                       }
>>                  }
>>          } else {
>>                  *err = bpf_prog_offload_compile(fp);
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index e01b3f0fd314..03ae76d57076 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -20073,6 +20073,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
>>   {
>>          struct bpf_prog *prog = env->prog, **func, *tmp;
>>          int i, j, subprog_start, subprog_end = 0, len, subprog;
>> +       void __percpu *priv_stack_ptr;
>>          struct bpf_map *map_ptr;
>>          struct bpf_insn *insn;
>>          void *old_bpf_func;
>> @@ -20169,6 +20170,17 @@ static int jit_subprogs(struct bpf_verifier_env *env)
>>
>>                  func[i]->aux->name[0] = 'F';
>>                  func[i]->aux->stack_depth = env->subprog_info[i].stack_depth;
>> +
>> +               if (env->subprog_info[i].use_priv_stack && func[i]->aux->stack_depth) {
>> +                       priv_stack_ptr = __alloc_percpu_gfp(func[i]->aux->stack_depth, 16,
>> +                                                           GFP_KERNEL);
>> +                       if (!priv_stack_ptr) {
>> +                               err = -ENOMEM;
>> +                               goto out_free;
>> +                       }
>> +                       func[i]->aux->priv_stack_ptr = priv_stack_ptr;
>> +               }
>> +
>>                  func[i]->jit_requested = 1;
>>                  func[i]->blinding_requested = prog->blinding_requested;
>>                  func[i]->aux->kfunc_tab = prog->aux->kfunc_tab;
>> @@ -20201,6 +20213,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
>>                          func[i]->aux->exception_boundary = env->seen_exception;
>>                  func[i] = bpf_int_jit_compile(func[i]);
>>                  if (!func[i]->jited) {
>> +                       free_percpu(func[i]->aux->priv_stack_ptr);
>>                          err = -ENOTSUPP;
>>                          goto out_free;
>>                  }
> Looks correct from leaks pov, but this is so hard to follow.
> I still don't like this imbalanced alloc/free.
> Either both need to be done by core or both by JIT.
>
> And JIT is probably better, since in:
> _alloc_percpu_gfp(func[i]->aux->stack_depth, 16
>
> 16 alignment is x86 specific.

Agree. I use alignment 16 to cover all architectures. for x86_64, 
alignment 8 is used. I did some checking in arch/ directory. 
[~/work/bpf-next/arch (master)]$ find . -name 'net' ./arm/net ./mips/net 
./parisc/net ./powerpc/net ./s390/net ./sparc/net ./x86/net ./arc/net 
./arm64/net ./loongarch/net ./riscv/net [~/work/bpf-next/arch (master)]$ 
egrep -r bpf_jit_free (excluding not func definition) 
powerpc/net/bpf_jit_comp.c:void bpf_jit_free(struct bpf_prog *fp) 
sparc/net/bpf_jit_comp_32.c:void bpf_jit_free(struct bpf_prog *fp) 
x86/net/bpf_jit_comp.c:void bpf_jit_free(struct bpf_prog *prog) 
arm64/net/bpf_jit_comp.c:void bpf_jit_free(struct bpf_prog *prog) 
riscv/net/bpf_jit_core.c:void bpf_jit_free(struct bpf_prog *prog) Looks 
like all important arch's like x86_64,arm64,riscv having their own 
bpf_jit_free(). Some others like s390, etc. do not. I think we can do 
allocation in JIT. If s390 starts to implement private stack, then it 
can implement arch-specific version of bpf_jit_free() at that time.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 03/10] bpf: Allow private stack to have each subprog having stack size of 512 bytes
  2024-11-05  2:47   ` Alexei Starovoitov
@ 2024-11-05  3:09     ` Yonghong Song
  0 siblings, 0 replies; 41+ messages in thread
From: Yonghong Song @ 2024-11-05  3:09 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo


On 11/4/24 6:47 PM, Alexei Starovoitov wrote:
> On Mon, Nov 4, 2024 at 11:35 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>> @@ -6070,11 +6105,23 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
>>                          depth);
>>                  return -EACCES;
>>          }
>> -       depth += round_up_stack_depth(env, subprog[idx].stack_depth);
>> +       subprog_depth = round_up_stack_depth(env, subprog[idx].stack_depth);
>> +       depth += subprog_depth;
>>          if (depth > MAX_BPF_STACK && !*subtree_depth) {
>>                  *subtree_depth = depth;
>>                  *depth_frame = frame + 1;
>>          }
>> +       if (priv_stack_supported != NO_PRIV_STACK) {
>> +               if (!subprog[idx].use_priv_stack) {
>> +                       if (subprog_depth > MAX_BPF_STACK) {
>> +                               verbose(env, "stack size of subprog %d is %d. Too large\n",
>> +                                       idx, subprog_depth);
>> +                               return -EACCES;
>> +                       }
>> +                       if (subprog_depth >= BPF_PRIV_STACK_MIN_SIZE)
>> +                               subprog[idx].use_priv_stack = true;
>> +               }
>> +       }
> Hold on. If I'm reading this correctly this adaptive priv stack
> concept will make some subprogs with stack >= 64 to use priv_stack
> while other subprogs will still use normal stack?
> Same for the main prog. It may or may not use priv stack ?
>
> I guess this is ok-ish, but needs to be clearly explained in comments
> and commit log.

Will do.

> My first reaction to such adaptive concept was negative, since
> such "random" mix of priv stack in some subprogs makes
> the whole thing pretty hard to reason about it,
> but I guess it's valid to use normal stack when stack usage
> is small. No need to penalize every subprog.
>
> I wonder what others think about it.

Ya, other opinions are very welcome!

>
> Also it would be cleaner to rewrite above as:
> if (subprog_depth > MAX_BPF_STACK) {
>     verbose();
>     return -EACCESS;
> }
> if (priv_stack_supported == PRIV_STACK_ADAPTIVE &&
>      subprog_depth >= BPF_PRIV_STACK_MIN_SIZE)
>     subprog[idx].use_priv_stack = true;
>
> less indent and easier to read.

Okay, will do with less indent.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 04/10] bpf: Check potential private stack recursion for progs with async callback
  2024-11-05  2:51   ` Alexei Starovoitov
@ 2024-11-05  3:37     ` Yonghong Song
  2024-11-05 20:26       ` Alexei Starovoitov
  0 siblings, 1 reply; 41+ messages in thread
From: Yonghong Song @ 2024-11-05  3:37 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo


On 11/4/24 6:51 PM, Alexei Starovoitov wrote:
> On Mon, Nov 4, 2024 at 11:38 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>> In previous patch, tracing progs are enabled for private stack since
>> recursion checking ensures there exists no nested same bpf prog run on
>> the same cpu.
>>
>> But it is still possible for nested bpf subprog run on the same cpu
>> if the same subprog is called in both main prog and async callback,
>> or in different async callbacks. For example,
>>    main_prog
>>     bpf_timer_set_callback(timer, timer_cb);
>>     call sub1
>>    sub1
>>     ...
>>    time_cb
>>     call sub1
>>
>> In the above case, nested subprog run for sub1 is possible with one in
>> process context and the other in softirq context. If this is the case,
>> the verifier will disable private stack for this bpf prog.
>>
>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>> ---
>>   include/linux/bpf_verifier.h |  2 ++
>>   kernel/bpf/verifier.c        | 42 +++++++++++++++++++++++++++++++-----
>>   2 files changed, 39 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
>> index 0622c11a7e19..e921589abc72 100644
>> --- a/include/linux/bpf_verifier.h
>> +++ b/include/linux/bpf_verifier.h
>> @@ -669,6 +669,8 @@ struct bpf_subprog_info {
>>          /* true if bpf_fastcall stack region is used by functions that can't be inlined */
>>          bool keep_fastcall_stack: 1;
>>          bool use_priv_stack: 1;
>> +       bool visited_with_priv_stack_accum: 1;
>> +       bool visited_with_priv_stack: 1;
>>
>>          u8 arg_cnt;
>>          struct bpf_subprog_arg_info args[MAX_BPF_FUNC_REG_ARGS];
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 406195c433ea..e01b3f0fd314 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -6118,8 +6118,12 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
>>                                          idx, subprog_depth);
>>                                  return -EACCES;
>>                          }
>> -                       if (subprog_depth >= BPF_PRIV_STACK_MIN_SIZE)
>> +                       if (subprog_depth >= BPF_PRIV_STACK_MIN_SIZE) {
>>                                  subprog[idx].use_priv_stack = true;
>> +                               subprog[idx].visited_with_priv_stack = true;
>> +                       }
>> +               } else {
>> +                       subprog[idx].visited_with_priv_stack = true;
> See suggestion for patch 3.
> It's cleaner to rewrite with a single visited_with_priv_stack = true; statement.

Ack.

>
>>                  }
>>          }
>>   continue_func:
>> @@ -6220,10 +6224,12 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
>>   static int check_max_stack_depth(struct bpf_verifier_env *env)
>>   {
>>          struct bpf_subprog_info *si = env->subprog_info;
>> +       enum priv_stack_mode orig_priv_stack_supported;
>>          enum priv_stack_mode priv_stack_supported;
>>          int ret, subtree_depth = 0, depth_frame;
>>
>>          priv_stack_supported = bpf_enable_priv_stack(env->prog);
>> +       orig_priv_stack_supported = priv_stack_supported;
>>
>>          if (priv_stack_supported != NO_PRIV_STACK) {
>>                  for (int i = 0; i < env->subprog_cnt; i++) {
>> @@ -6240,13 +6246,39 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
>>                                                              priv_stack_supported);
>>                          if (ret < 0)
>>                                  return ret;
>> +
>> +                       if (priv_stack_supported != NO_PRIV_STACK) {
>> +                               for (int j = 0; j < env->subprog_cnt; j++) {
>> +                                       if (si[j].visited_with_priv_stack_accum &&
>> +                                           si[j].visited_with_priv_stack) {
>> +                                               /* si[j] is visited by both main/async subprog
>> +                                                * and another async subprog.
>> +                                                */
>> +                                               priv_stack_supported = NO_PRIV_STACK;
>> +                                               break;
>> +                                       }
>> +                                       if (!si[j].visited_with_priv_stack_accum)
>> +                                               si[j].visited_with_priv_stack_accum =
>> +                                                       si[j].visited_with_priv_stack;
>> +                               }
>> +                       }
>> +                       if (priv_stack_supported != NO_PRIV_STACK) {
>> +                               for (int j = 0; j < env->subprog_cnt; j++)
>> +                                       si[j].visited_with_priv_stack = false;
>> +                       }
> I cannot understand what this algorithm is doing.
> What is the meaning of visited_with_priv_stack_accum ?

The following is an example to show how the algorithm works.
Let us say we have prog like
    main_prog0  si[0]
      sub1      si[1]
      sub2      si[2]
    async1      si[3]
      sub4      si[4]
      sub2      si[2]
    async2      si[5]
      sub4      si[4]
      sub5      si[6]
      

Total 9 subprograms.

after iteration 1 (main_prog0)
    visited_with_priv_stack_accum: si[i] = false for i = 0 ... 9
    visited_with_priv_stack: si[0] = si[1] = si[2] = true, others false

    for all i, visited_with_priv_stack_accum[i] and visited_with_priv_stack[i]
    is false, so main_prog0 can use priv stack.

    visited_with_priv_stack_accum: si[0] = si[1] = si[2] = true; others false
    visited_with_priv_stack cleared with false.

after iteration 2 (async1)
    visited_with_priv_stack_accum: si[0] = si[1] = si[2] = true; others false
    visited_with_priv_stack: si[2] = si[3] = si[4] = true, others false

    Here, si[2] appears in both visited_with_priv_stack_accum and
    visited_with_priv_stack, so async1 cannot have priv stack.

    In my algorithm, I flipped the whole thing to no_priv_stack, which is
    too conservative. We should just skip async1 and continues.

    Let us say, we say async1 not having priv stack while main_prog0 has.

    /* the same as end of iteration 1 */
    visited_with_priv_stack_accum: si[0] = si[1] = si[2] = true; others false
    visited_with_priv_stack cleared with false.

after iteration 3 (async2)
    visited_with_priv_stack_accum: si[0] = si[1] = si[2] = true; others false
    visited_with_priv_stack: si[4] = si[5] = si[6] = true;

    there are no conflict, so async2 can use private stack.


If we only have one bit in bpf_subprog_info, for a async tree,
if marking a subprog to be true and later we found there is a conflict in
async tree and we need make the whole async subprogs not eligible for priv stack,
then it will be hard to undo previous markings.

So visited_with_priv_stack_accum is to accumulate "true" results from
main_prog/async's.

Maybe we change two bit names to
   visited_with_priv_stack
   visited_with_priv_stack_tmp
?

>
>>                  }
>>          }
>>
>> -       if (priv_stack_supported == NO_PRIV_STACK && subtree_depth > MAX_BPF_STACK) {
>> -               verbose(env, "combined stack size of %d calls is %d. Too large\n",
>> -                       depth_frame, subtree_depth);
>> -               return -EACCES;
>> +       if (priv_stack_supported == NO_PRIV_STACK) {
>> +               if (subtree_depth > MAX_BPF_STACK) {
>> +                       verbose(env, "combined stack size of %d calls is %d. Too large\n",
>> +                               depth_frame, subtree_depth);
>> +                       return -EACCES;
>> +               }
>> +               if (orig_priv_stack_supported == PRIV_STACK_ADAPTIVE) {
>> +                       for (int i = 0; i < env->subprog_cnt; i++)
>> +                               si[i].use_priv_stack = false;
>> +               }
> why? This patch suppose clear use_priv_stack from subprogs
> that are dual called and only from those subprogs.
> All other subprogs are fine.
>
> But it seems the alog attempts to detect one such calling scenario
> and disables priv_stack everywhere?

Sorry about this. Will fix in the next revision.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 05/10] bpf: Allocate private stack for eligible main prog or subprogs
  2024-11-05  3:07     ` Yonghong Song
@ 2024-11-05  3:44       ` Yonghong Song
  2024-11-05  5:19         ` Alexei Starovoitov
  0 siblings, 1 reply; 41+ messages in thread
From: Yonghong Song @ 2024-11-05  3:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo

On 11/4/24 7:07 PM, Yonghong Song wrote:
>
> On 11/4/24 5:38 PM, Alexei Starovoitov wrote:
>> On Mon, Nov 4, 2024 at 11:38 AM Yonghong Song 
>> <yonghong.song@linux.dev> wrote:
>>> For any main prog or subprogs, allocate private stack space if 
>>> requested
>>> by subprog info or main prog. The alignment for private stack is 16
>>> since maximum stack alignment is 16 for bpf-enabled archs.
>>>
>>> If jit failed, the allocated private stack will be freed in the same
>>> function where the allocation happens. If jit succeeded, e.g., for
>>> x86_64 arch, the allocated private stack is freed in arch specific
>>> implementation of bpf_jit_free().
>>>
>>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>>> ---
>>>   arch/x86/net/bpf_jit_comp.c |  1 +
>>>   include/linux/bpf.h         |  1 +
>>>   kernel/bpf/core.c           | 19 ++++++++++++++++---
>>>   kernel/bpf/verifier.c       | 13 +++++++++++++
>>>   4 files changed, 31 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
>>> index 06b080b61aa5..59d294b8dd67 100644
>>> --- a/arch/x86/net/bpf_jit_comp.c
>>> +++ b/arch/x86/net/bpf_jit_comp.c
>>> @@ -3544,6 +3544,7 @@ void bpf_jit_free(struct bpf_prog *prog)
>>>                  prog->bpf_func = (void *)prog->bpf_func - 
>>> cfi_get_offset();
>>>                  hdr = bpf_jit_binary_pack_hdr(prog);
>>>                  bpf_jit_binary_pack_free(hdr, NULL);
>>> +               free_percpu(prog->aux->priv_stack_ptr);
>>> WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(prog));
>>>          }
>>>
>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>> index 8db3c5d7404b..8a3ea7440a4a 100644
>>> --- a/include/linux/bpf.h
>>> +++ b/include/linux/bpf.h
>>> @@ -1507,6 +1507,7 @@ struct bpf_prog_aux {
>>>          u32 max_rdwr_access;
>>>          struct btf *attach_btf;
>>>          const struct bpf_ctx_arg_aux *ctx_arg_info;
>>> +       void __percpu *priv_stack_ptr;
>>>          struct mutex dst_mutex; /* protects dst_* pointers below, 
>>> *after* prog becomes visible */
>>>          struct bpf_prog *dst_prog;
>>>          struct bpf_trampoline *dst_trampoline;
>>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>>> index 14d9288441f2..f7a3e93c41e1 100644
>>> --- a/kernel/bpf/core.c
>>> +++ b/kernel/bpf/core.c
>>> @@ -2396,6 +2396,7 @@ static void bpf_prog_select_func(struct 
>>> bpf_prog *fp)
>>>    */
>>>   struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int 
>>> *err)
>>>   {
>>> +       void __percpu *priv_stack_ptr = NULL;
>>>          /* In case of BPF to BPF calls, verifier did all the prep
>>>           * work with regards to JITing, etc.
>>>           */
>>> @@ -2421,11 +2422,23 @@ struct bpf_prog 
>>> *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
>>>                  if (*err)
>>>                          return fp;
>>>
>>> +               if (fp->aux->use_priv_stack && fp->aux->stack_depth) {
>>> +                       priv_stack_ptr = 
>>> __alloc_percpu_gfp(fp->aux->stack_depth, 16, GFP_KERNEL);
>>> +                       if (!priv_stack_ptr) {
>>> +                               *err = -ENOMEM;
>>> +                               return fp;
>>> +                       }
>>> +                       fp->aux->priv_stack_ptr = priv_stack_ptr;
>>> +               }
>>> +
>>>                  fp = bpf_int_jit_compile(fp);
>>>                  bpf_prog_jit_attempt_done(fp);
>>> -               if (!fp->jited && jit_needed) {
>>> -                       *err = -ENOTSUPP;
>>> -                       return fp;
>>> +               if (!fp->jited) {
>>> +                       free_percpu(priv_stack_ptr);
>>> +                       if (jit_needed) {
>>> +                               *err = -ENOTSUPP;
>>> +                               return fp;
>>> +                       }
>>>                  }
>>>          } else {
>>>                  *err = bpf_prog_offload_compile(fp);
>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>>> index e01b3f0fd314..03ae76d57076 100644
>>> --- a/kernel/bpf/verifier.c
>>> +++ b/kernel/bpf/verifier.c
>>> @@ -20073,6 +20073,7 @@ static int jit_subprogs(struct 
>>> bpf_verifier_env *env)
>>>   {
>>>          struct bpf_prog *prog = env->prog, **func, *tmp;
>>>          int i, j, subprog_start, subprog_end = 0, len, subprog;
>>> +       void __percpu *priv_stack_ptr;
>>>          struct bpf_map *map_ptr;
>>>          struct bpf_insn *insn;
>>>          void *old_bpf_func;
>>> @@ -20169,6 +20170,17 @@ static int jit_subprogs(struct 
>>> bpf_verifier_env *env)
>>>
>>>                  func[i]->aux->name[0] = 'F';
>>>                  func[i]->aux->stack_depth = 
>>> env->subprog_info[i].stack_depth;
>>> +
>>> +               if (env->subprog_info[i].use_priv_stack && 
>>> func[i]->aux->stack_depth) {
>>> +                       priv_stack_ptr = 
>>> __alloc_percpu_gfp(func[i]->aux->stack_depth, 16,
>>> + GFP_KERNEL);
>>> +                       if (!priv_stack_ptr) {
>>> +                               err = -ENOMEM;
>>> +                               goto out_free;
>>> +                       }
>>> +                       func[i]->aux->priv_stack_ptr = priv_stack_ptr;
>>> +               }
>>> +
>>>                  func[i]->jit_requested = 1;
>>>                  func[i]->blinding_requested = 
>>> prog->blinding_requested;
>>>                  func[i]->aux->kfunc_tab = prog->aux->kfunc_tab;
>>> @@ -20201,6 +20213,7 @@ static int jit_subprogs(struct 
>>> bpf_verifier_env *env)
>>> func[i]->aux->exception_boundary = env->seen_exception;
>>>                  func[i] = bpf_int_jit_compile(func[i]);
>>>                  if (!func[i]->jited) {
>>> + free_percpu(func[i]->aux->priv_stack_ptr);
>>>                          err = -ENOTSUPP;
>>>                          goto out_free;
>>>                  }
>> Looks correct from leaks pov, but this is so hard to follow.
>> I still don't like this imbalanced alloc/free.
>> Either both need to be done by core or both by JIT.
>>
>> And JIT is probably better, since in:
>> _alloc_percpu_gfp(func[i]->aux->stack_depth, 16
>>
>> 16 alignment is x86 specific.
>
Sorry, I need to fix my format. The following is a reformat.

Agree. I use alignment 16 to cover all architectures. for x86_64,
alignment 8 is used. I did some checking in arch/ directory.

[~/work/bpf-next/arch (master)]$ find . -name 'net'
./arm/net
./mips/net
./parisc/net
./powerpc/net
./s390/net
./sparc/net
./x86/net
./arc/net
./arm64/net
./loongarch/net
./riscv/net

[~/work/bpf-next/arch (master)]$ egrep -r bpf_jit_free (excluding not func definition)
powerpc/net/bpf_jit_comp.c:void bpf_jit_free(struct bpf_prog *fp)
sparc/net/bpf_jit_comp_32.c:void bpf_jit_free(struct bpf_prog *fp)
x86/net/bpf_jit_comp.c:void bpf_jit_free(struct bpf_prog *prog)
arm64/net/bpf_jit_comp.c:void bpf_jit_free(struct bpf_prog *prog)
riscv/net/bpf_jit_core.c:void bpf_jit_free(struct bpf_prog *prog)
  
Looks like all important arch's like x86_64,arm64,riscv having their own
bpf_jit_free(). Some others like s390, etc. do not. I think we can do
allocation in JIT. If s390 starts to implement private stack, then it
can implement arch-specific version of bpf_jit_free() at that time.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 02/10] bpf: Return false for bpf_prog_check_recur() default case
  2024-11-05  2:53         ` Yonghong Song
@ 2024-11-05  3:50           ` Yonghong Song
  2024-11-05  4:28             ` Alexei Starovoitov
  0 siblings, 1 reply; 41+ messages in thread
From: Yonghong Song @ 2024-11-05  3:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo


On 11/4/24 6:53 PM, Yonghong Song wrote:
>
> On 11/4/24 5:55 PM, Alexei Starovoitov wrote:
>> On Mon, Nov 4, 2024 at 5:35 PM Yonghong Song 
>> <yonghong.song@linux.dev> wrote:
>>>
>>> On 11/4/24 5:21 PM, Alexei Starovoitov wrote:
>>>> On Mon, Nov 4, 2024 at 11:35 AM Yonghong Song 
>>>> <yonghong.song@linux.dev> wrote:
>>>>> The bpf_prog_check_recur() funciton is currently used by trampoline
>>>>> and tracing programs (also using trampoline) to check whether a
>>>>> particular prog supports recursion checking or not. The default case
>>>>> (non-trampoline progs) return true in the current implementation.
>>>>>
>>>>> Let us make the non-trampoline prog recursion check return false
>>>>> instead. It does not impact any existing use cases and allows the
>>>>> function to be used outside the trampoline context in the next patch.
>>>> Does not impact ?! But it does.
>>>> This patch removes recursion check from fentry progs.
>>>> This cannot be right.
>>> The original bpf_prog_check_recur() implementation:
>>>
>>> static inline bool bpf_prog_check_recur(const struct bpf_prog *prog)
>>> {
>>>           switch (resolve_prog_type(prog)) {
>>>           case BPF_PROG_TYPE_TRACING:
>>>                   return prog->expected_attach_type != BPF_TRACE_ITER;
>>>           case BPF_PROG_TYPE_STRUCT_OPS:
>>>           case BPF_PROG_TYPE_LSM:
>>>                   return false;
>>>           default:
>>>                   return true;
>>>           }
>>> }
>>>
>>> fentry prog is a TRACING prog, so it is covered. Did I miss anything?
>> I see. This is way too subtle.
>> You're correct that fentry is TYPE_TRACING,
>> so it could have "worked" if it was used to build trampolines only.
>>
>> But this helper is called for other prog types:
>>
>>          case BPF_FUNC_task_storage_get:
>>                  if (bpf_prog_check_recur(prog))
>>                          return &bpf_task_storage_get_recur_proto;
>>                  return &bpf_task_storage_get_proto;
>>
>> so it's still not correct, but for a different reason.
>
> There are four uses for func bpf_prog_check_recur() in kernel based on 
> cscope: 0 kernel/bpf/trampoline.c bpf_trampoline_enter 1053 if 
> (bpf_prog_check_recur(prog)) 1 kernel/bpf/trampoline.c 
> bpf_trampoline_exit 1068 if (bpf_prog_check_recur(prog)) 2 
> kernel/trace/bpf_trace.c bpf_tracing_func_proto 1549 if 
> (bpf_prog_check_recur(prog)) 3 kernel/trace/bpf_trace.c 
> bpf_tracing_func_proto 1553 if (bpf_prog_check_recur(prog)) The 2nd 
> and 3rd ones are in bpf_trace.c. 1444 static const struct 
> bpf_func_proto * 1445 bpf_tracing_func_proto(enum bpf_func_id func_id, 
> const struct bpf_prog *prog) 1446 { 1447 switch (func_id) { ... 1548 
> case BPF_FUNC_task_storage_get: 1549 if (bpf_prog_check_recur(prog)) 
> 1550 return &bpf_task_storage_get_recur_proto; 1551 return 
> &bpf_task_storage_get_proto; 1552 case BPF_FUNC_task_storage_delete: 
> 1553 if (bpf_prog_check_recur(prog)) 1554 return 
> &bpf_task_storage_delete_recur_proto; 1555 return 
> &bpf_task_storage_delete_proto; ... 1568 default: 1569 return 
> bpf_base_func_proto(func_id, prog); 1570 } 1571 } They are used for 
> tracing programs. So we should be safe here. But if you think that 
> changing bpf_proc_check_recur() and calling function 
> bpf_prog_check_recur() in bpf_enable_priv_stack() is too subtle, I can 
> go back to my original approach which makes all supported prog types 
> explicit in bpf_enable_priv_stack().

Sorry. Format issue again. The below is a better format:

There are four uses for func bpf_prog_check_recur() in kernel based on cscope:

0 kernel/bpf/trampoline.c bpf_trampoline_enter 1053 if (bpf_prog_check_recur(prog))
1 kernel/bpf/trampoline.c bpf_trampoline_exit 1068 if (bpf_prog_check_recur(prog))
2 kernel/trace/bpf_trace.c bpf_tracing_func_proto 1549 if (bpf_prog_check_recur(prog))
3 kernel/trace/bpf_trace.c bpf_tracing_func_proto 1553 if (bpf_prog_check_recur(prog))

The 2nd and 3rd ones are in bpf_trace.c.

1444 static const struct bpf_func_proto *
1445 bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
1446 {
1447     switch (func_id) {
...
1548     case BPF_FUNC_task_storage_get:
1549         if (bpf_prog_check_recur(prog))
1550             return &bpf_task_storage_get_recur_proto;
1551         return &bpf_task_storage_get_proto;
1552     case BPF_FUNC_task_storage_delete:
1553         if (bpf_prog_check_recur(prog))
1554             return &bpf_task_storage_delete_recur_proto;
1555         return &bpf_task_storage_delete_proto;
...
1568     default:
1569         return bpf_base_func_proto(func_id, prog);
1570     }
1571 }
  
They are used for tracing programs. So we should be safe here. But if you think that
changing bpf_proc_check_recur() and calling function bpf_prog_check_recur()
in bpf_enable_priv_stack() is too subtle, I can go back to my original approach
which makes all supported prog types explicit in bpf_enable_priv_stack().


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 02/10] bpf: Return false for bpf_prog_check_recur() default case
  2024-11-05  3:50           ` Yonghong Song
@ 2024-11-05  4:28             ` Alexei Starovoitov
  2024-11-05  6:02               ` Yonghong Song
  0 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2024-11-05  4:28 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo

On Mon, Nov 4, 2024 at 7:50 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
> On 11/4/24 6:53 PM, Yonghong Song wrote:
> >
> > On 11/4/24 5:55 PM, Alexei Starovoitov wrote:
> >> On Mon, Nov 4, 2024 at 5:35 PM Yonghong Song
> >> <yonghong.song@linux.dev> wrote:
> >>>
> >>> On 11/4/24 5:21 PM, Alexei Starovoitov wrote:
> >>>> On Mon, Nov 4, 2024 at 11:35 AM Yonghong Song
> >>>> <yonghong.song@linux.dev> wrote:
> >>>>> The bpf_prog_check_recur() funciton is currently used by trampoline
> >>>>> and tracing programs (also using trampoline) to check whether a
> >>>>> particular prog supports recursion checking or not. The default case
> >>>>> (non-trampoline progs) return true in the current implementation.
> >>>>>
> >>>>> Let us make the non-trampoline prog recursion check return false
> >>>>> instead. It does not impact any existing use cases and allows the
> >>>>> function to be used outside the trampoline context in the next patch.
> >>>> Does not impact ?! But it does.
> >>>> This patch removes recursion check from fentry progs.
> >>>> This cannot be right.
> >>> The original bpf_prog_check_recur() implementation:
> >>>
> >>> static inline bool bpf_prog_check_recur(const struct bpf_prog *prog)
> >>> {
> >>>           switch (resolve_prog_type(prog)) {
> >>>           case BPF_PROG_TYPE_TRACING:
> >>>                   return prog->expected_attach_type != BPF_TRACE_ITER;
> >>>           case BPF_PROG_TYPE_STRUCT_OPS:
> >>>           case BPF_PROG_TYPE_LSM:
> >>>                   return false;
> >>>           default:
> >>>                   return true;
> >>>           }
> >>> }
> >>>
> >>> fentry prog is a TRACING prog, so it is covered. Did I miss anything?
> >> I see. This is way too subtle.
> >> You're correct that fentry is TYPE_TRACING,
> >> so it could have "worked" if it was used to build trampolines only.
> >>
> >> But this helper is called for other prog types:
> >>
> >>          case BPF_FUNC_task_storage_get:
> >>                  if (bpf_prog_check_recur(prog))
> >>                          return &bpf_task_storage_get_recur_proto;
> >>                  return &bpf_task_storage_get_proto;
> >>
> >> so it's still not correct, but for a different reason.
> >
> > There are four uses for func bpf_prog_check_recur() in kernel based on
> > cscope: 0 kernel/bpf/trampoline.c bpf_trampoline_enter 1053 if
> > (bpf_prog_check_recur(prog)) 1 kernel/bpf/trampoline.c
> > bpf_trampoline_exit 1068 if (bpf_prog_check_recur(prog)) 2
> > kernel/trace/bpf_trace.c bpf_tracing_func_proto 1549 if
> > (bpf_prog_check_recur(prog)) 3 kernel/trace/bpf_trace.c
> > bpf_tracing_func_proto 1553 if (bpf_prog_check_recur(prog)) The 2nd
> > and 3rd ones are in bpf_trace.c. 1444 static const struct
> > bpf_func_proto * 1445 bpf_tracing_func_proto(enum bpf_func_id func_id,
> > const struct bpf_prog *prog) 1446 { 1447 switch (func_id) { ... 1548
> > case BPF_FUNC_task_storage_get: 1549 if (bpf_prog_check_recur(prog))
> > 1550 return &bpf_task_storage_get_recur_proto; 1551 return
> > &bpf_task_storage_get_proto; 1552 case BPF_FUNC_task_storage_delete:
> > 1553 if (bpf_prog_check_recur(prog)) 1554 return
> > &bpf_task_storage_delete_recur_proto; 1555 return
> > &bpf_task_storage_delete_proto; ... 1568 default: 1569 return
> > bpf_base_func_proto(func_id, prog); 1570 } 1571 } They are used for
> > tracing programs. So we should be safe here. But if you think that
> > changing bpf_proc_check_recur() and calling function
> > bpf_prog_check_recur() in bpf_enable_priv_stack() is too subtle, I can
> > go back to my original approach which makes all supported prog types
> > explicit in bpf_enable_priv_stack().
>
> Sorry. Format issue again. The below is a better format:
>
> There are four uses for func bpf_prog_check_recur() in kernel based on cscope:
>
> 0 kernel/bpf/trampoline.c bpf_trampoline_enter 1053 if (bpf_prog_check_recur(prog))
> 1 kernel/bpf/trampoline.c bpf_trampoline_exit 1068 if (bpf_prog_check_recur(prog))
> 2 kernel/trace/bpf_trace.c bpf_tracing_func_proto 1549 if (bpf_prog_check_recur(prog))
> 3 kernel/trace/bpf_trace.c bpf_tracing_func_proto 1553 if (bpf_prog_check_recur(prog))
>
> The 2nd and 3rd ones are in bpf_trace.c.
>
> 1444 static const struct bpf_func_proto *
> 1445 bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> 1446 {
> 1447     switch (func_id) {
> ...
> 1548     case BPF_FUNC_task_storage_get:
> 1549         if (bpf_prog_check_recur(prog))
> 1550             return &bpf_task_storage_get_recur_proto;
> 1551         return &bpf_task_storage_get_proto;
> 1552     case BPF_FUNC_task_storage_delete:
> 1553         if (bpf_prog_check_recur(prog))
> 1554             return &bpf_task_storage_delete_recur_proto;
> 1555         return &bpf_task_storage_delete_proto;
> ...
> 1568     default:
> 1569         return bpf_base_func_proto(func_id, prog);
> 1570     }
> 1571 }
>
> They are used for tracing programs. So we should be safe here. But if you think that
> changing bpf_proc_check_recur() and calling function bpf_prog_check_recur()
> in bpf_enable_priv_stack() is too subtle, I can go back to my original approach
> which makes all supported prog types explicit in bpf_enable_priv_stack().

What do you mean 'it's safe' ?
If you change bpf_prog_check_recur() to return false like this patch does
then kprobe progs will not have recursion protection
calling task_storage_get() helper.
In the context of this helper it means that kprobe progs have to use:
nobusy = bpf_task_storage_trylock();
With this patch as-is there will be a deadlock in bpf_task_storage_lock()
when kprobe is using task storage.
So it looks broken to me.

I also don't understand the point of this patch 2.
The patch 3 can still do:

+ switch (prog->type) {
+ case BPF_PROG_TYPE_KPROBE:
+ case BPF_PROG_TYPE_TRACEPOINT:
+ case BPF_PROG_TYPE_PERF_EVENT:
+ case BPF_PROG_TYPE_RAW_TRACEPOINT:
+   return PRIV_STACK_ADAPTIVE;
+ default:
+   break;
+ }
+
+ if (!bpf_prog_check_recur(prog))
+   return NO_PRIV_STACK;

which would mean that iter, lsm, struct_ops will not be allowed
to use priv stack.

Unless struct_ops will explicit request priv stack via bool flag.
Then we will also add recursion protection in trampoline.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 05/10] bpf: Allocate private stack for eligible main prog or subprogs
  2024-11-05  3:44       ` Yonghong Song
@ 2024-11-05  5:19         ` Alexei Starovoitov
  2024-11-05  6:05           ` Yonghong Song
  0 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2024-11-05  5:19 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo

On Mon, Nov 4, 2024 at 7:44 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
> Agree. I use alignment 16 to cover all architectures. for x86_64,
> alignment 8 is used. I did some checking in arch/ directory.

hmm. I'm pretty sure x86 psABI requires 16-byte stack alignment,
but I don't know why.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 02/10] bpf: Return false for bpf_prog_check_recur() default case
  2024-11-05  4:28             ` Alexei Starovoitov
@ 2024-11-05  6:02               ` Yonghong Song
  2024-11-05 15:50                 ` Alexei Starovoitov
  0 siblings, 1 reply; 41+ messages in thread
From: Yonghong Song @ 2024-11-05  6:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo




On 11/4/24 8:28 PM, Alexei Starovoitov wrote:
> On Mon, Nov 4, 2024 at 7:50 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> On 11/4/24 6:53 PM, Yonghong Song wrote:
>>> On 11/4/24 5:55 PM, Alexei Starovoitov wrote:
>>>> On Mon, Nov 4, 2024 at 5:35 PM Yonghong Song
>>>> <yonghong.song@linux.dev> wrote:
>>>>> On 11/4/24 5:21 PM, Alexei Starovoitov wrote:
>>>>>> On Mon, Nov 4, 2024 at 11:35 AM Yonghong Song
>>>>>> <yonghong.song@linux.dev> wrote:
>>>>>>> The bpf_prog_check_recur() funciton is currently used by trampoline
>>>>>>> and tracing programs (also using trampoline) to check whether a
>>>>>>> particular prog supports recursion checking or not. The default case
>>>>>>> (non-trampoline progs) return true in the current implementation.
>>>>>>>
>>>>>>> Let us make the non-trampoline prog recursion check return false
>>>>>>> instead. It does not impact any existing use cases and allows the
>>>>>>> function to be used outside the trampoline context in the next patch.
>>>>>> Does not impact ?! But it does.
>>>>>> This patch removes recursion check from fentry progs.
>>>>>> This cannot be right.
>>>>> The original bpf_prog_check_recur() implementation:
>>>>>
>>>>> static inline bool bpf_prog_check_recur(const struct bpf_prog *prog)
>>>>> {
>>>>>            switch (resolve_prog_type(prog)) {
>>>>>            case BPF_PROG_TYPE_TRACING:
>>>>>                    return prog->expected_attach_type != BPF_TRACE_ITER;
>>>>>            case BPF_PROG_TYPE_STRUCT_OPS:
>>>>>            case BPF_PROG_TYPE_LSM:
>>>>>                    return false;
>>>>>            default:
>>>>>                    return true;
>>>>>            }
>>>>> }
>>>>>
>>>>> fentry prog is a TRACING prog, so it is covered. Did I miss anything?
>>>> I see. This is way too subtle.
>>>> You're correct that fentry is TYPE_TRACING,
>>>> so it could have "worked" if it was used to build trampolines only.
>>>>
>>>> But this helper is called for other prog types:
>>>>
>>>>           case BPF_FUNC_task_storage_get:
>>>>                   if (bpf_prog_check_recur(prog))
>>>>                           return &bpf_task_storage_get_recur_proto;
>>>>                   return &bpf_task_storage_get_proto;
>>>>
>>>> so it's still not correct, but for a different reason.
>>> There are four uses for func bpf_prog_check_recur() in kernel based on
>>> cscope: 0 kernel/bpf/trampoline.c bpf_trampoline_enter 1053 if
>>> (bpf_prog_check_recur(prog)) 1 kernel/bpf/trampoline.c
>>> bpf_trampoline_exit 1068 if (bpf_prog_check_recur(prog)) 2
>>> kernel/trace/bpf_trace.c bpf_tracing_func_proto 1549 if
>>> (bpf_prog_check_recur(prog)) 3 kernel/trace/bpf_trace.c
>>> bpf_tracing_func_proto 1553 if (bpf_prog_check_recur(prog)) The 2nd
>>> and 3rd ones are in bpf_trace.c. 1444 static const struct
>>> bpf_func_proto * 1445 bpf_tracing_func_proto(enum bpf_func_id func_id,
>>> const struct bpf_prog *prog) 1446 { 1447 switch (func_id) { ... 1548
>>> case BPF_FUNC_task_storage_get: 1549 if (bpf_prog_check_recur(prog))
>>> 1550 return &bpf_task_storage_get_recur_proto; 1551 return
>>> &bpf_task_storage_get_proto; 1552 case BPF_FUNC_task_storage_delete:
>>> 1553 if (bpf_prog_check_recur(prog)) 1554 return
>>> &bpf_task_storage_delete_recur_proto; 1555 return
>>> &bpf_task_storage_delete_proto; ... 1568 default: 1569 return
>>> bpf_base_func_proto(func_id, prog); 1570 } 1571 } They are used for
>>> tracing programs. So we should be safe here. But if you think that
>>> changing bpf_proc_check_recur() and calling function
>>> bpf_prog_check_recur() in bpf_enable_priv_stack() is too subtle, I can
>>> go back to my original approach which makes all supported prog types
>>> explicit in bpf_enable_priv_stack().
>> Sorry. Format issue again. The below is a better format:
>>
>> There are four uses for func bpf_prog_check_recur() in kernel based on cscope:
>>
>> 0 kernel/bpf/trampoline.c bpf_trampoline_enter 1053 if (bpf_prog_check_recur(prog))
>> 1 kernel/bpf/trampoline.c bpf_trampoline_exit 1068 if (bpf_prog_check_recur(prog))
>> 2 kernel/trace/bpf_trace.c bpf_tracing_func_proto 1549 if (bpf_prog_check_recur(prog))
>> 3 kernel/trace/bpf_trace.c bpf_tracing_func_proto 1553 if (bpf_prog_check_recur(prog))
>>
>> The 2nd and 3rd ones are in bpf_trace.c.
>>
>> 1444 static const struct bpf_func_proto *
>> 1445 bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>> 1446 {
>> 1447     switch (func_id) {
>> ...
>> 1548     case BPF_FUNC_task_storage_get:
>> 1549         if (bpf_prog_check_recur(prog))
>> 1550             return &bpf_task_storage_get_recur_proto;
>> 1551         return &bpf_task_storage_get_proto;
>> 1552     case BPF_FUNC_task_storage_delete:
>> 1553         if (bpf_prog_check_recur(prog))
>> 1554             return &bpf_task_storage_delete_recur_proto;
>> 1555         return &bpf_task_storage_delete_proto;
>> ...
>> 1568     default:
>> 1569         return bpf_base_func_proto(func_id, prog);
>> 1570     }
>> 1571 }
>>
>> They are used for tracing programs. So we should be safe here. But if you think that
>> changing bpf_proc_check_recur() and calling function bpf_prog_check_recur()
>> in bpf_enable_priv_stack() is too subtle, I can go back to my original approach
>> which makes all supported prog types explicit in bpf_enable_priv_stack().
> What do you mean 'it's safe' ?
> If you change bpf_prog_check_recur() to return false like this patch does
> then kprobe progs will not have recursion protection
> calling task_storage_get() helper.
> In the context of this helper it means that kprobe progs have to use:
> nobusy = bpf_task_storage_trylock();
> With this patch as-is there will be a deadlock in bpf_task_storage_lock()
> when kprobe is using task storage.
> So it looks broken to me.
>
> I also don't understand the point of this patch 2.
> The patch 3 can still do:
>
> + switch (prog->type) {
> + case BPF_PROG_TYPE_KPROBE:
> + case BPF_PROG_TYPE_TRACEPOINT:
> + case BPF_PROG_TYPE_PERF_EVENT:
> + case BPF_PROG_TYPE_RAW_TRACEPOINT:
> +   return PRIV_STACK_ADAPTIVE;
> + default:
> +   break;
> + }
> +
> + if (!bpf_prog_check_recur(prog))
> +   return NO_PRIV_STACK;
>
> which would mean that iter, lsm, struct_ops will not be allowed
> to use priv stack.

One example is e.g. a TC prog. Since bpf_prog_check_recur(prog)
will return true (means supporting recursion), and private stack
does not really support TC prog, the logic will become more
complicated.

I am totally okay with removing patch 2 and go back to my
previous approach to explicitly list prog types supporting
private stack.

>
> Unless struct_ops will explicit request priv stack via bool flag.
> Then we will also add recursion protection in trampoline.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 05/10] bpf: Allocate private stack for eligible main prog or subprogs
  2024-11-05  5:19         ` Alexei Starovoitov
@ 2024-11-05  6:05           ` Yonghong Song
  0 siblings, 0 replies; 41+ messages in thread
From: Yonghong Song @ 2024-11-05  6:05 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo




On 11/4/24 9:19 PM, Alexei Starovoitov wrote:
> On Mon, Nov 4, 2024 at 7:44 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> Agree. I use alignment 16 to cover all architectures. for x86_64,
>> alignment 8 is used. I did some checking in arch/ directory.
> hmm. I'm pretty sure x86 psABI requires 16-byte stack alignment,
> but I don't know why.

One possible reason is to accommodate values like int128 or
128bit floating point. I will make percpu allocation with 16
byte alignment.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 02/10] bpf: Return false for bpf_prog_check_recur() default case
  2024-11-05  6:02               ` Yonghong Song
@ 2024-11-05 15:50                 ` Alexei Starovoitov
  2024-11-05 16:33                   ` Yonghong Song
  0 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2024-11-05 15:50 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo

On Mon, Nov 4, 2024 at 10:02 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
> >
> > I also don't understand the point of this patch 2.
> > The patch 3 can still do:
> >
> > + switch (prog->type) {
> > + case BPF_PROG_TYPE_KPROBE:
> > + case BPF_PROG_TYPE_TRACEPOINT:
> > + case BPF_PROG_TYPE_PERF_EVENT:
> > + case BPF_PROG_TYPE_RAW_TRACEPOINT:
> > +   return PRIV_STACK_ADAPTIVE;
> > + default:
> > +   break;
> > + }
> > +
> > + if (!bpf_prog_check_recur(prog))
> > +   return NO_PRIV_STACK;
> >
> > which would mean that iter, lsm, struct_ops will not be allowed
> > to use priv stack.
>
> One example is e.g. a TC prog. Since bpf_prog_check_recur(prog)
> will return true (means supporting recursion), and private stack
> does not really support TC prog, the logic will become more
> complicated.
>
> I am totally okay with removing patch 2 and go back to my
> previous approach to explicitly list prog types supporting
> private stack.

The point of reusing bpf_prog_check_recur() is that we don't
need to duplicate the logic.
We can still do something like:
switch (prog->type) {
 case BPF_PROG_TYPE_KPROBE:
 case BPF_PROG_TYPE_TRACEPOINT:
 case BPF_PROG_TYPE_PERF_EVENT:
 case BPF_PROG_TYPE_RAW_TRACEPOINT:
    return PRIV_STACK_ADAPTIVE;
 case BPF_PROG_TYPE_TRACING:
 case BPF_PROG_TYPE_LSM:
 case BPF_PROG_TYPE_STRUCT_OPS:
    if (bpf_prog_check_recur())
      return PRIV_STACK_ADAPTIVE;
    /* fallthrough */
  default:
    return NO_PRIV_STACK;
}

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 02/10] bpf: Return false for bpf_prog_check_recur() default case
  2024-11-05 15:50                 ` Alexei Starovoitov
@ 2024-11-05 16:33                   ` Yonghong Song
  2024-11-05 16:38                     ` Alexei Starovoitov
  0 siblings, 1 reply; 41+ messages in thread
From: Yonghong Song @ 2024-11-05 16:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo




On 11/5/24 7:50 AM, Alexei Starovoitov wrote:
> On Mon, Nov 4, 2024 at 10:02 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>> I also don't understand the point of this patch 2.
>>> The patch 3 can still do:
>>>
>>> + switch (prog->type) {
>>> + case BPF_PROG_TYPE_KPROBE:
>>> + case BPF_PROG_TYPE_TRACEPOINT:
>>> + case BPF_PROG_TYPE_PERF_EVENT:
>>> + case BPF_PROG_TYPE_RAW_TRACEPOINT:
>>> +   return PRIV_STACK_ADAPTIVE;
>>> + default:
>>> +   break;
>>> + }
>>> +
>>> + if (!bpf_prog_check_recur(prog))
>>> +   return NO_PRIV_STACK;
>>>
>>> which would mean that iter, lsm, struct_ops will not be allowed
>>> to use priv stack.
>> One example is e.g. a TC prog. Since bpf_prog_check_recur(prog)
>> will return true (means supporting recursion), and private stack
>> does not really support TC prog, the logic will become more
>> complicated.
>>
>> I am totally okay with removing patch 2 and go back to my
>> previous approach to explicitly list prog types supporting
>> private stack.
> The point of reusing bpf_prog_check_recur() is that we don't
> need to duplicate the logic.
> We can still do something like:
> switch (prog->type) {
>   case BPF_PROG_TYPE_KPROBE:
>   case BPF_PROG_TYPE_TRACEPOINT:
>   case BPF_PROG_TYPE_PERF_EVENT:
>   case BPF_PROG_TYPE_RAW_TRACEPOINT:
>      return PRIV_STACK_ADAPTIVE;
>   case BPF_PROG_TYPE_TRACING:
>   case BPF_PROG_TYPE_LSM:
>   case BPF_PROG_TYPE_STRUCT_OPS:
>      if (bpf_prog_check_recur())
>        return PRIV_STACK_ADAPTIVE;
>      /* fallthrough */
>    default:
>      return NO_PRIV_STACK;
> }

Right. Listing trampoline related prog types explicitly
and using bpf_prog_check_recur() will be safe.

One thing is for BPF_PROG_TYPE_STRUCT_OPS, PRIV_STACK_ALWAYS
will be returned. I will make adjustment like

switch (prog->type) {
  case BPF_PROG_TYPE_KPROBE:
  case BPF_PROG_TYPE_TRACEPOINT:
  case BPF_PROG_TYPE_PERF_EVENT:
  case BPF_PROG_TYPE_RAW_TRACEPOINT:
     return PRIV_STACK_ADAPTIVE;
  case BPF_PROG_TYPE_TRACING:
  case BPF_PROG_TYPE_LSM:
  case BPF_PROG_TYPE_STRUCT_OPS:
     if (bpf_prog_check_recur()) {
       if (prog->type == BPF_PROG_TYPE_STRUCT_OPS)
           return PRIV_STACK_ALWAYS;
       else
           return PRIV_STACK_ADAPTIVE;
     }
     /* fallthrough */
   default:
     return NO_PRIV_STACK;
}


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 02/10] bpf: Return false for bpf_prog_check_recur() default case
  2024-11-05 16:33                   ` Yonghong Song
@ 2024-11-05 16:38                     ` Alexei Starovoitov
  2024-11-05 16:48                       ` Yonghong Song
  0 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2024-11-05 16:38 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo

On Tue, Nov 5, 2024 at 8:33 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
>
>
> On 11/5/24 7:50 AM, Alexei Starovoitov wrote:
> > On Mon, Nov 4, 2024 at 10:02 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> >>> I also don't understand the point of this patch 2.
> >>> The patch 3 can still do:
> >>>
> >>> + switch (prog->type) {
> >>> + case BPF_PROG_TYPE_KPROBE:
> >>> + case BPF_PROG_TYPE_TRACEPOINT:
> >>> + case BPF_PROG_TYPE_PERF_EVENT:
> >>> + case BPF_PROG_TYPE_RAW_TRACEPOINT:
> >>> +   return PRIV_STACK_ADAPTIVE;
> >>> + default:
> >>> +   break;
> >>> + }
> >>> +
> >>> + if (!bpf_prog_check_recur(prog))
> >>> +   return NO_PRIV_STACK;
> >>>
> >>> which would mean that iter, lsm, struct_ops will not be allowed
> >>> to use priv stack.
> >> One example is e.g. a TC prog. Since bpf_prog_check_recur(prog)
> >> will return true (means supporting recursion), and private stack
> >> does not really support TC prog, the logic will become more
> >> complicated.
> >>
> >> I am totally okay with removing patch 2 and go back to my
> >> previous approach to explicitly list prog types supporting
> >> private stack.
> > The point of reusing bpf_prog_check_recur() is that we don't
> > need to duplicate the logic.
> > We can still do something like:
> > switch (prog->type) {
> >   case BPF_PROG_TYPE_KPROBE:
> >   case BPF_PROG_TYPE_TRACEPOINT:
> >   case BPF_PROG_TYPE_PERF_EVENT:
> >   case BPF_PROG_TYPE_RAW_TRACEPOINT:
> >      return PRIV_STACK_ADAPTIVE;
> >   case BPF_PROG_TYPE_TRACING:
> >   case BPF_PROG_TYPE_LSM:
> >   case BPF_PROG_TYPE_STRUCT_OPS:
> >      if (bpf_prog_check_recur())
> >        return PRIV_STACK_ADAPTIVE;
> >      /* fallthrough */
> >    default:
> >      return NO_PRIV_STACK;
> > }
>
> Right. Listing trampoline related prog types explicitly
> and using bpf_prog_check_recur() will be safe.
>
> One thing is for BPF_PROG_TYPE_STRUCT_OPS, PRIV_STACK_ALWAYS
> will be returned. I will make adjustment like
>
> switch (prog->type) {
>   case BPF_PROG_TYPE_KPROBE:
>   case BPF_PROG_TYPE_TRACEPOINT:
>   case BPF_PROG_TYPE_PERF_EVENT:
>   case BPF_PROG_TYPE_RAW_TRACEPOINT:
>      return PRIV_STACK_ADAPTIVE;
>   case BPF_PROG_TYPE_TRACING:
>   case BPF_PROG_TYPE_LSM:
>   case BPF_PROG_TYPE_STRUCT_OPS:
>      if (bpf_prog_check_recur()) {
>        if (prog->type == BPF_PROG_TYPE_STRUCT_OPS)
>            return PRIV_STACK_ALWAYS;

hmm. definitely not unconditionally.
Only when explicitly requested in callback.

Something like this:
   case BPF_PROG_TYPE_TRACING:
   case BPF_PROG_TYPE_LSM:
      if (bpf_prog_check_recur())
         return PRIV_STACK_ADAPTIVE;
   case BPF_PROG_TYPE_STRUCT_OPS:
      if (prog->aux->priv_stack_requested)
         return PRIV_STACK_ALWAYS;
   default:
      return NO_PRIV_STACK;

and then we also change bpf_prog_check_recur()
 to return true when prog->aux->priv_stack_requested

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 02/10] bpf: Return false for bpf_prog_check_recur() default case
  2024-11-05 16:38                     ` Alexei Starovoitov
@ 2024-11-05 16:48                       ` Yonghong Song
  2024-11-05 17:47                         ` Alexei Starovoitov
  0 siblings, 1 reply; 41+ messages in thread
From: Yonghong Song @ 2024-11-05 16:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo




On 11/5/24 8:38 AM, Alexei Starovoitov wrote:
> On Tue, Nov 5, 2024 at 8:33 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>>
>>
>> On 11/5/24 7:50 AM, Alexei Starovoitov wrote:
>>> On Mon, Nov 4, 2024 at 10:02 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>>> I also don't understand the point of this patch 2.
>>>>> The patch 3 can still do:
>>>>>
>>>>> + switch (prog->type) {
>>>>> + case BPF_PROG_TYPE_KPROBE:
>>>>> + case BPF_PROG_TYPE_TRACEPOINT:
>>>>> + case BPF_PROG_TYPE_PERF_EVENT:
>>>>> + case BPF_PROG_TYPE_RAW_TRACEPOINT:
>>>>> +   return PRIV_STACK_ADAPTIVE;
>>>>> + default:
>>>>> +   break;
>>>>> + }
>>>>> +
>>>>> + if (!bpf_prog_check_recur(prog))
>>>>> +   return NO_PRIV_STACK;
>>>>>
>>>>> which would mean that iter, lsm, struct_ops will not be allowed
>>>>> to use priv stack.
>>>> One example is e.g. a TC prog. Since bpf_prog_check_recur(prog)
>>>> will return true (means supporting recursion), and private stack
>>>> does not really support TC prog, the logic will become more
>>>> complicated.
>>>>
>>>> I am totally okay with removing patch 2 and go back to my
>>>> previous approach to explicitly list prog types supporting
>>>> private stack.
>>> The point of reusing bpf_prog_check_recur() is that we don't
>>> need to duplicate the logic.
>>> We can still do something like:
>>> switch (prog->type) {
>>>    case BPF_PROG_TYPE_KPROBE:
>>>    case BPF_PROG_TYPE_TRACEPOINT:
>>>    case BPF_PROG_TYPE_PERF_EVENT:
>>>    case BPF_PROG_TYPE_RAW_TRACEPOINT:
>>>       return PRIV_STACK_ADAPTIVE;
>>>    case BPF_PROG_TYPE_TRACING:
>>>    case BPF_PROG_TYPE_LSM:
>>>    case BPF_PROG_TYPE_STRUCT_OPS:
>>>       if (bpf_prog_check_recur())
>>>         return PRIV_STACK_ADAPTIVE;
>>>       /* fallthrough */
>>>     default:
>>>       return NO_PRIV_STACK;
>>> }
>> Right. Listing trampoline related prog types explicitly
>> and using bpf_prog_check_recur() will be safe.
>>
>> One thing is for BPF_PROG_TYPE_STRUCT_OPS, PRIV_STACK_ALWAYS
>> will be returned. I will make adjustment like
>>
>> switch (prog->type) {
>>    case BPF_PROG_TYPE_KPROBE:
>>    case BPF_PROG_TYPE_TRACEPOINT:
>>    case BPF_PROG_TYPE_PERF_EVENT:
>>    case BPF_PROG_TYPE_RAW_TRACEPOINT:
>>       return PRIV_STACK_ADAPTIVE;
>>    case BPF_PROG_TYPE_TRACING:
>>    case BPF_PROG_TYPE_LSM:
>>    case BPF_PROG_TYPE_STRUCT_OPS:
>>       if (bpf_prog_check_recur()) {
>>         if (prog->type == BPF_PROG_TYPE_STRUCT_OPS)
>>             return PRIV_STACK_ALWAYS;
> hmm. definitely not unconditionally.
> Only when explicitly requested in callback.
>
> Something like this:
>     case BPF_PROG_TYPE_TRACING:
>     case BPF_PROG_TYPE_LSM:
>        if (bpf_prog_check_recur())
>           return PRIV_STACK_ADAPTIVE;
>     case BPF_PROG_TYPE_STRUCT_OPS:
>        if (prog->aux->priv_stack_requested)
>           return PRIV_STACK_ALWAYS;
>     default:
>        return NO_PRIV_STACK;
>
> and then we also change bpf_prog_check_recur()
>   to return true when prog->aux->priv_stack_requested

This works too. I had another thinking about
    case BPF_PROG_TYPE_LSM:
       if (bpf_prog_check_recur())
          return PRIV_STACK_ADAPTIVE;
    case BPF_PROG_TYPE_STRUCT_OPS:
       if (bpf_prog_check_recur())
          return PRIV_STACK_ALWAYS;

Note that in bpf_prog_check_recur(), for struct_ops,
will return prog->aux->priv_stack_request.
But think it is too verbose so didn't propose.

So explicitly using prog->aux->priv_stack_requested
is more visible. Maybe we can even do

    case BPF_PROG_TYPE_TRACING:
    case BPF_PROG_TYPE_LSM:
    case BPF_PROG_TYPE_STRUCT_OPS:
       if (prog->aux->priv_stack_requested)
          return PRIV_STACK_ALWYAS;
       else if (bpf_prog_check_recur())
          return PRIV_STACK_ADAPTIVE;
       /* fallthrough */
    default:
       return NO_PRIV_STACK;


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 02/10] bpf: Return false for bpf_prog_check_recur() default case
  2024-11-05 16:48                       ` Yonghong Song
@ 2024-11-05 17:47                         ` Alexei Starovoitov
  0 siblings, 0 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2024-11-05 17:47 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo

On Tue, Nov 5, 2024 at 8:48 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
>
>
> On 11/5/24 8:38 AM, Alexei Starovoitov wrote:
> > On Tue, Nov 5, 2024 at 8:33 AM Yonghong Song <yonghong.song@linux.dev> wrote:
> >>
> >>
> >>
> >> On 11/5/24 7:50 AM, Alexei Starovoitov wrote:
> >>> On Mon, Nov 4, 2024 at 10:02 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> >>>>> I also don't understand the point of this patch 2.
> >>>>> The patch 3 can still do:
> >>>>>
> >>>>> + switch (prog->type) {
> >>>>> + case BPF_PROG_TYPE_KPROBE:
> >>>>> + case BPF_PROG_TYPE_TRACEPOINT:
> >>>>> + case BPF_PROG_TYPE_PERF_EVENT:
> >>>>> + case BPF_PROG_TYPE_RAW_TRACEPOINT:
> >>>>> +   return PRIV_STACK_ADAPTIVE;
> >>>>> + default:
> >>>>> +   break;
> >>>>> + }
> >>>>> +
> >>>>> + if (!bpf_prog_check_recur(prog))
> >>>>> +   return NO_PRIV_STACK;
> >>>>>
> >>>>> which would mean that iter, lsm, struct_ops will not be allowed
> >>>>> to use priv stack.
> >>>> One example is e.g. a TC prog. Since bpf_prog_check_recur(prog)
> >>>> will return true (means supporting recursion), and private stack
> >>>> does not really support TC prog, the logic will become more
> >>>> complicated.
> >>>>
> >>>> I am totally okay with removing patch 2 and go back to my
> >>>> previous approach to explicitly list prog types supporting
> >>>> private stack.
> >>> The point of reusing bpf_prog_check_recur() is that we don't
> >>> need to duplicate the logic.
> >>> We can still do something like:
> >>> switch (prog->type) {
> >>>    case BPF_PROG_TYPE_KPROBE:
> >>>    case BPF_PROG_TYPE_TRACEPOINT:
> >>>    case BPF_PROG_TYPE_PERF_EVENT:
> >>>    case BPF_PROG_TYPE_RAW_TRACEPOINT:
> >>>       return PRIV_STACK_ADAPTIVE;
> >>>    case BPF_PROG_TYPE_TRACING:
> >>>    case BPF_PROG_TYPE_LSM:
> >>>    case BPF_PROG_TYPE_STRUCT_OPS:
> >>>       if (bpf_prog_check_recur())
> >>>         return PRIV_STACK_ADAPTIVE;
> >>>       /* fallthrough */
> >>>     default:
> >>>       return NO_PRIV_STACK;
> >>> }
> >> Right. Listing trampoline related prog types explicitly
> >> and using bpf_prog_check_recur() will be safe.
> >>
> >> One thing is for BPF_PROG_TYPE_STRUCT_OPS, PRIV_STACK_ALWAYS
> >> will be returned. I will make adjustment like
> >>
> >> switch (prog->type) {
> >>    case BPF_PROG_TYPE_KPROBE:
> >>    case BPF_PROG_TYPE_TRACEPOINT:
> >>    case BPF_PROG_TYPE_PERF_EVENT:
> >>    case BPF_PROG_TYPE_RAW_TRACEPOINT:
> >>       return PRIV_STACK_ADAPTIVE;
> >>    case BPF_PROG_TYPE_TRACING:
> >>    case BPF_PROG_TYPE_LSM:
> >>    case BPF_PROG_TYPE_STRUCT_OPS:
> >>       if (bpf_prog_check_recur()) {
> >>         if (prog->type == BPF_PROG_TYPE_STRUCT_OPS)
> >>             return PRIV_STACK_ALWAYS;
> > hmm. definitely not unconditionally.
> > Only when explicitly requested in callback.
> >
> > Something like this:
> >     case BPF_PROG_TYPE_TRACING:
> >     case BPF_PROG_TYPE_LSM:
> >        if (bpf_prog_check_recur())
> >           return PRIV_STACK_ADAPTIVE;
> >     case BPF_PROG_TYPE_STRUCT_OPS:
> >        if (prog->aux->priv_stack_requested)
> >           return PRIV_STACK_ALWAYS;
> >     default:
> >        return NO_PRIV_STACK;
> >
> > and then we also change bpf_prog_check_recur()
> >   to return true when prog->aux->priv_stack_requested
>
> This works too. I had another thinking about
>     case BPF_PROG_TYPE_LSM:
>        if (bpf_prog_check_recur())
>           return PRIV_STACK_ADAPTIVE;
>     case BPF_PROG_TYPE_STRUCT_OPS:
>        if (bpf_prog_check_recur())
>           return PRIV_STACK_ALWAYS;
>
> Note that in bpf_prog_check_recur(), for struct_ops,
> will return prog->aux->priv_stack_request.
> But think it is too verbose so didn't propose.
>
> So explicitly using prog->aux->priv_stack_requested
> is more visible. Maybe we can even do
>
>     case BPF_PROG_TYPE_TRACING:
>     case BPF_PROG_TYPE_LSM:
>     case BPF_PROG_TYPE_STRUCT_OPS:
>        if (prog->aux->priv_stack_requested)
>           return PRIV_STACK_ALWYAS;
>        else if (bpf_prog_check_recur())
>           return PRIV_STACK_ADAPTIVE;
>        /* fallthrough */
>     default:
>        return NO_PRIV_STACK;

The last version makes sense to me.
bpf_prog_check_recur() should also return true when
prog->aux->priv_stack_requested to make sure trampoline adds a
run-time check.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 04/10] bpf: Check potential private stack recursion for progs with async callback
  2024-11-05  3:37     ` Yonghong Song
@ 2024-11-05 20:26       ` Alexei Starovoitov
  2024-11-05 21:26         ` Yonghong Song
  0 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2024-11-05 20:26 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo

On Mon, Nov 4, 2024 at 7:37 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
> On 11/4/24 6:51 PM, Alexei Starovoitov wrote:
> > On Mon, Nov 4, 2024 at 11:38 AM Yonghong Song <yonghong.song@linux.dev> wrote:
> >> In previous patch, tracing progs are enabled for private stack since
> >> recursion checking ensures there exists no nested same bpf prog run on
> >> the same cpu.
> >>
> >> But it is still possible for nested bpf subprog run on the same cpu
> >> if the same subprog is called in both main prog and async callback,
> >> or in different async callbacks. For example,
> >>    main_prog
> >>     bpf_timer_set_callback(timer, timer_cb);
> >>     call sub1
> >>    sub1
> >>     ...
> >>    time_cb
> >>     call sub1
> >>
> >> In the above case, nested subprog run for sub1 is possible with one in
> >> process context and the other in softirq context. If this is the case,
> >> the verifier will disable private stack for this bpf prog.
> >>
> >> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
> >> ---
> >>   include/linux/bpf_verifier.h |  2 ++
> >>   kernel/bpf/verifier.c        | 42 +++++++++++++++++++++++++++++++-----
> >>   2 files changed, 39 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> >> index 0622c11a7e19..e921589abc72 100644
> >> --- a/include/linux/bpf_verifier.h
> >> +++ b/include/linux/bpf_verifier.h
> >> @@ -669,6 +669,8 @@ struct bpf_subprog_info {
> >>          /* true if bpf_fastcall stack region is used by functions that can't be inlined */
> >>          bool keep_fastcall_stack: 1;
> >>          bool use_priv_stack: 1;
> >> +       bool visited_with_priv_stack_accum: 1;
> >> +       bool visited_with_priv_stack: 1;
> >>
> >>          u8 arg_cnt;
> >>          struct bpf_subprog_arg_info args[MAX_BPF_FUNC_REG_ARGS];
> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >> index 406195c433ea..e01b3f0fd314 100644
> >> --- a/kernel/bpf/verifier.c
> >> +++ b/kernel/bpf/verifier.c
> >> @@ -6118,8 +6118,12 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
> >>                                          idx, subprog_depth);
> >>                                  return -EACCES;
> >>                          }
> >> -                       if (subprog_depth >= BPF_PRIV_STACK_MIN_SIZE)
> >> +                       if (subprog_depth >= BPF_PRIV_STACK_MIN_SIZE) {
> >>                                  subprog[idx].use_priv_stack = true;
> >> +                               subprog[idx].visited_with_priv_stack = true;
> >> +                       }
> >> +               } else {
> >> +                       subprog[idx].visited_with_priv_stack = true;
> > See suggestion for patch 3.
> > It's cleaner to rewrite with a single visited_with_priv_stack = true; statement.
>
> Ack.
>
> >
> >>                  }
> >>          }
> >>   continue_func:
> >> @@ -6220,10 +6224,12 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
> >>   static int check_max_stack_depth(struct bpf_verifier_env *env)
> >>   {
> >>          struct bpf_subprog_info *si = env->subprog_info;
> >> +       enum priv_stack_mode orig_priv_stack_supported;
> >>          enum priv_stack_mode priv_stack_supported;
> >>          int ret, subtree_depth = 0, depth_frame;
> >>
> >>          priv_stack_supported = bpf_enable_priv_stack(env->prog);
> >> +       orig_priv_stack_supported = priv_stack_supported;
> >>
> >>          if (priv_stack_supported != NO_PRIV_STACK) {
> >>                  for (int i = 0; i < env->subprog_cnt; i++) {
> >> @@ -6240,13 +6246,39 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
> >>                                                              priv_stack_supported);
> >>                          if (ret < 0)
> >>                                  return ret;
> >> +
> >> +                       if (priv_stack_supported != NO_PRIV_STACK) {
> >> +                               for (int j = 0; j < env->subprog_cnt; j++) {
> >> +                                       if (si[j].visited_with_priv_stack_accum &&
> >> +                                           si[j].visited_with_priv_stack) {
> >> +                                               /* si[j] is visited by both main/async subprog
> >> +                                                * and another async subprog.
> >> +                                                */
> >> +                                               priv_stack_supported = NO_PRIV_STACK;
> >> +                                               break;
> >> +                                       }
> >> +                                       if (!si[j].visited_with_priv_stack_accum)
> >> +                                               si[j].visited_with_priv_stack_accum =
> >> +                                                       si[j].visited_with_priv_stack;
> >> +                               }
> >> +                       }
> >> +                       if (priv_stack_supported != NO_PRIV_STACK) {
> >> +                               for (int j = 0; j < env->subprog_cnt; j++)
> >> +                                       si[j].visited_with_priv_stack = false;
> >> +                       }
> > I cannot understand what this algorithm is doing.
> > What is the meaning of visited_with_priv_stack_accum ?
>
> The following is an example to show how the algorithm works.
> Let us say we have prog like
>     main_prog0  si[0]
>       sub1      si[1]
>       sub2      si[2]
>     async1      si[3]
>       sub4      si[4]
>       sub2      si[2]
>     async2      si[5]
>       sub4      si[4]
>       sub5      si[6]
>
>
> Total 9 subprograms.
>
> after iteration 1 (main_prog0)
>     visited_with_priv_stack_accum: si[i] = false for i = 0 ... 9
>     visited_with_priv_stack: si[0] = si[1] = si[2] = true, others false
>
>     for all i, visited_with_priv_stack_accum[i] and visited_with_priv_stack[i]
>     is false, so main_prog0 can use priv stack.
>
>     visited_with_priv_stack_accum: si[0] = si[1] = si[2] = true; others false
>     visited_with_priv_stack cleared with false.
>
> after iteration 2 (async1)
>     visited_with_priv_stack_accum: si[0] = si[1] = si[2] = true; others false
>     visited_with_priv_stack: si[2] = si[3] = si[4] = true, others false
>
>     Here, si[2] appears in both visited_with_priv_stack_accum and
>     visited_with_priv_stack, so async1 cannot have priv stack.
>
>     In my algorithm, I flipped the whole thing to no_priv_stack, which is
>     too conservative. We should just skip async1 and continues.
>
>     Let us say, we say async1 not having priv stack while main_prog0 has.
>
>     /* the same as end of iteration 1 */
>     visited_with_priv_stack_accum: si[0] = si[1] = si[2] = true; others false
>     visited_with_priv_stack cleared with false.
>
> after iteration 3 (async2)
>     visited_with_priv_stack_accum: si[0] = si[1] = si[2] = true; others false
>     visited_with_priv_stack: si[4] = si[5] = si[6] = true;
>
>     there are no conflict, so async2 can use private stack.
>
>
> If we only have one bit in bpf_subprog_info, for a async tree,
> if marking a subprog to be true and later we found there is a conflict in
> async tree and we need make the whole async subprogs not eligible for priv stack,
> then it will be hard to undo previous markings.
>
> So visited_with_priv_stack_accum is to accumulate "true" results from
> main_prog/async's.

I see. I think it works, but feels complicated.
It feels it should be possible to do without extra flags. Like
check_max_stack_depth_subprog() will know whether it was called
to verify async_cb or not.
So it's just a matter of adding single 'if' to it:
if (subprog[idx].use_priv_stack && checking_async_cb)
   /* reset to false due to potential recursion */
   subprog[idx].use_priv_stack = false;

check_max_stack_depth() starts with i==0,
so reachable and eligible subprogs will be marked with use_priv_stack.
Then check_max_stack_depth_subprog() will be called again
to verify async. If it sees the mark it's a bad case.
what am I missing?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 04/10] bpf: Check potential private stack recursion for progs with async callback
  2024-11-05 20:26       ` Alexei Starovoitov
@ 2024-11-05 21:26         ` Yonghong Song
  2024-11-05 21:52           ` Alexei Starovoitov
  0 siblings, 1 reply; 41+ messages in thread
From: Yonghong Song @ 2024-11-05 21:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo




On 11/5/24 12:26 PM, Alexei Starovoitov wrote:
> On Mon, Nov 4, 2024 at 7:37 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> On 11/4/24 6:51 PM, Alexei Starovoitov wrote:
>>> On Mon, Nov 4, 2024 at 11:38 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>> In previous patch, tracing progs are enabled for private stack since
>>>> recursion checking ensures there exists no nested same bpf prog run on
>>>> the same cpu.
>>>>
>>>> But it is still possible for nested bpf subprog run on the same cpu
>>>> if the same subprog is called in both main prog and async callback,
>>>> or in different async callbacks. For example,
>>>>     main_prog
>>>>      bpf_timer_set_callback(timer, timer_cb);
>>>>      call sub1
>>>>     sub1
>>>>      ...
>>>>     time_cb
>>>>      call sub1
>>>>
>>>> In the above case, nested subprog run for sub1 is possible with one in
>>>> process context and the other in softirq context. If this is the case,
>>>> the verifier will disable private stack for this bpf prog.
>>>>
>>>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>>>> ---
>>>>    include/linux/bpf_verifier.h |  2 ++
>>>>    kernel/bpf/verifier.c        | 42 +++++++++++++++++++++++++++++++-----
>>>>    2 files changed, 39 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
>>>> index 0622c11a7e19..e921589abc72 100644
>>>> --- a/include/linux/bpf_verifier.h
>>>> +++ b/include/linux/bpf_verifier.h
>>>> @@ -669,6 +669,8 @@ struct bpf_subprog_info {
>>>>           /* true if bpf_fastcall stack region is used by functions that can't be inlined */
>>>>           bool keep_fastcall_stack: 1;
>>>>           bool use_priv_stack: 1;
>>>> +       bool visited_with_priv_stack_accum: 1;
>>>> +       bool visited_with_priv_stack: 1;
>>>>
>>>>           u8 arg_cnt;
>>>>           struct bpf_subprog_arg_info args[MAX_BPF_FUNC_REG_ARGS];
>>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>>>> index 406195c433ea..e01b3f0fd314 100644
>>>> --- a/kernel/bpf/verifier.c
>>>> +++ b/kernel/bpf/verifier.c
>>>> @@ -6118,8 +6118,12 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
>>>>                                           idx, subprog_depth);
>>>>                                   return -EACCES;
>>>>                           }
>>>> -                       if (subprog_depth >= BPF_PRIV_STACK_MIN_SIZE)
>>>> +                       if (subprog_depth >= BPF_PRIV_STACK_MIN_SIZE) {
>>>>                                   subprog[idx].use_priv_stack = true;
>>>> +                               subprog[idx].visited_with_priv_stack = true;
>>>> +                       }
>>>> +               } else {
>>>> +                       subprog[idx].visited_with_priv_stack = true;
>>> See suggestion for patch 3.
>>> It's cleaner to rewrite with a single visited_with_priv_stack = true; statement.
>> Ack.
>>
>>>>                   }
>>>>           }
>>>>    continue_func:
>>>> @@ -6220,10 +6224,12 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
>>>>    static int check_max_stack_depth(struct bpf_verifier_env *env)
>>>>    {
>>>>           struct bpf_subprog_info *si = env->subprog_info;
>>>> +       enum priv_stack_mode orig_priv_stack_supported;
>>>>           enum priv_stack_mode priv_stack_supported;
>>>>           int ret, subtree_depth = 0, depth_frame;
>>>>
>>>>           priv_stack_supported = bpf_enable_priv_stack(env->prog);
>>>> +       orig_priv_stack_supported = priv_stack_supported;
>>>>
>>>>           if (priv_stack_supported != NO_PRIV_STACK) {
>>>>                   for (int i = 0; i < env->subprog_cnt; i++) {
>>>> @@ -6240,13 +6246,39 @@ static int check_max_stack_depth(struct bpf_verifier_env *env)
>>>>                                                               priv_stack_supported);
>>>>                           if (ret < 0)
>>>>                                   return ret;
>>>> +
>>>> +                       if (priv_stack_supported != NO_PRIV_STACK) {
>>>> +                               for (int j = 0; j < env->subprog_cnt; j++) {
>>>> +                                       if (si[j].visited_with_priv_stack_accum &&
>>>> +                                           si[j].visited_with_priv_stack) {
>>>> +                                               /* si[j] is visited by both main/async subprog
>>>> +                                                * and another async subprog.
>>>> +                                                */
>>>> +                                               priv_stack_supported = NO_PRIV_STACK;
>>>> +                                               break;
>>>> +                                       }
>>>> +                                       if (!si[j].visited_with_priv_stack_accum)
>>>> +                                               si[j].visited_with_priv_stack_accum =
>>>> +                                                       si[j].visited_with_priv_stack;
>>>> +                               }
>>>> +                       }
>>>> +                       if (priv_stack_supported != NO_PRIV_STACK) {
>>>> +                               for (int j = 0; j < env->subprog_cnt; j++)
>>>> +                                       si[j].visited_with_priv_stack = false;
>>>> +                       }
>>> I cannot understand what this algorithm is doing.
>>> What is the meaning of visited_with_priv_stack_accum ?
>> The following is an example to show how the algorithm works.
>> Let us say we have prog like
>>      main_prog0  si[0]
>>        sub1      si[1]
>>        sub2      si[2]
>>      async1      si[3]
>>        sub4      si[4]
>>        sub2      si[2]
>>      async2      si[5]
>>        sub4      si[4]
>>        sub5      si[6]
>>
>>
>> Total 9 subprograms.
>>
>> after iteration 1 (main_prog0)
>>      visited_with_priv_stack_accum: si[i] = false for i = 0 ... 9
>>      visited_with_priv_stack: si[0] = si[1] = si[2] = true, others false
>>
>>      for all i, visited_with_priv_stack_accum[i] and visited_with_priv_stack[i]
>>      is false, so main_prog0 can use priv stack.
>>
>>      visited_with_priv_stack_accum: si[0] = si[1] = si[2] = true; others false
>>      visited_with_priv_stack cleared with false.
>>
>> after iteration 2 (async1)
>>      visited_with_priv_stack_accum: si[0] = si[1] = si[2] = true; others false
>>      visited_with_priv_stack: si[2] = si[3] = si[4] = true, others false
>>
>>      Here, si[2] appears in both visited_with_priv_stack_accum and
>>      visited_with_priv_stack, so async1 cannot have priv stack.
>>
>>      In my algorithm, I flipped the whole thing to no_priv_stack, which is
>>      too conservative. We should just skip async1 and continues.
>>
>>      Let us say, we say async1 not having priv stack while main_prog0 has.
>>
>>      /* the same as end of iteration 1 */
>>      visited_with_priv_stack_accum: si[0] = si[1] = si[2] = true; others false
>>      visited_with_priv_stack cleared with false.
>>
>> after iteration 3 (async2)
>>      visited_with_priv_stack_accum: si[0] = si[1] = si[2] = true; others false
>>      visited_with_priv_stack: si[4] = si[5] = si[6] = true;
>>
>>      there are no conflict, so async2 can use private stack.
>>
>>
>> If we only have one bit in bpf_subprog_info, for a async tree,
>> if marking a subprog to be true and later we found there is a conflict in
>> async tree and we need make the whole async subprogs not eligible for priv stack,
>> then it will be hard to undo previous markings.
>>
>> So visited_with_priv_stack_accum is to accumulate "true" results from
>> main_prog/async's.
> I see. I think it works, but feels complicated.
> It feels it should be possible to do without extra flags. Like
> check_max_stack_depth_subprog() will know whether it was called
> to verify async_cb or not.
> So it's just a matter of adding single 'if' to it:
> if (subprog[idx].use_priv_stack && checking_async_cb)
>     /* reset to false due to potential recursion */
>     subprog[idx].use_priv_stack = false;
>
> check_max_stack_depth() starts with i==0,
> so reachable and eligible subprogs will be marked with use_priv_stack.
> Then check_max_stack_depth_subprog() will be called again
> to verify async. If it sees the mark it's a bad case.
> what am I missing?

First I think we still want to mark some subprogs in async tree
to use private stack, right? If this is the case, then let us see
the following examle:

main_prog:
    sub1: use_priv_stack = true
    sub2" use_priv_stack = true

async: /* calling sub1 twice */
    sub1
      <=== we do
             if (subprog[idx].use_priv_stack && checking_async_cb)
                 subprog[idx].use_priv_stack = false;
    sub1
      <=== here we have subprog[idx].use_priv_stack = false;
           we could mark use_priv_stack = true again here
           since logic didn't keep track of sub1 has been
           visited before.

To solve the above issue, we need one visited bit in bpf_subprog_info.
After finishing async tree, if for any subprog,
   visited_bit && subprog[idx].use_priv_stack
is true, we can mark subprog[idx].use_priv_stack = false

So one visited bit is enough.

More complicated case is two asyncs. For example:

main_prog:
   sub1
   sub2

async1:
   sub3

async2:
   sub3

If async1/sub3 and async2/sub3 can be nested, then we will
need two visited bits as I have above.
If async1/sub3 and async2/sub3 cannot be nested, then one
visited bit should be enough, since we can traverse
async1/async2 with 'visited' marking and then compare against
main prog.

So the question would be:
   1. Is it possible that two async call backs may nest with
      each other? I actually do not know the answer.
   2. Do we want to allow subprogs in async tree to use private
      stacks?


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 04/10] bpf: Check potential private stack recursion for progs with async callback
  2024-11-05 21:26         ` Yonghong Song
@ 2024-11-05 21:52           ` Alexei Starovoitov
  2024-11-06  0:19             ` Yonghong Song
  0 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2024-11-05 21:52 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo

On Tue, Nov 5, 2024 at 1:26 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
> > I see. I think it works, but feels complicated.
> > It feels it should be possible to do without extra flags. Like
> > check_max_stack_depth_subprog() will know whether it was called
> > to verify async_cb or not.
> > So it's just a matter of adding single 'if' to it:
> > if (subprog[idx].use_priv_stack && checking_async_cb)
> >     /* reset to false due to potential recursion */
> >     subprog[idx].use_priv_stack = false;
> >
> > check_max_stack_depth() starts with i==0,
> > so reachable and eligible subprogs will be marked with use_priv_stack.
> > Then check_max_stack_depth_subprog() will be called again
> > to verify async. If it sees the mark it's a bad case.
> > what am I missing?
>
> First I think we still want to mark some subprogs in async tree
> to use private stack, right? If this is the case, then let us see
> the following examle:
>
> main_prog:
>     sub1: use_priv_stack = true
>     sub2" use_priv_stack = true
>
> async: /* calling sub1 twice */
>     sub1
>       <=== we do
>              if (subprog[idx].use_priv_stack && checking_async_cb)
>                  subprog[idx].use_priv_stack = false;
>     sub1
>       <=== here we have subprog[idx].use_priv_stack = false;
>            we could mark use_priv_stack = true again here
>            since logic didn't keep track of sub1 has been
>            visited before.

This case needs a sticky state to solve.
Instead of bool use_priv_stack it can be tri-state:
no_priv_stack
priv_stack_unknown <- start state
priv_stack_maybe

main_prog pass will set it to priv_stack_maybe
while async pass will clear it to no_priv_stack
and it cannot be bumped up.

> To solve the above issue, we need one visited bit in bpf_subprog_info.
> After finishing async tree, if for any subprog,
>    visited_bit && subprog[idx].use_priv_stack
> is true, we can mark subprog[idx].use_priv_stack = false
>
> So one visited bit is enough.
>
> More complicated case is two asyncs. For example:
>
> main_prog:
>    sub1
>    sub2
>
> async1:
>    sub3
>
> async2:
>    sub3
>
> If async1/sub3 and async2/sub3 can be nested, then we will
> need two visited bits as I have above.
> If async1/sub3 and async2/sub3 cannot be nested, then one
> visited bit should be enough, since we can traverse
> async1/async2 with 'visited' marking and then compare against
> main prog.
>
> So the question would be:
>    1. Is it possible that two async call backs may nest with
>       each other? I actually do not know the answer.

I think we have to assume that they can. Doing otherwise
would subject us to implementation details.
I think above tri-state approach works for two callbacks case too:
async1 will bump sub3 to priv_stack_maybe
while async2 will clear it to sticky no_priv_stack.

Ideally we reuse the same enum for this algorithm and for earlier
patches.

>    2. Do we want to allow subprogs in async tree to use private
>       stacks?

yes. when sched-ext requests priv stack it would want it everywhere.
I think the request_priv_stack should be treated as
PRIV_STACK_ADAPTIVE. Meaning that subprogs with stack_depth < 64
don't need to use it.
In other words struct_ops prog with request_priv_stack == true
tells the verifier: add run-time recursion check at main prog entry,
otherwise treat it like fentry and pick priv stack vs normal
as the best for performance.

Then for both fentry and struct_ops w/request_priv_stack
the async callbacks will be considered for priv stack too and
will be cleared to normals stack when potential recursion via async
is detected.
I don't think it's an error for either prog type.
Overall we shouldn't treat struct_ops too special.
fentry progs with large stack are automatically candidates for priv stack.
struct_ops w/request_priv_stack are in the same category.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 04/10] bpf: Check potential private stack recursion for progs with async callback
  2024-11-05 21:52           ` Alexei Starovoitov
@ 2024-11-06  0:19             ` Yonghong Song
  2024-11-06  1:07               ` Alexei Starovoitov
  0 siblings, 1 reply; 41+ messages in thread
From: Yonghong Song @ 2024-11-06  0:19 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo




On 11/5/24 1:52 PM, Alexei Starovoitov wrote:
> On Tue, Nov 5, 2024 at 1:26 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>> I see. I think it works, but feels complicated.
>>> It feels it should be possible to do without extra flags. Like
>>> check_max_stack_depth_subprog() will know whether it was called
>>> to verify async_cb or not.
>>> So it's just a matter of adding single 'if' to it:
>>> if (subprog[idx].use_priv_stack && checking_async_cb)
>>>      /* reset to false due to potential recursion */
>>>      subprog[idx].use_priv_stack = false;
>>>
>>> check_max_stack_depth() starts with i==0,
>>> so reachable and eligible subprogs will be marked with use_priv_stack.
>>> Then check_max_stack_depth_subprog() will be called again
>>> to verify async. If it sees the mark it's a bad case.
>>> what am I missing?
>> First I think we still want to mark some subprogs in async tree
>> to use private stack, right? If this is the case, then let us see
>> the following examle:
>>
>> main_prog:
>>      sub1: use_priv_stack = true
>>      sub2" use_priv_stack = true
>>
>> async: /* calling sub1 twice */
>>      sub1
>>        <=== we do
>>               if (subprog[idx].use_priv_stack && checking_async_cb)
>>                   subprog[idx].use_priv_stack = false;
>>      sub1
>>        <=== here we have subprog[idx].use_priv_stack = false;
>>             we could mark use_priv_stack = true again here
>>             since logic didn't keep track of sub1 has been
>>             visited before.
> This case needs a sticky state to solve.
> Instead of bool use_priv_stack it can be tri-state:
> no_priv_stack
> priv_stack_unknown <- start state
> priv_stack_maybe
>
> main_prog pass will set it to priv_stack_maybe
> while async pass will clear it to no_priv_stack
> and it cannot be bumped up.

The tri-state may not work. For example,

main_prog:
    call sub1
    call sub2
    call sub1
    call sub3

async:
    call sub4 ==> UNKNOWN -> MAYBE
    call sub3
    call sub4 ==> MAYBE -> NO_PRIV_STACK?

For sub4 in async which is called twice, for the second sub4 call,
it is not clear whether UNKNOWN->MAYBE transition happens in
main_prog or async. So based on transition prototol,
second sub4 call will transition to NO_PRIV_STACK which is not
what we want.

So I think we still need a separate bit in bpf_subprog_info to
accumulate information for main_prog tree or any async tree.

>
>> To solve the above issue, we need one visited bit in bpf_subprog_info.
>> After finishing async tree, if for any subprog,
>>     visited_bit && subprog[idx].use_priv_stack
>> is true, we can mark subprog[idx].use_priv_stack = false
>>
>> So one visited bit is enough.
>>
>> More complicated case is two asyncs. For example:
>>
>> main_prog:
>>     sub1
>>     sub2
>>
>> async1:
>>     sub3
>>
>> async2:
>>     sub3
>>
>> If async1/sub3 and async2/sub3 can be nested, then we will
>> need two visited bits as I have above.
>> If async1/sub3 and async2/sub3 cannot be nested, then one
>> visited bit should be enough, since we can traverse
>> async1/async2 with 'visited' marking and then compare against
>> main prog.
>>
>> So the question would be:
>>     1. Is it possible that two async call backs may nest with
>>        each other? I actually do not know the answer.
> I think we have to assume that they can. Doing otherwise
> would subject us to implementation details.
> I think above tri-state approach works for two callbacks case too:
> async1 will bump sub3 to priv_stack_maybe
> while async2 will clear it to sticky no_priv_stack.
>
> Ideally we reuse the same enum for this algorithm and for earlier
> patches.
>
>>     2. Do we want to allow subprogs in async tree to use private
>>        stacks?
> yes. when sched-ext requests priv stack it would want it everywhere.
> I think the request_priv_stack should be treated as
> PRIV_STACK_ADAPTIVE. Meaning that subprogs with stack_depth < 64
> don't need to use it.
> In other words struct_ops prog with request_priv_stack == true
> tells the verifier: add run-time recursion check at main prog entry,
> otherwise treat it like fentry and pick priv stack vs normal
> as the best for performance.
>
> Then for both fentry and struct_ops w/request_priv_stack
> the async callbacks will be considered for priv stack too and
> will be cleared to normals stack when potential recursion via async
> is detected.
> I don't think it's an error for either prog type.
> Overall we shouldn't treat struct_ops too special.
> fentry progs with large stack are automatically candidates for priv stack.
> struct_ops w/request_priv_stack are in the same category.

Okay, indeed we can treat struct_ops the same as other tracing programs.
They are adaptive too subject to various conditions where priv state may
not be used, e.g. small stack size, tailcall, and potentially nested subprogs.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 04/10] bpf: Check potential private stack recursion for progs with async callback
  2024-11-06  0:19             ` Yonghong Song
@ 2024-11-06  1:07               ` Alexei Starovoitov
  2024-11-06  2:33                 ` Yonghong Song
  2024-11-06  6:55                 ` Yonghong Song
  0 siblings, 2 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2024-11-06  1:07 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo

On Tue, Nov 5, 2024 at 4:19 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
>
>
> On 11/5/24 1:52 PM, Alexei Starovoitov wrote:
> > On Tue, Nov 5, 2024 at 1:26 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> >>> I see. I think it works, but feels complicated.
> >>> It feels it should be possible to do without extra flags. Like
> >>> check_max_stack_depth_subprog() will know whether it was called
> >>> to verify async_cb or not.
> >>> So it's just a matter of adding single 'if' to it:
> >>> if (subprog[idx].use_priv_stack && checking_async_cb)
> >>>      /* reset to false due to potential recursion */
> >>>      subprog[idx].use_priv_stack = false;
> >>>
> >>> check_max_stack_depth() starts with i==0,
> >>> so reachable and eligible subprogs will be marked with use_priv_stack.
> >>> Then check_max_stack_depth_subprog() will be called again
> >>> to verify async. If it sees the mark it's a bad case.
> >>> what am I missing?
> >> First I think we still want to mark some subprogs in async tree
> >> to use private stack, right? If this is the case, then let us see
> >> the following examle:
> >>
> >> main_prog:
> >>      sub1: use_priv_stack = true
> >>      sub2" use_priv_stack = true
> >>
> >> async: /* calling sub1 twice */
> >>      sub1
> >>        <=== we do
> >>               if (subprog[idx].use_priv_stack && checking_async_cb)
> >>                   subprog[idx].use_priv_stack = false;
> >>      sub1
> >>        <=== here we have subprog[idx].use_priv_stack = false;
> >>             we could mark use_priv_stack = true again here
> >>             since logic didn't keep track of sub1 has been
> >>             visited before.
> > This case needs a sticky state to solve.
> > Instead of bool use_priv_stack it can be tri-state:
> > no_priv_stack
> > priv_stack_unknown <- start state
> > priv_stack_maybe
> >
> > main_prog pass will set it to priv_stack_maybe
> > while async pass will clear it to no_priv_stack
> > and it cannot be bumped up.
>
> The tri-state may not work. For example,
>
> main_prog:
>     call sub1
>     call sub2
>     call sub1

sub1 cannot be called nested like this.
I think we discussed it already.

>     call sub3
>
> async:
>     call sub4 ==> UNKNOWN -> MAYBE
>     call sub3
>     call sub4 ==> MAYBE -> NO_PRIV_STACK?
>
> For sub4 in async which is called twice, for the second sub4 call,
> it is not clear whether UNKNOWN->MAYBE transition happens in
> main_prog or async. So based on transition prototol,
> second sub4 call will transition to NO_PRIV_STACK which is not
> what we want.

I see. Good point.

> So I think we still need a separate bit in bpf_subprog_info to
> accumulate information for main_prog tree or any async tree.

This is getting quite convoluted. To support priv stack
in multiple async cb-s we may need to remember async cb id or something.
I say let's force all subprogs in async cb to use normal stack.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 04/10] bpf: Check potential private stack recursion for progs with async callback
  2024-11-06  1:07               ` Alexei Starovoitov
@ 2024-11-06  2:33                 ` Yonghong Song
  2024-11-06  6:55                 ` Yonghong Song
  1 sibling, 0 replies; 41+ messages in thread
From: Yonghong Song @ 2024-11-06  2:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo




On 11/5/24 5:07 PM, Alexei Starovoitov wrote:
> On Tue, Nov 5, 2024 at 4:19 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>>
>>
>> On 11/5/24 1:52 PM, Alexei Starovoitov wrote:
>>> On Tue, Nov 5, 2024 at 1:26 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>>> I see. I think it works, but feels complicated.
>>>>> It feels it should be possible to do without extra flags. Like
>>>>> check_max_stack_depth_subprog() will know whether it was called
>>>>> to verify async_cb or not.
>>>>> So it's just a matter of adding single 'if' to it:
>>>>> if (subprog[idx].use_priv_stack && checking_async_cb)
>>>>>       /* reset to false due to potential recursion */
>>>>>       subprog[idx].use_priv_stack = false;
>>>>>
>>>>> check_max_stack_depth() starts with i==0,
>>>>> so reachable and eligible subprogs will be marked with use_priv_stack.
>>>>> Then check_max_stack_depth_subprog() will be called again
>>>>> to verify async. If it sees the mark it's a bad case.
>>>>> what am I missing?
>>>> First I think we still want to mark some subprogs in async tree
>>>> to use private stack, right? If this is the case, then let us see
>>>> the following examle:
>>>>
>>>> main_prog:
>>>>       sub1: use_priv_stack = true
>>>>       sub2" use_priv_stack = true
>>>>
>>>> async: /* calling sub1 twice */
>>>>       sub1
>>>>         <=== we do
>>>>                if (subprog[idx].use_priv_stack && checking_async_cb)
>>>>                    subprog[idx].use_priv_stack = false;
>>>>       sub1
>>>>         <=== here we have subprog[idx].use_priv_stack = false;
>>>>              we could mark use_priv_stack = true again here
>>>>              since logic didn't keep track of sub1 has been
>>>>              visited before.
>>> This case needs a sticky state to solve.
>>> Instead of bool use_priv_stack it can be tri-state:
>>> no_priv_stack
>>> priv_stack_unknown <- start state
>>> priv_stack_maybe
>>>
>>> main_prog pass will set it to priv_stack_maybe
>>> while async pass will clear it to no_priv_stack
>>> and it cannot be bumped up.
>> The tri-state may not work. For example,
>>
>> main_prog:
>>      call sub1
>>      call sub2
>>      call sub1
> sub1 cannot be called nested like this.
> I think we discussed it already.
>
>>      call sub3
>>
>> async:
>>      call sub4 ==> UNKNOWN -> MAYBE
>>      call sub3
>>      call sub4 ==> MAYBE -> NO_PRIV_STACK?
>>
>> For sub4 in async which is called twice, for the second sub4 call,
>> it is not clear whether UNKNOWN->MAYBE transition happens in
>> main_prog or async. So based on transition prototol,
>> second sub4 call will transition to NO_PRIV_STACK which is not
>> what we want.
> I see. Good point.
>
>> So I think we still need a separate bit in bpf_subprog_info to
>> accumulate information for main_prog tree or any async tree.
> This is getting quite convoluted. To support priv stack
> in multiple async cb-s we may need to remember async cb id or something.
> I say let's force all subprogs in async cb to use normal stack.

Okay. Let do this. We only have a few of helper/kfunc having async cb
    bpf_timer_set_callback
    bpf_wq_set_callback
    exception handling


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 04/10] bpf: Check potential private stack recursion for progs with async callback
  2024-11-06  1:07               ` Alexei Starovoitov
  2024-11-06  2:33                 ` Yonghong Song
@ 2024-11-06  6:55                 ` Yonghong Song
  2024-11-06 15:26                   ` Alexei Starovoitov
  1 sibling, 1 reply; 41+ messages in thread
From: Yonghong Song @ 2024-11-06  6:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo




On 11/5/24 5:07 PM, Alexei Starovoitov wrote:
> On Tue, Nov 5, 2024 at 4:19 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>>
>>
>> On 11/5/24 1:52 PM, Alexei Starovoitov wrote:
>>> On Tue, Nov 5, 2024 at 1:26 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>>> I see. I think it works, but feels complicated.
>>>>> It feels it should be possible to do without extra flags. Like
>>>>> check_max_stack_depth_subprog() will know whether it was called
>>>>> to verify async_cb or not.
>>>>> So it's just a matter of adding single 'if' to it:
>>>>> if (subprog[idx].use_priv_stack && checking_async_cb)
>>>>>       /* reset to false due to potential recursion */
>>>>>       subprog[idx].use_priv_stack = false;
>>>>>
>>>>> check_max_stack_depth() starts with i==0,
>>>>> so reachable and eligible subprogs will be marked with use_priv_stack.
>>>>> Then check_max_stack_depth_subprog() will be called again
>>>>> to verify async. If it sees the mark it's a bad case.
>>>>> what am I missing?
>>>> First I think we still want to mark some subprogs in async tree
>>>> to use private stack, right? If this is the case, then let us see
>>>> the following examle:
>>>>
>>>> main_prog:
>>>>       sub1: use_priv_stack = true
>>>>       sub2" use_priv_stack = true
>>>>
>>>> async: /* calling sub1 twice */
>>>>       sub1
>>>>         <=== we do
>>>>                if (subprog[idx].use_priv_stack && checking_async_cb)
>>>>                    subprog[idx].use_priv_stack = false;
>>>>       sub1
>>>>         <=== here we have subprog[idx].use_priv_stack = false;
>>>>              we could mark use_priv_stack = true again here
>>>>              since logic didn't keep track of sub1 has been
>>>>              visited before.
>>> This case needs a sticky state to solve.
>>> Instead of bool use_priv_stack it can be tri-state:
>>> no_priv_stack
>>> priv_stack_unknown <- start state
>>> priv_stack_maybe
>>>
>>> main_prog pass will set it to priv_stack_maybe
>>> while async pass will clear it to no_priv_stack
>>> and it cannot be bumped up.
>> The tri-state may not work. For example,
>>
>> main_prog:
>>      call sub1
>>      call sub2
>>      call sub1
> sub1 cannot be called nested like this.
> I think we discussed it already.
>
>>      call sub3
>>
>> async:
>>      call sub4 ==> UNKNOWN -> MAYBE
>>      call sub3
>>      call sub4 ==> MAYBE -> NO_PRIV_STACK?
>>
>> For sub4 in async which is called twice, for the second sub4 call,
>> it is not clear whether UNKNOWN->MAYBE transition happens in
>> main_prog or async. So based on transition prototol,
>> second sub4 call will transition to NO_PRIV_STACK which is not
>> what we want.
> I see. Good point.
>
>> So I think we still need a separate bit in bpf_subprog_info to
>> accumulate information for main_prog tree or any async tree.
> This is getting quite convoluted. To support priv stack
> in multiple async cb-s we may need to remember async cb id or something.
> I say let's force all subprogs in async cb to use normal stack.

I did a quick prototype. Among others, tri-state (UNKNOWN, NO, ADAPTIVE) 
and reverse traversing subprogs like below diff --git 
a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 
cb82254484ff..1084432dbe83 100644 --- a/kernel/bpf/verifier.c +++ 
b/kernel/bpf/verifier.c @@ -6192,7 +6192,7 @@ static int 
check_max_stack_depth(struct bpf_verifier_env *env) struct 
bpf_subprog_info *si = env->subprog_info; int ret; - for (int i = 0; i < 
env->subprog_cnt; i++) { + for (int i = env->subprog_cnt - 1; i >= 0; 
i--) { if (i && !si[i].is_async_cb) continue; works correctly. 
Basically, all async_cb touched subprogs are marked as 'NO'. Finally for 
main prog tree, UNKNOWN subprog will convert to ADAPTIVE if >= stack 
size threshold. Stack size checking can also be done properly for 
async_cb tree and main prog tree.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 04/10] bpf: Check potential private stack recursion for progs with async callback
  2024-11-06  6:55                 ` Yonghong Song
@ 2024-11-06 15:26                   ` Alexei Starovoitov
  2024-11-06 15:44                     ` Yonghong Song
  0 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2024-11-06 15:26 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo

On Tue, Nov 5, 2024 at 10:55 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
>
>
> On 11/5/24 5:07 PM, Alexei Starovoitov wrote:
> > On Tue, Nov 5, 2024 at 4:19 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> >>
> >>
> >>
> >> On 11/5/24 1:52 PM, Alexei Starovoitov wrote:
> >>> On Tue, Nov 5, 2024 at 1:26 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> >>>>> I see. I think it works, but feels complicated.
> >>>>> It feels it should be possible to do without extra flags. Like
> >>>>> check_max_stack_depth_subprog() will know whether it was called
> >>>>> to verify async_cb or not.
> >>>>> So it's just a matter of adding single 'if' to it:
> >>>>> if (subprog[idx].use_priv_stack && checking_async_cb)
> >>>>>       /* reset to false due to potential recursion */
> >>>>>       subprog[idx].use_priv_stack = false;
> >>>>>
> >>>>> check_max_stack_depth() starts with i==0,
> >>>>> so reachable and eligible subprogs will be marked with use_priv_stack.
> >>>>> Then check_max_stack_depth_subprog() will be called again
> >>>>> to verify async. If it sees the mark it's a bad case.
> >>>>> what am I missing?
> >>>> First I think we still want to mark some subprogs in async tree
> >>>> to use private stack, right? If this is the case, then let us see
> >>>> the following examle:
> >>>>
> >>>> main_prog:
> >>>>       sub1: use_priv_stack = true
> >>>>       sub2" use_priv_stack = true
> >>>>
> >>>> async: /* calling sub1 twice */
> >>>>       sub1
> >>>>         <=== we do
> >>>>                if (subprog[idx].use_priv_stack && checking_async_cb)
> >>>>                    subprog[idx].use_priv_stack = false;
> >>>>       sub1
> >>>>         <=== here we have subprog[idx].use_priv_stack = false;
> >>>>              we could mark use_priv_stack = true again here
> >>>>              since logic didn't keep track of sub1 has been
> >>>>              visited before.
> >>> This case needs a sticky state to solve.
> >>> Instead of bool use_priv_stack it can be tri-state:
> >>> no_priv_stack
> >>> priv_stack_unknown <- start state
> >>> priv_stack_maybe
> >>>
> >>> main_prog pass will set it to priv_stack_maybe
> >>> while async pass will clear it to no_priv_stack
> >>> and it cannot be bumped up.
> >> The tri-state may not work. For example,
> >>
> >> main_prog:
> >>      call sub1
> >>      call sub2
> >>      call sub1
> > sub1 cannot be called nested like this.
> > I think we discussed it already.
> >
> >>      call sub3
> >>
> >> async:
> >>      call sub4 ==> UNKNOWN -> MAYBE
> >>      call sub3
> >>      call sub4 ==> MAYBE -> NO_PRIV_STACK?
> >>
> >> For sub4 in async which is called twice, for the second sub4 call,
> >> it is not clear whether UNKNOWN->MAYBE transition happens in
> >> main_prog or async. So based on transition prototol,
> >> second sub4 call will transition to NO_PRIV_STACK which is not
> >> what we want.
> > I see. Good point.
> >
> >> So I think we still need a separate bit in bpf_subprog_info to
> >> accumulate information for main_prog tree or any async tree.
> > This is getting quite convoluted. To support priv stack
> > in multiple async cb-s we may need to remember async cb id or something.
> > I say let's force all subprogs in async cb to use normal stack.
>
> I did a quick prototype. Among others, tri-state (UNKNOWN, NO, ADAPTIVE)
> and reverse traversing subprogs like below diff --git
> a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index
> cb82254484ff..1084432dbe83 100644 --- a/kernel/bpf/verifier.c +++
> b/kernel/bpf/verifier.c @@ -6192,7 +6192,7 @@ static int
> check_max_stack_depth(struct bpf_verifier_env *env) struct
> bpf_subprog_info *si = env->subprog_info; int ret; - for (int i = 0; i <
> env->subprog_cnt; i++) { + for (int i = env->subprog_cnt - 1; i >= 0;
> i--) { if (i && !si[i].is_async_cb) continue; works correctly.
> Basically, all async_cb touched subprogs are marked as 'NO'. Finally for
> main prog tree, UNKNOWN subprog will convert to ADAPTIVE if >= stack
> size threshold. Stack size checking can also be done properly for
> async_cb tree and main prog tree.

Your emailer still spits out garbage :(
but I think I got the idea. Will wait for respin.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH bpf-next v9 04/10] bpf: Check potential private stack recursion for progs with async callback
  2024-11-06 15:26                   ` Alexei Starovoitov
@ 2024-11-06 15:44                     ` Yonghong Song
  0 siblings, 0 replies; 41+ messages in thread
From: Yonghong Song @ 2024-11-06 15:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo




On 11/6/24 7:26 AM, Alexei Starovoitov wrote:
> On Tue, Nov 5, 2024 at 10:55 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>>
>>
>> On 11/5/24 5:07 PM, Alexei Starovoitov wrote:
>>> On Tue, Nov 5, 2024 at 4:19 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>>
>>>>
>>>> On 11/5/24 1:52 PM, Alexei Starovoitov wrote:
>>>>> On Tue, Nov 5, 2024 at 1:26 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>>>>> I see. I think it works, but feels complicated.
>>>>>>> It feels it should be possible to do without extra flags. Like
>>>>>>> check_max_stack_depth_subprog() will know whether it was called
>>>>>>> to verify async_cb or not.
>>>>>>> So it's just a matter of adding single 'if' to it:
>>>>>>> if (subprog[idx].use_priv_stack && checking_async_cb)
>>>>>>>        /* reset to false due to potential recursion */
>>>>>>>        subprog[idx].use_priv_stack = false;
>>>>>>>
>>>>>>> check_max_stack_depth() starts with i==0,
>>>>>>> so reachable and eligible subprogs will be marked with use_priv_stack.
>>>>>>> Then check_max_stack_depth_subprog() will be called again
>>>>>>> to verify async. If it sees the mark it's a bad case.
>>>>>>> what am I missing?
>>>>>> First I think we still want to mark some subprogs in async tree
>>>>>> to use private stack, right? If this is the case, then let us see
>>>>>> the following examle:
>>>>>>
>>>>>> main_prog:
>>>>>>        sub1: use_priv_stack = true
>>>>>>        sub2" use_priv_stack = true
>>>>>>
>>>>>> async: /* calling sub1 twice */
>>>>>>        sub1
>>>>>>          <=== we do
>>>>>>                 if (subprog[idx].use_priv_stack && checking_async_cb)
>>>>>>                     subprog[idx].use_priv_stack = false;
>>>>>>        sub1
>>>>>>          <=== here we have subprog[idx].use_priv_stack = false;
>>>>>>               we could mark use_priv_stack = true again here
>>>>>>               since logic didn't keep track of sub1 has been
>>>>>>               visited before.
>>>>> This case needs a sticky state to solve.
>>>>> Instead of bool use_priv_stack it can be tri-state:
>>>>> no_priv_stack
>>>>> priv_stack_unknown <- start state
>>>>> priv_stack_maybe
>>>>>
>>>>> main_prog pass will set it to priv_stack_maybe
>>>>> while async pass will clear it to no_priv_stack
>>>>> and it cannot be bumped up.
>>>> The tri-state may not work. For example,
>>>>
>>>> main_prog:
>>>>       call sub1
>>>>       call sub2
>>>>       call sub1
>>> sub1 cannot be called nested like this.
>>> I think we discussed it already.
>>>
>>>>       call sub3
>>>>
>>>> async:
>>>>       call sub4 ==> UNKNOWN -> MAYBE
>>>>       call sub3
>>>>       call sub4 ==> MAYBE -> NO_PRIV_STACK?
>>>>
>>>> For sub4 in async which is called twice, for the second sub4 call,
>>>> it is not clear whether UNKNOWN->MAYBE transition happens in
>>>> main_prog or async. So based on transition prototol,
>>>> second sub4 call will transition to NO_PRIV_STACK which is not
>>>> what we want.
>>> I see. Good point.
>>>
>>>> So I think we still need a separate bit in bpf_subprog_info to
>>>> accumulate information for main_prog tree or any async tree.
>>> This is getting quite convoluted. To support priv stack
>>> in multiple async cb-s we may need to remember async cb id or something.
>>> I say let's force all subprogs in async cb to use normal stack.
>> I did a quick prototype. Among others, tri-state (UNKNOWN, NO, ADAPTIVE)
>> and reverse traversing subprogs like below diff --git
>> a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index
>> cb82254484ff..1084432dbe83 100644 --- a/kernel/bpf/verifier.c +++
>> b/kernel/bpf/verifier.c @@ -6192,7 +6192,7 @@ static int
>> check_max_stack_depth(struct bpf_verifier_env *env) struct
>> bpf_subprog_info *si = env->subprog_info; int ret; - for (int i = 0; i <
>> env->subprog_cnt; i++) { + for (int i = env->subprog_cnt - 1; i >= 0;
>> i--) { if (i && !si[i].is_async_cb) continue; works correctly.
>> Basically, all async_cb touched subprogs are marked as 'NO'. Finally for
>> main prog tree, UNKNOWN subprog will convert to ADAPTIVE if >= stack
>> size threshold. Stack size checking can also be done properly for
>> async_cb tree and main prog tree.
> Your emailer still spits out garbage :(

Somehow recently, my thunderbird mailer may change text/width setting
depending on the original format. Unless I explicitly set the format
again. Sometimes it will use a 'Variable width' which will glue everything in
one paragraph. I am trying to find a solution now to fix the mailer.

> but I think I got the idea. Will wait for respin.

Will do.


^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2024-11-06 15:44 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-04 19:34 [PATCH bpf-next v9 00/10] bpf: Support private stack for bpf progs Yonghong Song
2024-11-04 19:35 ` [PATCH bpf-next v9 01/10] bpf: Check stack depth limit after visiting all subprogs Yonghong Song
2024-11-04 19:35 ` [PATCH bpf-next v9 02/10] bpf: Return false for bpf_prog_check_recur() default case Yonghong Song
2024-11-05  1:21   ` Alexei Starovoitov
2024-11-05  1:35     ` Yonghong Song
2024-11-05  1:55       ` Alexei Starovoitov
2024-11-05  2:53         ` Yonghong Song
2024-11-05  3:50           ` Yonghong Song
2024-11-05  4:28             ` Alexei Starovoitov
2024-11-05  6:02               ` Yonghong Song
2024-11-05 15:50                 ` Alexei Starovoitov
2024-11-05 16:33                   ` Yonghong Song
2024-11-05 16:38                     ` Alexei Starovoitov
2024-11-05 16:48                       ` Yonghong Song
2024-11-05 17:47                         ` Alexei Starovoitov
2024-11-04 19:35 ` [PATCH bpf-next v9 03/10] bpf: Allow private stack to have each subprog having stack size of 512 bytes Yonghong Song
2024-11-05  2:47   ` Alexei Starovoitov
2024-11-05  3:09     ` Yonghong Song
2024-11-04 19:35 ` [PATCH bpf-next v9 04/10] bpf: Check potential private stack recursion for progs with async callback Yonghong Song
2024-11-05  2:51   ` Alexei Starovoitov
2024-11-05  3:37     ` Yonghong Song
2024-11-05 20:26       ` Alexei Starovoitov
2024-11-05 21:26         ` Yonghong Song
2024-11-05 21:52           ` Alexei Starovoitov
2024-11-06  0:19             ` Yonghong Song
2024-11-06  1:07               ` Alexei Starovoitov
2024-11-06  2:33                 ` Yonghong Song
2024-11-06  6:55                 ` Yonghong Song
2024-11-06 15:26                   ` Alexei Starovoitov
2024-11-06 15:44                     ` Yonghong Song
2024-11-04 19:35 ` [PATCH bpf-next v9 05/10] bpf: Allocate private stack for eligible main prog or subprogs Yonghong Song
2024-11-05  1:38   ` Alexei Starovoitov
2024-11-05  3:07     ` Yonghong Song
2024-11-05  3:44       ` Yonghong Song
2024-11-05  5:19         ` Alexei Starovoitov
2024-11-05  6:05           ` Yonghong Song
2024-11-04 19:35 ` [PATCH bpf-next v9 06/10] bpf, x86: Avoid repeated usage of bpf_prog->aux->stack_depth Yonghong Song
2024-11-04 19:35 ` [PATCH bpf-next v9 07/10] bpf, x86: Support private stack in jit Yonghong Song
2024-11-04 19:35 ` [PATCH bpf-next v9 08/10] selftests/bpf: Add tracing prog private stack tests Yonghong Song
2024-11-04 19:35 ` [PATCH bpf-next v9 09/10] bpf: Support private stack for struct_ops progs Yonghong Song
2024-11-04 19:35 ` [PATCH bpf-next v9 10/10] selftests/bpf: Add struct_ops prog private stack tests Yonghong Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox