[PATCH bpf-next v4 00/10] bpf: Support private stack for bpf progs

BPF List
 help / color / mirror / Atom feed

* [PATCH bpf-next v4 00/10] bpf: Support private stack for bpf progs
@ 2024-10-10 17:55 Yonghong Song
  2024-10-10 17:55 ` [PATCH bpf-next v4 01/10] bpf: Allow each subprog having stack size of 512 bytes Yonghong Song
                   ` (10 more replies)
  0 siblings, 11 replies; 25+ messages in thread
From: Yonghong Song @ 2024-10-10 17:55 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Tejun Heo

The main motivation for private stack comes from nested scheduler in
sched-ext from Tejun. The basic idea is that
 - each cgroup will its own associated bpf program,
 - bpf program with parent cgroup will call bpf programs
   in immediate child cgroups.

Let us say we have the following cgroup hierarchy:
  root_cg (prog0):
    cg1 (prog1):
      cg11 (prog11):
        cg111 (prog111)
        cg112 (prog112)
      cg12 (prog12):
        cg121 (prog121)
        cg122 (prog122)
    cg2 (prog2):
      cg21 (prog21)
      cg22 (prog22)
      cg23 (prog23)

In the above example, prog0 will call a kfunc which will call prog1 and
prog2 to get sched info for cg1 and cg2 and then the information is
summarized and sent back to prog0. Similarly, prog11 and prog12 will be
invoked in the kfunc and the result will be summarized and sent back to
prog1, etc.

Currently, for each thread, the x86 kernel allocate 8KB stack. The each
bpf program (including its subprograms) has maximum 512B stack size to
avoid potential stack overflow. And nested bpf programs increase the risk
of stack overflow. To avoid potential stack overflow caused by bpf
programs, this patch implemented a private stack so bpf program stack
space is allocated dynamically when the program is jited. Such private
stack is applied to tracing programs like kprobe/uprobe, perf_event,
tracepoint, raw tracepoint and tracing.

But more than one instance of the same bpf program may run in the system.
To make things simple, percpu private stack is allocated for each program,
so if the same program is running on different cpus concurrently, we won't
have any issue. Note that the kernel already have logic to prevent the
recursion for the same bpf program on the same cpu (kprobe, fentry, etc.).

The patch implemented a percpu private stack based approach for x86 arch.
A new kfunc bpf_prog_call() is introduced for the above nested scheduler use
case. If bpf_prog_call() is used in the program and bpf_tail_call() is not
used in the same program, then private stack will be used. Internally,
private stack allows certain number of recursions by allocating more
space. Please see each individual patch for details.

Change logs:
  v3 -> v4:
    - v3 link: https://lore.kernel.org/bpf/20240926234506.1769256-1-yonghong.song@linux.dev/
      There is a long discussion in the above v3 link trying to allow private
      stack to be used by kernel functions in order to simplify implementation.
      But unfortunately we didn't find a workable solution yet, so we return
      to the approach where private stack is only used by bpf programs.
    - Add bpf_prog_call() kfunc.
  v2 -> v3:
    - Instead of per-subprog private stack allocation, allocate private
      stacks at main prog or callback entry prog. Subprogs not main or callback
      progs will increment the inherited stack pointer to be their
      frame pointer.
    - Private stack allows each prog max stack size to be 512 bytes, intead
      of the whole prog hierarchy to be 512 bytes.
    - Add some tests.

Yonghong Song (10):
  bpf: Allow each subprog having stack size of 512 bytes
  bpf: Mark each subprog with proper private stack modes
  bpf, x86: Refactor func emit_prologue
  bpf, x86: Create a helper for certain "reg <op>= imm" operations
  bpf, x86: Add jit support for private stack
  selftests/bpf: Add private stack tests
  bpf: Support calling non-tailcall bpf prog
  bpf, x86: Create two helpers for some arith operations
  bpf, x86: Jit support for nested bpf_prog_call
  selftests/bpf: Add tests for bpf_prog_call()

 arch/x86/net/bpf_jit_comp.c                   | 318 ++++++++++++++----
 include/linux/bpf.h                           |  14 +
 include/linux/bpf_verifier.h                  |   3 +
 include/linux/filter.h                        |   1 +
 kernel/bpf/core.c                             |  27 ++
 kernel/bpf/helpers.c                          |  20 ++
 kernel/bpf/trampoline.c                       |  16 +
 kernel/bpf/verifier.c                         | 145 +++++++-
 .../selftests/bpf/prog_tests/prog_call.c      |  78 +++++
 .../selftests/bpf/prog_tests/verifier.c       |   2 +
 tools/testing/selftests/bpf/progs/prog_call.c |  92 +++++
 .../bpf/progs/verifier_private_stack.c        | 216 ++++++++++++
 12 files changed, 856 insertions(+), 76 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/prog_call.c
 create mode 100644 tools/testing/selftests/bpf/progs/prog_call.c
 create mode 100644 tools/testing/selftests/bpf/progs/verifier_private_stack.c

-- 
2.43.5

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH bpf-next v4 01/10] bpf: Allow each subprog having stack size of 512 bytes
  2024-10-10 17:55 [PATCH bpf-next v4 00/10] bpf: Support private stack for bpf progs Yonghong Song
@ 2024-10-10 17:55 ` Yonghong Song
  2024-10-10 17:56 ` [PATCH bpf-next v4 02/10] bpf: Mark each subprog with proper private stack modes Yonghong Song
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Yonghong Song @ 2024-10-10 17:55 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Tejun Heo

With private stack support, each subprog can have stack with up to 512
bytes. The limit of 512 bytes per subprog is kept to avoid increasing
verifier complexity since greater than 512 bytes will cause big verifier
change and increase memory consumption and verification time.

If private stack is supported, for a bpf prog, esp. when it has
subprogs, private stack will be allocated for the main prog
and for each callback subprog. For example,
  main_prog
    subprog1
      calling helper
        subprog10 (callback func)
          subprog11
    subprog2
      calling helper
        subprog10 (callback func)
          subprog11

Separate private allocations for main_prog and callback_fn subprog10
will make things easier since the helper function uses the kernel stack.

Additional subprog info is also collected for later to
allocate private stack for main prog and each callback functions.

Note that if any tail_call is called in the prog (including all subprogs),
then private stack is not used.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 include/linux/bpf.h          |  1 +
 include/linux/bpf_verifier.h |  3 ++
 include/linux/filter.h       |  1 +
 kernel/bpf/core.c            |  5 ++
 kernel/bpf/verifier.c        | 94 +++++++++++++++++++++++++++++++-----
 5 files changed, 91 insertions(+), 13 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 19d8ca8ac960..9ef9133e0470 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1483,6 +1483,7 @@ struct bpf_prog_aux {
 	bool xdp_has_frags;
 	bool exception_cb;
 	bool exception_boundary;
+	bool priv_stack_eligible;
 	struct bpf_arena *arena;
 	/* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
 	const struct btf_type *attach_func_proto;
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 4513372c5bc8..bcfe868e3801 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -659,6 +659,8 @@ struct bpf_subprog_info {
 	 * are used for bpf_fastcall spills and fills.
 	 */
 	s16 fastcall_stack_off;
+	u16 subtree_stack_depth;
+	u16 subtree_top_idx;
 	bool has_tail_call: 1;
 	bool tail_call_reachable: 1;
 	bool has_ld_abs: 1;
@@ -668,6 +670,7 @@ struct bpf_subprog_info {
 	bool args_cached: 1;
 	/* true if bpf_fastcall stack region is used by functions that can't be inlined */
 	bool keep_fastcall_stack: 1;
+	bool priv_stack_eligible: 1;
 
 	u8 arg_cnt;
 	struct bpf_subprog_arg_info args[MAX_BPF_FUNC_REG_ARGS];
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 7d7578a8eac1..3a21947f2fd4 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1119,6 +1119,7 @@ bool bpf_jit_supports_exceptions(void);
 bool bpf_jit_supports_ptr_xchg(void);
 bool bpf_jit_supports_arena(void);
 bool bpf_jit_supports_insn(struct bpf_insn *insn, bool in_arena);
+bool bpf_jit_supports_private_stack(void);
 u64 bpf_arch_uaddress_limit(void);
 void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie);
 bool bpf_helper_changes_pkt_data(void *func);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 5e77c58e0601..ba088b58746f 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -3044,6 +3044,11 @@ bool __weak bpf_jit_supports_exceptions(void)
 	return false;
 }
 
+bool __weak bpf_jit_supports_private_stack(void)
+{
+	return false;
+}
+
 void __weak arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie)
 {
 }
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7d9b38ffd220..3972606f97d2 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -194,6 +194,8 @@ struct bpf_verifier_stack_elem {
 
 #define BPF_GLOBAL_PERCPU_MA_MAX_SIZE  512
 
+#define BPF_PRIV_STACK_MIN_SUBTREE_SIZE	128
+
 static int acquire_reference_state(struct bpf_verifier_env *env, int insn_idx);
 static int release_reference(struct bpf_verifier_env *env, int ref_obj_id);
 static void invalidate_non_owning_refs(struct bpf_verifier_env *env);
@@ -5982,6 +5984,41 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
 					   strict);
 }
 
+static bool bpf_enable_private_stack(struct bpf_prog *prog)
+{
+	if (!bpf_jit_supports_private_stack())
+		return false;
+
+	switch (prog->aux->prog->type) {
+	case BPF_PROG_TYPE_KPROBE:
+	case BPF_PROG_TYPE_TRACEPOINT:
+	case BPF_PROG_TYPE_PERF_EVENT:
+	case BPF_PROG_TYPE_RAW_TRACEPOINT:
+		return true;
+	case BPF_PROG_TYPE_TRACING:
+		if (prog->expected_attach_type != BPF_TRACE_ITER)
+			return true;
+		fallthrough;
+	default:
+		return false;
+	}
+}
+
+static bool is_priv_stack_supported(struct bpf_verifier_env *env)
+{
+	struct bpf_subprog_info *si = env->subprog_info;
+	bool has_tail_call = false;
+
+	for (int i = 0; i < env->subprog_cnt; i++) {
+		if (si[i].has_tail_call) {
+			has_tail_call = true;
+			break;
+		}
+	}
+
+	return !has_tail_call && bpf_enable_private_stack(env->prog);
+}
+
 static int round_up_stack_depth(struct bpf_verifier_env *env, int stack_depth)
 {
 	if (env->prog->jit_requested)
@@ -5999,16 +6036,21 @@ static int round_up_stack_depth(struct bpf_verifier_env *env, int stack_depth)
  * Since recursion is prevented by check_cfg() this algorithm
  * only needs a local stack of MAX_CALL_FRAMES to remember callsites
  */
-static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
+static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
+					 bool check_priv_stack, bool priv_stack_supported)
 {
 	struct bpf_subprog_info *subprog = env->subprog_info;
 	struct bpf_insn *insn = env->prog->insnsi;
 	int depth = 0, frame = 0, i, subprog_end;
 	bool tail_call_reachable = false;
+	bool priv_stack_eligible = false;
 	int ret_insn[MAX_CALL_FRAMES];
 	int ret_prog[MAX_CALL_FRAMES];
-	int j;
+	int j, subprog_stack_depth;
+	int orig_idx = idx;
 
+	if (check_priv_stack)
+		subprog[idx].subtree_top_idx = idx;
 	i = subprog[idx].start;
 process_func:
 	/* protect against potential stack overflow that might happen when
@@ -6030,18 +6072,33 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
 	 * tailcall will unwind the current stack frame but it will not get rid
 	 * of caller's stack as shown on the example above.
 	 */
-	if (idx && subprog[idx].has_tail_call && depth >= 256) {
+	if (!check_priv_stack && idx && subprog[idx].has_tail_call && depth >= 256) {
 		verbose(env,
 			"tail_calls are not allowed when call stack of previous frames is %d bytes. Too large\n",
 			depth);
 		return -EACCES;
 	}
-	depth += round_up_stack_depth(env, subprog[idx].stack_depth);
-	if (depth > MAX_BPF_STACK) {
+	subprog_stack_depth = round_up_stack_depth(env, subprog[idx].stack_depth);
+	depth += subprog_stack_depth;
+	if (!check_priv_stack && !priv_stack_supported && depth > MAX_BPF_STACK) {
 		verbose(env, "combined stack size of %d calls is %d. Too large\n",
 			frame + 1, depth);
 		return -EACCES;
 	}
+	if (check_priv_stack) {
+		if (subprog_stack_depth > MAX_BPF_STACK) {
+			verbose(env, "stack size of subprog %d is %d. Too large\n",
+				idx, subprog_stack_depth);
+			return -EACCES;
+		}
+
+		if (!priv_stack_eligible && depth >= BPF_PRIV_STACK_MIN_SUBTREE_SIZE) {
+			subprog[orig_idx].priv_stack_eligible = true;
+			env->prog->aux->priv_stack_eligible = priv_stack_eligible = true;
+		}
+		subprog[orig_idx].subtree_stack_depth =
+			max_t(u16, subprog[orig_idx].subtree_stack_depth, depth);
+	}
 continue_func:
 	subprog_end = subprog[idx + 1].start;
 	for (; i < subprog_end; i++) {
@@ -6097,8 +6154,10 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
 		}
 		i = next_insn;
 		idx = sidx;
+		if (check_priv_stack)
+			subprog[idx].subtree_top_idx = orig_idx;
 
-		if (subprog[idx].has_tail_call)
+		if (!check_priv_stack && subprog[idx].has_tail_call)
 			tail_call_reachable = true;
 
 		frame++;
@@ -6122,7 +6181,7 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
 			}
 			subprog[ret_prog[j]].tail_call_reachable = true;
 		}
-	if (subprog[0].tail_call_reachable)
+	if (!check_priv_stack && subprog[0].tail_call_reachable)
 		env->prog->aux->tail_call_reachable = true;
 
 	/* end of for() loop means the last insn of the 'subprog'
@@ -6137,14 +6196,18 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx)
 	goto continue_func;
 }
 
-static int check_max_stack_depth(struct bpf_verifier_env *env)
+static int check_max_stack_depth(struct bpf_verifier_env *env, bool check_priv_stack,
+				 bool priv_stack_supported)
 {
 	struct bpf_subprog_info *si = env->subprog_info;
+	bool check_subprog;
 	int ret;
 
 	for (int i = 0; i < env->subprog_cnt; i++) {
-		if (!i || si[i].is_async_cb) {
-			ret = check_max_stack_depth_subprog(env, i);
+		check_subprog = !i || (check_priv_stack ? si[i].is_cb : si[i].is_async_cb);
+		if (check_subprog) {
+			ret = check_max_stack_depth_subprog(env, i, check_priv_stack,
+							    priv_stack_supported);
 			if (ret < 0)
 				return ret;
 		}
@@ -22298,7 +22361,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u3
 	struct bpf_verifier_env *env;
 	int i, len, ret = -EINVAL, err;
 	u32 log_true_size;
-	bool is_priv;
+	bool is_priv, priv_stack_supported = false;
 
 	/* no program is valid */
 	if (ARRAY_SIZE(bpf_verifier_ops) == 0)
@@ -22425,8 +22488,10 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u3
 	if (ret == 0)
 		ret = remove_fastcall_spills_fills(env);
 
-	if (ret == 0)
-		ret = check_max_stack_depth(env);
+	if (ret == 0) {
+		priv_stack_supported = is_priv_stack_supported(env);
+		ret = check_max_stack_depth(env, false, priv_stack_supported);
+	}
 
 	/* instruction rewrites happen after this point */
 	if (ret == 0)
@@ -22460,6 +22525,9 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u3
 								     : false;
 	}
 
+	if (ret == 0 && priv_stack_supported)
+		ret = check_max_stack_depth(env, true, true);
+
 	if (ret == 0)
 		ret = fixup_call_args(env);
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next v4 02/10] bpf: Mark each subprog with proper private stack modes
  2024-10-10 17:55 [PATCH bpf-next v4 00/10] bpf: Support private stack for bpf progs Yonghong Song
  2024-10-10 17:55 ` [PATCH bpf-next v4 01/10] bpf: Allow each subprog having stack size of 512 bytes Yonghong Song
@ 2024-10-10 17:56 ` Yonghong Song
  2024-10-10 17:56 ` [PATCH bpf-next v4 03/10] bpf, x86: Refactor func emit_prologue Yonghong Song
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Yonghong Song @ 2024-10-10 17:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Tejun Heo

Three private stack modes are used to direct jit action:
  NO_PRIV_STACK:        do not use private stack
  PRIV_STACK_SUB_PROG:  adjust frame pointer address (similar to normal stack)
  PRIV_STACK_ROOT_PROG: set the frame pointer

Note that for subtree root prog (main prog or callback fn), even if the
bpf_prog stack size is 0, PRIV_STACK_ROOT_PROG mode is still used.
This is for bpf exception handling. More details can be found in
subsequent jit support and selftest patches.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 include/linux/bpf.h   |  9 +++++++++
 kernel/bpf/core.c     | 19 +++++++++++++++++++
 kernel/bpf/verifier.c | 29 +++++++++++++++++++++++++++++
 3 files changed, 57 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9ef9133e0470..f22ddb423fd0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1450,6 +1450,12 @@ struct btf_mod_pair {
 
 struct bpf_kfunc_desc_tab;
 
+enum bpf_priv_stack_mode {
+	NO_PRIV_STACK,
+	PRIV_STACK_SUB_PROG,
+	PRIV_STACK_ROOT_PROG,
+};
+
 struct bpf_prog_aux {
 	atomic64_t refcnt;
 	u32 used_map_cnt;
@@ -1466,6 +1472,9 @@ struct bpf_prog_aux {
 	u32 ctx_arg_info_size;
 	u32 max_rdonly_access;
 	u32 max_rdwr_access;
+	enum bpf_priv_stack_mode priv_stack_mode;
+	u16 subtree_stack_depth; /* Subtree stack depth if PRIV_STACK_ROOT_PROG, 0 otherwise */
+	void __percpu *priv_stack_ptr;
 	struct btf *attach_btf;
 	const struct bpf_ctx_arg_aux *ctx_arg_info;
 	struct mutex dst_mutex; /* protects dst_* pointers below, *after* prog becomes visible */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index ba088b58746f..f79d951a061f 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1239,6 +1239,7 @@ void __weak bpf_jit_free(struct bpf_prog *fp)
 		struct bpf_binary_header *hdr = bpf_jit_binary_hdr(fp);
 
 		bpf_jit_binary_free(hdr);
+		free_percpu(fp->aux->priv_stack_ptr);
 		WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(fp));
 	}
 
@@ -2420,6 +2421,24 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
 		if (*err)
 			return fp;
 
+		if (fp->aux->priv_stack_eligible) {
+			if (!fp->aux->stack_depth) {
+				fp->aux->priv_stack_mode = NO_PRIV_STACK;
+			} else {
+				void __percpu *priv_stack_ptr;
+
+				fp->aux->priv_stack_mode = PRIV_STACK_ROOT_PROG;
+				priv_stack_ptr =
+					__alloc_percpu_gfp(fp->aux->stack_depth, 8, GFP_KERNEL);
+				if (!priv_stack_ptr) {
+					*err = -ENOMEM;
+					return fp;
+				}
+				fp->aux->subtree_stack_depth = fp->aux->stack_depth;
+				fp->aux->priv_stack_ptr = priv_stack_ptr;
+			}
+		}
+
 		fp = bpf_int_jit_compile(fp);
 		bpf_prog_jit_attempt_done(fp);
 		if (!fp->jited && jit_needed) {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 3972606f97d2..46b0c277c6a8 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -20003,6 +20003,8 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 {
 	struct bpf_prog *prog = env->prog, **func, *tmp;
 	int i, j, subprog_start, subprog_end = 0, len, subprog;
+	int subtree_top_idx, subtree_stack_depth;
+	void __percpu *priv_stack_ptr;
 	struct bpf_map *map_ptr;
 	struct bpf_insn *insn;
 	void *old_bpf_func;
@@ -20081,6 +20083,33 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 		func[i]->is_func = 1;
 		func[i]->sleepable = prog->sleepable;
 		func[i]->aux->func_idx = i;
+
+		subtree_top_idx = env->subprog_info[i].subtree_top_idx;
+		if (env->subprog_info[subtree_top_idx].priv_stack_eligible) {
+			if (subtree_top_idx == i)
+				func[i]->aux->subtree_stack_depth =
+					env->subprog_info[i].subtree_stack_depth;
+
+			subtree_stack_depth = func[i]->aux->subtree_stack_depth;
+			if (subtree_top_idx != i) {
+				if (env->subprog_info[subtree_top_idx].subtree_stack_depth)
+					func[i]->aux->priv_stack_mode = PRIV_STACK_SUB_PROG;
+				else
+					func[i]->aux->priv_stack_mode = NO_PRIV_STACK;
+			} else if (!subtree_stack_depth) {
+				func[i]->aux->priv_stack_mode = PRIV_STACK_ROOT_PROG;
+			} else {
+				func[i]->aux->priv_stack_mode = PRIV_STACK_ROOT_PROG;
+				priv_stack_ptr =
+					__alloc_percpu_gfp(subtree_stack_depth, 8, GFP_KERNEL);
+				if (!priv_stack_ptr) {
+					err = -ENOMEM;
+					goto out_free;
+				}
+				func[i]->aux->priv_stack_ptr = priv_stack_ptr;
+			}
+		}
+
 		/* Below members will be freed only at prog->aux */
 		func[i]->aux->btf = prog->aux->btf;
 		func[i]->aux->func_info = prog->aux->func_info;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next v4 03/10] bpf, x86: Refactor func emit_prologue
  2024-10-10 17:55 [PATCH bpf-next v4 00/10] bpf: Support private stack for bpf progs Yonghong Song
  2024-10-10 17:55 ` [PATCH bpf-next v4 01/10] bpf: Allow each subprog having stack size of 512 bytes Yonghong Song
  2024-10-10 17:56 ` [PATCH bpf-next v4 02/10] bpf: Mark each subprog with proper private stack modes Yonghong Song
@ 2024-10-10 17:56 ` Yonghong Song
  2024-10-10 17:56 ` [PATCH bpf-next v4 04/10] bpf, x86: Create a helper for certain "reg <op>= imm" operations Yonghong Song
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Yonghong Song @ 2024-10-10 17:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Tejun Heo

Refactor function emit_prologue() such that it has bpf_prog as one of
arguments. This can reduce the number of total arguments since later
on there will be more arguments being added to this function.

Also add a variable 'stack_depth' to hold the value for
  bpf_prog->aux->stack_depth
to simplify the code.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 arch/x86/net/bpf_jit_comp.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 06b080b61aa5..6d24389e58a1 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -489,10 +489,12 @@ static void emit_prologue_tail_call(u8 **pprog, bool is_subprog)
  * bpf_tail_call helper will skip the first X86_TAIL_CALL_OFFSET bytes
  * while jumping to another program
  */
-static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf,
-			  bool tail_call_reachable, bool is_subprog,
-			  bool is_exception_cb)
+static void emit_prologue(u8 **pprog, u32 stack_depth, struct bpf_prog *bpf_prog,
+			  bool tail_call_reachable)
 {
+	bool ebpf_from_cbpf = bpf_prog_was_classic(bpf_prog);
+	bool is_exception_cb = bpf_prog->aux->exception_cb;
+	bool is_subprog = bpf_is_subprog(bpf_prog);
 	u8 *prog = *pprog;
 
 	emit_cfi(&prog, is_subprog ? cfi_bpf_subprog_hash : cfi_bpf_hash);
@@ -1424,17 +1426,18 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 	u64 arena_vm_start, user_vm_start;
 	int i, excnt = 0;
 	int ilen, proglen = 0;
+	u32 stack_depth;
 	u8 *prog = temp;
 	int err;
 
+	stack_depth = bpf_prog->aux->stack_depth;
+
 	arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena);
 	user_vm_start = bpf_arena_get_user_vm_start(bpf_prog->aux->arena);
 
 	detect_reg_usage(insn, insn_cnt, callee_regs_used);
 
-	emit_prologue(&prog, bpf_prog->aux->stack_depth,
-		      bpf_prog_was_classic(bpf_prog), tail_call_reachable,
-		      bpf_is_subprog(bpf_prog), bpf_prog->aux->exception_cb);
+	emit_prologue(&prog, stack_depth, bpf_prog, tail_call_reachable);
 	/* Exception callback will clobber callee regs for its own use, and
 	 * restore the original callee regs from main prog's stack frame.
 	 */
@@ -2128,7 +2131,7 @@ st:			if (is_imm8(insn->off))
 
 			func = (u8 *) __bpf_call_base + imm32;
 			if (tail_call_reachable) {
-				LOAD_TAIL_CALL_CNT_PTR(bpf_prog->aux->stack_depth);
+				LOAD_TAIL_CALL_CNT_PTR(stack_depth);
 				ip += 7;
 			}
 			if (!imm32)
@@ -2145,13 +2148,13 @@ st:			if (is_imm8(insn->off))
 							  &bpf_prog->aux->poke_tab[imm32 - 1],
 							  &prog, image + addrs[i - 1],
 							  callee_regs_used,
-							  bpf_prog->aux->stack_depth,
+							  stack_depth,
 							  ctx);
 			else
 				emit_bpf_tail_call_indirect(bpf_prog,
 							    &prog,
 							    callee_regs_used,
-							    bpf_prog->aux->stack_depth,
+							    stack_depth,
 							    image + addrs[i - 1],
 							    ctx);
 			break;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next v4 04/10] bpf, x86: Create a helper for certain "reg <op>= imm" operations
  2024-10-10 17:55 [PATCH bpf-next v4 00/10] bpf: Support private stack for bpf progs Yonghong Song
                   ` (2 preceding siblings ...)
  2024-10-10 17:56 ` [PATCH bpf-next v4 03/10] bpf, x86: Refactor func emit_prologue Yonghong Song
@ 2024-10-10 17:56 ` Yonghong Song
  2024-10-10 17:56 ` [PATCH bpf-next v4 05/10] bpf, x86: Add jit support for private stack Yonghong Song
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Yonghong Song @ 2024-10-10 17:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Tejun Heo

Create a helper to generate jited codes for certain "reg <op>= imm"
operations where operations are for add/sub/and/or/xor. This helper
will be used in the subsequent patch.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 arch/x86/net/bpf_jit_comp.c | 82 +++++++++++++++++++++----------------
 1 file changed, 46 insertions(+), 36 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 6d24389e58a1..f01fdabf786e 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1406,6 +1406,51 @@ static void emit_shiftx(u8 **pprog, u32 dst_reg, u8 src_reg, bool is64, u8 op)
 	*pprog = prog;
 }
 
+/* emit ADD/SUB/AND/OR/XOR 'reg <op>= imm' operations */
+static void emit_alu_helper_1(u8 **pprog, u8 insn_code, u32 dst_reg, s32 imm32)
+{
+	u8 b2 = 0, b3 = 0;
+	u8 *prog = *pprog;
+
+	maybe_emit_1mod(&prog, dst_reg, BPF_CLASS(insn_code) == BPF_ALU64);
+
+	/*
+	 * b3 holds 'normal' opcode, b2 short form only valid
+	 * in case dst is eax/rax.
+	 */
+	switch (BPF_OP(insn_code)) {
+	case BPF_ADD:
+		b3 = 0xC0;
+		b2 = 0x05;
+		break;
+	case BPF_SUB:
+		b3 = 0xE8;
+		b2 = 0x2D;
+		break;
+	case BPF_AND:
+		b3 = 0xE0;
+		b2 = 0x25;
+		break;
+	case BPF_OR:
+		b3 = 0xC8;
+		b2 = 0x0D;
+		break;
+	case BPF_XOR:
+		b3 = 0xF0;
+		b2 = 0x35;
+		break;
+	}
+
+	if (is_imm8(imm32))
+		EMIT3(0x83, add_1reg(b3, dst_reg), imm32);
+	else if (is_axreg(dst_reg))
+		EMIT1_off32(b2, imm32);
+	else
+		EMIT2_off32(0x81, add_1reg(b3, dst_reg), imm32);
+
+	*pprog = prog;
+}
+
 #define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
 
 #define __LOAD_TCC_PTR(off)			\
@@ -1567,42 +1612,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 		case BPF_ALU64 | BPF_AND | BPF_K:
 		case BPF_ALU64 | BPF_OR | BPF_K:
 		case BPF_ALU64 | BPF_XOR | BPF_K:
-			maybe_emit_1mod(&prog, dst_reg,
-					BPF_CLASS(insn->code) == BPF_ALU64);
-
-			/*
-			 * b3 holds 'normal' opcode, b2 short form only valid
-			 * in case dst is eax/rax.
-			 */
-			switch (BPF_OP(insn->code)) {
-			case BPF_ADD:
-				b3 = 0xC0;
-				b2 = 0x05;
-				break;
-			case BPF_SUB:
-				b3 = 0xE8;
-				b2 = 0x2D;
-				break;
-			case BPF_AND:
-				b3 = 0xE0;
-				b2 = 0x25;
-				break;
-			case BPF_OR:
-				b3 = 0xC8;
-				b2 = 0x0D;
-				break;
-			case BPF_XOR:
-				b3 = 0xF0;
-				b2 = 0x35;
-				break;
-			}
-
-			if (is_imm8(imm32))
-				EMIT3(0x83, add_1reg(b3, dst_reg), imm32);
-			else if (is_axreg(dst_reg))
-				EMIT1_off32(b2, imm32);
-			else
-				EMIT2_off32(0x81, add_1reg(b3, dst_reg), imm32);
+			emit_alu_helper_1(&prog, insn->code, dst_reg, imm32);
 			break;
 
 		case BPF_ALU64 | BPF_MOV | BPF_K:
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next v4 05/10] bpf, x86: Add jit support for private stack
  2024-10-10 17:55 [PATCH bpf-next v4 00/10] bpf: Support private stack for bpf progs Yonghong Song
                   ` (3 preceding siblings ...)
  2024-10-10 17:56 ` [PATCH bpf-next v4 04/10] bpf, x86: Create a helper for certain "reg <op>= imm" operations Yonghong Song
@ 2024-10-10 17:56 ` Yonghong Song
  2024-10-10 17:56 ` [PATCH bpf-next v4 06/10] selftests/bpf: Add private stack tests Yonghong Song
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Yonghong Song @ 2024-10-10 17:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Tejun Heo

Add jit support for private stack. For a particular subtree, e.g.,
  subtree_root <== stack depth 120
   subprog1    <== stack depth 80
    subprog2   <== stack depth 40
   subprog3    <== stack depth 160

Let us say that priv_stack_ptr is the memory address allocated for
private stack. The frame pointer for each above is calculated like below:
  subtree_root  <== subtree_root_fp = private_stack_ptr + 120
   subprog1     <== subtree_subprog1_fp = subtree_root_fp + 80
    subprog2    <== subtree_subprog2_fp = subtree_subprog1_fp + 40
   subprog3     <== subtree_subprog1_fp = subtree_root_fp + 160

For any function call to helper/kfunc, push/pop prog frame pointer
is needed in order to preserve frame pointer value.

To deal with exception handling, push/pop frame pointer is also used
surrounding call to subsequent subprog. For example,
  subtree_root
   subprog1
     ...
     insn: call bpf_throw
     ...

After jit, we will have
  subtree_root
   insn: push r9
   subprog1
     ...
     insn: push r9
     insn: call bpf_throw
     insn: pop r9
     ...
   insn: pop r9

  exception_handler
     pop r9
     ...
where r9 represents the fp for each subprog.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 arch/x86/net/bpf_jit_comp.c | 88 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 86 insertions(+), 2 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index f01fdabf786e..a6ba85cec49a 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -325,6 +325,22 @@ struct jit_context {
 /* Number of bytes that will be skipped on tailcall */
 #define X86_TAIL_CALL_OFFSET	(12 + ENDBR_INSN_SIZE)
 
+static void push_r9(u8 **pprog)
+{
+	u8 *prog = *pprog;
+
+	EMIT2(0x41, 0x51);   /* push r9 */
+	*pprog = prog;
+}
+
+static void pop_r9(u8 **pprog)
+{
+	u8 *prog = *pprog;
+
+	EMIT2(0x41, 0x59);   /* pop r9 */
+	*pprog = prog;
+}
+
 static void push_r12(u8 **pprog)
 {
 	u8 *prog = *pprog;
@@ -484,13 +500,17 @@ static void emit_prologue_tail_call(u8 **pprog, bool is_subprog)
 	*pprog = prog;
 }
 
+static void emit_priv_frame_ptr(u8 **pprog, struct bpf_prog *bpf_prog,
+				enum bpf_priv_stack_mode priv_stack_mode);
+
 /*
  * Emit x86-64 prologue code for BPF program.
  * bpf_tail_call helper will skip the first X86_TAIL_CALL_OFFSET bytes
  * while jumping to another program
  */
 static void emit_prologue(u8 **pprog, u32 stack_depth, struct bpf_prog *bpf_prog,
-			  bool tail_call_reachable)
+			  bool tail_call_reachable,
+			  enum bpf_priv_stack_mode priv_stack_mode)
 {
 	bool ebpf_from_cbpf = bpf_prog_was_classic(bpf_prog);
 	bool is_exception_cb = bpf_prog->aux->exception_cb;
@@ -520,6 +540,8 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, struct bpf_prog *bpf_prog
 		 * first restore those callee-saved regs from stack, before
 		 * reusing the stack frame.
 		 */
+		if (priv_stack_mode != NO_PRIV_STACK)
+			pop_r9(&prog);
 		pop_callee_regs(&prog, all_callee_regs_used);
 		pop_r12(&prog);
 		/* Reset the stack frame. */
@@ -532,6 +554,8 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, struct bpf_prog *bpf_prog
 	/* X86_TAIL_CALL_OFFSET is here */
 	EMIT_ENDBR();
 
+	emit_priv_frame_ptr(&prog, bpf_prog, priv_stack_mode);
+
 	/* sub rsp, rounded_stack_depth */
 	if (stack_depth)
 		EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
@@ -1451,6 +1475,42 @@ static void emit_alu_helper_1(u8 **pprog, u8 insn_code, u32 dst_reg, s32 imm32)
 	*pprog = prog;
 }
 
+static void emit_root_priv_frame_ptr(u8 **pprog, struct bpf_prog *bpf_prog,
+				     u32 orig_stack_depth)
+{
+	void __percpu *priv_frame_ptr;
+	u8 *prog = *pprog;
+
+	priv_frame_ptr = bpf_prog->aux->priv_stack_ptr + orig_stack_depth;
+
+	/* movabs r9, priv_frame_ptr */
+	emit_mov_imm64(&prog, X86_REG_R9, (long) priv_frame_ptr >> 32,
+		       (u32) (long) priv_frame_ptr);
+#ifdef CONFIG_SMP
+	/* add <r9>, gs:[<off>] */
+	EMIT2(0x65, 0x4c);
+	EMIT3(0x03, 0x0c, 0x25);
+	EMIT((u32)(unsigned long)&this_cpu_off, 4);
+#endif
+	*pprog = prog;
+}
+
+static void emit_priv_frame_ptr(u8 **pprog, struct bpf_prog *bpf_prog,
+				enum bpf_priv_stack_mode priv_stack_mode)
+{
+	u32 orig_stack_depth = round_up(bpf_prog->aux->stack_depth, 8);
+	u8 *prog = *pprog;
+
+	if (priv_stack_mode == PRIV_STACK_ROOT_PROG)
+		emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
+	else if (priv_stack_mode == PRIV_STACK_SUB_PROG && orig_stack_depth)
+		/* r9 += orig_stack_depth */
+		emit_alu_helper_1(&prog, BPF_ALU64 | BPF_ADD | BPF_K, X86_REG_R9,
+				  orig_stack_depth);
+
+	*pprog = prog;
+}
+
 #define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
 
 #define __LOAD_TCC_PTR(off)			\
@@ -1464,6 +1524,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 {
 	bool tail_call_reachable = bpf_prog->aux->tail_call_reachable;
 	struct bpf_insn *insn = bpf_prog->insnsi;
+	enum bpf_priv_stack_mode priv_stack_mode;
 	bool callee_regs_used[4] = {};
 	int insn_cnt = bpf_prog->len;
 	bool seen_exit = false;
@@ -1476,13 +1537,17 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 	int err;
 
 	stack_depth = bpf_prog->aux->stack_depth;
+	priv_stack_mode = bpf_prog->aux->priv_stack_mode;
+	if (priv_stack_mode != NO_PRIV_STACK)
+		stack_depth = 0;
 
 	arena_vm_start = bpf_arena_get_kern_vm_start(bpf_prog->aux->arena);
 	user_vm_start = bpf_arena_get_user_vm_start(bpf_prog->aux->arena);
 
 	detect_reg_usage(insn, insn_cnt, callee_regs_used);
 
-	emit_prologue(&prog, stack_depth, bpf_prog, tail_call_reachable);
+	emit_prologue(&prog, stack_depth, bpf_prog, tail_call_reachable,
+		      priv_stack_mode);
 	/* Exception callback will clobber callee regs for its own use, and
 	 * restore the original callee regs from main prog's stack frame.
 	 */
@@ -1521,6 +1586,14 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 		u8 *func;
 		int nops;
 
+		if (priv_stack_mode != NO_PRIV_STACK) {
+			if (src_reg == BPF_REG_FP)
+				src_reg = X86_REG_R9;
+
+			if (dst_reg == BPF_REG_FP)
+				dst_reg = X86_REG_R9;
+		}
+
 		switch (insn->code) {
 			/* ALU */
 		case BPF_ALU | BPF_ADD | BPF_X:
@@ -2146,9 +2219,15 @@ st:			if (is_imm8(insn->off))
 			}
 			if (!imm32)
 				return -EINVAL;
+			if (priv_stack_mode != NO_PRIV_STACK) {
+				push_r9(&prog);
+				ip += 2;
+			}
 			ip += x86_call_depth_emit_accounting(&prog, func, ip);
 			if (emit_call(&prog, func, ip))
 				return -EINVAL;
+			if (priv_stack_mode != NO_PRIV_STACK)
+				pop_r9(&prog);
 			break;
 		}
 
@@ -3572,6 +3651,11 @@ bool bpf_jit_supports_exceptions(void)
 	return IS_ENABLED(CONFIG_UNWINDER_ORC);
 }
 
+bool bpf_jit_supports_private_stack(void)
+{
+	return true;
+}
+
 void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64 sp, u64 bp), void *cookie)
 {
 #if defined(CONFIG_UNWINDER_ORC)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next v4 06/10] selftests/bpf: Add private stack tests
  2024-10-10 17:55 [PATCH bpf-next v4 00/10] bpf: Support private stack for bpf progs Yonghong Song
                   ` (4 preceding siblings ...)
  2024-10-10 17:56 ` [PATCH bpf-next v4 05/10] bpf, x86: Add jit support for private stack Yonghong Song
@ 2024-10-10 17:56 ` Yonghong Song
  2024-10-10 17:56 ` [PATCH bpf-next v4 07/10] bpf: Support calling non-tailcall bpf prog Yonghong Song
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Yonghong Song @ 2024-10-10 17:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Tejun Heo

Some private stack tests are added including:
  - prog with stack size greater than BPF_PSTACK_MIN_SUBTREE_SIZE.
  - prog with stack size less than BPF_PSTACK_MIN_SUBTREE_SIZE.
  - prog with one subprog having MAX_BPF_STACK stack size and another
    subprog having non-zero stack size.
  - prog with callback function.
  - prog with exception in main prog or subprog.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 .../selftests/bpf/prog_tests/verifier.c       |   2 +
 .../bpf/progs/verifier_private_stack.c        | 216 ++++++++++++++++++
 2 files changed, 218 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/verifier_private_stack.c

diff --git a/tools/testing/selftests/bpf/prog_tests/verifier.c b/tools/testing/selftests/bpf/prog_tests/verifier.c
index e26b5150fc43..635ff3509403 100644
--- a/tools/testing/selftests/bpf/prog_tests/verifier.c
+++ b/tools/testing/selftests/bpf/prog_tests/verifier.c
@@ -59,6 +59,7 @@
 #include "verifier_or_jmp32_k.skel.h"
 #include "verifier_precision.skel.h"
 #include "verifier_prevent_map_lookup.skel.h"
+#include "verifier_private_stack.skel.h"
 #include "verifier_raw_stack.skel.h"
 #include "verifier_raw_tp_writable.skel.h"
 #include "verifier_reg_equal.skel.h"
@@ -185,6 +186,7 @@ void test_verifier_bpf_fastcall(void)         { RUN(verifier_bpf_fastcall); }
 void test_verifier_or_jmp32_k(void)           { RUN(verifier_or_jmp32_k); }
 void test_verifier_precision(void)            { RUN(verifier_precision); }
 void test_verifier_prevent_map_lookup(void)   { RUN(verifier_prevent_map_lookup); }
+void test_verifier_private_stack(void)        { RUN(verifier_private_stack); }
 void test_verifier_raw_stack(void)            { RUN(verifier_raw_stack); }
 void test_verifier_raw_tp_writable(void)      { RUN(verifier_raw_tp_writable); }
 void test_verifier_reg_equal(void)            { RUN(verifier_reg_equal); }
diff --git a/tools/testing/selftests/bpf/progs/verifier_private_stack.c b/tools/testing/selftests/bpf/progs/verifier_private_stack.c
new file mode 100644
index 000000000000..e8de565f8b34
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/verifier_private_stack.c
@@ -0,0 +1,216 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+#include "bpf_experimental.h"
+
+/* From include/linux/filter.h */
+#define MAX_BPF_STACK    512
+
+#if defined(__TARGET_ARCH_x86)
+
+SEC("kprobe")
+__description("Private stack, single prog")
+__success
+__arch_x86_64
+__jited("	movabsq	$0x{{.*}}, %r9")
+__jited("	addq	%gs:0x{{.*}}, %r9")
+__jited("	movl	$0x2a, %edi")
+__jited("	movq	%rdi, -0x100(%r9)")
+__naked void private_stack_single_prog(void)
+{
+	asm volatile (
+	"r1 = 42;"
+	"*(u64 *)(r10 - 256) = r1;"
+	"r0 = 0;"
+	"exit;"
+	:
+	:
+	: __clobber_all);
+}
+
+__used
+__naked static void cumulative_stack_depth_subprog(void)
+{
+        asm volatile (
+	"r1 = 41;"
+        "*(u64 *)(r10 - 32) = r1;"
+        "call %[bpf_get_smp_processor_id];"
+        "exit;"
+        :: __imm(bpf_get_smp_processor_id)
+	: __clobber_all);
+}
+
+SEC("kprobe")
+__description("Private stack, subtree > MAX_BPF_STACK")
+__success
+__arch_x86_64
+/* private stack fp for the main prog */
+__jited("	movabsq	$0x{{.*}}, %r9")
+__jited("	addq	%gs:0x{{.*}}, %r9")
+__jited("	movl	$0x2a, %edi")
+__jited("	movq	%rdi, -0x200(%r9)")
+__jited("	pushq	%r9")
+__jited("	callq	0x{{.*}}")
+__jited("	popq	%r9")
+__jited("	xorl	%eax, %eax")
+__naked void private_stack_nested_1(void)
+{
+	asm volatile (
+	"r1 = 42;"
+	"*(u64 *)(r10 - %[max_bpf_stack]) = r1;"
+	"call cumulative_stack_depth_subprog;"
+	"r0 = 0;"
+	"exit;"
+	:
+	: __imm_const(max_bpf_stack, MAX_BPF_STACK)
+	: __clobber_all);
+}
+
+SEC("kprobe")
+__description("Private stack, subtree > MAX_BPF_STACK")
+__success
+__arch_x86_64
+/* private stack fp for the subprog */
+__jited("	addq	$0x20, %r9")
+__naked void private_stack_nested_2(void)
+{
+	asm volatile (
+	"r1 = 42;"
+	"*(u64 *)(r10 - %[max_bpf_stack]) = r1;"
+	"call cumulative_stack_depth_subprog;"
+	"r0 = 0;"
+	"exit;"
+	:
+	: __imm_const(max_bpf_stack, MAX_BPF_STACK)
+	: __clobber_all);
+}
+
+SEC("raw_tp")
+__description("No private stack, nested")
+__success
+__arch_x86_64
+__jited("	subq	$0x8, %rsp")
+__naked void no_private_stack_nested(void)
+{
+	asm volatile (
+	"r1 = 42;"
+	"*(u64 *)(r10 - 8) = r1;"
+	"call cumulative_stack_depth_subprog;"
+	"r0 = 0;"
+	"exit;"
+	:
+	:
+	: __clobber_all);
+}
+
+__naked __noinline __used
+static unsigned long loop_callback(void)
+{
+	asm volatile (
+	"call %[bpf_get_prandom_u32];"
+	"r1 = 42;"
+	"*(u64 *)(r10 - 512) = r1;"
+	"call cumulative_stack_depth_subprog;"
+	"r0 = 0;"
+	"exit;"
+	:
+	: __imm(bpf_get_prandom_u32)
+	: __clobber_common);
+}
+
+SEC("raw_tp")
+__description("Private stack, callback")
+__success
+__arch_x86_64
+/* for func loop_callback */
+__jited("func #1")
+__jited("	endbr64")
+__jited("	nopl	(%rax,%rax)")
+__jited("	nopl	(%rax)")
+__jited("	pushq	%rbp")
+__jited("	movq	%rsp, %rbp")
+__jited("	endbr64")
+__jited("	movabsq	$0x{{.*}}, %r9")
+__jited("	addq	%gs:0x{{.*}}, %r9")
+__jited("	pushq	%r9")
+__jited("	callq")
+__jited("	popq	%r9")
+__jited("	movl	$0x2a, %edi")
+__jited("	movq	%rdi, -0x200(%r9)")
+__jited("	pushq	%r9")
+__jited("	callq")
+__jited("	popq	%r9")
+__naked void private_stack_callback(void)
+{
+	asm volatile (
+	"r1 = 1;"
+	"r2 = %[loop_callback];"
+	"r3 = 0;"
+	"r4 = 0;"
+	"call %[bpf_loop];"
+	"r0 = 0;"
+	"exit;"
+	:
+	: __imm_ptr(loop_callback),
+	  __imm(bpf_loop)
+	: __clobber_common);
+}
+
+SEC("fentry/bpf_fentry_test9")
+__description("Private stack, exception in main prog")
+__success __retval(0)
+__arch_x86_64
+__jited("	pushq	%r9")
+__jited("	callq")
+__jited("	popq	%r9")
+int private_stack_exception_main_prog(void)
+{
+	asm volatile (
+	"r1 = 42;"
+	"*(u64 *)(r10 - 512) = r1;"
+	::: __clobber_common);
+
+	bpf_throw(0);
+	return 0;
+}
+
+__used static int subprog_exception(void)
+{
+	bpf_throw(0);
+	return 0;
+}
+
+SEC("fentry/bpf_fentry_test9")
+__description("Private stack, exception in subprog")
+__success __retval(0)
+__arch_x86_64
+__jited("	movq	%rdi, -0x200(%r9)")
+__jited("	pushq	%r9")
+__jited("	callq")
+__jited("	popq	%r9")
+int private_stack_exception_sub_prog(void)
+{
+	asm volatile (
+	"r1 = 42;"
+	"*(u64 *)(r10 - 512) = r1;"
+	"call subprog_exception;"
+	::: __clobber_common);
+
+	return 0;
+}
+
+#else
+
+SEC("kprobe")
+__description("private stack is not supported, use a dummy test")
+__success
+int dummy_test(void)
+{
+        return 0;
+}
+
+#endif
+
+char _license[] SEC("license") = "GPL";
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next v4 07/10] bpf: Support calling non-tailcall bpf prog
  2024-10-10 17:55 [PATCH bpf-next v4 00/10] bpf: Support private stack for bpf progs Yonghong Song
                   ` (5 preceding siblings ...)
  2024-10-10 17:56 ` [PATCH bpf-next v4 06/10] selftests/bpf: Add private stack tests Yonghong Song
@ 2024-10-10 17:56 ` Yonghong Song
  2024-10-10 20:28   ` Alexei Starovoitov
  2024-10-10 17:56 ` [PATCH bpf-next v4 08/10] bpf, x86: Create two helpers for some arith operations Yonghong Song
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Yonghong Song @ 2024-10-10 17:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Tejun Heo

A kfunc bpf_prog_call() is introduced such that it can call another bpf
prog within a bpf prog. It has the same parameters as bpf_tail_call()
but acts like a normal function call.

But bpf_prog_call() could recurse to the caller prog itself. So if a bpf
prog calls bpf_prog_call(), that bpf prog will use private stacks with
maximum recursion level 4. The 4 level recursion should work for most
cases.

bpf_prog_call() cannot be used if tail_call exists in the same prog
since tail_call does not use private stack. If both prog_call and
tail_call in the same prog, verification will fail.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 include/linux/bpf.h   |  2 ++
 kernel/bpf/core.c     |  7 +++++--
 kernel/bpf/helpers.c  | 20 ++++++++++++++++++++
 kernel/bpf/verifier.c | 30 ++++++++++++++++++++++++++----
 4 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f22ddb423fd0..952cb398eb30 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1493,6 +1493,7 @@ struct bpf_prog_aux {
 	bool exception_cb;
 	bool exception_boundary;
 	bool priv_stack_eligible;
+	bool has_prog_call;
 	struct bpf_arena *arena;
 	/* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
 	const struct btf_type *attach_func_proto;
@@ -1929,6 +1930,7 @@ struct bpf_array {
 
 #define BPF_COMPLEXITY_LIMIT_INSNS      1000000 /* yes. 1M insns */
 #define MAX_TAIL_CALL_CNT 33
+#define BPF_MAX_PRIV_STACK_NEST_LEVEL	4
 
 /* Maximum number of loops for bpf_loop and bpf_iter_num.
  * It's enum to expose it (and thus make it discoverable) through BTF.
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index f79d951a061f..0d2c97f63ecf 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2426,10 +2426,13 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
 				fp->aux->priv_stack_mode = NO_PRIV_STACK;
 			} else {
 				void __percpu *priv_stack_ptr;
+				int nest_level = 1;
 
+				if (fp->aux->has_prog_call)
+					nest_level = BPF_MAX_PRIV_STACK_NEST_LEVEL;
 				fp->aux->priv_stack_mode = PRIV_STACK_ROOT_PROG;
-				priv_stack_ptr =
-					__alloc_percpu_gfp(fp->aux->stack_depth, 8, GFP_KERNEL);
+				priv_stack_ptr = __alloc_percpu_gfp(
+					fp->aux->stack_depth * nest_level, 8, GFP_KERNEL);
 				if (!priv_stack_ptr) {
 					*err = -ENOMEM;
 					return fp;
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 4053f279ed4c..9cc880dc213e 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -2749,6 +2749,25 @@ __bpf_kfunc void bpf_rcu_read_unlock(void)
 	rcu_read_unlock();
 }
 
+__bpf_kfunc int bpf_prog_call(void *ctx, struct bpf_map *p__map, u32 index)
+{
+	struct bpf_array *array;
+	struct bpf_prog *prog;
+
+	if (p__map->map_type != BPF_MAP_TYPE_PROG_ARRAY)
+		return -EINVAL;
+
+	array = container_of(p__map, struct bpf_array, map);
+	if (unlikely(index >= array->map.max_entries))
+		return -E2BIG;
+
+	prog = READ_ONCE(array->ptrs[index]);
+	if (!prog)
+		return -ENOENT;
+
+	return bpf_prog_run(prog, ctx);
+}
+
 struct bpf_throw_ctx {
 	struct bpf_prog_aux *aux;
 	u64 sp;
@@ -3035,6 +3054,7 @@ BTF_ID_FLAGS(func, bpf_task_get_cgroup1, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
 #endif
 BTF_ID_FLAGS(func, bpf_task_from_pid, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_throw)
+BTF_ID_FLAGS(func, bpf_prog_call)
 BTF_KFUNCS_END(generic_btf_ids)
 
 static const struct btf_kfunc_id_set generic_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 46b0c277c6a8..e3d9820618a1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5986,6 +5986,9 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
 
 static bool bpf_enable_private_stack(struct bpf_prog *prog)
 {
+	if (prog->aux->has_prog_call)
+		return true;
+
 	if (!bpf_jit_supports_private_stack())
 		return false;
 
@@ -6092,7 +6095,9 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
 			return -EACCES;
 		}
 
-		if (!priv_stack_eligible && depth >= BPF_PRIV_STACK_MIN_SUBTREE_SIZE) {
+		if (!priv_stack_eligible &&
+		    (depth >= BPF_PRIV_STACK_MIN_SUBTREE_SIZE ||
+		     env->prog->aux->has_prog_call)) {
 			subprog[orig_idx].priv_stack_eligible = true;
 			env->prog->aux->priv_stack_eligible = priv_stack_eligible = true;
 		}
@@ -6181,8 +6186,13 @@ static int check_max_stack_depth_subprog(struct bpf_verifier_env *env, int idx,
 			}
 			subprog[ret_prog[j]].tail_call_reachable = true;
 		}
-	if (!check_priv_stack && subprog[0].tail_call_reachable)
+	if (!check_priv_stack && subprog[0].tail_call_reachable) {
+		if (env->prog->aux->has_prog_call) {
+			verbose(env, "cannot do prog call and tail call in the same prog\n");
+			return -EINVAL;
+		}
 		env->prog->aux->tail_call_reachable = true;
+	}
 
 	/* end of for() loop means the last insn of the 'subprog'
 	 * was reached. Doesn't matter whether it was JA or EXIT
@@ -11322,6 +11332,7 @@ enum special_kfunc_type {
 	KF_bpf_preempt_enable,
 	KF_bpf_iter_css_task_new,
 	KF_bpf_session_cookie,
+	KF_bpf_prog_call,
 };
 
 BTF_SET_START(special_kfunc_set)
@@ -11387,6 +11398,7 @@ BTF_ID(func, bpf_session_cookie)
 #else
 BTF_ID_UNUSED
 #endif
+BTF_ID(func, bpf_prog_call)
 
 static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
 {
@@ -11433,6 +11445,11 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
 	if (meta->func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx])
 		return KF_ARG_PTR_TO_CTX;
 
+	if (meta->func_id == special_kfunc_list[KF_bpf_prog_call] && argno == 0) {
+		env->prog->aux->has_prog_call = true;
+		return KF_ARG_PTR_TO_CTX;
+	}
+
 	/* In this function, we verify the kfunc's BTF as per the argument type,
 	 * leaving the rest of the verification with respect to the register
 	 * type to our caller. When a set of conditions hold in the BTF type of
@@ -20009,6 +20026,7 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 	struct bpf_insn *insn;
 	void *old_bpf_func;
 	int err, num_exentries;
+	int nest_level = 1;
 
 	if (env->subprog_cnt <= 1)
 		return 0;
@@ -20099,9 +20117,13 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 			} else if (!subtree_stack_depth) {
 				func[i]->aux->priv_stack_mode = PRIV_STACK_ROOT_PROG;
 			} else {
+				if (env->prog->aux->has_prog_call) {
+					func[i]->aux->has_prog_call = true;
+					nest_level = BPF_MAX_PRIV_STACK_NEST_LEVEL;
+				}
 				func[i]->aux->priv_stack_mode = PRIV_STACK_ROOT_PROG;
-				priv_stack_ptr =
-					__alloc_percpu_gfp(subtree_stack_depth, 8, GFP_KERNEL);
+				priv_stack_ptr = __alloc_percpu_gfp(
+					subtree_stack_depth * nest_level, 8, GFP_KERNEL);
 				if (!priv_stack_ptr) {
 					err = -ENOMEM;
 					goto out_free;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next v4 08/10] bpf, x86: Create two helpers for some arith operations
  2024-10-10 17:55 [PATCH bpf-next v4 00/10] bpf: Support private stack for bpf progs Yonghong Song
                   ` (6 preceding siblings ...)
  2024-10-10 17:56 ` [PATCH bpf-next v4 07/10] bpf: Support calling non-tailcall bpf prog Yonghong Song
@ 2024-10-10 17:56 ` Yonghong Song
  2024-10-10 20:21   ` Alexei Starovoitov
  2024-10-10 17:56 ` [PATCH bpf-next v4 09/10] bpf, x86: Jit support for nested bpf_prog_call Yonghong Song
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Yonghong Song @ 2024-10-10 17:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Tejun Heo

Two helpers are extracted from bpf/x86 jit:
  - a helper to handle 'reg1 <op>= reg2' where <op> is add/sub/and/or/xor
  - a helper to handle 'reg *= imm'

Both helpers will be used in the subsequent patch.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 arch/x86/net/bpf_jit_comp.c | 51 ++++++++++++++++++++++++-------------
 1 file changed, 34 insertions(+), 17 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index a6ba85cec49a..297dd64f4b6a 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1475,6 +1475,37 @@ static void emit_alu_helper_1(u8 **pprog, u8 insn_code, u32 dst_reg, s32 imm32)
 	*pprog = prog;
 }
 
+/* emit ADD/SUB/AND/OR/XOR 'reg1 <op>= reg2' operations */
+static void emit_alu_helper_2(u8 **pprog, u8 insn_code, u32 dst_reg, u32 src_reg)
+{
+	u8 b2 = 0;
+	u8 *prog = *pprog;
+
+	maybe_emit_mod(&prog, dst_reg, src_reg,
+		       BPF_CLASS(insn_code) == BPF_ALU64);
+	b2 = simple_alu_opcodes[BPF_OP(insn_code)];
+	EMIT2(b2, add_2reg(0xC0, dst_reg, src_reg));
+
+	*pprog = prog;
+}
+
+/* emit 'reg *= imm' operations */
+static void emit_alu_helper_3(u8 **pprog, u8 insn_code, u32 dst_reg, s32 imm32)
+{
+	u8 *prog = *pprog;
+
+	maybe_emit_mod(&prog, dst_reg, dst_reg, BPF_CLASS(insn_code) == BPF_ALU64);
+
+	if (is_imm8(imm32))
+		/* imul dst_reg, dst_reg, imm8 */
+		EMIT3(0x6B, add_2reg(0xC0, dst_reg, dst_reg), imm32);
+	else
+		/* imul dst_reg, dst_reg, imm32 */
+		EMIT2_off32(0x69, add_2reg(0xC0, dst_reg, dst_reg), imm32);
+
+	*pprog = prog;
+}
+
 static void emit_root_priv_frame_ptr(u8 **pprog, struct bpf_prog *bpf_prog,
 				     u32 orig_stack_depth)
 {
@@ -1578,7 +1609,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 		const s32 imm32 = insn->imm;
 		u32 dst_reg = insn->dst_reg;
 		u32 src_reg = insn->src_reg;
-		u8 b2 = 0, b3 = 0;
+		u8 b3 = 0;
 		u8 *start_of_ldx;
 		s64 jmp_offset;
 		s16 insn_off;
@@ -1606,10 +1637,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 		case BPF_ALU64 | BPF_AND | BPF_X:
 		case BPF_ALU64 | BPF_OR | BPF_X:
 		case BPF_ALU64 | BPF_XOR | BPF_X:
-			maybe_emit_mod(&prog, dst_reg, src_reg,
-				       BPF_CLASS(insn->code) == BPF_ALU64);
-			b2 = simple_alu_opcodes[BPF_OP(insn->code)];
-			EMIT2(b2, add_2reg(0xC0, dst_reg, src_reg));
+			emit_alu_helper_2(&prog, insn->code, dst_reg, src_reg);
 			break;
 
 		case BPF_ALU64 | BPF_MOV | BPF_X:
@@ -1772,18 +1800,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 
 		case BPF_ALU | BPF_MUL | BPF_K:
 		case BPF_ALU64 | BPF_MUL | BPF_K:
-			maybe_emit_mod(&prog, dst_reg, dst_reg,
-				       BPF_CLASS(insn->code) == BPF_ALU64);
-
-			if (is_imm8(imm32))
-				/* imul dst_reg, dst_reg, imm8 */
-				EMIT3(0x6B, add_2reg(0xC0, dst_reg, dst_reg),
-				      imm32);
-			else
-				/* imul dst_reg, dst_reg, imm32 */
-				EMIT2_off32(0x69,
-					    add_2reg(0xC0, dst_reg, dst_reg),
-					    imm32);
+			emit_alu_helper_3(&prog, insn->code, dst_reg, imm32);
 			break;
 
 		case BPF_ALU | BPF_MUL | BPF_X:
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next v4 09/10] bpf, x86: Jit support for nested bpf_prog_call
  2024-10-10 17:55 [PATCH bpf-next v4 00/10] bpf: Support private stack for bpf progs Yonghong Song
                   ` (7 preceding siblings ...)
  2024-10-10 17:56 ` [PATCH bpf-next v4 08/10] bpf, x86: Create two helpers for some arith operations Yonghong Song
@ 2024-10-10 17:56 ` Yonghong Song
  2024-10-10 20:53   ` Alexei Starovoitov
  2024-10-10 17:56 ` [PATCH bpf-next v4 10/10] selftests/bpf: Add tests for bpf_prog_call() Yonghong Song
  2024-10-15 21:28 ` [PATCH bpf-next v4 00/10] bpf: Support private stack for bpf progs Tejun Heo
  10 siblings, 1 reply; 25+ messages in thread
From: Yonghong Song @ 2024-10-10 17:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Tejun Heo

Two functions are added in the kernel
  - int notrace __bpf_prog_enter_recur_limited(struct bpf_prog *prog)
  - void notrace __bpf_prog_exit_recur_limited(struct bpf_prog *prog)
and they are called in bpf progs through jit.

Func __bpf_prog_enter_recur_limited() will return 0 if maximum recursion
level has been reached in which case, bpf prog will return to the caller
directly. Otherwise, it will return the current recursion level. The
recursion level will be used by jit to calculated proper frame pointer
for that recursion level.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 arch/x86/net/bpf_jit_comp.c | 94 +++++++++++++++++++++++++++++++++----
 include/linux/bpf.h         |  2 +
 kernel/bpf/trampoline.c     | 16 +++++++
 3 files changed, 104 insertions(+), 8 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 297dd64f4b6a..a763e018e87f 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -501,7 +501,8 @@ static void emit_prologue_tail_call(u8 **pprog, bool is_subprog)
 }
 
 static void emit_priv_frame_ptr(u8 **pprog, struct bpf_prog *bpf_prog,
-				enum bpf_priv_stack_mode priv_stack_mode);
+				enum bpf_priv_stack_mode priv_stack_mode,
+				bool is_subprog, u8 *image, u8 *temp);
 
 /*
  * Emit x86-64 prologue code for BPF program.
@@ -510,7 +511,8 @@ static void emit_priv_frame_ptr(u8 **pprog, struct bpf_prog *bpf_prog,
  */
 static void emit_prologue(u8 **pprog, u32 stack_depth, struct bpf_prog *bpf_prog,
 			  bool tail_call_reachable,
-			  enum bpf_priv_stack_mode priv_stack_mode)
+			  enum bpf_priv_stack_mode priv_stack_mode, u8 *image,
+			  u8 *temp)
 {
 	bool ebpf_from_cbpf = bpf_prog_was_classic(bpf_prog);
 	bool is_exception_cb = bpf_prog->aux->exception_cb;
@@ -554,7 +556,7 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, struct bpf_prog *bpf_prog
 	/* X86_TAIL_CALL_OFFSET is here */
 	EMIT_ENDBR();
 
-	emit_priv_frame_ptr(&prog, bpf_prog, priv_stack_mode);
+	emit_priv_frame_ptr(&prog, bpf_prog, priv_stack_mode, is_subprog, image, temp);
 
 	/* sub rsp, rounded_stack_depth */
 	if (stack_depth)
@@ -696,6 +698,15 @@ static void emit_return(u8 **pprog, u8 *ip)
 	*pprog = prog;
 }
 
+static int num_bytes_of_emit_return(void)
+{
+	if (cpu_feature_enabled(X86_FEATURE_RETHUNK))
+		return 5;
+	if (IS_ENABLED(CONFIG_MITIGATION_SLS))
+		return 2;
+	return 1;
+}
+
 #define BPF_TAIL_CALL_CNT_PTR_STACK_OFF(stack)	(-16 - round_up(stack, 8))
 
 /*
@@ -1527,17 +1538,67 @@ static void emit_root_priv_frame_ptr(u8 **pprog, struct bpf_prog *bpf_prog,
 }
 
 static void emit_priv_frame_ptr(u8 **pprog, struct bpf_prog *bpf_prog,
-				enum bpf_priv_stack_mode priv_stack_mode)
+				enum bpf_priv_stack_mode priv_stack_mode,
+				bool is_subprog, u8 *image, u8 *temp)
 {
 	u32 orig_stack_depth = round_up(bpf_prog->aux->stack_depth, 8);
 	u8 *prog = *pprog;
 
-	if (priv_stack_mode == PRIV_STACK_ROOT_PROG)
-		emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
-	else if (priv_stack_mode == PRIV_STACK_SUB_PROG && orig_stack_depth)
+	if (priv_stack_mode == PRIV_STACK_ROOT_PROG) {
+		int offs;
+		u8 *func;
+
+		if (!bpf_prog->aux->has_prog_call) {
+			emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
+		} else {
+			EMIT1(0x57);		/* push rdi */
+			if (is_subprog) {
+				/* subprog may have up to 5 arguments */
+				EMIT1(0x56);		/* push rsi */
+				EMIT1(0x52);		/* push rdx */
+				EMIT1(0x51);		/* push rcx */
+				EMIT2(0x41, 0x50);	/* push r8 */
+			}
+			emit_mov_imm64(&prog, BPF_REG_1, (long) bpf_prog >> 32,
+				       (u32) (long) bpf_prog);
+			func = (u8 *)__bpf_prog_enter_recur_limited;
+			offs = prog - temp;
+			offs += x86_call_depth_emit_accounting(&prog, func, image + offs);
+			emit_call(&prog, func, image + offs);
+			if (is_subprog) {
+				EMIT2(0x41, 0x58);	/* pop r8 */
+				EMIT1(0x59);		/* pop rcx */
+				EMIT1(0x5a);		/* pop rdx */
+				EMIT1(0x5e);		/* pop rsi */
+			}
+			EMIT1(0x5f);		/* pop rdi */
+
+			EMIT4(0x48, 0x83, 0xf8, 0x0);   /* cmp rax,0x0 */
+			EMIT2(X86_JNE, num_bytes_of_emit_return() + 1);
+
+			/* return if stack recursion has been reached */
+			EMIT1(0xC9);    /* leave */
+			emit_return(&prog, image + (prog - temp));
+
+			/* cnt -= 1 */
+			emit_alu_helper_1(&prog, BPF_ALU64 | BPF_SUB | BPF_K,
+					  BPF_REG_0, 1);
+
+			/* accum_stack_depth = cnt * subtree_stack_depth */
+			emit_alu_helper_3(&prog, BPF_ALU64 | BPF_MUL | BPF_K, BPF_REG_0,
+					  bpf_prog->aux->subtree_stack_depth);
+
+			emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
+
+			/* r9 += accum_stack_depth */
+			emit_alu_helper_2(&prog, BPF_ALU64 | BPF_ADD | BPF_X, X86_REG_R9,
+					  BPF_REG_0);
+		}
+	} else if (priv_stack_mode == PRIV_STACK_SUB_PROG && orig_stack_depth) {
 		/* r9 += orig_stack_depth */
 		emit_alu_helper_1(&prog, BPF_ALU64 | BPF_ADD | BPF_K, X86_REG_R9,
 				  orig_stack_depth);
+	}
 
 	*pprog = prog;
 }
@@ -1578,7 +1639,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 	detect_reg_usage(insn, insn_cnt, callee_regs_used);
 
 	emit_prologue(&prog, stack_depth, bpf_prog, tail_call_reachable,
-		      priv_stack_mode);
+		      priv_stack_mode, image, temp);
 	/* Exception callback will clobber callee regs for its own use, and
 	 * restore the original callee regs from main prog's stack frame.
 	 */
@@ -2519,6 +2580,23 @@ st:			if (is_imm8(insn->off))
 				if (arena_vm_start)
 					pop_r12(&prog);
 			}
+
+			if (bpf_prog->aux->has_prog_call) {
+				u8 *func, *ip;
+				int offs;
+
+				ip = image + addrs[i - 1];
+				/* save and restore the return value */
+				EMIT1(0x50);    /* push rax */
+				emit_mov_imm64(&prog, BPF_REG_1, (long) bpf_prog >> 32,
+					       (u32) (long) bpf_prog);
+				func = (u8 *)__bpf_prog_exit_recur_limited;
+				offs = prog - temp;
+				offs += x86_call_depth_emit_accounting(&prog, func, ip + offs);
+				emit_call(&prog, func, ip + offs);
+				EMIT1(0x58);    /* pop rax */
+			}
+
 			EMIT1(0xC9);         /* leave */
 			emit_return(&prog, image + addrs[i - 1] + (prog - temp));
 			break;
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 952cb398eb30..605004cba9f7 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1148,6 +1148,8 @@ u64 notrace __bpf_prog_enter_sleepable_recur(struct bpf_prog *prog,
 					     struct bpf_tramp_run_ctx *run_ctx);
 void notrace __bpf_prog_exit_sleepable_recur(struct bpf_prog *prog, u64 start,
 					     struct bpf_tramp_run_ctx *run_ctx);
+int notrace __bpf_prog_enter_recur_limited(struct bpf_prog *prog);
+void notrace __bpf_prog_exit_recur_limited(struct bpf_prog *prog);
 void notrace __bpf_tramp_enter(struct bpf_tramp_image *tr);
 void notrace __bpf_tramp_exit(struct bpf_tramp_image *tr);
 typedef u64 (*bpf_trampoline_enter_t)(struct bpf_prog *prog,
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index f8302a5ca400..d9e7260e4b39 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -960,6 +960,22 @@ void notrace __bpf_prog_exit_sleepable_recur(struct bpf_prog *prog, u64 start,
 	rcu_read_unlock_trace();
 }
 
+int notrace __bpf_prog_enter_recur_limited(struct bpf_prog *prog)
+{
+	int cnt = this_cpu_inc_return(*(prog->active));
+
+	if (cnt > BPF_MAX_PRIV_STACK_NEST_LEVEL) {
+		bpf_prog_inc_misses_counter(prog);
+		return 0;
+	}
+	return cnt;
+}
+
+void notrace __bpf_prog_exit_recur_limited(struct bpf_prog *prog)
+{
+	this_cpu_dec(*(prog->active));
+}
+
 static u64 notrace __bpf_prog_enter_sleepable(struct bpf_prog *prog,
 					      struct bpf_tramp_run_ctx *run_ctx)
 {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH bpf-next v4 10/10] selftests/bpf: Add tests for bpf_prog_call()
  2024-10-10 17:55 [PATCH bpf-next v4 00/10] bpf: Support private stack for bpf progs Yonghong Song
                   ` (8 preceding siblings ...)
  2024-10-10 17:56 ` [PATCH bpf-next v4 09/10] bpf, x86: Jit support for nested bpf_prog_call Yonghong Song
@ 2024-10-10 17:56 ` Yonghong Song
  2024-10-15 21:28 ` [PATCH bpf-next v4 00/10] bpf: Support private stack for bpf progs Tejun Heo
  10 siblings, 0 replies; 25+ messages in thread
From: Yonghong Song @ 2024-10-10 17:56 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau, Tejun Heo

Add two subtests for nested bpf_prog_call(). One is recursion in main prog,
and the other is recursion in callback func.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 .../selftests/bpf/prog_tests/prog_call.c      | 78 ++++++++++++++++
 tools/testing/selftests/bpf/progs/prog_call.c | 92 +++++++++++++++++++
 2 files changed, 170 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/prog_call.c
 create mode 100644 tools/testing/selftests/bpf/progs/prog_call.c

diff --git a/tools/testing/selftests/bpf/prog_tests/prog_call.c b/tools/testing/selftests/bpf/prog_tests/prog_call.c
new file mode 100644
index 000000000000..573c67c9af12
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/prog_call.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+#include "prog_call.skel.h"
+
+static void test_nest_prog_call(int prog_index)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, topts,
+		.data_in = &pkt_v4,
+		.data_size_in = sizeof(pkt_v4),
+	);
+	int err, idx = 0, prog_fd, map_fd;
+	struct prog_call *skel;
+	struct bpf_program *prog;
+
+	skel = prog_call__open();
+	if (!ASSERT_OK_PTR(skel, "prog_call__open"))
+		return;
+
+	switch (prog_index) {
+	case 0:
+		prog = skel->progs.entry_no_subprog;
+		break;
+	case 1:
+		prog = skel->progs.entry_subprog;
+		break;
+	case 2:
+		prog = skel->progs.entry_callback;
+		break;
+	}
+
+	bpf_program__set_autoload(prog, true);
+
+	err = prog_call__load(skel);
+	if (!ASSERT_OK(err, "prog_call__load"))
+		return;
+
+	map_fd = bpf_map__fd(skel->maps.jmp_table);
+	prog_fd = bpf_program__fd(prog);
+	/* maximum recursion level 4 */
+	err = bpf_map_update_elem(map_fd, &idx, &prog_fd, 0);
+	if (!ASSERT_OK(err, "bpf_map_update_elem"))
+		goto out;
+
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+	ASSERT_OK(err, "test_run");
+	ASSERT_EQ(skel->bss->vali, 4, "i");
+	ASSERT_EQ(skel->bss->valj, 6, "j");
+out:
+	prog_call__destroy(skel);
+}
+
+static void test_prog_call_with_tailcall(void)
+{
+	struct prog_call *skel;
+	int err;
+
+	skel = prog_call__open();
+	if (!ASSERT_OK_PTR(skel, "prog_call__open"))
+		return;
+
+	bpf_program__set_autoload(skel->progs.entry_tail_call, true);
+	err = prog_call__load(skel);
+	if (!ASSERT_ERR(err, "prog_call__load"))
+		prog_call__destroy(skel);
+}
+
+void test_prog_call(void)
+{
+	if (test__start_subtest("single_main_prog"))
+		test_nest_prog_call(0);
+	if (test__start_subtest("sub_prog"))
+		test_nest_prog_call(1);
+	if (test__start_subtest("callback_fn"))
+		test_nest_prog_call(2);
+	if (test__start_subtest("with_tailcall"))
+		test_prog_call_with_tailcall();
+}
diff --git a/tools/testing/selftests/bpf/progs/prog_call.c b/tools/testing/selftests/bpf/progs/prog_call.c
new file mode 100644
index 000000000000..c494cfcf653b
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/prog_call.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
+	__uint(max_entries, 3);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+} jmp_table SEC(".maps");
+
+struct callback_ctx {
+	struct __sk_buff *skb;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u64);
+} arraymap SEC(".maps");
+
+int vali, valj;
+
+int glb;
+__noinline static void subprog2(volatile int *a)
+{
+	glb = a[20] + a[10];
+}
+
+__noinline static void subprog1(struct __sk_buff *skb)
+{
+	volatile int a[100] = {};
+
+	a[10] = vali;
+	subprog2(a);
+	vali++;
+	bpf_prog_call(skb, (struct bpf_map *)&jmp_table, 0);
+	valj += a[10];
+}
+
+SEC("?tc")
+int entry_no_subprog(struct __sk_buff *skb)
+{
+	volatile int a[100] = {};
+
+	a[10] = vali;
+	subprog2(a);
+	vali++;
+	bpf_prog_call(skb, (struct bpf_map *)&jmp_table, 0);
+	valj += a[10];
+	return 0;
+}
+
+SEC("?tc")
+int entry_subprog(struct __sk_buff *skb)
+{
+	subprog1(skb);
+	return 0;
+}
+
+static __u64
+check_array_elem(struct bpf_map *map, __u32 *key, __u64 *val,
+		 struct callback_ctx *data)
+{
+	subprog1(data->skb);
+	return 0;
+}
+
+SEC("?tc")
+int entry_callback(struct __sk_buff *skb)
+{
+	struct callback_ctx data;
+
+	data.skb = skb;
+	bpf_for_each_map_elem(&arraymap, check_array_elem, &data, 0);
+	return 0;
+}
+
+SEC("?tc")
+int entry_tail_call(struct __sk_buff *skb)
+{
+	struct callback_ctx data;
+
+	bpf_tail_call_static(skb, &jmp_table, 0);
+
+	data.skb = skb;
+	bpf_for_each_map_elem(&arraymap, check_array_elem, &data, 0);
+	return 0;
+}
+
+char __license[] SEC("license") = "GPL";
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 08/10] bpf, x86: Create two helpers for some arith operations
  2024-10-10 17:56 ` [PATCH bpf-next v4 08/10] bpf, x86: Create two helpers for some arith operations Yonghong Song
@ 2024-10-10 20:21   ` Alexei Starovoitov
  2024-10-11  4:16     ` Yonghong Song
  0 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2024-10-10 20:21 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo

On Thu, Oct 10, 2024 at 10:56 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>
> Two helpers are extracted from bpf/x86 jit:
>   - a helper to handle 'reg1 <op>= reg2' where <op> is add/sub/and/or/xor
>   - a helper to handle 'reg *= imm'
>
> Both helpers will be used in the subsequent patch.
>
> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
> ---
>  arch/x86/net/bpf_jit_comp.c | 51 ++++++++++++++++++++++++-------------
>  1 file changed, 34 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index a6ba85cec49a..297dd64f4b6a 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -1475,6 +1475,37 @@ static void emit_alu_helper_1(u8 **pprog, u8 insn_code, u32 dst_reg, s32 imm32)
>         *pprog = prog;
>  }
>
> +/* emit ADD/SUB/AND/OR/XOR 'reg1 <op>= reg2' operations */
> +static void emit_alu_helper_2(u8 **pprog, u8 insn_code, u32 dst_reg, u32 src_reg)
> +{
> +       u8 b2 = 0;
> +       u8 *prog = *pprog;
> +
> +       maybe_emit_mod(&prog, dst_reg, src_reg,
> +                      BPF_CLASS(insn_code) == BPF_ALU64);
> +       b2 = simple_alu_opcodes[BPF_OP(insn_code)];
> +       EMIT2(b2, add_2reg(0xC0, dst_reg, src_reg));
> +
> +       *pprog = prog;
> +}
> +
> +/* emit 'reg *= imm' operations */
> +static void emit_alu_helper_3(u8 **pprog, u8 insn_code, u32 dst_reg, s32 imm32)

_1, _2, _3 ?!

There must be a better way to name the helpers. Like:

_1 -> emit_alu_imm
_2 -> emit_alu_reg
_3 -> emit_mul_imm

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 07/10] bpf: Support calling non-tailcall bpf prog
  2024-10-10 17:56 ` [PATCH bpf-next v4 07/10] bpf: Support calling non-tailcall bpf prog Yonghong Song
@ 2024-10-10 20:28   ` Alexei Starovoitov
  2024-10-11  4:12     ` Yonghong Song
  0 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2024-10-10 20:28 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo

On Thu, Oct 10, 2024 at 10:56 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>
> A kfunc bpf_prog_call() is introduced such that it can call another bpf
> prog within a bpf prog. It has the same parameters as bpf_tail_call()
> but acts like a normal function call.
>
> But bpf_prog_call() could recurse to the caller prog itself. So if a bpf
> prog calls bpf_prog_call(), that bpf prog will use private stacks with
> maximum recursion level 4. The 4 level recursion should work for most
> cases.
>
> bpf_prog_call() cannot be used if tail_call exists in the same prog
> since tail_call does not use private stack. If both prog_call and
> tail_call in the same prog, verification will fail.

..

> +__bpf_kfunc int bpf_prog_call(void *ctx, struct bpf_map *p__map, u32 index)
> +{
> +       struct bpf_array *array;
> +       struct bpf_prog *prog;
> +
> +       if (p__map->map_type != BPF_MAP_TYPE_PROG_ARRAY)
> +               return -EINVAL;
> +
> +       array = container_of(p__map, struct bpf_array, map);
> +       if (unlikely(index >= array->map.max_entries))
> +               return -E2BIG;
> +
> +       prog = READ_ONCE(array->ptrs[index]);
> +       if (!prog)
> +               return -ENOENT;
> +
> +       return bpf_prog_run(prog, ctx);
> +}

bpf_tail_call() was a hack during the early days,
since I didn't know any better :(
I really don't want to use that as a pattern.
prog life time rules, tail call cnt, prog_array_compatible, etc.
caused plenty of pain. Don't want to see a repeat.

Progs that need to call another prog can use freplace mechanism already.
There is no need for bpf_prog_call.

Let's get priv_stack in shape first (the first ~6 patches).

pw-bot: cr

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 09/10] bpf, x86: Jit support for nested bpf_prog_call
  2024-10-10 17:56 ` [PATCH bpf-next v4 09/10] bpf, x86: Jit support for nested bpf_prog_call Yonghong Song
@ 2024-10-10 20:53   ` Alexei Starovoitov
  2024-10-11  4:20     ` Yonghong Song
  0 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2024-10-10 20:53 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo

On Thu, Oct 10, 2024 at 10:59 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>  static void emit_priv_frame_ptr(u8 **pprog, struct bpf_prog *bpf_prog,
> -                               enum bpf_priv_stack_mode priv_stack_mode)
> +                               enum bpf_priv_stack_mode priv_stack_mode,
> +                               bool is_subprog, u8 *image, u8 *temp)
>  {
>         u32 orig_stack_depth = round_up(bpf_prog->aux->stack_depth, 8);
>         u8 *prog = *pprog;
>
> -       if (priv_stack_mode == PRIV_STACK_ROOT_PROG)
> -               emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
> -       else if (priv_stack_mode == PRIV_STACK_SUB_PROG && orig_stack_depth)
> +       if (priv_stack_mode == PRIV_STACK_ROOT_PROG) {
> +               int offs;
> +               u8 *func;
> +
> +               if (!bpf_prog->aux->has_prog_call) {
> +                       emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
> +               } else {
> +                       EMIT1(0x57);            /* push rdi */
> +                       if (is_subprog) {
> +                               /* subprog may have up to 5 arguments */
> +                               EMIT1(0x56);            /* push rsi */
> +                               EMIT1(0x52);            /* push rdx */
> +                               EMIT1(0x51);            /* push rcx */
> +                               EMIT2(0x41, 0x50);      /* push r8 */
> +                       }
> +                       emit_mov_imm64(&prog, BPF_REG_1, (long) bpf_prog >> 32,
> +                                      (u32) (long) bpf_prog);
> +                       func = (u8 *)__bpf_prog_enter_recur_limited;
> +                       offs = prog - temp;
> +                       offs += x86_call_depth_emit_accounting(&prog, func, image + offs);
> +                       emit_call(&prog, func, image + offs);
> +                       if (is_subprog) {
> +                               EMIT2(0x41, 0x58);      /* pop r8 */
> +                               EMIT1(0x59);            /* pop rcx */
> +                               EMIT1(0x5a);            /* pop rdx */
> +                               EMIT1(0x5e);            /* pop rsi */
> +                       }
> +                       EMIT1(0x5f);            /* pop rdi */
> +
> +                       EMIT4(0x48, 0x83, 0xf8, 0x0);   /* cmp rax,0x0 */
> +                       EMIT2(X86_JNE, num_bytes_of_emit_return() + 1);
> +
> +                       /* return if stack recursion has been reached */
> +                       EMIT1(0xC9);    /* leave */
> +                       emit_return(&prog, image + (prog - temp));
> +
> +                       /* cnt -= 1 */
> +                       emit_alu_helper_1(&prog, BPF_ALU64 | BPF_SUB | BPF_K,
> +                                         BPF_REG_0, 1);
> +
> +                       /* accum_stack_depth = cnt * subtree_stack_depth */
> +                       emit_alu_helper_3(&prog, BPF_ALU64 | BPF_MUL | BPF_K, BPF_REG_0,
> +                                         bpf_prog->aux->subtree_stack_depth);
> +
> +                       emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
> +
> +                       /* r9 += accum_stack_depth */
> +                       emit_alu_helper_2(&prog, BPF_ALU64 | BPF_ADD | BPF_X, X86_REG_R9,
> +                                         BPF_REG_0);

That's way too much asm for logic that can stay in C.

bpf_trampoline_enter() should select __bpf_prog_enter_recur_limited()
for appropriate prog_type/attach_type/etc.

JITs don't need to change.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 07/10] bpf: Support calling non-tailcall bpf prog
  2024-10-10 20:28   ` Alexei Starovoitov
@ 2024-10-11  4:12     ` Yonghong Song
  2024-10-15 21:18       ` Tejun Heo
  0 siblings, 1 reply; 25+ messages in thread
From: Yonghong Song @ 2024-10-11  4:12 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo


On 10/10/24 1:28 PM, Alexei Starovoitov wrote:
> On Thu, Oct 10, 2024 at 10:56 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>> A kfunc bpf_prog_call() is introduced such that it can call another bpf
>> prog within a bpf prog. It has the same parameters as bpf_tail_call()
>> but acts like a normal function call.
>>
>> But bpf_prog_call() could recurse to the caller prog itself. So if a bpf
>> prog calls bpf_prog_call(), that bpf prog will use private stacks with
>> maximum recursion level 4. The 4 level recursion should work for most
>> cases.
>>
>> bpf_prog_call() cannot be used if tail_call exists in the same prog
>> since tail_call does not use private stack. If both prog_call and
>> tail_call in the same prog, verification will fail.
> ..
>
>> +__bpf_kfunc int bpf_prog_call(void *ctx, struct bpf_map *p__map, u32 index)
>> +{
>> +       struct bpf_array *array;
>> +       struct bpf_prog *prog;
>> +
>> +       if (p__map->map_type != BPF_MAP_TYPE_PROG_ARRAY)
>> +               return -EINVAL;
>> +
>> +       array = container_of(p__map, struct bpf_array, map);
>> +       if (unlikely(index >= array->map.max_entries))
>> +               return -E2BIG;
>> +
>> +       prog = READ_ONCE(array->ptrs[index]);
>> +       if (!prog)
>> +               return -ENOENT;
>> +
>> +       return bpf_prog_run(prog, ctx);
>> +}
> bpf_tail_call() was a hack during the early days,
> since I didn't know any better :(
> I really don't want to use that as a pattern.
> prog life time rules, tail call cnt, prog_array_compatible, etc.
> caused plenty of pain. Don't want to see a repeat.
>
> Progs that need to call another prog can use freplace mechanism already.
> There is no need for bpf_prog_call.

In this case, it could call itself.

>
> Let's get priv_stack in shape first (the first ~6 patches).

I am okay to focus on the first 6 patches. But I would like to get
Tejun's comments about what is the best way to support hierarchical
bpf based scheduler.

>
> pw-bot: cr

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 08/10] bpf, x86: Create two helpers for some arith operations
  2024-10-10 20:21   ` Alexei Starovoitov
@ 2024-10-11  4:16     ` Yonghong Song
  0 siblings, 0 replies; 25+ messages in thread
From: Yonghong Song @ 2024-10-11  4:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo


On 10/10/24 1:21 PM, Alexei Starovoitov wrote:
> On Thu, Oct 10, 2024 at 10:56 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>> Two helpers are extracted from bpf/x86 jit:
>>    - a helper to handle 'reg1 <op>= reg2' where <op> is add/sub/and/or/xor
>>    - a helper to handle 'reg *= imm'
>>
>> Both helpers will be used in the subsequent patch.
>>
>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>> ---
>>   arch/x86/net/bpf_jit_comp.c | 51 ++++++++++++++++++++++++-------------
>>   1 file changed, 34 insertions(+), 17 deletions(-)
>>
>> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
>> index a6ba85cec49a..297dd64f4b6a 100644
>> --- a/arch/x86/net/bpf_jit_comp.c
>> +++ b/arch/x86/net/bpf_jit_comp.c
>> @@ -1475,6 +1475,37 @@ static void emit_alu_helper_1(u8 **pprog, u8 insn_code, u32 dst_reg, s32 imm32)
>>          *pprog = prog;
>>   }
>>
>> +/* emit ADD/SUB/AND/OR/XOR 'reg1 <op>= reg2' operations */
>> +static void emit_alu_helper_2(u8 **pprog, u8 insn_code, u32 dst_reg, u32 src_reg)
>> +{
>> +       u8 b2 = 0;
>> +       u8 *prog = *pprog;
>> +
>> +       maybe_emit_mod(&prog, dst_reg, src_reg,
>> +                      BPF_CLASS(insn_code) == BPF_ALU64);
>> +       b2 = simple_alu_opcodes[BPF_OP(insn_code)];
>> +       EMIT2(b2, add_2reg(0xC0, dst_reg, src_reg));
>> +
>> +       *pprog = prog;
>> +}
>> +
>> +/* emit 'reg *= imm' operations */
>> +static void emit_alu_helper_3(u8 **pprog, u8 insn_code, u32 dst_reg, s32 imm32)
> _1, _2, _3 ?!
>
> There must be a better way to name the helpers. Like:
>
> _1 -> emit_alu_imm
> _2 -> emit_alu_reg
> _3 -> emit_mul_imm

I struggle to get a proper name here. I originally thought about to use
emit_alu_reg_imm, emit_alu_reg_reg, but in my case, even emit_alu_reg_imm
only supports add/sub/and/or/xor and it does not support mul/div/mod, so
emit_alu_reg_imm does not really cover all alu operations so I chose
another name which is also not good.

I guess I can use the above you suggested in the above which actually
covers most alu operations.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 09/10] bpf, x86: Jit support for nested bpf_prog_call
  2024-10-10 20:53   ` Alexei Starovoitov
@ 2024-10-11  4:20     ` Yonghong Song
  2024-10-11  4:29       ` Alexei Starovoitov
  0 siblings, 1 reply; 25+ messages in thread
From: Yonghong Song @ 2024-10-11  4:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo


On 10/10/24 1:53 PM, Alexei Starovoitov wrote:
> On Thu, Oct 10, 2024 at 10:59 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>>   static void emit_priv_frame_ptr(u8 **pprog, struct bpf_prog *bpf_prog,
>> -                               enum bpf_priv_stack_mode priv_stack_mode)
>> +                               enum bpf_priv_stack_mode priv_stack_mode,
>> +                               bool is_subprog, u8 *image, u8 *temp)
>>   {
>>          u32 orig_stack_depth = round_up(bpf_prog->aux->stack_depth, 8);
>>          u8 *prog = *pprog;
>>
>> -       if (priv_stack_mode == PRIV_STACK_ROOT_PROG)
>> -               emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
>> -       else if (priv_stack_mode == PRIV_STACK_SUB_PROG && orig_stack_depth)
>> +       if (priv_stack_mode == PRIV_STACK_ROOT_PROG) {
>> +               int offs;
>> +               u8 *func;
>> +
>> +               if (!bpf_prog->aux->has_prog_call) {
>> +                       emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
>> +               } else {
>> +                       EMIT1(0x57);            /* push rdi */
>> +                       if (is_subprog) {
>> +                               /* subprog may have up to 5 arguments */
>> +                               EMIT1(0x56);            /* push rsi */
>> +                               EMIT1(0x52);            /* push rdx */
>> +                               EMIT1(0x51);            /* push rcx */
>> +                               EMIT2(0x41, 0x50);      /* push r8 */
>> +                       }
>> +                       emit_mov_imm64(&prog, BPF_REG_1, (long) bpf_prog >> 32,
>> +                                      (u32) (long) bpf_prog);
>> +                       func = (u8 *)__bpf_prog_enter_recur_limited;
>> +                       offs = prog - temp;
>> +                       offs += x86_call_depth_emit_accounting(&prog, func, image + offs);
>> +                       emit_call(&prog, func, image + offs);
>> +                       if (is_subprog) {
>> +                               EMIT2(0x41, 0x58);      /* pop r8 */
>> +                               EMIT1(0x59);            /* pop rcx */
>> +                               EMIT1(0x5a);            /* pop rdx */
>> +                               EMIT1(0x5e);            /* pop rsi */
>> +                       }
>> +                       EMIT1(0x5f);            /* pop rdi */
>> +
>> +                       EMIT4(0x48, 0x83, 0xf8, 0x0);   /* cmp rax,0x0 */
>> +                       EMIT2(X86_JNE, num_bytes_of_emit_return() + 1);
>> +
>> +                       /* return if stack recursion has been reached */
>> +                       EMIT1(0xC9);    /* leave */
>> +                       emit_return(&prog, image + (prog - temp));
>> +
>> +                       /* cnt -= 1 */
>> +                       emit_alu_helper_1(&prog, BPF_ALU64 | BPF_SUB | BPF_K,
>> +                                         BPF_REG_0, 1);
>> +
>> +                       /* accum_stack_depth = cnt * subtree_stack_depth */
>> +                       emit_alu_helper_3(&prog, BPF_ALU64 | BPF_MUL | BPF_K, BPF_REG_0,
>> +                                         bpf_prog->aux->subtree_stack_depth);
>> +
>> +                       emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
>> +
>> +                       /* r9 += accum_stack_depth */
>> +                       emit_alu_helper_2(&prog, BPF_ALU64 | BPF_ADD | BPF_X, X86_REG_R9,
>> +                                         BPF_REG_0);
> That's way too much asm for logic that can stay in C.
>
> bpf_trampoline_enter() should select __bpf_prog_enter_recur_limited()
> for appropriate prog_type/attach_type/etc.

The above jit code not just for the main prog, but also for callback fn's
since callback fn could call bpf prog as well. So putting in bpf trampoline
not enough.

But I can improve the above by putting the most logic
    cnt -= 1; accum_stack_depth = cnt * subtree_stack_depth; r9 += accum_stack_depth
inside __bpf_prog_enter_recur_limited().

>
> JITs don't need to change.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 09/10] bpf, x86: Jit support for nested bpf_prog_call
  2024-10-11  4:20     ` Yonghong Song
@ 2024-10-11  4:29       ` Alexei Starovoitov
  2024-10-11 15:38         ` Yonghong Song
  0 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2024-10-11  4:29 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo

On Thu, Oct 10, 2024 at 9:21 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
> On 10/10/24 1:53 PM, Alexei Starovoitov wrote:
> > On Thu, Oct 10, 2024 at 10:59 AM Yonghong Song <yonghong.song@linux.dev> wrote:
> >>   static void emit_priv_frame_ptr(u8 **pprog, struct bpf_prog *bpf_prog,
> >> -                               enum bpf_priv_stack_mode priv_stack_mode)
> >> +                               enum bpf_priv_stack_mode priv_stack_mode,
> >> +                               bool is_subprog, u8 *image, u8 *temp)
> >>   {
> >>          u32 orig_stack_depth = round_up(bpf_prog->aux->stack_depth, 8);
> >>          u8 *prog = *pprog;
> >>
> >> -       if (priv_stack_mode == PRIV_STACK_ROOT_PROG)
> >> -               emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
> >> -       else if (priv_stack_mode == PRIV_STACK_SUB_PROG && orig_stack_depth)
> >> +       if (priv_stack_mode == PRIV_STACK_ROOT_PROG) {
> >> +               int offs;
> >> +               u8 *func;
> >> +
> >> +               if (!bpf_prog->aux->has_prog_call) {
> >> +                       emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
> >> +               } else {
> >> +                       EMIT1(0x57);            /* push rdi */
> >> +                       if (is_subprog) {
> >> +                               /* subprog may have up to 5 arguments */
> >> +                               EMIT1(0x56);            /* push rsi */
> >> +                               EMIT1(0x52);            /* push rdx */
> >> +                               EMIT1(0x51);            /* push rcx */
> >> +                               EMIT2(0x41, 0x50);      /* push r8 */
> >> +                       }
> >> +                       emit_mov_imm64(&prog, BPF_REG_1, (long) bpf_prog >> 32,
> >> +                                      (u32) (long) bpf_prog);
> >> +                       func = (u8 *)__bpf_prog_enter_recur_limited;
> >> +                       offs = prog - temp;
> >> +                       offs += x86_call_depth_emit_accounting(&prog, func, image + offs);
> >> +                       emit_call(&prog, func, image + offs);
> >> +                       if (is_subprog) {
> >> +                               EMIT2(0x41, 0x58);      /* pop r8 */
> >> +                               EMIT1(0x59);            /* pop rcx */
> >> +                               EMIT1(0x5a);            /* pop rdx */
> >> +                               EMIT1(0x5e);            /* pop rsi */
> >> +                       }
> >> +                       EMIT1(0x5f);            /* pop rdi */
> >> +
> >> +                       EMIT4(0x48, 0x83, 0xf8, 0x0);   /* cmp rax,0x0 */
> >> +                       EMIT2(X86_JNE, num_bytes_of_emit_return() + 1);
> >> +
> >> +                       /* return if stack recursion has been reached */
> >> +                       EMIT1(0xC9);    /* leave */
> >> +                       emit_return(&prog, image + (prog - temp));
> >> +
> >> +                       /* cnt -= 1 */
> >> +                       emit_alu_helper_1(&prog, BPF_ALU64 | BPF_SUB | BPF_K,
> >> +                                         BPF_REG_0, 1);
> >> +
> >> +                       /* accum_stack_depth = cnt * subtree_stack_depth */
> >> +                       emit_alu_helper_3(&prog, BPF_ALU64 | BPF_MUL | BPF_K, BPF_REG_0,
> >> +                                         bpf_prog->aux->subtree_stack_depth);
> >> +
> >> +                       emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
> >> +
> >> +                       /* r9 += accum_stack_depth */
> >> +                       emit_alu_helper_2(&prog, BPF_ALU64 | BPF_ADD | BPF_X, X86_REG_R9,
> >> +                                         BPF_REG_0);
> > That's way too much asm for logic that can stay in C.
> >
> > bpf_trampoline_enter() should select __bpf_prog_enter_recur_limited()
> > for appropriate prog_type/attach_type/etc.
>
> The above jit code not just for the main prog, but also for callback fn's
> since callback fn could call bpf prog as well. So putting in bpf trampoline
> not enough.

callback can call the prog only if bpf_call_prog() kfunc exists
and that's one more reason to avoid going that direction.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 09/10] bpf, x86: Jit support for nested bpf_prog_call
  2024-10-11  4:29       ` Alexei Starovoitov
@ 2024-10-11 15:38         ` Yonghong Song
  2024-10-11 15:40           ` Alexei Starovoitov
  0 siblings, 1 reply; 25+ messages in thread
From: Yonghong Song @ 2024-10-11 15:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo


On 10/10/24 9:29 PM, Alexei Starovoitov wrote:
> On Thu, Oct 10, 2024 at 9:21 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> On 10/10/24 1:53 PM, Alexei Starovoitov wrote:
>>> On Thu, Oct 10, 2024 at 10:59 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>>    static void emit_priv_frame_ptr(u8 **pprog, struct bpf_prog *bpf_prog,
>>>> -                               enum bpf_priv_stack_mode priv_stack_mode)
>>>> +                               enum bpf_priv_stack_mode priv_stack_mode,
>>>> +                               bool is_subprog, u8 *image, u8 *temp)
>>>>    {
>>>>           u32 orig_stack_depth = round_up(bpf_prog->aux->stack_depth, 8);
>>>>           u8 *prog = *pprog;
>>>>
>>>> -       if (priv_stack_mode == PRIV_STACK_ROOT_PROG)
>>>> -               emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
>>>> -       else if (priv_stack_mode == PRIV_STACK_SUB_PROG && orig_stack_depth)
>>>> +       if (priv_stack_mode == PRIV_STACK_ROOT_PROG) {
>>>> +               int offs;
>>>> +               u8 *func;
>>>> +
>>>> +               if (!bpf_prog->aux->has_prog_call) {
>>>> +                       emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
>>>> +               } else {
>>>> +                       EMIT1(0x57);            /* push rdi */
>>>> +                       if (is_subprog) {
>>>> +                               /* subprog may have up to 5 arguments */
>>>> +                               EMIT1(0x56);            /* push rsi */
>>>> +                               EMIT1(0x52);            /* push rdx */
>>>> +                               EMIT1(0x51);            /* push rcx */
>>>> +                               EMIT2(0x41, 0x50);      /* push r8 */
>>>> +                       }
>>>> +                       emit_mov_imm64(&prog, BPF_REG_1, (long) bpf_prog >> 32,
>>>> +                                      (u32) (long) bpf_prog);
>>>> +                       func = (u8 *)__bpf_prog_enter_recur_limited;
>>>> +                       offs = prog - temp;
>>>> +                       offs += x86_call_depth_emit_accounting(&prog, func, image + offs);
>>>> +                       emit_call(&prog, func, image + offs);
>>>> +                       if (is_subprog) {
>>>> +                               EMIT2(0x41, 0x58);      /* pop r8 */
>>>> +                               EMIT1(0x59);            /* pop rcx */
>>>> +                               EMIT1(0x5a);            /* pop rdx */
>>>> +                               EMIT1(0x5e);            /* pop rsi */
>>>> +                       }
>>>> +                       EMIT1(0x5f);            /* pop rdi */
>>>> +
>>>> +                       EMIT4(0x48, 0x83, 0xf8, 0x0);   /* cmp rax,0x0 */
>>>> +                       EMIT2(X86_JNE, num_bytes_of_emit_return() + 1);
>>>> +
>>>> +                       /* return if stack recursion has been reached */
>>>> +                       EMIT1(0xC9);    /* leave */
>>>> +                       emit_return(&prog, image + (prog - temp));
>>>> +
>>>> +                       /* cnt -= 1 */
>>>> +                       emit_alu_helper_1(&prog, BPF_ALU64 | BPF_SUB | BPF_K,
>>>> +                                         BPF_REG_0, 1);
>>>> +
>>>> +                       /* accum_stack_depth = cnt * subtree_stack_depth */
>>>> +                       emit_alu_helper_3(&prog, BPF_ALU64 | BPF_MUL | BPF_K, BPF_REG_0,
>>>> +                                         bpf_prog->aux->subtree_stack_depth);
>>>> +
>>>> +                       emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
>>>> +
>>>> +                       /* r9 += accum_stack_depth */
>>>> +                       emit_alu_helper_2(&prog, BPF_ALU64 | BPF_ADD | BPF_X, X86_REG_R9,
>>>> +                                         BPF_REG_0);
>>> That's way too much asm for logic that can stay in C.
>>>
>>> bpf_trampoline_enter() should select __bpf_prog_enter_recur_limited()
>>> for appropriate prog_type/attach_type/etc.
>> The above jit code not just for the main prog, but also for callback fn's
>> since callback fn could call bpf prog as well. So putting in bpf trampoline
>> not enough.
> callback can call the prog only if bpf_call_prog() kfunc exists
> and that's one more reason to avoid going that direction.

Okay, I will add verifier check to prevent bpf_call_prog() in callback functions.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 09/10] bpf, x86: Jit support for nested bpf_prog_call
  2024-10-11 15:38         ` Yonghong Song
@ 2024-10-11 15:40           ` Alexei Starovoitov
  2024-10-11 16:14             ` Yonghong Song
  0 siblings, 1 reply; 25+ messages in thread
From: Alexei Starovoitov @ 2024-10-11 15:40 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo

On Fri, Oct 11, 2024 at 8:39 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
> On 10/10/24 9:29 PM, Alexei Starovoitov wrote:
> > On Thu, Oct 10, 2024 at 9:21 PM Yonghong Song <yonghong.song@linux.dev> wrote:
> >>
> >> On 10/10/24 1:53 PM, Alexei Starovoitov wrote:
> >>> On Thu, Oct 10, 2024 at 10:59 AM Yonghong Song <yonghong.song@linux.dev> wrote:
> >>>>    static void emit_priv_frame_ptr(u8 **pprog, struct bpf_prog *bpf_prog,
> >>>> -                               enum bpf_priv_stack_mode priv_stack_mode)
> >>>> +                               enum bpf_priv_stack_mode priv_stack_mode,
> >>>> +                               bool is_subprog, u8 *image, u8 *temp)
> >>>>    {
> >>>>           u32 orig_stack_depth = round_up(bpf_prog->aux->stack_depth, 8);
> >>>>           u8 *prog = *pprog;
> >>>>
> >>>> -       if (priv_stack_mode == PRIV_STACK_ROOT_PROG)
> >>>> -               emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
> >>>> -       else if (priv_stack_mode == PRIV_STACK_SUB_PROG && orig_stack_depth)
> >>>> +       if (priv_stack_mode == PRIV_STACK_ROOT_PROG) {
> >>>> +               int offs;
> >>>> +               u8 *func;
> >>>> +
> >>>> +               if (!bpf_prog->aux->has_prog_call) {
> >>>> +                       emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
> >>>> +               } else {
> >>>> +                       EMIT1(0x57);            /* push rdi */
> >>>> +                       if (is_subprog) {
> >>>> +                               /* subprog may have up to 5 arguments */
> >>>> +                               EMIT1(0x56);            /* push rsi */
> >>>> +                               EMIT1(0x52);            /* push rdx */
> >>>> +                               EMIT1(0x51);            /* push rcx */
> >>>> +                               EMIT2(0x41, 0x50);      /* push r8 */
> >>>> +                       }
> >>>> +                       emit_mov_imm64(&prog, BPF_REG_1, (long) bpf_prog >> 32,
> >>>> +                                      (u32) (long) bpf_prog);
> >>>> +                       func = (u8 *)__bpf_prog_enter_recur_limited;
> >>>> +                       offs = prog - temp;
> >>>> +                       offs += x86_call_depth_emit_accounting(&prog, func, image + offs);
> >>>> +                       emit_call(&prog, func, image + offs);
> >>>> +                       if (is_subprog) {
> >>>> +                               EMIT2(0x41, 0x58);      /* pop r8 */
> >>>> +                               EMIT1(0x59);            /* pop rcx */
> >>>> +                               EMIT1(0x5a);            /* pop rdx */
> >>>> +                               EMIT1(0x5e);            /* pop rsi */
> >>>> +                       }
> >>>> +                       EMIT1(0x5f);            /* pop rdi */
> >>>> +
> >>>> +                       EMIT4(0x48, 0x83, 0xf8, 0x0);   /* cmp rax,0x0 */
> >>>> +                       EMIT2(X86_JNE, num_bytes_of_emit_return() + 1);
> >>>> +
> >>>> +                       /* return if stack recursion has been reached */
> >>>> +                       EMIT1(0xC9);    /* leave */
> >>>> +                       emit_return(&prog, image + (prog - temp));
> >>>> +
> >>>> +                       /* cnt -= 1 */
> >>>> +                       emit_alu_helper_1(&prog, BPF_ALU64 | BPF_SUB | BPF_K,
> >>>> +                                         BPF_REG_0, 1);
> >>>> +
> >>>> +                       /* accum_stack_depth = cnt * subtree_stack_depth */
> >>>> +                       emit_alu_helper_3(&prog, BPF_ALU64 | BPF_MUL | BPF_K, BPF_REG_0,
> >>>> +                                         bpf_prog->aux->subtree_stack_depth);
> >>>> +
> >>>> +                       emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
> >>>> +
> >>>> +                       /* r9 += accum_stack_depth */
> >>>> +                       emit_alu_helper_2(&prog, BPF_ALU64 | BPF_ADD | BPF_X, X86_REG_R9,
> >>>> +                                         BPF_REG_0);
> >>> That's way too much asm for logic that can stay in C.
> >>>
> >>> bpf_trampoline_enter() should select __bpf_prog_enter_recur_limited()
> >>> for appropriate prog_type/attach_type/etc.
> >> The above jit code not just for the main prog, but also for callback fn's
> >> since callback fn could call bpf prog as well. So putting in bpf trampoline
> >> not enough.
> > callback can call the prog only if bpf_call_prog() kfunc exists
> > and that's one more reason to avoid going that direction.
>
> Okay, I will add verifier check to prevent bpf_call_prog() in callback functions.

We're talking past each other.
It's a nack to introduce bpf_call_prog kfunc.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 09/10] bpf, x86: Jit support for nested bpf_prog_call
  2024-10-11 15:40           ` Alexei Starovoitov
@ 2024-10-11 16:14             ` Yonghong Song
  0 siblings, 0 replies; 25+ messages in thread
From: Yonghong Song @ 2024-10-11 16:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Kernel Team, Martin KaFai Lau, Tejun Heo


On 10/11/24 8:40 AM, Alexei Starovoitov wrote:
> On Fri, Oct 11, 2024 at 8:39 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> On 10/10/24 9:29 PM, Alexei Starovoitov wrote:
>>> On Thu, Oct 10, 2024 at 9:21 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>> On 10/10/24 1:53 PM, Alexei Starovoitov wrote:
>>>>> On Thu, Oct 10, 2024 at 10:59 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>>>>     static void emit_priv_frame_ptr(u8 **pprog, struct bpf_prog *bpf_prog,
>>>>>> -                               enum bpf_priv_stack_mode priv_stack_mode)
>>>>>> +                               enum bpf_priv_stack_mode priv_stack_mode,
>>>>>> +                               bool is_subprog, u8 *image, u8 *temp)
>>>>>>     {
>>>>>>            u32 orig_stack_depth = round_up(bpf_prog->aux->stack_depth, 8);
>>>>>>            u8 *prog = *pprog;
>>>>>>
>>>>>> -       if (priv_stack_mode == PRIV_STACK_ROOT_PROG)
>>>>>> -               emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
>>>>>> -       else if (priv_stack_mode == PRIV_STACK_SUB_PROG && orig_stack_depth)
>>>>>> +       if (priv_stack_mode == PRIV_STACK_ROOT_PROG) {
>>>>>> +               int offs;
>>>>>> +               u8 *func;
>>>>>> +
>>>>>> +               if (!bpf_prog->aux->has_prog_call) {
>>>>>> +                       emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
>>>>>> +               } else {
>>>>>> +                       EMIT1(0x57);            /* push rdi */
>>>>>> +                       if (is_subprog) {
>>>>>> +                               /* subprog may have up to 5 arguments */
>>>>>> +                               EMIT1(0x56);            /* push rsi */
>>>>>> +                               EMIT1(0x52);            /* push rdx */
>>>>>> +                               EMIT1(0x51);            /* push rcx */
>>>>>> +                               EMIT2(0x41, 0x50);      /* push r8 */
>>>>>> +                       }
>>>>>> +                       emit_mov_imm64(&prog, BPF_REG_1, (long) bpf_prog >> 32,
>>>>>> +                                      (u32) (long) bpf_prog);
>>>>>> +                       func = (u8 *)__bpf_prog_enter_recur_limited;
>>>>>> +                       offs = prog - temp;
>>>>>> +                       offs += x86_call_depth_emit_accounting(&prog, func, image + offs);
>>>>>> +                       emit_call(&prog, func, image + offs);
>>>>>> +                       if (is_subprog) {
>>>>>> +                               EMIT2(0x41, 0x58);      /* pop r8 */
>>>>>> +                               EMIT1(0x59);            /* pop rcx */
>>>>>> +                               EMIT1(0x5a);            /* pop rdx */
>>>>>> +                               EMIT1(0x5e);            /* pop rsi */
>>>>>> +                       }
>>>>>> +                       EMIT1(0x5f);            /* pop rdi */
>>>>>> +
>>>>>> +                       EMIT4(0x48, 0x83, 0xf8, 0x0);   /* cmp rax,0x0 */
>>>>>> +                       EMIT2(X86_JNE, num_bytes_of_emit_return() + 1);
>>>>>> +
>>>>>> +                       /* return if stack recursion has been reached */
>>>>>> +                       EMIT1(0xC9);    /* leave */
>>>>>> +                       emit_return(&prog, image + (prog - temp));
>>>>>> +
>>>>>> +                       /* cnt -= 1 */
>>>>>> +                       emit_alu_helper_1(&prog, BPF_ALU64 | BPF_SUB | BPF_K,
>>>>>> +                                         BPF_REG_0, 1);
>>>>>> +
>>>>>> +                       /* accum_stack_depth = cnt * subtree_stack_depth */
>>>>>> +                       emit_alu_helper_3(&prog, BPF_ALU64 | BPF_MUL | BPF_K, BPF_REG_0,
>>>>>> +                                         bpf_prog->aux->subtree_stack_depth);
>>>>>> +
>>>>>> +                       emit_root_priv_frame_ptr(&prog, bpf_prog, orig_stack_depth);
>>>>>> +
>>>>>> +                       /* r9 += accum_stack_depth */
>>>>>> +                       emit_alu_helper_2(&prog, BPF_ALU64 | BPF_ADD | BPF_X, X86_REG_R9,
>>>>>> +                                         BPF_REG_0);
>>>>> That's way too much asm for logic that can stay in C.
>>>>>
>>>>> bpf_trampoline_enter() should select __bpf_prog_enter_recur_limited()
>>>>> for appropriate prog_type/attach_type/etc.
>>>> The above jit code not just for the main prog, but also for callback fn's
>>>> since callback fn could call bpf prog as well. So putting in bpf trampoline
>>>> not enough.
>>> callback can call the prog only if bpf_call_prog() kfunc exists
>>> and that's one more reason to avoid going that direction.
>> Okay, I will add verifier check to prevent bpf_call_prog() in callback functions.
> We're talking past each other.
> It's a nack to introduce bpf_call_prog kfunc.

Okay. Will remove it in the next revision.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 07/10] bpf: Support calling non-tailcall bpf prog
  2024-10-11  4:12     ` Yonghong Song
@ 2024-10-15 21:18       ` Tejun Heo
  2024-10-15 21:35         ` Alexei Starovoitov
  0 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2024-10-15 21:18 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Kernel Team, Martin KaFai Lau

Hello,

On Thu, Oct 10, 2024 at 09:12:19PM -0700, Yonghong Song wrote:
> > Let's get priv_stack in shape first (the first ~6 patches).
> 
> I am okay to focus on the first 6 patches. But I would like to get
> Tejun's comments about what is the best way to support hierarchical
> bpf based scheduler.

There isn't a concrete design yet, so it's difficult to say anything
definitive but I was thinking more along the line of providing sched_ext
kfunc helpers that perform nesting calls rather than each BPF program
directly calling nested BPF programs.

For example, let's say the scheduler hierarchy looks like this:

  R + A + AA
    |   + AB
    + B

Let's say AB has a task waking up to it and is calling ops.select_cpu():

 ops.select_cpu()
 {
	if (does AB already have the perfect CPU sitting around)
		direct dispatch and return the CPU;
	if (scx_bpf_get_cpus(describe the perfect CPU))
		direct dispatch and return the CPU;
	if (is there any eligible idle CPU that AB is holding)
		direct dispatch and return the CPU;
	if (scx_bpf_get_cpus(any eligible CPUs))
		direct dispatch and return the CPU;
	// no idle CPU, proceed to enqueue
	return prev_cpu;
 }

Note that the scheduler at AB doesn't have any knowledge of what's up the
tree. It's just describing what it wants through the kfunc which is then
responsible for nesting calls up the hierarhcy. Up a layer, this can be
implemented like:

 ops.get_cpus(CPUs description)
 {
	if (has any CPUs matching the description)
		claim and return the CPUs;
	modify CPUs description to enforce e.g. cache sharing policy;
	and possibly to request more CPUs for batching;
	if (scx_bpf_get_cpus(CPUs description)) {
		store extra CPUs;
		claim and return some of the CPUs;
	}
	return no CPUs available;
 }

This way, the schedulers at different layers are isolated and each only has
to express what it wants.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 00/10] bpf: Support private stack for bpf progs
  2024-10-10 17:55 [PATCH bpf-next v4 00/10] bpf: Support private stack for bpf progs Yonghong Song
                   ` (9 preceding siblings ...)
  2024-10-10 17:56 ` [PATCH bpf-next v4 10/10] selftests/bpf: Add tests for bpf_prog_call() Yonghong Song
@ 2024-10-15 21:28 ` Tejun Heo
  2024-10-15 21:39   ` Alexei Starovoitov
  10 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2024-10-15 21:28 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	kernel-team, Martin KaFai Lau

Hello,

On Thu, Oct 10, 2024 at 10:55:52AM -0700, Yonghong Song wrote:
> The main motivation for private stack comes from nested scheduler in
> sched-ext from Tejun. The basic idea is that
>  - each cgroup will its own associated bpf program,
>  - bpf program with parent cgroup will call bpf programs
>    in immediate child cgroups.
> 
> Let us say we have the following cgroup hierarchy:
>   root_cg (prog0):
>     cg1 (prog1):
>       cg11 (prog11):
>         cg111 (prog111)
>         cg112 (prog112)
>       cg12 (prog12):
>         cg121 (prog121)
>         cg122 (prog122)
>     cg2 (prog2):
>       cg21 (prog21)
>       cg22 (prog22)
>       cg23 (prog23)

Thank you so much for working on this. I have some basic and a bit
tangential questions around how stacks are allocated. So, for sched_ext,
each scheduler would be represented by struct_ops and I think the interface
to load them would be attaching a struct_ops to a cgroup.

- I suppose each operation in a struct_ops would count as a separate program
  and would thus allocate 512 * nr_cpus stacks, right?

- If the same scheduler implementation is attached to more than one cgroups,
  would each instance be treated as a separate set of programs or would they
  share the stack?

- Most struct_ops operations won't need to be nested and thus wouldn't need
  to use a private stack. Would it be possible to indicate which one should
  use a private stack?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 07/10] bpf: Support calling non-tailcall bpf prog
  2024-10-15 21:18       ` Tejun Heo
@ 2024-10-15 21:35         ` Alexei Starovoitov
  0 siblings, 0 replies; 25+ messages in thread
From: Alexei Starovoitov @ 2024-10-15 21:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Kernel Team, Martin KaFai Lau

On Tue, Oct 15, 2024 at 2:18 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, Oct 10, 2024 at 09:12:19PM -0700, Yonghong Song wrote:
> > > Let's get priv_stack in shape first (the first ~6 patches).
> >
> > I am okay to focus on the first 6 patches. But I would like to get
> > Tejun's comments about what is the best way to support hierarchical
> > bpf based scheduler.
>
> There isn't a concrete design yet, so it's difficult to say anything
> definitive but I was thinking more along the line of providing sched_ext
> kfunc helpers that perform nesting calls rather than each BPF program
> directly calling nested BPF programs.
>
> For example, let's say the scheduler hierarchy looks like this:
>
>   R + A + AA
>     |   + AB
>     + B
>
> Let's say AB has a task waking up to it and is calling ops.select_cpu():
>
>  ops.select_cpu()
>  {
>         if (does AB already have the perfect CPU sitting around)
>                 direct dispatch and return the CPU;
>         if (scx_bpf_get_cpus(describe the perfect CPU))
>                 direct dispatch and return the CPU;
>         if (is there any eligible idle CPU that AB is holding)
>                 direct dispatch and return the CPU;
>         if (scx_bpf_get_cpus(any eligible CPUs))
>                 direct dispatch and return the CPU;
>         // no idle CPU, proceed to enqueue
>         return prev_cpu;
>  }
>
> Note that the scheduler at AB doesn't have any knowledge of what's up the
> tree. It's just describing what it wants through the kfunc which is then
> responsible for nesting calls up the hierarhcy. Up a layer, this can be
> implemented like:
>
>  ops.get_cpus(CPUs description)
>  {
>         if (has any CPUs matching the description)
>                 claim and return the CPUs;
>         modify CPUs description to enforce e.g. cache sharing policy;
>         and possibly to request more CPUs for batching;
>         if (scx_bpf_get_cpus(CPUs description)) {
>                 store extra CPUs;
>                 claim and return some of the CPUs;
>         }
>         return no CPUs available;
>  }
>
> This way, the schedulers at different layers are isolated and each only has
> to express what it wants.

What we've been discussing is something like this:

ops.get_cpus -> bpf prog A -> kfunc

where kfunc will call one of struct_ops callback
which may call bpf prog A again, since it's the only one attached
to this get_cpus callback.
So
ops.get_cpus -> bpf prog A -> kfunc -> ops.get_cpus -> bpf prog A.

If kfunc calls a different struct_ops callback it will call
a different bpf prog B and it will have its own private stack.

During struct_ops registration one of bpf_verifier_ops() callbacks
like bpf_scx_check_member (or a new callback) will indicate
back to bpf trampoline that limited recursion for a specific
ops.get_cpus is allowed.
Then bpf trampoline's bpf_trampoline_enter() selector will
pick an entry helper that allows limited recursion.

Currently bpf trampoline doesn't check recursion for struct_ops progs,
so it needs to be tightened to allow limited recursion
and to let bpf jit prologue know which part of priv stack to use.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH bpf-next v4 00/10] bpf: Support private stack for bpf progs
  2024-10-15 21:28 ` [PATCH bpf-next v4 00/10] bpf: Support private stack for bpf progs Tejun Heo
@ 2024-10-15 21:39   ` Alexei Starovoitov
  0 siblings, 0 replies; 25+ messages in thread
From: Alexei Starovoitov @ 2024-10-15 21:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yonghong Song, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Kernel Team, Martin KaFai Lau

On Tue, Oct 15, 2024 at 2:28 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, Oct 10, 2024 at 10:55:52AM -0700, Yonghong Song wrote:
> > The main motivation for private stack comes from nested scheduler in
> > sched-ext from Tejun. The basic idea is that
> >  - each cgroup will its own associated bpf program,
> >  - bpf program with parent cgroup will call bpf programs
> >    in immediate child cgroups.
> >
> > Let us say we have the following cgroup hierarchy:
> >   root_cg (prog0):
> >     cg1 (prog1):
> >       cg11 (prog11):
> >         cg111 (prog111)
> >         cg112 (prog112)
> >       cg12 (prog12):
> >         cg121 (prog121)
> >         cg122 (prog122)
> >     cg2 (prog2):
> >       cg21 (prog21)
> >       cg22 (prog22)
> >       cg23 (prog23)
>
> Thank you so much for working on this. I have some basic and a bit
> tangential questions around how stacks are allocated. So, for sched_ext,
> each scheduler would be represented by struct_ops and I think the interface
> to load them would be attaching a struct_ops to a cgroup.
>
> - I suppose each operation in a struct_ops would count as a separate program
>   and would thus allocate 512 * nr_cpus stacks, right?

It's one stack per program.
Its size will be ~512 * nr_cpus * max_allowed_recursion.

We hope max_allowed_recursion == 4 or something small.

> - If the same scheduler implementation is attached to more than one cgroups,
>   would each instance be treated as a separate set of programs or would they
>   share the stack?

I think there is only one sched_ext struct_ops with
its set of progs. They are global and not "attached to a cgroup".

> - Most struct_ops operations won't need to be nested and thus wouldn't need
>   to use a private stack. Would it be possible to indicate which one should
>   use a private stack?

See my other reply. One of bpf_verifier_ops callbacks would need to
indicate back to trampoline which callback is nested with limited recursion.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2024-10-15 21:40 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-10 17:55 [PATCH bpf-next v4 00/10] bpf: Support private stack for bpf progs Yonghong Song
2024-10-10 17:55 ` [PATCH bpf-next v4 01/10] bpf: Allow each subprog having stack size of 512 bytes Yonghong Song
2024-10-10 17:56 ` [PATCH bpf-next v4 02/10] bpf: Mark each subprog with proper private stack modes Yonghong Song
2024-10-10 17:56 ` [PATCH bpf-next v4 03/10] bpf, x86: Refactor func emit_prologue Yonghong Song
2024-10-10 17:56 ` [PATCH bpf-next v4 04/10] bpf, x86: Create a helper for certain "reg <op>= imm" operations Yonghong Song
2024-10-10 17:56 ` [PATCH bpf-next v4 05/10] bpf, x86: Add jit support for private stack Yonghong Song
2024-10-10 17:56 ` [PATCH bpf-next v4 06/10] selftests/bpf: Add private stack tests Yonghong Song
2024-10-10 17:56 ` [PATCH bpf-next v4 07/10] bpf: Support calling non-tailcall bpf prog Yonghong Song
2024-10-10 20:28   ` Alexei Starovoitov
2024-10-11  4:12     ` Yonghong Song
2024-10-15 21:18       ` Tejun Heo
2024-10-15 21:35         ` Alexei Starovoitov
2024-10-10 17:56 ` [PATCH bpf-next v4 08/10] bpf, x86: Create two helpers for some arith operations Yonghong Song
2024-10-10 20:21   ` Alexei Starovoitov
2024-10-11  4:16     ` Yonghong Song
2024-10-10 17:56 ` [PATCH bpf-next v4 09/10] bpf, x86: Jit support for nested bpf_prog_call Yonghong Song
2024-10-10 20:53   ` Alexei Starovoitov
2024-10-11  4:20     ` Yonghong Song
2024-10-11  4:29       ` Alexei Starovoitov
2024-10-11 15:38         ` Yonghong Song
2024-10-11 15:40           ` Alexei Starovoitov
2024-10-11 16:14             ` Yonghong Song
2024-10-10 17:56 ` [PATCH bpf-next v4 10/10] selftests/bpf: Add tests for bpf_prog_call() Yonghong Song
2024-10-15 21:28 ` [PATCH bpf-next v4 00/10] bpf: Support private stack for bpf progs Tejun Heo
2024-10-15 21:39   ` Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox