[PATCH bpf-next v1 00/13] bpf qdisc

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next v1 00/13] bpf qdisc
@ 2024-12-13 23:29 Amery Hung
  2024-12-13 23:29 ` [PATCH bpf-next v1 01/13] bpf: Support getting referenced kptr from struct_ops argument Amery Hung
                   ` (12 more replies)
  0 siblings, 13 replies; 35+ messages in thread
From: Amery Hung @ 2024-12-13 23:29 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung

Hi all,

This patchset aims to support implementing qdisc using bpf struct_ops.
This version takes a step back and only implements the minimum support
for bpf qdisc. 1) support of adding skb to bpf_list and bpf_rbtree
directly and 2) classful qdisc are deferred to future patchsets.

* Overview *

This series supports implementing qdisc using bpf struct_ops. bpf qdisc
aims to be a flexible and easy-to-use infrastructure that allows users to
quickly experiment with different scheduling algorithms/policies. It only
requires users to implement core qdisc logic using bpf and implements the
mundane part for them. In addition, the ability to easily communicate
between qdisc and other components will also bring new opportunities for
new applications and optimizations.

* struct_ops changes *

To make struct_ops works better with bpf qdisc, two new changes are
introduced to bpf specifically for struct_ops programs. Frist, we
introduce "ref_acquired" postfix for arguments in stub functions in
patch 1-2. It will allow Qdisc_ops->enqueue to acquire an referenced kptr
to an skb just once. Through the reference object tracking mechanism in
the verifier, we can make sure that the acquired skb will be either
enqueued or dropped. Besides, no duplicate references can be acquired.
Then, we allow a reference leak in struct_ops programs so that we can
return an skb naturally. This is done and tested in patch 3 and 4.

* Performance of bpf qdisc *

We tested several bpf qdiscs included in the selftests and their in-tree
counterparts to give you a sense of the performance of qdisc implemented
in bpf.

The implementation of bpf_fq is fairly complex and slightly different from
fq so later we only compare the two fifo qdiscs. bpf_fq implements the
same fair queueing algorithm in fq, but without flow hash collision
avoidance and garbage collection of inactive flows. bpf_fifo uses a single
bpf_list as a queue instead of three queues for different priorities in
pfifo_fast. The time complexity of fifo however should be similar since the
queue selection time is negligible.

Test setup:

    client -> qdisc ------------->  server
    ~~~~~~~~~~~~~~~                 ~~~~~~
    nested VM1 @ DC1               VM2 @ DC2

Throghput: iperf3 -t 600, 5 times

      Qdisc        Average (GBits/sec)
    ----------     -------------------
    pfifo_fast       12.52 ± 0.26
    bpf_fifo         11.72 ± 0.32 
    fq               10.24 ± 0.13
    bpf_fq           11.92 ± 0.64 

Latency: sockperf pp --tcp -t 600, 5 times

      Qdisc        Average (usec)
    ----------     --------------
    pfifo_fast      244.58 ± 7.93
    bpf_fifo        244.92 ± 15.22
    fq              234.30 ± 19.25
    bpf_fq          221.34 ± 10.76

Looking at the two fifo qdiscs, the 6.4% drop in throughput in the bpf
implementatioin is consistent with previous observation (v8 throughput
test on a loopback device). This should be able to be mitigated by
supporting adding skb to bpf_list or bpf_rbtree directly in the future.

* Clean up skb in bpf qdisc during reset *

The current implementation relies on bpf qdisc implementors to correctly
release skbs in queues (bpf graphs or maps) in .reset, which might not be
a safe thing to do. The solution as Martin has suggested would be
supporting private data in struct_ops. This can also help simplifying
implementation of qdisc that works with mq. For examples, qdiscs in the
selftest mostly use global data. Therefore, even if user add multiple
qdisc instances under mq, they would still share the same queue. 

---
v1:
    Fix struct_ops referenced kptr acquire/return mechanisms
    Allow creating dynptr from skb
    Add bpf qdisc kfunc filter
    Support updating bstats and qstats
    Update qdiscs in selftest to update stats
    Add gc, handle hash collision and fix bugs in fq_bpf

past RFCs

v9: Drop classful qdisc operations and kfuncs
    Drop support of enqueuing skb directly to bpf_rbtree/list
    Link: https://lore.kernel.org/bpf/20240714175130.4051012-1-amery.hung@bytedance.com/

v8: Implement support of bpf qdisc using struct_ops
    Allow struct_ops to acquire referenced kptr via argument
    Allow struct_ops to release and return referenced kptr
    Support enqueuing sk_buff to bpf_rbtree/list
    Move examples from samples to selftests
    Add a classful qdisc selftest
    Link: https://lore.kernel.org/netdev/20240510192412.3297104-15-amery.hung@bytedance.com/

v7: Reference skb using kptr to sk_buff instead of __sk_buff
    Use the new bpf rbtree/link to for skb queues
    Add reset and init programs
    Add a bpf fq qdisc sample
    Add a bpf netem qdisc sample
    Link: https://lore.kernel.org/netdev/cover.1705432850.git.amery.hung@bytedance.com/

v6: switch to kptr based approach

v5: mv kernel/bpf/skb_map.c net/core/skb_map.c
    implement flow map as map-in-map
    rename bpf_skb_tc_classify() and move it to net/sched/cls_api.c
    clean up eBPF qdisc program context

v4: get rid of PIFO, use rbtree directly

v3: move priority queue from sch_bpf to skb map
    introduce skb map and its helpers
    introduce bpf_skb_classify()
    use netdevice notifier to reset skb's
    Rebase on latest bpf-next

v2: Rebase on latest net-next
    Make the code more complete (but still incomplete)


Amery Hung (13):
  bpf: Support getting referenced kptr from struct_ops argument
  selftests/bpf: Test referenced kptr arguments of struct_ops programs
  bpf: Allow struct_ops prog to return referenced kptr
  selftests/bpf: Test returning referenced kptr from struct_ops programs
  bpf: net_sched: Support implementation of Qdisc_ops in bpf
  bpf: net_sched: Add basic bpf qdisc kfuncs
  bpf: net_sched: Add a qdisc watchdog timer
  bpf: net_sched: Support updating bstats
  bpf: net_sched: Support updating qstats
  bpf: net_sched: Allow writing to more Qdisc members
  libbpf: Support creating and destroying qdisc
  selftests: Add a basic fifo qdisc test
  selftests: Add a bpf fq qdisc to selftest

 include/linux/bpf.h                           |   3 +
 include/linux/btf.h                           |   1 +
 include/net/sch_generic.h                     |   4 +
 kernel/bpf/bpf_struct_ops.c                   |  26 +-
 kernel/bpf/btf.c                              |   5 +-
 kernel/bpf/verifier.c                         |  77 +-
 net/sched/Kconfig                             |  12 +
 net/sched/Makefile                            |   1 +
 net/sched/bpf_qdisc.c                         | 394 ++++++++++
 net/sched/sch_api.c                           |  18 +-
 net/sched/sch_generic.c                       |  11 +-
 tools/lib/bpf/libbpf.h                        |   5 +-
 tools/lib/bpf/netlink.c                       |  20 +-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  15 +
 .../selftests/bpf/bpf_testmod/bpf_testmod.h   |   6 +
 tools/testing/selftests/bpf/config            |   1 +
 .../selftests/bpf/prog_tests/bpf_qdisc.c      | 185 +++++
 .../prog_tests/test_struct_ops_kptr_return.c  |  87 +++
 .../prog_tests/test_struct_ops_refcounted.c   |  58 ++
 .../selftests/bpf/progs/bpf_qdisc_common.h    |  27 +
 .../selftests/bpf/progs/bpf_qdisc_fifo.c      | 117 +++
 .../selftests/bpf/progs/bpf_qdisc_fq.c        | 726 ++++++++++++++++++
 .../bpf/progs/struct_ops_kptr_return.c        |  29 +
 ...uct_ops_kptr_return_fail__invalid_scalar.c |  24 +
 .../struct_ops_kptr_return_fail__local_kptr.c |  30 +
 ...uct_ops_kptr_return_fail__nonzero_offset.c |  23 +
 .../struct_ops_kptr_return_fail__wrong_type.c |  28 +
 .../bpf/progs/struct_ops_refcounted.c         |  67 ++
 ...ruct_ops_refcounted_fail__global_subprog.c |  32 +
 .../struct_ops_refcounted_fail__ref_leak.c    |  17 +
 30 files changed, 2026 insertions(+), 23 deletions(-)
 create mode 100644 net/sched/bpf_qdisc.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_kptr_return.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__invalid_scalar.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__nonzero_offset.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__wrong_type.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_refcounted.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c

-- 
2.20.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH bpf-next v1 01/13] bpf: Support getting referenced kptr from struct_ops argument
  2024-12-13 23:29 [PATCH bpf-next v1 00/13] bpf qdisc Amery Hung
@ 2024-12-13 23:29 ` Amery Hung
  2024-12-18  0:58   ` Martin KaFai Lau
  2024-12-13 23:29 ` [PATCH bpf-next v1 02/13] selftests/bpf: Test referenced kptr arguments of struct_ops programs Amery Hung
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Amery Hung @ 2024-12-13 23:29 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung

Allows struct_ops programs to acqurie referenced kptrs from arguments
by directly reading the argument.

The verifier will acquire a reference for struct_ops a argument tagged
with "__ref" in the stub function in the beginning of the main program.
The user will be able to access the referenced kptr directly by reading
the context as long as it has not been released by the program.

This new mechanism to acquire referenced kptr (compared to the existing
"kfunc with KF_ACQUIRE") is introduced for ergonomic and semantic reasons.
In the first use case, Qdisc_ops, an skb is passed to .enqueue in the
first argument. This mechanism provides a natural way for users to get a
referenced kptr in the .enqueue struct_ops programs and makes sure that a
qdisc will always enqueue or drop the skb.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/linux/bpf.h         |  3 +++
 kernel/bpf/bpf_struct_ops.c | 26 ++++++++++++++++++++------
 kernel/bpf/btf.c            |  1 +
 kernel/bpf/verifier.c       | 35 ++++++++++++++++++++++++++++++++---
 4 files changed, 56 insertions(+), 9 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 1b84613b10ac..72bf941d1daf 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -968,6 +968,7 @@ struct bpf_insn_access_aux {
 		struct {
 			struct btf *btf;
 			u32 btf_id;
+			u32 ref_obj_id;
 		};
 	};
 	struct bpf_verifier_log *log; /* for verbose logs */
@@ -1480,6 +1481,8 @@ struct bpf_ctx_arg_aux {
 	enum bpf_reg_type reg_type;
 	struct btf *btf;
 	u32 btf_id;
+	u32 ref_obj_id;
+	bool refcounted;
 };
 
 struct btf_mod_pair {
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index fda3dd2ee984..6e7795744f6a 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -145,6 +145,7 @@ void bpf_struct_ops_image_free(void *image)
 }
 
 #define MAYBE_NULL_SUFFIX "__nullable"
+#define REFCOUNTED_SUFFIX "__ref"
 #define MAX_STUB_NAME 128
 
 /* Return the type info of a stub function, if it exists.
@@ -206,9 +207,11 @@ static int prepare_arg_info(struct btf *btf,
 			    struct bpf_struct_ops_arg_info *arg_info)
 {
 	const struct btf_type *stub_func_proto, *pointed_type;
+	bool is_nullable = false, is_refcounted = false;
 	const struct btf_param *stub_args, *args;
 	struct bpf_ctx_arg_aux *info, *info_buf;
 	u32 nargs, arg_no, info_cnt = 0;
+	const char *suffix;
 	u32 arg_btf_id;
 	int offset;
 
@@ -240,12 +243,19 @@ static int prepare_arg_info(struct btf *btf,
 	info = info_buf;
 	for (arg_no = 0; arg_no < nargs; arg_no++) {
 		/* Skip arguments that is not suffixed with
-		 * "__nullable".
+		 * "__nullable or __ref".
 		 */
-		if (!btf_param_match_suffix(btf, &stub_args[arg_no],
-					    MAYBE_NULL_SUFFIX))
+		is_nullable = btf_param_match_suffix(btf, &stub_args[arg_no],
+						     MAYBE_NULL_SUFFIX);
+		is_refcounted = btf_param_match_suffix(btf, &stub_args[arg_no],
+						       REFCOUNTED_SUFFIX);
+		if (!is_nullable && !is_refcounted)
 			continue;
 
+		if (is_nullable)
+			suffix = MAYBE_NULL_SUFFIX;
+		else if (is_refcounted)
+			suffix = REFCOUNTED_SUFFIX;
 		/* Should be a pointer to struct */
 		pointed_type = btf_type_resolve_ptr(btf,
 						    args[arg_no].type,
@@ -253,7 +263,7 @@ static int prepare_arg_info(struct btf *btf,
 		if (!pointed_type ||
 		    !btf_type_is_struct(pointed_type)) {
 			pr_warn("stub function %s__%s has %s tagging to an unsupported type\n",
-				st_ops_name, member_name, MAYBE_NULL_SUFFIX);
+				st_ops_name, member_name, suffix);
 			goto err_out;
 		}
 
@@ -271,11 +281,15 @@ static int prepare_arg_info(struct btf *btf,
 		}
 
 		/* Fill the information of the new argument */
-		info->reg_type =
-			PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
 		info->btf_id = arg_btf_id;
 		info->btf = btf;
 		info->offset = offset;
+		if (is_nullable) {
+			info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
+		} else if (is_refcounted) {
+			info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID;
+			info->refcounted = true;
+		}
 
 		info++;
 		info_cnt++;
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index e7a59e6462a9..a05ccf9ee032 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -6580,6 +6580,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
 			info->reg_type = ctx_arg_info->reg_type;
 			info->btf = ctx_arg_info->btf ? : btf_vmlinux;
 			info->btf_id = ctx_arg_info->btf_id;
+			info->ref_obj_id = ctx_arg_info->ref_obj_id;
 			return true;
 		}
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 9f5de8d4fbd0..69753096075f 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1402,6 +1402,17 @@ static int release_reference_state(struct bpf_func_state *state, int ptr_id)
 	return -EINVAL;
 }
 
+static bool find_reference_state(struct bpf_func_state *state, int ptr_id)
+{
+	int i;
+
+	for (i = 0; i < state->acquired_refs; i++)
+		if (state->refs[i].id == ptr_id)
+			return true;
+
+	return false;
+}
+
 static int release_lock_state(struct bpf_func_state *state, int type, int id, void *ptr)
 {
 	int i, last_idx;
@@ -5798,7 +5809,8 @@ static int check_packet_access(struct bpf_verifier_env *env, u32 regno, int off,
 /* check access to 'struct bpf_context' fields.  Supports fixed offsets only */
 static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off, int size,
 			    enum bpf_access_type t, enum bpf_reg_type *reg_type,
-			    struct btf **btf, u32 *btf_id, bool *is_retval, bool is_ldsx)
+			    struct btf **btf, u32 *btf_id, bool *is_retval, bool is_ldsx,
+			    u32 *ref_obj_id)
 {
 	struct bpf_insn_access_aux info = {
 		.reg_type = *reg_type,
@@ -5820,8 +5832,16 @@ static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off,
 		*is_retval = info.is_retval;
 
 		if (base_type(*reg_type) == PTR_TO_BTF_ID) {
+			if (info.ref_obj_id &&
+			    !find_reference_state(cur_func(env), info.ref_obj_id)) {
+				verbose(env, "invalid bpf_context access off=%d. Reference may already be released\n",
+					off);
+				return -EACCES;
+			}
+
 			*btf = info.btf;
 			*btf_id = info.btf_id;
+			*ref_obj_id = info.ref_obj_id;
 		} else {
 			env->insn_aux_data[insn_idx].ctx_field_size = info.ctx_field_size;
 		}
@@ -7135,7 +7155,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 		struct bpf_retval_range range;
 		enum bpf_reg_type reg_type = SCALAR_VALUE;
 		struct btf *btf = NULL;
-		u32 btf_id = 0;
+		u32 btf_id = 0, ref_obj_id = 0;
 
 		if (t == BPF_WRITE && value_regno >= 0 &&
 		    is_pointer_value(env, value_regno)) {
@@ -7148,7 +7168,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 			return err;
 
 		err = check_ctx_access(env, insn_idx, off, size, t, &reg_type, &btf,
-				       &btf_id, &is_retval, is_ldsx);
+				       &btf_id, &is_retval, is_ldsx, &ref_obj_id);
 		if (err)
 			verbose_linfo(env, insn_idx, "; ");
 		if (!err && t == BPF_READ && value_regno >= 0) {
@@ -7179,6 +7199,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 				if (base_type(reg_type) == PTR_TO_BTF_ID) {
 					regs[value_regno].btf = btf;
 					regs[value_regno].btf_id = btf_id;
+					regs[value_regno].ref_obj_id = ref_obj_id;
 				}
 			}
 			regs[value_regno].type = reg_type;
@@ -21662,6 +21683,7 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
 {
 	bool pop_log = !(env->log.level & BPF_LOG_LEVEL2);
 	struct bpf_subprog_info *sub = subprog_info(env, subprog);
+	struct bpf_ctx_arg_aux *ctx_arg_info;
 	struct bpf_verifier_state *state;
 	struct bpf_reg_state *regs;
 	int ret, i;
@@ -21769,6 +21791,13 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
 		mark_reg_known_zero(env, regs, BPF_REG_1);
 	}
 
+	if (!subprog && env->prog->type == BPF_PROG_TYPE_STRUCT_OPS) {
+		ctx_arg_info = (struct bpf_ctx_arg_aux *)env->prog->aux->ctx_arg_info;
+		for (i = 0; i < env->prog->aux->ctx_arg_info_size; i++)
+			if (ctx_arg_info[i].refcounted)
+				ctx_arg_info[i].ref_obj_id = acquire_reference_state(env, 0);
+	}
+
 	ret = do_check(env);
 out:
 	/* check for NULL is necessary, since cur_state can be freed inside
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH bpf-next v1 02/13] selftests/bpf: Test referenced kptr arguments of struct_ops programs
  2024-12-13 23:29 [PATCH bpf-next v1 00/13] bpf qdisc Amery Hung
  2024-12-13 23:29 ` [PATCH bpf-next v1 01/13] bpf: Support getting referenced kptr from struct_ops argument Amery Hung
@ 2024-12-13 23:29 ` Amery Hung
  2024-12-18  1:17   ` Martin KaFai Lau
  2024-12-19  3:40   ` Yonghong Song
  2024-12-13 23:29 ` [PATCH bpf-next v1 03/13] bpf: Allow struct_ops prog to return referenced kptr Amery Hung
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 35+ messages in thread
From: Amery Hung @ 2024-12-13 23:29 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung

Test referenced kptr acquired through struct_ops argument tagged with
"__ref". The success case checks whether 1) a reference to the correct
type is acquired, and 2) the referenced kptr argument can be accessed in
multiple paths as long as it hasn't been released. In the fail cases,
we first confirm that a referenced kptr acquried through a struct_ops
argument is not allowed to be leaked. Then, we make sure this new
referenced kptr acquiring mechanism does not accidentally allow referenced
kptrs to flow into global subprograms through their arguments.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  7 ++
 .../selftests/bpf/bpf_testmod/bpf_testmod.h   |  2 +
 .../prog_tests/test_struct_ops_refcounted.c   | 58 ++++++++++++++++
 .../bpf/progs/struct_ops_refcounted.c         | 67 +++++++++++++++++++
 ...ruct_ops_refcounted_fail__global_subprog.c | 32 +++++++++
 .../struct_ops_refcounted_fail__ref_leak.c    | 17 +++++
 6 files changed, 183 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_refcounted.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c

diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 987d41af71d2..244234546ae2 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -1135,10 +1135,17 @@ static int bpf_testmod_ops__test_maybe_null(int dummy,
 	return 0;
 }
 
+static int bpf_testmod_ops__test_refcounted(int dummy,
+					    struct task_struct *task__ref)
+{
+	return 0;
+}
+
 static struct bpf_testmod_ops __bpf_testmod_ops = {
 	.test_1 = bpf_testmod_test_1,
 	.test_2 = bpf_testmod_test_2,
 	.test_maybe_null = bpf_testmod_ops__test_maybe_null,
+	.test_refcounted = bpf_testmod_ops__test_refcounted,
 };
 
 struct bpf_struct_ops bpf_bpf_testmod_ops = {
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
index fb7dff47597a..0e31586c1353 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
@@ -36,6 +36,8 @@ struct bpf_testmod_ops {
 	/* Used to test nullable arguments. */
 	int (*test_maybe_null)(int dummy, struct task_struct *task);
 	int (*unsupported_ops)(void);
+	/* Used to test ref_acquired arguments. */
+	int (*test_refcounted)(int dummy, struct task_struct *task);
 
 	/* The following fields are used to test shadow copies. */
 	char onebyte;
diff --git a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c
new file mode 100644
index 000000000000..976df951b700
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c
@@ -0,0 +1,58 @@
+#include <test_progs.h>
+
+#include "struct_ops_refcounted.skel.h"
+#include "struct_ops_refcounted_fail__ref_leak.skel.h"
+#include "struct_ops_refcounted_fail__global_subprog.skel.h"
+
+/* Test that the verifier accepts a program that first acquires a referenced
+ * kptr through context and then releases the reference
+ */
+static void refcounted(void)
+{
+	struct struct_ops_refcounted *skel;
+
+	skel = struct_ops_refcounted__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "struct_ops_module_open_and_load"))
+		return;
+
+	struct_ops_refcounted__destroy(skel);
+}
+
+/* Test that the verifier rejects a program that acquires a referenced
+ * kptr through context without releasing the reference
+ */
+static void refcounted_fail__ref_leak(void)
+{
+	struct struct_ops_refcounted_fail__ref_leak *skel;
+
+	skel = struct_ops_refcounted_fail__ref_leak__open_and_load();
+	if (ASSERT_ERR_PTR(skel, "struct_ops_module_fail__open_and_load"))
+		return;
+
+	struct_ops_refcounted_fail__ref_leak__destroy(skel);
+}
+
+/* Test that the verifier rejects a program that contains a global
+ * subprogram with referenced kptr arguments
+ */
+static void refcounted_fail__global_subprog(void)
+{
+	struct struct_ops_refcounted_fail__global_subprog *skel;
+
+	skel = struct_ops_refcounted_fail__global_subprog__open_and_load();
+	if (ASSERT_ERR_PTR(skel, "struct_ops_module_fail__open_and_load"))
+		return;
+
+	struct_ops_refcounted_fail__global_subprog__destroy(skel);
+}
+
+void test_struct_ops_refcounted(void)
+{
+	if (test__start_subtest("refcounted"))
+		refcounted();
+	if (test__start_subtest("refcounted_fail__ref_leak"))
+		refcounted_fail__ref_leak();
+	if (test__start_subtest("refcounted_fail__global_subprog"))
+		refcounted_fail__global_subprog();
+}
+
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_refcounted.c b/tools/testing/selftests/bpf/progs/struct_ops_refcounted.c
new file mode 100644
index 000000000000..2c1326668b92
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_refcounted.c
@@ -0,0 +1,67 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+extern void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This is a test BPF program that uses struct_ops to access a referenced
+ * kptr argument. This is a test for the verifier to ensure that it
+ * 1) recongnizes the task as a referenced object (i.e., ref_obj_id > 0), and
+ * 2) the same reference can be acquired from multiple paths as long as it
+ *    has not been released.
+ *
+ * test_refcounted() is equivalent to the C code below. It is written in assembly
+ * to avoid reads from task (i.e., getting referenced kptrs to task) being merged
+ * into single path by the compiler.
+ *
+ * int test_refcounted(int dummy, struct task_struct *task)
+ * {
+ *         if (dummy % 2)
+ *                 bpf_task_release(task);
+ *         else
+ *                 bpf_task_release(task);
+ *         return 0;
+ * }
+ */
+SEC("struct_ops/test_refcounted")
+int test_refcounted(unsigned long long *ctx)
+{
+	asm volatile ("					\
+	/* r6 = dummy */				\
+	r6 = *(u64 *)(r1 + 0x0);			\
+	/* if (r6 & 0x1 != 0) */			\
+	r6 &= 0x1;					\
+	if r6 == 0 goto l0_%=;				\
+	/* r1 = task */					\
+	r1 = *(u64 *)(r1 + 0x8);			\
+	call %[bpf_task_release];			\
+	goto l1_%=;					\
+l0_%=:	/* r1 = task */					\
+	r1 = *(u64 *)(r1 + 0x8);			\
+	call %[bpf_task_release];			\
+l1_%=:	/* return 0 */					\
+"	:
+	: __imm(bpf_task_release)
+	: __clobber_all);
+	return 0;
+}
+
+/* BTF FUNC records are not generated for kfuncs referenced
+ * from inline assembly. These records are necessary for
+ * libbpf to link the program. The function below is a hack
+ * to ensure that BTF FUNC records are generated.
+ */
+void __btf_root(void)
+{
+	bpf_task_release(NULL);
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_refcounted = {
+	.test_refcounted = (void *)test_refcounted,
+};
+
+
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c
new file mode 100644
index 000000000000..c7e84e63b053
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c
@@ -0,0 +1,32 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+extern void bpf_task_release(struct task_struct *p) __ksym;
+
+__noinline int subprog_release(__u64 *ctx __arg_ctx)
+{
+	struct task_struct *task = (struct task_struct *)ctx[1];
+	int dummy = (int)ctx[0];
+
+	bpf_task_release(task);
+
+	return dummy + 1;
+}
+
+SEC("struct_ops/test_refcounted")
+int test_refcounted(unsigned long long *ctx)
+{
+	struct task_struct *task = (struct task_struct *)ctx[1];
+
+	bpf_task_release(task);
+
+	return subprog_release(ctx);
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_ref_acquire = {
+	.test_refcounted = (void *)test_refcounted,
+};
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c
new file mode 100644
index 000000000000..6e82859eb187
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c
@@ -0,0 +1,17 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+SEC("struct_ops/test_refcounted")
+int BPF_PROG(test_refcounted, int dummy,
+	     struct task_struct *task)
+{
+	return 0;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_ref_acquire = {
+	.test_refcounted = (void *)test_refcounted,
+};
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH bpf-next v1 03/13] bpf: Allow struct_ops prog to return referenced kptr
  2024-12-13 23:29 [PATCH bpf-next v1 00/13] bpf qdisc Amery Hung
  2024-12-13 23:29 ` [PATCH bpf-next v1 01/13] bpf: Support getting referenced kptr from struct_ops argument Amery Hung
  2024-12-13 23:29 ` [PATCH bpf-next v1 02/13] selftests/bpf: Test referenced kptr arguments of struct_ops programs Amery Hung
@ 2024-12-13 23:29 ` Amery Hung
  2024-12-18 22:29   ` Martin KaFai Lau
  2024-12-13 23:29 ` [PATCH bpf-next v1 04/13] selftests/bpf: Test returning referenced kptr from struct_ops programs Amery Hung
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Amery Hung @ 2024-12-13 23:29 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung

Allow a struct_ops program to return a referenced kptr if the struct_ops
operator's return type is a struct pointer. To make sure the returned
pointer continues to be valid in the kernel, several constraints are
required:

1) The type of the pointer must matches the return type
2) The pointer originally comes from the kernel (not locally allocated)
3) The pointer is in its unmodified form

Implementation wise, a referenced kptr first needs to be allowed to leak
in check_reference_leak() if it is in the return register. Then, in
check_return_code(), constraints 1-3 are checked.

In addition, since the first user, Qdisc_ops::dequeue, allows a NULL
pointer to be returned when there is no skb to be dequeued, we will allow
a scalar value with value equals to NULL to be returned.

In the future when there is a struct_ops user that always expects a valid
pointer to be returned from an operator, we may extend tagging to the
return value. We can tell the verifier to only allow NULL pointer return
if the return value is tagged with MAY_BE_NULL.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 kernel/bpf/verifier.c | 42 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 38 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 69753096075f..c04028106710 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -10453,6 +10453,8 @@ record_func_key(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
 
 static int check_reference_leak(struct bpf_verifier_env *env, bool exception_exit)
 {
+	enum bpf_prog_type type = resolve_prog_type(env->prog);
+	struct bpf_reg_state *reg = reg_state(env, BPF_REG_0);
 	struct bpf_func_state *state = cur_func(env);
 	bool refs_lingering = false;
 	int i;
@@ -10463,6 +10465,12 @@ static int check_reference_leak(struct bpf_verifier_env *env, bool exception_exi
 	for (i = 0; i < state->acquired_refs; i++) {
 		if (state->refs[i].type != REF_TYPE_PTR)
 			continue;
+		/* Allow struct_ops programs to leak referenced kptr through return value.
+		 * Type checks are performed later in check_return_code.
+		 */
+		if (type == BPF_PROG_TYPE_STRUCT_OPS && !exception_exit &&
+		    reg->ref_obj_id == state->refs[i].id)
+			continue;
 		verbose(env, "Unreleased reference id=%d alloc_insn=%d\n",
 			state->refs[i].id, state->refs[i].insn_idx);
 		refs_lingering = true;
@@ -15993,13 +16001,15 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
 	const char *exit_ctx = "At program exit";
 	struct tnum enforce_attach_type_range = tnum_unknown;
 	const struct bpf_prog *prog = env->prog;
-	struct bpf_reg_state *reg;
+	struct bpf_reg_state *reg = reg_state(env, regno);
 	struct bpf_retval_range range = retval_range(0, 1);
 	enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
 	int err;
 	struct bpf_func_state *frame = env->cur_state->frame[0];
 	const bool is_subprog = frame->subprogno;
 	bool return_32bit = false;
+	struct btf *btf = bpf_prog_get_target_btf(prog);
+	const struct btf_type *ret_type = NULL;
 
 	/* LSM and struct_ops func-ptr's return type could be "void" */
 	if (!is_subprog || frame->in_exception_callback_fn) {
@@ -16008,10 +16018,31 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
 			if (prog->expected_attach_type == BPF_LSM_CGROUP)
 				/* See below, can be 0 or 0-1 depending on hook. */
 				break;
-			fallthrough;
+			if (!prog->aux->attach_func_proto->type)
+				return 0;
+			break;
 		case BPF_PROG_TYPE_STRUCT_OPS:
 			if (!prog->aux->attach_func_proto->type)
 				return 0;
+
+			if (frame->in_exception_callback_fn)
+				break;
+
+			/* Allow a struct_ops program to return a referenced kptr if it
+			 * matches the operator's return type and is in its unmodified
+			 * form. A scalar zero (i.e., a null pointer) is also allowed.
+			 */
+			ret_type = btf_type_by_id(btf, prog->aux->attach_func_proto->type);
+			if (btf_type_is_ptr(ret_type) && reg->type & PTR_TO_BTF_ID &&
+			    reg->ref_obj_id) {
+				if (reg->btf_id != ret_type->type) {
+					verbose(env, "Return kptr type, struct %s, doesn't match function prototype, struct %s\n",
+						btf_type_name(reg->btf, reg->btf_id),
+						btf_type_name(btf, ret_type->type));
+					return -EINVAL;
+				}
+				return __check_ptr_off_reg(env, reg, regno, false);
+			}
 			break;
 		default:
 			break;
@@ -16033,8 +16064,6 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
 		return -EACCES;
 	}
 
-	reg = cur_regs(env) + regno;
-
 	if (frame->in_async_callback_fn) {
 		/* enforce return zero from async callbacks like timer */
 		exit_ctx = "At async callback return";
@@ -16133,6 +16162,11 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
 	case BPF_PROG_TYPE_NETFILTER:
 		range = retval_range(NF_DROP, NF_ACCEPT);
 		break;
+	case BPF_PROG_TYPE_STRUCT_OPS:
+		if (!ret_type || !btf_type_is_ptr(ret_type))
+			return 0;
+		range = retval_range(0, 0);
+		break;
 	case BPF_PROG_TYPE_EXT:
 		/* freplace program can return anything as its return value
 		 * depends on the to-be-replaced kernel func or bpf program.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH bpf-next v1 04/13] selftests/bpf: Test returning referenced kptr from struct_ops programs
  2024-12-13 23:29 [PATCH bpf-next v1 00/13] bpf qdisc Amery Hung
                   ` (2 preceding siblings ...)
  2024-12-13 23:29 ` [PATCH bpf-next v1 03/13] bpf: Allow struct_ops prog to return referenced kptr Amery Hung
@ 2024-12-13 23:29 ` Amery Hung
  2024-12-13 23:29 ` [PATCH bpf-next v1 05/13] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: Amery Hung @ 2024-12-13 23:29 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung

Test struct_ops programs returning referenced kptr. When the return type
of a struct_ops operator is pointer to struct, the verifier should
only allow programs that return a scalar NULL or a non-local kptr with the
correct type in its unmodified form.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  8 ++
 .../selftests/bpf/bpf_testmod/bpf_testmod.h   |  4 +
 .../prog_tests/test_struct_ops_kptr_return.c  | 87 +++++++++++++++++++
 .../bpf/progs/struct_ops_kptr_return.c        | 29 +++++++
 ...uct_ops_kptr_return_fail__invalid_scalar.c | 24 +++++
 .../struct_ops_kptr_return_fail__local_kptr.c | 30 +++++++
 ...uct_ops_kptr_return_fail__nonzero_offset.c | 23 +++++
 .../struct_ops_kptr_return_fail__wrong_type.c | 28 ++++++
 8 files changed, 233 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_kptr_return.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__invalid_scalar.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__nonzero_offset.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__wrong_type.c

diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 244234546ae2..cfab09f16cc2 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -1141,11 +1141,19 @@ static int bpf_testmod_ops__test_refcounted(int dummy,
 	return 0;
 }
 
+static struct task_struct *
+bpf_testmod_ops__test_return_ref_kptr(int dummy, struct task_struct *task__ref,
+				      struct cgroup *cgrp)
+{
+	return NULL;
+}
+
 static struct bpf_testmod_ops __bpf_testmod_ops = {
 	.test_1 = bpf_testmod_test_1,
 	.test_2 = bpf_testmod_test_2,
 	.test_maybe_null = bpf_testmod_ops__test_maybe_null,
 	.test_refcounted = bpf_testmod_ops__test_refcounted,
+	.test_return_ref_kptr = bpf_testmod_ops__test_return_ref_kptr,
 };
 
 struct bpf_struct_ops bpf_bpf_testmod_ops = {
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
index 0e31586c1353..a66659314e67 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
@@ -6,6 +6,7 @@
 #include <linux/types.h>
 
 struct task_struct;
+struct cgroup;
 
 struct bpf_testmod_test_read_ctx {
 	char *buf;
@@ -38,6 +39,9 @@ struct bpf_testmod_ops {
 	int (*unsupported_ops)(void);
 	/* Used to test ref_acquired arguments. */
 	int (*test_refcounted)(int dummy, struct task_struct *task);
+	/* Used to test returning referenced kptr. */
+	struct task_struct *(*test_return_ref_kptr)(int dummy, struct task_struct *task,
+						    struct cgroup *cgrp);
 
 	/* The following fields are used to test shadow copies. */
 	char onebyte;
diff --git a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_kptr_return.c b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_kptr_return.c
new file mode 100644
index 000000000000..bc2fac39215a
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_kptr_return.c
@@ -0,0 +1,87 @@
+#include <test_progs.h>
+
+#include "struct_ops_kptr_return.skel.h"
+#include "struct_ops_kptr_return_fail__wrong_type.skel.h"
+#include "struct_ops_kptr_return_fail__invalid_scalar.skel.h"
+#include "struct_ops_kptr_return_fail__nonzero_offset.skel.h"
+#include "struct_ops_kptr_return_fail__local_kptr.skel.h"
+
+/* Test that the verifier accepts a program that acquires a referenced
+ * kptr and releases the reference through return
+ */
+static void kptr_return(void)
+{
+	struct struct_ops_kptr_return *skel;
+
+	skel = struct_ops_kptr_return__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "struct_ops_module_open_and_load"))
+		return;
+
+	struct_ops_kptr_return__destroy(skel);
+}
+
+/* Test that the verifier rejects a program that returns a kptr of the
+ * wrong type
+ */
+static void kptr_return_fail__wrong_type(void)
+{
+	struct struct_ops_kptr_return_fail__wrong_type *skel;
+
+	skel = struct_ops_kptr_return_fail__wrong_type__open_and_load();
+	if (ASSERT_ERR_PTR(skel, "struct_ops_module_fail__wrong_type__open_and_load"))
+		return;
+
+	struct_ops_kptr_return_fail__wrong_type__destroy(skel);
+}
+
+/* Test that the verifier rejects a program that returns a non-null scalar */
+static void kptr_return_fail__invalid_scalar(void)
+{
+	struct struct_ops_kptr_return_fail__invalid_scalar *skel;
+
+	skel = struct_ops_kptr_return_fail__invalid_scalar__open_and_load();
+	if (ASSERT_ERR_PTR(skel, "struct_ops_module_fail__invalid_scalar__open_and_load"))
+		return;
+
+	struct_ops_kptr_return_fail__invalid_scalar__destroy(skel);
+}
+
+/* Test that the verifier rejects a program that returns kptr with non-zero offset */
+static void kptr_return_fail__nonzero_offset(void)
+{
+	struct struct_ops_kptr_return_fail__nonzero_offset *skel;
+
+	skel = struct_ops_kptr_return_fail__nonzero_offset__open_and_load();
+	if (ASSERT_ERR_PTR(skel, "struct_ops_module_fail__nonzero_offset__open_and_load"))
+		return;
+
+	struct_ops_kptr_return_fail__nonzero_offset__destroy(skel);
+}
+
+/* Test that the verifier rejects a program that returns local kptr */
+static void kptr_return_fail__local_kptr(void)
+{
+	struct struct_ops_kptr_return_fail__local_kptr *skel;
+
+	skel = struct_ops_kptr_return_fail__local_kptr__open_and_load();
+	if (ASSERT_ERR_PTR(skel, "struct_ops_module_fail__local_kptr__open_and_load"))
+		return;
+
+	struct_ops_kptr_return_fail__local_kptr__destroy(skel);
+}
+
+void test_struct_ops_kptr_return(void)
+{
+	if (test__start_subtest("kptr_return"))
+		kptr_return();
+	if (test__start_subtest("kptr_return_fail__wrong_type"))
+		kptr_return_fail__wrong_type();
+	if (test__start_subtest("kptr_return_fail__invalid_scalar"))
+		kptr_return_fail__invalid_scalar();
+	if (test__start_subtest("kptr_return_fail__nonzero_offset"))
+		kptr_return_fail__nonzero_offset();
+	if (test__start_subtest("kptr_return_fail__local_kptr"))
+		kptr_return_fail__local_kptr();
+}
+
+
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_kptr_return.c b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return.c
new file mode 100644
index 000000000000..29b7719cd4c9
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return.c
@@ -0,0 +1,29 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This test struct_ops BPF programs returning referenced kptr. The verifier should
+ * allow a referenced kptr or a NULL pointer to be returned. A referenced kptr to task
+ * here is acquried automatically as the task argument is tagged with "__ref".
+ */
+SEC("struct_ops/test_return_ref_kptr")
+struct task_struct *BPF_PROG(test_return_ref_kptr, int dummy,
+			     struct task_struct *task, struct cgroup *cgrp)
+{
+	if (dummy % 2) {
+		bpf_task_release(task);
+		return NULL;
+	}
+	return task;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_kptr_return = {
+	.test_return_ref_kptr = (void *)test_return_ref_kptr,
+};
+
+
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__invalid_scalar.c b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__invalid_scalar.c
new file mode 100644
index 000000000000..d67982ba8224
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__invalid_scalar.c
@@ -0,0 +1,24 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct cgroup *bpf_cgroup_acquire(struct cgroup *p) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This test struct_ops BPF programs returning referenced kptr. The verifier should
+ * reject programs returning a non-zero scalar value.
+ */
+SEC("struct_ops/test_return_ref_kptr")
+struct task_struct *BPF_PROG(test_return_ref_kptr, int dummy,
+			     struct task_struct *task, struct cgroup *cgrp)
+{
+	bpf_task_release(task);
+	return (struct task_struct *)1;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_kptr_return = {
+	.test_return_ref_kptr = (void *)test_return_ref_kptr,
+};
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c
new file mode 100644
index 000000000000..9a4247432539
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c
@@ -0,0 +1,30 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+#include "bpf_experimental.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct cgroup *bpf_cgroup_acquire(struct cgroup *p) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This test struct_ops BPF programs returning referenced kptr. The verifier should
+ * reject programs returning a local kptr.
+ */
+SEC("struct_ops/test_return_ref_kptr")
+struct task_struct *BPF_PROG(test_return_ref_kptr, int dummy,
+			     struct task_struct *task, struct cgroup *cgrp)
+{
+	struct task_struct *t;
+
+	t = bpf_obj_new(typeof(*task));
+	if (!t)
+		return task;
+
+	return t;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_kptr_return = {
+	.test_return_ref_kptr = (void *)test_return_ref_kptr,
+};
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__nonzero_offset.c b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__nonzero_offset.c
new file mode 100644
index 000000000000..5bb0b4029d11
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__nonzero_offset.c
@@ -0,0 +1,23 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct cgroup *bpf_cgroup_acquire(struct cgroup *p) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This test struct_ops BPF programs returning referenced kptr. The verifier should
+ * reject programs returning a modified referenced kptr.
+ */
+SEC("struct_ops/test_return_ref_kptr")
+struct task_struct *BPF_PROG(test_return_ref_kptr, int dummy,
+			     struct task_struct *task, struct cgroup *cgrp)
+{
+	return (struct task_struct *)&task->jobctl;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_kptr_return = {
+	.test_return_ref_kptr = (void *)test_return_ref_kptr,
+};
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__wrong_type.c b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__wrong_type.c
new file mode 100644
index 000000000000..32365cb7af49
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__wrong_type.c
@@ -0,0 +1,28 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct cgroup *bpf_cgroup_acquire(struct cgroup *p) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This test struct_ops BPF programs returning referenced kptr. The verifier should
+ * reject programs returning a referenced kptr of the wrong type.
+ */
+SEC("struct_ops/test_return_ref_kptr")
+struct task_struct *BPF_PROG(test_return_ref_kptr, int dummy,
+			     struct task_struct *task, struct cgroup *cgrp)
+{
+	struct task_struct *ret;
+
+	ret = (struct task_struct *)bpf_cgroup_acquire(cgrp);
+	bpf_task_release(task);
+
+	return ret;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_kptr_return = {
+	.test_return_ref_kptr = (void *)test_return_ref_kptr,
+};
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH bpf-next v1 05/13] bpf: net_sched: Support implementation of Qdisc_ops in bpf
  2024-12-13 23:29 [PATCH bpf-next v1 00/13] bpf qdisc Amery Hung
                   ` (3 preceding siblings ...)
  2024-12-13 23:29 ` [PATCH bpf-next v1 04/13] selftests/bpf: Test returning referenced kptr from struct_ops programs Amery Hung
@ 2024-12-13 23:29 ` Amery Hung
  2024-12-14  4:51   ` Cong Wang
  2024-12-18 23:37   ` Martin KaFai Lau
  2024-12-13 23:29 ` [PATCH bpf-next v1 06/13] bpf: net_sched: Add basic bpf qdisc kfuncs Amery Hung
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 35+ messages in thread
From: Amery Hung @ 2024-12-13 23:29 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung

Enable users to implement a classless qdisc using bpf. The last few
patches in this series has prepared struct_ops to support core operators
in Qdisc_ops. The recent advancement in bpf such as allocated
objects, bpf list and bpf rbtree has also provided powerful and flexible
building blocks to realize sophisticated scheduling algorithms. Therefore,
in this patch, we start allowing qdisc to be implemented using bpf
struct_ops. Users can implement Qdisc_ops.{enqueue, dequeue, init, reset,
and .destroy in Qdisc_ops in bpf and register the qdisc dynamically into
the kernel.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Co-developed-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/linux/btf.h     |   1 +
 kernel/bpf/btf.c        |   4 +-
 net/sched/Kconfig       |  12 +++
 net/sched/Makefile      |   1 +
 net/sched/bpf_qdisc.c   | 214 ++++++++++++++++++++++++++++++++++++++++
 net/sched/sch_api.c     |   7 +-
 net/sched/sch_generic.c |   3 +-
 7 files changed, 236 insertions(+), 6 deletions(-)
 create mode 100644 net/sched/bpf_qdisc.c

diff --git a/include/linux/btf.h b/include/linux/btf.h
index 4214e76c9168..eb16218fdf52 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -563,6 +563,7 @@ const char *btf_name_by_offset(const struct btf *btf, u32 offset);
 const char *btf_str_by_offset(const struct btf *btf, u32 offset);
 struct btf *btf_parse_vmlinux(void);
 struct btf *bpf_prog_get_target_btf(const struct bpf_prog *prog);
+u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto, int off);
 u32 *btf_kfunc_id_set_contains(const struct btf *btf, u32 kfunc_btf_id,
 			       const struct bpf_prog *prog);
 u32 *btf_kfunc_is_modify_return(const struct btf *btf, u32 kfunc_btf_id,
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index a05ccf9ee032..f733dbf24261 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -6375,8 +6375,8 @@ static bool is_int_ptr(struct btf *btf, const struct btf_type *t)
 	return btf_type_is_int(t);
 }
 
-static u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto,
-			   int off)
+u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto,
+		    int off)
 {
 	const struct btf_param *args;
 	const struct btf_type *t;
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 8180d0c12fce..ccd0255da5a5 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -403,6 +403,18 @@ config NET_SCH_ETS
 
 	  If unsure, say N.
 
+config NET_SCH_BPF
+	bool "BPF-based Qdisc"
+	depends on BPF_SYSCALL && BPF_JIT && DEBUG_INFO_BTF
+	help
+	  This option allows BPF-based queueing disiplines. With BPF struct_ops,
+	  users can implement supported operators in Qdisc_ops using BPF programs.
+	  The queue holding skb can be built with BPF maps or graphs.
+
+	  Say Y here if you want to use BPF-based Qdisc.
+
+	  If unsure, say N.
+
 menuconfig NET_SCH_DEFAULT
 	bool "Allow override default queue discipline"
 	help
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 82c3f78ca486..904d784902d1 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -62,6 +62,7 @@ obj-$(CONFIG_NET_SCH_FQ_PIE)	+= sch_fq_pie.o
 obj-$(CONFIG_NET_SCH_CBS)	+= sch_cbs.o
 obj-$(CONFIG_NET_SCH_ETF)	+= sch_etf.o
 obj-$(CONFIG_NET_SCH_TAPRIO)	+= sch_taprio.o
+obj-$(CONFIG_NET_SCH_BPF)	+= bpf_qdisc.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
new file mode 100644
index 000000000000..a2e2db29e5fc
--- /dev/null
+++ b/net/sched/bpf_qdisc.c
@@ -0,0 +1,214 @@
+#include <linux/types.h>
+#include <linux/bpf_verifier.h>
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/filter.h>
+#include <net/pkt_sched.h>
+#include <net/pkt_cls.h>
+
+static struct bpf_struct_ops bpf_Qdisc_ops;
+
+struct bpf_sk_buff_ptr {
+	struct sk_buff *skb;
+};
+
+static int bpf_qdisc_init(struct btf *btf)
+{
+	return 0;
+}
+
+static const struct bpf_func_proto *
+bpf_qdisc_get_func_proto(enum bpf_func_id func_id,
+			 const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	default:
+		return bpf_base_func_proto(func_id, prog);
+	}
+}
+
+BTF_ID_LIST_SINGLE(bpf_sk_buff_ids, struct, sk_buff)
+BTF_ID_LIST_SINGLE(bpf_sk_buff_ptr_ids, struct, bpf_sk_buff_ptr)
+
+static bool bpf_qdisc_is_valid_access(int off, int size,
+				      enum bpf_access_type type,
+				      const struct bpf_prog *prog,
+				      struct bpf_insn_access_aux *info)
+{
+	struct btf *btf = prog->aux->attach_btf;
+	u32 arg;
+
+	arg = get_ctx_arg_idx(btf, prog->aux->attach_func_proto, off);
+	if (!strcmp(prog->aux->attach_func_name, "enqueue")) {
+		if (arg == 2 && type == BPF_READ) {
+			info->reg_type = PTR_TO_BTF_ID | PTR_TRUSTED;
+			info->btf = btf;
+			info->btf_id = bpf_sk_buff_ptr_ids[0];
+			return true;
+		}
+	}
+
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
+					const struct bpf_reg_state *reg,
+					int off, int size)
+{
+	const struct btf_type *t, *skbt;
+	size_t end;
+
+	skbt = btf_type_by_id(reg->btf, bpf_sk_buff_ids[0]);
+	t = btf_type_by_id(reg->btf, reg->btf_id);
+	if (t != skbt) {
+		bpf_log(log, "only read is supported\n");
+		return -EACCES;
+	}
+
+	switch (off) {
+	case offsetof(struct sk_buff, tstamp):
+		end = offsetofend(struct sk_buff, tstamp);
+		break;
+	case offsetof(struct sk_buff, priority):
+		end = offsetofend(struct sk_buff, priority);
+		break;
+	case offsetof(struct sk_buff, mark):
+		end = offsetofend(struct sk_buff, mark);
+		break;
+	case offsetof(struct sk_buff, queue_mapping):
+		end = offsetofend(struct sk_buff, queue_mapping);
+		break;
+	case offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb, tc_classid):
+		end = offsetof(struct sk_buff, cb) +
+		      offsetofend(struct qdisc_skb_cb, tc_classid);
+		break;
+	case offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb, data[0]) ...
+	     offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb,
+						     data[QDISC_CB_PRIV_LEN - 1]):
+		end = offsetof(struct sk_buff, cb) +
+		      offsetofend(struct qdisc_skb_cb, data[QDISC_CB_PRIV_LEN - 1]);
+		break;
+	case offsetof(struct sk_buff, tc_index):
+		end = offsetofend(struct sk_buff, tc_index);
+		break;
+	default:
+		bpf_log(log, "no write support to sk_buff at off %d\n", off);
+		return -EACCES;
+	}
+
+	if (off + size > end) {
+		bpf_log(log,
+			"write access at off %d with size %d beyond the member of sk_buff ended at %zu\n",
+			off, size, end);
+		return -EACCES;
+	}
+
+	return 0;
+}
+
+static const struct bpf_verifier_ops bpf_qdisc_verifier_ops = {
+	.get_func_proto		= bpf_qdisc_get_func_proto,
+	.is_valid_access	= bpf_qdisc_is_valid_access,
+	.btf_struct_access	= bpf_qdisc_btf_struct_access,
+};
+
+static int bpf_qdisc_init_member(const struct btf_type *t,
+				 const struct btf_member *member,
+				 void *kdata, const void *udata)
+{
+	const struct Qdisc_ops *uqdisc_ops;
+	struct Qdisc_ops *qdisc_ops;
+	u32 moff;
+
+	uqdisc_ops = (const struct Qdisc_ops *)udata;
+	qdisc_ops = (struct Qdisc_ops *)kdata;
+
+	moff = __btf_member_bit_offset(t, member) / 8;
+	switch (moff) {
+	case offsetof(struct Qdisc_ops, priv_size):
+		if (uqdisc_ops->priv_size)
+			return -EINVAL;
+		return 1;
+	case offsetof(struct Qdisc_ops, static_flags):
+		if (uqdisc_ops->static_flags)
+			return -EINVAL;
+		return 1;
+	case offsetof(struct Qdisc_ops, peek):
+		if (!uqdisc_ops->peek)
+			qdisc_ops->peek = qdisc_peek_dequeued;
+		return 1;
+	case offsetof(struct Qdisc_ops, id):
+		if (bpf_obj_name_cpy(qdisc_ops->id, uqdisc_ops->id,
+				     sizeof(qdisc_ops->id)) <= 0)
+			return -EINVAL;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int bpf_qdisc_reg(void *kdata, struct bpf_link *link)
+{
+	return register_qdisc(kdata);
+}
+
+static void bpf_qdisc_unreg(void *kdata, struct bpf_link *link)
+{
+	return unregister_qdisc(kdata);
+}
+
+static int Qdisc_ops__enqueue(struct sk_buff *skb__ref, struct Qdisc *sch,
+			      struct sk_buff **to_free)
+{
+	return 0;
+}
+
+static struct sk_buff *Qdisc_ops__dequeue(struct Qdisc *sch)
+{
+	return NULL;
+}
+
+static struct sk_buff *Qdisc_ops__peek(struct Qdisc *sch)
+{
+	return NULL;
+}
+
+static int Qdisc_ops__init(struct Qdisc *sch, struct nlattr *arg,
+			   struct netlink_ext_ack *extack)
+{
+	return 0;
+}
+
+static void Qdisc_ops__reset(struct Qdisc *sch)
+{
+}
+
+static void Qdisc_ops__destroy(struct Qdisc *sch)
+{
+}
+
+static struct Qdisc_ops __bpf_ops_qdisc_ops = {
+	.enqueue = Qdisc_ops__enqueue,
+	.dequeue = Qdisc_ops__dequeue,
+	.peek = Qdisc_ops__peek,
+	.init = Qdisc_ops__init,
+	.reset = Qdisc_ops__reset,
+	.destroy = Qdisc_ops__destroy,
+};
+
+static struct bpf_struct_ops bpf_Qdisc_ops = {
+	.verifier_ops = &bpf_qdisc_verifier_ops,
+	.reg = bpf_qdisc_reg,
+	.unreg = bpf_qdisc_unreg,
+	.init_member = bpf_qdisc_init_member,
+	.init = bpf_qdisc_init,
+	.name = "Qdisc_ops",
+	.cfi_stubs = &__bpf_ops_qdisc_ops,
+	.owner = THIS_MODULE,
+};
+
+static int __init bpf_qdisc_kfunc_init(void)
+{
+	return register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
+}
+late_initcall(bpf_qdisc_kfunc_init);
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 2eefa4783879..f074053c4232 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -25,6 +25,7 @@
 #include <linux/hrtimer.h>
 #include <linux/slab.h>
 #include <linux/hashtable.h>
+#include <linux/bpf.h>
 
 #include <net/net_namespace.h>
 #include <net/sock.h>
@@ -358,7 +359,7 @@ static struct Qdisc_ops *qdisc_lookup_ops(struct nlattr *kind)
 		read_lock(&qdisc_mod_lock);
 		for (q = qdisc_base; q; q = q->next) {
 			if (nla_strcmp(kind, q->id) == 0) {
-				if (!try_module_get(q->owner))
+				if (!bpf_try_module_get(q, q->owner))
 					q = NULL;
 				break;
 			}
@@ -1287,7 +1288,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 				/* We will try again qdisc_lookup_ops,
 				 * so don't keep a reference.
 				 */
-				module_put(ops->owner);
+				bpf_module_put(ops, ops->owner);
 				err = -EAGAIN;
 				goto err_out;
 			}
@@ -1398,7 +1399,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 	netdev_put(dev, &sch->dev_tracker);
 	qdisc_free(sch);
 err_out2:
-	module_put(ops->owner);
+	bpf_module_put(ops, ops->owner);
 err_out:
 	*errp = err;
 	return NULL;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 38ec18f73de4..1e770ec251a0 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -24,6 +24,7 @@
 #include <linux/if_vlan.h>
 #include <linux/skb_array.h>
 #include <linux/if_macvlan.h>
+#include <linux/bpf.h>
 #include <net/sch_generic.h>
 #include <net/pkt_sched.h>
 #include <net/dst.h>
@@ -1083,7 +1084,7 @@ static void __qdisc_destroy(struct Qdisc *qdisc)
 		ops->destroy(qdisc);
 
 	lockdep_unregister_key(&qdisc->root_lock_key);
-	module_put(ops->owner);
+	bpf_module_put(ops, ops->owner);
 	netdev_put(dev, &qdisc->dev_tracker);
 
 	trace_qdisc_destroy(qdisc);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH bpf-next v1 06/13] bpf: net_sched: Add basic bpf qdisc kfuncs
  2024-12-13 23:29 [PATCH bpf-next v1 00/13] bpf qdisc Amery Hung
                   ` (4 preceding siblings ...)
  2024-12-13 23:29 ` [PATCH bpf-next v1 05/13] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
@ 2024-12-13 23:29 ` Amery Hung
  2024-12-18 17:11   ` Amery Hung
  2024-12-19  7:37   ` Martin KaFai Lau
  2024-12-13 23:29 ` [PATCH bpf-next v1 07/13] bpf: net_sched: Add a qdisc watchdog timer Amery Hung
                   ` (6 subsequent siblings)
  12 siblings, 2 replies; 35+ messages in thread
From: Amery Hung @ 2024-12-13 23:29 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung

Add basic kfuncs for working on skb in qdisc.

Both bpf_qdisc_skb_drop() and bpf_kfree_skb() can be used to release
a reference to an skb. However, bpf_qdisc_skb_drop() can only be called
in .enqueue where a to_free skb list is available from kernel to defer
the release. bpf_kfree_skb() should be used elsewhere. It is also used
in bpf_obj_free_fields() when cleaning up skb in maps and collections.

bpf_skb_get_hash() returns the flow hash of an skb, which can be used
to build flow-based queueing algorithms.

Finally, allow users to create read-only dynptr via bpf_dynptr_from_skb().

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 net/sched/bpf_qdisc.c | 77 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 76 insertions(+), 1 deletion(-)

diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
index a2e2db29e5fc..28959424eab0 100644
--- a/net/sched/bpf_qdisc.c
+++ b/net/sched/bpf_qdisc.c
@@ -106,6 +106,67 @@ static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
 	return 0;
 }
 
+__bpf_kfunc_start_defs();
+
+/* bpf_skb_get_hash - Get the flow hash of an skb.
+ * @skb: The skb to get the flow hash from.
+ */
+__bpf_kfunc u32 bpf_skb_get_hash(struct sk_buff *skb)
+{
+	return skb_get_hash(skb);
+}
+
+/* bpf_kfree_skb - Release an skb's reference and drop it immediately.
+ * @skb: The skb whose reference to be released and dropped.
+ */
+__bpf_kfunc void bpf_kfree_skb(struct sk_buff *skb)
+{
+	kfree_skb(skb);
+}
+
+/* bpf_qdisc_skb_drop - Drop an skb by adding it to a deferred free list.
+ * @skb: The skb whose reference to be released and dropped.
+ * @to_free_list: The list of skbs to be dropped.
+ */
+__bpf_kfunc void bpf_qdisc_skb_drop(struct sk_buff *skb,
+				    struct bpf_sk_buff_ptr *to_free_list)
+{
+	__qdisc_drop(skb, (struct sk_buff **)to_free_list);
+}
+
+__bpf_kfunc_end_defs();
+
+#define BPF_QDISC_KFUNC_xxx \
+	BPF_QDISC_KFUNC(bpf_skb_get_hash, KF_TRUSTED_ARGS) \
+	BPF_QDISC_KFUNC(bpf_kfree_skb, KF_RELEASE) \
+	BPF_QDISC_KFUNC(bpf_qdisc_skb_drop, KF_RELEASE) \
+
+BTF_KFUNCS_START(bpf_qdisc_kfunc_ids)
+#define BPF_QDISC_KFUNC(name, flag) BTF_ID_FLAGS(func, name, flag)
+BPF_QDISC_KFUNC_xxx
+#undef BPF_QDISC_KFUNC
+BTF_ID_FLAGS(func, bpf_dynptr_from_skb, KF_TRUSTED_ARGS)
+BTF_KFUNCS_END(bpf_qdisc_kfunc_ids)
+
+#define BPF_QDISC_KFUNC(name, _) BTF_ID_LIST_SINGLE(name##_ids, func, name)
+BPF_QDISC_KFUNC_xxx
+#undef BPF_QDISC_KFUNC
+
+static int bpf_qdisc_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
+{
+	if (kfunc_id == bpf_qdisc_skb_drop_ids[0])
+		if (strcmp(prog->aux->attach_func_name, "enqueue"))
+			return -EACCES;
+
+	return 0;
+}
+
+static const struct btf_kfunc_id_set bpf_qdisc_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &bpf_qdisc_kfunc_ids,
+	.filter = bpf_qdisc_kfunc_filter,
+};
+
 static const struct bpf_verifier_ops bpf_qdisc_verifier_ops = {
 	.get_func_proto		= bpf_qdisc_get_func_proto,
 	.is_valid_access	= bpf_qdisc_is_valid_access,
@@ -209,6 +270,20 @@ static struct bpf_struct_ops bpf_Qdisc_ops = {
 
 static int __init bpf_qdisc_kfunc_init(void)
 {
-	return register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
+	int ret;
+	const struct btf_id_dtor_kfunc skb_kfunc_dtors[] = {
+		{
+			.btf_id       = bpf_sk_buff_ids[0],
+			.kfunc_btf_id = bpf_kfree_skb_ids[0]
+		},
+	};
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_qdisc_kfunc_set);
+	ret = ret ?: register_btf_id_dtor_kfuncs(skb_kfunc_dtors,
+						 ARRAY_SIZE(skb_kfunc_dtors),
+						 THIS_MODULE);
+	ret = ret ?: register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
+
+	return ret;
 }
 late_initcall(bpf_qdisc_kfunc_init);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH bpf-next v1 07/13] bpf: net_sched: Add a qdisc watchdog timer
  2024-12-13 23:29 [PATCH bpf-next v1 00/13] bpf qdisc Amery Hung
                   ` (5 preceding siblings ...)
  2024-12-13 23:29 ` [PATCH bpf-next v1 06/13] bpf: net_sched: Add basic bpf qdisc kfuncs Amery Hung
@ 2024-12-13 23:29 ` Amery Hung
  2024-12-19  1:16   ` Martin KaFai Lau
  2024-12-13 23:29 ` [PATCH bpf-next v1 08/13] bpf: net_sched: Support updating bstats Amery Hung
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Amery Hung @ 2024-12-13 23:29 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung

Add a watchdog timer to bpf qdisc. The watchdog can be used to schedule
the execution of qdisc through kfunc, bpf_qdisc_schedule(). It can be
useful for building traffic shaping scheduling algorithm, where the time
the next packet will be dequeued is known.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/net/sch_generic.h |  4 +++
 net/sched/bpf_qdisc.c     | 51 ++++++++++++++++++++++++++++++++++++++-
 net/sched/sch_api.c       | 11 +++++++++
 net/sched/sch_generic.c   |  8 ++++++
 4 files changed, 73 insertions(+), 1 deletion(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 5d74fa7e694c..6a252b1b0680 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -1357,4 +1357,8 @@ static inline void qdisc_synchronize(const struct Qdisc *q)
 		msleep(1);
 }
 
+int bpf_qdisc_init_pre_op(struct Qdisc *sch, struct nlattr *opt, struct netlink_ext_ack *extack);
+void bpf_qdisc_destroy_post_op(struct Qdisc *sch);
+void bpf_qdisc_reset_post_op(struct Qdisc *sch);
+
 #endif
diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
index 28959424eab0..7c155207fe1e 100644
--- a/net/sched/bpf_qdisc.c
+++ b/net/sched/bpf_qdisc.c
@@ -8,6 +8,10 @@
 
 static struct bpf_struct_ops bpf_Qdisc_ops;
 
+struct bpf_sched_data {
+	struct qdisc_watchdog watchdog;
+};
+
 struct bpf_sk_buff_ptr {
 	struct sk_buff *skb;
 };
@@ -17,6 +21,32 @@ static int bpf_qdisc_init(struct btf *btf)
 	return 0;
 }
 
+int bpf_qdisc_init_pre_op(struct Qdisc *sch, struct nlattr *opt,
+			  struct netlink_ext_ack *extack)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	qdisc_watchdog_init(&q->watchdog, sch);
+	return 0;
+}
+EXPORT_SYMBOL(bpf_qdisc_init_pre_op);
+
+void bpf_qdisc_reset_post_op(struct Qdisc *sch)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	qdisc_watchdog_cancel(&q->watchdog);
+}
+EXPORT_SYMBOL(bpf_qdisc_reset_post_op);
+
+void bpf_qdisc_destroy_post_op(struct Qdisc *sch)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	qdisc_watchdog_cancel(&q->watchdog);
+}
+EXPORT_SYMBOL(bpf_qdisc_destroy_post_op);
+
 static const struct bpf_func_proto *
 bpf_qdisc_get_func_proto(enum bpf_func_id func_id,
 			 const struct bpf_prog *prog)
@@ -134,12 +164,25 @@ __bpf_kfunc void bpf_qdisc_skb_drop(struct sk_buff *skb,
 	__qdisc_drop(skb, (struct sk_buff **)to_free_list);
 }
 
+/* bpf_qdisc_watchdog_schedule - Schedule a qdisc to a later time using a timer.
+ * @sch: The qdisc to be scheduled.
+ * @expire: The expiry time of the timer.
+ * @delta_ns: The slack range of the timer.
+ */
+__bpf_kfunc void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64 delta_ns)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	qdisc_watchdog_schedule_range_ns(&q->watchdog, expire, delta_ns);
+}
+
 __bpf_kfunc_end_defs();
 
 #define BPF_QDISC_KFUNC_xxx \
 	BPF_QDISC_KFUNC(bpf_skb_get_hash, KF_TRUSTED_ARGS) \
 	BPF_QDISC_KFUNC(bpf_kfree_skb, KF_RELEASE) \
 	BPF_QDISC_KFUNC(bpf_qdisc_skb_drop, KF_RELEASE) \
+	BPF_QDISC_KFUNC(bpf_qdisc_watchdog_schedule, KF_TRUSTED_ARGS) \
 
 BTF_KFUNCS_START(bpf_qdisc_kfunc_ids)
 #define BPF_QDISC_KFUNC(name, flag) BTF_ID_FLAGS(func, name, flag)
@@ -154,9 +197,14 @@ BPF_QDISC_KFUNC_xxx
 
 static int bpf_qdisc_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
 {
-	if (kfunc_id == bpf_qdisc_skb_drop_ids[0])
+	if (kfunc_id == bpf_qdisc_skb_drop_ids[0]) {
 		if (strcmp(prog->aux->attach_func_name, "enqueue"))
 			return -EACCES;
+	} else if (kfunc_id == bpf_qdisc_watchdog_schedule_ids[0]) {
+		if (strcmp(prog->aux->attach_func_name, "enqueue") &&
+		    strcmp(prog->aux->attach_func_name, "dequeue"))
+			return -EACCES;
+	}
 
 	return 0;
 }
@@ -189,6 +237,7 @@ static int bpf_qdisc_init_member(const struct btf_type *t,
 	case offsetof(struct Qdisc_ops, priv_size):
 		if (uqdisc_ops->priv_size)
 			return -EINVAL;
+		qdisc_ops->priv_size = sizeof(struct bpf_sched_data);
 		return 1;
 	case offsetof(struct Qdisc_ops, static_flags):
 		if (uqdisc_ops->static_flags)
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index f074053c4232..507abddcdafd 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1357,6 +1357,13 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 		rcu_assign_pointer(sch->stab, stab);
 	}
 
+#ifdef CONFIG_NET_SCH_BPF
+	if (ops->owner == BPF_MODULE_OWNER) {
+		err = bpf_qdisc_init_pre_op(sch, tca[TCA_OPTIONS], extack);
+		if (err != 0)
+			goto err_out4;
+	}
+#endif
 	if (ops->init) {
 		err = ops->init(sch, tca[TCA_OPTIONS], extack);
 		if (err != 0)
@@ -1393,6 +1400,10 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 	 */
 	if (ops->destroy)
 		ops->destroy(sch);
+#ifdef CONFIG_NET_SCH_BPF
+	if (ops->owner == BPF_MODULE_OWNER)
+		bpf_qdisc_destroy_post_op(sch);
+#endif
 	qdisc_put_stab(rtnl_dereference(sch->stab));
 err_out3:
 	lockdep_unregister_key(&sch->root_lock_key);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 1e770ec251a0..ea4ee7f914be 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1039,6 +1039,10 @@ void qdisc_reset(struct Qdisc *qdisc)
 
 	if (ops->reset)
 		ops->reset(qdisc);
+#ifdef CONFIG_NET_SCH_BPF
+	if (ops->owner == BPF_MODULE_OWNER)
+		bpf_qdisc_reset_post_op(qdisc);
+#endif
 
 	__skb_queue_purge(&qdisc->gso_skb);
 	__skb_queue_purge(&qdisc->skb_bad_txq);
@@ -1082,6 +1086,10 @@ static void __qdisc_destroy(struct Qdisc *qdisc)
 
 	if (ops->destroy)
 		ops->destroy(qdisc);
+#ifdef CONFIG_NET_SCH_BPF
+	if (ops->owner == BPF_MODULE_OWNER)
+		bpf_qdisc_destroy_post_op(qdisc);
+#endif
 
 	lockdep_unregister_key(&qdisc->root_lock_key);
 	bpf_module_put(ops, ops->owner);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH bpf-next v1 08/13] bpf: net_sched: Support updating bstats
  2024-12-13 23:29 [PATCH bpf-next v1 00/13] bpf qdisc Amery Hung
                   ` (6 preceding siblings ...)
  2024-12-13 23:29 ` [PATCH bpf-next v1 07/13] bpf: net_sched: Add a qdisc watchdog timer Amery Hung
@ 2024-12-13 23:29 ` Amery Hung
  2024-12-13 23:29 ` [PATCH bpf-next v1 09/13] bpf: net_sched: Support updating qstats Amery Hung
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: Amery Hung @ 2024-12-13 23:29 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung

Add a kfunc to update Qdisc bstats when an skb is dequeued. The kfunc is
only available in .dequeue programs.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 net/sched/bpf_qdisc.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
index 7c155207fe1e..b5ac3b9923fb 100644
--- a/net/sched/bpf_qdisc.c
+++ b/net/sched/bpf_qdisc.c
@@ -176,6 +176,15 @@ __bpf_kfunc void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64
 	qdisc_watchdog_schedule_range_ns(&q->watchdog, expire, delta_ns);
 }
 
+/* bpf_qdisc_bstats_update - Update Qdisc basic statistics
+ * @sch: The qdisc from which an skb is dequeued.
+ * @skb: The skb to be dequeued.
+ */
+__bpf_kfunc void bpf_qdisc_bstats_update(struct Qdisc *sch, const struct sk_buff *skb)
+{
+	bstats_update(&sch->bstats, skb);
+}
+
 __bpf_kfunc_end_defs();
 
 #define BPF_QDISC_KFUNC_xxx \
@@ -183,6 +192,7 @@ __bpf_kfunc_end_defs();
 	BPF_QDISC_KFUNC(bpf_kfree_skb, KF_RELEASE) \
 	BPF_QDISC_KFUNC(bpf_qdisc_skb_drop, KF_RELEASE) \
 	BPF_QDISC_KFUNC(bpf_qdisc_watchdog_schedule, KF_TRUSTED_ARGS) \
+	BPF_QDISC_KFUNC(bpf_qdisc_bstats_update, KF_TRUSTED_ARGS) \
 
 BTF_KFUNCS_START(bpf_qdisc_kfunc_ids)
 #define BPF_QDISC_KFUNC(name, flag) BTF_ID_FLAGS(func, name, flag)
@@ -204,6 +214,9 @@ static int bpf_qdisc_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
 		if (strcmp(prog->aux->attach_func_name, "enqueue") &&
 		    strcmp(prog->aux->attach_func_name, "dequeue"))
 			return -EACCES;
+	} else if (kfunc_id == bpf_qdisc_bstats_update_ids[0]) {
+		if (strcmp(prog->aux->attach_func_name, "dequeue"))
+			return -EACCES;
 	}
 
 	return 0;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH bpf-next v1 09/13] bpf: net_sched: Support updating qstats
  2024-12-13 23:29 [PATCH bpf-next v1 00/13] bpf qdisc Amery Hung
                   ` (7 preceding siblings ...)
  2024-12-13 23:29 ` [PATCH bpf-next v1 08/13] bpf: net_sched: Support updating bstats Amery Hung
@ 2024-12-13 23:29 ` Amery Hung
  2024-12-13 23:29 ` [PATCH bpf-next v1 10/13] bpf: net_sched: Allow writing to more Qdisc members Amery Hung
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: Amery Hung @ 2024-12-13 23:29 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung

Allow bpf qdisc programs to update Qdisc qstats directly with btf struct
access.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 net/sched/bpf_qdisc.c | 53 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 45 insertions(+), 8 deletions(-)

diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
index b5ac3b9923fb..3901f855effc 100644
--- a/net/sched/bpf_qdisc.c
+++ b/net/sched/bpf_qdisc.c
@@ -57,6 +57,7 @@ bpf_qdisc_get_func_proto(enum bpf_func_id func_id,
 	}
 }
 
+BTF_ID_LIST_SINGLE(bpf_qdisc_ids, struct, Qdisc)
 BTF_ID_LIST_SINGLE(bpf_sk_buff_ids, struct, sk_buff)
 BTF_ID_LIST_SINGLE(bpf_sk_buff_ptr_ids, struct, bpf_sk_buff_ptr)
 
@@ -81,20 +82,37 @@ static bool bpf_qdisc_is_valid_access(int off, int size,
 	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
 }
 
-static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
-					const struct bpf_reg_state *reg,
-					int off, int size)
+static int bpf_qdisc_qdisc_access(struct bpf_verifier_log *log,
+				  const struct bpf_reg_state *reg,
+				  int off, int size)
 {
-	const struct btf_type *t, *skbt;
 	size_t end;
 
-	skbt = btf_type_by_id(reg->btf, bpf_sk_buff_ids[0]);
-	t = btf_type_by_id(reg->btf, reg->btf_id);
-	if (t != skbt) {
-		bpf_log(log, "only read is supported\n");
+	switch (off) {
+	case offsetof(struct Qdisc, qstats) ... offsetofend(struct Qdisc, qstats) - 1:
+		end = offsetofend(struct Qdisc, qstats);
+		break;
+	default:
+		bpf_log(log, "no write support to Qdisc at off %d\n", off);
+		return -EACCES;
+	}
+
+	if (off + size > end) {
+		bpf_log(log,
+			"write access at off %d with size %d beyond the member of Qdisc ended at %zu\n",
+			off, size, end);
 		return -EACCES;
 	}
 
+	return 0;
+}
+
+static int bpf_qdisc_sk_buff_access(struct bpf_verifier_log *log,
+				    const struct bpf_reg_state *reg,
+				    int off, int size)
+{
+	size_t end;
+
 	switch (off) {
 	case offsetof(struct sk_buff, tstamp):
 		end = offsetofend(struct sk_buff, tstamp);
@@ -136,6 +154,25 @@ static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
 	return 0;
 }
 
+static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
+				       const struct bpf_reg_state *reg,
+				       int off, int size)
+{
+	const struct btf_type *t, *skbt, *qdisct;
+
+	skbt = btf_type_by_id(reg->btf, bpf_sk_buff_ids[0]);
+	qdisct = btf_type_by_id(reg->btf, bpf_qdisc_ids[0]);
+	t = btf_type_by_id(reg->btf, reg->btf_id);
+
+	if (t == skbt)
+		return bpf_qdisc_sk_buff_access(log, reg, off, size);
+	else if (t == qdisct)
+		return bpf_qdisc_qdisc_access(log, reg, off, size);
+
+	bpf_log(log, "only read is supported\n");
+	return -EACCES;
+}
+
 __bpf_kfunc_start_defs();
 
 /* bpf_skb_get_hash - Get the flow hash of an skb.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH bpf-next v1 10/13] bpf: net_sched: Allow writing to more Qdisc members
  2024-12-13 23:29 [PATCH bpf-next v1 00/13] bpf qdisc Amery Hung
                   ` (8 preceding siblings ...)
  2024-12-13 23:29 ` [PATCH bpf-next v1 09/13] bpf: net_sched: Support updating qstats Amery Hung
@ 2024-12-13 23:29 ` Amery Hung
  2024-12-13 23:29 ` [PATCH bpf-next v1 11/13] libbpf: Support creating and destroying qdisc Amery Hung
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 35+ messages in thread
From: Amery Hung @ 2024-12-13 23:29 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung

Allow bpf qdisc to write to Qdisc->limit and Qdisc->q.qlen.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 net/sched/bpf_qdisc.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
index 3901f855effc..1caa9f696d2d 100644
--- a/net/sched/bpf_qdisc.c
+++ b/net/sched/bpf_qdisc.c
@@ -89,6 +89,12 @@ static int bpf_qdisc_qdisc_access(struct bpf_verifier_log *log,
 	size_t end;
 
 	switch (off) {
+	case offsetof(struct Qdisc, limit):
+		end = offsetofend(struct Qdisc, limit);
+		break;
+	case offsetof(struct Qdisc, q) + offsetof(struct qdisc_skb_head, qlen):
+		end = offsetof(struct Qdisc, q) + offsetofend(struct qdisc_skb_head, qlen);
+		break;
 	case offsetof(struct Qdisc, qstats) ... offsetofend(struct Qdisc, qstats) - 1:
 		end = offsetofend(struct Qdisc, qstats);
 		break;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH bpf-next v1 11/13] libbpf: Support creating and destroying qdisc
  2024-12-13 23:29 [PATCH bpf-next v1 00/13] bpf qdisc Amery Hung
                   ` (9 preceding siblings ...)
  2024-12-13 23:29 ` [PATCH bpf-next v1 10/13] bpf: net_sched: Allow writing to more Qdisc members Amery Hung
@ 2024-12-13 23:29 ` Amery Hung
  2024-12-17 18:32   ` Andrii Nakryiko
  2024-12-13 23:29 ` [PATCH bpf-next v1 12/13] selftests: Add a basic fifo qdisc test Amery Hung
  2024-12-13 23:29 ` [PATCH bpf-next v1 13/13] selftests: Add a bpf fq qdisc to selftest Amery Hung
  12 siblings, 1 reply; 35+ messages in thread
From: Amery Hung @ 2024-12-13 23:29 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung

Extend struct bpf_tc_hook with handle, qdisc name and a new attach type,
BPF_TC_QDISC, to allow users to add or remove any qdisc specified in
addition to clsact.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 tools/lib/bpf/libbpf.h  |  5 ++++-
 tools/lib/bpf/netlink.c | 20 +++++++++++++++++---
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index b2ce3a72b11d..b05d95814776 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -1268,6 +1268,7 @@ enum bpf_tc_attach_point {
 	BPF_TC_INGRESS = 1 << 0,
 	BPF_TC_EGRESS  = 1 << 1,
 	BPF_TC_CUSTOM  = 1 << 2,
+	BPF_TC_QDISC   = 1 << 3,
 };
 
 #define BPF_TC_PARENT(a, b) 	\
@@ -1282,9 +1283,11 @@ struct bpf_tc_hook {
 	int ifindex;
 	enum bpf_tc_attach_point attach_point;
 	__u32 parent;
+	__u32 handle;
+	char *qdisc;
 	size_t :0;
 };
-#define bpf_tc_hook__last_field parent
+#define bpf_tc_hook__last_field qdisc
 
 struct bpf_tc_opts {
 	size_t sz;
diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
index 68a2def17175..72db8c0add21 100644
--- a/tools/lib/bpf/netlink.c
+++ b/tools/lib/bpf/netlink.c
@@ -529,9 +529,9 @@ int bpf_xdp_query_id(int ifindex, int flags, __u32 *prog_id)
 }
 
 
-typedef int (*qdisc_config_t)(struct libbpf_nla_req *req);
+typedef int (*qdisc_config_t)(struct libbpf_nla_req *req, struct bpf_tc_hook *hook);
 
-static int clsact_config(struct libbpf_nla_req *req)
+static int clsact_config(struct libbpf_nla_req *req, struct bpf_tc_hook *hook)
 {
 	req->tc.tcm_parent = TC_H_CLSACT;
 	req->tc.tcm_handle = TC_H_MAKE(TC_H_CLSACT, 0);
@@ -539,6 +539,16 @@ static int clsact_config(struct libbpf_nla_req *req)
 	return nlattr_add(req, TCA_KIND, "clsact", sizeof("clsact"));
 }
 
+static int qdisc_config(struct libbpf_nla_req *req, struct bpf_tc_hook *hook)
+{
+	char *qdisc = OPTS_GET(hook, qdisc, NULL);
+
+	req->tc.tcm_parent = OPTS_GET(hook, parent, TC_H_ROOT);
+	req->tc.tcm_handle = OPTS_GET(hook, handle, 0);
+
+	return nlattr_add(req, TCA_KIND, qdisc, strlen(qdisc) + 1);
+}
+
 static int attach_point_to_config(struct bpf_tc_hook *hook,
 				  qdisc_config_t *config)
 {
@@ -552,6 +562,9 @@ static int attach_point_to_config(struct bpf_tc_hook *hook,
 		return 0;
 	case BPF_TC_CUSTOM:
 		return -EOPNOTSUPP;
+	case BPF_TC_QDISC:
+		*config = &qdisc_config;
+		return 0;
 	default:
 		return -EINVAL;
 	}
@@ -596,7 +609,7 @@ static int tc_qdisc_modify(struct bpf_tc_hook *hook, int cmd, int flags)
 	req.tc.tcm_family  = AF_UNSPEC;
 	req.tc.tcm_ifindex = OPTS_GET(hook, ifindex, 0);
 
-	ret = config(&req);
+	ret = config(&req, hook);
 	if (ret < 0)
 		return ret;
 
@@ -639,6 +652,7 @@ int bpf_tc_hook_destroy(struct bpf_tc_hook *hook)
 	case BPF_TC_INGRESS:
 	case BPF_TC_EGRESS:
 		return libbpf_err(__bpf_tc_detach(hook, NULL, true));
+	case BPF_TC_QDISC:
 	case BPF_TC_INGRESS | BPF_TC_EGRESS:
 		return libbpf_err(tc_qdisc_delete(hook));
 	case BPF_TC_CUSTOM:
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH bpf-next v1 12/13] selftests: Add a basic fifo qdisc test
  2024-12-13 23:29 [PATCH bpf-next v1 00/13] bpf qdisc Amery Hung
                   ` (10 preceding siblings ...)
  2024-12-13 23:29 ` [PATCH bpf-next v1 11/13] libbpf: Support creating and destroying qdisc Amery Hung
@ 2024-12-13 23:29 ` Amery Hung
  2024-12-13 23:29 ` [PATCH bpf-next v1 13/13] selftests: Add a bpf fq qdisc to selftest Amery Hung
  12 siblings, 0 replies; 35+ messages in thread
From: Amery Hung @ 2024-12-13 23:29 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung

This selftest shows a bare minimum fifo qdisc, which simply enqueues skbs
into the back of a bpf list and dequeues from the front of the list.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 tools/testing/selftests/bpf/config            |   1 +
 .../selftests/bpf/prog_tests/bpf_qdisc.c      | 161 ++++++++++++++++++
 .../selftests/bpf/progs/bpf_qdisc_common.h    |  27 +++
 .../selftests/bpf/progs/bpf_qdisc_fifo.c      | 117 +++++++++++++
 4 files changed, 306 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c

diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index 4ca84c8d9116..cf35e7e473d4 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -70,6 +70,7 @@ CONFIG_NET_IPGRE=y
 CONFIG_NET_IPGRE_DEMUX=y
 CONFIG_NET_IPIP=y
 CONFIG_NET_MPLS_GSO=y
+CONFIG_NET_SCH_BPF=y
 CONFIG_NET_SCH_FQ=y
 CONFIG_NET_SCH_INGRESS=y
 CONFIG_NET_SCHED=y
diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
new file mode 100644
index 000000000000..295d0216e70f
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
@@ -0,0 +1,161 @@
+#include <linux/pkt_sched.h>
+#include <linux/rtnetlink.h>
+#include <test_progs.h>
+
+#include "network_helpers.h"
+#include "bpf_qdisc_fifo.skel.h"
+
+#ifndef ENOTSUPP
+#define ENOTSUPP 524
+#endif
+
+#define LO_IFINDEX 1
+
+static const unsigned int total_bytes = 10 * 1024 * 1024;
+static int stop;
+
+static void *server(void *arg)
+{
+	int lfd = (int)(long)arg, err = 0, fd;
+	ssize_t nr_sent = 0, bytes = 0;
+	char batch[1500];
+
+	fd = accept(lfd, NULL, NULL);
+	while (fd == -1) {
+		if (errno == EINTR)
+			continue;
+		err = -errno;
+		goto done;
+	}
+
+	if (settimeo(fd, 0)) {
+		err = -errno;
+		goto done;
+	}
+
+	while (bytes < total_bytes && !READ_ONCE(stop)) {
+		nr_sent = send(fd, &batch,
+			       MIN(total_bytes - bytes, sizeof(batch)), 0);
+		if (nr_sent == -1 && errno == EINTR)
+			continue;
+		if (nr_sent == -1) {
+			err = -errno;
+			break;
+		}
+		bytes += nr_sent;
+	}
+
+	ASSERT_EQ(bytes, total_bytes, "send");
+
+done:
+	if (fd >= 0)
+		close(fd);
+	if (err) {
+		WRITE_ONCE(stop, 1);
+		return ERR_PTR(err);
+	}
+	return NULL;
+}
+
+static void do_test(char *qdisc)
+{
+	DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook, .ifindex = LO_IFINDEX,
+			    .attach_point = BPF_TC_QDISC,
+			    .parent = TC_H_ROOT,
+			    .handle = 0x8000000,
+			    .qdisc = qdisc);
+	struct sockaddr_in6 sa6 = {};
+	ssize_t nr_recv = 0, bytes = 0;
+	int lfd = -1, fd = -1;
+	pthread_t srv_thread;
+	socklen_t addrlen = sizeof(sa6);
+	void *thread_ret;
+	char batch[1500];
+	int err;
+
+	WRITE_ONCE(stop, 0);
+
+	err = bpf_tc_hook_create(&hook);
+	if (!ASSERT_OK(err, "attach qdisc"))
+		return;
+
+	lfd = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
+	if (!ASSERT_NEQ(lfd, -1, "socket")) {
+		bpf_tc_hook_destroy(&hook);
+		return;
+	}
+
+	fd = socket(AF_INET6, SOCK_STREAM, 0);
+	if (!ASSERT_NEQ(fd, -1, "socket")) {
+		bpf_tc_hook_destroy(&hook);
+		close(lfd);
+		return;
+	}
+
+	if (settimeo(lfd, 0) || settimeo(fd, 0))
+		goto done;
+
+	err = getsockname(lfd, (struct sockaddr *)&sa6, &addrlen);
+	if (!ASSERT_NEQ(err, -1, "getsockname"))
+		goto done;
+
+	/* connect to server */
+	err = connect(fd, (struct sockaddr *)&sa6, addrlen);
+	if (!ASSERT_NEQ(err, -1, "connect"))
+		goto done;
+
+	err = pthread_create(&srv_thread, NULL, server, (void *)(long)lfd);
+	if (!ASSERT_OK(err, "pthread_create"))
+		goto done;
+
+	/* recv total_bytes */
+	while (bytes < total_bytes && !READ_ONCE(stop)) {
+		nr_recv = recv(fd, &batch,
+			       MIN(total_bytes - bytes, sizeof(batch)), 0);
+		if (nr_recv == -1 && errno == EINTR)
+			continue;
+		if (nr_recv == -1)
+			break;
+		bytes += nr_recv;
+	}
+
+	ASSERT_EQ(bytes, total_bytes, "recv");
+
+	WRITE_ONCE(stop, 1);
+	pthread_join(srv_thread, &thread_ret);
+	ASSERT_OK(IS_ERR(thread_ret), "thread_ret");
+
+done:
+	close(lfd);
+	close(fd);
+
+	bpf_tc_hook_destroy(&hook);
+	return;
+}
+
+static void test_fifo(void)
+{
+	struct bpf_qdisc_fifo *fifo_skel;
+	struct bpf_link *link;
+
+	fifo_skel = bpf_qdisc_fifo__open_and_load();
+	if (!ASSERT_OK_PTR(fifo_skel, "bpf_qdisc_fifo__open_and_load"))
+		return;
+
+	link = bpf_map__attach_struct_ops(fifo_skel->maps.fifo);
+	if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops")) {
+		bpf_qdisc_fifo__destroy(fifo_skel);
+		return;
+	}
+
+	do_test("bpf_fifo");
+
+	bpf_link__destroy(link);
+	bpf_qdisc_fifo__destroy(fifo_skel);
+}
+
+void test_bpf_qdisc(void)
+{
+	if (test__start_subtest("fifo"))
+		test_fifo();
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
new file mode 100644
index 000000000000..62a778f94908
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
@@ -0,0 +1,27 @@
+#ifndef _BPF_QDISC_COMMON_H
+#define _BPF_QDISC_COMMON_H
+
+#define NET_XMIT_SUCCESS        0x00
+#define NET_XMIT_DROP           0x01    /* skb dropped                  */
+#define NET_XMIT_CN             0x02    /* congestion notification      */
+
+#define TC_PRIO_CONTROL  7
+#define TC_PRIO_MAX      15
+
+u32 bpf_skb_get_hash(struct sk_buff *p) __ksym;
+void bpf_kfree_skb(struct sk_buff *p) __ksym;
+void bpf_qdisc_skb_drop(struct sk_buff *p, struct bpf_sk_buff_ptr *to_free) __ksym;
+void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64 delta_ns) __ksym;
+void bpf_qdisc_bstats_update(struct Qdisc *sch, const struct sk_buff *skb) __ksym;
+
+static struct qdisc_skb_cb *qdisc_skb_cb(const struct sk_buff *skb)
+{
+	return (struct qdisc_skb_cb *)skb->cb;
+}
+
+static inline unsigned int qdisc_pkt_len(const struct sk_buff *skb)
+{
+	return qdisc_skb_cb(skb)->pkt_len;
+}
+
+#endif
diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c b/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
new file mode 100644
index 000000000000..705e7da325da
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
@@ -0,0 +1,117 @@
+#include <vmlinux.h>
+#include "bpf_experimental.h"
+#include "bpf_qdisc_common.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct skb_node {
+	struct sk_buff __kptr * skb;
+	struct bpf_list_node node;
+};
+
+#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+
+private(A) struct bpf_spin_lock q_fifo_lock;
+private(A) struct bpf_list_head q_fifo __contains(skb_node, node);
+
+SEC("struct_ops/bpf_fifo_enqueue")
+int BPF_PROG(bpf_fifo_enqueue, struct sk_buff *skb, struct Qdisc *sch,
+	     struct bpf_sk_buff_ptr *to_free)
+{
+	struct skb_node *skbn;
+	u32 pkt_len;
+
+	if (sch->q.qlen == sch->limit)
+		goto drop;
+
+	skbn = bpf_obj_new(typeof(*skbn));
+	if (!skbn)
+		goto drop;
+
+	pkt_len = qdisc_pkt_len(skb);
+
+	sch->q.qlen++;
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	if (skb)
+		bpf_qdisc_skb_drop(skb, to_free);
+
+	bpf_spin_lock(&q_fifo_lock);
+	bpf_list_push_back(&q_fifo, &skbn->node);
+	bpf_spin_unlock(&q_fifo_lock);
+
+	sch->qstats.backlog += pkt_len;
+	return NET_XMIT_SUCCESS;
+drop:
+	bpf_qdisc_skb_drop(skb, to_free);
+	return NET_XMIT_DROP;
+}
+
+SEC("struct_ops/bpf_fifo_dequeue")
+struct sk_buff *BPF_PROG(bpf_fifo_dequeue, struct Qdisc *sch)
+{
+	struct bpf_list_node *node;
+	struct sk_buff *skb = NULL;
+	struct skb_node *skbn;
+
+	bpf_spin_lock(&q_fifo_lock);
+	node = bpf_list_pop_front(&q_fifo);
+	bpf_spin_unlock(&q_fifo_lock);
+	if (!node)
+		return NULL;
+
+	skbn = container_of(node, struct skb_node, node);
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	bpf_obj_drop(skbn);
+	if (!skb)
+		return NULL;
+
+	sch->qstats.backlog -= qdisc_pkt_len(skb);
+	bpf_qdisc_bstats_update(sch, skb);
+	sch->q.qlen--;
+
+	return skb;
+}
+
+SEC("struct_ops/bpf_fifo_init")
+int BPF_PROG(bpf_fifo_init, struct Qdisc *sch, struct nlattr *opt,
+	     struct netlink_ext_ack *extack)
+{
+	sch->limit = 1000;
+	return 0;
+}
+
+SEC("struct_ops/bpf_fifo_reset")
+void BPF_PROG(bpf_fifo_reset, struct Qdisc *sch)
+{
+	struct bpf_list_node *node;
+	struct skb_node *skbn;
+	int i;
+
+	bpf_for(i, 0, sch->q.qlen) {
+		struct sk_buff *skb = NULL;
+
+		bpf_spin_lock(&q_fifo_lock);
+		node = bpf_list_pop_front(&q_fifo);
+		bpf_spin_unlock(&q_fifo_lock);
+
+		if (!node)
+			break;
+
+		skbn = container_of(node, struct skb_node, node);
+		skb = bpf_kptr_xchg(&skbn->skb, skb);
+		if (skb)
+			bpf_kfree_skb(skb);
+		bpf_obj_drop(skbn);
+	}
+	sch->q.qlen = 0;
+}
+
+SEC(".struct_ops")
+struct Qdisc_ops fifo = {
+	.enqueue   = (void *)bpf_fifo_enqueue,
+	.dequeue   = (void *)bpf_fifo_dequeue,
+	.init      = (void *)bpf_fifo_init,
+	.reset     = (void *)bpf_fifo_reset,
+	.id        = "bpf_fifo",
+};
+
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH bpf-next v1 13/13] selftests: Add a bpf fq qdisc to selftest
  2024-12-13 23:29 [PATCH bpf-next v1 00/13] bpf qdisc Amery Hung
                   ` (11 preceding siblings ...)
  2024-12-13 23:29 ` [PATCH bpf-next v1 12/13] selftests: Add a basic fifo qdisc test Amery Hung
@ 2024-12-13 23:29 ` Amery Hung
  12 siblings, 0 replies; 35+ messages in thread
From: Amery Hung @ 2024-12-13 23:29 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung

This test implements a more sophisticated qdisc using bpf. The bpf fair-
queueing (fq) qdisc gives each flow an equal chance to transmit data. It
also respects the timestamp of skb for rate limiting.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 .../selftests/bpf/prog_tests/bpf_qdisc.c      |  24 +
 .../selftests/bpf/progs/bpf_qdisc_fq.c        | 726 ++++++++++++++++++
 2 files changed, 750 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
index 295d0216e70f..394bf5a4adae 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
@@ -4,6 +4,7 @@
 
 #include "network_helpers.h"
 #include "bpf_qdisc_fifo.skel.h"
+#include "bpf_qdisc_fq.skel.h"
 
 #ifndef ENOTSUPP
 #define ENOTSUPP 524
@@ -154,8 +155,31 @@ static void test_fifo(void)
 	bpf_qdisc_fifo__destroy(fifo_skel);
 }
 
+static void test_fq(void)
+{
+	struct bpf_qdisc_fq *fq_skel;
+	struct bpf_link *link;
+
+	fq_skel = bpf_qdisc_fq__open_and_load();
+	if (!ASSERT_OK_PTR(fq_skel, "bpf_qdisc_fq__open_and_load"))
+		return;
+
+	link = bpf_map__attach_struct_ops(fq_skel->maps.fq);
+	if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops")) {
+		bpf_qdisc_fq__destroy(fq_skel);
+		return;
+	}
+
+	do_test("bpf_fq");
+
+	bpf_link__destroy(link);
+	bpf_qdisc_fq__destroy(fq_skel);
+}
+
 void test_bpf_qdisc(void)
 {
 	if (test__start_subtest("fifo"))
 		test_fifo();
+	if (test__start_subtest("fq"))
+		test_fq();
 }
diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c b/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c
new file mode 100644
index 000000000000..38a72fde3c5a
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c
@@ -0,0 +1,726 @@
+#include <vmlinux.h>
+#include <errno.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_experimental.h"
+#include "bpf_qdisc_common.h"
+
+char _license[] SEC("license") = "GPL";
+
+#define NSEC_PER_USEC 1000L
+#define NSEC_PER_SEC 1000000000L
+
+#define NUM_QUEUE (1 << 20)
+
+struct fq_bpf_data {
+	u32 quantum;
+	u32 initial_quantum;
+	u32 flow_refill_delay;
+	u32 flow_plimit;
+	u64 horizon;
+	u32 orphan_mask;
+	u32 timer_slack;
+	u64 time_next_delayed_flow;
+	u64 unthrottle_latency_ns;
+	u8 horizon_drop;
+	u32 new_flow_cnt;
+	u32 old_flow_cnt;
+	u64 ktime_cache;
+};
+
+enum {
+	CLS_RET_PRIO	= 0,
+	CLS_RET_NONPRIO = 1,
+	CLS_RET_ERR	= 2,
+};
+
+struct skb_node {
+	u64 tstamp;
+	struct sk_buff __kptr * skb;
+	struct bpf_rb_node node;
+};
+
+struct fq_flow_node {
+	int credit;
+	u32 qlen;
+	u64 age;
+	u64 time_next_packet;
+	struct bpf_list_node list_node;
+	struct bpf_rb_node rb_node;
+	struct bpf_rb_root queue __contains(skb_node, node);
+	struct bpf_spin_lock lock;
+	struct bpf_refcount refcount;
+};
+
+struct dequeue_nonprio_ctx {
+	bool stop_iter;
+	u64 expire;
+	u64 now;
+};
+
+struct remove_flows_ctx {
+	bool gc_only;
+	u32 reset_cnt;
+	u32 reset_max;
+};
+
+struct unset_throttled_flows_ctx {
+	bool unset_all;
+	u64 now;
+};
+
+struct fq_stashed_flow {
+	struct fq_flow_node __kptr * flow;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, __u64);
+	__type(value, struct fq_stashed_flow);
+	__uint(max_entries, NUM_QUEUE);
+} fq_nonprio_flows SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, __u64);
+	__type(value, struct fq_stashed_flow);
+	__uint(max_entries, 1);
+} fq_prio_flows SEC(".maps");
+
+#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+
+private(A) struct bpf_spin_lock fq_delayed_lock;
+private(A) struct bpf_rb_root fq_delayed __contains(fq_flow_node, rb_node);
+
+private(B) struct bpf_spin_lock fq_new_flows_lock;
+private(B) struct bpf_list_head fq_new_flows __contains(fq_flow_node, list_node);
+
+private(C) struct bpf_spin_lock fq_old_flows_lock;
+private(C) struct bpf_list_head fq_old_flows __contains(fq_flow_node, list_node);
+
+private(D) struct fq_bpf_data q;
+
+/* Wrapper for bpf_kptr_xchg that expects NULL dst */
+static void bpf_kptr_xchg_back(void *map_val, void *ptr)
+{
+	void *ret;
+
+	ret = bpf_kptr_xchg(map_val, ptr);
+	if (ret)
+		bpf_obj_drop(ret);
+}
+
+static bool skbn_tstamp_less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct skb_node *skbn_a;
+	struct skb_node *skbn_b;
+
+	skbn_a = container_of(a, struct skb_node, node);
+	skbn_b = container_of(b, struct skb_node, node);
+
+	return skbn_a->tstamp < skbn_b->tstamp;
+}
+
+static bool fn_time_next_packet_less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct fq_flow_node *flow_a;
+	struct fq_flow_node *flow_b;
+
+	flow_a = container_of(a, struct fq_flow_node, rb_node);
+	flow_b = container_of(b, struct fq_flow_node, rb_node);
+
+	return flow_a->time_next_packet < flow_b->time_next_packet;
+}
+
+static void
+fq_flows_add_head(struct bpf_list_head *head, struct bpf_spin_lock *lock,
+		  struct fq_flow_node *flow, u32 *flow_cnt)
+{
+	bpf_spin_lock(lock);
+	bpf_list_push_front(head, &flow->list_node);
+	bpf_spin_unlock(lock);
+	*flow_cnt += 1;
+}
+
+static void
+fq_flows_add_tail(struct bpf_list_head *head, struct bpf_spin_lock *lock,
+		  struct fq_flow_node *flow, u32 *flow_cnt)
+{
+	bpf_spin_lock(lock);
+	bpf_list_push_back(head, &flow->list_node);
+	bpf_spin_unlock(lock);
+	*flow_cnt += 1;
+}
+
+static void
+fq_flows_remove_front(struct bpf_list_head *head, struct bpf_spin_lock *lock,
+		      struct bpf_list_node **node, u32 *flow_cnt)
+{
+	bpf_spin_lock(lock);
+	*node = bpf_list_pop_front(head);
+	bpf_spin_unlock(lock);
+	*flow_cnt -= 1;
+}
+
+static bool
+fq_flows_is_empty(struct bpf_list_head *head, struct bpf_spin_lock *lock)
+{
+	struct bpf_list_node *node;
+
+	bpf_spin_lock(lock);
+	node = bpf_list_pop_front(head);
+	if (node) {
+		bpf_list_push_front(head, node);
+		bpf_spin_unlock(lock);
+		return false;
+	}
+	bpf_spin_unlock(lock);
+
+	return true;
+}
+
+/* flow->age is used to denote the state of the flow (not-detached, detached, throttled)
+ * as well as the timestamp when the flow is detached.
+ *
+ * 0: not-detached
+ * 1 - (~0ULL-1): detached
+ * ~0ULL: throttled
+ */
+static void fq_flow_set_detached(struct fq_flow_node *flow)
+{
+	flow->age = bpf_jiffies64();
+}
+
+static bool fq_flow_is_detached(struct fq_flow_node *flow)
+{
+	return flow->age != 0 && flow->age != ~0ULL;
+}
+
+static bool sk_listener(struct sock *sk)
+{
+	return (1 << sk->__sk_common.skc_state) & (TCPF_LISTEN | TCPF_NEW_SYN_RECV);
+}
+
+static void fq_gc(void);
+
+static int fq_new_flow(void *flow_map, struct fq_stashed_flow **sflow, u64 hash)
+{
+	struct fq_stashed_flow tmp = {};
+	struct fq_flow_node *flow;
+	int ret;
+
+	flow = bpf_obj_new(typeof(*flow));
+	if (!flow)
+		return -ENOMEM;
+
+	flow->credit = q.initial_quantum,
+	flow->qlen = 0,
+	flow->age = 1,
+	flow->time_next_packet = 0,
+
+	ret = bpf_map_update_elem(flow_map, &hash, &tmp, 0);
+	if (ret == -ENOMEM) {
+		fq_gc();
+		bpf_map_update_elem(&fq_nonprio_flows, &hash, &tmp, 0);
+	}
+
+	*sflow = bpf_map_lookup_elem(flow_map, &hash);
+	if (!*sflow) {
+		bpf_obj_drop(flow);
+		return -ENOMEM;
+	}
+
+	bpf_kptr_xchg_back(&(*sflow)->flow, flow);
+	return 0;
+}
+
+static int
+fq_classify(struct sk_buff *skb, struct fq_stashed_flow **sflow)
+{
+	struct sock *sk = skb->sk;
+	int ret = CLS_RET_NONPRIO;
+	u64 hash = 0;
+
+	if ((skb->priority & TC_PRIO_MAX) == TC_PRIO_CONTROL) {
+		*sflow = bpf_map_lookup_elem(&fq_prio_flows, &hash);
+		ret = CLS_RET_PRIO;
+	} else {
+		if (!sk || sk_listener(sk)) {
+			hash = bpf_skb_get_hash(skb) & q.orphan_mask;
+			/* Avoid collision with an existing flow hash, which
+			 * only uses the lower 32 bits of hash, by setting the
+			 * upper half of hash to 1.
+			 */
+			hash |= (1ULL << 32);
+		} else if (sk->__sk_common.skc_state == TCP_CLOSE) {
+			hash = bpf_skb_get_hash(skb) & q.orphan_mask;
+			hash |= (1ULL << 32);
+		} else {
+			hash = sk->__sk_common.skc_hash;
+		}
+		*sflow = bpf_map_lookup_elem(&fq_nonprio_flows, &hash);
+	}
+
+	if (!*sflow)
+		ret = fq_new_flow(&fq_nonprio_flows, sflow, hash) < 0 ?
+		      CLS_RET_ERR : CLS_RET_NONPRIO;
+
+	return ret;
+}
+
+static bool fq_packet_beyond_horizon(struct sk_buff *skb)
+{
+	return (s64)skb->tstamp > (s64)(q.ktime_cache + q.horizon);
+}
+
+SEC("struct_ops/bpf_fq_enqueue")
+int BPF_PROG(bpf_fq_enqueue, struct sk_buff *skb, struct Qdisc *sch,
+	     struct bpf_sk_buff_ptr *to_free)
+{
+	struct fq_flow_node *flow = NULL, *flow_copy;
+	struct fq_stashed_flow *sflow;
+	u64 time_to_send, jiffies;
+	struct skb_node *skbn;
+	int ret;
+
+	if (sch->q.qlen >= sch->limit)
+		goto drop;
+
+	if (!skb->tstamp) {
+		time_to_send = q.ktime_cache = bpf_ktime_get_ns();
+	} else {
+		if (fq_packet_beyond_horizon(skb)) {
+			q.ktime_cache = bpf_ktime_get_ns();
+			if (fq_packet_beyond_horizon(skb)) {
+				if (q.horizon_drop)
+					goto drop;
+
+				skb->tstamp = q.ktime_cache + q.horizon;
+			}
+		}
+		time_to_send = skb->tstamp;
+	}
+
+	ret = fq_classify(skb, &sflow);
+	if (ret == CLS_RET_ERR)
+		goto drop;
+
+	flow = bpf_kptr_xchg(&sflow->flow, flow);
+	if (!flow)
+		goto drop;
+
+	if (ret == CLS_RET_NONPRIO) {
+		if (flow->qlen >= q.flow_plimit) {
+			bpf_kptr_xchg_back(&sflow->flow, flow);
+			goto drop;
+		}
+
+		if (fq_flow_is_detached(flow)) {
+			flow_copy = bpf_refcount_acquire(flow);
+
+			jiffies = bpf_jiffies64();
+			if ((s64)(jiffies - (flow_copy->age + q.flow_refill_delay)) > 0) {
+				if (flow_copy->credit < q.quantum)
+					flow_copy->credit = q.quantum;
+			}
+			flow_copy->age = 0;
+			fq_flows_add_tail(&fq_new_flows, &fq_new_flows_lock, flow_copy,
+					  &q.new_flow_cnt);
+		}
+	}
+
+	skbn = bpf_obj_new(typeof(*skbn));
+	if (!skbn) {
+		bpf_kptr_xchg_back(&sflow->flow, flow);
+		goto drop;
+	}
+
+	skbn->tstamp = skb->tstamp = time_to_send;
+
+	sch->qstats.backlog += qdisc_pkt_len(skb);
+
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	if (skb)
+		bpf_qdisc_skb_drop(skb, to_free);
+
+	bpf_spin_lock(&flow->lock);
+	bpf_rbtree_add(&flow->queue, &skbn->node, skbn_tstamp_less);
+	bpf_spin_unlock(&flow->lock);
+
+	flow->qlen++;
+	bpf_kptr_xchg_back(&sflow->flow, flow);
+
+	sch->q.qlen++;
+	return NET_XMIT_SUCCESS;
+
+drop:
+	bpf_qdisc_skb_drop(skb, to_free);
+	sch->qstats.drops++;
+	return NET_XMIT_DROP;
+}
+
+static int fq_unset_throttled_flows(u32 index, struct unset_throttled_flows_ctx *ctx)
+{
+	struct bpf_rb_node *node = NULL;
+	struct fq_flow_node *flow;
+
+	bpf_spin_lock(&fq_delayed_lock);
+
+	node = bpf_rbtree_first(&fq_delayed);
+	if (!node) {
+		bpf_spin_unlock(&fq_delayed_lock);
+		return 1;
+	}
+
+	flow = container_of(node, struct fq_flow_node, rb_node);
+	if (!ctx->unset_all && flow->time_next_packet > ctx->now) {
+		q.time_next_delayed_flow = flow->time_next_packet;
+		bpf_spin_unlock(&fq_delayed_lock);
+		return 1;
+	}
+
+	node = bpf_rbtree_remove(&fq_delayed, &flow->rb_node);
+
+	bpf_spin_unlock(&fq_delayed_lock);
+
+	if (!node)
+		return 1;
+
+	flow = container_of(node, struct fq_flow_node, rb_node);
+	flow->age = 0;
+	fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow, &q.old_flow_cnt);
+
+	return 0;
+}
+
+static void fq_flow_set_throttled(struct fq_flow_node *flow)
+{
+	flow->age = ~0ULL;
+
+	if (q.time_next_delayed_flow > flow->time_next_packet)
+		q.time_next_delayed_flow = flow->time_next_packet;
+
+	bpf_spin_lock(&fq_delayed_lock);
+	bpf_rbtree_add(&fq_delayed, &flow->rb_node, fn_time_next_packet_less);
+	bpf_spin_unlock(&fq_delayed_lock);
+}
+
+static void fq_check_throttled(u64 now)
+{
+	struct unset_throttled_flows_ctx ctx = {
+		.unset_all = false,
+		.now = now,
+	};
+	unsigned long sample;
+
+	if (q.time_next_delayed_flow > now)
+		return;
+
+	sample = (unsigned long)(now - q.time_next_delayed_flow);
+	q.unthrottle_latency_ns -= q.unthrottle_latency_ns >> 3;
+	q.unthrottle_latency_ns += sample >> 3;
+
+	q.time_next_delayed_flow = ~0ULL;
+	bpf_loop(NUM_QUEUE, fq_unset_throttled_flows, &ctx, 0);
+}
+
+static struct sk_buff*
+fq_dequeue_nonprio_flows(u32 index, struct dequeue_nonprio_ctx *ctx)
+{
+	u64 time_next_packet, time_to_send;
+	struct bpf_rb_node *rb_node;
+	struct sk_buff *skb = NULL;
+	struct bpf_list_head *head;
+	struct bpf_list_node *node;
+	struct bpf_spin_lock *lock;
+	struct fq_flow_node *flow;
+	struct skb_node *skbn;
+	bool is_empty;
+	u32 *cnt;
+
+	if (q.new_flow_cnt) {
+		head = &fq_new_flows;
+		lock = &fq_new_flows_lock;
+		cnt = &q.new_flow_cnt;
+	} else if (q.old_flow_cnt) {
+		head = &fq_old_flows;
+		lock = &fq_old_flows_lock;
+		cnt = &q.old_flow_cnt;
+	} else {
+		if (q.time_next_delayed_flow != ~0ULL)
+			ctx->expire = q.time_next_delayed_flow;
+		goto break_loop;
+	}
+
+	fq_flows_remove_front(head, lock, &node, cnt);
+	if (!node)
+		goto break_loop;
+
+	flow = container_of(node, struct fq_flow_node, list_node);
+	if (flow->credit <= 0) {
+		flow->credit += q.quantum;
+		fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow, &q.old_flow_cnt);
+		return NULL;
+	}
+
+	bpf_spin_lock(&flow->lock);
+	rb_node = bpf_rbtree_first(&flow->queue);
+	if (!rb_node) {
+		bpf_spin_unlock(&flow->lock);
+		is_empty = fq_flows_is_empty(&fq_old_flows, &fq_old_flows_lock);
+		if (head == &fq_new_flows && !is_empty) {
+			fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow, &q.old_flow_cnt);
+		} else {
+			fq_flow_set_detached(flow);
+			bpf_obj_drop(flow);
+		}
+		return NULL;
+	}
+
+	skbn = container_of(rb_node, struct skb_node, node);
+	time_to_send = skbn->tstamp;
+
+	time_next_packet = (time_to_send > flow->time_next_packet) ?
+		time_to_send : flow->time_next_packet;
+	if (ctx->now < time_next_packet) {
+		bpf_spin_unlock(&flow->lock);
+		flow->time_next_packet = time_next_packet;
+		fq_flow_set_throttled(flow);
+		return NULL;
+	}
+
+	rb_node = bpf_rbtree_remove(&flow->queue, rb_node);
+	bpf_spin_unlock(&flow->lock);
+
+	if (!rb_node)
+		goto add_flow_and_break;
+
+	skbn = container_of(rb_node, struct skb_node, node);
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	bpf_obj_drop(skbn);
+
+	if (!skb)
+		goto add_flow_and_break;
+
+	flow->credit -= qdisc_skb_cb(skb)->pkt_len;
+	flow->qlen--;
+
+add_flow_and_break:
+	fq_flows_add_head(head, lock, flow, cnt);
+
+break_loop:
+	ctx->stop_iter = true;
+	return skb;
+}
+
+static struct sk_buff *fq_dequeue_prio(void)
+{
+	struct fq_flow_node *flow = NULL;
+	struct fq_stashed_flow *sflow;
+	struct bpf_rb_node *rb_node;
+	struct sk_buff *skb = NULL;
+	struct skb_node *skbn;
+	u64 hash = 0;
+
+	sflow = bpf_map_lookup_elem(&fq_prio_flows, &hash);
+	if (!sflow)
+		return NULL;
+
+	flow = bpf_kptr_xchg(&sflow->flow, flow);
+	if (!flow)
+		return NULL;
+
+	bpf_spin_lock(&flow->lock);
+	rb_node = bpf_rbtree_first(&flow->queue);
+	if (!rb_node) {
+		bpf_spin_unlock(&flow->lock);
+		goto out;
+	}
+
+	skbn = container_of(rb_node, struct skb_node, node);
+	rb_node = bpf_rbtree_remove(&flow->queue, &skbn->node);
+	bpf_spin_unlock(&flow->lock);
+
+	if (!rb_node)
+		goto out;
+
+	skbn = container_of(rb_node, struct skb_node, node);
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	bpf_obj_drop(skbn);
+
+out:
+	bpf_kptr_xchg_back(&sflow->flow, flow);
+
+	return skb;
+}
+
+SEC("struct_ops/bpf_fq_dequeue")
+struct sk_buff *BPF_PROG(bpf_fq_dequeue, struct Qdisc *sch)
+{
+	struct dequeue_nonprio_ctx cb_ctx = {};
+	struct sk_buff *skb = NULL;
+	int i;
+
+	if (!sch->q.qlen)
+		goto out;
+
+	skb = fq_dequeue_prio();
+	if (skb)
+		goto dequeue;
+
+	q.ktime_cache = cb_ctx.now = bpf_ktime_get_ns();
+	fq_check_throttled(q.ktime_cache);
+	bpf_for(i, 0, sch->limit) {
+		skb = fq_dequeue_nonprio_flows(i, &cb_ctx);
+		if (cb_ctx.stop_iter)
+			break;
+	};
+
+dequeue:
+	if (skb) {
+		sch->q.qlen--;
+		sch->qstats.backlog -= qdisc_pkt_len(skb);
+		bpf_qdisc_bstats_update(sch, skb);
+		return skb;
+	}
+
+	if (cb_ctx.expire)
+		bpf_qdisc_watchdog_schedule(sch, cb_ctx.expire, q.timer_slack);
+out:
+	return NULL;
+}
+
+static int fq_remove_flows_in_list(u32 index, void *ctx)
+{
+	struct bpf_list_node *node;
+	struct fq_flow_node *flow;
+
+	bpf_spin_lock(&fq_new_flows_lock);
+	node = bpf_list_pop_front(&fq_new_flows);
+	bpf_spin_unlock(&fq_new_flows_lock);
+	if (!node) {
+		bpf_spin_lock(&fq_old_flows_lock);
+		node = bpf_list_pop_front(&fq_old_flows);
+		bpf_spin_unlock(&fq_old_flows_lock);
+		if (!node)
+			return 1;
+	}
+
+	flow = container_of(node, struct fq_flow_node, list_node);
+	bpf_obj_drop(flow);
+
+	return 0;
+}
+
+extern unsigned CONFIG_HZ __kconfig;
+
+/* limit number of collected flows per round */
+#define FQ_GC_MAX 8
+#define FQ_GC_AGE (3*CONFIG_HZ)
+
+static bool fq_gc_candidate(struct fq_flow_node *flow)
+{
+	u64 jiffies = bpf_jiffies64();
+
+	return fq_flow_is_detached(flow) &&
+	       ((s64)(jiffies - (flow->age + FQ_GC_AGE)) > 0);
+}
+
+static int
+fq_remove_flows(struct bpf_map *flow_map, u64 *hash,
+		struct fq_stashed_flow *sflow, struct remove_flows_ctx *ctx)
+{
+	struct fq_flow_node *flow = NULL;
+
+	flow = bpf_kptr_xchg(&sflow->flow, flow);
+	if (flow) {
+		if (!ctx->gc_only || fq_gc_candidate(flow)) {
+			bpf_obj_drop(flow);
+			ctx->reset_cnt++;
+		} else {
+			bpf_kptr_xchg_back(&sflow->flow, flow);
+		}
+	}
+
+	return ctx->reset_cnt < ctx->reset_max ? 0 : 1;
+}
+
+static void fq_gc(void)
+{
+	struct remove_flows_ctx cb_ctx = {
+		.gc_only = true,
+		.reset_cnt = 0,
+		.reset_max = FQ_GC_MAX,
+	};
+
+	bpf_for_each_map_elem(&fq_nonprio_flows, fq_remove_flows, &cb_ctx, 0);
+}
+
+SEC("struct_ops/bpf_fq_reset")
+void BPF_PROG(bpf_fq_reset, struct Qdisc *sch)
+{
+	struct unset_throttled_flows_ctx utf_ctx = {
+		.unset_all = true,
+	};
+	struct remove_flows_ctx rf_ctx = {
+		.gc_only = false,
+		.reset_cnt = 0,
+		.reset_max = NUM_QUEUE,
+	};
+	struct fq_stashed_flow *sflow;
+	u64 hash = 0;
+
+	sch->q.qlen = 0;
+	sch->qstats.backlog = 0;
+
+	bpf_for_each_map_elem(&fq_nonprio_flows, fq_remove_flows, &rf_ctx, 0);
+
+	rf_ctx.reset_cnt = 0;
+	bpf_for_each_map_elem(&fq_prio_flows, fq_remove_flows, &rf_ctx, 0);
+	fq_new_flow(&fq_prio_flows, &sflow, hash);
+
+	bpf_loop(NUM_QUEUE, fq_remove_flows_in_list, NULL, 0);
+	q.new_flow_cnt = 0;
+	q.old_flow_cnt = 0;
+
+	bpf_loop(NUM_QUEUE, fq_unset_throttled_flows, &utf_ctx, 0);
+
+	return;
+}
+
+SEC("struct_ops/bpf_fq_init")
+int BPF_PROG(bpf_fq_init, struct Qdisc *sch, struct nlattr *opt,
+	     struct netlink_ext_ack *extack)
+{
+	struct net_device *dev = sch->dev_queue->dev;
+	u32 psched_mtu = dev->mtu + dev->hard_header_len;
+	struct fq_stashed_flow *sflow;
+	u64 hash = 0;
+
+	if (fq_new_flow(&fq_prio_flows, &sflow, hash) < 0)
+		return -ENOMEM;
+
+	sch->limit = 10000;
+	q.initial_quantum = 10 * psched_mtu;
+	q.quantum = 2 * psched_mtu;
+	q.flow_refill_delay = 40;
+	q.flow_plimit = 100;
+	q.horizon = 10ULL * NSEC_PER_SEC;
+	q.horizon_drop = 1;
+	q.orphan_mask = 1024 - 1;
+	q.timer_slack = 10 * NSEC_PER_USEC;
+	q.time_next_delayed_flow = ~0ULL;
+	q.unthrottle_latency_ns = 0ULL;
+	q.new_flow_cnt = 0;
+	q.old_flow_cnt = 0;
+
+	return 0;
+}
+
+SEC(".struct_ops")
+struct Qdisc_ops fq = {
+	.enqueue   = (void *)bpf_fq_enqueue,
+	.dequeue   = (void *)bpf_fq_dequeue,
+	.reset     = (void *)bpf_fq_reset,
+	.init      = (void *)bpf_fq_init,
+	.id        = "bpf_fq",
+};
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next v1 05/13] bpf: net_sched: Support implementation of Qdisc_ops in bpf
  2024-12-13 23:29 ` [PATCH bpf-next v1 05/13] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
@ 2024-12-14  4:51   ` Cong Wang
  2024-12-18 23:37   ` Martin KaFai Lau
  1 sibling, 0 replies; 35+ messages in thread
From: Cong Wang @ 2024-12-14  4:51 UTC (permalink / raw)
  To: Amery Hung
  Cc: netdev, bpf, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, stfomichev, ekarani.silvestre,
	yangpeihao, yepeilin.cs, ameryhung

On Fri, Dec 13, 2024 at 11:29:50PM +0000, Amery Hung wrote:
> diff --git a/include/linux/btf.h b/include/linux/btf.h
> index 4214e76c9168..eb16218fdf52 100644
> --- a/include/linux/btf.h
> +++ b/include/linux/btf.h
> @@ -563,6 +563,7 @@ const char *btf_name_by_offset(const struct btf *btf, u32 offset);
>  const char *btf_str_by_offset(const struct btf *btf, u32 offset);
>  struct btf *btf_parse_vmlinux(void);
>  struct btf *bpf_prog_get_target_btf(const struct bpf_prog *prog);
> +u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto, int off);
>  u32 *btf_kfunc_id_set_contains(const struct btf *btf, u32 kfunc_btf_id,
>  			       const struct bpf_prog *prog);
>  u32 *btf_kfunc_is_modify_return(const struct btf *btf, u32 kfunc_btf_id,
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index a05ccf9ee032..f733dbf24261 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -6375,8 +6375,8 @@ static bool is_int_ptr(struct btf *btf, const struct btf_type *t)
>  	return btf_type_is_int(t);
>  }
>  
> -static u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto,
> -			   int off)
> +u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto,
> +		    int off)
>  {
>  	const struct btf_param *args;
>  	const struct btf_type *t;

Maybe separate this piece out as a separate patch?


> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> index 8180d0c12fce..ccd0255da5a5 100644
> --- a/net/sched/Kconfig
> +++ b/net/sched/Kconfig
> @@ -403,6 +403,18 @@ config NET_SCH_ETS
>  
>  	  If unsure, say N.
>  
> +config NET_SCH_BPF
> +	bool "BPF-based Qdisc"
> +	depends on BPF_SYSCALL && BPF_JIT && DEBUG_INFO_BTF

I think new features should be default to n, unless you have reasons not
to do so.

> +	help
> +	  This option allows BPF-based queueing disiplines. With BPF struct_ops,
> +	  users can implement supported operators in Qdisc_ops using BPF programs.
> +	  The queue holding skb can be built with BPF maps or graphs.
> +
> +	  Say Y here if you want to use BPF-based Qdisc.
> +
> +	  If unsure, say N.
> +
>  menuconfig NET_SCH_DEFAULT
>  	bool "Allow override default queue discipline"
>  	help

[...]

> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 2eefa4783879..f074053c4232 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -25,6 +25,7 @@
>  #include <linux/hrtimer.h>
>  #include <linux/slab.h>
>  #include <linux/hashtable.h>
> +#include <linux/bpf.h>
>  
>  #include <net/net_namespace.h>
>  #include <net/sock.h>
> @@ -358,7 +359,7 @@ static struct Qdisc_ops *qdisc_lookup_ops(struct nlattr *kind)
>  		read_lock(&qdisc_mod_lock);
>  		for (q = qdisc_base; q; q = q->next) {
>  			if (nla_strcmp(kind, q->id) == 0) {
> -				if (!try_module_get(q->owner))
> +				if (!bpf_try_module_get(q, q->owner))
>  					q = NULL;
>  				break;
>  			}
> @@ -1287,7 +1288,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
>  				/* We will try again qdisc_lookup_ops,
>  				 * so don't keep a reference.
>  				 */
> -				module_put(ops->owner);
> +				bpf_module_put(ops, ops->owner);
>  				err = -EAGAIN;
>  				goto err_out;
>  			}
> @@ -1398,7 +1399,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
>  	netdev_put(dev, &sch->dev_tracker);
>  	qdisc_free(sch);
>  err_out2:
> -	module_put(ops->owner);
> +	bpf_module_put(ops, ops->owner);
>  err_out:
>  	*errp = err;
>  	return NULL;
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 38ec18f73de4..1e770ec251a0 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -24,6 +24,7 @@
>  #include <linux/if_vlan.h>
>  #include <linux/skb_array.h>
>  #include <linux/if_macvlan.h>
> +#include <linux/bpf.h>
>  #include <net/sch_generic.h>
>  #include <net/pkt_sched.h>
>  #include <net/dst.h>
> @@ -1083,7 +1084,7 @@ static void __qdisc_destroy(struct Qdisc *qdisc)
>  		ops->destroy(qdisc);
>  
>  	lockdep_unregister_key(&qdisc->root_lock_key);
> -	module_put(ops->owner);
> +	bpf_module_put(ops, ops->owner);
>  	netdev_put(dev, &qdisc->dev_tracker);
>  
>  	trace_qdisc_destroy(qdisc);

Maybe this piece should be separated out too? Ideally this patch should
only have bpf_qdisc.c and its Makefile and Kconfig changes.

Regards.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next v1 11/13] libbpf: Support creating and destroying qdisc
  2024-12-13 23:29 ` [PATCH bpf-next v1 11/13] libbpf: Support creating and destroying qdisc Amery Hung
@ 2024-12-17 18:32   ` Andrii Nakryiko
  2024-12-17 19:08     ` Amery Hung
  0 siblings, 1 reply; 35+ messages in thread
From: Andrii Nakryiko @ 2024-12-17 18:32 UTC (permalink / raw)
  To: Amery Hung
  Cc: netdev, bpf, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, stfomichev, ekarani.silvestre,
	yangpeihao, xiyou.wangcong, yepeilin.cs, ameryhung

On Fri, Dec 13, 2024 at 3:30 PM Amery Hung <amery.hung@bytedance.com> wrote:
>
> Extend struct bpf_tc_hook with handle, qdisc name and a new attach type,
> BPF_TC_QDISC, to allow users to add or remove any qdisc specified in
> addition to clsact.
>
> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> ---
>  tools/lib/bpf/libbpf.h  |  5 ++++-
>  tools/lib/bpf/netlink.c | 20 +++++++++++++++++---
>  2 files changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index b2ce3a72b11d..b05d95814776 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -1268,6 +1268,7 @@ enum bpf_tc_attach_point {
>         BPF_TC_INGRESS = 1 << 0,
>         BPF_TC_EGRESS  = 1 << 1,
>         BPF_TC_CUSTOM  = 1 << 2,
> +       BPF_TC_QDISC   = 1 << 3,
>  };
>
>  #define BPF_TC_PARENT(a, b)    \
> @@ -1282,9 +1283,11 @@ struct bpf_tc_hook {
>         int ifindex;
>         enum bpf_tc_attach_point attach_point;
>         __u32 parent;
> +       __u32 handle;
> +       char *qdisc;

const char *?

>         size_t :0;
>  };
> -#define bpf_tc_hook__last_field parent
> +#define bpf_tc_hook__last_field qdisc
>
>  struct bpf_tc_opts {
>         size_t sz;
> diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
> index 68a2def17175..72db8c0add21 100644
> --- a/tools/lib/bpf/netlink.c
> +++ b/tools/lib/bpf/netlink.c
> @@ -529,9 +529,9 @@ int bpf_xdp_query_id(int ifindex, int flags, __u32 *prog_id)
>  }
>
>
> -typedef int (*qdisc_config_t)(struct libbpf_nla_req *req);
> +typedef int (*qdisc_config_t)(struct libbpf_nla_req *req, struct bpf_tc_hook *hook);

should hook pointer be const?

>
> -static int clsact_config(struct libbpf_nla_req *req)
> +static int clsact_config(struct libbpf_nla_req *req, struct bpf_tc_hook *hook)

const?

>  {
>         req->tc.tcm_parent = TC_H_CLSACT;
>         req->tc.tcm_handle = TC_H_MAKE(TC_H_CLSACT, 0);
> @@ -539,6 +539,16 @@ static int clsact_config(struct libbpf_nla_req *req)
>         return nlattr_add(req, TCA_KIND, "clsact", sizeof("clsact"));
>  }
>
> +static int qdisc_config(struct libbpf_nla_req *req, struct bpf_tc_hook *hook)

same, const, it's not written into, right?

> +{
> +       char *qdisc = OPTS_GET(hook, qdisc, NULL);
> +
> +       req->tc.tcm_parent = OPTS_GET(hook, parent, TC_H_ROOT);
> +       req->tc.tcm_handle = OPTS_GET(hook, handle, 0);
> +
> +       return nlattr_add(req, TCA_KIND, qdisc, strlen(qdisc) + 1);
> +}
> +
>  static int attach_point_to_config(struct bpf_tc_hook *hook,
>                                   qdisc_config_t *config)
>  {
> @@ -552,6 +562,9 @@ static int attach_point_to_config(struct bpf_tc_hook *hook,
>                 return 0;
>         case BPF_TC_CUSTOM:
>                 return -EOPNOTSUPP;
> +       case BPF_TC_QDISC:
> +               *config = &qdisc_config;
> +               return 0;
>         default:
>                 return -EINVAL;
>         }
> @@ -596,7 +609,7 @@ static int tc_qdisc_modify(struct bpf_tc_hook *hook, int cmd, int flags)
>         req.tc.tcm_family  = AF_UNSPEC;
>         req.tc.tcm_ifindex = OPTS_GET(hook, ifindex, 0);
>
> -       ret = config(&req);
> +       ret = config(&req, hook);
>         if (ret < 0)
>                 return ret;
>
> @@ -639,6 +652,7 @@ int bpf_tc_hook_destroy(struct bpf_tc_hook *hook)
>         case BPF_TC_INGRESS:
>         case BPF_TC_EGRESS:
>                 return libbpf_err(__bpf_tc_detach(hook, NULL, true));
> +       case BPF_TC_QDISC:
>         case BPF_TC_INGRESS | BPF_TC_EGRESS:
>                 return libbpf_err(tc_qdisc_delete(hook));
>         case BPF_TC_CUSTOM:
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next v1 11/13] libbpf: Support creating and destroying qdisc
  2024-12-17 18:32   ` Andrii Nakryiko
@ 2024-12-17 19:08     ` Amery Hung
  0 siblings, 0 replies; 35+ messages in thread
From: Amery Hung @ 2024-12-17 19:08 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Amery Hung, netdev, bpf, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, stfomichev,
	ekarani.silvestre, yangpeihao, xiyou.wangcong, yepeilin.cs

You are right. I will add const to the hook argument and bpf_tc_hook->qdisc.

Thanks,
Amery

On Tue, Dec 17, 2024 at 10:32 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Dec 13, 2024 at 3:30 PM Amery Hung <amery.hung@bytedance.com> wrote:
> >
> > Extend struct bpf_tc_hook with handle, qdisc name and a new attach type,
> > BPF_TC_QDISC, to allow users to add or remove any qdisc specified in
> > addition to clsact.
> >
> > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > ---
> >  tools/lib/bpf/libbpf.h  |  5 ++++-
> >  tools/lib/bpf/netlink.c | 20 +++++++++++++++++---
> >  2 files changed, 21 insertions(+), 4 deletions(-)
> >
> > diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> > index b2ce3a72b11d..b05d95814776 100644
> > --- a/tools/lib/bpf/libbpf.h
> > +++ b/tools/lib/bpf/libbpf.h
> > @@ -1268,6 +1268,7 @@ enum bpf_tc_attach_point {
> >         BPF_TC_INGRESS = 1 << 0,
> >         BPF_TC_EGRESS  = 1 << 1,
> >         BPF_TC_CUSTOM  = 1 << 2,
> > +       BPF_TC_QDISC   = 1 << 3,
> >  };
> >
> >  #define BPF_TC_PARENT(a, b)    \
> > @@ -1282,9 +1283,11 @@ struct bpf_tc_hook {
> >         int ifindex;
> >         enum bpf_tc_attach_point attach_point;
> >         __u32 parent;
> > +       __u32 handle;
> > +       char *qdisc;
>
> const char *?
>
> >         size_t :0;
> >  };
> > -#define bpf_tc_hook__last_field parent
> > +#define bpf_tc_hook__last_field qdisc
> >
> >  struct bpf_tc_opts {
> >         size_t sz;
> > diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
> > index 68a2def17175..72db8c0add21 100644
> > --- a/tools/lib/bpf/netlink.c
> > +++ b/tools/lib/bpf/netlink.c
> > @@ -529,9 +529,9 @@ int bpf_xdp_query_id(int ifindex, int flags, __u32 *prog_id)
> >  }
> >
> >
> > -typedef int (*qdisc_config_t)(struct libbpf_nla_req *req);
> > +typedef int (*qdisc_config_t)(struct libbpf_nla_req *req, struct bpf_tc_hook *hook);
>
> should hook pointer be const?
>
> >
> > -static int clsact_config(struct libbpf_nla_req *req)
> > +static int clsact_config(struct libbpf_nla_req *req, struct bpf_tc_hook *hook)
>
> const?
>
> >  {
> >         req->tc.tcm_parent = TC_H_CLSACT;
> >         req->tc.tcm_handle = TC_H_MAKE(TC_H_CLSACT, 0);
> > @@ -539,6 +539,16 @@ static int clsact_config(struct libbpf_nla_req *req)
> >         return nlattr_add(req, TCA_KIND, "clsact", sizeof("clsact"));
> >  }
> >
> > +static int qdisc_config(struct libbpf_nla_req *req, struct bpf_tc_hook *hook)
>
> same, const, it's not written into, right?
>
> > +{
> > +       char *qdisc = OPTS_GET(hook, qdisc, NULL);
> > +
> > +       req->tc.tcm_parent = OPTS_GET(hook, parent, TC_H_ROOT);
> > +       req->tc.tcm_handle = OPTS_GET(hook, handle, 0);
> > +
> > +       return nlattr_add(req, TCA_KIND, qdisc, strlen(qdisc) + 1);
> > +}
> > +
> >  static int attach_point_to_config(struct bpf_tc_hook *hook,
> >                                   qdisc_config_t *config)
> >  {
> > @@ -552,6 +562,9 @@ static int attach_point_to_config(struct bpf_tc_hook *hook,
> >                 return 0;
> >         case BPF_TC_CUSTOM:
> >                 return -EOPNOTSUPP;
> > +       case BPF_TC_QDISC:
> > +               *config = &qdisc_config;
> > +               return 0;
> >         default:
> >                 return -EINVAL;
> >         }
> > @@ -596,7 +609,7 @@ static int tc_qdisc_modify(struct bpf_tc_hook *hook, int cmd, int flags)
> >         req.tc.tcm_family  = AF_UNSPEC;
> >         req.tc.tcm_ifindex = OPTS_GET(hook, ifindex, 0);
> >
> > -       ret = config(&req);
> > +       ret = config(&req, hook);
> >         if (ret < 0)
> >                 return ret;
> >
> > @@ -639,6 +652,7 @@ int bpf_tc_hook_destroy(struct bpf_tc_hook *hook)
> >         case BPF_TC_INGRESS:
> >         case BPF_TC_EGRESS:
> >                 return libbpf_err(__bpf_tc_detach(hook, NULL, true));
> > +       case BPF_TC_QDISC:
> >         case BPF_TC_INGRESS | BPF_TC_EGRESS:
> >                 return libbpf_err(tc_qdisc_delete(hook));
> >         case BPF_TC_CUSTOM:
> > --
> > 2.20.1
> >

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next v1 01/13] bpf: Support getting referenced kptr from struct_ops argument
  2024-12-13 23:29 ` [PATCH bpf-next v1 01/13] bpf: Support getting referenced kptr from struct_ops argument Amery Hung
@ 2024-12-18  0:58   ` Martin KaFai Lau
  2024-12-18  1:24     ` Alexei Starovoitov
                       ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Martin KaFai Lau @ 2024-12-18  0:58 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, stfomichev, ekarani.silvestre,
	yangpeihao, xiyou.wangcong, yepeilin.cs, ameryhung

On 12/13/24 3:29 PM, Amery Hung wrote:
> Allows struct_ops programs to acqurie referenced kptrs from arguments
> by directly reading the argument.
> 
> The verifier will acquire a reference for struct_ops a argument tagged
> with "__ref" in the stub function in the beginning of the main program.
> The user will be able to access the referenced kptr directly by reading
> the context as long as it has not been released by the program.
> 
> This new mechanism to acquire referenced kptr (compared to the existing
> "kfunc with KF_ACQUIRE") is introduced for ergonomic and semantic reasons.
> In the first use case, Qdisc_ops, an skb is passed to .enqueue in the
> first argument. This mechanism provides a natural way for users to get a
> referenced kptr in the .enqueue struct_ops programs and makes sure that a
> qdisc will always enqueue or drop the skb.
> 
> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> ---
>   include/linux/bpf.h         |  3 +++
>   kernel/bpf/bpf_struct_ops.c | 26 ++++++++++++++++++++------
>   kernel/bpf/btf.c            |  1 +
>   kernel/bpf/verifier.c       | 35 ++++++++++++++++++++++++++++++++---
>   4 files changed, 56 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 1b84613b10ac..72bf941d1daf 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -968,6 +968,7 @@ struct bpf_insn_access_aux {
>   		struct {
>   			struct btf *btf;
>   			u32 btf_id;
> +			u32 ref_obj_id;
>   		};
>   	};
>   	struct bpf_verifier_log *log; /* for verbose logs */
> @@ -1480,6 +1481,8 @@ struct bpf_ctx_arg_aux {
>   	enum bpf_reg_type reg_type;
>   	struct btf *btf;
>   	u32 btf_id;
> +	u32 ref_obj_id;
> +	bool refcounted;
>   };
>   
>   struct btf_mod_pair {
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index fda3dd2ee984..6e7795744f6a 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -145,6 +145,7 @@ void bpf_struct_ops_image_free(void *image)
>   }
>   
>   #define MAYBE_NULL_SUFFIX "__nullable"
> +#define REFCOUNTED_SUFFIX "__ref"
>   #define MAX_STUB_NAME 128
>   
>   /* Return the type info of a stub function, if it exists.
> @@ -206,9 +207,11 @@ static int prepare_arg_info(struct btf *btf,
>   			    struct bpf_struct_ops_arg_info *arg_info)
>   {
>   	const struct btf_type *stub_func_proto, *pointed_type;
> +	bool is_nullable = false, is_refcounted = false;
>   	const struct btf_param *stub_args, *args;
>   	struct bpf_ctx_arg_aux *info, *info_buf;
>   	u32 nargs, arg_no, info_cnt = 0;
> +	const char *suffix;
>   	u32 arg_btf_id;
>   	int offset;
>   
> @@ -240,12 +243,19 @@ static int prepare_arg_info(struct btf *btf,
>   	info = info_buf;
>   	for (arg_no = 0; arg_no < nargs; arg_no++) {
>   		/* Skip arguments that is not suffixed with
> -		 * "__nullable".
> +		 * "__nullable or __ref".
>   		 */
> -		if (!btf_param_match_suffix(btf, &stub_args[arg_no],
> -					    MAYBE_NULL_SUFFIX))
> +		is_nullable = btf_param_match_suffix(btf, &stub_args[arg_no],
> +						     MAYBE_NULL_SUFFIX);
> +		is_refcounted = btf_param_match_suffix(btf, &stub_args[arg_no],
> +						       REFCOUNTED_SUFFIX);
> +		if (!is_nullable && !is_refcounted)
>   			continue;
>   
> +		if (is_nullable)
> +			suffix = MAYBE_NULL_SUFFIX;
> +		else if (is_refcounted)
> +			suffix = REFCOUNTED_SUFFIX;
>   		/* Should be a pointer to struct */
>   		pointed_type = btf_type_resolve_ptr(btf,
>   						    args[arg_no].type,
> @@ -253,7 +263,7 @@ static int prepare_arg_info(struct btf *btf,
>   		if (!pointed_type ||
>   		    !btf_type_is_struct(pointed_type)) {
>   			pr_warn("stub function %s__%s has %s tagging to an unsupported type\n",
> -				st_ops_name, member_name, MAYBE_NULL_SUFFIX);
> +				st_ops_name, member_name, suffix);
>   			goto err_out;
>   		}
>   
> @@ -271,11 +281,15 @@ static int prepare_arg_info(struct btf *btf,
>   		}
>   
>   		/* Fill the information of the new argument */
> -		info->reg_type =
> -			PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
>   		info->btf_id = arg_btf_id;
>   		info->btf = btf;
>   		info->offset = offset;
> +		if (is_nullable) {
> +			info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
> +		} else if (is_refcounted) {
> +			info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID;
> +			info->refcounted = true;
> +		}
>   
>   		info++;
>   		info_cnt++;
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index e7a59e6462a9..a05ccf9ee032 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -6580,6 +6580,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
>   			info->reg_type = ctx_arg_info->reg_type;
>   			info->btf = ctx_arg_info->btf ? : btf_vmlinux;
>   			info->btf_id = ctx_arg_info->btf_id;
> +			info->ref_obj_id = ctx_arg_info->ref_obj_id;
>   			return true;
>   		}
>   	}
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 9f5de8d4fbd0..69753096075f 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1402,6 +1402,17 @@ static int release_reference_state(struct bpf_func_state *state, int ptr_id)
>   	return -EINVAL;
>   }
>   
> +static bool find_reference_state(struct bpf_func_state *state, int ptr_id)
> +{
> +	int i;
> +
> +	for (i = 0; i < state->acquired_refs; i++)
> +		if (state->refs[i].id == ptr_id)
> +			return true;
> +
> +	return false;
> +}
> +
>   static int release_lock_state(struct bpf_func_state *state, int type, int id, void *ptr)
>   {
>   	int i, last_idx;
> @@ -5798,7 +5809,8 @@ static int check_packet_access(struct bpf_verifier_env *env, u32 regno, int off,
>   /* check access to 'struct bpf_context' fields.  Supports fixed offsets only */
>   static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off, int size,
>   			    enum bpf_access_type t, enum bpf_reg_type *reg_type,
> -			    struct btf **btf, u32 *btf_id, bool *is_retval, bool is_ldsx)
> +			    struct btf **btf, u32 *btf_id, bool *is_retval, bool is_ldsx,
> +			    u32 *ref_obj_id)
>   {
>   	struct bpf_insn_access_aux info = {
>   		.reg_type = *reg_type,
> @@ -5820,8 +5832,16 @@ static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off,
>   		*is_retval = info.is_retval;
>   
>   		if (base_type(*reg_type) == PTR_TO_BTF_ID) {
> +			if (info.ref_obj_id &&
> +			    !find_reference_state(cur_func(env), info.ref_obj_id)) {
> +				verbose(env, "invalid bpf_context access off=%d. Reference may already be released\n",
> +					off);
> +				return -EACCES;
> +			}
> +
>   			*btf = info.btf;
>   			*btf_id = info.btf_id;
> +			*ref_obj_id = info.ref_obj_id;
>   		} else {
>   			env->insn_aux_data[insn_idx].ctx_field_size = info.ctx_field_size;
>   		}
> @@ -7135,7 +7155,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
>   		struct bpf_retval_range range;
>   		enum bpf_reg_type reg_type = SCALAR_VALUE;
>   		struct btf *btf = NULL;
> -		u32 btf_id = 0;
> +		u32 btf_id = 0, ref_obj_id = 0;
>   
>   		if (t == BPF_WRITE && value_regno >= 0 &&
>   		    is_pointer_value(env, value_regno)) {
> @@ -7148,7 +7168,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
>   			return err;
>   
>   		err = check_ctx_access(env, insn_idx, off, size, t, &reg_type, &btf,
> -				       &btf_id, &is_retval, is_ldsx);
> +				       &btf_id, &is_retval, is_ldsx, &ref_obj_id);
>   		if (err)
>   			verbose_linfo(env, insn_idx, "; ");
>   		if (!err && t == BPF_READ && value_regno >= 0) {
> @@ -7179,6 +7199,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
>   				if (base_type(reg_type) == PTR_TO_BTF_ID) {
>   					regs[value_regno].btf = btf;
>   					regs[value_regno].btf_id = btf_id;
> +					regs[value_regno].ref_obj_id = ref_obj_id;
>   				}
>   			}
>   			regs[value_regno].type = reg_type;
> @@ -21662,6 +21683,7 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
>   {
>   	bool pop_log = !(env->log.level & BPF_LOG_LEVEL2);
>   	struct bpf_subprog_info *sub = subprog_info(env, subprog);
> +	struct bpf_ctx_arg_aux *ctx_arg_info;
>   	struct bpf_verifier_state *state;
>   	struct bpf_reg_state *regs;
>   	int ret, i;
> @@ -21769,6 +21791,13 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
>   		mark_reg_known_zero(env, regs, BPF_REG_1);
>   	}
>   
> +	if (!subprog && env->prog->type == BPF_PROG_TYPE_STRUCT_OPS) {
> +		ctx_arg_info = (struct bpf_ctx_arg_aux *)env->prog->aux->ctx_arg_info;
> +		for (i = 0; i < env->prog->aux->ctx_arg_info_size; i++)
> +			if (ctx_arg_info[i].refcounted)
> +				ctx_arg_info[i].ref_obj_id = acquire_reference_state(env, 0);

There is a conflict in the bpf-next/master. acquire_reference_state has been 
refactored in commit 769b0f1c8214. From looking at the net/sched/sch_*.c 
changes, they should not have conflict with the net-next/main. I would suggest 
to rebase this set on bpf-next/master.

At the first glance, the ref_obj_id assignment looks racy because ctx_arg_info 
is shared by different bpf progs that may be verified in parallel. After another 
thought, this should be fine because it should always end up having the same 
ref_obj_id for the same arg-no, right? Not sure if UBSAN can understand this 
without using the READ/WRITE_ONCE. but adding READ/WRITE_ONCE when using 
ref_obj_id will be quite puzzling when reading the verifier code. Any better idea?

Other than the subprog, afaik, the bpf prog triggered by the bpf_tail_call can 
also take the 'u64 *ctx' array. May be disallow using tailcall in all ops in the 
bpf qdisc. env->subprog_info[i].has_tail_call has already tracked whether the 
tail_call is used.

> +	}
> +
>   	ret = do_check(env);
>   out:
>   	/* check for NULL is necessary, since cur_state can be freed inside


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next v1 02/13] selftests/bpf: Test referenced kptr arguments of struct_ops programs
  2024-12-13 23:29 ` [PATCH bpf-next v1 02/13] selftests/bpf: Test referenced kptr arguments of struct_ops programs Amery Hung
@ 2024-12-18  1:17   ` Martin KaFai Lau
  2024-12-18 16:10     ` Amery Hung
  2024-12-19  3:40   ` Yonghong Song
  1 sibling, 1 reply; 35+ messages in thread
From: Martin KaFai Lau @ 2024-12-18  1:17 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, stfomichev, ekarani.silvestre,
	yangpeihao, xiyou.wangcong, yepeilin.cs, ameryhung

On 12/13/24 3:29 PM, Amery Hung wrote:

> +void test_struct_ops_refcounted(void)
> +{
> +	if (test__start_subtest("refcounted"))
> +		refcounted();
> +	if (test__start_subtest("refcounted_fail__ref_leak"))
> +		refcounted_fail__ref_leak();

test_loader.c could make writing this test easier and it can also test the 
verifier failure message. e.g. for the ref_leak test, the following should do:

	RUN_TESTS(struct_ops_refcounted_fail__ref_leak);

The same for the other subtests in this patch.
	
> +	if (test__start_subtest("refcounted_fail__global_subprog"))
> +		refcounted_fail__global_subprog();
> +}

[ ... ]

> diff --git a/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c
> new file mode 100644
> index 000000000000..6e82859eb187
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c
> @@ -0,0 +1,17 @@
> +#include <vmlinux.h>
> +#include <bpf/bpf_tracing.h>
> +#include "../bpf_testmod/bpf_testmod.h"
> +
> +char _license[] SEC("license") = "GPL";
> +

+#include "bpf_misc.h"

+__failure __msg("Unreleased reference")
> +SEC("struct_ops/test_refcounted")
> +int BPF_PROG(test_refcounted, int dummy,
> +	     struct task_struct *task)
> +{
> +	return 0;
> +}
> +
> +SEC(".struct_ops.link")
> +struct bpf_testmod_ops testmod_ref_acquire = {
> +	.test_refcounted = (void *)test_refcounted,
> +};

[ I will stop here for today and will continue the rest tomorrow. ]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next v1 01/13] bpf: Support getting referenced kptr from struct_ops argument
  2024-12-18  0:58   ` Martin KaFai Lau
@ 2024-12-18  1:24     ` Alexei Starovoitov
  2024-12-18 16:09       ` Amery Hung
  2024-12-18  1:44     ` Jakub Kicinski
  2024-12-18 16:57     ` Amery Hung
  2 siblings, 1 reply; 35+ messages in thread
From: Alexei Starovoitov @ 2024-12-18  1:24 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Amery Hung, bpf, Network Development, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Kui-Feng Lee,
	Toke Høiland-Jørgensen, Jamal Hadi Salim, Jiri Pirko,
	stfomichev, ekarani.silvestre, yangpeihao, Cong Wang, Peilin Ye,
	Amery Hung

On Tue, Dec 17, 2024 at 4:58 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 12/13/24 3:29 PM, Amery Hung wrote:
> > Allows struct_ops programs to acqurie referenced kptrs from arguments
> > by directly reading the argument.
> >
> > The verifier will acquire a reference for struct_ops a argument tagged
> > with "__ref" in the stub function in the beginning of the main program.
> > The user will be able to access the referenced kptr directly by reading
> > the context as long as it has not been released by the program.
> >
> > This new mechanism to acquire referenced kptr (compared to the existing
> > "kfunc with KF_ACQUIRE") is introduced for ergonomic and semantic reasons.
> > In the first use case, Qdisc_ops, an skb is passed to .enqueue in the
> > first argument. This mechanism provides a natural way for users to get a
> > referenced kptr in the .enqueue struct_ops programs and makes sure that a
> > qdisc will always enqueue or drop the skb.
> >
> > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > ---
> >   include/linux/bpf.h         |  3 +++
> >   kernel/bpf/bpf_struct_ops.c | 26 ++++++++++++++++++++------
> >   kernel/bpf/btf.c            |  1 +
> >   kernel/bpf/verifier.c       | 35 ++++++++++++++++++++++++++++++++---
> >   4 files changed, 56 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 1b84613b10ac..72bf941d1daf 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -968,6 +968,7 @@ struct bpf_insn_access_aux {
> >               struct {
> >                       struct btf *btf;
> >                       u32 btf_id;
> > +                     u32 ref_obj_id;
> >               };
> >       };
> >       struct bpf_verifier_log *log; /* for verbose logs */
> > @@ -1480,6 +1481,8 @@ struct bpf_ctx_arg_aux {
> >       enum bpf_reg_type reg_type;
> >       struct btf *btf;
> >       u32 btf_id;
> > +     u32 ref_obj_id;
> > +     bool refcounted;
> >   };
> >
> >   struct btf_mod_pair {
> > diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> > index fda3dd2ee984..6e7795744f6a 100644
> > --- a/kernel/bpf/bpf_struct_ops.c
> > +++ b/kernel/bpf/bpf_struct_ops.c
> > @@ -145,6 +145,7 @@ void bpf_struct_ops_image_free(void *image)
> >   }
> >
> >   #define MAYBE_NULL_SUFFIX "__nullable"
> > +#define REFCOUNTED_SUFFIX "__ref"
> >   #define MAX_STUB_NAME 128
> >
> >   /* Return the type info of a stub function, if it exists.
> > @@ -206,9 +207,11 @@ static int prepare_arg_info(struct btf *btf,
> >                           struct bpf_struct_ops_arg_info *arg_info)
> >   {
> >       const struct btf_type *stub_func_proto, *pointed_type;
> > +     bool is_nullable = false, is_refcounted = false;
> >       const struct btf_param *stub_args, *args;
> >       struct bpf_ctx_arg_aux *info, *info_buf;
> >       u32 nargs, arg_no, info_cnt = 0;
> > +     const char *suffix;
> >       u32 arg_btf_id;
> >       int offset;
> >
> > @@ -240,12 +243,19 @@ static int prepare_arg_info(struct btf *btf,
> >       info = info_buf;
> >       for (arg_no = 0; arg_no < nargs; arg_no++) {
> >               /* Skip arguments that is not suffixed with
> > -              * "__nullable".
> > +              * "__nullable or __ref".
> >                */
> > -             if (!btf_param_match_suffix(btf, &stub_args[arg_no],
> > -                                         MAYBE_NULL_SUFFIX))
> > +             is_nullable = btf_param_match_suffix(btf, &stub_args[arg_no],
> > +                                                  MAYBE_NULL_SUFFIX);
> > +             is_refcounted = btf_param_match_suffix(btf, &stub_args[arg_no],
> > +                                                    REFCOUNTED_SUFFIX);
> > +             if (!is_nullable && !is_refcounted)
> >                       continue;
> >
> > +             if (is_nullable)
> > +                     suffix = MAYBE_NULL_SUFFIX;
> > +             else if (is_refcounted)
> > +                     suffix = REFCOUNTED_SUFFIX;
> >               /* Should be a pointer to struct */
> >               pointed_type = btf_type_resolve_ptr(btf,
> >                                                   args[arg_no].type,
> > @@ -253,7 +263,7 @@ static int prepare_arg_info(struct btf *btf,
> >               if (!pointed_type ||
> >                   !btf_type_is_struct(pointed_type)) {
> >                       pr_warn("stub function %s__%s has %s tagging to an unsupported type\n",
> > -                             st_ops_name, member_name, MAYBE_NULL_SUFFIX);
> > +                             st_ops_name, member_name, suffix);
> >                       goto err_out;
> >               }
> >
> > @@ -271,11 +281,15 @@ static int prepare_arg_info(struct btf *btf,
> >               }
> >
> >               /* Fill the information of the new argument */
> > -             info->reg_type =
> > -                     PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
> >               info->btf_id = arg_btf_id;
> >               info->btf = btf;
> >               info->offset = offset;
> > +             if (is_nullable) {
> > +                     info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
> > +             } else if (is_refcounted) {
> > +                     info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID;
> > +                     info->refcounted = true;
> > +             }
> >
> >               info++;
> >               info_cnt++;
> > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > index e7a59e6462a9..a05ccf9ee032 100644
> > --- a/kernel/bpf/btf.c
> > +++ b/kernel/bpf/btf.c
> > @@ -6580,6 +6580,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
> >                       info->reg_type = ctx_arg_info->reg_type;
> >                       info->btf = ctx_arg_info->btf ? : btf_vmlinux;
> >                       info->btf_id = ctx_arg_info->btf_id;
> > +                     info->ref_obj_id = ctx_arg_info->ref_obj_id;
> >                       return true;
> >               }
> >       }
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 9f5de8d4fbd0..69753096075f 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -1402,6 +1402,17 @@ static int release_reference_state(struct bpf_func_state *state, int ptr_id)
> >       return -EINVAL;
> >   }
> >
> > +static bool find_reference_state(struct bpf_func_state *state, int ptr_id)
> > +{
> > +     int i;
> > +
> > +     for (i = 0; i < state->acquired_refs; i++)
> > +             if (state->refs[i].id == ptr_id)
> > +                     return true;
> > +
> > +     return false;
> > +}
> > +
> >   static int release_lock_state(struct bpf_func_state *state, int type, int id, void *ptr)
> >   {
> >       int i, last_idx;
> > @@ -5798,7 +5809,8 @@ static int check_packet_access(struct bpf_verifier_env *env, u32 regno, int off,
> >   /* check access to 'struct bpf_context' fields.  Supports fixed offsets only */
> >   static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off, int size,
> >                           enum bpf_access_type t, enum bpf_reg_type *reg_type,
> > -                         struct btf **btf, u32 *btf_id, bool *is_retval, bool is_ldsx)
> > +                         struct btf **btf, u32 *btf_id, bool *is_retval, bool is_ldsx,
> > +                         u32 *ref_obj_id)
> >   {
> >       struct bpf_insn_access_aux info = {
> >               .reg_type = *reg_type,
> > @@ -5820,8 +5832,16 @@ static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off,
> >               *is_retval = info.is_retval;
> >
> >               if (base_type(*reg_type) == PTR_TO_BTF_ID) {
> > +                     if (info.ref_obj_id &&
> > +                         !find_reference_state(cur_func(env), info.ref_obj_id)) {
> > +                             verbose(env, "invalid bpf_context access off=%d. Reference may already be released\n",
> > +                                     off);
> > +                             return -EACCES;
> > +                     }
> > +
> >                       *btf = info.btf;
> >                       *btf_id = info.btf_id;
> > +                     *ref_obj_id = info.ref_obj_id;
> >               } else {
> >                       env->insn_aux_data[insn_idx].ctx_field_size = info.ctx_field_size;
> >               }
> > @@ -7135,7 +7155,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
> >               struct bpf_retval_range range;
> >               enum bpf_reg_type reg_type = SCALAR_VALUE;
> >               struct btf *btf = NULL;
> > -             u32 btf_id = 0;
> > +             u32 btf_id = 0, ref_obj_id = 0;
> >
> >               if (t == BPF_WRITE && value_regno >= 0 &&
> >                   is_pointer_value(env, value_regno)) {
> > @@ -7148,7 +7168,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
> >                       return err;
> >
> >               err = check_ctx_access(env, insn_idx, off, size, t, &reg_type, &btf,
> > -                                    &btf_id, &is_retval, is_ldsx);
> > +                                    &btf_id, &is_retval, is_ldsx, &ref_obj_id);
> >               if (err)
> >                       verbose_linfo(env, insn_idx, "; ");
> >               if (!err && t == BPF_READ && value_regno >= 0) {
> > @@ -7179,6 +7199,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
> >                               if (base_type(reg_type) == PTR_TO_BTF_ID) {
> >                                       regs[value_regno].btf = btf;
> >                                       regs[value_regno].btf_id = btf_id;
> > +                                     regs[value_regno].ref_obj_id = ref_obj_id;
> >                               }
> >                       }
> >                       regs[value_regno].type = reg_type;
> > @@ -21662,6 +21683,7 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
> >   {
> >       bool pop_log = !(env->log.level & BPF_LOG_LEVEL2);
> >       struct bpf_subprog_info *sub = subprog_info(env, subprog);
> > +     struct bpf_ctx_arg_aux *ctx_arg_info;
> >       struct bpf_verifier_state *state;
> >       struct bpf_reg_state *regs;
> >       int ret, i;
> > @@ -21769,6 +21791,13 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
> >               mark_reg_known_zero(env, regs, BPF_REG_1);
> >       }
> >
> > +     if (!subprog && env->prog->type == BPF_PROG_TYPE_STRUCT_OPS) {
> > +             ctx_arg_info = (struct bpf_ctx_arg_aux *)env->prog->aux->ctx_arg_info;
> > +             for (i = 0; i < env->prog->aux->ctx_arg_info_size; i++)
> > +                     if (ctx_arg_info[i].refcounted)
> > +                             ctx_arg_info[i].ref_obj_id = acquire_reference_state(env, 0);
>
> There is a conflict in the bpf-next/master. acquire_reference_state has been
> refactored in commit 769b0f1c8214. From looking at the net/sched/sch_*.c
> changes, they should not have conflict with the net-next/main. I would suggest
> to rebase this set on bpf-next/master.
>
> At the first glance, the ref_obj_id assignment looks racy because ctx_arg_info
> is shared by different bpf progs that may be verified in parallel. After another
> thought, this should be fine because it should always end up having the same
> ref_obj_id for the same arg-no, right? Not sure if UBSAN can understand this
> without using the READ/WRITE_ONCE. but adding READ/WRITE_ONCE when using
> ref_obj_id will be quite puzzling when reading the verifier code. Any better idea?

ctx_arg_info is kinda read-only from the verifier pov.
bpf_ctx_arg_aux->btf_id is populated before the main verifier loop.
While ref_obj_id is a dynamic property.
It doesn't really fit in bpf_ctx_arg_aux.
It probably needs to be another struct type that is allocated
and populated once with acquire_reference() when the main verifier loop
is happening.
do_check_common() maybe too early?
Looks like it's anyway a reference that is ok to leak per patch 3 ?

It seems the main goal is to pass ref_obj_id-like argument into bpf prog
and make sure that prog doesn't call KF_RELEASE kfunc on it twice,
but leaking is ok?
Maybe it needs a different type. Other than REF_TYPE_PTR.

> Other than the subprog, afaik, the bpf prog triggered by the bpf_tail_call can
> also take the 'u64 *ctx' array. May be disallow using tailcall in all ops in the
> bpf qdisc. env->subprog_info[i].has_tail_call has already tracked whether the
> tail_call is used.

+1. Just disallow tail_call.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next v1 01/13] bpf: Support getting referenced kptr from struct_ops argument
  2024-12-18  0:58   ` Martin KaFai Lau
  2024-12-18  1:24     ` Alexei Starovoitov
@ 2024-12-18  1:44     ` Jakub Kicinski
  2024-12-18 16:57     ` Amery Hung
  2 siblings, 0 replies; 35+ messages in thread
From: Jakub Kicinski @ 2024-12-18  1:44 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Amery Hung, bpf, netdev, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, stfomichev,
	ekarani.silvestre, yangpeihao, xiyou.wangcong, yepeilin.cs,
	ameryhung

On Tue, 17 Dec 2024 16:58:11 -0800 Martin KaFai Lau wrote:
> There is a conflict in the bpf-next/master. acquire_reference_state has been 
> refactored in commit 769b0f1c8214. From looking at the net/sched/sch_*.c 
> changes, they should not have conflict with the net-next/main. I would suggest 
> to rebase this set on bpf-next/master.

Let's worry about merging once it's ready. Previous version was 
5 months ago, the one before that 2 months earlier..

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next v1 01/13] bpf: Support getting referenced kptr from struct_ops argument
  2024-12-18  1:24     ` Alexei Starovoitov
@ 2024-12-18 16:09       ` Amery Hung
  2024-12-18 17:20         ` Alexei Starovoitov
  0 siblings, 1 reply; 35+ messages in thread
From: Amery Hung @ 2024-12-18 16:09 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Martin KaFai Lau, Amery Hung, bpf, Network Development,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Kui-Feng Lee,
	Toke Høiland-Jørgensen, Jamal Hadi Salim, Jiri Pirko,
	stfomichev, ekarani.silvestre, yangpeihao, Cong Wang, Peilin Ye

On Tue, Dec 17, 2024 at 5:24 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Dec 17, 2024 at 4:58 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >
> > On 12/13/24 3:29 PM, Amery Hung wrote:
> > > Allows struct_ops programs to acqurie referenced kptrs from arguments
> > > by directly reading the argument.
> > >
> > > The verifier will acquire a reference for struct_ops a argument tagged
> > > with "__ref" in the stub function in the beginning of the main program.
> > > The user will be able to access the referenced kptr directly by reading
> > > the context as long as it has not been released by the program.
> > >
> > > This new mechanism to acquire referenced kptr (compared to the existing
> > > "kfunc with KF_ACQUIRE") is introduced for ergonomic and semantic reasons.
> > > In the first use case, Qdisc_ops, an skb is passed to .enqueue in the
> > > first argument. This mechanism provides a natural way for users to get a
> > > referenced kptr in the .enqueue struct_ops programs and makes sure that a
> > > qdisc will always enqueue or drop the skb.
> > >
> > > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > > ---
> > >   include/linux/bpf.h         |  3 +++
> > >   kernel/bpf/bpf_struct_ops.c | 26 ++++++++++++++++++++------
> > >   kernel/bpf/btf.c            |  1 +
> > >   kernel/bpf/verifier.c       | 35 ++++++++++++++++++++++++++++++++---
> > >   4 files changed, 56 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 1b84613b10ac..72bf941d1daf 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -968,6 +968,7 @@ struct bpf_insn_access_aux {
> > >               struct {
> > >                       struct btf *btf;
> > >                       u32 btf_id;
> > > +                     u32 ref_obj_id;
> > >               };
> > >       };
> > >       struct bpf_verifier_log *log; /* for verbose logs */
> > > @@ -1480,6 +1481,8 @@ struct bpf_ctx_arg_aux {
> > >       enum bpf_reg_type reg_type;
> > >       struct btf *btf;
> > >       u32 btf_id;
> > > +     u32 ref_obj_id;
> > > +     bool refcounted;
> > >   };
> > >
> > >   struct btf_mod_pair {
> > > diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> > > index fda3dd2ee984..6e7795744f6a 100644
> > > --- a/kernel/bpf/bpf_struct_ops.c
> > > +++ b/kernel/bpf/bpf_struct_ops.c
> > > @@ -145,6 +145,7 @@ void bpf_struct_ops_image_free(void *image)
> > >   }
> > >
> > >   #define MAYBE_NULL_SUFFIX "__nullable"
> > > +#define REFCOUNTED_SUFFIX "__ref"
> > >   #define MAX_STUB_NAME 128
> > >
> > >   /* Return the type info of a stub function, if it exists.
> > > @@ -206,9 +207,11 @@ static int prepare_arg_info(struct btf *btf,
> > >                           struct bpf_struct_ops_arg_info *arg_info)
> > >   {
> > >       const struct btf_type *stub_func_proto, *pointed_type;
> > > +     bool is_nullable = false, is_refcounted = false;
> > >       const struct btf_param *stub_args, *args;
> > >       struct bpf_ctx_arg_aux *info, *info_buf;
> > >       u32 nargs, arg_no, info_cnt = 0;
> > > +     const char *suffix;
> > >       u32 arg_btf_id;
> > >       int offset;
> > >
> > > @@ -240,12 +243,19 @@ static int prepare_arg_info(struct btf *btf,
> > >       info = info_buf;
> > >       for (arg_no = 0; arg_no < nargs; arg_no++) {
> > >               /* Skip arguments that is not suffixed with
> > > -              * "__nullable".
> > > +              * "__nullable or __ref".
> > >                */
> > > -             if (!btf_param_match_suffix(btf, &stub_args[arg_no],
> > > -                                         MAYBE_NULL_SUFFIX))
> > > +             is_nullable = btf_param_match_suffix(btf, &stub_args[arg_no],
> > > +                                                  MAYBE_NULL_SUFFIX);
> > > +             is_refcounted = btf_param_match_suffix(btf, &stub_args[arg_no],
> > > +                                                    REFCOUNTED_SUFFIX);
> > > +             if (!is_nullable && !is_refcounted)
> > >                       continue;
> > >
> > > +             if (is_nullable)
> > > +                     suffix = MAYBE_NULL_SUFFIX;
> > > +             else if (is_refcounted)
> > > +                     suffix = REFCOUNTED_SUFFIX;
> > >               /* Should be a pointer to struct */
> > >               pointed_type = btf_type_resolve_ptr(btf,
> > >                                                   args[arg_no].type,
> > > @@ -253,7 +263,7 @@ static int prepare_arg_info(struct btf *btf,
> > >               if (!pointed_type ||
> > >                   !btf_type_is_struct(pointed_type)) {
> > >                       pr_warn("stub function %s__%s has %s tagging to an unsupported type\n",
> > > -                             st_ops_name, member_name, MAYBE_NULL_SUFFIX);
> > > +                             st_ops_name, member_name, suffix);
> > >                       goto err_out;
> > >               }
> > >
> > > @@ -271,11 +281,15 @@ static int prepare_arg_info(struct btf *btf,
> > >               }
> > >
> > >               /* Fill the information of the new argument */
> > > -             info->reg_type =
> > > -                     PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
> > >               info->btf_id = arg_btf_id;
> > >               info->btf = btf;
> > >               info->offset = offset;
> > > +             if (is_nullable) {
> > > +                     info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
> > > +             } else if (is_refcounted) {
> > > +                     info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID;
> > > +                     info->refcounted = true;
> > > +             }
> > >
> > >               info++;
> > >               info_cnt++;
> > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > > index e7a59e6462a9..a05ccf9ee032 100644
> > > --- a/kernel/bpf/btf.c
> > > +++ b/kernel/bpf/btf.c
> > > @@ -6580,6 +6580,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
> > >                       info->reg_type = ctx_arg_info->reg_type;
> > >                       info->btf = ctx_arg_info->btf ? : btf_vmlinux;
> > >                       info->btf_id = ctx_arg_info->btf_id;
> > > +                     info->ref_obj_id = ctx_arg_info->ref_obj_id;
> > >                       return true;
> > >               }
> > >       }
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index 9f5de8d4fbd0..69753096075f 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -1402,6 +1402,17 @@ static int release_reference_state(struct bpf_func_state *state, int ptr_id)
> > >       return -EINVAL;
> > >   }
> > >
> > > +static bool find_reference_state(struct bpf_func_state *state, int ptr_id)
> > > +{
> > > +     int i;
> > > +
> > > +     for (i = 0; i < state->acquired_refs; i++)
> > > +             if (state->refs[i].id == ptr_id)
> > > +                     return true;
> > > +
> > > +     return false;
> > > +}
> > > +
> > >   static int release_lock_state(struct bpf_func_state *state, int type, int id, void *ptr)
> > >   {
> > >       int i, last_idx;
> > > @@ -5798,7 +5809,8 @@ static int check_packet_access(struct bpf_verifier_env *env, u32 regno, int off,
> > >   /* check access to 'struct bpf_context' fields.  Supports fixed offsets only */
> > >   static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off, int size,
> > >                           enum bpf_access_type t, enum bpf_reg_type *reg_type,
> > > -                         struct btf **btf, u32 *btf_id, bool *is_retval, bool is_ldsx)
> > > +                         struct btf **btf, u32 *btf_id, bool *is_retval, bool is_ldsx,
> > > +                         u32 *ref_obj_id)
> > >   {
> > >       struct bpf_insn_access_aux info = {
> > >               .reg_type = *reg_type,
> > > @@ -5820,8 +5832,16 @@ static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off,
> > >               *is_retval = info.is_retval;
> > >
> > >               if (base_type(*reg_type) == PTR_TO_BTF_ID) {
> > > +                     if (info.ref_obj_id &&
> > > +                         !find_reference_state(cur_func(env), info.ref_obj_id)) {
> > > +                             verbose(env, "invalid bpf_context access off=%d. Reference may already be released\n",
> > > +                                     off);
> > > +                             return -EACCES;
> > > +                     }
> > > +
> > >                       *btf = info.btf;
> > >                       *btf_id = info.btf_id;
> > > +                     *ref_obj_id = info.ref_obj_id;
> > >               } else {
> > >                       env->insn_aux_data[insn_idx].ctx_field_size = info.ctx_field_size;
> > >               }
> > > @@ -7135,7 +7155,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
> > >               struct bpf_retval_range range;
> > >               enum bpf_reg_type reg_type = SCALAR_VALUE;
> > >               struct btf *btf = NULL;
> > > -             u32 btf_id = 0;
> > > +             u32 btf_id = 0, ref_obj_id = 0;
> > >
> > >               if (t == BPF_WRITE && value_regno >= 0 &&
> > >                   is_pointer_value(env, value_regno)) {
> > > @@ -7148,7 +7168,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
> > >                       return err;
> > >
> > >               err = check_ctx_access(env, insn_idx, off, size, t, &reg_type, &btf,
> > > -                                    &btf_id, &is_retval, is_ldsx);
> > > +                                    &btf_id, &is_retval, is_ldsx, &ref_obj_id);
> > >               if (err)
> > >                       verbose_linfo(env, insn_idx, "; ");
> > >               if (!err && t == BPF_READ && value_regno >= 0) {
> > > @@ -7179,6 +7199,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
> > >                               if (base_type(reg_type) == PTR_TO_BTF_ID) {
> > >                                       regs[value_regno].btf = btf;
> > >                                       regs[value_regno].btf_id = btf_id;
> > > +                                     regs[value_regno].ref_obj_id = ref_obj_id;
> > >                               }
> > >                       }
> > >                       regs[value_regno].type = reg_type;
> > > @@ -21662,6 +21683,7 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
> > >   {
> > >       bool pop_log = !(env->log.level & BPF_LOG_LEVEL2);
> > >       struct bpf_subprog_info *sub = subprog_info(env, subprog);
> > > +     struct bpf_ctx_arg_aux *ctx_arg_info;
> > >       struct bpf_verifier_state *state;
> > >       struct bpf_reg_state *regs;
> > >       int ret, i;
> > > @@ -21769,6 +21791,13 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
> > >               mark_reg_known_zero(env, regs, BPF_REG_1);
> > >       }
> > >
> > > +     if (!subprog && env->prog->type == BPF_PROG_TYPE_STRUCT_OPS) {
> > > +             ctx_arg_info = (struct bpf_ctx_arg_aux *)env->prog->aux->ctx_arg_info;
> > > +             for (i = 0; i < env->prog->aux->ctx_arg_info_size; i++)
> > > +                     if (ctx_arg_info[i].refcounted)
> > > +                             ctx_arg_info[i].ref_obj_id = acquire_reference_state(env, 0);
> >
> > There is a conflict in the bpf-next/master. acquire_reference_state has been
> > refactored in commit 769b0f1c8214. From looking at the net/sched/sch_*.c
> > changes, they should not have conflict with the net-next/main. I would suggest
> > to rebase this set on bpf-next/master.
> >
> > At the first glance, the ref_obj_id assignment looks racy because ctx_arg_info
> > is shared by different bpf progs that may be verified in parallel. After another
> > thought, this should be fine because it should always end up having the same
> > ref_obj_id for the same arg-no, right? Not sure if UBSAN can understand this
> > without using the READ/WRITE_ONCE. but adding READ/WRITE_ONCE when using
> > ref_obj_id will be quite puzzling when reading the verifier code. Any better idea?
>
> ctx_arg_info is kinda read-only from the verifier pov.
> bpf_ctx_arg_aux->btf_id is populated before the main verifier loop.
> While ref_obj_id is a dynamic property.
> It doesn't really fit in bpf_ctx_arg_aux.
> It probably needs to be another struct type that is allocated
> and populated once with acquire_reference() when the main verifier loop
> is happening.
> do_check_common() maybe too early?
> Looks like it's anyway a reference that is ok to leak per patch 3 ?
>
> It seems the main goal is to pass ref_obj_id-like argument into bpf prog
> and make sure that prog doesn't call KF_RELEASE kfunc on it twice,
> but leaking is ok?
> Maybe it needs a different type. Other than REF_TYPE_PTR.
>

The main goal of this patch is to get a unique ref_obj_id to the skb
arg in a .enqueue call. Therefore, we acquire that one and only
ref_obj_id for __ref arg early in do_check_common() and do not change
it afterward. Later in the main loop, the liviness is tracked in the
reference states. This feels kind of read-only? Besides, since we
acquire the ref automatically, it forces the user to do something with
the ref ptr (in qdisc's case, .enqueue needs to either drop or enqueue
it).

I try to break down the requirements from bpf qdisc (1. only a unique
reference to the skb in .enqueue; 2. users must enqueue or drop the
skb in .enqueue; 3. dequeue a single skb) into two orthogonal patches
1 and 3. so whether this reference can leak or not can be independent.
Taking a step back, maybe we can encapsulate them all in one semantic
(a new kind of REF like you suggest), but I am not sure if that'd be
too specific and then less useful to others.

> > Other than the subprog, afaik, the bpf prog triggered by the bpf_tail_call can
> > also take the 'u64 *ctx' array. May be disallow using tailcall in all ops in the
> > bpf qdisc. env->subprog_info[i].has_tail_call has already tracked whether the
> > tail_call is used.
>
> +1. Just disallow tail_call.

Got it. Also, thanks Martin for pointing it out.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next v1 02/13] selftests/bpf: Test referenced kptr arguments of struct_ops programs
  2024-12-18  1:17   ` Martin KaFai Lau
@ 2024-12-18 16:10     ` Amery Hung
  0 siblings, 0 replies; 35+ messages in thread
From: Amery Hung @ 2024-12-18 16:10 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Amery Hung, bpf, netdev, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, stfomichev,
	ekarani.silvestre, yangpeihao, xiyou.wangcong, yepeilin.cs

On Tue, Dec 17, 2024 at 5:17 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 12/13/24 3:29 PM, Amery Hung wrote:
>
> > +void test_struct_ops_refcounted(void)
> > +{
> > +     if (test__start_subtest("refcounted"))
> > +             refcounted();
> > +     if (test__start_subtest("refcounted_fail__ref_leak"))
> > +             refcounted_fail__ref_leak();
>
> test_loader.c could make writing this test easier and it can also test the
> verifier failure message. e.g. for the ref_leak test, the following should do:
>
>         RUN_TESTS(struct_ops_refcounted_fail__ref_leak);
>
> The same for the other subtests in this patch.
>

Thanks for the pointer. I will change selftests in this set to use test_loader.

> > +     if (test__start_subtest("refcounted_fail__global_subprog"))
> > +             refcounted_fail__global_subprog();
> > +}
>
> [ ... ]
>
> > diff --git a/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c
> > new file mode 100644
> > index 000000000000..6e82859eb187
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c
> > @@ -0,0 +1,17 @@
> > +#include <vmlinux.h>
> > +#include <bpf/bpf_tracing.h>
> > +#include "../bpf_testmod/bpf_testmod.h"
> > +
> > +char _license[] SEC("license") = "GPL";
> > +
>
> +#include "bpf_misc.h"
>
> +__failure __msg("Unreleased reference")
> > +SEC("struct_ops/test_refcounted")
> > +int BPF_PROG(test_refcounted, int dummy,
> > +          struct task_struct *task)
> > +{
> > +     return 0;
> > +}
> > +
> > +SEC(".struct_ops.link")
> > +struct bpf_testmod_ops testmod_ref_acquire = {
> > +     .test_refcounted = (void *)test_refcounted,
> > +};
>
> [ I will stop here for today and will continue the rest tomorrow. ]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next v1 01/13] bpf: Support getting referenced kptr from struct_ops argument
  2024-12-18  0:58   ` Martin KaFai Lau
  2024-12-18  1:24     ` Alexei Starovoitov
  2024-12-18  1:44     ` Jakub Kicinski
@ 2024-12-18 16:57     ` Amery Hung
  2024-12-19 23:06       ` Martin KaFai Lau
  2 siblings, 1 reply; 35+ messages in thread
From: Amery Hung @ 2024-12-18 16:57 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Amery Hung, bpf, netdev, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, stfomichev,
	ekarani.silvestre, yangpeihao, xiyou.wangcong, yepeilin.cs

On Tue, Dec 17, 2024 at 4:58 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 12/13/24 3:29 PM, Amery Hung wrote:
> > Allows struct_ops programs to acqurie referenced kptrs from arguments
> > by directly reading the argument.
> >
> > The verifier will acquire a reference for struct_ops a argument tagged
> > with "__ref" in the stub function in the beginning of the main program.
> > The user will be able to access the referenced kptr directly by reading
> > the context as long as it has not been released by the program.
> >
> > This new mechanism to acquire referenced kptr (compared to the existing
> > "kfunc with KF_ACQUIRE") is introduced for ergonomic and semantic reasons.
> > In the first use case, Qdisc_ops, an skb is passed to .enqueue in the
> > first argument. This mechanism provides a natural way for users to get a
> > referenced kptr in the .enqueue struct_ops programs and makes sure that a
> > qdisc will always enqueue or drop the skb.
> >
> > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > ---
> >   include/linux/bpf.h         |  3 +++
> >   kernel/bpf/bpf_struct_ops.c | 26 ++++++++++++++++++++------
> >   kernel/bpf/btf.c            |  1 +
> >   kernel/bpf/verifier.c       | 35 ++++++++++++++++++++++++++++++++---
> >   4 files changed, 56 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 1b84613b10ac..72bf941d1daf 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -968,6 +968,7 @@ struct bpf_insn_access_aux {
> >               struct {
> >                       struct btf *btf;
> >                       u32 btf_id;
> > +                     u32 ref_obj_id;
> >               };
> >       };
> >       struct bpf_verifier_log *log; /* for verbose logs */
> > @@ -1480,6 +1481,8 @@ struct bpf_ctx_arg_aux {
> >       enum bpf_reg_type reg_type;
> >       struct btf *btf;
> >       u32 btf_id;
> > +     u32 ref_obj_id;
> > +     bool refcounted;
> >   };
> >
> >   struct btf_mod_pair {
> > diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> > index fda3dd2ee984..6e7795744f6a 100644
> > --- a/kernel/bpf/bpf_struct_ops.c
> > +++ b/kernel/bpf/bpf_struct_ops.c
> > @@ -145,6 +145,7 @@ void bpf_struct_ops_image_free(void *image)
> >   }
> >
> >   #define MAYBE_NULL_SUFFIX "__nullable"
> > +#define REFCOUNTED_SUFFIX "__ref"
> >   #define MAX_STUB_NAME 128
> >
> >   /* Return the type info of a stub function, if it exists.
> > @@ -206,9 +207,11 @@ static int prepare_arg_info(struct btf *btf,
> >                           struct bpf_struct_ops_arg_info *arg_info)
> >   {
> >       const struct btf_type *stub_func_proto, *pointed_type;
> > +     bool is_nullable = false, is_refcounted = false;
> >       const struct btf_param *stub_args, *args;
> >       struct bpf_ctx_arg_aux *info, *info_buf;
> >       u32 nargs, arg_no, info_cnt = 0;
> > +     const char *suffix;
> >       u32 arg_btf_id;
> >       int offset;
> >
> > @@ -240,12 +243,19 @@ static int prepare_arg_info(struct btf *btf,
> >       info = info_buf;
> >       for (arg_no = 0; arg_no < nargs; arg_no++) {
> >               /* Skip arguments that is not suffixed with
> > -              * "__nullable".
> > +              * "__nullable or __ref".
> >                */
> > -             if (!btf_param_match_suffix(btf, &stub_args[arg_no],
> > -                                         MAYBE_NULL_SUFFIX))
> > +             is_nullable = btf_param_match_suffix(btf, &stub_args[arg_no],
> > +                                                  MAYBE_NULL_SUFFIX);
> > +             is_refcounted = btf_param_match_suffix(btf, &stub_args[arg_no],
> > +                                                    REFCOUNTED_SUFFIX);
> > +             if (!is_nullable && !is_refcounted)
> >                       continue;
> >
> > +             if (is_nullable)
> > +                     suffix = MAYBE_NULL_SUFFIX;
> > +             else if (is_refcounted)
> > +                     suffix = REFCOUNTED_SUFFIX;
> >               /* Should be a pointer to struct */
> >               pointed_type = btf_type_resolve_ptr(btf,
> >                                                   args[arg_no].type,
> > @@ -253,7 +263,7 @@ static int prepare_arg_info(struct btf *btf,
> >               if (!pointed_type ||
> >                   !btf_type_is_struct(pointed_type)) {
> >                       pr_warn("stub function %s__%s has %s tagging to an unsupported type\n",
> > -                             st_ops_name, member_name, MAYBE_NULL_SUFFIX);
> > +                             st_ops_name, member_name, suffix);
> >                       goto err_out;
> >               }
> >
> > @@ -271,11 +281,15 @@ static int prepare_arg_info(struct btf *btf,
> >               }
> >
> >               /* Fill the information of the new argument */
> > -             info->reg_type =
> > -                     PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
> >               info->btf_id = arg_btf_id;
> >               info->btf = btf;
> >               info->offset = offset;
> > +             if (is_nullable) {
> > +                     info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
> > +             } else if (is_refcounted) {
> > +                     info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID;
> > +                     info->refcounted = true;
> > +             }
> >
> >               info++;
> >               info_cnt++;
> > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > index e7a59e6462a9..a05ccf9ee032 100644
> > --- a/kernel/bpf/btf.c
> > +++ b/kernel/bpf/btf.c
> > @@ -6580,6 +6580,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
> >                       info->reg_type = ctx_arg_info->reg_type;
> >                       info->btf = ctx_arg_info->btf ? : btf_vmlinux;
> >                       info->btf_id = ctx_arg_info->btf_id;
> > +                     info->ref_obj_id = ctx_arg_info->ref_obj_id;
> >                       return true;
> >               }
> >       }
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 9f5de8d4fbd0..69753096075f 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -1402,6 +1402,17 @@ static int release_reference_state(struct bpf_func_state *state, int ptr_id)
> >       return -EINVAL;
> >   }
> >
> > +static bool find_reference_state(struct bpf_func_state *state, int ptr_id)
> > +{
> > +     int i;
> > +
> > +     for (i = 0; i < state->acquired_refs; i++)
> > +             if (state->refs[i].id == ptr_id)
> > +                     return true;
> > +
> > +     return false;
> > +}
> > +
> >   static int release_lock_state(struct bpf_func_state *state, int type, int id, void *ptr)
> >   {
> >       int i, last_idx;
> > @@ -5798,7 +5809,8 @@ static int check_packet_access(struct bpf_verifier_env *env, u32 regno, int off,
> >   /* check access to 'struct bpf_context' fields.  Supports fixed offsets only */
> >   static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off, int size,
> >                           enum bpf_access_type t, enum bpf_reg_type *reg_type,
> > -                         struct btf **btf, u32 *btf_id, bool *is_retval, bool is_ldsx)
> > +                         struct btf **btf, u32 *btf_id, bool *is_retval, bool is_ldsx,
> > +                         u32 *ref_obj_id)
> >   {
> >       struct bpf_insn_access_aux info = {
> >               .reg_type = *reg_type,
> > @@ -5820,8 +5832,16 @@ static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off,
> >               *is_retval = info.is_retval;
> >
> >               if (base_type(*reg_type) == PTR_TO_BTF_ID) {
> > +                     if (info.ref_obj_id &&
> > +                         !find_reference_state(cur_func(env), info.ref_obj_id)) {
> > +                             verbose(env, "invalid bpf_context access off=%d. Reference may already be released\n",
> > +                                     off);
> > +                             return -EACCES;
> > +                     }
> > +
> >                       *btf = info.btf;
> >                       *btf_id = info.btf_id;
> > +                     *ref_obj_id = info.ref_obj_id;
> >               } else {
> >                       env->insn_aux_data[insn_idx].ctx_field_size = info.ctx_field_size;
> >               }
> > @@ -7135,7 +7155,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
> >               struct bpf_retval_range range;
> >               enum bpf_reg_type reg_type = SCALAR_VALUE;
> >               struct btf *btf = NULL;
> > -             u32 btf_id = 0;
> > +             u32 btf_id = 0, ref_obj_id = 0;
> >
> >               if (t == BPF_WRITE && value_regno >= 0 &&
> >                   is_pointer_value(env, value_regno)) {
> > @@ -7148,7 +7168,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
> >                       return err;
> >
> >               err = check_ctx_access(env, insn_idx, off, size, t, &reg_type, &btf,
> > -                                    &btf_id, &is_retval, is_ldsx);
> > +                                    &btf_id, &is_retval, is_ldsx, &ref_obj_id);
> >               if (err)
> >                       verbose_linfo(env, insn_idx, "; ");
> >               if (!err && t == BPF_READ && value_regno >= 0) {
> > @@ -7179,6 +7199,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
> >                               if (base_type(reg_type) == PTR_TO_BTF_ID) {
> >                                       regs[value_regno].btf = btf;
> >                                       regs[value_regno].btf_id = btf_id;
> > +                                     regs[value_regno].ref_obj_id = ref_obj_id;
> >                               }
> >                       }
> >                       regs[value_regno].type = reg_type;
> > @@ -21662,6 +21683,7 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
> >   {
> >       bool pop_log = !(env->log.level & BPF_LOG_LEVEL2);
> >       struct bpf_subprog_info *sub = subprog_info(env, subprog);
> > +     struct bpf_ctx_arg_aux *ctx_arg_info;
> >       struct bpf_verifier_state *state;
> >       struct bpf_reg_state *regs;
> >       int ret, i;
> > @@ -21769,6 +21791,13 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
> >               mark_reg_known_zero(env, regs, BPF_REG_1);
> >       }
> >
> > +     if (!subprog && env->prog->type == BPF_PROG_TYPE_STRUCT_OPS) {
> > +             ctx_arg_info = (struct bpf_ctx_arg_aux *)env->prog->aux->ctx_arg_info;
> > +             for (i = 0; i < env->prog->aux->ctx_arg_info_size; i++)
> > +                     if (ctx_arg_info[i].refcounted)
> > +                             ctx_arg_info[i].ref_obj_id = acquire_reference_state(env, 0);
>
> There is a conflict in the bpf-next/master. acquire_reference_state has been
> refactored in commit 769b0f1c8214. From looking at the net/sched/sch_*.c
> changes, they should not have conflict with the net-next/main. I would suggest
> to rebase this set on bpf-next/master.
>

Thanks, I have rebased it and will send a new version.

> At the first glance, the ref_obj_id assignment looks racy because ctx_arg_info
> is shared by different bpf progs that may be verified in parallel. After another
> thought, this should be fine because it should always end up having the same
> ref_obj_id for the same arg-no, right? Not sure if UBSAN can understand this
> without using the READ/WRITE_ONCE. but adding READ/WRITE_ONCE when using
> ref_obj_id will be quite puzzling when reading the verifier code. Any better idea?
>

It looks like ref_obj_id cannot be reused (id always comes from
++env->id_gen), and these will be the earliest references to acquire.
So, maybe we can assume the ref_obj_id without needing to store it in
ctx_arg_info? E.g., the first __ref argument's ref_obj_id is always 1.

> Other than the subprog, afaik, the bpf prog triggered by the bpf_tail_call can
> also take the 'u64 *ctx' array. May be disallow using tailcall in all ops in the
> bpf qdisc. env->subprog_info[i].has_tail_call has already tracked whether the
> tail_call is used.
>
> > +     }
> > +
> >       ret = do_check(env);
> >   out:
> >       /* check for NULL is necessary, since cur_state can be freed inside
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next v1 06/13] bpf: net_sched: Add basic bpf qdisc kfuncs
  2024-12-13 23:29 ` [PATCH bpf-next v1 06/13] bpf: net_sched: Add basic bpf qdisc kfuncs Amery Hung
@ 2024-12-18 17:11   ` Amery Hung
  2024-12-19  7:37   ` Martin KaFai Lau
  1 sibling, 0 replies; 35+ messages in thread
From: Amery Hung @ 2024-12-18 17:11 UTC (permalink / raw)
  To: Amery Hung
  Cc: netdev, bpf, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, stfomichev, ekarani.silvestre,
	yangpeihao, xiyou.wangcong, yepeilin.cs

On Fri, Dec 13, 2024 at 3:30 PM Amery Hung <amery.hung@bytedance.com> wrote:
>
> Add basic kfuncs for working on skb in qdisc.
>
> Both bpf_qdisc_skb_drop() and bpf_kfree_skb() can be used to release
> a reference to an skb. However, bpf_qdisc_skb_drop() can only be called
> in .enqueue where a to_free skb list is available from kernel to defer
> the release. bpf_kfree_skb() should be used elsewhere. It is also used
> in bpf_obj_free_fields() when cleaning up skb in maps and collections.
>
> bpf_skb_get_hash() returns the flow hash of an skb, which can be used
> to build flow-based queueing algorithms.
>
> Finally, allow users to create read-only dynptr via bpf_dynptr_from_skb().
>
> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> ---
>  net/sched/bpf_qdisc.c | 77 ++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 76 insertions(+), 1 deletion(-)
>
> diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
> index a2e2db29e5fc..28959424eab0 100644
> --- a/net/sched/bpf_qdisc.c
> +++ b/net/sched/bpf_qdisc.c
> @@ -106,6 +106,67 @@ static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
>         return 0;
>  }
>
> +__bpf_kfunc_start_defs();
> +
> +/* bpf_skb_get_hash - Get the flow hash of an skb.
> + * @skb: The skb to get the flow hash from.
> + */
> +__bpf_kfunc u32 bpf_skb_get_hash(struct sk_buff *skb)
> +{
> +       return skb_get_hash(skb);
> +}
> +
> +/* bpf_kfree_skb - Release an skb's reference and drop it immediately.
> + * @skb: The skb whose reference to be released and dropped.
> + */
> +__bpf_kfunc void bpf_kfree_skb(struct sk_buff *skb)
> +{
> +       kfree_skb(skb);
> +}
> +
> +/* bpf_qdisc_skb_drop - Drop an skb by adding it to a deferred free list.
> + * @skb: The skb whose reference to be released and dropped.
> + * @to_free_list: The list of skbs to be dropped.
> + */
> +__bpf_kfunc void bpf_qdisc_skb_drop(struct sk_buff *skb,
> +                                   struct bpf_sk_buff_ptr *to_free_list)
> +{
> +       __qdisc_drop(skb, (struct sk_buff **)to_free_list);
> +}
> +
> +__bpf_kfunc_end_defs();
> +
> +#define BPF_QDISC_KFUNC_xxx \
> +       BPF_QDISC_KFUNC(bpf_skb_get_hash, KF_TRUSTED_ARGS) \
> +       BPF_QDISC_KFUNC(bpf_kfree_skb, KF_RELEASE) \
> +       BPF_QDISC_KFUNC(bpf_qdisc_skb_drop, KF_RELEASE) \
> +
> +BTF_KFUNCS_START(bpf_qdisc_kfunc_ids)
> +#define BPF_QDISC_KFUNC(name, flag) BTF_ID_FLAGS(func, name, flag)
> +BPF_QDISC_KFUNC_xxx
> +#undef BPF_QDISC_KFUNC
> +BTF_ID_FLAGS(func, bpf_dynptr_from_skb, KF_TRUSTED_ARGS)
> +BTF_KFUNCS_END(bpf_qdisc_kfunc_ids)
> +
> +#define BPF_QDISC_KFUNC(name, _) BTF_ID_LIST_SINGLE(name##_ids, func, name)
> +BPF_QDISC_KFUNC_xxx
> +#undef BPF_QDISC_KFUNC
> +
> +static int bpf_qdisc_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
> +{

Here is a null pointer dereference since prog->aux->attach_func_name
is not populated yet during check_cfg(). I will add:

        if (!btf_id_set8_contains(&bpf_qdisc_kfunc_ids, kfunc_id) ||
            !prog->aux->attach_func_name)
                return 0;

> +       if (kfunc_id == bpf_qdisc_skb_drop_ids[0])
> +               if (strcmp(prog->aux->attach_func_name, "enqueue"))
> +                       return -EACCES;
> +
> +       return 0;
> +}
> +
> +static const struct btf_kfunc_id_set bpf_qdisc_kfunc_set = {
> +       .owner = THIS_MODULE,
> +       .set   = &bpf_qdisc_kfunc_ids,
> +       .filter = bpf_qdisc_kfunc_filter,
> +};
> +
>  static const struct bpf_verifier_ops bpf_qdisc_verifier_ops = {
>         .get_func_proto         = bpf_qdisc_get_func_proto,
>         .is_valid_access        = bpf_qdisc_is_valid_access,
> @@ -209,6 +270,20 @@ static struct bpf_struct_ops bpf_Qdisc_ops = {
>
>  static int __init bpf_qdisc_kfunc_init(void)
>  {
> -       return register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
> +       int ret;
> +       const struct btf_id_dtor_kfunc skb_kfunc_dtors[] = {
> +               {
> +                       .btf_id       = bpf_sk_buff_ids[0],
> +                       .kfunc_btf_id = bpf_kfree_skb_ids[0]
> +               },
> +       };
> +
> +       ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_qdisc_kfunc_set);
> +       ret = ret ?: register_btf_id_dtor_kfuncs(skb_kfunc_dtors,
> +                                                ARRAY_SIZE(skb_kfunc_dtors),
> +                                                THIS_MODULE);
> +       ret = ret ?: register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
> +
> +       return ret;
>  }
>  late_initcall(bpf_qdisc_kfunc_init);
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next v1 01/13] bpf: Support getting referenced kptr from struct_ops argument
  2024-12-18 16:09       ` Amery Hung
@ 2024-12-18 17:20         ` Alexei Starovoitov
  0 siblings, 0 replies; 35+ messages in thread
From: Alexei Starovoitov @ 2024-12-18 17:20 UTC (permalink / raw)
  To: Amery Hung
  Cc: Martin KaFai Lau, Amery Hung, bpf, Network Development,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Kui-Feng Lee,
	Toke Høiland-Jørgensen, Jamal Hadi Salim, Jiri Pirko,
	stfomichev, ekarani.silvestre, yangpeihao, Cong Wang, Peilin Ye

On Wed, Dec 18, 2024 at 8:09 AM Amery Hung <ameryhung@gmail.com> wrote:
>
> On Tue, Dec 17, 2024 at 5:24 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Dec 17, 2024 at 4:58 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > >
> > > On 12/13/24 3:29 PM, Amery Hung wrote:
> > > > Allows struct_ops programs to acqurie referenced kptrs from arguments
> > > > by directly reading the argument.
> > > >
> > > > The verifier will acquire a reference for struct_ops a argument tagged
> > > > with "__ref" in the stub function in the beginning of the main program.
> > > > The user will be able to access the referenced kptr directly by reading
> > > > the context as long as it has not been released by the program.
> > > >
> > > > This new mechanism to acquire referenced kptr (compared to the existing
> > > > "kfunc with KF_ACQUIRE") is introduced for ergonomic and semantic reasons.
> > > > In the first use case, Qdisc_ops, an skb is passed to .enqueue in the
> > > > first argument. This mechanism provides a natural way for users to get a
> > > > referenced kptr in the .enqueue struct_ops programs and makes sure that a
> > > > qdisc will always enqueue or drop the skb.
> > > >
> > > > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > > > ---
> > > >   include/linux/bpf.h         |  3 +++
> > > >   kernel/bpf/bpf_struct_ops.c | 26 ++++++++++++++++++++------
> > > >   kernel/bpf/btf.c            |  1 +
> > > >   kernel/bpf/verifier.c       | 35 ++++++++++++++++++++++++++++++++---
> > > >   4 files changed, 56 insertions(+), 9 deletions(-)
> > > >
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index 1b84613b10ac..72bf941d1daf 100644
> > > > --- a/include/linux/bpf.h
> > > > +++ b/include/linux/bpf.h
> > > > @@ -968,6 +968,7 @@ struct bpf_insn_access_aux {
> > > >               struct {
> > > >                       struct btf *btf;
> > > >                       u32 btf_id;
> > > > +                     u32 ref_obj_id;
> > > >               };
> > > >       };
> > > >       struct bpf_verifier_log *log; /* for verbose logs */
> > > > @@ -1480,6 +1481,8 @@ struct bpf_ctx_arg_aux {
> > > >       enum bpf_reg_type reg_type;
> > > >       struct btf *btf;
> > > >       u32 btf_id;
> > > > +     u32 ref_obj_id;
> > > > +     bool refcounted;
> > > >   };
> > > >
> > > >   struct btf_mod_pair {
> > > > diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> > > > index fda3dd2ee984..6e7795744f6a 100644
> > > > --- a/kernel/bpf/bpf_struct_ops.c
> > > > +++ b/kernel/bpf/bpf_struct_ops.c
> > > > @@ -145,6 +145,7 @@ void bpf_struct_ops_image_free(void *image)
> > > >   }
> > > >
> > > >   #define MAYBE_NULL_SUFFIX "__nullable"
> > > > +#define REFCOUNTED_SUFFIX "__ref"
> > > >   #define MAX_STUB_NAME 128
> > > >
> > > >   /* Return the type info of a stub function, if it exists.
> > > > @@ -206,9 +207,11 @@ static int prepare_arg_info(struct btf *btf,
> > > >                           struct bpf_struct_ops_arg_info *arg_info)
> > > >   {
> > > >       const struct btf_type *stub_func_proto, *pointed_type;
> > > > +     bool is_nullable = false, is_refcounted = false;
> > > >       const struct btf_param *stub_args, *args;
> > > >       struct bpf_ctx_arg_aux *info, *info_buf;
> > > >       u32 nargs, arg_no, info_cnt = 0;
> > > > +     const char *suffix;
> > > >       u32 arg_btf_id;
> > > >       int offset;
> > > >
> > > > @@ -240,12 +243,19 @@ static int prepare_arg_info(struct btf *btf,
> > > >       info = info_buf;
> > > >       for (arg_no = 0; arg_no < nargs; arg_no++) {
> > > >               /* Skip arguments that is not suffixed with
> > > > -              * "__nullable".
> > > > +              * "__nullable or __ref".
> > > >                */
> > > > -             if (!btf_param_match_suffix(btf, &stub_args[arg_no],
> > > > -                                         MAYBE_NULL_SUFFIX))
> > > > +             is_nullable = btf_param_match_suffix(btf, &stub_args[arg_no],
> > > > +                                                  MAYBE_NULL_SUFFIX);
> > > > +             is_refcounted = btf_param_match_suffix(btf, &stub_args[arg_no],
> > > > +                                                    REFCOUNTED_SUFFIX);
> > > > +             if (!is_nullable && !is_refcounted)
> > > >                       continue;
> > > >
> > > > +             if (is_nullable)
> > > > +                     suffix = MAYBE_NULL_SUFFIX;
> > > > +             else if (is_refcounted)
> > > > +                     suffix = REFCOUNTED_SUFFIX;
> > > >               /* Should be a pointer to struct */
> > > >               pointed_type = btf_type_resolve_ptr(btf,
> > > >                                                   args[arg_no].type,
> > > > @@ -253,7 +263,7 @@ static int prepare_arg_info(struct btf *btf,
> > > >               if (!pointed_type ||
> > > >                   !btf_type_is_struct(pointed_type)) {
> > > >                       pr_warn("stub function %s__%s has %s tagging to an unsupported type\n",
> > > > -                             st_ops_name, member_name, MAYBE_NULL_SUFFIX);
> > > > +                             st_ops_name, member_name, suffix);
> > > >                       goto err_out;
> > > >               }
> > > >
> > > > @@ -271,11 +281,15 @@ static int prepare_arg_info(struct btf *btf,
> > > >               }
> > > >
> > > >               /* Fill the information of the new argument */
> > > > -             info->reg_type =
> > > > -                     PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
> > > >               info->btf_id = arg_btf_id;
> > > >               info->btf = btf;
> > > >               info->offset = offset;
> > > > +             if (is_nullable) {
> > > > +                     info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
> > > > +             } else if (is_refcounted) {
> > > > +                     info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID;
> > > > +                     info->refcounted = true;
> > > > +             }
> > > >
> > > >               info++;
> > > >               info_cnt++;
> > > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > > > index e7a59e6462a9..a05ccf9ee032 100644
> > > > --- a/kernel/bpf/btf.c
> > > > +++ b/kernel/bpf/btf.c
> > > > @@ -6580,6 +6580,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
> > > >                       info->reg_type = ctx_arg_info->reg_type;
> > > >                       info->btf = ctx_arg_info->btf ? : btf_vmlinux;
> > > >                       info->btf_id = ctx_arg_info->btf_id;
> > > > +                     info->ref_obj_id = ctx_arg_info->ref_obj_id;
> > > >                       return true;
> > > >               }
> > > >       }
> > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > > index 9f5de8d4fbd0..69753096075f 100644
> > > > --- a/kernel/bpf/verifier.c
> > > > +++ b/kernel/bpf/verifier.c
> > > > @@ -1402,6 +1402,17 @@ static int release_reference_state(struct bpf_func_state *state, int ptr_id)
> > > >       return -EINVAL;
> > > >   }
> > > >
> > > > +static bool find_reference_state(struct bpf_func_state *state, int ptr_id)
> > > > +{
> > > > +     int i;
> > > > +
> > > > +     for (i = 0; i < state->acquired_refs; i++)
> > > > +             if (state->refs[i].id == ptr_id)
> > > > +                     return true;
> > > > +
> > > > +     return false;
> > > > +}
> > > > +
> > > >   static int release_lock_state(struct bpf_func_state *state, int type, int id, void *ptr)
> > > >   {
> > > >       int i, last_idx;
> > > > @@ -5798,7 +5809,8 @@ static int check_packet_access(struct bpf_verifier_env *env, u32 regno, int off,
> > > >   /* check access to 'struct bpf_context' fields.  Supports fixed offsets only */
> > > >   static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off, int size,
> > > >                           enum bpf_access_type t, enum bpf_reg_type *reg_type,
> > > > -                         struct btf **btf, u32 *btf_id, bool *is_retval, bool is_ldsx)
> > > > +                         struct btf **btf, u32 *btf_id, bool *is_retval, bool is_ldsx,
> > > > +                         u32 *ref_obj_id)
> > > >   {
> > > >       struct bpf_insn_access_aux info = {
> > > >               .reg_type = *reg_type,
> > > > @@ -5820,8 +5832,16 @@ static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off,
> > > >               *is_retval = info.is_retval;
> > > >
> > > >               if (base_type(*reg_type) == PTR_TO_BTF_ID) {
> > > > +                     if (info.ref_obj_id &&
> > > > +                         !find_reference_state(cur_func(env), info.ref_obj_id)) {
> > > > +                             verbose(env, "invalid bpf_context access off=%d. Reference may already be released\n",
> > > > +                                     off);
> > > > +                             return -EACCES;
> > > > +                     }
> > > > +
> > > >                       *btf = info.btf;
> > > >                       *btf_id = info.btf_id;
> > > > +                     *ref_obj_id = info.ref_obj_id;
> > > >               } else {
> > > >                       env->insn_aux_data[insn_idx].ctx_field_size = info.ctx_field_size;
> > > >               }
> > > > @@ -7135,7 +7155,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
> > > >               struct bpf_retval_range range;
> > > >               enum bpf_reg_type reg_type = SCALAR_VALUE;
> > > >               struct btf *btf = NULL;
> > > > -             u32 btf_id = 0;
> > > > +             u32 btf_id = 0, ref_obj_id = 0;
> > > >
> > > >               if (t == BPF_WRITE && value_regno >= 0 &&
> > > >                   is_pointer_value(env, value_regno)) {
> > > > @@ -7148,7 +7168,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
> > > >                       return err;
> > > >
> > > >               err = check_ctx_access(env, insn_idx, off, size, t, &reg_type, &btf,
> > > > -                                    &btf_id, &is_retval, is_ldsx);
> > > > +                                    &btf_id, &is_retval, is_ldsx, &ref_obj_id);
> > > >               if (err)
> > > >                       verbose_linfo(env, insn_idx, "; ");
> > > >               if (!err && t == BPF_READ && value_regno >= 0) {
> > > > @@ -7179,6 +7199,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
> > > >                               if (base_type(reg_type) == PTR_TO_BTF_ID) {
> > > >                                       regs[value_regno].btf = btf;
> > > >                                       regs[value_regno].btf_id = btf_id;
> > > > +                                     regs[value_regno].ref_obj_id = ref_obj_id;
> > > >                               }
> > > >                       }
> > > >                       regs[value_regno].type = reg_type;
> > > > @@ -21662,6 +21683,7 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
> > > >   {
> > > >       bool pop_log = !(env->log.level & BPF_LOG_LEVEL2);
> > > >       struct bpf_subprog_info *sub = subprog_info(env, subprog);
> > > > +     struct bpf_ctx_arg_aux *ctx_arg_info;
> > > >       struct bpf_verifier_state *state;
> > > >       struct bpf_reg_state *regs;
> > > >       int ret, i;
> > > > @@ -21769,6 +21791,13 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
> > > >               mark_reg_known_zero(env, regs, BPF_REG_1);
> > > >       }
> > > >
> > > > +     if (!subprog && env->prog->type == BPF_PROG_TYPE_STRUCT_OPS) {
> > > > +             ctx_arg_info = (struct bpf_ctx_arg_aux *)env->prog->aux->ctx_arg_info;
> > > > +             for (i = 0; i < env->prog->aux->ctx_arg_info_size; i++)
> > > > +                     if (ctx_arg_info[i].refcounted)
> > > > +                             ctx_arg_info[i].ref_obj_id = acquire_reference_state(env, 0);
> > >
> > > There is a conflict in the bpf-next/master. acquire_reference_state has been
> > > refactored in commit 769b0f1c8214. From looking at the net/sched/sch_*.c
> > > changes, they should not have conflict with the net-next/main. I would suggest
> > > to rebase this set on bpf-next/master.
> > >
> > > At the first glance, the ref_obj_id assignment looks racy because ctx_arg_info
> > > is shared by different bpf progs that may be verified in parallel. After another
> > > thought, this should be fine because it should always end up having the same
> > > ref_obj_id for the same arg-no, right? Not sure if UBSAN can understand this
> > > without using the READ/WRITE_ONCE. but adding READ/WRITE_ONCE when using
> > > ref_obj_id will be quite puzzling when reading the verifier code. Any better idea?
> >
> > ctx_arg_info is kinda read-only from the verifier pov.
> > bpf_ctx_arg_aux->btf_id is populated before the main verifier loop.
> > While ref_obj_id is a dynamic property.
> > It doesn't really fit in bpf_ctx_arg_aux.
> > It probably needs to be another struct type that is allocated
> > and populated once with acquire_reference() when the main verifier loop
> > is happening.
> > do_check_common() maybe too early?
> > Looks like it's anyway a reference that is ok to leak per patch 3 ?
> >
> > It seems the main goal is to pass ref_obj_id-like argument into bpf prog
> > and make sure that prog doesn't call KF_RELEASE kfunc on it twice,
> > but leaking is ok?
> > Maybe it needs a different type. Other than REF_TYPE_PTR.
> >
>
> The main goal of this patch is to get a unique ref_obj_id to the skb
> arg in a .enqueue call. Therefore, we acquire that one and only
> ref_obj_id for __ref arg early in do_check_common() and do not change
> it afterward. Later in the main loop, the liviness is tracked in the
> reference states. This feels kind of read-only? Besides, since we
> acquire the ref automatically, it forces the user to do something with
> the ref ptr (in qdisc's case, .enqueue needs to either drop or enqueue
> it).
>
> I try to break down the requirements from bpf qdisc (1. only a unique
> reference to the skb in .enqueue; 2. users must enqueue or drop the
> skb in .enqueue; 3. dequeue a single skb) into two orthogonal patches
> 1 and 3. so whether this reference can leak or not can be independent.
> Taking a step back, maybe we can encapsulate them all in one semantic
> (a new kind of REF like you suggest), but I am not sure if that'd be
> too specific and then less useful to others.

Makes sense to keep the same ref type then.
I misread patch 3 comment
"leak referenced kptr through return value"
as just "leak referenced kptr".
Probably better to use a different term than "leak".
The ref_obj_id is kept valid through the prog.
It's returned from the prog back to the kernel. Not really leaking.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next v1 03/13] bpf: Allow struct_ops prog to return referenced kptr
  2024-12-13 23:29 ` [PATCH bpf-next v1 03/13] bpf: Allow struct_ops prog to return referenced kptr Amery Hung
@ 2024-12-18 22:29   ` Martin KaFai Lau
  0 siblings, 0 replies; 35+ messages in thread
From: Martin KaFai Lau @ 2024-12-18 22:29 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, stfomichev, ekarani.silvestre,
	yangpeihao, xiyou.wangcong, yepeilin.cs, ameryhung

On 12/13/24 3:29 PM, Amery Hung wrote:
> @@ -15993,13 +16001,15 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>   	const char *exit_ctx = "At program exit";
>   	struct tnum enforce_attach_type_range = tnum_unknown;
>   	const struct bpf_prog *prog = env->prog;
> -	struct bpf_reg_state *reg;
> +	struct bpf_reg_state *reg = reg_state(env, regno);
>   	struct bpf_retval_range range = retval_range(0, 1);
>   	enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
>   	int err;
>   	struct bpf_func_state *frame = env->cur_state->frame[0];
>   	const bool is_subprog = frame->subprogno;
>   	bool return_32bit = false;
> +	struct btf *btf = bpf_prog_get_target_btf(prog);
> +	const struct btf_type *ret_type = NULL;
>   
>   	/* LSM and struct_ops func-ptr's return type could be "void" */
>   	if (!is_subprog || frame->in_exception_callback_fn) {
> @@ -16008,10 +16018,31 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>   			if (prog->expected_attach_type == BPF_LSM_CGROUP)
>   				/* See below, can be 0 or 0-1 depending on hook. */
>   				break;
> -			fallthrough;
> +			if (!prog->aux->attach_func_proto->type)
> +				return 0;
> +			break;
>   		case BPF_PROG_TYPE_STRUCT_OPS:
>   			if (!prog->aux->attach_func_proto->type)
>   				return 0;
> +
> +			if (frame->in_exception_callback_fn)
> +				break;
> +
> +			/* Allow a struct_ops program to return a referenced kptr if it
> +			 * matches the operator's return type and is in its unmodified
> +			 * form. A scalar zero (i.e., a null pointer) is also allowed.
> +			 */
> +			ret_type = btf_type_by_id(btf, prog->aux->attach_func_proto->type);
> +			if (btf_type_is_ptr(ret_type) && reg->type & PTR_TO_BTF_ID &&

The "reg->type & PTR_TO_BTF_ID" does not look right. It should be 
"base_type(reg->type) == PTR_TO_BTF_ID".

> +			    reg->ref_obj_id) {
> +				if (reg->btf_id != ret_type->type) {

reg->btf could be a bpf prog's btf (i.e. prog->aux->btf) instead of the kernel 
btf, so only comparing btf_id here is not very correct.

One way could be to first compare the reg->btf == prog->aux->attach_btf.
prog->aux->attach_btf here must be a kernel btf.

Another way is, btf_type_resolve_ptr() should be a better helper than 
btf_type_by_id() here. It only returns non NULL if the type is a pointer and 
also skips the modifiers like "const" before returning. Then it can directly
compare the "struct btf_type *" returned by 
'btf_type_resolve_ptr(prog->aux->attach_btf, prog->aux->attach_func_proto->type, 
NULL)' and 'btf_type_resolve_ptr(reg->btf, reg->btf_id, NULL)'

May as well enforce the pointer returned by an "ops" must be a struct (i.e. 
__btf_type_is_struct(t) == true). This enforcement can be done in 
bpf_struct_ops_desc_init().



> +					verbose(env, "Return kptr type, struct %s, doesn't match function prototype, struct %s\n",
> +						btf_type_name(reg->btf, reg->btf_id),
> +						btf_type_name(btf, ret_type->type));
> +					return -EINVAL;
> +				}
> +				return __check_ptr_off_reg(env, reg, regno, false);
> +			}
>   			break;
>   		default:
>   			break;
> @@ -16033,8 +16064,6 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>   		return -EACCES;
>   	}
>   
> -	reg = cur_regs(env) + regno;
> -
>   	if (frame->in_async_callback_fn) {
>   		/* enforce return zero from async callbacks like timer */
>   		exit_ctx = "At async callback return";
> @@ -16133,6 +16162,11 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>   	case BPF_PROG_TYPE_NETFILTER:
>   		range = retval_range(NF_DROP, NF_ACCEPT);
>   		break;
> +	case BPF_PROG_TYPE_STRUCT_OPS:
> +		if (!ret_type || !btf_type_is_ptr(ret_type))
> +			return 0;
> +		range = retval_range(0, 0);
> +		break;
>   	case BPF_PROG_TYPE_EXT:
>   		/* freplace program can return anything as its return value
>   		 * depends on the to-be-replaced kernel func or bpf program.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next v1 05/13] bpf: net_sched: Support implementation of Qdisc_ops in bpf
  2024-12-13 23:29 ` [PATCH bpf-next v1 05/13] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
  2024-12-14  4:51   ` Cong Wang
@ 2024-12-18 23:37   ` Martin KaFai Lau
  1 sibling, 0 replies; 35+ messages in thread
From: Martin KaFai Lau @ 2024-12-18 23:37 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, stfomichev, ekarani.silvestre,
	yangpeihao, xiyou.wangcong, yepeilin.cs, ameryhung

On 12/13/24 3:29 PM, Amery Hung wrote:
> +static int bpf_qdisc_init_member(const struct btf_type *t,
> +				 const struct btf_member *member,
> +				 void *kdata, const void *udata)
> +{
> +	const struct Qdisc_ops *uqdisc_ops;
> +	struct Qdisc_ops *qdisc_ops;
> +	u32 moff;
> +
> +	uqdisc_ops = (const struct Qdisc_ops *)udata;
> +	qdisc_ops = (struct Qdisc_ops *)kdata;
> +
> +	moff = __btf_member_bit_offset(t, member) / 8;
> +	switch (moff) {
> +	case offsetof(struct Qdisc_ops, priv_size):
> +		if (uqdisc_ops->priv_size)

bpf_struct_ops_map_update_elem() has enforced non function pointer member must 
be zero if ->init_member() returns 0, so this check is unnecessary.

> +			return -EINVAL;
> +		return 1;
> +	case offsetof(struct Qdisc_ops, static_flags):
> +		if (uqdisc_ops->static_flags)

Same here.

case priv_size and static_flags should be not needed, just return 0.

> +			return -EINVAL;
> +		return 1;
> +	case offsetof(struct Qdisc_ops, peek):
> +		if (!uqdisc_ops->peek)

bpf_struct_ops_map_update_elem() will assign the trampoline (that will call the 
bpf prog) to qdisc_ops->peek if the "u"qdisc_ops->peek has the prog fd.

This test is not necessary also.

> +			qdisc_ops->peek = qdisc_peek_dequeued;

Always do this assignment



> +		return 1;

and return 0 here. Allow the bpf_struct_ops_map_update_elem() to do the needed 
fd testing instead and reassign the qdisc_ops->peek with the trampoline if needed.

> +	case offsetof(struct Qdisc_ops, id):
> +		if (bpf_obj_name_cpy(qdisc_ops->id, uqdisc_ops->id,
> +				     sizeof(qdisc_ops->id)) <= 0)
> +			return -EINVAL;
> +		return 1;
> +	}
> +
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next v1 07/13] bpf: net_sched: Add a qdisc watchdog timer
  2024-12-13 23:29 ` [PATCH bpf-next v1 07/13] bpf: net_sched: Add a qdisc watchdog timer Amery Hung
@ 2024-12-19  1:16   ` Martin KaFai Lau
  2024-12-20 19:24     ` Amery Hung
  0 siblings, 1 reply; 35+ messages in thread
From: Martin KaFai Lau @ 2024-12-19  1:16 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, stfomichev, ekarani.silvestre,
	yangpeihao, xiyou.wangcong, yepeilin.cs, ameryhung

On 12/13/24 3:29 PM, Amery Hung wrote:
> Add a watchdog timer to bpf qdisc. The watchdog can be used to schedule
> the execution of qdisc through kfunc, bpf_qdisc_schedule(). It can be
> useful for building traffic shaping scheduling algorithm, where the time
> the next packet will be dequeued is known.
> 
> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> ---
>   include/net/sch_generic.h |  4 +++
>   net/sched/bpf_qdisc.c     | 51 ++++++++++++++++++++++++++++++++++++++-
>   net/sched/sch_api.c       | 11 +++++++++
>   net/sched/sch_generic.c   |  8 ++++++
>   4 files changed, 73 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 5d74fa7e694c..6a252b1b0680 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -1357,4 +1357,8 @@ static inline void qdisc_synchronize(const struct Qdisc *q)
>   		msleep(1);
>   }
>   
> +int bpf_qdisc_init_pre_op(struct Qdisc *sch, struct nlattr *opt, struct netlink_ext_ack *extack);
> +void bpf_qdisc_destroy_post_op(struct Qdisc *sch);
> +void bpf_qdisc_reset_post_op(struct Qdisc *sch);
> +
>   #endif
> diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
> index 28959424eab0..7c155207fe1e 100644
> --- a/net/sched/bpf_qdisc.c
> +++ b/net/sched/bpf_qdisc.c
> @@ -8,6 +8,10 @@
>   
>   static struct bpf_struct_ops bpf_Qdisc_ops;
>   
> +struct bpf_sched_data {
> +	struct qdisc_watchdog watchdog;
> +};
> +
>   struct bpf_sk_buff_ptr {
>   	struct sk_buff *skb;
>   };
> @@ -17,6 +21,32 @@ static int bpf_qdisc_init(struct btf *btf)
>   	return 0;
>   }
>   
> +int bpf_qdisc_init_pre_op(struct Qdisc *sch, struct nlattr *opt,
> +			  struct netlink_ext_ack *extack)
> +{
> +	struct bpf_sched_data *q = qdisc_priv(sch);
> +
> +	qdisc_watchdog_init(&q->watchdog, sch);
> +	return 0;
> +}
> +EXPORT_SYMBOL(bpf_qdisc_init_pre_op);
> +
> +void bpf_qdisc_reset_post_op(struct Qdisc *sch)
> +{
> +	struct bpf_sched_data *q = qdisc_priv(sch);
> +
> +	qdisc_watchdog_cancel(&q->watchdog);
> +}
> +EXPORT_SYMBOL(bpf_qdisc_reset_post_op);
> +
> +void bpf_qdisc_destroy_post_op(struct Qdisc *sch)
> +{
> +	struct bpf_sched_data *q = qdisc_priv(sch);
> +
> +	qdisc_watchdog_cancel(&q->watchdog);
> +}
> +EXPORT_SYMBOL(bpf_qdisc_destroy_post_op);

These feel like the candidates for the ".gen_prologue" and ".gen_epilogue". Then 
the changes to sch_api.c is not needed.

> +
>   static const struct bpf_func_proto *
>   bpf_qdisc_get_func_proto(enum bpf_func_id func_id,
>   			 const struct bpf_prog *prog)
> @@ -134,12 +164,25 @@ __bpf_kfunc void bpf_qdisc_skb_drop(struct sk_buff *skb,
>   	__qdisc_drop(skb, (struct sk_buff **)to_free_list);
>   }
>   
> +/* bpf_qdisc_watchdog_schedule - Schedule a qdisc to a later time using a timer.
> + * @sch: The qdisc to be scheduled.
> + * @expire: The expiry time of the timer.
> + * @delta_ns: The slack range of the timer.
> + */
> +__bpf_kfunc void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64 delta_ns)
> +{
> +	struct bpf_sched_data *q = qdisc_priv(sch);
> +
> +	qdisc_watchdog_schedule_range_ns(&q->watchdog, expire, delta_ns);
> +}
> +
>   __bpf_kfunc_end_defs();
>   
>   #define BPF_QDISC_KFUNC_xxx \
>   	BPF_QDISC_KFUNC(bpf_skb_get_hash, KF_TRUSTED_ARGS) \
>   	BPF_QDISC_KFUNC(bpf_kfree_skb, KF_RELEASE) \
>   	BPF_QDISC_KFUNC(bpf_qdisc_skb_drop, KF_RELEASE) \
> +	BPF_QDISC_KFUNC(bpf_qdisc_watchdog_schedule, KF_TRUSTED_ARGS) \
>   
>   BTF_KFUNCS_START(bpf_qdisc_kfunc_ids)
>   #define BPF_QDISC_KFUNC(name, flag) BTF_ID_FLAGS(func, name, flag)
> @@ -154,9 +197,14 @@ BPF_QDISC_KFUNC_xxx
>   
>   static int bpf_qdisc_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
>   {
> -	if (kfunc_id == bpf_qdisc_skb_drop_ids[0])
> +	if (kfunc_id == bpf_qdisc_skb_drop_ids[0]) {
>   		if (strcmp(prog->aux->attach_func_name, "enqueue"))
>   			return -EACCES;
> +	} else if (kfunc_id == bpf_qdisc_watchdog_schedule_ids[0]) {
> +		if (strcmp(prog->aux->attach_func_name, "enqueue") &&
> +		    strcmp(prog->aux->attach_func_name, "dequeue"))
> +			return -EACCES;
> +	}
>   
>   	return 0;
>   }
> @@ -189,6 +237,7 @@ static int bpf_qdisc_init_member(const struct btf_type *t,
>   	case offsetof(struct Qdisc_ops, priv_size):
>   		if (uqdisc_ops->priv_size)
>   			return -EINVAL;
> +		qdisc_ops->priv_size = sizeof(struct bpf_sched_data);

ah. ok. The priv_size case is still needed.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next v1 02/13] selftests/bpf: Test referenced kptr arguments of struct_ops programs
  2024-12-13 23:29 ` [PATCH bpf-next v1 02/13] selftests/bpf: Test referenced kptr arguments of struct_ops programs Amery Hung
  2024-12-18  1:17   ` Martin KaFai Lau
@ 2024-12-19  3:40   ` Yonghong Song
  2024-12-19 20:49     ` Amery Hung
  1 sibling, 1 reply; 35+ messages in thread
From: Yonghong Song @ 2024-12-19  3:40 UTC (permalink / raw)
  To: Amery Hung, netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung




On 12/13/24 3:29 PM, Amery Hung wrote:
> Test referenced kptr acquired through struct_ops argument tagged with
> "__ref". The success case checks whether 1) a reference to the correct
> type is acquired, and 2) the referenced kptr argument can be accessed in
> multiple paths as long as it hasn't been released. In the fail cases,
> we first confirm that a referenced kptr acquried through a struct_ops
> argument is not allowed to be leaked. Then, we make sure this new
> referenced kptr acquiring mechanism does not accidentally allow referenced
> kptrs to flow into global subprograms through their arguments.
>
> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> ---
>   .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  7 ++
>   .../selftests/bpf/bpf_testmod/bpf_testmod.h   |  2 +
>   .../prog_tests/test_struct_ops_refcounted.c   | 58 ++++++++++++++++
>   .../bpf/progs/struct_ops_refcounted.c         | 67 +++++++++++++++++++
>   ...ruct_ops_refcounted_fail__global_subprog.c | 32 +++++++++
>   .../struct_ops_refcounted_fail__ref_leak.c    | 17 +++++
>   6 files changed, 183 insertions(+)
>   create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c
>   create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_refcounted.c
>   create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c
>   create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c
>
> diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> index 987d41af71d2..244234546ae2 100644
> --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> @@ -1135,10 +1135,17 @@ static int bpf_testmod_ops__test_maybe_null(int dummy,
>   	return 0;
>   }
>   
> +static int bpf_testmod_ops__test_refcounted(int dummy,
> +					    struct task_struct *task__ref)
> +{
> +	return 0;
> +}
> +
>   static struct bpf_testmod_ops __bpf_testmod_ops = {
>   	.test_1 = bpf_testmod_test_1,
>   	.test_2 = bpf_testmod_test_2,
>   	.test_maybe_null = bpf_testmod_ops__test_maybe_null,
> +	.test_refcounted = bpf_testmod_ops__test_refcounted,
>   };
>   
>   struct bpf_struct_ops bpf_bpf_testmod_ops = {
> diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
> index fb7dff47597a..0e31586c1353 100644
> --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
> +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
> @@ -36,6 +36,8 @@ struct bpf_testmod_ops {
>   	/* Used to test nullable arguments. */
>   	int (*test_maybe_null)(int dummy, struct task_struct *task);
>   	int (*unsupported_ops)(void);
> +	/* Used to test ref_acquired arguments. */
> +	int (*test_refcounted)(int dummy, struct task_struct *task);
>   
>   	/* The following fields are used to test shadow copies. */
>   	char onebyte;
> diff --git a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c
> new file mode 100644
> index 000000000000..976df951b700
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c
> @@ -0,0 +1,58 @@
> +#include <test_progs.h>
> +
> +#include "struct_ops_refcounted.skel.h"
> +#include "struct_ops_refcounted_fail__ref_leak.skel.h"
> +#include "struct_ops_refcounted_fail__global_subprog.skel.h"
> +
> +/* Test that the verifier accepts a program that first acquires a referenced
> + * kptr through context and then releases the reference
> + */
> +static void refcounted(void)
> +{
> +	struct struct_ops_refcounted *skel;
> +
> +	skel = struct_ops_refcounted__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "struct_ops_module_open_and_load"))
> +		return;
> +
> +	struct_ops_refcounted__destroy(skel);
> +}
> +
> +/* Test that the verifier rejects a program that acquires a referenced
> + * kptr through context without releasing the reference
> + */
> +static void refcounted_fail__ref_leak(void)
> +{
> +	struct struct_ops_refcounted_fail__ref_leak *skel;
> +
> +	skel = struct_ops_refcounted_fail__ref_leak__open_and_load();
> +	if (ASSERT_ERR_PTR(skel, "struct_ops_module_fail__open_and_load"))
> +		return;
> +
> +	struct_ops_refcounted_fail__ref_leak__destroy(skel);
> +}
> +
> +/* Test that the verifier rejects a program that contains a global
> + * subprogram with referenced kptr arguments
> + */
> +static void refcounted_fail__global_subprog(void)
> +{
> +	struct struct_ops_refcounted_fail__global_subprog *skel;
> +
> +	skel = struct_ops_refcounted_fail__global_subprog__open_and_load();
> +	if (ASSERT_ERR_PTR(skel, "struct_ops_module_fail__open_and_load"))
> +		return;
> +
> +	struct_ops_refcounted_fail__global_subprog__destroy(skel);
> +}
> +
> +void test_struct_ops_refcounted(void)
> +{
> +	if (test__start_subtest("refcounted"))
> +		refcounted();
> +	if (test__start_subtest("refcounted_fail__ref_leak"))
> +		refcounted_fail__ref_leak();
> +	if (test__start_subtest("refcounted_fail__global_subprog"))
> +		refcounted_fail__global_subprog();
> +}
> +
> diff --git a/tools/testing/selftests/bpf/progs/struct_ops_refcounted.c b/tools/testing/selftests/bpf/progs/struct_ops_refcounted.c
> new file mode 100644
> index 000000000000..2c1326668b92
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/struct_ops_refcounted.c
> @@ -0,0 +1,67 @@
> +#include <vmlinux.h>
> +#include <bpf/bpf_tracing.h>
> +#include "../bpf_testmod/bpf_testmod.h"
> +#include "bpf_misc.h"
> +
> +char _license[] SEC("license") = "GPL";
> +
> +extern void bpf_task_release(struct task_struct *p) __ksym;
> +
> +/* This is a test BPF program that uses struct_ops to access a referenced
> + * kptr argument. This is a test for the verifier to ensure that it
> + * 1) recongnizes the task as a referenced object (i.e., ref_obj_id > 0), and
> + * 2) the same reference can be acquired from multiple paths as long as it
> + *    has not been released.
> + *
> + * test_refcounted() is equivalent to the C code below. It is written in assembly
> + * to avoid reads from task (i.e., getting referenced kptrs to task) being merged
> + * into single path by the compiler.
> + *
> + * int test_refcounted(int dummy, struct task_struct *task)
> + * {
> + *         if (dummy % 2)
> + *                 bpf_task_release(task);
> + *         else
> + *                 bpf_task_release(task);
> + *         return 0;
> + * }
> + */
> +SEC("struct_ops/test_refcounted")
> +int test_refcounted(unsigned long long *ctx)
> +{
> +	asm volatile ("					\
> +	/* r6 = dummy */				\
> +	r6 = *(u64 *)(r1 + 0x0);			\
> +	/* if (r6 & 0x1 != 0) */			\
> +	r6 &= 0x1;					\
> +	if r6 == 0 goto l0_%=;				\
> +	/* r1 = task */					\
> +	r1 = *(u64 *)(r1 + 0x8);			\
> +	call %[bpf_task_release];			\
> +	goto l1_%=;					\
> +l0_%=:	/* r1 = task */					\
> +	r1 = *(u64 *)(r1 + 0x8);			\
> +	call %[bpf_task_release];			\
> +l1_%=:	/* return 0 */					\
> +"	:
> +	: __imm(bpf_task_release)
> +	: __clobber_all);
> +	return 0;
> +}

You can use clang nomerge attribute to prevent bpf_task_release(task) merging. For example,

$ cat t.c
struct task_struct {
         int a;
         int b;
         int d[20];
};


__attribute__((nomerge)) extern void bpf_task_release(struct task_struct *task);

int test_refcounted(int dummy, struct task_struct *task)
{
         if (dummy % 2)
                 bpf_task_release(task);
         else
                 bpf_task_release(task);
         return 0;
}

$ clang --version
clang version 19.1.5 (https://github.com/llvm/llvm-project.git ab4b5a2db582958af1ee308a790cfdb42bd24720)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/yhs/work/llvm-project/llvm/build.19/Release/bin
$ clang --target=bpf -O2 -mcpu=v3 -S t.c
$ cat t.s
         .text
         .file   "t.c"
         .globl  test_refcounted                 # -- Begin function test_refcounted
         .p2align        3
         .type   test_refcounted,@function
test_refcounted:                        # @test_refcounted
# %bb.0:
         w1 &= 1
         if w1 == 0 goto LBB0_2
# %bb.1:
         r1 = r2
         call bpf_task_release
         goto LBB0_3
LBB0_2:
         r1 = r2
         call bpf_task_release
LBB0_3:
         w0 = 0
         exit
.Lfunc_end0:
         .size   test_refcounted, .Lfunc_end0-test_refcounted
                                         # -- End function
         .addrsig

> +
> +/* BTF FUNC records are not generated for kfuncs referenced
> + * from inline assembly. These records are necessary for
> + * libbpf to link the program. The function below is a hack
> + * to ensure that BTF FUNC records are generated.
> + */
> +void __btf_root(void)
> +{
> +	bpf_task_release(NULL);
> +}
> +
> +SEC(".struct_ops.link")
> +struct bpf_testmod_ops testmod_refcounted = {
> +	.test_refcounted = (void *)test_refcounted,
> +};
> +
> +
> diff --git a/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c
> new file mode 100644
> index 000000000000..c7e84e63b053
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c
> @@ -0,0 +1,32 @@
> +#include <vmlinux.h>
> +#include <bpf/bpf_tracing.h>
> +#include "../bpf_testmod/bpf_testmod.h"
> +
> +char _license[] SEC("license") = "GPL";
> +
> +extern void bpf_task_release(struct task_struct *p) __ksym;
> +
> +__noinline int subprog_release(__u64 *ctx __arg_ctx)
> +{
> +	struct task_struct *task = (struct task_struct *)ctx[1];
> +	int dummy = (int)ctx[0];
> +
> +	bpf_task_release(task);
> +
> +	return dummy + 1;
> +}
> +
> +SEC("struct_ops/test_refcounted")
> +int test_refcounted(unsigned long long *ctx)
> +{
> +	struct task_struct *task = (struct task_struct *)ctx[1];
> +
> +	bpf_task_release(task);
> +
> +	return subprog_release(ctx);
> +}
> +
> +SEC(".struct_ops.link")
> +struct bpf_testmod_ops testmod_ref_acquire = {
> +	.test_refcounted = (void *)test_refcounted,
> +};
> diff --git a/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c
> new file mode 100644
> index 000000000000..6e82859eb187
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c
> @@ -0,0 +1,17 @@
> +#include <vmlinux.h>
> +#include <bpf/bpf_tracing.h>
> +#include "../bpf_testmod/bpf_testmod.h"
> +
> +char _license[] SEC("license") = "GPL";
> +
> +SEC("struct_ops/test_refcounted")
> +int BPF_PROG(test_refcounted, int dummy,
> +	     struct task_struct *task)
> +{
> +	return 0;
> +}
> +
> +SEC(".struct_ops.link")
> +struct bpf_testmod_ops testmod_ref_acquire = {
> +	.test_refcounted = (void *)test_refcounted,
> +};


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next v1 06/13] bpf: net_sched: Add basic bpf qdisc kfuncs
  2024-12-13 23:29 ` [PATCH bpf-next v1 06/13] bpf: net_sched: Add basic bpf qdisc kfuncs Amery Hung
  2024-12-18 17:11   ` Amery Hung
@ 2024-12-19  7:37   ` Martin KaFai Lau
  2024-12-20  0:32     ` Amery Hung
  1 sibling, 1 reply; 35+ messages in thread
From: Martin KaFai Lau @ 2024-12-19  7:37 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, stfomichev, ekarani.silvestre,
	yangpeihao, xiyou.wangcong, yepeilin.cs, ameryhung

On 12/13/24 3:29 PM, Amery Hung wrote:
> Add basic kfuncs for working on skb in qdisc.
> 
> Both bpf_qdisc_skb_drop() and bpf_kfree_skb() can be used to release
> a reference to an skb. However, bpf_qdisc_skb_drop() can only be called
> in .enqueue where a to_free skb list is available from kernel to defer
> the release. bpf_kfree_skb() should be used elsewhere. It is also used
> in bpf_obj_free_fields() when cleaning up skb in maps and collections.
> 
> bpf_skb_get_hash() returns the flow hash of an skb, which can be used
> to build flow-based queueing algorithms.
> 
> Finally, allow users to create read-only dynptr via bpf_dynptr_from_skb().
> 
> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> ---
>   net/sched/bpf_qdisc.c | 77 ++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 76 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
> index a2e2db29e5fc..28959424eab0 100644
> --- a/net/sched/bpf_qdisc.c
> +++ b/net/sched/bpf_qdisc.c
> @@ -106,6 +106,67 @@ static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
>   	return 0;
>   }
>   
> +__bpf_kfunc_start_defs();
> +
> +/* bpf_skb_get_hash - Get the flow hash of an skb.
> + * @skb: The skb to get the flow hash from.
> + */
> +__bpf_kfunc u32 bpf_skb_get_hash(struct sk_buff *skb)
> +{
> +	return skb_get_hash(skb);
> +}
> +
> +/* bpf_kfree_skb - Release an skb's reference and drop it immediately.
> + * @skb: The skb whose reference to be released and dropped.
> + */
> +__bpf_kfunc void bpf_kfree_skb(struct sk_buff *skb)
> +{
> +	kfree_skb(skb);
> +}
> +
> +/* bpf_qdisc_skb_drop - Drop an skb by adding it to a deferred free list.
> + * @skb: The skb whose reference to be released and dropped.
> + * @to_free_list: The list of skbs to be dropped.
> + */
> +__bpf_kfunc void bpf_qdisc_skb_drop(struct sk_buff *skb,
> +				    struct bpf_sk_buff_ptr *to_free_list)
> +{
> +	__qdisc_drop(skb, (struct sk_buff **)to_free_list);
> +}
> +
> +__bpf_kfunc_end_defs();
> +
> +#define BPF_QDISC_KFUNC_xxx \
> +	BPF_QDISC_KFUNC(bpf_skb_get_hash, KF_TRUSTED_ARGS) \
> +	BPF_QDISC_KFUNC(bpf_kfree_skb, KF_RELEASE) \
> +	BPF_QDISC_KFUNC(bpf_qdisc_skb_drop, KF_RELEASE) \
> +
> +BTF_KFUNCS_START(bpf_qdisc_kfunc_ids)
> +#define BPF_QDISC_KFUNC(name, flag) BTF_ID_FLAGS(func, name, flag)
> +BPF_QDISC_KFUNC_xxx
> +#undef BPF_QDISC_KFUNC
> +BTF_ID_FLAGS(func, bpf_dynptr_from_skb, KF_TRUSTED_ARGS)
> +BTF_KFUNCS_END(bpf_qdisc_kfunc_ids)
> +
> +#define BPF_QDISC_KFUNC(name, _) BTF_ID_LIST_SINGLE(name##_ids, func, name)


> +BPF_QDISC_KFUNC_xxx
> +#undef BPF_QDISC_KFUNC
> +
> +static int bpf_qdisc_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
> +{
> +	if (kfunc_id == bpf_qdisc_skb_drop_ids[0])
> +		if (strcmp(prog->aux->attach_func_name, "enqueue"))

The kfunc is registered for all BPF_PROG_TYPE_STRUCT_OPS. Checking func_name 
alone is not enough, e.g. another future struct_ops may have the "enqueue" ops.

Checking the btf type of "struct Qdisc_ops" is better. Something like the 
following (untested):

diff --git i/include/linux/bpf.h w/include/linux/bpf.h
index c81ac98db439..cf3133f81e7f 100644
--- i/include/linux/bpf.h
+++ w/include/linux/bpf.h
@@ -1809,6 +1809,7 @@ struct bpf_struct_ops {
  	void *cfi_stubs;
  	struct module *owner;
  	const char *name;
+	const struct btf_type *type;
  	struct btf_func_model func_models[BPF_STRUCT_OPS_MAX_NR_MEMBERS];
  };

diff --git i/kernel/bpf/bpf_struct_ops.c w/kernel/bpf/bpf_struct_ops.c
index d9e0af00580b..5c2ca5a84384 100644
--- i/kernel/bpf/bpf_struct_ops.c
+++ w/kernel/bpf/bpf_struct_ops.c
@@ -432,6 +432,8 @@ int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc 
*st_ops_desc,
  		goto errout;
  	}

+	st_ops->type = t;
+
  	return 0;

  errout:
diff --git i/net/sched/bpf_qdisc.c w/net/sched/bpf_qdisc.c
index 1caa9f696d2d..94e45ea59fef 100644
--- i/net/sched/bpf_qdisc.c
+++ w/net/sched/bpf_qdisc.c
@@ -250,6 +250,11 @@ BPF_QDISC_KFUNC_xxx

  static int bpf_qdisc_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
  {
+
+	if (bpf_Qdisc_ops.type != btf_type_by_id(prog->aux->attach_btf,
+						 prog->aux->attach_btf_id))
+		return -EACCES;
+
  	if (kfunc_id == bpf_qdisc_skb_drop_ids[0]) {
  		if (strcmp(prog->aux->attach_func_name, "enqueue"))
  			return -EACCES;


st_ops->type (and a few others) was refactored to bpf_struct_ops_desc when 
adding the kernel module support. I think adding st_ops->type back should be enough.

Also, a bike shedding here, from looking at patch 7 and patch 8 which limit a 
set of kfuncs to a particular ops. I think using btf_id_set_contains() is more 
inline to other verifier usages.

BTF_SET_START(qdisc_enqueue_kfunc_set)
BTF_ID(func, bpf_qdisc_skb_drop)
BTF_ID(func, bpf_qdisc_watchdog_schedule)
BTF_SET_END(qdisc_enqueue_kfunc_set)

BTF_SET_START(qdisc_dequeue_kfunc_set)
BTF_ID(func, bpf_qdisc_bstats_update)
BTF_ID(func, bpf_qdisc_watchdog_schedule)
BTF_SET_END(qdisc_dequeue_kfunc_set)

BTF_SET_START(qdisc_common_kfunc_set)
BTF_ID(func, bpf_skb_get_hash)
BTF_ID(func, bpf_kfree_skb)
BTF_SET_END(qdisc_common_kfunc_set)

> +			return -EACCES;
> +
> +	return 0;
> +}
> +
> +static const struct btf_kfunc_id_set bpf_qdisc_kfunc_set = {
> +	.owner = THIS_MODULE,
> +	.set   = &bpf_qdisc_kfunc_ids,
> +	.filter = bpf_qdisc_kfunc_filter,
> +};
> +
>   static const struct bpf_verifier_ops bpf_qdisc_verifier_ops = {
>   	.get_func_proto		= bpf_qdisc_get_func_proto,
>   	.is_valid_access	= bpf_qdisc_is_valid_access,
> @@ -209,6 +270,20 @@ static struct bpf_struct_ops bpf_Qdisc_ops = {
>   
>   static int __init bpf_qdisc_kfunc_init(void)
>   {
> -	return register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
> +	int ret;
> +	const struct btf_id_dtor_kfunc skb_kfunc_dtors[] = {
> +		{
> +			.btf_id       = bpf_sk_buff_ids[0],
> +			.kfunc_btf_id = bpf_kfree_skb_ids[0]
> +		},
> +	};
> +
> +	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_qdisc_kfunc_set);
> +	ret = ret ?: register_btf_id_dtor_kfuncs(skb_kfunc_dtors,
> +						 ARRAY_SIZE(skb_kfunc_dtors),
> +						 THIS_MODULE);
> +	ret = ret ?: register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
> +
> +	return ret;
>   }
>   late_initcall(bpf_qdisc_kfunc_init);


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next v1 02/13] selftests/bpf: Test referenced kptr arguments of struct_ops programs
  2024-12-19  3:40   ` Yonghong Song
@ 2024-12-19 20:49     ` Amery Hung
  0 siblings, 0 replies; 35+ messages in thread
From: Amery Hung @ 2024-12-19 20:49 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Amery Hung, netdev, bpf, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, stfomichev,
	ekarani.silvestre, yangpeihao, xiyou.wangcong, yepeilin.cs

On Wed, Dec 18, 2024 at 7:41 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
>
>
> On 12/13/24 3:29 PM, Amery Hung wrote:
> > Test referenced kptr acquired through struct_ops argument tagged with
> > "__ref". The success case checks whether 1) a reference to the correct
> > type is acquired, and 2) the referenced kptr argument can be accessed in
> > multiple paths as long as it hasn't been released. In the fail cases,
> > we first confirm that a referenced kptr acquried through a struct_ops
> > argument is not allowed to be leaked. Then, we make sure this new
> > referenced kptr acquiring mechanism does not accidentally allow referenced
> > kptrs to flow into global subprograms through their arguments.
> >
> > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > ---
> >   .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  7 ++
> >   .../selftests/bpf/bpf_testmod/bpf_testmod.h   |  2 +
> >   .../prog_tests/test_struct_ops_refcounted.c   | 58 ++++++++++++++++
> >   .../bpf/progs/struct_ops_refcounted.c         | 67 +++++++++++++++++++
> >   ...ruct_ops_refcounted_fail__global_subprog.c | 32 +++++++++
> >   .../struct_ops_refcounted_fail__ref_leak.c    | 17 +++++
> >   6 files changed, 183 insertions(+)
> >   create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c
> >   create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_refcounted.c
> >   create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c
> >   create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c
> >
> > diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > index 987d41af71d2..244234546ae2 100644
> > --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > @@ -1135,10 +1135,17 @@ static int bpf_testmod_ops__test_maybe_null(int dummy,
> >       return 0;
> >   }
> >
> > +static int bpf_testmod_ops__test_refcounted(int dummy,
> > +                                         struct task_struct *task__ref)
> > +{
> > +     return 0;
> > +}
> > +
> >   static struct bpf_testmod_ops __bpf_testmod_ops = {
> >       .test_1 = bpf_testmod_test_1,
> >       .test_2 = bpf_testmod_test_2,
> >       .test_maybe_null = bpf_testmod_ops__test_maybe_null,
> > +     .test_refcounted = bpf_testmod_ops__test_refcounted,
> >   };
> >
> >   struct bpf_struct_ops bpf_bpf_testmod_ops = {
> > diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
> > index fb7dff47597a..0e31586c1353 100644
> > --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
> > +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
> > @@ -36,6 +36,8 @@ struct bpf_testmod_ops {
> >       /* Used to test nullable arguments. */
> >       int (*test_maybe_null)(int dummy, struct task_struct *task);
> >       int (*unsupported_ops)(void);
> > +     /* Used to test ref_acquired arguments. */
> > +     int (*test_refcounted)(int dummy, struct task_struct *task);
> >
> >       /* The following fields are used to test shadow copies. */
> >       char onebyte;
> > diff --git a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c
> > new file mode 100644
> > index 000000000000..976df951b700
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c
> > @@ -0,0 +1,58 @@
> > +#include <test_progs.h>
> > +
> > +#include "struct_ops_refcounted.skel.h"
> > +#include "struct_ops_refcounted_fail__ref_leak.skel.h"
> > +#include "struct_ops_refcounted_fail__global_subprog.skel.h"
> > +
> > +/* Test that the verifier accepts a program that first acquires a referenced
> > + * kptr through context and then releases the reference
> > + */
> > +static void refcounted(void)
> > +{
> > +     struct struct_ops_refcounted *skel;
> > +
> > +     skel = struct_ops_refcounted__open_and_load();
> > +     if (!ASSERT_OK_PTR(skel, "struct_ops_module_open_and_load"))
> > +             return;
> > +
> > +     struct_ops_refcounted__destroy(skel);
> > +}
> > +
> > +/* Test that the verifier rejects a program that acquires a referenced
> > + * kptr through context without releasing the reference
> > + */
> > +static void refcounted_fail__ref_leak(void)
> > +{
> > +     struct struct_ops_refcounted_fail__ref_leak *skel;
> > +
> > +     skel = struct_ops_refcounted_fail__ref_leak__open_and_load();
> > +     if (ASSERT_ERR_PTR(skel, "struct_ops_module_fail__open_and_load"))
> > +             return;
> > +
> > +     struct_ops_refcounted_fail__ref_leak__destroy(skel);
> > +}
> > +
> > +/* Test that the verifier rejects a program that contains a global
> > + * subprogram with referenced kptr arguments
> > + */
> > +static void refcounted_fail__global_subprog(void)
> > +{
> > +     struct struct_ops_refcounted_fail__global_subprog *skel;
> > +
> > +     skel = struct_ops_refcounted_fail__global_subprog__open_and_load();
> > +     if (ASSERT_ERR_PTR(skel, "struct_ops_module_fail__open_and_load"))
> > +             return;
> > +
> > +     struct_ops_refcounted_fail__global_subprog__destroy(skel);
> > +}
> > +
> > +void test_struct_ops_refcounted(void)
> > +{
> > +     if (test__start_subtest("refcounted"))
> > +             refcounted();
> > +     if (test__start_subtest("refcounted_fail__ref_leak"))
> > +             refcounted_fail__ref_leak();
> > +     if (test__start_subtest("refcounted_fail__global_subprog"))
> > +             refcounted_fail__global_subprog();
> > +}
> > +
> > diff --git a/tools/testing/selftests/bpf/progs/struct_ops_refcounted.c b/tools/testing/selftests/bpf/progs/struct_ops_refcounted.c
> > new file mode 100644
> > index 000000000000..2c1326668b92
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/struct_ops_refcounted.c
> > @@ -0,0 +1,67 @@
> > +#include <vmlinux.h>
> > +#include <bpf/bpf_tracing.h>
> > +#include "../bpf_testmod/bpf_testmod.h"
> > +#include "bpf_misc.h"
> > +
> > +char _license[] SEC("license") = "GPL";
> > +
> > +extern void bpf_task_release(struct task_struct *p) __ksym;
> > +
> > +/* This is a test BPF program that uses struct_ops to access a referenced
> > + * kptr argument. This is a test for the verifier to ensure that it
> > + * 1) recongnizes the task as a referenced object (i.e., ref_obj_id > 0), and
> > + * 2) the same reference can be acquired from multiple paths as long as it
> > + *    has not been released.
> > + *
> > + * test_refcounted() is equivalent to the C code below. It is written in assembly
> > + * to avoid reads from task (i.e., getting referenced kptrs to task) being merged
> > + * into single path by the compiler.
> > + *
> > + * int test_refcounted(int dummy, struct task_struct *task)
> > + * {
> > + *         if (dummy % 2)
> > + *                 bpf_task_release(task);
> > + *         else
> > + *                 bpf_task_release(task);
> > + *         return 0;
> > + * }
> > + */
> > +SEC("struct_ops/test_refcounted")
> > +int test_refcounted(unsigned long long *ctx)
> > +{
> > +     asm volatile ("                                 \
> > +     /* r6 = dummy */                                \
> > +     r6 = *(u64 *)(r1 + 0x0);                        \
> > +     /* if (r6 & 0x1 != 0) */                        \
> > +     r6 &= 0x1;                                      \
> > +     if r6 == 0 goto l0_%=;                          \
> > +     /* r1 = task */                                 \
> > +     r1 = *(u64 *)(r1 + 0x8);                        \
> > +     call %[bpf_task_release];                       \
> > +     goto l1_%=;                                     \
> > +l0_%=:       /* r1 = task */                                 \
> > +     r1 = *(u64 *)(r1 + 0x8);                        \
> > +     call %[bpf_task_release];                       \
> > +l1_%=:       /* return 0 */                                  \
> > +"    :
> > +     : __imm(bpf_task_release)
> > +     : __clobber_all);
> > +     return 0;
> > +}
>
> You can use clang nomerge attribute to prevent bpf_task_release(task) merging. For example,
>

Thanks for the info! That simplifies this test a lot. I will change it
in the next version.

> $ cat t.c
> struct task_struct {
>          int a;
>          int b;
>          int d[20];
> };
>
>
> __attribute__((nomerge)) extern void bpf_task_release(struct task_struct *task);
>
> int test_refcounted(int dummy, struct task_struct *task)
> {
>          if (dummy % 2)
>                  bpf_task_release(task);
>          else
>                  bpf_task_release(task);
>          return 0;
> }
>
> $ clang --version
> clang version 19.1.5 (https://github.com/llvm/llvm-project.git ab4b5a2db582958af1ee308a790cfdb42bd24720)
> Target: x86_64-unknown-linux-gnu
> Thread model: posix
> InstalledDir: /home/yhs/work/llvm-project/llvm/build.19/Release/bin
> $ clang --target=bpf -O2 -mcpu=v3 -S t.c
> $ cat t.s
>          .text
>          .file   "t.c"
>          .globl  test_refcounted                 # -- Begin function test_refcounted
>          .p2align        3
>          .type   test_refcounted,@function
> test_refcounted:                        # @test_refcounted
> # %bb.0:
>          w1 &= 1
>          if w1 == 0 goto LBB0_2
> # %bb.1:
>          r1 = r2
>          call bpf_task_release
>          goto LBB0_3
> LBB0_2:
>          r1 = r2
>          call bpf_task_release
> LBB0_3:
>          w0 = 0
>          exit
> .Lfunc_end0:
>          .size   test_refcounted, .Lfunc_end0-test_refcounted
>                                          # -- End function
>          .addrsig
>
> > +
> > +/* BTF FUNC records are not generated for kfuncs referenced
> > + * from inline assembly. These records are necessary for
> > + * libbpf to link the program. The function below is a hack
> > + * to ensure that BTF FUNC records are generated.
> > + */
> > +void __btf_root(void)
> > +{
> > +     bpf_task_release(NULL);
> > +}
> > +
> > +SEC(".struct_ops.link")
> > +struct bpf_testmod_ops testmod_refcounted = {
> > +     .test_refcounted = (void *)test_refcounted,
> > +};
> > +
> > +
> > diff --git a/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c
> > new file mode 100644
> > index 000000000000..c7e84e63b053
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c
> > @@ -0,0 +1,32 @@
> > +#include <vmlinux.h>
> > +#include <bpf/bpf_tracing.h>
> > +#include "../bpf_testmod/bpf_testmod.h"
> > +
> > +char _license[] SEC("license") = "GPL";
> > +
> > +extern void bpf_task_release(struct task_struct *p) __ksym;
> > +
> > +__noinline int subprog_release(__u64 *ctx __arg_ctx)
> > +{
> > +     struct task_struct *task = (struct task_struct *)ctx[1];
> > +     int dummy = (int)ctx[0];
> > +
> > +     bpf_task_release(task);
> > +
> > +     return dummy + 1;
> > +}
> > +
> > +SEC("struct_ops/test_refcounted")
> > +int test_refcounted(unsigned long long *ctx)
> > +{
> > +     struct task_struct *task = (struct task_struct *)ctx[1];
> > +
> > +     bpf_task_release(task);
> > +
> > +     return subprog_release(ctx);
> > +}
> > +
> > +SEC(".struct_ops.link")
> > +struct bpf_testmod_ops testmod_ref_acquire = {
> > +     .test_refcounted = (void *)test_refcounted,
> > +};
> > diff --git a/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c
> > new file mode 100644
> > index 000000000000..6e82859eb187
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c
> > @@ -0,0 +1,17 @@
> > +#include <vmlinux.h>
> > +#include <bpf/bpf_tracing.h>
> > +#include "../bpf_testmod/bpf_testmod.h"
> > +
> > +char _license[] SEC("license") = "GPL";
> > +
> > +SEC("struct_ops/test_refcounted")
> > +int BPF_PROG(test_refcounted, int dummy,
> > +          struct task_struct *task)
> > +{
> > +     return 0;
> > +}
> > +
> > +SEC(".struct_ops.link")
> > +struct bpf_testmod_ops testmod_ref_acquire = {
> > +     .test_refcounted = (void *)test_refcounted,
> > +};
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next v1 01/13] bpf: Support getting referenced kptr from struct_ops argument
  2024-12-18 16:57     ` Amery Hung
@ 2024-12-19 23:06       ` Martin KaFai Lau
  0 siblings, 0 replies; 35+ messages in thread
From: Martin KaFai Lau @ 2024-12-19 23:06 UTC (permalink / raw)
  To: Amery Hung
  Cc: Amery Hung, bpf, netdev, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, stfomichev,
	ekarani.silvestre, yangpeihao, xiyou.wangcong, yepeilin.cs

On 12/18/24 8:57 AM, Amery Hung wrote:
>> At the first glance, the ref_obj_id assignment looks racy because ctx_arg_info
>> is shared by different bpf progs that may be verified in parallel. After another
>> thought, this should be fine because it should always end up having the same
>> ref_obj_id for the same arg-no, right? Not sure if UBSAN can understand this
>> without using the READ/WRITE_ONCE. but adding READ/WRITE_ONCE when using
>> ref_obj_id will be quite puzzling when reading the verifier code. Any better idea?
>>
> It looks like ref_obj_id cannot be reused (id always comes from
> ++env->id_gen), and these will be the earliest references to acquire.
> So, maybe we can assume the ref_obj_id without needing to store it in
> ctx_arg_info? E.g., the first __ref argument's ref_obj_id is always 1.

That seems reasonable to me. Then ctx_arg_info can stay read-only after the very 
first initialization during bpf_struct_ops_desc_init().

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next v1 06/13] bpf: net_sched: Add basic bpf qdisc kfuncs
  2024-12-19  7:37   ` Martin KaFai Lau
@ 2024-12-20  0:32     ` Amery Hung
  0 siblings, 0 replies; 35+ messages in thread
From: Amery Hung @ 2024-12-20  0:32 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Amery Hung, bpf, netdev, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, stfomichev,
	ekarani.silvestre, yangpeihao, xiyou.wangcong, yepeilin.cs

On Wed, Dec 18, 2024 at 11:37 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 12/13/24 3:29 PM, Amery Hung wrote:
> > Add basic kfuncs for working on skb in qdisc.
> >
> > Both bpf_qdisc_skb_drop() and bpf_kfree_skb() can be used to release
> > a reference to an skb. However, bpf_qdisc_skb_drop() can only be called
> > in .enqueue where a to_free skb list is available from kernel to defer
> > the release. bpf_kfree_skb() should be used elsewhere. It is also used
> > in bpf_obj_free_fields() when cleaning up skb in maps and collections.
> >
> > bpf_skb_get_hash() returns the flow hash of an skb, which can be used
> > to build flow-based queueing algorithms.
> >
> > Finally, allow users to create read-only dynptr via bpf_dynptr_from_skb().
> >
> > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > ---
> >   net/sched/bpf_qdisc.c | 77 ++++++++++++++++++++++++++++++++++++++++++-
> >   1 file changed, 76 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
> > index a2e2db29e5fc..28959424eab0 100644
> > --- a/net/sched/bpf_qdisc.c
> > +++ b/net/sched/bpf_qdisc.c
> > @@ -106,6 +106,67 @@ static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
> >       return 0;
> >   }
> >
> > +__bpf_kfunc_start_defs();
> > +
> > +/* bpf_skb_get_hash - Get the flow hash of an skb.
> > + * @skb: The skb to get the flow hash from.
> > + */
> > +__bpf_kfunc u32 bpf_skb_get_hash(struct sk_buff *skb)
> > +{
> > +     return skb_get_hash(skb);
> > +}
> > +
> > +/* bpf_kfree_skb - Release an skb's reference and drop it immediately.
> > + * @skb: The skb whose reference to be released and dropped.
> > + */
> > +__bpf_kfunc void bpf_kfree_skb(struct sk_buff *skb)
> > +{
> > +     kfree_skb(skb);
> > +}
> > +
> > +/* bpf_qdisc_skb_drop - Drop an skb by adding it to a deferred free list.
> > + * @skb: The skb whose reference to be released and dropped.
> > + * @to_free_list: The list of skbs to be dropped.
> > + */
> > +__bpf_kfunc void bpf_qdisc_skb_drop(struct sk_buff *skb,
> > +                                 struct bpf_sk_buff_ptr *to_free_list)
> > +{
> > +     __qdisc_drop(skb, (struct sk_buff **)to_free_list);
> > +}
> > +
> > +__bpf_kfunc_end_defs();
> > +
> > +#define BPF_QDISC_KFUNC_xxx \
> > +     BPF_QDISC_KFUNC(bpf_skb_get_hash, KF_TRUSTED_ARGS) \
> > +     BPF_QDISC_KFUNC(bpf_kfree_skb, KF_RELEASE) \
> > +     BPF_QDISC_KFUNC(bpf_qdisc_skb_drop, KF_RELEASE) \
> > +
> > +BTF_KFUNCS_START(bpf_qdisc_kfunc_ids)
> > +#define BPF_QDISC_KFUNC(name, flag) BTF_ID_FLAGS(func, name, flag)
> > +BPF_QDISC_KFUNC_xxx
> > +#undef BPF_QDISC_KFUNC
> > +BTF_ID_FLAGS(func, bpf_dynptr_from_skb, KF_TRUSTED_ARGS)
> > +BTF_KFUNCS_END(bpf_qdisc_kfunc_ids)
> > +
> > +#define BPF_QDISC_KFUNC(name, _) BTF_ID_LIST_SINGLE(name##_ids, func, name)
>
>
> > +BPF_QDISC_KFUNC_xxx
> > +#undef BPF_QDISC_KFUNC
> > +
> > +static int bpf_qdisc_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
> > +{
> > +     if (kfunc_id == bpf_qdisc_skb_drop_ids[0])
> > +             if (strcmp(prog->aux->attach_func_name, "enqueue"))
>
> The kfunc is registered for all BPF_PROG_TYPE_STRUCT_OPS. Checking func_name
> alone is not enough, e.g. another future struct_ops may have the "enqueue" ops.
>
> Checking the btf type of "struct Qdisc_ops" is better. Something like the
> following (untested):
>

Got it. I will add a structp_ops type check in the filter.

> diff --git i/include/linux/bpf.h w/include/linux/bpf.h
> index c81ac98db439..cf3133f81e7f 100644
> --- i/include/linux/bpf.h
> +++ w/include/linux/bpf.h
> @@ -1809,6 +1809,7 @@ struct bpf_struct_ops {
>         void *cfi_stubs;
>         struct module *owner;
>         const char *name;
> +       const struct btf_type *type;
>         struct btf_func_model func_models[BPF_STRUCT_OPS_MAX_NR_MEMBERS];
>   };
>
> diff --git i/kernel/bpf/bpf_struct_ops.c w/kernel/bpf/bpf_struct_ops.c
> index d9e0af00580b..5c2ca5a84384 100644
> --- i/kernel/bpf/bpf_struct_ops.c
> +++ w/kernel/bpf/bpf_struct_ops.c
> @@ -432,6 +432,8 @@ int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc
> *st_ops_desc,
>                 goto errout;
>         }
>
> +       st_ops->type = t;
> +
>         return 0;
>
>   errout:
> diff --git i/net/sched/bpf_qdisc.c w/net/sched/bpf_qdisc.c
> index 1caa9f696d2d..94e45ea59fef 100644
> --- i/net/sched/bpf_qdisc.c
> +++ w/net/sched/bpf_qdisc.c
> @@ -250,6 +250,11 @@ BPF_QDISC_KFUNC_xxx
>
>   static int bpf_qdisc_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
>   {
> +
> +       if (bpf_Qdisc_ops.type != btf_type_by_id(prog->aux->attach_btf,
> +                                                prog->aux->attach_btf_id))
> +               return -EACCES;
> +
>         if (kfunc_id == bpf_qdisc_skb_drop_ids[0]) {
>                 if (strcmp(prog->aux->attach_func_name, "enqueue"))
>                         return -EACCES;
>
>
> st_ops->type (and a few others) was refactored to bpf_struct_ops_desc when
> adding the kernel module support. I think adding st_ops->type back should be enough.
>
> Also, a bike shedding here, from looking at patch 7 and patch 8 which limit a
> set of kfuncs to a particular ops. I think using btf_id_set_contains() is more
> inline to other verifier usages.
>
> BTF_SET_START(qdisc_enqueue_kfunc_set)
> BTF_ID(func, bpf_qdisc_skb_drop)
> BTF_ID(func, bpf_qdisc_watchdog_schedule)
> BTF_SET_END(qdisc_enqueue_kfunc_set)
>
> BTF_SET_START(qdisc_dequeue_kfunc_set)
> BTF_ID(func, bpf_qdisc_bstats_update)
> BTF_ID(func, bpf_qdisc_watchdog_schedule)
> BTF_SET_END(qdisc_dequeue_kfunc_set)
>
> BTF_SET_START(qdisc_common_kfunc_set)
> BTF_ID(func, bpf_skb_get_hash)
> BTF_ID(func, bpf_kfree_skb)
> BTF_SET_END(qdisc_common_kfunc_set)
>

I will change the style of kfunc ops availability check to the one you
suggested.

Thanks,
Amery

> > +                     return -EACCES;
> > +
> > +     return 0;
> > +}
> > +
> > +static const struct btf_kfunc_id_set bpf_qdisc_kfunc_set = {
> > +     .owner = THIS_MODULE,
> > +     .set   = &bpf_qdisc_kfunc_ids,
> > +     .filter = bpf_qdisc_kfunc_filter,
> > +};
> > +
> >   static const struct bpf_verifier_ops bpf_qdisc_verifier_ops = {
> >       .get_func_proto         = bpf_qdisc_get_func_proto,
> >       .is_valid_access        = bpf_qdisc_is_valid_access,
> > @@ -209,6 +270,20 @@ static struct bpf_struct_ops bpf_Qdisc_ops = {
> >
> >   static int __init bpf_qdisc_kfunc_init(void)
> >   {
> > -     return register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
> > +     int ret;
> > +     const struct btf_id_dtor_kfunc skb_kfunc_dtors[] = {
> > +             {
> > +                     .btf_id       = bpf_sk_buff_ids[0],
> > +                     .kfunc_btf_id = bpf_kfree_skb_ids[0]
> > +             },
> > +     };
> > +
> > +     ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_qdisc_kfunc_set);
> > +     ret = ret ?: register_btf_id_dtor_kfuncs(skb_kfunc_dtors,
> > +                                              ARRAY_SIZE(skb_kfunc_dtors),
> > +                                              THIS_MODULE);
> > +     ret = ret ?: register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
> > +
> > +     return ret;
> >   }
> >   late_initcall(bpf_qdisc_kfunc_init);
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH bpf-next v1 07/13] bpf: net_sched: Add a qdisc watchdog timer
  2024-12-19  1:16   ` Martin KaFai Lau
@ 2024-12-20 19:24     ` Amery Hung
  0 siblings, 0 replies; 35+ messages in thread
From: Amery Hung @ 2024-12-20 19:24 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Amery Hung, bpf, netdev, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, stfomichev,
	ekarani.silvestre, yangpeihao, xiyou.wangcong, yepeilin.cs

On Wed, Dec 18, 2024 at 5:16 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 12/13/24 3:29 PM, Amery Hung wrote:
> > Add a watchdog timer to bpf qdisc. The watchdog can be used to schedule
> > the execution of qdisc through kfunc, bpf_qdisc_schedule(). It can be
> > useful for building traffic shaping scheduling algorithm, where the time
> > the next packet will be dequeued is known.
> >
> > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > ---
> >   include/net/sch_generic.h |  4 +++
> >   net/sched/bpf_qdisc.c     | 51 ++++++++++++++++++++++++++++++++++++++-
> >   net/sched/sch_api.c       | 11 +++++++++
> >   net/sched/sch_generic.c   |  8 ++++++
> >   4 files changed, 73 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> > index 5d74fa7e694c..6a252b1b0680 100644
> > --- a/include/net/sch_generic.h
> > +++ b/include/net/sch_generic.h
> > @@ -1357,4 +1357,8 @@ static inline void qdisc_synchronize(const struct Qdisc *q)
> >               msleep(1);
> >   }
> >
> > +int bpf_qdisc_init_pre_op(struct Qdisc *sch, struct nlattr *opt, struct netlink_ext_ack *extack);
> > +void bpf_qdisc_destroy_post_op(struct Qdisc *sch);
> > +void bpf_qdisc_reset_post_op(struct Qdisc *sch);
> > +
> >   #endif
> > diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
> > index 28959424eab0..7c155207fe1e 100644
> > --- a/net/sched/bpf_qdisc.c
> > +++ b/net/sched/bpf_qdisc.c
> > @@ -8,6 +8,10 @@
> >
> >   static struct bpf_struct_ops bpf_Qdisc_ops;
> >
> > +struct bpf_sched_data {
> > +     struct qdisc_watchdog watchdog;
> > +};
> > +
> >   struct bpf_sk_buff_ptr {
> >       struct sk_buff *skb;
> >   };
> > @@ -17,6 +21,32 @@ static int bpf_qdisc_init(struct btf *btf)
> >       return 0;
> >   }
> >
> > +int bpf_qdisc_init_pre_op(struct Qdisc *sch, struct nlattr *opt,
> > +                       struct netlink_ext_ack *extack)
> > +{
> > +     struct bpf_sched_data *q = qdisc_priv(sch);
> > +
> > +     qdisc_watchdog_init(&q->watchdog, sch);
> > +     return 0;
> > +}
> > +EXPORT_SYMBOL(bpf_qdisc_init_pre_op);
> > +
> > +void bpf_qdisc_reset_post_op(struct Qdisc *sch)
> > +{
> > +     struct bpf_sched_data *q = qdisc_priv(sch);
> > +
> > +     qdisc_watchdog_cancel(&q->watchdog);
> > +}
> > +EXPORT_SYMBOL(bpf_qdisc_reset_post_op);
> > +
> > +void bpf_qdisc_destroy_post_op(struct Qdisc *sch)
> > +{
> > +     struct bpf_sched_data *q = qdisc_priv(sch);
> > +
> > +     qdisc_watchdog_cancel(&q->watchdog);
> > +}
> > +EXPORT_SYMBOL(bpf_qdisc_destroy_post_op);
>
> These feel like the candidates for the ".gen_prologue" and ".gen_epilogue". Then
> the changes to sch_api.c is not needed.
>

I will switch to gen_prologue and gen_epilogue in the next version.
Thank you so much for working on this.

> > +
> >   static const struct bpf_func_proto *
> >   bpf_qdisc_get_func_proto(enum bpf_func_id func_id,
> >                        const struct bpf_prog *prog)
> > @@ -134,12 +164,25 @@ __bpf_kfunc void bpf_qdisc_skb_drop(struct sk_buff *skb,
> >       __qdisc_drop(skb, (struct sk_buff **)to_free_list);
> >   }
> >
> > +/* bpf_qdisc_watchdog_schedule - Schedule a qdisc to a later time using a timer.
> > + * @sch: The qdisc to be scheduled.
> > + * @expire: The expiry time of the timer.
> > + * @delta_ns: The slack range of the timer.
> > + */
> > +__bpf_kfunc void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64 delta_ns)
> > +{
> > +     struct bpf_sched_data *q = qdisc_priv(sch);
> > +
> > +     qdisc_watchdog_schedule_range_ns(&q->watchdog, expire, delta_ns);
> > +}
> > +
> >   __bpf_kfunc_end_defs();
> >
> >   #define BPF_QDISC_KFUNC_xxx \
> >       BPF_QDISC_KFUNC(bpf_skb_get_hash, KF_TRUSTED_ARGS) \
> >       BPF_QDISC_KFUNC(bpf_kfree_skb, KF_RELEASE) \
> >       BPF_QDISC_KFUNC(bpf_qdisc_skb_drop, KF_RELEASE) \
> > +     BPF_QDISC_KFUNC(bpf_qdisc_watchdog_schedule, KF_TRUSTED_ARGS) \
> >
> >   BTF_KFUNCS_START(bpf_qdisc_kfunc_ids)
> >   #define BPF_QDISC_KFUNC(name, flag) BTF_ID_FLAGS(func, name, flag)
> > @@ -154,9 +197,14 @@ BPF_QDISC_KFUNC_xxx
> >
> >   static int bpf_qdisc_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
> >   {
> > -     if (kfunc_id == bpf_qdisc_skb_drop_ids[0])
> > +     if (kfunc_id == bpf_qdisc_skb_drop_ids[0]) {
> >               if (strcmp(prog->aux->attach_func_name, "enqueue"))
> >                       return -EACCES;
> > +     } else if (kfunc_id == bpf_qdisc_watchdog_schedule_ids[0]) {
> > +             if (strcmp(prog->aux->attach_func_name, "enqueue") &&
> > +                 strcmp(prog->aux->attach_func_name, "dequeue"))
> > +                     return -EACCES;
> > +     }
> >
> >       return 0;
> >   }
> > @@ -189,6 +237,7 @@ static int bpf_qdisc_init_member(const struct btf_type *t,
> >       case offsetof(struct Qdisc_ops, priv_size):
> >               if (uqdisc_ops->priv_size)
> >                       return -EINVAL;
> > +             qdisc_ops->priv_size = sizeof(struct bpf_sched_data);
>
> ah. ok. The priv_size case is still needed.
>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2024-12-20 19:24 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-13 23:29 [PATCH bpf-next v1 00/13] bpf qdisc Amery Hung
2024-12-13 23:29 ` [PATCH bpf-next v1 01/13] bpf: Support getting referenced kptr from struct_ops argument Amery Hung
2024-12-18  0:58   ` Martin KaFai Lau
2024-12-18  1:24     ` Alexei Starovoitov
2024-12-18 16:09       ` Amery Hung
2024-12-18 17:20         ` Alexei Starovoitov
2024-12-18  1:44     ` Jakub Kicinski
2024-12-18 16:57     ` Amery Hung
2024-12-19 23:06       ` Martin KaFai Lau
2024-12-13 23:29 ` [PATCH bpf-next v1 02/13] selftests/bpf: Test referenced kptr arguments of struct_ops programs Amery Hung
2024-12-18  1:17   ` Martin KaFai Lau
2024-12-18 16:10     ` Amery Hung
2024-12-19  3:40   ` Yonghong Song
2024-12-19 20:49     ` Amery Hung
2024-12-13 23:29 ` [PATCH bpf-next v1 03/13] bpf: Allow struct_ops prog to return referenced kptr Amery Hung
2024-12-18 22:29   ` Martin KaFai Lau
2024-12-13 23:29 ` [PATCH bpf-next v1 04/13] selftests/bpf: Test returning referenced kptr from struct_ops programs Amery Hung
2024-12-13 23:29 ` [PATCH bpf-next v1 05/13] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
2024-12-14  4:51   ` Cong Wang
2024-12-18 23:37   ` Martin KaFai Lau
2024-12-13 23:29 ` [PATCH bpf-next v1 06/13] bpf: net_sched: Add basic bpf qdisc kfuncs Amery Hung
2024-12-18 17:11   ` Amery Hung
2024-12-19  7:37   ` Martin KaFai Lau
2024-12-20  0:32     ` Amery Hung
2024-12-13 23:29 ` [PATCH bpf-next v1 07/13] bpf: net_sched: Add a qdisc watchdog timer Amery Hung
2024-12-19  1:16   ` Martin KaFai Lau
2024-12-20 19:24     ` Amery Hung
2024-12-13 23:29 ` [PATCH bpf-next v1 08/13] bpf: net_sched: Support updating bstats Amery Hung
2024-12-13 23:29 ` [PATCH bpf-next v1 09/13] bpf: net_sched: Support updating qstats Amery Hung
2024-12-13 23:29 ` [PATCH bpf-next v1 10/13] bpf: net_sched: Allow writing to more Qdisc members Amery Hung
2024-12-13 23:29 ` [PATCH bpf-next v1 11/13] libbpf: Support creating and destroying qdisc Amery Hung
2024-12-17 18:32   ` Andrii Nakryiko
2024-12-17 19:08     ` Amery Hung
2024-12-13 23:29 ` [PATCH bpf-next v1 12/13] selftests: Add a basic fifo qdisc test Amery Hung
2024-12-13 23:29 ` [PATCH bpf-next v1 13/13] selftests: Add a bpf fq qdisc to selftest Amery Hung

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).