[RFC PATCH v8 00/20] bpf qdisc

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v8 00/20] bpf qdisc
@ 2024-05-10 19:23 Amery Hung
  2024-05-10 19:23 ` [RFC PATCH v8 01/20] bpf: Support passing referenced kptr to struct_ops programs Amery Hung
                   ` (19 more replies)
  0 siblings, 20 replies; 46+ messages in thread
From: Amery Hung @ 2024-05-10 19:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs, ameryhung

This is the v8 of bpf qdisc patchset. While I would like to do more
testing and performance evaluation, I think posting it now may help
discussions in the upcoming LSF/MM/BPF.

* Overview *

This series supports implementing a qdisc using bpf struct_ops. bpf qdisc
aims to be a flexible and easy-to-use infrastructure that allows users to
quickly experiment with different scheduling algorithms/policies. It only
requires users to implement core qdisc logic using bpf and implements the
mundane part for them. In addition, the ability to easily communicate
between qdisc and other components will also bring new opportunities for
new applications and optimizations.

After discussion in the previous patchset [0], we swicthed to struct_ops
to take the benefit of struct_ops and avoid introducing a new abstraction
to users. In addition, three changes to bpf are introduced to make
bpf qdisc easier to program and performant.

* struct_ops changes *

To make struct_ops works better with bpf qdisc, two new changes are
introduced to bpf specifically for struct_ops programs. Frist, we
introduce "ref_acquired" postfix for arguments in stub functions [1] in
patch 1-2. It will allow Qdisc_ops->enqueue to acquire an referenced kptr
to an skb just once. Through the reference object tracking mechanism in
the verifier, we can make sure that the acquired skb will be either
enqueued or dropped. Besides, no duplicate references can be acquired.
Then, we allow a reference leak in struct_ops programs so that we can
return an skb naturally. This is done and tested in patch 3 and 4.

* Support adding skb to bpf graph *

Allowing users to enqueue an skb directly to a bpf collection improves
users' programming experience and performance of qdiscs. In the previous
patchset (v7), the user would need to allocate a local object, exchange
an skb kptr into the object and then add the object to a collection during
enqueue. The memory allocation in the fast path was hurting the
performance.

To allow adding skb to collection, we first introduced the support for
adding kernel objects to bpf list and rbtree (patch 5-8). Then, we
introduced exclusive-ownership graph nodes so that 1) we can fit
an rb node into an skb, and 2) make it possible for list node and rb node
to coexist in a union in skb (patch 9-12).

We evaluated the benefit of direct skb queueing by comparing the
throughput of simple fifo qdiscs implemented with v7 and v8 patchset.
Both qdisc use a bpf list as the fifo. The fifo v8 is included in the
selftests. While fifo v7 is identical in terms of the queueing logic,
it requires additional bpf_obj_new() and bpf_kptr_xchg() calls to enqueue
a local object containing a skb kptr. The test uses iperf3 to send and 
receive traffic on the qdisc added to the loopback device for 1 minute,
and we repeated it for five times. The result is shown below:

                                    Average throughput   stdev
fifo with indirect queueing (v7)    40.4 Gbps            0.91 Gbps
fifo with direct queueing (v8)      43.5 Gbps            0.24 Gbps

This part of the patchset (patch 5-12) is less tested and the approach may
be overcomplicated, so I especially would like to gather more feedback
before going further.

* Miscellaneous notes *

Finally, this patchset is based on 
34c58c89feb3 (Merge branch 'gve-ring-size-changes') in net-next.

The fq example in selftests requires bpf support of exchanging kptr into
allocated objects (local kptr), which Dave Marchevsky developed and
sent me as off-list patchset.

Todo:
  - Add more bpf testcases
  - Add testcases for bpf_skb_tc_classify and other qdisc ops
  - Add kfunc access control
  - Add support for statistics
  - Remove the requirement of explicit skb->dev restoration
  - Look into more ops in Qdisc_ops
  - Support updating Qdisc_ops

[0] https://lore.kernel.org/netdev/cover.1705432850.git.amery.hung@bytedance.com/

---
v8: Implement support of bpf qdisc using struct_ops
    Allow struct_ops to acquire referenced kptr via argument
    Allow struct_ops to release and return referenced kptr
    Support enqueuing sk_buff to bpf_rbtree/list
    Move examples from samples to selftests
    Add a classful qdisc selftest

v7: Reference skb using kptr to sk_buff instead of __sk_buff
    Use the new bpf rbtree/link to for skb queues
    Add reset and init programs
    Add a bpf fq qdisc sample
    Add a bpf netem qdisc sample

v6: switch to kptr based approach

v5: mv kernel/bpf/skb_map.c net/core/skb_map.c
    implement flow map as map-in-map
    rename bpf_skb_tc_classify() and move it to net/sched/cls_api.c
    clean up eBPF qdisc program context

v4: get rid of PIFO, use rbtree directly

v3: move priority queue from sch_bpf to skb map
    introduce skb map and its helpers
    introduce bpf_skb_classify()
    use netdevice notifier to reset skb's
    Rebase on latest bpf-next

v2: Rebase on latest net-next
    Make the code more complete (but still incomplete)

Amery Hung (20):
  bpf: Support passing referenced kptr to struct_ops programs
  selftests/bpf: Test referenced kptr arguments of struct_ops programs
  bpf: Allow struct_ops prog to return referenced kptr
  selftests/bpf: Test returning kptr from struct_ops programs
  bpf: Generate btf_struct_metas for kernel BTF
  bpf: Recognize kernel types as graph values
  bpf: Allow adding kernel objects to collections
  selftests/bpf: Test adding kernel object to bpf graph
  bpf: Find special BTF fields in union
  bpf: Introduce exclusive-ownership list and rbtree nodes
  bpf: Allow adding exclusive nodes to bpf list and rbtree
  selftests/bpf: Modify linked_list tests to work with macro-ified
    removes
  bpf: net_sched: Support implementation of Qdisc_ops in bpf
  bpf: net_sched: Add bpf qdisc kfuncs
  bpf: net_sched: Allow more optional methods in Qdisc_ops
  libbpf: Support creating and destroying qdisc
  selftests: Add a basic fifo qdisc test
  selftests: Add a bpf fq qdisc to selftest
  selftests: Add a bpf netem qdisc to selftest
  selftests: Add a prio bpf qdisc

 include/linux/bpf.h                           |  30 +-
 include/linux/bpf_verifier.h                  |   8 +-
 include/linux/btf.h                           |   5 +-
 include/linux/rbtree_types.h                  |   4 +
 include/linux/skbuff.h                        |   2 +
 include/linux/types.h                         |   4 +
 include/net/sch_generic.h                     |   8 +
 kernel/bpf/bpf_struct_ops.c                   |  17 +-
 kernel/bpf/btf.c                              | 255 +++++-
 kernel/bpf/helpers.c                          |  63 +-
 kernel/bpf/syscall.c                          |  22 +-
 kernel/bpf/verifier.c                         | 185 +++-
 net/sched/Makefile                            |   4 +
 net/sched/bpf_qdisc.c                         | 788 ++++++++++++++++++
 net/sched/sch_api.c                           |  19 +-
 net/sched/sch_generic.c                       |  11 +-
 tools/lib/bpf/libbpf.h                        |   5 +-
 tools/lib/bpf/netlink.c                       |  20 +-
 .../testing/selftests/bpf/bpf_experimental.h  |  59 +-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  29 +
 .../selftests/bpf/bpf_testmod/bpf_testmod.h   |  11 +
 .../selftests/bpf/prog_tests/bpf_qdisc.c      | 259 ++++++
 .../selftests/bpf/prog_tests/linked_list.c    |   6 +-
 .../prog_tests/test_struct_ops_kptr_return.c  |  87 ++
 .../prog_tests/test_struct_ops_ref_acquire.c  |  58 ++
 .../selftests/bpf/progs/bpf_qdisc_common.h    |  23 +
 .../selftests/bpf/progs/bpf_qdisc_fifo.c      |  83 ++
 .../selftests/bpf/progs/bpf_qdisc_fq.c        | 660 +++++++++++++++
 .../selftests/bpf/progs/bpf_qdisc_netem.c     | 236 ++++++
 .../selftests/bpf/progs/bpf_qdisc_prio.c      | 112 +++
 .../testing/selftests/bpf/progs/linked_list.c |  15 +
 .../testing/selftests/bpf/progs/linked_list.h |   8 +
 .../selftests/bpf/progs/linked_list_fail.c    |  46 +-
 .../bpf/progs/struct_ops_kptr_return.c        |  24 +
 ...uct_ops_kptr_return_fail__invalid_scalar.c |  24 +
 .../struct_ops_kptr_return_fail__local_kptr.c |  30 +
 ...uct_ops_kptr_return_fail__nonzero_offset.c |  23 +
 .../struct_ops_kptr_return_fail__wrong_type.c |  28 +
 .../bpf/progs/struct_ops_ref_acquire.c        |  27 +
 .../progs/struct_ops_ref_acquire_dup_ref.c    |  24 +
 .../progs/struct_ops_ref_acquire_ref_leak.c   |  19 +
 41 files changed, 3216 insertions(+), 125 deletions(-)
 create mode 100644 net/sched/bpf_qdisc.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_kptr_return.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_ref_acquire.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_netem.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_prio.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__invalid_scalar.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__nonzero_offset.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__wrong_type.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_ref_acquire.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_ref_acquire_dup_ref.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_ref_acquire_ref_leak.c

-- 
2.20.1

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC PATCH v8 01/20] bpf: Support passing referenced kptr to struct_ops programs
  2024-05-10 19:23 [RFC PATCH v8 00/20] bpf qdisc Amery Hung
@ 2024-05-10 19:23 ` Amery Hung
  2024-05-16 23:59   ` Kumar Kartikeya Dwivedi
  2024-05-10 19:23 ` [RFC PATCH v8 02/20] selftests/bpf: Test referenced kptr arguments of " Amery Hung
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Amery Hung @ 2024-05-10 19:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs, ameryhung

This patch supports struct_ops programs that acqurie referenced kptrs
throguh arguments. In Qdisc_ops, an skb is passed to ".enqueue" in the
first argument. The qdisc becomes the sole owner of the skb and must
enqueue or drop the skb. This matches the referenced kptr semantic
in bpf. However, the existing practice of acquiring a referenced kptr via
a kfunc with KF_ACQUIRE does not play well in this case. Calling kfuncs
repeatedly allows the user to acquire multiple references, while there
should be only one reference to a unique skb in a qdisc.

The solutioin is to make a struct_ops program automatically acquire a
referenced kptr through a tagged argument in the stub function. When
tagged with "__ref_acquired" (suggestion for a better name?), an
reference kptr (ref_obj_id > 0) will be acquired automatically when
entering the program. In addition, only the first read to the arguement
is allowed and it will yeild a referenced kptr.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/linux/bpf.h         |  3 +++
 kernel/bpf/bpf_struct_ops.c | 17 +++++++++++++----
 kernel/bpf/btf.c            | 10 +++++++++-
 kernel/bpf/verifier.c       | 16 +++++++++++++---
 4 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9c6a7b8ff963..6aabca1581fe 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -914,6 +914,7 @@ struct bpf_insn_access_aux {
 		struct {
 			struct btf *btf;
 			u32 btf_id;
+			u32 ref_obj_id;
 		};
 	};
 	struct bpf_verifier_log *log; /* for verbose logs */
@@ -1416,6 +1417,8 @@ struct bpf_ctx_arg_aux {
 	enum bpf_reg_type reg_type;
 	struct btf *btf;
 	u32 btf_id;
+	u32 ref_obj_id;
+	bool ref_acquired;
 };
 
 struct btf_mod_pair {
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 86c7884abaf8..bca8e5936846 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -143,6 +143,7 @@ void bpf_struct_ops_image_free(void *image)
 }
 
 #define MAYBE_NULL_SUFFIX "__nullable"
+#define REF_ACQUIRED_SUFFIX "__ref_acquired"
 #define MAX_STUB_NAME 128
 
 /* Return the type info of a stub function, if it exists.
@@ -204,6 +205,7 @@ static int prepare_arg_info(struct btf *btf,
 			    struct bpf_struct_ops_arg_info *arg_info)
 {
 	const struct btf_type *stub_func_proto, *pointed_type;
+	bool is_nullable = false, is_ref_acquired = false;
 	const struct btf_param *stub_args, *args;
 	struct bpf_ctx_arg_aux *info, *info_buf;
 	u32 nargs, arg_no, info_cnt = 0;
@@ -240,8 +242,11 @@ static int prepare_arg_info(struct btf *btf,
 		/* Skip arguments that is not suffixed with
 		 * "__nullable".
 		 */
-		if (!btf_param_match_suffix(btf, &stub_args[arg_no],
-					    MAYBE_NULL_SUFFIX))
+		is_nullable = btf_param_match_suffix(btf, &stub_args[arg_no],
+						     MAYBE_NULL_SUFFIX);
+		is_ref_acquired = btf_param_match_suffix(btf, &stub_args[arg_no],
+						       REF_ACQUIRED_SUFFIX);
+		if (!(is_nullable || is_ref_acquired))
 			continue;
 
 		/* Should be a pointer to struct */
@@ -269,11 +274,15 @@ static int prepare_arg_info(struct btf *btf,
 		}
 
 		/* Fill the information of the new argument */
-		info->reg_type =
-			PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
 		info->btf_id = arg_btf_id;
 		info->btf = btf;
 		info->offset = offset;
+		if (is_nullable) {
+			info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
+		} else if (is_ref_acquired) {
+			info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID;
+			info->ref_acquired = true;
+		}
 
 		info++;
 		info_cnt++;
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 8c95392214ed..e462fb4a4598 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -6316,7 +6316,8 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
 
 	/* this is a pointer to another type */
 	for (i = 0; i < prog->aux->ctx_arg_info_size; i++) {
-		const struct bpf_ctx_arg_aux *ctx_arg_info = &prog->aux->ctx_arg_info[i];
+		struct bpf_ctx_arg_aux *ctx_arg_info =
+			(struct bpf_ctx_arg_aux *)&prog->aux->ctx_arg_info[i];
 
 		if (ctx_arg_info->offset == off) {
 			if (!ctx_arg_info->btf_id) {
@@ -6324,9 +6325,16 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
 				return false;
 			}
 
+			if (ctx_arg_info->ref_acquired && !ctx_arg_info->ref_obj_id) {
+				bpf_log(log, "cannot acquire a reference to context argument offset %u\n", off);
+				return false;
+			}
+
 			info->reg_type = ctx_arg_info->reg_type;
 			info->btf = ctx_arg_info->btf ? : btf_vmlinux;
 			info->btf_id = ctx_arg_info->btf_id;
+			info->ref_obj_id = ctx_arg_info->ref_obj_id;
+			ctx_arg_info->ref_obj_id = 0;
 			return true;
 		}
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 9f867fca9fbe..06a6edd306fd 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5557,7 +5557,7 @@ static int check_packet_access(struct bpf_verifier_env *env, u32 regno, int off,
 /* check access to 'struct bpf_context' fields.  Supports fixed offsets only */
 static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off, int size,
 			    enum bpf_access_type t, enum bpf_reg_type *reg_type,
-			    struct btf **btf, u32 *btf_id)
+			    struct btf **btf, u32 *btf_id, u32 *ref_obj_id)
 {
 	struct bpf_insn_access_aux info = {
 		.reg_type = *reg_type,
@@ -5578,6 +5578,7 @@ static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off,
 		if (base_type(*reg_type) == PTR_TO_BTF_ID) {
 			*btf = info.btf;
 			*btf_id = info.btf_id;
+			*ref_obj_id = info.ref_obj_id;
 		} else {
 			env->insn_aux_data[insn_idx].ctx_field_size = info.ctx_field_size;
 		}
@@ -6833,7 +6834,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 	} else if (reg->type == PTR_TO_CTX) {
 		enum bpf_reg_type reg_type = SCALAR_VALUE;
 		struct btf *btf = NULL;
-		u32 btf_id = 0;
+		u32 btf_id = 0, ref_obj_id = 0;
 
 		if (t == BPF_WRITE && value_regno >= 0 &&
 		    is_pointer_value(env, value_regno)) {
@@ -6846,7 +6847,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 			return err;
 
 		err = check_ctx_access(env, insn_idx, off, size, t, &reg_type, &btf,
-				       &btf_id);
+				       &btf_id, &ref_obj_id);
 		if (err)
 			verbose_linfo(env, insn_idx, "; ");
 		if (!err && t == BPF_READ && value_regno >= 0) {
@@ -6870,6 +6871,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 				if (base_type(reg_type) == PTR_TO_BTF_ID) {
 					regs[value_regno].btf = btf;
 					regs[value_regno].btf_id = btf_id;
+					regs[value_regno].ref_obj_id = ref_obj_id;
 				}
 			}
 			regs[value_regno].type = reg_type;
@@ -20426,6 +20428,7 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
 {
 	bool pop_log = !(env->log.level & BPF_LOG_LEVEL2);
 	struct bpf_subprog_info *sub = subprog_info(env, subprog);
+	struct bpf_ctx_arg_aux *ctx_arg_info;
 	struct bpf_verifier_state *state;
 	struct bpf_reg_state *regs;
 	int ret, i;
@@ -20533,6 +20536,13 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
 		mark_reg_known_zero(env, regs, BPF_REG_1);
 	}
 
+	if (env->prog->type == BPF_PROG_TYPE_STRUCT_OPS) {
+		ctx_arg_info = (struct bpf_ctx_arg_aux *)env->prog->aux->ctx_arg_info;
+		for (i = 0; i < env->prog->aux->ctx_arg_info_size; i++)
+			if (ctx_arg_info[i].ref_acquired)
+				ctx_arg_info[i].ref_obj_id = acquire_reference_state(env, 0);
+	}
+
 	ret = do_check(env);
 out:
 	/* check for NULL is necessary, since cur_state can be freed inside
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH v8 02/20] selftests/bpf: Test referenced kptr arguments of struct_ops programs
  2024-05-10 19:23 [RFC PATCH v8 00/20] bpf qdisc Amery Hung
  2024-05-10 19:23 ` [RFC PATCH v8 01/20] bpf: Support passing referenced kptr to struct_ops programs Amery Hung
@ 2024-05-10 19:23 ` Amery Hung
  2024-05-10 21:33   ` Kui-Feng Lee
  2024-05-10 19:23 ` [RFC PATCH v8 03/20] bpf: Allow struct_ops prog to return referenced kptr Amery Hung
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Amery Hung @ 2024-05-10 19:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs, ameryhung

A reference is automatically acquired for a referenced kptr argument
annotated via the stub function with "__ref_acquired" in a struct_ops
program. It must be released and cannot be acquired more than once.

The test first checks whether a reference to the correct type is acquired
in "ref_acquire". Then, we check if the verifier correctly rejects the
program that fails to release the reference (i.e., reference leak) in
"ref_acquire_ref_leak". Finally, we check if the reference can be only
acquired once through the argument in "ref_acquire_dup_ref".

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  7 +++
 .../selftests/bpf/bpf_testmod/bpf_testmod.h   |  2 +
 .../prog_tests/test_struct_ops_ref_acquire.c  | 58 +++++++++++++++++++
 .../bpf/progs/struct_ops_ref_acquire.c        | 27 +++++++++
 .../progs/struct_ops_ref_acquire_dup_ref.c    | 24 ++++++++
 .../progs/struct_ops_ref_acquire_ref_leak.c   | 19 ++++++
 6 files changed, 137 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_ref_acquire.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_ref_acquire.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_ref_acquire_dup_ref.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_ref_acquire_ref_leak.c

diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 39ad96a18123..64dcab25b539 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -594,10 +594,17 @@ static int bpf_testmod_ops__test_maybe_null(int dummy,
 	return 0;
 }
 
+static int bpf_testmod_ops__test_ref_acquire(int dummy,
+					     struct task_struct *task__ref_acquired)
+{
+	return 0;
+}
+
 static struct bpf_testmod_ops __bpf_testmod_ops = {
 	.test_1 = bpf_testmod_test_1,
 	.test_2 = bpf_testmod_test_2,
 	.test_maybe_null = bpf_testmod_ops__test_maybe_null,
+	.test_ref_acquire = bpf_testmod_ops__test_ref_acquire,
 };
 
 struct bpf_struct_ops bpf_bpf_testmod_ops = {
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
index 23fa1872ee67..a0233990fb0e 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
@@ -35,6 +35,8 @@ struct bpf_testmod_ops {
 	void (*test_2)(int a, int b);
 	/* Used to test nullable arguments. */
 	int (*test_maybe_null)(int dummy, struct task_struct *task);
+	/* Used to test ref_acquired arguments. */
+	int (*test_ref_acquire)(int dummy, struct task_struct *task);
 
 	/* The following fields are used to test shadow copies. */
 	char onebyte;
diff --git a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_ref_acquire.c b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_ref_acquire.c
new file mode 100644
index 000000000000..779287a00ed8
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_ref_acquire.c
@@ -0,0 +1,58 @@
+#include <test_progs.h>
+
+#include "struct_ops_ref_acquire.skel.h"
+#include "struct_ops_ref_acquire_ref_leak.skel.h"
+#include "struct_ops_ref_acquire_dup_ref.skel.h"
+
+/* Test that the verifier accepts a program that acquires a referenced
+ * kptr and releases the reference
+ */
+static void ref_acquire(void)
+{
+	struct struct_ops_ref_acquire *skel;
+
+	skel = struct_ops_ref_acquire__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "struct_ops_module_open_and_load"))
+		return;
+
+	struct_ops_ref_acquire__destroy(skel);
+}
+
+/* Test that the verifier rejects a program that acquires a referenced
+ * kptr without releasing the reference
+ */
+static void ref_acquire_ref_leak(void)
+{
+	struct struct_ops_ref_acquire_ref_leak *skel;
+
+	skel = struct_ops_ref_acquire_ref_leak__open_and_load();
+	if (ASSERT_ERR_PTR(skel, "struct_ops_module_fail__open_and_load"))
+		return;
+
+	struct_ops_ref_acquire_ref_leak__destroy(skel);
+}
+
+/* Test that the verifier rejects a program that tries to acquire a
+ * referenced twice
+ */
+static void ref_acquire_dup_ref(void)
+{
+	struct struct_ops_ref_acquire_dup_ref *skel;
+
+	skel = struct_ops_ref_acquire_dup_ref__open_and_load();
+	if (ASSERT_ERR_PTR(skel, "struct_ops_module_fail__open_and_load"))
+		return;
+
+	struct_ops_ref_acquire_dup_ref__destroy(skel);
+}
+
+void test_struct_ops_ref_acquire(void)
+{
+	if (test__start_subtest("ref_acquire"))
+		ref_acquire();
+	if (test__start_subtest("ref_acquire_ref_leak"))
+		ref_acquire_ref_leak();
+	if (test__start_subtest("ref_acquire_dup_ref"))
+		ref_acquire_dup_ref();
+}
+
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_ref_acquire.c b/tools/testing/selftests/bpf/progs/struct_ops_ref_acquire.c
new file mode 100644
index 000000000000..bae342db0fdb
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_ref_acquire.c
@@ -0,0 +1,27 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This is a test BPF program that uses struct_ops to access a referenced
+ * kptr argument. This is a test for the verifier to ensure that it recongnizes
+ * the task as a referenced object (i.e., ref_obj_id > 0).
+ */
+SEC("struct_ops/test_ref_acquire")
+int BPF_PROG(test_ref_acquire, int dummy,
+	     struct task_struct *task)
+{
+	bpf_task_release(task);
+
+	return 0;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_ref_acquire = {
+	.test_ref_acquire = (void *)test_ref_acquire,
+};
+
+
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_ref_acquire_dup_ref.c b/tools/testing/selftests/bpf/progs/struct_ops_ref_acquire_dup_ref.c
new file mode 100644
index 000000000000..489db98a47fb
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_ref_acquire_dup_ref.c
@@ -0,0 +1,24 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+void bpf_task_release(struct task_struct *p) __ksym;
+
+SEC("struct_ops/test_ref_acquire")
+int BPF_PROG(test_ref_acquire, int dummy,
+	     struct task_struct *task)
+{
+	bpf_task_release(task);
+	bpf_task_release(task);
+
+	return 0;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_ref_acquire = {
+	.test_ref_acquire = (void *)test_ref_acquire,
+};
+
+
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_ref_acquire_ref_leak.c b/tools/testing/selftests/bpf/progs/struct_ops_ref_acquire_ref_leak.c
new file mode 100644
index 000000000000..c5b9a1d748a1
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_ref_acquire_ref_leak.c
@@ -0,0 +1,19 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+SEC("struct_ops/test_ref_acquire")
+int BPF_PROG(test_ref_acquire, int dummy,
+	     struct task_struct *task)
+{
+	return 0;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_ref_acquire = {
+	.test_ref_acquire = (void *)test_ref_acquire,
+};
+
+
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH v8 03/20] bpf: Allow struct_ops prog to return referenced kptr
  2024-05-10 19:23 [RFC PATCH v8 00/20] bpf qdisc Amery Hung
  2024-05-10 19:23 ` [RFC PATCH v8 01/20] bpf: Support passing referenced kptr to struct_ops programs Amery Hung
  2024-05-10 19:23 ` [RFC PATCH v8 02/20] selftests/bpf: Test referenced kptr arguments of " Amery Hung
@ 2024-05-10 19:23 ` Amery Hung
  2024-05-17  2:06   ` Amery Hung
  2024-05-10 19:23 ` [RFC PATCH v8 04/20] selftests/bpf: Test returning kptr from struct_ops programs Amery Hung
                   ` (16 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Amery Hung @ 2024-05-10 19:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs, ameryhung

This patch allows a struct_ops program to return a referenced kptr
if the struct_ops member has a pointer return type. To make sure the
pointer returned to kernel is valid, it needs to be referenced and
originally comes from the kernel. That is, it should be acquired
through kfuncs or struct_ops "ref_acquried" arguments, but not allocated
locally. Besides, null pointer is allowed. Therefore, kernel caller
of the struct_ops function consuming the pointer needs to take care of
the potential null pointer.

The first use case will be Qdisc_ops::dequeue, where a qdisc returns a
pointer to the skb to be dequeued.

To achieve this, we first allow a reference object to leak through return
if it is in the return register and the type matches the return type of the
function. Then, we check whether the pointer to-be-returned is valid in
check_return_code().

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 kernel/bpf/verifier.c | 50 +++++++++++++++++++++++++++++++++++++++----
 1 file changed, 46 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 06a6edd306fd..2d4a55ead85b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -10081,16 +10081,36 @@ record_func_key(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
 
 static int check_reference_leak(struct bpf_verifier_env *env, bool exception_exit)
 {
+	enum bpf_prog_type type = resolve_prog_type(env->prog);
+	u32 regno = exception_exit? BPF_REG_1 : BPF_REG_0;
+	struct bpf_reg_state *reg = reg_state(env, regno);
 	struct bpf_func_state *state = cur_func(env);
+	const struct bpf_prog *prog = env->prog;
+	const struct btf_type *ret_type = NULL;
 	bool refs_lingering = false;
+	struct btf *btf;
 	int i;
 
 	if (!exception_exit && state->frameno && !state->in_callback_fn)
 		return 0;
 
+	if (type == BPF_PROG_TYPE_STRUCT_OPS &&
+	    reg->type & PTR_TO_BTF_ID && reg->ref_obj_id) {
+		btf = bpf_prog_get_target_btf(prog);
+		ret_type = btf_type_by_id(btf, prog->aux->attach_func_proto->type);
+		if (reg->btf_id != ret_type->type) {
+			verbose(env, "Return kptr type, struct %s, doesn't match function prototype, struct %s\n",
+				btf_type_name(reg->btf, reg->btf_id),
+				btf_type_name(btf, ret_type->type));
+			return -EINVAL;
+		}
+	}
+
 	for (i = 0; i < state->acquired_refs; i++) {
 		if (!exception_exit && state->in_callback_fn && state->refs[i].callback_ref != state->frameno)
 			continue;
+		if (ret_type && reg->ref_obj_id == state->refs[i].id)
+			continue;
 		verbose(env, "Unreleased reference id=%d alloc_insn=%d\n",
 			state->refs[i].id, state->refs[i].insn_idx);
 		refs_lingering = true;
@@ -15395,12 +15415,15 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
 	const char *exit_ctx = "At program exit";
 	struct tnum enforce_attach_type_range = tnum_unknown;
 	const struct bpf_prog *prog = env->prog;
-	struct bpf_reg_state *reg;
+	struct bpf_reg_state *reg = reg_state(env, regno);
 	struct bpf_retval_range range = retval_range(0, 1);
 	enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
 	int err;
 	struct bpf_func_state *frame = env->cur_state->frame[0];
 	const bool is_subprog = frame->subprogno;
+	struct btf *btf = bpf_prog_get_target_btf(prog);
+	bool st_ops_ret_is_kptr = false;
+	const struct btf_type *t;
 
 	/* LSM and struct_ops func-ptr's return type could be "void" */
 	if (!is_subprog || frame->in_exception_callback_fn) {
@@ -15409,10 +15432,26 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
 			if (prog->expected_attach_type == BPF_LSM_CGROUP)
 				/* See below, can be 0 or 0-1 depending on hook. */
 				break;
-			fallthrough;
+			if (!prog->aux->attach_func_proto->type)
+				return 0;
+			break;
 		case BPF_PROG_TYPE_STRUCT_OPS:
 			if (!prog->aux->attach_func_proto->type)
 				return 0;
+
+			t = btf_type_by_id(btf, prog->aux->attach_func_proto->type);
+			if (btf_type_is_ptr(t)) {
+				/* Allow struct_ops programs to return kptr or null if
+				 * the return type is a pointer type.
+				 * check_reference_leak has ensured the returning kptr
+				 * matches the type of the function prototype and is
+				 * the only leaking reference. Thus, we can safely return
+				 * if the pointer is in its unmodified form
+				 */
+				if (reg->type & PTR_TO_BTF_ID)
+					return __check_ptr_off_reg(env, reg, regno, false);
+				st_ops_ret_is_kptr = true;
+			}
 			break;
 		default:
 			break;
@@ -15434,8 +15473,6 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
 		return -EACCES;
 	}
 
-	reg = cur_regs(env) + regno;
-
 	if (frame->in_async_callback_fn) {
 		/* enforce return zero from async callbacks like timer */
 		exit_ctx = "At async callback return";
@@ -15522,6 +15559,11 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
 	case BPF_PROG_TYPE_NETFILTER:
 		range = retval_range(NF_DROP, NF_ACCEPT);
 		break;
+	case BPF_PROG_TYPE_STRUCT_OPS:
+		if (!st_ops_ret_is_kptr)
+			return 0;
+		range = retval_range(0, 0);
+		break;
 	case BPF_PROG_TYPE_EXT:
 		/* freplace program can return anything as its return value
 		 * depends on the to-be-replaced kernel func or bpf program.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH v8 04/20] selftests/bpf: Test returning kptr from struct_ops programs
  2024-05-10 19:23 [RFC PATCH v8 00/20] bpf qdisc Amery Hung
                   ` (2 preceding siblings ...)
  2024-05-10 19:23 ` [RFC PATCH v8 03/20] bpf: Allow struct_ops prog to return referenced kptr Amery Hung
@ 2024-05-10 19:23 ` Amery Hung
  2024-05-10 19:23 ` [RFC PATCH v8 05/20] bpf: Generate btf_struct_metas for kernel BTF Amery Hung
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Amery Hung @ 2024-05-10 19:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs, ameryhung

Test struct_ops programs returning kptr. The verifier should only allow
programs returning NULL or a non-local kptr with the correct type.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  8 ++
 .../selftests/bpf/bpf_testmod/bpf_testmod.h   |  4 +
 .../prog_tests/test_struct_ops_kptr_return.c  | 87 +++++++++++++++++++
 .../bpf/progs/struct_ops_kptr_return.c        | 24 +++++
 ...uct_ops_kptr_return_fail__invalid_scalar.c | 24 +++++
 .../struct_ops_kptr_return_fail__local_kptr.c | 30 +++++++
 ...uct_ops_kptr_return_fail__nonzero_offset.c | 23 +++++
 .../struct_ops_kptr_return_fail__wrong_type.c | 28 ++++++
 8 files changed, 228 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_kptr_return.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__invalid_scalar.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__nonzero_offset.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__wrong_type.c

diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 64dcab25b539..097a8d1c2ef8 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -600,11 +600,19 @@ static int bpf_testmod_ops__test_ref_acquire(int dummy,
 	return 0;
 }
 
+static struct task_struct *
+bpf_testmod_ops__test_kptr_return(int dummy, struct task_struct *task__ref_acquired,
+				  struct cgroup *cgrp)
+{
+	return NULL;
+}
+
 static struct bpf_testmod_ops __bpf_testmod_ops = {
 	.test_1 = bpf_testmod_test_1,
 	.test_2 = bpf_testmod_test_2,
 	.test_maybe_null = bpf_testmod_ops__test_maybe_null,
 	.test_ref_acquire = bpf_testmod_ops__test_ref_acquire,
+	.test_kptr_return = bpf_testmod_ops__test_kptr_return,
 };
 
 struct bpf_struct_ops bpf_bpf_testmod_ops = {
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
index a0233990fb0e..6d24e1307b64 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
@@ -6,6 +6,7 @@
 #include <linux/types.h>
 
 struct task_struct;
+struct cgroup;
 
 struct bpf_testmod_test_read_ctx {
 	char *buf;
@@ -37,6 +38,9 @@ struct bpf_testmod_ops {
 	int (*test_maybe_null)(int dummy, struct task_struct *task);
 	/* Used to test ref_acquired arguments. */
 	int (*test_ref_acquire)(int dummy, struct task_struct *task);
+	/* Used to test returning kptr. */
+	struct task_struct *(*test_kptr_return)(int dummy, struct task_struct *task,
+						struct cgroup *cgrp);
 
 	/* The following fields are used to test shadow copies. */
 	char onebyte;
diff --git a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_kptr_return.c b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_kptr_return.c
new file mode 100644
index 000000000000..bc2fac39215a
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_kptr_return.c
@@ -0,0 +1,87 @@
+#include <test_progs.h>
+
+#include "struct_ops_kptr_return.skel.h"
+#include "struct_ops_kptr_return_fail__wrong_type.skel.h"
+#include "struct_ops_kptr_return_fail__invalid_scalar.skel.h"
+#include "struct_ops_kptr_return_fail__nonzero_offset.skel.h"
+#include "struct_ops_kptr_return_fail__local_kptr.skel.h"
+
+/* Test that the verifier accepts a program that acquires a referenced
+ * kptr and releases the reference through return
+ */
+static void kptr_return(void)
+{
+	struct struct_ops_kptr_return *skel;
+
+	skel = struct_ops_kptr_return__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "struct_ops_module_open_and_load"))
+		return;
+
+	struct_ops_kptr_return__destroy(skel);
+}
+
+/* Test that the verifier rejects a program that returns a kptr of the
+ * wrong type
+ */
+static void kptr_return_fail__wrong_type(void)
+{
+	struct struct_ops_kptr_return_fail__wrong_type *skel;
+
+	skel = struct_ops_kptr_return_fail__wrong_type__open_and_load();
+	if (ASSERT_ERR_PTR(skel, "struct_ops_module_fail__wrong_type__open_and_load"))
+		return;
+
+	struct_ops_kptr_return_fail__wrong_type__destroy(skel);
+}
+
+/* Test that the verifier rejects a program that returns a non-null scalar */
+static void kptr_return_fail__invalid_scalar(void)
+{
+	struct struct_ops_kptr_return_fail__invalid_scalar *skel;
+
+	skel = struct_ops_kptr_return_fail__invalid_scalar__open_and_load();
+	if (ASSERT_ERR_PTR(skel, "struct_ops_module_fail__invalid_scalar__open_and_load"))
+		return;
+
+	struct_ops_kptr_return_fail__invalid_scalar__destroy(skel);
+}
+
+/* Test that the verifier rejects a program that returns kptr with non-zero offset */
+static void kptr_return_fail__nonzero_offset(void)
+{
+	struct struct_ops_kptr_return_fail__nonzero_offset *skel;
+
+	skel = struct_ops_kptr_return_fail__nonzero_offset__open_and_load();
+	if (ASSERT_ERR_PTR(skel, "struct_ops_module_fail__nonzero_offset__open_and_load"))
+		return;
+
+	struct_ops_kptr_return_fail__nonzero_offset__destroy(skel);
+}
+
+/* Test that the verifier rejects a program that returns local kptr */
+static void kptr_return_fail__local_kptr(void)
+{
+	struct struct_ops_kptr_return_fail__local_kptr *skel;
+
+	skel = struct_ops_kptr_return_fail__local_kptr__open_and_load();
+	if (ASSERT_ERR_PTR(skel, "struct_ops_module_fail__local_kptr__open_and_load"))
+		return;
+
+	struct_ops_kptr_return_fail__local_kptr__destroy(skel);
+}
+
+void test_struct_ops_kptr_return(void)
+{
+	if (test__start_subtest("kptr_return"))
+		kptr_return();
+	if (test__start_subtest("kptr_return_fail__wrong_type"))
+		kptr_return_fail__wrong_type();
+	if (test__start_subtest("kptr_return_fail__invalid_scalar"))
+		kptr_return_fail__invalid_scalar();
+	if (test__start_subtest("kptr_return_fail__nonzero_offset"))
+		kptr_return_fail__nonzero_offset();
+	if (test__start_subtest("kptr_return_fail__local_kptr"))
+		kptr_return_fail__local_kptr();
+}
+
+
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_kptr_return.c b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return.c
new file mode 100644
index 000000000000..34933a88e1f9
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return.c
@@ -0,0 +1,24 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This test struct_ops BPF programs returning referenced kptr. The verifier should
+ * allow a referenced kptr (acquired with "ref_acquired") to be leaked through return.
+ */
+SEC("struct_ops/test_kptr_return")
+struct task_struct *BPF_PROG(test_kptr_return, int dummy,
+	     struct task_struct *task, struct cgroup *cgrp)
+{
+	return task;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_kptr_return = {
+	.test_kptr_return = (void *)test_kptr_return,
+};
+
+
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__invalid_scalar.c b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__invalid_scalar.c
new file mode 100644
index 000000000000..d479e3377496
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__invalid_scalar.c
@@ -0,0 +1,24 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct cgroup *bpf_cgroup_acquire(struct cgroup *p) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This test struct_ops BPF programs returning referenced kptr. The verifier should
+ * reject programs returning a non-zero scalar value.
+ */
+SEC("struct_ops/test_kptr_return")
+struct task_struct *BPF_PROG(test_kptr_return, int dummy,
+	     struct task_struct *task, struct cgroup *cgrp)
+{
+	bpf_task_release(task);
+	return (struct task_struct *)1;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_kptr_return = {
+	.test_kptr_return = (void *)test_kptr_return,
+};
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c
new file mode 100644
index 000000000000..9266987798ca
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c
@@ -0,0 +1,30 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+#include "bpf_experimental.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct cgroup *bpf_cgroup_acquire(struct cgroup *p) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This test struct_ops BPF programs returning referenced kptr. The verifier should
+ * reject programs returning a local kptr.
+ */
+SEC("struct_ops/test_kptr_return")
+struct task_struct *BPF_PROG(test_kptr_return, int dummy,
+	     struct task_struct *task, struct cgroup *cgrp)
+{
+	struct task_struct *t;
+
+	t = bpf_obj_new(typeof(*task));
+	if (!t)
+		return task;
+
+	return t;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_kptr_return = {
+	.test_kptr_return = (void *)test_kptr_return,
+};
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__nonzero_offset.c b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__nonzero_offset.c
new file mode 100644
index 000000000000..1a369e9839f3
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__nonzero_offset.c
@@ -0,0 +1,23 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct cgroup *bpf_cgroup_acquire(struct cgroup *p) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This test struct_ops BPF programs returning referenced kptr. The verifier should
+ * reject programs returning a modified referenced kptr.
+ */
+SEC("struct_ops/test_kptr_return")
+struct task_struct *BPF_PROG(test_kptr_return, int dummy,
+	     struct task_struct *task, struct cgroup *cgrp)
+{
+	return (struct task_struct *)&task->jobctl;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_kptr_return = {
+	.test_kptr_return = (void *)test_kptr_return,
+};
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__wrong_type.c b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__wrong_type.c
new file mode 100644
index 000000000000..4128ea0b77f1
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__wrong_type.c
@@ -0,0 +1,28 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct cgroup *bpf_cgroup_acquire(struct cgroup *p) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This test struct_ops BPF programs returning referenced kptr. The verifier should
+ * reject programs returning a referenced kptr of the wrong type.
+ */
+SEC("struct_ops/test_kptr_return")
+struct task_struct *BPF_PROG(test_kptr_return, int dummy,
+	     struct task_struct *task, struct cgroup *cgrp)
+{
+	struct task_struct *ret;
+
+	ret = (struct task_struct *)bpf_cgroup_acquire(cgrp);
+	bpf_task_release(task);
+
+	return ret;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_kptr_return = {
+	.test_kptr_return = (void *)test_kptr_return,
+};
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH v8 05/20] bpf: Generate btf_struct_metas for kernel BTF
  2024-05-10 19:23 [RFC PATCH v8 00/20] bpf qdisc Amery Hung
                   ` (3 preceding siblings ...)
  2024-05-10 19:23 ` [RFC PATCH v8 04/20] selftests/bpf: Test returning kptr from struct_ops programs Amery Hung
@ 2024-05-10 19:23 ` Amery Hung
  2024-05-10 19:23 ` [RFC PATCH v8 06/20] bpf: Recognize kernel types as graph values Amery Hung
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Amery Hung @ 2024-05-10 19:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs, ameryhung

Currently, only program BTF from the user may contain special BTF fields
(e.g., bpf_list_head, bpf_spin_lock, and bpf_timer). To support adding
kernel objects to collections, we will need specical BTF fields (i.e.,
graph nodes) in kernel structures as well. This patch takes the first
step by finding these fields and build metadata for kernel BTF.

Unlike parsing program BTF, where we go through all types, an allowlist
specifying kernel structures that contain special BTF fields is used.
This to avoid wasting time parsing most kernel types that does not have
any special BTF field.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 kernel/bpf/btf.c | 63 +++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 59 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index e462fb4a4598..5ee6ccc2fab7 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -5380,6 +5380,11 @@ static const char *alloc_obj_fields[] = {
 	"bpf_refcount",
 };
 
+/* kernel structures with special BTF fields*/
+static const char *kstructs_with_special_btf[] = {
+	"unused",
+};
+
 static struct btf_struct_metas *
 btf_parse_struct_metas(struct bpf_verifier_log *log, struct btf *btf)
 {
@@ -5391,6 +5396,7 @@ btf_parse_struct_metas(struct bpf_verifier_log *log, struct btf *btf)
 		} _arr;
 	} aof;
 	struct btf_struct_metas *tab = NULL;
+	bool btf_is_base_kernel;
 	int i, n, id, ret;
 
 	BUILD_BUG_ON(offsetof(struct btf_id_set, cnt) != 0);
@@ -5412,16 +5418,25 @@ btf_parse_struct_metas(struct bpf_verifier_log *log, struct btf *btf)
 		return NULL;
 	sort(&aof.set.ids, aof.set.cnt, sizeof(aof.set.ids[0]), btf_id_cmp_func, NULL);
 
-	n = btf_nr_types(btf);
+	btf_is_base_kernel = btf_is_kernel(btf) && !btf_is_module(btf);
+	n = btf_is_base_kernel ? ARRAY_SIZE(kstructs_with_special_btf) : btf_nr_types(btf);
 	for (i = 1; i < n; i++) {
 		struct btf_struct_metas *new_tab;
 		const struct btf_member *member;
 		struct btf_struct_meta *type;
 		struct btf_record *record;
 		const struct btf_type *t;
-		int j, tab_cnt;
+		int j, tab_cnt, id;
 
-		t = btf_type_by_id(btf, i);
+		id = btf_is_base_kernel ?
+		     btf_find_by_name_kind(btf, kstructs_with_special_btf[i],
+					   BTF_KIND_STRUCT) : i;
+		if (id < 0) {
+			ret = -EINVAL;
+			goto free;
+		}
+
+		t = btf_type_by_id(btf, id);
 		if (!t) {
 			ret = -EINVAL;
 			goto free;
@@ -5449,7 +5464,7 @@ btf_parse_struct_metas(struct bpf_verifier_log *log, struct btf *btf)
 		tab = new_tab;
 
 		type = &tab->types[tab->cnt];
-		type->btf_id = i;
+		type->btf_id = id;
 		record = btf_parse_fields(btf, t, BPF_SPIN_LOCK | BPF_LIST_HEAD | BPF_LIST_NODE |
 						  BPF_RB_ROOT | BPF_RB_NODE | BPF_REFCOUNT |
 						  BPF_KPTR, t->size);
@@ -5967,6 +5982,7 @@ BTF_ID(struct, bpf_ctx_convert)
 
 struct btf *btf_parse_vmlinux(void)
 {
+	struct btf_struct_metas *struct_meta_tab;
 	struct btf_verifier_env *env = NULL;
 	struct bpf_verifier_log *log;
 	struct btf *btf = NULL;
@@ -6009,6 +6025,23 @@ struct btf *btf_parse_vmlinux(void)
 	if (err)
 		goto errout;
 
+	struct_meta_tab = btf_parse_struct_metas(&env->log, btf);
+	if (IS_ERR(struct_meta_tab)) {
+		err = PTR_ERR(struct_meta_tab);
+		goto errout;
+	}
+	btf->struct_meta_tab = struct_meta_tab;
+
+	if (struct_meta_tab) {
+		int i;
+
+		for (i = 0; i < struct_meta_tab->cnt; i++) {
+			err = btf_check_and_fixup_fields(struct_meta_tab->types[i].record);
+			if (err < 0)
+				goto errout_meta;
+		}
+	}
+
 	/* btf_parse_vmlinux() runs under bpf_verifier_lock */
 	bpf_ctx_convert.t = btf_type_by_id(btf, bpf_ctx_convert_btf_id[0]);
 
@@ -6021,6 +6054,8 @@ struct btf *btf_parse_vmlinux(void)
 	btf_verifier_env_free(env);
 	return btf;
 
+errout_meta:
+	btf_free_struct_meta_tab(btf);
 errout:
 	btf_verifier_env_free(env);
 	if (btf) {
@@ -6034,6 +6069,7 @@ struct btf *btf_parse_vmlinux(void)
 
 static struct btf *btf_parse_module(const char *module_name, const void *data, unsigned int data_size)
 {
+	struct btf_struct_metas *struct_meta_tab;
 	struct btf_verifier_env *env = NULL;
 	struct bpf_verifier_log *log;
 	struct btf *btf = NULL, *base_btf;
@@ -6091,10 +6127,29 @@ static struct btf *btf_parse_module(const char *module_name, const void *data, u
 	if (err)
 		goto errout;
 
+	struct_meta_tab = btf_parse_struct_metas(&env->log, btf);
+	if (IS_ERR(struct_meta_tab)) {
+		err = PTR_ERR(struct_meta_tab);
+		goto errout;
+	}
+	btf->struct_meta_tab = struct_meta_tab;
+
+	if (struct_meta_tab) {
+		int i;
+
+		for (i = 0; i < struct_meta_tab->cnt; i++) {
+			err = btf_check_and_fixup_fields(struct_meta_tab->types[i].record);
+			if (err < 0)
+				goto errout_meta;
+		}
+	}
+
 	btf_verifier_env_free(env);
 	refcount_set(&btf->refcnt, 1);
 	return btf;
 
+errout_meta:
+	btf_free_struct_meta_tab(btf);
 errout:
 	btf_verifier_env_free(env);
 	if (btf) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH v8 06/20] bpf: Recognize kernel types as graph values
  2024-05-10 19:23 [RFC PATCH v8 00/20] bpf qdisc Amery Hung
                   ` (4 preceding siblings ...)
  2024-05-10 19:23 ` [RFC PATCH v8 05/20] bpf: Generate btf_struct_metas for kernel BTF Amery Hung
@ 2024-05-10 19:23 ` Amery Hung
  2024-05-10 19:23 ` [RFC PATCH v8 07/20] bpf: Allow adding kernel objects to collections Amery Hung
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Amery Hung @ 2024-05-10 19:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs, ameryhung

This patch teaches bpf to recognize graphs that contain kernel objects
as graph values. BPF programs can use a new BTF declaration tag
"contain_kptr" to signal that the value of a graph will be a kernel type.
"contains_kptr" follows the same annotation format as "contains".

For the implementation, when the value is a kernel type, we uses kernel
BTF for node and roots as well so that we don't need to match the same
type in different BTF. Since graph values can be kernel types, we can
no longer make the assumption that the BTF is from programs when finding
and parsing graph nodes and roots. Therefore, we record the BTF of a
node in btf_field_info and use it later.

No kernel object can be added to bpf graphs yet. In later patches,
we will teach the verifier to allow moving kptr in and out collections.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/linux/btf.h                           |  4 +-
 kernel/bpf/btf.c                              | 49 ++++++++++++-------
 kernel/bpf/syscall.c                          |  2 +-
 .../testing/selftests/bpf/bpf_experimental.h  |  1 +
 4 files changed, 36 insertions(+), 20 deletions(-)

diff --git a/include/linux/btf.h b/include/linux/btf.h
index f9e56fd12a9f..2579b8a51172 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -219,7 +219,7 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
 			   u32 expected_offset, u32 expected_size);
 struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type *t,
 				    u32 field_mask, u32 value_size);
-int btf_check_and_fixup_fields(const struct btf *btf, struct btf_record *rec);
+int btf_check_and_fixup_fields(struct btf_record *rec);
 bool btf_type_is_void(const struct btf_type *t);
 s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
 s32 bpf_find_btf_id(const char *name, u32 kind, struct btf **btf_p);
@@ -569,7 +569,7 @@ static inline int register_btf_id_dtor_kfuncs(const struct btf_id_dtor_kfunc *dt
 {
 	return 0;
 }
-static inline struct btf_struct_meta *btf_find_struct_meta(const struct btf *btf, u32 btf_id)
+static inline struct btf_struct_meta *btf_find_struct_meta(u32 btf_id)
 {
 	return NULL;
 }
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 5ee6ccc2fab7..37fb6143da79 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3296,6 +3296,7 @@ struct btf_field_info {
 		struct {
 			const char *node_name;
 			u32 value_btf_id;
+			const struct btf *btf;
 		} graph_root;
 	};
 };
@@ -3405,7 +3406,9 @@ btf_find_graph_root(const struct btf *btf, const struct btf_type *pt,
 		    enum btf_field_type head_type)
 {
 	const char *node_field_name;
+	bool value_is_kptr = false;
 	const char *value_type;
+	struct btf *kptr_btf;
 	s32 id;
 
 	if (!__btf_type_is_struct(t))
@@ -3413,15 +3416,26 @@ btf_find_graph_root(const struct btf *btf, const struct btf_type *pt,
 	if (t->size != sz)
 		return BTF_FIELD_IGNORE;
 	value_type = btf_find_decl_tag_value(btf, pt, comp_idx, "contains:");
-	if (IS_ERR(value_type))
-		return -EINVAL;
+	if (!IS_ERR(value_type))
+		goto found;
+	value_type = btf_find_decl_tag_value(btf, pt, comp_idx, "contains_kptr:");
+	if (!IS_ERR(value_type)) {
+		value_is_kptr = true;
+		goto found;
+	}
+	return -EINVAL;
+found:
 	node_field_name = strstr(value_type, ":");
 	if (!node_field_name)
 		return -EINVAL;
 	value_type = kstrndup(value_type, node_field_name - value_type, GFP_KERNEL | __GFP_NOWARN);
 	if (!value_type)
 		return -ENOMEM;
-	id = btf_find_by_name_kind(btf, value_type, BTF_KIND_STRUCT);
+	if (value_is_kptr)
+		id = bpf_find_btf_id(value_type, BTF_KIND_STRUCT, &kptr_btf);
+	else
+		id = btf_find_by_name_kind(btf, value_type, BTF_KIND_STRUCT);
+
 	kfree(value_type);
 	if (id < 0)
 		return id;
@@ -3431,6 +3445,7 @@ btf_find_graph_root(const struct btf *btf, const struct btf_type *pt,
 	info->type = head_type;
 	info->off = off;
 	info->graph_root.value_btf_id = id;
+	info->graph_root.btf = value_is_kptr ? kptr_btf : btf;
 	info->graph_root.node_name = node_field_name;
 	return BTF_FIELD_FOUND;
 }
@@ -3722,13 +3737,13 @@ static int btf_parse_kptr(const struct btf *btf, struct btf_field *field,
 	return ret;
 }
 
-static int btf_parse_graph_root(const struct btf *btf,
-				struct btf_field *field,
+static int btf_parse_graph_root(struct btf_field *field,
 				struct btf_field_info *info,
 				const char *node_type_name,
 				size_t node_type_align)
 {
 	const struct btf_type *t, *n = NULL;
+	const struct btf *btf = info->graph_root.btf;
 	const struct btf_member *member;
 	u32 offset;
 	int i;
@@ -3766,18 +3781,16 @@ static int btf_parse_graph_root(const struct btf *btf,
 	return 0;
 }
 
-static int btf_parse_list_head(const struct btf *btf, struct btf_field *field,
-			       struct btf_field_info *info)
+static int btf_parse_list_head(struct btf_field *field, struct btf_field_info *info)
 {
-	return btf_parse_graph_root(btf, field, info, "bpf_list_node",
-					    __alignof__(struct bpf_list_node));
+	return btf_parse_graph_root(field, info, "bpf_list_node",
+				    __alignof__(struct bpf_list_node));
 }
 
-static int btf_parse_rb_root(const struct btf *btf, struct btf_field *field,
-			     struct btf_field_info *info)
+static int btf_parse_rb_root(struct btf_field *field, struct btf_field_info *info)
 {
-	return btf_parse_graph_root(btf, field, info, "bpf_rb_node",
-					    __alignof__(struct bpf_rb_node));
+	return btf_parse_graph_root(field, info, "bpf_rb_node",
+				    __alignof__(struct bpf_rb_node));
 }
 
 static int btf_field_cmp(const void *_a, const void *_b, const void *priv)
@@ -3859,12 +3872,12 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 				goto end;
 			break;
 		case BPF_LIST_HEAD:
-			ret = btf_parse_list_head(btf, &rec->fields[i], &info_arr[i]);
+			ret = btf_parse_list_head(&rec->fields[i], &info_arr[i]);
 			if (ret < 0)
 				goto end;
 			break;
 		case BPF_RB_ROOT:
-			ret = btf_parse_rb_root(btf, &rec->fields[i], &info_arr[i]);
+			ret = btf_parse_rb_root(&rec->fields[i], &info_arr[i]);
 			if (ret < 0)
 				goto end;
 			break;
@@ -3901,7 +3914,7 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 	return ERR_PTR(ret);
 }
 
-int btf_check_and_fixup_fields(const struct btf *btf, struct btf_record *rec)
+int btf_check_and_fixup_fields(struct btf_record *rec)
 {
 	int i;
 
@@ -3917,11 +3930,13 @@ int btf_check_and_fixup_fields(const struct btf *btf, struct btf_record *rec)
 		return 0;
 	for (i = 0; i < rec->cnt; i++) {
 		struct btf_struct_meta *meta;
+		struct btf *btf;
 		u32 btf_id;
 
 		if (!(rec->fields[i].type & BPF_GRAPH_ROOT))
 			continue;
 		btf_id = rec->fields[i].graph_root.value_btf_id;
+		btf = rec->fields[i].graph_root.btf;
 		meta = btf_find_struct_meta(btf, btf_id);
 		if (!meta)
 			return -EFAULT;
@@ -5630,7 +5645,7 @@ static struct btf *btf_parse(const union bpf_attr *attr, bpfptr_t uattr, u32 uat
 		int i;
 
 		for (i = 0; i < struct_meta_tab->cnt; i++) {
-			err = btf_check_and_fixup_fields(btf, struct_meta_tab->types[i].record);
+			err = btf_check_and_fixup_fields(struct_meta_tab->types[i].record);
 			if (err < 0)
 				goto errout_meta;
 		}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e44c276e8617..9e93d48efe19 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1157,7 +1157,7 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
 		}
 	}
 
-	ret = btf_check_and_fixup_fields(btf, map->record);
+	ret = btf_check_and_fixup_fields(map->record);
 	if (ret < 0)
 		goto free_map_tab;
 
diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index a5b9df38c162..a4da75df819c 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -7,6 +7,7 @@
 #include <bpf/bpf_core_read.h>
 
 #define __contains(name, node) __attribute__((btf_decl_tag("contains:" #name ":" #node)))
+#define __contains_kptr(name, node) __attribute__((btf_decl_tag("contains_kptr:" #name ":" #node)))
 
 /* Description
  *	Allocates an object of the type represented by 'local_type_id' in
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH v8 07/20] bpf: Allow adding kernel objects to collections
  2024-05-10 19:23 [RFC PATCH v8 00/20] bpf qdisc Amery Hung
                   ` (5 preceding siblings ...)
  2024-05-10 19:23 ` [RFC PATCH v8 06/20] bpf: Recognize kernel types as graph values Amery Hung
@ 2024-05-10 19:23 ` Amery Hung
  2024-05-10 19:24 ` [RFC PATCH v8 08/20] selftests/bpf: Test adding kernel object to bpf graph Amery Hung
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Amery Hung @ 2024-05-10 19:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs, ameryhung

To allow adding/removing kernel objects to/from collections, we teach the
verifier that a graph node can be in a trusted kptr in addition to local
objects. Besides, a kernel graph value removed from a collection should
still be a trusted kptr.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/linux/bpf_verifier.h |  8 +++++++-
 kernel/bpf/verifier.c        | 18 ++++++++++++------
 2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 7cb1b75eee38..edb306ef4c61 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -864,9 +864,15 @@ static inline bool type_is_ptr_alloc_obj(u32 type)
 	return base_type(type) == PTR_TO_BTF_ID && type_flag(type) & MEM_ALLOC;
 }
 
+static inline bool type_is_ptr_trusted(u32 type)
+{
+	return base_type(type) == PTR_TO_BTF_ID && type_flag(type) & PTR_TRUSTED;
+}
+
 static inline bool type_is_non_owning_ref(u32 type)
 {
-	return type_is_ptr_alloc_obj(type) && type_flag(type) & NON_OWN_REF;
+	return (type_is_ptr_alloc_obj(type) || type_is_ptr_trusted(type)) &&
+	       type_flag(type) & NON_OWN_REF;
 }
 
 static inline bool type_is_pkt_pointer(enum bpf_reg_type type)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 2d4a55ead85b..f01d2b876a2e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -413,7 +413,8 @@ static struct btf_record *reg_btf_record(const struct bpf_reg_state *reg)
 
 	if (reg->type == PTR_TO_MAP_VALUE) {
 		rec = reg->map_ptr->record;
-	} else if (type_is_ptr_alloc_obj(reg->type)) {
+	} else if (type_is_ptr_alloc_obj(reg->type) || type_is_ptr_trusted(reg->type) ||
+		   reg->type == PTR_TO_BTF_ID) {
 		meta = btf_find_struct_meta(reg->btf, reg->btf_id);
 		if (meta)
 			rec = meta->record;
@@ -1860,7 +1861,8 @@ static void mark_reg_graph_node(struct bpf_reg_state *regs, u32 regno,
 				struct btf_field_graph_root *ds_head)
 {
 	__mark_reg_known_zero(&regs[regno]);
-	regs[regno].type = PTR_TO_BTF_ID | MEM_ALLOC;
+	regs[regno].type = btf_is_kernel(ds_head->btf) ? PTR_TO_BTF_ID | PTR_TRUSTED :
+							 PTR_TO_BTF_ID | MEM_ALLOC;
 	regs[regno].btf = ds_head->btf;
 	regs[regno].btf_id = ds_head->value_btf_id;
 	regs[regno].off = ds_head->node_offset;
@@ -11931,8 +11933,10 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 				return ret;
 			break;
 		case KF_ARG_PTR_TO_LIST_NODE:
-			if (reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
-				verbose(env, "arg#%d expected pointer to allocated object\n", i);
+			if (reg->type != (PTR_TO_BTF_ID | MEM_ALLOC) &&
+			    reg->type != (PTR_TO_BTF_ID | PTR_TRUSTED) &&
+			    reg->type != PTR_TO_BTF_ID) {
+				verbose(env, "arg#%d expected pointer to allocated object or trusted pointer\n", i);
 				return -EINVAL;
 			}
 			if (!reg->ref_obj_id) {
@@ -11954,8 +11958,10 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 					return -EINVAL;
 				}
 			} else {
-				if (reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
-					verbose(env, "arg#%d expected pointer to allocated object\n", i);
+				if (reg->type != (PTR_TO_BTF_ID | MEM_ALLOC) &&
+				    reg->type != (PTR_TO_BTF_ID | PTR_TRUSTED) &&
+				    reg->type != PTR_TO_BTF_ID) {
+					verbose(env, "arg#%d expected pointer to allocated object or trusted pointer\n", i);
 					return -EINVAL;
 				}
 				if (!reg->ref_obj_id) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH v8 08/20] selftests/bpf: Test adding kernel object to bpf graph
  2024-05-10 19:23 [RFC PATCH v8 00/20] bpf qdisc Amery Hung
                   ` (6 preceding siblings ...)
  2024-05-10 19:23 ` [RFC PATCH v8 07/20] bpf: Allow adding kernel objects to collections Amery Hung
@ 2024-05-10 19:24 ` Amery Hung
  2024-05-10 19:24 ` [RFC PATCH v8 09/20] bpf: Find special BTF fields in union Amery Hung
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Amery Hung @ 2024-05-10 19:24 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs, ameryhung

This patch tests bpf graphs storing kernel objects.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   | 14 +++++++++
 .../selftests/bpf/bpf_testmod/bpf_testmod.h   |  5 ++++
 .../selftests/bpf/prog_tests/linked_list.c    |  6 ++--
 .../testing/selftests/bpf/progs/linked_list.c | 15 ++++++++++
 .../testing/selftests/bpf/progs/linked_list.h |  8 +++++
 .../selftests/bpf/progs/linked_list_fail.c    | 29 +++++++++++++++++++
 6 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 097a8d1c2ef8..90dda6335c04 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -494,6 +494,18 @@ __bpf_kfunc static u32 bpf_kfunc_call_test_static_unused_arg(u32 arg, u32 unused
 	return arg;
 }
 
+__bpf_kfunc static struct bpf_testmod_linked_list_obj *
+bpf_kfunc_call_test_acq_linked_list_obj(void)
+{
+	return kzalloc(sizeof(struct bpf_testmod_linked_list_obj), GFP_ATOMIC);
+}
+
+__bpf_kfunc static void
+bpf_kfunc_call_test_rel_linked_list_obj(struct bpf_testmod_linked_list_obj *obj)
+{
+	kvfree(obj);
+}
+
 BTF_KFUNCS_START(bpf_testmod_check_kfunc_ids)
 BTF_ID_FLAGS(func, bpf_testmod_test_mod_kfunc)
 BTF_ID_FLAGS(func, bpf_kfunc_call_test1)
@@ -520,6 +532,8 @@ BTF_ID_FLAGS(func, bpf_kfunc_call_test_ref, KF_TRUSTED_ARGS | KF_RCU)
 BTF_ID_FLAGS(func, bpf_kfunc_call_test_destructive, KF_DESTRUCTIVE)
 BTF_ID_FLAGS(func, bpf_kfunc_call_test_static_unused_arg)
 BTF_ID_FLAGS(func, bpf_kfunc_call_test_offset)
+BTF_ID_FLAGS(func, bpf_kfunc_call_test_acq_linked_list_obj, KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_kfunc_call_test_rel_linked_list_obj, KF_RELEASE)
 BTF_KFUNCS_END(bpf_testmod_check_kfunc_ids)
 
 static int bpf_testmod_ops_init(struct btf *btf)
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
index 6d24e1307b64..77c36fc016e3 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
@@ -99,4 +99,9 @@ struct bpf_testmod_ops2 {
 	int (*test_1)(void);
 };
 
+struct bpf_testmod_linked_list_obj {
+	int val;
+	struct bpf_list_node node;
+};
+
 #endif /* _BPF_TESTMOD_H */
diff --git a/tools/testing/selftests/bpf/prog_tests/linked_list.c b/tools/testing/selftests/bpf/prog_tests/linked_list.c
index 2fb89de63bd2..813c2e9a2346 100644
--- a/tools/testing/selftests/bpf/prog_tests/linked_list.c
+++ b/tools/testing/selftests/bpf/prog_tests/linked_list.c
@@ -80,8 +80,8 @@ static struct {
 	{ "direct_write_node", "direct access to bpf_list_node is disallowed" },
 	{ "use_after_unlock_push_front", "invalid mem access 'scalar'" },
 	{ "use_after_unlock_push_back", "invalid mem access 'scalar'" },
-	{ "double_push_front", "arg#1 expected pointer to allocated object" },
-	{ "double_push_back", "arg#1 expected pointer to allocated object" },
+	{ "double_push_front", "arg#1 expected pointer to allocated object or trusted pointer" },
+	{ "double_push_back", "arg#1 expected pointer to allocated object or trusted pointer" },
 	{ "no_node_value_type", "bpf_list_node not found at offset=0" },
 	{ "incorrect_value_type",
 	  "operation on bpf_list_head expects arg#1 bpf_list_node at offset=48 in struct foo, "
@@ -96,6 +96,8 @@ static struct {
 	{ "incorrect_head_off2", "bpf_list_head not found at offset=1" },
 	{ "pop_front_off", "off 48 doesn't point to 'struct bpf_spin_lock' that is at 40" },
 	{ "pop_back_off", "off 48 doesn't point to 'struct bpf_spin_lock' that is at 40" },
+	{ "direct_write_node_kernel", "" },
+	{ "push_local_node_to_kptr_list", "operation on bpf_list_head expects arg#1 bpf_list_node at offset=8 in struct bpf_testmod_linked_list_obj, but arg is at offset=8 in struct bpf_testmod_linked_list_obj" },
 };
 
 static void test_linked_list_fail_prog(const char *prog_name, const char *err_msg)
diff --git a/tools/testing/selftests/bpf/progs/linked_list.c b/tools/testing/selftests/bpf/progs/linked_list.c
index 26205ca80679..148ec67feaf7 100644
--- a/tools/testing/selftests/bpf/progs/linked_list.c
+++ b/tools/testing/selftests/bpf/progs/linked_list.c
@@ -378,4 +378,19 @@ int global_list_in_list(void *ctx)
 	return test_list_in_list(&glock, &ghead);
 }
 
+SEC("tc")
+int push_to_kptr_list(void *ctx)
+{
+	struct bpf_testmod_linked_list_obj *f;
+
+	f = bpf_kfunc_call_test_acq_linked_list_obj();
+	if (!f)
+		return 0;
+
+	bpf_spin_lock(&glock3);
+	bpf_list_push_back(&ghead3, &f->node);
+	bpf_spin_unlock(&glock3);
+	return 0;
+}
+
 char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/linked_list.h b/tools/testing/selftests/bpf/progs/linked_list.h
index c0f3609a7ffa..14bd92cfdb6f 100644
--- a/tools/testing/selftests/bpf/progs/linked_list.h
+++ b/tools/testing/selftests/bpf/progs/linked_list.h
@@ -5,6 +5,7 @@
 #include <vmlinux.h>
 #include <bpf/bpf_helpers.h>
 #include "bpf_experimental.h"
+#include "../bpf_testmod/bpf_testmod.h"
 
 struct bar {
 	struct bpf_list_node node;
@@ -52,5 +53,12 @@ struct {
 private(A) struct bpf_spin_lock glock;
 private(A) struct bpf_list_head ghead __contains(foo, node2);
 private(B) struct bpf_spin_lock glock2;
+private(C) struct bpf_spin_lock glock3;
+private(C) struct bpf_list_head ghead3 __contains_kptr(bpf_testmod_linked_list_obj, node);
+
+struct bpf_testmod_linked_list_obj *bpf_kfunc_call_test_acq_linked_list_obj(void) __ksym;
+void bpf_kfunc_call_test_rel_linked_list_obj(struct bpf_testmod_linked_list_obj *obj) __ksym;
+struct bpf_testmod_rb_tree_obj *bpf_kfunc_call_test_acq_rb_tree_obj(void) __ksym;
+void bpf_kfunc_call_test_rel_rb_tree_obj(struct bpf_testmod_rb_tree_obj *obj) __ksym;
 
 #endif
diff --git a/tools/testing/selftests/bpf/progs/linked_list_fail.c b/tools/testing/selftests/bpf/progs/linked_list_fail.c
index 6438982b928b..5f8063ecc448 100644
--- a/tools/testing/selftests/bpf/progs/linked_list_fail.c
+++ b/tools/testing/selftests/bpf/progs/linked_list_fail.c
@@ -609,4 +609,33 @@ int pop_back_off(void *ctx)
 	return pop_ptr_off((void *)bpf_list_pop_back);
 }
 
+SEC("?tc")
+int direct_write_node_kernel(void *ctx)
+{
+	struct bpf_testmod_linked_list_obj *f;
+
+	f = bpf_kfunc_call_test_acq_linked_list_obj();
+	if (!f)
+		return 0;
+
+	*(__u64 *)&f->node = 0;
+	bpf_kfunc_call_test_rel_linked_list_obj(f);
+	return 0;
+}
+
+SEC("?tc")
+int push_local_node_to_kptr_list(void *ctx)
+{
+	struct bpf_testmod_linked_list_obj *f;
+
+	f = bpf_obj_new(typeof(*f));
+	if (!f)
+		return 0;
+
+	bpf_spin_lock(&glock3);
+	bpf_list_push_back(&ghead3, &f->node);
+	bpf_spin_unlock(&glock3);
+	return 0;
+}
+
 char _license[] SEC("license") = "GPL";
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH v8 09/20] bpf: Find special BTF fields in union
  2024-05-10 19:23 [RFC PATCH v8 00/20] bpf qdisc Amery Hung
                   ` (7 preceding siblings ...)
  2024-05-10 19:24 ` [RFC PATCH v8 08/20] selftests/bpf: Test adding kernel object to bpf graph Amery Hung
@ 2024-05-10 19:24 ` Amery Hung
  2024-05-16 23:37   ` Amery Hung
  2024-05-10 19:24 ` [RFC PATCH v8 10/20] bpf: Introduce exclusive-ownership list and rbtree nodes Amery Hung
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Amery Hung @ 2024-05-10 19:24 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs, ameryhung

This patch looks into unions when parsing BTF. While we would like to
support adding a skb to bpf collections, the bpf graph node in sk_buff
will happen to be in a union due to space constraint. Therefore,

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 kernel/bpf/btf.c | 74 +++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 64 insertions(+), 10 deletions(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 37fb6143da79..25a5dc840ac3 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3305,7 +3305,7 @@ static int btf_find_struct(const struct btf *btf, const struct btf_type *t,
 			   u32 off, int sz, enum btf_field_type field_type,
 			   struct btf_field_info *info)
 {
-	if (!__btf_type_is_struct(t))
+	if (!btf_type_is_struct(t))
 		return BTF_FIELD_IGNORE;
 	if (t->size != sz)
 		return BTF_FIELD_IGNORE;
@@ -3497,6 +3497,24 @@ static int btf_get_field_type(const char *name, u32 field_mask, u32 *seen_mask,
 	return type;
 }
 
+static int btf_get_union_field_types(const struct btf *btf, const struct btf_type *u,
+				     u32 field_mask, u32 *seen_mask, int *align, int *sz)
+{
+	int i, field_type, field_types = 0;
+	const struct btf_member *member;
+	const struct btf_type *t;
+
+	for_each_member(i, u, member) {
+		t = btf_type_by_id(btf, member->type);
+		field_type = btf_get_field_type(__btf_name_by_offset(btf, t->name_off),
+						field_mask, seen_mask, align, sz);
+		if (field_type == 0 || field_type == BPF_KPTR_REF)
+			continue;
+		field_types = field_types | field_type;
+	}
+	return field_types;
+}
+
 #undef field_mask_test_name
 
 static int btf_find_struct_field(const struct btf *btf,
@@ -3512,8 +3530,12 @@ static int btf_find_struct_field(const struct btf *btf,
 		const struct btf_type *member_type = btf_type_by_id(btf,
 								    member->type);
 
-		field_type = btf_get_field_type(__btf_name_by_offset(btf, member_type->name_off),
-						field_mask, &seen_mask, &align, &sz);
+		field_type = BTF_INFO_KIND(member_type->info) == BTF_KIND_UNION ?
+			btf_get_union_field_types(btf, member_type, field_mask,
+						  &seen_mask, &align, &sz) :
+			btf_get_field_type(__btf_name_by_offset(btf, member_type->name_off),
+					   field_mask, &seen_mask, &align, &sz);
+
 		if (field_type == 0)
 			continue;
 		if (field_type < 0)
@@ -3521,8 +3543,7 @@ static int btf_find_struct_field(const struct btf *btf,
 
 		off = __btf_member_bit_offset(t, member);
 		if (off % 8)
-			/* valid C code cannot generate such BTF */
-			return -EINVAL;
+			continue;
 		off /= 8;
 		if (off % align)
 			continue;
@@ -3737,6 +3758,20 @@ static int btf_parse_kptr(const struct btf *btf, struct btf_field *field,
 	return ret;
 }
 
+static const struct btf_type *
+btf_find_member_by_name(const struct btf *btf, const struct btf_type *t,
+			const char *member_name)
+{
+	const struct btf_member *member;
+	int i;
+
+	for_each_member(i, t, member) {
+		if (!strcmp(member_name, __btf_name_by_offset(btf, member->name_off)))
+			return btf_type_by_id(btf, member->type);
+	}
+	return NULL;
+}
+
 static int btf_parse_graph_root(struct btf_field *field,
 				struct btf_field_info *info,
 				const char *node_type_name,
@@ -3754,18 +3789,27 @@ static int btf_parse_graph_root(struct btf_field *field,
 	 * verify its type.
 	 */
 	for_each_member(i, t, member) {
-		if (strcmp(info->graph_root.node_name,
-			   __btf_name_by_offset(btf, member->name_off)))
+		const struct btf_type *member_type = btf_type_by_id(btf, member->type);
+
+		if (BTF_INFO_KIND(member_type->info) == BTF_KIND_UNION) {
+			member_type = btf_find_member_by_name(btf, member_type,
+							      info->graph_root.node_name);
+			if (!member_type)
+				continue;
+		} else if (strcmp(info->graph_root.node_name,
+				  __btf_name_by_offset(btf, member->name_off))) {
 			continue;
+		}
+
 		/* Invalid BTF, two members with same name */
 		if (n)
 			return -EINVAL;
-		n = btf_type_by_id(btf, member->type);
+		n = member_type;
 		if (!__btf_type_is_struct(n))
 			return -EINVAL;
 		if (strcmp(node_type_name, __btf_name_by_offset(btf, n->name_off)))
 			return -EINVAL;
-		offset = __btf_member_bit_offset(n, member);
+		offset = __btf_member_bit_offset(member_type, member);
 		if (offset % 8)
 			return -EINVAL;
 		offset /= 8;
@@ -5440,7 +5484,7 @@ btf_parse_struct_metas(struct bpf_verifier_log *log, struct btf *btf)
 		const struct btf_member *member;
 		struct btf_struct_meta *type;
 		struct btf_record *record;
-		const struct btf_type *t;
+		const struct btf_type *t, *member_type;
 		int j, tab_cnt, id;
 
 		id = btf_is_base_kernel ?
@@ -5462,6 +5506,16 @@ btf_parse_struct_metas(struct bpf_verifier_log *log, struct btf *btf)
 		cond_resched();
 
 		for_each_member(j, t, member) {
+			member_type = btf_type_by_id(btf, member->type);
+			if (BTF_INFO_KIND(member_type->info) == BTF_KIND_UNION) {
+				const struct btf_member *umember;
+				int k;
+
+				for_each_member(k, member_type, umember) {
+					if (btf_id_set_contains(&aof.set, umember->type))
+						goto parse;
+				}
+			}
 			if (btf_id_set_contains(&aof.set, member->type))
 				goto parse;
 		}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH v8 10/20] bpf: Introduce exclusive-ownership list and rbtree nodes
  2024-05-10 19:23 [RFC PATCH v8 00/20] bpf qdisc Amery Hung
                   ` (8 preceding siblings ...)
  2024-05-10 19:24 ` [RFC PATCH v8 09/20] bpf: Find special BTF fields in union Amery Hung
@ 2024-05-10 19:24 ` Amery Hung
  2024-05-10 19:24 ` [RFC PATCH v8 11/20] bpf: Allow adding exclusive nodes to bpf list and rbtree Amery Hung
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Amery Hung @ 2024-05-10 19:24 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs, ameryhung

This patch reintroduces the semantic of exclusive ownership of a
reference. The main motivation is to save spaces and avoid changing
kernel structure layout. Existing bpf graph nodes add an additional
owner field to list_head and rb_node to safely support shared ownership
of a reference. The previous patch supports adding kernel objects to
collections by including bpf_list_node or bpf_rb_node in a kernel
structure same as user-defined local objects. However, some kernel
objects' layout have been optimized through out the years and cannot be
easily changed. For example, a bpf_rb_node cannot be added in the union
at offset=0 in sk_buff since bpf_rb_node is larger than other members.
Exclusive ownership solves the problem as "owner" is no longer needed
and both graph nodes can be at the same offset.

To achieve this, bpf_list_excl_node and bpf_rb_excl_node are first
introduced. They simply wrap list_head and rb_node, and serve as
annotations in BTF. Then, we make sure that they cannot co-exist with
bpf_refcount, bpf_list_node and bpf_rb_nodes in the same structure when
parsing btf. This will prevent the user from acquiring more than one
reference to a object with a exclusive node.

No exclusive node can be added to collection yet. We will teach the
verifier to accept exclusive nodes as valid nodes and then skip the
ownership checks in graph kfuncs.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/linux/bpf.h          | 27 ++++++++++++---
 include/linux/rbtree_types.h |  4 +++
 include/linux/types.h        |  4 +++
 kernel/bpf/btf.c             | 64 +++++++++++++++++++++++++++++++++---
 kernel/bpf/syscall.c         | 20 +++++++++--
 5 files changed, 108 insertions(+), 11 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 6aabca1581fe..49c29c823fb3 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -197,11 +197,16 @@ enum btf_field_type {
 	BPF_KPTR       = BPF_KPTR_UNREF | BPF_KPTR_REF | BPF_KPTR_PERCPU,
 	BPF_LIST_HEAD  = (1 << 5),
 	BPF_LIST_NODE  = (1 << 6),
-	BPF_RB_ROOT    = (1 << 7),
-	BPF_RB_NODE    = (1 << 8),
-	BPF_GRAPH_NODE = BPF_RB_NODE | BPF_LIST_NODE,
+	BPF_LIST_EXCL_NODE = (1 << 7),
+	BPF_RB_ROOT    = (1 << 8),
+	BPF_RB_NODE    = (1 << 9),
+	BPF_RB_EXCL_NODE = (1 << 10),
+	BPF_GRAPH_EXCL_NODE = BPF_RB_EXCL_NODE | BPF_LIST_EXCL_NODE,
+	BPF_GRAPH_NODE = BPF_RB_NODE | BPF_LIST_NODE |
+			 BPF_RB_EXCL_NODE | BPF_LIST_EXCL_NODE,
 	BPF_GRAPH_ROOT = BPF_RB_ROOT | BPF_LIST_HEAD,
-	BPF_REFCOUNT   = (1 << 9),
+	BPF_GRAPH_NODE_OR_ROOT = BPF_GRAPH_NODE | BPF_GRAPH_ROOT,
+	BPF_REFCOUNT   = (1 << 11),
 };
 
 typedef void (*btf_dtor_kfunc_t)(void *);
@@ -321,10 +326,14 @@ static inline const char *btf_field_type_name(enum btf_field_type type)
 		return "bpf_list_head";
 	case BPF_LIST_NODE:
 		return "bpf_list_node";
+	case BPF_LIST_EXCL_NODE:
+		return "bpf_list_excl_node";
 	case BPF_RB_ROOT:
 		return "bpf_rb_root";
 	case BPF_RB_NODE:
 		return "bpf_rb_node";
+	case BPF_RB_EXCL_NODE:
+		return "bpf_rb_excl_node";
 	case BPF_REFCOUNT:
 		return "bpf_refcount";
 	default:
@@ -348,10 +357,14 @@ static inline u32 btf_field_type_size(enum btf_field_type type)
 		return sizeof(struct bpf_list_head);
 	case BPF_LIST_NODE:
 		return sizeof(struct bpf_list_node);
+	case BPF_LIST_EXCL_NODE:
+		return sizeof(struct bpf_list_excl_node);
 	case BPF_RB_ROOT:
 		return sizeof(struct bpf_rb_root);
 	case BPF_RB_NODE:
 		return sizeof(struct bpf_rb_node);
+	case BPF_RB_EXCL_NODE:
+		return sizeof(struct bpf_rb_excl_node);
 	case BPF_REFCOUNT:
 		return sizeof(struct bpf_refcount);
 	default:
@@ -375,10 +388,14 @@ static inline u32 btf_field_type_align(enum btf_field_type type)
 		return __alignof__(struct bpf_list_head);
 	case BPF_LIST_NODE:
 		return __alignof__(struct bpf_list_node);
+	case BPF_LIST_EXCL_NODE:
+		return __alignof__(struct bpf_list_excl_node);
 	case BPF_RB_ROOT:
 		return __alignof__(struct bpf_rb_root);
 	case BPF_RB_NODE:
 		return __alignof__(struct bpf_rb_node);
+	case BPF_RB_EXCL_NODE:
+		return __alignof__(struct bpf_rb_excl_node);
 	case BPF_REFCOUNT:
 		return __alignof__(struct bpf_refcount);
 	default:
@@ -396,10 +413,12 @@ static inline void bpf_obj_init_field(const struct btf_field *field, void *addr)
 		refcount_set((refcount_t *)addr, 1);
 		break;
 	case BPF_RB_NODE:
+	case BPF_RB_EXCL_NODE:
 		RB_CLEAR_NODE((struct rb_node *)addr);
 		break;
 	case BPF_LIST_HEAD:
 	case BPF_LIST_NODE:
+	case BPF_LIST_EXCL_NODE:
 		INIT_LIST_HEAD((struct list_head *)addr);
 		break;
 	case BPF_RB_ROOT:
diff --git a/include/linux/rbtree_types.h b/include/linux/rbtree_types.h
index 45b6ecde3665..fc5185991fb1 100644
--- a/include/linux/rbtree_types.h
+++ b/include/linux/rbtree_types.h
@@ -28,6 +28,10 @@ struct rb_root_cached {
 	struct rb_node *rb_leftmost;
 };
 
+struct bpf_rb_excl_node {
+	struct rb_node rb_node;
+};
+
 #define RB_ROOT (struct rb_root) { NULL, }
 #define RB_ROOT_CACHED (struct rb_root_cached) { {NULL, }, NULL }
 
diff --git a/include/linux/types.h b/include/linux/types.h
index 2bc8766ba20c..71429cd80ce2 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -202,6 +202,10 @@ struct hlist_node {
 	struct hlist_node *next, **pprev;
 };
 
+struct bpf_list_excl_node {
+	struct list_head list_head;
+};
+
 struct ustat {
 	__kernel_daddr_t	f_tfree;
 #ifdef CONFIG_ARCH_32BIT_USTAT_F_TINODE
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 25a5dc840ac3..a641c716e0fa 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3484,6 +3484,8 @@ static int btf_get_field_type(const char *name, u32 field_mask, u32 *seen_mask,
 	field_mask_test_name(BPF_RB_ROOT,   "bpf_rb_root");
 	field_mask_test_name(BPF_RB_NODE,   "bpf_rb_node");
 	field_mask_test_name(BPF_REFCOUNT,  "bpf_refcount");
+	field_mask_test_name(BPF_LIST_EXCL_NODE, "bpf_list_excl_node");
+	field_mask_test_name(BPF_RB_EXCL_NODE,   "bpf_rb_excl_node");
 
 	/* Only return BPF_KPTR when all other types with matchable names fail */
 	if (field_mask & BPF_KPTR) {
@@ -3504,6 +3506,8 @@ static int btf_get_union_field_types(const struct btf *btf, const struct btf_typ
 	const struct btf_member *member;
 	const struct btf_type *t;
 
+	field_mask &= BPF_GRAPH_EXCL_NODE;
+
 	for_each_member(i, u, member) {
 		t = btf_type_by_id(btf, member->type);
 		field_type = btf_get_field_type(__btf_name_by_offset(btf, t->name_off),
@@ -3552,13 +3556,28 @@ static int btf_find_struct_field(const struct btf *btf,
 		case BPF_SPIN_LOCK:
 		case BPF_TIMER:
 		case BPF_LIST_NODE:
+		case BPF_LIST_EXCL_NODE:
 		case BPF_RB_NODE:
+		case BPF_RB_EXCL_NODE:
 		case BPF_REFCOUNT:
 			ret = btf_find_struct(btf, member_type, off, sz, field_type,
 					      idx < info_cnt ? &info[idx] : &tmp);
 			if (ret < 0)
 				return ret;
 			break;
+		case BPF_GRAPH_EXCL_NODE:
+			ret = btf_find_struct(btf, member_type, off, sz,
+					      BPF_LIST_EXCL_NODE,
+					      idx < info_cnt ? &info[idx] : &tmp);
+			if (ret < 0)
+				return ret;
+			++idx;
+			ret = btf_find_struct(btf, member_type, off, sz,
+					      BPF_RB_EXCL_NODE,
+					      idx < info_cnt ? &info[idx] : &tmp);
+			if (ret < 0)
+				return ret;
+			break;
 		case BPF_KPTR_UNREF:
 		case BPF_KPTR_REF:
 		case BPF_KPTR_PERCPU:
@@ -3619,7 +3638,9 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
 		case BPF_SPIN_LOCK:
 		case BPF_TIMER:
 		case BPF_LIST_NODE:
+		case BPF_LIST_EXCL_NODE:
 		case BPF_RB_NODE:
+		case BPF_RB_EXCL_NODE:
 		case BPF_REFCOUNT:
 			ret = btf_find_struct(btf, var_type, off, sz, field_type,
 					      idx < info_cnt ? &info[idx] : &tmp);
@@ -3827,14 +3848,24 @@ static int btf_parse_graph_root(struct btf_field *field,
 
 static int btf_parse_list_head(struct btf_field *field, struct btf_field_info *info)
 {
-	return btf_parse_graph_root(field, info, "bpf_list_node",
-				    __alignof__(struct bpf_list_node));
+	int err;
+
+	err = btf_parse_graph_root(field, info, "bpf_list_node",
+				   __alignof__(struct bpf_list_node));
+
+	return err ? btf_parse_graph_root(field, info, "bpf_list_excl_node",
+					  __alignof__(struct bpf_list_excl_node)) : 0;
 }
 
 static int btf_parse_rb_root(struct btf_field *field, struct btf_field_info *info)
 {
-	return btf_parse_graph_root(field, info, "bpf_rb_node",
-				    __alignof__(struct bpf_rb_node));
+	int err;
+
+	err = btf_parse_graph_root(field, info, "bpf_rb_node",
+				   __alignof__(struct bpf_rb_node));
+
+	return err ? btf_parse_graph_root(field, info, "bpf_rb_excl_node",
+					  __alignof__(struct bpf_rb_excl_node)) : 0;
 }
 
 static int btf_field_cmp(const void *_a, const void *_b, const void *priv)
@@ -3864,6 +3895,7 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 		return NULL;
 
 	cnt = ret;
+
 	/* This needs to be kzalloc to zero out padding and unused fields, see
 	 * comment in btf_record_equal.
 	 */
@@ -3881,7 +3913,9 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 			ret = -EFAULT;
 			goto end;
 		}
-		if (info_arr[i].off < next_off) {
+		if (info_arr[i].off < next_off &&
+		    !(info_arr[i].off == info_arr[i - 1].off &&
+		     (info_arr[i].type | info_arr[i - 1].type) == BPF_GRAPH_EXCL_NODE)) {
 			ret = -EEXIST;
 			goto end;
 		}
@@ -3925,6 +3959,8 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 			if (ret < 0)
 				goto end;
 			break;
+		case BPF_LIST_EXCL_NODE:
+		case BPF_RB_EXCL_NODE:
 		case BPF_LIST_NODE:
 		case BPF_RB_NODE:
 			break;
@@ -3949,6 +3985,21 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 		goto end;
 	}
 
+	if (rec->refcount_off >= 0 &&
+	    (btf_record_has_field(rec, BPF_LIST_EXCL_NODE) ||
+	     btf_record_has_field(rec, BPF_RB_EXCL_NODE))) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	if ((btf_record_has_field(rec, BPF_LIST_EXCL_NODE) ||
+	     btf_record_has_field(rec, BPF_RB_EXCL_NODE)) &&
+	    (btf_record_has_field(rec, BPF_LIST_NODE) ||
+	     btf_record_has_field(rec, BPF_RB_NODE))) {
+		ret = -EINVAL;
+		goto end;
+	}
+
 	sort_r(rec->fields, rec->cnt, sizeof(struct btf_field), btf_field_cmp,
 	       NULL, rec);
 
@@ -5434,8 +5485,10 @@ static const char *alloc_obj_fields[] = {
 	"bpf_spin_lock",
 	"bpf_list_head",
 	"bpf_list_node",
+	"bpf_list_excl_node",
 	"bpf_rb_root",
 	"bpf_rb_node",
+	"bpf_rb_excl_node",
 	"bpf_refcount",
 };
 
@@ -5536,6 +5589,7 @@ btf_parse_struct_metas(struct bpf_verifier_log *log, struct btf *btf)
 		type->btf_id = id;
 		record = btf_parse_fields(btf, t, BPF_SPIN_LOCK | BPF_LIST_HEAD | BPF_LIST_NODE |
 						  BPF_RB_ROOT | BPF_RB_NODE | BPF_REFCOUNT |
+						  BPF_LIST_EXCL_NODE | BPF_RB_EXCL_NODE |
 						  BPF_KPTR, t->size);
 		/* The record cannot be unset, treat it as an error if so */
 		if (IS_ERR_OR_NULL(record)) {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 9e93d48efe19..25fad6293720 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -528,13 +528,23 @@ struct btf_field *btf_record_find(const struct btf_record *rec, u32 offset,
 				  u32 field_mask)
 {
 	struct btf_field *field;
+	u32 i;
 
 	if (IS_ERR_OR_NULL(rec) || !(rec->field_mask & field_mask))
 		return NULL;
 	field = bsearch(&offset, rec->fields, rec->cnt, sizeof(rec->fields[0]), btf_field_cmp);
-	if (!field || !(field->type & field_mask))
+	if (!field)
 		return NULL;
-	return field;
+	if (field->type & field_mask)
+		return field;
+	if (field->type & BPF_GRAPH_EXCL_NODE && field_mask & BPF_GRAPH_EXCL_NODE) {
+		i = field - rec->fields;
+		if (i > 0 && (field - 1)->type & field_mask)
+			return field - 1;
+		if (i < rec->cnt - 1 && (field + 1)->type & field_mask)
+			return field + 1;
+	}
+	return NULL;
 }
 
 void btf_record_free(struct btf_record *rec)
@@ -554,8 +564,10 @@ void btf_record_free(struct btf_record *rec)
 			break;
 		case BPF_LIST_HEAD:
 		case BPF_LIST_NODE:
+		case BPF_LIST_EXCL_NODE:
 		case BPF_RB_ROOT:
 		case BPF_RB_NODE:
+		case BPF_RB_EXCL_NODE:
 		case BPF_SPIN_LOCK:
 		case BPF_TIMER:
 		case BPF_REFCOUNT:
@@ -603,8 +615,10 @@ struct btf_record *btf_record_dup(const struct btf_record *rec)
 			break;
 		case BPF_LIST_HEAD:
 		case BPF_LIST_NODE:
+		case BPF_LIST_EXCL_NODE:
 		case BPF_RB_ROOT:
 		case BPF_RB_NODE:
+		case BPF_RB_EXCL_NODE:
 		case BPF_SPIN_LOCK:
 		case BPF_TIMER:
 		case BPF_REFCOUNT:
@@ -711,7 +725,9 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
 			bpf_rb_root_free(field, field_ptr, obj + rec->spin_lock_off);
 			break;
 		case BPF_LIST_NODE:
+		case BPF_LIST_EXCL_NODE:
 		case BPF_RB_NODE:
+		case BPF_RB_EXCL_NODE:
 		case BPF_REFCOUNT:
 			break;
 		default:
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH v8 11/20] bpf: Allow adding exclusive nodes to bpf list and rbtree
  2024-05-10 19:23 [RFC PATCH v8 00/20] bpf qdisc Amery Hung
                   ` (9 preceding siblings ...)
  2024-05-10 19:24 ` [RFC PATCH v8 10/20] bpf: Introduce exclusive-ownership list and rbtree nodes Amery Hung
@ 2024-05-10 19:24 ` Amery Hung
  2024-05-10 19:24 ` [RFC PATCH v8 12/20] selftests/bpf: Modify linked_list tests to work with macro-ified removes Amery Hung
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Amery Hung @ 2024-05-10 19:24 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs, ameryhung

This patch first teaches verifier to accept exclusive nodes
(bpf_list_excl_node and bpf_rb_excl_node) as valid graph nodes.

Graph kfuncs can now skip ownership tracking and checks for graphs
containing exclusive nodes since we already make sure that a exclusive
node cannot be owned by more than one collection at the same time.

Graph kfuncs will use struct_meta to tell whether a node is exclusive or
not. Therefore we pass struct_meta as an additional argument to graph
remove kfuncs and let verifier fixup the instruction.

The first user of exclusive-ownership nodes is sk_buff. In bpf qdisc, an
sk_buff will be able to be enqueued into either a bpf_list or a
bpf_rbtree. This significantly simplify how users write the code and
improve qdisc performance as we no longer need to allocate local objects
to store skb kptrs.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/linux/skbuff.h                        |   2 +
 kernel/bpf/btf.c                              |   1 +
 kernel/bpf/helpers.c                          |  63 +++++++----
 kernel/bpf/verifier.c                         | 101 ++++++++++++++----
 .../testing/selftests/bpf/bpf_experimental.h  |  58 +++++++++-
 5 files changed, 180 insertions(+), 45 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 03ea36a82cdd..fefc82542a3c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -871,6 +871,8 @@ struct sk_buff {
 		struct rb_node		rbnode; /* used in netem, ip4 defrag, and tcp stack */
 		struct list_head	list;
 		struct llist_node	ll_node;
+		struct bpf_list_excl_node	bpf_list;
+		struct bpf_rb_excl_node		bpf_rbnode;
 	};
 
 	struct sock		*sk;
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index a641c716e0fa..6a9c1671c8f4 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -5495,6 +5495,7 @@ static const char *alloc_obj_fields[] = {
 /* kernel structures with special BTF fields*/
 static const char *kstructs_with_special_btf[] = {
 	"unused",
+	"sk_buff",
 };
 
 static struct btf_struct_metas *
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 70655cec452c..7acdd8899304 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1988,6 +1988,9 @@ static int __bpf_list_add(struct bpf_list_node_kern *node,
 			  bool tail, struct btf_record *rec, u64 off)
 {
 	struct list_head *n = &node->list_head, *h = (void *)head;
+	bool exclusive;
+
+	exclusive = btf_record_has_field(rec, BPF_LIST_EXCL_NODE);
 
 	/* If list_head was 0-initialized by map, bpf_obj_init_field wasn't
 	 * called on its fields, so init here
@@ -1998,14 +2001,15 @@ static int __bpf_list_add(struct bpf_list_node_kern *node,
 	/* node->owner != NULL implies !list_empty(n), no need to separately
 	 * check the latter
 	 */
-	if (cmpxchg(&node->owner, NULL, BPF_PTR_POISON)) {
+	if (!exclusive && cmpxchg(&node->owner, NULL, BPF_PTR_POISON)) {
 		/* Only called from BPF prog, no need to migrate_disable */
 		__bpf_obj_drop_impl((void *)n - off, rec, false);
 		return -EINVAL;
 	}
 
 	tail ? list_add_tail(n, h) : list_add(n, h);
-	WRITE_ONCE(node->owner, head);
+	if (!exclusive)
+		WRITE_ONCE(node->owner, head);
 
 	return 0;
 }
@@ -2030,10 +2034,14 @@ __bpf_kfunc int bpf_list_push_back_impl(struct bpf_list_head *head,
 	return __bpf_list_add(n, head, true, meta ? meta->record : NULL, off);
 }
 
-static struct bpf_list_node *__bpf_list_del(struct bpf_list_head *head, bool tail)
+static struct bpf_list_node *__bpf_list_del(struct bpf_list_head *head,
+					    struct btf_record *rec, bool tail)
 {
 	struct list_head *n, *h = (void *)head;
 	struct bpf_list_node_kern *node;
+	bool exclusive;
+
+	exclusive = btf_record_has_field(rec, BPF_LIST_EXCL_NODE);
 
 	/* If list_head was 0-initialized by map, bpf_obj_init_field wasn't
 	 * called on its fields, so init here
@@ -2045,40 +2053,55 @@ static struct bpf_list_node *__bpf_list_del(struct bpf_list_head *head, bool tai
 
 	n = tail ? h->prev : h->next;
 	node = container_of(n, struct bpf_list_node_kern, list_head);
-	if (WARN_ON_ONCE(READ_ONCE(node->owner) != head))
+	if (!exclusive && WARN_ON_ONCE(READ_ONCE(node->owner) != head))
 		return NULL;
 
 	list_del_init(n);
-	WRITE_ONCE(node->owner, NULL);
+	if (!exclusive)
+		WRITE_ONCE(node->owner, NULL);
 	return (struct bpf_list_node *)n;
 }
 
-__bpf_kfunc struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head)
+__bpf_kfunc struct bpf_list_node *bpf_list_pop_front_impl(struct bpf_list_head *head,
+							  void *meta__ign)
 {
-	return __bpf_list_del(head, false);
+	struct btf_struct_meta *meta = meta__ign;
+
+	return __bpf_list_del(head, meta ? meta->record : NULL, false);
 }
 
-__bpf_kfunc struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head)
+__bpf_kfunc struct bpf_list_node *bpf_list_pop_back_impl(struct bpf_list_head *head,
+							 void *meta__ign)
 {
-	return __bpf_list_del(head, true);
+	struct btf_struct_meta *meta = meta__ign;
+
+	return __bpf_list_del(head, meta ? meta->record : NULL, true);
 }
 
-__bpf_kfunc struct bpf_rb_node *bpf_rbtree_remove(struct bpf_rb_root *root,
-						  struct bpf_rb_node *node)
+__bpf_kfunc struct bpf_rb_node *bpf_rbtree_remove_impl(struct bpf_rb_root *root,
+						       struct bpf_rb_node *node,
+						       void *meta__ign)
 {
 	struct bpf_rb_node_kern *node_internal = (struct bpf_rb_node_kern *)node;
 	struct rb_root_cached *r = (struct rb_root_cached *)root;
 	struct rb_node *n = &node_internal->rb_node;
+	struct btf_struct_meta *meta = meta__ign;
+	struct btf_record *rec;
+	bool exclusive;
+
+	rec = meta ? meta->record : NULL;
+	exclusive = btf_record_has_field(rec, BPF_RB_EXCL_NODE);
 
 	/* node_internal->owner != root implies either RB_EMPTY_NODE(n) or
 	 * n is owned by some other tree. No need to check RB_EMPTY_NODE(n)
 	 */
-	if (READ_ONCE(node_internal->owner) != root)
+	if (!exclusive && READ_ONCE(node_internal->owner) != root)
 		return NULL;
 
 	rb_erase_cached(n, r);
 	RB_CLEAR_NODE(n);
-	WRITE_ONCE(node_internal->owner, NULL);
+	if (!exclusive)
+		WRITE_ONCE(node_internal->owner, NULL);
 	return (struct bpf_rb_node *)n;
 }
 
@@ -2093,11 +2116,14 @@ static int __bpf_rbtree_add(struct bpf_rb_root *root,
 	struct rb_node *parent = NULL, *n = &node->rb_node;
 	bpf_callback_t cb = (bpf_callback_t)less;
 	bool leftmost = true;
+	bool exclusive;
+
+	exclusive = btf_record_has_field(rec, BPF_RB_EXCL_NODE);
 
 	/* node->owner != NULL implies !RB_EMPTY_NODE(n), no need to separately
 	 * check the latter
 	 */
-	if (cmpxchg(&node->owner, NULL, BPF_PTR_POISON)) {
+	if (!exclusive && cmpxchg(&node->owner, NULL, BPF_PTR_POISON)) {
 		/* Only called from BPF prog, no need to migrate_disable */
 		__bpf_obj_drop_impl((void *)n - off, rec, false);
 		return -EINVAL;
@@ -2115,7 +2141,8 @@ static int __bpf_rbtree_add(struct bpf_rb_root *root,
 
 	rb_link_node(n, parent, link);
 	rb_insert_color_cached(n, (struct rb_root_cached *)root, leftmost);
-	WRITE_ONCE(node->owner, root);
+	if (!exclusive)
+		WRITE_ONCE(node->owner, root);
 	return 0;
 }
 
@@ -2562,11 +2589,11 @@ BTF_ID_FLAGS(func, bpf_percpu_obj_drop_impl, KF_RELEASE)
 BTF_ID_FLAGS(func, bpf_refcount_acquire_impl, KF_ACQUIRE | KF_RET_NULL | KF_RCU)
 BTF_ID_FLAGS(func, bpf_list_push_front_impl)
 BTF_ID_FLAGS(func, bpf_list_push_back_impl)
-BTF_ID_FLAGS(func, bpf_list_pop_front, KF_ACQUIRE | KF_RET_NULL)
-BTF_ID_FLAGS(func, bpf_list_pop_back, KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_list_pop_front_impl, KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_list_pop_back_impl, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_task_acquire, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_task_release, KF_RELEASE)
-BTF_ID_FLAGS(func, bpf_rbtree_remove, KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_rbtree_remove_impl, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_rbtree_add_impl)
 BTF_ID_FLAGS(func, bpf_rbtree_first, KF_RET_NULL)
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f01d2b876a2e..ffab9b6048cd 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -11005,13 +11005,13 @@ enum special_kfunc_type {
 	KF_bpf_refcount_acquire_impl,
 	KF_bpf_list_push_front_impl,
 	KF_bpf_list_push_back_impl,
-	KF_bpf_list_pop_front,
-	KF_bpf_list_pop_back,
+	KF_bpf_list_pop_front_impl,
+	KF_bpf_list_pop_back_impl,
 	KF_bpf_cast_to_kern_ctx,
 	KF_bpf_rdonly_cast,
 	KF_bpf_rcu_read_lock,
 	KF_bpf_rcu_read_unlock,
-	KF_bpf_rbtree_remove,
+	KF_bpf_rbtree_remove_impl,
 	KF_bpf_rbtree_add_impl,
 	KF_bpf_rbtree_first,
 	KF_bpf_dynptr_from_skb,
@@ -11031,11 +11031,11 @@ BTF_ID(func, bpf_obj_drop_impl)
 BTF_ID(func, bpf_refcount_acquire_impl)
 BTF_ID(func, bpf_list_push_front_impl)
 BTF_ID(func, bpf_list_push_back_impl)
-BTF_ID(func, bpf_list_pop_front)
-BTF_ID(func, bpf_list_pop_back)
+BTF_ID(func, bpf_list_pop_front_impl)
+BTF_ID(func, bpf_list_pop_back_impl)
 BTF_ID(func, bpf_cast_to_kern_ctx)
 BTF_ID(func, bpf_rdonly_cast)
-BTF_ID(func, bpf_rbtree_remove)
+BTF_ID(func, bpf_rbtree_remove_impl)
 BTF_ID(func, bpf_rbtree_add_impl)
 BTF_ID(func, bpf_rbtree_first)
 BTF_ID(func, bpf_dynptr_from_skb)
@@ -11057,13 +11057,13 @@ BTF_ID(func, bpf_obj_drop_impl)
 BTF_ID(func, bpf_refcount_acquire_impl)
 BTF_ID(func, bpf_list_push_front_impl)
 BTF_ID(func, bpf_list_push_back_impl)
-BTF_ID(func, bpf_list_pop_front)
-BTF_ID(func, bpf_list_pop_back)
+BTF_ID(func, bpf_list_pop_front_impl)
+BTF_ID(func, bpf_list_pop_back_impl)
 BTF_ID(func, bpf_cast_to_kern_ctx)
 BTF_ID(func, bpf_rdonly_cast)
 BTF_ID(func, bpf_rcu_read_lock)
 BTF_ID(func, bpf_rcu_read_unlock)
-BTF_ID(func, bpf_rbtree_remove)
+BTF_ID(func, bpf_rbtree_remove_impl)
 BTF_ID(func, bpf_rbtree_add_impl)
 BTF_ID(func, bpf_rbtree_first)
 BTF_ID(func, bpf_dynptr_from_skb)
@@ -11382,14 +11382,14 @@ static bool is_bpf_list_api_kfunc(u32 btf_id)
 {
 	return btf_id == special_kfunc_list[KF_bpf_list_push_front_impl] ||
 	       btf_id == special_kfunc_list[KF_bpf_list_push_back_impl] ||
-	       btf_id == special_kfunc_list[KF_bpf_list_pop_front] ||
-	       btf_id == special_kfunc_list[KF_bpf_list_pop_back];
+	       btf_id == special_kfunc_list[KF_bpf_list_pop_front_impl] ||
+	       btf_id == special_kfunc_list[KF_bpf_list_pop_back_impl];
 }
 
 static bool is_bpf_rbtree_api_kfunc(u32 btf_id)
 {
 	return btf_id == special_kfunc_list[KF_bpf_rbtree_add_impl] ||
-	       btf_id == special_kfunc_list[KF_bpf_rbtree_remove] ||
+	       btf_id == special_kfunc_list[KF_bpf_rbtree_remove_impl] ||
 	       btf_id == special_kfunc_list[KF_bpf_rbtree_first];
 }
 
@@ -11448,11 +11448,13 @@ static bool check_kfunc_is_graph_node_api(struct bpf_verifier_env *env,
 
 	switch (node_field_type) {
 	case BPF_LIST_NODE:
+	case BPF_LIST_EXCL_NODE:
 		ret = (kfunc_btf_id == special_kfunc_list[KF_bpf_list_push_front_impl] ||
 		       kfunc_btf_id == special_kfunc_list[KF_bpf_list_push_back_impl]);
 		break;
 	case BPF_RB_NODE:
-		ret = (kfunc_btf_id == special_kfunc_list[KF_bpf_rbtree_remove] ||
+	case BPF_RB_EXCL_NODE:
+		ret = (kfunc_btf_id == special_kfunc_list[KF_bpf_rbtree_remove_impl] ||
 		       kfunc_btf_id == special_kfunc_list[KF_bpf_rbtree_add_impl]);
 		break;
 	default:
@@ -11515,6 +11517,9 @@ __process_kf_arg_ptr_to_graph_root(struct bpf_verifier_env *env,
 		return -EFAULT;
 	}
 	*head_field = field;
+	meta->arg_btf = field->graph_root.btf;
+	meta->arg_btf_id = field->graph_root.value_btf_id;
+
 	return 0;
 }
 
@@ -11603,18 +11608,30 @@ static int process_kf_arg_ptr_to_list_node(struct bpf_verifier_env *env,
 					   struct bpf_reg_state *reg, u32 regno,
 					   struct bpf_kfunc_call_arg_meta *meta)
 {
-	return __process_kf_arg_ptr_to_graph_node(env, reg, regno, meta,
-						  BPF_LIST_HEAD, BPF_LIST_NODE,
-						  &meta->arg_list_head.field);
+	int err;
+
+	err = __process_kf_arg_ptr_to_graph_node(env, reg, regno, meta,
+						 BPF_LIST_HEAD, BPF_LIST_NODE,
+						 &meta->arg_list_head.field);
+
+	return err ? __process_kf_arg_ptr_to_graph_node(env, reg, regno, meta,
+							BPF_LIST_HEAD, BPF_LIST_EXCL_NODE,
+							&meta->arg_list_head.field) : 0;
 }
 
 static int process_kf_arg_ptr_to_rbtree_node(struct bpf_verifier_env *env,
 					     struct bpf_reg_state *reg, u32 regno,
 					     struct bpf_kfunc_call_arg_meta *meta)
 {
-	return __process_kf_arg_ptr_to_graph_node(env, reg, regno, meta,
-						  BPF_RB_ROOT, BPF_RB_NODE,
-						  &meta->arg_rbtree_root.field);
+	int err;
+
+	err = __process_kf_arg_ptr_to_graph_node(env, reg, regno, meta,
+						 BPF_RB_ROOT, BPF_RB_NODE,
+						 &meta->arg_rbtree_root.field);
+
+	return err ? __process_kf_arg_ptr_to_graph_node(env, reg, regno, meta,
+							BPF_RB_ROOT, BPF_RB_EXCL_NODE,
+							&meta->arg_rbtree_root.field) : 0;
 }
 
 /*
@@ -11948,7 +11965,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 				return ret;
 			break;
 		case KF_ARG_PTR_TO_RB_NODE:
-			if (meta->func_id == special_kfunc_list[KF_bpf_rbtree_remove]) {
+			if (meta->func_id == special_kfunc_list[KF_bpf_rbtree_remove_impl]) {
 				if (!type_is_non_owning_ref(reg->type) || reg->ref_obj_id) {
 					verbose(env, "rbtree_remove node input must be non-owning ref\n");
 					return -EINVAL;
@@ -12255,6 +12272,11 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		}
 	}
 
+	if (meta.func_id == special_kfunc_list[KF_bpf_list_pop_front_impl] ||
+	    meta.func_id == special_kfunc_list[KF_bpf_list_pop_back_impl] ||
+	    meta.func_id == special_kfunc_list[KF_bpf_rbtree_remove_impl])
+		insn_aux->kptr_struct_meta = btf_find_struct_meta(meta.arg_btf, meta.arg_btf_id);
+
 	if (meta.func_id == special_kfunc_list[KF_bpf_throw]) {
 		if (!bpf_jit_supports_exceptions()) {
 			verbose(env, "JIT does not support calling kfunc %s#%d\n",
@@ -12386,12 +12408,12 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 				insn_aux->kptr_struct_meta =
 					btf_find_struct_meta(meta.arg_btf,
 							     meta.arg_btf_id);
-			} else if (meta.func_id == special_kfunc_list[KF_bpf_list_pop_front] ||
-				   meta.func_id == special_kfunc_list[KF_bpf_list_pop_back]) {
+			} else if (meta.func_id == special_kfunc_list[KF_bpf_list_pop_front_impl] ||
+				   meta.func_id == special_kfunc_list[KF_bpf_list_pop_back_impl]) {
 				struct btf_field *field = meta.arg_list_head.field;
 
 				mark_reg_graph_node(regs, BPF_REG_0, &field->graph_root);
-			} else if (meta.func_id == special_kfunc_list[KF_bpf_rbtree_remove] ||
+			} else if (meta.func_id == special_kfunc_list[KF_bpf_rbtree_remove_impl] ||
 				   meta.func_id == special_kfunc_list[KF_bpf_rbtree_first]) {
 				struct btf_field *field = meta.arg_rbtree_root.field;
 
@@ -19526,6 +19548,21 @@ static void __fixup_collection_insert_kfunc(struct bpf_insn_aux_data *insn_aux,
 	*cnt = 4;
 }
 
+static void __fixup_collection_remove_kfunc(struct bpf_insn_aux_data *insn_aux,
+					    u16 struct_meta_reg,
+					    struct bpf_insn *insn,
+					    struct bpf_insn *insn_buf,
+					    int *cnt)
+{
+	struct btf_struct_meta *kptr_struct_meta = insn_aux->kptr_struct_meta;
+	struct bpf_insn addr[2] = { BPF_LD_IMM64(struct_meta_reg, (long)kptr_struct_meta) };
+
+	insn_buf[0] = addr[0];
+	insn_buf[1] = addr[1];
+	insn_buf[2] = *insn;
+	*cnt = 3;
+}
+
 static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			    struct bpf_insn *insn_buf, int insn_idx, int *cnt)
 {
@@ -19614,6 +19651,24 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 
 		__fixup_collection_insert_kfunc(&env->insn_aux_data[insn_idx], struct_meta_reg,
 						node_offset_reg, insn, insn_buf, cnt);
+	} else if (desc->func_id == special_kfunc_list[KF_bpf_list_pop_back_impl] ||
+		   desc->func_id == special_kfunc_list[KF_bpf_list_pop_front_impl] ||
+		   desc->func_id == special_kfunc_list[KF_bpf_rbtree_remove_impl]) {
+		struct btf_struct_meta *kptr_struct_meta = env->insn_aux_data[insn_idx].kptr_struct_meta;
+		int struct_meta_reg = BPF_REG_2;
+
+		/* rbtree_remove has extra 'node' arg, so args-to-fixup are in diff regs */
+		if (desc->func_id == special_kfunc_list[KF_bpf_rbtree_remove_impl])
+			struct_meta_reg = BPF_REG_3;
+
+		if (!kptr_struct_meta) {
+			verbose(env, "verifier internal error: kptr_struct_meta expected at insn_idx %d\n",
+				insn_idx);
+			return -EFAULT;
+		}
+
+		__fixup_collection_remove_kfunc(&env->insn_aux_data[insn_idx], struct_meta_reg,
+						insn, insn_buf, cnt);
 	} else if (desc->func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx] ||
 		   desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
 		insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index a4da75df819c..27f6d1fec793 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -91,22 +91,34 @@ extern int bpf_list_push_back_impl(struct bpf_list_head *head,
  * Returns
  *	Pointer to bpf_list_node of deleted entry, or NULL if list is empty.
  */
-extern struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head) __ksym;
+extern struct bpf_list_node *bpf_list_pop_front_impl(struct bpf_list_head *head,
+						     void *meta) __ksym;
+
+/* Convenience macro to wrap over bpf_list_pop_front_impl */
+#define bpf_list_pop_front(head) bpf_list_pop_front_impl(head, NULL)
 
 /* Description
  *	Remove the entry at the end of the BPF linked list.
  * Returns
  *	Pointer to bpf_list_node of deleted entry, or NULL if list is empty.
  */
-extern struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head) __ksym;
+extern struct bpf_list_node *bpf_list_pop_back_impl(struct bpf_list_head *head,
+					            void *meta) __ksym;
+
+/* Convenience macro to wrap over bpf_list_pop_back_impl */
+#define bpf_list_pop_back(head) bpf_list_pop_back_impl(head, NULL)
 
 /* Description
  *	Remove 'node' from rbtree with root 'root'
  * Returns
  * 	Pointer to the removed node, or NULL if 'root' didn't contain 'node'
  */
-extern struct bpf_rb_node *bpf_rbtree_remove(struct bpf_rb_root *root,
-					     struct bpf_rb_node *node) __ksym;
+extern struct bpf_rb_node *bpf_rbtree_remove_impl(struct bpf_rb_root *root,
+						  struct bpf_rb_node *node,
+						  void *meta) __ksym;
+
+/* Convenience macro to wrap over bpf_rbtree_remove_impl */
+#define bpf_rbtree_remove(head, node) bpf_rbtree_remove_impl(head, node, NULL)
 
 /* Description
  *	Add 'node' to rbtree with root 'root' using comparator 'less'
@@ -132,6 +144,44 @@ extern int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node *nod
  */
 extern struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root) __ksym;
 
+/* Convenience single-ownership graph functions */
+int bpf_list_excl_push_front(struct bpf_list_head *head, struct bpf_list_excl_node *node)
+{
+	return bpf_list_push_front(head, (struct bpf_list_node *)node);
+}
+
+int bpf_list_excl_push_back(struct bpf_list_head *head, struct bpf_list_excl_node *node)
+{
+	return bpf_list_push_back(head, (struct bpf_list_node *)node);
+}
+
+struct bpf_list_excl_node *bpf_list_excl_pop_front(struct bpf_list_head *head)
+{
+	return (struct bpf_list_excl_node *)bpf_list_pop_front(head);
+}
+
+struct bpf_list_excl_node *bpf_list_excl_pop_back(struct bpf_list_head *head)
+{
+	return (struct bpf_list_excl_node *)bpf_list_pop_back(head);
+}
+
+struct bpf_rb_excl_node *bpf_rbtree_excl_remove(struct bpf_rb_root *root,
+					    struct bpf_rb_excl_node *node)
+{
+	return (struct bpf_rb_excl_node *)bpf_rbtree_remove(root, (struct bpf_rb_node *)node);
+}
+
+int bpf_rbtree_excl_add(struct bpf_rb_root *root, struct bpf_rb_excl_node *node,
+		      bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b))
+{
+	return bpf_rbtree_add(root, (struct bpf_rb_node *)node, less);
+}
+
+struct bpf_rb_excl_node *bpf_rbtree_excl_first(struct bpf_rb_root *root)
+{
+	return (struct bpf_rb_excl_node *)bpf_rbtree_first(root);
+}
+
 /* Description
  *	Allocates a percpu object of the type represented by 'local_type_id' in
  *	program BTF. User may use the bpf_core_type_id_local macro to pass the
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH v8 12/20] selftests/bpf: Modify linked_list tests to work with macro-ified removes
  2024-05-10 19:23 [RFC PATCH v8 00/20] bpf qdisc Amery Hung
                   ` (10 preceding siblings ...)
  2024-05-10 19:24 ` [RFC PATCH v8 11/20] bpf: Allow adding exclusive nodes to bpf list and rbtree Amery Hung
@ 2024-05-10 19:24 ` Amery Hung
  2024-05-10 19:24 ` [RFC PATCH v8 13/20] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Amery Hung @ 2024-05-10 19:24 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs, ameryhung

Since a hidden arguement is added to bpf list remove kfuncs, and
bpf_list_pop_back/front are macrofied, we modify selftests so that it can
be compiled.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 .../selftests/bpf/progs/linked_list_fail.c      | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/linked_list_fail.c b/tools/testing/selftests/bpf/progs/linked_list_fail.c
index 5f8063ecc448..d260f80ea64d 100644
--- a/tools/testing/selftests/bpf/progs/linked_list_fail.c
+++ b/tools/testing/selftests/bpf/progs/linked_list_fail.c
@@ -49,8 +49,7 @@
 	int test##_missing_lock_##op(void *ctx)             \
 	{                                                   \
 		INIT;                                       \
-		void (*p)(void *) = (void *)&bpf_list_##op; \
-		p(hexpr);                                   \
+		bpf_list_##op(hexpr);                       \
 		return 0;                                   \
 	}
 
@@ -96,9 +95,8 @@ CHECK(inner_map, push_back, &iv->head, &f->node2);
 	int test##_incorrect_lock_##op(void *ctx)           \
 	{                                                   \
 		INIT;                                       \
-		void (*p)(void *) = (void *)&bpf_list_##op; \
 		bpf_spin_lock(lexpr);                       \
-		p(hexpr);                                   \
+		bpf_list_##op(hexpr);                       \
 		return 0;                                   \
 	}
 
@@ -576,7 +574,7 @@ int incorrect_head_off2(void *ctx)
 }
 
 static __always_inline
-int pop_ptr_off(void *(*op)(void *head))
+int pop_ptr_off(bool pop_front)
 {
 	struct {
 		struct bpf_list_head head __contains(foo, node2);
@@ -588,7 +586,10 @@ int pop_ptr_off(void *(*op)(void *head))
 	if (!p)
 		return 0;
 	bpf_spin_lock(&p->lock);
-	n = op(&p->head);
+	if (pop_front)
+		n = bpf_list_pop_front(&p->head);
+	else
+		n = bpf_list_pop_back(&p->head);
 	bpf_spin_unlock(&p->lock);
 
 	if (!n)
@@ -600,13 +601,13 @@ int pop_ptr_off(void *(*op)(void *head))
 SEC("?tc")
 int pop_front_off(void *ctx)
 {
-	return pop_ptr_off((void *)bpf_list_pop_front);
+	return pop_ptr_off(true);
 }
 
 SEC("?tc")
 int pop_back_off(void *ctx)
 {
-	return pop_ptr_off((void *)bpf_list_pop_back);
+	return pop_ptr_off(false);
 }
 
 SEC("?tc")
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH v8 13/20] bpf: net_sched: Support implementation of Qdisc_ops in bpf
  2024-05-10 19:23 [RFC PATCH v8 00/20] bpf qdisc Amery Hung
                   ` (11 preceding siblings ...)
  2024-05-10 19:24 ` [RFC PATCH v8 12/20] selftests/bpf: Modify linked_list tests to work with macro-ified removes Amery Hung
@ 2024-05-10 19:24 ` Amery Hung
  2024-05-10 19:24 ` [RFC PATCH v8 14/20] bpf: net_sched: Add bpf qdisc kfuncs Amery Hung
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Amery Hung @ 2024-05-10 19:24 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs, ameryhung

This patch enables users to implement a qdisc using bpf. The last few
patches in this series has prepared struct_ops to support core methods
in Qdisc_ops. The recent advancement in bpf such as local objects,
bpf list and bpf rbtree has also provided powerful and flexible building
blocks to realize sophisticated scheduling algorithms. Therefore, in this
patch, we start allowing qdisc to be implemented using bpf struct_ops.
Users can implement .enqueue and .dequeue in Qdisc_ops in bpf and register
the qdisc dynamically into the kernel.

To further make bpf qdisc easy to use, a qdisc watchdog and a class hash
table are included by default. They are taken care of by bpf qdisc infra
with predefined Qdisc_ops and Qdisc_class_ops methods. In the next few
patches, kfuncs will be introduced for users to really make use of them,
and more ops will be supported.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Co-developed-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/linux/btf.h     |   1 +
 kernel/bpf/btf.c        |   2 +-
 net/sched/Makefile      |   4 +
 net/sched/bpf_qdisc.c   | 563 ++++++++++++++++++++++++++++++++++++++++
 net/sched/sch_api.c     |   7 +-
 net/sched/sch_generic.c |   3 +-
 6 files changed, 575 insertions(+), 5 deletions(-)
 create mode 100644 net/sched/bpf_qdisc.c

diff --git a/include/linux/btf.h b/include/linux/btf.h
index 2579b8a51172..2d01a921f604 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -520,6 +520,7 @@ const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id);
 const char *btf_name_by_offset(const struct btf *btf, u32 offset);
 struct btf *btf_parse_vmlinux(void);
 struct btf *bpf_prog_get_target_btf(const struct bpf_prog *prog);
+u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto, int off);
 u32 *btf_kfunc_id_set_contains(const struct btf *btf, u32 kfunc_btf_id,
 			       const struct bpf_prog *prog);
 u32 *btf_kfunc_is_modify_return(const struct btf *btf, u32 kfunc_btf_id,
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 6a9c1671c8f4..edfaba046427 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -6304,7 +6304,7 @@ static bool is_int_ptr(struct btf *btf, const struct btf_type *t)
 	return btf_type_is_int(t);
 }
 
-static u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto,
+u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto,
 			   int off)
 {
 	const struct btf_param *args;
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 82c3f78ca486..2094e6e74158 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -63,6 +63,10 @@ obj-$(CONFIG_NET_SCH_CBS)	+= sch_cbs.o
 obj-$(CONFIG_NET_SCH_ETF)	+= sch_etf.o
 obj-$(CONFIG_NET_SCH_TAPRIO)	+= sch_taprio.o
 
+ifeq ($(CONFIG_BPF_JIT),y)
+obj-$(CONFIG_BPF_SYSCALL)	+= bpf_qdisc.o
+endif
+
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
 obj-$(CONFIG_NET_CLS_FW)	+= cls_fw.o
diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
new file mode 100644
index 000000000000..53e9b0f1fbd8
--- /dev/null
+++ b/net/sched/bpf_qdisc.c
@@ -0,0 +1,563 @@
+#include <linux/types.h>
+#include <linux/bpf_verifier.h>
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/filter.h>
+#include <net/pkt_sched.h>
+#include <net/pkt_cls.h>
+
+static struct bpf_struct_ops bpf_Qdisc_ops;
+
+static u32 unsupported_ops[] = {
+	offsetof(struct Qdisc_ops, init),
+	offsetof(struct Qdisc_ops, reset),
+	offsetof(struct Qdisc_ops, destroy),
+	offsetof(struct Qdisc_ops, change),
+	offsetof(struct Qdisc_ops, attach),
+	offsetof(struct Qdisc_ops, change_real_num_tx),
+	offsetof(struct Qdisc_ops, dump),
+	offsetof(struct Qdisc_ops, dump_stats),
+	offsetof(struct Qdisc_ops, ingress_block_set),
+	offsetof(struct Qdisc_ops, egress_block_set),
+	offsetof(struct Qdisc_ops, ingress_block_get),
+	offsetof(struct Qdisc_ops, egress_block_get),
+};
+
+struct sch_bpf_class {
+	struct Qdisc_class_common common;
+	struct Qdisc *qdisc;
+
+	unsigned int drops;
+	unsigned int overlimits;
+	struct gnet_stats_basic_sync bstats;
+};
+
+struct bpf_sched_data {
+	struct tcf_proto __rcu *filter_list; /* optional external classifier */
+	struct tcf_block *block;
+	struct Qdisc_class_hash clhash;
+	struct qdisc_watchdog watchdog;
+};
+
+struct bpf_sk_buff_ptr {
+	struct sk_buff *skb;
+};
+
+static int bpf_qdisc_init(struct btf *btf)
+{
+	return 0;
+}
+
+static int sch_bpf_graft(struct Qdisc *sch, unsigned long arg, struct Qdisc *new,
+			 struct Qdisc **old, struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+
+	if (new)
+		*old = qdisc_replace(sch, new, &cl->qdisc);
+	return 0;
+}
+
+static struct Qdisc *sch_bpf_leaf(struct Qdisc *sch, unsigned long arg)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+
+	return cl->qdisc;
+}
+
+static struct sch_bpf_class *sch_bpf_find(struct Qdisc *sch, u32 classid)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct Qdisc_class_common *clc;
+
+	clc = qdisc_class_find(&q->clhash, classid);
+	if (!clc)
+		return NULL;
+	return container_of(clc, struct sch_bpf_class, common);
+}
+
+static unsigned long sch_bpf_search(struct Qdisc *sch, u32 handle)
+{
+	return (unsigned long)sch_bpf_find(sch, handle);
+}
+
+static int sch_bpf_change_class(struct Qdisc *sch, u32 classid,
+				u32 parentid, struct nlattr **tca,
+				unsigned long *arg,
+				struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)*arg;
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	if (!cl) {
+		if (classid == 0 || TC_H_MAJ(classid ^ sch->handle) != 0 ||
+		    sch_bpf_find(sch, classid))
+			return -EINVAL;
+
+		cl = kzalloc(sizeof(*cl), GFP_KERNEL);
+		if (!cl)
+			return -ENOBUFS;
+
+		cl->common.classid = classid;
+		gnet_stats_basic_sync_init(&cl->bstats);
+		qdisc_class_hash_insert(&q->clhash, &cl->common);
+	}
+
+	qdisc_class_hash_grow(sch, &q->clhash);
+	*arg = (unsigned long)cl;
+	return 0;
+}
+
+static int sch_bpf_delete(struct Qdisc *sch, unsigned long arg,
+			  struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	qdisc_class_hash_remove(&q->clhash, &cl->common);
+	if (cl->qdisc)
+		qdisc_put(cl->qdisc);
+	return 0;
+}
+
+static struct tcf_block *sch_bpf_tcf_block(struct Qdisc *sch, unsigned long cl,
+					   struct netlink_ext_ack *extack)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	if (cl)
+		return NULL;
+	return q->block;
+}
+
+static unsigned long sch_bpf_bind(struct Qdisc *sch, unsigned long parent,
+				  u32 classid)
+{
+	return 0;
+}
+
+static void sch_bpf_unbind(struct Qdisc *q, unsigned long cl)
+{
+}
+
+static int sch_bpf_dump_class(struct Qdisc *sch, unsigned long arg,
+			      struct sk_buff *skb, struct tcmsg *tcm)
+{
+	return 0;
+}
+
+static int
+sch_bpf_dump_class_stats(struct Qdisc *sch, unsigned long arg, struct gnet_dump *d)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+	struct gnet_stats_queue qs = {
+		.drops = cl->drops,
+		.overlimits = cl->overlimits,
+	};
+	__u32 qlen = 0;
+
+	if (cl->qdisc)
+		qdisc_qstats_qlen_backlog(cl->qdisc, &qlen, &qs.backlog);
+	else
+		qlen = 0;
+
+	if (gnet_stats_copy_basic(d, NULL, &cl->bstats, true) < 0 ||
+	    gnet_stats_copy_queue(d, NULL, &qs, qlen) < 0)
+		return -1;
+	return 0;
+}
+
+static void sch_bpf_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct sch_bpf_class *cl;
+	unsigned int i;
+
+	if (arg->stop)
+		return;
+
+	for (i = 0; i < q->clhash.hashsize; i++) {
+		hlist_for_each_entry(cl, &q->clhash.hash[i], common.hnode) {
+			if (arg->count < arg->skip) {
+				arg->count++;
+				continue;
+			}
+			if (arg->fn(sch, (unsigned long)cl, arg) < 0) {
+				arg->stop = 1;
+				return;
+			}
+			arg->count++;
+		}
+	}
+}
+
+static int bpf_qdisc_init_op(struct Qdisc *sch, struct nlattr *opt,
+			struct netlink_ext_ack *extack)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	int err;
+
+	qdisc_watchdog_init(&q->watchdog, sch);
+
+	err = tcf_block_get(&q->block, &q->filter_list, sch, extack);
+	if (err)
+		return err;
+
+	err = qdisc_class_hash_init(&q->clhash);
+	if (err < 0)
+		return err;
+
+	return 0;
+}
+
+static void bpf_qdisc_reset_op(struct Qdisc *sch)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct sch_bpf_class *cl;
+	unsigned int i;
+
+	for (i = 0; i < q->clhash.hashsize; i++) {
+		hlist_for_each_entry(cl, &q->clhash.hash[i], common.hnode) {
+			if (cl->qdisc)
+				qdisc_reset(cl->qdisc);
+		}
+	}
+
+	qdisc_watchdog_cancel(&q->watchdog);
+}
+
+static void bpf_qdisc_destroy_class(struct Qdisc *sch, struct sch_bpf_class *cl)
+{
+	if (cl->qdisc)
+		qdisc_put(cl->qdisc);
+	kfree(cl);
+}
+
+static void bpf_qdisc_destroy_op(struct Qdisc *sch)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct sch_bpf_class *cl;
+	struct hlist_node *next;
+	unsigned int i;
+
+	qdisc_watchdog_cancel(&q->watchdog);
+	tcf_block_put(q->block);
+
+	for (i = 0; i < q->clhash.hashsize; i++) {
+		hlist_for_each_entry_safe(cl, next, &q->clhash.hash[i],
+					  common.hnode) {
+			qdisc_class_hash_remove(&q->clhash,
+						&cl->common);
+			bpf_qdisc_destroy_class(sch, cl);
+		}
+	}
+
+	qdisc_class_hash_destroy(&q->clhash);
+}
+
+static const struct Qdisc_class_ops sch_bpf_class_ops = {
+	.graft		=	sch_bpf_graft,
+	.leaf		=	sch_bpf_leaf,
+	.find		=	sch_bpf_search,
+	.change		=	sch_bpf_change_class,
+	.delete		=	sch_bpf_delete,
+	.tcf_block	=	sch_bpf_tcf_block,
+	.bind_tcf	=	sch_bpf_bind,
+	.unbind_tcf	=	sch_bpf_unbind,
+	.dump		=	sch_bpf_dump_class,
+	.dump_stats	=	sch_bpf_dump_class_stats,
+	.walk		=	sch_bpf_walk,
+};
+
+static const struct bpf_func_proto *
+bpf_qdisc_get_func_proto(enum bpf_func_id func_id,
+			 const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	default:
+		return bpf_base_func_proto(func_id, prog);
+	}
+}
+
+BTF_ID_LIST_SINGLE(bpf_sk_buff_ids, struct, sk_buff)
+BTF_ID_LIST_SINGLE(bpf_sk_buff_ptr_ids, struct, bpf_sk_buff_ptr)
+
+static bool bpf_qdisc_is_valid_access(int off, int size,
+				      enum bpf_access_type type,
+				      const struct bpf_prog *prog,
+				      struct bpf_insn_access_aux *info)
+{
+	struct btf *btf = prog->aux->attach_btf;
+	u32 arg;
+
+	arg = get_ctx_arg_idx(btf, prog->aux->attach_func_proto, off);
+	if (!strcmp(prog->aux->attach_func_name, "enqueue")) {
+		if (arg == 2) {
+			info->reg_type = PTR_TO_BTF_ID | PTR_TRUSTED;
+			info->btf = btf;
+			info->btf_id = bpf_sk_buff_ptr_ids[0];
+			return true;
+		}
+	}
+
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
+					const struct bpf_reg_state *reg,
+					int off, int size)
+{
+	const struct btf_type *t, *skbt;
+	size_t end;
+
+	skbt = btf_type_by_id(reg->btf, bpf_sk_buff_ids[0]);
+	t = btf_type_by_id(reg->btf, reg->btf_id);
+	if (t != skbt) {
+		bpf_log(log, "only read is supported\n");
+		return -EACCES;
+	}
+
+	switch (off) {
+	case offsetof(struct sk_buff, tstamp):
+		end = offsetofend(struct sk_buff, tstamp);
+		break;
+	case offsetof(struct sk_buff, priority):
+		end = offsetofend(struct sk_buff, priority);
+		break;
+	case offsetof(struct sk_buff, mark):
+		end = offsetofend(struct sk_buff, mark);
+		break;
+	case offsetof(struct sk_buff, queue_mapping):
+		end = offsetofend(struct sk_buff, queue_mapping);
+		break;
+	case offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb, tc_classid):
+		end = offsetof(struct sk_buff, cb) +
+		      offsetofend(struct qdisc_skb_cb, tc_classid);
+		break;
+	case offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb, data[0]) ...
+	     offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb,
+						     data[QDISC_CB_PRIV_LEN - 1]):
+		end = offsetof(struct sk_buff, cb) +
+		      offsetofend(struct qdisc_skb_cb, data[QDISC_CB_PRIV_LEN - 1]);
+		break;
+	case offsetof(struct sk_buff, tc_index):
+		end = offsetofend(struct sk_buff, tc_index);
+		break;
+	default:
+		bpf_log(log, "no write support to sk_buff at off %d\n", off);
+		return -EACCES;
+	}
+
+	if (off + size > end) {
+		bpf_log(log,
+			"write access at off %d with size %d beyond the member of sk_buff ended at %zu\n",
+			off, size, end);
+		return -EACCES;
+	}
+
+	return 0;
+}
+
+static const struct bpf_verifier_ops bpf_qdisc_verifier_ops = {
+	.get_func_proto		= bpf_qdisc_get_func_proto,
+	.is_valid_access	= bpf_qdisc_is_valid_access,
+	.btf_struct_access	= bpf_qdisc_btf_struct_access,
+};
+
+static int bpf_qdisc_init_member(const struct btf_type *t,
+				 const struct btf_member *member,
+				 void *kdata, const void *udata)
+{
+	const struct Qdisc_ops *uqdisc_ops;
+	struct Qdisc_ops *qdisc_ops;
+	u32 moff;
+
+	uqdisc_ops = (const struct Qdisc_ops *)udata;
+	qdisc_ops = (struct Qdisc_ops *)kdata;
+
+	moff = __btf_member_bit_offset(t, member) / 8;
+	switch (moff) {
+	case offsetof(struct Qdisc_ops, cl_ops):
+		if (uqdisc_ops->cl_ops)
+			return -EINVAL;
+
+		qdisc_ops->cl_ops = &sch_bpf_class_ops;
+		return 1;
+	case offsetof(struct Qdisc_ops, priv_size):
+		if (uqdisc_ops->priv_size)
+			return -EINVAL;
+		qdisc_ops->priv_size = sizeof(struct bpf_sched_data);
+		return 1;
+	case offsetof(struct Qdisc_ops, init):
+		qdisc_ops->init = bpf_qdisc_init_op;
+		return 1;
+	case offsetof(struct Qdisc_ops, reset):
+		qdisc_ops->reset = bpf_qdisc_reset_op;
+		return 1;
+	case offsetof(struct Qdisc_ops, destroy):
+		qdisc_ops->destroy = bpf_qdisc_destroy_op;
+		return 1;
+	case offsetof(struct Qdisc_ops, peek):
+		if (!uqdisc_ops->peek)
+			qdisc_ops->peek = qdisc_peek_dequeued;
+		return 1;
+	case offsetof(struct Qdisc_ops, id):
+		if (bpf_obj_name_cpy(qdisc_ops->id, uqdisc_ops->id,
+				     sizeof(qdisc_ops->id)) <= 0)
+			return -EINVAL;
+		return 1;
+	}
+
+	return 0;
+}
+
+static bool is_unsupported(u32 member_offset)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(unsupported_ops); i++) {
+		if (member_offset == unsupported_ops[i])
+			return true;
+	}
+
+	return false;
+}
+
+static int bpf_qdisc_check_member(const struct btf_type *t,
+				  const struct btf_member *member,
+				  const struct bpf_prog *prog)
+{
+	if (is_unsupported(__btf_member_bit_offset(t, member) / 8))
+		return -ENOTSUPP;
+	return 0;
+}
+
+static int bpf_qdisc_validate(void *kdata)
+{
+	return 0;
+}
+
+static int bpf_qdisc_reg(void *kdata)
+{
+	return register_qdisc(kdata);
+}
+
+static void bpf_qdisc_unreg(void *kdata)
+{
+	return unregister_qdisc(kdata);
+}
+
+static int Qdisc_ops__enqueue(struct sk_buff *skb__ref_acquired, struct Qdisc *sch,
+			       struct sk_buff **to_free)
+{
+	return 0;
+}
+
+static struct sk_buff *Qdisc_ops__dequeue(struct Qdisc *sch)
+{
+	return NULL;
+}
+
+static struct sk_buff *Qdisc_ops__peek(struct Qdisc *sch)
+{
+	return NULL;
+}
+
+static int Qdisc_ops__init(struct Qdisc *sch, struct nlattr *arg,
+			    struct netlink_ext_ack *extack)
+{
+	return 0;
+}
+
+static void Qdisc_ops__reset(struct Qdisc *sch)
+{
+}
+
+static void Qdisc_ops__destroy(struct Qdisc *sch)
+{
+}
+
+static int Qdisc_ops__change(struct Qdisc *sch, struct nlattr *arg,
+			      struct netlink_ext_ack *extack)
+{
+	return 0;
+}
+
+static void Qdisc_ops__attach(struct Qdisc *sch)
+{
+}
+
+static int Qdisc_ops__change_tx_queue_len(struct Qdisc *sch, unsigned int new_len)
+{
+	return 0;
+}
+
+static void Qdisc_ops__change_real_num_tx(struct Qdisc *sch, unsigned int new_real_tx)
+{
+}
+
+static int Qdisc_ops__dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	return 0;
+}
+
+static int Qdisc_ops__dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	return 0;
+}
+
+static void Qdisc_ops__ingress_block_set(struct Qdisc *sch, u32 block_index)
+{
+}
+
+static void Qdisc_ops__egress_block_set(struct Qdisc *sch, u32 block_index)
+{
+}
+
+static u32 Qdisc_ops__ingress_block_get(struct Qdisc *sch)
+{
+	return 0;
+}
+
+static u32 Qdisc_ops__egress_block_get(struct Qdisc *sch)
+{
+	return 0;
+}
+
+static struct Qdisc_ops __bpf_ops_qdisc_ops = {
+	.enqueue = Qdisc_ops__enqueue,
+	.dequeue = Qdisc_ops__dequeue,
+	.peek = Qdisc_ops__peek,
+	.init = Qdisc_ops__init,
+	.reset = Qdisc_ops__reset,
+	.destroy = Qdisc_ops__destroy,
+	.change = Qdisc_ops__change,
+	.attach = Qdisc_ops__attach,
+	.change_tx_queue_len = Qdisc_ops__change_tx_queue_len,
+	.change_real_num_tx = Qdisc_ops__change_real_num_tx,
+	.dump = Qdisc_ops__dump,
+	.dump_stats = Qdisc_ops__dump_stats,
+	.ingress_block_set = Qdisc_ops__ingress_block_set,
+	.egress_block_set = Qdisc_ops__egress_block_set,
+	.ingress_block_get = Qdisc_ops__ingress_block_get,
+	.egress_block_get = Qdisc_ops__egress_block_get,
+};
+
+static struct bpf_struct_ops bpf_Qdisc_ops = {
+	.verifier_ops = &bpf_qdisc_verifier_ops,
+	.reg = bpf_qdisc_reg,
+	.unreg = bpf_qdisc_unreg,
+	.check_member = bpf_qdisc_check_member,
+	.init_member = bpf_qdisc_init_member,
+	.init = bpf_qdisc_init,
+	.validate = bpf_qdisc_validate,
+	.name = "Qdisc_ops",
+	.cfi_stubs = &__bpf_ops_qdisc_ops,
+	.owner = THIS_MODULE,
+};
+
+static int __init bpf_qdisc_kfunc_init(void)
+{
+	return register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
+}
+late_initcall(bpf_qdisc_kfunc_init);
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 65e05b0c98e4..3b5ada5830cd 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -25,6 +25,7 @@
 #include <linux/hrtimer.h>
 #include <linux/slab.h>
 #include <linux/hashtable.h>
+#include <linux/bpf.h>
 
 #include <net/net_namespace.h>
 #include <net/sock.h>
@@ -358,7 +359,7 @@ static struct Qdisc_ops *qdisc_lookup_ops(struct nlattr *kind)
 		read_lock(&qdisc_mod_lock);
 		for (q = qdisc_base; q; q = q->next) {
 			if (nla_strcmp(kind, q->id) == 0) {
-				if (!try_module_get(q->owner))
+				if (!bpf_try_module_get(q, q->owner))
 					q = NULL;
 				break;
 			}
@@ -1282,7 +1283,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 				/* We will try again qdisc_lookup_ops,
 				 * so don't keep a reference.
 				 */
-				module_put(ops->owner);
+				bpf_module_put(ops, ops->owner);
 				err = -EAGAIN;
 				goto err_out;
 			}
@@ -1392,7 +1393,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 	netdev_put(dev, &sch->dev_tracker);
 	qdisc_free(sch);
 err_out2:
-	module_put(ops->owner);
+	bpf_module_put(ops, ops->owner);
 err_out:
 	*errp = err;
 	return NULL;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index ff5336493777..f4343653db0f 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -24,6 +24,7 @@
 #include <linux/if_vlan.h>
 #include <linux/skb_array.h>
 #include <linux/if_macvlan.h>
+#include <linux/bpf.h>
 #include <net/sch_generic.h>
 #include <net/pkt_sched.h>
 #include <net/dst.h>
@@ -1067,7 +1068,7 @@ static void __qdisc_destroy(struct Qdisc *qdisc)
 	if (ops->destroy)
 		ops->destroy(qdisc);
 
-	module_put(ops->owner);
+	bpf_module_put(ops, ops->owner);
 	netdev_put(dev, &qdisc->dev_tracker);
 
 	trace_qdisc_destroy(qdisc);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH v8 14/20] bpf: net_sched: Add bpf qdisc kfuncs
  2024-05-10 19:23 [RFC PATCH v8 00/20] bpf qdisc Amery Hung
                   ` (12 preceding siblings ...)
  2024-05-10 19:24 ` [RFC PATCH v8 13/20] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
@ 2024-05-10 19:24 ` Amery Hung
  2024-05-22 23:55   ` Martin KaFai Lau
  2024-05-10 19:24 ` [RFC PATCH v8 15/20] bpf: net_sched: Allow more optional methods in Qdisc_ops Amery Hung
                   ` (5 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Amery Hung @ 2024-05-10 19:24 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs, ameryhung

This patch adds kfuncs for working on skb and manipulating child
class/qdisc.

Both bpf_qdisc_skb_drop() and bpf_skb_release() can be used to release
a reference to an skb. However, bpf_qdisc_skb_drop() can only be called
in .enqueue where a to_free skb list is available from kernel to defer
the release. Otherwise, bpf_skb_release() should be used elsewhere. It
is also used in bpf_obj_free_fields() when cleaning up skb in maps and
collections.

For bpf_qdisc_enqueue() and bpf_qdisc_dequeue(), kfuncs that pass skb
between the current qdisc and a child qdisc, classid is used to refer
to a specific child qdisc instead of "srtuct Qdisc *" so that it is
impossible to recursively enqueue or dequeue skb to a qdisc itself.
More specifically, while we can make bpf_qdisc_find_class() return a
pointer to a child qdisc, and use it in enqueue or dequeue kfuncs
instead of classid, it would be hard to make sure the pointer is not
pointing to the current qdisc, causing indefinite resursive calls.

bpf_qdisc_create_child() is introduced to make the deployment easier and
more robust. It can be called in .init to populate the class hierarchy
the scheduling algorithm expect. This saves extra tc calls and prevents
user errors in creating classes. An example can be found in the bpf prio
qdisc in selftests.

bpf_skb_set_dev() is temporarily added to restore skb->dev after removing
skb from collection. Apparently, we cannot rely on the user to always
call it after every remove. This will be addressed in the next revision.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Co-developed-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 net/sched/bpf_qdisc.c | 239 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 238 insertions(+), 1 deletion(-)

diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
index 53e9b0f1fbd8..2a40452c2c9a 100644
--- a/net/sched/bpf_qdisc.c
+++ b/net/sched/bpf_qdisc.c
@@ -358,6 +358,229 @@ static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
 	return 0;
 }
 
+__bpf_kfunc_start_defs();
+
+/* bpf_skb_set_dev - A temporary kfunc to restore skb->dev after removing an
+ * skb from collections.
+ * @skb: The skb to get the flow hash from.
+ * @sch: The qdisc the skb belongs to.
+ */
+__bpf_kfunc void bpf_skb_set_dev(struct sk_buff *skb, struct Qdisc *sch)
+{
+	skb->dev = qdisc_dev(sch);
+}
+
+/* bpf_skb_get_hash - Get the flow hash of an skb.
+ * @skb: The skb to get the flow hash from.
+ */
+__bpf_kfunc u32 bpf_skb_get_hash(struct sk_buff *skb)
+{
+	return skb_get_hash(skb);
+}
+
+/* bpf_skb_release - Release an skb reference acquired on an skb immediately.
+ * @skb: The skb on which a reference is being released.
+ */
+__bpf_kfunc void bpf_skb_release(struct sk_buff *skb)
+{
+	consume_skb(skb);
+}
+
+/* bpf_qdisc_skb_drop - Add an skb to be dropped later to a list.
+ * @skb: The skb on which a reference is being released and dropped.
+ * @to_free_list: The list of skbs to be dropped.
+ */
+__bpf_kfunc void bpf_qdisc_skb_drop(struct sk_buff *skb,
+				    struct bpf_sk_buff_ptr *to_free_list)
+{
+	__qdisc_drop(skb, (struct sk_buff **)to_free_list);
+}
+
+/* bpf_qdisc_watchdog_schedule - Schedule a qdisc to a later time using a timer.
+ * @sch: The qdisc to be scheduled.
+ * @expire: The expiry time of the timer.
+ * @delta_ns: The slack range of the timer.
+ */
+__bpf_kfunc void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64 delta_ns)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	qdisc_watchdog_schedule_range_ns(&q->watchdog, expire, delta_ns);
+}
+
+/* bpf_skb_tc_classify - Classify an skb using an existing filter referred
+ * to by the specified handle on the net device of index ifindex.
+ * @skb: The skb to be classified.
+ * @handle: The handle of the filter to be referenced.
+ * @ifindex: The ifindex of the net device where the filter is attached.
+ *
+ * Returns a 64-bit integer containing the tc action verdict and the classid,
+ * created as classid << 32 | action.
+ */
+__bpf_kfunc u64 bpf_skb_tc_classify(struct sk_buff *skb, int ifindex, u32 handle)
+{
+	struct net *net = dev_net(skb->dev);
+	const struct Qdisc_class_ops *cops;
+	struct tcf_result res = {};
+	struct tcf_block *block;
+	struct tcf_chain *chain;
+	struct net_device *dev;
+	int result = TC_ACT_OK;
+	unsigned long cl = 0;
+	struct Qdisc *q;
+
+	rcu_read_lock();
+	dev = dev_get_by_index_rcu(net, ifindex);
+	if (!dev)
+		goto out;
+	q = qdisc_lookup_rcu(dev, handle);
+	if (!q)
+		goto out;
+
+	cops = q->ops->cl_ops;
+	if (!cops)
+		goto out;
+	if (!cops->tcf_block)
+		goto out;
+	if (TC_H_MIN(handle)) {
+		cl = cops->find(q, handle);
+		if (cl == 0)
+			goto out;
+	}
+	block = cops->tcf_block(q, cl, NULL);
+	if (!block)
+		goto out;
+
+	for (chain = tcf_get_next_chain(block, NULL);
+	     chain;
+	     chain = tcf_get_next_chain(block, chain)) {
+		struct tcf_proto *tp;
+
+		for (tp = tcf_get_next_proto(chain, NULL);
+		     tp; tp = tcf_get_next_proto(chain, tp)) {
+
+			result = tcf_classify(skb, NULL, tp, &res, false);
+			if (result >= 0) {
+				switch (result) {
+				case TC_ACT_QUEUED:
+				case TC_ACT_STOLEN:
+				case TC_ACT_TRAP:
+					fallthrough;
+				case TC_ACT_SHOT:
+					rcu_read_unlock();
+					return result;
+				}
+			}
+		}
+	}
+out:
+	rcu_read_unlock();
+	return (res.class << 32 | result);
+}
+
+/* bpf_qdisc_create_child - Create a default child qdisc during init.
+ * A qdisc can use this kfunc to populate the desired class topology during
+ * initialization without relying on the user to do this correctly. A default
+ * pfifo will be added to the child class.
+ *
+ * @sch: The parent qdisc of the to-be-created child qdisc.
+ * @min: The minor number of the child qdisc.
+ * @extack: Netlink extended ACK report.
+ */
+__bpf_kfunc int bpf_qdisc_create_child(struct Qdisc *sch, u32 min,
+				       struct netlink_ext_ack *extack)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct sch_bpf_class *cl;
+	struct Qdisc *new_q;
+
+	cl = kzalloc(sizeof(*cl), GFP_KERNEL);
+	if (!cl)
+		return -ENOMEM;
+
+	cl->common.classid = TC_H_MAKE(sch->handle, TC_H_MIN(min));
+
+	new_q = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops,
+				  TC_H_MAKE(sch->handle, min), extack);
+	if (!new_q)
+		return -ENOMEM;
+
+	cl->qdisc = new_q;
+
+	qdisc_class_hash_insert(&q->clhash, &cl->common);
+	qdisc_hash_add(new_q, true);
+	return 0;
+}
+
+/* bpf_qdisc_find_class - Check if a specific class exists in a qdisc.
+ * @sch: The qdisc the class belongs to.
+ * @classid: The classsid of the class.
+ */
+__bpf_kfunc bool bpf_qdisc_find_class(struct Qdisc *sch, u32 classid)
+{
+	struct sch_bpf_class *cl = sch_bpf_find(sch, classid);
+
+	if (!cl || !cl->qdisc)
+		return false;
+
+	return true;
+}
+
+/* bpf_qdisc_enqueue - Enqueue an skb into a child qdisc.
+ * @skb: The skb to be enqueued into another qdisc.
+ * @sch: The qdisc the skb currently belongs to.
+ * @classid: The handle of the child qdisc where the skb will be enqueued.
+ * @to_free_list: The list of skbs where a to-be-dropped skb will be added to.
+ */
+__bpf_kfunc int bpf_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch, u32 classid,
+				  struct bpf_sk_buff_ptr *to_free_list)
+{
+	struct sch_bpf_class *cl = sch_bpf_find(sch, classid);
+
+	if (!cl || !cl->qdisc)
+		return qdisc_drop(skb, sch, (struct sk_buff **)to_free_list);
+
+	return qdisc_enqueue(skb, cl->qdisc, (struct sk_buff **)to_free_list);
+}
+
+/* bpf_qdisc_enqueue - Dequeue an skb from a child qdisc.
+ * @sch: The parent qdisc of the child qdisc.
+ * @classid: The handle of the child qdisc where we try to dequeue an skb.
+ */
+__bpf_kfunc struct sk_buff *bpf_qdisc_dequeue(struct Qdisc *sch, u32 classid)
+{
+	struct sch_bpf_class *cl = sch_bpf_find(sch, classid);
+
+	if (!cl || !cl->qdisc)
+		return NULL;
+
+	return cl->qdisc->dequeue(cl->qdisc);
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(bpf_qdisc_kfunc_ids)
+BTF_ID_FLAGS(func, bpf_skb_set_dev)
+BTF_ID_FLAGS(func, bpf_skb_get_hash)
+BTF_ID_FLAGS(func, bpf_skb_release, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_qdisc_skb_drop, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_qdisc_watchdog_schedule)
+BTF_ID_FLAGS(func, bpf_skb_tc_classify)
+BTF_ID_FLAGS(func, bpf_qdisc_create_child)
+BTF_ID_FLAGS(func, bpf_qdisc_find_class)
+BTF_ID_FLAGS(func, bpf_qdisc_enqueue, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_qdisc_dequeue, KF_ACQUIRE | KF_RET_NULL)
+BTF_KFUNCS_END(bpf_qdisc_kfunc_ids)
+
+static const struct btf_kfunc_id_set bpf_qdisc_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &bpf_qdisc_kfunc_ids,
+};
+
+BTF_ID_LIST(skb_kfunc_dtor_ids)
+BTF_ID(struct, sk_buff)
+BTF_ID_FLAGS(func, bpf_skb_release, KF_RELEASE)
+
 static const struct bpf_verifier_ops bpf_qdisc_verifier_ops = {
 	.get_func_proto		= bpf_qdisc_get_func_proto,
 	.is_valid_access	= bpf_qdisc_is_valid_access,
@@ -558,6 +781,20 @@ static struct bpf_struct_ops bpf_Qdisc_ops = {
 
 static int __init bpf_qdisc_kfunc_init(void)
 {
-	return register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
+	int ret;
+	const struct btf_id_dtor_kfunc skb_kfunc_dtors[] = {
+		{
+			.btf_id       = skb_kfunc_dtor_ids[0],
+			.kfunc_btf_id = skb_kfunc_dtor_ids[1]
+		},
+	};
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_qdisc_kfunc_set);
+	ret = ret ?: register_btf_id_dtor_kfuncs(skb_kfunc_dtors,
+						 ARRAY_SIZE(skb_kfunc_dtors),
+						 THIS_MODULE);
+	ret = ret ?: register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
+
+	return ret;
 }
 late_initcall(bpf_qdisc_kfunc_init);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH v8 15/20] bpf: net_sched: Allow more optional methods in Qdisc_ops
  2024-05-10 19:23 [RFC PATCH v8 00/20] bpf qdisc Amery Hung
                   ` (13 preceding siblings ...)
  2024-05-10 19:24 ` [RFC PATCH v8 14/20] bpf: net_sched: Add bpf qdisc kfuncs Amery Hung
@ 2024-05-10 19:24 ` Amery Hung
  2024-05-10 19:24 ` [RFC PATCH v8 16/20] libbpf: Support creating and destroying qdisc Amery Hung
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Amery Hung @ 2024-05-10 19:24 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs, ameryhung

So far, init, reset, and destroy are implemented by bpf qdisc infra as
fixed methods that manipulate the watchdog and the class hash table
according to the occasion. This patch allows users to supply these
three ops to perform the desired work alongside the predefined methods.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/net/sch_generic.h |  8 ++++++++
 net/sched/bpf_qdisc.c     | 22 +++++-----------------
 net/sched/sch_api.c       | 12 +++++++++++-
 net/sched/sch_generic.c   |  8 ++++++++
 4 files changed, 32 insertions(+), 18 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 76db6be16083..71e54cfa0d41 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -1356,4 +1356,12 @@ static inline void qdisc_synchronize(const struct Qdisc *q)
 		msleep(1);
 }
 
+#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_BPF_JIT)
+extern const struct Qdisc_class_ops sch_bpf_class_ops;
+
+int bpf_qdisc_init_pre_op(struct Qdisc *sch, struct nlattr *opt, struct netlink_ext_ack *extack);
+void bpf_qdisc_destroy_post_op(struct Qdisc *sch);
+void bpf_qdisc_reset_post_op(struct Qdisc *sch);
+#endif
+
 #endif
diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
index 2a40452c2c9a..cb9088d0571a 100644
--- a/net/sched/bpf_qdisc.c
+++ b/net/sched/bpf_qdisc.c
@@ -9,9 +9,6 @@
 static struct bpf_struct_ops bpf_Qdisc_ops;
 
 static u32 unsupported_ops[] = {
-	offsetof(struct Qdisc_ops, init),
-	offsetof(struct Qdisc_ops, reset),
-	offsetof(struct Qdisc_ops, destroy),
 	offsetof(struct Qdisc_ops, change),
 	offsetof(struct Qdisc_ops, attach),
 	offsetof(struct Qdisc_ops, change_real_num_tx),
@@ -191,8 +188,8 @@ static void sch_bpf_walk(struct Qdisc *sch, struct qdisc_walker *arg)
 	}
 }
 
-static int bpf_qdisc_init_op(struct Qdisc *sch, struct nlattr *opt,
-			struct netlink_ext_ack *extack)
+int bpf_qdisc_init_pre_op(struct Qdisc *sch, struct nlattr *opt,
+			  struct netlink_ext_ack *extack)
 {
 	struct bpf_sched_data *q = qdisc_priv(sch);
 	int err;
@@ -210,7 +207,7 @@ static int bpf_qdisc_init_op(struct Qdisc *sch, struct nlattr *opt,
 	return 0;
 }
 
-static void bpf_qdisc_reset_op(struct Qdisc *sch)
+void bpf_qdisc_reset_post_op(struct Qdisc *sch)
 {
 	struct bpf_sched_data *q = qdisc_priv(sch);
 	struct sch_bpf_class *cl;
@@ -233,7 +230,7 @@ static void bpf_qdisc_destroy_class(struct Qdisc *sch, struct sch_bpf_class *cl)
 	kfree(cl);
 }
 
-static void bpf_qdisc_destroy_op(struct Qdisc *sch)
+void bpf_qdisc_destroy_post_op(struct Qdisc *sch)
 {
 	struct bpf_sched_data *q = qdisc_priv(sch);
 	struct sch_bpf_class *cl;
@@ -255,7 +252,7 @@ static void bpf_qdisc_destroy_op(struct Qdisc *sch)
 	qdisc_class_hash_destroy(&q->clhash);
 }
 
-static const struct Qdisc_class_ops sch_bpf_class_ops = {
+const struct Qdisc_class_ops sch_bpf_class_ops = {
 	.graft		=	sch_bpf_graft,
 	.leaf		=	sch_bpf_leaf,
 	.find		=	sch_bpf_search,
@@ -611,15 +608,6 @@ static int bpf_qdisc_init_member(const struct btf_type *t,
 			return -EINVAL;
 		qdisc_ops->priv_size = sizeof(struct bpf_sched_data);
 		return 1;
-	case offsetof(struct Qdisc_ops, init):
-		qdisc_ops->init = bpf_qdisc_init_op;
-		return 1;
-	case offsetof(struct Qdisc_ops, reset):
-		qdisc_ops->reset = bpf_qdisc_reset_op;
-		return 1;
-	case offsetof(struct Qdisc_ops, destroy):
-		qdisc_ops->destroy = bpf_qdisc_destroy_op;
-		return 1;
 	case offsetof(struct Qdisc_ops, peek):
 		if (!uqdisc_ops->peek)
 			qdisc_ops->peek = qdisc_peek_dequeued;
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 3b5ada5830cd..a81ceee55755 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1249,7 +1249,6 @@ static int qdisc_block_indexes_set(struct Qdisc *sch, struct nlattr **tca,
 
    Parameters are passed via opt.
  */
-
 static struct Qdisc *qdisc_create(struct net_device *dev,
 				  struct netdev_queue *dev_queue,
 				  u32 parent, u32 handle,
@@ -1352,6 +1351,13 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 		rcu_assign_pointer(sch->stab, stab);
 	}
 
+#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_BPF_JIT)
+	if (ops->cl_ops == &sch_bpf_class_ops) {
+		err = bpf_qdisc_init_pre_op(sch, tca[TCA_OPTIONS], extack);
+		if (err != 0)
+			goto err_out4;
+	}
+#endif
 	if (ops->init) {
 		err = ops->init(sch, tca[TCA_OPTIONS], extack);
 		if (err != 0)
@@ -1388,6 +1394,10 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 	 */
 	if (ops->destroy)
 		ops->destroy(sch);
+#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_BPF_JIT)
+	if (ops->cl_ops == &sch_bpf_class_ops)
+		bpf_qdisc_destroy_post_op(sch);
+#endif
 	qdisc_put_stab(rtnl_dereference(sch->stab));
 err_out3:
 	netdev_put(dev, &sch->dev_tracker);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index f4343653db0f..385ae2974f00 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1024,6 +1024,10 @@ void qdisc_reset(struct Qdisc *qdisc)
 
 	if (ops->reset)
 		ops->reset(qdisc);
+#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_BPF_JIT)
+	if (ops->cl_ops == &sch_bpf_class_ops)
+		bpf_qdisc_reset_post_op(qdisc);
+#endif
 
 	__skb_queue_purge(&qdisc->gso_skb);
 	__skb_queue_purge(&qdisc->skb_bad_txq);
@@ -1067,6 +1071,10 @@ static void __qdisc_destroy(struct Qdisc *qdisc)
 
 	if (ops->destroy)
 		ops->destroy(qdisc);
+#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_BPF_JIT)
+	if (ops->cl_ops == &sch_bpf_class_ops)
+		bpf_qdisc_destroy_post_op(qdisc);
+#endif
 
 	bpf_module_put(ops, ops->owner);
 	netdev_put(dev, &qdisc->dev_tracker);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH v8 16/20] libbpf: Support creating and destroying qdisc
  2024-05-10 19:23 [RFC PATCH v8 00/20] bpf qdisc Amery Hung
                   ` (14 preceding siblings ...)
  2024-05-10 19:24 ` [RFC PATCH v8 15/20] bpf: net_sched: Allow more optional methods in Qdisc_ops Amery Hung
@ 2024-05-10 19:24 ` Amery Hung
  2024-05-10 19:24 ` [RFC PATCH v8 17/20] selftests: Add a basic fifo qdisc test Amery Hung
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Amery Hung @ 2024-05-10 19:24 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs, ameryhung

This patch extends support of adding and removing qdiscs beyond clsact
qdisc. In bpf_tc_hook_create() and bpf_tc_hook_destroy(), a user can
first set "attach_point" to BPF_TC_QDISC, and then specify the qdisc
with "qdisc".

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 tools/lib/bpf/libbpf.h  |  5 ++++-
 tools/lib/bpf/netlink.c | 20 +++++++++++++++++---
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index f88ab50c0229..2da4bc6f0cc1 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -1234,6 +1234,7 @@ enum bpf_tc_attach_point {
 	BPF_TC_INGRESS = 1 << 0,
 	BPF_TC_EGRESS  = 1 << 1,
 	BPF_TC_CUSTOM  = 1 << 2,
+	BPF_TC_QDISC   = 1 << 3,
 };
 
 #define BPF_TC_PARENT(a, b) 	\
@@ -1248,9 +1249,11 @@ struct bpf_tc_hook {
 	int ifindex;
 	enum bpf_tc_attach_point attach_point;
 	__u32 parent;
+	__u32 handle;
+	char *qdisc;
 	size_t :0;
 };
-#define bpf_tc_hook__last_field parent
+#define bpf_tc_hook__last_field qdisc
 
 struct bpf_tc_opts {
 	size_t sz;
diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
index 68a2def17175..72db8c0add21 100644
--- a/tools/lib/bpf/netlink.c
+++ b/tools/lib/bpf/netlink.c
@@ -529,9 +529,9 @@ int bpf_xdp_query_id(int ifindex, int flags, __u32 *prog_id)
 }
 
 
-typedef int (*qdisc_config_t)(struct libbpf_nla_req *req);
+typedef int (*qdisc_config_t)(struct libbpf_nla_req *req, struct bpf_tc_hook *hook);
 
-static int clsact_config(struct libbpf_nla_req *req)
+static int clsact_config(struct libbpf_nla_req *req, struct bpf_tc_hook *hook)
 {
 	req->tc.tcm_parent = TC_H_CLSACT;
 	req->tc.tcm_handle = TC_H_MAKE(TC_H_CLSACT, 0);
@@ -539,6 +539,16 @@ static int clsact_config(struct libbpf_nla_req *req)
 	return nlattr_add(req, TCA_KIND, "clsact", sizeof("clsact"));
 }
 
+static int qdisc_config(struct libbpf_nla_req *req, struct bpf_tc_hook *hook)
+{
+	char *qdisc = OPTS_GET(hook, qdisc, NULL);
+
+	req->tc.tcm_parent = OPTS_GET(hook, parent, TC_H_ROOT);
+	req->tc.tcm_handle = OPTS_GET(hook, handle, 0);
+
+	return nlattr_add(req, TCA_KIND, qdisc, strlen(qdisc) + 1);
+}
+
 static int attach_point_to_config(struct bpf_tc_hook *hook,
 				  qdisc_config_t *config)
 {
@@ -552,6 +562,9 @@ static int attach_point_to_config(struct bpf_tc_hook *hook,
 		return 0;
 	case BPF_TC_CUSTOM:
 		return -EOPNOTSUPP;
+	case BPF_TC_QDISC:
+		*config = &qdisc_config;
+		return 0;
 	default:
 		return -EINVAL;
 	}
@@ -596,7 +609,7 @@ static int tc_qdisc_modify(struct bpf_tc_hook *hook, int cmd, int flags)
 	req.tc.tcm_family  = AF_UNSPEC;
 	req.tc.tcm_ifindex = OPTS_GET(hook, ifindex, 0);
 
-	ret = config(&req);
+	ret = config(&req, hook);
 	if (ret < 0)
 		return ret;
 
@@ -639,6 +652,7 @@ int bpf_tc_hook_destroy(struct bpf_tc_hook *hook)
 	case BPF_TC_INGRESS:
 	case BPF_TC_EGRESS:
 		return libbpf_err(__bpf_tc_detach(hook, NULL, true));
+	case BPF_TC_QDISC:
 	case BPF_TC_INGRESS | BPF_TC_EGRESS:
 		return libbpf_err(tc_qdisc_delete(hook));
 	case BPF_TC_CUSTOM:
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH v8 17/20] selftests: Add a basic fifo qdisc test
  2024-05-10 19:23 [RFC PATCH v8 00/20] bpf qdisc Amery Hung
                   ` (15 preceding siblings ...)
  2024-05-10 19:24 ` [RFC PATCH v8 16/20] libbpf: Support creating and destroying qdisc Amery Hung
@ 2024-05-10 19:24 ` Amery Hung
  2024-05-21  3:15   ` Stanislav Fomichev
  2024-05-10 19:24 ` [RFC PATCH v8 18/20] selftests: Add a bpf fq qdisc to selftest Amery Hung
                   ` (2 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Amery Hung @ 2024-05-10 19:24 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs, ameryhung

This selftest shows a bare minimum fifo qdisc, which simply enqueues skbs
into the back of a bpf list and dequeues from the front of the list.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 .../selftests/bpf/prog_tests/bpf_qdisc.c      | 161 ++++++++++++++++++
 .../selftests/bpf/progs/bpf_qdisc_common.h    |  23 +++
 .../selftests/bpf/progs/bpf_qdisc_fifo.c      |  83 +++++++++
 3 files changed, 267 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
new file mode 100644
index 000000000000..295d0216e70f
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
@@ -0,0 +1,161 @@
+#include <linux/pkt_sched.h>
+#include <linux/rtnetlink.h>
+#include <test_progs.h>
+
+#include "network_helpers.h"
+#include "bpf_qdisc_fifo.skel.h"
+
+#ifndef ENOTSUPP
+#define ENOTSUPP 524
+#endif
+
+#define LO_IFINDEX 1
+
+static const unsigned int total_bytes = 10 * 1024 * 1024;
+static int stop;
+
+static void *server(void *arg)
+{
+	int lfd = (int)(long)arg, err = 0, fd;
+	ssize_t nr_sent = 0, bytes = 0;
+	char batch[1500];
+
+	fd = accept(lfd, NULL, NULL);
+	while (fd == -1) {
+		if (errno == EINTR)
+			continue;
+		err = -errno;
+		goto done;
+	}
+
+	if (settimeo(fd, 0)) {
+		err = -errno;
+		goto done;
+	}
+
+	while (bytes < total_bytes && !READ_ONCE(stop)) {
+		nr_sent = send(fd, &batch,
+			       MIN(total_bytes - bytes, sizeof(batch)), 0);
+		if (nr_sent == -1 && errno == EINTR)
+			continue;
+		if (nr_sent == -1) {
+			err = -errno;
+			break;
+		}
+		bytes += nr_sent;
+	}
+
+	ASSERT_EQ(bytes, total_bytes, "send");
+
+done:
+	if (fd >= 0)
+		close(fd);
+	if (err) {
+		WRITE_ONCE(stop, 1);
+		return ERR_PTR(err);
+	}
+	return NULL;
+}
+
+static void do_test(char *qdisc)
+{
+	DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook, .ifindex = LO_IFINDEX,
+			    .attach_point = BPF_TC_QDISC,
+			    .parent = TC_H_ROOT,
+			    .handle = 0x8000000,
+			    .qdisc = qdisc);
+	struct sockaddr_in6 sa6 = {};
+	ssize_t nr_recv = 0, bytes = 0;
+	int lfd = -1, fd = -1;
+	pthread_t srv_thread;
+	socklen_t addrlen = sizeof(sa6);
+	void *thread_ret;
+	char batch[1500];
+	int err;
+
+	WRITE_ONCE(stop, 0);
+
+	err = bpf_tc_hook_create(&hook);
+	if (!ASSERT_OK(err, "attach qdisc"))
+		return;
+
+	lfd = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
+	if (!ASSERT_NEQ(lfd, -1, "socket")) {
+		bpf_tc_hook_destroy(&hook);
+		return;
+	}
+
+	fd = socket(AF_INET6, SOCK_STREAM, 0);
+	if (!ASSERT_NEQ(fd, -1, "socket")) {
+		bpf_tc_hook_destroy(&hook);
+		close(lfd);
+		return;
+	}
+
+	if (settimeo(lfd, 0) || settimeo(fd, 0))
+		goto done;
+
+	err = getsockname(lfd, (struct sockaddr *)&sa6, &addrlen);
+	if (!ASSERT_NEQ(err, -1, "getsockname"))
+		goto done;
+
+	/* connect to server */
+	err = connect(fd, (struct sockaddr *)&sa6, addrlen);
+	if (!ASSERT_NEQ(err, -1, "connect"))
+		goto done;
+
+	err = pthread_create(&srv_thread, NULL, server, (void *)(long)lfd);
+	if (!ASSERT_OK(err, "pthread_create"))
+		goto done;
+
+	/* recv total_bytes */
+	while (bytes < total_bytes && !READ_ONCE(stop)) {
+		nr_recv = recv(fd, &batch,
+			       MIN(total_bytes - bytes, sizeof(batch)), 0);
+		if (nr_recv == -1 && errno == EINTR)
+			continue;
+		if (nr_recv == -1)
+			break;
+		bytes += nr_recv;
+	}
+
+	ASSERT_EQ(bytes, total_bytes, "recv");
+
+	WRITE_ONCE(stop, 1);
+	pthread_join(srv_thread, &thread_ret);
+	ASSERT_OK(IS_ERR(thread_ret), "thread_ret");
+
+done:
+	close(lfd);
+	close(fd);
+
+	bpf_tc_hook_destroy(&hook);
+	return;
+}
+
+static void test_fifo(void)
+{
+	struct bpf_qdisc_fifo *fifo_skel;
+	struct bpf_link *link;
+
+	fifo_skel = bpf_qdisc_fifo__open_and_load();
+	if (!ASSERT_OK_PTR(fifo_skel, "bpf_qdisc_fifo__open_and_load"))
+		return;
+
+	link = bpf_map__attach_struct_ops(fifo_skel->maps.fifo);
+	if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops")) {
+		bpf_qdisc_fifo__destroy(fifo_skel);
+		return;
+	}
+
+	do_test("bpf_fifo");
+
+	bpf_link__destroy(link);
+	bpf_qdisc_fifo__destroy(fifo_skel);
+}
+
+void test_bpf_qdisc(void)
+{
+	if (test__start_subtest("fifo"))
+		test_fifo();
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
new file mode 100644
index 000000000000..96ab357de28e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
@@ -0,0 +1,23 @@
+#ifndef _BPF_QDISC_COMMON_H
+#define _BPF_QDISC_COMMON_H
+
+#define NET_XMIT_SUCCESS        0x00
+#define NET_XMIT_DROP           0x01    /* skb dropped                  */
+#define NET_XMIT_CN             0x02    /* congestion notification      */
+
+#define TC_PRIO_CONTROL  7
+#define TC_PRIO_MAX      15
+
+void bpf_skb_set_dev(struct sk_buff *skb, struct Qdisc *sch) __ksym;
+u32 bpf_skb_get_hash(struct sk_buff *p) __ksym;
+void bpf_skb_release(struct sk_buff *p) __ksym;
+void bpf_qdisc_skb_drop(struct sk_buff *p, struct bpf_sk_buff_ptr *to_free) __ksym;
+void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64 delta_ns) __ksym;
+bool bpf_qdisc_find_class(struct Qdisc *sch, u32 classid) __ksym;
+int bpf_qdisc_create_child(struct Qdisc *sch, u32 min,
+			   struct netlink_ext_ack *extack) __ksym;
+int bpf_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch, u32 classid,
+		      struct bpf_sk_buff_ptr *to_free_list) __ksym;
+struct sk_buff *bpf_qdisc_dequeue(struct Qdisc *sch, u32 classid) __ksym;
+
+#endif
diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c b/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
new file mode 100644
index 000000000000..433fd9c3639c
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
@@ -0,0 +1,83 @@
+#include <vmlinux.h>
+#include "bpf_experimental.h"
+#include "bpf_qdisc_common.h"
+
+char _license[] SEC("license") = "GPL";
+
+#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+
+private(B) struct bpf_spin_lock q_fifo_lock;
+private(B) struct bpf_list_head q_fifo __contains_kptr(sk_buff, bpf_list);
+
+unsigned int q_limit = 1000;
+unsigned int q_qlen = 0;
+
+SEC("struct_ops/bpf_fifo_enqueue")
+int BPF_PROG(bpf_fifo_enqueue, struct sk_buff *skb, struct Qdisc *sch,
+	     struct bpf_sk_buff_ptr *to_free)
+{
+	q_qlen++;
+	if (q_qlen > q_limit) {
+		bpf_qdisc_skb_drop(skb, to_free);
+		return NET_XMIT_DROP;
+	}
+
+	bpf_spin_lock(&q_fifo_lock);
+	bpf_list_excl_push_back(&q_fifo, &skb->bpf_list);
+	bpf_spin_unlock(&q_fifo_lock);
+
+	return NET_XMIT_SUCCESS;
+}
+
+SEC("struct_ops/bpf_fifo_dequeue")
+struct sk_buff *BPF_PROG(bpf_fifo_dequeue, struct Qdisc *sch)
+{
+	struct sk_buff *skb;
+	struct bpf_list_excl_node *node;
+
+	bpf_spin_lock(&q_fifo_lock);
+	node = bpf_list_excl_pop_front(&q_fifo);
+	bpf_spin_unlock(&q_fifo_lock);
+	if (!node)
+		return NULL;
+
+	skb = container_of(node, struct sk_buff, bpf_list);
+	bpf_skb_set_dev(skb, sch);
+	q_qlen--;
+
+	return skb;
+}
+
+static int reset_fifo(u32 index, void *ctx)
+{
+	struct bpf_list_excl_node *node;
+	struct sk_buff *skb;
+
+	bpf_spin_lock(&q_fifo_lock);
+	node = bpf_list_excl_pop_front(&q_fifo);
+	bpf_spin_unlock(&q_fifo_lock);
+
+	if (!node) {
+		return 1;
+	}
+
+	skb = container_of(node, struct sk_buff, bpf_list);
+	bpf_skb_release(skb);
+	return 0;
+}
+
+SEC("struct_ops/bpf_fifo_reset")
+void BPF_PROG(bpf_fifo_reset, struct Qdisc *sch)
+{
+	bpf_loop(q_qlen, reset_fifo, NULL, 0);
+	q_qlen = 0;
+}
+
+SEC(".struct_ops")
+struct Qdisc_ops fifo = {
+	.enqueue   = (void *)bpf_fifo_enqueue,
+	.dequeue   = (void *)bpf_fifo_dequeue,
+	.reset     = (void *)bpf_fifo_reset,
+	.id        = "bpf_fifo",
+};
+
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH v8 18/20] selftests: Add a bpf fq qdisc to selftest
  2024-05-10 19:23 [RFC PATCH v8 00/20] bpf qdisc Amery Hung
                   ` (16 preceding siblings ...)
  2024-05-10 19:24 ` [RFC PATCH v8 17/20] selftests: Add a basic fifo qdisc test Amery Hung
@ 2024-05-10 19:24 ` Amery Hung
  2024-05-24  6:24   ` Martin KaFai Lau
  2024-05-10 19:24 ` [RFC PATCH v8 19/20] selftests: Add a bpf netem " Amery Hung
  2024-05-10 19:24 ` [RFC PATCH v8 20/20] selftests: Add a prio bpf qdisc Amery Hung
  19 siblings, 1 reply; 46+ messages in thread
From: Amery Hung @ 2024-05-10 19:24 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs, ameryhung

This test implements a more sophisticated qdisc using bpf. The bpf fair-
queueing (fq) qdisc gives each flow an equal chance to transmit data. It
also respects the timestamp of skb for rate limiting. The implementation
does not prevent hash collision of flows nor does it recycle flows.

The bpf fq also takes the chance to communicate packet drop information
with a bpf clsact EDT rate limiter using bpf maps. With the info, the
rate limiter can compenstate the delay caused by packet drops in qdisc
to maintain the throughput.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 .../selftests/bpf/prog_tests/bpf_qdisc.c      |  24 +
 .../selftests/bpf/progs/bpf_qdisc_fq.c        | 660 ++++++++++++++++++
 2 files changed, 684 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
index 295d0216e70f..394bf5a4adae 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
@@ -4,6 +4,7 @@
 
 #include "network_helpers.h"
 #include "bpf_qdisc_fifo.skel.h"
+#include "bpf_qdisc_fq.skel.h"
 
 #ifndef ENOTSUPP
 #define ENOTSUPP 524
@@ -154,8 +155,31 @@ static void test_fifo(void)
 	bpf_qdisc_fifo__destroy(fifo_skel);
 }
 
+static void test_fq(void)
+{
+	struct bpf_qdisc_fq *fq_skel;
+	struct bpf_link *link;
+
+	fq_skel = bpf_qdisc_fq__open_and_load();
+	if (!ASSERT_OK_PTR(fq_skel, "bpf_qdisc_fq__open_and_load"))
+		return;
+
+	link = bpf_map__attach_struct_ops(fq_skel->maps.fq);
+	if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops")) {
+		bpf_qdisc_fq__destroy(fq_skel);
+		return;
+	}
+
+	do_test("bpf_fq");
+
+	bpf_link__destroy(link);
+	bpf_qdisc_fq__destroy(fq_skel);
+}
+
 void test_bpf_qdisc(void)
 {
 	if (test__start_subtest("fifo"))
 		test_fifo();
+	if (test__start_subtest("fq"))
+		test_fq();
 }
diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c b/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c
new file mode 100644
index 000000000000..5118237da9e4
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c
@@ -0,0 +1,660 @@
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_experimental.h"
+#include "bpf_qdisc_common.h"
+
+char _license[] SEC("license") = "GPL";
+
+#define NSEC_PER_USEC 1000L
+#define NSEC_PER_SEC 1000000000L
+#define PSCHED_MTU (64 * 1024 + 14)
+
+#define NUM_QUEUE_LOG 10
+#define NUM_QUEUE (1 << NUM_QUEUE_LOG)
+#define PRIO_QUEUE (NUM_QUEUE + 1)
+#define COMP_DROP_PKT_DELAY 1
+#define THROTTLED 0xffffffffffffffff
+
+/* fq configuration */
+__u64 q_flow_refill_delay = 40 * 10000; //40us
+__u64 q_horizon = 10ULL * NSEC_PER_SEC;
+__u32 q_initial_quantum = 10 * PSCHED_MTU;
+__u32 q_quantum = 2 * PSCHED_MTU;
+__u32 q_orphan_mask = 1023;
+__u32 q_flow_plimit = 100;
+__u32 q_plimit = 10000;
+__u32 q_timer_slack = 10 * NSEC_PER_USEC;
+bool q_horizon_drop = true;
+
+bool q_compensate_tstamp;
+bool q_random_drop;
+
+unsigned long time_next_delayed_flow = ~0ULL;
+unsigned long unthrottle_latency_ns = 0ULL;
+unsigned long ktime_cache = 0;
+unsigned long dequeue_now;
+unsigned int fq_qlen = 0;
+
+struct fq_flow_node {
+	u32 hash;
+	int credit;
+	u32 qlen;
+	u32 socket_hash;
+	u64 age;
+	u64 time_next_packet;
+	struct bpf_list_node list_node;
+	struct bpf_rb_node rb_node;
+	struct bpf_rb_root queue __contains_kptr(sk_buff, bpf_rbnode);
+	struct bpf_spin_lock lock;
+	struct bpf_refcount refcount;
+};
+
+struct dequeue_nonprio_ctx {
+	bool dequeued;
+	u64 expire;
+};
+
+struct fq_stashed_flow {
+	struct fq_flow_node __kptr *flow;
+};
+
+struct stashed_skb {
+	struct sk_buff __kptr *skb;
+};
+
+/* [NUM_QUEUE] for TC_PRIO_CONTROL
+ * [0, NUM_QUEUE - 1] for other flows
+ */
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__type(key, __u32);
+	__type(value, struct fq_stashed_flow);
+	__uint(max_entries, NUM_QUEUE + 1);
+} fq_stashed_flows SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, __u32);
+	__type(value, __u64);
+	__uint(pinning, LIBBPF_PIN_BY_NAME);
+	__uint(max_entries, 16);
+} rate_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, __u32);
+	__type(value, __u64);
+	__uint(pinning, LIBBPF_PIN_BY_NAME);
+	__uint(max_entries, 16);
+} comp_map SEC(".maps");
+
+#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+
+private(A) struct bpf_spin_lock fq_delayed_lock;
+private(A) struct bpf_rb_root fq_delayed __contains(fq_flow_node, rb_node);
+
+private(B) struct bpf_spin_lock fq_new_flows_lock;
+private(B) struct bpf_list_head fq_new_flows __contains(fq_flow_node, list_node);
+
+private(C) struct bpf_spin_lock fq_old_flows_lock;
+private(C) struct bpf_list_head fq_old_flows __contains(fq_flow_node, list_node);
+
+private(D) struct bpf_spin_lock fq_stashed_skb_lock;
+private(D) struct bpf_list_head fq_stashed_skb __contains_kptr(sk_buff, bpf_list);
+
+static __always_inline bool bpf_kptr_xchg_back(void *map_val, void *ptr)
+{
+	void *ret;
+
+	ret = bpf_kptr_xchg(map_val, ptr);
+	if (ret) { //unexpected
+		bpf_obj_drop(ret);
+		return false;
+	}
+	return true;
+}
+
+static __always_inline struct qdisc_skb_cb *qdisc_skb_cb(const struct sk_buff *skb)
+{
+	return (struct qdisc_skb_cb *)skb->cb;
+}
+
+static __always_inline int hash64(u64 val, int bits)
+{
+	return val * 0x61C8864680B583EBull >> (64 - bits);
+}
+
+static bool skb_tstamp_less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct sk_buff *skb_a;
+	struct sk_buff *skb_b;
+
+	skb_a = container_of(a, struct sk_buff, bpf_rbnode);
+	skb_b = container_of(b, struct sk_buff, bpf_rbnode);
+
+	return skb_a->tstamp < skb_b->tstamp;
+}
+
+static bool fn_time_next_packet_less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct fq_flow_node *flow_a;
+	struct fq_flow_node *flow_b;
+
+	flow_a = container_of(a, struct fq_flow_node, rb_node);
+	flow_b = container_of(b, struct fq_flow_node, rb_node);
+
+	return flow_a->time_next_packet < flow_b->time_next_packet;
+}
+
+static __always_inline void
+fq_flows_add_head(struct bpf_list_head *head, struct bpf_spin_lock *lock,
+		  struct fq_flow_node *flow)
+{
+	bpf_spin_lock(lock);
+	bpf_list_push_front(head, &flow->list_node);
+	bpf_spin_unlock(lock);
+}
+
+static __always_inline void
+fq_flows_add_tail(struct bpf_list_head *head, struct bpf_spin_lock *lock,
+		  struct fq_flow_node *flow)
+{
+	bpf_spin_lock(lock);
+	bpf_list_push_back(head, &flow->list_node);
+	bpf_spin_unlock(lock);
+}
+
+static __always_inline bool
+fq_flows_is_empty(struct bpf_list_head *head, struct bpf_spin_lock *lock)
+{
+	struct bpf_list_node *node;
+
+	bpf_spin_lock(lock);
+	node = bpf_list_pop_front(head);
+	if (node) {
+		bpf_list_push_front(head, node);
+		bpf_spin_unlock(lock);
+		return false;
+	}
+	bpf_spin_unlock(lock);
+
+	return true;
+}
+
+static __always_inline void fq_flow_set_detached(struct fq_flow_node *flow)
+{
+	flow->age = bpf_jiffies64();
+	bpf_obj_drop(flow);
+}
+
+static __always_inline bool fq_flow_is_detached(struct fq_flow_node *flow)
+{
+	return flow->age != 0 && flow->age != THROTTLED;
+}
+
+static __always_inline bool fq_flow_is_throttled(struct fq_flow_node *flow)
+{
+	return flow->age != THROTTLED;
+}
+
+static __always_inline bool sk_listener(struct sock *sk)
+{
+	return (1 << sk->__sk_common.skc_state) & (TCPF_LISTEN | TCPF_NEW_SYN_RECV);
+}
+
+static __always_inline int
+fq_classify(struct sk_buff *skb, u32 *hash, struct fq_stashed_flow **sflow,
+	    bool *connected, u32 *sk_hash)
+{
+	struct fq_flow_node *flow;
+	struct sock *sk = skb->sk;
+
+	*connected = false;
+
+	if ((skb->priority & TC_PRIO_MAX) == TC_PRIO_CONTROL) {
+		*hash = PRIO_QUEUE;
+	} else {
+		if (!sk || sk_listener(sk)) {
+			*sk_hash = bpf_skb_get_hash(skb) & q_orphan_mask;
+			*sk_hash = (*sk_hash << 1 | 1);
+		} else if (sk->__sk_common.skc_state == TCP_CLOSE) {
+			*sk_hash = bpf_skb_get_hash(skb) & q_orphan_mask;
+			*sk_hash = (*sk_hash << 1 | 1);
+		} else {
+			*sk_hash = sk->__sk_common.skc_hash;
+			*connected = true;
+		}
+		*hash = hash64(*sk_hash, NUM_QUEUE_LOG);
+	}
+
+	*sflow = bpf_map_lookup_elem(&fq_stashed_flows, hash);
+	if (!*sflow)
+		return -1; //unexpected
+
+	if ((*sflow)->flow)
+		return 0;
+
+	flow = bpf_obj_new(typeof(*flow));
+	if (!flow)
+		return -1;
+
+	flow->hash = *hash;
+	flow->credit = q_initial_quantum;
+	flow->qlen = 0;
+	flow->age = 1UL;
+	flow->time_next_packet = 0;
+
+	bpf_kptr_xchg_back(&(*sflow)->flow, flow);
+
+	return 0;
+}
+
+static __always_inline bool fq_packet_beyond_horizon(struct sk_buff *skb)
+{
+	return (s64)skb->tstamp > (s64)(ktime_cache + q_horizon);
+}
+
+SEC("struct_ops/bpf_fq_enqueue")
+int BPF_PROG(bpf_fq_enqueue, struct sk_buff *skb, struct Qdisc *sch,
+	     struct bpf_sk_buff_ptr *to_free)
+{
+	struct iphdr *iph = (void *)(long)skb->data + sizeof(struct ethhdr);
+	u64 time_to_send, jiffies, delay_ns, *comp_ns, *rate;
+	struct fq_flow_node *flow = NULL, *flow_copy;
+	struct fq_stashed_flow *sflow;
+	u32 hash, daddr, sk_hash;
+	bool connected;
+
+	if (q_random_drop & (bpf_get_prandom_u32() > ~0U * 0.90))
+		goto drop;
+
+	if (fq_qlen >= q_plimit)
+		goto drop;
+
+	if (!skb->tstamp) {
+		time_to_send = ktime_cache = bpf_ktime_get_ns();
+	} else {
+		if (fq_packet_beyond_horizon(skb)) {
+			ktime_cache = bpf_ktime_get_ns();
+			if (fq_packet_beyond_horizon(skb)) {
+				if (q_horizon_drop)
+					goto drop;
+
+				skb->tstamp = ktime_cache + q_horizon;
+			}
+		}
+		time_to_send = skb->tstamp;
+	}
+
+	if (fq_classify(skb, &hash, &sflow, &connected, &sk_hash) < 0)
+		goto drop;
+
+	flow = bpf_kptr_xchg(&sflow->flow, flow);
+	if (!flow)
+		goto drop; //unexpected
+
+	if (hash != PRIO_QUEUE) {
+		if (connected && flow->socket_hash != sk_hash) {
+			flow->credit = q_initial_quantum;
+			flow->socket_hash = sk_hash;
+			if (fq_flow_is_throttled(flow)) {
+				/* mark the flow as undetached. The reference to the
+				 * throttled flow in fq_delayed will be removed later.
+				 */
+				flow_copy = bpf_refcount_acquire(flow);
+				flow_copy->age = 0;
+				fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow_copy);
+			}
+			flow->time_next_packet = 0ULL;
+		}
+
+		if (flow->qlen >= q_flow_plimit) {
+			bpf_kptr_xchg_back(&sflow->flow, flow);
+			goto drop;
+		}
+
+		if (fq_flow_is_detached(flow)) {
+			if (connected)
+				flow->socket_hash = sk_hash;
+
+			flow_copy = bpf_refcount_acquire(flow);
+
+			jiffies = bpf_jiffies64();
+			if ((s64)(jiffies - (flow_copy->age + q_flow_refill_delay)) > 0) {
+				if (flow_copy->credit < q_quantum)
+					flow_copy->credit = q_quantum;
+			}
+			flow_copy->age = 0;
+			fq_flows_add_tail(&fq_new_flows, &fq_new_flows_lock, flow_copy);
+		}
+	}
+
+	skb->tstamp = time_to_send;
+
+	bpf_spin_lock(&flow->lock);
+	bpf_rbtree_excl_add(&flow->queue, &skb->bpf_rbnode, skb_tstamp_less);
+	bpf_spin_unlock(&flow->lock);
+
+	flow->qlen++;
+	bpf_kptr_xchg_back(&sflow->flow, flow);
+
+	fq_qlen++;
+	return NET_XMIT_SUCCESS;
+
+drop:
+	if (q_compensate_tstamp) {
+		bpf_probe_read_kernel(&daddr, sizeof(daddr), &iph->daddr);
+		rate = bpf_map_lookup_elem(&rate_map, &daddr);
+		comp_ns = bpf_map_lookup_elem(&comp_map, &daddr);
+		if (rate && comp_ns) {
+			delay_ns = (u64)qdisc_skb_cb(skb)->pkt_len * NSEC_PER_SEC / (*rate);
+			__sync_fetch_and_add(comp_ns, delay_ns);
+		}
+	}
+	bpf_qdisc_skb_drop(skb, to_free);
+	return NET_XMIT_DROP;
+}
+
+static int fq_unset_throttled_flows(u32 index, bool *unset_all)
+{
+	struct bpf_rb_node *node = NULL;
+	struct fq_flow_node *flow;
+
+	bpf_spin_lock(&fq_delayed_lock);
+
+	node = bpf_rbtree_first(&fq_delayed);
+	if (!node) {
+		bpf_spin_unlock(&fq_delayed_lock);
+		return 1;
+	}
+
+	flow = container_of(node, struct fq_flow_node, rb_node);
+	if (!*unset_all && flow->time_next_packet > dequeue_now) {
+		time_next_delayed_flow = flow->time_next_packet;
+		bpf_spin_unlock(&fq_delayed_lock);
+		return 1;
+	}
+
+	node = bpf_rbtree_remove(&fq_delayed, &flow->rb_node);
+
+	bpf_spin_unlock(&fq_delayed_lock);
+
+	if (!node)
+		return 1; //unexpected
+
+	flow = container_of(node, struct fq_flow_node, rb_node);
+
+	/* the flow was recycled during enqueue() */
+	if (flow->age != THROTTLED) {
+		bpf_obj_drop(flow);
+		return 0;
+	}
+
+	flow->age = 0;
+	fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow);
+
+	return 0;
+}
+
+static __always_inline void fq_flow_set_throttled(struct fq_flow_node *flow)
+{
+	flow->age = THROTTLED;
+
+	if (time_next_delayed_flow > flow->time_next_packet)
+		time_next_delayed_flow = flow->time_next_packet;
+
+	bpf_spin_lock(&fq_delayed_lock);
+	bpf_rbtree_add(&fq_delayed, &flow->rb_node, fn_time_next_packet_less);
+	bpf_spin_unlock(&fq_delayed_lock);
+}
+
+static __always_inline void fq_check_throttled(void)
+{
+	bool unset_all = false;
+	unsigned long sample;
+
+	if (time_next_delayed_flow > dequeue_now)
+		return;
+
+	sample = (unsigned long)(dequeue_now - time_next_delayed_flow);
+	unthrottle_latency_ns -= unthrottle_latency_ns >> 3;
+	unthrottle_latency_ns += sample >> 3;
+
+	time_next_delayed_flow = ~0ULL;
+	bpf_loop(NUM_QUEUE, fq_unset_throttled_flows, &unset_all, 0);
+}
+
+static __always_inline void stash_skb(struct sk_buff *skb)
+{
+	bpf_spin_lock(&fq_stashed_skb_lock);
+	bpf_list_excl_push_back(&fq_stashed_skb, &skb->bpf_list);
+	bpf_spin_unlock(&fq_stashed_skb_lock);
+}
+
+static __always_inline struct sk_buff *get_stashed_skb()
+{
+	struct bpf_list_excl_node *node;
+	struct sk_buff *skb;
+
+	bpf_spin_lock(&fq_stashed_skb_lock);
+	node = bpf_list_excl_pop_front(&fq_stashed_skb);
+	bpf_spin_unlock(&fq_stashed_skb_lock);
+	if (!node)
+		return NULL;
+
+	skb = container_of(node, struct sk_buff, bpf_list);
+	return skb;
+}
+
+static int
+fq_dequeue_nonprio_flows(u32 index, struct dequeue_nonprio_ctx *ctx)
+{
+	u64 time_next_packet, time_to_send;
+	struct bpf_rb_excl_node *rb_node;
+	struct sk_buff *skb = NULL;
+	struct bpf_list_head *head;
+	struct bpf_list_node *node;
+	struct bpf_spin_lock *lock;
+	struct fq_flow_node *flow;
+	bool is_empty;
+
+	head = &fq_new_flows;
+	lock = &fq_new_flows_lock;
+	bpf_spin_lock(&fq_new_flows_lock);
+	node = bpf_list_pop_front(&fq_new_flows);
+	bpf_spin_unlock(&fq_new_flows_lock);
+	if (!node) {
+		head = &fq_old_flows;
+		lock = &fq_old_flows_lock;
+		bpf_spin_lock(&fq_old_flows_lock);
+		node = bpf_list_pop_front(&fq_old_flows);
+		bpf_spin_unlock(&fq_old_flows_lock);
+		if (!node) {
+			if (time_next_delayed_flow != ~0ULL)
+				ctx->expire = time_next_delayed_flow;
+			return 1;
+		}
+	}
+
+	flow = container_of(node, struct fq_flow_node, list_node);
+	if (flow->credit <= 0) {
+		flow->credit += q_quantum;
+		fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow);
+		return 0;
+	}
+
+	bpf_spin_lock(&flow->lock);
+	rb_node = bpf_rbtree_excl_first(&flow->queue);
+	if (!rb_node) {
+		bpf_spin_unlock(&flow->lock);
+		is_empty = fq_flows_is_empty(&fq_old_flows, &fq_old_flows_lock);
+		if (head == &fq_new_flows && !is_empty)
+			fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow);
+		else
+			fq_flow_set_detached(flow);
+
+		return 0;
+	}
+
+	skb = container_of(rb_node, struct sk_buff, bpf_rbnode);
+	time_to_send = skb->tstamp;
+
+	time_next_packet = (time_to_send > flow->time_next_packet) ?
+		time_to_send : flow->time_next_packet;
+	if (dequeue_now < time_next_packet) {
+		bpf_spin_unlock(&flow->lock);
+		flow->time_next_packet = time_next_packet;
+		fq_flow_set_throttled(flow);
+		return 0;
+	}
+
+	rb_node = bpf_rbtree_excl_remove(&flow->queue, rb_node);
+	bpf_spin_unlock(&flow->lock);
+
+	if (!rb_node) {
+		fq_flows_add_tail(head, lock, flow);
+		return 0; //unexpected
+	}
+
+	skb = container_of(rb_node, struct sk_buff, bpf_rbnode);
+
+	flow->credit -= qdisc_skb_cb(skb)->pkt_len;
+	flow->qlen--;
+	fq_qlen--;
+
+	ctx->dequeued = true;
+	stash_skb(skb);
+
+	fq_flows_add_head(head, lock, flow);
+
+	return 1;
+}
+
+static __always_inline struct sk_buff *fq_dequeue_prio(void)
+{
+	struct fq_flow_node *flow = NULL;
+	struct fq_stashed_flow *sflow;
+	struct sk_buff *skb = NULL;
+	struct bpf_rb_excl_node *node;
+	u32 hash = NUM_QUEUE;
+
+	sflow = bpf_map_lookup_elem(&fq_stashed_flows, &hash);
+	if (!sflow)
+		return NULL; //unexpected
+
+	flow = bpf_kptr_xchg(&sflow->flow, flow);
+	if (!flow)
+		return NULL;
+
+	bpf_spin_lock(&flow->lock);
+	node = bpf_rbtree_excl_first(&flow->queue);
+	if (!node) {
+		bpf_spin_unlock(&flow->lock);
+		goto xchg_flow_back;
+	}
+
+	skb = container_of(node, struct sk_buff, bpf_rbnode);
+	node = bpf_rbtree_excl_remove(&flow->queue, &skb->bpf_rbnode);
+	bpf_spin_unlock(&flow->lock);
+
+	if (!node) {
+		skb = NULL;
+		goto xchg_flow_back;
+	}
+
+	skb = container_of(node, struct sk_buff, bpf_rbnode);
+	fq_qlen--;
+
+xchg_flow_back:
+	bpf_kptr_xchg_back(&sflow->flow, flow);
+
+	return skb;
+}
+
+SEC("struct_ops/bpf_fq_dequeue")
+struct sk_buff *BPF_PROG(bpf_fq_dequeue, struct Qdisc *sch)
+{
+	struct dequeue_nonprio_ctx cb_ctx = {};
+	struct sk_buff *skb = NULL;
+
+	skb = fq_dequeue_prio();
+	if (skb) {
+		bpf_skb_set_dev(skb, sch);
+		return skb;
+	}
+
+	ktime_cache = dequeue_now = bpf_ktime_get_ns();
+	fq_check_throttled();
+	bpf_loop(q_plimit, fq_dequeue_nonprio_flows, &cb_ctx, 0);
+
+	skb = get_stashed_skb();
+
+	if (skb) {
+		bpf_skb_set_dev(skb, sch);
+		return skb;
+	}
+
+	if (cb_ctx.expire)
+		bpf_qdisc_watchdog_schedule(sch, cb_ctx.expire, q_timer_slack);
+
+	return NULL;
+}
+
+static int
+fq_reset_flows(u32 index, void *ctx)
+{
+	struct bpf_list_node *node;
+	struct fq_flow_node *flow;
+
+	bpf_spin_lock(&fq_new_flows_lock);
+	node = bpf_list_pop_front(&fq_new_flows);
+	bpf_spin_unlock(&fq_new_flows_lock);
+	if (!node) {
+		bpf_spin_lock(&fq_old_flows_lock);
+		node = bpf_list_pop_front(&fq_old_flows);
+		bpf_spin_unlock(&fq_old_flows_lock);
+		if (!node)
+			return 1;
+	}
+
+	flow = container_of(node, struct fq_flow_node, list_node);
+	bpf_obj_drop(flow);
+
+	return 0;
+}
+
+static int
+fq_reset_stashed_flows(u32 index, void *ctx)
+{
+	struct fq_flow_node *flow = NULL;
+	struct fq_stashed_flow *sflow;
+
+	sflow = bpf_map_lookup_elem(&fq_stashed_flows, &index);
+	if (!sflow)
+		return 0;
+
+	flow = bpf_kptr_xchg(&sflow->flow, flow);
+	if (flow)
+		bpf_obj_drop(flow);
+
+	return 0;
+}
+
+SEC("struct_ops/bpf_fq_reset")
+void BPF_PROG(bpf_fq_reset, struct Qdisc *sch)
+{
+	bool unset_all = true;
+	fq_qlen = 0;
+	bpf_loop(NUM_QUEUE + 1, fq_reset_stashed_flows, NULL, 0);
+	bpf_loop(NUM_QUEUE, fq_reset_flows, NULL, 0);
+	bpf_loop(NUM_QUEUE, fq_unset_throttled_flows, &unset_all, 0);
+	return;
+}
+
+SEC(".struct_ops")
+struct Qdisc_ops fq = {
+	.enqueue   = (void *)bpf_fq_enqueue,
+	.dequeue   = (void *)bpf_fq_dequeue,
+	.reset     = (void *)bpf_fq_reset,
+	.id        = "bpf_fq",
+};
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH v8 19/20] selftests: Add a bpf netem qdisc to selftest
  2024-05-10 19:23 [RFC PATCH v8 00/20] bpf qdisc Amery Hung
                   ` (17 preceding siblings ...)
  2024-05-10 19:24 ` [RFC PATCH v8 18/20] selftests: Add a bpf fq qdisc to selftest Amery Hung
@ 2024-05-10 19:24 ` Amery Hung
  2024-05-10 19:24 ` [RFC PATCH v8 20/20] selftests: Add a prio bpf qdisc Amery Hung
  19 siblings, 0 replies; 46+ messages in thread
From: Amery Hung @ 2024-05-10 19:24 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs, ameryhung

This test implements a simple network emulator qdisc that simulates
packet drop, loss and delay. The qdisc uses Gilbert-Elliott model to
simulate packet drops. When used with mq qdisc, the bpf netem qdiscs
on different tx queues maintain a global state machine using a bpf map.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 .../selftests/bpf/prog_tests/bpf_qdisc.c      |  30 +++
 .../selftests/bpf/progs/bpf_qdisc_netem.c     | 236 ++++++++++++++++++
 2 files changed, 266 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_netem.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
index 394bf5a4adae..ec9c0d166e89 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
@@ -6,6 +6,13 @@
 #include "bpf_qdisc_fifo.skel.h"
 #include "bpf_qdisc_fq.skel.h"
 
+struct crndstate {
+	u32 last;
+	u32 rho;
+};
+
+#include "bpf_qdisc_netem.skel.h"
+
 #ifndef ENOTSUPP
 #define ENOTSUPP 524
 #endif
@@ -176,10 +183,33 @@ static void test_fq(void)
 	bpf_qdisc_fq__destroy(fq_skel);
 }
 
+static void test_netem(void)
+{
+	struct bpf_qdisc_netem *netem_skel;
+	struct bpf_link *link;
+
+	netem_skel = bpf_qdisc_netem__open_and_load();
+	if (!ASSERT_OK_PTR(netem_skel, "bpf_qdisc_netem__open_and_load"))
+		return;
+
+	link = bpf_map__attach_struct_ops(netem_skel->maps.netem);
+	if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops")) {
+		bpf_qdisc_netem__destroy(netem_skel);
+		return;
+	}
+
+	do_test("bpf_netem");
+
+	bpf_link__destroy(link);
+	bpf_qdisc_netem__destroy(netem_skel);
+}
+
 void test_bpf_qdisc(void)
 {
 	if (test__start_subtest("fifo"))
 		test_fifo();
 	if (test__start_subtest("fq"))
 		test_fq();
+	if (test__start_subtest("netem"))
+		test_netem();
 }
diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_netem.c b/tools/testing/selftests/bpf/progs/bpf_qdisc_netem.c
new file mode 100644
index 000000000000..c1df73cdbd3e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_netem.c
@@ -0,0 +1,236 @@
+#include <vmlinux.h>
+#include "bpf_experimental.h"
+#include "bpf_qdisc_common.h"
+
+char _license[] SEC("license") = "GPL";
+
+#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+
+private(A) struct bpf_spin_lock t_root_lock;
+private(A) struct bpf_rb_root t_root __contains_kptr(sk_buff, bpf_rbnode);
+
+int q_loss_model = CLG_GILB_ELL;
+unsigned int q_limit = 1000;
+signed long q_latency = 0;
+signed long q_jitter = 0;
+unsigned int q_loss = 1;
+unsigned int q_qlen = 0;
+
+struct crndstate q_loss_cor = {.last = 0, .rho = 0,};
+struct crndstate q_delay_cor = {.last = 0, .rho = 0,};
+
+struct clg_state {
+	u64 state;
+	u32 a1;
+	u32 a2;
+	u32 a3;
+	u32 a4;
+	u32 a5;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__type(key, __u32);
+	__type(value, struct clg_state);
+	__uint(max_entries, 1);
+} g_clg_state SEC(".maps");
+
+static bool skb_tstamp_less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct sk_buff *skb_a;
+	struct sk_buff *skb_b;
+
+	skb_a = container_of(a, struct sk_buff, bpf_rbnode);
+	skb_b = container_of(b, struct sk_buff, bpf_rbnode);
+
+	return skb_a->tstamp < skb_b->tstamp;
+}
+
+static __always_inline u32 get_crandom(struct crndstate *state)
+{
+	u64 value, rho;
+	unsigned long answer;
+
+	if (!state || state->rho == 0)	/* no correlation */
+		return bpf_get_prandom_u32();
+
+	value = bpf_get_prandom_u32();
+	rho = (u64)state->rho + 1;
+	answer = (value * ((1ull<<32) - rho) + state->last * rho) >> 32;
+	state->last = answer;
+	return answer;
+}
+
+static __always_inline s64 tabledist(s64 mu, s32 sigma, struct crndstate *state)
+{
+	u32 rnd;
+
+	if (sigma == 0)
+		return mu;
+
+	rnd = get_crandom(state);
+
+	/* default uniform distribution */
+	return ((rnd % (2 * (u32)sigma)) + mu) - sigma;
+}
+
+static __always_inline bool loss_gilb_ell(void)
+{
+	struct clg_state *clg;
+	u32 r1, r2, key = 0;
+	bool ret = false;
+
+	clg = bpf_map_lookup_elem(&g_clg_state, &key);
+	if (!clg)
+		return false;
+
+	r1 = bpf_get_prandom_u32();
+	r2 = bpf_get_prandom_u32();
+
+	switch (clg->state) {
+	case GOOD_STATE:
+		if (r1 < clg->a1)
+			__sync_val_compare_and_swap(&clg->state,
+						    GOOD_STATE, BAD_STATE);
+		if (r2 < clg->a4)
+			ret = true;
+		break;
+	case BAD_STATE:
+		if (r1 < clg->a2)
+			__sync_val_compare_and_swap(&clg->state,
+						    BAD_STATE, GOOD_STATE);
+		if (r2 > clg->a3)
+			ret = true;
+	}
+
+	return ret;
+}
+
+static __always_inline bool loss_event(void)
+{
+	switch (q_loss_model) {
+	case CLG_RANDOM:
+		return q_loss && q_loss >= get_crandom(&q_loss_cor);
+	case CLG_GILB_ELL:
+		return loss_gilb_ell();
+	}
+
+	return false;
+}
+
+SEC("struct_ops/bpf_netem_enqueue")
+int BPF_PROG(bpf_netem_enqueue, struct sk_buff *skb, struct Qdisc *sch,
+	     struct bpf_sk_buff_ptr *to_free)
+{
+	int count = 1;
+	s64 delay = 0;
+	u64 now;
+
+	if (loss_event())
+		--count;
+
+	if (count == 0) {
+		bpf_qdisc_skb_drop(skb, to_free);
+		return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+	}
+
+	q_qlen++;
+	if (q_qlen > q_limit) {
+		bpf_qdisc_skb_drop(skb, to_free);
+		return NET_XMIT_DROP;
+	}
+
+	delay = tabledist(q_latency, q_jitter, &q_delay_cor);
+	now = bpf_ktime_get_ns();
+	skb->tstamp = now + delay;
+
+	bpf_spin_lock(&t_root_lock);
+	bpf_rbtree_excl_add(&t_root, &skb->bpf_rbnode, skb_tstamp_less);
+	bpf_spin_unlock(&t_root_lock);
+
+	return NET_XMIT_SUCCESS;
+}
+
+SEC("struct_ops/bpf_netem_dequeue")
+struct sk_buff *BPF_PROG(bpf_netem_dequeue, struct Qdisc *sch)
+{
+	struct bpf_rb_excl_node *node;
+	struct sk_buff *skb;
+	u64 now, tstamp;
+
+	now = bpf_ktime_get_ns();
+
+	bpf_spin_lock(&t_root_lock);
+	node = bpf_rbtree_excl_first(&t_root);
+	if (!node) {
+		bpf_spin_unlock(&t_root_lock);
+		return NULL;
+	}
+
+	skb = container_of(node, struct sk_buff, bpf_rbnode);
+	tstamp = skb->tstamp;
+	if (tstamp <= now) {
+		node = bpf_rbtree_excl_remove(&t_root, node);
+		bpf_spin_unlock(&t_root_lock);
+
+		if (!node)
+			return NULL;
+
+		skb = container_of(node, struct sk_buff, bpf_rbnode);
+		bpf_skb_set_dev(skb, sch);
+		q_qlen--;
+		return skb;
+	}
+
+	bpf_spin_unlock(&t_root_lock);
+	bpf_qdisc_watchdog_schedule(sch, tstamp, 0);
+	return NULL;
+}
+
+SEC("struct_ops/bpf_netem_init")
+int BPF_PROG(bpf_netem_init, struct Qdisc *sch, struct nlattr *opt,
+	     struct netlink_ext_ack *extack)
+{
+	return 0;
+}
+
+static int reset_queue(u32 index, void *ctx)
+{
+	struct bpf_rb_excl_node *node;
+	struct sk_buff *skb;
+
+	bpf_spin_lock(&t_root_lock);
+	node = bpf_rbtree_excl_first(&t_root);
+	if (!node) {
+		bpf_spin_unlock(&t_root_lock);
+		return 1;
+	}
+
+	skb = container_of(node, struct sk_buff, bpf_rbnode);
+	node = bpf_rbtree_excl_remove(&t_root, node);
+	bpf_spin_unlock(&t_root_lock);
+
+	if (!node)
+		return 1;
+
+	skb = container_of(node, struct sk_buff, bpf_rbnode);
+	bpf_skb_release(skb);
+	return 0;
+}
+
+SEC("struct_ops/bpf_netem_reset")
+void BPF_PROG(bpf_netem_reset, struct Qdisc *sch)
+{
+	bpf_loop(q_limit, reset_queue, NULL, 0);
+	q_qlen = 0;
+}
+
+SEC(".struct_ops")
+struct Qdisc_ops netem = {
+	.enqueue   = (void *)bpf_netem_enqueue,
+	.dequeue   = (void *)bpf_netem_dequeue,
+	.init      = (void *)bpf_netem_init,
+	.reset     = (void *)bpf_netem_reset,
+	.id        = "bpf_netem",
+};
+
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH v8 20/20] selftests: Add a prio bpf qdisc
  2024-05-10 19:23 [RFC PATCH v8 00/20] bpf qdisc Amery Hung
                   ` (18 preceding siblings ...)
  2024-05-10 19:24 ` [RFC PATCH v8 19/20] selftests: Add a bpf netem " Amery Hung
@ 2024-05-10 19:24 ` Amery Hung
  19 siblings, 0 replies; 46+ messages in thread
From: Amery Hung @ 2024-05-10 19:24 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs, ameryhung

This test implements a classful qdisc using bpf. The prio qdisc, like
its native counterpart, has 16 bands. An skb is classified into a band
based on its priority. During dequeue, the band with the lowest priority
value are tried first. The bpf prio qdisc populates the classes during
initialization with pfifo qdisc, and we later change them to be fq qdiscs.
A direct queue using bpf list is provided to make sure the traffic will
be always flowing even if qdiscs in all bands are removed.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 .../selftests/bpf/prog_tests/bpf_qdisc.c      |  52 +++++++-
 .../selftests/bpf/progs/bpf_qdisc_prio.c      | 112 ++++++++++++++++++
 2 files changed, 160 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_prio.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
index ec9c0d166e89..e1e80fb3c52d 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
@@ -2,9 +2,11 @@
 #include <linux/rtnetlink.h>
 #include <test_progs.h>
 
+#include "netlink_helpers.h"
 #include "network_helpers.h"
 #include "bpf_qdisc_fifo.skel.h"
 #include "bpf_qdisc_fq.skel.h"
+#include "bpf_qdisc_prio.skel.h"
 
 struct crndstate {
 	u32 last;
@@ -65,7 +67,7 @@ static void *server(void *arg)
 	return NULL;
 }
 
-static void do_test(char *qdisc)
+static void do_test(char *qdisc, int (*setup)(void))
 {
 	DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook, .ifindex = LO_IFINDEX,
 			    .attach_point = BPF_TC_QDISC,
@@ -87,6 +89,12 @@ static void do_test(char *qdisc)
 	if (!ASSERT_OK(err, "attach qdisc"))
 		return;
 
+	if (setup) {
+		err = setup();
+		if (!ASSERT_OK(err, "setup qdisc"))
+			return;
+	}
+
 	lfd = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
 	if (!ASSERT_NEQ(lfd, -1, "socket")) {
 		bpf_tc_hook_destroy(&hook);
@@ -156,7 +164,7 @@ static void test_fifo(void)
 		return;
 	}
 
-	do_test("bpf_fifo");
+	do_test("bpf_fifo", NULL);
 
 	bpf_link__destroy(link);
 	bpf_qdisc_fifo__destroy(fifo_skel);
@@ -177,7 +185,7 @@ static void test_fq(void)
 		return;
 	}
 
-	do_test("bpf_fq");
+	do_test("bpf_fq", NULL);
 
 	bpf_link__destroy(link);
 	bpf_qdisc_fq__destroy(fq_skel);
@@ -198,12 +206,46 @@ static void test_netem(void)
 		return;
 	}
 
-	do_test("bpf_netem");
+	do_test("bpf_netem", NULL);
 
 	bpf_link__destroy(link);
 	bpf_qdisc_netem__destroy(netem_skel);
 }
 
+static int setup_prio_bands(void)
+{
+	char cmd[128];
+	int i;
+
+	for (i = 1; i <= 16; i++) {
+		snprintf(cmd, sizeof(cmd), "tc qdisc add dev lo parent 800:%x handle %x0: fq", i, i);
+		if (!ASSERT_OK(system(cmd), cmd))
+			return -1;
+	}
+	return 0;
+}
+
+static void test_prio_qdisc(void)
+{
+	struct bpf_qdisc_prio *prio_skel;
+	struct bpf_link *link;
+
+	prio_skel = bpf_qdisc_prio__open_and_load();
+	if (!ASSERT_OK_PTR(prio_skel, "bpf_qdisc_prio__open_and_load"))
+		return;
+
+	link = bpf_map__attach_struct_ops(prio_skel->maps.prio);
+	if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops")) {
+		bpf_qdisc_prio__destroy(prio_skel);
+		return;
+	}
+
+	do_test("bpf_prio", &setup_prio_bands);
+
+	bpf_link__destroy(link);
+	bpf_qdisc_prio__destroy(prio_skel);
+}
+
 void test_bpf_qdisc(void)
 {
 	if (test__start_subtest("fifo"))
@@ -212,4 +254,6 @@ void test_bpf_qdisc(void)
 		test_fq();
 	if (test__start_subtest("netem"))
 		test_netem();
+	if (test__start_subtest("prio"))
+		test_prio_qdisc();
 }
diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_prio.c b/tools/testing/selftests/bpf/progs/bpf_qdisc_prio.c
new file mode 100644
index 000000000000..9a7797a7ed9d
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_prio.c
@@ -0,0 +1,112 @@
+#include <vmlinux.h>
+#include "bpf_experimental.h"
+#include "bpf_qdisc_common.h"
+
+char _license[] SEC("license") = "GPL";
+
+#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+
+private(B) struct bpf_spin_lock direct_queue_lock;
+private(B) struct bpf_list_head direct_queue __contains_kptr(sk_buff, bpf_list);
+
+unsigned int q_limit = 1000;
+unsigned int q_qlen = 0;
+
+SEC("struct_ops/bpf_prio_enqueue")
+int BPF_PROG(bpf_prio_enqueue, struct sk_buff *skb, struct Qdisc *sch,
+	     struct bpf_sk_buff_ptr *to_free)
+{
+	u32 classid = sch->handle | (skb->priority & TC_PRIO_MAX);
+
+	if (bpf_qdisc_find_class(sch, classid))
+		return bpf_qdisc_enqueue(skb, sch, classid, to_free);
+
+	q_qlen++;
+	if (q_qlen > q_limit) {
+		bpf_qdisc_skb_drop(skb, to_free);
+		return NET_XMIT_DROP;
+	}
+
+	bpf_spin_lock(&direct_queue_lock);
+	bpf_list_excl_push_back(&direct_queue, &skb->bpf_list);
+	bpf_spin_unlock(&direct_queue_lock);
+
+	return NET_XMIT_SUCCESS;
+}
+
+SEC("struct_ops/bpf_prio_dequeue")
+struct sk_buff *BPF_PROG(bpf_prio_dequeue, struct Qdisc *sch)
+{
+	struct bpf_list_excl_node *node;
+	struct sk_buff *skb;
+	u32 i, classid;
+
+	bpf_spin_lock(&direct_queue_lock);
+	node = bpf_list_excl_pop_front(&direct_queue);
+	bpf_spin_unlock(&direct_queue_lock);
+	if (!node) {
+		for (i = 0; i <= TC_PRIO_MAX; i++) {
+			classid = sch->handle | i;
+			skb = bpf_qdisc_dequeue(sch, classid);
+			if (skb)
+				return skb;
+		}
+		return NULL;
+	}
+
+	skb = container_of(node, struct sk_buff, bpf_list);
+	bpf_skb_set_dev(skb, sch);
+	q_qlen--;
+
+	return skb;
+}
+
+SEC("struct_ops/bpf_prio_init")
+int BPF_PROG(bpf_prio_init, struct Qdisc *sch, struct nlattr *opt,
+	     struct netlink_ext_ack *extack)
+{
+	int i, err;
+
+	for (i = 1; i <= TC_PRIO_MAX + 1; i++) {
+		err = bpf_qdisc_create_child(sch, i, extack);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+static int reset_direct_queue(u32 index, void *ctx)
+{
+	struct bpf_list_excl_node *node;
+	struct sk_buff *skb;
+
+	bpf_spin_lock(&direct_queue_lock);
+	node = bpf_list_excl_pop_front(&direct_queue);
+	bpf_spin_unlock(&direct_queue_lock);
+
+	if (!node) {
+		return 1;
+	}
+
+	skb = container_of(node, struct sk_buff, bpf_list);
+	bpf_skb_release(skb);
+	return 0;
+}
+
+SEC("struct_ops/bpf_prio_reset")
+void BPF_PROG(bpf_prio_reset, struct Qdisc *sch)
+{
+	bpf_loop(q_qlen, reset_direct_queue, NULL, 0);
+	q_qlen = 0;
+}
+
+SEC(".struct_ops")
+struct Qdisc_ops prio = {
+	.enqueue   = (void *)bpf_prio_enqueue,
+	.dequeue   = (void *)bpf_prio_dequeue,
+	.init      = (void *)bpf_prio_init,
+	.reset     = (void *)bpf_prio_reset,
+	.id        = "bpf_prio",
+};
+
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 02/20] selftests/bpf: Test referenced kptr arguments of struct_ops programs
  2024-05-10 19:23 ` [RFC PATCH v8 02/20] selftests/bpf: Test referenced kptr arguments of " Amery Hung
@ 2024-05-10 21:33   ` Kui-Feng Lee
  2024-05-10 22:16     ` Amery Hung
  0 siblings, 1 reply; 46+ messages in thread
From: Kui-Feng Lee @ 2024-05-10 21:33 UTC (permalink / raw)
  To: Amery Hung, netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, toke, jhs, jiri, sdf,
	xiyou.wangcong, yepeilin.cs



On 5/10/24 12:23, Amery Hung wrote:
> A reference is automatically acquired for a referenced kptr argument
> annotated via the stub function with "__ref_acquired" in a struct_ops
> program. It must be released and cannot be acquired more than once.
> 
> The test first checks whether a reference to the correct type is acquired
> in "ref_acquire". Then, we check if the verifier correctly rejects the
> program that fails to release the reference (i.e., reference leak) in
> "ref_acquire_ref_leak". Finally, we check if the reference can be only
> acquired once through the argument in "ref_acquire_dup_ref".
> 
> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> ---
>   .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  7 +++
>   .../selftests/bpf/bpf_testmod/bpf_testmod.h   |  2 +
>   .../prog_tests/test_struct_ops_ref_acquire.c  | 58 +++++++++++++++++++
>   .../bpf/progs/struct_ops_ref_acquire.c        | 27 +++++++++
>   .../progs/struct_ops_ref_acquire_dup_ref.c    | 24 ++++++++
>   .../progs/struct_ops_ref_acquire_ref_leak.c   | 19 ++++++
>   6 files changed, 137 insertions(+)
>   create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_ref_acquire.c
>   create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_ref_acquire.c
>   create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_ref_acquire_dup_ref.c
>   create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_ref_acquire_ref_leak.c
> 
> 
  ... skipped ...
> +
> diff --git a/tools/testing/selftests/bpf/progs/struct_ops_ref_acquire.c b/tools/testing/selftests/bpf/progs/struct_ops_ref_acquire.c
> new file mode 100644
> index 000000000000..bae342db0fdb
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/struct_ops_ref_acquire.c
> @@ -0,0 +1,27 @@
> +#include <vmlinux.h>
> +#include <bpf/bpf_tracing.h>
> +#include "../bpf_testmod/bpf_testmod.h"
> +
> +char _license[] SEC("license") = "GPL";
> +
> +void bpf_task_release(struct task_struct *p) __ksym;
> +
> +/* This is a test BPF program that uses struct_ops to access a referenced
> + * kptr argument. This is a test for the verifier to ensure that it recongnizes
> + * the task as a referenced object (i.e., ref_obj_id > 0).
> + */
> +SEC("struct_ops/test_ref_acquire")
> +int BPF_PROG(test_ref_acquire, int dummy,
> +	     struct task_struct *task)
> +{
> +	bpf_task_release(task);

This looks weird for me.

According to what you mentioned in the patch 1, the purpose is to
prevent acquiring multiple references from happening. So, is it possible
to return NULL from the acquire function if having returned a reference
before?


> +
> +	return 0;
> +}
> +
> +
... skipped ...

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 02/20] selftests/bpf: Test referenced kptr arguments of struct_ops programs
  2024-05-10 21:33   ` Kui-Feng Lee
@ 2024-05-10 22:16     ` Amery Hung
  2024-05-16 23:14       ` Amery Hung
  0 siblings, 1 reply; 46+ messages in thread
From: Amery Hung @ 2024-05-10 22:16 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: netdev, bpf, yangpeihao, daniel, andrii, martin.lau, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs

On Fri, May 10, 2024 at 2:33 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>
>
>
> On 5/10/24 12:23, Amery Hung wrote:
> > A reference is automatically acquired for a referenced kptr argument
> > annotated via the stub function with "__ref_acquired" in a struct_ops
> > program. It must be released and cannot be acquired more than once.
> >
> > The test first checks whether a reference to the correct type is acquired
> > in "ref_acquire". Then, we check if the verifier correctly rejects the
> > program that fails to release the reference (i.e., reference leak) in
> > "ref_acquire_ref_leak". Finally, we check if the reference can be only
> > acquired once through the argument in "ref_acquire_dup_ref".
> >
> > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > ---
> >   .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  7 +++
> >   .../selftests/bpf/bpf_testmod/bpf_testmod.h   |  2 +
> >   .../prog_tests/test_struct_ops_ref_acquire.c  | 58 +++++++++++++++++++
> >   .../bpf/progs/struct_ops_ref_acquire.c        | 27 +++++++++
> >   .../progs/struct_ops_ref_acquire_dup_ref.c    | 24 ++++++++
> >   .../progs/struct_ops_ref_acquire_ref_leak.c   | 19 ++++++
> >   6 files changed, 137 insertions(+)
> >   create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_ref_acquire.c
> >   create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_ref_acquire.c
> >   create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_ref_acquire_dup_ref.c
> >   create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_ref_acquire_ref_leak.c
> >
> >
>   ... skipped ...
> > +
> > diff --git a/tools/testing/selftests/bpf/progs/struct_ops_ref_acquire.c b/tools/testing/selftests/bpf/progs/struct_ops_ref_acquire.c
> > new file mode 100644
> > index 000000000000..bae342db0fdb
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/struct_ops_ref_acquire.c
> > @@ -0,0 +1,27 @@
> > +#include <vmlinux.h>
> > +#include <bpf/bpf_tracing.h>
> > +#include "../bpf_testmod/bpf_testmod.h"
> > +
> > +char _license[] SEC("license") = "GPL";
> > +
> > +void bpf_task_release(struct task_struct *p) __ksym;
> > +
> > +/* This is a test BPF program that uses struct_ops to access a referenced
> > + * kptr argument. This is a test for the verifier to ensure that it recongnizes
> > + * the task as a referenced object (i.e., ref_obj_id > 0).
> > + */
> > +SEC("struct_ops/test_ref_acquire")
> > +int BPF_PROG(test_ref_acquire, int dummy,
> > +          struct task_struct *task)
> > +{
> > +     bpf_task_release(task);
>
> This looks weird for me.
>
> According to what you mentioned in the patch 1, the purpose is to
> prevent acquiring multiple references from happening. So, is it possible
> to return NULL from the acquire function if having returned a reference
> before?

The purpose of req_acquired is to allow acquiring a referenced kptr in
struct_ops argument just once. Whether multiple references can be
acquired/duplicated later I think could be orthogonal.

In bpf qdisc, we ensure unique reference of skb through ref_acquired and
the fact that there is no bpf_ref_count in sk_buff (so that users cannot
use bpf_ref_acquire()).

In this case, it is true that programs like below will be able to get
multiple references to task (Is this the scenario you have in mind?).
Thus, if the users want to enforce the unique reference semantic, they
need to make bpf_task_acquire() unavailable as well.

SEC("struct_ops/test_ref_acquire")
int BPF_PROG(test_ref_acquire, int dummy,
             struct task_struct *task)
{
        struct task_struct task2;
        task2 = bpf_task_acquire(task);
        bpf_task_release(task);
        if (task2)
            bpf_task_release(task2);
        return 0;
}


>
>
> > +
> > +     return 0;
> > +}
> > +
> > +
> ... skipped ...

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 02/20] selftests/bpf: Test referenced kptr arguments of struct_ops programs
  2024-05-10 22:16     ` Amery Hung
@ 2024-05-16 23:14       ` Amery Hung
  2024-05-16 23:43         ` Martin KaFai Lau
  0 siblings, 1 reply; 46+ messages in thread
From: Amery Hung @ 2024-05-16 23:14 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: netdev, bpf, yangpeihao, daniel, andrii, martin.lau, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs

I thought about patch 1-4 a bit more after the discussion in LSFMMBPF and
I think we should keep what "ref_acquried" does, but maybe rename it to
"ref_moved".

We discussed the lifecycle of skb in qdisc and changes to struct_ops and
bpf semantics. In short, At the beginning of .enqueue, the kernel passes
the ownership of an skb to a qdisc. We do not increase the reference count
of skb since this is an ownership transfer, not kernel and qdisc both
holding references to the skb. (The counterexample can be found in RFC v7.
See how weird skb release kfuncs look[0]). The skb should be either
enqueued or dropped. Then, in .dequeue, an skb will be removed from the
queue and the ownership will be returned to the kernel.

Referenced kptr in bpf already carries the semantic of ownership. Thus,
what we need here is to enable struct_ops programs to get a referenced
kptr from the argument and returning referenced kptr (achieved via patch
1-4).

Proper handling of referenced objects is important for safety reasons.
In the case of bpf qdisc, there are three problematic situations as listed
below, and referenced kptr has taken care of (1) and (2).

(1) .enqueue not enqueueing nor dropping the skb, causing reference leak

(2) .dequeue making up an invalid skb ptr and returning to kernel

(3) If bpf qdisc operators can duplicate skb references, multiple
    references to the same skb can be present. If we enqueue these
    references to a collection and dequeue one, since skb->dev will be
    restored after the skb is removed from the collection, other skb in
    the collection will then have invalid skb->rbnode as "dev" and "rbnode"
    share the same memory.

A discussion point was about introducing and enforcing a unique reference
semantic (PTR_UNIQUE) to mitigate (3). After giving it more thoughts, I
think we should keep "ref_acquired", and be careful about kernel-side
implementation that could return referenced kptr. Taking a step back, (3)
is only problematic because I made an assumption that the kfunc only
increases the reference count of skb (i.e., skb_get()). It could have been
done safely using skb_copy() or maybe pskb_copy(). In other words, it is a
kernel implementation issue, and not a verifier issue. Besides, the
verifier has no knowledge about what a kfunc with KF_ACQUIRE does
internally whatsoever.

In v8, we try to do this safely by only allowing reading "ref_acquired"-
annotated argument once. Since the argument passed to struct_ops never
changes when during a single invocation, it will always be referencing the
same kernel object. Therefore, reading more than once and returning
mulitple references shouldn't be allowed. Maybe "ref_moved" is a more
precise annotation label, hinting that the ownership is transferred.

[0] https://lore.kernel.org/netdev/2d31261b245828d09d2f80e0953e911a9c38573a.1705432850.git.amery.hung@bytedance.com/

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 09/20] bpf: Find special BTF fields in union
  2024-05-10 19:24 ` [RFC PATCH v8 09/20] bpf: Find special BTF fields in union Amery Hung
@ 2024-05-16 23:37   ` Amery Hung
  0 siblings, 0 replies; 46+ messages in thread
From: Amery Hung @ 2024-05-16 23:37 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs

The implementation of supporting adding skb to collections is flaky as
Kui-Feng has pointed out in offline discussion. Basically, supporting
special BTF fields in unions needs more care.

I will defer patch 5-12 to another patchset after the first BPF Qdisc
patchset lands. While the performance of qdiscs implemented with the first
series will not be as good, this will make the patchset easier to review.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 02/20] selftests/bpf: Test referenced kptr arguments of struct_ops programs
  2024-05-16 23:14       ` Amery Hung
@ 2024-05-16 23:43         ` Martin KaFai Lau
  2024-05-17  0:54           ` Amery Hung
  0 siblings, 1 reply; 46+ messages in thread
From: Martin KaFai Lau @ 2024-05-16 23:43 UTC (permalink / raw)
  To: Amery Hung, Kui-Feng Lee
  Cc: netdev, bpf, yangpeihao, daniel, andrii, martin.lau, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs

On 5/16/24 4:14 PM, Amery Hung wrote:
> I thought about patch 1-4 a bit more after the discussion in LSFMMBPF and
> I think we should keep what "ref_acquried" does, but maybe rename it to
> "ref_moved".
> 
> We discussed the lifecycle of skb in qdisc and changes to struct_ops and
> bpf semantics. In short, At the beginning of .enqueue, the kernel passes
> the ownership of an skb to a qdisc. We do not increase the reference count
> of skb since this is an ownership transfer, not kernel and qdisc both
> holding references to the skb. (The counterexample can be found in RFC v7.
> See how weird skb release kfuncs look[0]). The skb should be either
> enqueued or dropped. Then, in .dequeue, an skb will be removed from the
> queue and the ownership will be returned to the kernel.
> 
> Referenced kptr in bpf already carries the semantic of ownership. Thus,
> what we need here is to enable struct_ops programs to get a referenced
> kptr from the argument and returning referenced kptr (achieved via patch
> 1-4).
> 
> Proper handling of referenced objects is important for safety reasons.
> In the case of bpf qdisc, there are three problematic situations as listed
> below, and referenced kptr has taken care of (1) and (2).
> 
> (1) .enqueue not enqueueing nor dropping the skb, causing reference leak
> 
> (2) .dequeue making up an invalid skb ptr and returning to kernel
> 
> (3) If bpf qdisc operators can duplicate skb references, multiple
>      references to the same skb can be present. If we enqueue these
>      references to a collection and dequeue one, since skb->dev will be
>      restored after the skb is removed from the collection, other skb in
>      the collection will then have invalid skb->rbnode as "dev" and "rbnode"
>      share the same memory.
> 
> A discussion point was about introducing and enforcing a unique reference
> semantic (PTR_UNIQUE) to mitigate (3). After giving it more thoughts, I
> think we should keep "ref_acquired", and be careful about kernel-side
> implementation that could return referenced kptr. Taking a step back, (3)
> is only problematic because I made an assumption that the kfunc only
> increases the reference count of skb (i.e., skb_get()). It could have been
> done safely using skb_copy() or maybe pskb_copy(). In other words, it is a
> kernel implementation issue, and not a verifier issue. Besides, the
> verifier has no knowledge about what a kfunc with KF_ACQUIRE does
> internally whatsoever.
> 
> In v8, we try to do this safely by only allowing reading "ref_acquired"-
> annotated argument once. Since the argument passed to struct_ops never
> changes when during a single invocation, it will always be referencing the
> same kernel object. Therefore, reading more than once and returning
> mulitple references shouldn't be allowed. Maybe "ref_moved" is a more
> precise annotation label, hinting that the ownership is transferred.

The part that no skb acquire kfunc should be available to the qdisc struct_ops 
prog is understood. I think it just needs to clarify the commit message and 
remove the "It must be released and cannot be acquired more than once" part.


> 
> [0] https://lore.kernel.org/netdev/2d31261b245828d09d2f80e0953e911a9c38573a.1705432850.git.amery.hung@bytedance.com/


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 01/20] bpf: Support passing referenced kptr to struct_ops programs
  2024-05-10 19:23 ` [RFC PATCH v8 01/20] bpf: Support passing referenced kptr to struct_ops programs Amery Hung
@ 2024-05-16 23:59   ` Kumar Kartikeya Dwivedi
  2024-05-17  0:17     ` Amery Hung
  0 siblings, 1 reply; 46+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-05-16 23:59 UTC (permalink / raw)
  To: Amery Hung
  Cc: netdev, bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw,
	toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs

On Fri, 10 May 2024 at 21:24, Amery Hung <ameryhung@gmail.com> wrote:
>
> This patch supports struct_ops programs that acqurie referenced kptrs
> throguh arguments. In Qdisc_ops, an skb is passed to ".enqueue" in the
> first argument. The qdisc becomes the sole owner of the skb and must
> enqueue or drop the skb. This matches the referenced kptr semantic
> in bpf. However, the existing practice of acquiring a referenced kptr via
> a kfunc with KF_ACQUIRE does not play well in this case. Calling kfuncs
> repeatedly allows the user to acquire multiple references, while there
> should be only one reference to a unique skb in a qdisc.
>
> The solutioin is to make a struct_ops program automatically acquire a
> referenced kptr through a tagged argument in the stub function. When
> tagged with "__ref_acquired" (suggestion for a better name?), an
> reference kptr (ref_obj_id > 0) will be acquired automatically when
> entering the program. In addition, only the first read to the arguement
> is allowed and it will yeild a referenced kptr.
>
> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> ---
>  include/linux/bpf.h         |  3 +++
>  kernel/bpf/bpf_struct_ops.c | 17 +++++++++++++----
>  kernel/bpf/btf.c            | 10 +++++++++-
>  kernel/bpf/verifier.c       | 16 +++++++++++++---
>  4 files changed, 38 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 9c6a7b8ff963..6aabca1581fe 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -914,6 +914,7 @@ struct bpf_insn_access_aux {
>                 struct {
>                         struct btf *btf;
>                         u32 btf_id;
> +                       u32 ref_obj_id;
>                 };
>         };
>         struct bpf_verifier_log *log; /* for verbose logs */
> @@ -1416,6 +1417,8 @@ struct bpf_ctx_arg_aux {
>         enum bpf_reg_type reg_type;
>         struct btf *btf;
>         u32 btf_id;
> +       u32 ref_obj_id;
> +       bool ref_acquired;
>  };
>
>  struct btf_mod_pair {
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index 86c7884abaf8..bca8e5936846 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -143,6 +143,7 @@ void bpf_struct_ops_image_free(void *image)
>  }
>
>  #define MAYBE_NULL_SUFFIX "__nullable"
> +#define REF_ACQUIRED_SUFFIX "__ref_acquired"
>  #define MAX_STUB_NAME 128
>
>  /* Return the type info of a stub function, if it exists.
> @@ -204,6 +205,7 @@ static int prepare_arg_info(struct btf *btf,
>                             struct bpf_struct_ops_arg_info *arg_info)
>  {
>         const struct btf_type *stub_func_proto, *pointed_type;
> +       bool is_nullable = false, is_ref_acquired = false;
>         const struct btf_param *stub_args, *args;
>         struct bpf_ctx_arg_aux *info, *info_buf;
>         u32 nargs, arg_no, info_cnt = 0;
> @@ -240,8 +242,11 @@ static int prepare_arg_info(struct btf *btf,
>                 /* Skip arguments that is not suffixed with
>                  * "__nullable".
>                  */
> -               if (!btf_param_match_suffix(btf, &stub_args[arg_no],
> -                                           MAYBE_NULL_SUFFIX))
> +               is_nullable = btf_param_match_suffix(btf, &stub_args[arg_no],
> +                                                    MAYBE_NULL_SUFFIX);
> +               is_ref_acquired = btf_param_match_suffix(btf, &stub_args[arg_no],
> +                                                      REF_ACQUIRED_SUFFIX);
> +               if (!(is_nullable || is_ref_acquired))
>                         continue;
>
>                 /* Should be a pointer to struct */
> @@ -269,11 +274,15 @@ static int prepare_arg_info(struct btf *btf,
>                 }
>
>                 /* Fill the information of the new argument */
> -               info->reg_type =
> -                       PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
>                 info->btf_id = arg_btf_id;
>                 info->btf = btf;
>                 info->offset = offset;
> +               if (is_nullable) {
> +                       info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
> +               } else if (is_ref_acquired) {
> +                       info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID;
> +                       info->ref_acquired = true;
> +               }
>
>                 info++;
>                 info_cnt++;
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index 8c95392214ed..e462fb4a4598 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -6316,7 +6316,8 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
>
>         /* this is a pointer to another type */
>         for (i = 0; i < prog->aux->ctx_arg_info_size; i++) {
> -               const struct bpf_ctx_arg_aux *ctx_arg_info = &prog->aux->ctx_arg_info[i];
> +               struct bpf_ctx_arg_aux *ctx_arg_info =
> +                       (struct bpf_ctx_arg_aux *)&prog->aux->ctx_arg_info[i];
>
>                 if (ctx_arg_info->offset == off) {
>                         if (!ctx_arg_info->btf_id) {
> @@ -6324,9 +6325,16 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
>                                 return false;
>                         }
>
> +                       if (ctx_arg_info->ref_acquired && !ctx_arg_info->ref_obj_id) {
> +                               bpf_log(log, "cannot acquire a reference to context argument offset %u\n", off);
> +                               return false;
> +                       }
> +
>                         info->reg_type = ctx_arg_info->reg_type;
>                         info->btf = ctx_arg_info->btf ? : btf_vmlinux;
>                         info->btf_id = ctx_arg_info->btf_id;
> +                       info->ref_obj_id = ctx_arg_info->ref_obj_id;
> +                       ctx_arg_info->ref_obj_id = 0;
>                         return true;

I think this is fragile. What if the compiler produces two independent
paths in the program which read the skb pointer once?
Technically, the program is still reading the skb pointer once at runtime.
Then you will reset ref_obj_id to 0 when exploring one, and assign as
0 in the other one, causing errors.
ctx_arg_info appears to be global for the program.

I think the better way would be to check if ref_obj_id is still part
of the reference state.
If the ref_obj_id has already been dropped from reference_state, then
any loads should get ref_obj_id = 0.
That would happen when dropping or enqueueing the skb into qdisc,
which would (I presume) do release_reference_state(ref_obj_id).
Otherwise, all of them can share the same ref_obj_id. You won't have
to implement "can only read once" logic,
and when you enqueue stuff in the qdisc, all identical copies produced
from different load instructions will be invalidated.
Same ref_obj_id == unique ownership of the same object.
You can already have multiple copies through rX = rY, multiple ctx
loads of skb will produce a similar verifier state.

So, on entry, assign ctx_arg_info->ref_obj_id uniquely, then on each load:
if reference_state.find(ctx_arg_info->ref_obj_id) == true; then
info->ref_obj_id = ctx_arg_info->ref_obj_id; else info->ref_obj_id =
0;

Let me know if I missed something.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 01/20] bpf: Support passing referenced kptr to struct_ops programs
  2024-05-16 23:59   ` Kumar Kartikeya Dwivedi
@ 2024-05-17  0:17     ` Amery Hung
  2024-05-17  0:23       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 46+ messages in thread
From: Amery Hung @ 2024-05-17  0:17 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: netdev, bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw,
	toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs

On Thu, May 16, 2024 at 4:59 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Fri, 10 May 2024 at 21:24, Amery Hung <ameryhung@gmail.com> wrote:
> >
> > This patch supports struct_ops programs that acqurie referenced kptrs
> > throguh arguments. In Qdisc_ops, an skb is passed to ".enqueue" in the
> > first argument. The qdisc becomes the sole owner of the skb and must
> > enqueue or drop the skb. This matches the referenced kptr semantic
> > in bpf. However, the existing practice of acquiring a referenced kptr via
> > a kfunc with KF_ACQUIRE does not play well in this case. Calling kfuncs
> > repeatedly allows the user to acquire multiple references, while there
> > should be only one reference to a unique skb in a qdisc.
> >
> > The solutioin is to make a struct_ops program automatically acquire a
> > referenced kptr through a tagged argument in the stub function. When
> > tagged with "__ref_acquired" (suggestion for a better name?), an
> > reference kptr (ref_obj_id > 0) will be acquired automatically when
> > entering the program. In addition, only the first read to the arguement
> > is allowed and it will yeild a referenced kptr.
> >
> > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > ---
> >  include/linux/bpf.h         |  3 +++
> >  kernel/bpf/bpf_struct_ops.c | 17 +++++++++++++----
> >  kernel/bpf/btf.c            | 10 +++++++++-
> >  kernel/bpf/verifier.c       | 16 +++++++++++++---
> >  4 files changed, 38 insertions(+), 8 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 9c6a7b8ff963..6aabca1581fe 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -914,6 +914,7 @@ struct bpf_insn_access_aux {
> >                 struct {
> >                         struct btf *btf;
> >                         u32 btf_id;
> > +                       u32 ref_obj_id;
> >                 };
> >         };
> >         struct bpf_verifier_log *log; /* for verbose logs */
> > @@ -1416,6 +1417,8 @@ struct bpf_ctx_arg_aux {
> >         enum bpf_reg_type reg_type;
> >         struct btf *btf;
> >         u32 btf_id;
> > +       u32 ref_obj_id;
> > +       bool ref_acquired;
> >  };
> >
> >  struct btf_mod_pair {
> > diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> > index 86c7884abaf8..bca8e5936846 100644
> > --- a/kernel/bpf/bpf_struct_ops.c
> > +++ b/kernel/bpf/bpf_struct_ops.c
> > @@ -143,6 +143,7 @@ void bpf_struct_ops_image_free(void *image)
> >  }
> >
> >  #define MAYBE_NULL_SUFFIX "__nullable"
> > +#define REF_ACQUIRED_SUFFIX "__ref_acquired"
> >  #define MAX_STUB_NAME 128
> >
> >  /* Return the type info of a stub function, if it exists.
> > @@ -204,6 +205,7 @@ static int prepare_arg_info(struct btf *btf,
> >                             struct bpf_struct_ops_arg_info *arg_info)
> >  {
> >         const struct btf_type *stub_func_proto, *pointed_type;
> > +       bool is_nullable = false, is_ref_acquired = false;
> >         const struct btf_param *stub_args, *args;
> >         struct bpf_ctx_arg_aux *info, *info_buf;
> >         u32 nargs, arg_no, info_cnt = 0;
> > @@ -240,8 +242,11 @@ static int prepare_arg_info(struct btf *btf,
> >                 /* Skip arguments that is not suffixed with
> >                  * "__nullable".
> >                  */
> > -               if (!btf_param_match_suffix(btf, &stub_args[arg_no],
> > -                                           MAYBE_NULL_SUFFIX))
> > +               is_nullable = btf_param_match_suffix(btf, &stub_args[arg_no],
> > +                                                    MAYBE_NULL_SUFFIX);
> > +               is_ref_acquired = btf_param_match_suffix(btf, &stub_args[arg_no],
> > +                                                      REF_ACQUIRED_SUFFIX);
> > +               if (!(is_nullable || is_ref_acquired))
> >                         continue;
> >
> >                 /* Should be a pointer to struct */
> > @@ -269,11 +274,15 @@ static int prepare_arg_info(struct btf *btf,
> >                 }
> >
> >                 /* Fill the information of the new argument */
> > -               info->reg_type =
> > -                       PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
> >                 info->btf_id = arg_btf_id;
> >                 info->btf = btf;
> >                 info->offset = offset;
> > +               if (is_nullable) {
> > +                       info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
> > +               } else if (is_ref_acquired) {
> > +                       info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID;
> > +                       info->ref_acquired = true;
> > +               }
> >
> >                 info++;
> >                 info_cnt++;
> > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > index 8c95392214ed..e462fb4a4598 100644
> > --- a/kernel/bpf/btf.c
> > +++ b/kernel/bpf/btf.c
> > @@ -6316,7 +6316,8 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
> >
> >         /* this is a pointer to another type */
> >         for (i = 0; i < prog->aux->ctx_arg_info_size; i++) {
> > -               const struct bpf_ctx_arg_aux *ctx_arg_info = &prog->aux->ctx_arg_info[i];
> > +               struct bpf_ctx_arg_aux *ctx_arg_info =
> > +                       (struct bpf_ctx_arg_aux *)&prog->aux->ctx_arg_info[i];
> >
> >                 if (ctx_arg_info->offset == off) {
> >                         if (!ctx_arg_info->btf_id) {
> > @@ -6324,9 +6325,16 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
> >                                 return false;
> >                         }
> >
> > +                       if (ctx_arg_info->ref_acquired && !ctx_arg_info->ref_obj_id) {
> > +                               bpf_log(log, "cannot acquire a reference to context argument offset %u\n", off);
> > +                               return false;
> > +                       }
> > +
> >                         info->reg_type = ctx_arg_info->reg_type;
> >                         info->btf = ctx_arg_info->btf ? : btf_vmlinux;
> >                         info->btf_id = ctx_arg_info->btf_id;
> > +                       info->ref_obj_id = ctx_arg_info->ref_obj_id;
> > +                       ctx_arg_info->ref_obj_id = 0;
> >                         return true;
>
> I think this is fragile. What if the compiler produces two independent
> paths in the program which read the skb pointer once?
> Technically, the program is still reading the skb pointer once at runtime.
> Then you will reset ref_obj_id to 0 when exploring one, and assign as
> 0 in the other one, causing errors.
> ctx_arg_info appears to be global for the program.
>
> I think the better way would be to check if ref_obj_id is still part
> of the reference state.
> If the ref_obj_id has already been dropped from reference_state, then
> any loads should get ref_obj_id = 0.
> That would happen when dropping or enqueueing the skb into qdisc,
> which would (I presume) do release_reference_state(ref_obj_id).
> Otherwise, all of them can share the same ref_obj_id. You won't have
> to implement "can only read once" logic,
> and when you enqueue stuff in the qdisc, all identical copies produced
> from different load instructions will be invalidated.
> Same ref_obj_id == unique ownership of the same object.
> You can already have multiple copies through rX = rY, multiple ctx
> loads of skb will produce a similar verifier state.
>
> So, on entry, assign ctx_arg_info->ref_obj_id uniquely, then on each load:
> if reference_state.find(ctx_arg_info->ref_obj_id) == true; then
> info->ref_obj_id = ctx_arg_info->ref_obj_id; else info->ref_obj_id =
> 0;
>
> Let me know if I missed something.

You are right. The current approach will falsely reject valid programs,
and your suggestion makes sense.

Thanks,
Amery

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 01/20] bpf: Support passing referenced kptr to struct_ops programs
  2024-05-17  0:17     ` Amery Hung
@ 2024-05-17  0:23       ` Kumar Kartikeya Dwivedi
  2024-05-17  1:22         ` Amery Hung
  0 siblings, 1 reply; 46+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-05-17  0:23 UTC (permalink / raw)
  To: Amery Hung
  Cc: netdev, bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw,
	toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs

On Fri, 17 May 2024 at 02:17, Amery Hung <ameryhung@gmail.com> wrote:
>
> On Thu, May 16, 2024 at 4:59 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Fri, 10 May 2024 at 21:24, Amery Hung <ameryhung@gmail.com> wrote:
> > >
> > > This patch supports struct_ops programs that acqurie referenced kptrs
> > > throguh arguments. In Qdisc_ops, an skb is passed to ".enqueue" in the
> > > first argument. The qdisc becomes the sole owner of the skb and must
> > > enqueue or drop the skb. This matches the referenced kptr semantic
> > > in bpf. However, the existing practice of acquiring a referenced kptr via
> > > a kfunc with KF_ACQUIRE does not play well in this case. Calling kfuncs
> > > repeatedly allows the user to acquire multiple references, while there
> > > should be only one reference to a unique skb in a qdisc.
> > >
> > > The solutioin is to make a struct_ops program automatically acquire a
> > > referenced kptr through a tagged argument in the stub function. When
> > > tagged with "__ref_acquired" (suggestion for a better name?), an
> > > reference kptr (ref_obj_id > 0) will be acquired automatically when
> > > entering the program. In addition, only the first read to the arguement
> > > is allowed and it will yeild a referenced kptr.
> > >
> > > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > > ---
> > >  include/linux/bpf.h         |  3 +++
> > >  kernel/bpf/bpf_struct_ops.c | 17 +++++++++++++----
> > >  kernel/bpf/btf.c            | 10 +++++++++-
> > >  kernel/bpf/verifier.c       | 16 +++++++++++++---
> > >  4 files changed, 38 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 9c6a7b8ff963..6aabca1581fe 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -914,6 +914,7 @@ struct bpf_insn_access_aux {
> > >                 struct {
> > >                         struct btf *btf;
> > >                         u32 btf_id;
> > > +                       u32 ref_obj_id;
> > >                 };
> > >         };
> > >         struct bpf_verifier_log *log; /* for verbose logs */
> > > @@ -1416,6 +1417,8 @@ struct bpf_ctx_arg_aux {
> > >         enum bpf_reg_type reg_type;
> > >         struct btf *btf;
> > >         u32 btf_id;
> > > +       u32 ref_obj_id;
> > > +       bool ref_acquired;
> > >  };
> > >
> > >  struct btf_mod_pair {
> > > diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> > > index 86c7884abaf8..bca8e5936846 100644
> > > --- a/kernel/bpf/bpf_struct_ops.c
> > > +++ b/kernel/bpf/bpf_struct_ops.c
> > > @@ -143,6 +143,7 @@ void bpf_struct_ops_image_free(void *image)
> > >  }
> > >
> > >  #define MAYBE_NULL_SUFFIX "__nullable"
> > > +#define REF_ACQUIRED_SUFFIX "__ref_acquired"
> > >  #define MAX_STUB_NAME 128
> > >
> > >  /* Return the type info of a stub function, if it exists.
> > > @@ -204,6 +205,7 @@ static int prepare_arg_info(struct btf *btf,
> > >                             struct bpf_struct_ops_arg_info *arg_info)
> > >  {
> > >         const struct btf_type *stub_func_proto, *pointed_type;
> > > +       bool is_nullable = false, is_ref_acquired = false;
> > >         const struct btf_param *stub_args, *args;
> > >         struct bpf_ctx_arg_aux *info, *info_buf;
> > >         u32 nargs, arg_no, info_cnt = 0;
> > > @@ -240,8 +242,11 @@ static int prepare_arg_info(struct btf *btf,
> > >                 /* Skip arguments that is not suffixed with
> > >                  * "__nullable".
> > >                  */
> > > -               if (!btf_param_match_suffix(btf, &stub_args[arg_no],
> > > -                                           MAYBE_NULL_SUFFIX))
> > > +               is_nullable = btf_param_match_suffix(btf, &stub_args[arg_no],
> > > +                                                    MAYBE_NULL_SUFFIX);
> > > +               is_ref_acquired = btf_param_match_suffix(btf, &stub_args[arg_no],
> > > +                                                      REF_ACQUIRED_SUFFIX);
> > > +               if (!(is_nullable || is_ref_acquired))
> > >                         continue;
> > >
> > >                 /* Should be a pointer to struct */
> > > @@ -269,11 +274,15 @@ static int prepare_arg_info(struct btf *btf,
> > >                 }
> > >
> > >                 /* Fill the information of the new argument */
> > > -               info->reg_type =
> > > -                       PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
> > >                 info->btf_id = arg_btf_id;
> > >                 info->btf = btf;
> > >                 info->offset = offset;
> > > +               if (is_nullable) {
> > > +                       info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
> > > +               } else if (is_ref_acquired) {
> > > +                       info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID;
> > > +                       info->ref_acquired = true;
> > > +               }
> > >
> > >                 info++;
> > >                 info_cnt++;
> > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > > index 8c95392214ed..e462fb4a4598 100644
> > > --- a/kernel/bpf/btf.c
> > > +++ b/kernel/bpf/btf.c
> > > @@ -6316,7 +6316,8 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
> > >
> > >         /* this is a pointer to another type */
> > >         for (i = 0; i < prog->aux->ctx_arg_info_size; i++) {
> > > -               const struct bpf_ctx_arg_aux *ctx_arg_info = &prog->aux->ctx_arg_info[i];
> > > +               struct bpf_ctx_arg_aux *ctx_arg_info =
> > > +                       (struct bpf_ctx_arg_aux *)&prog->aux->ctx_arg_info[i];
> > >
> > >                 if (ctx_arg_info->offset == off) {
> > >                         if (!ctx_arg_info->btf_id) {
> > > @@ -6324,9 +6325,16 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
> > >                                 return false;
> > >                         }
> > >
> > > +                       if (ctx_arg_info->ref_acquired && !ctx_arg_info->ref_obj_id) {
> > > +                               bpf_log(log, "cannot acquire a reference to context argument offset %u\n", off);
> > > +                               return false;
> > > +                       }
> > > +
> > >                         info->reg_type = ctx_arg_info->reg_type;
> > >                         info->btf = ctx_arg_info->btf ? : btf_vmlinux;
> > >                         info->btf_id = ctx_arg_info->btf_id;
> > > +                       info->ref_obj_id = ctx_arg_info->ref_obj_id;
> > > +                       ctx_arg_info->ref_obj_id = 0;
> > >                         return true;
> >
> > I think this is fragile. What if the compiler produces two independent
> > paths in the program which read the skb pointer once?
> > Technically, the program is still reading the skb pointer once at runtime.
> > Then you will reset ref_obj_id to 0 when exploring one, and assign as
> > 0 in the other one, causing errors.
> > ctx_arg_info appears to be global for the program.
> >
> > I think the better way would be to check if ref_obj_id is still part
> > of the reference state.
> > If the ref_obj_id has already been dropped from reference_state, then
> > any loads should get ref_obj_id = 0.
> > That would happen when dropping or enqueueing the skb into qdisc,
> > which would (I presume) do release_reference_state(ref_obj_id).
> > Otherwise, all of them can share the same ref_obj_id. You won't have
> > to implement "can only read once" logic,
> > and when you enqueue stuff in the qdisc, all identical copies produced
> > from different load instructions will be invalidated.
> > Same ref_obj_id == unique ownership of the same object.
> > You can already have multiple copies through rX = rY, multiple ctx
> > loads of skb will produce a similar verifier state.
> >
> > So, on entry, assign ctx_arg_info->ref_obj_id uniquely, then on each load:
> > if reference_state.find(ctx_arg_info->ref_obj_id) == true; then
> > info->ref_obj_id = ctx_arg_info->ref_obj_id; else info->ref_obj_id =
> > 0;
> >
> > Let me know if I missed something.
>
> You are right. The current approach will falsely reject valid programs,
> and your suggestion makes sense.

Also, I wonder whether when ref_obj_id has been released, we should
mark the loaded register as unknown scalar, vs skb with ref_obj_id =
0?
Otherwise right now it will take PTR_TO_BTF_ID | PTR_TRUSTED as
reg_type, and I think verifier will permit reads even if ref_obj_id =
0.
This will surely be bad once skb is dropped/enqueued, since the
program should no longer be able to read such memory.

>
> Thanks,
> Amery

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 02/20] selftests/bpf: Test referenced kptr arguments of struct_ops programs
  2024-05-16 23:43         ` Martin KaFai Lau
@ 2024-05-17  0:54           ` Amery Hung
  2024-05-17  1:07             ` Martin KaFai Lau
  0 siblings, 1 reply; 46+ messages in thread
From: Amery Hung @ 2024-05-17  0:54 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Kui-Feng Lee, netdev, bpf, yangpeihao, daniel, andrii, martin.lau,
	toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs

On Thu, May 16, 2024 at 4:45 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 5/16/24 4:14 PM, Amery Hung wrote:
> > I thought about patch 1-4 a bit more after the discussion in LSFMMBPF and
> > I think we should keep what "ref_acquried" does, but maybe rename it to
> > "ref_moved".
> >
> > We discussed the lifecycle of skb in qdisc and changes to struct_ops and
> > bpf semantics. In short, At the beginning of .enqueue, the kernel passes
> > the ownership of an skb to a qdisc. We do not increase the reference count
> > of skb since this is an ownership transfer, not kernel and qdisc both
> > holding references to the skb. (The counterexample can be found in RFC v7.
> > See how weird skb release kfuncs look[0]). The skb should be either
> > enqueued or dropped. Then, in .dequeue, an skb will be removed from the
> > queue and the ownership will be returned to the kernel.
> >
> > Referenced kptr in bpf already carries the semantic of ownership. Thus,
> > what we need here is to enable struct_ops programs to get a referenced
> > kptr from the argument and returning referenced kptr (achieved via patch
> > 1-4).
> >
> > Proper handling of referenced objects is important for safety reasons.
> > In the case of bpf qdisc, there are three problematic situations as listed
> > below, and referenced kptr has taken care of (1) and (2).
> >
> > (1) .enqueue not enqueueing nor dropping the skb, causing reference leak
> >
> > (2) .dequeue making up an invalid skb ptr and returning to kernel
> >
> > (3) If bpf qdisc operators can duplicate skb references, multiple
> >      references to the same skb can be present. If we enqueue these
> >      references to a collection and dequeue one, since skb->dev will be
> >      restored after the skb is removed from the collection, other skb in
> >      the collection will then have invalid skb->rbnode as "dev" and "rbnode"
> >      share the same memory.
> >
> > A discussion point was about introducing and enforcing a unique reference
> > semantic (PTR_UNIQUE) to mitigate (3). After giving it more thoughts, I
> > think we should keep "ref_acquired", and be careful about kernel-side
> > implementation that could return referenced kptr. Taking a step back, (3)
> > is only problematic because I made an assumption that the kfunc only
> > increases the reference count of skb (i.e., skb_get()). It could have been
> > done safely using skb_copy() or maybe pskb_copy(). In other words, it is a
> > kernel implementation issue, and not a verifier issue. Besides, the
> > verifier has no knowledge about what a kfunc with KF_ACQUIRE does
> > internally whatsoever.
> >
> > In v8, we try to do this safely by only allowing reading "ref_acquired"-
> > annotated argument once. Since the argument passed to struct_ops never
> > changes when during a single invocation, it will always be referencing the
> > same kernel object. Therefore, reading more than once and returning
> > mulitple references shouldn't be allowed. Maybe "ref_moved" is a more
> > precise annotation label, hinting that the ownership is transferred.
>
> The part that no skb acquire kfunc should be available to the qdisc struct_ops
> prog is understood. I think it just needs to clarify the commit message and
> remove the "It must be released and cannot be acquired more than once" part.
>

Got it. I will improve the clarity of the commit message.

In addition, I will also remove "struct_ops_ref_acquire_dup_ref.c" as
whether duplicate references can be acquired through kfunc is out of
scope (should be taken care of by struct_ops implementer). Actually,
this testcase should load the and it does load...

As for the name, do you have any thoughts?

Thanks,
Amery

>
> >
> > [0] https://lore.kernel.org/netdev/2d31261b245828d09d2f80e0953e911a9c38573a.1705432850.git.amery.hung@bytedance.com/
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 02/20] selftests/bpf: Test referenced kptr arguments of struct_ops programs
  2024-05-17  0:54           ` Amery Hung
@ 2024-05-17  1:07             ` Martin KaFai Lau
  0 siblings, 0 replies; 46+ messages in thread
From: Martin KaFai Lau @ 2024-05-17  1:07 UTC (permalink / raw)
  To: Amery Hung
  Cc: Kui-Feng Lee, netdev, bpf, yangpeihao, daniel, andrii, martin.lau,
	toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs

On 5/16/24 5:54 PM, Amery Hung wrote:
>> The part that no skb acquire kfunc should be available to the qdisc struct_ops
>> prog is understood. I think it just needs to clarify the commit message and
>> remove the "It must be released and cannot be acquired more than once" part.
>>
> Got it. I will improve the clarity of the commit message.
> 
> In addition, I will also remove "struct_ops_ref_acquire_dup_ref.c" as
> whether duplicate references can be acquired through kfunc is out of
> scope (should be taken care of by struct_ops implementer). Actually,
> this testcase should load the and it does load...
> 
> As for the name, do you have any thoughts?

Naming is hard... :(

May be just keep it short, just "__ref"?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 01/20] bpf: Support passing referenced kptr to struct_ops programs
  2024-05-17  0:23       ` Kumar Kartikeya Dwivedi
@ 2024-05-17  1:22         ` Amery Hung
  2024-05-17  2:00           ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 46+ messages in thread
From: Amery Hung @ 2024-05-17  1:22 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: netdev, bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw,
	toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs

On Thu, May 16, 2024 at 5:24 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Fri, 17 May 2024 at 02:17, Amery Hung <ameryhung@gmail.com> wrote:
> >
> > On Thu, May 16, 2024 at 4:59 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > On Fri, 10 May 2024 at 21:24, Amery Hung <ameryhung@gmail.com> wrote:
> > > >
> > > > This patch supports struct_ops programs that acqurie referenced kptrs
> > > > throguh arguments. In Qdisc_ops, an skb is passed to ".enqueue" in the
> > > > first argument. The qdisc becomes the sole owner of the skb and must
> > > > enqueue or drop the skb. This matches the referenced kptr semantic
> > > > in bpf. However, the existing practice of acquiring a referenced kptr via
> > > > a kfunc with KF_ACQUIRE does not play well in this case. Calling kfuncs
> > > > repeatedly allows the user to acquire multiple references, while there
> > > > should be only one reference to a unique skb in a qdisc.
> > > >
> > > > The solutioin is to make a struct_ops program automatically acquire a
> > > > referenced kptr through a tagged argument in the stub function. When
> > > > tagged with "__ref_acquired" (suggestion for a better name?), an
> > > > reference kptr (ref_obj_id > 0) will be acquired automatically when
> > > > entering the program. In addition, only the first read to the arguement
> > > > is allowed and it will yeild a referenced kptr.
> > > >
> > > > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > > > ---
> > > >  include/linux/bpf.h         |  3 +++
> > > >  kernel/bpf/bpf_struct_ops.c | 17 +++++++++++++----
> > > >  kernel/bpf/btf.c            | 10 +++++++++-
> > > >  kernel/bpf/verifier.c       | 16 +++++++++++++---
> > > >  4 files changed, 38 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index 9c6a7b8ff963..6aabca1581fe 100644
> > > > --- a/include/linux/bpf.h
> > > > +++ b/include/linux/bpf.h
> > > > @@ -914,6 +914,7 @@ struct bpf_insn_access_aux {
> > > >                 struct {
> > > >                         struct btf *btf;
> > > >                         u32 btf_id;
> > > > +                       u32 ref_obj_id;
> > > >                 };
> > > >         };
> > > >         struct bpf_verifier_log *log; /* for verbose logs */
> > > > @@ -1416,6 +1417,8 @@ struct bpf_ctx_arg_aux {
> > > >         enum bpf_reg_type reg_type;
> > > >         struct btf *btf;
> > > >         u32 btf_id;
> > > > +       u32 ref_obj_id;
> > > > +       bool ref_acquired;
> > > >  };
> > > >
> > > >  struct btf_mod_pair {
> > > > diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> > > > index 86c7884abaf8..bca8e5936846 100644
> > > > --- a/kernel/bpf/bpf_struct_ops.c
> > > > +++ b/kernel/bpf/bpf_struct_ops.c
> > > > @@ -143,6 +143,7 @@ void bpf_struct_ops_image_free(void *image)
> > > >  }
> > > >
> > > >  #define MAYBE_NULL_SUFFIX "__nullable"
> > > > +#define REF_ACQUIRED_SUFFIX "__ref_acquired"
> > > >  #define MAX_STUB_NAME 128
> > > >
> > > >  /* Return the type info of a stub function, if it exists.
> > > > @@ -204,6 +205,7 @@ static int prepare_arg_info(struct btf *btf,
> > > >                             struct bpf_struct_ops_arg_info *arg_info)
> > > >  {
> > > >         const struct btf_type *stub_func_proto, *pointed_type;
> > > > +       bool is_nullable = false, is_ref_acquired = false;
> > > >         const struct btf_param *stub_args, *args;
> > > >         struct bpf_ctx_arg_aux *info, *info_buf;
> > > >         u32 nargs, arg_no, info_cnt = 0;
> > > > @@ -240,8 +242,11 @@ static int prepare_arg_info(struct btf *btf,
> > > >                 /* Skip arguments that is not suffixed with
> > > >                  * "__nullable".
> > > >                  */
> > > > -               if (!btf_param_match_suffix(btf, &stub_args[arg_no],
> > > > -                                           MAYBE_NULL_SUFFIX))
> > > > +               is_nullable = btf_param_match_suffix(btf, &stub_args[arg_no],
> > > > +                                                    MAYBE_NULL_SUFFIX);
> > > > +               is_ref_acquired = btf_param_match_suffix(btf, &stub_args[arg_no],
> > > > +                                                      REF_ACQUIRED_SUFFIX);
> > > > +               if (!(is_nullable || is_ref_acquired))
> > > >                         continue;
> > > >
> > > >                 /* Should be a pointer to struct */
> > > > @@ -269,11 +274,15 @@ static int prepare_arg_info(struct btf *btf,
> > > >                 }
> > > >
> > > >                 /* Fill the information of the new argument */
> > > > -               info->reg_type =
> > > > -                       PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
> > > >                 info->btf_id = arg_btf_id;
> > > >                 info->btf = btf;
> > > >                 info->offset = offset;
> > > > +               if (is_nullable) {
> > > > +                       info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
> > > > +               } else if (is_ref_acquired) {
> > > > +                       info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID;
> > > > +                       info->ref_acquired = true;
> > > > +               }
> > > >
> > > >                 info++;
> > > >                 info_cnt++;
> > > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > > > index 8c95392214ed..e462fb4a4598 100644
> > > > --- a/kernel/bpf/btf.c
> > > > +++ b/kernel/bpf/btf.c
> > > > @@ -6316,7 +6316,8 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
> > > >
> > > >         /* this is a pointer to another type */
> > > >         for (i = 0; i < prog->aux->ctx_arg_info_size; i++) {
> > > > -               const struct bpf_ctx_arg_aux *ctx_arg_info = &prog->aux->ctx_arg_info[i];
> > > > +               struct bpf_ctx_arg_aux *ctx_arg_info =
> > > > +                       (struct bpf_ctx_arg_aux *)&prog->aux->ctx_arg_info[i];
> > > >
> > > >                 if (ctx_arg_info->offset == off) {
> > > >                         if (!ctx_arg_info->btf_id) {
> > > > @@ -6324,9 +6325,16 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
> > > >                                 return false;
> > > >                         }
> > > >
> > > > +                       if (ctx_arg_info->ref_acquired && !ctx_arg_info->ref_obj_id) {
> > > > +                               bpf_log(log, "cannot acquire a reference to context argument offset %u\n", off);
> > > > +                               return false;
> > > > +                       }
> > > > +
> > > >                         info->reg_type = ctx_arg_info->reg_type;
> > > >                         info->btf = ctx_arg_info->btf ? : btf_vmlinux;
> > > >                         info->btf_id = ctx_arg_info->btf_id;
> > > > +                       info->ref_obj_id = ctx_arg_info->ref_obj_id;
> > > > +                       ctx_arg_info->ref_obj_id = 0;
> > > >                         return true;
> > >
> > > I think this is fragile. What if the compiler produces two independent
> > > paths in the program which read the skb pointer once?
> > > Technically, the program is still reading the skb pointer once at runtime.
> > > Then you will reset ref_obj_id to 0 when exploring one, and assign as
> > > 0 in the other one, causing errors.
> > > ctx_arg_info appears to be global for the program.
> > >
> > > I think the better way would be to check if ref_obj_id is still part
> > > of the reference state.
> > > If the ref_obj_id has already been dropped from reference_state, then
> > > any loads should get ref_obj_id = 0.
> > > That would happen when dropping or enqueueing the skb into qdisc,
> > > which would (I presume) do release_reference_state(ref_obj_id).
> > > Otherwise, all of them can share the same ref_obj_id. You won't have
> > > to implement "can only read once" logic,
> > > and when you enqueue stuff in the qdisc, all identical copies produced
> > > from different load instructions will be invalidated.
> > > Same ref_obj_id == unique ownership of the same object.
> > > You can already have multiple copies through rX = rY, multiple ctx
> > > loads of skb will produce a similar verifier state.
> > >
> > > So, on entry, assign ctx_arg_info->ref_obj_id uniquely, then on each load:
> > > if reference_state.find(ctx_arg_info->ref_obj_id) == true; then
> > > info->ref_obj_id = ctx_arg_info->ref_obj_id; else info->ref_obj_id =
> > > 0;
> > >
> > > Let me know if I missed something.
> >
> > You are right. The current approach will falsely reject valid programs,
> > and your suggestion makes sense.
>
> Also, I wonder whether when ref_obj_id has been released, we should
> mark the loaded register as unknown scalar, vs skb with ref_obj_id =
> 0?
> Otherwise right now it will take PTR_TO_BTF_ID | PTR_TRUSTED as
> reg_type, and I think verifier will permit reads even if ref_obj_id =
> 0.

If reference_state.find(ctx_arg_info->ref_obj_id) == false, I think we
should just return false from btf_ctx_access and reject the program
right away.

> This will surely be bad once skb is dropped/enqueued, since the
> program should no longer be able to read such memory.
>
> >
> > Thanks,
> > Amery

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 01/20] bpf: Support passing referenced kptr to struct_ops programs
  2024-05-17  1:22         ` Amery Hung
@ 2024-05-17  2:00           ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 46+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-05-17  2:00 UTC (permalink / raw)
  To: Amery Hung
  Cc: netdev, bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw,
	toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs

On Fri, 17 May 2024 at 03:22, Amery Hung <ameryhung@gmail.com> wrote:
>
> On Thu, May 16, 2024 at 5:24 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Fri, 17 May 2024 at 02:17, Amery Hung <ameryhung@gmail.com> wrote:
> > >
> > > On Thu, May 16, 2024 at 4:59 PM Kumar Kartikeya Dwivedi
> > > <memxor@gmail.com> wrote:
> > > >
> > > > On Fri, 10 May 2024 at 21:24, Amery Hung <ameryhung@gmail.com> wrote:
> > > > >
> > > > > This patch supports struct_ops programs that acqurie referenced kptrs
> > > > > throguh arguments. In Qdisc_ops, an skb is passed to ".enqueue" in the
> > > > > first argument. The qdisc becomes the sole owner of the skb and must
> > > > > enqueue or drop the skb. This matches the referenced kptr semantic
> > > > > in bpf. However, the existing practice of acquiring a referenced kptr via
> > > > > a kfunc with KF_ACQUIRE does not play well in this case. Calling kfuncs
> > > > > repeatedly allows the user to acquire multiple references, while there
> > > > > should be only one reference to a unique skb in a qdisc.
> > > > >
> > > > > The solutioin is to make a struct_ops program automatically acquire a
> > > > > referenced kptr through a tagged argument in the stub function. When
> > > > > tagged with "__ref_acquired" (suggestion for a better name?), an
> > > > > reference kptr (ref_obj_id > 0) will be acquired automatically when
> > > > > entering the program. In addition, only the first read to the arguement
> > > > > is allowed and it will yeild a referenced kptr.
> > > > >
> > > > > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > > > > ---
> > > > >  include/linux/bpf.h         |  3 +++
> > > > >  kernel/bpf/bpf_struct_ops.c | 17 +++++++++++++----
> > > > >  kernel/bpf/btf.c            | 10 +++++++++-
> > > > >  kernel/bpf/verifier.c       | 16 +++++++++++++---
> > > > >  4 files changed, 38 insertions(+), 8 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > > index 9c6a7b8ff963..6aabca1581fe 100644
> > > > > --- a/include/linux/bpf.h
> > > > > +++ b/include/linux/bpf.h
> > > > > @@ -914,6 +914,7 @@ struct bpf_insn_access_aux {
> > > > >                 struct {
> > > > >                         struct btf *btf;
> > > > >                         u32 btf_id;
> > > > > +                       u32 ref_obj_id;
> > > > >                 };
> > > > >         };
> > > > >         struct bpf_verifier_log *log; /* for verbose logs */
> > > > > @@ -1416,6 +1417,8 @@ struct bpf_ctx_arg_aux {
> > > > >         enum bpf_reg_type reg_type;
> > > > >         struct btf *btf;
> > > > >         u32 btf_id;
> > > > > +       u32 ref_obj_id;
> > > > > +       bool ref_acquired;
> > > > >  };
> > > > >
> > > > >  struct btf_mod_pair {
> > > > > diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> > > > > index 86c7884abaf8..bca8e5936846 100644
> > > > > --- a/kernel/bpf/bpf_struct_ops.c
> > > > > +++ b/kernel/bpf/bpf_struct_ops.c
> > > > > @@ -143,6 +143,7 @@ void bpf_struct_ops_image_free(void *image)
> > > > >  }
> > > > >
> > > > >  #define MAYBE_NULL_SUFFIX "__nullable"
> > > > > +#define REF_ACQUIRED_SUFFIX "__ref_acquired"
> > > > >  #define MAX_STUB_NAME 128
> > > > >
> > > > >  /* Return the type info of a stub function, if it exists.
> > > > > @@ -204,6 +205,7 @@ static int prepare_arg_info(struct btf *btf,
> > > > >                             struct bpf_struct_ops_arg_info *arg_info)
> > > > >  {
> > > > >         const struct btf_type *stub_func_proto, *pointed_type;
> > > > > +       bool is_nullable = false, is_ref_acquired = false;
> > > > >         const struct btf_param *stub_args, *args;
> > > > >         struct bpf_ctx_arg_aux *info, *info_buf;
> > > > >         u32 nargs, arg_no, info_cnt = 0;
> > > > > @@ -240,8 +242,11 @@ static int prepare_arg_info(struct btf *btf,
> > > > >                 /* Skip arguments that is not suffixed with
> > > > >                  * "__nullable".
> > > > >                  */
> > > > > -               if (!btf_param_match_suffix(btf, &stub_args[arg_no],
> > > > > -                                           MAYBE_NULL_SUFFIX))
> > > > > +               is_nullable = btf_param_match_suffix(btf, &stub_args[arg_no],
> > > > > +                                                    MAYBE_NULL_SUFFIX);
> > > > > +               is_ref_acquired = btf_param_match_suffix(btf, &stub_args[arg_no],
> > > > > +                                                      REF_ACQUIRED_SUFFIX);
> > > > > +               if (!(is_nullable || is_ref_acquired))
> > > > >                         continue;
> > > > >
> > > > >                 /* Should be a pointer to struct */
> > > > > @@ -269,11 +274,15 @@ static int prepare_arg_info(struct btf *btf,
> > > > >                 }
> > > > >
> > > > >                 /* Fill the information of the new argument */
> > > > > -               info->reg_type =
> > > > > -                       PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
> > > > >                 info->btf_id = arg_btf_id;
> > > > >                 info->btf = btf;
> > > > >                 info->offset = offset;
> > > > > +               if (is_nullable) {
> > > > > +                       info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
> > > > > +               } else if (is_ref_acquired) {
> > > > > +                       info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID;
> > > > > +                       info->ref_acquired = true;
> > > > > +               }
> > > > >
> > > > >                 info++;
> > > > >                 info_cnt++;
> > > > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > > > > index 8c95392214ed..e462fb4a4598 100644
> > > > > --- a/kernel/bpf/btf.c
> > > > > +++ b/kernel/bpf/btf.c
> > > > > @@ -6316,7 +6316,8 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
> > > > >
> > > > >         /* this is a pointer to another type */
> > > > >         for (i = 0; i < prog->aux->ctx_arg_info_size; i++) {
> > > > > -               const struct bpf_ctx_arg_aux *ctx_arg_info = &prog->aux->ctx_arg_info[i];
> > > > > +               struct bpf_ctx_arg_aux *ctx_arg_info =
> > > > > +                       (struct bpf_ctx_arg_aux *)&prog->aux->ctx_arg_info[i];
> > > > >
> > > > >                 if (ctx_arg_info->offset == off) {
> > > > >                         if (!ctx_arg_info->btf_id) {
> > > > > @@ -6324,9 +6325,16 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
> > > > >                                 return false;
> > > > >                         }
> > > > >
> > > > > +                       if (ctx_arg_info->ref_acquired && !ctx_arg_info->ref_obj_id) {
> > > > > +                               bpf_log(log, "cannot acquire a reference to context argument offset %u\n", off);
> > > > > +                               return false;
> > > > > +                       }
> > > > > +
> > > > >                         info->reg_type = ctx_arg_info->reg_type;
> > > > >                         info->btf = ctx_arg_info->btf ? : btf_vmlinux;
> > > > >                         info->btf_id = ctx_arg_info->btf_id;
> > > > > +                       info->ref_obj_id = ctx_arg_info->ref_obj_id;
> > > > > +                       ctx_arg_info->ref_obj_id = 0;
> > > > >                         return true;
> > > >
> > > > I think this is fragile. What if the compiler produces two independent
> > > > paths in the program which read the skb pointer once?
> > > > Technically, the program is still reading the skb pointer once at runtime.
> > > > Then you will reset ref_obj_id to 0 when exploring one, and assign as
> > > > 0 in the other one, causing errors.
> > > > ctx_arg_info appears to be global for the program.
> > > >
> > > > I think the better way would be to check if ref_obj_id is still part
> > > > of the reference state.
> > > > If the ref_obj_id has already been dropped from reference_state, then
> > > > any loads should get ref_obj_id = 0.
> > > > That would happen when dropping or enqueueing the skb into qdisc,
> > > > which would (I presume) do release_reference_state(ref_obj_id).
> > > > Otherwise, all of them can share the same ref_obj_id. You won't have
> > > > to implement "can only read once" logic,
> > > > and when you enqueue stuff in the qdisc, all identical copies produced
> > > > from different load instructions will be invalidated.
> > > > Same ref_obj_id == unique ownership of the same object.
> > > > You can already have multiple copies through rX = rY, multiple ctx
> > > > loads of skb will produce a similar verifier state.
> > > >
> > > > So, on entry, assign ctx_arg_info->ref_obj_id uniquely, then on each load:
> > > > if reference_state.find(ctx_arg_info->ref_obj_id) == true; then
> > > > info->ref_obj_id = ctx_arg_info->ref_obj_id; else info->ref_obj_id =
> > > > 0;
> > > >
> > > > Let me know if I missed something.
> > >
> > > You are right. The current approach will falsely reject valid programs,
> > > and your suggestion makes sense.
> >
> > Also, I wonder whether when ref_obj_id has been released, we should
> > mark the loaded register as unknown scalar, vs skb with ref_obj_id =
> > 0?
> > Otherwise right now it will take PTR_TO_BTF_ID | PTR_TRUSTED as
> > reg_type, and I think verifier will permit reads even if ref_obj_id =
> > 0.
>
> If reference_state.find(ctx_arg_info->ref_obj_id) == false, I think we
> should just return false from btf_ctx_access and reject the program
> right away.
>

Hm, yeah, that could be another option as well.
Might be better than returning a scalar and confusing people on usage later.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 03/20] bpf: Allow struct_ops prog to return referenced kptr
  2024-05-10 19:23 ` [RFC PATCH v8 03/20] bpf: Allow struct_ops prog to return referenced kptr Amery Hung
@ 2024-05-17  2:06   ` Amery Hung
  2024-05-17  5:30     ` Martin KaFai Lau
  0 siblings, 1 reply; 46+ messages in thread
From: Amery Hung @ 2024-05-17  2:06 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs

On Fri, May 10, 2024 at 12:24 PM Amery Hung <ameryhung@gmail.com> wrote:
>
> This patch allows a struct_ops program to return a referenced kptr
> if the struct_ops member has a pointer return type. To make sure the
> pointer returned to kernel is valid, it needs to be referenced and
> originally comes from the kernel. That is, it should be acquired
> through kfuncs or struct_ops "ref_acquried" arguments, but not allocated
> locally. Besides, null pointer is allowed. Therefore, kernel caller
> of the struct_ops function consuming the pointer needs to take care of
> the potential null pointer.
>
> The first use case will be Qdisc_ops::dequeue, where a qdisc returns a
> pointer to the skb to be dequeued.
>
> To achieve this, we first allow a reference object to leak through return
> if it is in the return register and the type matches the return type of the
> function. Then, we check whether the pointer to-be-returned is valid in
> check_return_code().
>
> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> ---
>  kernel/bpf/verifier.c | 50 +++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 46 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 06a6edd306fd..2d4a55ead85b 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -10081,16 +10081,36 @@ record_func_key(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
>
>  static int check_reference_leak(struct bpf_verifier_env *env, bool exception_exit)
>  {
> +       enum bpf_prog_type type = resolve_prog_type(env->prog);
> +       u32 regno = exception_exit? BPF_REG_1 : BPF_REG_0;
> +       struct bpf_reg_state *reg = reg_state(env, regno);
>         struct bpf_func_state *state = cur_func(env);
> +       const struct bpf_prog *prog = env->prog;
> +       const struct btf_type *ret_type = NULL;
>         bool refs_lingering = false;
> +       struct btf *btf;
>         int i;
>
>         if (!exception_exit && state->frameno && !state->in_callback_fn)
>                 return 0;
>
> +       if (type == BPF_PROG_TYPE_STRUCT_OPS &&
> +           reg->type & PTR_TO_BTF_ID && reg->ref_obj_id) {
> +               btf = bpf_prog_get_target_btf(prog);
> +               ret_type = btf_type_by_id(btf, prog->aux->attach_func_proto->type);
> +               if (reg->btf_id != ret_type->type) {
> +                       verbose(env, "Return kptr type, struct %s, doesn't match function prototype, struct %s\n",
> +                               btf_type_name(reg->btf, reg->btf_id),
> +                               btf_type_name(btf, ret_type->type));
> +                       return -EINVAL;
> +               }
> +       }
> +
>         for (i = 0; i < state->acquired_refs; i++) {
>                 if (!exception_exit && state->in_callback_fn && state->refs[i].callback_ref != state->frameno)
>                         continue;
> +               if (ret_type && reg->ref_obj_id == state->refs[i].id)
> +                       continue;
>                 verbose(env, "Unreleased reference id=%d alloc_insn=%d\n",
>                         state->refs[i].id, state->refs[i].insn_idx);
>                 refs_lingering = true;
> @@ -15395,12 +15415,15 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>         const char *exit_ctx = "At program exit";
>         struct tnum enforce_attach_type_range = tnum_unknown;
>         const struct bpf_prog *prog = env->prog;
> -       struct bpf_reg_state *reg;
> +       struct bpf_reg_state *reg = reg_state(env, regno);
>         struct bpf_retval_range range = retval_range(0, 1);
>         enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
>         int err;
>         struct bpf_func_state *frame = env->cur_state->frame[0];
>         const bool is_subprog = frame->subprogno;
> +       struct btf *btf = bpf_prog_get_target_btf(prog);
> +       bool st_ops_ret_is_kptr = false;
> +       const struct btf_type *t;
>
>         /* LSM and struct_ops func-ptr's return type could be "void" */
>         if (!is_subprog || frame->in_exception_callback_fn) {
> @@ -15409,10 +15432,26 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>                         if (prog->expected_attach_type == BPF_LSM_CGROUP)
>                                 /* See below, can be 0 or 0-1 depending on hook. */
>                                 break;
> -                       fallthrough;
> +                       if (!prog->aux->attach_func_proto->type)
> +                               return 0;
> +                       break;
>                 case BPF_PROG_TYPE_STRUCT_OPS:
>                         if (!prog->aux->attach_func_proto->type)
>                                 return 0;
> +
> +                       t = btf_type_by_id(btf, prog->aux->attach_func_proto->type);
> +                       if (btf_type_is_ptr(t)) {
> +                               /* Allow struct_ops programs to return kptr or null if
> +                                * the return type is a pointer type.
> +                                * check_reference_leak has ensured the returning kptr
> +                                * matches the type of the function prototype and is
> +                                * the only leaking reference. Thus, we can safely return
> +                                * if the pointer is in its unmodified form
> +                                */
> +                               if (reg->type & PTR_TO_BTF_ID)
> +                                       return __check_ptr_off_reg(env, reg, regno, false);
> +                               st_ops_ret_is_kptr = true;
> +                       }
>                         break;
>                 default:
>                         break;
> @@ -15434,8 +15473,6 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>                 return -EACCES;
>         }
>
> -       reg = cur_regs(env) + regno;
> -
>         if (frame->in_async_callback_fn) {
>                 /* enforce return zero from async callbacks like timer */
>                 exit_ctx = "At async callback return";
> @@ -15522,6 +15559,11 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>         case BPF_PROG_TYPE_NETFILTER:
>                 range = retval_range(NF_DROP, NF_ACCEPT);
>                 break;
> +       case BPF_PROG_TYPE_STRUCT_OPS:
> +               if (!st_ops_ret_is_kptr)
> +                       return 0;
> +               range = retval_range(0, 0);
> +               break;

Arguments and the return for helpers and kfuncs, where we transition
between bpf program and kernel, can be tagged, so that we can do
proper checks. struct_ops shares the similar property in that sense,
but currently lacks the ability to tag the return.

A discussion was that, here we assume the returning referenced kptr
"may be null" because that's what Qdisc expects. I think it would make
sense to only allow it when the return is explicitly tagged with
MAY_BE_NULL. How about doing so in the stub function name?


>         case BPF_PROG_TYPE_EXT:
>                 /* freplace program can return anything as its return value
>                  * depends on the to-be-replaced kernel func or bpf program.
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 03/20] bpf: Allow struct_ops prog to return referenced kptr
  2024-05-17  2:06   ` Amery Hung
@ 2024-05-17  5:30     ` Martin KaFai Lau
  0 siblings, 0 replies; 46+ messages in thread
From: Martin KaFai Lau @ 2024-05-17  5:30 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw, toke, jhs,
	jiri, sdf, xiyou.wangcong, yepeilin.cs, netdev

On 5/16/24 7:06 PM, Amery Hung wrote:
> Arguments and the return for helpers and kfuncs, where we transition
> between bpf program and kernel, can be tagged, so that we can do
> proper checks. struct_ops shares the similar property in that sense,
> but currently lacks the ability to tag the return.
> 
> A discussion was that, here we assume the returning referenced kptr
> "may be null" because that's what Qdisc expects. 

As a return value of a kernel function, it usually needs to return error. I was 
thinking "may be null" is actually the common case if it is not ERR_PTR.

> I think it would make sense to only allow it when the return is explicitly
> tagged with MAY_BE_NULL. How about doing so in the stub function name?

I think this is something internal and qdisc is the only case now. The default 
is something that could be improved later as a followup and not necessary a blocker?

But, yeah, if it is obvious to make the return ptr expectation/tagging 
consistent to how __nullable means to the argument, it would be nice. Tagging 
the stub function name like "__ret_null"? That should work. I think it will need 
some plumbing from bpf_struct_ops.c to the verifier here.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 17/20] selftests: Add a basic fifo qdisc test
  2024-05-10 19:24 ` [RFC PATCH v8 17/20] selftests: Add a basic fifo qdisc test Amery Hung
@ 2024-05-21  3:15   ` Stanislav Fomichev
  2024-05-21 15:03     ` Amery Hung
  0 siblings, 1 reply; 46+ messages in thread
From: Stanislav Fomichev @ 2024-05-21  3:15 UTC (permalink / raw)
  To: Amery Hung
  Cc: netdev, bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw,
	toke, jhs, jiri, xiyou.wangcong, yepeilin.cs

On 05/10, Amery Hung wrote:
> This selftest shows a bare minimum fifo qdisc, which simply enqueues skbs
> into the back of a bpf list and dequeues from the front of the list.
> 
> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> ---
>  .../selftests/bpf/prog_tests/bpf_qdisc.c      | 161 ++++++++++++++++++
>  .../selftests/bpf/progs/bpf_qdisc_common.h    |  23 +++
>  .../selftests/bpf/progs/bpf_qdisc_fifo.c      |  83 +++++++++
>  3 files changed, 267 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
>  create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
>  create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
> new file mode 100644
> index 000000000000..295d0216e70f
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
> @@ -0,0 +1,161 @@
> +#include <linux/pkt_sched.h>
> +#include <linux/rtnetlink.h>
> +#include <test_progs.h>
> +
> +#include "network_helpers.h"
> +#include "bpf_qdisc_fifo.skel.h"
> +
> +#ifndef ENOTSUPP
> +#define ENOTSUPP 524
> +#endif
> +
> +#define LO_IFINDEX 1
> +
> +static const unsigned int total_bytes = 10 * 1024 * 1024;
> +static int stop;
> +
> +static void *server(void *arg)
> +{
> +	int lfd = (int)(long)arg, err = 0, fd;
> +	ssize_t nr_sent = 0, bytes = 0;
> +	char batch[1500];
> +
> +	fd = accept(lfd, NULL, NULL);
> +	while (fd == -1) {
> +		if (errno == EINTR)
> +			continue;
> +		err = -errno;
> +		goto done;
> +	}
> +
> +	if (settimeo(fd, 0)) {
> +		err = -errno;
> +		goto done;
> +	}
> +
> +	while (bytes < total_bytes && !READ_ONCE(stop)) {
> +		nr_sent = send(fd, &batch,
> +			       MIN(total_bytes - bytes, sizeof(batch)), 0);
> +		if (nr_sent == -1 && errno == EINTR)
> +			continue;
> +		if (nr_sent == -1) {
> +			err = -errno;
> +			break;
> +		}
> +		bytes += nr_sent;
> +	}
> +
> +	ASSERT_EQ(bytes, total_bytes, "send");
> +
> +done:
> +	if (fd >= 0)
> +		close(fd);
> +	if (err) {
> +		WRITE_ONCE(stop, 1);
> +		return ERR_PTR(err);
> +	}
> +	return NULL;
> +}
> +
> +static void do_test(char *qdisc)
> +{
> +	DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook, .ifindex = LO_IFINDEX,
> +			    .attach_point = BPF_TC_QDISC,
> +			    .parent = TC_H_ROOT,
> +			    .handle = 0x8000000,
> +			    .qdisc = qdisc);
> +	struct sockaddr_in6 sa6 = {};
> +	ssize_t nr_recv = 0, bytes = 0;
> +	int lfd = -1, fd = -1;
> +	pthread_t srv_thread;
> +	socklen_t addrlen = sizeof(sa6);
> +	void *thread_ret;
> +	char batch[1500];
> +	int err;
> +
> +	WRITE_ONCE(stop, 0);
> +
> +	err = bpf_tc_hook_create(&hook);
> +	if (!ASSERT_OK(err, "attach qdisc"))
> +		return;
> +
> +	lfd = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
> +	if (!ASSERT_NEQ(lfd, -1, "socket")) {
> +		bpf_tc_hook_destroy(&hook);
> +		return;
> +	}
> +
> +	fd = socket(AF_INET6, SOCK_STREAM, 0);
> +	if (!ASSERT_NEQ(fd, -1, "socket")) {
> +		bpf_tc_hook_destroy(&hook);
> +		close(lfd);
> +		return;
> +	}
> +
> +	if (settimeo(lfd, 0) || settimeo(fd, 0))
> +		goto done;
> +
> +	err = getsockname(lfd, (struct sockaddr *)&sa6, &addrlen);
> +	if (!ASSERT_NEQ(err, -1, "getsockname"))
> +		goto done;
> +
> +	/* connect to server */
> +	err = connect(fd, (struct sockaddr *)&sa6, addrlen);
> +	if (!ASSERT_NEQ(err, -1, "connect"))
> +		goto done;
> +
> +	err = pthread_create(&srv_thread, NULL, server, (void *)(long)lfd);
> +	if (!ASSERT_OK(err, "pthread_create"))
> +		goto done;
> +
> +	/* recv total_bytes */
> +	while (bytes < total_bytes && !READ_ONCE(stop)) {
> +		nr_recv = recv(fd, &batch,
> +			       MIN(total_bytes - bytes, sizeof(batch)), 0);
> +		if (nr_recv == -1 && errno == EINTR)
> +			continue;
> +		if (nr_recv == -1)
> +			break;
> +		bytes += nr_recv;
> +	}
> +
> +	ASSERT_EQ(bytes, total_bytes, "recv");
> +
> +	WRITE_ONCE(stop, 1);
> +	pthread_join(srv_thread, &thread_ret);
> +	ASSERT_OK(IS_ERR(thread_ret), "thread_ret");
> +
> +done:
> +	close(lfd);
> +	close(fd);
> +
> +	bpf_tc_hook_destroy(&hook);
> +	return;
> +}
> +
> +static void test_fifo(void)
> +{
> +	struct bpf_qdisc_fifo *fifo_skel;
> +	struct bpf_link *link;
> +
> +	fifo_skel = bpf_qdisc_fifo__open_and_load();
> +	if (!ASSERT_OK_PTR(fifo_skel, "bpf_qdisc_fifo__open_and_load"))
> +		return;
> +
> +	link = bpf_map__attach_struct_ops(fifo_skel->maps.fifo);
> +	if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops")) {
> +		bpf_qdisc_fifo__destroy(fifo_skel);
> +		return;
> +	}
> +
> +	do_test("bpf_fifo");
> +
> +	bpf_link__destroy(link);
> +	bpf_qdisc_fifo__destroy(fifo_skel);
> +}
> +
> +void test_bpf_qdisc(void)
> +{
> +	if (test__start_subtest("fifo"))
> +		test_fifo();
> +}
> diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
> new file mode 100644
> index 000000000000..96ab357de28e
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
> @@ -0,0 +1,23 @@
> +#ifndef _BPF_QDISC_COMMON_H
> +#define _BPF_QDISC_COMMON_H
> +
> +#define NET_XMIT_SUCCESS        0x00
> +#define NET_XMIT_DROP           0x01    /* skb dropped                  */
> +#define NET_XMIT_CN             0x02    /* congestion notification      */
> +
> +#define TC_PRIO_CONTROL  7
> +#define TC_PRIO_MAX      15
> +
> +void bpf_skb_set_dev(struct sk_buff *skb, struct Qdisc *sch) __ksym;
> +u32 bpf_skb_get_hash(struct sk_buff *p) __ksym;
> +void bpf_skb_release(struct sk_buff *p) __ksym;
> +void bpf_qdisc_skb_drop(struct sk_buff *p, struct bpf_sk_buff_ptr *to_free) __ksym;
> +void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64 delta_ns) __ksym;
> +bool bpf_qdisc_find_class(struct Qdisc *sch, u32 classid) __ksym;
> +int bpf_qdisc_create_child(struct Qdisc *sch, u32 min,
> +			   struct netlink_ext_ack *extack) __ksym;
> +int bpf_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch, u32 classid,
> +		      struct bpf_sk_buff_ptr *to_free_list) __ksym;
> +struct sk_buff *bpf_qdisc_dequeue(struct Qdisc *sch, u32 classid) __ksym;
> +
> +#endif
> diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c b/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
> new file mode 100644
> index 000000000000..433fd9c3639c
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
> @@ -0,0 +1,83 @@
> +#include <vmlinux.h>
> +#include "bpf_experimental.h"
> +#include "bpf_qdisc_common.h"
> +
> +char _license[] SEC("license") = "GPL";
> +
> +#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
> +
> +private(B) struct bpf_spin_lock q_fifo_lock;
> +private(B) struct bpf_list_head q_fifo __contains_kptr(sk_buff, bpf_list);
> +
> +unsigned int q_limit = 1000;
> +unsigned int q_qlen = 0;
> +
> +SEC("struct_ops/bpf_fifo_enqueue")
> +int BPF_PROG(bpf_fifo_enqueue, struct sk_buff *skb, struct Qdisc *sch,
> +	     struct bpf_sk_buff_ptr *to_free)
> +{
> +	q_qlen++;
> +	if (q_qlen > q_limit) {
> +		bpf_qdisc_skb_drop(skb, to_free);
> +		return NET_XMIT_DROP;
> +	}

[..]

> +	bpf_spin_lock(&q_fifo_lock);
> +	bpf_list_excl_push_back(&q_fifo, &skb->bpf_list);
> +	bpf_spin_unlock(&q_fifo_lock);

Can you also expand a bit on the locking here and elsewhere? And how it
interplays with TCQ_F_NOLOCK?

As I mentioned at lsfmmbpf, I don't think there is a lot of similar
locking in the existing C implementations? So why do we need it here?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 17/20] selftests: Add a basic fifo qdisc test
  2024-05-21  3:15   ` Stanislav Fomichev
@ 2024-05-21 15:03     ` Amery Hung
  2024-05-21 17:57       ` Stanislav Fomichev
  0 siblings, 1 reply; 46+ messages in thread
From: Amery Hung @ 2024-05-21 15:03 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: netdev, bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw,
	toke, jhs, jiri, xiyou.wangcong, yepeilin.cs

On Mon, May 20, 2024 at 8:15 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 05/10, Amery Hung wrote:
> > This selftest shows a bare minimum fifo qdisc, which simply enqueues skbs
> > into the back of a bpf list and dequeues from the front of the list.
> >
> > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > ---
> >  .../selftests/bpf/prog_tests/bpf_qdisc.c      | 161 ++++++++++++++++++
> >  .../selftests/bpf/progs/bpf_qdisc_common.h    |  23 +++
> >  .../selftests/bpf/progs/bpf_qdisc_fifo.c      |  83 +++++++++
> >  3 files changed, 267 insertions(+)
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
> >  create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
> > new file mode 100644
> > index 000000000000..295d0216e70f
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
> > @@ -0,0 +1,161 @@
> > +#include <linux/pkt_sched.h>
> > +#include <linux/rtnetlink.h>
> > +#include <test_progs.h>
> > +
> > +#include "network_helpers.h"
> > +#include "bpf_qdisc_fifo.skel.h"
> > +
> > +#ifndef ENOTSUPP
> > +#define ENOTSUPP 524
> > +#endif
> > +
> > +#define LO_IFINDEX 1
> > +
> > +static const unsigned int total_bytes = 10 * 1024 * 1024;
> > +static int stop;
> > +
> > +static void *server(void *arg)
> > +{
> > +     int lfd = (int)(long)arg, err = 0, fd;
> > +     ssize_t nr_sent = 0, bytes = 0;
> > +     char batch[1500];
> > +
> > +     fd = accept(lfd, NULL, NULL);
> > +     while (fd == -1) {
> > +             if (errno == EINTR)
> > +                     continue;
> > +             err = -errno;
> > +             goto done;
> > +     }
> > +
> > +     if (settimeo(fd, 0)) {
> > +             err = -errno;
> > +             goto done;
> > +     }
> > +
> > +     while (bytes < total_bytes && !READ_ONCE(stop)) {
> > +             nr_sent = send(fd, &batch,
> > +                            MIN(total_bytes - bytes, sizeof(batch)), 0);
> > +             if (nr_sent == -1 && errno == EINTR)
> > +                     continue;
> > +             if (nr_sent == -1) {
> > +                     err = -errno;
> > +                     break;
> > +             }
> > +             bytes += nr_sent;
> > +     }
> > +
> > +     ASSERT_EQ(bytes, total_bytes, "send");
> > +
> > +done:
> > +     if (fd >= 0)
> > +             close(fd);
> > +     if (err) {
> > +             WRITE_ONCE(stop, 1);
> > +             return ERR_PTR(err);
> > +     }
> > +     return NULL;
> > +}
> > +
> > +static void do_test(char *qdisc)
> > +{
> > +     DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook, .ifindex = LO_IFINDEX,
> > +                         .attach_point = BPF_TC_QDISC,
> > +                         .parent = TC_H_ROOT,
> > +                         .handle = 0x8000000,
> > +                         .qdisc = qdisc);
> > +     struct sockaddr_in6 sa6 = {};
> > +     ssize_t nr_recv = 0, bytes = 0;
> > +     int lfd = -1, fd = -1;
> > +     pthread_t srv_thread;
> > +     socklen_t addrlen = sizeof(sa6);
> > +     void *thread_ret;
> > +     char batch[1500];
> > +     int err;
> > +
> > +     WRITE_ONCE(stop, 0);
> > +
> > +     err = bpf_tc_hook_create(&hook);
> > +     if (!ASSERT_OK(err, "attach qdisc"))
> > +             return;
> > +
> > +     lfd = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
> > +     if (!ASSERT_NEQ(lfd, -1, "socket")) {
> > +             bpf_tc_hook_destroy(&hook);
> > +             return;
> > +     }
> > +
> > +     fd = socket(AF_INET6, SOCK_STREAM, 0);
> > +     if (!ASSERT_NEQ(fd, -1, "socket")) {
> > +             bpf_tc_hook_destroy(&hook);
> > +             close(lfd);
> > +             return;
> > +     }
> > +
> > +     if (settimeo(lfd, 0) || settimeo(fd, 0))
> > +             goto done;
> > +
> > +     err = getsockname(lfd, (struct sockaddr *)&sa6, &addrlen);
> > +     if (!ASSERT_NEQ(err, -1, "getsockname"))
> > +             goto done;
> > +
> > +     /* connect to server */
> > +     err = connect(fd, (struct sockaddr *)&sa6, addrlen);
> > +     if (!ASSERT_NEQ(err, -1, "connect"))
> > +             goto done;
> > +
> > +     err = pthread_create(&srv_thread, NULL, server, (void *)(long)lfd);
> > +     if (!ASSERT_OK(err, "pthread_create"))
> > +             goto done;
> > +
> > +     /* recv total_bytes */
> > +     while (bytes < total_bytes && !READ_ONCE(stop)) {
> > +             nr_recv = recv(fd, &batch,
> > +                            MIN(total_bytes - bytes, sizeof(batch)), 0);
> > +             if (nr_recv == -1 && errno == EINTR)
> > +                     continue;
> > +             if (nr_recv == -1)
> > +                     break;
> > +             bytes += nr_recv;
> > +     }
> > +
> > +     ASSERT_EQ(bytes, total_bytes, "recv");
> > +
> > +     WRITE_ONCE(stop, 1);
> > +     pthread_join(srv_thread, &thread_ret);
> > +     ASSERT_OK(IS_ERR(thread_ret), "thread_ret");
> > +
> > +done:
> > +     close(lfd);
> > +     close(fd);
> > +
> > +     bpf_tc_hook_destroy(&hook);
> > +     return;
> > +}
> > +
> > +static void test_fifo(void)
> > +{
> > +     struct bpf_qdisc_fifo *fifo_skel;
> > +     struct bpf_link *link;
> > +
> > +     fifo_skel = bpf_qdisc_fifo__open_and_load();
> > +     if (!ASSERT_OK_PTR(fifo_skel, "bpf_qdisc_fifo__open_and_load"))
> > +             return;
> > +
> > +     link = bpf_map__attach_struct_ops(fifo_skel->maps.fifo);
> > +     if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops")) {
> > +             bpf_qdisc_fifo__destroy(fifo_skel);
> > +             return;
> > +     }
> > +
> > +     do_test("bpf_fifo");
> > +
> > +     bpf_link__destroy(link);
> > +     bpf_qdisc_fifo__destroy(fifo_skel);
> > +}
> > +
> > +void test_bpf_qdisc(void)
> > +{
> > +     if (test__start_subtest("fifo"))
> > +             test_fifo();
> > +}
> > diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
> > new file mode 100644
> > index 000000000000..96ab357de28e
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
> > @@ -0,0 +1,23 @@
> > +#ifndef _BPF_QDISC_COMMON_H
> > +#define _BPF_QDISC_COMMON_H
> > +
> > +#define NET_XMIT_SUCCESS        0x00
> > +#define NET_XMIT_DROP           0x01    /* skb dropped                  */
> > +#define NET_XMIT_CN             0x02    /* congestion notification      */
> > +
> > +#define TC_PRIO_CONTROL  7
> > +#define TC_PRIO_MAX      15
> > +
> > +void bpf_skb_set_dev(struct sk_buff *skb, struct Qdisc *sch) __ksym;
> > +u32 bpf_skb_get_hash(struct sk_buff *p) __ksym;
> > +void bpf_skb_release(struct sk_buff *p) __ksym;
> > +void bpf_qdisc_skb_drop(struct sk_buff *p, struct bpf_sk_buff_ptr *to_free) __ksym;
> > +void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64 delta_ns) __ksym;
> > +bool bpf_qdisc_find_class(struct Qdisc *sch, u32 classid) __ksym;
> > +int bpf_qdisc_create_child(struct Qdisc *sch, u32 min,
> > +                        struct netlink_ext_ack *extack) __ksym;
> > +int bpf_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch, u32 classid,
> > +                   struct bpf_sk_buff_ptr *to_free_list) __ksym;
> > +struct sk_buff *bpf_qdisc_dequeue(struct Qdisc *sch, u32 classid) __ksym;
> > +
> > +#endif
> > diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c b/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
> > new file mode 100644
> > index 000000000000..433fd9c3639c
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
> > @@ -0,0 +1,83 @@
> > +#include <vmlinux.h>
> > +#include "bpf_experimental.h"
> > +#include "bpf_qdisc_common.h"
> > +
> > +char _license[] SEC("license") = "GPL";
> > +
> > +#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
> > +
> > +private(B) struct bpf_spin_lock q_fifo_lock;
> > +private(B) struct bpf_list_head q_fifo __contains_kptr(sk_buff, bpf_list);
> > +
> > +unsigned int q_limit = 1000;
> > +unsigned int q_qlen = 0;
> > +
> > +SEC("struct_ops/bpf_fifo_enqueue")
> > +int BPF_PROG(bpf_fifo_enqueue, struct sk_buff *skb, struct Qdisc *sch,
> > +          struct bpf_sk_buff_ptr *to_free)
> > +{
> > +     q_qlen++;
> > +     if (q_qlen > q_limit) {
> > +             bpf_qdisc_skb_drop(skb, to_free);
> > +             return NET_XMIT_DROP;
> > +     }
>
> [..]
>
> > +     bpf_spin_lock(&q_fifo_lock);
> > +     bpf_list_excl_push_back(&q_fifo, &skb->bpf_list);
> > +     bpf_spin_unlock(&q_fifo_lock);
>
> Can you also expand a bit on the locking here and elsewhere? And how it
> interplays with TCQ_F_NOLOCK?
>
> As I mentioned at lsfmmbpf, I don't think there is a lot of similar
> locking in the existing C implementations? So why do we need it here?

The locks are required to prevent catastrophic concurrent accesses to
bpf graphs. The verifier will check 1) if there is a spin_lock in the
same struct with a list head or rbtree root, and 2) the lock is held
when accessing the list or rbtree.

Since we have the safety guarantee provided by the verifier, I think
there is an opportunity to allow qdisc users to set TCQ_F_NOLOCK. I will
check if qdisc kfuncs are TCQ_F_NOLOCK safe though. Let me know if I
missed anything.

Thanks,
Amery

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 17/20] selftests: Add a basic fifo qdisc test
  2024-05-21 15:03     ` Amery Hung
@ 2024-05-21 17:57       ` Stanislav Fomichev
  0 siblings, 0 replies; 46+ messages in thread
From: Stanislav Fomichev @ 2024-05-21 17:57 UTC (permalink / raw)
  To: Amery Hung
  Cc: netdev, bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw,
	toke, jhs, jiri, xiyou.wangcong, yepeilin.cs

On Tue, May 21, 2024 at 8:03 AM Amery Hung <ameryhung@gmail.com> wrote:
>
> On Mon, May 20, 2024 at 8:15 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 05/10, Amery Hung wrote:
> > > This selftest shows a bare minimum fifo qdisc, which simply enqueues skbs
> > > into the back of a bpf list and dequeues from the front of the list.
> > >
> > > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > > ---
> > >  .../selftests/bpf/prog_tests/bpf_qdisc.c      | 161 ++++++++++++++++++
> > >  .../selftests/bpf/progs/bpf_qdisc_common.h    |  23 +++
> > >  .../selftests/bpf/progs/bpf_qdisc_fifo.c      |  83 +++++++++
> > >  3 files changed, 267 insertions(+)
> > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
> > >  create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
> > >  create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
> > >
> > > diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
> > > new file mode 100644
> > > index 000000000000..295d0216e70f
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
> > > @@ -0,0 +1,161 @@
> > > +#include <linux/pkt_sched.h>
> > > +#include <linux/rtnetlink.h>
> > > +#include <test_progs.h>
> > > +
> > > +#include "network_helpers.h"
> > > +#include "bpf_qdisc_fifo.skel.h"
> > > +
> > > +#ifndef ENOTSUPP
> > > +#define ENOTSUPP 524
> > > +#endif
> > > +
> > > +#define LO_IFINDEX 1
> > > +
> > > +static const unsigned int total_bytes = 10 * 1024 * 1024;
> > > +static int stop;
> > > +
> > > +static void *server(void *arg)
> > > +{
> > > +     int lfd = (int)(long)arg, err = 0, fd;
> > > +     ssize_t nr_sent = 0, bytes = 0;
> > > +     char batch[1500];
> > > +
> > > +     fd = accept(lfd, NULL, NULL);
> > > +     while (fd == -1) {
> > > +             if (errno == EINTR)
> > > +                     continue;
> > > +             err = -errno;
> > > +             goto done;
> > > +     }
> > > +
> > > +     if (settimeo(fd, 0)) {
> > > +             err = -errno;
> > > +             goto done;
> > > +     }
> > > +
> > > +     while (bytes < total_bytes && !READ_ONCE(stop)) {
> > > +             nr_sent = send(fd, &batch,
> > > +                            MIN(total_bytes - bytes, sizeof(batch)), 0);
> > > +             if (nr_sent == -1 && errno == EINTR)
> > > +                     continue;
> > > +             if (nr_sent == -1) {
> > > +                     err = -errno;
> > > +                     break;
> > > +             }
> > > +             bytes += nr_sent;
> > > +     }
> > > +
> > > +     ASSERT_EQ(bytes, total_bytes, "send");
> > > +
> > > +done:
> > > +     if (fd >= 0)
> > > +             close(fd);
> > > +     if (err) {
> > > +             WRITE_ONCE(stop, 1);
> > > +             return ERR_PTR(err);
> > > +     }
> > > +     return NULL;
> > > +}
> > > +
> > > +static void do_test(char *qdisc)
> > > +{
> > > +     DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook, .ifindex = LO_IFINDEX,
> > > +                         .attach_point = BPF_TC_QDISC,
> > > +                         .parent = TC_H_ROOT,
> > > +                         .handle = 0x8000000,
> > > +                         .qdisc = qdisc);
> > > +     struct sockaddr_in6 sa6 = {};
> > > +     ssize_t nr_recv = 0, bytes = 0;
> > > +     int lfd = -1, fd = -1;
> > > +     pthread_t srv_thread;
> > > +     socklen_t addrlen = sizeof(sa6);
> > > +     void *thread_ret;
> > > +     char batch[1500];
> > > +     int err;
> > > +
> > > +     WRITE_ONCE(stop, 0);
> > > +
> > > +     err = bpf_tc_hook_create(&hook);
> > > +     if (!ASSERT_OK(err, "attach qdisc"))
> > > +             return;
> > > +
> > > +     lfd = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
> > > +     if (!ASSERT_NEQ(lfd, -1, "socket")) {
> > > +             bpf_tc_hook_destroy(&hook);
> > > +             return;
> > > +     }
> > > +
> > > +     fd = socket(AF_INET6, SOCK_STREAM, 0);
> > > +     if (!ASSERT_NEQ(fd, -1, "socket")) {
> > > +             bpf_tc_hook_destroy(&hook);
> > > +             close(lfd);
> > > +             return;
> > > +     }
> > > +
> > > +     if (settimeo(lfd, 0) || settimeo(fd, 0))
> > > +             goto done;
> > > +
> > > +     err = getsockname(lfd, (struct sockaddr *)&sa6, &addrlen);
> > > +     if (!ASSERT_NEQ(err, -1, "getsockname"))
> > > +             goto done;
> > > +
> > > +     /* connect to server */
> > > +     err = connect(fd, (struct sockaddr *)&sa6, addrlen);
> > > +     if (!ASSERT_NEQ(err, -1, "connect"))
> > > +             goto done;
> > > +
> > > +     err = pthread_create(&srv_thread, NULL, server, (void *)(long)lfd);
> > > +     if (!ASSERT_OK(err, "pthread_create"))
> > > +             goto done;
> > > +
> > > +     /* recv total_bytes */
> > > +     while (bytes < total_bytes && !READ_ONCE(stop)) {
> > > +             nr_recv = recv(fd, &batch,
> > > +                            MIN(total_bytes - bytes, sizeof(batch)), 0);
> > > +             if (nr_recv == -1 && errno == EINTR)
> > > +                     continue;
> > > +             if (nr_recv == -1)
> > > +                     break;
> > > +             bytes += nr_recv;
> > > +     }
> > > +
> > > +     ASSERT_EQ(bytes, total_bytes, "recv");
> > > +
> > > +     WRITE_ONCE(stop, 1);
> > > +     pthread_join(srv_thread, &thread_ret);
> > > +     ASSERT_OK(IS_ERR(thread_ret), "thread_ret");
> > > +
> > > +done:
> > > +     close(lfd);
> > > +     close(fd);
> > > +
> > > +     bpf_tc_hook_destroy(&hook);
> > > +     return;
> > > +}
> > > +
> > > +static void test_fifo(void)
> > > +{
> > > +     struct bpf_qdisc_fifo *fifo_skel;
> > > +     struct bpf_link *link;
> > > +
> > > +     fifo_skel = bpf_qdisc_fifo__open_and_load();
> > > +     if (!ASSERT_OK_PTR(fifo_skel, "bpf_qdisc_fifo__open_and_load"))
> > > +             return;
> > > +
> > > +     link = bpf_map__attach_struct_ops(fifo_skel->maps.fifo);
> > > +     if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops")) {
> > > +             bpf_qdisc_fifo__destroy(fifo_skel);
> > > +             return;
> > > +     }
> > > +
> > > +     do_test("bpf_fifo");
> > > +
> > > +     bpf_link__destroy(link);
> > > +     bpf_qdisc_fifo__destroy(fifo_skel);
> > > +}
> > > +
> > > +void test_bpf_qdisc(void)
> > > +{
> > > +     if (test__start_subtest("fifo"))
> > > +             test_fifo();
> > > +}
> > > diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
> > > new file mode 100644
> > > index 000000000000..96ab357de28e
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
> > > @@ -0,0 +1,23 @@
> > > +#ifndef _BPF_QDISC_COMMON_H
> > > +#define _BPF_QDISC_COMMON_H
> > > +
> > > +#define NET_XMIT_SUCCESS        0x00
> > > +#define NET_XMIT_DROP           0x01    /* skb dropped                  */
> > > +#define NET_XMIT_CN             0x02    /* congestion notification      */
> > > +
> > > +#define TC_PRIO_CONTROL  7
> > > +#define TC_PRIO_MAX      15
> > > +
> > > +void bpf_skb_set_dev(struct sk_buff *skb, struct Qdisc *sch) __ksym;
> > > +u32 bpf_skb_get_hash(struct sk_buff *p) __ksym;
> > > +void bpf_skb_release(struct sk_buff *p) __ksym;
> > > +void bpf_qdisc_skb_drop(struct sk_buff *p, struct bpf_sk_buff_ptr *to_free) __ksym;
> > > +void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64 delta_ns) __ksym;
> > > +bool bpf_qdisc_find_class(struct Qdisc *sch, u32 classid) __ksym;
> > > +int bpf_qdisc_create_child(struct Qdisc *sch, u32 min,
> > > +                        struct netlink_ext_ack *extack) __ksym;
> > > +int bpf_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch, u32 classid,
> > > +                   struct bpf_sk_buff_ptr *to_free_list) __ksym;
> > > +struct sk_buff *bpf_qdisc_dequeue(struct Qdisc *sch, u32 classid) __ksym;
> > > +
> > > +#endif
> > > diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c b/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
> > > new file mode 100644
> > > index 000000000000..433fd9c3639c
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
> > > @@ -0,0 +1,83 @@
> > > +#include <vmlinux.h>
> > > +#include "bpf_experimental.h"
> > > +#include "bpf_qdisc_common.h"
> > > +
> > > +char _license[] SEC("license") = "GPL";
> > > +
> > > +#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
> > > +
> > > +private(B) struct bpf_spin_lock q_fifo_lock;
> > > +private(B) struct bpf_list_head q_fifo __contains_kptr(sk_buff, bpf_list);
> > > +
> > > +unsigned int q_limit = 1000;
> > > +unsigned int q_qlen = 0;
> > > +
> > > +SEC("struct_ops/bpf_fifo_enqueue")
> > > +int BPF_PROG(bpf_fifo_enqueue, struct sk_buff *skb, struct Qdisc *sch,
> > > +          struct bpf_sk_buff_ptr *to_free)
> > > +{
> > > +     q_qlen++;
> > > +     if (q_qlen > q_limit) {
> > > +             bpf_qdisc_skb_drop(skb, to_free);
> > > +             return NET_XMIT_DROP;
> > > +     }
> >
> > [..]
> >
> > > +     bpf_spin_lock(&q_fifo_lock);
> > > +     bpf_list_excl_push_back(&q_fifo, &skb->bpf_list);
> > > +     bpf_spin_unlock(&q_fifo_lock);
> >
> > Can you also expand a bit on the locking here and elsewhere? And how it
> > interplays with TCQ_F_NOLOCK?
> >
> > As I mentioned at lsfmmbpf, I don't think there is a lot of similar
> > locking in the existing C implementations? So why do we need it here?
>
> The locks are required to prevent catastrophic concurrent accesses to
> bpf graphs. The verifier will check 1) if there is a spin_lock in the
> same struct with a list head or rbtree root, and 2) the lock is held
> when accessing the list or rbtree.
>
> Since we have the safety guarantee provided by the verifier, I think
> there is an opportunity to allow qdisc users to set TCQ_F_NOLOCK. I will
> check if qdisc kfuncs are TCQ_F_NOLOCK safe though. Let me know if I
> missed anything.

Ah, so these locking constraints come from the verifier....
In this case, yes, it would be nice to have special treatment for bpf
qdisc (or maybe allow passing TCQ_F_NOLOCK somehow). If the verifier
enforces the locking for the underlying data structures, we should try
to remove the ones from the generic qdisc layer.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 14/20] bpf: net_sched: Add bpf qdisc kfuncs
  2024-05-10 19:24 ` [RFC PATCH v8 14/20] bpf: net_sched: Add bpf qdisc kfuncs Amery Hung
@ 2024-05-22 23:55   ` Martin KaFai Lau
  2024-05-23  1:06     ` Amery Hung
  0 siblings, 1 reply; 46+ messages in thread
From: Martin KaFai Lau @ 2024-05-22 23:55 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, yangpeihao, daniel, andrii, martin.lau, sinquersw,
	toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs

On 5/10/24 12:24 PM, Amery Hung wrote:
> +BTF_KFUNCS_START(bpf_qdisc_kfunc_ids)
> +BTF_ID_FLAGS(func, bpf_skb_set_dev)
> +BTF_ID_FLAGS(func, bpf_skb_get_hash)
> +BTF_ID_FLAGS(func, bpf_skb_release, KF_RELEASE)
> +BTF_ID_FLAGS(func, bpf_qdisc_skb_drop, KF_RELEASE)
> +BTF_ID_FLAGS(func, bpf_qdisc_watchdog_schedule)

Thanks for working on the bpf qdisc!

I want to see if we can shrink the set and focus on the core pieces first.

The above kfuncs look ok. bpf_skb_set_dev() will need some thoughts but my 
understanding is that it is also not needed if the patch set did not reuse the 
rb_node in the sk_buff?

> +BTF_ID_FLAGS(func, bpf_skb_tc_classify)
> +BTF_ID_FLAGS(func, bpf_qdisc_create_child)
> +BTF_ID_FLAGS(func, bpf_qdisc_find_class)
> +BTF_ID_FLAGS(func, bpf_qdisc_enqueue, KF_RELEASE)
> +BTF_ID_FLAGS(func, bpf_qdisc_dequeue, KF_ACQUIRE | KF_RET_NULL)

How about starting with classless qdisc first?

I also wonder if the class/hierarchy can be implemented in the 
bpf_map/bpf_rb_root/bpf_list_head alone. That aside, the patch set shows that 
classful qdisc is something tangible with kfuncs. The classless qdisc bpf 
support does not seem to depend on it. Unless there is something on the classful 
side that really needs to be finalized at this point, I would leave it out from 
the core pieces for now and focus on classless. Does it make sense?

> +BTF_KFUNCS_END(bpf_qdisc_kfunc_ids)
> +
> +static const struct btf_kfunc_id_set bpf_qdisc_kfunc_set = {
> +	.owner = THIS_MODULE,
> +	.set   = &bpf_qdisc_kfunc_ids,
> +};
> +
> +BTF_ID_LIST(skb_kfunc_dtor_ids)
> +BTF_ID(struct, sk_buff)
> +BTF_ID_FLAGS(func, bpf_skb_release, KF_RELEASE)
> +
>   static const struct bpf_verifier_ops bpf_qdisc_verifier_ops = {
>   	.get_func_proto		= bpf_qdisc_get_func_proto,
>   	.is_valid_access	= bpf_qdisc_is_valid_access,
> @@ -558,6 +781,20 @@ static struct bpf_struct_ops bpf_Qdisc_ops = {
>   
>   static int __init bpf_qdisc_kfunc_init(void)
>   {
> -	return register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
> +	int ret;
> +	const struct btf_id_dtor_kfunc skb_kfunc_dtors[] = {
> +		{
> +			.btf_id       = skb_kfunc_dtor_ids[0],
> +			.kfunc_btf_id = skb_kfunc_dtor_ids[1]
> +		},
> +	};
> +
> +	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_qdisc_kfunc_set);
> +	ret = ret ?: register_btf_id_dtor_kfuncs(skb_kfunc_dtors,
> +						 ARRAY_SIZE(skb_kfunc_dtors),
> +						 THIS_MODULE);
> +	ret = ret ?: register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
> +
> +	return ret;
>   }
>   late_initcall(bpf_qdisc_kfunc_init);


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 14/20] bpf: net_sched: Add bpf qdisc kfuncs
  2024-05-22 23:55   ` Martin KaFai Lau
@ 2024-05-23  1:06     ` Amery Hung
  0 siblings, 0 replies; 46+ messages in thread
From: Amery Hung @ 2024-05-23  1:06 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, yangpeihao, daniel, andrii, martin.lau, sinquersw,
	toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs

On Wed, May 22, 2024 at 4:56 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 5/10/24 12:24 PM, Amery Hung wrote:
> > +BTF_KFUNCS_START(bpf_qdisc_kfunc_ids)
> > +BTF_ID_FLAGS(func, bpf_skb_set_dev)
> > +BTF_ID_FLAGS(func, bpf_skb_get_hash)
> > +BTF_ID_FLAGS(func, bpf_skb_release, KF_RELEASE)
> > +BTF_ID_FLAGS(func, bpf_qdisc_skb_drop, KF_RELEASE)
> > +BTF_ID_FLAGS(func, bpf_qdisc_watchdog_schedule)
>
> Thanks for working on the bpf qdisc!
>
> I want to see if we can shrink the set and focus on the core pieces first.
>
> The above kfuncs look ok. bpf_skb_set_dev() will need some thoughts but my
> understanding is that it is also not needed if the patch set did not reuse the
> rb_node in the sk_buff?

Correct. I will remove this kfunc and fall back to the v7 approach
(allocating local objects to hold skb kptrs) in the next version.
Support for adding skb natively to bpf graphs can come at a later
time.

>
> > +BTF_ID_FLAGS(func, bpf_skb_tc_classify)
> > +BTF_ID_FLAGS(func, bpf_qdisc_create_child)
> > +BTF_ID_FLAGS(func, bpf_qdisc_find_class)
> > +BTF_ID_FLAGS(func, bpf_qdisc_enqueue, KF_RELEASE)
> > +BTF_ID_FLAGS(func, bpf_qdisc_dequeue, KF_ACQUIRE | KF_RET_NULL)
>
> How about starting with classless qdisc first?
>
> I also wonder if the class/hierarchy can be implemented in the
> bpf_map/bpf_rb_root/bpf_list_head alone. That aside, the patch set shows that
> classful qdisc is something tangible with kfuncs. The classless qdisc bpf
> support does not seem to depend on it. Unless there is something on the classful
> side that really needs to be finalized at this point, I would leave it out from
> the core pieces for now and focus on classless. Does it make sense?
>

Totally make sense! I will simplify the patchset in the next version
by making it classless. Like what you said, with bpf maps and graphs,
sophisticated & hierarchical queues can already be implemented in a
single bpf qdisc.

Just to sum up, to make the patchset landable, I will:
1) fix and keep the first 4 struct_ops patches that support acquiring/
   returning referenced kptr
2) defer the support of adding skb to bpf graphs
3) defer Qdisc_class_ops and related kfuncs

> > +BTF_KFUNCS_END(bpf_qdisc_kfunc_ids)
> > +
> > +static const struct btf_kfunc_id_set bpf_qdisc_kfunc_set = {
> > +     .owner = THIS_MODULE,
> > +     .set   = &bpf_qdisc_kfunc_ids,
> > +};
> > +
> > +BTF_ID_LIST(skb_kfunc_dtor_ids)
> > +BTF_ID(struct, sk_buff)
> > +BTF_ID_FLAGS(func, bpf_skb_release, KF_RELEASE)
> > +
> >   static const struct bpf_verifier_ops bpf_qdisc_verifier_ops = {
> >       .get_func_proto         = bpf_qdisc_get_func_proto,
> >       .is_valid_access        = bpf_qdisc_is_valid_access,
> > @@ -558,6 +781,20 @@ static struct bpf_struct_ops bpf_Qdisc_ops = {
> >
> >   static int __init bpf_qdisc_kfunc_init(void)
> >   {
> > -     return register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
> > +     int ret;
> > +     const struct btf_id_dtor_kfunc skb_kfunc_dtors[] = {
> > +             {
> > +                     .btf_id       = skb_kfunc_dtor_ids[0],
> > +                     .kfunc_btf_id = skb_kfunc_dtor_ids[1]
> > +             },
> > +     };
> > +
> > +     ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_qdisc_kfunc_set);
> > +     ret = ret ?: register_btf_id_dtor_kfuncs(skb_kfunc_dtors,
> > +                                              ARRAY_SIZE(skb_kfunc_dtors),
> > +                                              THIS_MODULE);
> > +     ret = ret ?: register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
> > +
> > +     return ret;
> >   }
> >   late_initcall(bpf_qdisc_kfunc_init);
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 18/20] selftests: Add a bpf fq qdisc to selftest
  2024-05-10 19:24 ` [RFC PATCH v8 18/20] selftests: Add a bpf fq qdisc to selftest Amery Hung
@ 2024-05-24  6:24   ` Martin KaFai Lau
  2024-05-24  7:40     ` Toke Høiland-Jørgensen
  2024-05-24 19:33     ` Alexei Starovoitov
  0 siblings, 2 replies; 46+ messages in thread
From: Martin KaFai Lau @ 2024-05-24  6:24 UTC (permalink / raw)
  To: Amery Hung
  Cc: netdev, bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw,
	toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs

On 5/10/24 12:24 PM, Amery Hung wrote:
> This test implements a more sophisticated qdisc using bpf. The bpf fair-
> queueing (fq) qdisc gives each flow an equal chance to transmit data. It
> also respects the timestamp of skb for rate limiting. The implementation
> does not prevent hash collision of flows nor does it recycle flows.

Does it hit some issue to handle the flow collision (just curious if there are 
missing pieces to do this)?

> The bpf fq also takes the chance to communicate packet drop information
> with a bpf clsact EDT rate limiter using bpf maps. With the info, the
> rate limiter can compenstate the delay caused by packet drops in qdisc
> to maintain the throughput.
> 

> diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c b/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c
> new file mode 100644
> index 000000000000..5118237da9e4
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c
> @@ -0,0 +1,660 @@
> +#include <vmlinux.h>
> +#include <bpf/bpf_helpers.h>
> +#include "bpf_experimental.h"
> +#include "bpf_qdisc_common.h"
> +
> +char _license[] SEC("license") = "GPL";
> +
> +#define NSEC_PER_USEC 1000L
> +#define NSEC_PER_SEC 1000000000L
> +#define PSCHED_MTU (64 * 1024 + 14)
> +
> +#define NUM_QUEUE_LOG 10
> +#define NUM_QUEUE (1 << NUM_QUEUE_LOG)
> +#define PRIO_QUEUE (NUM_QUEUE + 1)
> +#define COMP_DROP_PKT_DELAY 1
> +#define THROTTLED 0xffffffffffffffff
> +
> +/* fq configuration */
> +__u64 q_flow_refill_delay = 40 * 10000; //40us
> +__u64 q_horizon = 10ULL * NSEC_PER_SEC;
> +__u32 q_initial_quantum = 10 * PSCHED_MTU;
> +__u32 q_quantum = 2 * PSCHED_MTU;
> +__u32 q_orphan_mask = 1023;
> +__u32 q_flow_plimit = 100;
> +__u32 q_plimit = 10000;
> +__u32 q_timer_slack = 10 * NSEC_PER_USEC;
> +bool q_horizon_drop = true;
> +
> +bool q_compensate_tstamp;
> +bool q_random_drop;
> +
> +unsigned long time_next_delayed_flow = ~0ULL;
> +unsigned long unthrottle_latency_ns = 0ULL;
> +unsigned long ktime_cache = 0;
> +unsigned long dequeue_now;
> +unsigned int fq_qlen = 0;

I suspect some of these globals may be more natural if it is stored private to 
an individual Qdisc instance. i.e. qdisc_priv(). e.g. in the sch_mq setup.

A high level idea is to allow the SEC(".struct_ops.link") to specify its own 
Qdisc_ops.priv_size.

The bpf prog could use it as a simple u8 array memory area to write anything but 
the verifier can't learn a lot from it. It will be more useful if it can work 
like map_value(s) to the verifier such that the verifier can also see the 
bpf_rb_root/bpf_list_head/bpf_spin_lock...etc.

> +
> +struct fq_flow_node {
> +	u32 hash;
> +	int credit;
> +	u32 qlen;
> +	u32 socket_hash;
> +	u64 age;
> +	u64 time_next_packet;
> +	struct bpf_list_node list_node;
> +	struct bpf_rb_node rb_node;
> +	struct bpf_rb_root queue __contains_kptr(sk_buff, bpf_rbnode);
> +	struct bpf_spin_lock lock;
> +	struct bpf_refcount refcount;
> +};
> +
> +struct dequeue_nonprio_ctx {
> +	bool dequeued;
> +	u64 expire;
> +};
> +
> +struct fq_stashed_flow {
> +	struct fq_flow_node __kptr *flow;
> +};
> +
> +struct stashed_skb {
> +	struct sk_buff __kptr *skb;
> +};
> +
> +/* [NUM_QUEUE] for TC_PRIO_CONTROL
> + * [0, NUM_QUEUE - 1] for other flows
> + */
> +struct {
> +	__uint(type, BPF_MAP_TYPE_ARRAY);
> +	__type(key, __u32);
> +	__type(value, struct fq_stashed_flow);
> +	__uint(max_entries, NUM_QUEUE + 1);
> +} fq_stashed_flows SEC(".maps");
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__type(key, __u32);
> +	__type(value, __u64);
> +	__uint(pinning, LIBBPF_PIN_BY_NAME);
> +	__uint(max_entries, 16);
> +} rate_map SEC(".maps");
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__type(key, __u32);
> +	__type(value, __u64);
> +	__uint(pinning, LIBBPF_PIN_BY_NAME);
> +	__uint(max_entries, 16);
> +} comp_map SEC(".maps");
> +
> +#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
> +
> +private(A) struct bpf_spin_lock fq_delayed_lock;
> +private(A) struct bpf_rb_root fq_delayed __contains(fq_flow_node, rb_node);
> +
> +private(B) struct bpf_spin_lock fq_new_flows_lock;
> +private(B) struct bpf_list_head fq_new_flows __contains(fq_flow_node, list_node);
> +
> +private(C) struct bpf_spin_lock fq_old_flows_lock;
> +private(C) struct bpf_list_head fq_old_flows __contains(fq_flow_node, list_node);
> +
> +private(D) struct bpf_spin_lock fq_stashed_skb_lock;
> +private(D) struct bpf_list_head fq_stashed_skb __contains_kptr(sk_buff, bpf_list);

[ ... ]

> +SEC("struct_ops/bpf_fq_enqueue")
> +int BPF_PROG(bpf_fq_enqueue, struct sk_buff *skb, struct Qdisc *sch,
> +	     struct bpf_sk_buff_ptr *to_free)
> +{
> +	struct iphdr *iph = (void *)(long)skb->data + sizeof(struct ethhdr);
> +	u64 time_to_send, jiffies, delay_ns, *comp_ns, *rate;
> +	struct fq_flow_node *flow = NULL, *flow_copy;
> +	struct fq_stashed_flow *sflow;
> +	u32 hash, daddr, sk_hash;
> +	bool connected;
> +
> +	if (q_random_drop & (bpf_get_prandom_u32() > ~0U * 0.90))
> +		goto drop;
> +
> +	if (fq_qlen >= q_plimit)
> +		goto drop;
> +
> +	if (!skb->tstamp) {
> +		time_to_send = ktime_cache = bpf_ktime_get_ns();
> +	} else {
> +		if (fq_packet_beyond_horizon(skb)) {
> +			ktime_cache = bpf_ktime_get_ns();
> +			if (fq_packet_beyond_horizon(skb)) {
> +				if (q_horizon_drop)
> +					goto drop;
> +
> +				skb->tstamp = ktime_cache + q_horizon;
> +			}
> +		}
> +		time_to_send = skb->tstamp;
> +	}
> +
> +	if (fq_classify(skb, &hash, &sflow, &connected, &sk_hash) < 0)
> +		goto drop;
> +
> +	flow = bpf_kptr_xchg(&sflow->flow, flow);
> +	if (!flow)
> +		goto drop; //unexpected
> +
> +	if (hash != PRIO_QUEUE) {
> +		if (connected && flow->socket_hash != sk_hash) {
> +			flow->credit = q_initial_quantum;
> +			flow->socket_hash = sk_hash;
> +			if (fq_flow_is_throttled(flow)) {
> +				/* mark the flow as undetached. The reference to the
> +				 * throttled flow in fq_delayed will be removed later.
> +				 */
> +				flow_copy = bpf_refcount_acquire(flow);
> +				flow_copy->age = 0;
> +				fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow_copy);
> +			}
> +			flow->time_next_packet = 0ULL;
> +		}
> +
> +		if (flow->qlen >= q_flow_plimit) {
> +			bpf_kptr_xchg_back(&sflow->flow, flow);
> +			goto drop;
> +		}
> +
> +		if (fq_flow_is_detached(flow)) {
> +			if (connected)
> +				flow->socket_hash = sk_hash;
> +
> +			flow_copy = bpf_refcount_acquire(flow);
> +
> +			jiffies = bpf_jiffies64();
> +			if ((s64)(jiffies - (flow_copy->age + q_flow_refill_delay)) > 0) {
> +				if (flow_copy->credit < q_quantum)
> +					flow_copy->credit = q_quantum;
> +			}
> +			flow_copy->age = 0;
> +			fq_flows_add_tail(&fq_new_flows, &fq_new_flows_lock, flow_copy);
> +		}
> +	}
> +
> +	skb->tstamp = time_to_send;
> +
> +	bpf_spin_lock(&flow->lock);
> +	bpf_rbtree_excl_add(&flow->queue, &skb->bpf_rbnode, skb_tstamp_less);
> +	bpf_spin_unlock(&flow->lock);
> +
> +	flow->qlen++;
> +	bpf_kptr_xchg_back(&sflow->flow, flow);
> +
> +	fq_qlen++;
> +	return NET_XMIT_SUCCESS;
> +
> +drop:
> +	if (q_compensate_tstamp) {
> +		bpf_probe_read_kernel(&daddr, sizeof(daddr), &iph->daddr);
> +		rate = bpf_map_lookup_elem(&rate_map, &daddr);
> +		comp_ns = bpf_map_lookup_elem(&comp_map, &daddr);
> +		if (rate && comp_ns) {
> +			delay_ns = (u64)qdisc_skb_cb(skb)->pkt_len * NSEC_PER_SEC / (*rate);
> +			__sync_fetch_and_add(comp_ns, delay_ns);
> +		}
> +	}
> +	bpf_qdisc_skb_drop(skb, to_free);
> +	return NET_XMIT_DROP;
> +}

[ ... ]

> +SEC("struct_ops/bpf_fq_dequeue")
> +struct sk_buff *BPF_PROG(bpf_fq_dequeue, struct Qdisc *sch)
> +{
> +	struct dequeue_nonprio_ctx cb_ctx = {};
> +	struct sk_buff *skb = NULL;
> +
> +	skb = fq_dequeue_prio();
> +	if (skb) {
> +		bpf_skb_set_dev(skb, sch);
> +		return skb;
> +	}
> +
> +	ktime_cache = dequeue_now = bpf_ktime_get_ns();
> +	fq_check_throttled();
> +	bpf_loop(q_plimit, fq_dequeue_nonprio_flows, &cb_ctx, 0);
> +
> +	skb = get_stashed_skb();
> +
> +	if (skb) {
> +		bpf_skb_set_dev(skb, sch);
> +		return skb;
> +	}
> +
> +	if (cb_ctx.expire)
> +		bpf_qdisc_watchdog_schedule(sch, cb_ctx.expire, q_timer_slack);
> +
> +	return NULL;
> +}

The enqueue and dequeue are using the bpf map (e.g. arraymap) or global var 
(also an arraymap). Potentially, the map can be shared by different qdisc 
instances (sch) and they could be attached to different net devices also. Not 
sure if there is potentail issue? e.g. the bpf_fq_reset below.
or a bpf prog dequeue a skb with a different skb->dev.

> +
> +static int
> +fq_reset_flows(u32 index, void *ctx)
> +{
> +	struct bpf_list_node *node;
> +	struct fq_flow_node *flow;
> +
> +	bpf_spin_lock(&fq_new_flows_lock);
> +	node = bpf_list_pop_front(&fq_new_flows);
> +	bpf_spin_unlock(&fq_new_flows_lock);
> +	if (!node) {
> +		bpf_spin_lock(&fq_old_flows_lock);
> +		node = bpf_list_pop_front(&fq_old_flows);
> +		bpf_spin_unlock(&fq_old_flows_lock);
> +		if (!node)
> +			return 1;
> +	}
> +
> +	flow = container_of(node, struct fq_flow_node, list_node);
> +	bpf_obj_drop(flow);
> +
> +	return 0;
> +}
> +
> +static int
> +fq_reset_stashed_flows(u32 index, void *ctx)
> +{
> +	struct fq_flow_node *flow = NULL;
> +	struct fq_stashed_flow *sflow;
> +
> +	sflow = bpf_map_lookup_elem(&fq_stashed_flows, &index);
> +	if (!sflow)
> +		return 0;
> +
> +	flow = bpf_kptr_xchg(&sflow->flow, flow);
> +	if (flow)
> +		bpf_obj_drop(flow);
> +
> +	return 0;
> +}
> +
> +SEC("struct_ops/bpf_fq_reset")
> +void BPF_PROG(bpf_fq_reset, struct Qdisc *sch)
> +{
> +	bool unset_all = true;
> +	fq_qlen = 0;
> +	bpf_loop(NUM_QUEUE + 1, fq_reset_stashed_flows, NULL, 0);
> +	bpf_loop(NUM_QUEUE, fq_reset_flows, NULL, 0);
> +	bpf_loop(NUM_QUEUE, fq_unset_throttled_flows, &unset_all, 0);

I am not sure if it can depend on a bpf prog to do cleanup/reset. What if it 
missed to drop some skb which potentially could hold up resources like sk/dev/netns?

A quick thought is that the struct_ops knows all the bpf progs and each prog 
tracks the used map in prog->aux->used_maps. The kernel can clean it up. 
However, the map may still be used by other Qdisc instances.

It may be easier if the skb can only be enqueued somewhere in the qdisc_priv() 
and then cleans up its own qdisc_priv during reset.

> +	return;
> +}
> +
> +SEC(".struct_ops")
> +struct Qdisc_ops fq = {
> +	.enqueue   = (void *)bpf_fq_enqueue,
> +	.dequeue   = (void *)bpf_fq_dequeue,
> +	.reset     = (void *)bpf_fq_reset,
> +	.id        = "bpf_fq",
> +};


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 18/20] selftests: Add a bpf fq qdisc to selftest
  2024-05-24  6:24   ` Martin KaFai Lau
@ 2024-05-24  7:40     ` Toke Høiland-Jørgensen
  2024-05-26  1:08       ` Martin KaFai Lau
  2024-05-24 19:33     ` Alexei Starovoitov
  1 sibling, 1 reply; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-05-24  7:40 UTC (permalink / raw)
  To: Martin KaFai Lau, Amery Hung
  Cc: netdev, bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw,
	jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs

Martin KaFai Lau <martin.lau@linux.dev> writes:

> [ ... ]
>
>> +SEC("struct_ops/bpf_fq_dequeue")
>> +struct sk_buff *BPF_PROG(bpf_fq_dequeue, struct Qdisc *sch)
>> +{
>> +	struct dequeue_nonprio_ctx cb_ctx = {};
>> +	struct sk_buff *skb = NULL;
>> +
>> +	skb = fq_dequeue_prio();
>> +	if (skb) {
>> +		bpf_skb_set_dev(skb, sch);
>> +		return skb;
>> +	}
>> +
>> +	ktime_cache = dequeue_now = bpf_ktime_get_ns();
>> +	fq_check_throttled();
>> +	bpf_loop(q_plimit, fq_dequeue_nonprio_flows, &cb_ctx, 0);
>> +
>> +	skb = get_stashed_skb();
>> +
>> +	if (skb) {
>> +		bpf_skb_set_dev(skb, sch);
>> +		return skb;
>> +	}
>> +
>> +	if (cb_ctx.expire)
>> +		bpf_qdisc_watchdog_schedule(sch, cb_ctx.expire, q_timer_slack);
>> +
>> +	return NULL;
>> +}
>
> The enqueue and dequeue are using the bpf map (e.g. arraymap) or global var 
> (also an arraymap). Potentially, the map can be shared by different qdisc 
> instances (sch) and they could be attached to different net devices also. Not 
> sure if there is potentail issue? e.g. the bpf_fq_reset below.
> or a bpf prog dequeue a skb with a different skb->dev.

I think behaviour like this is potentially quite interesting and will
allow some neat optimisations (skipping a redirect to a different
interface and just directly enqueueing it to a different place comes to
mind). However, as you point out it may lead to weird things like a
mismatched skb->dev, so if we allow this we should make sure that the
kernel will disallow (or fix) such behaviour. I'm not sure how difficult
that is; for instance, I *think* the mismatched skb->dev can lead to
bugs, but I'm not quite sure.

So maybe it's better to disallow "crossing over" like this, and relax
that restriction later if/when we have a concrete use case?

-Toke


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 18/20] selftests: Add a bpf fq qdisc to selftest
  2024-05-24  6:24   ` Martin KaFai Lau
  2024-05-24  7:40     ` Toke Høiland-Jørgensen
@ 2024-05-24 19:33     ` Alexei Starovoitov
  2024-05-24 20:54       ` Martin KaFai Lau
  1 sibling, 1 reply; 46+ messages in thread
From: Alexei Starovoitov @ 2024-05-24 19:33 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Amery Hung, Network Development, bpf, yangpeihao, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Kui-Feng Lee,
	Toke Høiland-Jørgensen, Jamal Hadi Salim, Jiri Pirko,
	Stanislav Fomichev, Cong Wang, Peilin Ye

On Thu, May 23, 2024 at 11:25 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> > +
> > +unsigned long time_next_delayed_flow = ~0ULL;
> > +unsigned long unthrottle_latency_ns = 0ULL;
> > +unsigned long ktime_cache = 0;
> > +unsigned long dequeue_now;
> > +unsigned int fq_qlen = 0;
>
> I suspect some of these globals may be more natural if it is stored private to
> an individual Qdisc instance. i.e. qdisc_priv(). e.g. in the sch_mq setup.
>
> A high level idea is to allow the SEC(".struct_ops.link") to specify its own
> Qdisc_ops.priv_size.
>
> The bpf prog could use it as a simple u8 array memory area to write anything but
> the verifier can't learn a lot from it. It will be more useful if it can work
> like map_value(s) to the verifier such that the verifier can also see the
> bpf_rb_root/bpf_list_head/bpf_spin_lock...etc.

Qdisc_ops.priv_size is too qdsic specific.
imo using globals here is fine. bpf prog can use hash map or arena
to store per-netdev or per-qdisc data.
The less custom things the better.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 18/20] selftests: Add a bpf fq qdisc to selftest
  2024-05-24 19:33     ` Alexei Starovoitov
@ 2024-05-24 20:54       ` Martin KaFai Lau
  0 siblings, 0 replies; 46+ messages in thread
From: Martin KaFai Lau @ 2024-05-24 20:54 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Amery Hung, Network Development, bpf, yangpeihao, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Kui-Feng Lee,
	Toke Høiland-Jørgensen, Jamal Hadi Salim, Jiri Pirko,
	Stanislav Fomichev, Cong Wang, Peilin Ye

On 5/24/24 12:33 PM, Alexei Starovoitov wrote:
> On Thu, May 23, 2024 at 11:25 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>>> +
>>> +unsigned long time_next_delayed_flow = ~0ULL;
>>> +unsigned long unthrottle_latency_ns = 0ULL;
>>> +unsigned long ktime_cache = 0;
>>> +unsigned long dequeue_now;
>>> +unsigned int fq_qlen = 0;
>>
>> I suspect some of these globals may be more natural if it is stored private to
>> an individual Qdisc instance. i.e. qdisc_priv(). e.g. in the sch_mq setup.
>>
>> A high level idea is to allow the SEC(".struct_ops.link") to specify its own
>> Qdisc_ops.priv_size.
>>
>> The bpf prog could use it as a simple u8 array memory area to write anything but
>> the verifier can't learn a lot from it. It will be more useful if it can work
>> like map_value(s) to the verifier such that the verifier can also see the
>> bpf_rb_root/bpf_list_head/bpf_spin_lock...etc.
> 
> Qdisc_ops.priv_size is too qdsic specific.

Instead of priv_size, may be something like a bpf_local_storage for Qdisc is 
closer to how other kernel objects (sk/task/cgrp) are doing it now. Like 
bpf_sk_storage that goes away with the sk. It needs a storage that goes away 
with the Qdisc.

I was thinking using priv_size to work something like a bpf_local_storage 
without the pointer array redirection by pre-defining all map_values it wants to 
store in the Qdisc, so the total size of the pre-defined map_values will be the 
priv_size. However, I haven't thought through how it should look like from 
bpf_prog.c to the kernel. It is an optimization of the bpf_local_storage.

> imo using globals here is fine. bpf prog can use hash map or arena
> to store per-netdev or per-qdisc data.
> The less custom things the better.



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 18/20] selftests: Add a bpf fq qdisc to selftest
  2024-05-24  7:40     ` Toke Høiland-Jørgensen
@ 2024-05-26  1:08       ` Martin KaFai Lau
  2024-05-27 10:09         ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 46+ messages in thread
From: Martin KaFai Lau @ 2024-05-26  1:08 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Amery Hung
  Cc: netdev, bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw,
	jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs

On 5/24/24 12:40 AM, Toke Høiland-Jørgensen wrote:
> I think behaviour like this is potentially quite interesting and will
> allow some neat optimisations (skipping a redirect to a different
> interface and just directly enqueueing it to a different place comes to

hmm... I am not sure it is a good/safe optimization. From looking at 
skb_do_redirect, there are quite a few things bypassed from __dev_queue_xmit 
upto the final dequeue of the redirected dev. I don't know if all of them is not 
dev dependent.

> mind). However, as you point out it may lead to weird things like a
> mismatched skb->dev, so if we allow this we should make sure that the
> kernel will disallow (or fix) such behaviour.

Have been thinking about the skb->dev "fix" but the thought is originally for 
the bpf_skb_set_dev() use case in patch 14.

Note that the struct_ops ".dequeue" is actually realized by a fentry trampoline 
(call it fentry ".dequeue"). May be using an extra fexit ".dequeue" here. The 
fexit ".dequeue" will be called after the fentry ".dequeue". The fexit 
".dequeue" has the function arguments (sch here that has the correct dev) and 
the return value (skb) from the fentry ".dequeue". This will be an extra call 
(to the fexit ".dequeue") and very specific to this use case but may be the less 
evil solution I can think of now...

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH v8 18/20] selftests: Add a bpf fq qdisc to selftest
  2024-05-26  1:08       ` Martin KaFai Lau
@ 2024-05-27 10:09         ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-05-27 10:09 UTC (permalink / raw)
  To: Martin KaFai Lau, Amery Hung
  Cc: netdev, bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw,
	jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs

Martin KaFai Lau <martin.lau@linux.dev> writes:

> On 5/24/24 12:40 AM, Toke Høiland-Jørgensen wrote:
>> I think behaviour like this is potentially quite interesting and will
>> allow some neat optimisations (skipping a redirect to a different
>> interface and just directly enqueueing it to a different place comes to
>
> hmm... I am not sure it is a good/safe optimization. From looking at
> skb_do_redirect, there are quite a few things bypassed from
> __dev_queue_xmit upto the final dequeue of the redirected dev. I don't
> know if all of them is not dev dependent.

There are certainly footguns, but as long as they are of the "break the
data path" variety and not the "immediately crash the kernel" variety
that may be OK. After all, you can already do plenty of convoluted
things with BPF that will break things. And glancing through the
redirect code, nothing immediately jumps out as something that will
definitely crash, AFAICT.

However, it does feel a bit risky, so I am also totally fine with
disallowing this until someone comes up with a concrete use case where
it would be beneficial :)

>> mind). However, as you point out it may lead to weird things like a
>> mismatched skb->dev, so if we allow this we should make sure that the
>> kernel will disallow (or fix) such behaviour.
>
> Have been thinking about the skb->dev "fix" but the thought is originally for 
> the bpf_skb_set_dev() use case in patch 14.
>
> Note that the struct_ops ".dequeue" is actually realized by a fentry trampoline 
> (call it fentry ".dequeue"). May be using an extra fexit ".dequeue" here. The 
> fexit ".dequeue" will be called after the fentry ".dequeue". The fexit 
> ".dequeue" has the function arguments (sch here that has the correct dev) and 
> the return value (skb) from the fentry ".dequeue". This will be an extra call 
> (to the fexit ".dequeue") and very specific to this use case but may be the less 
> evil solution I can think of now...

That's an interesting idea, certainly! Relying on fexit functions
to do specific sanity checks/fixups after a BPF program has run
(enforcing/checking post-conditions, basically) does not seem totally
crazy to me, and may have other applications :)

-Toke


^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2024-05-27 10:09 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-10 19:23 [RFC PATCH v8 00/20] bpf qdisc Amery Hung
2024-05-10 19:23 ` [RFC PATCH v8 01/20] bpf: Support passing referenced kptr to struct_ops programs Amery Hung
2024-05-16 23:59   ` Kumar Kartikeya Dwivedi
2024-05-17  0:17     ` Amery Hung
2024-05-17  0:23       ` Kumar Kartikeya Dwivedi
2024-05-17  1:22         ` Amery Hung
2024-05-17  2:00           ` Kumar Kartikeya Dwivedi
2024-05-10 19:23 ` [RFC PATCH v8 02/20] selftests/bpf: Test referenced kptr arguments of " Amery Hung
2024-05-10 21:33   ` Kui-Feng Lee
2024-05-10 22:16     ` Amery Hung
2024-05-16 23:14       ` Amery Hung
2024-05-16 23:43         ` Martin KaFai Lau
2024-05-17  0:54           ` Amery Hung
2024-05-17  1:07             ` Martin KaFai Lau
2024-05-10 19:23 ` [RFC PATCH v8 03/20] bpf: Allow struct_ops prog to return referenced kptr Amery Hung
2024-05-17  2:06   ` Amery Hung
2024-05-17  5:30     ` Martin KaFai Lau
2024-05-10 19:23 ` [RFC PATCH v8 04/20] selftests/bpf: Test returning kptr from struct_ops programs Amery Hung
2024-05-10 19:23 ` [RFC PATCH v8 05/20] bpf: Generate btf_struct_metas for kernel BTF Amery Hung
2024-05-10 19:23 ` [RFC PATCH v8 06/20] bpf: Recognize kernel types as graph values Amery Hung
2024-05-10 19:23 ` [RFC PATCH v8 07/20] bpf: Allow adding kernel objects to collections Amery Hung
2024-05-10 19:24 ` [RFC PATCH v8 08/20] selftests/bpf: Test adding kernel object to bpf graph Amery Hung
2024-05-10 19:24 ` [RFC PATCH v8 09/20] bpf: Find special BTF fields in union Amery Hung
2024-05-16 23:37   ` Amery Hung
2024-05-10 19:24 ` [RFC PATCH v8 10/20] bpf: Introduce exclusive-ownership list and rbtree nodes Amery Hung
2024-05-10 19:24 ` [RFC PATCH v8 11/20] bpf: Allow adding exclusive nodes to bpf list and rbtree Amery Hung
2024-05-10 19:24 ` [RFC PATCH v8 12/20] selftests/bpf: Modify linked_list tests to work with macro-ified removes Amery Hung
2024-05-10 19:24 ` [RFC PATCH v8 13/20] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
2024-05-10 19:24 ` [RFC PATCH v8 14/20] bpf: net_sched: Add bpf qdisc kfuncs Amery Hung
2024-05-22 23:55   ` Martin KaFai Lau
2024-05-23  1:06     ` Amery Hung
2024-05-10 19:24 ` [RFC PATCH v8 15/20] bpf: net_sched: Allow more optional methods in Qdisc_ops Amery Hung
2024-05-10 19:24 ` [RFC PATCH v8 16/20] libbpf: Support creating and destroying qdisc Amery Hung
2024-05-10 19:24 ` [RFC PATCH v8 17/20] selftests: Add a basic fifo qdisc test Amery Hung
2024-05-21  3:15   ` Stanislav Fomichev
2024-05-21 15:03     ` Amery Hung
2024-05-21 17:57       ` Stanislav Fomichev
2024-05-10 19:24 ` [RFC PATCH v8 18/20] selftests: Add a bpf fq qdisc to selftest Amery Hung
2024-05-24  6:24   ` Martin KaFai Lau
2024-05-24  7:40     ` Toke Høiland-Jørgensen
2024-05-26  1:08       ` Martin KaFai Lau
2024-05-27 10:09         ` Toke Høiland-Jørgensen
2024-05-24 19:33     ` Alexei Starovoitov
2024-05-24 20:54       ` Martin KaFai Lau
2024-05-10 19:24 ` [RFC PATCH v8 19/20] selftests: Add a bpf netem " Amery Hung
2024-05-10 19:24 ` [RFC PATCH v8 20/20] selftests: Add a prio bpf qdisc Amery Hung

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).