netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v2 00/14] bpf qdisc
@ 2024-12-20 19:55 Amery Hung
  2024-12-20 19:55 ` [PATCH bpf-next v2 01/14] bpf: Support getting referenced kptr from struct_ops argument Amery Hung
                   ` (15 more replies)
  0 siblings, 16 replies; 34+ messages in thread
From: Amery Hung @ 2024-12-20 19:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung, amery.hung

Hi all,

This patchset aims to support implementing qdisc using bpf struct_ops.
This version takes a step back and only implements the minimum support
for bpf qdisc. 1) support of adding skb to bpf_list and bpf_rbtree
directly and 2) classful qdisc are deferred to future patchsets.

* Overview *

This series supports implementing qdisc using bpf struct_ops. bpf qdisc
aims to be a flexible and easy-to-use infrastructure that allows users to
quickly experiment with different scheduling algorithms/policies. It only
requires users to implement core qdisc logic using bpf and implements the
mundane part for them. In addition, the ability to easily communicate
between qdisc and other components will also bring new opportunities for
new applications and optimizations.

* struct_ops changes *

To make struct_ops works better with bpf qdisc, two new changes are
introduced to bpf specifically for struct_ops programs. Frist, we
introduce "__ref" postfix for arguments in stub functions in patch 1-2.
It allows Qdisc_ops->enqueue to acquire an unique referenced kptr to the
skb argument. Through the reference object tracking mechanism in
the verifier, we can make sure that the acquired skb will be either
enqueued or dropped. Besides, no duplicate references can be acquired.
Then, we allow a referenced kptr to be returned from struct_ops programs
so that we can return an skb naturally. This is done and tested in patch 3
and 4.

* Performance of bpf qdisc *

We tested several bpf qdiscs included in the selftests and their in-tree
counterparts to give you a sense of the performance of qdisc implemented
in bpf.

The implementation of bpf_fq is fairly complex and slightly different from
fq so later we only compare the two fifo qdiscs. bpf_fq implements the
same fair queueing algorithm in fq, but without flow hash collision
avoidance and garbage collection of inactive flows. bpf_fifo uses a single
bpf_list as a queue instead of three queues for different priorities in
pfifo_fast. The time complexity of fifo however should be similar since the
queue selection time is negligible.

Test setup:

    client -> qdisc ------------->  server
    ~~~~~~~~~~~~~~~                 ~~~~~~
    nested VM1 @ DC1               VM2 @ DC2

Throghput: iperf3 -t 600, 5 times

      Qdisc        Average (GBits/sec)
    ----------     -------------------
    pfifo_fast       12.52 ± 0.26
    bpf_fifo         11.72 ± 0.32 
    fq               10.24 ± 0.13
    bpf_fq           11.92 ± 0.64 

Latency: sockperf pp --tcp -t 600, 5 times

      Qdisc        Average (usec)
    ----------     --------------
    pfifo_fast      244.58 ± 7.93
    bpf_fifo        244.92 ± 15.22
    fq              234.30 ± 19.25
    bpf_fq          221.34 ± 10.76

Looking at the two fifo qdiscs, the 6.4% drop in throughput in the bpf
implementatioin is consistent with previous observation (v8 throughput
test on a loopback device). This should be able to be mitigated by
supporting adding skb to bpf_list or bpf_rbtree directly in the future.

* Clean up skb in bpf qdisc during reset *

The current implementation relies on bpf qdisc implementors to correctly
release skbs in queues (bpf graphs or maps) in .reset, which might not be
a safe thing to do. The solution as Martin has suggested would be
supporting private data in struct_ops. This can also help simplifying
implementation of qdisc that works with mq. For examples, qdiscs in the
selftest mostly use global data. Therefore, even if user add multiple
qdisc instances under mq, they would still share the same queue. 


---
v2: Rebase to bpf-next/master

    Patch 1-4
        Remove the use of ctx_arg_info->ref_obj_id when acquiring referenced kptr from struct_ops arg
        Improve type comparison when checking kptr return from struct_ops
        Simplify selftests with test_loader and nomerge attribute
    Patch 5
        Remove redundant checks in qdisc_init
        Disallow tail_call
    Patch 6
        Improve kfunc ops availabilty filter by
        i) Checking struct_ops->type
        ii) Defining op-specific kfunc set
    Patch 7
        Search and add bpf_kfunc_desc after gen_prologue/epilogue
    Patch 8
        Use gen_prologue/epilogue to init/cancel watchdog timer
    Patch 12
        Mark read-only func arg and struct member const in libbpf

v1:
    Fix struct_ops referenced kptr acquire/return mechanisms
    Allow creating dynptr from skb
    Add bpf qdisc kfunc filter
    Support updating bstats and qstats
    Update qdiscs in selftest to update stats
    Add gc, handle hash collision and fix bugs in fq_bpf
    Link: https://lore.kernel.org/bpf/20241213232958.2388301-1-amery.hung@bytedance.com/

past RFCs

v9: Drop classful qdisc operations and kfuncs
    Drop support of enqueuing skb directly to bpf_rbtree/list
    Link: https://lore.kernel.org/bpf/20240714175130.4051012-1-amery.hung@bytedance.com/

v8: Implement support of bpf qdisc using struct_ops
    Allow struct_ops to acquire referenced kptr via argument
    Allow struct_ops to release and return referenced kptr
    Support enqueuing sk_buff to bpf_rbtree/list
    Move examples from samples to selftests
    Add a classful qdisc selftest
    Link: https://lore.kernel.org/netdev/20240510192412.3297104-15-amery.hung@bytedance.com/

v7: Reference skb using kptr to sk_buff instead of __sk_buff
    Use the new bpf rbtree/link to for skb queues
    Add reset and init programs
    Add a bpf fq qdisc sample
    Add a bpf netem qdisc sample
    Link: https://lore.kernel.org/netdev/cover.1705432850.git.amery.hung@bytedance.com/

v6: switch to kptr based approach

v5: mv kernel/bpf/skb_map.c net/core/skb_map.c
    implement flow map as map-in-map
    rename bpf_skb_tc_classify() and move it to net/sched/cls_api.c
    clean up eBPF qdisc program context

v4: get rid of PIFO, use rbtree directly

v3: move priority queue from sch_bpf to skb map
    introduce skb map and its helpers
    introduce bpf_skb_classify()
    use netdevice notifier to reset skb's
    Rebase on latest bpf-next

v2: Rebase on latest net-next
    Make the code more complete (but still incomplete)

Amery Hung (14):
  bpf: Support getting referenced kptr from struct_ops argument
  selftests/bpf: Test referenced kptr arguments of struct_ops programs
  bpf: Allow struct_ops prog to return referenced kptr
  selftests/bpf: Test returning referenced kptr from struct_ops programs
  bpf: net_sched: Support implementation of Qdisc_ops in bpf
  bpf: net_sched: Add basic bpf qdisc kfuncs
  bpf: Search and add kfuncs in struct_ops prologue and epilogue
  bpf: net_sched: Add a qdisc watchdog timer
  bpf: net_sched: Support updating bstats
  bpf: net_sched: Support updating qstats
  bpf: net_sched: Allow writing to more Qdisc members
  libbpf: Support creating and destroying qdisc
  selftests: Add a basic fifo qdisc test
  selftests: Add a bpf fq qdisc to selftest

 include/linux/bpf.h                           |   3 +
 include/linux/btf.h                           |   1 +
 include/linux/filter.h                        |  10 +
 kernel/bpf/bpf_struct_ops.c                   |  40 +-
 kernel/bpf/btf.c                              |   7 +-
 kernel/bpf/verifier.c                         |  98 ++-
 net/sched/Kconfig                             |  12 +
 net/sched/Makefile                            |   1 +
 net/sched/bpf_qdisc.c                         | 443 +++++++++++
 net/sched/sch_api.c                           |   7 +-
 net/sched/sch_generic.c                       |   3 +-
 tools/lib/bpf/libbpf.h                        |   5 +-
 tools/lib/bpf/netlink.c                       |  20 +-
 tools/testing/selftests/bpf/config            |   1 +
 .../selftests/bpf/prog_tests/bpf_qdisc.c      | 185 +++++
 .../prog_tests/test_struct_ops_kptr_return.c  |  16 +
 .../prog_tests/test_struct_ops_refcounted.c   |  12 +
 .../selftests/bpf/progs/bpf_qdisc_common.h    |  27 +
 .../selftests/bpf/progs/bpf_qdisc_fifo.c      | 117 +++
 .../selftests/bpf/progs/bpf_qdisc_fq.c        | 726 ++++++++++++++++++
 .../bpf/progs/struct_ops_kptr_return.c        |  30 +
 ...uct_ops_kptr_return_fail__invalid_scalar.c |  26 +
 .../struct_ops_kptr_return_fail__local_kptr.c |  34 +
 ...uct_ops_kptr_return_fail__nonzero_offset.c |  25 +
 .../struct_ops_kptr_return_fail__wrong_type.c |  30 +
 .../bpf/progs/struct_ops_refcounted.c         |  31 +
 ...ruct_ops_refcounted_fail__global_subprog.c |  37 +
 .../struct_ops_refcounted_fail__ref_leak.c    |  22 +
 .../selftests/bpf/test_kmods/bpf_testmod.c    |  15 +
 .../selftests/bpf/test_kmods/bpf_testmod.h    |   6 +
 30 files changed, 1964 insertions(+), 26 deletions(-)
 create mode 100644 net/sched/bpf_qdisc.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_kptr_return.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__invalid_scalar.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__nonzero_offset.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__wrong_type.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_refcounted.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c

-- 
2.47.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v2 01/14] bpf: Support getting referenced kptr from struct_ops argument
  2024-12-20 19:55 [PATCH bpf-next v2 00/14] bpf qdisc Amery Hung
@ 2024-12-20 19:55 ` Amery Hung
  2025-01-23  9:57   ` Eduard Zingerman
  2024-12-20 19:55 ` [PATCH bpf-next v2 02/14] selftests/bpf: Test referenced kptr arguments of struct_ops programs Amery Hung
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Amery Hung @ 2024-12-20 19:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung, amery.hung

From: Amery Hung <amery.hung@bytedance.com>

Allows struct_ops programs to acqurie referenced kptrs from arguments
by directly reading the argument.

The verifier will acquire a reference for struct_ops a argument tagged
with "__ref" in the stub function in the beginning of the main program.
The user will be able to access the referenced kptr directly by reading
the context as long as it has not been released by the program.

This new mechanism to acquire referenced kptr (compared to the existing
"kfunc with KF_ACQUIRE") is introduced for ergonomic and semantic reasons.
In the first use case, Qdisc_ops, an skb is passed to .enqueue in the
first argument. This mechanism provides a natural way for users to get a
referenced kptr in the .enqueue struct_ops programs and makes sure that a
qdisc will always enqueue or drop the skb.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/linux/bpf.h         |  2 ++
 kernel/bpf/bpf_struct_ops.c | 26 ++++++++++++++++++++------
 kernel/bpf/btf.c            |  3 ++-
 kernel/bpf/verifier.c       | 37 ++++++++++++++++++++++++++++++++++---
 4 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index feda0ce90f5a..2556f8043276 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -968,6 +968,7 @@ struct bpf_insn_access_aux {
 		struct {
 			struct btf *btf;
 			u32 btf_id;
+			u32 ref_obj_id;
 		};
 	};
 	struct bpf_verifier_log *log; /* for verbose logs */
@@ -1481,6 +1482,7 @@ struct bpf_ctx_arg_aux {
 	enum bpf_reg_type reg_type;
 	struct btf *btf;
 	u32 btf_id;
+	bool refcounted;
 };
 
 struct btf_mod_pair {
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 606efe32485a..d9e0af00580b 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -146,6 +146,7 @@ void bpf_struct_ops_image_free(void *image)
 }
 
 #define MAYBE_NULL_SUFFIX "__nullable"
+#define REFCOUNTED_SUFFIX "__ref"
 #define MAX_STUB_NAME 128
 
 /* Return the type info of a stub function, if it exists.
@@ -207,9 +208,11 @@ static int prepare_arg_info(struct btf *btf,
 			    struct bpf_struct_ops_arg_info *arg_info)
 {
 	const struct btf_type *stub_func_proto, *pointed_type;
+	bool is_nullable = false, is_refcounted = false;
 	const struct btf_param *stub_args, *args;
 	struct bpf_ctx_arg_aux *info, *info_buf;
 	u32 nargs, arg_no, info_cnt = 0;
+	const char *suffix;
 	u32 arg_btf_id;
 	int offset;
 
@@ -241,12 +244,19 @@ static int prepare_arg_info(struct btf *btf,
 	info = info_buf;
 	for (arg_no = 0; arg_no < nargs; arg_no++) {
 		/* Skip arguments that is not suffixed with
-		 * "__nullable".
+		 * "__nullable or __ref".
 		 */
-		if (!btf_param_match_suffix(btf, &stub_args[arg_no],
-					    MAYBE_NULL_SUFFIX))
+		is_nullable = btf_param_match_suffix(btf, &stub_args[arg_no],
+						     MAYBE_NULL_SUFFIX);
+		is_refcounted = btf_param_match_suffix(btf, &stub_args[arg_no],
+						       REFCOUNTED_SUFFIX);
+		if (!is_nullable && !is_refcounted)
 			continue;
 
+		if (is_nullable)
+			suffix = MAYBE_NULL_SUFFIX;
+		else if (is_refcounted)
+			suffix = REFCOUNTED_SUFFIX;
 		/* Should be a pointer to struct */
 		pointed_type = btf_type_resolve_ptr(btf,
 						    args[arg_no].type,
@@ -254,7 +264,7 @@ static int prepare_arg_info(struct btf *btf,
 		if (!pointed_type ||
 		    !btf_type_is_struct(pointed_type)) {
 			pr_warn("stub function %s__%s has %s tagging to an unsupported type\n",
-				st_ops_name, member_name, MAYBE_NULL_SUFFIX);
+				st_ops_name, member_name, suffix);
 			goto err_out;
 		}
 
@@ -272,11 +282,15 @@ static int prepare_arg_info(struct btf *btf,
 		}
 
 		/* Fill the information of the new argument */
-		info->reg_type =
-			PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
 		info->btf_id = arg_btf_id;
 		info->btf = btf;
 		info->offset = offset;
+		if (is_nullable) {
+			info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
+		} else if (is_refcounted) {
+			info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID;
+			info->refcounted = true;
+		}
 
 		info++;
 		info_cnt++;
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 28246c59e12e..c2f4f84e539d 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -6546,7 +6546,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
 	const struct btf_param *args;
 	bool ptr_err_raw_tp = false;
 	const char *tag_value;
-	u32 nr_args, arg;
+	u32 nr_args, arg, nr_ref_args = 0;
 	int i, ret;
 
 	if (off % 8) {
@@ -6682,6 +6682,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
 			info->reg_type = ctx_arg_info->reg_type;
 			info->btf = ctx_arg_info->btf ? : btf_vmlinux;
 			info->btf_id = ctx_arg_info->btf_id;
+			info->ref_obj_id = ctx_arg_info->refcounted ? ++nr_ref_args : 0;
 			return true;
 		}
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f27274e933e5..26305571e377 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1542,6 +1542,17 @@ static void release_reference_state(struct bpf_verifier_state *state, int idx)
 	return;
 }
 
+static bool find_reference_state(struct bpf_verifier_state *state, int ptr_id)
+{
+	int i;
+
+	for (i = 0; i < state->acquired_refs; i++)
+		if (state->refs[i].id == ptr_id)
+			return true;
+
+	return false;
+}
+
 static int release_lock_state(struct bpf_verifier_state *state, int type, int id, void *ptr)
 {
 	int i;
@@ -5980,7 +5991,8 @@ static int check_packet_access(struct bpf_verifier_env *env, u32 regno, int off,
 /* check access to 'struct bpf_context' fields.  Supports fixed offsets only */
 static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off, int size,
 			    enum bpf_access_type t, enum bpf_reg_type *reg_type,
-			    struct btf **btf, u32 *btf_id, bool *is_retval, bool is_ldsx)
+			    struct btf **btf, u32 *btf_id, bool *is_retval, bool is_ldsx,
+			    u32 *ref_obj_id)
 {
 	struct bpf_insn_access_aux info = {
 		.reg_type = *reg_type,
@@ -6002,8 +6014,16 @@ static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off,
 		*is_retval = info.is_retval;
 
 		if (base_type(*reg_type) == PTR_TO_BTF_ID) {
+			if (info.ref_obj_id &&
+			    !find_reference_state(env->cur_state, info.ref_obj_id)) {
+				verbose(env, "invalid bpf_context access off=%d. Reference may already be released\n",
+					off);
+				return -EACCES;
+			}
+
 			*btf = info.btf;
 			*btf_id = info.btf_id;
+			*ref_obj_id = info.ref_obj_id;
 		} else {
 			env->insn_aux_data[insn_idx].ctx_field_size = info.ctx_field_size;
 		}
@@ -7369,7 +7389,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 		struct bpf_retval_range range;
 		enum bpf_reg_type reg_type = SCALAR_VALUE;
 		struct btf *btf = NULL;
-		u32 btf_id = 0;
+		u32 btf_id = 0, ref_obj_id = 0;
 
 		if (t == BPF_WRITE && value_regno >= 0 &&
 		    is_pointer_value(env, value_regno)) {
@@ -7382,7 +7402,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 			return err;
 
 		err = check_ctx_access(env, insn_idx, off, size, t, &reg_type, &btf,
-				       &btf_id, &is_retval, is_ldsx);
+				       &btf_id, &is_retval, is_ldsx, &ref_obj_id);
 		if (err)
 			verbose_linfo(env, insn_idx, "; ");
 		if (!err && t == BPF_READ && value_regno >= 0) {
@@ -7413,6 +7433,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 				if (base_type(reg_type) == PTR_TO_BTF_ID) {
 					regs[value_regno].btf = btf;
 					regs[value_regno].btf_id = btf_id;
+					regs[value_regno].ref_obj_id = ref_obj_id;
 				}
 			}
 			regs[value_regno].type = reg_type;
@@ -22161,6 +22182,16 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
 		mark_reg_known_zero(env, regs, BPF_REG_1);
 	}
 
+	/* Acquire references for struct_ops program arguments tagged with "__ref".
+	 * These should be the earliest references acquired. btf_ctx_access() will
+	 * assume the ref_obj_id of the n-th __ref-tagged argument to be n.
+	 */
+	if (!subprog && env->prog->type == BPF_PROG_TYPE_STRUCT_OPS) {
+		for (i = 0; i < env->prog->aux->ctx_arg_info_size; i++)
+			if (env->prog->aux->ctx_arg_info[i].refcounted)
+				acquire_reference(env, 0);
+	}
+
 	ret = do_check(env);
 out:
 	/* check for NULL is necessary, since cur_state can be freed inside
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v2 02/14] selftests/bpf: Test referenced kptr arguments of struct_ops programs
  2024-12-20 19:55 [PATCH bpf-next v2 00/14] bpf qdisc Amery Hung
  2024-12-20 19:55 ` [PATCH bpf-next v2 01/14] bpf: Support getting referenced kptr from struct_ops argument Amery Hung
@ 2024-12-20 19:55 ` Amery Hung
  2025-01-23  9:57   ` Eduard Zingerman
  2024-12-20 19:55 ` [PATCH bpf-next v2 03/14] bpf: Allow struct_ops prog to return referenced kptr Amery Hung
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Amery Hung @ 2024-12-20 19:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung, amery.hung

From: Amery Hung <amery.hung@bytedance.com>

Test referenced kptr acquired through struct_ops argument tagged with
"__ref". The success case checks whether 1) a reference to the correct
type is acquired, and 2) the referenced kptr argument can be accessed in
multiple paths as long as it hasn't been released. In the fail cases,
we first confirm that a referenced kptr acquried through a struct_ops
argument is not allowed to be leaked. Then, we make sure this new
referenced kptr acquiring mechanism does not accidentally allow referenced
kptrs to flow into global subprograms through their arguments.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 .../prog_tests/test_struct_ops_refcounted.c   | 12 ++++++
 .../bpf/progs/struct_ops_refcounted.c         | 31 ++++++++++++++++
 ...ruct_ops_refcounted_fail__global_subprog.c | 37 +++++++++++++++++++
 .../struct_ops_refcounted_fail__ref_leak.c    | 22 +++++++++++
 .../selftests/bpf/test_kmods/bpf_testmod.c    |  7 ++++
 .../selftests/bpf/test_kmods/bpf_testmod.h    |  2 +
 6 files changed, 111 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_refcounted.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c
new file mode 100644
index 000000000000..e290a2f6db95
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c
@@ -0,0 +1,12 @@
+#include <test_progs.h>
+
+#include "struct_ops_refcounted.skel.h"
+#include "struct_ops_refcounted_fail__ref_leak.skel.h"
+#include "struct_ops_refcounted_fail__global_subprog.skel.h"
+
+void test_struct_ops_refcounted(void)
+{
+	RUN_TESTS(struct_ops_refcounted);
+	RUN_TESTS(struct_ops_refcounted_fail__ref_leak);
+	RUN_TESTS(struct_ops_refcounted_fail__global_subprog);
+}
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_refcounted.c b/tools/testing/selftests/bpf/progs/struct_ops_refcounted.c
new file mode 100644
index 000000000000..76dcb6089d7f
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_refcounted.c
@@ -0,0 +1,31 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../test_kmods/bpf_testmod.h"
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+__attribute__((nomerge)) extern void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This is a test BPF program that uses struct_ops to access a referenced
+ * kptr argument. This is a test for the verifier to ensure that it
+ * 1) recongnizes the task as a referenced object (i.e., ref_obj_id > 0), and
+ * 2) the same reference can be acquired from multiple paths as long as it
+ *    has not been released.
+ */
+SEC("struct_ops/test_refcounted")
+int BPF_PROG(refcounted, int dummy, struct task_struct *task)
+{
+	if (dummy == 1)
+		bpf_task_release(task);
+	else
+		bpf_task_release(task);
+	return 0;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_refcounted = {
+	.test_refcounted = (void *)refcounted,
+};
+
+
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c
new file mode 100644
index 000000000000..43493a7ead39
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c
@@ -0,0 +1,37 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../test_kmods/bpf_testmod.h"
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+extern void bpf_task_release(struct task_struct *p) __ksym;
+
+__noinline int subprog_release(__u64 *ctx __arg_ctx)
+{
+	struct task_struct *task = (struct task_struct *)ctx[1];
+	int dummy = (int)ctx[0];
+
+	bpf_task_release(task);
+
+	return dummy + 1;
+}
+
+/* Test that the verifier rejects a program that contains a global
+ * subprogram with referenced kptr arguments
+ */
+SEC("struct_ops/test_refcounted")
+__failure __msg("invalid bpf_context access off=8. Reference may already be released")
+int refcounted_fail__global_subprog(unsigned long long *ctx)
+{
+	struct task_struct *task = (struct task_struct *)ctx[1];
+
+	bpf_task_release(task);
+
+	return subprog_release(ctx);
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_ref_acquire = {
+	.test_refcounted = (void *)refcounted_fail__global_subprog,
+};
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c
new file mode 100644
index 000000000000..e945b1a04294
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c
@@ -0,0 +1,22 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../test_kmods/bpf_testmod.h"
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+/* Test that the verifier rejects a program that acquires a referenced
+ * kptr through context without releasing the reference
+ */
+SEC("struct_ops/test_refcounted")
+__failure __msg("Unreleased reference id=1 alloc_insn=0")
+int BPF_PROG(refcounted_fail__ref_leak, int dummy,
+	     struct task_struct *task)
+{
+	return 0;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_ref_acquire = {
+	.test_refcounted = (void *)refcounted_fail__ref_leak,
+};
diff --git a/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c b/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c
index cc9dde507aba..802cbd871035 100644
--- a/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c
@@ -1176,10 +1176,17 @@ static int bpf_testmod_ops__test_maybe_null(int dummy,
 	return 0;
 }
 
+static int bpf_testmod_ops__test_refcounted(int dummy,
+					    struct task_struct *task__ref)
+{
+	return 0;
+}
+
 static struct bpf_testmod_ops __bpf_testmod_ops = {
 	.test_1 = bpf_testmod_test_1,
 	.test_2 = bpf_testmod_test_2,
 	.test_maybe_null = bpf_testmod_ops__test_maybe_null,
+	.test_refcounted = bpf_testmod_ops__test_refcounted,
 };
 
 struct bpf_struct_ops bpf_bpf_testmod_ops = {
diff --git a/tools/testing/selftests/bpf/test_kmods/bpf_testmod.h b/tools/testing/selftests/bpf/test_kmods/bpf_testmod.h
index 356803d1c10e..c57b2f9dab10 100644
--- a/tools/testing/selftests/bpf/test_kmods/bpf_testmod.h
+++ b/tools/testing/selftests/bpf/test_kmods/bpf_testmod.h
@@ -36,6 +36,8 @@ struct bpf_testmod_ops {
 	/* Used to test nullable arguments. */
 	int (*test_maybe_null)(int dummy, struct task_struct *task);
 	int (*unsupported_ops)(void);
+	/* Used to test ref_acquired arguments. */
+	int (*test_refcounted)(int dummy, struct task_struct *task);
 
 	/* The following fields are used to test shadow copies. */
 	char onebyte;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v2 03/14] bpf: Allow struct_ops prog to return referenced kptr
  2024-12-20 19:55 [PATCH bpf-next v2 00/14] bpf qdisc Amery Hung
  2024-12-20 19:55 ` [PATCH bpf-next v2 01/14] bpf: Support getting referenced kptr from struct_ops argument Amery Hung
  2024-12-20 19:55 ` [PATCH bpf-next v2 02/14] selftests/bpf: Test referenced kptr arguments of struct_ops programs Amery Hung
@ 2024-12-20 19:55 ` Amery Hung
  2025-01-15 15:25   ` Ming Lei
  2025-01-23  9:57   ` Eduard Zingerman
  2024-12-20 19:55 ` [PATCH bpf-next v2 04/14] selftests/bpf: Test returning referenced kptr from struct_ops programs Amery Hung
                   ` (12 subsequent siblings)
  15 siblings, 2 replies; 34+ messages in thread
From: Amery Hung @ 2024-12-20 19:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung, amery.hung

From: Amery Hung <amery.hung@bytedance.com>

Allow a struct_ops program to return a referenced kptr if the struct_ops
operator's return type is a struct pointer. To make sure the returned
pointer continues to be valid in the kernel, several constraints are
required:

1) The type of the pointer must matches the return type
2) The pointer originally comes from the kernel (not locally allocated)
3) The pointer is in its unmodified form

Implementation wise, a referenced kptr first needs to be allowed to _leak_
in check_reference_leak() if it is in the return register. Then, in
check_return_code(), constraints 1-3 are checked. During struct_ops
registration, a check is also added to warn about operators with
non-struct pointer return.

In addition, since the first user, Qdisc_ops::dequeue, allows a NULL
pointer to be returned when there is no skb to be dequeued, we will allow
a scalar value with value equals to NULL to be returned.

In the future when there is a struct_ops user that always expects a valid
pointer to be returned from an operator, we may extend tagging to the
return value. We can tell the verifier to only allow NULL pointer return
if the return value is tagged with MAY_BE_NULL.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 kernel/bpf/bpf_struct_ops.c | 12 +++++++++++-
 kernel/bpf/verifier.c       | 36 ++++++++++++++++++++++++++++++++----
 2 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index d9e0af00580b..27d4a170df84 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -386,7 +386,7 @@ int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc,
 	st_ops_desc->value_type = btf_type_by_id(btf, value_id);
 
 	for_each_member(i, t, member) {
-		const struct btf_type *func_proto;
+		const struct btf_type *func_proto, *ret_type;
 
 		mname = btf_name_by_offset(btf, member->name_off);
 		if (!*mname) {
@@ -409,6 +409,16 @@ int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc,
 		if (!func_proto)
 			continue;
 
+		if (func_proto->type) {
+			ret_type = btf_type_resolve_ptr(btf, func_proto->type, NULL);
+			if (ret_type && !__btf_type_is_struct(ret_type)) {
+				pr_warn("func ptr %s in struct %s returns non-struct pointer, which is not supported\n",
+					mname, st_ops->name);
+				err = -EOPNOTSUPP;
+				goto errout;
+			}
+		}
+
 		if (btf_distill_func_proto(log, btf,
 					   func_proto, mname,
 					   &st_ops->func_models[i])) {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 26305571e377..0e6a3c4daa7d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -10707,6 +10707,8 @@ record_func_key(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
 static int check_reference_leak(struct bpf_verifier_env *env, bool exception_exit)
 {
 	struct bpf_verifier_state *state = env->cur_state;
+	enum bpf_prog_type type = resolve_prog_type(env->prog);
+	struct bpf_reg_state *reg = reg_state(env, BPF_REG_0);
 	bool refs_lingering = false;
 	int i;
 
@@ -10716,6 +10718,12 @@ static int check_reference_leak(struct bpf_verifier_env *env, bool exception_exi
 	for (i = 0; i < state->acquired_refs; i++) {
 		if (state->refs[i].type != REF_TYPE_PTR)
 			continue;
+		/* Allow struct_ops programs to return a referenced kptr back to
+		 * kernel. Type checks are performed later in check_return_code.
+		 */
+		if (type == BPF_PROG_TYPE_STRUCT_OPS && !exception_exit &&
+		    reg->ref_obj_id == state->refs[i].id)
+			continue;
 		verbose(env, "Unreleased reference id=%d alloc_insn=%d\n",
 			state->refs[i].id, state->refs[i].insn_idx);
 		refs_lingering = true;
@@ -16320,13 +16328,14 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
 	const char *exit_ctx = "At program exit";
 	struct tnum enforce_attach_type_range = tnum_unknown;
 	const struct bpf_prog *prog = env->prog;
-	struct bpf_reg_state *reg;
+	struct bpf_reg_state *reg = reg_state(env, regno);
 	struct bpf_retval_range range = retval_range(0, 1);
 	enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
 	int err;
 	struct bpf_func_state *frame = env->cur_state->frame[0];
 	const bool is_subprog = frame->subprogno;
 	bool return_32bit = false;
+	const struct btf_type *reg_type, *ret_type = NULL;
 
 	/* LSM and struct_ops func-ptr's return type could be "void" */
 	if (!is_subprog || frame->in_exception_callback_fn) {
@@ -16335,10 +16344,26 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
 			if (prog->expected_attach_type == BPF_LSM_CGROUP)
 				/* See below, can be 0 or 0-1 depending on hook. */
 				break;
-			fallthrough;
+			if (!prog->aux->attach_func_proto->type)
+				return 0;
+			break;
 		case BPF_PROG_TYPE_STRUCT_OPS:
 			if (!prog->aux->attach_func_proto->type)
 				return 0;
+
+			if (frame->in_exception_callback_fn)
+				break;
+
+			/* Allow a struct_ops program to return a referenced kptr if it
+			 * matches the operator's return type and is in its unmodified
+			 * form. A scalar zero (i.e., a null pointer) is also allowed.
+			 */
+			reg_type = reg->btf ? btf_type_by_id(reg->btf, reg->btf_id) : NULL;
+			ret_type = btf_type_resolve_ptr(prog->aux->attach_btf,
+							prog->aux->attach_func_proto->type,
+							NULL);
+			if (ret_type && ret_type == reg_type && reg->ref_obj_id)
+				return __check_ptr_off_reg(env, reg, regno, false);
 			break;
 		default:
 			break;
@@ -16360,8 +16385,6 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
 		return -EACCES;
 	}
 
-	reg = cur_regs(env) + regno;
-
 	if (frame->in_async_callback_fn) {
 		/* enforce return zero from async callbacks like timer */
 		exit_ctx = "At async callback return";
@@ -16460,6 +16483,11 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
 	case BPF_PROG_TYPE_NETFILTER:
 		range = retval_range(NF_DROP, NF_ACCEPT);
 		break;
+	case BPF_PROG_TYPE_STRUCT_OPS:
+		if (!ret_type)
+			return 0;
+		range = retval_range(0, 0);
+		break;
 	case BPF_PROG_TYPE_EXT:
 		/* freplace program can return anything as its return value
 		 * depends on the to-be-replaced kernel func or bpf program.
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v2 04/14] selftests/bpf: Test returning referenced kptr from struct_ops programs
  2024-12-20 19:55 [PATCH bpf-next v2 00/14] bpf qdisc Amery Hung
                   ` (2 preceding siblings ...)
  2024-12-20 19:55 ` [PATCH bpf-next v2 03/14] bpf: Allow struct_ops prog to return referenced kptr Amery Hung
@ 2024-12-20 19:55 ` Amery Hung
  2025-01-23  9:58   ` Eduard Zingerman
  2024-12-20 19:55 ` [PATCH bpf-next v2 05/14] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Amery Hung @ 2024-12-20 19:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung, amery.hung

From: Amery Hung <amery.hung@bytedance.com>

Test struct_ops programs returning referenced kptr. When the return type
of a struct_ops operator is pointer to struct, the verifier should
only allow programs that return a scalar NULL or a non-local kptr with the
correct type in its unmodified form.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 .../prog_tests/test_struct_ops_kptr_return.c  | 16 +++++++++
 .../bpf/progs/struct_ops_kptr_return.c        | 30 ++++++++++++++++
 ...uct_ops_kptr_return_fail__invalid_scalar.c | 26 ++++++++++++++
 .../struct_ops_kptr_return_fail__local_kptr.c | 34 +++++++++++++++++++
 ...uct_ops_kptr_return_fail__nonzero_offset.c | 25 ++++++++++++++
 .../struct_ops_kptr_return_fail__wrong_type.c | 30 ++++++++++++++++
 .../selftests/bpf/test_kmods/bpf_testmod.c    |  8 +++++
 .../selftests/bpf/test_kmods/bpf_testmod.h    |  4 +++
 8 files changed, 173 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_kptr_return.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__invalid_scalar.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__nonzero_offset.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__wrong_type.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_kptr_return.c b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_kptr_return.c
new file mode 100644
index 000000000000..467cc72a3588
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_kptr_return.c
@@ -0,0 +1,16 @@
+#include <test_progs.h>
+
+#include "struct_ops_kptr_return.skel.h"
+#include "struct_ops_kptr_return_fail__wrong_type.skel.h"
+#include "struct_ops_kptr_return_fail__invalid_scalar.skel.h"
+#include "struct_ops_kptr_return_fail__nonzero_offset.skel.h"
+#include "struct_ops_kptr_return_fail__local_kptr.skel.h"
+
+void test_struct_ops_kptr_return(void)
+{
+	RUN_TESTS(struct_ops_kptr_return);
+	RUN_TESTS(struct_ops_kptr_return_fail__wrong_type);
+	RUN_TESTS(struct_ops_kptr_return_fail__invalid_scalar);
+	RUN_TESTS(struct_ops_kptr_return_fail__nonzero_offset);
+	RUN_TESTS(struct_ops_kptr_return_fail__local_kptr);
+}
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_kptr_return.c b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return.c
new file mode 100644
index 000000000000..36386b3c23a1
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return.c
@@ -0,0 +1,30 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../test_kmods/bpf_testmod.h"
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This test struct_ops BPF programs returning referenced kptr. The verifier should
+ * allow a referenced kptr or a NULL pointer to be returned. A referenced kptr to task
+ * here is acquried automatically as the task argument is tagged with "__ref".
+ */
+SEC("struct_ops/test_return_ref_kptr")
+struct task_struct *BPF_PROG(kptr_return, int dummy,
+			     struct task_struct *task, struct cgroup *cgrp)
+{
+	if (dummy % 2) {
+		bpf_task_release(task);
+		return NULL;
+	}
+	return task;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_kptr_return = {
+	.test_return_ref_kptr = (void *)kptr_return,
+};
+
+
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__invalid_scalar.c b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__invalid_scalar.c
new file mode 100644
index 000000000000..caeea158ef69
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__invalid_scalar.c
@@ -0,0 +1,26 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../test_kmods/bpf_testmod.h"
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct cgroup *bpf_cgroup_acquire(struct cgroup *p) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This test struct_ops BPF programs returning referenced kptr. The verifier should
+ * reject programs returning a non-zero scalar value.
+ */
+SEC("struct_ops/test_return_ref_kptr")
+__failure __msg("At program exit the register R0 has smin=1 smax=1 should have been in [0, 0]")
+struct task_struct *BPF_PROG(kptr_return_fail__invalid_scalar, int dummy,
+			     struct task_struct *task, struct cgroup *cgrp)
+{
+	bpf_task_release(task);
+	return (struct task_struct *)1;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_kptr_return = {
+	.test_return_ref_kptr = (void *)kptr_return_fail__invalid_scalar,
+};
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c
new file mode 100644
index 000000000000..b8b4f05c3d7f
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c
@@ -0,0 +1,34 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../test_kmods/bpf_testmod.h"
+#include "bpf_experimental.h"
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct cgroup *bpf_cgroup_acquire(struct cgroup *p) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This test struct_ops BPF programs returning referenced kptr. The verifier should
+ * reject programs returning a local kptr.
+ */
+SEC("struct_ops/test_return_ref_kptr")
+__failure __msg("At program exit the register R0 is not a known value (ptr_or_null_)")
+struct task_struct *BPF_PROG(kptr_return_fail__local_kptr, int dummy,
+			     struct task_struct *task, struct cgroup *cgrp)
+{
+	struct task_struct *t;
+
+	bpf_task_release(task);
+
+	t = bpf_obj_new(typeof(*task));
+	if (!t)
+		return NULL;
+
+	return t;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_kptr_return = {
+	.test_return_ref_kptr = (void *)kptr_return_fail__local_kptr,
+};
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__nonzero_offset.c b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__nonzero_offset.c
new file mode 100644
index 000000000000..7ddeb28c2329
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__nonzero_offset.c
@@ -0,0 +1,25 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../test_kmods/bpf_testmod.h"
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct cgroup *bpf_cgroup_acquire(struct cgroup *p) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This test struct_ops BPF programs returning referenced kptr. The verifier should
+ * reject programs returning a modified referenced kptr.
+ */
+SEC("struct_ops/test_return_ref_kptr")
+__failure __msg("dereference of modified trusted_ptr_ ptr R0 off={{[0-9]+}} disallowed")
+struct task_struct *BPF_PROG(kptr_return_fail__nonzero_offset, int dummy,
+			     struct task_struct *task, struct cgroup *cgrp)
+{
+	return (struct task_struct *)&task->jobctl;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_kptr_return = {
+	.test_return_ref_kptr = (void *)kptr_return_fail__nonzero_offset,
+};
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__wrong_type.c b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__wrong_type.c
new file mode 100644
index 000000000000..6a2dd5367802
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__wrong_type.c
@@ -0,0 +1,30 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../test_kmods/bpf_testmod.h"
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct cgroup *bpf_cgroup_acquire(struct cgroup *p) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This test struct_ops BPF programs returning referenced kptr. The verifier should
+ * reject programs returning a referenced kptr of the wrong type.
+ */
+SEC("struct_ops/test_return_ref_kptr")
+__failure __msg("At program exit the register R0 is not a known value (ptr_or_null_)")
+struct task_struct *BPF_PROG(kptr_return_fail__wrong_type, int dummy,
+			     struct task_struct *task, struct cgroup *cgrp)
+{
+	struct task_struct *ret;
+
+	ret = (struct task_struct *)bpf_cgroup_acquire(cgrp);
+	bpf_task_release(task);
+
+	return ret;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_kptr_return = {
+	.test_return_ref_kptr = (void *)kptr_return_fail__wrong_type,
+};
diff --git a/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c b/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c
index 802cbd871035..89dc502de9d4 100644
--- a/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/test_kmods/bpf_testmod.c
@@ -1182,11 +1182,19 @@ static int bpf_testmod_ops__test_refcounted(int dummy,
 	return 0;
 }
 
+static struct task_struct *
+bpf_testmod_ops__test_return_ref_kptr(int dummy, struct task_struct *task__ref,
+				      struct cgroup *cgrp)
+{
+	return NULL;
+}
+
 static struct bpf_testmod_ops __bpf_testmod_ops = {
 	.test_1 = bpf_testmod_test_1,
 	.test_2 = bpf_testmod_test_2,
 	.test_maybe_null = bpf_testmod_ops__test_maybe_null,
 	.test_refcounted = bpf_testmod_ops__test_refcounted,
+	.test_return_ref_kptr = bpf_testmod_ops__test_return_ref_kptr,
 };
 
 struct bpf_struct_ops bpf_bpf_testmod_ops = {
diff --git a/tools/testing/selftests/bpf/test_kmods/bpf_testmod.h b/tools/testing/selftests/bpf/test_kmods/bpf_testmod.h
index c57b2f9dab10..c9fab51f16e2 100644
--- a/tools/testing/selftests/bpf/test_kmods/bpf_testmod.h
+++ b/tools/testing/selftests/bpf/test_kmods/bpf_testmod.h
@@ -6,6 +6,7 @@
 #include <linux/types.h>
 
 struct task_struct;
+struct cgroup;
 
 struct bpf_testmod_test_read_ctx {
 	char *buf;
@@ -38,6 +39,9 @@ struct bpf_testmod_ops {
 	int (*unsupported_ops)(void);
 	/* Used to test ref_acquired arguments. */
 	int (*test_refcounted)(int dummy, struct task_struct *task);
+	/* Used to test returning referenced kptr. */
+	struct task_struct *(*test_return_ref_kptr)(int dummy, struct task_struct *task,
+						    struct cgroup *cgrp);
 
 	/* The following fields are used to test shadow copies. */
 	char onebyte;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v2 05/14] bpf: net_sched: Support implementation of Qdisc_ops in bpf
  2024-12-20 19:55 [PATCH bpf-next v2 00/14] bpf qdisc Amery Hung
                   ` (3 preceding siblings ...)
  2024-12-20 19:55 ` [PATCH bpf-next v2 04/14] selftests/bpf: Test returning referenced kptr from struct_ops programs Amery Hung
@ 2024-12-20 19:55 ` Amery Hung
  2025-01-09 15:00   ` Amery Hung
                     ` (2 more replies)
  2024-12-20 19:55 ` [PATCH bpf-next v2 06/14] bpf: net_sched: Add basic bpf qdisc kfuncs Amery Hung
                   ` (10 subsequent siblings)
  15 siblings, 3 replies; 34+ messages in thread
From: Amery Hung @ 2024-12-20 19:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung, amery.hung

From: Amery Hung <amery.hung@bytedance.com>

Enable users to implement a classless qdisc using bpf. The last few
patches in this series has prepared struct_ops to support core operators
in Qdisc_ops. The recent advancement in bpf such as allocated
objects, bpf list and bpf rbtree has also provided powerful and flexible
building blocks to realize sophisticated scheduling algorithms. Therefore,
in this patch, we start allowing qdisc to be implemented using bpf
struct_ops. Users can implement Qdisc_ops.{enqueue, dequeue, init, reset,
and .destroy in Qdisc_ops in bpf and register the qdisc dynamically into
the kernel.

Co-developed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/linux/btf.h     |   1 +
 kernel/bpf/btf.c        |   4 +-
 net/sched/Kconfig       |  12 +++
 net/sched/Makefile      |   1 +
 net/sched/bpf_qdisc.c   | 207 ++++++++++++++++++++++++++++++++++++++++
 net/sched/sch_api.c     |   7 +-
 net/sched/sch_generic.c |   3 +-
 7 files changed, 229 insertions(+), 6 deletions(-)
 create mode 100644 net/sched/bpf_qdisc.c

diff --git a/include/linux/btf.h b/include/linux/btf.h
index 4214e76c9168..eb16218fdf52 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -563,6 +563,7 @@ const char *btf_name_by_offset(const struct btf *btf, u32 offset);
 const char *btf_str_by_offset(const struct btf *btf, u32 offset);
 struct btf *btf_parse_vmlinux(void);
 struct btf *bpf_prog_get_target_btf(const struct bpf_prog *prog);
+u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto, int off);
 u32 *btf_kfunc_id_set_contains(const struct btf *btf, u32 kfunc_btf_id,
 			       const struct bpf_prog *prog);
 u32 *btf_kfunc_is_modify_return(const struct btf *btf, u32 kfunc_btf_id,
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index c2f4f84e539d..78476cebefe3 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -6375,8 +6375,8 @@ static bool is_int_ptr(struct btf *btf, const struct btf_type *t)
 	return btf_type_is_int(t);
 }
 
-static u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto,
-			   int off)
+u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto,
+		    int off)
 {
 	const struct btf_param *args;
 	const struct btf_type *t;
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 8180d0c12fce..ccd0255da5a5 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -403,6 +403,18 @@ config NET_SCH_ETS
 
 	  If unsure, say N.
 
+config NET_SCH_BPF
+	bool "BPF-based Qdisc"
+	depends on BPF_SYSCALL && BPF_JIT && DEBUG_INFO_BTF
+	help
+	  This option allows BPF-based queueing disiplines. With BPF struct_ops,
+	  users can implement supported operators in Qdisc_ops using BPF programs.
+	  The queue holding skb can be built with BPF maps or graphs.
+
+	  Say Y here if you want to use BPF-based Qdisc.
+
+	  If unsure, say N.
+
 menuconfig NET_SCH_DEFAULT
 	bool "Allow override default queue discipline"
 	help
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 82c3f78ca486..904d784902d1 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -62,6 +62,7 @@ obj-$(CONFIG_NET_SCH_FQ_PIE)	+= sch_fq_pie.o
 obj-$(CONFIG_NET_SCH_CBS)	+= sch_cbs.o
 obj-$(CONFIG_NET_SCH_ETF)	+= sch_etf.o
 obj-$(CONFIG_NET_SCH_TAPRIO)	+= sch_taprio.o
+obj-$(CONFIG_NET_SCH_BPF)	+= bpf_qdisc.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
new file mode 100644
index 000000000000..4b7d013f4f5c
--- /dev/null
+++ b/net/sched/bpf_qdisc.c
@@ -0,0 +1,207 @@
+#include <linux/types.h>
+#include <linux/bpf_verifier.h>
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/filter.h>
+#include <net/pkt_sched.h>
+#include <net/pkt_cls.h>
+
+static struct bpf_struct_ops bpf_Qdisc_ops;
+
+struct bpf_sk_buff_ptr {
+	struct sk_buff *skb;
+};
+
+static int bpf_qdisc_init(struct btf *btf)
+{
+	return 0;
+}
+
+static const struct bpf_func_proto *
+bpf_qdisc_get_func_proto(enum bpf_func_id func_id,
+			 const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_tail_call:
+		return NULL;
+	default:
+		return bpf_base_func_proto(func_id, prog);
+	}
+}
+
+BTF_ID_LIST_SINGLE(bpf_sk_buff_ids, struct, sk_buff)
+BTF_ID_LIST_SINGLE(bpf_sk_buff_ptr_ids, struct, bpf_sk_buff_ptr)
+
+static bool bpf_qdisc_is_valid_access(int off, int size,
+				      enum bpf_access_type type,
+				      const struct bpf_prog *prog,
+				      struct bpf_insn_access_aux *info)
+{
+	struct btf *btf = prog->aux->attach_btf;
+	u32 arg;
+
+	arg = get_ctx_arg_idx(btf, prog->aux->attach_func_proto, off);
+	if (!strcmp(prog->aux->attach_func_name, "enqueue")) {
+		if (arg == 2 && type == BPF_READ) {
+			info->reg_type = PTR_TO_BTF_ID | PTR_TRUSTED;
+			info->btf = btf;
+			info->btf_id = bpf_sk_buff_ptr_ids[0];
+			return true;
+		}
+	}
+
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
+					const struct bpf_reg_state *reg,
+					int off, int size)
+{
+	const struct btf_type *t, *skbt;
+	size_t end;
+
+	skbt = btf_type_by_id(reg->btf, bpf_sk_buff_ids[0]);
+	t = btf_type_by_id(reg->btf, reg->btf_id);
+	if (t != skbt) {
+		bpf_log(log, "only read is supported\n");
+		return -EACCES;
+	}
+
+	switch (off) {
+	case offsetof(struct sk_buff, tstamp):
+		end = offsetofend(struct sk_buff, tstamp);
+		break;
+	case offsetof(struct sk_buff, priority):
+		end = offsetofend(struct sk_buff, priority);
+		break;
+	case offsetof(struct sk_buff, mark):
+		end = offsetofend(struct sk_buff, mark);
+		break;
+	case offsetof(struct sk_buff, queue_mapping):
+		end = offsetofend(struct sk_buff, queue_mapping);
+		break;
+	case offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb, tc_classid):
+		end = offsetof(struct sk_buff, cb) +
+		      offsetofend(struct qdisc_skb_cb, tc_classid);
+		break;
+	case offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb, data[0]) ...
+	     offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb,
+						     data[QDISC_CB_PRIV_LEN - 1]):
+		end = offsetof(struct sk_buff, cb) +
+		      offsetofend(struct qdisc_skb_cb, data[QDISC_CB_PRIV_LEN - 1]);
+		break;
+	case offsetof(struct sk_buff, tc_index):
+		end = offsetofend(struct sk_buff, tc_index);
+		break;
+	default:
+		bpf_log(log, "no write support to sk_buff at off %d\n", off);
+		return -EACCES;
+	}
+
+	if (off + size > end) {
+		bpf_log(log,
+			"write access at off %d with size %d beyond the member of sk_buff ended at %zu\n",
+			off, size, end);
+		return -EACCES;
+	}
+
+	return 0;
+}
+
+static const struct bpf_verifier_ops bpf_qdisc_verifier_ops = {
+	.get_func_proto		= bpf_qdisc_get_func_proto,
+	.is_valid_access	= bpf_qdisc_is_valid_access,
+	.btf_struct_access	= bpf_qdisc_btf_struct_access,
+};
+
+static int bpf_qdisc_init_member(const struct btf_type *t,
+				 const struct btf_member *member,
+				 void *kdata, const void *udata)
+{
+	const struct Qdisc_ops *uqdisc_ops;
+	struct Qdisc_ops *qdisc_ops;
+	u32 moff;
+
+	uqdisc_ops = (const struct Qdisc_ops *)udata;
+	qdisc_ops = (struct Qdisc_ops *)kdata;
+
+	moff = __btf_member_bit_offset(t, member) / 8;
+	switch (moff) {
+	case offsetof(struct Qdisc_ops, peek):
+		qdisc_ops->peek = qdisc_peek_dequeued;
+		return 0;
+	case offsetof(struct Qdisc_ops, id):
+		if (bpf_obj_name_cpy(qdisc_ops->id, uqdisc_ops->id,
+				     sizeof(qdisc_ops->id)) <= 0)
+			return -EINVAL;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int bpf_qdisc_reg(void *kdata, struct bpf_link *link)
+{
+	return register_qdisc(kdata);
+}
+
+static void bpf_qdisc_unreg(void *kdata, struct bpf_link *link)
+{
+	return unregister_qdisc(kdata);
+}
+
+static int Qdisc_ops__enqueue(struct sk_buff *skb__ref, struct Qdisc *sch,
+			      struct sk_buff **to_free)
+{
+	return 0;
+}
+
+static struct sk_buff *Qdisc_ops__dequeue(struct Qdisc *sch)
+{
+	return NULL;
+}
+
+static struct sk_buff *Qdisc_ops__peek(struct Qdisc *sch)
+{
+	return NULL;
+}
+
+static int Qdisc_ops__init(struct Qdisc *sch, struct nlattr *arg,
+			   struct netlink_ext_ack *extack)
+{
+	return 0;
+}
+
+static void Qdisc_ops__reset(struct Qdisc *sch)
+{
+}
+
+static void Qdisc_ops__destroy(struct Qdisc *sch)
+{
+}
+
+static struct Qdisc_ops __bpf_ops_qdisc_ops = {
+	.enqueue = Qdisc_ops__enqueue,
+	.dequeue = Qdisc_ops__dequeue,
+	.peek = Qdisc_ops__peek,
+	.init = Qdisc_ops__init,
+	.reset = Qdisc_ops__reset,
+	.destroy = Qdisc_ops__destroy,
+};
+
+static struct bpf_struct_ops bpf_Qdisc_ops = {
+	.verifier_ops = &bpf_qdisc_verifier_ops,
+	.reg = bpf_qdisc_reg,
+	.unreg = bpf_qdisc_unreg,
+	.init_member = bpf_qdisc_init_member,
+	.init = bpf_qdisc_init,
+	.name = "Qdisc_ops",
+	.cfi_stubs = &__bpf_ops_qdisc_ops,
+	.owner = THIS_MODULE,
+};
+
+static int __init bpf_qdisc_kfunc_init(void)
+{
+	return register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
+}
+late_initcall(bpf_qdisc_kfunc_init);
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 300430b8c4d2..b35c73c82342 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -25,6 +25,7 @@
 #include <linux/hrtimer.h>
 #include <linux/slab.h>
 #include <linux/hashtable.h>
+#include <linux/bpf.h>
 
 #include <net/net_namespace.h>
 #include <net/sock.h>
@@ -358,7 +359,7 @@ static struct Qdisc_ops *qdisc_lookup_ops(struct nlattr *kind)
 		read_lock(&qdisc_mod_lock);
 		for (q = qdisc_base; q; q = q->next) {
 			if (nla_strcmp(kind, q->id) == 0) {
-				if (!try_module_get(q->owner))
+				if (!bpf_try_module_get(q, q->owner))
 					q = NULL;
 				break;
 			}
@@ -1287,7 +1288,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 				/* We will try again qdisc_lookup_ops,
 				 * so don't keep a reference.
 				 */
-				module_put(ops->owner);
+				bpf_module_put(ops, ops->owner);
 				err = -EAGAIN;
 				goto err_out;
 			}
@@ -1398,7 +1399,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 	netdev_put(dev, &sch->dev_tracker);
 	qdisc_free(sch);
 err_out2:
-	module_put(ops->owner);
+	bpf_module_put(ops, ops->owner);
 err_out:
 	*errp = err;
 	return NULL;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 38ec18f73de4..1e770ec251a0 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -24,6 +24,7 @@
 #include <linux/if_vlan.h>
 #include <linux/skb_array.h>
 #include <linux/if_macvlan.h>
+#include <linux/bpf.h>
 #include <net/sch_generic.h>
 #include <net/pkt_sched.h>
 #include <net/dst.h>
@@ -1083,7 +1084,7 @@ static void __qdisc_destroy(struct Qdisc *qdisc)
 		ops->destroy(qdisc);
 
 	lockdep_unregister_key(&qdisc->root_lock_key);
-	module_put(ops->owner);
+	bpf_module_put(ops, ops->owner);
 	netdev_put(dev, &qdisc->dev_tracker);
 
 	trace_qdisc_destroy(qdisc);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v2 06/14] bpf: net_sched: Add basic bpf qdisc kfuncs
  2024-12-20 19:55 [PATCH bpf-next v2 00/14] bpf qdisc Amery Hung
                   ` (4 preceding siblings ...)
  2024-12-20 19:55 ` [PATCH bpf-next v2 05/14] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
@ 2024-12-20 19:55 ` Amery Hung
  2025-01-10  0:24   ` Martin KaFai Lau
  2024-12-20 19:55 ` [PATCH bpf-next v2 07/14] bpf: Search and add kfuncs in struct_ops prologue and epilogue Amery Hung
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Amery Hung @ 2024-12-20 19:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung, amery.hung

From: Amery Hung <amery.hung@bytedance.com>

Add basic kfuncs for working on skb in qdisc.

Both bpf_qdisc_skb_drop() and bpf_kfree_skb() can be used to release
a reference to an skb. However, bpf_qdisc_skb_drop() can only be called
in .enqueue where a to_free skb list is available from kernel to defer
the release. bpf_kfree_skb() should be used elsewhere. It is also used
in bpf_obj_free_fields() when cleaning up skb in maps and collections.

bpf_skb_get_hash() returns the flow hash of an skb, which can be used
to build flow-based queueing algorithms.

Finally, allow users to create read-only dynptr via bpf_dynptr_from_skb().

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/linux/bpf.h         |  1 +
 kernel/bpf/bpf_struct_ops.c |  2 +
 net/sched/bpf_qdisc.c       | 92 ++++++++++++++++++++++++++++++++++++-
 3 files changed, 94 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 2556f8043276..87ecee12af21 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1809,6 +1809,7 @@ struct bpf_struct_ops {
 	void *cfi_stubs;
 	struct module *owner;
 	const char *name;
+	const struct btf_type *type;
 	struct btf_func_model func_models[BPF_STRUCT_OPS_MAX_NR_MEMBERS];
 };
 
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 27d4a170df84..65542d8f064c 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -442,6 +442,8 @@ int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc,
 		goto errout;
 	}
 
+	st_ops->type = t;
+
 	return 0;
 
 errout:
diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
index 4b7d013f4f5c..1c92bfcc3847 100644
--- a/net/sched/bpf_qdisc.c
+++ b/net/sched/bpf_qdisc.c
@@ -108,6 +108,79 @@ static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
 	return 0;
 }
 
+__bpf_kfunc_start_defs();
+
+/* bpf_skb_get_hash - Get the flow hash of an skb.
+ * @skb: The skb to get the flow hash from.
+ */
+__bpf_kfunc u32 bpf_skb_get_hash(struct sk_buff *skb)
+{
+	return skb_get_hash(skb);
+}
+
+/* bpf_kfree_skb - Release an skb's reference and drop it immediately.
+ * @skb: The skb whose reference to be released and dropped.
+ */
+__bpf_kfunc void bpf_kfree_skb(struct sk_buff *skb)
+{
+	kfree_skb(skb);
+}
+
+/* bpf_qdisc_skb_drop - Drop an skb by adding it to a deferred free list.
+ * @skb: The skb whose reference to be released and dropped.
+ * @to_free_list: The list of skbs to be dropped.
+ */
+__bpf_kfunc void bpf_qdisc_skb_drop(struct sk_buff *skb,
+				    struct bpf_sk_buff_ptr *to_free_list)
+{
+	__qdisc_drop(skb, (struct sk_buff **)to_free_list);
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(qdisc_kfunc_ids)
+BTF_ID_FLAGS(func, bpf_skb_get_hash, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_kfree_skb, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_qdisc_skb_drop, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_dynptr_from_skb, KF_TRUSTED_ARGS)
+BTF_KFUNCS_END(qdisc_kfunc_ids)
+
+BTF_SET_START(qdisc_common_kfunc_set)
+BTF_ID(func, bpf_skb_get_hash)
+BTF_ID(func, bpf_kfree_skb)
+BTF_SET_END(qdisc_common_kfunc_set)
+
+BTF_SET_START(qdisc_enqueue_kfunc_set)
+BTF_ID(func, bpf_qdisc_skb_drop)
+BTF_SET_END(qdisc_enqueue_kfunc_set)
+
+static int bpf_qdisc_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
+{
+	if (bpf_Qdisc_ops.type != btf_type_by_id(prog->aux->attach_btf,
+						 prog->aux->attach_btf_id))
+		return 0;
+
+	/* Skip the check when prog->attach_func_name is not yet available
+	 * during check_cfg().
+	 */
+	if (!btf_id_set8_contains(&qdisc_kfunc_ids, kfunc_id) ||
+	    !prog->aux->attach_func_name)
+		return 0;
+
+	if (!strcmp(prog->aux->attach_func_name, "enqueue")) {
+		if (btf_id_set_contains(&qdisc_enqueue_kfunc_set, kfunc_id))
+		       return 0;
+	}
+
+	return btf_id_set_contains(&qdisc_common_kfunc_set, kfunc_id) ? 0 : -EACCES;
+}
+
+static const struct btf_kfunc_id_set bpf_qdisc_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &qdisc_kfunc_ids,
+	.filter = bpf_qdisc_kfunc_filter,
+};
+
 static const struct bpf_verifier_ops bpf_qdisc_verifier_ops = {
 	.get_func_proto		= bpf_qdisc_get_func_proto,
 	.is_valid_access	= bpf_qdisc_is_valid_access,
@@ -200,8 +273,25 @@ static struct bpf_struct_ops bpf_Qdisc_ops = {
 	.owner = THIS_MODULE,
 };
 
+BTF_ID_LIST(bpf_sk_buff_dtor_ids)
+BTF_ID(func, bpf_kfree_skb)
+
 static int __init bpf_qdisc_kfunc_init(void)
 {
-	return register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
+	int ret;
+	const struct btf_id_dtor_kfunc skb_kfunc_dtors[] = {
+		{
+			.btf_id       = bpf_sk_buff_ids[0],
+			.kfunc_btf_id = bpf_sk_buff_dtor_ids[0]
+		},
+	};
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_qdisc_kfunc_set);
+	ret = ret ?: register_btf_id_dtor_kfuncs(skb_kfunc_dtors,
+						 ARRAY_SIZE(skb_kfunc_dtors),
+						 THIS_MODULE);
+	ret = ret ?: register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
+
+	return ret;
 }
 late_initcall(bpf_qdisc_kfunc_init);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v2 07/14] bpf: Search and add kfuncs in struct_ops prologue and epilogue
  2024-12-20 19:55 [PATCH bpf-next v2 00/14] bpf qdisc Amery Hung
                   ` (5 preceding siblings ...)
  2024-12-20 19:55 ` [PATCH bpf-next v2 06/14] bpf: net_sched: Add basic bpf qdisc kfuncs Amery Hung
@ 2024-12-20 19:55 ` Amery Hung
  2024-12-20 19:55 ` [PATCH bpf-next v2 08/14] bpf: net_sched: Add a qdisc watchdog timer Amery Hung
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Amery Hung @ 2024-12-20 19:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung, amery.hung

From: Amery Hung <amery.hung@bytedance.com>

Currently, add_kfunc_call() is only invoked once before the main
verification loop. Therefore, the verifier could not find the
bpf_kfunc_btf_tab of a new kfunc call which is not seen in user defined
struct_ops operators but introduced in gen_prologue or gen_epilogue
during do_misc_fixup(). Fix this by searching kfuncs in the patching
instruction buffer and add them to prog->aux->kfunc_tab.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 kernel/bpf/verifier.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 0e6a3c4daa7d..949812d955ca 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3214,6 +3214,21 @@ bpf_jit_find_kfunc_model(const struct bpf_prog *prog,
 	return res ? &res->func_model : NULL;
 }
 
+static int add_kfunc_in_insns(struct bpf_verifier_env *env,
+			      struct bpf_insn *insn, int cnt)
+{
+	int i, ret;
+
+	for (i = 0; i < cnt; i++, insn++) {
+		if (bpf_pseudo_kfunc_call(insn)) {
+			ret = add_kfunc_call(env, insn->imm, insn->off);
+			if (ret < 0)
+				return ret;
+		}
+	}
+	return 0;
+}
+
 static int add_subprog_and_kfunc(struct bpf_verifier_env *env)
 {
 	struct bpf_subprog_info *subprog = env->subprog_info;
@@ -20278,7 +20293,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 {
 	struct bpf_subprog_info *subprogs = env->subprog_info;
 	const struct bpf_verifier_ops *ops = env->ops;
-	int i, cnt, size, ctx_field_size, delta = 0, epilogue_cnt = 0;
+	int i, cnt, size, ctx_field_size, ret, delta = 0, epilogue_cnt = 0;
 	const int insn_cnt = env->prog->len;
 	struct bpf_insn *epilogue_buf = env->epilogue_buf;
 	struct bpf_insn *insn_buf = env->insn_buf;
@@ -20307,6 +20322,10 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 				return -ENOMEM;
 			env->prog = new_prog;
 			delta += cnt - 1;
+
+			ret = add_kfunc_in_insns(env, epilogue_buf, epilogue_cnt - 1);
+			if (ret < 0)
+				return ret;
 		}
 	}
 
@@ -20327,6 +20346,10 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 
 			env->prog = new_prog;
 			delta += cnt - 1;
+
+			ret = add_kfunc_in_insns(env, insn_buf, cnt - 1);
+			if (ret < 0)
+				return ret;
 		}
 	}
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v2 08/14] bpf: net_sched: Add a qdisc watchdog timer
  2024-12-20 19:55 [PATCH bpf-next v2 00/14] bpf qdisc Amery Hung
                   ` (6 preceding siblings ...)
  2024-12-20 19:55 ` [PATCH bpf-next v2 07/14] bpf: Search and add kfuncs in struct_ops prologue and epilogue Amery Hung
@ 2024-12-20 19:55 ` Amery Hung
  2025-01-09  0:20   ` Martin KaFai Lau
  2024-12-20 19:55 ` [PATCH bpf-next v2 09/14] bpf: net_sched: Support updating bstats Amery Hung
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Amery Hung @ 2024-12-20 19:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung, amery.hung

From: Amery Hung <amery.hung@bytedance.com>

Add a watchdog timer to bpf qdisc. The watchdog can be used to schedule
the execution of qdisc through kfunc, bpf_qdisc_schedule(). It can be
useful for building traffic shaping scheduling algorithm, where the time
the next packet will be dequeued is known.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/linux/filter.h | 10 +++++
 net/sched/bpf_qdisc.c  | 92 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 102 insertions(+)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 0477254bc2d3..3bc9b741a120 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -469,6 +469,16 @@ static inline bool insn_is_cast_user(const struct bpf_insn *insn)
 		.off   = 0,					\
 		.imm   = BPF_CALL_IMM(FUNC) })
 
+/* Kfunc call */
+
+#define BPF_CALL_KFUNC(OFF, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_JMP | BPF_CALL,			\
+		.dst_reg = 0,					\
+		.src_reg = BPF_PSEUDO_KFUNC_CALL,		\
+		.off   = OFF,					\
+		.imm   = IMM })
+
 /* Raw code statement block */
 
 #define BPF_RAW_INSN(CODE, DST, SRC, OFF, IMM)			\
diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
index 1c92bfcc3847..bbe7aded6f24 100644
--- a/net/sched/bpf_qdisc.c
+++ b/net/sched/bpf_qdisc.c
@@ -8,6 +8,10 @@
 
 static struct bpf_struct_ops bpf_Qdisc_ops;
 
+struct bpf_sched_data {
+	struct qdisc_watchdog watchdog;
+};
+
 struct bpf_sk_buff_ptr {
 	struct sk_buff *skb;
 };
@@ -108,6 +112,46 @@ static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
 	return 0;
 }
 
+BTF_ID_LIST(bpf_qdisc_init_prologue_ids)
+BTF_ID(func, bpf_qdisc_init_prologue)
+
+static int bpf_qdisc_gen_prologue(struct bpf_insn *insn_buf, bool direct_write,
+				  const struct bpf_prog *prog)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	if (strcmp(prog->aux->attach_func_name, "init"))
+		return 0;
+
+	*insn++ = BPF_MOV64_REG(BPF_REG_6, BPF_REG_1);
+	*insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, 0);
+	*insn++ = BPF_CALL_KFUNC(0, bpf_qdisc_init_prologue_ids[0]);
+	*insn++ = BPF_MOV64_REG(BPF_REG_1, BPF_REG_6);
+	*insn++ = prog->insnsi[0];
+
+	return insn - insn_buf;
+}
+
+BTF_ID_LIST(bpf_qdisc_reset_destroy_epilogue_ids)
+BTF_ID(func, bpf_qdisc_reset_destroy_epilogue)
+
+static int bpf_qdisc_gen_epilogue(struct bpf_insn *insn_buf, const struct bpf_prog *prog,
+				  s16 ctx_stack_off)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	if (strcmp(prog->aux->attach_func_name, "reset") &&
+	    strcmp(prog->aux->attach_func_name, "destroy"))
+		return 0;
+
+	*insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_FP, ctx_stack_off);
+	*insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, 0);
+	*insn++ = BPF_CALL_KFUNC(0, bpf_qdisc_reset_destroy_epilogue_ids[0]);
+	*insn++ = BPF_EXIT_INSN();
+
+	return insn - insn_buf;
+}
+
 __bpf_kfunc_start_defs();
 
 /* bpf_skb_get_hash - Get the flow hash of an skb.
@@ -136,6 +180,36 @@ __bpf_kfunc void bpf_qdisc_skb_drop(struct sk_buff *skb,
 	__qdisc_drop(skb, (struct sk_buff **)to_free_list);
 }
 
+/* bpf_qdisc_watchdog_schedule - Schedule a qdisc to a later time using a timer.
+ * @sch: The qdisc to be scheduled.
+ * @expire: The expiry time of the timer.
+ * @delta_ns: The slack range of the timer.
+ */
+__bpf_kfunc void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64 delta_ns)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	qdisc_watchdog_schedule_range_ns(&q->watchdog, expire, delta_ns);
+}
+
+/* bpf_qdisc_init_prologue - Hidden kfunc called in prologue of .init. */
+__bpf_kfunc void bpf_qdisc_init_prologue(struct Qdisc *sch)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	qdisc_watchdog_init(&q->watchdog, sch);
+}
+
+/* bpf_qdisc_reset_destroy_epilogue - Hidden kfunc called in epilogue of .reset
+ * and .destroy
+ */
+__bpf_kfunc void bpf_qdisc_reset_destroy_epilogue(struct Qdisc *sch)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	qdisc_watchdog_cancel(&q->watchdog);
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(qdisc_kfunc_ids)
@@ -143,6 +217,9 @@ BTF_ID_FLAGS(func, bpf_skb_get_hash, KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, bpf_kfree_skb, KF_RELEASE)
 BTF_ID_FLAGS(func, bpf_qdisc_skb_drop, KF_RELEASE)
 BTF_ID_FLAGS(func, bpf_dynptr_from_skb, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_qdisc_watchdog_schedule, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_qdisc_init_prologue, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_qdisc_reset_destroy_epilogue, KF_TRUSTED_ARGS)
 BTF_KFUNCS_END(qdisc_kfunc_ids)
 
 BTF_SET_START(qdisc_common_kfunc_set)
@@ -152,8 +229,13 @@ BTF_SET_END(qdisc_common_kfunc_set)
 
 BTF_SET_START(qdisc_enqueue_kfunc_set)
 BTF_ID(func, bpf_qdisc_skb_drop)
+BTF_ID(func, bpf_qdisc_watchdog_schedule)
 BTF_SET_END(qdisc_enqueue_kfunc_set)
 
+BTF_SET_START(qdisc_dequeue_kfunc_set)
+BTF_ID(func, bpf_qdisc_watchdog_schedule)
+BTF_SET_END(qdisc_dequeue_kfunc_set)
+
 static int bpf_qdisc_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
 {
 	if (bpf_Qdisc_ops.type != btf_type_by_id(prog->aux->attach_btf,
@@ -170,6 +252,9 @@ static int bpf_qdisc_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
 	if (!strcmp(prog->aux->attach_func_name, "enqueue")) {
 		if (btf_id_set_contains(&qdisc_enqueue_kfunc_set, kfunc_id))
 		       return 0;
+	} else if (!strcmp(prog->aux->attach_func_name, "dequeue")) {
+		if (btf_id_set_contains(&qdisc_dequeue_kfunc_set, kfunc_id))
+		       return 0;
 	}
 
 	return btf_id_set_contains(&qdisc_common_kfunc_set, kfunc_id) ? 0 : -EACCES;
@@ -185,6 +270,8 @@ static const struct bpf_verifier_ops bpf_qdisc_verifier_ops = {
 	.get_func_proto		= bpf_qdisc_get_func_proto,
 	.is_valid_access	= bpf_qdisc_is_valid_access,
 	.btf_struct_access	= bpf_qdisc_btf_struct_access,
+	.gen_prologue		= bpf_qdisc_gen_prologue,
+	.gen_epilogue		= bpf_qdisc_gen_epilogue,
 };
 
 static int bpf_qdisc_init_member(const struct btf_type *t,
@@ -200,6 +287,11 @@ static int bpf_qdisc_init_member(const struct btf_type *t,
 
 	moff = __btf_member_bit_offset(t, member) / 8;
 	switch (moff) {
+	case offsetof(struct Qdisc_ops, priv_size):
+		if (uqdisc_ops->priv_size)
+			return -EINVAL;
+		qdisc_ops->priv_size = sizeof(struct bpf_sched_data);
+		return 1;
 	case offsetof(struct Qdisc_ops, peek):
 		qdisc_ops->peek = qdisc_peek_dequeued;
 		return 0;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v2 09/14] bpf: net_sched: Support updating bstats
  2024-12-20 19:55 [PATCH bpf-next v2 00/14] bpf qdisc Amery Hung
                   ` (7 preceding siblings ...)
  2024-12-20 19:55 ` [PATCH bpf-next v2 08/14] bpf: net_sched: Add a qdisc watchdog timer Amery Hung
@ 2024-12-20 19:55 ` Amery Hung
  2024-12-20 19:55 ` [PATCH bpf-next v2 10/14] bpf: net_sched: Support updating qstats Amery Hung
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Amery Hung @ 2024-12-20 19:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung, amery.hung

From: Amery Hung <amery.hung@bytedance.com>

Add a kfunc to update Qdisc bstats when an skb is dequeued. The kfunc is
only available in .dequeue programs.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 net/sched/bpf_qdisc.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
index bbe7aded6f24..39f01daed48a 100644
--- a/net/sched/bpf_qdisc.c
+++ b/net/sched/bpf_qdisc.c
@@ -210,6 +210,15 @@ __bpf_kfunc void bpf_qdisc_reset_destroy_epilogue(struct Qdisc *sch)
 	qdisc_watchdog_cancel(&q->watchdog);
 }
 
+/* bpf_qdisc_bstats_update - Update Qdisc basic statistics
+ * @sch: The qdisc from which an skb is dequeued.
+ * @skb: The skb to be dequeued.
+ */
+__bpf_kfunc void bpf_qdisc_bstats_update(struct Qdisc *sch, const struct sk_buff *skb)
+{
+	bstats_update(&sch->bstats, skb);
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(qdisc_kfunc_ids)
@@ -220,6 +229,7 @@ BTF_ID_FLAGS(func, bpf_dynptr_from_skb, KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, bpf_qdisc_watchdog_schedule, KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, bpf_qdisc_init_prologue, KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, bpf_qdisc_reset_destroy_epilogue, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_qdisc_bstats_update, KF_TRUSTED_ARGS)
 BTF_KFUNCS_END(qdisc_kfunc_ids)
 
 BTF_SET_START(qdisc_common_kfunc_set)
@@ -234,6 +244,7 @@ BTF_SET_END(qdisc_enqueue_kfunc_set)
 
 BTF_SET_START(qdisc_dequeue_kfunc_set)
 BTF_ID(func, bpf_qdisc_watchdog_schedule)
+BTF_ID(func, bpf_qdisc_bstats_update)
 BTF_SET_END(qdisc_dequeue_kfunc_set)
 
 static int bpf_qdisc_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v2 10/14] bpf: net_sched: Support updating qstats
  2024-12-20 19:55 [PATCH bpf-next v2 00/14] bpf qdisc Amery Hung
                   ` (8 preceding siblings ...)
  2024-12-20 19:55 ` [PATCH bpf-next v2 09/14] bpf: net_sched: Support updating bstats Amery Hung
@ 2024-12-20 19:55 ` Amery Hung
  2024-12-20 19:55 ` [PATCH bpf-next v2 11/14] bpf: net_sched: Allow writing to more Qdisc members Amery Hung
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Amery Hung @ 2024-12-20 19:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung, amery.hung

From: Amery Hung <amery.hung@bytedance.com>

Allow bpf qdisc programs to update Qdisc qstats directly with btf struct
access.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 net/sched/bpf_qdisc.c | 53 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 45 insertions(+), 8 deletions(-)

diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
index 39f01daed48a..04ad3676448f 100644
--- a/net/sched/bpf_qdisc.c
+++ b/net/sched/bpf_qdisc.c
@@ -33,6 +33,7 @@ bpf_qdisc_get_func_proto(enum bpf_func_id func_id,
 	}
 }
 
+BTF_ID_LIST_SINGLE(bpf_qdisc_ids, struct, Qdisc)
 BTF_ID_LIST_SINGLE(bpf_sk_buff_ids, struct, sk_buff)
 BTF_ID_LIST_SINGLE(bpf_sk_buff_ptr_ids, struct, bpf_sk_buff_ptr)
 
@@ -57,20 +58,37 @@ static bool bpf_qdisc_is_valid_access(int off, int size,
 	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
 }
 
-static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
-					const struct bpf_reg_state *reg,
-					int off, int size)
+static int bpf_qdisc_qdisc_access(struct bpf_verifier_log *log,
+				  const struct bpf_reg_state *reg,
+				  int off, int size)
 {
-	const struct btf_type *t, *skbt;
 	size_t end;
 
-	skbt = btf_type_by_id(reg->btf, bpf_sk_buff_ids[0]);
-	t = btf_type_by_id(reg->btf, reg->btf_id);
-	if (t != skbt) {
-		bpf_log(log, "only read is supported\n");
+	switch (off) {
+	case offsetof(struct Qdisc, qstats) ... offsetofend(struct Qdisc, qstats) - 1:
+		end = offsetofend(struct Qdisc, qstats);
+		break;
+	default:
+		bpf_log(log, "no write support to Qdisc at off %d\n", off);
+		return -EACCES;
+	}
+
+	if (off + size > end) {
+		bpf_log(log,
+			"write access at off %d with size %d beyond the member of Qdisc ended at %zu\n",
+			off, size, end);
 		return -EACCES;
 	}
 
+	return 0;
+}
+
+static int bpf_qdisc_sk_buff_access(struct bpf_verifier_log *log,
+				    const struct bpf_reg_state *reg,
+				    int off, int size)
+{
+	size_t end;
+
 	switch (off) {
 	case offsetof(struct sk_buff, tstamp):
 		end = offsetofend(struct sk_buff, tstamp);
@@ -112,6 +130,25 @@ static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
 	return 0;
 }
 
+static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
+				       const struct bpf_reg_state *reg,
+				       int off, int size)
+{
+	const struct btf_type *t, *skbt, *qdisct;
+
+	skbt = btf_type_by_id(reg->btf, bpf_sk_buff_ids[0]);
+	qdisct = btf_type_by_id(reg->btf, bpf_qdisc_ids[0]);
+	t = btf_type_by_id(reg->btf, reg->btf_id);
+
+	if (t == skbt)
+		return bpf_qdisc_sk_buff_access(log, reg, off, size);
+	else if (t == qdisct)
+		return bpf_qdisc_qdisc_access(log, reg, off, size);
+
+	bpf_log(log, "only read is supported\n");
+	return -EACCES;
+}
+
 BTF_ID_LIST(bpf_qdisc_init_prologue_ids)
 BTF_ID(func, bpf_qdisc_init_prologue)
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v2 11/14] bpf: net_sched: Allow writing to more Qdisc members
  2024-12-20 19:55 [PATCH bpf-next v2 00/14] bpf qdisc Amery Hung
                   ` (9 preceding siblings ...)
  2024-12-20 19:55 ` [PATCH bpf-next v2 10/14] bpf: net_sched: Support updating qstats Amery Hung
@ 2024-12-20 19:55 ` Amery Hung
  2024-12-20 19:55 ` [PATCH bpf-next v2 12/14] libbpf: Support creating and destroying qdisc Amery Hung
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Amery Hung @ 2024-12-20 19:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung, amery.hung

From: Amery Hung <amery.hung@bytedance.com>

Allow bpf qdisc to write to Qdisc->limit and Qdisc->q.qlen.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 net/sched/bpf_qdisc.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
index 04ad3676448f..925624c47c3e 100644
--- a/net/sched/bpf_qdisc.c
+++ b/net/sched/bpf_qdisc.c
@@ -65,6 +65,12 @@ static int bpf_qdisc_qdisc_access(struct bpf_verifier_log *log,
 	size_t end;
 
 	switch (off) {
+	case offsetof(struct Qdisc, limit):
+		end = offsetofend(struct Qdisc, limit);
+		break;
+	case offsetof(struct Qdisc, q) + offsetof(struct qdisc_skb_head, qlen):
+		end = offsetof(struct Qdisc, q) + offsetofend(struct qdisc_skb_head, qlen);
+		break;
 	case offsetof(struct Qdisc, qstats) ... offsetofend(struct Qdisc, qstats) - 1:
 		end = offsetofend(struct Qdisc, qstats);
 		break;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v2 12/14] libbpf: Support creating and destroying qdisc
  2024-12-20 19:55 [PATCH bpf-next v2 00/14] bpf qdisc Amery Hung
                   ` (10 preceding siblings ...)
  2024-12-20 19:55 ` [PATCH bpf-next v2 11/14] bpf: net_sched: Allow writing to more Qdisc members Amery Hung
@ 2024-12-20 19:55 ` Amery Hung
  2024-12-20 19:55 ` [PATCH bpf-next v2 13/14] selftests: Add a basic fifo qdisc test Amery Hung
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 34+ messages in thread
From: Amery Hung @ 2024-12-20 19:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung, amery.hung

From: Amery Hung <amery.hung@bytedance.com>

Extend struct bpf_tc_hook with handle, qdisc name and a new attach type,
BPF_TC_QDISC, to allow users to add or remove any qdisc specified in
addition to clsact.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 tools/lib/bpf/libbpf.h  |  5 ++++-
 tools/lib/bpf/netlink.c | 20 +++++++++++++++++---
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index d45807103565..062ed3f273a1 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -1268,6 +1268,7 @@ enum bpf_tc_attach_point {
 	BPF_TC_INGRESS = 1 << 0,
 	BPF_TC_EGRESS  = 1 << 1,
 	BPF_TC_CUSTOM  = 1 << 2,
+	BPF_TC_QDISC   = 1 << 3,
 };
 
 #define BPF_TC_PARENT(a, b) 	\
@@ -1282,9 +1283,11 @@ struct bpf_tc_hook {
 	int ifindex;
 	enum bpf_tc_attach_point attach_point;
 	__u32 parent;
+	__u32 handle;
+	const char *qdisc;
 	size_t :0;
 };
-#define bpf_tc_hook__last_field parent
+#define bpf_tc_hook__last_field qdisc
 
 struct bpf_tc_opts {
 	size_t sz;
diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
index 68a2def17175..c997e69d507f 100644
--- a/tools/lib/bpf/netlink.c
+++ b/tools/lib/bpf/netlink.c
@@ -529,9 +529,9 @@ int bpf_xdp_query_id(int ifindex, int flags, __u32 *prog_id)
 }
 
 
-typedef int (*qdisc_config_t)(struct libbpf_nla_req *req);
+typedef int (*qdisc_config_t)(struct libbpf_nla_req *req, const struct bpf_tc_hook *hook);
 
-static int clsact_config(struct libbpf_nla_req *req)
+static int clsact_config(struct libbpf_nla_req *req, const struct bpf_tc_hook *hook)
 {
 	req->tc.tcm_parent = TC_H_CLSACT;
 	req->tc.tcm_handle = TC_H_MAKE(TC_H_CLSACT, 0);
@@ -539,6 +539,16 @@ static int clsact_config(struct libbpf_nla_req *req)
 	return nlattr_add(req, TCA_KIND, "clsact", sizeof("clsact"));
 }
 
+static int qdisc_config(struct libbpf_nla_req *req, const struct bpf_tc_hook *hook)
+{
+	const char *qdisc = OPTS_GET(hook, qdisc, NULL);
+
+	req->tc.tcm_parent = OPTS_GET(hook, parent, TC_H_ROOT);
+	req->tc.tcm_handle = OPTS_GET(hook, handle, 0);
+
+	return nlattr_add(req, TCA_KIND, qdisc, strlen(qdisc) + 1);
+}
+
 static int attach_point_to_config(struct bpf_tc_hook *hook,
 				  qdisc_config_t *config)
 {
@@ -552,6 +562,9 @@ static int attach_point_to_config(struct bpf_tc_hook *hook,
 		return 0;
 	case BPF_TC_CUSTOM:
 		return -EOPNOTSUPP;
+	case BPF_TC_QDISC:
+		*config = &qdisc_config;
+		return 0;
 	default:
 		return -EINVAL;
 	}
@@ -596,7 +609,7 @@ static int tc_qdisc_modify(struct bpf_tc_hook *hook, int cmd, int flags)
 	req.tc.tcm_family  = AF_UNSPEC;
 	req.tc.tcm_ifindex = OPTS_GET(hook, ifindex, 0);
 
-	ret = config(&req);
+	ret = config(&req, hook);
 	if (ret < 0)
 		return ret;
 
@@ -639,6 +652,7 @@ int bpf_tc_hook_destroy(struct bpf_tc_hook *hook)
 	case BPF_TC_INGRESS:
 	case BPF_TC_EGRESS:
 		return libbpf_err(__bpf_tc_detach(hook, NULL, true));
+	case BPF_TC_QDISC:
 	case BPF_TC_INGRESS | BPF_TC_EGRESS:
 		return libbpf_err(tc_qdisc_delete(hook));
 	case BPF_TC_CUSTOM:
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v2 13/14] selftests: Add a basic fifo qdisc test
  2024-12-20 19:55 [PATCH bpf-next v2 00/14] bpf qdisc Amery Hung
                   ` (11 preceding siblings ...)
  2024-12-20 19:55 ` [PATCH bpf-next v2 12/14] libbpf: Support creating and destroying qdisc Amery Hung
@ 2024-12-20 19:55 ` Amery Hung
  2025-01-10  0:05   ` Martin KaFai Lau
  2024-12-20 19:55 ` [PATCH bpf-next v2 14/14] selftests: Add a bpf fq qdisc to selftest Amery Hung
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Amery Hung @ 2024-12-20 19:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung, amery.hung

From: Amery Hung <amery.hung@bytedance.com>

This selftest shows a bare minimum fifo qdisc, which simply enqueues skbs
into the back of a bpf list and dequeues from the front of the list.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 tools/testing/selftests/bpf/config            |   1 +
 .../selftests/bpf/prog_tests/bpf_qdisc.c      | 161 ++++++++++++++++++
 .../selftests/bpf/progs/bpf_qdisc_common.h    |  27 +++
 .../selftests/bpf/progs/bpf_qdisc_fifo.c      | 117 +++++++++++++
 4 files changed, 306 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c

diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index c378d5d07e02..6b0cab55bd2d 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -71,6 +71,7 @@ CONFIG_NET_IPGRE=y
 CONFIG_NET_IPGRE_DEMUX=y
 CONFIG_NET_IPIP=y
 CONFIG_NET_MPLS_GSO=y
+CONFIG_NET_SCH_BPF=y
 CONFIG_NET_SCH_FQ=y
 CONFIG_NET_SCH_INGRESS=y
 CONFIG_NET_SCHED=y
diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
new file mode 100644
index 000000000000..295d0216e70f
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
@@ -0,0 +1,161 @@
+#include <linux/pkt_sched.h>
+#include <linux/rtnetlink.h>
+#include <test_progs.h>
+
+#include "network_helpers.h"
+#include "bpf_qdisc_fifo.skel.h"
+
+#ifndef ENOTSUPP
+#define ENOTSUPP 524
+#endif
+
+#define LO_IFINDEX 1
+
+static const unsigned int total_bytes = 10 * 1024 * 1024;
+static int stop;
+
+static void *server(void *arg)
+{
+	int lfd = (int)(long)arg, err = 0, fd;
+	ssize_t nr_sent = 0, bytes = 0;
+	char batch[1500];
+
+	fd = accept(lfd, NULL, NULL);
+	while (fd == -1) {
+		if (errno == EINTR)
+			continue;
+		err = -errno;
+		goto done;
+	}
+
+	if (settimeo(fd, 0)) {
+		err = -errno;
+		goto done;
+	}
+
+	while (bytes < total_bytes && !READ_ONCE(stop)) {
+		nr_sent = send(fd, &batch,
+			       MIN(total_bytes - bytes, sizeof(batch)), 0);
+		if (nr_sent == -1 && errno == EINTR)
+			continue;
+		if (nr_sent == -1) {
+			err = -errno;
+			break;
+		}
+		bytes += nr_sent;
+	}
+
+	ASSERT_EQ(bytes, total_bytes, "send");
+
+done:
+	if (fd >= 0)
+		close(fd);
+	if (err) {
+		WRITE_ONCE(stop, 1);
+		return ERR_PTR(err);
+	}
+	return NULL;
+}
+
+static void do_test(char *qdisc)
+{
+	DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook, .ifindex = LO_IFINDEX,
+			    .attach_point = BPF_TC_QDISC,
+			    .parent = TC_H_ROOT,
+			    .handle = 0x8000000,
+			    .qdisc = qdisc);
+	struct sockaddr_in6 sa6 = {};
+	ssize_t nr_recv = 0, bytes = 0;
+	int lfd = -1, fd = -1;
+	pthread_t srv_thread;
+	socklen_t addrlen = sizeof(sa6);
+	void *thread_ret;
+	char batch[1500];
+	int err;
+
+	WRITE_ONCE(stop, 0);
+
+	err = bpf_tc_hook_create(&hook);
+	if (!ASSERT_OK(err, "attach qdisc"))
+		return;
+
+	lfd = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
+	if (!ASSERT_NEQ(lfd, -1, "socket")) {
+		bpf_tc_hook_destroy(&hook);
+		return;
+	}
+
+	fd = socket(AF_INET6, SOCK_STREAM, 0);
+	if (!ASSERT_NEQ(fd, -1, "socket")) {
+		bpf_tc_hook_destroy(&hook);
+		close(lfd);
+		return;
+	}
+
+	if (settimeo(lfd, 0) || settimeo(fd, 0))
+		goto done;
+
+	err = getsockname(lfd, (struct sockaddr *)&sa6, &addrlen);
+	if (!ASSERT_NEQ(err, -1, "getsockname"))
+		goto done;
+
+	/* connect to server */
+	err = connect(fd, (struct sockaddr *)&sa6, addrlen);
+	if (!ASSERT_NEQ(err, -1, "connect"))
+		goto done;
+
+	err = pthread_create(&srv_thread, NULL, server, (void *)(long)lfd);
+	if (!ASSERT_OK(err, "pthread_create"))
+		goto done;
+
+	/* recv total_bytes */
+	while (bytes < total_bytes && !READ_ONCE(stop)) {
+		nr_recv = recv(fd, &batch,
+			       MIN(total_bytes - bytes, sizeof(batch)), 0);
+		if (nr_recv == -1 && errno == EINTR)
+			continue;
+		if (nr_recv == -1)
+			break;
+		bytes += nr_recv;
+	}
+
+	ASSERT_EQ(bytes, total_bytes, "recv");
+
+	WRITE_ONCE(stop, 1);
+	pthread_join(srv_thread, &thread_ret);
+	ASSERT_OK(IS_ERR(thread_ret), "thread_ret");
+
+done:
+	close(lfd);
+	close(fd);
+
+	bpf_tc_hook_destroy(&hook);
+	return;
+}
+
+static void test_fifo(void)
+{
+	struct bpf_qdisc_fifo *fifo_skel;
+	struct bpf_link *link;
+
+	fifo_skel = bpf_qdisc_fifo__open_and_load();
+	if (!ASSERT_OK_PTR(fifo_skel, "bpf_qdisc_fifo__open_and_load"))
+		return;
+
+	link = bpf_map__attach_struct_ops(fifo_skel->maps.fifo);
+	if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops")) {
+		bpf_qdisc_fifo__destroy(fifo_skel);
+		return;
+	}
+
+	do_test("bpf_fifo");
+
+	bpf_link__destroy(link);
+	bpf_qdisc_fifo__destroy(fifo_skel);
+}
+
+void test_bpf_qdisc(void)
+{
+	if (test__start_subtest("fifo"))
+		test_fifo();
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
new file mode 100644
index 000000000000..62a778f94908
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
@@ -0,0 +1,27 @@
+#ifndef _BPF_QDISC_COMMON_H
+#define _BPF_QDISC_COMMON_H
+
+#define NET_XMIT_SUCCESS        0x00
+#define NET_XMIT_DROP           0x01    /* skb dropped                  */
+#define NET_XMIT_CN             0x02    /* congestion notification      */
+
+#define TC_PRIO_CONTROL  7
+#define TC_PRIO_MAX      15
+
+u32 bpf_skb_get_hash(struct sk_buff *p) __ksym;
+void bpf_kfree_skb(struct sk_buff *p) __ksym;
+void bpf_qdisc_skb_drop(struct sk_buff *p, struct bpf_sk_buff_ptr *to_free) __ksym;
+void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64 delta_ns) __ksym;
+void bpf_qdisc_bstats_update(struct Qdisc *sch, const struct sk_buff *skb) __ksym;
+
+static struct qdisc_skb_cb *qdisc_skb_cb(const struct sk_buff *skb)
+{
+	return (struct qdisc_skb_cb *)skb->cb;
+}
+
+static inline unsigned int qdisc_pkt_len(const struct sk_buff *skb)
+{
+	return qdisc_skb_cb(skb)->pkt_len;
+}
+
+#endif
diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c b/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
new file mode 100644
index 000000000000..705e7da325da
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
@@ -0,0 +1,117 @@
+#include <vmlinux.h>
+#include "bpf_experimental.h"
+#include "bpf_qdisc_common.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct skb_node {
+	struct sk_buff __kptr * skb;
+	struct bpf_list_node node;
+};
+
+#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+
+private(A) struct bpf_spin_lock q_fifo_lock;
+private(A) struct bpf_list_head q_fifo __contains(skb_node, node);
+
+SEC("struct_ops/bpf_fifo_enqueue")
+int BPF_PROG(bpf_fifo_enqueue, struct sk_buff *skb, struct Qdisc *sch,
+	     struct bpf_sk_buff_ptr *to_free)
+{
+	struct skb_node *skbn;
+	u32 pkt_len;
+
+	if (sch->q.qlen == sch->limit)
+		goto drop;
+
+	skbn = bpf_obj_new(typeof(*skbn));
+	if (!skbn)
+		goto drop;
+
+	pkt_len = qdisc_pkt_len(skb);
+
+	sch->q.qlen++;
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	if (skb)
+		bpf_qdisc_skb_drop(skb, to_free);
+
+	bpf_spin_lock(&q_fifo_lock);
+	bpf_list_push_back(&q_fifo, &skbn->node);
+	bpf_spin_unlock(&q_fifo_lock);
+
+	sch->qstats.backlog += pkt_len;
+	return NET_XMIT_SUCCESS;
+drop:
+	bpf_qdisc_skb_drop(skb, to_free);
+	return NET_XMIT_DROP;
+}
+
+SEC("struct_ops/bpf_fifo_dequeue")
+struct sk_buff *BPF_PROG(bpf_fifo_dequeue, struct Qdisc *sch)
+{
+	struct bpf_list_node *node;
+	struct sk_buff *skb = NULL;
+	struct skb_node *skbn;
+
+	bpf_spin_lock(&q_fifo_lock);
+	node = bpf_list_pop_front(&q_fifo);
+	bpf_spin_unlock(&q_fifo_lock);
+	if (!node)
+		return NULL;
+
+	skbn = container_of(node, struct skb_node, node);
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	bpf_obj_drop(skbn);
+	if (!skb)
+		return NULL;
+
+	sch->qstats.backlog -= qdisc_pkt_len(skb);
+	bpf_qdisc_bstats_update(sch, skb);
+	sch->q.qlen--;
+
+	return skb;
+}
+
+SEC("struct_ops/bpf_fifo_init")
+int BPF_PROG(bpf_fifo_init, struct Qdisc *sch, struct nlattr *opt,
+	     struct netlink_ext_ack *extack)
+{
+	sch->limit = 1000;
+	return 0;
+}
+
+SEC("struct_ops/bpf_fifo_reset")
+void BPF_PROG(bpf_fifo_reset, struct Qdisc *sch)
+{
+	struct bpf_list_node *node;
+	struct skb_node *skbn;
+	int i;
+
+	bpf_for(i, 0, sch->q.qlen) {
+		struct sk_buff *skb = NULL;
+
+		bpf_spin_lock(&q_fifo_lock);
+		node = bpf_list_pop_front(&q_fifo);
+		bpf_spin_unlock(&q_fifo_lock);
+
+		if (!node)
+			break;
+
+		skbn = container_of(node, struct skb_node, node);
+		skb = bpf_kptr_xchg(&skbn->skb, skb);
+		if (skb)
+			bpf_kfree_skb(skb);
+		bpf_obj_drop(skbn);
+	}
+	sch->q.qlen = 0;
+}
+
+SEC(".struct_ops")
+struct Qdisc_ops fifo = {
+	.enqueue   = (void *)bpf_fifo_enqueue,
+	.dequeue   = (void *)bpf_fifo_dequeue,
+	.init      = (void *)bpf_fifo_init,
+	.reset     = (void *)bpf_fifo_reset,
+	.id        = "bpf_fifo",
+};
+
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v2 14/14] selftests: Add a bpf fq qdisc to selftest
  2024-12-20 19:55 [PATCH bpf-next v2 00/14] bpf qdisc Amery Hung
                   ` (12 preceding siblings ...)
  2024-12-20 19:55 ` [PATCH bpf-next v2 13/14] selftests: Add a basic fifo qdisc test Amery Hung
@ 2024-12-20 19:55 ` Amery Hung
  2025-01-09 23:36   ` Martin KaFai Lau
  2025-01-02 17:29 ` [PATCH bpf-next v2 00/14] bpf qdisc Toke Høiland-Jørgensen
  2025-01-10  1:43 ` Martin KaFai Lau
  15 siblings, 1 reply; 34+ messages in thread
From: Amery Hung @ 2024-12-20 19:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung, amery.hung

From: Amery Hung <amery.hung@bytedance.com>

This test implements a more sophisticated qdisc using bpf. The bpf fair-
queueing (fq) qdisc gives each flow an equal chance to transmit data. It
also respects the timestamp of skb for rate limiting.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 .../selftests/bpf/prog_tests/bpf_qdisc.c      |  24 +
 .../selftests/bpf/progs/bpf_qdisc_fq.c        | 726 ++++++++++++++++++
 2 files changed, 750 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
index 295d0216e70f..394bf5a4adae 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
@@ -4,6 +4,7 @@
 
 #include "network_helpers.h"
 #include "bpf_qdisc_fifo.skel.h"
+#include "bpf_qdisc_fq.skel.h"
 
 #ifndef ENOTSUPP
 #define ENOTSUPP 524
@@ -154,8 +155,31 @@ static void test_fifo(void)
 	bpf_qdisc_fifo__destroy(fifo_skel);
 }
 
+static void test_fq(void)
+{
+	struct bpf_qdisc_fq *fq_skel;
+	struct bpf_link *link;
+
+	fq_skel = bpf_qdisc_fq__open_and_load();
+	if (!ASSERT_OK_PTR(fq_skel, "bpf_qdisc_fq__open_and_load"))
+		return;
+
+	link = bpf_map__attach_struct_ops(fq_skel->maps.fq);
+	if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops")) {
+		bpf_qdisc_fq__destroy(fq_skel);
+		return;
+	}
+
+	do_test("bpf_fq");
+
+	bpf_link__destroy(link);
+	bpf_qdisc_fq__destroy(fq_skel);
+}
+
 void test_bpf_qdisc(void)
 {
 	if (test__start_subtest("fifo"))
 		test_fifo();
+	if (test__start_subtest("fq"))
+		test_fq();
 }
diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c b/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c
new file mode 100644
index 000000000000..2af2e39f9ed7
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c
@@ -0,0 +1,726 @@
+#include <vmlinux.h>
+#include <errno.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_experimental.h"
+#include "bpf_qdisc_common.h"
+
+char _license[] SEC("license") = "GPL";
+
+#define NSEC_PER_USEC 1000L
+#define NSEC_PER_SEC 1000000000L
+
+#define NUM_QUEUE (1 << 20)
+
+struct fq_bpf_data {
+	u32 quantum;
+	u32 initial_quantum;
+	u32 flow_refill_delay;
+	u32 flow_plimit;
+	u64 horizon;
+	u32 orphan_mask;
+	u32 timer_slack;
+	u64 time_next_delayed_flow;
+	u64 unthrottle_latency_ns;
+	u8 horizon_drop;
+	u32 new_flow_cnt;
+	u32 old_flow_cnt;
+	u64 ktime_cache;
+};
+
+enum {
+	CLS_RET_PRIO	= 0,
+	CLS_RET_NONPRIO = 1,
+	CLS_RET_ERR	= 2,
+};
+
+struct skb_node {
+	u64 tstamp;
+	struct sk_buff __kptr * skb;
+	struct bpf_rb_node node;
+};
+
+struct fq_flow_node {
+	int credit;
+	u32 qlen;
+	u64 age;
+	u64 time_next_packet;
+	struct bpf_list_node list_node;
+	struct bpf_rb_node rb_node;
+	struct bpf_rb_root queue __contains(skb_node, node);
+	struct bpf_spin_lock lock;
+	struct bpf_refcount refcount;
+};
+
+struct dequeue_nonprio_ctx {
+	bool stop_iter;
+	u64 expire;
+	u64 now;
+};
+
+struct remove_flows_ctx {
+	bool gc_only;
+	u32 reset_cnt;
+	u32 reset_max;
+};
+
+struct unset_throttled_flows_ctx {
+	bool unset_all;
+	u64 now;
+};
+
+struct fq_stashed_flow {
+	struct fq_flow_node __kptr * flow;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, __u64);
+	__type(value, struct fq_stashed_flow);
+	__uint(max_entries, NUM_QUEUE);
+} fq_nonprio_flows SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, __u64);
+	__type(value, struct fq_stashed_flow);
+	__uint(max_entries, 1);
+} fq_prio_flows SEC(".maps");
+
+#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+
+private(A) struct bpf_spin_lock fq_delayed_lock;
+private(A) struct bpf_rb_root fq_delayed __contains(fq_flow_node, rb_node);
+
+private(B) struct bpf_spin_lock fq_new_flows_lock;
+private(B) struct bpf_list_head fq_new_flows __contains(fq_flow_node, list_node);
+
+private(C) struct bpf_spin_lock fq_old_flows_lock;
+private(C) struct bpf_list_head fq_old_flows __contains(fq_flow_node, list_node);
+
+private(D) struct fq_bpf_data q;
+
+/* Wrapper for bpf_kptr_xchg that expects NULL dst */
+static void bpf_kptr_xchg_back(void *map_val, void *ptr)
+{
+	void *ret;
+
+	ret = bpf_kptr_xchg(map_val, ptr);
+	if (ret)
+		bpf_obj_drop(ret);
+}
+
+static bool skbn_tstamp_less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct skb_node *skbn_a;
+	struct skb_node *skbn_b;
+
+	skbn_a = container_of(a, struct skb_node, node);
+	skbn_b = container_of(b, struct skb_node, node);
+
+	return skbn_a->tstamp < skbn_b->tstamp;
+}
+
+static bool fn_time_next_packet_less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct fq_flow_node *flow_a;
+	struct fq_flow_node *flow_b;
+
+	flow_a = container_of(a, struct fq_flow_node, rb_node);
+	flow_b = container_of(b, struct fq_flow_node, rb_node);
+
+	return flow_a->time_next_packet < flow_b->time_next_packet;
+}
+
+static void
+fq_flows_add_head(struct bpf_list_head *head, struct bpf_spin_lock *lock,
+		  struct fq_flow_node *flow, u32 *flow_cnt)
+{
+	bpf_spin_lock(lock);
+	bpf_list_push_front(head, &flow->list_node);
+	bpf_spin_unlock(lock);
+	*flow_cnt += 1;
+}
+
+static void
+fq_flows_add_tail(struct bpf_list_head *head, struct bpf_spin_lock *lock,
+		  struct fq_flow_node *flow, u32 *flow_cnt)
+{
+	bpf_spin_lock(lock);
+	bpf_list_push_back(head, &flow->list_node);
+	bpf_spin_unlock(lock);
+	*flow_cnt += 1;
+}
+
+static void
+fq_flows_remove_front(struct bpf_list_head *head, struct bpf_spin_lock *lock,
+		      struct bpf_list_node **node, u32 *flow_cnt)
+{
+	bpf_spin_lock(lock);
+	*node = bpf_list_pop_front(head);
+	bpf_spin_unlock(lock);
+	*flow_cnt -= 1;
+}
+
+static bool
+fq_flows_is_empty(struct bpf_list_head *head, struct bpf_spin_lock *lock)
+{
+	struct bpf_list_node *node;
+
+	bpf_spin_lock(lock);
+	node = bpf_list_pop_front(head);
+	if (node) {
+		bpf_list_push_front(head, node);
+		bpf_spin_unlock(lock);
+		return false;
+	}
+	bpf_spin_unlock(lock);
+
+	return true;
+}
+
+/* flow->age is used to denote the state of the flow (not-detached, detached, throttled)
+ * as well as the timestamp when the flow is detached.
+ *
+ * 0: not-detached
+ * 1 - (~0ULL-1): detached
+ * ~0ULL: throttled
+ */
+static void fq_flow_set_detached(struct fq_flow_node *flow)
+{
+	flow->age = bpf_jiffies64();
+}
+
+static bool fq_flow_is_detached(struct fq_flow_node *flow)
+{
+	return flow->age != 0 && flow->age != ~0ULL;
+}
+
+static bool sk_listener(struct sock *sk)
+{
+	return (1 << sk->__sk_common.skc_state) & (TCPF_LISTEN | TCPF_NEW_SYN_RECV);
+}
+
+static void fq_gc(void);
+
+static int fq_new_flow(void *flow_map, struct fq_stashed_flow **sflow, u64 hash)
+{
+	struct fq_stashed_flow tmp = {};
+	struct fq_flow_node *flow;
+	int ret;
+
+	flow = bpf_obj_new(typeof(*flow));
+	if (!flow)
+		return -ENOMEM;
+
+	flow->credit = q.initial_quantum,
+	flow->qlen = 0,
+	flow->age = 1,
+	flow->time_next_packet = 0,
+
+	ret = bpf_map_update_elem(flow_map, &hash, &tmp, 0);
+	if (ret == -ENOMEM) {
+		fq_gc();
+		bpf_map_update_elem(&fq_nonprio_flows, &hash, &tmp, 0);
+	}
+
+	*sflow = bpf_map_lookup_elem(flow_map, &hash);
+	if (!*sflow) {
+		bpf_obj_drop(flow);
+		return -ENOMEM;
+	}
+
+	bpf_kptr_xchg_back(&(*sflow)->flow, flow);
+	return 0;
+}
+
+static int
+fq_classify(struct sk_buff *skb, struct fq_stashed_flow **sflow)
+{
+	struct sock *sk = skb->sk;
+	int ret = CLS_RET_NONPRIO;
+	u64 hash = 0;
+
+	if ((skb->priority & TC_PRIO_MAX) == TC_PRIO_CONTROL) {
+		*sflow = bpf_map_lookup_elem(&fq_prio_flows, &hash);
+		ret = CLS_RET_PRIO;
+	} else {
+		if (!sk || sk_listener(sk)) {
+			hash = bpf_skb_get_hash(skb) & q.orphan_mask;
+			/* Avoid collision with an existing flow hash, which
+			 * only uses the lower 32 bits of hash, by setting the
+			 * upper half of hash to 1.
+			 */
+			hash |= (1ULL << 32);
+		} else if (sk->__sk_common.skc_state == TCP_CLOSE) {
+			hash = bpf_skb_get_hash(skb) & q.orphan_mask;
+			hash |= (1ULL << 32);
+		} else {
+			hash = sk->__sk_common.skc_hash;
+		}
+		*sflow = bpf_map_lookup_elem(&fq_nonprio_flows, &hash);
+	}
+
+	if (!*sflow)
+		ret = fq_new_flow(&fq_nonprio_flows, sflow, hash) < 0 ?
+		      CLS_RET_ERR : CLS_RET_NONPRIO;
+
+	return ret;
+}
+
+static bool fq_packet_beyond_horizon(struct sk_buff *skb)
+{
+	return (s64)skb->tstamp > (s64)(q.ktime_cache + q.horizon);
+}
+
+SEC("struct_ops/bpf_fq_enqueue")
+int BPF_PROG(bpf_fq_enqueue, struct sk_buff *skb, struct Qdisc *sch,
+	     struct bpf_sk_buff_ptr *to_free)
+{
+	struct fq_flow_node *flow = NULL, *flow_copy;
+	struct fq_stashed_flow *sflow;
+	u64 time_to_send, jiffies;
+	struct skb_node *skbn;
+	int ret;
+
+	if (sch->q.qlen >= sch->limit)
+		goto drop;
+
+	if (!skb->tstamp) {
+		time_to_send = q.ktime_cache = bpf_ktime_get_ns();
+	} else {
+		if (fq_packet_beyond_horizon(skb)) {
+			q.ktime_cache = bpf_ktime_get_ns();
+			if (fq_packet_beyond_horizon(skb)) {
+				if (q.horizon_drop)
+					goto drop;
+
+				skb->tstamp = q.ktime_cache + q.horizon;
+			}
+		}
+		time_to_send = skb->tstamp;
+	}
+
+	ret = fq_classify(skb, &sflow);
+	if (ret == CLS_RET_ERR)
+		goto drop;
+
+	flow = bpf_kptr_xchg(&sflow->flow, flow);
+	if (!flow)
+		goto drop;
+
+	if (ret == CLS_RET_NONPRIO) {
+		if (flow->qlen >= q.flow_plimit) {
+			bpf_kptr_xchg_back(&sflow->flow, flow);
+			goto drop;
+		}
+
+		if (fq_flow_is_detached(flow)) {
+			flow_copy = bpf_refcount_acquire(flow);
+
+			jiffies = bpf_jiffies64();
+			if ((s64)(jiffies - (flow_copy->age + q.flow_refill_delay)) > 0) {
+				if (flow_copy->credit < q.quantum)
+					flow_copy->credit = q.quantum;
+			}
+			flow_copy->age = 0;
+			fq_flows_add_tail(&fq_new_flows, &fq_new_flows_lock, flow_copy,
+					  &q.new_flow_cnt);
+		}
+	}
+
+	skbn = bpf_obj_new(typeof(*skbn));
+	if (!skbn) {
+		bpf_kptr_xchg_back(&sflow->flow, flow);
+		goto drop;
+	}
+
+	skbn->tstamp = skb->tstamp = time_to_send;
+
+	sch->qstats.backlog += qdisc_pkt_len(skb);
+
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	if (skb)
+		bpf_qdisc_skb_drop(skb, to_free);
+
+	bpf_spin_lock(&flow->lock);
+	bpf_rbtree_add(&flow->queue, &skbn->node, skbn_tstamp_less);
+	bpf_spin_unlock(&flow->lock);
+
+	flow->qlen++;
+	bpf_kptr_xchg_back(&sflow->flow, flow);
+
+	sch->q.qlen++;
+	return NET_XMIT_SUCCESS;
+
+drop:
+	bpf_qdisc_skb_drop(skb, to_free);
+	sch->qstats.drops++;
+	return NET_XMIT_DROP;
+}
+
+static int fq_unset_throttled_flows(u32 index, struct unset_throttled_flows_ctx *ctx)
+{
+	struct bpf_rb_node *node = NULL;
+	struct fq_flow_node *flow;
+
+	bpf_spin_lock(&fq_delayed_lock);
+
+	node = bpf_rbtree_first(&fq_delayed);
+	if (!node) {
+		bpf_spin_unlock(&fq_delayed_lock);
+		return 1;
+	}
+
+	flow = container_of(node, struct fq_flow_node, rb_node);
+	if (!ctx->unset_all && flow->time_next_packet > ctx->now) {
+		q.time_next_delayed_flow = flow->time_next_packet;
+		bpf_spin_unlock(&fq_delayed_lock);
+		return 1;
+	}
+
+	node = bpf_rbtree_remove(&fq_delayed, &flow->rb_node);
+
+	bpf_spin_unlock(&fq_delayed_lock);
+
+	if (!node)
+		return 1;
+
+	flow = container_of(node, struct fq_flow_node, rb_node);
+	flow->age = 0;
+	fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow, &q.old_flow_cnt);
+
+	return 0;
+}
+
+static void fq_flow_set_throttled(struct fq_flow_node *flow)
+{
+	flow->age = ~0ULL;
+
+	if (q.time_next_delayed_flow > flow->time_next_packet)
+		q.time_next_delayed_flow = flow->time_next_packet;
+
+	bpf_spin_lock(&fq_delayed_lock);
+	bpf_rbtree_add(&fq_delayed, &flow->rb_node, fn_time_next_packet_less);
+	bpf_spin_unlock(&fq_delayed_lock);
+}
+
+static void fq_check_throttled(u64 now)
+{
+	struct unset_throttled_flows_ctx ctx = {
+		.unset_all = false,
+		.now = now,
+	};
+	unsigned long sample;
+
+	if (q.time_next_delayed_flow > now)
+		return;
+
+	sample = (unsigned long)(now - q.time_next_delayed_flow);
+	q.unthrottle_latency_ns -= q.unthrottle_latency_ns >> 3;
+	q.unthrottle_latency_ns += sample >> 3;
+
+	q.time_next_delayed_flow = ~0ULL;
+	bpf_loop(NUM_QUEUE, fq_unset_throttled_flows, &ctx, 0);
+}
+
+static struct sk_buff*
+fq_dequeue_nonprio_flows(u32 index, struct dequeue_nonprio_ctx *ctx)
+{
+	u64 time_next_packet, time_to_send;
+	struct bpf_rb_node *rb_node;
+	struct sk_buff *skb = NULL;
+	struct bpf_list_head *head;
+	struct bpf_list_node *node;
+	struct bpf_spin_lock *lock;
+	struct fq_flow_node *flow;
+	struct skb_node *skbn;
+	bool is_empty;
+	u32 *cnt;
+
+	if (q.new_flow_cnt) {
+		head = &fq_new_flows;
+		lock = &fq_new_flows_lock;
+		cnt = &q.new_flow_cnt;
+	} else if (q.old_flow_cnt) {
+		head = &fq_old_flows;
+		lock = &fq_old_flows_lock;
+		cnt = &q.old_flow_cnt;
+	} else {
+		if (q.time_next_delayed_flow != ~0ULL)
+			ctx->expire = q.time_next_delayed_flow;
+		goto break_loop;
+	}
+
+	fq_flows_remove_front(head, lock, &node, cnt);
+	if (!node)
+		goto break_loop;
+
+	flow = container_of(node, struct fq_flow_node, list_node);
+	if (flow->credit <= 0) {
+		flow->credit += q.quantum;
+		fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow, &q.old_flow_cnt);
+		return NULL;
+	}
+
+	bpf_spin_lock(&flow->lock);
+	rb_node = bpf_rbtree_first(&flow->queue);
+	if (!rb_node) {
+		bpf_spin_unlock(&flow->lock);
+		is_empty = fq_flows_is_empty(&fq_old_flows, &fq_old_flows_lock);
+		if (head == &fq_new_flows && !is_empty) {
+			fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow, &q.old_flow_cnt);
+		} else {
+			fq_flow_set_detached(flow);
+			bpf_obj_drop(flow);
+		}
+		return NULL;
+	}
+
+	skbn = container_of(rb_node, struct skb_node, node);
+	time_to_send = skbn->tstamp;
+
+	time_next_packet = (time_to_send > flow->time_next_packet) ?
+		time_to_send : flow->time_next_packet;
+	if (ctx->now < time_next_packet) {
+		bpf_spin_unlock(&flow->lock);
+		flow->time_next_packet = time_next_packet;
+		fq_flow_set_throttled(flow);
+		return NULL;
+	}
+
+	rb_node = bpf_rbtree_remove(&flow->queue, rb_node);
+	bpf_spin_unlock(&flow->lock);
+
+	if (!rb_node)
+		goto add_flow_and_break;
+
+	skbn = container_of(rb_node, struct skb_node, node);
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	bpf_obj_drop(skbn);
+
+	if (!skb)
+		goto add_flow_and_break;
+
+	flow->credit -= qdisc_skb_cb(skb)->pkt_len;
+	flow->qlen--;
+
+add_flow_and_break:
+	fq_flows_add_head(head, lock, flow, cnt);
+
+break_loop:
+	ctx->stop_iter = true;
+	return skb;
+}
+
+static struct sk_buff *fq_dequeue_prio(void)
+{
+	struct fq_flow_node *flow = NULL;
+	struct fq_stashed_flow *sflow;
+	struct bpf_rb_node *rb_node;
+	struct sk_buff *skb = NULL;
+	struct skb_node *skbn;
+	u64 hash = 0;
+
+	sflow = bpf_map_lookup_elem(&fq_prio_flows, &hash);
+	if (!sflow)
+		return NULL;
+
+	flow = bpf_kptr_xchg(&sflow->flow, flow);
+	if (!flow)
+		return NULL;
+
+	bpf_spin_lock(&flow->lock);
+	rb_node = bpf_rbtree_first(&flow->queue);
+	if (!rb_node) {
+		bpf_spin_unlock(&flow->lock);
+		goto out;
+	}
+
+	skbn = container_of(rb_node, struct skb_node, node);
+	rb_node = bpf_rbtree_remove(&flow->queue, &skbn->node);
+	bpf_spin_unlock(&flow->lock);
+
+	if (!rb_node)
+		goto out;
+
+	skbn = container_of(rb_node, struct skb_node, node);
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	bpf_obj_drop(skbn);
+
+out:
+	bpf_kptr_xchg_back(&sflow->flow, flow);
+
+	return skb;
+}
+
+SEC("struct_ops/bpf_fq_dequeue")
+struct sk_buff *BPF_PROG(bpf_fq_dequeue, struct Qdisc *sch)
+{
+	struct dequeue_nonprio_ctx cb_ctx = {};
+	struct sk_buff *skb = NULL;
+	int i;
+
+	if (!sch->q.qlen)
+		goto out;
+
+	skb = fq_dequeue_prio();
+	if (skb)
+		goto dequeue;
+
+	q.ktime_cache = cb_ctx.now = bpf_ktime_get_ns();
+	fq_check_throttled(q.ktime_cache);
+	bpf_for(i, 0, sch->limit) {
+		skb = fq_dequeue_nonprio_flows(i, &cb_ctx);
+		if (cb_ctx.stop_iter)
+			break;
+	};
+
+	if (skb) {
+dequeue:
+		sch->q.qlen--;
+		sch->qstats.backlog -= qdisc_pkt_len(skb);
+		bpf_qdisc_bstats_update(sch, skb);
+		return skb;
+	}
+
+	if (cb_ctx.expire)
+		bpf_qdisc_watchdog_schedule(sch, cb_ctx.expire, q.timer_slack);
+out:
+	return NULL;
+}
+
+static int fq_remove_flows_in_list(u32 index, void *ctx)
+{
+	struct bpf_list_node *node;
+	struct fq_flow_node *flow;
+
+	bpf_spin_lock(&fq_new_flows_lock);
+	node = bpf_list_pop_front(&fq_new_flows);
+	bpf_spin_unlock(&fq_new_flows_lock);
+	if (!node) {
+		bpf_spin_lock(&fq_old_flows_lock);
+		node = bpf_list_pop_front(&fq_old_flows);
+		bpf_spin_unlock(&fq_old_flows_lock);
+		if (!node)
+			return 1;
+	}
+
+	flow = container_of(node, struct fq_flow_node, list_node);
+	bpf_obj_drop(flow);
+
+	return 0;
+}
+
+extern unsigned CONFIG_HZ __kconfig;
+
+/* limit number of collected flows per round */
+#define FQ_GC_MAX 8
+#define FQ_GC_AGE (3*CONFIG_HZ)
+
+static bool fq_gc_candidate(struct fq_flow_node *flow)
+{
+	u64 jiffies = bpf_jiffies64();
+
+	return fq_flow_is_detached(flow) &&
+	       ((s64)(jiffies - (flow->age + FQ_GC_AGE)) > 0);
+}
+
+static int
+fq_remove_flows(struct bpf_map *flow_map, u64 *hash,
+		struct fq_stashed_flow *sflow, struct remove_flows_ctx *ctx)
+{
+	struct fq_flow_node *flow = NULL;
+
+	flow = bpf_kptr_xchg(&sflow->flow, flow);
+	if (flow) {
+		if (!ctx->gc_only || fq_gc_candidate(flow)) {
+			bpf_obj_drop(flow);
+			ctx->reset_cnt++;
+		} else {
+			bpf_kptr_xchg_back(&sflow->flow, flow);
+		}
+	}
+
+	return ctx->reset_cnt < ctx->reset_max ? 0 : 1;
+}
+
+static void fq_gc(void)
+{
+	struct remove_flows_ctx cb_ctx = {
+		.gc_only = true,
+		.reset_cnt = 0,
+		.reset_max = FQ_GC_MAX,
+	};
+
+	bpf_for_each_map_elem(&fq_nonprio_flows, fq_remove_flows, &cb_ctx, 0);
+}
+
+SEC("struct_ops/bpf_fq_reset")
+void BPF_PROG(bpf_fq_reset, struct Qdisc *sch)
+{
+	struct unset_throttled_flows_ctx utf_ctx = {
+		.unset_all = true,
+	};
+	struct remove_flows_ctx rf_ctx = {
+		.gc_only = false,
+		.reset_cnt = 0,
+		.reset_max = NUM_QUEUE,
+	};
+	struct fq_stashed_flow *sflow;
+	u64 hash = 0;
+
+	sch->q.qlen = 0;
+	sch->qstats.backlog = 0;
+
+	bpf_for_each_map_elem(&fq_nonprio_flows, fq_remove_flows, &rf_ctx, 0);
+
+	rf_ctx.reset_cnt = 0;
+	bpf_for_each_map_elem(&fq_prio_flows, fq_remove_flows, &rf_ctx, 0);
+	fq_new_flow(&fq_prio_flows, &sflow, hash);
+
+	bpf_loop(NUM_QUEUE, fq_remove_flows_in_list, NULL, 0);
+	q.new_flow_cnt = 0;
+	q.old_flow_cnt = 0;
+
+	bpf_loop(NUM_QUEUE, fq_unset_throttled_flows, &utf_ctx, 0);
+
+	return;
+}
+
+SEC("struct_ops/bpf_fq_init")
+int BPF_PROG(bpf_fq_init, struct Qdisc *sch, struct nlattr *opt,
+	     struct netlink_ext_ack *extack)
+{
+	struct net_device *dev = sch->dev_queue->dev;
+	u32 psched_mtu = dev->mtu + dev->hard_header_len;
+	struct fq_stashed_flow *sflow;
+	u64 hash = 0;
+
+	if (fq_new_flow(&fq_prio_flows, &sflow, hash) < 0)
+		return -ENOMEM;
+
+	sch->limit = 10000;
+	q.initial_quantum = 10 * psched_mtu;
+	q.quantum = 2 * psched_mtu;
+	q.flow_refill_delay = 40;
+	q.flow_plimit = 100;
+	q.horizon = 10ULL * NSEC_PER_SEC;
+	q.horizon_drop = 1;
+	q.orphan_mask = 1024 - 1;
+	q.timer_slack = 10 * NSEC_PER_USEC;
+	q.time_next_delayed_flow = ~0ULL;
+	q.unthrottle_latency_ns = 0ULL;
+	q.new_flow_cnt = 0;
+	q.old_flow_cnt = 0;
+
+	return 0;
+}
+
+SEC(".struct_ops")
+struct Qdisc_ops fq = {
+	.enqueue   = (void *)bpf_fq_enqueue,
+	.dequeue   = (void *)bpf_fq_dequeue,
+	.reset     = (void *)bpf_fq_reset,
+	.init      = (void *)bpf_fq_init,
+	.id        = "bpf_fq",
+};
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v2 00/14] bpf qdisc
  2024-12-20 19:55 [PATCH bpf-next v2 00/14] bpf qdisc Amery Hung
                   ` (13 preceding siblings ...)
  2024-12-20 19:55 ` [PATCH bpf-next v2 14/14] selftests: Add a bpf fq qdisc to selftest Amery Hung
@ 2025-01-02 17:29 ` Toke Høiland-Jørgensen
  2025-01-10  1:43 ` Martin KaFai Lau
  15 siblings, 0 replies; 34+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-01-02 17:29 UTC (permalink / raw)
  To: Amery Hung, netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, ameryhung, amery.hung

Amery Hung <ameryhung@gmail.com> writes:

> Hi all,
>
> This patchset aims to support implementing qdisc using bpf struct_ops.
> This version takes a step back and only implements the minimum support
> for bpf qdisc. 1) support of adding skb to bpf_list and bpf_rbtree
> directly and 2) classful qdisc are deferred to future patchsets.
>
> * Overview *
>
> This series supports implementing qdisc using bpf struct_ops. bpf qdisc
> aims to be a flexible and easy-to-use infrastructure that allows users to
> quickly experiment with different scheduling algorithms/policies. It only
> requires users to implement core qdisc logic using bpf and implements the
> mundane part for them. In addition, the ability to easily communicate
> between qdisc and other components will also bring new opportunities for
> new applications and optimizations.

This is very cool, and I'm thrilled to see this work getting closer to
being merged! :)

> * struct_ops changes *
>
> To make struct_ops works better with bpf qdisc, two new changes are
> introduced to bpf specifically for struct_ops programs. Frist, we
> introduce "__ref" postfix for arguments in stub functions in patch 1-2.
> It allows Qdisc_ops->enqueue to acquire an unique referenced kptr to the
> skb argument. Through the reference object tracking mechanism in
> the verifier, we can make sure that the acquired skb will be either
> enqueued or dropped. Besides, no duplicate references can be acquired.
> Then, we allow a referenced kptr to be returned from struct_ops programs
> so that we can return an skb naturally. This is done and tested in patch 3
> and 4.
>
> * Performance of bpf qdisc *
>
> We tested several bpf qdiscs included in the selftests and their in-tree
> counterparts to give you a sense of the performance of qdisc implemented
> in bpf.
>
> The implementation of bpf_fq is fairly complex and slightly different from
> fq so later we only compare the two fifo qdiscs. bpf_fq implements the
> same fair queueing algorithm in fq, but without flow hash collision
> avoidance and garbage collection of inactive flows. bpf_fifo uses a single
> bpf_list as a queue instead of three queues for different priorities in
> pfifo_fast. The time complexity of fifo however should be similar since the
> queue selection time is negligible.
>
> Test setup:
>
>     client -> qdisc ------------->  server
>     ~~~~~~~~~~~~~~~                 ~~~~~~
>     nested VM1 @ DC1               VM2 @ DC2
>
> Throghput: iperf3 -t 600, 5 times
>
>       Qdisc        Average (GBits/sec)
>     ----------     -------------------
>     pfifo_fast       12.52 ± 0.26
>     bpf_fifo         11.72 ± 0.32 
>     fq               10.24 ± 0.13
>     bpf_fq           11.92 ± 0.64 
>
> Latency: sockperf pp --tcp -t 600, 5 times
>
>       Qdisc        Average (usec)
>     ----------     --------------
>     pfifo_fast      244.58 ± 7.93
>     bpf_fifo        244.92 ± 15.22
>     fq              234.30 ± 19.25
>     bpf_fq          221.34 ± 10.76
>
> Looking at the two fifo qdiscs, the 6.4% drop in throughput in the bpf
> implementatioin is consistent with previous observation (v8 throughput
> test on a loopback device). This should be able to be mitigated by
> supporting adding skb to bpf_list or bpf_rbtree directly in the future.

This looks pretty decent!

> * Clean up skb in bpf qdisc during reset *
>
> The current implementation relies on bpf qdisc implementors to correctly
> release skbs in queues (bpf graphs or maps) in .reset, which might not be
> a safe thing to do. The solution as Martin has suggested would be
> supporting private data in struct_ops. This can also help simplifying
> implementation of qdisc that works with mq. For examples, qdiscs in the
> selftest mostly use global data. Therefore, even if user add multiple
> qdisc instances under mq, they would still share the same queue.

So is the plan to fix this before merging this series? Seems like a bit
of a footgun, otherwise?

-Toke


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v2 08/14] bpf: net_sched: Add a qdisc watchdog timer
  2024-12-20 19:55 ` [PATCH bpf-next v2 08/14] bpf: net_sched: Add a qdisc watchdog timer Amery Hung
@ 2025-01-09  0:20   ` Martin KaFai Lau
  2025-01-09 15:00     ` Amery Hung
  0 siblings, 1 reply; 34+ messages in thread
From: Martin KaFai Lau @ 2025-01-09  0:20 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, stfomichev, ekarani.silvestre,
	yangpeihao, xiyou.wangcong, yepeilin.cs, amery.hung

On 12/20/24 11:55 AM, Amery Hung wrote:
> diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
> index 1c92bfcc3847..bbe7aded6f24 100644
> --- a/net/sched/bpf_qdisc.c
> +++ b/net/sched/bpf_qdisc.c
> @@ -8,6 +8,10 @@
>   
>   static struct bpf_struct_ops bpf_Qdisc_ops;
>   
> +struct bpf_sched_data {
> +	struct qdisc_watchdog watchdog;
> +};
> +
>   struct bpf_sk_buff_ptr {
>   	struct sk_buff *skb;
>   };
> @@ -108,6 +112,46 @@ static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
>   	return 0;
>   }
>   
> +BTF_ID_LIST(bpf_qdisc_init_prologue_ids)
> +BTF_ID(func, bpf_qdisc_init_prologue)
> +
> +static int bpf_qdisc_gen_prologue(struct bpf_insn *insn_buf, bool direct_write,
> +				  const struct bpf_prog *prog)
> +{
> +	struct bpf_insn *insn = insn_buf;
> +
> +	if (strcmp(prog->aux->attach_func_name, "init"))
> +		return 0;
> +
> +	*insn++ = BPF_MOV64_REG(BPF_REG_6, BPF_REG_1);
> +	*insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, 0);
> +	*insn++ = BPF_CALL_KFUNC(0, bpf_qdisc_init_prologue_ids[0]);

I was wondering if patch 7 is needed if BPF_EMIT_CALL() and BPF_CALL_1() were 
used, so it looks more like a bpf helper instead of kfunc. I tried but failed at 
the "fn = env->ops->get_func_proto(insn->imm, env->prog);" in do_misc_fixups(). 
I think the change in patch 7 is simple enough instead of getting 
get_func_proto() works for this case.

> +	*insn++ = BPF_MOV64_REG(BPF_REG_1, BPF_REG_6);
> +	*insn++ = prog->insnsi[0];
> +
> +	return insn - insn_buf;
> +}
> +
> +BTF_ID_LIST(bpf_qdisc_reset_destroy_epilogue_ids)
> +BTF_ID(func, bpf_qdisc_reset_destroy_epilogue)
> +
> +static int bpf_qdisc_gen_epilogue(struct bpf_insn *insn_buf, const struct bpf_prog *prog,
> +				  s16 ctx_stack_off)
> +{
> +	struct bpf_insn *insn = insn_buf;
> +
> +	if (strcmp(prog->aux->attach_func_name, "reset") &&
> +	    strcmp(prog->aux->attach_func_name, "destroy"))
> +		return 0;
> +
> +	*insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_FP, ctx_stack_off);
> +	*insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, 0);
> +	*insn++ = BPF_CALL_KFUNC(0, bpf_qdisc_reset_destroy_epilogue_ids[0]);
> +	*insn++ = BPF_EXIT_INSN();
> +
> +	return insn - insn_buf;
> +}
> +

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v2 05/14] bpf: net_sched: Support implementation of Qdisc_ops in bpf
  2024-12-20 19:55 ` [PATCH bpf-next v2 05/14] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
@ 2025-01-09 15:00   ` Amery Hung
  2025-01-10  0:28   ` Martin KaFai Lau
  2025-01-10  1:20   ` Jakub Kicinski
  2 siblings, 0 replies; 34+ messages in thread
From: Amery Hung @ 2025-01-09 15:00 UTC (permalink / raw)
  To: netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, amery.hung

@Cong, I am addressing your comments in v1 here. First, bpf qdisc will
be disabled by default (CONFIG_SCH_BPF=n). Second, I think it makes
more sense to keep changes to sch_generic.c and sch_api.c part of this
commit as it is built on top of struct_ops (similar to what bpf
congestion control did).

Please let me know if you have any other feedback. I'd also appreciate
it if you approve this set.

Thanks!
Amery


On Fri, Dec 20, 2024 at 11:56 AM Amery Hung <ameryhung@gmail.com> wrote:
>
> From: Amery Hung <amery.hung@bytedance.com>
>
> Enable users to implement a classless qdisc using bpf. The last few
> patches in this series has prepared struct_ops to support core operators
> in Qdisc_ops. The recent advancement in bpf such as allocated
> objects, bpf list and bpf rbtree has also provided powerful and flexible
> building blocks to realize sophisticated scheduling algorithms. Therefore,
> in this patch, we start allowing qdisc to be implemented using bpf
> struct_ops. Users can implement Qdisc_ops.{enqueue, dequeue, init, reset,
> and .destroy in Qdisc_ops in bpf and register the qdisc dynamically into
> the kernel.
>
> Co-developed-by: Cong Wang <cong.wang@bytedance.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> ---
>  include/linux/btf.h     |   1 +
>  kernel/bpf/btf.c        |   4 +-
>  net/sched/Kconfig       |  12 +++
>  net/sched/Makefile      |   1 +
>  net/sched/bpf_qdisc.c   | 207 ++++++++++++++++++++++++++++++++++++++++
>  net/sched/sch_api.c     |   7 +-
>  net/sched/sch_generic.c |   3 +-
>  7 files changed, 229 insertions(+), 6 deletions(-)
>  create mode 100644 net/sched/bpf_qdisc.c
>
> diff --git a/include/linux/btf.h b/include/linux/btf.h
> index 4214e76c9168..eb16218fdf52 100644
> --- a/include/linux/btf.h
> +++ b/include/linux/btf.h
> @@ -563,6 +563,7 @@ const char *btf_name_by_offset(const struct btf *btf, u32 offset);
>  const char *btf_str_by_offset(const struct btf *btf, u32 offset);
>  struct btf *btf_parse_vmlinux(void);
>  struct btf *bpf_prog_get_target_btf(const struct bpf_prog *prog);
> +u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto, int off);
>  u32 *btf_kfunc_id_set_contains(const struct btf *btf, u32 kfunc_btf_id,
>                                const struct bpf_prog *prog);
>  u32 *btf_kfunc_is_modify_return(const struct btf *btf, u32 kfunc_btf_id,
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index c2f4f84e539d..78476cebefe3 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -6375,8 +6375,8 @@ static bool is_int_ptr(struct btf *btf, const struct btf_type *t)
>         return btf_type_is_int(t);
>  }
>
> -static u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto,
> -                          int off)
> +u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto,
> +                   int off)
>  {
>         const struct btf_param *args;
>         const struct btf_type *t;
> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> index 8180d0c12fce..ccd0255da5a5 100644
> --- a/net/sched/Kconfig
> +++ b/net/sched/Kconfig
> @@ -403,6 +403,18 @@ config NET_SCH_ETS
>
>           If unsure, say N.
>
> +config NET_SCH_BPF
> +       bool "BPF-based Qdisc"
> +       depends on BPF_SYSCALL && BPF_JIT && DEBUG_INFO_BTF
> +       help
> +         This option allows BPF-based queueing disiplines. With BPF struct_ops,
> +         users can implement supported operators in Qdisc_ops using BPF programs.
> +         The queue holding skb can be built with BPF maps or graphs.
> +
> +         Say Y here if you want to use BPF-based Qdisc.
> +
> +         If unsure, say N.
> +
>  menuconfig NET_SCH_DEFAULT
>         bool "Allow override default queue discipline"
>         help
> diff --git a/net/sched/Makefile b/net/sched/Makefile
> index 82c3f78ca486..904d784902d1 100644
> --- a/net/sched/Makefile
> +++ b/net/sched/Makefile
> @@ -62,6 +62,7 @@ obj-$(CONFIG_NET_SCH_FQ_PIE)  += sch_fq_pie.o
>  obj-$(CONFIG_NET_SCH_CBS)      += sch_cbs.o
>  obj-$(CONFIG_NET_SCH_ETF)      += sch_etf.o
>  obj-$(CONFIG_NET_SCH_TAPRIO)   += sch_taprio.o
> +obj-$(CONFIG_NET_SCH_BPF)      += bpf_qdisc.o
>
>  obj-$(CONFIG_NET_CLS_U32)      += cls_u32.o
>  obj-$(CONFIG_NET_CLS_ROUTE4)   += cls_route.o
> diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
> new file mode 100644
> index 000000000000..4b7d013f4f5c
> --- /dev/null
> +++ b/net/sched/bpf_qdisc.c
> @@ -0,0 +1,207 @@
> +#include <linux/types.h>
> +#include <linux/bpf_verifier.h>
> +#include <linux/bpf.h>
> +#include <linux/btf.h>
> +#include <linux/filter.h>
> +#include <net/pkt_sched.h>
> +#include <net/pkt_cls.h>
> +
> +static struct bpf_struct_ops bpf_Qdisc_ops;
> +
> +struct bpf_sk_buff_ptr {
> +       struct sk_buff *skb;
> +};
> +
> +static int bpf_qdisc_init(struct btf *btf)
> +{
> +       return 0;
> +}
> +
> +static const struct bpf_func_proto *
> +bpf_qdisc_get_func_proto(enum bpf_func_id func_id,
> +                        const struct bpf_prog *prog)
> +{
> +       switch (func_id) {
> +       case BPF_FUNC_tail_call:
> +               return NULL;
> +       default:
> +               return bpf_base_func_proto(func_id, prog);
> +       }
> +}
> +
> +BTF_ID_LIST_SINGLE(bpf_sk_buff_ids, struct, sk_buff)
> +BTF_ID_LIST_SINGLE(bpf_sk_buff_ptr_ids, struct, bpf_sk_buff_ptr)
> +
> +static bool bpf_qdisc_is_valid_access(int off, int size,
> +                                     enum bpf_access_type type,
> +                                     const struct bpf_prog *prog,
> +                                     struct bpf_insn_access_aux *info)
> +{
> +       struct btf *btf = prog->aux->attach_btf;
> +       u32 arg;
> +
> +       arg = get_ctx_arg_idx(btf, prog->aux->attach_func_proto, off);
> +       if (!strcmp(prog->aux->attach_func_name, "enqueue")) {
> +               if (arg == 2 && type == BPF_READ) {
> +                       info->reg_type = PTR_TO_BTF_ID | PTR_TRUSTED;
> +                       info->btf = btf;
> +                       info->btf_id = bpf_sk_buff_ptr_ids[0];
> +                       return true;
> +               }
> +       }
> +
> +       return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> +}
> +
> +static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
> +                                       const struct bpf_reg_state *reg,
> +                                       int off, int size)
> +{
> +       const struct btf_type *t, *skbt;
> +       size_t end;
> +
> +       skbt = btf_type_by_id(reg->btf, bpf_sk_buff_ids[0]);
> +       t = btf_type_by_id(reg->btf, reg->btf_id);
> +       if (t != skbt) {
> +               bpf_log(log, "only read is supported\n");
> +               return -EACCES;
> +       }
> +
> +       switch (off) {
> +       case offsetof(struct sk_buff, tstamp):
> +               end = offsetofend(struct sk_buff, tstamp);
> +               break;
> +       case offsetof(struct sk_buff, priority):
> +               end = offsetofend(struct sk_buff, priority);
> +               break;
> +       case offsetof(struct sk_buff, mark):
> +               end = offsetofend(struct sk_buff, mark);
> +               break;
> +       case offsetof(struct sk_buff, queue_mapping):
> +               end = offsetofend(struct sk_buff, queue_mapping);
> +               break;
> +       case offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb, tc_classid):
> +               end = offsetof(struct sk_buff, cb) +
> +                     offsetofend(struct qdisc_skb_cb, tc_classid);
> +               break;
> +       case offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb, data[0]) ...
> +            offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb,
> +                                                    data[QDISC_CB_PRIV_LEN - 1]):
> +               end = offsetof(struct sk_buff, cb) +
> +                     offsetofend(struct qdisc_skb_cb, data[QDISC_CB_PRIV_LEN - 1]);
> +               break;
> +       case offsetof(struct sk_buff, tc_index):
> +               end = offsetofend(struct sk_buff, tc_index);
> +               break;
> +       default:
> +               bpf_log(log, "no write support to sk_buff at off %d\n", off);
> +               return -EACCES;
> +       }
> +
> +       if (off + size > end) {
> +               bpf_log(log,
> +                       "write access at off %d with size %d beyond the member of sk_buff ended at %zu\n",
> +                       off, size, end);
> +               return -EACCES;
> +       }
> +
> +       return 0;
> +}
> +
> +static const struct bpf_verifier_ops bpf_qdisc_verifier_ops = {
> +       .get_func_proto         = bpf_qdisc_get_func_proto,
> +       .is_valid_access        = bpf_qdisc_is_valid_access,
> +       .btf_struct_access      = bpf_qdisc_btf_struct_access,
> +};
> +
> +static int bpf_qdisc_init_member(const struct btf_type *t,
> +                                const struct btf_member *member,
> +                                void *kdata, const void *udata)
> +{
> +       const struct Qdisc_ops *uqdisc_ops;
> +       struct Qdisc_ops *qdisc_ops;
> +       u32 moff;
> +
> +       uqdisc_ops = (const struct Qdisc_ops *)udata;
> +       qdisc_ops = (struct Qdisc_ops *)kdata;
> +
> +       moff = __btf_member_bit_offset(t, member) / 8;
> +       switch (moff) {
> +       case offsetof(struct Qdisc_ops, peek):
> +               qdisc_ops->peek = qdisc_peek_dequeued;
> +               return 0;
> +       case offsetof(struct Qdisc_ops, id):
> +               if (bpf_obj_name_cpy(qdisc_ops->id, uqdisc_ops->id,
> +                                    sizeof(qdisc_ops->id)) <= 0)
> +                       return -EINVAL;
> +               return 1;
> +       }
> +
> +       return 0;
> +}
> +
> +static int bpf_qdisc_reg(void *kdata, struct bpf_link *link)
> +{
> +       return register_qdisc(kdata);
> +}
> +
> +static void bpf_qdisc_unreg(void *kdata, struct bpf_link *link)
> +{
> +       return unregister_qdisc(kdata);
> +}
> +
> +static int Qdisc_ops__enqueue(struct sk_buff *skb__ref, struct Qdisc *sch,
> +                             struct sk_buff **to_free)
> +{
> +       return 0;
> +}
> +
> +static struct sk_buff *Qdisc_ops__dequeue(struct Qdisc *sch)
> +{
> +       return NULL;
> +}
> +
> +static struct sk_buff *Qdisc_ops__peek(struct Qdisc *sch)
> +{
> +       return NULL;
> +}
> +
> +static int Qdisc_ops__init(struct Qdisc *sch, struct nlattr *arg,
> +                          struct netlink_ext_ack *extack)
> +{
> +       return 0;
> +}
> +
> +static void Qdisc_ops__reset(struct Qdisc *sch)
> +{
> +}
> +
> +static void Qdisc_ops__destroy(struct Qdisc *sch)
> +{
> +}
> +
> +static struct Qdisc_ops __bpf_ops_qdisc_ops = {
> +       .enqueue = Qdisc_ops__enqueue,
> +       .dequeue = Qdisc_ops__dequeue,
> +       .peek = Qdisc_ops__peek,
> +       .init = Qdisc_ops__init,
> +       .reset = Qdisc_ops__reset,
> +       .destroy = Qdisc_ops__destroy,
> +};
> +
> +static struct bpf_struct_ops bpf_Qdisc_ops = {
> +       .verifier_ops = &bpf_qdisc_verifier_ops,
> +       .reg = bpf_qdisc_reg,
> +       .unreg = bpf_qdisc_unreg,
> +       .init_member = bpf_qdisc_init_member,
> +       .init = bpf_qdisc_init,
> +       .name = "Qdisc_ops",
> +       .cfi_stubs = &__bpf_ops_qdisc_ops,
> +       .owner = THIS_MODULE,
> +};
> +
> +static int __init bpf_qdisc_kfunc_init(void)
> +{
> +       return register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
> +}
> +late_initcall(bpf_qdisc_kfunc_init);
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 300430b8c4d2..b35c73c82342 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -25,6 +25,7 @@
>  #include <linux/hrtimer.h>
>  #include <linux/slab.h>
>  #include <linux/hashtable.h>
> +#include <linux/bpf.h>
>
>  #include <net/net_namespace.h>
>  #include <net/sock.h>
> @@ -358,7 +359,7 @@ static struct Qdisc_ops *qdisc_lookup_ops(struct nlattr *kind)
>                 read_lock(&qdisc_mod_lock);
>                 for (q = qdisc_base; q; q = q->next) {
>                         if (nla_strcmp(kind, q->id) == 0) {
> -                               if (!try_module_get(q->owner))
> +                               if (!bpf_try_module_get(q, q->owner))
>                                         q = NULL;
>                                 break;
>                         }
> @@ -1287,7 +1288,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
>                                 /* We will try again qdisc_lookup_ops,
>                                  * so don't keep a reference.
>                                  */
> -                               module_put(ops->owner);
> +                               bpf_module_put(ops, ops->owner);
>                                 err = -EAGAIN;
>                                 goto err_out;
>                         }
> @@ -1398,7 +1399,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
>         netdev_put(dev, &sch->dev_tracker);
>         qdisc_free(sch);
>  err_out2:
> -       module_put(ops->owner);
> +       bpf_module_put(ops, ops->owner);
>  err_out:
>         *errp = err;
>         return NULL;
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 38ec18f73de4..1e770ec251a0 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -24,6 +24,7 @@
>  #include <linux/if_vlan.h>
>  #include <linux/skb_array.h>
>  #include <linux/if_macvlan.h>
> +#include <linux/bpf.h>
>  #include <net/sch_generic.h>
>  #include <net/pkt_sched.h>
>  #include <net/dst.h>
> @@ -1083,7 +1084,7 @@ static void __qdisc_destroy(struct Qdisc *qdisc)
>                 ops->destroy(qdisc);
>
>         lockdep_unregister_key(&qdisc->root_lock_key);
> -       module_put(ops->owner);
> +       bpf_module_put(ops, ops->owner);
>         netdev_put(dev, &qdisc->dev_tracker);
>
>         trace_qdisc_destroy(qdisc);
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v2 08/14] bpf: net_sched: Add a qdisc watchdog timer
  2025-01-09  0:20   ` Martin KaFai Lau
@ 2025-01-09 15:00     ` Amery Hung
  0 siblings, 0 replies; 34+ messages in thread
From: Amery Hung @ 2025-01-09 15:00 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, stfomichev, ekarani.silvestre,
	yangpeihao, xiyou.wangcong, yepeilin.cs, amery.hung

On Wed, Jan 8, 2025 at 4:20 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 12/20/24 11:55 AM, Amery Hung wrote:
> > diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
> > index 1c92bfcc3847..bbe7aded6f24 100644
> > --- a/net/sched/bpf_qdisc.c
> > +++ b/net/sched/bpf_qdisc.c
> > @@ -8,6 +8,10 @@
> >
> >   static struct bpf_struct_ops bpf_Qdisc_ops;
> >
> > +struct bpf_sched_data {
> > +     struct qdisc_watchdog watchdog;
> > +};
> > +
> >   struct bpf_sk_buff_ptr {
> >       struct sk_buff *skb;
> >   };
> > @@ -108,6 +112,46 @@ static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
> >       return 0;
> >   }
> >
> > +BTF_ID_LIST(bpf_qdisc_init_prologue_ids)
> > +BTF_ID(func, bpf_qdisc_init_prologue)
> > +
> > +static int bpf_qdisc_gen_prologue(struct bpf_insn *insn_buf, bool direct_write,
> > +                               const struct bpf_prog *prog)
> > +{
> > +     struct bpf_insn *insn = insn_buf;
> > +
> > +     if (strcmp(prog->aux->attach_func_name, "init"))
> > +             return 0;
> > +
> > +     *insn++ = BPF_MOV64_REG(BPF_REG_6, BPF_REG_1);
> > +     *insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, 0);
> > +     *insn++ = BPF_CALL_KFUNC(0, bpf_qdisc_init_prologue_ids[0]);
>
> I was wondering if patch 7 is needed if BPF_EMIT_CALL() and BPF_CALL_1() were
> used, so it looks more like a bpf helper instead of kfunc. I tried but failed at
> the "fn = env->ops->get_func_proto(insn->imm, env->prog);" in do_misc_fixups().
> I think the change in patch 7 is simple enough instead of getting
> get_func_proto() works for this case.
>

Yeah. Making BPF_EMIT_CALL + BPF_CALL_xxx work in
gen_prologue/epilogue doesn't seem trivial to me at least when
compared with BPF_CALL_KFUNC.

> > +     *insn++ = BPF_MOV64_REG(BPF_REG_1, BPF_REG_6);
> > +     *insn++ = prog->insnsi[0];
> > +
> > +     return insn - insn_buf;
> > +}
> > +
> > +BTF_ID_LIST(bpf_qdisc_reset_destroy_epilogue_ids)
> > +BTF_ID(func, bpf_qdisc_reset_destroy_epilogue)
> > +
> > +static int bpf_qdisc_gen_epilogue(struct bpf_insn *insn_buf, const struct bpf_prog *prog,
> > +                               s16 ctx_stack_off)
> > +{
> > +     struct bpf_insn *insn = insn_buf;
> > +
> > +     if (strcmp(prog->aux->attach_func_name, "reset") &&
> > +         strcmp(prog->aux->attach_func_name, "destroy"))
> > +             return 0;
> > +
> > +     *insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_FP, ctx_stack_off);
> > +     *insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, 0);
> > +     *insn++ = BPF_CALL_KFUNC(0, bpf_qdisc_reset_destroy_epilogue_ids[0]);
> > +     *insn++ = BPF_EXIT_INSN();
> > +
> > +     return insn - insn_buf;
> > +}
> > +

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v2 14/14] selftests: Add a bpf fq qdisc to selftest
  2024-12-20 19:55 ` [PATCH bpf-next v2 14/14] selftests: Add a bpf fq qdisc to selftest Amery Hung
@ 2025-01-09 23:36   ` Martin KaFai Lau
  0 siblings, 0 replies; 34+ messages in thread
From: Martin KaFai Lau @ 2025-01-09 23:36 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, stfomichev, ekarani.silvestre,
	yangpeihao, xiyou.wangcong, yepeilin.cs, amery.hung

On 12/20/24 11:55 AM, Amery Hung wrote:
> +static int fq_new_flow(void *flow_map, struct fq_stashed_flow **sflow, u64 hash)
> +{
> +	struct fq_stashed_flow tmp = {};
> +	struct fq_flow_node *flow;
> +	int ret;
> +
> +	flow = bpf_obj_new(typeof(*flow));
> +	if (!flow)
> +		return -ENOMEM;
> +
> +	flow->credit = q.initial_quantum,
> +	flow->qlen = 0,
> +	flow->age = 1,
> +	flow->time_next_packet = 0,
> +
> +	ret = bpf_map_update_elem(flow_map, &hash, &tmp, 0);
> +	if (ret == -ENOMEM) {

-E2BIG needs to be checked also to handle the flow_map is full case.

> +		fq_gc();
> +		bpf_map_update_elem(&fq_nonprio_flows, &hash, &tmp, 0);
> +	}
> +
> +	*sflow = bpf_map_lookup_elem(flow_map, &hash);
> +	if (!*sflow) {
> +		bpf_obj_drop(flow);
> +		return -ENOMEM;
> +	}
> +
> +	bpf_kptr_xchg_back(&(*sflow)->flow, flow);
> +	return 0;
> +}

[ ... ]

> +static int
> +fq_remove_flows(struct bpf_map *flow_map, u64 *hash,
> +		struct fq_stashed_flow *sflow, struct remove_flows_ctx *ctx)
> +{
> +	struct fq_flow_node *flow = NULL;
> +
> +	flow = bpf_kptr_xchg(&sflow->flow, flow);
> +	if (flow) {
> +		if (!ctx->gc_only || fq_gc_candidate(flow)) {
> +			bpf_obj_drop(flow);

afaik, the hash value (i.e. sflow here) is still in the flow_map.

Instead of xchg and then drop, I think this should be 
bpf_map_delete_elem(flow_map, hash) which deletes the sflow value from the 
flow_map and also takes care of the bpf_obj_drop() also.

> +			ctx->reset_cnt++;
> +		} else {
> +			bpf_kptr_xchg_back(&sflow->flow, flow);
> +		}
> +	}
> +
> +	return ctx->reset_cnt < ctx->reset_max ? 0 : 1;
> +}
> +
> +static void fq_gc(void)
> +{
> +	struct remove_flows_ctx cb_ctx = {
> +		.gc_only = true,
> +		.reset_cnt = 0,
> +		.reset_max = FQ_GC_MAX,
> +	};
> +
> +	bpf_for_each_map_elem(&fq_nonprio_flows, fq_remove_flows, &cb_ctx, 0);
> +}
> +
> +SEC("struct_ops/bpf_fq_reset")
> +void BPF_PROG(bpf_fq_reset, struct Qdisc *sch)
> +{
> +	struct unset_throttled_flows_ctx utf_ctx = {
> +		.unset_all = true,
> +	};
> +	struct remove_flows_ctx rf_ctx = {
> +		.gc_only = false,
> +		.reset_cnt = 0,
> +		.reset_max = NUM_QUEUE,
> +	};
> +	struct fq_stashed_flow *sflow;
> +	u64 hash = 0;
> +
> +	sch->q.qlen = 0;
> +	sch->qstats.backlog = 0;
> +
> +	bpf_for_each_map_elem(&fq_nonprio_flows, fq_remove_flows, &rf_ctx, 0);
> +
> +	rf_ctx.reset_cnt = 0;
> +	bpf_for_each_map_elem(&fq_prio_flows, fq_remove_flows, &rf_ctx, 0);
> +	fq_new_flow(&fq_prio_flows, &sflow, hash);
> +
> +	bpf_loop(NUM_QUEUE, fq_remove_flows_in_list, NULL, 0);
> +	q.new_flow_cnt = 0;
> +	q.old_flow_cnt = 0;
> +
> +	bpf_loop(NUM_QUEUE, fq_unset_throttled_flows, &utf_ctx, 0);
> +
> +	return;
> +}

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v2 13/14] selftests: Add a basic fifo qdisc test
  2024-12-20 19:55 ` [PATCH bpf-next v2 13/14] selftests: Add a basic fifo qdisc test Amery Hung
@ 2025-01-10  0:05   ` Martin KaFai Lau
  0 siblings, 0 replies; 34+ messages in thread
From: Martin KaFai Lau @ 2025-01-10  0:05 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, stfomichev, ekarani.silvestre,
	yangpeihao, xiyou.wangcong, yepeilin.cs

On 12/20/24 11:55 AM, Amery Hung wrote:
> +static void do_test(char *qdisc)
> +{
> +	DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook, .ifindex = LO_IFINDEX,
> +			    .attach_point = BPF_TC_QDISC,
> +			    .parent = TC_H_ROOT,
> +			    .handle = 0x8000000,
> +			    .qdisc = qdisc);
> +	struct sockaddr_in6 sa6 = {};
> +	ssize_t nr_recv = 0, bytes = 0;
> +	int lfd = -1, fd = -1;
> +	pthread_t srv_thread;
> +	socklen_t addrlen = sizeof(sa6);
> +	void *thread_ret;
> +	char batch[1500];
> +	int err;
> +
> +	WRITE_ONCE(stop, 0);
> +
> +	err = bpf_tc_hook_create(&hook);
> +	if (!ASSERT_OK(err, "attach qdisc"))
> +		return;
> +
> +	lfd = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
> +	if (!ASSERT_NEQ(lfd, -1, "socket")) {

nit. ASSERT_OK_FD.

> +		bpf_tc_hook_destroy(&hook);
> +		return;
> +	}
> +
> +	fd = socket(AF_INET6, SOCK_STREAM, 0);
> +	if (!ASSERT_NEQ(fd, -1, "socket")) {
> +		bpf_tc_hook_destroy(&hook);
> +		close(lfd);
> +		return;
> +	}
> +
> +	if (settimeo(lfd, 0) || settimeo(fd, 0))
> +		goto done;
> +
> +	err = getsockname(lfd, (struct sockaddr *)&sa6, &addrlen);
> +	if (!ASSERT_NEQ(err, -1, "getsockname"))
> +		goto done;
> +
> +	/* connect to server */
> +	err = connect(fd, (struct sockaddr *)&sa6, addrlen);

Instead of socket/getsockname/connect, "fd = connect_to_fd(lfd);" from the 
network_helpers.c should do.

The above settimeo({lfd,fd}) is not needed also. The helpers from 
network_helpers.c should have already set the default timeout.

> +	if (!ASSERT_NEQ(err, -1, "connect"))
> +		goto done;
> +
> +	err = pthread_create(&srv_thread, NULL, server, (void *)(long)lfd);
> +	if (!ASSERT_OK(err, "pthread_create"))
> +		goto done;
> +
> +	/* recv total_bytes */
> +	while (bytes < total_bytes && !READ_ONCE(stop)) {
> +		nr_recv = recv(fd, &batch,
> +			       MIN(total_bytes - bytes, sizeof(batch)), 0);
> +		if (nr_recv == -1 && errno == EINTR)
> +			continue;
> +		if (nr_recv == -1)
> +			break;
> +		bytes += nr_recv;
> +	}
> +
> +	ASSERT_EQ(bytes, total_bytes, "recv");
> +
> +	WRITE_ONCE(stop, 1);
> +	pthread_join(srv_thread, &thread_ret);
> +	ASSERT_OK(IS_ERR(thread_ret), "thread_ret");
> +
> +done:
> +	close(lfd);
> +	close(fd);
> +
> +	bpf_tc_hook_destroy(&hook);
> +	return;
> +}
> +
> +static void test_fifo(void)
> +{
> +	struct bpf_qdisc_fifo *fifo_skel;
> +	struct bpf_link *link;
> +
> +	fifo_skel = bpf_qdisc_fifo__open_and_load();
> +	if (!ASSERT_OK_PTR(fifo_skel, "bpf_qdisc_fifo__open_and_load"))
> +		return;
> +
> +	link = bpf_map__attach_struct_ops(fifo_skel->maps.fifo);
> +	if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops")) {
> +		bpf_qdisc_fifo__destroy(fifo_skel);
> +		return;
> +	}
> +
> +	do_test("bpf_fifo");
> +
> +	bpf_link__destroy(link);
> +	bpf_qdisc_fifo__destroy(fifo_skel);
> +}
> +
> +void test_bpf_qdisc(void)
> +{

Run the whole test under its own netns. Use netns_new("bpf_qdisc_ns", true) from 
the test_progs.c should do.

> +	if (test__start_subtest("fifo"))
> +		test_fifo();
> +}

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v2 06/14] bpf: net_sched: Add basic bpf qdisc kfuncs
  2024-12-20 19:55 ` [PATCH bpf-next v2 06/14] bpf: net_sched: Add basic bpf qdisc kfuncs Amery Hung
@ 2025-01-10  0:24   ` Martin KaFai Lau
  2025-01-10 18:00     ` Amery Hung
  0 siblings, 1 reply; 34+ messages in thread
From: Martin KaFai Lau @ 2025-01-10  0:24 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, stfomichev, ekarani.silvestre,
	yangpeihao, xiyou.wangcong, yepeilin.cs

On 12/20/24 11:55 AM, Amery Hung wrote:
> +BTF_KFUNCS_START(qdisc_kfunc_ids)
> +BTF_ID_FLAGS(func, bpf_skb_get_hash, KF_TRUSTED_ARGS)
> +BTF_ID_FLAGS(func, bpf_kfree_skb, KF_RELEASE)
> +BTF_ID_FLAGS(func, bpf_qdisc_skb_drop, KF_RELEASE)
> +BTF_ID_FLAGS(func, bpf_dynptr_from_skb, KF_TRUSTED_ARGS)
> +BTF_KFUNCS_END(qdisc_kfunc_ids)
> +
> +BTF_SET_START(qdisc_common_kfunc_set)
> +BTF_ID(func, bpf_skb_get_hash)
> +BTF_ID(func, bpf_kfree_skb)

I think bpf_dynptr_from_skb should also be here.

> +BTF_SET_END(qdisc_common_kfunc_set)
> +
> +BTF_SET_START(qdisc_enqueue_kfunc_set)
> +BTF_ID(func, bpf_qdisc_skb_drop)
> +BTF_SET_END(qdisc_enqueue_kfunc_set)
> +
> +static int bpf_qdisc_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
> +{
> +	if (bpf_Qdisc_ops.type != btf_type_by_id(prog->aux->attach_btf,
> +						 prog->aux->attach_btf_id))
> +		return 0;
> +
> +	/* Skip the check when prog->attach_func_name is not yet available
> +	 * during check_cfg().

Instead of using attach_func_name, it is better to directly use the 
prog->expected_attach_type provided by the user space. It is essentially the 
member index of the "struct Qdisc_ops". Take a look at the prog_ops_moff() in 
bpf_tcp_ca.c.

> +	 */
> +	if (!btf_id_set8_contains(&qdisc_kfunc_ids, kfunc_id) ||
> +	    !prog->aux->attach_func_name)
> +		return 0;
> +
> +	if (!strcmp(prog->aux->attach_func_name, "enqueue")) {
> +		if (btf_id_set_contains(&qdisc_enqueue_kfunc_set, kfunc_id))
> +		       return 0;
> +	}
> +
> +	return btf_id_set_contains(&qdisc_common_kfunc_set, kfunc_id) ? 0 : -EACCES;
> +}

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v2 05/14] bpf: net_sched: Support implementation of Qdisc_ops in bpf
  2024-12-20 19:55 ` [PATCH bpf-next v2 05/14] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
  2025-01-09 15:00   ` Amery Hung
@ 2025-01-10  0:28   ` Martin KaFai Lau
  2025-01-10  1:20   ` Jakub Kicinski
  2 siblings, 0 replies; 34+ messages in thread
From: Martin KaFai Lau @ 2025-01-10  0:28 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, stfomichev, ekarani.silvestre,
	yangpeihao, xiyou.wangcong, yepeilin.cs

On 12/20/24 11:55 AM, Amery Hung wrote:
> +static const struct bpf_func_proto *
> +bpf_qdisc_get_func_proto(enum bpf_func_id func_id,
> +			 const struct bpf_prog *prog)
> +{
> +	switch (func_id) {
> +	case BPF_FUNC_tail_call:

Since it needs a respin, a comment will be useful here to explain why tail_call 
is excluded.

> +		return NULL;
> +	default:
> +		return bpf_base_func_proto(func_id, prog);
> +	}
> +}


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v2 05/14] bpf: net_sched: Support implementation of Qdisc_ops in bpf
  2024-12-20 19:55 ` [PATCH bpf-next v2 05/14] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
  2025-01-09 15:00   ` Amery Hung
  2025-01-10  0:28   ` Martin KaFai Lau
@ 2025-01-10  1:20   ` Jakub Kicinski
  2 siblings, 0 replies; 34+ messages in thread
From: Jakub Kicinski @ 2025-01-10  1:20 UTC (permalink / raw)
  To: Amery Hung
  Cc: netdev, bpf, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, stfomichev, ekarani.silvestre,
	yangpeihao, xiyou.wangcong, yepeilin.cs, amery.hung

On Fri, 20 Dec 2024 11:55:31 -0800 Amery Hung wrote:
> From: Amery Hung <amery.hung@bytedance.com>
> 
> Enable users to implement a classless qdisc using bpf. The last few
> patches in this series has prepared struct_ops to support core operators
> in Qdisc_ops. The recent advancement in bpf such as allocated
> objects, bpf list and bpf rbtree has also provided powerful and flexible
> building blocks to realize sophisticated scheduling algorithms. Therefore,
> in this patch, we start allowing qdisc to be implemented using bpf
> struct_ops. Users can implement Qdisc_ops.{enqueue, dequeue, init, reset,
> and .destroy in Qdisc_ops in bpf and register the qdisc dynamically into
> the kernel.

Are you making sure the BPF Qdisc can only be attached as TC_H_ROOT, 
and has no cl_ops? Sorry IDK much about the bpf ops glue.

It'd certainly be good if Jamal or Eric had a look since they handle
all the qdisc security bugs. The qdisc abstraction is an order of
magnitude more leaky than TCP CC ops :(

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v2 00/14] bpf qdisc
  2024-12-20 19:55 [PATCH bpf-next v2 00/14] bpf qdisc Amery Hung
                   ` (14 preceding siblings ...)
  2025-01-02 17:29 ` [PATCH bpf-next v2 00/14] bpf qdisc Toke Høiland-Jørgensen
@ 2025-01-10  1:43 ` Martin KaFai Lau
  15 siblings, 0 replies; 34+ messages in thread
From: Martin KaFai Lau @ 2025-01-10  1:43 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, stfomichev, ekarani.silvestre,
	yangpeihao, xiyou.wangcong, yepeilin.cs

On 12/20/24 11:55 AM, Amery Hung wrote:
> The implementation of bpf_fq is fairly complex and slightly different from
> fq so later we only compare the two fifo qdiscs. bpf_fq implements the
> same fair queueing algorithm in fq, but without flow hash collision
> avoidance and garbage collection of inactive flows. bpf_fifo uses a single

For hash collision, I think you meant >1 tcp_socks having the same hash in patch 
14? This probably could be detected by adding the sk pointer value to the 
bpf-map key? not asking to change patch 14 though.

For garbage collection, I think patch 14 has it but yes it is iterating the bpf 
map, so not as quick as doing gc while searching for the sk in the rbtree. I 
think the only missing piece is being able to iterate the bpf_rb_root, i.e. able 
to directly search left and right of a bpf_rb_node.

> bpf_list as a queue instead of three queues for different priorities in
> pfifo_fast. The time complexity of fifo however should be similar since the
> queue selection time is negligible.
> 
> Test setup:
> 
>      client -> qdisc ------------->  server
>      ~~~~~~~~~~~~~~~                 ~~~~~~
>      nested VM1 @ DC1               VM2 @ DC2
> 
> Throghput: iperf3 -t 600, 5 times
> 
>        Qdisc        Average (GBits/sec)
>      ----------     -------------------
>      pfifo_fast       12.52 ± 0.26
>      bpf_fifo         11.72 ± 0.32
>      fq               10.24 ± 0.13
>      bpf_fq           11.92 ± 0.64
> 
> Latency: sockperf pp --tcp -t 600, 5 times
> 
>        Qdisc        Average (usec)
>      ----------     --------------
>      pfifo_fast      244.58 ± 7.93
>      bpf_fifo        244.92 ± 15.22
>      fq              234.30 ± 19.25
>      bpf_fq          221.34 ± 10.76
> 
> Looking at the two fifo qdiscs, the 6.4% drop in throughput in the bpf
> implementatioin is consistent with previous observation (v8 throughput
> test on a loopback device). This should be able to be mitigated by
> supporting adding skb to bpf_list or bpf_rbtree directly in the future.
> 
> * Clean up skb in bpf qdisc during reset *
> 
> The current implementation relies on bpf qdisc implementors to correctly
> release skbs in queues (bpf graphs or maps) in .reset, which might not be
> a safe thing to do. The solution as Martin has suggested would be
> supporting private data in struct_ops. This can also help simplifying
> implementation of qdisc that works with mq. For examples, qdiscs in the
> selftest mostly use global data. Therefore, even if user add multiple
> qdisc instances under mq, they would still share the same queue.

Although not as nice as priv_data, I think mq setup with a dedicated queue can 
be done with bpf map-in-map.

For the cleanup part, it is similar to how the bpf kptr is cleaned up, either 
the bpf program frees it or the bpf infra will eventually clean it up during the 
bpf map destruction.

For priv_data, I think it could be a useful addition to the bpf_struct_ops. 
Meaning it should also work for struct_ops other than Qdisc_ops. Then all 
destruction and free could be done more automatically and seamlessly.

imo, the above improvements can be iterated later on top of the core pieces of 
this set.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v2 06/14] bpf: net_sched: Add basic bpf qdisc kfuncs
  2025-01-10  0:24   ` Martin KaFai Lau
@ 2025-01-10 18:00     ` Amery Hung
  0 siblings, 0 replies; 34+ messages in thread
From: Amery Hung @ 2025-01-10 18:00 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, stfomichev, ekarani.silvestre,
	yangpeihao, xiyou.wangcong, yepeilin.cs

On Thu, Jan 9, 2025 at 4:24 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 12/20/24 11:55 AM, Amery Hung wrote:
> > +BTF_KFUNCS_START(qdisc_kfunc_ids)
> > +BTF_ID_FLAGS(func, bpf_skb_get_hash, KF_TRUSTED_ARGS)
> > +BTF_ID_FLAGS(func, bpf_kfree_skb, KF_RELEASE)
> > +BTF_ID_FLAGS(func, bpf_qdisc_skb_drop, KF_RELEASE)
> > +BTF_ID_FLAGS(func, bpf_dynptr_from_skb, KF_TRUSTED_ARGS)
> > +BTF_KFUNCS_END(qdisc_kfunc_ids)
> > +
> > +BTF_SET_START(qdisc_common_kfunc_set)
> > +BTF_ID(func, bpf_skb_get_hash)
> > +BTF_ID(func, bpf_kfree_skb)
>
> I think bpf_dynptr_from_skb should also be here.

Good catch

>
> > +BTF_SET_END(qdisc_common_kfunc_set)
> > +
> > +BTF_SET_START(qdisc_enqueue_kfunc_set)
> > +BTF_ID(func, bpf_qdisc_skb_drop)
> > +BTF_SET_END(qdisc_enqueue_kfunc_set)
> > +
> > +static int bpf_qdisc_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
> > +{
> > +     if (bpf_Qdisc_ops.type != btf_type_by_id(prog->aux->attach_btf,
> > +                                              prog->aux->attach_btf_id))
> > +             return 0;
> > +
> > +     /* Skip the check when prog->attach_func_name is not yet available
> > +      * during check_cfg().
>
> Instead of using attach_func_name, it is better to directly use the
> prog->expected_attach_type provided by the user space. It is essentially the
> member index of the "struct Qdisc_ops". Take a look at the prog_ops_moff() in
> bpf_tcp_ca.c.

I will replace all places that use attach_func_name to refer to the
ops to prog_ops_moff.

Thank,
Amery

>
> > +      */
> > +     if (!btf_id_set8_contains(&qdisc_kfunc_ids, kfunc_id) ||
> > +         !prog->aux->attach_func_name)
> > +             return 0;
> > +
> > +     if (!strcmp(prog->aux->attach_func_name, "enqueue")) {
> > +             if (btf_id_set_contains(&qdisc_enqueue_kfunc_set, kfunc_id))
> > +                    return 0;
> > +     }
> > +
> > +     return btf_id_set_contains(&qdisc_common_kfunc_set, kfunc_id) ? 0 : -EACCES;
> > +}

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v2 03/14] bpf: Allow struct_ops prog to return referenced kptr
  2024-12-20 19:55 ` [PATCH bpf-next v2 03/14] bpf: Allow struct_ops prog to return referenced kptr Amery Hung
@ 2025-01-15 15:25   ` Ming Lei
  2025-01-23  9:57   ` Eduard Zingerman
  1 sibling, 0 replies; 34+ messages in thread
From: Ming Lei @ 2025-01-15 15:25 UTC (permalink / raw)
  To: Amery Hung
  Cc: netdev, bpf, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, stfomichev, ekarani.silvestre,
	yangpeihao, xiyou.wangcong, yepeilin.cs, amery.hung

Hello Amery,

On Fri, Dec 20, 2024 at 11:55:29AM -0800, Amery Hung wrote:
> From: Amery Hung <amery.hung@bytedance.com>
> 
> Allow a struct_ops program to return a referenced kptr if the struct_ops
> operator's return type is a struct pointer. To make sure the returned
> pointer continues to be valid in the kernel, several constraints are
> required:
> 
> 1) The type of the pointer must matches the return type
> 2) The pointer originally comes from the kernel (not locally allocated)
> 3) The pointer is in its unmodified form
> 
> Implementation wise, a referenced kptr first needs to be allowed to _leak_
> in check_reference_leak() if it is in the return register. Then, in
> check_return_code(), constraints 1-3 are checked. During struct_ops
> registration, a check is also added to warn about operators with
> non-struct pointer return.
> 
> In addition, since the first user, Qdisc_ops::dequeue, allows a NULL
> pointer to be returned when there is no skb to be dequeued, we will allow
> a scalar value with value equals to NULL to be returned.
> 
> In the future when there is a struct_ops user that always expects a valid
> pointer to be returned from an operator, we may extend tagging to the
> return value. We can tell the verifier to only allow NULL pointer return
> if the return value is tagged with MAY_BE_NULL.
> 
> Signed-off-by: Amery Hung <amery.hung@bytedance.com>

I feel this patchset is very useful, as Alexei shared, it can help to
mark two ublk bpf aio kfunc[1] as KF_ACQ/REL for avoiding bpf aio leak.

[1] https://lore.kernel.org/linux-block/20250107120417.1237392-1-tom.leiming@gmail.com/

So I try to test it with ublk bpf patchset, however, looks it fails ublk
bpf selftests. And the test does succeed without applying this patch,
and ublk is built as module.

- apply the 1st and the 3rd(this one) patch
- apply ublk bpf patchset[1]
- build kernel & reboot
- make -C tools/testing/selftests TARGETS=ublk run_tests

make: Entering directory '/root/git/linux/tools/testing/selftests'
make[1]: Entering directory '/root/git/linux/tools/testing/selftests/ublk'
  GEN      vmlinux.h
  CLNG-BPF ublk_loop.bpf.o
  GEN-SKEL ublk_loop.skel.h
  CLNG-BPF ublk_null.bpf.o
  GEN-SKEL ublk_null.skel.h
  CLNG-BPF ublk_stripe.bpf.o
  GEN-SKEL ublk_stripe.skel.h
  CC       ublk_bpf.o
  BINARY   ublk_bpf
rm /root/git/linux/tools/testing/selftests/ublk/ublk_bpf.o
make[1]: Leaving directory '/root/git/linux/tools/testing/selftests/ublk'
make[1]: Entering directory '/root/git/linux/tools/testing/selftests/ublk'
TAP version 13
1..8
# timeout set to 45
# selftests: ublk: test_null_01.sh
# null_01 : [PASS]
ok 1 selftests: ublk: test_null_01.sh
# timeout set to 45
# selftests: ublk: test_null_02.sh
# libbpf: struct_ops init_kern: struct ublk_bpf_ops data is not found in struct bpf_struct_ops_ublk_bpf_ops
# libbpf: failed to load object 'ublk_null.bpf.o'
# fail to load bpf obj from ublk_null.bpf.o
# fail to register bpf prog null ublk_null.bpf.o
not ok 2 selftests: ublk: test_null_02.sh # exit=255



Thanks,
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v2 01/14] bpf: Support getting referenced kptr from struct_ops argument
  2024-12-20 19:55 ` [PATCH bpf-next v2 01/14] bpf: Support getting referenced kptr from struct_ops argument Amery Hung
@ 2025-01-23  9:57   ` Eduard Zingerman
  2025-01-23 19:41     ` Amery Hung
  0 siblings, 1 reply; 34+ messages in thread
From: Eduard Zingerman @ 2025-01-23  9:57 UTC (permalink / raw)
  To: Amery Hung, netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, amery.hung

On Fri, 2024-12-20 at 11:55 -0800, Amery Hung wrote:
> From: Amery Hung <amery.hung@bytedance.com>
> 
> Allows struct_ops programs to acqurie referenced kptrs from arguments
> by directly reading the argument.
> 
> The verifier will acquire a reference for struct_ops a argument tagged
> with "__ref" in the stub function in the beginning of the main program.
> The user will be able to access the referenced kptr directly by reading
> the context as long as it has not been released by the program.
> 
> This new mechanism to acquire referenced kptr (compared to the existing
> "kfunc with KF_ACQUIRE") is introduced for ergonomic and semantic reasons.
> In the first use case, Qdisc_ops, an skb is passed to .enqueue in the
> first argument. This mechanism provides a natural way for users to get a
> referenced kptr in the .enqueue struct_ops programs and makes sure that a
> qdisc will always enqueue or drop the skb.
> 
> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> ---

Hi Amery,

Sorry, for joining so late in the review process.
Decided to take a look at verifier related changes.
Overall the patch looks good to me,
but I dislike the part allocating parameter ids.

[...]

> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index 28246c59e12e..c2f4f84e539d 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -6682,6 +6682,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
>  			info->reg_type = ctx_arg_info->reg_type;
>  			info->btf = ctx_arg_info->btf ? : btf_vmlinux;
>  			info->btf_id = ctx_arg_info->btf_id;
> +			info->ref_obj_id = ctx_arg_info->refcounted ? ++nr_ref_args : 0;
>  			return true;
>  		}
>  	}
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index f27274e933e5..26305571e377 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c

[...]

> @@ -22161,6 +22182,16 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
>  		mark_reg_known_zero(env, regs, BPF_REG_1);
>  	}
>  
> +	/* Acquire references for struct_ops program arguments tagged with "__ref".
> +	 * These should be the earliest references acquired. btf_ctx_access() will
> +	 * assume the ref_obj_id of the n-th __ref-tagged argument to be n.
> +	 */
> +	if (!subprog && env->prog->type == BPF_PROG_TYPE_STRUCT_OPS) {
> +		for (i = 0; i < env->prog->aux->ctx_arg_info_size; i++)
> +			if (env->prog->aux->ctx_arg_info[i].refcounted)
> +				acquire_reference(env, 0);
> +	}
> +
>  	ret = do_check(env);
>  out:
>  	/* check for NULL is necessary, since cur_state can be freed inside

I think it would be cleaner if:
- each program would own it's instance of 'env->prog->aux->ctx_arg_info';
- ref_obj_id field would be added to 'struct bpf_ctx_arg_aux';
- parameter ids would be allocated in do_check_common(), but without
  reliance on being first to allocate.

Or add some rigour to this thing and e.g. make env->id_gen signed
and declare an enum of special ids like:

  enum special_ids {
  	STRUCT_OPS_CTX_PARAM_0 = -1,
  	STRUCT_OPS_CTX_PARAM_1 = -2,
  	STRUCT_OPS_CTX_PARAM_2 = -3,
  	...
  }

and update the loop above as:

	if (!subprog && env->prog->type == BPF_PROG_TYPE_STRUCT_OPS) {
		for (i = 0; i < env->prog->aux->ctx_arg_info_size; i++)
			if (env->prog->aux->ctx_arg_info[i].refcounted)
            	/* imagined function that acquires an id with specific value */
				acquire_special_reference(env, 0, STRUCT_OPS_CTX_PARAM_0 - i /* desired id */);
	}

wdyt?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v2 02/14] selftests/bpf: Test referenced kptr arguments of struct_ops programs
  2024-12-20 19:55 ` [PATCH bpf-next v2 02/14] selftests/bpf: Test referenced kptr arguments of struct_ops programs Amery Hung
@ 2025-01-23  9:57   ` Eduard Zingerman
  2025-01-24  0:04     ` Amery Hung
  0 siblings, 1 reply; 34+ messages in thread
From: Eduard Zingerman @ 2025-01-23  9:57 UTC (permalink / raw)
  To: Amery Hung, netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, amery.hung

On Fri, 2024-12-20 at 11:55 -0800, Amery Hung wrote:

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>

[...]

> diff --git a/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c
> new file mode 100644
> index 000000000000..43493a7ead39
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c
> @@ -0,0 +1,37 @@
> +#include <vmlinux.h>
> +#include <bpf/bpf_tracing.h>
> +#include "../test_kmods/bpf_testmod.h"
> +#include "bpf_misc.h"
> +
> +char _license[] SEC("license") = "GPL";
> +
> +extern void bpf_task_release(struct task_struct *p) __ksym;
> +
> +__noinline int subprog_release(__u64 *ctx __arg_ctx)
> +{
> +	struct task_struct *task = (struct task_struct *)ctx[1];
> +	int dummy = (int)ctx[0];
> +
> +	bpf_task_release(task);
> +
> +	return dummy + 1;
> +}
> +
> +/* Test that the verifier rejects a program that contains a global
> + * subprogram with referenced kptr arguments
> + */
> +SEC("struct_ops/test_refcounted")

Nit: I'd add a __msg("Validating subprog_release() func#1...")
     before the error message match, just to make sure that
     error is reported when subprog_release() is verified.

> +__failure __msg("invalid bpf_context access off=8. Reference may already be released")
> +int refcounted_fail__global_subprog(unsigned long long *ctx)
> +{
> +	struct task_struct *task = (struct task_struct *)ctx[1];
> +
> +	bpf_task_release(task);
> +
> +	return subprog_release(ctx);
> +}
> +
> +SEC(".struct_ops.link")
> +struct bpf_testmod_ops testmod_ref_acquire = {
> +	.test_refcounted = (void *)refcounted_fail__global_subprog,
> +};

[...]


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v2 03/14] bpf: Allow struct_ops prog to return referenced kptr
  2024-12-20 19:55 ` [PATCH bpf-next v2 03/14] bpf: Allow struct_ops prog to return referenced kptr Amery Hung
  2025-01-15 15:25   ` Ming Lei
@ 2025-01-23  9:57   ` Eduard Zingerman
  2025-01-23 18:19     ` Eduard Zingerman
  1 sibling, 1 reply; 34+ messages in thread
From: Eduard Zingerman @ 2025-01-23  9:57 UTC (permalink / raw)
  To: Amery Hung, netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, amery.hung

On Fri, 2024-12-20 at 11:55 -0800, Amery Hung wrote:

[...]

> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index d9e0af00580b..27d4a170df84 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -386,7 +386,7 @@ int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc,
>  	st_ops_desc->value_type = btf_type_by_id(btf, value_id);
>  
>  	for_each_member(i, t, member) {
> -		const struct btf_type *func_proto;
> +		const struct btf_type *func_proto, *ret_type;
>  
>  		mname = btf_name_by_offset(btf, member->name_off);
>  		if (!*mname) {
> @@ -409,6 +409,16 @@ int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc,
>  		if (!func_proto)
>  			continue;
>  
> +		if (func_proto->type) {
> +			ret_type = btf_type_resolve_ptr(btf, func_proto->type, NULL);
> +			if (ret_type && !__btf_type_is_struct(ret_type)) {
> +				pr_warn("func ptr %s in struct %s returns non-struct pointer, which is not supported\n",
> +					mname, st_ops->name);
> +				err = -EOPNOTSUPP;
> +				goto errout;
> +			}
> +		}
> +

This limitation seems unnecessary, if reference leaks are only allowed
for parameters marked with __ref.

>  		if (btf_distill_func_proto(log, btf,
>  					   func_proto, mname,
>  					   &st_ops->func_models[i])) {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 26305571e377..0e6a3c4daa7d 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -10707,6 +10707,8 @@ record_func_key(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
>  static int check_reference_leak(struct bpf_verifier_env *env, bool exception_exit)
>  {
>  	struct bpf_verifier_state *state = env->cur_state;
> +	enum bpf_prog_type type = resolve_prog_type(env->prog);
> +	struct bpf_reg_state *reg = reg_state(env, BPF_REG_0);
>  	bool refs_lingering = false;
>  	int i;
>  
> @@ -10716,6 +10718,12 @@ static int check_reference_leak(struct bpf_verifier_env *env, bool exception_exi
>  	for (i = 0; i < state->acquired_refs; i++) {
>  		if (state->refs[i].type != REF_TYPE_PTR)
>  			continue;
> +		/* Allow struct_ops programs to return a referenced kptr back to
> +		 * kernel. Type checks are performed later in check_return_code.
> +		 */
> +		if (type == BPF_PROG_TYPE_STRUCT_OPS && !exception_exit &&
> +		    reg->ref_obj_id == state->refs[i].id)
> +			continue;
>  		verbose(env, "Unreleased reference id=%d alloc_insn=%d\n",
>  			state->refs[i].id, state->refs[i].insn_idx);
>  		refs_lingering = true;
> @@ -16320,13 +16328,14 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>  	const char *exit_ctx = "At program exit";
>  	struct tnum enforce_attach_type_range = tnum_unknown;
>  	const struct bpf_prog *prog = env->prog;
> -	struct bpf_reg_state *reg;
> +	struct bpf_reg_state *reg = reg_state(env, regno);
>  	struct bpf_retval_range range = retval_range(0, 1);
>  	enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
>  	int err;
>  	struct bpf_func_state *frame = env->cur_state->frame[0];
>  	const bool is_subprog = frame->subprogno;
>  	bool return_32bit = false;
> +	const struct btf_type *reg_type, *ret_type = NULL;
>  
>  	/* LSM and struct_ops func-ptr's return type could be "void" */
>  	if (!is_subprog || frame->in_exception_callback_fn) {
> @@ -16335,10 +16344,26 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>  			if (prog->expected_attach_type == BPF_LSM_CGROUP)
>  				/* See below, can be 0 or 0-1 depending on hook. */
>  				break;
> -			fallthrough;
> +			if (!prog->aux->attach_func_proto->type)
> +				return 0;
> +			break;
>  		case BPF_PROG_TYPE_STRUCT_OPS:
>  			if (!prog->aux->attach_func_proto->type)
>  				return 0;
> +
> +			if (frame->in_exception_callback_fn)
> +				break;
> +
> +			/* Allow a struct_ops program to return a referenced kptr if it
> +			 * matches the operator's return type and is in its unmodified
> +			 * form. A scalar zero (i.e., a null pointer) is also allowed.
> +			 */
> +			reg_type = reg->btf ? btf_type_by_id(reg->btf, reg->btf_id) : NULL;
> +			ret_type = btf_type_resolve_ptr(prog->aux->attach_btf,
> +							prog->aux->attach_func_proto->type,
> +							NULL);

This does not enforce the kernel provenance of the pointer.
See my comment for the next patch for an example.

I think such return should only be allowed for parameters marked with
__ref suffix. If so, pointer provenance check would just compare
reg->ref_obj_id value with known ids of __ref arguments.

> +			if (ret_type && ret_type == reg_type && reg->ref_obj_id)
> +				return __check_ptr_off_reg(env, reg, regno, false);
>  			break;
>  		default:
>  			break;
> @@ -16360,8 +16385,6 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>  		return -EACCES;
>  	}
>  
> -	reg = cur_regs(env) + regno;
> -
>  	if (frame->in_async_callback_fn) {
>  		/* enforce return zero from async callbacks like timer */
>  		exit_ctx = "At async callback return";
> @@ -16460,6 +16483,11 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>  	case BPF_PROG_TYPE_NETFILTER:
>  		range = retval_range(NF_DROP, NF_ACCEPT);
>  		break;
> +	case BPF_PROG_TYPE_STRUCT_OPS:
> +		if (!ret_type)
> +			return 0;
> +		range = retval_range(0, 0);
> +		break;
>  	case BPF_PROG_TYPE_EXT:
>  		/* freplace program can return anything as its return value
>  		 * depends on the to-be-replaced kernel func or bpf program.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v2 04/14] selftests/bpf: Test returning referenced kptr from struct_ops programs
  2024-12-20 19:55 ` [PATCH bpf-next v2 04/14] selftests/bpf: Test returning referenced kptr from struct_ops programs Amery Hung
@ 2025-01-23  9:58   ` Eduard Zingerman
  0 siblings, 0 replies; 34+ messages in thread
From: Eduard Zingerman @ 2025-01-23  9:58 UTC (permalink / raw)
  To: Amery Hung, netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs, amery.hung

On Fri, 2024-12-20 at 11:55 -0800, Amery Hung wrote:

[...]

> diff --git a/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c
> new file mode 100644
> index 000000000000..b8b4f05c3d7f
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c
> @@ -0,0 +1,34 @@
> +#include <vmlinux.h>
> +#include <bpf/bpf_tracing.h>
> +#include "../test_kmods/bpf_testmod.h"
> +#include "bpf_experimental.h"
> +#include "bpf_misc.h"
> +
> +char _license[] SEC("license") = "GPL";
> +
> +struct cgroup *bpf_cgroup_acquire(struct cgroup *p) __ksym;
> +void bpf_task_release(struct task_struct *p) __ksym;
> +
> +/* This test struct_ops BPF programs returning referenced kptr. The verifier should
> + * reject programs returning a local kptr.
> + */
> +SEC("struct_ops/test_return_ref_kptr")
> +__failure __msg("At program exit the register R0 is not a known value (ptr_or_null_)")
> +struct task_struct *BPF_PROG(kptr_return_fail__local_kptr, int dummy,
> +			     struct task_struct *task, struct cgroup *cgrp)
> +{
> +	struct task_struct *t;
> +
> +	bpf_task_release(task);
> +
> +	t = bpf_obj_new(typeof(*task));

The return type (and btf id) of the bpf_obj_new is 'void *',
but here a function with return type (and btf id) of
'struct task_struct *' is necessary.
For example:

-       t = bpf_obj_new(typeof(*task));
+       t = bpf_task_from_pid(32);

However, if this replacement is done the test does
not signal verification error because of insufficient
checks in the verifier.c:check_return_code().

> +	if (!t)
> +		return NULL;
> +
> +	return t;
> +}
> +
> +SEC(".struct_ops.link")
> +struct bpf_testmod_ops testmod_kptr_return = {
> +	.test_return_ref_kptr = (void *)kptr_return_fail__local_kptr,
> +};

[...]


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v2 03/14] bpf: Allow struct_ops prog to return referenced kptr
  2025-01-23  9:57   ` Eduard Zingerman
@ 2025-01-23 18:19     ` Eduard Zingerman
  0 siblings, 0 replies; 34+ messages in thread
From: Eduard Zingerman @ 2025-01-23 18:19 UTC (permalink / raw)
  To: Amery Hung, netdev
  Cc: bpf, daniel, andrii, alexei.starovoitov, martin.lau, sinquersw,
	toke, jhs, jiri, stfomichev, ekarani.silvestre, yangpeihao,
	xiyou.wangcong, yepeilin.cs

On Thu, 2025-01-23 at 01:57 -0800, Eduard Zingerman wrote:

[...]

> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 26305571e377..0e6a3c4daa7d 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -10707,6 +10707,8 @@ record_func_key(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
> >  static int check_reference_leak(struct bpf_verifier_env *env, bool exception_exit)
> >  {
> >  	struct bpf_verifier_state *state = env->cur_state;
> > +	enum bpf_prog_type type = resolve_prog_type(env->prog);
> > +	struct bpf_reg_state *reg = reg_state(env, BPF_REG_0);
> >  	bool refs_lingering = false;
> >  	int i;
> >  
> > @@ -10716,6 +10718,12 @@ static int check_reference_leak(struct bpf_verifier_env *env, bool exception_exi
> >  	for (i = 0; i < state->acquired_refs; i++) {
> >  		if (state->refs[i].type != REF_TYPE_PTR)
> >  			continue;
> > +		/* Allow struct_ops programs to return a referenced kptr back to
> > +		 * kernel. Type checks are performed later in check_return_code.
> > +		 */
> > +		if (type == BPF_PROG_TYPE_STRUCT_OPS && !exception_exit &&
> > +		    reg->ref_obj_id == state->refs[i].id)
> > +			continue;
> >  		verbose(env, "Unreleased reference id=%d alloc_insn=%d\n",
> >  			state->refs[i].id, state->refs[i].insn_idx);
> >  		refs_lingering = true;
> > @@ -16320,13 +16328,14 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
> >  	const char *exit_ctx = "At program exit";
> >  	struct tnum enforce_attach_type_range = tnum_unknown;
> >  	const struct bpf_prog *prog = env->prog;
> > -	struct bpf_reg_state *reg;
> > +	struct bpf_reg_state *reg = reg_state(env, regno);
> >  	struct bpf_retval_range range = retval_range(0, 1);
> >  	enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
> >  	int err;
> >  	struct bpf_func_state *frame = env->cur_state->frame[0];
> >  	const bool is_subprog = frame->subprogno;
> >  	bool return_32bit = false;
> > +	const struct btf_type *reg_type, *ret_type = NULL;
> >  
> >  	/* LSM and struct_ops func-ptr's return type could be "void" */
> >  	if (!is_subprog || frame->in_exception_callback_fn) {
> > @@ -16335,10 +16344,26 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
> >  			if (prog->expected_attach_type == BPF_LSM_CGROUP)
> >  				/* See below, can be 0 or 0-1 depending on hook. */
> >  				break;
> > -			fallthrough;
> > +			if (!prog->aux->attach_func_proto->type)
> > +				return 0;
> > +			break;
> >  		case BPF_PROG_TYPE_STRUCT_OPS:
> >  			if (!prog->aux->attach_func_proto->type)
> >  				return 0;
> > +
> > +			if (frame->in_exception_callback_fn)
> > +				break;
> > +
> > +			/* Allow a struct_ops program to return a referenced kptr if it
> > +			 * matches the operator's return type and is in its unmodified
> > +			 * form. A scalar zero (i.e., a null pointer) is also allowed.
> > +			 */
> > +			reg_type = reg->btf ? btf_type_by_id(reg->btf, reg->btf_id) : NULL;
> > +			ret_type = btf_type_resolve_ptr(prog->aux->attach_btf,
> > +							prog->aux->attach_func_proto->type,
> > +							NULL);
> 
> This does not enforce the kernel provenance of the pointer.
> See my comment for the next patch for an example.
> 
> I think such return should only be allowed for parameters marked with
> __ref suffix. If so, pointer provenance check would just compare
> reg->ref_obj_id value with known ids of __ref arguments.

After looking at the selftests in patch #13, it appears that I
misunderstood the line:

"2) The pointer originally comes from the kernel (not locally allocated)"

And the only thing you'd like to exclude here is 'bpf_obj_new'.
In which case my comments for patches #3,4 are invalid.
Sorry for the noise.

[...]


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v2 01/14] bpf: Support getting referenced kptr from struct_ops argument
  2025-01-23  9:57   ` Eduard Zingerman
@ 2025-01-23 19:41     ` Amery Hung
  0 siblings, 0 replies; 34+ messages in thread
From: Amery Hung @ 2025-01-23 19:41 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: netdev, bpf, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, stfomichev, ekarani.silvestre,
	yangpeihao, xiyou.wangcong, yepeilin.cs, amery.hung

On Thu, Jan 23, 2025 at 1:57 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Fri, 2024-12-20 at 11:55 -0800, Amery Hung wrote:
> > From: Amery Hung <amery.hung@bytedance.com>
> >
> > Allows struct_ops programs to acqurie referenced kptrs from arguments
> > by directly reading the argument.
> >
> > The verifier will acquire a reference for struct_ops a argument tagged
> > with "__ref" in the stub function in the beginning of the main program.
> > The user will be able to access the referenced kptr directly by reading
> > the context as long as it has not been released by the program.
> >
> > This new mechanism to acquire referenced kptr (compared to the existing
> > "kfunc with KF_ACQUIRE") is introduced for ergonomic and semantic reasons.
> > In the first use case, Qdisc_ops, an skb is passed to .enqueue in the
> > first argument. This mechanism provides a natural way for users to get a
> > referenced kptr in the .enqueue struct_ops programs and makes sure that a
> > qdisc will always enqueue or drop the skb.
> >
> > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > ---
>
> Hi Amery,
>
> Sorry, for joining so late in the review process.
> Decided to take a look at verifier related changes.
> Overall the patch looks good to me,
> but I dislike the part allocating parameter ids.
>
> [...]
>
> > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > index 28246c59e12e..c2f4f84e539d 100644
> > --- a/kernel/bpf/btf.c
> > +++ b/kernel/bpf/btf.c
> > @@ -6682,6 +6682,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
> >                       info->reg_type = ctx_arg_info->reg_type;
> >                       info->btf = ctx_arg_info->btf ? : btf_vmlinux;
> >                       info->btf_id = ctx_arg_info->btf_id;
> > +                     info->ref_obj_id = ctx_arg_info->refcounted ? ++nr_ref_args : 0;
> >                       return true;
> >               }
> >       }
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index f27274e933e5..26305571e377 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
>
> [...]
>
> > @@ -22161,6 +22182,16 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
> >               mark_reg_known_zero(env, regs, BPF_REG_1);
> >       }
> >
> > +     /* Acquire references for struct_ops program arguments tagged with "__ref".
> > +      * These should be the earliest references acquired. btf_ctx_access() will
> > +      * assume the ref_obj_id of the n-th __ref-tagged argument to be n.
> > +      */
> > +     if (!subprog && env->prog->type == BPF_PROG_TYPE_STRUCT_OPS) {
> > +             for (i = 0; i < env->prog->aux->ctx_arg_info_size; i++)
> > +                     if (env->prog->aux->ctx_arg_info[i].refcounted)
> > +                             acquire_reference(env, 0);
> > +     }
> > +
> >       ret = do_check(env);
> >  out:
> >       /* check for NULL is necessary, since cur_state can be freed inside
>
> I think it would be cleaner if:
> - each program would own it's instance of 'env->prog->aux->ctx_arg_info';
> - ref_obj_id field would be added to 'struct bpf_ctx_arg_aux';
> - parameter ids would be allocated in do_check_common(), but without
>   reliance on being first to allocate.

This is very similar to what v1 is doing but I missed the first part.
Replacing assignments of prog->aux->ctx_arg_info with deep copy should
make ctx_arg_info a per-program thing. I agree with you that creating
this assumption of ref_obj_id is unnecessary and I will change it.

https://lore.kernel.org/bpf/20241213232958.2388301-1-amery.hung@bytedance.com/

Thanks,
Amery

>
> Or add some rigour to this thing and e.g. make env->id_gen signed
> and declare an enum of special ids like:
>
>   enum special_ids {
>         STRUCT_OPS_CTX_PARAM_0 = -1,
>         STRUCT_OPS_CTX_PARAM_1 = -2,
>         STRUCT_OPS_CTX_PARAM_2 = -3,
>         ...
>   }
>
> and update the loop above as:
>
>         if (!subprog && env->prog->type == BPF_PROG_TYPE_STRUCT_OPS) {
>                 for (i = 0; i < env->prog->aux->ctx_arg_info_size; i++)
>                         if (env->prog->aux->ctx_arg_info[i].refcounted)
>                 /* imagined function that acquires an id with specific value */
>                                 acquire_special_reference(env, 0, STRUCT_OPS_CTX_PARAM_0 - i /* desired id */);
>         }
>
> wdyt?
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v2 02/14] selftests/bpf: Test referenced kptr arguments of struct_ops programs
  2025-01-23  9:57   ` Eduard Zingerman
@ 2025-01-24  0:04     ` Amery Hung
  0 siblings, 0 replies; 34+ messages in thread
From: Amery Hung @ 2025-01-24  0:04 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: netdev, bpf, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, stfomichev, ekarani.silvestre,
	yangpeihao, xiyou.wangcong, yepeilin.cs, amery.hung

On Thu, Jan 23, 2025 at 1:57 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Fri, 2024-12-20 at 11:55 -0800, Amery Hung wrote:
>
> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
>
> [...]
>
> > diff --git a/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c
> > new file mode 100644
> > index 000000000000..43493a7ead39
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__global_subprog.c
> > @@ -0,0 +1,37 @@
> > +#include <vmlinux.h>
> > +#include <bpf/bpf_tracing.h>
> > +#include "../test_kmods/bpf_testmod.h"
> > +#include "bpf_misc.h"
> > +
> > +char _license[] SEC("license") = "GPL";
> > +
> > +extern void bpf_task_release(struct task_struct *p) __ksym;
> > +
> > +__noinline int subprog_release(__u64 *ctx __arg_ctx)
> > +{
> > +     struct task_struct *task = (struct task_struct *)ctx[1];
> > +     int dummy = (int)ctx[0];
> > +
> > +     bpf_task_release(task);
> > +
> > +     return dummy + 1;
> > +}
> > +
> > +/* Test that the verifier rejects a program that contains a global
> > + * subprogram with referenced kptr arguments
> > + */
> > +SEC("struct_ops/test_refcounted")
>
> Nit: I'd add a __msg("Validating subprog_release() func#1...")
>      before the error message match, just to make sure that
>      error is reported when subprog_release() is verified.
>

I see. I will add the msg match.

> > +__failure __msg("invalid bpf_context access off=8. Reference may already be released")
> > +int refcounted_fail__global_subprog(unsigned long long *ctx)
> > +{
> > +     struct task_struct *task = (struct task_struct *)ctx[1];
> > +
> > +     bpf_task_release(task);
> > +
> > +     return subprog_release(ctx);
> > +}
> > +
> > +SEC(".struct_ops.link")
> > +struct bpf_testmod_ops testmod_ref_acquire = {
> > +     .test_refcounted = (void *)refcounted_fail__global_subprog,
> > +};
>
> [...]
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2025-01-24  0:04 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-20 19:55 [PATCH bpf-next v2 00/14] bpf qdisc Amery Hung
2024-12-20 19:55 ` [PATCH bpf-next v2 01/14] bpf: Support getting referenced kptr from struct_ops argument Amery Hung
2025-01-23  9:57   ` Eduard Zingerman
2025-01-23 19:41     ` Amery Hung
2024-12-20 19:55 ` [PATCH bpf-next v2 02/14] selftests/bpf: Test referenced kptr arguments of struct_ops programs Amery Hung
2025-01-23  9:57   ` Eduard Zingerman
2025-01-24  0:04     ` Amery Hung
2024-12-20 19:55 ` [PATCH bpf-next v2 03/14] bpf: Allow struct_ops prog to return referenced kptr Amery Hung
2025-01-15 15:25   ` Ming Lei
2025-01-23  9:57   ` Eduard Zingerman
2025-01-23 18:19     ` Eduard Zingerman
2024-12-20 19:55 ` [PATCH bpf-next v2 04/14] selftests/bpf: Test returning referenced kptr from struct_ops programs Amery Hung
2025-01-23  9:58   ` Eduard Zingerman
2024-12-20 19:55 ` [PATCH bpf-next v2 05/14] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
2025-01-09 15:00   ` Amery Hung
2025-01-10  0:28   ` Martin KaFai Lau
2025-01-10  1:20   ` Jakub Kicinski
2024-12-20 19:55 ` [PATCH bpf-next v2 06/14] bpf: net_sched: Add basic bpf qdisc kfuncs Amery Hung
2025-01-10  0:24   ` Martin KaFai Lau
2025-01-10 18:00     ` Amery Hung
2024-12-20 19:55 ` [PATCH bpf-next v2 07/14] bpf: Search and add kfuncs in struct_ops prologue and epilogue Amery Hung
2024-12-20 19:55 ` [PATCH bpf-next v2 08/14] bpf: net_sched: Add a qdisc watchdog timer Amery Hung
2025-01-09  0:20   ` Martin KaFai Lau
2025-01-09 15:00     ` Amery Hung
2024-12-20 19:55 ` [PATCH bpf-next v2 09/14] bpf: net_sched: Support updating bstats Amery Hung
2024-12-20 19:55 ` [PATCH bpf-next v2 10/14] bpf: net_sched: Support updating qstats Amery Hung
2024-12-20 19:55 ` [PATCH bpf-next v2 11/14] bpf: net_sched: Allow writing to more Qdisc members Amery Hung
2024-12-20 19:55 ` [PATCH bpf-next v2 12/14] libbpf: Support creating and destroying qdisc Amery Hung
2024-12-20 19:55 ` [PATCH bpf-next v2 13/14] selftests: Add a basic fifo qdisc test Amery Hung
2025-01-10  0:05   ` Martin KaFai Lau
2024-12-20 19:55 ` [PATCH bpf-next v2 14/14] selftests: Add a bpf fq qdisc to selftest Amery Hung
2025-01-09 23:36   ` Martin KaFai Lau
2025-01-02 17:29 ` [PATCH bpf-next v2 00/14] bpf qdisc Toke Høiland-Jørgensen
2025-01-10  1:43 ` Martin KaFai Lau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).