[RFC PATCH v9 00/11] bpf qdisc

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v9 00/11] bpf qdisc
@ 2024-07-14 17:51 Amery Hung
  2024-07-14 17:51 ` [RFC PATCH v9 01/11] bpf: Support getting referenced kptr from struct_ops argument Amery Hung
                   ` (13 more replies)
  0 siblings, 14 replies; 42+ messages in thread
From: Amery Hung @ 2024-07-14 17:51 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs,
	ameryhung

Hi all,

This patchset aims to support implementing qdisc using bpf struct_ops.
This version takes a step back and only implements the minimum support
for bpf qdisc. 1) support of adding skb to bpf_list and bpf_rbtree
directly and 2) classful qdisc are deferred to future patchsets.

* Overview *

This series supports implementing qdisc using bpf struct_ops. bpf qdisc
aims to be a flexible and easy-to-use infrastructure that allows users to
quickly experiment with different scheduling algorithms/policies. It only
requires users to implement core qdisc logic using bpf and implements the
mundane part for them. In addition, the ability to easily communicate
between qdisc and other components will also bring new opportunities for
new applications and optimizations.

* struct_ops changes *

To make struct_ops works better with bpf qdisc, two new changes are
introduced to bpf specifically for struct_ops programs. Frist, we
introduce "ref_acquired" postfix for arguments in stub functions [1] in
patch 1-2. It will allow Qdisc_ops->enqueue to acquire an referenced kptr
to an skb just once. Through the reference object tracking mechanism in
the verifier, we can make sure that the acquired skb will be either
enqueued or dropped. Besides, no duplicate references can be acquired.
Then, we allow a reference leak in struct_ops programs so that we can
return an skb naturally. This is done and tested in patch 3 and 4.

* Performance of bpf qdisc *

We tested several bpf qdiscs included in the selftests and their in-tree
counterparts to give you a sense of the performance of qdisc implemented
in bpf.

The implementation of bpf_fq is fairly complex and slightly different from
fq so later we only compare the two fifo qdiscs. bpf_fq implements the
same fair queueing algorithm in fq, but without flow hash collision
avoidance and garbage collection of inactive flows. bpf_fifo uses a single
bpf_list as a queue instead of three queues for different priorities in
pfifo_fast. The time complexity of fifo however should be similar since the
queue selection time is negligible.

Test setup:

    client -> qdisc ------------->  server
    ~~~~~~~~~~~~~~~                 ~~~~~~
    nested VM1 @ DC1               VM2 @ DC2

Throghput: iperf3 -t 600, 5 times

      Qdisc        Average (GBits/sec)
    ----------     -------------------
    pfifo_fast       12.52 ± 0.26
    bpf_fifo         11.72 ± 0.32 
    fq               10.24 ± 0.13
    bpf_fq           11.92 ± 0.64 

Latency: sockperf pp --tcp -t 600, 5 times

      Qdisc        Average (usec)
    ----------     --------------
    pfifo_fast      244.58 ± 7.93
    bpf_fifo        244.92 ± 15.22
    fq              234.30 ± 19.25
    bpf_fq          221.34 ± 10.76

Looking at the two fifo qdiscs, the 6.4% drop in throughput in the bpf
implementatioin is consistent with previous observation (v8 throughput
test on a loopback device). This should be able to be mitigated by
supporting adding skb to bpf_list or bpf_rbtree directly in the future.

* Clean up skb in bpf qdisc during reset *

The current implementation relies on bpf qdisc implementors to correctly
release skbs in queues (bpf graphs or maps), which might not be a safe
thing to do. The solution remains to be explored in the next version and
Martin has suggested to store skb in qdisc private data.

* Miscellaneous notes *

The bpf qdiscs in selftest requires support of exchanging kptr into
allocated objects (local kptr), which Dave Marchevsky developed and
kindly sent me as off-list patchset.

Todo:
  - Properly clean up skb in bpf qdisc during reset
  - Support updating Qdisc_ops

---
v9: Drop classful qdisc operations and kfuncs
    Drop support of enqueuing skb directly to bpf_rbtree/list

v8: Implement support of bpf qdisc using struct_ops
    Allow struct_ops to acquire referenced kptr via argument
    Allow struct_ops to release and return referenced kptr
    Support enqueuing sk_buff to bpf_rbtree/list
    Move examples from samples to selftests
    Add a classful qdisc selftest
    Link: https://lore.kernel.org/netdev/20240510192412.3297104-15-amery.hung@bytedance.com/

v7: Reference skb using kptr to sk_buff instead of __sk_buff
    Use the new bpf rbtree/link to for skb queues
    Add reset and init programs
    Add a bpf fq qdisc sample
    Add a bpf netem qdisc sample
    Link: https://lore.kernel.org/netdev/cover.1705432850.git.amery.hung@bytedance.com/

v6: switch to kptr based approach

v5: mv kernel/bpf/skb_map.c net/core/skb_map.c
    implement flow map as map-in-map
    rename bpf_skb_tc_classify() and move it to net/sched/cls_api.c
    clean up eBPF qdisc program context

v4: get rid of PIFO, use rbtree directly

v3: move priority queue from sch_bpf to skb map
    introduce skb map and its helpers
    introduce bpf_skb_classify()
    use netdevice notifier to reset skb's
    Rebase on latest bpf-next

v2: Rebase on latest net-next
    Make the code more complete (but still incomplete)

Amery Hung (11):
  bpf: Support getting referenced kptr from struct_ops argument
  selftests/bpf: Test referenced kptr arguments of struct_ops programs
  bpf: Allow struct_ops prog to return referenced kptr
  selftests/bpf: Test returning referenced kptr from struct_ops programs
  bpf: net_sched: Support implementation of Qdisc_ops in bpf
  bpf: net_sched: Add bpf qdisc kfuncs
  bpf: net_sched: Allow more optional operators in Qdisc_ops
  libbpf: Support creating and destroying qdisc
  selftests: Add a basic fifo qdisc test
  selftests: Add a bpf fq qdisc to selftest
  selftests: Add a bpf netem qdisc to selftest

 include/linux/bpf.h                           |   3 +
 include/linux/btf.h                           |   1 +
 include/net/sch_generic.h                     |   7 +
 kernel/bpf/bpf_struct_ops.c                   |  26 +-
 kernel/bpf/btf.c                              |   3 +-
 kernel/bpf/verifier.c                         |  84 ++-
 net/sched/Makefile                            |   4 +
 net/sched/bpf_qdisc.c                         | 412 ++++++++++++
 net/sched/sch_api.c                           |  18 +-
 net/sched/sch_generic.c                       |  11 +-
 tools/lib/bpf/libbpf.h                        |   5 +-
 tools/lib/bpf/netlink.c                       |  20 +-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  15 +
 .../selftests/bpf/bpf_testmod/bpf_testmod.h   |   6 +
 .../selftests/bpf/prog_tests/bpf_qdisc.c      | 215 ++++++
 .../prog_tests/test_struct_ops_kptr_return.c  |  87 +++
 .../prog_tests/test_struct_ops_refcounted.c   |  41 ++
 .../selftests/bpf/progs/bpf_qdisc_common.h    |  16 +
 .../selftests/bpf/progs/bpf_qdisc_fifo.c      | 102 +++
 .../selftests/bpf/progs/bpf_qdisc_fq.c        | 623 ++++++++++++++++++
 .../selftests/bpf/progs/bpf_qdisc_netem.c     | 258 ++++++++
 .../bpf/progs/struct_ops_kptr_return.c        |  29 +
 ...uct_ops_kptr_return_fail__invalid_scalar.c |  24 +
 .../struct_ops_kptr_return_fail__local_kptr.c |  30 +
 ...uct_ops_kptr_return_fail__nonzero_offset.c |  23 +
 .../struct_ops_kptr_return_fail__wrong_type.c |  28 +
 .../bpf/progs/struct_ops_refcounted.c         |  67 ++
 .../struct_ops_refcounted_fail__ref_leak.c    |  17 +
 28 files changed, 2153 insertions(+), 22 deletions(-)
 create mode 100644 net/sched/bpf_qdisc.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_kptr_return.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_netem.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__invalid_scalar.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__nonzero_offset.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__wrong_type.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_refcounted.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c

-- 
2.20.1

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH v9 01/11] bpf: Support getting referenced kptr from struct_ops argument
  2024-07-14 17:51 [RFC PATCH v9 00/11] bpf qdisc Amery Hung
@ 2024-07-14 17:51 ` Amery Hung
  2024-07-24  0:32   ` Martin KaFai Lau
  2024-07-14 17:51 ` [RFC PATCH v9 02/11] selftests/bpf: Test referenced kptr arguments of struct_ops programs Amery Hung
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 42+ messages in thread
From: Amery Hung @ 2024-07-14 17:51 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs,
	ameryhung

Allows struct_ops programs to acqurie referenced kptrs from arguments
by directly reading the argument.

The verifier will automatically acquire a reference for struct_ops
argument tagged with "__ref" in the stub function. The user will be
able to access the referenced kptr directly by reading the context
as long as it has not been released by the program.

This new mechanism to acquire referenced kptr (compared to the existing
"kfunc with KF_ACQUIRE") is introduced for ergonomic and semantic reasons.

In the first use case, Qdisc_ops, an skb is passed to .enqueue in the
first argument. The qdisc becomes the sole owner of the skb and must
enqueue or drop the skb. Representing skbs in bpf qdisc as referenced
kptrs makes sure 1) qdisc will always enqueue or drop the skb in .enqueue,
and 2) qdisc cannot make up invalid skb pointers in .dequeue. The new
mechanism provides a natural way for users to get a referenced kptr in
struct_ops programs.

More importantly, we would also like to make sure that there is only a
single reference to the same skb in qdisc. Since in the future,
skb->rbnode will be utilized to support adding skb to bpf list and rbtree,
allowing multiple references may lead to racy accesses to this field when
the user adds references of the skb to different bpf graphs. The new
mechanism provides a better way to enforce such unique ptr semantic than
forbidding users to call a KF_ACQUIRE kfunc multiple times.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/linux/bpf.h         |  3 +++
 kernel/bpf/bpf_struct_ops.c | 26 ++++++++++++++++++++------
 kernel/bpf/btf.c            |  1 +
 kernel/bpf/verifier.c       | 34 +++++++++++++++++++++++++++++++---
 4 files changed, 55 insertions(+), 9 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index cc460786da9b..3891e45ded4e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -924,6 +924,7 @@ struct bpf_insn_access_aux {
 		struct {
 			struct btf *btf;
 			u32 btf_id;
+			u32 ref_obj_id;
 		};
 	};
 	struct bpf_verifier_log *log; /* for verbose logs */
@@ -1427,6 +1428,8 @@ struct bpf_ctx_arg_aux {
 	enum bpf_reg_type reg_type;
 	struct btf *btf;
 	u32 btf_id;
+	u32 ref_obj_id;
+	bool refcounted;
 };
 
 struct btf_mod_pair {
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 0d515ec57aa5..05f16f21981e 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -145,6 +145,7 @@ void bpf_struct_ops_image_free(void *image)
 }
 
 #define MAYBE_NULL_SUFFIX "__nullable"
+#define REFCOUNTED_SUFFIX "__ref"
 #define MAX_STUB_NAME 128
 
 /* Return the type info of a stub function, if it exists.
@@ -206,9 +207,11 @@ static int prepare_arg_info(struct btf *btf,
 			    struct bpf_struct_ops_arg_info *arg_info)
 {
 	const struct btf_type *stub_func_proto, *pointed_type;
+	bool is_nullable = false, is_refcounted = false;
 	const struct btf_param *stub_args, *args;
 	struct bpf_ctx_arg_aux *info, *info_buf;
 	u32 nargs, arg_no, info_cnt = 0;
+	const char *suffix;
 	u32 arg_btf_id;
 	int offset;
 
@@ -240,12 +243,19 @@ static int prepare_arg_info(struct btf *btf,
 	info = info_buf;
 	for (arg_no = 0; arg_no < nargs; arg_no++) {
 		/* Skip arguments that is not suffixed with
-		 * "__nullable".
+		 * "__nullable or __ref".
 		 */
-		if (!btf_param_match_suffix(btf, &stub_args[arg_no],
-					    MAYBE_NULL_SUFFIX))
+		is_nullable = btf_param_match_suffix(btf, &stub_args[arg_no],
+						     MAYBE_NULL_SUFFIX);
+		is_refcounted = btf_param_match_suffix(btf, &stub_args[arg_no],
+						       REFCOUNTED_SUFFIX);
+		if (!is_nullable && !is_refcounted)
 			continue;
 
+		if (is_nullable)
+			suffix = MAYBE_NULL_SUFFIX;
+		else if (is_refcounted)
+			suffix = REFCOUNTED_SUFFIX;
 		/* Should be a pointer to struct */
 		pointed_type = btf_type_resolve_ptr(btf,
 						    args[arg_no].type,
@@ -253,7 +263,7 @@ static int prepare_arg_info(struct btf *btf,
 		if (!pointed_type ||
 		    !btf_type_is_struct(pointed_type)) {
 			pr_warn("stub function %s__%s has %s tagging to an unsupported type\n",
-				st_ops_name, member_name, MAYBE_NULL_SUFFIX);
+				st_ops_name, member_name, suffix);
 			goto err_out;
 		}
 
@@ -271,11 +281,15 @@ static int prepare_arg_info(struct btf *btf,
 		}
 
 		/* Fill the information of the new argument */
-		info->reg_type =
-			PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
 		info->btf_id = arg_btf_id;
 		info->btf = btf;
 		info->offset = offset;
+		if (is_nullable) {
+			info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID | PTR_MAYBE_NULL;
+		} else if (is_refcounted) {
+			info->reg_type = PTR_TRUSTED | PTR_TO_BTF_ID;
+			info->refcounted = true;
+		}
 
 		info++;
 		info_cnt++;
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index de15e8b12fae..52be35b30308 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -6516,6 +6516,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
 			info->reg_type = ctx_arg_info->reg_type;
 			info->btf = ctx_arg_info->btf ? : btf_vmlinux;
 			info->btf_id = ctx_arg_info->btf_id;
+			info->ref_obj_id = ctx_arg_info->ref_obj_id;
 			return true;
 		}
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 37053cc4defe..f614ab283c37 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1367,6 +1367,17 @@ static int release_reference_state(struct bpf_func_state *state, int ptr_id)
 	return -EINVAL;
 }
 
+static bool find_reference_state(struct bpf_func_state *state, int ptr_id)
+{
+	int i;
+
+	for (i = 0; i < state->acquired_refs; i++)
+		if (state->refs[i].id == ptr_id)
+			return true;
+
+	return false;
+}
+
 static void free_func_state(struct bpf_func_state *state)
 {
 	if (!state)
@@ -5587,7 +5598,7 @@ static int check_packet_access(struct bpf_verifier_env *env, u32 regno, int off,
 /* check access to 'struct bpf_context' fields.  Supports fixed offsets only */
 static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off, int size,
 			    enum bpf_access_type t, enum bpf_reg_type *reg_type,
-			    struct btf **btf, u32 *btf_id)
+			    struct btf **btf, u32 *btf_id, u32 *ref_obj_id)
 {
 	struct bpf_insn_access_aux info = {
 		.reg_type = *reg_type,
@@ -5606,8 +5617,16 @@ static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off,
 		*reg_type = info.reg_type;
 
 		if (base_type(*reg_type) == PTR_TO_BTF_ID) {
+			if (info.ref_obj_id &&
+			    !find_reference_state(cur_func(env), info.ref_obj_id)) {
+				verbose(env, "bpf_context off=%d ref_obj_id=%d is no longer valid\n",
+					off, info.ref_obj_id);
+				return -EACCES;
+			}
+
 			*btf = info.btf;
 			*btf_id = info.btf_id;
+			*ref_obj_id = info.ref_obj_id;
 		} else {
 			env->insn_aux_data[insn_idx].ctx_field_size = info.ctx_field_size;
 		}
@@ -6878,7 +6897,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 	} else if (reg->type == PTR_TO_CTX) {
 		enum bpf_reg_type reg_type = SCALAR_VALUE;
 		struct btf *btf = NULL;
-		u32 btf_id = 0;
+		u32 btf_id = 0, ref_obj_id = 0;
 
 		if (t == BPF_WRITE && value_regno >= 0 &&
 		    is_pointer_value(env, value_regno)) {
@@ -6891,7 +6910,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 			return err;
 
 		err = check_ctx_access(env, insn_idx, off, size, t, &reg_type, &btf,
-				       &btf_id);
+				       &btf_id, &ref_obj_id);
 		if (err)
 			verbose_linfo(env, insn_idx, "; ");
 		if (!err && t == BPF_READ && value_regno >= 0) {
@@ -6915,6 +6934,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 				if (base_type(reg_type) == PTR_TO_BTF_ID) {
 					regs[value_regno].btf = btf;
 					regs[value_regno].btf_id = btf_id;
+					regs[value_regno].ref_obj_id = ref_obj_id;
 				}
 			}
 			regs[value_regno].type = reg_type;
@@ -20897,6 +20917,7 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
 {
 	bool pop_log = !(env->log.level & BPF_LOG_LEVEL2);
 	struct bpf_subprog_info *sub = subprog_info(env, subprog);
+	struct bpf_ctx_arg_aux *ctx_arg_info;
 	struct bpf_verifier_state *state;
 	struct bpf_reg_state *regs;
 	int ret, i;
@@ -21004,6 +21025,13 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
 		mark_reg_known_zero(env, regs, BPF_REG_1);
 	}
 
+	if (env->prog->type == BPF_PROG_TYPE_STRUCT_OPS) {
+		ctx_arg_info = (struct bpf_ctx_arg_aux *)env->prog->aux->ctx_arg_info;
+		for (i = 0; i < env->prog->aux->ctx_arg_info_size; i++)
+			if (ctx_arg_info[i].refcounted)
+				ctx_arg_info[i].ref_obj_id = acquire_reference_state(env, 0);
+	}
+
 	ret = do_check(env);
 out:
 	/* check for NULL is necessary, since cur_state can be freed inside
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v9 01/11] bpf: Support getting referenced kptr from struct_ops argument
  2024-07-14 17:51 ` [RFC PATCH v9 01/11] bpf: Support getting referenced kptr from struct_ops argument Amery Hung
@ 2024-07-24  0:32   ` Martin KaFai Lau
  2024-07-24 17:00     ` Amery Hung
  0 siblings, 1 reply; 42+ messages in thread
From: Martin KaFai Lau @ 2024-07-24  0:32 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, yangpeihao, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

On 7/14/24 10:51 AM, Amery Hung wrote:
> @@ -21004,6 +21025,13 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
>   		mark_reg_known_zero(env, regs, BPF_REG_1);
>   	}
>   
> +	if (env->prog->type == BPF_PROG_TYPE_STRUCT_OPS) {
> +		ctx_arg_info = (struct bpf_ctx_arg_aux *)env->prog->aux->ctx_arg_info;
> +		for (i = 0; i < env->prog->aux->ctx_arg_info_size; i++)
> +			if (ctx_arg_info[i].refcounted)
> +				ctx_arg_info[i].ref_obj_id = acquire_reference_state(env, 0);
> +	}
> +

I think this will miss a case when passing the struct_ops prog ctx (i.e. "__u64 
*ctx") to a global subprog. Something like this:

__noinline int subprog_release(__u64 *ctx __arg_ctx)
{
	struct task_struct *task = (struct task_struct *)ctx[1];
	int dummy = (int)ctx[0];

	bpf_task_release(task);

	return dummy + 1;
}

SEC("struct_ops/subprog_ref")
__failure
int test_subprog_ref(__u64 *ctx)
{
	struct task_struct *task = (struct task_struct *)ctx[1];

	bpf_task_release(task);

	return subprog_release(ctx);;
}

SEC(".struct_ops.link")
struct bpf_testmod_ops subprog_ref = {
	.test_refcounted = (void *)test_subprog_ref,
};

A quick thought is, I think tracking the ctx's ref id in the env->cur_state may 
not be the correct place.

[ Just want to bring up what I have noticed so far. I will stop at here for 
today and will continue. ]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v9 01/11] bpf: Support getting referenced kptr from struct_ops argument
  2024-07-24  0:32   ` Martin KaFai Lau
@ 2024-07-24 17:00     ` Amery Hung
  2024-07-25  1:28       ` Martin KaFai Lau
  0 siblings, 1 reply; 42+ messages in thread
From: Amery Hung @ 2024-07-24 17:00 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, yangpeihao, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

On Tue, Jul 23, 2024 at 5:32 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 7/14/24 10:51 AM, Amery Hung wrote:
> > @@ -21004,6 +21025,13 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
> >               mark_reg_known_zero(env, regs, BPF_REG_1);
> >       }
> >
> > +     if (env->prog->type == BPF_PROG_TYPE_STRUCT_OPS) {
> > +             ctx_arg_info = (struct bpf_ctx_arg_aux *)env->prog->aux->ctx_arg_info;
> > +             for (i = 0; i < env->prog->aux->ctx_arg_info_size; i++)
> > +                     if (ctx_arg_info[i].refcounted)
> > +                             ctx_arg_info[i].ref_obj_id = acquire_reference_state(env, 0);
> > +     }
> > +
>
> I think this will miss a case when passing the struct_ops prog ctx (i.e. "__u64
> *ctx") to a global subprog. Something like this:
>
> __noinline int subprog_release(__u64 *ctx __arg_ctx)
> {
>         struct task_struct *task = (struct task_struct *)ctx[1];
>         int dummy = (int)ctx[0];
>
>         bpf_task_release(task);
>
>         return dummy + 1;
> }
>
> SEC("struct_ops/subprog_ref")
> __failure
> int test_subprog_ref(__u64 *ctx)
> {
>         struct task_struct *task = (struct task_struct *)ctx[1];
>
>         bpf_task_release(task);
>
>         return subprog_release(ctx);;
> }
>
> SEC(".struct_ops.link")
> struct bpf_testmod_ops subprog_ref = {
>         .test_refcounted = (void *)test_subprog_ref,
> };
>

Thanks for pointing this out. The test did failed.

> A quick thought is, I think tracking the ctx's ref id in the env->cur_state may
> not be the correct place.

I think it is a bit tricky because subprogs are checked independently
and their state is folded (i.e., there can be multiple edges from the
main program to a subprog).

Maybe the verifier can rewrite the program: set the refcounted ctx to
NULL when releasing reference. Then, in do_check_common(), if it is a
global subprog, we mark refcounted ctx as PTR_MAYBE_NULL to force a
runtime check. How does it sound?

>
> [ Just want to bring up what I have noticed so far. I will stop at here for
> today and will continue. ]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v9 01/11] bpf: Support getting referenced kptr from struct_ops argument
  2024-07-24 17:00     ` Amery Hung
@ 2024-07-25  1:28       ` Martin KaFai Lau
  0 siblings, 0 replies; 42+ messages in thread
From: Martin KaFai Lau @ 2024-07-25  1:28 UTC (permalink / raw)
  To: Amery Hung, alexei.starovoitov
  Cc: bpf, netdev, yangpeihao, daniel, andrii, martin.lau, sinquersw,
	toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs

On 7/24/24 10:00 AM, Amery Hung wrote:
> On Tue, Jul 23, 2024 at 5:32 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 7/14/24 10:51 AM, Amery Hung wrote:
>>> @@ -21004,6 +21025,13 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
>>>                mark_reg_known_zero(env, regs, BPF_REG_1);
>>>        }
>>>
>>> +     if (env->prog->type == BPF_PROG_TYPE_STRUCT_OPS) {
>>> +             ctx_arg_info = (struct bpf_ctx_arg_aux *)env->prog->aux->ctx_arg_info;
>>> +             for (i = 0; i < env->prog->aux->ctx_arg_info_size; i++)
>>> +                     if (ctx_arg_info[i].refcounted)
>>> +                             ctx_arg_info[i].ref_obj_id = acquire_reference_state(env, 0);
>>> +     }
>>> +
>>
>> I think this will miss a case when passing the struct_ops prog ctx (i.e. "__u64
>> *ctx") to a global subprog. Something like this:
>>
>> __noinline int subprog_release(__u64 *ctx __arg_ctx)
>> {
>>          struct task_struct *task = (struct task_struct *)ctx[1];
>>          int dummy = (int)ctx[0];
>>
>>          bpf_task_release(task);
>>
>>          return dummy + 1;
>> }
>>
>> SEC("struct_ops/subprog_ref")
>> __failure
>> int test_subprog_ref(__u64 *ctx)
>> {
>>          struct task_struct *task = (struct task_struct *)ctx[1];
>>
>>          bpf_task_release(task);
>>
>>          return subprog_release(ctx);;
>> }
>>
>> SEC(".struct_ops.link")
>> struct bpf_testmod_ops subprog_ref = {
>>          .test_refcounted = (void *)test_subprog_ref,
>> };
>>
> 
> Thanks for pointing this out. The test did failed.
> 
>> A quick thought is, I think tracking the ctx's ref id in the env->cur_state may
>> not be the correct place.
> 
> I think it is a bit tricky because subprogs are checked independently
> and their state is folded (i.e., there can be multiple edges from the
> main program to a subprog).
> 
> Maybe the verifier can rewrite the program: set the refcounted ctx to
> NULL when releasing reference. Then, in do_check_common(), if it is a
> global subprog, we mark refcounted ctx as PTR_MAYBE_NULL to force a
> runtime check. How does it sound?

don't know how to get the ctx pointer to patch the code. It is not always in r1.

A case like this should still break even with the PTR_MAYBE_NULL marking in all 
main and subprog (I haven't tried this one myself):

SEC("struct_ops/subprog_ref")
int test_subprog_ref(__u64 *ctx)
{
	struct task_struct *task = (struct task_struct *)ctx[1];

	if (task) {
		subprog_release(ctx);
		bpf_task_release(task);
	}

	return;
}

afaik, the global subprog is checked independently from the main prog and it 
does not know the state of the main prog. Take a look at the subprog_is_global() 
case in the check_func_call().

How about only acquire_reference_state() for the main prog? Yes, the global 
subprog cannot do the bpf_kptr_xchg() and bpf_qdisc_skb_drop() but it can still 
read the skb. The non-global subprog (static) should work though (please test).

I don't have other better idea. May be Alexei can provide some guidance here?


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH v9 02/11] selftests/bpf: Test referenced kptr arguments of struct_ops programs
  2024-07-14 17:51 [RFC PATCH v9 00/11] bpf qdisc Amery Hung
  2024-07-14 17:51 ` [RFC PATCH v9 01/11] bpf: Support getting referenced kptr from struct_ops argument Amery Hung
@ 2024-07-14 17:51 ` Amery Hung
  2024-07-14 17:51 ` [RFC PATCH v9 03/11] bpf: Allow struct_ops prog to return referenced kptr Amery Hung
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 42+ messages in thread
From: Amery Hung @ 2024-07-14 17:51 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs,
	ameryhung

Test referenced kptr acquired through struct_ops argument tagged with
"__ref". The success case checks whether 1) a reference to the correct
type is acquired, and 2) the referenced kptr argument can be accessed in
multiple paths as long as it hasn't been released. In the fail case,
we confirm that a referenced kptr acquried through struct_ops argument,
just like the ones acquired via kfuncs, cannot leak.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  7 ++
 .../selftests/bpf/bpf_testmod/bpf_testmod.h   |  2 +
 .../prog_tests/test_struct_ops_refcounted.c   | 41 ++++++++++++
 .../bpf/progs/struct_ops_refcounted.c         | 67 +++++++++++++++++++
 .../struct_ops_refcounted_fail__ref_leak.c    | 17 +++++
 5 files changed, 134 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_refcounted.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c

diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index f8962a1dd397..316a4c3d3a88 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -916,10 +916,17 @@ static int bpf_testmod_ops__test_maybe_null(int dummy,
 	return 0;
 }
 
+static int bpf_testmod_ops__test_refcounted(int dummy,
+					    struct task_struct *task__ref)
+{
+	return 0;
+}
+
 static struct bpf_testmod_ops __bpf_testmod_ops = {
 	.test_1 = bpf_testmod_test_1,
 	.test_2 = bpf_testmod_test_2,
 	.test_maybe_null = bpf_testmod_ops__test_maybe_null,
+	.test_refcounted = bpf_testmod_ops__test_refcounted,
 };
 
 struct bpf_struct_ops bpf_bpf_testmod_ops = {
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
index 23fa1872ee67..bfef5f382d01 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
@@ -35,6 +35,8 @@ struct bpf_testmod_ops {
 	void (*test_2)(int a, int b);
 	/* Used to test nullable arguments. */
 	int (*test_maybe_null)(int dummy, struct task_struct *task);
+	/* Used to test ref_acquired arguments. */
+	int (*test_refcounted)(int dummy, struct task_struct *task);
 
 	/* The following fields are used to test shadow copies. */
 	char onebyte;
diff --git a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c
new file mode 100644
index 000000000000..c463b46538d2
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_refcounted.c
@@ -0,0 +1,41 @@
+#include <test_progs.h>
+
+#include "struct_ops_refcounted.skel.h"
+#include "struct_ops_refcounted_fail__ref_leak.skel.h"
+
+/* Test that the verifier accepts a program that acquires a referenced
+ * kptr and releases the reference
+ */
+static void refcounted(void)
+{
+	struct struct_ops_refcounted *skel;
+
+	skel = struct_ops_refcounted__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "struct_ops_module_open_and_load"))
+		return;
+
+	struct_ops_refcounted__destroy(skel);
+}
+
+/* Test that the verifier rejects a program that acquires a referenced
+ * kptr without releasing the reference
+ */
+static void refcounted_fail__ref_leak(void)
+{
+	struct struct_ops_refcounted_fail__ref_leak *skel;
+
+	skel = struct_ops_refcounted_fail__ref_leak__open_and_load();
+	if (ASSERT_ERR_PTR(skel, "struct_ops_module_fail__open_and_load"))
+		return;
+
+	struct_ops_refcounted_fail__ref_leak__destroy(skel);
+}
+
+void test_struct_ops_refcounted(void)
+{
+	if (test__start_subtest("refcounted"))
+		refcounted();
+	if (test__start_subtest("refcounted_fail__ref_leak"))
+		refcounted_fail__ref_leak();
+}
+
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_refcounted.c b/tools/testing/selftests/bpf/progs/struct_ops_refcounted.c
new file mode 100644
index 000000000000..2c1326668b92
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_refcounted.c
@@ -0,0 +1,67 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+extern void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This is a test BPF program that uses struct_ops to access a referenced
+ * kptr argument. This is a test for the verifier to ensure that it
+ * 1) recongnizes the task as a referenced object (i.e., ref_obj_id > 0), and
+ * 2) the same reference can be acquired from multiple paths as long as it
+ *    has not been released.
+ *
+ * test_refcounted() is equivalent to the C code below. It is written in assembly
+ * to avoid reads from task (i.e., getting referenced kptrs to task) being merged
+ * into single path by the compiler.
+ *
+ * int test_refcounted(int dummy, struct task_struct *task)
+ * {
+ *         if (dummy % 2)
+ *                 bpf_task_release(task);
+ *         else
+ *                 bpf_task_release(task);
+ *         return 0;
+ * }
+ */
+SEC("struct_ops/test_refcounted")
+int test_refcounted(unsigned long long *ctx)
+{
+	asm volatile ("					\
+	/* r6 = dummy */				\
+	r6 = *(u64 *)(r1 + 0x0);			\
+	/* if (r6 & 0x1 != 0) */			\
+	r6 &= 0x1;					\
+	if r6 == 0 goto l0_%=;				\
+	/* r1 = task */					\
+	r1 = *(u64 *)(r1 + 0x8);			\
+	call %[bpf_task_release];			\
+	goto l1_%=;					\
+l0_%=:	/* r1 = task */					\
+	r1 = *(u64 *)(r1 + 0x8);			\
+	call %[bpf_task_release];			\
+l1_%=:	/* return 0 */					\
+"	:
+	: __imm(bpf_task_release)
+	: __clobber_all);
+	return 0;
+}
+
+/* BTF FUNC records are not generated for kfuncs referenced
+ * from inline assembly. These records are necessary for
+ * libbpf to link the program. The function below is a hack
+ * to ensure that BTF FUNC records are generated.
+ */
+void __btf_root(void)
+{
+	bpf_task_release(NULL);
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_refcounted = {
+	.test_refcounted = (void *)test_refcounted,
+};
+
+
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c
new file mode 100644
index 000000000000..6e82859eb187
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_refcounted_fail__ref_leak.c
@@ -0,0 +1,17 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+SEC("struct_ops/test_refcounted")
+int BPF_PROG(test_refcounted, int dummy,
+	     struct task_struct *task)
+{
+	return 0;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_ref_acquire = {
+	.test_refcounted = (void *)test_refcounted,
+};
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v9 03/11] bpf: Allow struct_ops prog to return referenced kptr
  2024-07-14 17:51 [RFC PATCH v9 00/11] bpf qdisc Amery Hung
  2024-07-14 17:51 ` [RFC PATCH v9 01/11] bpf: Support getting referenced kptr from struct_ops argument Amery Hung
  2024-07-14 17:51 ` [RFC PATCH v9 02/11] selftests/bpf: Test referenced kptr arguments of struct_ops programs Amery Hung
@ 2024-07-14 17:51 ` Amery Hung
  2024-07-24  5:36   ` Kui-Feng Lee
  2024-07-24 23:57   ` Martin KaFai Lau
  2024-07-14 17:51 ` [RFC PATCH v9 04/11] selftests/bpf: Test returning referenced kptr from struct_ops programs Amery Hung
                   ` (10 subsequent siblings)
  13 siblings, 2 replies; 42+ messages in thread
From: Amery Hung @ 2024-07-14 17:51 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs,
	ameryhung

Allow a struct_ops program to return a referenced kptr if the struct_ops
operator has pointer to struct as the return type. To make sure the
returned pointer continues to be valid in the kernel, several
constraints are required:

1) The type of the pointer must matches the return type
2) The pointer originally comes from the kernel (not locally allocated)
3) The pointer is in its unmodified form

In addition, since the first user, Qdisc_ops::dequeue, allows a NULL
pointer to be returned when there is no skb to be dequeued, we will allow
a scalar value with value equals to NULL to be returned.

In the future when there is a struct_ops user that always expects a valid
pointer to be returned from an operator, we may extend tagging to the
return value. We can tell the verifier to only allow NULL pointer return
if the return value is tagged with MAY_BE_NULL.

The check is split into two parts since check_reference_leak() happens
before check_return_code(). We first allow a reference object to leak
through return if it is in the return register and the type matches the
return type. Then, we check whether the pointer to-be-returned is valid in
check_return_code().

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 kernel/bpf/verifier.c | 50 +++++++++++++++++++++++++++++++++++++++----
 1 file changed, 46 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f614ab283c37..e7f356098902 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -10188,16 +10188,36 @@ record_func_key(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
 
 static int check_reference_leak(struct bpf_verifier_env *env, bool exception_exit)
 {
+	enum bpf_prog_type type = resolve_prog_type(env->prog);
+	u32 regno = exception_exit ? BPF_REG_1 : BPF_REG_0;
+	struct bpf_reg_state *reg = reg_state(env, regno);
 	struct bpf_func_state *state = cur_func(env);
+	const struct bpf_prog *prog = env->prog;
+	const struct btf_type *ret_type = NULL;
 	bool refs_lingering = false;
+	struct btf *btf;
 	int i;
 
 	if (!exception_exit && state->frameno && !state->in_callback_fn)
 		return 0;
 
+	if (type == BPF_PROG_TYPE_STRUCT_OPS &&
+	    reg->type & PTR_TO_BTF_ID && reg->ref_obj_id) {
+		btf = bpf_prog_get_target_btf(prog);
+		ret_type = btf_type_by_id(btf, prog->aux->attach_func_proto->type);
+		if (reg->btf_id != ret_type->type) {
+			verbose(env, "Return kptr type, struct %s, doesn't match function prototype, struct %s\n",
+				btf_type_name(reg->btf, reg->btf_id),
+				btf_type_name(btf, ret_type->type));
+			return -EINVAL;
+		}
+	}
+
 	for (i = 0; i < state->acquired_refs; i++) {
 		if (!exception_exit && state->in_callback_fn && state->refs[i].callback_ref != state->frameno)
 			continue;
+		if (ret_type && reg->ref_obj_id == state->refs[i].id)
+			continue;
 		verbose(env, "Unreleased reference id=%d alloc_insn=%d\n",
 			state->refs[i].id, state->refs[i].insn_idx);
 		refs_lingering = true;
@@ -15677,12 +15697,15 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
 	const char *exit_ctx = "At program exit";
 	struct tnum enforce_attach_type_range = tnum_unknown;
 	const struct bpf_prog *prog = env->prog;
-	struct bpf_reg_state *reg;
+	struct bpf_reg_state *reg = reg_state(env, regno);
 	struct bpf_retval_range range = retval_range(0, 1);
 	enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
 	int err;
 	struct bpf_func_state *frame = env->cur_state->frame[0];
 	const bool is_subprog = frame->subprogno;
+	struct btf *btf = bpf_prog_get_target_btf(prog);
+	bool st_ops_ret_is_kptr = false;
+	const struct btf_type *t;
 
 	/* LSM and struct_ops func-ptr's return type could be "void" */
 	if (!is_subprog || frame->in_exception_callback_fn) {
@@ -15691,10 +15714,26 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
 			if (prog->expected_attach_type == BPF_LSM_CGROUP)
 				/* See below, can be 0 or 0-1 depending on hook. */
 				break;
-			fallthrough;
+			if (!prog->aux->attach_func_proto->type)
+				return 0;
+			break;
 		case BPF_PROG_TYPE_STRUCT_OPS:
 			if (!prog->aux->attach_func_proto->type)
 				return 0;
+
+			t = btf_type_by_id(btf, prog->aux->attach_func_proto->type);
+			if (btf_type_is_ptr(t)) {
+				/* Allow struct_ops programs to return kptr or null if
+				 * the return type is a pointer type.
+				 * check_reference_leak has ensured the returning kptr
+				 * matches the type of the function prototype and is
+				 * the only leaking reference. Thus, we can safely return
+				 * if the pointer is in its unmodified form
+				 */
+				if (reg->type & PTR_TO_BTF_ID)
+					return __check_ptr_off_reg(env, reg, regno, false);
+				st_ops_ret_is_kptr = true;
+			}
 			break;
 		default:
 			break;
@@ -15716,8 +15755,6 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
 		return -EACCES;
 	}
 
-	reg = cur_regs(env) + regno;
-
 	if (frame->in_async_callback_fn) {
 		/* enforce return zero from async callbacks like timer */
 		exit_ctx = "At async callback return";
@@ -15804,6 +15841,11 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
 	case BPF_PROG_TYPE_NETFILTER:
 		range = retval_range(NF_DROP, NF_ACCEPT);
 		break;
+	case BPF_PROG_TYPE_STRUCT_OPS:
+		if (!st_ops_ret_is_kptr)
+			return 0;
+		range = retval_range(0, 0);
+		break;
 	case BPF_PROG_TYPE_EXT:
 		/* freplace program can return anything as its return value
 		 * depends on the to-be-replaced kernel func or bpf program.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v9 03/11] bpf: Allow struct_ops prog to return referenced kptr
  2024-07-14 17:51 ` [RFC PATCH v9 03/11] bpf: Allow struct_ops prog to return referenced kptr Amery Hung
@ 2024-07-24  5:36   ` Kui-Feng Lee
  2024-07-24 18:27     ` Kui-Feng Lee
  2024-07-24 20:44     ` Amery Hung
  2024-07-24 23:57   ` Martin KaFai Lau
  1 sibling, 2 replies; 42+ messages in thread
From: Kui-Feng Lee @ 2024-07-24  5:36 UTC (permalink / raw)
  To: Amery Hung, netdev
  Cc: bpf, yangpeihao, daniel, andrii, alexei.starovoitov, martin.lau,
	toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs



On 7/14/24 10:51, Amery Hung wrote:
> Allow a struct_ops program to return a referenced kptr if the struct_ops
> operator has pointer to struct as the return type. To make sure the
> returned pointer continues to be valid in the kernel, several
> constraints are required:
> 
> 1) The type of the pointer must matches the return type
> 2) The pointer originally comes from the kernel (not locally allocated)
> 3) The pointer is in its unmodified form
> 
> In addition, since the first user, Qdisc_ops::dequeue, allows a NULL
> pointer to be returned when there is no skb to be dequeued, we will allow
> a scalar value with value equals to NULL to be returned.
> 
> In the future when there is a struct_ops user that always expects a valid
> pointer to be returned from an operator, we may extend tagging to the
> return value. We can tell the verifier to only allow NULL pointer return
> if the return value is tagged with MAY_BE_NULL.
> 
> The check is split into two parts since check_reference_leak() happens
> before check_return_code(). We first allow a reference object to leak
> through return if it is in the return register and the type matches the
> return type. Then, we check whether the pointer to-be-returned is valid in
> check_return_code().
> 
> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> ---
>   kernel/bpf/verifier.c | 50 +++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 46 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index f614ab283c37..e7f356098902 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -10188,16 +10188,36 @@ record_func_key(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
>   
>   static int check_reference_leak(struct bpf_verifier_env *env, bool exception_exit)
>   {
> +	enum bpf_prog_type type = resolve_prog_type(env->prog);
> +	u32 regno = exception_exit ? BPF_REG_1 : BPF_REG_0;
> +	struct bpf_reg_state *reg = reg_state(env, regno);
>   	struct bpf_func_state *state = cur_func(env);
> +	const struct bpf_prog *prog = env->prog;
> +	const struct btf_type *ret_type = NULL;
>   	bool refs_lingering = false;
> +	struct btf *btf;
>   	int i;
>   
>   	if (!exception_exit && state->frameno && !state->in_callback_fn)
>   		return 0;
>   
> +	if (type == BPF_PROG_TYPE_STRUCT_OPS &&
> +	    reg->type & PTR_TO_BTF_ID && reg->ref_obj_id) {
> +		btf = bpf_prog_get_target_btf(prog);
> +		ret_type = btf_type_by_id(btf, prog->aux->attach_func_proto->type);
> +		if (reg->btf_id != ret_type->type) {
> +			verbose(env, "Return kptr type, struct %s, doesn't match function prototype, struct %s\n",
> +				btf_type_name(reg->btf, reg->btf_id),
> +				btf_type_name(btf, ret_type->type));
> +			return -EINVAL;
> +		}
> +	}
> +
>   	for (i = 0; i < state->acquired_refs; i++) {
>   		if (!exception_exit && state->in_callback_fn && state->refs[i].callback_ref != state->frameno)
>   			continue;
> +		if (ret_type && reg->ref_obj_id == state->refs[i].id)
> +			continue;

Is it possible having two kptrs that both are in the returned type
passing into a function?


>   		verbose(env, "Unreleased reference id=%d alloc_insn=%d\n",
>   			state->refs[i].id, state->refs[i].insn_idx);
>   		refs_lingering = true;
> @@ -15677,12 +15697,15 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>   	const char *exit_ctx = "At program exit";
>   	struct tnum enforce_attach_type_range = tnum_unknown;
>   	const struct bpf_prog *prog = env->prog;
> -	struct bpf_reg_state *reg;
> +	struct bpf_reg_state *reg = reg_state(env, regno);
>   	struct bpf_retval_range range = retval_range(0, 1);
>   	enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
>   	int err;
>   	struct bpf_func_state *frame = env->cur_state->frame[0];
>   	const bool is_subprog = frame->subprogno;
> +	struct btf *btf = bpf_prog_get_target_btf(prog);
> +	bool st_ops_ret_is_kptr = false;
> +	const struct btf_type *t;
>   
>   	/* LSM and struct_ops func-ptr's return type could be "void" */
>   	if (!is_subprog || frame->in_exception_callback_fn) {
> @@ -15691,10 +15714,26 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>   			if (prog->expected_attach_type == BPF_LSM_CGROUP)
>   				/* See below, can be 0 or 0-1 depending on hook. */
>   				break;
> -			fallthrough;
> +			if (!prog->aux->attach_func_proto->type)
> +				return 0;
> +			break;
>   		case BPF_PROG_TYPE_STRUCT_OPS:
>   			if (!prog->aux->attach_func_proto->type)
>   				return 0;
> +
> +			t = btf_type_by_id(btf, prog->aux->attach_func_proto->type);
> +			if (btf_type_is_ptr(t)) {
> +				/* Allow struct_ops programs to return kptr or null if
> +				 * the return type is a pointer type.
> +				 * check_reference_leak has ensured the returning kptr
> +				 * matches the type of the function prototype and is
> +				 * the only leaking reference. Thus, we can safely return
> +				 * if the pointer is in its unmodified form
> +				 */
> +				if (reg->type & PTR_TO_BTF_ID)
> +					return __check_ptr_off_reg(env, reg, regno, false);
> +				st_ops_ret_is_kptr = true;
> +			}
>   			break;
>   		default:
>   			break;
> @@ -15716,8 +15755,6 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>   		return -EACCES;
>   	}
>   
> -	reg = cur_regs(env) + regno;
> -
>   	if (frame->in_async_callback_fn) {
>   		/* enforce return zero from async callbacks like timer */
>   		exit_ctx = "At async callback return";
> @@ -15804,6 +15841,11 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>   	case BPF_PROG_TYPE_NETFILTER:
>   		range = retval_range(NF_DROP, NF_ACCEPT);
>   		break;
> +	case BPF_PROG_TYPE_STRUCT_OPS:
> +		if (!st_ops_ret_is_kptr)
> +			return 0;
> +		range = retval_range(0, 0);
> +		break;
>   	case BPF_PROG_TYPE_EXT:
>   		/* freplace program can return anything as its return value
>   		 * depends on the to-be-replaced kernel func or bpf program.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v9 03/11] bpf: Allow struct_ops prog to return referenced kptr
  2024-07-24  5:36   ` Kui-Feng Lee
@ 2024-07-24 18:27     ` Kui-Feng Lee
  2024-07-24 20:44     ` Amery Hung
  1 sibling, 0 replies; 42+ messages in thread
From: Kui-Feng Lee @ 2024-07-24 18:27 UTC (permalink / raw)
  To: Amery Hung, netdev
  Cc: bpf, yangpeihao, daniel, andrii, alexei.starovoitov, martin.lau,
	toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs



On 7/23/24 22:36, Kui-Feng Lee wrote:
> 
> 
> On 7/14/24 10:51, Amery Hung wrote:
>> Allow a struct_ops program to return a referenced kptr if the struct_ops
>> operator has pointer to struct as the return type. To make sure the
>> returned pointer continues to be valid in the kernel, several
>> constraints are required:
>>
>> 1) The type of the pointer must matches the return type
>> 2) The pointer originally comes from the kernel (not locally allocated)
>> 3) The pointer is in its unmodified form
>>
>> In addition, since the first user, Qdisc_ops::dequeue, allows a NULL
>> pointer to be returned when there is no skb to be dequeued, we will allow
>> a scalar value with value equals to NULL to be returned.
>>
>> In the future when there is a struct_ops user that always expects a valid
>> pointer to be returned from an operator, we may extend tagging to the
>> return value. We can tell the verifier to only allow NULL pointer return
>> if the return value is tagged with MAY_BE_NULL.
>>
>> The check is split into two parts since check_reference_leak() happens
>> before check_return_code(). We first allow a reference object to leak
>> through return if it is in the return register and the type matches the
>> return type. Then, we check whether the pointer to-be-returned is 
>> valid in
>> check_return_code().
>>
>> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
>> ---
>>   kernel/bpf/verifier.c | 50 +++++++++++++++++++++++++++++++++++++++----
>>   1 file changed, 46 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index f614ab283c37..e7f356098902 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -10188,16 +10188,36 @@ record_func_key(struct bpf_verifier_env 
>> *env, struct bpf_call_arg_meta *meta,
>>   static int check_reference_leak(struct bpf_verifier_env *env, bool 
>> exception_exit)
>>   {
>> +    enum bpf_prog_type type = resolve_prog_type(env->prog);
>> +    u32 regno = exception_exit ? BPF_REG_1 : BPF_REG_0;
>> +    struct bpf_reg_state *reg = reg_state(env, regno);
>>       struct bpf_func_state *state = cur_func(env);
>> +    const struct bpf_prog *prog = env->prog;
>> +    const struct btf_type *ret_type = NULL;
>>       bool refs_lingering = false;
>> +    struct btf *btf;
>>       int i;
>>       if (!exception_exit && state->frameno && !state->in_callback_fn)
>>           return 0;
>> +    if (type == BPF_PROG_TYPE_STRUCT_OPS &&
>> +        reg->type & PTR_TO_BTF_ID && reg->ref_obj_id) {
>> +        btf = bpf_prog_get_target_btf(prog);
>> +        ret_type = btf_type_by_id(btf, 
>> prog->aux->attach_func_proto->type);
>> +        if (reg->btf_id != ret_type->type) {
>> +            verbose(env, "Return kptr type, struct %s, doesn't match 
>> function prototype, struct %s\n",
>> +                btf_type_name(reg->btf, reg->btf_id),
>> +                btf_type_name(btf, ret_type->type));
>> +            return -EINVAL;
>> +        }
>> +    }
>> +
>>       for (i = 0; i < state->acquired_refs; i++) {
>>           if (!exception_exit && state->in_callback_fn && 
>> state->refs[i].callback_ref != state->frameno)
>>               continue;
>> +        if (ret_type && reg->ref_obj_id == state->refs[i].id)
>> +            continue;
> 
> Is it possible having two kptrs that both are in the returned type
> passing into a function?

Does it work to remove the ref pointed by reg0 from state
at the location that handles BPF_EXIT in do_check()?

> 
> 
>>           verbose(env, "Unreleased reference id=%d alloc_insn=%d\n",
>>               state->refs[i].id, state->refs[i].insn_idx);
>>           refs_lingering = true;
>> @@ -15677,12 +15697,15 @@ static int check_return_code(struct 
>> bpf_verifier_env *env, int regno, const char
>>       const char *exit_ctx = "At program exit";
>>       struct tnum enforce_attach_type_range = tnum_unknown;
>>       const struct bpf_prog *prog = env->prog;
>> -    struct bpf_reg_state *reg;
>> +    struct bpf_reg_state *reg = reg_state(env, regno);
>>       struct bpf_retval_range range = retval_range(0, 1);
>>       enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
>>       int err;
>>       struct bpf_func_state *frame = env->cur_state->frame[0];
>>       const bool is_subprog = frame->subprogno;
>> +    struct btf *btf = bpf_prog_get_target_btf(prog);
>> +    bool st_ops_ret_is_kptr = false;
>> +    const struct btf_type *t;
>>       /* LSM and struct_ops func-ptr's return type could be "void" */
>>       if (!is_subprog || frame->in_exception_callback_fn) {
>> @@ -15691,10 +15714,26 @@ static int check_return_code(struct 
>> bpf_verifier_env *env, int regno, const char
>>               if (prog->expected_attach_type == BPF_LSM_CGROUP)
>>                   /* See below, can be 0 or 0-1 depending on hook. */
>>                   break;
>> -            fallthrough;
>> +            if (!prog->aux->attach_func_proto->type)
>> +                return 0;
>> +            break;
>>           case BPF_PROG_TYPE_STRUCT_OPS:
>>               if (!prog->aux->attach_func_proto->type)
>>                   return 0;
>> +
>> +            t = btf_type_by_id(btf, prog->aux->attach_func_proto->type);
>> +            if (btf_type_is_ptr(t)) {
>> +                /* Allow struct_ops programs to return kptr or null if
>> +                 * the return type is a pointer type.
>> +                 * check_reference_leak has ensured the returning kptr
>> +                 * matches the type of the function prototype and is
>> +                 * the only leaking reference. Thus, we can safely 
>> return
>> +                 * if the pointer is in its unmodified form
>> +                 */
>> +                if (reg->type & PTR_TO_BTF_ID)
>> +                    return __check_ptr_off_reg(env, reg, regno, false);
>> +                st_ops_ret_is_kptr = true;
>> +            }
>>               break;
>>           default:
>>               break;
>> @@ -15716,8 +15755,6 @@ static int check_return_code(struct 
>> bpf_verifier_env *env, int regno, const char
>>           return -EACCES;
>>       }
>> -    reg = cur_regs(env) + regno;
>> -
>>       if (frame->in_async_callback_fn) {
>>           /* enforce return zero from async callbacks like timer */
>>           exit_ctx = "At async callback return";
>> @@ -15804,6 +15841,11 @@ static int check_return_code(struct 
>> bpf_verifier_env *env, int regno, const char
>>       case BPF_PROG_TYPE_NETFILTER:
>>           range = retval_range(NF_DROP, NF_ACCEPT);
>>           break;
>> +    case BPF_PROG_TYPE_STRUCT_OPS:
>> +        if (!st_ops_ret_is_kptr)
>> +            return 0;
>> +        range = retval_range(0, 0);
>> +        break;
>>       case BPF_PROG_TYPE_EXT:
>>           /* freplace program can return anything as its return value
>>            * depends on the to-be-replaced kernel func or bpf program.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v9 03/11] bpf: Allow struct_ops prog to return referenced kptr
  2024-07-24  5:36   ` Kui-Feng Lee
  2024-07-24 18:27     ` Kui-Feng Lee
@ 2024-07-24 20:44     ` Amery Hung
  2024-07-26 18:22       ` Kui-Feng Lee
  1 sibling, 1 reply; 42+ messages in thread
From: Amery Hung @ 2024-07-24 20:44 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: netdev, bpf, yangpeihao, daniel, andrii, alexei.starovoitov,
	martin.lau, toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs

On Tue, Jul 23, 2024 at 10:36 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>
>
>
> On 7/14/24 10:51, Amery Hung wrote:
> > Allow a struct_ops program to return a referenced kptr if the struct_ops
> > operator has pointer to struct as the return type. To make sure the
> > returned pointer continues to be valid in the kernel, several
> > constraints are required:
> >
> > 1) The type of the pointer must matches the return type
> > 2) The pointer originally comes from the kernel (not locally allocated)
> > 3) The pointer is in its unmodified form
> >
> > In addition, since the first user, Qdisc_ops::dequeue, allows a NULL
> > pointer to be returned when there is no skb to be dequeued, we will allow
> > a scalar value with value equals to NULL to be returned.
> >
> > In the future when there is a struct_ops user that always expects a valid
> > pointer to be returned from an operator, we may extend tagging to the
> > return value. We can tell the verifier to only allow NULL pointer return
> > if the return value is tagged with MAY_BE_NULL.
> >
> > The check is split into two parts since check_reference_leak() happens
> > before check_return_code(). We first allow a reference object to leak
> > through return if it is in the return register and the type matches the
> > return type. Then, we check whether the pointer to-be-returned is valid in
> > check_return_code().
> >
> > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > ---
> >   kernel/bpf/verifier.c | 50 +++++++++++++++++++++++++++++++++++++++----
> >   1 file changed, 46 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index f614ab283c37..e7f356098902 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -10188,16 +10188,36 @@ record_func_key(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
> >
> >   static int check_reference_leak(struct bpf_verifier_env *env, bool exception_exit)
> >   {
> > +     enum bpf_prog_type type = resolve_prog_type(env->prog);
> > +     u32 regno = exception_exit ? BPF_REG_1 : BPF_REG_0;
> > +     struct bpf_reg_state *reg = reg_state(env, regno);
> >       struct bpf_func_state *state = cur_func(env);
> > +     const struct bpf_prog *prog = env->prog;
> > +     const struct btf_type *ret_type = NULL;
> >       bool refs_lingering = false;
> > +     struct btf *btf;
> >       int i;
> >
> >       if (!exception_exit && state->frameno && !state->in_callback_fn)
> >               return 0;
> >
> > +     if (type == BPF_PROG_TYPE_STRUCT_OPS &&
> > +         reg->type & PTR_TO_BTF_ID && reg->ref_obj_id) {
> > +             btf = bpf_prog_get_target_btf(prog);
> > +             ret_type = btf_type_by_id(btf, prog->aux->attach_func_proto->type);
> > +             if (reg->btf_id != ret_type->type) {
> > +                     verbose(env, "Return kptr type, struct %s, doesn't match function prototype, struct %s\n",
> > +                             btf_type_name(reg->btf, reg->btf_id),
> > +                             btf_type_name(btf, ret_type->type));
> > +                     return -EINVAL;
> > +             }
> > +     }
> > +
> >       for (i = 0; i < state->acquired_refs; i++) {
> >               if (!exception_exit && state->in_callback_fn && state->refs[i].callback_ref != state->frameno)
> >                       continue;
> > +             if (ret_type && reg->ref_obj_id == state->refs[i].id)
> > +                     continue;
>
> Is it possible having two kptrs that both are in the returned type
> passing into a function?
>

Just to make sure I understand the question correctly: Are you asking
what would happen here if a struct_ops operator has the following
signature?

struct *foo xxx_ops__dummy_op(struct foo *foo_a__ref, struct foo *foo_b__ref)

>
> >               verbose(env, "Unreleased reference id=%d alloc_insn=%d\n",
> >                       state->refs[i].id, state->refs[i].insn_idx);
> >               refs_lingering = true;
> > @@ -15677,12 +15697,15 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
> >       const char *exit_ctx = "At program exit";
> >       struct tnum enforce_attach_type_range = tnum_unknown;
> >       const struct bpf_prog *prog = env->prog;
> > -     struct bpf_reg_state *reg;
> > +     struct bpf_reg_state *reg = reg_state(env, regno);
> >       struct bpf_retval_range range = retval_range(0, 1);
> >       enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
> >       int err;
> >       struct bpf_func_state *frame = env->cur_state->frame[0];
> >       const bool is_subprog = frame->subprogno;
> > +     struct btf *btf = bpf_prog_get_target_btf(prog);
> > +     bool st_ops_ret_is_kptr = false;
> > +     const struct btf_type *t;
> >
> >       /* LSM and struct_ops func-ptr's return type could be "void" */
> >       if (!is_subprog || frame->in_exception_callback_fn) {
> > @@ -15691,10 +15714,26 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
> >                       if (prog->expected_attach_type == BPF_LSM_CGROUP)
> >                               /* See below, can be 0 or 0-1 depending on hook. */
> >                               break;
> > -                     fallthrough;
> > +                     if (!prog->aux->attach_func_proto->type)
> > +                             return 0;
> > +                     break;
> >               case BPF_PROG_TYPE_STRUCT_OPS:
> >                       if (!prog->aux->attach_func_proto->type)
> >                               return 0;
> > +
> > +                     t = btf_type_by_id(btf, prog->aux->attach_func_proto->type);
> > +                     if (btf_type_is_ptr(t)) {
> > +                             /* Allow struct_ops programs to return kptr or null if
> > +                              * the return type is a pointer type.
> > +                              * check_reference_leak has ensured the returning kptr
> > +                              * matches the type of the function prototype and is
> > +                              * the only leaking reference. Thus, we can safely return
> > +                              * if the pointer is in its unmodified form
> > +                              */
> > +                             if (reg->type & PTR_TO_BTF_ID)
> > +                                     return __check_ptr_off_reg(env, reg, regno, false);
> > +                             st_ops_ret_is_kptr = true;
> > +                     }
> >                       break;
> >               default:
> >                       break;
> > @@ -15716,8 +15755,6 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
> >               return -EACCES;
> >       }
> >
> > -     reg = cur_regs(env) + regno;
> > -
> >       if (frame->in_async_callback_fn) {
> >               /* enforce return zero from async callbacks like timer */
> >               exit_ctx = "At async callback return";
> > @@ -15804,6 +15841,11 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
> >       case BPF_PROG_TYPE_NETFILTER:
> >               range = retval_range(NF_DROP, NF_ACCEPT);
> >               break;
> > +     case BPF_PROG_TYPE_STRUCT_OPS:
> > +             if (!st_ops_ret_is_kptr)
> > +                     return 0;
> > +             range = retval_range(0, 0);
> > +             break;
> >       case BPF_PROG_TYPE_EXT:
> >               /* freplace program can return anything as its return value
> >                * depends on the to-be-replaced kernel func or bpf program.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v9 03/11] bpf: Allow struct_ops prog to return referenced kptr
  2024-07-24 20:44     ` Amery Hung
@ 2024-07-26 18:22       ` Kui-Feng Lee
  2024-07-26 22:45         ` Amery Hung
  0 siblings, 1 reply; 42+ messages in thread
From: Kui-Feng Lee @ 2024-07-26 18:22 UTC (permalink / raw)
  To: Amery Hung
  Cc: netdev, bpf, yangpeihao, daniel, andrii, alexei.starovoitov,
	martin.lau, toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs



On 7/24/24 13:44, Amery Hung wrote:
> On Tue, Jul 23, 2024 at 10:36 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>>
>>
>>
>> On 7/14/24 10:51, Amery Hung wrote:
>>> Allow a struct_ops program to return a referenced kptr if the struct_ops
>>> operator has pointer to struct as the return type. To make sure the
>>> returned pointer continues to be valid in the kernel, several
>>> constraints are required:
>>>
>>> 1) The type of the pointer must matches the return type
>>> 2) The pointer originally comes from the kernel (not locally allocated)
>>> 3) The pointer is in its unmodified form
>>>
>>> In addition, since the first user, Qdisc_ops::dequeue, allows a NULL
>>> pointer to be returned when there is no skb to be dequeued, we will allow
>>> a scalar value with value equals to NULL to be returned.
>>>
>>> In the future when there is a struct_ops user that always expects a valid
>>> pointer to be returned from an operator, we may extend tagging to the
>>> return value. We can tell the verifier to only allow NULL pointer return
>>> if the return value is tagged with MAY_BE_NULL.
>>>
>>> The check is split into two parts since check_reference_leak() happens
>>> before check_return_code(). We first allow a reference object to leak
>>> through return if it is in the return register and the type matches the
>>> return type. Then, we check whether the pointer to-be-returned is valid in
>>> check_return_code().
>>>
>>> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
>>> ---
>>>    kernel/bpf/verifier.c | 50 +++++++++++++++++++++++++++++++++++++++----
>>>    1 file changed, 46 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>>> index f614ab283c37..e7f356098902 100644
>>> --- a/kernel/bpf/verifier.c
>>> +++ b/kernel/bpf/verifier.c
>>> @@ -10188,16 +10188,36 @@ record_func_key(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
>>>
>>>    static int check_reference_leak(struct bpf_verifier_env *env, bool exception_exit)
>>>    {
>>> +     enum bpf_prog_type type = resolve_prog_type(env->prog);
>>> +     u32 regno = exception_exit ? BPF_REG_1 : BPF_REG_0;
>>> +     struct bpf_reg_state *reg = reg_state(env, regno);
>>>        struct bpf_func_state *state = cur_func(env);
>>> +     const struct bpf_prog *prog = env->prog;
>>> +     const struct btf_type *ret_type = NULL;
>>>        bool refs_lingering = false;
>>> +     struct btf *btf;
>>>        int i;
>>>
>>>        if (!exception_exit && state->frameno && !state->in_callback_fn)
>>>                return 0;
>>>
>>> +     if (type == BPF_PROG_TYPE_STRUCT_OPS &&
>>> +         reg->type & PTR_TO_BTF_ID && reg->ref_obj_id) {
>>> +             btf = bpf_prog_get_target_btf(prog);
>>> +             ret_type = btf_type_by_id(btf, prog->aux->attach_func_proto->type);
>>> +             if (reg->btf_id != ret_type->type) {
>>> +                     verbose(env, "Return kptr type, struct %s, doesn't match function prototype, struct %s\n",
>>> +                             btf_type_name(reg->btf, reg->btf_id),
>>> +                             btf_type_name(btf, ret_type->type));
>>> +                     return -EINVAL;
>>> +             }
>>> +     }
>>> +
>>>        for (i = 0; i < state->acquired_refs; i++) {
>>>                if (!exception_exit && state->in_callback_fn && state->refs[i].callback_ref != state->frameno)
>>>                        continue;
>>> +             if (ret_type && reg->ref_obj_id == state->refs[i].id)
>>> +                     continue;
>>
>> Is it possible having two kptrs that both are in the returned type
>> passing into a function?
>>
> 
> Just to make sure I understand the question correctly: Are you asking
> what would happen here if a struct_ops operator has the following
> signature?
> 
> struct *foo xxx_ops__dummy_op(struct foo *foo_a__ref, struct foo *foo_b__ref)

Right! What would happen to this case? Could one of them leak without
being detected?

> 
>>
>>>                verbose(env, "Unreleased reference id=%d alloc_insn=%d\n",
>>>                        state->refs[i].id, state->refs[i].insn_idx);
>>>                refs_lingering = true;
>>> @@ -15677,12 +15697,15 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>>>        const char *exit_ctx = "At program exit";
>>>        struct tnum enforce_attach_type_range = tnum_unknown;
>>>        const struct bpf_prog *prog = env->prog;
>>> -     struct bpf_reg_state *reg;
>>> +     struct bpf_reg_state *reg = reg_state(env, regno);
>>>        struct bpf_retval_range range = retval_range(0, 1);
>>>        enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
>>>        int err;
>>>        struct bpf_func_state *frame = env->cur_state->frame[0];
>>>        const bool is_subprog = frame->subprogno;
>>> +     struct btf *btf = bpf_prog_get_target_btf(prog);
>>> +     bool st_ops_ret_is_kptr = false;
>>> +     const struct btf_type *t;
>>>
>>>        /* LSM and struct_ops func-ptr's return type could be "void" */
>>>        if (!is_subprog || frame->in_exception_callback_fn) {
>>> @@ -15691,10 +15714,26 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>>>                        if (prog->expected_attach_type == BPF_LSM_CGROUP)
>>>                                /* See below, can be 0 or 0-1 depending on hook. */
>>>                                break;
>>> -                     fallthrough;
>>> +                     if (!prog->aux->attach_func_proto->type)
>>> +                             return 0;
>>> +                     break;
>>>                case BPF_PROG_TYPE_STRUCT_OPS:
>>>                        if (!prog->aux->attach_func_proto->type)
>>>                                return 0;
>>> +
>>> +                     t = btf_type_by_id(btf, prog->aux->attach_func_proto->type);
>>> +                     if (btf_type_is_ptr(t)) {
>>> +                             /* Allow struct_ops programs to return kptr or null if
>>> +                              * the return type is a pointer type.
>>> +                              * check_reference_leak has ensured the returning kptr
>>> +                              * matches the type of the function prototype and is
>>> +                              * the only leaking reference. Thus, we can safely return
>>> +                              * if the pointer is in its unmodified form
>>> +                              */
>>> +                             if (reg->type & PTR_TO_BTF_ID)
>>> +                                     return __check_ptr_off_reg(env, reg, regno, false);
>>> +                             st_ops_ret_is_kptr = true;
>>> +                     }
>>>                        break;
>>>                default:
>>>                        break;
>>> @@ -15716,8 +15755,6 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>>>                return -EACCES;
>>>        }
>>>
>>> -     reg = cur_regs(env) + regno;
>>> -
>>>        if (frame->in_async_callback_fn) {
>>>                /* enforce return zero from async callbacks like timer */
>>>                exit_ctx = "At async callback return";
>>> @@ -15804,6 +15841,11 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>>>        case BPF_PROG_TYPE_NETFILTER:
>>>                range = retval_range(NF_DROP, NF_ACCEPT);
>>>                break;
>>> +     case BPF_PROG_TYPE_STRUCT_OPS:
>>> +             if (!st_ops_ret_is_kptr)
>>> +                     return 0;
>>> +             range = retval_range(0, 0);
>>> +             break;
>>>        case BPF_PROG_TYPE_EXT:
>>>                /* freplace program can return anything as its return value
>>>                 * depends on the to-be-replaced kernel func or bpf program.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v9 03/11] bpf: Allow struct_ops prog to return referenced kptr
  2024-07-26 18:22       ` Kui-Feng Lee
@ 2024-07-26 22:45         ` Amery Hung
  0 siblings, 0 replies; 42+ messages in thread
From: Amery Hung @ 2024-07-26 22:45 UTC (permalink / raw)
  To: Kui-Feng Lee
  Cc: netdev, bpf, yangpeihao, daniel, andrii, alexei.starovoitov,
	martin.lau, toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs

On Fri, Jul 26, 2024 at 11:22 AM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>
>
>
> On 7/24/24 13:44, Amery Hung wrote:
> > On Tue, Jul 23, 2024 at 10:36 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
> >>
> >>
> >>
> >> On 7/14/24 10:51, Amery Hung wrote:
> >>> Allow a struct_ops program to return a referenced kptr if the struct_ops
> >>> operator has pointer to struct as the return type. To make sure the
> >>> returned pointer continues to be valid in the kernel, several
> >>> constraints are required:
> >>>
> >>> 1) The type of the pointer must matches the return type
> >>> 2) The pointer originally comes from the kernel (not locally allocated)
> >>> 3) The pointer is in its unmodified form
> >>>
> >>> In addition, since the first user, Qdisc_ops::dequeue, allows a NULL
> >>> pointer to be returned when there is no skb to be dequeued, we will allow
> >>> a scalar value with value equals to NULL to be returned.
> >>>
> >>> In the future when there is a struct_ops user that always expects a valid
> >>> pointer to be returned from an operator, we may extend tagging to the
> >>> return value. We can tell the verifier to only allow NULL pointer return
> >>> if the return value is tagged with MAY_BE_NULL.
> >>>
> >>> The check is split into two parts since check_reference_leak() happens
> >>> before check_return_code(). We first allow a reference object to leak
> >>> through return if it is in the return register and the type matches the
> >>> return type. Then, we check whether the pointer to-be-returned is valid in
> >>> check_return_code().
> >>>
> >>> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> >>> ---
> >>>    kernel/bpf/verifier.c | 50 +++++++++++++++++++++++++++++++++++++++----
> >>>    1 file changed, 46 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >>> index f614ab283c37..e7f356098902 100644
> >>> --- a/kernel/bpf/verifier.c
> >>> +++ b/kernel/bpf/verifier.c
> >>> @@ -10188,16 +10188,36 @@ record_func_key(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
> >>>
> >>>    static int check_reference_leak(struct bpf_verifier_env *env, bool exception_exit)
> >>>    {
> >>> +     enum bpf_prog_type type = resolve_prog_type(env->prog);
> >>> +     u32 regno = exception_exit ? BPF_REG_1 : BPF_REG_0;
> >>> +     struct bpf_reg_state *reg = reg_state(env, regno);
> >>>        struct bpf_func_state *state = cur_func(env);
> >>> +     const struct bpf_prog *prog = env->prog;
> >>> +     const struct btf_type *ret_type = NULL;
> >>>        bool refs_lingering = false;
> >>> +     struct btf *btf;
> >>>        int i;
> >>>
> >>>        if (!exception_exit && state->frameno && !state->in_callback_fn)
> >>>                return 0;
> >>>
> >>> +     if (type == BPF_PROG_TYPE_STRUCT_OPS &&
> >>> +         reg->type & PTR_TO_BTF_ID && reg->ref_obj_id) {
> >>> +             btf = bpf_prog_get_target_btf(prog);
> >>> +             ret_type = btf_type_by_id(btf, prog->aux->attach_func_proto->type);
> >>> +             if (reg->btf_id != ret_type->type) {
> >>> +                     verbose(env, "Return kptr type, struct %s, doesn't match function prototype, struct %s\n",
> >>> +                             btf_type_name(reg->btf, reg->btf_id),
> >>> +                             btf_type_name(btf, ret_type->type));
> >>> +                     return -EINVAL;
> >>> +             }
> >>> +     }
> >>> +
> >>>        for (i = 0; i < state->acquired_refs; i++) {
> >>>                if (!exception_exit && state->in_callback_fn && state->refs[i].callback_ref != state->frameno)
> >>>                        continue;
> >>> +             if (ret_type && reg->ref_obj_id == state->refs[i].id)
> >>> +                     continue;
> >>
> >> Is it possible having two kptrs that both are in the returned type
> >> passing into a function?
> >>
> >
> > Just to make sure I understand the question correctly: Are you asking
> > what would happen here if a struct_ops operator has the following
> > signature?
> >
> > struct *foo xxx_ops__dummy_op(struct foo *foo_a__ref, struct foo *foo_b__ref)
>
> Right! What would happen to this case? Could one of them leak without
> being detected?
>

There will be a ref_obj_id for foo_a and another one for foo_b when we
enter the program (patch 1). Then, in the for loop in
check_reference_leak(), reg->ref_obj_id should just match one of
those, and all others will still be viewed as reference leak.

> >
> >>
> >>>                verbose(env, "Unreleased reference id=%d alloc_insn=%d\n",
> >>>                        state->refs[i].id, state->refs[i].insn_idx);
> >>>                refs_lingering = true;
> >>> @@ -15677,12 +15697,15 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
> >>>        const char *exit_ctx = "At program exit";
> >>>        struct tnum enforce_attach_type_range = tnum_unknown;
> >>>        const struct bpf_prog *prog = env->prog;
> >>> -     struct bpf_reg_state *reg;
> >>> +     struct bpf_reg_state *reg = reg_state(env, regno);
> >>>        struct bpf_retval_range range = retval_range(0, 1);
> >>>        enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
> >>>        int err;
> >>>        struct bpf_func_state *frame = env->cur_state->frame[0];
> >>>        const bool is_subprog = frame->subprogno;
> >>> +     struct btf *btf = bpf_prog_get_target_btf(prog);
> >>> +     bool st_ops_ret_is_kptr = false;
> >>> +     const struct btf_type *t;
> >>>
> >>>        /* LSM and struct_ops func-ptr's return type could be "void" */
> >>>        if (!is_subprog || frame->in_exception_callback_fn) {
> >>> @@ -15691,10 +15714,26 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
> >>>                        if (prog->expected_attach_type == BPF_LSM_CGROUP)
> >>>                                /* See below, can be 0 or 0-1 depending on hook. */
> >>>                                break;
> >>> -                     fallthrough;
> >>> +                     if (!prog->aux->attach_func_proto->type)
> >>> +                             return 0;
> >>> +                     break;
> >>>                case BPF_PROG_TYPE_STRUCT_OPS:
> >>>                        if (!prog->aux->attach_func_proto->type)
> >>>                                return 0;
> >>> +
> >>> +                     t = btf_type_by_id(btf, prog->aux->attach_func_proto->type);
> >>> +                     if (btf_type_is_ptr(t)) {
> >>> +                             /* Allow struct_ops programs to return kptr or null if
> >>> +                              * the return type is a pointer type.
> >>> +                              * check_reference_leak has ensured the returning kptr
> >>> +                              * matches the type of the function prototype and is
> >>> +                              * the only leaking reference. Thus, we can safely return
> >>> +                              * if the pointer is in its unmodified form
> >>> +                              */
> >>> +                             if (reg->type & PTR_TO_BTF_ID)
> >>> +                                     return __check_ptr_off_reg(env, reg, regno, false);
> >>> +                             st_ops_ret_is_kptr = true;
> >>> +                     }
> >>>                        break;
> >>>                default:
> >>>                        break;
> >>> @@ -15716,8 +15755,6 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
> >>>                return -EACCES;
> >>>        }
> >>>
> >>> -     reg = cur_regs(env) + regno;
> >>> -
> >>>        if (frame->in_async_callback_fn) {
> >>>                /* enforce return zero from async callbacks like timer */
> >>>                exit_ctx = "At async callback return";
> >>> @@ -15804,6 +15841,11 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
> >>>        case BPF_PROG_TYPE_NETFILTER:
> >>>                range = retval_range(NF_DROP, NF_ACCEPT);
> >>>                break;
> >>> +     case BPF_PROG_TYPE_STRUCT_OPS:
> >>> +             if (!st_ops_ret_is_kptr)
> >>> +                     return 0;
> >>> +             range = retval_range(0, 0);
> >>> +             break;
> >>>        case BPF_PROG_TYPE_EXT:
> >>>                /* freplace program can return anything as its return value
> >>>                 * depends on the to-be-replaced kernel func or bpf program.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v9 03/11] bpf: Allow struct_ops prog to return referenced kptr
  2024-07-14 17:51 ` [RFC PATCH v9 03/11] bpf: Allow struct_ops prog to return referenced kptr Amery Hung
  2024-07-24  5:36   ` Kui-Feng Lee
@ 2024-07-24 23:57   ` Martin KaFai Lau
  1 sibling, 0 replies; 42+ messages in thread
From: Martin KaFai Lau @ 2024-07-24 23:57 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, yangpeihao, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

On 7/14/24 10:51 AM, Amery Hung wrote:
> Allow a struct_ops program to return a referenced kptr if the struct_ops
> operator has pointer to struct as the return type. To make sure the
> returned pointer continues to be valid in the kernel, several
> constraints are required:
> 
> 1) The type of the pointer must matches the return type
> 2) The pointer originally comes from the kernel (not locally allocated)
> 3) The pointer is in its unmodified form
> 
> In addition, since the first user, Qdisc_ops::dequeue, allows a NULL
> pointer to be returned when there is no skb to be dequeued, we will allow
> a scalar value with value equals to NULL to be returned.
> 
> In the future when there is a struct_ops user that always expects a valid
> pointer to be returned from an operator, we may extend tagging to the
> return value. We can tell the verifier to only allow NULL pointer return
> if the return value is tagged with MAY_BE_NULL.
> 
> The check is split into two parts since check_reference_leak() happens
> before check_return_code(). We first allow a reference object to leak
> through return if it is in the return register and the type matches the
> return type. Then, we check whether the pointer to-be-returned is valid in
> check_return_code().
> 
> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> ---
>   kernel/bpf/verifier.c | 50 +++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 46 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index f614ab283c37..e7f356098902 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -10188,16 +10188,36 @@ record_func_key(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
>   
>   static int check_reference_leak(struct bpf_verifier_env *env, bool exception_exit)
>   {
> +	enum bpf_prog_type type = resolve_prog_type(env->prog);
> +	u32 regno = exception_exit ? BPF_REG_1 : BPF_REG_0;

hmm... Can reg_1 hold a PTR_TO_BTF_ID during bpf_throw()?

Beside, if I read how the current check_reference_leak() handles "exception_exit 
== true" correctly, any leak is a leak. Does it need special handling for 
struct_ops program here when "exception_exit == true"?

> +	struct bpf_reg_state *reg = reg_state(env, regno);
>   	struct bpf_func_state *state = cur_func(env);
> +	const struct bpf_prog *prog = env->prog;
> +	const struct btf_type *ret_type = NULL;
>   	bool refs_lingering = false;
> +	struct btf *btf;
>   	int i;
>   
>   	if (!exception_exit && state->frameno && !state->in_callback_fn)
>   		return 0;
>   
> +	if (type == BPF_PROG_TYPE_STRUCT_OPS &&
> +	    reg->type & PTR_TO_BTF_ID && reg->ref_obj_id) {
> +		btf = bpf_prog_get_target_btf(prog);
> +		ret_type = btf_type_by_id(btf, prog->aux->attach_func_proto->type);
> +		if (reg->btf_id != ret_type->type) {
> +			verbose(env, "Return kptr type, struct %s, doesn't match function prototype, struct %s\n",
> +				btf_type_name(reg->btf, reg->btf_id),
> +				btf_type_name(btf, ret_type->type));
> +			return -EINVAL;
> +		}
> +	}
> +
>   	for (i = 0; i < state->acquired_refs; i++) {
>   		if (!exception_exit && state->in_callback_fn && state->refs[i].callback_ref != state->frameno)
>   			continue;
> +		if (ret_type && reg->ref_obj_id == state->refs[i].id)
> +			continue;
>   		verbose(env, "Unreleased reference id=%d alloc_insn=%d\n",
>   			state->refs[i].id, state->refs[i].insn_idx);
>   		refs_lingering = true;
> @@ -15677,12 +15697,15 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>   	const char *exit_ctx = "At program exit";
>   	struct tnum enforce_attach_type_range = tnum_unknown;
>   	const struct bpf_prog *prog = env->prog;
> -	struct bpf_reg_state *reg;
> +	struct bpf_reg_state *reg = reg_state(env, regno);
>   	struct bpf_retval_range range = retval_range(0, 1);
>   	enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
>   	int err;
>   	struct bpf_func_state *frame = env->cur_state->frame[0];
>   	const bool is_subprog = frame->subprogno;
> +	struct btf *btf = bpf_prog_get_target_btf(prog);
> +	bool st_ops_ret_is_kptr = false;
> +	const struct btf_type *t;
>   
>   	/* LSM and struct_ops func-ptr's return type could be "void" */
>   	if (!is_subprog || frame->in_exception_callback_fn) {
> @@ -15691,10 +15714,26 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>   			if (prog->expected_attach_type == BPF_LSM_CGROUP)
>   				/* See below, can be 0 or 0-1 depending on hook. */
>   				break;
> -			fallthrough;
> +			if (!prog->aux->attach_func_proto->type)
> +				return 0;
> +			break;
>   		case BPF_PROG_TYPE_STRUCT_OPS:
>   			if (!prog->aux->attach_func_proto->type)
>   				return 0;
> +
> +			t = btf_type_by_id(btf, prog->aux->attach_func_proto->type);
> +			if (btf_type_is_ptr(t)) {
> +				/* Allow struct_ops programs to return kptr or null if
> +				 * the return type is a pointer type.
> +				 * check_reference_leak has ensured the returning kptr
> +				 * matches the type of the function prototype and is

It needs to ensure reg->ref_obj_id != 0 also for non-null pointer. Then it can 
rely on the check_reference_leak() for the type checking. I think 
reg->ref_obj_id needs to be checked at here anyway because the prog should not 
return a non-refcounted PTR_TO_BTF_ID ptr.

may be more straightforward (?) to move the type checking from 
check_reference_leak() to check_return_code() here. Leave the 
check_reference_leak() to check for leak and check_return_code() to check for 
the return value/ptr-type.

another thing is....

> +				 * the only leaking reference. Thus, we can safely return
> +				 * if the pointer is in its unmodified form
> +				 */
> +				if (reg->type & PTR_TO_BTF_ID)
> +					return __check_ptr_off_reg(env, reg, regno, false);
> +				st_ops_ret_is_kptr = true;
> +			}
>   			break;
>   		default:
>   			break;
> @@ -15716,8 +15755,6 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>   		return -EACCES;
>   	}
>   
> -	reg = cur_regs(env) + regno;
> -
>   	if (frame->in_async_callback_fn) {
>   		/* enforce return zero from async callbacks like timer */
>   		exit_ctx = "At async callback return";
> @@ -15804,6 +15841,11 @@ static int check_return_code(struct bpf_verifier_env *env, int regno, const char
>   	case BPF_PROG_TYPE_NETFILTER:
>   		range = retval_range(NF_DROP, NF_ACCEPT);
>   		break;
> +	case BPF_PROG_TYPE_STRUCT_OPS:
> +		if (!st_ops_ret_is_kptr)

... can the changes added earlier in this function be done here together instead 
of gluing by "st_ops_ret_is_kptr"?

> +			return 0;
> +		range = retval_range(0, 0);
> +		break;
>   	case BPF_PROG_TYPE_EXT:
>   		/* freplace program can return anything as its return value
>   		 * depends on the to-be-replaced kernel func or bpf program.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH v9 04/11] selftests/bpf: Test returning referenced kptr from struct_ops programs
  2024-07-14 17:51 [RFC PATCH v9 00/11] bpf qdisc Amery Hung
                   ` (2 preceding siblings ...)
  2024-07-14 17:51 ` [RFC PATCH v9 03/11] bpf: Allow struct_ops prog to return referenced kptr Amery Hung
@ 2024-07-14 17:51 ` Amery Hung
  2024-07-14 17:51 ` [RFC PATCH v9 05/11] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 42+ messages in thread
From: Amery Hung @ 2024-07-14 17:51 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs,
	ameryhung

Test struct_ops programs returning referenced kptr. When the return type
of a struct_ops operator is pointer to struct, the verifier should
only allow programs that return a scalar NULL or a non-local kptr with the
correct type in its unmodified form.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  8 ++
 .../selftests/bpf/bpf_testmod/bpf_testmod.h   |  4 +
 .../prog_tests/test_struct_ops_kptr_return.c  | 87 +++++++++++++++++++
 .../bpf/progs/struct_ops_kptr_return.c        | 29 +++++++
 ...uct_ops_kptr_return_fail__invalid_scalar.c | 24 +++++
 .../struct_ops_kptr_return_fail__local_kptr.c | 30 +++++++
 ...uct_ops_kptr_return_fail__nonzero_offset.c | 23 +++++
 .../struct_ops_kptr_return_fail__wrong_type.c | 28 ++++++
 8 files changed, 233 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_struct_ops_kptr_return.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__invalid_scalar.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__nonzero_offset.c
 create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__wrong_type.c

diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 316a4c3d3a88..c90bb3a5e86a 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -922,11 +922,19 @@ static int bpf_testmod_ops__test_refcounted(int dummy,
 	return 0;
 }
 
+static struct task_struct *
+bpf_testmod_ops__test_return_ref_kptr(int dummy, struct task_struct *task__ref,
+				      struct cgroup *cgrp)
+{
+	return NULL;
+}
+
 static struct bpf_testmod_ops __bpf_testmod_ops = {
 	.test_1 = bpf_testmod_test_1,
 	.test_2 = bpf_testmod_test_2,
 	.test_maybe_null = bpf_testmod_ops__test_maybe_null,
 	.test_refcounted = bpf_testmod_ops__test_refcounted,
+	.test_return_ref_kptr = bpf_testmod_ops__test_return_ref_kptr,
 };
 
 struct bpf_struct_ops bpf_bpf_testmod_ops = {
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
index bfef5f382d01..2289ecd38401 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
@@ -6,6 +6,7 @@
 #include <linux/types.h>
 
 struct task_struct;
+struct cgroup;
 
 struct bpf_testmod_test_read_ctx {
 	char *buf;
@@ -37,6 +38,9 @@ struct bpf_testmod_ops {
 	int (*test_maybe_null)(int dummy, struct task_struct *task);
 	/* Used to test ref_acquired arguments. */
 	int (*test_refcounted)(int dummy, struct task_struct *task);
+	/* Used to test returning referenced kptr. */
+	struct task_struct *(*test_return_ref_kptr)(int dummy, struct task_struct *task,
+						    struct cgroup *cgrp);
 
 	/* The following fields are used to test shadow copies. */
 	char onebyte;
diff --git a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_kptr_return.c b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_kptr_return.c
new file mode 100644
index 000000000000..bc2fac39215a
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_kptr_return.c
@@ -0,0 +1,87 @@
+#include <test_progs.h>
+
+#include "struct_ops_kptr_return.skel.h"
+#include "struct_ops_kptr_return_fail__wrong_type.skel.h"
+#include "struct_ops_kptr_return_fail__invalid_scalar.skel.h"
+#include "struct_ops_kptr_return_fail__nonzero_offset.skel.h"
+#include "struct_ops_kptr_return_fail__local_kptr.skel.h"
+
+/* Test that the verifier accepts a program that acquires a referenced
+ * kptr and releases the reference through return
+ */
+static void kptr_return(void)
+{
+	struct struct_ops_kptr_return *skel;
+
+	skel = struct_ops_kptr_return__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "struct_ops_module_open_and_load"))
+		return;
+
+	struct_ops_kptr_return__destroy(skel);
+}
+
+/* Test that the verifier rejects a program that returns a kptr of the
+ * wrong type
+ */
+static void kptr_return_fail__wrong_type(void)
+{
+	struct struct_ops_kptr_return_fail__wrong_type *skel;
+
+	skel = struct_ops_kptr_return_fail__wrong_type__open_and_load();
+	if (ASSERT_ERR_PTR(skel, "struct_ops_module_fail__wrong_type__open_and_load"))
+		return;
+
+	struct_ops_kptr_return_fail__wrong_type__destroy(skel);
+}
+
+/* Test that the verifier rejects a program that returns a non-null scalar */
+static void kptr_return_fail__invalid_scalar(void)
+{
+	struct struct_ops_kptr_return_fail__invalid_scalar *skel;
+
+	skel = struct_ops_kptr_return_fail__invalid_scalar__open_and_load();
+	if (ASSERT_ERR_PTR(skel, "struct_ops_module_fail__invalid_scalar__open_and_load"))
+		return;
+
+	struct_ops_kptr_return_fail__invalid_scalar__destroy(skel);
+}
+
+/* Test that the verifier rejects a program that returns kptr with non-zero offset */
+static void kptr_return_fail__nonzero_offset(void)
+{
+	struct struct_ops_kptr_return_fail__nonzero_offset *skel;
+
+	skel = struct_ops_kptr_return_fail__nonzero_offset__open_and_load();
+	if (ASSERT_ERR_PTR(skel, "struct_ops_module_fail__nonzero_offset__open_and_load"))
+		return;
+
+	struct_ops_kptr_return_fail__nonzero_offset__destroy(skel);
+}
+
+/* Test that the verifier rejects a program that returns local kptr */
+static void kptr_return_fail__local_kptr(void)
+{
+	struct struct_ops_kptr_return_fail__local_kptr *skel;
+
+	skel = struct_ops_kptr_return_fail__local_kptr__open_and_load();
+	if (ASSERT_ERR_PTR(skel, "struct_ops_module_fail__local_kptr__open_and_load"))
+		return;
+
+	struct_ops_kptr_return_fail__local_kptr__destroy(skel);
+}
+
+void test_struct_ops_kptr_return(void)
+{
+	if (test__start_subtest("kptr_return"))
+		kptr_return();
+	if (test__start_subtest("kptr_return_fail__wrong_type"))
+		kptr_return_fail__wrong_type();
+	if (test__start_subtest("kptr_return_fail__invalid_scalar"))
+		kptr_return_fail__invalid_scalar();
+	if (test__start_subtest("kptr_return_fail__nonzero_offset"))
+		kptr_return_fail__nonzero_offset();
+	if (test__start_subtest("kptr_return_fail__local_kptr"))
+		kptr_return_fail__local_kptr();
+}
+
+
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_kptr_return.c b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return.c
new file mode 100644
index 000000000000..29b7719cd4c9
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return.c
@@ -0,0 +1,29 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This test struct_ops BPF programs returning referenced kptr. The verifier should
+ * allow a referenced kptr or a NULL pointer to be returned. A referenced kptr to task
+ * here is acquried automatically as the task argument is tagged with "__ref".
+ */
+SEC("struct_ops/test_return_ref_kptr")
+struct task_struct *BPF_PROG(test_return_ref_kptr, int dummy,
+			     struct task_struct *task, struct cgroup *cgrp)
+{
+	if (dummy % 2) {
+		bpf_task_release(task);
+		return NULL;
+	}
+	return task;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_kptr_return = {
+	.test_return_ref_kptr = (void *)test_return_ref_kptr,
+};
+
+
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__invalid_scalar.c b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__invalid_scalar.c
new file mode 100644
index 000000000000..d67982ba8224
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__invalid_scalar.c
@@ -0,0 +1,24 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct cgroup *bpf_cgroup_acquire(struct cgroup *p) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This test struct_ops BPF programs returning referenced kptr. The verifier should
+ * reject programs returning a non-zero scalar value.
+ */
+SEC("struct_ops/test_return_ref_kptr")
+struct task_struct *BPF_PROG(test_return_ref_kptr, int dummy,
+			     struct task_struct *task, struct cgroup *cgrp)
+{
+	bpf_task_release(task);
+	return (struct task_struct *)1;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_kptr_return = {
+	.test_return_ref_kptr = (void *)test_return_ref_kptr,
+};
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c
new file mode 100644
index 000000000000..9a4247432539
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__local_kptr.c
@@ -0,0 +1,30 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+#include "bpf_experimental.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct cgroup *bpf_cgroup_acquire(struct cgroup *p) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This test struct_ops BPF programs returning referenced kptr. The verifier should
+ * reject programs returning a local kptr.
+ */
+SEC("struct_ops/test_return_ref_kptr")
+struct task_struct *BPF_PROG(test_return_ref_kptr, int dummy,
+			     struct task_struct *task, struct cgroup *cgrp)
+{
+	struct task_struct *t;
+
+	t = bpf_obj_new(typeof(*task));
+	if (!t)
+		return task;
+
+	return t;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_kptr_return = {
+	.test_return_ref_kptr = (void *)test_return_ref_kptr,
+};
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__nonzero_offset.c b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__nonzero_offset.c
new file mode 100644
index 000000000000..5bb0b4029d11
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__nonzero_offset.c
@@ -0,0 +1,23 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct cgroup *bpf_cgroup_acquire(struct cgroup *p) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This test struct_ops BPF programs returning referenced kptr. The verifier should
+ * reject programs returning a modified referenced kptr.
+ */
+SEC("struct_ops/test_return_ref_kptr")
+struct task_struct *BPF_PROG(test_return_ref_kptr, int dummy,
+			     struct task_struct *task, struct cgroup *cgrp)
+{
+	return (struct task_struct *)&task->jobctl;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_kptr_return = {
+	.test_return_ref_kptr = (void *)test_return_ref_kptr,
+};
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__wrong_type.c b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__wrong_type.c
new file mode 100644
index 000000000000..32365cb7af49
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/struct_ops_kptr_return_fail__wrong_type.c
@@ -0,0 +1,28 @@
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include "../bpf_testmod/bpf_testmod.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct cgroup *bpf_cgroup_acquire(struct cgroup *p) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+
+/* This test struct_ops BPF programs returning referenced kptr. The verifier should
+ * reject programs returning a referenced kptr of the wrong type.
+ */
+SEC("struct_ops/test_return_ref_kptr")
+struct task_struct *BPF_PROG(test_return_ref_kptr, int dummy,
+			     struct task_struct *task, struct cgroup *cgrp)
+{
+	struct task_struct *ret;
+
+	ret = (struct task_struct *)bpf_cgroup_acquire(cgrp);
+	bpf_task_release(task);
+
+	return ret;
+}
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_kptr_return = {
+	.test_return_ref_kptr = (void *)test_return_ref_kptr,
+};
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v9 05/11] bpf: net_sched: Support implementation of Qdisc_ops in bpf
  2024-07-14 17:51 [RFC PATCH v9 00/11] bpf qdisc Amery Hung
                   ` (3 preceding siblings ...)
  2024-07-14 17:51 ` [RFC PATCH v9 04/11] selftests/bpf: Test returning referenced kptr from struct_ops programs Amery Hung
@ 2024-07-14 17:51 ` Amery Hung
  2024-07-18  0:00   ` Amery Hung
  2024-07-25 21:24   ` Martin KaFai Lau
  2024-07-14 17:51 ` [RFC PATCH v9 06/11] bpf: net_sched: Add bpf qdisc kfuncs Amery Hung
                   ` (8 subsequent siblings)
  13 siblings, 2 replies; 42+ messages in thread
From: Amery Hung @ 2024-07-14 17:51 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs,
	ameryhung

Enable users to implement a classless qdisc using bpf. The last few
patches in this series has prepared struct_ops to support core operators
in Qdisc_ops. The recent advancement in bpf such as allocated
objects, bpf list and bpf rbtree has also provided powerful and flexible
building blocks to realize sophisticated scheduling algorithms. Therefore,
in this patch, we start allowing qdisc to be implemented using bpf
struct_ops. Users can implement .enqueue and .dequeue in Qdisc_ops in bpf
and register the qdisc dynamically into the kernel.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Co-developed-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/linux/btf.h       |   1 +
 include/net/sch_generic.h |   1 +
 kernel/bpf/btf.c          |   2 +-
 net/sched/Makefile        |   4 +
 net/sched/bpf_qdisc.c     | 352 ++++++++++++++++++++++++++++++++++++++
 net/sched/sch_api.c       |   7 +-
 net/sched/sch_generic.c   |   3 +-
 7 files changed, 365 insertions(+), 5 deletions(-)
 create mode 100644 net/sched/bpf_qdisc.c

diff --git a/include/linux/btf.h b/include/linux/btf.h
index cffb43133c68..730ec304f787 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -562,6 +562,7 @@ const char *btf_name_by_offset(const struct btf *btf, u32 offset);
 const char *btf_str_by_offset(const struct btf *btf, u32 offset);
 struct btf *btf_parse_vmlinux(void);
 struct btf *bpf_prog_get_target_btf(const struct bpf_prog *prog);
+u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto, int off);
 u32 *btf_kfunc_id_set_contains(const struct btf *btf, u32 kfunc_btf_id,
 			       const struct bpf_prog *prog);
 u32 *btf_kfunc_is_modify_return(const struct btf *btf, u32 kfunc_btf_id,
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 79edd5b5e3c9..214ed2e34faa 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -95,6 +95,7 @@ struct Qdisc {
 #define TCQ_F_INVISIBLE		0x80 /* invisible by default in dump */
 #define TCQ_F_NOLOCK		0x100 /* qdisc does not require locking */
 #define TCQ_F_OFFLOADED		0x200 /* qdisc is offloaded to HW */
+#define TCQ_F_BPF		0x400 /* BPF qdisc */
 	u32			limit;
 	const struct Qdisc_ops	*ops;
 	struct qdisc_size_table	__rcu *stab;
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 52be35b30308..059bcc365f10 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -6314,7 +6314,7 @@ static bool is_int_ptr(struct btf *btf, const struct btf_type *t)
 	return btf_type_is_int(t);
 }
 
-static u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto,
+u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto,
 			   int off)
 {
 	const struct btf_param *args;
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 82c3f78ca486..2094e6e74158 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -63,6 +63,10 @@ obj-$(CONFIG_NET_SCH_CBS)	+= sch_cbs.o
 obj-$(CONFIG_NET_SCH_ETF)	+= sch_etf.o
 obj-$(CONFIG_NET_SCH_TAPRIO)	+= sch_taprio.o
 
+ifeq ($(CONFIG_BPF_JIT),y)
+obj-$(CONFIG_BPF_SYSCALL)	+= bpf_qdisc.o
+endif
+
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
 obj-$(CONFIG_NET_CLS_FW)	+= cls_fw.o
diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
new file mode 100644
index 000000000000..a68fc115d8f8
--- /dev/null
+++ b/net/sched/bpf_qdisc.c
@@ -0,0 +1,352 @@
+#include <linux/types.h>
+#include <linux/bpf_verifier.h>
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/filter.h>
+#include <net/pkt_sched.h>
+#include <net/pkt_cls.h>
+
+static struct bpf_struct_ops bpf_Qdisc_ops;
+
+static u32 unsupported_ops[] = {
+	offsetof(struct Qdisc_ops, init),
+	offsetof(struct Qdisc_ops, reset),
+	offsetof(struct Qdisc_ops, destroy),
+	offsetof(struct Qdisc_ops, change),
+	offsetof(struct Qdisc_ops, attach),
+	offsetof(struct Qdisc_ops, change_real_num_tx),
+	offsetof(struct Qdisc_ops, dump),
+	offsetof(struct Qdisc_ops, dump_stats),
+	offsetof(struct Qdisc_ops, ingress_block_set),
+	offsetof(struct Qdisc_ops, egress_block_set),
+	offsetof(struct Qdisc_ops, ingress_block_get),
+	offsetof(struct Qdisc_ops, egress_block_get),
+};
+
+struct bpf_sched_data {
+	struct qdisc_watchdog watchdog;
+};
+
+struct bpf_sk_buff_ptr {
+	struct sk_buff *skb;
+};
+
+static int bpf_qdisc_init(struct btf *btf)
+{
+	return 0;
+}
+
+static int bpf_qdisc_init_op(struct Qdisc *sch, struct nlattr *opt,
+			     struct netlink_ext_ack *extack)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	qdisc_watchdog_init(&q->watchdog, sch);
+	return 0;
+}
+
+static void bpf_qdisc_reset_op(struct Qdisc *sch)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	qdisc_watchdog_cancel(&q->watchdog);
+}
+
+static void bpf_qdisc_destroy_op(struct Qdisc *sch)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	qdisc_watchdog_cancel(&q->watchdog);
+}
+
+static const struct bpf_func_proto *
+bpf_qdisc_get_func_proto(enum bpf_func_id func_id,
+			 const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	default:
+		return bpf_base_func_proto(func_id, prog);
+	}
+}
+
+BTF_ID_LIST_SINGLE(bpf_sk_buff_ids, struct, sk_buff)
+BTF_ID_LIST_SINGLE(bpf_sk_buff_ptr_ids, struct, bpf_sk_buff_ptr)
+
+static bool bpf_qdisc_is_valid_access(int off, int size,
+				      enum bpf_access_type type,
+				      const struct bpf_prog *prog,
+				      struct bpf_insn_access_aux *info)
+{
+	struct btf *btf = prog->aux->attach_btf;
+	u32 arg;
+
+	arg = get_ctx_arg_idx(btf, prog->aux->attach_func_proto, off);
+	if (!strcmp(prog->aux->attach_func_name, "enqueue")) {
+		if (arg == 2) {
+			info->reg_type = PTR_TO_BTF_ID | PTR_TRUSTED;
+			info->btf = btf;
+			info->btf_id = bpf_sk_buff_ptr_ids[0];
+			return true;
+		}
+	}
+
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
+					const struct bpf_reg_state *reg,
+					int off, int size)
+{
+	const struct btf_type *t, *skbt;
+	size_t end;
+
+	skbt = btf_type_by_id(reg->btf, bpf_sk_buff_ids[0]);
+	t = btf_type_by_id(reg->btf, reg->btf_id);
+	if (t != skbt) {
+		bpf_log(log, "only read is supported\n");
+		return -EACCES;
+	}
+
+	switch (off) {
+	case offsetof(struct sk_buff, tstamp):
+		end = offsetofend(struct sk_buff, tstamp);
+		break;
+	case offsetof(struct sk_buff, priority):
+		end = offsetofend(struct sk_buff, priority);
+		break;
+	case offsetof(struct sk_buff, mark):
+		end = offsetofend(struct sk_buff, mark);
+		break;
+	case offsetof(struct sk_buff, queue_mapping):
+		end = offsetofend(struct sk_buff, queue_mapping);
+		break;
+	case offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb, tc_classid):
+		end = offsetof(struct sk_buff, cb) +
+		      offsetofend(struct qdisc_skb_cb, tc_classid);
+		break;
+	case offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb, data[0]) ...
+	     offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb,
+						     data[QDISC_CB_PRIV_LEN - 1]):
+		end = offsetof(struct sk_buff, cb) +
+		      offsetofend(struct qdisc_skb_cb, data[QDISC_CB_PRIV_LEN - 1]);
+		break;
+	case offsetof(struct sk_buff, tc_index):
+		end = offsetofend(struct sk_buff, tc_index);
+		break;
+	default:
+		bpf_log(log, "no write support to sk_buff at off %d\n", off);
+		return -EACCES;
+	}
+
+	if (off + size > end) {
+		bpf_log(log,
+			"write access at off %d with size %d beyond the member of sk_buff ended at %zu\n",
+			off, size, end);
+		return -EACCES;
+	}
+
+	return 0;
+}
+
+static const struct bpf_verifier_ops bpf_qdisc_verifier_ops = {
+	.get_func_proto		= bpf_qdisc_get_func_proto,
+	.is_valid_access	= bpf_qdisc_is_valid_access,
+	.btf_struct_access	= bpf_qdisc_btf_struct_access,
+};
+
+static int bpf_qdisc_init_member(const struct btf_type *t,
+				 const struct btf_member *member,
+				 void *kdata, const void *udata)
+{
+	const struct Qdisc_ops *uqdisc_ops;
+	struct Qdisc_ops *qdisc_ops;
+	u32 moff;
+
+	uqdisc_ops = (const struct Qdisc_ops *)udata;
+	qdisc_ops = (struct Qdisc_ops *)kdata;
+
+	moff = __btf_member_bit_offset(t, member) / 8;
+	switch (moff) {
+	case offsetof(struct Qdisc_ops, priv_size):
+		if (uqdisc_ops->priv_size)
+			return -EINVAL;
+		qdisc_ops->priv_size = sizeof(struct bpf_sched_data);
+		return 1;
+	case offsetof(struct Qdisc_ops, static_flags):
+		if (uqdisc_ops->static_flags)
+			return -EINVAL;
+		qdisc_ops->static_flags = TCQ_F_BPF;
+		return 1;
+	case offsetof(struct Qdisc_ops, init):
+		qdisc_ops->init = bpf_qdisc_init_op;
+		return 1;
+	case offsetof(struct Qdisc_ops, reset):
+		qdisc_ops->reset = bpf_qdisc_reset_op;
+		return 1;
+	case offsetof(struct Qdisc_ops, destroy):
+		qdisc_ops->destroy = bpf_qdisc_destroy_op;
+		return 1;
+	case offsetof(struct Qdisc_ops, peek):
+		if (!uqdisc_ops->peek)
+			qdisc_ops->peek = qdisc_peek_dequeued;
+		return 1;
+	case offsetof(struct Qdisc_ops, id):
+		if (bpf_obj_name_cpy(qdisc_ops->id, uqdisc_ops->id,
+				     sizeof(qdisc_ops->id)) <= 0)
+			return -EINVAL;
+		return 1;
+	}
+
+	return 0;
+}
+
+static bool is_unsupported(u32 member_offset)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(unsupported_ops); i++) {
+		if (member_offset == unsupported_ops[i])
+			return true;
+	}
+
+	return false;
+}
+
+static int bpf_qdisc_check_member(const struct btf_type *t,
+				  const struct btf_member *member,
+				  const struct bpf_prog *prog)
+{
+	if (is_unsupported(__btf_member_bit_offset(t, member) / 8))
+		return -ENOTSUPP;
+	return 0;
+}
+
+static int bpf_qdisc_validate(void *kdata)
+{
+	return 0;
+}
+
+static int bpf_qdisc_reg(void *kdata, struct bpf_link *link)
+{
+	return register_qdisc(kdata);
+}
+
+static void bpf_qdisc_unreg(void *kdata, struct bpf_link *link)
+{
+	return unregister_qdisc(kdata);
+}
+
+static int Qdisc_ops__enqueue(struct sk_buff *skb__ref, struct Qdisc *sch,
+			       struct sk_buff **to_free)
+{
+	return 0;
+}
+
+static struct sk_buff *Qdisc_ops__dequeue(struct Qdisc *sch)
+{
+	return NULL;
+}
+
+static struct sk_buff *Qdisc_ops__peek(struct Qdisc *sch)
+{
+	return NULL;
+}
+
+static int Qdisc_ops__init(struct Qdisc *sch, struct nlattr *arg,
+			    struct netlink_ext_ack *extack)
+{
+	return 0;
+}
+
+static void Qdisc_ops__reset(struct Qdisc *sch)
+{
+}
+
+static void Qdisc_ops__destroy(struct Qdisc *sch)
+{
+}
+
+static int Qdisc_ops__change(struct Qdisc *sch, struct nlattr *arg,
+			      struct netlink_ext_ack *extack)
+{
+	return 0;
+}
+
+static void Qdisc_ops__attach(struct Qdisc *sch)
+{
+}
+
+static int Qdisc_ops__change_tx_queue_len(struct Qdisc *sch, unsigned int new_len)
+{
+	return 0;
+}
+
+static void Qdisc_ops__change_real_num_tx(struct Qdisc *sch, unsigned int new_real_tx)
+{
+}
+
+static int Qdisc_ops__dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	return 0;
+}
+
+static int Qdisc_ops__dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	return 0;
+}
+
+static void Qdisc_ops__ingress_block_set(struct Qdisc *sch, u32 block_index)
+{
+}
+
+static void Qdisc_ops__egress_block_set(struct Qdisc *sch, u32 block_index)
+{
+}
+
+static u32 Qdisc_ops__ingress_block_get(struct Qdisc *sch)
+{
+	return 0;
+}
+
+static u32 Qdisc_ops__egress_block_get(struct Qdisc *sch)
+{
+	return 0;
+}
+
+static struct Qdisc_ops __bpf_ops_qdisc_ops = {
+	.enqueue = Qdisc_ops__enqueue,
+	.dequeue = Qdisc_ops__dequeue,
+	.peek = Qdisc_ops__peek,
+	.init = Qdisc_ops__init,
+	.reset = Qdisc_ops__reset,
+	.destroy = Qdisc_ops__destroy,
+	.change = Qdisc_ops__change,
+	.attach = Qdisc_ops__attach,
+	.change_tx_queue_len = Qdisc_ops__change_tx_queue_len,
+	.change_real_num_tx = Qdisc_ops__change_real_num_tx,
+	.dump = Qdisc_ops__dump,
+	.dump_stats = Qdisc_ops__dump_stats,
+	.ingress_block_set = Qdisc_ops__ingress_block_set,
+	.egress_block_set = Qdisc_ops__egress_block_set,
+	.ingress_block_get = Qdisc_ops__ingress_block_get,
+	.egress_block_get = Qdisc_ops__egress_block_get,
+};
+
+static struct bpf_struct_ops bpf_Qdisc_ops = {
+	.verifier_ops = &bpf_qdisc_verifier_ops,
+	.reg = bpf_qdisc_reg,
+	.unreg = bpf_qdisc_unreg,
+	.check_member = bpf_qdisc_check_member,
+	.init_member = bpf_qdisc_init_member,
+	.init = bpf_qdisc_init,
+	.validate = bpf_qdisc_validate,
+	.name = "Qdisc_ops",
+	.cfi_stubs = &__bpf_ops_qdisc_ops,
+	.owner = THIS_MODULE,
+};
+
+static int __init bpf_qdisc_kfunc_init(void)
+{
+	return register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
+}
+late_initcall(bpf_qdisc_kfunc_init);
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 74afc210527d..5064b6d2d1ec 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -25,6 +25,7 @@
 #include <linux/hrtimer.h>
 #include <linux/slab.h>
 #include <linux/hashtable.h>
+#include <linux/bpf.h>
 
 #include <net/net_namespace.h>
 #include <net/sock.h>
@@ -358,7 +359,7 @@ static struct Qdisc_ops *qdisc_lookup_ops(struct nlattr *kind)
 		read_lock(&qdisc_mod_lock);
 		for (q = qdisc_base; q; q = q->next) {
 			if (nla_strcmp(kind, q->id) == 0) {
-				if (!try_module_get(q->owner))
+				if (!bpf_try_module_get(q, q->owner))
 					q = NULL;
 				break;
 			}
@@ -1282,7 +1283,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 				/* We will try again qdisc_lookup_ops,
 				 * so don't keep a reference.
 				 */
-				module_put(ops->owner);
+				bpf_module_put(ops, ops->owner);
 				err = -EAGAIN;
 				goto err_out;
 			}
@@ -1393,7 +1394,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 	netdev_put(dev, &sch->dev_tracker);
 	qdisc_free(sch);
 err_out2:
-	module_put(ops->owner);
+	bpf_module_put(ops, ops->owner);
 err_out:
 	*errp = err;
 	return NULL;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 2af24547a82c..76e4a6efd17c 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -24,6 +24,7 @@
 #include <linux/if_vlan.h>
 #include <linux/skb_array.h>
 #include <linux/if_macvlan.h>
+#include <linux/bpf.h>
 #include <net/sch_generic.h>
 #include <net/pkt_sched.h>
 #include <net/dst.h>
@@ -1077,7 +1078,7 @@ static void __qdisc_destroy(struct Qdisc *qdisc)
 		ops->destroy(qdisc);
 
 	lockdep_unregister_key(&qdisc->root_lock_key);
-	module_put(ops->owner);
+	bpf_module_put(ops, ops->owner);
 	netdev_put(dev, &qdisc->dev_tracker);
 
 	trace_qdisc_destroy(qdisc);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v9 05/11] bpf: net_sched: Support implementation of Qdisc_ops in bpf
  2024-07-14 17:51 ` [RFC PATCH v9 05/11] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
@ 2024-07-18  0:00   ` Amery Hung
  2024-07-25 21:24   ` Martin KaFai Lau
  1 sibling, 0 replies; 42+ messages in thread
From: Amery Hung @ 2024-07-18  0:00 UTC (permalink / raw)
  To: ameryhung
  Cc: alexei.starovoitov, andrii, bpf, daniel, jhs, jiri, martin.lau,
	netdev, sdf, sinquersw, toke, xiyou.wangcong, yangpeihao,
	yepeilin.cs, donald.hunter

From: Amery Hung <ameryhung@gmail.com>

Enable users to implement a classless qdisc using bpf. The last few
patches in this series has prepared struct_ops to support core operators
in Qdisc_ops. The recent advancement in bpf such as allocated
objects, bpf list and bpf rbtree has also provided powerful and flexible
building blocks to realize sophisticated scheduling algorithms. Therefore,
in this patch, we start allowing qdisc to be implemented using bpf
struct_ops. Users can implement .enqueue and .dequeue in Qdisc_ops in bpf
and register the qdisc dynamically into the kernel.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Co-developed-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/linux/btf.h       |   1 +
 include/net/sch_generic.h |   1 +
 kernel/bpf/btf.c          |   2 +-
 net/sched/Kconfig         |  14 ++
 net/sched/Makefile        |   1 +
 net/sched/bpf_qdisc.c     | 352 ++++++++++++++++++++++++++++++++++++++
 net/sched/sch_api.c       |   7 +-
 net/sched/sch_generic.c   |   3 +-
 8 files changed, 376 insertions(+), 5 deletions(-)
 create mode 100644 net/sched/bpf_qdisc.c

diff --git a/include/linux/btf.h b/include/linux/btf.h
index cffb43133c68..730ec304f787 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -562,6 +562,7 @@ const char *btf_name_by_offset(const struct btf *btf, u32 offset);
 const char *btf_str_by_offset(const struct btf *btf, u32 offset);
 struct btf *btf_parse_vmlinux(void);
 struct btf *bpf_prog_get_target_btf(const struct bpf_prog *prog);
+u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto, int off);
 u32 *btf_kfunc_id_set_contains(const struct btf *btf, u32 kfunc_btf_id,
 			       const struct bpf_prog *prog);
 u32 *btf_kfunc_is_modify_return(const struct btf *btf, u32 kfunc_btf_id,
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 79edd5b5e3c9..214ed2e34faa 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -95,6 +95,7 @@ struct Qdisc {
 #define TCQ_F_INVISIBLE		0x80 /* invisible by default in dump */
 #define TCQ_F_NOLOCK		0x100 /* qdisc does not require locking */
 #define TCQ_F_OFFLOADED		0x200 /* qdisc is offloaded to HW */
+#define TCQ_F_BPF		0x400 /* BPF qdisc */
 	u32			limit;
 	const struct Qdisc_ops	*ops;
 	struct qdisc_size_table	__rcu *stab;
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index b188f51c6ce9..2e3ded4de2ea 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -6314,7 +6314,7 @@ static bool is_int_ptr(struct btf *btf, const struct btf_type *t)
 	return btf_type_is_int(t);
 }
 
-static u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto,
+u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto,
 			   int off)
 {
 	const struct btf_param *args;
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 8180d0c12fce..8bfe063a851d 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -403,6 +403,20 @@ config NET_SCH_ETS
 
 	  If unsure, say N.
 
+config NET_SCH_BPF
+	tristate "BPF-based Qdisc"
+	depends on BPF_SYSCALL && BPF_JIT && DEBUG_INFO_BTF
+	help
+	  This option allows queueing disipline to be implemented using BPF
+	  struct_ops.
+
+	  Say Y here if you want to use BPF-based packet Qdisc.
+
+	  To compile this code as a module, choose M here: the module
+	  will be called sch_bpf.
+
+	  If unsure, say N.
+
 menuconfig NET_SCH_DEFAULT
 	bool "Allow override default queue discipline"
 	help
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 82c3f78ca486..904d784902d1 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -62,6 +62,7 @@ obj-$(CONFIG_NET_SCH_FQ_PIE)	+= sch_fq_pie.o
 obj-$(CONFIG_NET_SCH_CBS)	+= sch_cbs.o
 obj-$(CONFIG_NET_SCH_ETF)	+= sch_etf.o
 obj-$(CONFIG_NET_SCH_TAPRIO)	+= sch_taprio.o
+obj-$(CONFIG_NET_SCH_BPF)	+= bpf_qdisc.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
new file mode 100644
index 000000000000..a68fc115d8f8
--- /dev/null
+++ b/net/sched/bpf_qdisc.c
@@ -0,0 +1,352 @@
+#include <linux/types.h>
+#include <linux/bpf_verifier.h>
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/filter.h>
+#include <net/pkt_sched.h>
+#include <net/pkt_cls.h>
+
+static struct bpf_struct_ops bpf_Qdisc_ops;
+
+static u32 unsupported_ops[] = {
+	offsetof(struct Qdisc_ops, init),
+	offsetof(struct Qdisc_ops, reset),
+	offsetof(struct Qdisc_ops, destroy),
+	offsetof(struct Qdisc_ops, change),
+	offsetof(struct Qdisc_ops, attach),
+	offsetof(struct Qdisc_ops, change_real_num_tx),
+	offsetof(struct Qdisc_ops, dump),
+	offsetof(struct Qdisc_ops, dump_stats),
+	offsetof(struct Qdisc_ops, ingress_block_set),
+	offsetof(struct Qdisc_ops, egress_block_set),
+	offsetof(struct Qdisc_ops, ingress_block_get),
+	offsetof(struct Qdisc_ops, egress_block_get),
+};
+
+struct bpf_sched_data {
+	struct qdisc_watchdog watchdog;
+};
+
+struct bpf_sk_buff_ptr {
+	struct sk_buff *skb;
+};
+
+static int bpf_qdisc_init(struct btf *btf)
+{
+	return 0;
+}
+
+static int bpf_qdisc_init_op(struct Qdisc *sch, struct nlattr *opt,
+			     struct netlink_ext_ack *extack)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	qdisc_watchdog_init(&q->watchdog, sch);
+	return 0;
+}
+
+static void bpf_qdisc_reset_op(struct Qdisc *sch)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	qdisc_watchdog_cancel(&q->watchdog);
+}
+
+static void bpf_qdisc_destroy_op(struct Qdisc *sch)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	qdisc_watchdog_cancel(&q->watchdog);
+}
+
+static const struct bpf_func_proto *
+bpf_qdisc_get_func_proto(enum bpf_func_id func_id,
+			 const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	default:
+		return bpf_base_func_proto(func_id, prog);
+	}
+}
+
+BTF_ID_LIST_SINGLE(bpf_sk_buff_ids, struct, sk_buff)
+BTF_ID_LIST_SINGLE(bpf_sk_buff_ptr_ids, struct, bpf_sk_buff_ptr)
+
+static bool bpf_qdisc_is_valid_access(int off, int size,
+				      enum bpf_access_type type,
+				      const struct bpf_prog *prog,
+				      struct bpf_insn_access_aux *info)
+{
+	struct btf *btf = prog->aux->attach_btf;
+	u32 arg;
+
+	arg = get_ctx_arg_idx(btf, prog->aux->attach_func_proto, off);
+	if (!strcmp(prog->aux->attach_func_name, "enqueue")) {
+		if (arg == 2) {
+			info->reg_type = PTR_TO_BTF_ID | PTR_TRUSTED;
+			info->btf = btf;
+			info->btf_id = bpf_sk_buff_ptr_ids[0];
+			return true;
+		}
+	}
+
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
+					const struct bpf_reg_state *reg,
+					int off, int size)
+{
+	const struct btf_type *t, *skbt;
+	size_t end;
+
+	skbt = btf_type_by_id(reg->btf, bpf_sk_buff_ids[0]);
+	t = btf_type_by_id(reg->btf, reg->btf_id);
+	if (t != skbt) {
+		bpf_log(log, "only read is supported\n");
+		return -EACCES;
+	}
+
+	switch (off) {
+	case offsetof(struct sk_buff, tstamp):
+		end = offsetofend(struct sk_buff, tstamp);
+		break;
+	case offsetof(struct sk_buff, priority):
+		end = offsetofend(struct sk_buff, priority);
+		break;
+	case offsetof(struct sk_buff, mark):
+		end = offsetofend(struct sk_buff, mark);
+		break;
+	case offsetof(struct sk_buff, queue_mapping):
+		end = offsetofend(struct sk_buff, queue_mapping);
+		break;
+	case offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb, tc_classid):
+		end = offsetof(struct sk_buff, cb) +
+		      offsetofend(struct qdisc_skb_cb, tc_classid);
+		break;
+	case offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb, data[0]) ...
+	     offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb,
+						     data[QDISC_CB_PRIV_LEN - 1]):
+		end = offsetof(struct sk_buff, cb) +
+		      offsetofend(struct qdisc_skb_cb, data[QDISC_CB_PRIV_LEN - 1]);
+		break;
+	case offsetof(struct sk_buff, tc_index):
+		end = offsetofend(struct sk_buff, tc_index);
+		break;
+	default:
+		bpf_log(log, "no write support to sk_buff at off %d\n", off);
+		return -EACCES;
+	}
+
+	if (off + size > end) {
+		bpf_log(log,
+			"write access at off %d with size %d beyond the member of sk_buff ended at %zu\n",
+			off, size, end);
+		return -EACCES;
+	}
+
+	return 0;
+}
+
+static const struct bpf_verifier_ops bpf_qdisc_verifier_ops = {
+	.get_func_proto		= bpf_qdisc_get_func_proto,
+	.is_valid_access	= bpf_qdisc_is_valid_access,
+	.btf_struct_access	= bpf_qdisc_btf_struct_access,
+};
+
+static int bpf_qdisc_init_member(const struct btf_type *t,
+				 const struct btf_member *member,
+				 void *kdata, const void *udata)
+{
+	const struct Qdisc_ops *uqdisc_ops;
+	struct Qdisc_ops *qdisc_ops;
+	u32 moff;
+
+	uqdisc_ops = (const struct Qdisc_ops *)udata;
+	qdisc_ops = (struct Qdisc_ops *)kdata;
+
+	moff = __btf_member_bit_offset(t, member) / 8;
+	switch (moff) {
+	case offsetof(struct Qdisc_ops, priv_size):
+		if (uqdisc_ops->priv_size)
+			return -EINVAL;
+		qdisc_ops->priv_size = sizeof(struct bpf_sched_data);
+		return 1;
+	case offsetof(struct Qdisc_ops, static_flags):
+		if (uqdisc_ops->static_flags)
+			return -EINVAL;
+		qdisc_ops->static_flags = TCQ_F_BPF;
+		return 1;
+	case offsetof(struct Qdisc_ops, init):
+		qdisc_ops->init = bpf_qdisc_init_op;
+		return 1;
+	case offsetof(struct Qdisc_ops, reset):
+		qdisc_ops->reset = bpf_qdisc_reset_op;
+		return 1;
+	case offsetof(struct Qdisc_ops, destroy):
+		qdisc_ops->destroy = bpf_qdisc_destroy_op;
+		return 1;
+	case offsetof(struct Qdisc_ops, peek):
+		if (!uqdisc_ops->peek)
+			qdisc_ops->peek = qdisc_peek_dequeued;
+		return 1;
+	case offsetof(struct Qdisc_ops, id):
+		if (bpf_obj_name_cpy(qdisc_ops->id, uqdisc_ops->id,
+				     sizeof(qdisc_ops->id)) <= 0)
+			return -EINVAL;
+		return 1;
+	}
+
+	return 0;
+}
+
+static bool is_unsupported(u32 member_offset)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(unsupported_ops); i++) {
+		if (member_offset == unsupported_ops[i])
+			return true;
+	}
+
+	return false;
+}
+
+static int bpf_qdisc_check_member(const struct btf_type *t,
+				  const struct btf_member *member,
+				  const struct bpf_prog *prog)
+{
+	if (is_unsupported(__btf_member_bit_offset(t, member) / 8))
+		return -ENOTSUPP;
+	return 0;
+}
+
+static int bpf_qdisc_validate(void *kdata)
+{
+	return 0;
+}
+
+static int bpf_qdisc_reg(void *kdata, struct bpf_link *link)
+{
+	return register_qdisc(kdata);
+}
+
+static void bpf_qdisc_unreg(void *kdata, struct bpf_link *link)
+{
+	return unregister_qdisc(kdata);
+}
+
+static int Qdisc_ops__enqueue(struct sk_buff *skb__ref, struct Qdisc *sch,
+			       struct sk_buff **to_free)
+{
+	return 0;
+}
+
+static struct sk_buff *Qdisc_ops__dequeue(struct Qdisc *sch)
+{
+	return NULL;
+}
+
+static struct sk_buff *Qdisc_ops__peek(struct Qdisc *sch)
+{
+	return NULL;
+}
+
+static int Qdisc_ops__init(struct Qdisc *sch, struct nlattr *arg,
+			    struct netlink_ext_ack *extack)
+{
+	return 0;
+}
+
+static void Qdisc_ops__reset(struct Qdisc *sch)
+{
+}
+
+static void Qdisc_ops__destroy(struct Qdisc *sch)
+{
+}
+
+static int Qdisc_ops__change(struct Qdisc *sch, struct nlattr *arg,
+			      struct netlink_ext_ack *extack)
+{
+	return 0;
+}
+
+static void Qdisc_ops__attach(struct Qdisc *sch)
+{
+}
+
+static int Qdisc_ops__change_tx_queue_len(struct Qdisc *sch, unsigned int new_len)
+{
+	return 0;
+}
+
+static void Qdisc_ops__change_real_num_tx(struct Qdisc *sch, unsigned int new_real_tx)
+{
+}
+
+static int Qdisc_ops__dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	return 0;
+}
+
+static int Qdisc_ops__dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	return 0;
+}
+
+static void Qdisc_ops__ingress_block_set(struct Qdisc *sch, u32 block_index)
+{
+}
+
+static void Qdisc_ops__egress_block_set(struct Qdisc *sch, u32 block_index)
+{
+}
+
+static u32 Qdisc_ops__ingress_block_get(struct Qdisc *sch)
+{
+	return 0;
+}
+
+static u32 Qdisc_ops__egress_block_get(struct Qdisc *sch)
+{
+	return 0;
+}
+
+static struct Qdisc_ops __bpf_ops_qdisc_ops = {
+	.enqueue = Qdisc_ops__enqueue,
+	.dequeue = Qdisc_ops__dequeue,
+	.peek = Qdisc_ops__peek,
+	.init = Qdisc_ops__init,
+	.reset = Qdisc_ops__reset,
+	.destroy = Qdisc_ops__destroy,
+	.change = Qdisc_ops__change,
+	.attach = Qdisc_ops__attach,
+	.change_tx_queue_len = Qdisc_ops__change_tx_queue_len,
+	.change_real_num_tx = Qdisc_ops__change_real_num_tx,
+	.dump = Qdisc_ops__dump,
+	.dump_stats = Qdisc_ops__dump_stats,
+	.ingress_block_set = Qdisc_ops__ingress_block_set,
+	.egress_block_set = Qdisc_ops__egress_block_set,
+	.ingress_block_get = Qdisc_ops__ingress_block_get,
+	.egress_block_get = Qdisc_ops__egress_block_get,
+};
+
+static struct bpf_struct_ops bpf_Qdisc_ops = {
+	.verifier_ops = &bpf_qdisc_verifier_ops,
+	.reg = bpf_qdisc_reg,
+	.unreg = bpf_qdisc_unreg,
+	.check_member = bpf_qdisc_check_member,
+	.init_member = bpf_qdisc_init_member,
+	.init = bpf_qdisc_init,
+	.validate = bpf_qdisc_validate,
+	.name = "Qdisc_ops",
+	.cfi_stubs = &__bpf_ops_qdisc_ops,
+	.owner = THIS_MODULE,
+};
+
+static int __init bpf_qdisc_kfunc_init(void)
+{
+	return register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
+}
+late_initcall(bpf_qdisc_kfunc_init);
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 74afc210527d..5064b6d2d1ec 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -25,6 +25,7 @@
 #include <linux/hrtimer.h>
 #include <linux/slab.h>
 #include <linux/hashtable.h>
+#include <linux/bpf.h>
 
 #include <net/net_namespace.h>
 #include <net/sock.h>
@@ -358,7 +359,7 @@ static struct Qdisc_ops *qdisc_lookup_ops(struct nlattr *kind)
 		read_lock(&qdisc_mod_lock);
 		for (q = qdisc_base; q; q = q->next) {
 			if (nla_strcmp(kind, q->id) == 0) {
-				if (!try_module_get(q->owner))
+				if (!bpf_try_module_get(q, q->owner))
 					q = NULL;
 				break;
 			}
@@ -1282,7 +1283,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 				/* We will try again qdisc_lookup_ops,
 				 * so don't keep a reference.
 				 */
-				module_put(ops->owner);
+				bpf_module_put(ops, ops->owner);
 				err = -EAGAIN;
 				goto err_out;
 			}
@@ -1393,7 +1394,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 	netdev_put(dev, &sch->dev_tracker);
 	qdisc_free(sch);
 err_out2:
-	module_put(ops->owner);
+	bpf_module_put(ops, ops->owner);
 err_out:
 	*errp = err;
 	return NULL;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 2af24547a82c..76e4a6efd17c 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -24,6 +24,7 @@
 #include <linux/if_vlan.h>
 #include <linux/skb_array.h>
 #include <linux/if_macvlan.h>
+#include <linux/bpf.h>
 #include <net/sch_generic.h>
 #include <net/pkt_sched.h>
 #include <net/dst.h>
@@ -1077,7 +1078,7 @@ static void __qdisc_destroy(struct Qdisc *qdisc)
 		ops->destroy(qdisc);
 
 	lockdep_unregister_key(&qdisc->root_lock_key);
-	module_put(ops->owner);
+	bpf_module_put(ops, ops->owner);
 	netdev_put(dev, &qdisc->dev_tracker);
 
 	trace_qdisc_destroy(qdisc);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v9 05/11] bpf: net_sched: Support implementation of Qdisc_ops in bpf
  2024-07-14 17:51 ` [RFC PATCH v9 05/11] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
  2024-07-18  0:00   ` Amery Hung
@ 2024-07-25 21:24   ` Martin KaFai Lau
  2024-07-31  4:09     ` Amery Hung
  1 sibling, 1 reply; 42+ messages in thread
From: Martin KaFai Lau @ 2024-07-25 21:24 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, yangpeihao, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

On 7/14/24 10:51 AM, Amery Hung wrote:
> +static const struct bpf_func_proto *
> +bpf_qdisc_get_func_proto(enum bpf_func_id func_id,
> +			 const struct bpf_prog *prog)
> +{
> +	switch (func_id) {

Instead of an empty switch, it should be useful to provide the skb->data related 
helper. It can start with read only dynptr first, the BPF_FUNC_dynptr_read 
helper here.

Also, the kfuncs: bpf_dynptr_slice and bpf_dynptr_from_skb_rdonly.

> +	default:
> +		return bpf_base_func_proto(func_id, prog);

[ ... ]

> +	}
> +}
> +
> +BTF_ID_LIST_SINGLE(bpf_sk_buff_ids, struct, sk_buff)
> +BTF_ID_LIST_SINGLE(bpf_sk_buff_ptr_ids, struct, bpf_sk_buff_ptr)
> +
> +static bool bpf_qdisc_is_valid_access(int off, int size,
> +				      enum bpf_access_type type,
> +				      const struct bpf_prog *prog,
> +				      struct bpf_insn_access_aux *info)
> +{
> +	struct btf *btf = prog->aux->attach_btf;
> +	u32 arg;
> +
> +	arg = get_ctx_arg_idx(btf, prog->aux->attach_func_proto, off);
> +	if (!strcmp(prog->aux->attach_func_name, "enqueue")) {
> +		if (arg == 2) {
> +			info->reg_type = PTR_TO_BTF_ID | PTR_TRUSTED;
> +			info->btf = btf;
> +			info->btf_id = bpf_sk_buff_ptr_ids[0];
> +			return true;

This will allow type == BPF_WRITE to ctx which should be rejected. The below 
bpf_tracing_btf_ctx_access() could have rejected it.

> +		}
> +	}
> +
> +	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> +}
> +

[ ... ]

> +
> +static bool is_unsupported(u32 member_offset)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(unsupported_ops); i++) {
> +		if (member_offset == unsupported_ops[i])
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +static int bpf_qdisc_check_member(const struct btf_type *t,
> +				  const struct btf_member *member,
> +				  const struct bpf_prog *prog)
> +{
> +	if (is_unsupported(__btf_member_bit_offset(t, member) / 8))

Note that the ".check_member" and the "is_unsupported" can be removed as you 
also noticed on the recent unsupported ops cleanup patches.

> +		return -ENOTSUPP;
> +	return 0;
> +}

[ ... ]

> +static struct Qdisc_ops __bpf_ops_qdisc_ops = {
> +	.enqueue = Qdisc_ops__enqueue,
> +	.dequeue = Qdisc_ops__dequeue,
> +	.peek = Qdisc_ops__peek,
> +	.init = Qdisc_ops__init,
> +	.reset = Qdisc_ops__reset,
> +	.destroy = Qdisc_ops__destroy,
> +	.change = Qdisc_ops__change,
> +	.attach = Qdisc_ops__attach,
> +	.change_tx_queue_len = Qdisc_ops__change_tx_queue_len,
> +	.change_real_num_tx = Qdisc_ops__change_real_num_tx,
> +	.dump = Qdisc_ops__dump,
> +	.dump_stats = Qdisc_ops__dump_stats,

Similar to the above is_unsupported comment. The unsupported ops should be 
removed from the cfi_stubs.

> +	.ingress_block_set = Qdisc_ops__ingress_block_set,
> +	.egress_block_set = Qdisc_ops__egress_block_set,
> +	.ingress_block_get = Qdisc_ops__ingress_block_get,
> +	.egress_block_get = Qdisc_ops__egress_block_get,
> +};
> +
> +static struct bpf_struct_ops bpf_Qdisc_ops = {
> +	.verifier_ops = &bpf_qdisc_verifier_ops,
> +	.reg = bpf_qdisc_reg,
> +	.unreg = bpf_qdisc_unreg,
> +	.check_member = bpf_qdisc_check_member,
> +	.init_member = bpf_qdisc_init_member,
> +	.init = bpf_qdisc_init,
> +	.validate = bpf_qdisc_validate,

".validate" is optional. The empty "bpf_qdisc_validate" can be removed.

> +	.name = "Qdisc_ops",
> +	.cfi_stubs = &__bpf_ops_qdisc_ops,
> +	.owner = THIS_MODULE,
> +};



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v9 05/11] bpf: net_sched: Support implementation of Qdisc_ops in bpf
  2024-07-25 21:24   ` Martin KaFai Lau
@ 2024-07-31  4:09     ` Amery Hung
  0 siblings, 0 replies; 42+ messages in thread
From: Amery Hung @ 2024-07-31  4:09 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, yangpeihao, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

On Thu, Jul 25, 2024 at 2:25 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 7/14/24 10:51 AM, Amery Hung wrote:
> > +static const struct bpf_func_proto *
> > +bpf_qdisc_get_func_proto(enum bpf_func_id func_id,
> > +                      const struct bpf_prog *prog)
> > +{
> > +     switch (func_id) {
>
> Instead of an empty switch, it should be useful to provide the skb->data related
> helper. It can start with read only dynptr first, the BPF_FUNC_dynptr_read
> helper here.
>
> Also, the kfuncs: bpf_dynptr_slice and bpf_dynptr_from_skb_rdonly.
>

I will add the helper and kfuncs and try them out.

> > +     default:
> > +             return bpf_base_func_proto(func_id, prog);
>
> [ ... ]
>
> > +     }
> > +}
> > +
> > +BTF_ID_LIST_SINGLE(bpf_sk_buff_ids, struct, sk_buff)
> > +BTF_ID_LIST_SINGLE(bpf_sk_buff_ptr_ids, struct, bpf_sk_buff_ptr)
> > +
> > +static bool bpf_qdisc_is_valid_access(int off, int size,
> > +                                   enum bpf_access_type type,
> > +                                   const struct bpf_prog *prog,
> > +                                   struct bpf_insn_access_aux *info)
> > +{
> > +     struct btf *btf = prog->aux->attach_btf;
> > +     u32 arg;
> > +
> > +     arg = get_ctx_arg_idx(btf, prog->aux->attach_func_proto, off);
> > +     if (!strcmp(prog->aux->attach_func_name, "enqueue")) {
> > +             if (arg == 2) {
> > +                     info->reg_type = PTR_TO_BTF_ID | PTR_TRUSTED;
> > +                     info->btf = btf;
> > +                     info->btf_id = bpf_sk_buff_ptr_ids[0];
> > +                     return true;
>
> This will allow type == BPF_WRITE to ctx which should be rejected. The below
> bpf_tracing_btf_ctx_access() could have rejected it.
>

Right. I will check the access type of the "to_free" argument in .enqueue.

> > +             }
> > +     }
> > +
> > +     return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> > +}
> > +
>
> [ ... ]
>
> > +
> > +static bool is_unsupported(u32 member_offset)
> > +{
> > +     unsigned int i;
> > +
> > +     for (i = 0; i < ARRAY_SIZE(unsupported_ops); i++) {
> > +             if (member_offset == unsupported_ops[i])
> > +                     return true;
> > +     }
> > +
> > +     return false;
> > +}
> > +
> > +static int bpf_qdisc_check_member(const struct btf_type *t,
> > +                               const struct btf_member *member,
> > +                               const struct bpf_prog *prog)
> > +{
> > +     if (is_unsupported(__btf_member_bit_offset(t, member) / 8))
>
> Note that the ".check_member" and the "is_unsupported" can be removed as you
> also noticed on the recent unsupported ops cleanup patches.

Thanks for looping me in. I removed them when testing the series.

>
> > +             return -ENOTSUPP;
> > +     return 0;
> > +}
>
> [ ... ]
>
> > +static struct Qdisc_ops __bpf_ops_qdisc_ops = {
> > +     .enqueue = Qdisc_ops__enqueue,
> > +     .dequeue = Qdisc_ops__dequeue,
> > +     .peek = Qdisc_ops__peek,
> > +     .init = Qdisc_ops__init,
> > +     .reset = Qdisc_ops__reset,
> > +     .destroy = Qdisc_ops__destroy,
> > +     .change = Qdisc_ops__change,
> > +     .attach = Qdisc_ops__attach,
> > +     .change_tx_queue_len = Qdisc_ops__change_tx_queue_len,
> > +     .change_real_num_tx = Qdisc_ops__change_real_num_tx,
> > +     .dump = Qdisc_ops__dump,
> > +     .dump_stats = Qdisc_ops__dump_stats,
>
> Similar to the above is_unsupported comment. The unsupported ops should be
> removed from the cfi_stubs.
>
> > +     .ingress_block_set = Qdisc_ops__ingress_block_set,
> > +     .egress_block_set = Qdisc_ops__egress_block_set,
> > +     .ingress_block_get = Qdisc_ops__ingress_block_get,
> > +     .egress_block_get = Qdisc_ops__egress_block_get,
> > +};
> > +
> > +static struct bpf_struct_ops bpf_Qdisc_ops = {
> > +     .verifier_ops = &bpf_qdisc_verifier_ops,
> > +     .reg = bpf_qdisc_reg,
> > +     .unreg = bpf_qdisc_unreg,
> > +     .check_member = bpf_qdisc_check_member,
> > +     .init_member = bpf_qdisc_init_member,
> > +     .init = bpf_qdisc_init,
> > +     .validate = bpf_qdisc_validate,
>
> ".validate" is optional. The empty "bpf_qdisc_validate" can be removed.
>

Got it.


> > +     .name = "Qdisc_ops",
> > +     .cfi_stubs = &__bpf_ops_qdisc_ops,
> > +     .owner = THIS_MODULE,
> > +};
>
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH v9 06/11] bpf: net_sched: Add bpf qdisc kfuncs
  2024-07-14 17:51 [RFC PATCH v9 00/11] bpf qdisc Amery Hung
                   ` (4 preceding siblings ...)
  2024-07-14 17:51 ` [RFC PATCH v9 05/11] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
@ 2024-07-14 17:51 ` Amery Hung
  2024-07-25 22:38   ` Martin KaFai Lau
  2024-07-14 17:51 ` [RFC PATCH v9 07/11] bpf: net_sched: Allow more optional operators in Qdisc_ops Amery Hung
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 42+ messages in thread
From: Amery Hung @ 2024-07-14 17:51 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs,
	ameryhung

Add kfuncs for working on skb in qdisc.

Both bpf_qdisc_skb_drop() and bpf_skb_release() can be used to release
a reference to an skb. However, bpf_qdisc_skb_drop() can only be called
in .enqueue where a to_free skb list is available from kernel to defer
the release. Otherwise, bpf_skb_release() should be used elsewhere. It
is also used in bpf_obj_free_fields() when cleaning up skb in maps and
collections.

bpf_qdisc_schedule() can be used to schedule the execution of the qdisc.
An example use case is to throttle a qdisc if the time to dequeue the
next packet is known.

bpf_skb_get_hash() returns the flow hash of an skb, which can be used
to build flow-based queueing algorithms.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 net/sched/bpf_qdisc.c | 74 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 73 insertions(+), 1 deletion(-)

diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
index a68fc115d8f8..eff7559aa346 100644
--- a/net/sched/bpf_qdisc.c
+++ b/net/sched/bpf_qdisc.c
@@ -148,6 +148,64 @@ static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
 	return 0;
 }
 
+__bpf_kfunc_start_defs();
+
+/* bpf_skb_get_hash - Get the flow hash of an skb.
+ * @skb: The skb to get the flow hash from.
+ */
+__bpf_kfunc u32 bpf_skb_get_hash(struct sk_buff *skb)
+{
+	return skb_get_hash(skb);
+}
+
+/* bpf_skb_release - Release an skb reference acquired on an skb immediately.
+ * @skb: The skb on which a reference is being released.
+ */
+__bpf_kfunc void bpf_skb_release(struct sk_buff *skb)
+{
+	consume_skb(skb);
+}
+
+/* bpf_qdisc_skb_drop - Add an skb to be dropped later to a list.
+ * @skb: The skb on which a reference is being released and dropped.
+ * @to_free_list: The list of skbs to be dropped.
+ */
+__bpf_kfunc void bpf_qdisc_skb_drop(struct sk_buff *skb,
+				    struct bpf_sk_buff_ptr *to_free_list)
+{
+	__qdisc_drop(skb, (struct sk_buff **)to_free_list);
+}
+
+/* bpf_qdisc_watchdog_schedule - Schedule a qdisc to a later time using a timer.
+ * @sch: The qdisc to be scheduled.
+ * @expire: The expiry time of the timer.
+ * @delta_ns: The slack range of the timer.
+ */
+__bpf_kfunc void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64 delta_ns)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	qdisc_watchdog_schedule_range_ns(&q->watchdog, expire, delta_ns);
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(bpf_qdisc_kfunc_ids)
+BTF_ID_FLAGS(func, bpf_skb_get_hash)
+BTF_ID_FLAGS(func, bpf_skb_release, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_qdisc_skb_drop, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_qdisc_watchdog_schedule)
+BTF_KFUNCS_END(bpf_qdisc_kfunc_ids)
+
+static const struct btf_kfunc_id_set bpf_qdisc_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &bpf_qdisc_kfunc_ids,
+};
+
+BTF_ID_LIST(skb_kfunc_dtor_ids)
+BTF_ID(struct, sk_buff)
+BTF_ID_FLAGS(func, bpf_skb_release, KF_RELEASE)
+
 static const struct bpf_verifier_ops bpf_qdisc_verifier_ops = {
 	.get_func_proto		= bpf_qdisc_get_func_proto,
 	.is_valid_access	= bpf_qdisc_is_valid_access,
@@ -347,6 +405,20 @@ static struct bpf_struct_ops bpf_Qdisc_ops = {
 
 static int __init bpf_qdisc_kfunc_init(void)
 {
-	return register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
+	int ret;
+	const struct btf_id_dtor_kfunc skb_kfunc_dtors[] = {
+		{
+			.btf_id       = skb_kfunc_dtor_ids[0],
+			.kfunc_btf_id = skb_kfunc_dtor_ids[1]
+		},
+	};
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_qdisc_kfunc_set);
+	ret = ret ?: register_btf_id_dtor_kfuncs(skb_kfunc_dtors,
+						 ARRAY_SIZE(skb_kfunc_dtors),
+						 THIS_MODULE);
+	ret = ret ?: register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
+
+	return ret;
 }
 late_initcall(bpf_qdisc_kfunc_init);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v9 06/11] bpf: net_sched: Add bpf qdisc kfuncs
  2024-07-14 17:51 ` [RFC PATCH v9 06/11] bpf: net_sched: Add bpf qdisc kfuncs Amery Hung
@ 2024-07-25 22:38   ` Martin KaFai Lau
  2024-07-31  4:08     ` Amery Hung
  0 siblings, 1 reply; 42+ messages in thread
From: Martin KaFai Lau @ 2024-07-25 22:38 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, yangpeihao, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, xiyou.wangcong,
	yepeilin.cs

On 7/14/24 10:51 AM, Amery Hung wrote:
> Add kfuncs for working on skb in qdisc.
> 
> Both bpf_qdisc_skb_drop() and bpf_skb_release() can be used to release
> a reference to an skb. However, bpf_qdisc_skb_drop() can only be called
> in .enqueue where a to_free skb list is available from kernel to defer

Enforcing the bpf_qdisc_skb_drop() kfunc only available to the ".enqueue" is 
achieved by the  "struct bpf_sk_buff_ptr" pointer type only available to the 
".enqueue" ops ?

> the release. Otherwise, bpf_skb_release() should be used elsewhere. It
> is also used in bpf_obj_free_fields() when cleaning up skb in maps and
> collections.
> 
> bpf_qdisc_schedule() can be used to schedule the execution of the qdisc.
> An example use case is to throttle a qdisc if the time to dequeue the
> next packet is known.
> 
> bpf_skb_get_hash() returns the flow hash of an skb, which can be used
> to build flow-based queueing algorithms.
> 
> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> ---
>   net/sched/bpf_qdisc.c | 74 ++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 73 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
> index a68fc115d8f8..eff7559aa346 100644
> --- a/net/sched/bpf_qdisc.c
> +++ b/net/sched/bpf_qdisc.c
> @@ -148,6 +148,64 @@ static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
>   	return 0;
>   }
>   
> +__bpf_kfunc_start_defs();
> +
> +/* bpf_skb_get_hash - Get the flow hash of an skb.
> + * @skb: The skb to get the flow hash from.
> + */
> +__bpf_kfunc u32 bpf_skb_get_hash(struct sk_buff *skb)
> +{
> +	return skb_get_hash(skb);
> +}
> +
> +/* bpf_skb_release - Release an skb reference acquired on an skb immediately.
> + * @skb: The skb on which a reference is being released.
> + */
> +__bpf_kfunc void bpf_skb_release(struct sk_buff *skb)
> +{
> +	consume_skb(skb);

snippet from the comment of consume_skb():

  *      Functions identically to kfree_skb, but kfree_skb assumes that the frame
  *      is being dropped after a failure and notes that

consume_skb() has a different tracepoint from the kfree_skb also. It is better 
not to confuse the tracing.

I think at least the Qdisc_ops.reset and the btf_id_dtor_kfunc don't fall into 
the consume_skb(). May be useful to add the kfree_skb[_reason?]() kfunc also?

> +}
> +
> +/* bpf_qdisc_skb_drop - Add an skb to be dropped later to a list.
> + * @skb: The skb on which a reference is being released and dropped.
> + * @to_free_list: The list of skbs to be dropped.
> + */
> +__bpf_kfunc void bpf_qdisc_skb_drop(struct sk_buff *skb,
> +				    struct bpf_sk_buff_ptr *to_free_list)
> +{
> +	__qdisc_drop(skb, (struct sk_buff **)to_free_list);
> +}
> +
> +/* bpf_qdisc_watchdog_schedule - Schedule a qdisc to a later time using a timer.
> + * @sch: The qdisc to be scheduled.
> + * @expire: The expiry time of the timer.
> + * @delta_ns: The slack range of the timer.
> + */
> +__bpf_kfunc void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64 delta_ns)
> +{
> +	struct bpf_sched_data *q = qdisc_priv(sch);
> +
> +	qdisc_watchdog_schedule_range_ns(&q->watchdog, expire, delta_ns);
> +}
> +
> +__bpf_kfunc_end_defs();
> +
> +BTF_KFUNCS_START(bpf_qdisc_kfunc_ids)
> +BTF_ID_FLAGS(func, bpf_skb_get_hash)

Add KF_TRUSTED_ARGS. Avoid cases like getting a skb from walking the skb->next 
for now.

> +BTF_ID_FLAGS(func, bpf_skb_release, KF_RELEASE)
> +BTF_ID_FLAGS(func, bpf_qdisc_skb_drop, KF_RELEASE)
> +BTF_ID_FLAGS(func, bpf_qdisc_watchdog_schedule)

Also add KF_TRUSTED_ARGS here.

> +BTF_KFUNCS_END(bpf_qdisc_kfunc_ids)
> +
> +static const struct btf_kfunc_id_set bpf_qdisc_kfunc_set = {
> +	.owner = THIS_MODULE,
> +	.set   = &bpf_qdisc_kfunc_ids,
> +};
> +
> +BTF_ID_LIST(skb_kfunc_dtor_ids)
> +BTF_ID(struct, sk_buff)
> +BTF_ID_FLAGS(func, bpf_skb_release, KF_RELEASE)
> +
>   static const struct bpf_verifier_ops bpf_qdisc_verifier_ops = {
>   	.get_func_proto		= bpf_qdisc_get_func_proto,
>   	.is_valid_access	= bpf_qdisc_is_valid_access,
> @@ -347,6 +405,20 @@ static struct bpf_struct_ops bpf_Qdisc_ops = {
>   
>   static int __init bpf_qdisc_kfunc_init(void)
>   {
> -	return register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
> +	int ret;
> +	const struct btf_id_dtor_kfunc skb_kfunc_dtors[] = {
> +		{
> +			.btf_id       = skb_kfunc_dtor_ids[0],
> +			.kfunc_btf_id = skb_kfunc_dtor_ids[1]
> +		},
> +	};
> +
> +	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_qdisc_kfunc_set);
> +	ret = ret ?: register_btf_id_dtor_kfuncs(skb_kfunc_dtors,
> +						 ARRAY_SIZE(skb_kfunc_dtors),
> +						 THIS_MODULE);
> +	ret = ret ?: register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
> +
> +	return ret;
>   }
>   late_initcall(bpf_qdisc_kfunc_init);


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v9 06/11] bpf: net_sched: Add bpf qdisc kfuncs
  2024-07-25 22:38   ` Martin KaFai Lau
@ 2024-07-31  4:08     ` Amery Hung
  0 siblings, 0 replies; 42+ messages in thread
From: Amery Hung @ 2024-07-31  4:08 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, yangpeihao, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, xiyou.wangcong,
	yepeilin.cs

On Thu, Jul 25, 2024 at 3:39 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 7/14/24 10:51 AM, Amery Hung wrote:
> > Add kfuncs for working on skb in qdisc.
> >
> > Both bpf_qdisc_skb_drop() and bpf_skb_release() can be used to release
> > a reference to an skb. However, bpf_qdisc_skb_drop() can only be called
> > in .enqueue where a to_free skb list is available from kernel to defer
>
> Enforcing the bpf_qdisc_skb_drop() kfunc only available to the ".enqueue" is
> achieved by the  "struct bpf_sk_buff_ptr" pointer type only available to the
> ".enqueue" ops ?

Yes. I assume it will be better to make this availability check
explicit using the .filter you mentioned.

>
> > the release. Otherwise, bpf_skb_release() should be used elsewhere. It
> > is also used in bpf_obj_free_fields() when cleaning up skb in maps and
> > collections.
> >
> > bpf_qdisc_schedule() can be used to schedule the execution of the qdisc.
> > An example use case is to throttle a qdisc if the time to dequeue the
> > next packet is known.
> >
> > bpf_skb_get_hash() returns the flow hash of an skb, which can be used
> > to build flow-based queueing algorithms.
> >
> > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > ---
> >   net/sched/bpf_qdisc.c | 74 ++++++++++++++++++++++++++++++++++++++++++-
> >   1 file changed, 73 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
> > index a68fc115d8f8..eff7559aa346 100644
> > --- a/net/sched/bpf_qdisc.c
> > +++ b/net/sched/bpf_qdisc.c
> > @@ -148,6 +148,64 @@ static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
> >       return 0;
> >   }
> >
> > +__bpf_kfunc_start_defs();
> > +
> > +/* bpf_skb_get_hash - Get the flow hash of an skb.
> > + * @skb: The skb to get the flow hash from.
> > + */
> > +__bpf_kfunc u32 bpf_skb_get_hash(struct sk_buff *skb)
> > +{
> > +     return skb_get_hash(skb);
> > +}
> > +
> > +/* bpf_skb_release - Release an skb reference acquired on an skb immediately.
> > + * @skb: The skb on which a reference is being released.
> > + */
> > +__bpf_kfunc void bpf_skb_release(struct sk_buff *skb)
> > +{
> > +     consume_skb(skb);
>
> snippet from the comment of consume_skb():
>
>   *      Functions identically to kfree_skb, but kfree_skb assumes that the frame
>   *      is being dropped after a failure and notes that
>
> consume_skb() has a different tracepoint from the kfree_skb also. It is better
> not to confuse the tracing.
>
> I think at least the Qdisc_ops.reset and the btf_id_dtor_kfunc don't fall into
> the consume_skb(). May be useful to add the kfree_skb[_reason?]() kfunc also?
>

I see. I will change bpf_skb_release() from using consume_skb() to
kfree_skb() (existing qdiscs are not using skb_drop_reason). The skb
cleanup mechanism when .reset is called can use rtnl_kfree_skbs().

> > +}
> > +
> > +/* bpf_qdisc_skb_drop - Add an skb to be dropped later to a list.
> > + * @skb: The skb on which a reference is being released and dropped.
> > + * @to_free_list: The list of skbs to be dropped.
> > + */
> > +__bpf_kfunc void bpf_qdisc_skb_drop(struct sk_buff *skb,
> > +                                 struct bpf_sk_buff_ptr *to_free_list)
> > +{
> > +     __qdisc_drop(skb, (struct sk_buff **)to_free_list);
> > +}
> > +
> > +/* bpf_qdisc_watchdog_schedule - Schedule a qdisc to a later time using a timer.
> > + * @sch: The qdisc to be scheduled.
> > + * @expire: The expiry time of the timer.
> > + * @delta_ns: The slack range of the timer.
> > + */
> > +__bpf_kfunc void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64 delta_ns)
> > +{
> > +     struct bpf_sched_data *q = qdisc_priv(sch);
> > +
> > +     qdisc_watchdog_schedule_range_ns(&q->watchdog, expire, delta_ns);
> > +}
> > +
> > +__bpf_kfunc_end_defs();
> > +
> > +BTF_KFUNCS_START(bpf_qdisc_kfunc_ids)
> > +BTF_ID_FLAGS(func, bpf_skb_get_hash)
>
> Add KF_TRUSTED_ARGS. Avoid cases like getting a skb from walking the skb->next
> for now.

Good point. Will do.




>
> > +BTF_ID_FLAGS(func, bpf_skb_release, KF_RELEASE)
> > +BTF_ID_FLAGS(func, bpf_qdisc_skb_drop, KF_RELEASE)
> > +BTF_ID_FLAGS(func, bpf_qdisc_watchdog_schedule)
>
> Also add KF_TRUSTED_ARGS here.
>
> > +BTF_KFUNCS_END(bpf_qdisc_kfunc_ids)
> > +
> > +static const struct btf_kfunc_id_set bpf_qdisc_kfunc_set = {
> > +     .owner = THIS_MODULE,
> > +     .set   = &bpf_qdisc_kfunc_ids,
> > +};
> > +
> > +BTF_ID_LIST(skb_kfunc_dtor_ids)
> > +BTF_ID(struct, sk_buff)
> > +BTF_ID_FLAGS(func, bpf_skb_release, KF_RELEASE)
> > +
> >   static const struct bpf_verifier_ops bpf_qdisc_verifier_ops = {
> >       .get_func_proto         = bpf_qdisc_get_func_proto,
> >       .is_valid_access        = bpf_qdisc_is_valid_access,
> > @@ -347,6 +405,20 @@ static struct bpf_struct_ops bpf_Qdisc_ops = {
> >
> >   static int __init bpf_qdisc_kfunc_init(void)
> >   {
> > -     return register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
> > +     int ret;
> > +     const struct btf_id_dtor_kfunc skb_kfunc_dtors[] = {
> > +             {
> > +                     .btf_id       = skb_kfunc_dtor_ids[0],
> > +                     .kfunc_btf_id = skb_kfunc_dtor_ids[1]
> > +             },
> > +     };
> > +
> > +     ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_qdisc_kfunc_set);
> > +     ret = ret ?: register_btf_id_dtor_kfuncs(skb_kfunc_dtors,
> > +                                              ARRAY_SIZE(skb_kfunc_dtors),
> > +                                              THIS_MODULE);
> > +     ret = ret ?: register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
> > +
> > +     return ret;
> >   }
> >   late_initcall(bpf_qdisc_kfunc_init);
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH v9 07/11] bpf: net_sched: Allow more optional operators in Qdisc_ops
  2024-07-14 17:51 [RFC PATCH v9 00/11] bpf qdisc Amery Hung
                   ` (5 preceding siblings ...)
  2024-07-14 17:51 ` [RFC PATCH v9 06/11] bpf: net_sched: Add bpf qdisc kfuncs Amery Hung
@ 2024-07-14 17:51 ` Amery Hung
  2024-07-18  0:01   ` Amery Hung
  2024-07-26  1:15   ` Martin KaFai Lau
  2024-07-14 17:51 ` [RFC PATCH v9 08/11] libbpf: Support creating and destroying qdisc Amery Hung
                   ` (6 subsequent siblings)
  13 siblings, 2 replies; 42+ messages in thread
From: Amery Hung @ 2024-07-14 17:51 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs,
	ameryhung

So far, init, reset, and destroy are implemented by bpf qdisc infra as
fixed operators that manipulate the watchdog according to the occasion.
This patch allows users to implement these three operators to perform
desired work alongside the predefined ones.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/net/sch_generic.h |  6 ++++++
 net/sched/bpf_qdisc.c     | 20 ++++----------------
 net/sched/sch_api.c       | 11 +++++++++++
 net/sched/sch_generic.c   |  8 ++++++++
 4 files changed, 29 insertions(+), 16 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 214ed2e34faa..3041782b7527 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -1359,4 +1359,10 @@ static inline void qdisc_synchronize(const struct Qdisc *q)
 		msleep(1);
 }
 
+#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_BPF_JIT)
+int bpf_qdisc_init_pre_op(struct Qdisc *sch, struct nlattr *opt, struct netlink_ext_ack *extack);
+void bpf_qdisc_destroy_post_op(struct Qdisc *sch);
+void bpf_qdisc_reset_post_op(struct Qdisc *sch);
+#endif
+
 #endif
diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
index eff7559aa346..903b4eb54510 100644
--- a/net/sched/bpf_qdisc.c
+++ b/net/sched/bpf_qdisc.c
@@ -9,9 +9,6 @@
 static struct bpf_struct_ops bpf_Qdisc_ops;
 
 static u32 unsupported_ops[] = {
-	offsetof(struct Qdisc_ops, init),
-	offsetof(struct Qdisc_ops, reset),
-	offsetof(struct Qdisc_ops, destroy),
 	offsetof(struct Qdisc_ops, change),
 	offsetof(struct Qdisc_ops, attach),
 	offsetof(struct Qdisc_ops, change_real_num_tx),
@@ -36,8 +33,8 @@ static int bpf_qdisc_init(struct btf *btf)
 	return 0;
 }
 
-static int bpf_qdisc_init_op(struct Qdisc *sch, struct nlattr *opt,
-			     struct netlink_ext_ack *extack)
+int bpf_qdisc_init_pre_op(struct Qdisc *sch, struct nlattr *opt,
+			  struct netlink_ext_ack *extack)
 {
 	struct bpf_sched_data *q = qdisc_priv(sch);
 
@@ -45,14 +42,14 @@ static int bpf_qdisc_init_op(struct Qdisc *sch, struct nlattr *opt,
 	return 0;
 }
 
-static void bpf_qdisc_reset_op(struct Qdisc *sch)
+void bpf_qdisc_reset_post_op(struct Qdisc *sch)
 {
 	struct bpf_sched_data *q = qdisc_priv(sch);
 
 	qdisc_watchdog_cancel(&q->watchdog);
 }
 
-static void bpf_qdisc_destroy_op(struct Qdisc *sch)
+void bpf_qdisc_destroy_post_op(struct Qdisc *sch)
 {
 	struct bpf_sched_data *q = qdisc_priv(sch);
 
@@ -235,15 +232,6 @@ static int bpf_qdisc_init_member(const struct btf_type *t,
 			return -EINVAL;
 		qdisc_ops->static_flags = TCQ_F_BPF;
 		return 1;
-	case offsetof(struct Qdisc_ops, init):
-		qdisc_ops->init = bpf_qdisc_init_op;
-		return 1;
-	case offsetof(struct Qdisc_ops, reset):
-		qdisc_ops->reset = bpf_qdisc_reset_op;
-		return 1;
-	case offsetof(struct Qdisc_ops, destroy):
-		qdisc_ops->destroy = bpf_qdisc_destroy_op;
-		return 1;
 	case offsetof(struct Qdisc_ops, peek):
 		if (!uqdisc_ops->peek)
 			qdisc_ops->peek = qdisc_peek_dequeued;
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 5064b6d2d1ec..9fb9375e2793 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1352,6 +1352,13 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 		rcu_assign_pointer(sch->stab, stab);
 	}
 
+#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_BPF_JIT)
+	if (sch->flags & TCQ_F_BPF) {
+		err = bpf_qdisc_init_pre_op(sch, tca[TCA_OPTIONS], extack);
+		if (err != 0)
+			goto err_out4;
+	}
+#endif
 	if (ops->init) {
 		err = ops->init(sch, tca[TCA_OPTIONS], extack);
 		if (err != 0)
@@ -1388,6 +1395,10 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 	 */
 	if (ops->destroy)
 		ops->destroy(sch);
+#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_BPF_JIT)
+	if (sch->flags & TCQ_F_BPF)
+		bpf_qdisc_destroy_post_op(sch);
+#endif
 	qdisc_put_stab(rtnl_dereference(sch->stab));
 err_out3:
 	lockdep_unregister_key(&sch->root_lock_key);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 76e4a6efd17c..0ac05665c69f 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1033,6 +1033,10 @@ void qdisc_reset(struct Qdisc *qdisc)
 
 	if (ops->reset)
 		ops->reset(qdisc);
+#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_BPF_JIT)
+	if (qdisc->flags & TCQ_F_BPF)
+		bpf_qdisc_reset_post_op(qdisc);
+#endif
 
 	__skb_queue_purge(&qdisc->gso_skb);
 	__skb_queue_purge(&qdisc->skb_bad_txq);
@@ -1076,6 +1080,10 @@ static void __qdisc_destroy(struct Qdisc *qdisc)
 
 	if (ops->destroy)
 		ops->destroy(qdisc);
+#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_BPF_JIT)
+	if (qdisc->flags & TCQ_F_BPF)
+		bpf_qdisc_destroy_post_op(qdisc);
+#endif
 
 	lockdep_unregister_key(&qdisc->root_lock_key);
 	bpf_module_put(ops, ops->owner);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v9 07/11] bpf: net_sched: Allow more optional operators in Qdisc_ops
  2024-07-14 17:51 ` [RFC PATCH v9 07/11] bpf: net_sched: Allow more optional operators in Qdisc_ops Amery Hung
@ 2024-07-18  0:01   ` Amery Hung
  2024-07-26  1:15   ` Martin KaFai Lau
  1 sibling, 0 replies; 42+ messages in thread
From: Amery Hung @ 2024-07-18  0:01 UTC (permalink / raw)
  To: ameryhung
  Cc: alexei.starovoitov, andrii, bpf, daniel, jhs, jiri, martin.lau,
	netdev, sdf, sinquersw, toke, xiyou.wangcong, yangpeihao,
	yepeilin.cs, donald.hunter

From: Amery Hung <ameryhung@gmail.com>

So far, init, reset, and destroy are implemented by bpf qdisc infra as
fixed operators that manipulate the watchdog according to the occasion.
This patch allows users to implement these three operators to perform
desired work alongside the predefined ones.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/net/sch_generic.h |  4 ++++
 net/sched/bpf_qdisc.c     | 23 +++++++----------------
 net/sched/sch_api.c       | 11 +++++++++++
 net/sched/sch_generic.c   |  8 ++++++++
 4 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 214ed2e34faa..1ab9e91281c0 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -1359,4 +1359,8 @@ static inline void qdisc_synchronize(const struct Qdisc *q)
 		msleep(1);
 }
 
+int bpf_qdisc_init_pre_op(struct Qdisc *sch, struct nlattr *opt, struct netlink_ext_ack *extack);
+void bpf_qdisc_destroy_post_op(struct Qdisc *sch);
+void bpf_qdisc_reset_post_op(struct Qdisc *sch);
+
 #endif
diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
index eff7559aa346..0273b3f8f230 100644
--- a/net/sched/bpf_qdisc.c
+++ b/net/sched/bpf_qdisc.c
@@ -9,9 +9,6 @@
 static struct bpf_struct_ops bpf_Qdisc_ops;
 
 static u32 unsupported_ops[] = {
-	offsetof(struct Qdisc_ops, init),
-	offsetof(struct Qdisc_ops, reset),
-	offsetof(struct Qdisc_ops, destroy),
 	offsetof(struct Qdisc_ops, change),
 	offsetof(struct Qdisc_ops, attach),
 	offsetof(struct Qdisc_ops, change_real_num_tx),
@@ -36,28 +33,31 @@ static int bpf_qdisc_init(struct btf *btf)
 	return 0;
 }
 
-static int bpf_qdisc_init_op(struct Qdisc *sch, struct nlattr *opt,
-			     struct netlink_ext_ack *extack)
+int bpf_qdisc_init_pre_op(struct Qdisc *sch, struct nlattr *opt,
+			  struct netlink_ext_ack *extack)
 {
 	struct bpf_sched_data *q = qdisc_priv(sch);
 
 	qdisc_watchdog_init(&q->watchdog, sch);
 	return 0;
 }
+EXPORT_SYMBOL(bpf_qdisc_init_pre_op);
 
-static void bpf_qdisc_reset_op(struct Qdisc *sch)
+void bpf_qdisc_reset_post_op(struct Qdisc *sch)
 {
 	struct bpf_sched_data *q = qdisc_priv(sch);
 
 	qdisc_watchdog_cancel(&q->watchdog);
 }
+EXPORT_SYMBOL(bpf_qdisc_reset_post_op);
 
-static void bpf_qdisc_destroy_op(struct Qdisc *sch)
+void bpf_qdisc_destroy_post_op(struct Qdisc *sch)
 {
 	struct bpf_sched_data *q = qdisc_priv(sch);
 
 	qdisc_watchdog_cancel(&q->watchdog);
 }
+EXPORT_SYMBOL(bpf_qdisc_destroy_post_op);
 
 static const struct bpf_func_proto *
 bpf_qdisc_get_func_proto(enum bpf_func_id func_id,
@@ -235,15 +235,6 @@ static int bpf_qdisc_init_member(const struct btf_type *t,
 			return -EINVAL;
 		qdisc_ops->static_flags = TCQ_F_BPF;
 		return 1;
-	case offsetof(struct Qdisc_ops, init):
-		qdisc_ops->init = bpf_qdisc_init_op;
-		return 1;
-	case offsetof(struct Qdisc_ops, reset):
-		qdisc_ops->reset = bpf_qdisc_reset_op;
-		return 1;
-	case offsetof(struct Qdisc_ops, destroy):
-		qdisc_ops->destroy = bpf_qdisc_destroy_op;
-		return 1;
 	case offsetof(struct Qdisc_ops, peek):
 		if (!uqdisc_ops->peek)
 			qdisc_ops->peek = qdisc_peek_dequeued;
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 5064b6d2d1ec..6379edf94f69 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1352,6 +1352,13 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 		rcu_assign_pointer(sch->stab, stab);
 	}
 
+#ifdef CONFIG_NET_SCH_BPF
+	if (sch->flags & TCQ_F_BPF) {
+		err = bpf_qdisc_init_pre_op(sch, tca[TCA_OPTIONS], extack);
+		if (err != 0)
+			goto err_out4;
+	}
+#endif
 	if (ops->init) {
 		err = ops->init(sch, tca[TCA_OPTIONS], extack);
 		if (err != 0)
@@ -1388,6 +1395,10 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 	 */
 	if (ops->destroy)
 		ops->destroy(sch);
+#ifdef CONFIG_NET_SCH_BPF
+	if (sch->flags & TCQ_F_BPF)
+		bpf_qdisc_destroy_post_op(sch);
+#endif
 	qdisc_put_stab(rtnl_dereference(sch->stab));
 err_out3:
 	lockdep_unregister_key(&sch->root_lock_key);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 76e4a6efd17c..0906d8a9f35d 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1033,6 +1033,10 @@ void qdisc_reset(struct Qdisc *qdisc)
 
 	if (ops->reset)
 		ops->reset(qdisc);
+#ifdef CONFIG_NET_SCH_BPF
+	if (qdisc->flags & TCQ_F_BPF)
+		bpf_qdisc_reset_post_op(qdisc);
+#endif
 
 	__skb_queue_purge(&qdisc->gso_skb);
 	__skb_queue_purge(&qdisc->skb_bad_txq);
@@ -1076,6 +1080,10 @@ static void __qdisc_destroy(struct Qdisc *qdisc)
 
 	if (ops->destroy)
 		ops->destroy(qdisc);
+#ifdef CONFIG_NET_SCH_BPF
+	if (qdisc->flags & TCQ_F_BPF)
+		bpf_qdisc_destroy_post_op(qdisc);
+#endif
 
 	lockdep_unregister_key(&qdisc->root_lock_key);
 	bpf_module_put(ops, ops->owner);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v9 07/11] bpf: net_sched: Allow more optional operators in Qdisc_ops
  2024-07-14 17:51 ` [RFC PATCH v9 07/11] bpf: net_sched: Allow more optional operators in Qdisc_ops Amery Hung
  2024-07-18  0:01   ` Amery Hung
@ 2024-07-26  1:15   ` Martin KaFai Lau
  2024-07-26 18:30     ` Martin KaFai Lau
  2024-07-26 22:30     ` Amery Hung
  1 sibling, 2 replies; 42+ messages in thread
From: Martin KaFai Lau @ 2024-07-26  1:15 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, yangpeihao, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

On 7/14/24 10:51 AM, Amery Hung wrote:
> So far, init, reset, and destroy are implemented by bpf qdisc infra as
> fixed operators that manipulate the watchdog according to the occasion.
> This patch allows users to implement these three operators to perform
> desired work alongside the predefined ones.
> 
> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> ---
>   include/net/sch_generic.h |  6 ++++++
>   net/sched/bpf_qdisc.c     | 20 ++++----------------
>   net/sched/sch_api.c       | 11 +++++++++++
>   net/sched/sch_generic.c   |  8 ++++++++
>   4 files changed, 29 insertions(+), 16 deletions(-)
> 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 214ed2e34faa..3041782b7527 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -1359,4 +1359,10 @@ static inline void qdisc_synchronize(const struct Qdisc *q)
>   		msleep(1);
>   }
>   
> +#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_BPF_JIT)
> +int bpf_qdisc_init_pre_op(struct Qdisc *sch, struct nlattr *opt, struct netlink_ext_ack *extack);
> +void bpf_qdisc_destroy_post_op(struct Qdisc *sch);
> +void bpf_qdisc_reset_post_op(struct Qdisc *sch);
> +#endif
> +
>   #endif
> diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
> index eff7559aa346..903b4eb54510 100644
> --- a/net/sched/bpf_qdisc.c
> +++ b/net/sched/bpf_qdisc.c
> @@ -9,9 +9,6 @@
>   static struct bpf_struct_ops bpf_Qdisc_ops;
>   
>   static u32 unsupported_ops[] = {
> -	offsetof(struct Qdisc_ops, init),
> -	offsetof(struct Qdisc_ops, reset),
> -	offsetof(struct Qdisc_ops, destroy),
>   	offsetof(struct Qdisc_ops, change),
>   	offsetof(struct Qdisc_ops, attach),
>   	offsetof(struct Qdisc_ops, change_real_num_tx),
> @@ -36,8 +33,8 @@ static int bpf_qdisc_init(struct btf *btf)
>   	return 0;
>   }
>   
> -static int bpf_qdisc_init_op(struct Qdisc *sch, struct nlattr *opt,
> -			     struct netlink_ext_ack *extack)
> +int bpf_qdisc_init_pre_op(struct Qdisc *sch, struct nlattr *opt,
> +			  struct netlink_ext_ack *extack)
>   {
>   	struct bpf_sched_data *q = qdisc_priv(sch);
>   
> @@ -45,14 +42,14 @@ static int bpf_qdisc_init_op(struct Qdisc *sch, struct nlattr *opt,
>   	return 0;
>   }
>   
> -static void bpf_qdisc_reset_op(struct Qdisc *sch)
> +void bpf_qdisc_reset_post_op(struct Qdisc *sch)
>   {
>   	struct bpf_sched_data *q = qdisc_priv(sch);
>   
>   	qdisc_watchdog_cancel(&q->watchdog);
>   }
>   
> -static void bpf_qdisc_destroy_op(struct Qdisc *sch)
> +void bpf_qdisc_destroy_post_op(struct Qdisc *sch)

The reset_post_ops and destroy_post_op are identical. They only do 
qdisc_watchdog_cancel().

>   {
>   	struct bpf_sched_data *q = qdisc_priv(sch);
>   
> @@ -235,15 +232,6 @@ static int bpf_qdisc_init_member(const struct btf_type *t,
>   			return -EINVAL;
>   		qdisc_ops->static_flags = TCQ_F_BPF;
>   		return 1;
> -	case offsetof(struct Qdisc_ops, init):
> -		qdisc_ops->init = bpf_qdisc_init_op;
> -		return 1;
> -	case offsetof(struct Qdisc_ops, reset):
> -		qdisc_ops->reset = bpf_qdisc_reset_op;
> -		return 1;
> -	case offsetof(struct Qdisc_ops, destroy):
> -		qdisc_ops->destroy = bpf_qdisc_destroy_op;
> -		return 1;
>   	case offsetof(struct Qdisc_ops, peek):
>   		if (!uqdisc_ops->peek)
>   			qdisc_ops->peek = qdisc_peek_dequeued;
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 5064b6d2d1ec..9fb9375e2793 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1352,6 +1352,13 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
>   		rcu_assign_pointer(sch->stab, stab);
>   	}
>   
> +#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_BPF_JIT)
> +	if (sch->flags & TCQ_F_BPF) {

I can see the reason why this patch is needed. It is a few line changes and they 
are not in the fast path... still weakly not excited about them but I know it 
could be a personal preference.

I think at the very least, instead of adding a new TCQ_F_BPF, let see if the 
"owner == BPF_MODULE_OWNER" test can be reused like how it is done in the 
bpf_try_module_get().


A rough direction I am spinning...

The pre/post is mainly to initialize and cleanup the "struct bpf_sched_data" 
before/after calling the bpf prog.

For the pre (init), there is a ".gen_prologue(...., const struct bpf_prog 
*prog)" in the "bpf_verifier_ops". Take a look at the tc_cls_act_prologue().
It calls a BPF_FUNC_skb_pull_data helper. It potentially can call a kfunc 
bpf_qdisc_watchdog_cancel. However, the gen_prologue is invoked too late in the 
verifier for kfunc calling now. This will need some thoughts and works.

For the post (destroy,reset), there is no "gen_epilogue" now. If 
bpf_qdisc_watchdog_schedule() is not allowed to be called in the ".reset" and 
".destroy" bpf prog. I think it can be changed to pre also? There is a ".filter" 
function in the "struct btf_kfunc_id_set" during the kfunc register.

> +		err = bpf_qdisc_init_pre_op(sch, tca[TCA_OPTIONS], extack);
> +		if (err != 0)
> +			goto err_out4;
> +	}
> +#endif
>   	if (ops->init) {
>   		err = ops->init(sch, tca[TCA_OPTIONS], extack);
>   		if (err != 0)
> @@ -1388,6 +1395,10 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
>   	 */
>   	if (ops->destroy)
>   		ops->destroy(sch);
> +#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_BPF_JIT)
> +	if (sch->flags & TCQ_F_BPF)
> +		bpf_qdisc_destroy_post_op(sch);
> +#endif
>   	qdisc_put_stab(rtnl_dereference(sch->stab));
>   err_out3:
>   	lockdep_unregister_key(&sch->root_lock_key);
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 76e4a6efd17c..0ac05665c69f 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1033,6 +1033,10 @@ void qdisc_reset(struct Qdisc *qdisc)
>   
>   	if (ops->reset)
>   		ops->reset(qdisc);
> +#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_BPF_JIT)
> +	if (qdisc->flags & TCQ_F_BPF)
> +		bpf_qdisc_reset_post_op(qdisc);
> +#endif
>   
>   	__skb_queue_purge(&qdisc->gso_skb);
>   	__skb_queue_purge(&qdisc->skb_bad_txq);
> @@ -1076,6 +1080,10 @@ static void __qdisc_destroy(struct Qdisc *qdisc)
>   
>   	if (ops->destroy)
>   		ops->destroy(qdisc);
> +#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_BPF_JIT)
> +	if (qdisc->flags & TCQ_F_BPF)
> +		bpf_qdisc_destroy_post_op(qdisc);
> +#endif
>   
>   	lockdep_unregister_key(&qdisc->root_lock_key);
>   	bpf_module_put(ops, ops->owner);


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v9 07/11] bpf: net_sched: Allow more optional operators in Qdisc_ops
  2024-07-26  1:15   ` Martin KaFai Lau
@ 2024-07-26 18:30     ` Martin KaFai Lau
  2024-07-26 22:30     ` Amery Hung
  1 sibling, 0 replies; 42+ messages in thread
From: Martin KaFai Lau @ 2024-07-26 18:30 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, yangpeihao, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

On 7/25/24 6:15 PM, Martin KaFai Lau wrote:
>> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
>> index 5064b6d2d1ec..9fb9375e2793 100644
>> --- a/net/sched/sch_api.c
>> +++ b/net/sched/sch_api.c
>> @@ -1352,6 +1352,13 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
>>           rcu_assign_pointer(sch->stab, stab);
>>       }
>> +#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_BPF_JIT)
>> +    if (sch->flags & TCQ_F_BPF) {
> 
> I can see the reason why this patch is needed. It is a few line changes and they 
> are not in the fast path... still weakly not excited about them but I know it 
> could be a personal preference.
> 
> I think at the very least, instead of adding a new TCQ_F_BPF, let see if the 
> "owner == BPF_MODULE_OWNER" test can be reused like how it is done in the 
> bpf_try_module_get().
> 
> 
> A rough direction I am spinning...
> 
> The pre/post is mainly to initialize and cleanup the "struct bpf_sched_data" 
> before/after calling the bpf prog.
> 
> For the pre (init), there is a ".gen_prologue(...., const struct bpf_prog 
> *prog)" in the "bpf_verifier_ops". Take a look at the tc_cls_act_prologue().
> It calls a BPF_FUNC_skb_pull_data helper. It potentially can call a kfunc 
> bpf_qdisc_watchdog_cancel. However, the gen_prologue is invoked too late in the 

typo. The kfunc should be s/qdisc_watchdog_cancel/qdisc_watchdog_init/ for the pre.

> verifier for kfunc calling now. This will need some thoughts and works.
> 
> For the post (destroy,reset), there is no "gen_epilogue" now. If 
> bpf_qdisc_watchdog_schedule() is not allowed to be called in the ".reset" and 
> ".destroy" bpf prog. I think it can be changed to pre also? There is a ".filter" 
> function in the "struct btf_kfunc_id_set" during the kfunc register.
> 
>> +        err = bpf_qdisc_init_pre_op(sch, tca[TCA_OPTIONS], extack);
>> +        if (err != 0)
>> +            goto err_out4;
>> +    } 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v9 07/11] bpf: net_sched: Allow more optional operators in Qdisc_ops
  2024-07-26  1:15   ` Martin KaFai Lau
  2024-07-26 18:30     ` Martin KaFai Lau
@ 2024-07-26 22:30     ` Amery Hung
  2024-07-30  0:20       ` Martin KaFai Lau
  1 sibling, 1 reply; 42+ messages in thread
From: Amery Hung @ 2024-07-26 22:30 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, yangpeihao, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

On Thu, Jul 25, 2024 at 6:15 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 7/14/24 10:51 AM, Amery Hung wrote:
> > So far, init, reset, and destroy are implemented by bpf qdisc infra as
> > fixed operators that manipulate the watchdog according to the occasion.
> > This patch allows users to implement these three operators to perform
> > desired work alongside the predefined ones.
> >
> > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > ---
> >   include/net/sch_generic.h |  6 ++++++
> >   net/sched/bpf_qdisc.c     | 20 ++++----------------
> >   net/sched/sch_api.c       | 11 +++++++++++
> >   net/sched/sch_generic.c   |  8 ++++++++
> >   4 files changed, 29 insertions(+), 16 deletions(-)
> >
> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> > index 214ed2e34faa..3041782b7527 100644
> > --- a/include/net/sch_generic.h
> > +++ b/include/net/sch_generic.h
> > @@ -1359,4 +1359,10 @@ static inline void qdisc_synchronize(const struct Qdisc *q)
> >               msleep(1);
> >   }
> >
> > +#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_BPF_JIT)
> > +int bpf_qdisc_init_pre_op(struct Qdisc *sch, struct nlattr *opt, struct netlink_ext_ack *extack);
> > +void bpf_qdisc_destroy_post_op(struct Qdisc *sch);
> > +void bpf_qdisc_reset_post_op(struct Qdisc *sch);
> > +#endif
> > +
> >   #endif
> > diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
> > index eff7559aa346..903b4eb54510 100644
> > --- a/net/sched/bpf_qdisc.c
> > +++ b/net/sched/bpf_qdisc.c
> > @@ -9,9 +9,6 @@
> >   static struct bpf_struct_ops bpf_Qdisc_ops;
> >
> >   static u32 unsupported_ops[] = {
> > -     offsetof(struct Qdisc_ops, init),
> > -     offsetof(struct Qdisc_ops, reset),
> > -     offsetof(struct Qdisc_ops, destroy),
> >       offsetof(struct Qdisc_ops, change),
> >       offsetof(struct Qdisc_ops, attach),
> >       offsetof(struct Qdisc_ops, change_real_num_tx),
> > @@ -36,8 +33,8 @@ static int bpf_qdisc_init(struct btf *btf)
> >       return 0;
> >   }
> >
> > -static int bpf_qdisc_init_op(struct Qdisc *sch, struct nlattr *opt,
> > -                          struct netlink_ext_ack *extack)
> > +int bpf_qdisc_init_pre_op(struct Qdisc *sch, struct nlattr *opt,
> > +                       struct netlink_ext_ack *extack)
> >   {
> >       struct bpf_sched_data *q = qdisc_priv(sch);
> >
> > @@ -45,14 +42,14 @@ static int bpf_qdisc_init_op(struct Qdisc *sch, struct nlattr *opt,
> >       return 0;
> >   }
> >
> > -static void bpf_qdisc_reset_op(struct Qdisc *sch)
> > +void bpf_qdisc_reset_post_op(struct Qdisc *sch)
> >   {
> >       struct bpf_sched_data *q = qdisc_priv(sch);
> >
> >       qdisc_watchdog_cancel(&q->watchdog);
> >   }
> >
> > -static void bpf_qdisc_destroy_op(struct Qdisc *sch)
> > +void bpf_qdisc_destroy_post_op(struct Qdisc *sch)
>
> The reset_post_ops and destroy_post_op are identical. They only do
> qdisc_watchdog_cancel().
>
> >   {
> >       struct bpf_sched_data *q = qdisc_priv(sch);
> >
> > @@ -235,15 +232,6 @@ static int bpf_qdisc_init_member(const struct btf_type *t,
> >                       return -EINVAL;
> >               qdisc_ops->static_flags = TCQ_F_BPF;
> >               return 1;
> > -     case offsetof(struct Qdisc_ops, init):
> > -             qdisc_ops->init = bpf_qdisc_init_op;
> > -             return 1;
> > -     case offsetof(struct Qdisc_ops, reset):
> > -             qdisc_ops->reset = bpf_qdisc_reset_op;
> > -             return 1;
> > -     case offsetof(struct Qdisc_ops, destroy):
> > -             qdisc_ops->destroy = bpf_qdisc_destroy_op;
> > -             return 1;
> >       case offsetof(struct Qdisc_ops, peek):
> >               if (!uqdisc_ops->peek)
> >                       qdisc_ops->peek = qdisc_peek_dequeued;
> > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> > index 5064b6d2d1ec..9fb9375e2793 100644
> > --- a/net/sched/sch_api.c
> > +++ b/net/sched/sch_api.c
> > @@ -1352,6 +1352,13 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
> >               rcu_assign_pointer(sch->stab, stab);
> >       }
> >
> > +#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_BPF_JIT)
> > +     if (sch->flags & TCQ_F_BPF) {
>
> I can see the reason why this patch is needed. It is a few line changes and they
> are not in the fast path... still weakly not excited about them but I know it
> could be a personal preference.
>
> I think at the very least, instead of adding a new TCQ_F_BPF, let see if the
> "owner == BPF_MODULE_OWNER" test can be reused like how it is done in the
> bpf_try_module_get().
>

Thanks for the suggestion. Will do.

>
> A rough direction I am spinning...
>
> The pre/post is mainly to initialize and cleanup the "struct bpf_sched_data"
> before/after calling the bpf prog.
>
> For the pre (init), there is a ".gen_prologue(...., const struct bpf_prog
> *prog)" in the "bpf_verifier_ops". Take a look at the tc_cls_act_prologue().
> It calls a BPF_FUNC_skb_pull_data helper. It potentially can call a kfunc
> bpf_qdisc_watchdog_cancel. However, the gen_prologue is invoked too late in the
> verifier for kfunc calling now. This will need some thoughts and works.
>
> For the post (destroy,reset), there is no "gen_epilogue" now. If
> bpf_qdisc_watchdog_schedule() is not allowed to be called in the ".reset" and
> ".destroy" bpf prog. I think it can be changed to pre also? There is a ".filter"
> function in the "struct btf_kfunc_id_set" during the kfunc register.
>

I can see how that would work. The ability to add prologue, epilogue
to struct_ops operators is one thing on my wish list.

Meanwhile, I am not sure whether that should be written in the kernel
or rewritten by the verifier. An argument for keeping it in the kernel
is that the prologue or epilogue can get quite complex and involves
many kernel structures not exposed to the bpf program (pre-defined ops
in Qdisc_ops in v8).

Maybe we can keep the current approach in the initial version as they
are not in the fast path, and then move to (gen_prologue,
gen_epilogue) once the plumbing is done?

> > +             err = bpf_qdisc_init_pre_op(sch, tca[TCA_OPTIONS], extack);
> > +             if (err != 0)
> > +                     goto err_out4;
> > +     }
> > +#endif
> >       if (ops->init) {
> >               err = ops->init(sch, tca[TCA_OPTIONS], extack);
> >               if (err != 0)
> > @@ -1388,6 +1395,10 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
> >        */
> >       if (ops->destroy)
> >               ops->destroy(sch);
> > +#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_BPF_JIT)
> > +     if (sch->flags & TCQ_F_BPF)
> > +             bpf_qdisc_destroy_post_op(sch);
> > +#endif
> >       qdisc_put_stab(rtnl_dereference(sch->stab));
> >   err_out3:
> >       lockdep_unregister_key(&sch->root_lock_key);
> > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> > index 76e4a6efd17c..0ac05665c69f 100644
> > --- a/net/sched/sch_generic.c
> > +++ b/net/sched/sch_generic.c
> > @@ -1033,6 +1033,10 @@ void qdisc_reset(struct Qdisc *qdisc)
> >
> >       if (ops->reset)
> >               ops->reset(qdisc);
> > +#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_BPF_JIT)
> > +     if (qdisc->flags & TCQ_F_BPF)
> > +             bpf_qdisc_reset_post_op(qdisc);
> > +#endif
> >
> >       __skb_queue_purge(&qdisc->gso_skb);
> >       __skb_queue_purge(&qdisc->skb_bad_txq);
> > @@ -1076,6 +1080,10 @@ static void __qdisc_destroy(struct Qdisc *qdisc)
> >
> >       if (ops->destroy)
> >               ops->destroy(qdisc);
> > +#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_BPF_JIT)
> > +     if (qdisc->flags & TCQ_F_BPF)
> > +             bpf_qdisc_destroy_post_op(qdisc);
> > +#endif
> >
> >       lockdep_unregister_key(&qdisc->root_lock_key);
> >       bpf_module_put(ops, ops->owner);
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v9 07/11] bpf: net_sched: Allow more optional operators in Qdisc_ops
  2024-07-26 22:30     ` Amery Hung
@ 2024-07-30  0:20       ` Martin KaFai Lau
  0 siblings, 0 replies; 42+ messages in thread
From: Martin KaFai Lau @ 2024-07-30  0:20 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, yangpeihao, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, xiyou.wangcong,
	yepeilin.cs, Stanislav Fomichev

On 7/26/24 3:30 PM, Amery Hung wrote:
>> The pre/post is mainly to initialize and cleanup the "struct bpf_sched_data"
>> before/after calling the bpf prog.
>>
>> For the pre (init), there is a ".gen_prologue(...., const struct bpf_prog
>> *prog)" in the "bpf_verifier_ops". Take a look at the tc_cls_act_prologue().
>> It calls a BPF_FUNC_skb_pull_data helper. It potentially can call a kfunc
>> bpf_qdisc_watchdog_cancel. However, the gen_prologue is invoked too late in the
>> verifier for kfunc calling now. This will need some thoughts and works.
>>
>> For the post (destroy,reset), there is no "gen_epilogue" now. If
>> bpf_qdisc_watchdog_schedule() is not allowed to be called in the ".reset" and
>> ".destroy" bpf prog. I think it can be changed to pre also? There is a ".filter"
>> function in the "struct btf_kfunc_id_set" during the kfunc register.
>>
> I can see how that would work. The ability to add prologue, epilogue
> to struct_ops operators is one thing on my wish list.
> 
> Meanwhile, I am not sure whether that should be written in the kernel
> or rewritten by the verifier. An argument for keeping it in the kernel
> is that the prologue or epilogue can get quite complex and involves
> many kernel structures not exposed to the bpf program (pre-defined ops
> in Qdisc_ops in v8).

Can the v8 pre-defined ops be called as a kfunc? The qdisc_watchdog_cancel/init 
in v9 could be a kfunc and called by pro/epilogue.

For checking and/or resetting skb->dev, it should be simple enough without 
kfunc. e.g. when reusing the skb->rbnode in the future followup effort.

[ Unrelated to qdisc. bpf_tcp_ca can also use help to ensure the cwnd is sane. ]

> 
> Maybe we can keep the current approach in the initial version as they
> are not in the fast path, and then move to (gen_prologue,
> gen_epilogue) once the plumbing is done?

Sure. It can be improved when things are ready.

I am trying some ideas on how to do gen_epilogue and will share when I get some 
tests working.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH v9 08/11] libbpf: Support creating and destroying qdisc
  2024-07-14 17:51 [RFC PATCH v9 00/11] bpf qdisc Amery Hung
                   ` (6 preceding siblings ...)
  2024-07-14 17:51 ` [RFC PATCH v9 07/11] bpf: net_sched: Allow more optional operators in Qdisc_ops Amery Hung
@ 2024-07-14 17:51 ` Amery Hung
  2024-07-14 17:51 ` [RFC PATCH v9 09/11] selftests: Add a basic fifo qdisc test Amery Hung
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 42+ messages in thread
From: Amery Hung @ 2024-07-14 17:51 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs,
	ameryhung

Extend struct bpf_tc_hook with handle, qdisc name and a new attach type,
BPF_TC_QDISC, to allow users to add or remove any qdisc specified in
addition to clsact.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 tools/lib/bpf/libbpf.h  |  5 ++++-
 tools/lib/bpf/netlink.c | 20 +++++++++++++++++---
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 64a6a3d323e3..f6329a901c9b 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -1258,6 +1258,7 @@ enum bpf_tc_attach_point {
 	BPF_TC_INGRESS = 1 << 0,
 	BPF_TC_EGRESS  = 1 << 1,
 	BPF_TC_CUSTOM  = 1 << 2,
+	BPF_TC_QDISC   = 1 << 3,
 };
 
 #define BPF_TC_PARENT(a, b) 	\
@@ -1272,9 +1273,11 @@ struct bpf_tc_hook {
 	int ifindex;
 	enum bpf_tc_attach_point attach_point;
 	__u32 parent;
+	__u32 handle;
+	char *qdisc;
 	size_t :0;
 };
-#define bpf_tc_hook__last_field parent
+#define bpf_tc_hook__last_field qdisc
 
 struct bpf_tc_opts {
 	size_t sz;
diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
index 68a2def17175..72db8c0add21 100644
--- a/tools/lib/bpf/netlink.c
+++ b/tools/lib/bpf/netlink.c
@@ -529,9 +529,9 @@ int bpf_xdp_query_id(int ifindex, int flags, __u32 *prog_id)
 }
 
 
-typedef int (*qdisc_config_t)(struct libbpf_nla_req *req);
+typedef int (*qdisc_config_t)(struct libbpf_nla_req *req, struct bpf_tc_hook *hook);
 
-static int clsact_config(struct libbpf_nla_req *req)
+static int clsact_config(struct libbpf_nla_req *req, struct bpf_tc_hook *hook)
 {
 	req->tc.tcm_parent = TC_H_CLSACT;
 	req->tc.tcm_handle = TC_H_MAKE(TC_H_CLSACT, 0);
@@ -539,6 +539,16 @@ static int clsact_config(struct libbpf_nla_req *req)
 	return nlattr_add(req, TCA_KIND, "clsact", sizeof("clsact"));
 }
 
+static int qdisc_config(struct libbpf_nla_req *req, struct bpf_tc_hook *hook)
+{
+	char *qdisc = OPTS_GET(hook, qdisc, NULL);
+
+	req->tc.tcm_parent = OPTS_GET(hook, parent, TC_H_ROOT);
+	req->tc.tcm_handle = OPTS_GET(hook, handle, 0);
+
+	return nlattr_add(req, TCA_KIND, qdisc, strlen(qdisc) + 1);
+}
+
 static int attach_point_to_config(struct bpf_tc_hook *hook,
 				  qdisc_config_t *config)
 {
@@ -552,6 +562,9 @@ static int attach_point_to_config(struct bpf_tc_hook *hook,
 		return 0;
 	case BPF_TC_CUSTOM:
 		return -EOPNOTSUPP;
+	case BPF_TC_QDISC:
+		*config = &qdisc_config;
+		return 0;
 	default:
 		return -EINVAL;
 	}
@@ -596,7 +609,7 @@ static int tc_qdisc_modify(struct bpf_tc_hook *hook, int cmd, int flags)
 	req.tc.tcm_family  = AF_UNSPEC;
 	req.tc.tcm_ifindex = OPTS_GET(hook, ifindex, 0);
 
-	ret = config(&req);
+	ret = config(&req, hook);
 	if (ret < 0)
 		return ret;
 
@@ -639,6 +652,7 @@ int bpf_tc_hook_destroy(struct bpf_tc_hook *hook)
 	case BPF_TC_INGRESS:
 	case BPF_TC_EGRESS:
 		return libbpf_err(__bpf_tc_detach(hook, NULL, true));
+	case BPF_TC_QDISC:
 	case BPF_TC_INGRESS | BPF_TC_EGRESS:
 		return libbpf_err(tc_qdisc_delete(hook));
 	case BPF_TC_CUSTOM:
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v9 09/11] selftests: Add a basic fifo qdisc test
  2024-07-14 17:51 [RFC PATCH v9 00/11] bpf qdisc Amery Hung
                   ` (7 preceding siblings ...)
  2024-07-14 17:51 ` [RFC PATCH v9 08/11] libbpf: Support creating and destroying qdisc Amery Hung
@ 2024-07-14 17:51 ` Amery Hung
  2024-07-14 17:51 ` [RFC PATCH v9 10/11] selftests: Add a bpf fq qdisc to selftest Amery Hung
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 42+ messages in thread
From: Amery Hung @ 2024-07-14 17:51 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs,
	ameryhung

This selftest shows a bare minimum fifo qdisc, which simply enqueues skbs
into the back of a bpf list and dequeues from the front of the list.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 .../selftests/bpf/prog_tests/bpf_qdisc.c      | 161 ++++++++++++++++++
 .../selftests/bpf/progs/bpf_qdisc_common.h    |  16 ++
 .../selftests/bpf/progs/bpf_qdisc_fifo.c      | 102 +++++++++++
 3 files changed, 279 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
new file mode 100644
index 000000000000..295d0216e70f
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
@@ -0,0 +1,161 @@
+#include <linux/pkt_sched.h>
+#include <linux/rtnetlink.h>
+#include <test_progs.h>
+
+#include "network_helpers.h"
+#include "bpf_qdisc_fifo.skel.h"
+
+#ifndef ENOTSUPP
+#define ENOTSUPP 524
+#endif
+
+#define LO_IFINDEX 1
+
+static const unsigned int total_bytes = 10 * 1024 * 1024;
+static int stop;
+
+static void *server(void *arg)
+{
+	int lfd = (int)(long)arg, err = 0, fd;
+	ssize_t nr_sent = 0, bytes = 0;
+	char batch[1500];
+
+	fd = accept(lfd, NULL, NULL);
+	while (fd == -1) {
+		if (errno == EINTR)
+			continue;
+		err = -errno;
+		goto done;
+	}
+
+	if (settimeo(fd, 0)) {
+		err = -errno;
+		goto done;
+	}
+
+	while (bytes < total_bytes && !READ_ONCE(stop)) {
+		nr_sent = send(fd, &batch,
+			       MIN(total_bytes - bytes, sizeof(batch)), 0);
+		if (nr_sent == -1 && errno == EINTR)
+			continue;
+		if (nr_sent == -1) {
+			err = -errno;
+			break;
+		}
+		bytes += nr_sent;
+	}
+
+	ASSERT_EQ(bytes, total_bytes, "send");
+
+done:
+	if (fd >= 0)
+		close(fd);
+	if (err) {
+		WRITE_ONCE(stop, 1);
+		return ERR_PTR(err);
+	}
+	return NULL;
+}
+
+static void do_test(char *qdisc)
+{
+	DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook, .ifindex = LO_IFINDEX,
+			    .attach_point = BPF_TC_QDISC,
+			    .parent = TC_H_ROOT,
+			    .handle = 0x8000000,
+			    .qdisc = qdisc);
+	struct sockaddr_in6 sa6 = {};
+	ssize_t nr_recv = 0, bytes = 0;
+	int lfd = -1, fd = -1;
+	pthread_t srv_thread;
+	socklen_t addrlen = sizeof(sa6);
+	void *thread_ret;
+	char batch[1500];
+	int err;
+
+	WRITE_ONCE(stop, 0);
+
+	err = bpf_tc_hook_create(&hook);
+	if (!ASSERT_OK(err, "attach qdisc"))
+		return;
+
+	lfd = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
+	if (!ASSERT_NEQ(lfd, -1, "socket")) {
+		bpf_tc_hook_destroy(&hook);
+		return;
+	}
+
+	fd = socket(AF_INET6, SOCK_STREAM, 0);
+	if (!ASSERT_NEQ(fd, -1, "socket")) {
+		bpf_tc_hook_destroy(&hook);
+		close(lfd);
+		return;
+	}
+
+	if (settimeo(lfd, 0) || settimeo(fd, 0))
+		goto done;
+
+	err = getsockname(lfd, (struct sockaddr *)&sa6, &addrlen);
+	if (!ASSERT_NEQ(err, -1, "getsockname"))
+		goto done;
+
+	/* connect to server */
+	err = connect(fd, (struct sockaddr *)&sa6, addrlen);
+	if (!ASSERT_NEQ(err, -1, "connect"))
+		goto done;
+
+	err = pthread_create(&srv_thread, NULL, server, (void *)(long)lfd);
+	if (!ASSERT_OK(err, "pthread_create"))
+		goto done;
+
+	/* recv total_bytes */
+	while (bytes < total_bytes && !READ_ONCE(stop)) {
+		nr_recv = recv(fd, &batch,
+			       MIN(total_bytes - bytes, sizeof(batch)), 0);
+		if (nr_recv == -1 && errno == EINTR)
+			continue;
+		if (nr_recv == -1)
+			break;
+		bytes += nr_recv;
+	}
+
+	ASSERT_EQ(bytes, total_bytes, "recv");
+
+	WRITE_ONCE(stop, 1);
+	pthread_join(srv_thread, &thread_ret);
+	ASSERT_OK(IS_ERR(thread_ret), "thread_ret");
+
+done:
+	close(lfd);
+	close(fd);
+
+	bpf_tc_hook_destroy(&hook);
+	return;
+}
+
+static void test_fifo(void)
+{
+	struct bpf_qdisc_fifo *fifo_skel;
+	struct bpf_link *link;
+
+	fifo_skel = bpf_qdisc_fifo__open_and_load();
+	if (!ASSERT_OK_PTR(fifo_skel, "bpf_qdisc_fifo__open_and_load"))
+		return;
+
+	link = bpf_map__attach_struct_ops(fifo_skel->maps.fifo);
+	if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops")) {
+		bpf_qdisc_fifo__destroy(fifo_skel);
+		return;
+	}
+
+	do_test("bpf_fifo");
+
+	bpf_link__destroy(link);
+	bpf_qdisc_fifo__destroy(fifo_skel);
+}
+
+void test_bpf_qdisc(void)
+{
+	if (test__start_subtest("fifo"))
+		test_fifo();
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
new file mode 100644
index 000000000000..6ffefbd43f0c
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
@@ -0,0 +1,16 @@
+#ifndef _BPF_QDISC_COMMON_H
+#define _BPF_QDISC_COMMON_H
+
+#define NET_XMIT_SUCCESS        0x00
+#define NET_XMIT_DROP           0x01    /* skb dropped                  */
+#define NET_XMIT_CN             0x02    /* congestion notification      */
+
+#define TC_PRIO_CONTROL  7
+#define TC_PRIO_MAX      15
+
+u32 bpf_skb_get_hash(struct sk_buff *p) __ksym;
+void bpf_skb_release(struct sk_buff *p) __ksym;
+void bpf_qdisc_skb_drop(struct sk_buff *p, struct bpf_sk_buff_ptr *to_free) __ksym;
+void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64 delta_ns) __ksym;
+
+#endif
diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c b/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
new file mode 100644
index 000000000000..eb6272d36c77
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
@@ -0,0 +1,102 @@
+#include <vmlinux.h>
+#include "bpf_experimental.h"
+#include "bpf_qdisc_common.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct skb_node {
+	struct sk_buff __kptr *skb;
+	struct bpf_list_node node;
+};
+
+#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+
+private(A) struct bpf_spin_lock q_fifo_lock;
+private(A) struct bpf_list_head q_fifo __contains(skb_node, node);
+
+unsigned int q_limit = 1000;
+unsigned int q_qlen = 0;
+
+SEC("struct_ops/bpf_fifo_enqueue")
+int BPF_PROG(bpf_fifo_enqueue, struct sk_buff *skb, struct Qdisc *sch,
+	     struct bpf_sk_buff_ptr *to_free)
+{
+	struct skb_node *skbn;
+
+	if (q_qlen == q_limit)
+		goto drop;
+
+	skbn = bpf_obj_new(typeof(*skbn));
+	if (!skbn)
+		goto drop;
+
+	q_qlen++;
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	if (skb) //unexpected
+		bpf_qdisc_skb_drop(skb, to_free);
+
+	bpf_spin_lock(&q_fifo_lock);
+	bpf_list_push_back(&q_fifo, &skbn->node);
+	bpf_spin_unlock(&q_fifo_lock);
+
+	return NET_XMIT_SUCCESS;
+drop:
+	bpf_qdisc_skb_drop(skb, to_free);
+	return NET_XMIT_DROP;
+}
+
+SEC("struct_ops/bpf_fifo_dequeue")
+struct sk_buff *BPF_PROG(bpf_fifo_dequeue, struct Qdisc *sch)
+{
+	struct bpf_list_node *node;
+	struct sk_buff *skb = NULL;
+	struct skb_node *skbn;
+
+	bpf_spin_lock(&q_fifo_lock);
+	node = bpf_list_pop_front(&q_fifo);
+	bpf_spin_unlock(&q_fifo_lock);
+	if (!node)
+		return NULL;
+
+	skbn = container_of(node, struct skb_node, node);
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	bpf_obj_drop(skbn);
+	q_qlen--;
+
+	return skb;
+}
+
+SEC("struct_ops/bpf_fifo_reset")
+void BPF_PROG(bpf_fifo_reset, struct Qdisc *sch)
+{
+	struct bpf_list_node *node;
+	struct skb_node *skbn;
+	int i;
+
+	bpf_for(i, 0, q_qlen) {
+		struct sk_buff *skb = NULL;
+
+		bpf_spin_lock(&q_fifo_lock);
+		node = bpf_list_pop_front(&q_fifo);
+		bpf_spin_unlock(&q_fifo_lock);
+
+		if (!node)
+			break;
+
+		skbn = container_of(node, struct skb_node, node);
+		skb = bpf_kptr_xchg(&skbn->skb, skb);
+		if (skb)
+			bpf_skb_release(skb);
+		bpf_obj_drop(skbn);
+	}
+	q_qlen = 0;
+}
+
+SEC(".struct_ops")
+struct Qdisc_ops fifo = {
+	.enqueue   = (void *)bpf_fifo_enqueue,
+	.dequeue   = (void *)bpf_fifo_dequeue,
+	.reset     = (void *)bpf_fifo_reset,
+	.id        = "bpf_fifo",
+};
+
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH v9 10/11] selftests: Add a bpf fq qdisc to selftest
  2024-07-14 17:51 [RFC PATCH v9 00/11] bpf qdisc Amery Hung
                   ` (8 preceding siblings ...)
  2024-07-14 17:51 ` [RFC PATCH v9 09/11] selftests: Add a basic fifo qdisc test Amery Hung
@ 2024-07-14 17:51 ` Amery Hung
  2024-07-19  1:54   ` Martin KaFai Lau
  2024-07-14 17:51 ` [RFC PATCH v9 11/11] selftests: Add a bpf netem " Amery Hung
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 42+ messages in thread
From: Amery Hung @ 2024-07-14 17:51 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs,
	ameryhung

This test implements a more sophisticated qdisc using bpf. The bpf fair-
queueing (fq) qdisc gives each flow an equal chance to transmit data. It
also respects the timestamp of skb for rate limiting. The implementation
does not prevent hash collision of flows nor does it recycle flows.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 .../selftests/bpf/prog_tests/bpf_qdisc.c      |  24 +
 .../selftests/bpf/progs/bpf_qdisc_fq.c        | 623 ++++++++++++++++++
 2 files changed, 647 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
index 295d0216e70f..394bf5a4adae 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
@@ -4,6 +4,7 @@
 
 #include "network_helpers.h"
 #include "bpf_qdisc_fifo.skel.h"
+#include "bpf_qdisc_fq.skel.h"
 
 #ifndef ENOTSUPP
 #define ENOTSUPP 524
@@ -154,8 +155,31 @@ static void test_fifo(void)
 	bpf_qdisc_fifo__destroy(fifo_skel);
 }
 
+static void test_fq(void)
+{
+	struct bpf_qdisc_fq *fq_skel;
+	struct bpf_link *link;
+
+	fq_skel = bpf_qdisc_fq__open_and_load();
+	if (!ASSERT_OK_PTR(fq_skel, "bpf_qdisc_fq__open_and_load"))
+		return;
+
+	link = bpf_map__attach_struct_ops(fq_skel->maps.fq);
+	if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops")) {
+		bpf_qdisc_fq__destroy(fq_skel);
+		return;
+	}
+
+	do_test("bpf_fq");
+
+	bpf_link__destroy(link);
+	bpf_qdisc_fq__destroy(fq_skel);
+}
+
 void test_bpf_qdisc(void)
 {
 	if (test__start_subtest("fifo"))
 		test_fifo();
+	if (test__start_subtest("fq"))
+		test_fq();
 }
diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c b/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c
new file mode 100644
index 000000000000..5debb045b6e2
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c
@@ -0,0 +1,623 @@
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_experimental.h"
+#include "bpf_qdisc_common.h"
+
+char _license[] SEC("license") = "GPL";
+
+#define NSEC_PER_USEC 1000L
+#define NSEC_PER_SEC 1000000000L
+#define PSCHED_MTU (64 * 1024 + 14)
+
+#define NUM_QUEUE_LOG 10
+#define NUM_QUEUE (1 << NUM_QUEUE_LOG)
+#define PRIO_QUEUE (NUM_QUEUE + 1)
+#define COMP_DROP_PKT_DELAY 1
+#define THROTTLED 0xffffffffffffffff
+
+/* fq configuration */
+__u64 q_flow_refill_delay = 40;
+__u64 q_horizon = 10ULL * NSEC_PER_SEC;
+__u32 q_initial_quantum = 10 * PSCHED_MTU;
+__u32 q_quantum = 2 * PSCHED_MTU;
+__u32 q_orphan_mask = 1023;
+__u32 q_flow_plimit = 100;
+__u32 q_plimit = 10000;
+__u32 q_timer_slack = 10 * NSEC_PER_USEC;
+bool q_horizon_drop = true;
+
+unsigned long time_next_delayed_flow = ~0ULL;
+unsigned long unthrottle_latency_ns = 0ULL;
+unsigned long ktime_cache = 0;
+unsigned long dequeue_now;
+unsigned int fq_qlen = 0;
+
+struct skb_node {
+	u64 tstamp;
+	struct sk_buff __kptr *skb;
+	struct bpf_rb_node node;
+};
+
+struct fq_flow_node {
+	u32 hash;
+	int credit;
+	u32 qlen;
+	u32 socket_hash;
+	u64 age;
+	u64 time_next_packet;
+	struct bpf_list_node list_node;
+	struct bpf_rb_node rb_node;
+	struct bpf_rb_root queue __contains(skb_node, node);
+	struct bpf_spin_lock lock;
+	struct bpf_refcount refcount;
+};
+
+struct dequeue_nonprio_ctx {
+	bool stop_iter;
+	u64 expire;
+};
+
+struct fq_stashed_flow {
+	struct fq_flow_node __kptr *flow;
+};
+
+/* [NUM_QUEUE] for TC_PRIO_CONTROL
+ * [0, NUM_QUEUE - 1] for other flows
+ */
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__type(key, __u32);
+	__type(value, struct fq_stashed_flow);
+	__uint(max_entries, NUM_QUEUE + 1);
+} fq_stashed_flows SEC(".maps");
+
+#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+
+private(A) struct bpf_spin_lock fq_delayed_lock;
+private(A) struct bpf_rb_root fq_delayed __contains(fq_flow_node, rb_node);
+
+private(B) struct bpf_spin_lock fq_new_flows_lock;
+private(B) struct bpf_list_head fq_new_flows __contains(fq_flow_node, list_node);
+
+private(C) struct bpf_spin_lock fq_old_flows_lock;
+private(C) struct bpf_list_head fq_old_flows __contains(fq_flow_node, list_node);
+
+static bool bpf_kptr_xchg_back(void *map_val, void *ptr)
+{
+	void *ret;
+
+	ret = bpf_kptr_xchg(map_val, ptr);
+	if (ret) { //unexpected
+		bpf_obj_drop(ret);
+		return false;
+	}
+	return true;
+}
+
+static struct qdisc_skb_cb *qdisc_skb_cb(const struct sk_buff *skb)
+{
+	return (struct qdisc_skb_cb *)skb->cb;
+}
+
+static int hash64(u64 val, int bits)
+{
+	return val * 0x61C8864680B583EBull >> (64 - bits);
+}
+
+static bool skb_tstamp_less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct skb_node *skbn_a;
+	struct skb_node *skbn_b;
+
+	skbn_a = container_of(a, struct skb_node, node);
+	skbn_b = container_of(b, struct skb_node, node);
+
+	return skbn_a->tstamp < skbn_b->tstamp;
+}
+
+static bool fn_time_next_packet_less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct fq_flow_node *flow_a;
+	struct fq_flow_node *flow_b;
+
+	flow_a = container_of(a, struct fq_flow_node, rb_node);
+	flow_b = container_of(b, struct fq_flow_node, rb_node);
+
+	return flow_a->time_next_packet < flow_b->time_next_packet;
+}
+
+static void
+fq_flows_add_head(struct bpf_list_head *head, struct bpf_spin_lock *lock,
+		  struct fq_flow_node *flow)
+{
+	bpf_spin_lock(lock);
+	bpf_list_push_front(head, &flow->list_node);
+	bpf_spin_unlock(lock);
+}
+
+static void
+fq_flows_add_tail(struct bpf_list_head *head, struct bpf_spin_lock *lock,
+		  struct fq_flow_node *flow)
+{
+	bpf_spin_lock(lock);
+	bpf_list_push_back(head, &flow->list_node);
+	bpf_spin_unlock(lock);
+}
+
+static bool
+fq_flows_is_empty(struct bpf_list_head *head, struct bpf_spin_lock *lock)
+{
+	struct bpf_list_node *node;
+
+	bpf_spin_lock(lock);
+	node = bpf_list_pop_front(head);
+	if (node) {
+		bpf_list_push_front(head, node);
+		bpf_spin_unlock(lock);
+		return false;
+	}
+	bpf_spin_unlock(lock);
+
+	return true;
+}
+
+static void fq_flow_set_detached(struct fq_flow_node *flow)
+{
+	flow->age = bpf_jiffies64();
+	bpf_obj_drop(flow);
+}
+
+static bool fq_flow_is_detached(struct fq_flow_node *flow)
+{
+	return flow->age != 0 && flow->age != THROTTLED;
+}
+
+static bool fq_flow_is_throttled(struct fq_flow_node *flow)
+{
+	return flow->age != THROTTLED;
+}
+
+static bool sk_listener(struct sock *sk)
+{
+	return (1 << sk->__sk_common.skc_state) & (TCPF_LISTEN | TCPF_NEW_SYN_RECV);
+}
+
+static int
+fq_classify(struct sk_buff *skb, u32 *hash, struct fq_stashed_flow **sflow,
+	    bool *connected, u32 *sk_hash)
+{
+	struct fq_flow_node *flow;
+	struct sock *sk = skb->sk;
+
+	*connected = false;
+
+	if ((skb->priority & TC_PRIO_MAX) == TC_PRIO_CONTROL) {
+		*hash = PRIO_QUEUE;
+	} else {
+		if (!sk || sk_listener(sk)) {
+			*sk_hash = bpf_skb_get_hash(skb) & q_orphan_mask;
+			*sk_hash = (*sk_hash << 1 | 1);
+		} else if (sk->__sk_common.skc_state == TCP_CLOSE) {
+			*sk_hash = bpf_skb_get_hash(skb) & q_orphan_mask;
+			*sk_hash = (*sk_hash << 1 | 1);
+		} else {
+			*sk_hash = sk->__sk_common.skc_hash;
+			*connected = true;
+		}
+		*hash = hash64(*sk_hash, NUM_QUEUE_LOG);
+	}
+
+	*sflow = bpf_map_lookup_elem(&fq_stashed_flows, hash);
+	if (!*sflow)
+		return -1;
+
+	if ((*sflow)->flow)
+		return 0;
+
+	flow = bpf_obj_new(typeof(*flow));
+	if (!flow)
+		return -1;
+
+	flow->hash = *hash;
+	flow->credit = q_initial_quantum;
+	flow->qlen = 0;
+	flow->age = 1UL;
+	flow->time_next_packet = 0;
+
+	bpf_kptr_xchg_back(&(*sflow)->flow, flow);
+
+	return 0;
+}
+
+static bool fq_packet_beyond_horizon(struct sk_buff *skb)
+{
+	return (s64)skb->tstamp > (s64)(ktime_cache + q_horizon);
+}
+
+SEC("struct_ops/bpf_fq_enqueue")
+int BPF_PROG(bpf_fq_enqueue, struct sk_buff *skb, struct Qdisc *sch,
+	     struct bpf_sk_buff_ptr *to_free)
+{
+	struct fq_flow_node *flow = NULL, *flow_copy;
+	struct fq_stashed_flow *sflow;
+	u64 time_to_send, jiffies;
+	u32 hash, sk_hash;
+	struct skb_node *skbn;
+	bool connected;
+
+	if (fq_qlen >= q_plimit)
+		goto drop;
+
+	if (!skb->tstamp) {
+		time_to_send = ktime_cache = bpf_ktime_get_ns();
+	} else {
+		if (fq_packet_beyond_horizon(skb)) {
+			ktime_cache = bpf_ktime_get_ns();
+			if (fq_packet_beyond_horizon(skb)) {
+				if (q_horizon_drop)
+					goto drop;
+
+				skb->tstamp = ktime_cache + q_horizon;
+			}
+		}
+		time_to_send = skb->tstamp;
+	}
+
+	if (fq_classify(skb, &hash, &sflow, &connected, &sk_hash) < 0)
+		goto drop;
+
+	flow = bpf_kptr_xchg(&sflow->flow, flow);
+	if (!flow)
+		goto drop;
+
+	if (hash != PRIO_QUEUE) {
+		if (connected && flow->socket_hash != sk_hash) {
+			flow->credit = q_initial_quantum;
+			flow->socket_hash = sk_hash;
+			if (fq_flow_is_throttled(flow)) {
+				/* mark the flow as undetached. The reference to the
+				 * throttled flow in fq_delayed will be removed later.
+				 */
+				flow_copy = bpf_refcount_acquire(flow);
+				flow_copy->age = 0;
+				fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow_copy);
+			}
+			flow->time_next_packet = 0ULL;
+		}
+
+		if (flow->qlen >= q_flow_plimit) {
+			bpf_kptr_xchg_back(&sflow->flow, flow);
+			goto drop;
+		}
+
+		if (fq_flow_is_detached(flow)) {
+			if (connected)
+				flow->socket_hash = sk_hash;
+
+			flow_copy = bpf_refcount_acquire(flow);
+
+			jiffies = bpf_jiffies64();
+			if ((s64)(jiffies - (flow_copy->age + q_flow_refill_delay)) > 0) {
+				if (flow_copy->credit < q_quantum)
+					flow_copy->credit = q_quantum;
+			}
+			flow_copy->age = 0;
+			fq_flows_add_tail(&fq_new_flows, &fq_new_flows_lock, flow_copy);
+		}
+	}
+
+	skbn = bpf_obj_new(typeof(*skbn));
+	if (!skbn) {
+		bpf_kptr_xchg_back(&sflow->flow, flow);
+		goto drop;
+	}
+
+	skbn->tstamp = skb->tstamp = time_to_send;
+
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	if (skb)
+		bpf_qdisc_skb_drop(skb, to_free);
+
+	bpf_spin_lock(&flow->lock);
+	bpf_rbtree_add(&flow->queue, &skbn->node, skb_tstamp_less);
+	bpf_spin_unlock(&flow->lock);
+
+	flow->qlen++;
+	bpf_kptr_xchg_back(&sflow->flow, flow);
+
+	fq_qlen++;
+	return NET_XMIT_SUCCESS;
+
+drop:
+	bpf_qdisc_skb_drop(skb, to_free);
+	return NET_XMIT_DROP;
+}
+
+static int fq_unset_throttled_flows(u32 index, bool *unset_all)
+{
+	struct bpf_rb_node *node = NULL;
+	struct fq_flow_node *flow;
+
+	bpf_spin_lock(&fq_delayed_lock);
+
+	node = bpf_rbtree_first(&fq_delayed);
+	if (!node) {
+		bpf_spin_unlock(&fq_delayed_lock);
+		return 1;
+	}
+
+	flow = container_of(node, struct fq_flow_node, rb_node);
+	if (!*unset_all && flow->time_next_packet > dequeue_now) {
+		time_next_delayed_flow = flow->time_next_packet;
+		bpf_spin_unlock(&fq_delayed_lock);
+		return 1;
+	}
+
+	node = bpf_rbtree_remove(&fq_delayed, &flow->rb_node);
+
+	bpf_spin_unlock(&fq_delayed_lock);
+
+	if (!node)
+		return 1;
+
+	flow = container_of(node, struct fq_flow_node, rb_node);
+
+	/* the flow was recycled during enqueue() */
+	if (flow->age != THROTTLED) {
+		bpf_obj_drop(flow);
+		return 0;
+	}
+
+	flow->age = 0;
+	fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow);
+
+	return 0;
+}
+
+static void fq_flow_set_throttled(struct fq_flow_node *flow)
+{
+	flow->age = THROTTLED;
+
+	if (time_next_delayed_flow > flow->time_next_packet)
+		time_next_delayed_flow = flow->time_next_packet;
+
+	bpf_spin_lock(&fq_delayed_lock);
+	bpf_rbtree_add(&fq_delayed, &flow->rb_node, fn_time_next_packet_less);
+	bpf_spin_unlock(&fq_delayed_lock);
+}
+
+static void fq_check_throttled(void)
+{
+	bool unset_all = false;
+	unsigned long sample;
+
+	if (time_next_delayed_flow > dequeue_now)
+		return;
+
+	sample = (unsigned long)(dequeue_now - time_next_delayed_flow);
+	unthrottle_latency_ns -= unthrottle_latency_ns >> 3;
+	unthrottle_latency_ns += sample >> 3;
+
+	time_next_delayed_flow = ~0ULL;
+	bpf_loop(NUM_QUEUE, fq_unset_throttled_flows, &unset_all, 0);
+}
+
+static struct sk_buff*
+fq_dequeue_nonprio_flows(u32 index, struct dequeue_nonprio_ctx *ctx)
+{
+	u64 time_next_packet, time_to_send;
+	struct bpf_rb_node *rb_node;
+	struct sk_buff *skb = NULL;
+	struct bpf_list_head *head;
+	struct bpf_list_node *node;
+	struct bpf_spin_lock *lock;
+	struct fq_flow_node *flow;
+	struct skb_node *skbn;
+	bool is_empty;
+
+	head = &fq_new_flows;
+	lock = &fq_new_flows_lock;
+	bpf_spin_lock(&fq_new_flows_lock);
+	node = bpf_list_pop_front(&fq_new_flows);
+	bpf_spin_unlock(&fq_new_flows_lock);
+	if (!node) {
+		head = &fq_old_flows;
+		lock = &fq_old_flows_lock;
+		bpf_spin_lock(&fq_old_flows_lock);
+		node = bpf_list_pop_front(&fq_old_flows);
+		bpf_spin_unlock(&fq_old_flows_lock);
+		if (!node) {
+			if (time_next_delayed_flow != ~0ULL)
+				ctx->expire = time_next_delayed_flow;
+			ctx->stop_iter = true;
+			return NULL;
+		}
+	}
+
+	flow = container_of(node, struct fq_flow_node, list_node);
+	if (flow->credit <= 0) {
+		flow->credit += q_quantum;
+		fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow);
+		return NULL;
+	}
+
+	bpf_spin_lock(&flow->lock);
+	rb_node = bpf_rbtree_first(&flow->queue);
+	if (!rb_node) {
+		bpf_spin_unlock(&flow->lock);
+		is_empty = fq_flows_is_empty(&fq_old_flows, &fq_old_flows_lock);
+		if (head == &fq_new_flows && !is_empty)
+			fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow);
+		else
+			fq_flow_set_detached(flow);
+
+		return NULL;
+	}
+
+	skbn = container_of(rb_node, struct skb_node, node);
+	time_to_send = skbn->tstamp;
+
+	time_next_packet = (time_to_send > flow->time_next_packet) ?
+		time_to_send : flow->time_next_packet;
+	if (dequeue_now < time_next_packet) {
+		bpf_spin_unlock(&flow->lock);
+		flow->time_next_packet = time_next_packet;
+		fq_flow_set_throttled(flow);
+		return NULL;
+	}
+
+	rb_node = bpf_rbtree_remove(&flow->queue, rb_node);
+	bpf_spin_unlock(&flow->lock);
+
+	if (!rb_node)
+		goto out;
+
+	skbn = container_of(rb_node, struct skb_node, node);
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	bpf_obj_drop(skbn);
+
+	if (!skb)
+		goto out;
+
+	flow->credit -= qdisc_skb_cb(skb)->pkt_len;
+	flow->qlen--;
+	fq_qlen--;
+
+	ctx->stop_iter = true;
+
+out:
+	fq_flows_add_head(head, lock, flow);
+	return skb;
+}
+
+static struct sk_buff *fq_dequeue_prio(void)
+{
+	struct fq_flow_node *flow = NULL;
+	struct fq_stashed_flow *sflow;
+	struct bpf_rb_node *rb_node;
+	struct sk_buff *skb = NULL;
+	struct skb_node *skbn;
+	u32 hash = NUM_QUEUE;
+
+	sflow = bpf_map_lookup_elem(&fq_stashed_flows, &hash);
+	if (!sflow)
+		return NULL;
+
+	flow = bpf_kptr_xchg(&sflow->flow, flow);
+	if (!flow)
+		return NULL;
+
+	bpf_spin_lock(&flow->lock);
+	rb_node = bpf_rbtree_first(&flow->queue);
+	if (!rb_node) {
+		bpf_spin_unlock(&flow->lock);
+		goto xchg_flow_back;
+	}
+
+	skbn = container_of(rb_node, struct skb_node, node);
+	rb_node = bpf_rbtree_remove(&flow->queue, &skbn->node);
+	bpf_spin_unlock(&flow->lock);
+
+	if (!rb_node) {
+		skb = NULL;
+		goto xchg_flow_back;
+	}
+
+	skbn = container_of(rb_node, struct skb_node, node);
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	bpf_obj_drop(skbn);
+
+	fq_qlen--;
+
+xchg_flow_back:
+	bpf_kptr_xchg_back(&sflow->flow, flow);
+
+	return skb;
+}
+
+SEC("struct_ops/bpf_fq_dequeue")
+struct sk_buff *BPF_PROG(bpf_fq_dequeue, struct Qdisc *sch)
+{
+	struct dequeue_nonprio_ctx cb_ctx = {};
+	struct sk_buff *skb = NULL;
+	int i;
+
+	skb = fq_dequeue_prio();
+	if (skb)
+		return skb;
+
+	ktime_cache = dequeue_now = bpf_ktime_get_ns();
+	fq_check_throttled();
+	bpf_for(i, 0, q_plimit) {
+		skb = fq_dequeue_nonprio_flows(i, &cb_ctx);
+		if (cb_ctx.stop_iter)
+			break;
+	};
+
+	if (skb)
+		return skb;
+
+	if (cb_ctx.expire)
+		bpf_qdisc_watchdog_schedule(sch, cb_ctx.expire, q_timer_slack);
+
+	return NULL;
+}
+
+static int
+fq_reset_flows(u32 index, void *ctx)
+{
+	struct bpf_list_node *node;
+	struct fq_flow_node *flow;
+
+	bpf_spin_lock(&fq_new_flows_lock);
+	node = bpf_list_pop_front(&fq_new_flows);
+	bpf_spin_unlock(&fq_new_flows_lock);
+	if (!node) {
+		bpf_spin_lock(&fq_old_flows_lock);
+		node = bpf_list_pop_front(&fq_old_flows);
+		bpf_spin_unlock(&fq_old_flows_lock);
+		if (!node)
+			return 1;
+	}
+
+	flow = container_of(node, struct fq_flow_node, list_node);
+	bpf_obj_drop(flow);
+
+	return 0;
+}
+
+static int
+fq_reset_stashed_flows(u32 index, void *ctx)
+{
+	struct fq_flow_node *flow = NULL;
+	struct fq_stashed_flow *sflow;
+
+	sflow = bpf_map_lookup_elem(&fq_stashed_flows, &index);
+	if (!sflow)
+		return 0;
+
+	flow = bpf_kptr_xchg(&sflow->flow, flow);
+	if (flow)
+		bpf_obj_drop(flow);
+
+	return 0;
+}
+
+SEC("struct_ops/bpf_fq_reset")
+void BPF_PROG(bpf_fq_reset, struct Qdisc *sch)
+{
+	bool unset_all = true;
+	fq_qlen = 0;
+	bpf_loop(NUM_QUEUE + 1, fq_reset_stashed_flows, NULL, 0);
+	bpf_loop(NUM_QUEUE, fq_reset_flows, NULL, 0);
+	bpf_loop(NUM_QUEUE, fq_unset_throttled_flows, &unset_all, 0);
+	return;
+}
+
+SEC(".struct_ops")
+struct Qdisc_ops fq = {
+	.enqueue   = (void *)bpf_fq_enqueue,
+	.dequeue   = (void *)bpf_fq_dequeue,
+	.reset     = (void *)bpf_fq_reset,
+	.id        = "bpf_fq",
+};
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v9 10/11] selftests: Add a bpf fq qdisc to selftest
  2024-07-14 17:51 ` [RFC PATCH v9 10/11] selftests: Add a bpf fq qdisc to selftest Amery Hung
@ 2024-07-19  1:54   ` Martin KaFai Lau
  2024-07-19 18:20     ` Amery Hung
  0 siblings, 1 reply; 42+ messages in thread
From: Martin KaFai Lau @ 2024-07-19  1:54 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, yangpeihao, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

On 7/14/24 10:51 AM, Amery Hung wrote:
> +struct {
> +	__uint(type, BPF_MAP_TYPE_ARRAY);
> +	__type(key, __u32);
> +	__type(value, struct fq_stashed_flow);
> +	__uint(max_entries, NUM_QUEUE + 1);
> +} fq_stashed_flows SEC(".maps");
> +
> +#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
> +
> +private(A) struct bpf_spin_lock fq_delayed_lock;
> +private(A) struct bpf_rb_root fq_delayed __contains(fq_flow_node, rb_node);
> +
> +private(B) struct bpf_spin_lock fq_new_flows_lock;
> +private(B) struct bpf_list_head fq_new_flows __contains(fq_flow_node, list_node);
> +
> +private(C) struct bpf_spin_lock fq_old_flows_lock;
> +private(C) struct bpf_list_head fq_old_flows __contains(fq_flow_node, list_node);

Potentially, multiple qdisc instances will content on these global locks. Do you 
think it will be an issue in setup like the root is mq and multiple fq(s) below 
the mq, like mq => (fq1, fq2, fq3...)?

I guess it could be solved by storing them into the map's value and each fq 
instance uses its own lock and list/rb (?) to make it work like ".priv_size", 
but just more work is needed in ".init". Not necessary the top of the things to 
tackle/optimize for now though.

[ ... ]

> +SEC("struct_ops/bpf_fq_enqueue")
> +int BPF_PROG(bpf_fq_enqueue, struct sk_buff *skb, struct Qdisc *sch,
> +	     struct bpf_sk_buff_ptr *to_free)
> +{
> +	struct fq_flow_node *flow = NULL, *flow_copy;
> +	struct fq_stashed_flow *sflow;
> +	u64 time_to_send, jiffies;
> +	u32 hash, sk_hash;
> +	struct skb_node *skbn;
> +	bool connected;
> +
> +	if (fq_qlen >= q_plimit)
> +		goto drop;
> +
> +	if (!skb->tstamp) {
> +		time_to_send = ktime_cache = bpf_ktime_get_ns();
> +	} else {
> +		if (fq_packet_beyond_horizon(skb)) {
> +			ktime_cache = bpf_ktime_get_ns();
> +			if (fq_packet_beyond_horizon(skb)) {
> +				if (q_horizon_drop)
> +					goto drop;
> +
> +				skb->tstamp = ktime_cache + q_horizon;
> +			}
> +		}
> +		time_to_send = skb->tstamp;
> +	}
> +
> +	if (fq_classify(skb, &hash, &sflow, &connected, &sk_hash) < 0)
> +		goto drop;
> +
> +	flow = bpf_kptr_xchg(&sflow->flow, flow);
> +	if (!flow)
> +		goto drop;
> +
> +	if (hash != PRIO_QUEUE) {
> +		if (connected && flow->socket_hash != sk_hash) {

The commit message mentioned it does not handle the hash collision. Not a 
request for now, I just want to understand if you hit some issues.

> +			flow->credit = q_initial_quantum;
> +			flow->socket_hash = sk_hash;
> +			if (fq_flow_is_throttled(flow)) {
> +				/* mark the flow as undetached. The reference to the
> +				 * throttled flow in fq_delayed will be removed later.
> +				 */
> +				flow_copy = bpf_refcount_acquire(flow);
> +				flow_copy->age = 0;
> +				fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow_copy);
> +			}
> +			flow->time_next_packet = 0ULL;
> +		}
> +
> +		if (flow->qlen >= q_flow_plimit) {
> +			bpf_kptr_xchg_back(&sflow->flow, flow);
> +			goto drop;
> +		}
> +
> +		if (fq_flow_is_detached(flow)) {
> +			if (connected)
> +				flow->socket_hash = sk_hash;
> +
> +			flow_copy = bpf_refcount_acquire(flow);
> +
> +			jiffies = bpf_jiffies64();
> +			if ((s64)(jiffies - (flow_copy->age + q_flow_refill_delay)) > 0) {
> +				if (flow_copy->credit < q_quantum)
> +					flow_copy->credit = q_quantum;
> +			}
> +			flow_copy->age = 0;
> +			fq_flows_add_tail(&fq_new_flows, &fq_new_flows_lock, flow_copy);
> +		}
> +	}
> +
> +	skbn = bpf_obj_new(typeof(*skbn));
> +	if (!skbn) {
> +		bpf_kptr_xchg_back(&sflow->flow, flow)
Please post the patch that makes the bpf_kptr_xchg() work. It is easier if I can 
try the selftests out.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v9 10/11] selftests: Add a bpf fq qdisc to selftest
  2024-07-19  1:54   ` Martin KaFai Lau
@ 2024-07-19 18:20     ` Amery Hung
  0 siblings, 0 replies; 42+ messages in thread
From: Amery Hung @ 2024-07-19 18:20 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, yangpeihao, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs, Dave Marchevsky

On Thu, Jul 18, 2024 at 6:54 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 7/14/24 10:51 AM, Amery Hung wrote:
> > +struct {
> > +     __uint(type, BPF_MAP_TYPE_ARRAY);
> > +     __type(key, __u32);
> > +     __type(value, struct fq_stashed_flow);
> > +     __uint(max_entries, NUM_QUEUE + 1);
> > +} fq_stashed_flows SEC(".maps");
> > +
> > +#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
> > +
> > +private(A) struct bpf_spin_lock fq_delayed_lock;
> > +private(A) struct bpf_rb_root fq_delayed __contains(fq_flow_node, rb_node);
> > +
> > +private(B) struct bpf_spin_lock fq_new_flows_lock;
> > +private(B) struct bpf_list_head fq_new_flows __contains(fq_flow_node, list_node);
> > +
> > +private(C) struct bpf_spin_lock fq_old_flows_lock;
> > +private(C) struct bpf_list_head fq_old_flows __contains(fq_flow_node, list_node);
>
> Potentially, multiple qdisc instances will content on these global locks. Do you
> think it will be an issue in setup like the root is mq and multiple fq(s) below
> the mq, like mq => (fq1, fq2, fq3...)?
>
> I guess it could be solved by storing them into the map's value and each fq
> instance uses its own lock and list/rb (?) to make it work like ".priv_size",
> but just more work is needed in ".init". Not necessary the top of the things to
> tackle/optimize for now though.
>

The examples in selftests indeed do not work well for mq as they share
a global queue.

Just thinking on a higher level. A solution could be introducing some
semantics to bpf to annotate maps or graphs to be private and backed
by per-qdisc privdata, so that users don't need to specify the qdisc
everytime when accessing private data. In addition, maybe the reset
mechanism can piggyback on this.

Though it is not the highest priority, I think the final code is
selftests should use independent queues if under mq. I will fix it in
some future revisions.

> [ ... ]
>
> > +SEC("struct_ops/bpf_fq_enqueue")
> > +int BPF_PROG(bpf_fq_enqueue, struct sk_buff *skb, struct Qdisc *sch,
> > +          struct bpf_sk_buff_ptr *to_free)
> > +{
> > +     struct fq_flow_node *flow = NULL, *flow_copy;
> > +     struct fq_stashed_flow *sflow;
> > +     u64 time_to_send, jiffies;
> > +     u32 hash, sk_hash;
> > +     struct skb_node *skbn;
> > +     bool connected;
> > +
> > +     if (fq_qlen >= q_plimit)
> > +             goto drop;
> > +
> > +     if (!skb->tstamp) {
> > +             time_to_send = ktime_cache = bpf_ktime_get_ns();
> > +     } else {
> > +             if (fq_packet_beyond_horizon(skb)) {
> > +                     ktime_cache = bpf_ktime_get_ns();
> > +                     if (fq_packet_beyond_horizon(skb)) {
> > +                             if (q_horizon_drop)
> > +                                     goto drop;
> > +
> > +                             skb->tstamp = ktime_cache + q_horizon;
> > +                     }
> > +             }
> > +             time_to_send = skb->tstamp;
> > +     }
> > +
> > +     if (fq_classify(skb, &hash, &sflow, &connected, &sk_hash) < 0)
> > +             goto drop;
> > +
> > +     flow = bpf_kptr_xchg(&sflow->flow, flow);
> > +     if (!flow)
> > +             goto drop;
> > +
> > +     if (hash != PRIO_QUEUE) {
> > +             if (connected && flow->socket_hash != sk_hash) {
>
> The commit message mentioned it does not handle the hash collision. Not a
> request for now, I just want to understand if you hit some issues.

IIRC, when I used hashmap for fq_stashed_flows, there were some false
negatives from the verifier. So I simplified the implementation by
rehashing flow hash to 10 bits and using an arraymap instead. Let me
fix this and see if there are fundamental issues.

>
> > +                     flow->credit = q_initial_quantum;
> > +                     flow->socket_hash = sk_hash;
> > +                     if (fq_flow_is_throttled(flow)) {
> > +                             /* mark the flow as undetached. The reference to the
> > +                              * throttled flow in fq_delayed will be removed later.
> > +                              */
> > +                             flow_copy = bpf_refcount_acquire(flow);
> > +                             flow_copy->age = 0;
> > +                             fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow_copy);
> > +                     }
> > +                     flow->time_next_packet = 0ULL;
> > +             }
> > +
> > +             if (flow->qlen >= q_flow_plimit) {
> > +                     bpf_kptr_xchg_back(&sflow->flow, flow);
> > +                     goto drop;
> > +             }
> > +
> > +             if (fq_flow_is_detached(flow)) {
> > +                     if (connected)
> > +                             flow->socket_hash = sk_hash;
> > +
> > +                     flow_copy = bpf_refcount_acquire(flow);
> > +
> > +                     jiffies = bpf_jiffies64();
> > +                     if ((s64)(jiffies - (flow_copy->age + q_flow_refill_delay)) > 0) {
> > +                             if (flow_copy->credit < q_quantum)
> > +                                     flow_copy->credit = q_quantum;
> > +                     }
> > +                     flow_copy->age = 0;
> > +                     fq_flows_add_tail(&fq_new_flows, &fq_new_flows_lock, flow_copy);
> > +             }
> > +     }
> > +
> > +     skbn = bpf_obj_new(typeof(*skbn));
> > +     if (!skbn) {
> > +             bpf_kptr_xchg_back(&sflow->flow, flow)
> Please post the patch that makes the bpf_kptr_xchg() work. It is easier if I can
> try the selftests out.
>

The offlist RFC patchset from Dave Marchevsky is now in reply to the
cover letter for people interested to try out. I am also copying Dave
here.

Thanks for reviewing,
Amery

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH v9 11/11] selftests: Add a bpf netem qdisc to selftest
  2024-07-14 17:51 [RFC PATCH v9 00/11] bpf qdisc Amery Hung
                   ` (9 preceding siblings ...)
  2024-07-14 17:51 ` [RFC PATCH v9 10/11] selftests: Add a bpf fq qdisc to selftest Amery Hung
@ 2024-07-14 17:51 ` Amery Hung
  2024-07-17 10:13 ` [RFC PATCH v9 00/11] bpf qdisc Donald Hunter
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 42+ messages in thread
From: Amery Hung @ 2024-07-14 17:51 UTC (permalink / raw)
  To: netdev
  Cc: bpf, yangpeihao, daniel, andrii, alexei.starovoitov, martin.lau,
	sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs,
	ameryhung

This test implements a simple network emulator qdisc that simulates
packet drop, loss and delay. The qdisc uses Gilbert-Elliott model to
simulate packet drops. When used with mq qdisc, the bpf netem qdiscs
on different tx queues maintain a global state machine using a bpf map.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 .../selftests/bpf/prog_tests/bpf_qdisc.c      |  30 ++
 .../selftests/bpf/progs/bpf_qdisc_netem.c     | 258 ++++++++++++++++++
 2 files changed, 288 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_netem.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
index 394bf5a4adae..ec9c0d166e89 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
@@ -6,6 +6,13 @@
 #include "bpf_qdisc_fifo.skel.h"
 #include "bpf_qdisc_fq.skel.h"
 
+struct crndstate {
+	u32 last;
+	u32 rho;
+};
+
+#include "bpf_qdisc_netem.skel.h"
+
 #ifndef ENOTSUPP
 #define ENOTSUPP 524
 #endif
@@ -176,10 +183,33 @@ static void test_fq(void)
 	bpf_qdisc_fq__destroy(fq_skel);
 }
 
+static void test_netem(void)
+{
+	struct bpf_qdisc_netem *netem_skel;
+	struct bpf_link *link;
+
+	netem_skel = bpf_qdisc_netem__open_and_load();
+	if (!ASSERT_OK_PTR(netem_skel, "bpf_qdisc_netem__open_and_load"))
+		return;
+
+	link = bpf_map__attach_struct_ops(netem_skel->maps.netem);
+	if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops")) {
+		bpf_qdisc_netem__destroy(netem_skel);
+		return;
+	}
+
+	do_test("bpf_netem");
+
+	bpf_link__destroy(link);
+	bpf_qdisc_netem__destroy(netem_skel);
+}
+
 void test_bpf_qdisc(void)
 {
 	if (test__start_subtest("fifo"))
 		test_fifo();
 	if (test__start_subtest("fq"))
 		test_fq();
+	if (test__start_subtest("netem"))
+		test_netem();
 }
diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_netem.c b/tools/testing/selftests/bpf/progs/bpf_qdisc_netem.c
new file mode 100644
index 000000000000..39be88a5f16a
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_netem.c
@@ -0,0 +1,258 @@
+#include <vmlinux.h>
+#include "bpf_experimental.h"
+#include "bpf_qdisc_common.h"
+
+char _license[] SEC("license") = "GPL";
+
+int q_loss_model = CLG_GILB_ELL;
+unsigned int q_limit = 1000;
+signed long q_latency = 0;
+signed long q_jitter = 0;
+unsigned int q_loss = 1;
+unsigned int q_qlen = 0;
+
+struct crndstate q_loss_cor = {.last = 0, .rho = 0,};
+struct crndstate q_delay_cor = {.last = 0, .rho = 0,};
+
+struct skb_node {
+	u64 tstamp;
+	struct sk_buff __kptr *skb;
+	struct bpf_rb_node node;
+};
+
+struct clg_state {
+	u64 state;
+	u32 a1;
+	u32 a2;
+	u32 a3;
+	u32 a4;
+	u32 a5;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__type(key, __u32);
+	__type(value, struct clg_state);
+	__uint(max_entries, 1);
+} g_clg_state SEC(".maps");
+
+#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+
+private(A) struct bpf_spin_lock t_root_lock;
+private(A) struct bpf_rb_root t_root __contains(skb_node, node);
+
+static bool skb_tstamp_less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct skb_node *skbn_a;
+	struct skb_node *skbn_b;
+
+	skbn_a = container_of(a, struct skb_node, node);
+	skbn_b = container_of(b, struct skb_node, node);
+
+	return skbn_a->tstamp < skbn_b->tstamp;
+}
+
+static u32 get_crandom(struct crndstate *state)
+{
+	u64 value, rho;
+	unsigned long answer;
+
+	if (!state || state->rho == 0)	/* no correlation */
+		return bpf_get_prandom_u32();
+
+	value = bpf_get_prandom_u32();
+	rho = (u64)state->rho + 1;
+	answer = (value * ((1ull<<32) - rho) + state->last * rho) >> 32;
+	state->last = answer;
+	return answer;
+}
+
+static s64 tabledist(s64 mu, s32 sigma, struct crndstate *state)
+{
+	u32 rnd;
+
+	if (sigma == 0)
+		return mu;
+
+	rnd = get_crandom(state);
+
+	/* default uniform distribution */
+	return ((rnd % (2 * (u32)sigma)) + mu) - sigma;
+}
+
+static bool loss_gilb_ell(void)
+{
+	struct clg_state *clg;
+	u32 r1, r2, key = 0;
+	bool ret = false;
+
+	clg = bpf_map_lookup_elem(&g_clg_state, &key);
+	if (!clg)
+		return false;
+
+	r1 = bpf_get_prandom_u32();
+	r2 = bpf_get_prandom_u32();
+
+	switch (clg->state) {
+	case GOOD_STATE:
+		if (r1 < clg->a1)
+			__sync_val_compare_and_swap(&clg->state,
+						    GOOD_STATE, BAD_STATE);
+		if (r2 < clg->a4)
+			ret = true;
+		break;
+	case BAD_STATE:
+		if (r1 < clg->a2)
+			__sync_val_compare_and_swap(&clg->state,
+						    BAD_STATE, GOOD_STATE);
+		if (r2 > clg->a3)
+			ret = true;
+	}
+
+	return ret;
+}
+
+static bool loss_event(void)
+{
+	switch (q_loss_model) {
+	case CLG_RANDOM:
+		return q_loss && q_loss >= get_crandom(&q_loss_cor);
+	case CLG_GILB_ELL:
+		return loss_gilb_ell();
+	}
+
+	return false;
+}
+
+SEC("struct_ops/bpf_netem_enqueue")
+int BPF_PROG(bpf_netem_enqueue, struct sk_buff *skb, struct Qdisc *sch,
+	     struct bpf_sk_buff_ptr *to_free)
+{
+	struct skb_node *skbn;
+	int count = 1;
+	s64 delay = 0;
+	u64 now;
+
+	if (loss_event())
+		--count;
+
+	if (count == 0) {
+		bpf_qdisc_skb_drop(skb, to_free);
+		return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+	}
+
+	q_qlen++;
+	if (q_qlen > q_limit) {
+		bpf_qdisc_skb_drop(skb, to_free);
+		return NET_XMIT_DROP;
+	}
+
+	skbn = bpf_obj_new(typeof(*skbn));
+	if (!skbn) {
+		bpf_qdisc_skb_drop(skb, to_free);
+		return NET_XMIT_DROP;
+	}
+
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	if (skb)
+		bpf_qdisc_skb_drop(skb, to_free);
+
+	delay = tabledist(q_latency, q_jitter, &q_delay_cor);
+	now = bpf_ktime_get_ns();
+	skbn->tstamp = now + delay;
+
+	bpf_spin_lock(&t_root_lock);
+	bpf_rbtree_add(&t_root, &skbn->node, skb_tstamp_less);
+	bpf_spin_unlock(&t_root_lock);
+
+	return NET_XMIT_SUCCESS;
+}
+
+SEC("struct_ops/bpf_netem_dequeue")
+struct sk_buff *BPF_PROG(bpf_netem_dequeue, struct Qdisc *sch)
+{
+	struct sk_buff *skb = NULL;
+	struct bpf_rb_node *node;
+	struct skb_node *skbn;
+	u64 now, tstamp;
+
+	now = bpf_ktime_get_ns();
+
+	bpf_spin_lock(&t_root_lock);
+	node = bpf_rbtree_first(&t_root);
+	if (!node) {
+		bpf_spin_unlock(&t_root_lock);
+		return NULL;
+	}
+
+	skbn = container_of(node, struct skb_node, node);
+	tstamp = skbn->tstamp;
+	if (tstamp <= now) {
+		node = bpf_rbtree_remove(&t_root, node);
+		bpf_spin_unlock(&t_root_lock);
+
+		if (!node)
+			return NULL;
+
+		skbn = container_of(node, struct skb_node, node);
+		skb = bpf_kptr_xchg(&skbn->skb, skb);
+		bpf_obj_drop(skbn);
+
+		q_qlen--;
+		return skb;
+	}
+
+	bpf_spin_unlock(&t_root_lock);
+	bpf_qdisc_watchdog_schedule(sch, tstamp, 0);
+	return NULL;
+}
+
+SEC("struct_ops/bpf_netem_init")
+int BPF_PROG(bpf_netem_init, struct Qdisc *sch, struct nlattr *opt,
+	     struct netlink_ext_ack *extack)
+{
+	return 0;
+}
+
+SEC("struct_ops/bpf_netem_reset")
+void BPF_PROG(bpf_netem_reset, struct Qdisc *sch)
+{
+	struct bpf_rb_node *node;
+	struct skb_node *skbn;
+	int i;
+
+	bpf_for(i, 0, q_limit) {
+		struct sk_buff *skb = NULL;
+
+		bpf_spin_lock(&t_root_lock);
+		node = bpf_rbtree_first(&t_root);
+		if (!node) {
+			bpf_spin_unlock(&t_root_lock);
+			break;
+		}
+
+		skbn = container_of(node, struct skb_node, node);
+		node = bpf_rbtree_remove(&t_root, node);
+		bpf_spin_unlock(&t_root_lock);
+
+		if (!node)
+			continue;
+
+		skbn = container_of(node, struct skb_node, node);
+		skb = bpf_kptr_xchg(&skbn->skb, skb);
+		if (skb)
+			bpf_skb_release(skb);
+		bpf_obj_drop(skbn);
+	}
+	q_qlen = 0;
+}
+
+SEC(".struct_ops")
+struct Qdisc_ops netem = {
+	.enqueue   = (void *)bpf_netem_enqueue,
+	.dequeue   = (void *)bpf_netem_dequeue,
+	.init      = (void *)bpf_netem_init,
+	.reset     = (void *)bpf_netem_reset,
+	.id        = "bpf_netem",
+};
+
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v9 00/11] bpf qdisc
  2024-07-14 17:51 [RFC PATCH v9 00/11] bpf qdisc Amery Hung
                   ` (10 preceding siblings ...)
  2024-07-14 17:51 ` [RFC PATCH v9 11/11] selftests: Add a bpf netem " Amery Hung
@ 2024-07-17 10:13 ` Donald Hunter
  2024-07-17 22:04   ` Amery Hung
  2024-07-19 17:21 ` [OFFLIST RFC 1/4] bpf: Search for kptrs in prog BTF structs Amery Hung
  2024-07-23  0:19 ` [RFC PATCH v9 00/11] bpf qdisc Alexei Starovoitov
  13 siblings, 1 reply; 42+ messages in thread
From: Donald Hunter @ 2024-07-17 10:13 UTC (permalink / raw)
  To: Amery Hung
  Cc: netdev, bpf, yangpeihao, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

Amery Hung <ameryhung@gmail.com> writes:

> Hi all,
>
> This patchset aims to support implementing qdisc using bpf struct_ops.
> This version takes a step back and only implements the minimum support
> for bpf qdisc. 1) support of adding skb to bpf_list and bpf_rbtree
> directly and 2) classful qdisc are deferred to future patchsets.

How do you build with this patchset?

I had to build with the following to get the selftests to build:

CONFIG_NET_SCH_NETEM=y
CONFIG_NET_FOU=y

> * Miscellaneous notes *
>
> The bpf qdiscs in selftest requires support of exchanging kptr into
> allocated objects (local kptr), which Dave Marchevsky developed and
> kindly sent me as off-list patchset.

It's impossible to try out this patchset without the kptr patches. Can
you include those patches here?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v9 00/11] bpf qdisc
  2024-07-17 10:13 ` [RFC PATCH v9 00/11] bpf qdisc Donald Hunter
@ 2024-07-17 22:04   ` Amery Hung
  0 siblings, 0 replies; 42+ messages in thread
From: Amery Hung @ 2024-07-17 22:04 UTC (permalink / raw)
  To: Donald Hunter
  Cc: netdev, bpf, yangpeihao, daniel, andrii, alexei.starovoitov,
	martin.lau, sinquersw, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs, Dave Marchevsky

On Wed, Jul 17, 2024 at 3:13 AM Donald Hunter <donald.hunter@gmail.com> wrote:
>
> Amery Hung <ameryhung@gmail.com> writes:
>
> > Hi all,
> >
> > This patchset aims to support implementing qdisc using bpf struct_ops.
> > This version takes a step back and only implements the minimum support
> > for bpf qdisc. 1) support of adding skb to bpf_list and bpf_rbtree
> > directly and 2) classful qdisc are deferred to future patchsets.
>
> How do you build with this patchset?
>
> I had to build with the following to get the selftests to build:
>
> CONFIG_NET_SCH_NETEM=y
> CONFIG_NET_FOU=y
>

There are config issues in my code. bpf qdisc depends on
CONFIG_NET_SCHED. Therefore, I will create a config entry,
CONFIG_NET_SCH_BPF, for bpf qdisc and make it under CONFIG_NET_SCHED
in Kconfig. The selftests will then require CONFIG_NET_SCH_BPF to
build.

I will send the fixed patches in reply to the problematic patches.
Sorry for the inconvenience.

> > * Miscellaneous notes *
> >
> > The bpf qdiscs in selftest requires support of exchanging kptr into
> > allocated objects (local kptr), which Dave Marchevsky developed and
> > kindly sent me as off-list patchset.
>
> It's impossible to try out this patchset without the kptr patches. Can
> you include those patches here?

Will do.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [OFFLIST RFC 1/4] bpf: Search for kptrs in prog BTF structs
  2024-07-14 17:51 [RFC PATCH v9 00/11] bpf qdisc Amery Hung
                   ` (11 preceding siblings ...)
  2024-07-17 10:13 ` [RFC PATCH v9 00/11] bpf qdisc Donald Hunter
@ 2024-07-19 17:21 ` Amery Hung
  2024-07-19 17:21   ` [OFFLIST RFC 2/4] bpf: Rename ARG_PTR_TO_KPTR -> ARG_KPTR_XCHG_DEST Amery Hung
                     ` (2 more replies)
  2024-07-23  0:19 ` [RFC PATCH v9 00/11] bpf qdisc Alexei Starovoitov
  13 siblings, 3 replies; 42+ messages in thread
From: Amery Hung @ 2024-07-19 17:21 UTC (permalink / raw)
  To: ameryhung
  Cc: alexei.starovoitov, andrii, bpf, daniel, jhs, jiri, martin.lau,
	netdev, sdf, sinquersw, toke, xiyou.wangcong, yangpeihao,
	yepeilin.cs, donald.hunter

From: Dave Marchevsky <davemarchevsky@fb.com>

Currently btf_parse_fields is used in two places to create struct
btf_record's for structs: when looking at mapval type, and when looking
at any struct in program BTF. The former looks for kptr fields while the
latter does not. This patch modifies the btf_parse_fields call made when
looking at prog BTF struct types to search for kptrs as well.

Before this series there was no reason to search for kptrs in non-mapval
types: a referenced kptr needs some owner to guarantee resource cleanup,
and map values were the only owner that supported this. If a struct with
a kptr field were to have some non-kptr-aware owner, the kptr field
might not be properly cleaned up and result in resources leaking. Only
searching for kptr fields in mapval was a simple way to avoid this
problem.

In practice, though, searching for BPF_KPTR when populating
struct_meta_tab does not expose us to this risk, as struct_meta_tab is
only accessed through btf_find_struct_meta helper, and that helper is
only called in contexts where recognizing the kptr field is safe:

  * PTR_TO_BTF_ID reg w/ MEM_ALLOC flag
    * Such a reg is a local kptr and must be free'd via bpf_obj_drop,
      which will correctly handle kptr field

  * When handling specific kfuncs which either expect MEM_ALLOC input or
    return MEM_ALLOC output (obj_{new,drop}, percpu_obj_{new,drop},
    list+rbtree funcs, refcount_acquire)
     * Will correctly handle kptr field for same reasons as above

  * When looking at kptr pointee type
     * Called by functions which implement "correct kptr resource
       handling"

  * In btf_check_and_fixup_fields
     * Helper that ensures no ownership loops for lists and rbtrees,
       doesn't care about kptr field existence

So we should be able to find BPF_KPTR fields in all prog BTF structs
without leaking resources.

Further patches in the series will build on this change to support
kptr_xchg into non-mapval local kptr. Without this change there would be
no kptr field found in such a type.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 kernel/bpf/btf.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 520f49f422fe..967246ecd3cb 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -5585,7 +5585,8 @@ btf_parse_struct_metas(struct bpf_verifier_log *log, struct btf *btf)
 		type = &tab->types[tab->cnt];
 		type->btf_id = i;
 		record = btf_parse_fields(btf, t, BPF_SPIN_LOCK | BPF_LIST_HEAD | BPF_LIST_NODE |
-						  BPF_RB_ROOT | BPF_RB_NODE | BPF_REFCOUNT, t->size);
+						  BPF_RB_ROOT | BPF_RB_NODE | BPF_REFCOUNT |
+						  BPF_KPTR, t->size);
 		/* The record cannot be unset, treat it as an error if so */
 		if (IS_ERR_OR_NULL(record)) {
 			ret = PTR_ERR_OR_ZERO(record) ?: -EFAULT;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [OFFLIST RFC 2/4] bpf: Rename ARG_PTR_TO_KPTR -> ARG_KPTR_XCHG_DEST
  2024-07-19 17:21 ` [OFFLIST RFC 1/4] bpf: Search for kptrs in prog BTF structs Amery Hung
@ 2024-07-19 17:21   ` Amery Hung
  2024-07-19 17:21   ` [OFFLIST RFC 3/4] bpf: Support bpf_kptr_xchg into local kptr Amery Hung
  2024-07-19 17:21   ` [OFFLIST RFC 4/4] selftests/bpf: Test bpf_kptr_xchg stashing " Amery Hung
  2 siblings, 0 replies; 42+ messages in thread
From: Amery Hung @ 2024-07-19 17:21 UTC (permalink / raw)
  To: ameryhung
  Cc: alexei.starovoitov, andrii, bpf, daniel, jhs, jiri, martin.lau,
	netdev, sdf, sinquersw, toke, xiyou.wangcong, yangpeihao,
	yepeilin.cs, donald.hunter

From: Dave Marchevsky <davemarchevsky@fb.com>

ARG_PTR_TO_KPTR is currently only used by the bpf_kptr_xchg helper.
Although it limits reg types for that helper's first arg to
PTR_TO_MAP_VALUE, any arbitrary mapval won't do: further custom
verification logic ensures that the mapval reg being xchgd-into is
pointing to a kptr field. If this is not the case, it's not safe to xchg
into that reg's pointee.

Let's rename the bpf_arg_type to more accurately describe the fairly
specific expectations that this arg type encodes.

This is a nonfunctional change.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 include/linux/bpf.h   | 2 +-
 kernel/bpf/helpers.c  | 2 +-
 kernel/bpf/verifier.c | 6 +++---
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4f1d4a97b9d1..cc460786da9b 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -743,7 +743,7 @@ enum bpf_arg_type {
 	ARG_PTR_TO_STACK,	/* pointer to stack */
 	ARG_PTR_TO_CONST_STR,	/* pointer to a null terminated read-only string */
 	ARG_PTR_TO_TIMER,	/* pointer to bpf_timer */
-	ARG_PTR_TO_KPTR,	/* pointer to referenced kptr */
+	ARG_KPTR_XCHG_DEST,	/* pointer to destination that kptrs are bpf_kptr_xchg'd into */
 	ARG_PTR_TO_DYNPTR,      /* pointer to bpf_dynptr. See bpf_type_flag for dynptr type */
 	__BPF_ARG_TYPE_MAX,
 
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index b5f0adae8293..c038c3e03019 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1635,7 +1635,7 @@ static const struct bpf_func_proto bpf_kptr_xchg_proto = {
 	.gpl_only     = false,
 	.ret_type     = RET_PTR_TO_BTF_ID_OR_NULL,
 	.ret_btf_id   = BPF_PTR_POISON,
-	.arg1_type    = ARG_PTR_TO_KPTR,
+	.arg1_type    = ARG_KPTR_XCHG_DEST,
 	.arg2_type    = ARG_PTR_TO_BTF_ID_OR_NULL | OBJ_RELEASE,
 	.arg2_btf_id  = BPF_PTR_POISON,
 };
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8da132a1ef28..06ec18ee973c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8260,7 +8260,7 @@ static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC } };
 static const struct bpf_reg_types stack_ptr_types = { .types = { PTR_TO_STACK } };
 static const struct bpf_reg_types const_str_ptr_types = { .types = { PTR_TO_MAP_VALUE } };
 static const struct bpf_reg_types timer_types = { .types = { PTR_TO_MAP_VALUE } };
-static const struct bpf_reg_types kptr_types = { .types = { PTR_TO_MAP_VALUE } };
+static const struct bpf_reg_types kptr_xchg_dest_types = { .types = { PTR_TO_MAP_VALUE } };
 static const struct bpf_reg_types dynptr_types = {
 	.types = {
 		PTR_TO_STACK,
@@ -8292,7 +8292,7 @@ static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
 	[ARG_PTR_TO_STACK]		= &stack_ptr_types,
 	[ARG_PTR_TO_CONST_STR]		= &const_str_ptr_types,
 	[ARG_PTR_TO_TIMER]		= &timer_types,
-	[ARG_PTR_TO_KPTR]		= &kptr_types,
+	[ARG_KPTR_XCHG_DEST]		= &kptr_xchg_dest_types,
 	[ARG_PTR_TO_DYNPTR]		= &dynptr_types,
 };
 
@@ -8892,7 +8892,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 			return err;
 		break;
 	}
-	case ARG_PTR_TO_KPTR:
+	case ARG_KPTR_XCHG_DEST:
 		err = process_kptr_func(env, regno, meta);
 		if (err)
 			return err;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [OFFLIST RFC 3/4] bpf: Support bpf_kptr_xchg into local kptr
  2024-07-19 17:21 ` [OFFLIST RFC 1/4] bpf: Search for kptrs in prog BTF structs Amery Hung
  2024-07-19 17:21   ` [OFFLIST RFC 2/4] bpf: Rename ARG_PTR_TO_KPTR -> ARG_KPTR_XCHG_DEST Amery Hung
@ 2024-07-19 17:21   ` Amery Hung
  2024-07-23  0:18     ` Alexei Starovoitov
  2024-07-19 17:21   ` [OFFLIST RFC 4/4] selftests/bpf: Test bpf_kptr_xchg stashing " Amery Hung
  2 siblings, 1 reply; 42+ messages in thread
From: Amery Hung @ 2024-07-19 17:21 UTC (permalink / raw)
  To: ameryhung
  Cc: alexei.starovoitov, andrii, bpf, daniel, jhs, jiri, martin.lau,
	netdev, sdf, sinquersw, toke, xiyou.wangcong, yangpeihao,
	yepeilin.cs, donald.hunter

From: Dave Marchevsky <davemarchevsky@fb.com>

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 kernel/bpf/verifier.c | 42 ++++++++++++++++++++++++++++--------------
 1 file changed, 28 insertions(+), 14 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 06ec18ee973c..39929569ae58 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7664,29 +7664,38 @@ static int process_kptr_func(struct bpf_verifier_env *env, int regno,
 			     struct bpf_call_arg_meta *meta)
 {
 	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
-	struct bpf_map *map_ptr = reg->map_ptr;
 	struct btf_field *kptr_field;
+	struct bpf_map *map_ptr;
+	struct btf_record *rec;
 	u32 kptr_off;
 
+	if (type_is_ptr_alloc_obj(reg->type)) {
+		rec = reg_btf_record(reg);
+	} else { /* PTR_TO_MAP_VALUE */
+		map_ptr = reg->map_ptr;
+		if (!map_ptr->btf) {
+			verbose(env, "map '%s' has to have BTF in order to use bpf_kptr_xchg\n",
+				map_ptr->name);
+			return -EINVAL;
+		}
+		rec = map_ptr->record;
+		meta->map_ptr = map_ptr;
+	}
+
 	if (!tnum_is_const(reg->var_off)) {
 		verbose(env,
 			"R%d doesn't have constant offset. kptr has to be at the constant offset\n",
 			regno);
 		return -EINVAL;
 	}
-	if (!map_ptr->btf) {
-		verbose(env, "map '%s' has to have BTF in order to use bpf_kptr_xchg\n",
-			map_ptr->name);
-		return -EINVAL;
-	}
-	if (!btf_record_has_field(map_ptr->record, BPF_KPTR)) {
-		verbose(env, "map '%s' has no valid kptr\n", map_ptr->name);
+
+	if (!btf_record_has_field(rec, BPF_KPTR)) {
+		verbose(env, "R%d has no valid kptr\n", regno);
 		return -EINVAL;
 	}
 
-	meta->map_ptr = map_ptr;
 	kptr_off = reg->off + reg->var_off.value;
-	kptr_field = btf_record_find(map_ptr->record, kptr_off, BPF_KPTR);
+	kptr_field = btf_record_find(rec, kptr_off, BPF_KPTR);
 	if (!kptr_field) {
 		verbose(env, "off=%d doesn't point to kptr\n", kptr_off);
 		return -EACCES;
@@ -8260,7 +8269,12 @@ static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC } };
 static const struct bpf_reg_types stack_ptr_types = { .types = { PTR_TO_STACK } };
 static const struct bpf_reg_types const_str_ptr_types = { .types = { PTR_TO_MAP_VALUE } };
 static const struct bpf_reg_types timer_types = { .types = { PTR_TO_MAP_VALUE } };
-static const struct bpf_reg_types kptr_xchg_dest_types = { .types = { PTR_TO_MAP_VALUE } };
+static const struct bpf_reg_types kptr_xchg_dest_types = {
+	.types = {
+		PTR_TO_MAP_VALUE,
+		PTR_TO_BTF_ID | MEM_ALLOC
+	}
+};
 static const struct bpf_reg_types dynptr_types = {
 	.types = {
 		PTR_TO_STACK,
@@ -8331,7 +8345,7 @@ static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
 	if (base_type(arg_type) == ARG_PTR_TO_MEM)
 		type &= ~DYNPTR_TYPE_FLAG_MASK;
 
-	if (meta->func_id == BPF_FUNC_kptr_xchg && type_is_alloc(type)) {
+	if (meta->func_id == BPF_FUNC_kptr_xchg && type_is_alloc(type) && regno > 1) {
 		type &= ~MEM_ALLOC;
 		type &= ~MEM_PERCPU;
 	}
@@ -8424,7 +8438,7 @@ static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
 			verbose(env, "verifier internal error: unimplemented handling of MEM_ALLOC\n");
 			return -EFAULT;
 		}
-		if (meta->func_id == BPF_FUNC_kptr_xchg) {
+		if (meta->func_id == BPF_FUNC_kptr_xchg && regno > 1) {
 			if (map_kptr_match_type(env, meta->kptr_field, reg, regno))
 				return -EACCES;
 		}
@@ -8735,7 +8749,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 		meta->release_regno = regno;
 	}
 
-	if (reg->ref_obj_id) {
+	if (reg->ref_obj_id && base_type(arg_type) != ARG_KPTR_XCHG_DEST) {
 		if (meta->ref_obj_id) {
 			verbose(env, "verifier internal error: more than one arg with ref_obj_id R%d %u %u\n",
 				regno, reg->ref_obj_id,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [OFFLIST RFC 3/4] bpf: Support bpf_kptr_xchg into local kptr
  2024-07-19 17:21   ` [OFFLIST RFC 3/4] bpf: Support bpf_kptr_xchg into local kptr Amery Hung
@ 2024-07-23  0:18     ` Alexei Starovoitov
  2024-07-24  0:08       ` Amery Hung
  0 siblings, 1 reply; 42+ messages in thread
From: Alexei Starovoitov @ 2024-07-23  0:18 UTC (permalink / raw)
  To: Amery Hung
  Cc: andrii, bpf, daniel, jhs, jiri, martin.lau, netdev, sdf,
	sinquersw, toke, xiyou.wangcong, yangpeihao, yepeilin.cs,
	donald.hunter

On Fri, Jul 19, 2024 at 05:21:18PM +0000, Amery Hung wrote:
> From: Dave Marchevsky <davemarchevsky@fb.com>
> 
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>

Amery,
please add your SOB after Dave's when you're sending patches like this.

Remove OFFLIST in subject... and resend cc-ing bpf@vger.

Add proper commit log.

> -	if (meta->func_id == BPF_FUNC_kptr_xchg && type_is_alloc(type)) {
> +	if (meta->func_id == BPF_FUNC_kptr_xchg && type_is_alloc(type) && regno > 1) {

I don't understand the point of regno > 1. Pls explain/add comment.

Patches 1 and 2 make sense.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [OFFLIST RFC 3/4] bpf: Support bpf_kptr_xchg into local kptr
  2024-07-23  0:18     ` Alexei Starovoitov
@ 2024-07-24  0:08       ` Amery Hung
  0 siblings, 0 replies; 42+ messages in thread
From: Amery Hung @ 2024-07-24  0:08 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: andrii, bpf, daniel, jhs, jiri, martin.lau, netdev, sdf,
	sinquersw, toke, xiyou.wangcong, yangpeihao, yepeilin.cs,
	donald.hunter

On Mon, Jul 22, 2024 at 5:18 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Jul 19, 2024 at 05:21:18PM +0000, Amery Hung wrote:
> > From: Dave Marchevsky <davemarchevsky@fb.com>
> >
> > Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>
> Amery,
> please add your SOB after Dave's when you're sending patches like this.
>
> Remove OFFLIST in subject... and resend cc-ing bpf@vger.
>
> Add proper commit log.

Got it.

>
> > -     if (meta->func_id == BPF_FUNC_kptr_xchg && type_is_alloc(type)) {
> > +     if (meta->func_id == BPF_FUNC_kptr_xchg && type_is_alloc(type) && regno > 1) {
>
> I don't understand the point of regno > 1. Pls explain/add comment.
>
> Patches 1 and 2 make sense.

I believe this patchset is incomplete, and it will trigger a refcount
bug when running local_kptr_stash/local_kptr_stash_simple. I will
finish and resend this individual series.

Thank you,
Amery

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [OFFLIST RFC 4/4] selftests/bpf: Test bpf_kptr_xchg stashing into local kptr
  2024-07-19 17:21 ` [OFFLIST RFC 1/4] bpf: Search for kptrs in prog BTF structs Amery Hung
  2024-07-19 17:21   ` [OFFLIST RFC 2/4] bpf: Rename ARG_PTR_TO_KPTR -> ARG_KPTR_XCHG_DEST Amery Hung
  2024-07-19 17:21   ` [OFFLIST RFC 3/4] bpf: Support bpf_kptr_xchg into local kptr Amery Hung
@ 2024-07-19 17:21   ` Amery Hung
  2 siblings, 0 replies; 42+ messages in thread
From: Amery Hung @ 2024-07-19 17:21 UTC (permalink / raw)
  To: ameryhung
  Cc: alexei.starovoitov, andrii, bpf, daniel, jhs, jiri, martin.lau,
	netdev, sdf, sinquersw, toke, xiyou.wangcong, yangpeihao,
	yepeilin.cs, donald.hunter

From: Dave Marchevsky <davemarchevsky@fb.com>

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 .../selftests/bpf/progs/local_kptr_stash.c    | 21 +++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/local_kptr_stash.c b/tools/testing/selftests/bpf/progs/local_kptr_stash.c
index 75043ffc5dad..8532abcae5c0 100644
--- a/tools/testing/selftests/bpf/progs/local_kptr_stash.c
+++ b/tools/testing/selftests/bpf/progs/local_kptr_stash.c
@@ -11,6 +11,7 @@
 struct node_data {
 	long key;
 	long data;
+	struct prog_test_ref_kfunc __kptr *stashed_in_node;
 	struct bpf_rb_node node;
 };
 
@@ -85,17 +86,33 @@ static bool less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
 
 static int create_and_stash(int idx, int val)
 {
+	struct prog_test_ref_kfunc *inner;
 	struct map_value *mapval;
 	struct node_data *res;
+	unsigned long dummy;
 
 	mapval = bpf_map_lookup_elem(&some_nodes, &idx);
 	if (!mapval)
 		return 1;
 
+	dummy = 0;
+	inner = bpf_kfunc_call_test_acquire(&dummy);
+	if (!inner)
+		return 2;
+
 	res = bpf_obj_new(typeof(*res));
-	if (!res)
-		return 1;
+	if (!res) {
+		bpf_kfunc_call_test_release(inner);
+		return 3;
+	}
 	res->key = val;
+	inner = bpf_kptr_xchg(&res->stashed_in_node, inner);
+	if (inner) {
+		/* Should never happen, we just obj_new'd res */
+		bpf_kfunc_call_test_release(inner);
+		bpf_obj_drop(res);
+		return 4;
+	}
 
 	res = bpf_kptr_xchg(&mapval->node, res);
 	if (res)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH v9 00/11] bpf qdisc
  2024-07-14 17:51 [RFC PATCH v9 00/11] bpf qdisc Amery Hung
                   ` (12 preceding siblings ...)
  2024-07-19 17:21 ` [OFFLIST RFC 1/4] bpf: Search for kptrs in prog BTF structs Amery Hung
@ 2024-07-23  0:19 ` Alexei Starovoitov
  13 siblings, 0 replies; 42+ messages in thread
From: Alexei Starovoitov @ 2024-07-23  0:19 UTC (permalink / raw)
  To: Amery Hung
  Cc: netdev, bpf, yangpeihao, daniel, andrii, martin.lau, sinquersw,
	toke, jhs, jiri, sdf, xiyou.wangcong, yepeilin.cs

On Sun, Jul 14, 2024 at 05:51:19PM +0000, Amery Hung wrote:
> * Miscellaneous notes *
> 
> The bpf qdiscs in selftest requires support of exchanging kptr into
> allocated objects (local kptr), which Dave Marchevsky developed and
> kindly sent me as off-list patchset.

Since there is a dependency pls focus on landing those first.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2024-07-31  4:09 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-14 17:51 [RFC PATCH v9 00/11] bpf qdisc Amery Hung
2024-07-14 17:51 ` [RFC PATCH v9 01/11] bpf: Support getting referenced kptr from struct_ops argument Amery Hung
2024-07-24  0:32   ` Martin KaFai Lau
2024-07-24 17:00     ` Amery Hung
2024-07-25  1:28       ` Martin KaFai Lau
2024-07-14 17:51 ` [RFC PATCH v9 02/11] selftests/bpf: Test referenced kptr arguments of struct_ops programs Amery Hung
2024-07-14 17:51 ` [RFC PATCH v9 03/11] bpf: Allow struct_ops prog to return referenced kptr Amery Hung
2024-07-24  5:36   ` Kui-Feng Lee
2024-07-24 18:27     ` Kui-Feng Lee
2024-07-24 20:44     ` Amery Hung
2024-07-26 18:22       ` Kui-Feng Lee
2024-07-26 22:45         ` Amery Hung
2024-07-24 23:57   ` Martin KaFai Lau
2024-07-14 17:51 ` [RFC PATCH v9 04/11] selftests/bpf: Test returning referenced kptr from struct_ops programs Amery Hung
2024-07-14 17:51 ` [RFC PATCH v9 05/11] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
2024-07-18  0:00   ` Amery Hung
2024-07-25 21:24   ` Martin KaFai Lau
2024-07-31  4:09     ` Amery Hung
2024-07-14 17:51 ` [RFC PATCH v9 06/11] bpf: net_sched: Add bpf qdisc kfuncs Amery Hung
2024-07-25 22:38   ` Martin KaFai Lau
2024-07-31  4:08     ` Amery Hung
2024-07-14 17:51 ` [RFC PATCH v9 07/11] bpf: net_sched: Allow more optional operators in Qdisc_ops Amery Hung
2024-07-18  0:01   ` Amery Hung
2024-07-26  1:15   ` Martin KaFai Lau
2024-07-26 18:30     ` Martin KaFai Lau
2024-07-26 22:30     ` Amery Hung
2024-07-30  0:20       ` Martin KaFai Lau
2024-07-14 17:51 ` [RFC PATCH v9 08/11] libbpf: Support creating and destroying qdisc Amery Hung
2024-07-14 17:51 ` [RFC PATCH v9 09/11] selftests: Add a basic fifo qdisc test Amery Hung
2024-07-14 17:51 ` [RFC PATCH v9 10/11] selftests: Add a bpf fq qdisc to selftest Amery Hung
2024-07-19  1:54   ` Martin KaFai Lau
2024-07-19 18:20     ` Amery Hung
2024-07-14 17:51 ` [RFC PATCH v9 11/11] selftests: Add a bpf netem " Amery Hung
2024-07-17 10:13 ` [RFC PATCH v9 00/11] bpf qdisc Donald Hunter
2024-07-17 22:04   ` Amery Hung
2024-07-19 17:21 ` [OFFLIST RFC 1/4] bpf: Search for kptrs in prog BTF structs Amery Hung
2024-07-19 17:21   ` [OFFLIST RFC 2/4] bpf: Rename ARG_PTR_TO_KPTR -> ARG_KPTR_XCHG_DEST Amery Hung
2024-07-19 17:21   ` [OFFLIST RFC 3/4] bpf: Support bpf_kptr_xchg into local kptr Amery Hung
2024-07-23  0:18     ` Alexei Starovoitov
2024-07-24  0:08       ` Amery Hung
2024-07-19 17:21   ` [OFFLIST RFC 4/4] selftests/bpf: Test bpf_kptr_xchg stashing " Amery Hung
2024-07-23  0:19 ` [RFC PATCH v9 00/11] bpf qdisc Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).