[PATCH bpf-next v7 00/10] bpf qdisc

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next v7 00/10] bpf qdisc
@ 2025-04-09 21:45 Amery Hung
  2025-04-09 21:45 ` [PATCH bpf-next v7 01/10] bpf: Prepare to reuse get_ctx_arg_idx Amery Hung
                   ` (10 more replies)
  0 siblings, 11 replies; 23+ messages in thread
From: Amery Hung @ 2025-04-09 21:45 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, edumazet, kuba,
	xiyou.wangcong, jhs, martin.lau, jiri, stfomichev, toke,
	sinquersw, ekarani.silvestre, yangpeihao, yepeilin.cs,
	kernel-team

Hi all,

This patchset aims to support implementing qdisc using bpf struct_ops.
This version takes a step back and only implements the minimum support
for bpf qdisc. 1) support of adding skb to bpf_list and bpf_rbtree
directly and 2) classful qdisc are deferred to future patchsets. In
addition, we only allow attaching bpf qdisc to root or mq for now.
This is to prevent accidentally breaking exisiting classful qdiscs
that rely on data in a child qdisc. This limit may be lifted in the
future after careful inspection.

* Overview *

This series supports implementing qdisc using bpf struct_ops. bpf qdisc
aims to be a flexible and easy-to-use infrastructure that allows users to
quickly experiment with different scheduling algorithms/policies. It only
requires users to implement core qdisc logic using bpf and implements the
mundane part for them. In addition, the ability to easily communicate
between qdisc and other components will also bring new opportunities for
new applications and optimizations.

* Performance of bpf qdisc *

This patchset includes two qdisc examples, bpf_fifo and bpf_fq, for
__testing__ purposes. For performance test, we compare selftests and their
kernel counterparts to give you a sense of the performance of qdisc
implemented in bpf.

The implementation of bpf_fq is fairly complex and slightly different from
fq so later we only compare the two fifo qdiscs. bpf_fq implements a 
scheduling algorithm similar to fq before commit 29f834aa326e ("net_sched:
sch_fq: add 3 bands and WRR scheduling") was introduced. bpf_fifo uses a
single bpf_list as a queue instead of three queues for different
priorities in pfifo_fast. The time complexity of fifo however should be
similar since the queue selection time is negligible.

Test setup:

    client -> qdisc ------------->  server
    ~~~~~~~~~~~~~~~                 ~~~~~~
    nested VM1 @ DC1               VM2 @ DC2

Throghput: iperf3 -t 600, 5 times

      Qdisc        Average (GBits/sec)
    ----------     -------------------
    pfifo_fast       12.52 ± 0.26
    bpf_fifo         11.72 ± 0.32 
    fq               10.24 ± 0.13
    bpf_fq           11.92 ± 0.64 

Latency: sockperf pp --tcp -t 600, 5 times

      Qdisc        Average (usec)
    ----------     --------------
    pfifo_fast      244.58 ± 7.93
    bpf_fifo        244.92 ± 15.22
    fq              234.30 ± 19.25
    bpf_fq          221.34 ± 10.76

Looking at the two fifo qdiscs, the 6.4% drop in throughput in the bpf
implementatioin is consistent with previous observation (v8 throughput
test on a loopback device). This should be able to be mitigated by
supporting adding skb to bpf_list or bpf_rbtree directly in the future.

* Clean up skb in bpf qdisc during reset *

The current implementation relies on bpf qdisc implementors to correctly
release skbs in queues (bpf graphs or maps) in .reset, which might not be
a safe thing to do. The solution as Martin has suggested would be
supporting private data in struct_ops. This can also help simplifying
implementation of qdisc that works with mq. For examples, qdiscs in the
selftest mostly use global data. Therefore, even if user add multiple
qdisc instances under mq, they would still share the same queue. 


---
v7:
  * Rebase to bpf-next/master and resolve a slight merge conflict due to
    commit 2eb6c6a34cb1 ("net: move replay logic to tc_modify_qdisc")
    moving a module_put(). No functional change from v6.

    Link to v6: https://lore.kernel.org/bpf/20250319215358.2287371-1-ameryhung@gmail.com/

v6:
  * Improve kfunc filter implementation
      - Drop old patch 2 that introduces a function to get member offset
        of struct_ops program
      - Include patch 1 from Juntong Deng from [0], which populates
        prog->aux->st_ops and attach_st_ops_member_off once and early in
        the verification
      - Adopt flag-based kfunc filter proposed in sched_ext [0]
  * Refactor bpf_qdisc_btf_struct_access 
      - Merge old patch 7, which allows writes to Qdisc->qstats with
        patch 3 (Support implementation of Qdisc_ops in bpf)
      - Share checks of bpf_qdisc_sk_buff_access() and
        bpf_qdisc_qdisc_access() in bpf_qdisc_btf_struct_access()
  * Explain what bytecodes in gen_pro/epilogue do in the comment
  * Merge old patch 8, which simply allows writes to Qdisc->q.qlen and
    Qdisc->limit, with patch 3 (Support implementation of Qdisc_ops in
    bpf)
  * Minor fixes
      - Remove bpf_qdisc_get_func_proto that is no longer needed as tail
        call is disabled universally for struct_ops with __ref args
      - Remove .peek's cfi_stub as it is not supported
  * Add SPDX license identifier
  * Add Acked-by from Toke

    [0] https://lore.kernel.org/all/AM6PR03MB50806070E3D56208DDB8131699C22@AM6PR03MB5080.eurprd03.prod.outlook.com/
    Link to v5: https://lore.kernel.org/bpf/20250313190309.2545711-1-ameryhung@gmail.com/

v5:
  * Rebase to bpf-next/master
  * Remove prerequisite bpf patches that has landed
  * Add Acked-by from Cong

v4:
  * Rebase to bpf-next/master
  * Move the root/mq attaching check to bpf_qdisc.c
    (now patch 15)
  * Replace netdevsim with veth for attaching mq

v3:
  * Rebase to bpf-next/master
  * Remove the requirement in the verifier that "__ref arguments must
    acquire ref_obj_id first" by making each prog keeping a copy of
    arg_ctx_info and saving ref_obj_id in it.
  * Generalize prog_ops_moff() to work with any struct_op (now called
    bpf_struct_ops_prog_moff())
  * Use bpf_struct_ops_prog_moff(prog) instead of
    prog->aux->attach_func_name to infer the ops of a program
  * Limit attach to root and mq for now and add corresponding selftests
  * Simplify qdisc selftests with network_helper
  * Fix fq_remove_flow() not deleting the stashed flow

v2: Rebase to bpf-next/master

    Patch 1-4
        Remove the use of ctx_arg_info->ref_obj_id when acquiring referenced kptr from struct_ops arg
        Improve type comparison when checking kptr return from struct_ops
        Simplify selftests with test_loader and nomerge attribute
    Patch 5
        Remove redundant checks in qdisc_init
        Disallow tail_call
    Patch 6
        Improve kfunc ops availabilty filter by
        i) Checking struct_ops->type
        ii) Defining op-specific kfunc set
    Patch 7
        Search and add bpf_kfunc_desc after gen_prologue/epilogue
    Patch 8
        Use gen_prologue/epilogue to init/cancel watchdog timer
    Patch 12
        Mark read-only func arg and struct member const in libbpf
    Link: https://lore.kernel.org/bpf/20241220195619.2022866-1-amery.hung@gmail.com/

v1:
    Fix struct_ops referenced kptr acquire/return mechanisms
    Allow creating dynptr from skb
    Add bpf qdisc kfunc filter
    Support updating bstats and qstats
    Update qdiscs in selftest to update stats
    Add gc, handle hash collision and fix bugs in fq_bpf
    Link: https://lore.kernel.org/bpf/20241213232958.2388301-1-amery.hung@bytedance.com/

past RFCs

v9: Drop classful qdisc operations and kfuncs
    Drop support of enqueuing skb directly to bpf_rbtree/list
    Link: https://lore.kernel.org/bpf/20240714175130.4051012-1-amery.hung@bytedance.com/

v8: Implement support of bpf qdisc using struct_ops
    Allow struct_ops to acquire referenced kptr via argument
    Allow struct_ops to release and return referenced kptr
    Support enqueuing sk_buff to bpf_rbtree/list
    Move examples from samples to selftests
    Add a classful qdisc selftest
    Link: https://lore.kernel.org/netdev/20240510192412.3297104-15-amery.hung@bytedance.com/

v7: Reference skb using kptr to sk_buff instead of __sk_buff
    Use the new bpf rbtree/link to for skb queues
    Add reset and init programs
    Add a bpf fq qdisc sample
    Add a bpf netem qdisc sample
    Link: https://lore.kernel.org/netdev/cover.1705432850.git.amery.hung@bytedance.com/

v6: switch to kptr based approach

v5: mv kernel/bpf/skb_map.c net/core/skb_map.c
    implement flow map as map-in-map
    rename bpf_skb_tc_classify() and move it to net/sched/cls_api.c
    clean up eBPF qdisc program context

v4: get rid of PIFO, use rbtree directly

v3: move priority queue from sch_bpf to skb map
    introduce skb map and its helpers
    introduce bpf_skb_classify()
    use netdevice notifier to reset skb's
    Rebase on latest bpf-next

v2: Rebase on latest net-next
    Make the code more complete (but still incomplete)



Amery Hung (10):
  bpf: Prepare to reuse get_ctx_arg_idx
  bpf: net_sched: Support implementation of Qdisc_ops in bpf
  bpf: net_sched: Add basic bpf qdisc kfuncs
  bpf: net_sched: Add a qdisc watchdog timer
  bpf: net_sched: Support updating bstats
  bpf: net_sched: Disable attaching bpf qdisc to non root
  libbpf: Support creating and destroying qdisc
  selftests/bpf: Add a basic fifo qdisc test
  selftests/bpf: Add a bpf fq qdisc to selftest
  selftests/bpf: Test attaching bpf qdisc to mq and non root

 include/linux/btf.h                           |   1 +
 kernel/bpf/btf.c                              |   6 +-
 net/sched/Kconfig                             |  12 +
 net/sched/Makefile                            |   1 +
 net/sched/bpf_qdisc.c                         | 477 +++++++++++
 net/sched/sch_api.c                           |   7 +-
 net/sched/sch_generic.c                       |   3 +-
 tools/lib/bpf/libbpf.h                        |   5 +-
 tools/lib/bpf/netlink.c                       |  20 +-
 tools/testing/selftests/bpf/config            |   2 +
 .../selftests/bpf/prog_tests/bpf_qdisc.c      | 180 +++++
 .../selftests/bpf/progs/bpf_qdisc_common.h    |  31 +
 .../selftests/bpf/progs/bpf_qdisc_fifo.c      | 117 +++
 .../selftests/bpf/progs/bpf_qdisc_fq.c        | 750 ++++++++++++++++++
 14 files changed, 1601 insertions(+), 11 deletions(-)
 create mode 100644 net/sched/bpf_qdisc.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c

-- 
2.47.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH bpf-next v7 01/10] bpf: Prepare to reuse get_ctx_arg_idx
  2025-04-09 21:45 [PATCH bpf-next v7 00/10] bpf qdisc Amery Hung
@ 2025-04-09 21:45 ` Amery Hung
  2025-04-09 21:45 ` [PATCH bpf-next v7 02/10] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Amery Hung @ 2025-04-09 21:45 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, edumazet, kuba,
	xiyou.wangcong, jhs, martin.lau, jiri, stfomichev, toke,
	sinquersw, ekarani.silvestre, yangpeihao, yepeilin.cs,
	kernel-team

Rename get_ctx_arg_idx to bpf_ctx_arg_idx, and allow others to call it.
No functional change.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/linux/btf.h | 1 +
 kernel/bpf/btf.c    | 6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/btf.h b/include/linux/btf.h
index ebc0c0c9b944..b2983706292f 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -522,6 +522,7 @@ bool btf_param_match_suffix(const struct btf *btf,
 			    const char *suffix);
 int btf_ctx_arg_offset(const struct btf *btf, const struct btf_type *func_proto,
 		       u32 arg_no);
+u32 btf_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto, int off);
 
 struct bpf_verifier_log;
 
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 16ba36f34dfa..ac6c8ed08de3 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -6391,8 +6391,8 @@ static bool is_int_ptr(struct btf *btf, const struct btf_type *t)
 	return btf_type_is_int(t);
 }
 
-static u32 get_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto,
-			   int off)
+u32 btf_ctx_arg_idx(struct btf *btf, const struct btf_type *func_proto,
+		    int off)
 {
 	const struct btf_param *args;
 	const struct btf_type *t;
@@ -6671,7 +6671,7 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type,
 			tname, off);
 		return false;
 	}
-	arg = get_ctx_arg_idx(btf, t, off);
+	arg = btf_ctx_arg_idx(btf, t, off);
 	args = (const struct btf_param *)(t + 1);
 	/* if (t == NULL) Fall back to default BPF prog with
 	 * MAX_BPF_FUNC_REG_ARGS u64 arguments.
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH bpf-next v7 02/10] bpf: net_sched: Support implementation of Qdisc_ops in bpf
  2025-04-09 21:45 [PATCH bpf-next v7 00/10] bpf qdisc Amery Hung
  2025-04-09 21:45 ` [PATCH bpf-next v7 01/10] bpf: Prepare to reuse get_ctx_arg_idx Amery Hung
@ 2025-04-09 21:45 ` Amery Hung
  2025-04-10 23:49   ` Martin KaFai Lau
  2025-04-09 21:45 ` [PATCH bpf-next v7 03/10] bpf: net_sched: Add basic bpf qdisc kfuncs Amery Hung
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Amery Hung @ 2025-04-09 21:45 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, edumazet, kuba,
	xiyou.wangcong, jhs, martin.lau, jiri, stfomichev, toke,
	sinquersw, ekarani.silvestre, yangpeihao, yepeilin.cs,
	kernel-team

From: Amery Hung <amery.hung@bytedance.com>

The recent advancement in bpf such as allocated objects, bpf list and bpf
rbtree has provided powerful and flexible building blocks to realize
sophisticated packet scheduling algorithms. As struct_ops now supports
core operators in Qdisc_ops, start allowing qdisc to be implemented using
bpf struct_ops with this patch. Users can implement Qdisc_ops.{enqueue,
dequeue, init, reset, destroy} in bpf and register the qdisc dynamically
into the kernel.

Co-developed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 net/sched/Kconfig       |  12 +++
 net/sched/Makefile      |   1 +
 net/sched/bpf_qdisc.c   | 232 ++++++++++++++++++++++++++++++++++++++++
 net/sched/sch_api.c     |   7 +-
 net/sched/sch_generic.c |   3 +-
 5 files changed, 251 insertions(+), 4 deletions(-)
 create mode 100644 net/sched/bpf_qdisc.c

diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 8180d0c12fce..ccd0255da5a5 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -403,6 +403,18 @@ config NET_SCH_ETS
 
 	  If unsure, say N.
 
+config NET_SCH_BPF
+	bool "BPF-based Qdisc"
+	depends on BPF_SYSCALL && BPF_JIT && DEBUG_INFO_BTF
+	help
+	  This option allows BPF-based queueing disiplines. With BPF struct_ops,
+	  users can implement supported operators in Qdisc_ops using BPF programs.
+	  The queue holding skb can be built with BPF maps or graphs.
+
+	  Say Y here if you want to use BPF-based Qdisc.
+
+	  If unsure, say N.
+
 menuconfig NET_SCH_DEFAULT
 	bool "Allow override default queue discipline"
 	help
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 82c3f78ca486..904d784902d1 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -62,6 +62,7 @@ obj-$(CONFIG_NET_SCH_FQ_PIE)	+= sch_fq_pie.o
 obj-$(CONFIG_NET_SCH_CBS)	+= sch_cbs.o
 obj-$(CONFIG_NET_SCH_ETF)	+= sch_etf.o
 obj-$(CONFIG_NET_SCH_TAPRIO)	+= sch_taprio.o
+obj-$(CONFIG_NET_SCH_BPF)	+= bpf_qdisc.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
new file mode 100644
index 000000000000..7eca556a3782
--- /dev/null
+++ b/net/sched/bpf_qdisc.c
@@ -0,0 +1,232 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/types.h>
+#include <linux/bpf_verifier.h>
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/filter.h>
+#include <net/pkt_sched.h>
+#include <net/pkt_cls.h>
+
+static struct bpf_struct_ops bpf_Qdisc_ops;
+
+struct bpf_sk_buff_ptr {
+	struct sk_buff *skb;
+};
+
+static int bpf_qdisc_init(struct btf *btf)
+{
+	return 0;
+}
+
+BTF_ID_LIST_SINGLE(bpf_qdisc_ids, struct, Qdisc)
+BTF_ID_LIST_SINGLE(bpf_sk_buff_ids, struct, sk_buff)
+BTF_ID_LIST_SINGLE(bpf_sk_buff_ptr_ids, struct, bpf_sk_buff_ptr)
+
+static bool bpf_qdisc_is_valid_access(int off, int size,
+				      enum bpf_access_type type,
+				      const struct bpf_prog *prog,
+				      struct bpf_insn_access_aux *info)
+{
+	struct btf *btf = prog->aux->attach_btf;
+	u32 arg;
+
+	arg = btf_ctx_arg_idx(btf, prog->aux->attach_func_proto, off);
+	if (prog->aux->attach_st_ops_member_off == offsetof(struct Qdisc_ops, enqueue)) {
+		if (arg == 2 && type == BPF_READ) {
+			info->reg_type = PTR_TO_BTF_ID | PTR_TRUSTED;
+			info->btf = btf;
+			info->btf_id = bpf_sk_buff_ptr_ids[0];
+			return true;
+		}
+	}
+
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static int bpf_qdisc_qdisc_access(struct bpf_verifier_log *log,
+				  const struct bpf_reg_state *reg,
+				  int off, size_t *end)
+{
+	switch (off) {
+	case offsetof(struct Qdisc, limit):
+		*end = offsetofend(struct Qdisc, limit);
+		break;
+	case offsetof(struct Qdisc, q) + offsetof(struct qdisc_skb_head, qlen):
+		*end = offsetof(struct Qdisc, q) + offsetofend(struct qdisc_skb_head, qlen);
+		break;
+	case offsetof(struct Qdisc, qstats) ... offsetofend(struct Qdisc, qstats) - 1:
+		*end = offsetofend(struct Qdisc, qstats);
+		break;
+	default:
+		return -EACCES;
+	}
+
+	return 0;
+}
+
+static int bpf_qdisc_sk_buff_access(struct bpf_verifier_log *log,
+				    const struct bpf_reg_state *reg,
+				    int off, size_t *end)
+{
+	switch (off) {
+	case offsetof(struct sk_buff, tstamp):
+		*end = offsetofend(struct sk_buff, tstamp);
+		break;
+	case offsetof(struct sk_buff, priority):
+		*end = offsetofend(struct sk_buff, priority);
+		break;
+	case offsetof(struct sk_buff, mark):
+		*end = offsetofend(struct sk_buff, mark);
+		break;
+	case offsetof(struct sk_buff, queue_mapping):
+		*end = offsetofend(struct sk_buff, queue_mapping);
+		break;
+	case offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb, tc_classid):
+		*end = offsetof(struct sk_buff, cb) +
+		       offsetofend(struct qdisc_skb_cb, tc_classid);
+		break;
+	case offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb, data[0]) ...
+	     offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb,
+						     data[QDISC_CB_PRIV_LEN - 1]):
+		*end = offsetof(struct sk_buff, cb) +
+		       offsetofend(struct qdisc_skb_cb, data[QDISC_CB_PRIV_LEN - 1]);
+		break;
+	case offsetof(struct sk_buff, tc_index):
+		*end = offsetofend(struct sk_buff, tc_index);
+		break;
+	default:
+		return -EACCES;
+	}
+
+	return 0;
+}
+
+static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
+				       const struct bpf_reg_state *reg,
+				       int off, int size)
+{
+	const struct btf_type *t, *skbt, *qdisct;
+	size_t end;
+	int err;
+
+	skbt = btf_type_by_id(reg->btf, bpf_sk_buff_ids[0]);
+	qdisct = btf_type_by_id(reg->btf, bpf_qdisc_ids[0]);
+	t = btf_type_by_id(reg->btf, reg->btf_id);
+
+	if (t == skbt) {
+		err = bpf_qdisc_sk_buff_access(log, reg, off, &end);
+	} else if (t == qdisct) {
+		err = bpf_qdisc_qdisc_access(log, reg, off, &end);
+	} else {
+		bpf_log(log, "only read is supported\n");
+		return -EACCES;
+	}
+
+	if (err) {
+		bpf_log(log, "no write support to %s at off %d\n",
+			btf_name_by_offset(reg->btf, t->name_off), off);
+		return -EACCES;
+	}
+
+	if (off + size > end) {
+		bpf_log(log,
+			"write access at off %d with size %d beyond the member of %s ended at %zu\n",
+			off, size, btf_name_by_offset(reg->btf, t->name_off), end);
+		return -EACCES;
+	}
+
+	return 0;
+}
+
+static const struct bpf_verifier_ops bpf_qdisc_verifier_ops = {
+	.get_func_proto		= bpf_base_func_proto,
+	.is_valid_access	= bpf_qdisc_is_valid_access,
+	.btf_struct_access	= bpf_qdisc_btf_struct_access,
+};
+
+static int bpf_qdisc_init_member(const struct btf_type *t,
+				 const struct btf_member *member,
+				 void *kdata, const void *udata)
+{
+	const struct Qdisc_ops *uqdisc_ops;
+	struct Qdisc_ops *qdisc_ops;
+	u32 moff;
+
+	uqdisc_ops = (const struct Qdisc_ops *)udata;
+	qdisc_ops = (struct Qdisc_ops *)kdata;
+
+	moff = __btf_member_bit_offset(t, member) / 8;
+	switch (moff) {
+	case offsetof(struct Qdisc_ops, peek):
+		qdisc_ops->peek = qdisc_peek_dequeued;
+		return 0;
+	case offsetof(struct Qdisc_ops, id):
+		if (bpf_obj_name_cpy(qdisc_ops->id, uqdisc_ops->id,
+				     sizeof(qdisc_ops->id)) <= 0)
+			return -EINVAL;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int bpf_qdisc_reg(void *kdata, struct bpf_link *link)
+{
+	return register_qdisc(kdata);
+}
+
+static void bpf_qdisc_unreg(void *kdata, struct bpf_link *link)
+{
+	return unregister_qdisc(kdata);
+}
+
+static int Qdisc_ops__enqueue(struct sk_buff *skb__ref, struct Qdisc *sch,
+			      struct sk_buff **to_free)
+{
+	return 0;
+}
+
+static struct sk_buff *Qdisc_ops__dequeue(struct Qdisc *sch)
+{
+	return NULL;
+}
+
+static int Qdisc_ops__init(struct Qdisc *sch, struct nlattr *arg,
+			   struct netlink_ext_ack *extack)
+{
+	return 0;
+}
+
+static void Qdisc_ops__reset(struct Qdisc *sch)
+{
+}
+
+static void Qdisc_ops__destroy(struct Qdisc *sch)
+{
+}
+
+static struct Qdisc_ops __bpf_ops_qdisc_ops = {
+	.enqueue = Qdisc_ops__enqueue,
+	.dequeue = Qdisc_ops__dequeue,
+	.init = Qdisc_ops__init,
+	.reset = Qdisc_ops__reset,
+	.destroy = Qdisc_ops__destroy,
+};
+
+static struct bpf_struct_ops bpf_Qdisc_ops = {
+	.verifier_ops = &bpf_qdisc_verifier_ops,
+	.reg = bpf_qdisc_reg,
+	.unreg = bpf_qdisc_unreg,
+	.init_member = bpf_qdisc_init_member,
+	.init = bpf_qdisc_init,
+	.name = "Qdisc_ops",
+	.cfi_stubs = &__bpf_ops_qdisc_ops,
+	.owner = THIS_MODULE,
+};
+
+static int __init bpf_qdisc_kfunc_init(void)
+{
+	return register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
+}
+late_initcall(bpf_qdisc_kfunc_init);
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index f74a097f54ae..db6330258dda 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -25,6 +25,7 @@
 #include <linux/hrtimer.h>
 #include <linux/slab.h>
 #include <linux/hashtable.h>
+#include <linux/bpf.h>
 
 #include <net/netdev_lock.h>
 #include <net/net_namespace.h>
@@ -359,7 +360,7 @@ static struct Qdisc_ops *qdisc_lookup_ops(struct nlattr *kind)
 		read_lock(&qdisc_mod_lock);
 		for (q = qdisc_base; q; q = q->next) {
 			if (nla_strcmp(kind, q->id) == 0) {
-				if (!try_module_get(q->owner))
+				if (!bpf_try_module_get(q, q->owner))
 					q = NULL;
 				break;
 			}
@@ -1370,7 +1371,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 	netdev_put(dev, &sch->dev_tracker);
 	qdisc_free(sch);
 err_out2:
-	module_put(ops->owner);
+	bpf_module_put(ops, ops->owner);
 err_out:
 	*errp = err;
 	return NULL;
@@ -1782,7 +1783,7 @@ static void request_qdisc_module(struct nlattr *kind)
 
 	ops = qdisc_lookup_ops(kind);
 	if (ops) {
-		module_put(ops->owner);
+		bpf_module_put(ops, ops->owner);
 		return;
 	}
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 14ab2f4c190a..e6fda9f20272 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -24,6 +24,7 @@
 #include <linux/if_vlan.h>
 #include <linux/skb_array.h>
 #include <linux/if_macvlan.h>
+#include <linux/bpf.h>
 #include <net/sch_generic.h>
 #include <net/pkt_sched.h>
 #include <net/dst.h>
@@ -1078,7 +1079,7 @@ static void __qdisc_destroy(struct Qdisc *qdisc)
 		ops->destroy(qdisc);
 
 	lockdep_unregister_key(&qdisc->root_lock_key);
-	module_put(ops->owner);
+	bpf_module_put(ops, ops->owner);
 	netdev_put(dev, &qdisc->dev_tracker);
 
 	trace_qdisc_destroy(qdisc);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH bpf-next v7 03/10] bpf: net_sched: Add basic bpf qdisc kfuncs
  2025-04-09 21:45 [PATCH bpf-next v7 00/10] bpf qdisc Amery Hung
  2025-04-09 21:45 ` [PATCH bpf-next v7 01/10] bpf: Prepare to reuse get_ctx_arg_idx Amery Hung
  2025-04-09 21:45 ` [PATCH bpf-next v7 02/10] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
@ 2025-04-09 21:45 ` Amery Hung
  2025-04-11 13:32   ` Kumar Kartikeya Dwivedi
  2025-04-09 21:46 ` [PATCH bpf-next v7 04/10] bpf: net_sched: Add a qdisc watchdog timer Amery Hung
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Amery Hung @ 2025-04-09 21:45 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, edumazet, kuba,
	xiyou.wangcong, jhs, martin.lau, jiri, stfomichev, toke,
	sinquersw, ekarani.silvestre, yangpeihao, yepeilin.cs,
	kernel-team

From: Amery Hung <amery.hung@bytedance.com>

Add basic kfuncs for working on skb in qdisc.

Both bpf_qdisc_skb_drop() and bpf_kfree_skb() can be used to release
a reference to an skb. However, bpf_qdisc_skb_drop() can only be called
in .enqueue where a to_free skb list is available from kernel to defer
the release. bpf_kfree_skb() should be used elsewhere. It is also used
in bpf_obj_free_fields() when cleaning up skb in maps and collections.

bpf_skb_get_hash() returns the flow hash of an skb, which can be used
to build flow-based queueing algorithms.

Finally, allow users to create read-only dynptr via bpf_dynptr_from_skb().

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 net/sched/bpf_qdisc.c | 111 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 110 insertions(+), 1 deletion(-)

diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
index 7eca556a3782..d812a72ca032 100644
--- a/net/sched/bpf_qdisc.c
+++ b/net/sched/bpf_qdisc.c
@@ -8,6 +8,9 @@
 #include <net/pkt_sched.h>
 #include <net/pkt_cls.h>
 
+#define QDISC_OP_IDX(op)	(offsetof(struct Qdisc_ops, op) / sizeof(void (*)(void)))
+#define QDISC_MOFF_IDX(moff)	(moff / sizeof(void (*)(void)))
+
 static struct bpf_struct_ops bpf_Qdisc_ops;
 
 struct bpf_sk_buff_ptr {
@@ -139,6 +142,95 @@ static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
 	return 0;
 }
 
+__bpf_kfunc_start_defs();
+
+/* bpf_skb_get_hash - Get the flow hash of an skb.
+ * @skb: The skb to get the flow hash from.
+ */
+__bpf_kfunc u32 bpf_skb_get_hash(struct sk_buff *skb)
+{
+	return skb_get_hash(skb);
+}
+
+/* bpf_kfree_skb - Release an skb's reference and drop it immediately.
+ * @skb: The skb whose reference to be released and dropped.
+ */
+__bpf_kfunc void bpf_kfree_skb(struct sk_buff *skb)
+{
+	kfree_skb(skb);
+}
+
+/* bpf_qdisc_skb_drop - Drop an skb by adding it to a deferred free list.
+ * @skb: The skb whose reference to be released and dropped.
+ * @to_free_list: The list of skbs to be dropped.
+ */
+__bpf_kfunc void bpf_qdisc_skb_drop(struct sk_buff *skb,
+				    struct bpf_sk_buff_ptr *to_free_list)
+{
+	__qdisc_drop(skb, (struct sk_buff **)to_free_list);
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(qdisc_kfunc_ids)
+BTF_ID_FLAGS(func, bpf_skb_get_hash, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_kfree_skb, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_qdisc_skb_drop, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_dynptr_from_skb, KF_TRUSTED_ARGS)
+BTF_KFUNCS_END(qdisc_kfunc_ids)
+
+BTF_SET_START(qdisc_common_kfunc_set)
+BTF_ID(func, bpf_skb_get_hash)
+BTF_ID(func, bpf_kfree_skb)
+BTF_ID(func, bpf_dynptr_from_skb)
+BTF_SET_END(qdisc_common_kfunc_set)
+
+BTF_SET_START(qdisc_enqueue_kfunc_set)
+BTF_ID(func, bpf_qdisc_skb_drop)
+BTF_SET_END(qdisc_enqueue_kfunc_set)
+
+enum qdisc_ops_kf_flags {
+	QDISC_OPS_KF_COMMON		= 0,
+	QDISC_OPS_KF_ENQUEUE		= 1 << 0,
+};
+
+static const u32 qdisc_ops_context_flags[] = {
+	[QDISC_OP_IDX(enqueue)]		= QDISC_OPS_KF_ENQUEUE,
+	[QDISC_OP_IDX(dequeue)]		= QDISC_OPS_KF_COMMON,
+	[QDISC_OP_IDX(init)]		= QDISC_OPS_KF_COMMON,
+	[QDISC_OP_IDX(reset)]		= QDISC_OPS_KF_COMMON,
+	[QDISC_OP_IDX(destroy)]		= QDISC_OPS_KF_COMMON,
+};
+
+static int bpf_qdisc_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
+{
+	u32 moff, flags;
+
+	if (!btf_id_set8_contains(&qdisc_kfunc_ids, kfunc_id))
+		return 0;
+
+	if (prog->aux->st_ops != &bpf_Qdisc_ops)
+		return -EACCES;
+
+	moff = prog->aux->attach_st_ops_member_off;
+	flags = qdisc_ops_context_flags[QDISC_MOFF_IDX(moff)];
+
+	if ((flags & QDISC_OPS_KF_ENQUEUE) &&
+	    btf_id_set_contains(&qdisc_enqueue_kfunc_set, kfunc_id))
+		return 0;
+
+	if (btf_id_set_contains(&qdisc_common_kfunc_set, kfunc_id))
+		return 0;
+
+	return -EACCES;
+}
+
+static const struct btf_kfunc_id_set bpf_qdisc_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &qdisc_kfunc_ids,
+	.filter = bpf_qdisc_kfunc_filter,
+};
+
 static const struct bpf_verifier_ops bpf_qdisc_verifier_ops = {
 	.get_func_proto		= bpf_base_func_proto,
 	.is_valid_access	= bpf_qdisc_is_valid_access,
@@ -225,8 +317,25 @@ static struct bpf_struct_ops bpf_Qdisc_ops = {
 	.owner = THIS_MODULE,
 };
 
+BTF_ID_LIST(bpf_sk_buff_dtor_ids)
+BTF_ID(func, bpf_kfree_skb)
+
 static int __init bpf_qdisc_kfunc_init(void)
 {
-	return register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
+	int ret;
+	const struct btf_id_dtor_kfunc skb_kfunc_dtors[] = {
+		{
+			.btf_id       = bpf_sk_buff_ids[0],
+			.kfunc_btf_id = bpf_sk_buff_dtor_ids[0]
+		},
+	};
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_qdisc_kfunc_set);
+	ret = ret ?: register_btf_id_dtor_kfuncs(skb_kfunc_dtors,
+						 ARRAY_SIZE(skb_kfunc_dtors),
+						 THIS_MODULE);
+	ret = ret ?: register_bpf_struct_ops(&bpf_Qdisc_ops, Qdisc_ops);
+
+	return ret;
 }
 late_initcall(bpf_qdisc_kfunc_init);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH bpf-next v7 04/10] bpf: net_sched: Add a qdisc watchdog timer
  2025-04-09 21:45 [PATCH bpf-next v7 00/10] bpf qdisc Amery Hung
                   ` (2 preceding siblings ...)
  2025-04-09 21:45 ` [PATCH bpf-next v7 03/10] bpf: net_sched: Add basic bpf qdisc kfuncs Amery Hung
@ 2025-04-09 21:46 ` Amery Hung
  2025-04-09 21:46 ` [PATCH bpf-next v7 05/10] bpf: net_sched: Support updating bstats Amery Hung
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Amery Hung @ 2025-04-09 21:46 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, edumazet, kuba,
	xiyou.wangcong, jhs, martin.lau, jiri, stfomichev, toke,
	sinquersw, ekarani.silvestre, yangpeihao, yepeilin.cs,
	kernel-team

From: Amery Hung <amery.hung@bytedance.com>

Add a watchdog timer to bpf qdisc. The watchdog can be used to schedule
the execution of qdisc through kfunc, bpf_qdisc_schedule(). It can be
useful for building traffic shaping scheduling algorithm, where the time
the next packet will be dequeued is known.

The implementation relies on struct_ops gen_prologue/epilogue to patch bpf
programs provided by users. Operator specific prologue/epilogue kfuncs
are introduced instead of watchdog kfuncs so that it is easier to extend
prologue/epilogue in the future (writing C vs BPF bytecode).

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 net/sched/bpf_qdisc.c | 106 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 105 insertions(+), 1 deletion(-)

diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
index d812a72ca032..5f4ab4877535 100644
--- a/net/sched/bpf_qdisc.c
+++ b/net/sched/bpf_qdisc.c
@@ -13,6 +13,10 @@
 
 static struct bpf_struct_ops bpf_Qdisc_ops;
 
+struct bpf_sched_data {
+	struct qdisc_watchdog watchdog;
+};
+
 struct bpf_sk_buff_ptr {
 	struct sk_buff *skb;
 };
@@ -142,6 +146,56 @@ static int bpf_qdisc_btf_struct_access(struct bpf_verifier_log *log,
 	return 0;
 }
 
+BTF_ID_LIST(bpf_qdisc_init_prologue_ids)
+BTF_ID(func, bpf_qdisc_init_prologue)
+
+static int bpf_qdisc_gen_prologue(struct bpf_insn *insn_buf, bool direct_write,
+				  const struct bpf_prog *prog)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	if (prog->aux->attach_st_ops_member_off != offsetof(struct Qdisc_ops, init))
+		return 0;
+
+	/* r6 = r1; // r6 will be "u64 *ctx". r1 is "u64 *ctx".
+	 * r1 = r1[0]; // r1 will be "struct Qdisc *sch"
+	 * r0 = bpf_qdisc_init_prologue(r1);
+	 * r1 = r6; // r1 will be "u64 *ctx".
+	 */
+	*insn++ = BPF_MOV64_REG(BPF_REG_6, BPF_REG_1);
+	*insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, 0);
+	*insn++ = BPF_CALL_KFUNC(0, bpf_qdisc_init_prologue_ids[0]);
+	*insn++ = BPF_MOV64_REG(BPF_REG_1, BPF_REG_6);
+	*insn++ = prog->insnsi[0];
+
+	return insn - insn_buf;
+}
+
+BTF_ID_LIST(bpf_qdisc_reset_destroy_epilogue_ids)
+BTF_ID(func, bpf_qdisc_reset_destroy_epilogue)
+
+static int bpf_qdisc_gen_epilogue(struct bpf_insn *insn_buf, const struct bpf_prog *prog,
+				  s16 ctx_stack_off)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	if (prog->aux->attach_st_ops_member_off != offsetof(struct Qdisc_ops, reset) &&
+	    prog->aux->attach_st_ops_member_off != offsetof(struct Qdisc_ops, destroy))
+		return 0;
+
+	/* r1 = stack[ctx_stack_off]; // r1 will be "u64 *ctx"
+	 * r1 = r1[0]; // r1 will be "struct Qdisc *sch"
+	 * r0 = bpf_qdisc_reset_destroy_epilogue(r1);
+	 * BPF_EXIT;
+	 */
+	*insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_FP, ctx_stack_off);
+	*insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, 0);
+	*insn++ = BPF_CALL_KFUNC(0, bpf_qdisc_reset_destroy_epilogue_ids[0]);
+	*insn++ = BPF_EXIT_INSN();
+
+	return insn - insn_buf;
+}
+
 __bpf_kfunc_start_defs();
 
 /* bpf_skb_get_hash - Get the flow hash of an skb.
@@ -170,6 +224,36 @@ __bpf_kfunc void bpf_qdisc_skb_drop(struct sk_buff *skb,
 	__qdisc_drop(skb, (struct sk_buff **)to_free_list);
 }
 
+/* bpf_qdisc_watchdog_schedule - Schedule a qdisc to a later time using a timer.
+ * @sch: The qdisc to be scheduled.
+ * @expire: The expiry time of the timer.
+ * @delta_ns: The slack range of the timer.
+ */
+__bpf_kfunc void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64 delta_ns)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	qdisc_watchdog_schedule_range_ns(&q->watchdog, expire, delta_ns);
+}
+
+/* bpf_qdisc_init_prologue - Hidden kfunc called in prologue of .init. */
+__bpf_kfunc void bpf_qdisc_init_prologue(struct Qdisc *sch)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	qdisc_watchdog_init(&q->watchdog, sch);
+}
+
+/* bpf_qdisc_reset_destroy_epilogue - Hidden kfunc called in epilogue of .reset
+ * and .destroy
+ */
+__bpf_kfunc void bpf_qdisc_reset_destroy_epilogue(struct Qdisc *sch)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	qdisc_watchdog_cancel(&q->watchdog);
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(qdisc_kfunc_ids)
@@ -177,6 +261,9 @@ BTF_ID_FLAGS(func, bpf_skb_get_hash, KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, bpf_kfree_skb, KF_RELEASE)
 BTF_ID_FLAGS(func, bpf_qdisc_skb_drop, KF_RELEASE)
 BTF_ID_FLAGS(func, bpf_dynptr_from_skb, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_qdisc_watchdog_schedule, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_qdisc_init_prologue, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_qdisc_reset_destroy_epilogue, KF_TRUSTED_ARGS)
 BTF_KFUNCS_END(qdisc_kfunc_ids)
 
 BTF_SET_START(qdisc_common_kfunc_set)
@@ -187,16 +274,22 @@ BTF_SET_END(qdisc_common_kfunc_set)
 
 BTF_SET_START(qdisc_enqueue_kfunc_set)
 BTF_ID(func, bpf_qdisc_skb_drop)
+BTF_ID(func, bpf_qdisc_watchdog_schedule)
 BTF_SET_END(qdisc_enqueue_kfunc_set)
 
+BTF_SET_START(qdisc_dequeue_kfunc_set)
+BTF_ID(func, bpf_qdisc_watchdog_schedule)
+BTF_SET_END(qdisc_dequeue_kfunc_set)
+
 enum qdisc_ops_kf_flags {
 	QDISC_OPS_KF_COMMON		= 0,
 	QDISC_OPS_KF_ENQUEUE		= 1 << 0,
+	QDISC_OPS_KF_DEQUEUE		= 1 << 1,
 };
 
 static const u32 qdisc_ops_context_flags[] = {
 	[QDISC_OP_IDX(enqueue)]		= QDISC_OPS_KF_ENQUEUE,
-	[QDISC_OP_IDX(dequeue)]		= QDISC_OPS_KF_COMMON,
+	[QDISC_OP_IDX(dequeue)]		= QDISC_OPS_KF_DEQUEUE,
 	[QDISC_OP_IDX(init)]		= QDISC_OPS_KF_COMMON,
 	[QDISC_OP_IDX(reset)]		= QDISC_OPS_KF_COMMON,
 	[QDISC_OP_IDX(destroy)]		= QDISC_OPS_KF_COMMON,
@@ -219,6 +312,10 @@ static int bpf_qdisc_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
 	    btf_id_set_contains(&qdisc_enqueue_kfunc_set, kfunc_id))
 		return 0;
 
+	if ((flags & QDISC_OPS_KF_DEQUEUE) &&
+	    btf_id_set_contains(&qdisc_dequeue_kfunc_set, kfunc_id))
+		return 0;
+
 	if (btf_id_set_contains(&qdisc_common_kfunc_set, kfunc_id))
 		return 0;
 
@@ -235,6 +332,8 @@ static const struct bpf_verifier_ops bpf_qdisc_verifier_ops = {
 	.get_func_proto		= bpf_base_func_proto,
 	.is_valid_access	= bpf_qdisc_is_valid_access,
 	.btf_struct_access	= bpf_qdisc_btf_struct_access,
+	.gen_prologue		= bpf_qdisc_gen_prologue,
+	.gen_epilogue		= bpf_qdisc_gen_epilogue,
 };
 
 static int bpf_qdisc_init_member(const struct btf_type *t,
@@ -250,6 +349,11 @@ static int bpf_qdisc_init_member(const struct btf_type *t,
 
 	moff = __btf_member_bit_offset(t, member) / 8;
 	switch (moff) {
+	case offsetof(struct Qdisc_ops, priv_size):
+		if (uqdisc_ops->priv_size)
+			return -EINVAL;
+		qdisc_ops->priv_size = sizeof(struct bpf_sched_data);
+		return 1;
 	case offsetof(struct Qdisc_ops, peek):
 		qdisc_ops->peek = qdisc_peek_dequeued;
 		return 0;
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH bpf-next v7 05/10] bpf: net_sched: Support updating bstats
  2025-04-09 21:45 [PATCH bpf-next v7 00/10] bpf qdisc Amery Hung
                   ` (3 preceding siblings ...)
  2025-04-09 21:46 ` [PATCH bpf-next v7 04/10] bpf: net_sched: Add a qdisc watchdog timer Amery Hung
@ 2025-04-09 21:46 ` Amery Hung
  2025-04-09 21:46 ` [PATCH bpf-next v7 06/10] bpf: net_sched: Disable attaching bpf qdisc to non root Amery Hung
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Amery Hung @ 2025-04-09 21:46 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, edumazet, kuba,
	xiyou.wangcong, jhs, martin.lau, jiri, stfomichev, toke,
	sinquersw, ekarani.silvestre, yangpeihao, yepeilin.cs,
	kernel-team

From: Amery Hung <amery.hung@bytedance.com>

Add a kfunc to update Qdisc bstats when an skb is dequeued. The kfunc is
only available in .dequeue programs.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 net/sched/bpf_qdisc.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
index 5f4ab4877535..5aff83d7d1d8 100644
--- a/net/sched/bpf_qdisc.c
+++ b/net/sched/bpf_qdisc.c
@@ -254,6 +254,15 @@ __bpf_kfunc void bpf_qdisc_reset_destroy_epilogue(struct Qdisc *sch)
 	qdisc_watchdog_cancel(&q->watchdog);
 }
 
+/* bpf_qdisc_bstats_update - Update Qdisc basic statistics
+ * @sch: The qdisc from which an skb is dequeued.
+ * @skb: The skb to be dequeued.
+ */
+__bpf_kfunc void bpf_qdisc_bstats_update(struct Qdisc *sch, const struct sk_buff *skb)
+{
+	bstats_update(&sch->bstats, skb);
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(qdisc_kfunc_ids)
@@ -264,6 +273,7 @@ BTF_ID_FLAGS(func, bpf_dynptr_from_skb, KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, bpf_qdisc_watchdog_schedule, KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, bpf_qdisc_init_prologue, KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, bpf_qdisc_reset_destroy_epilogue, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_qdisc_bstats_update, KF_TRUSTED_ARGS)
 BTF_KFUNCS_END(qdisc_kfunc_ids)
 
 BTF_SET_START(qdisc_common_kfunc_set)
@@ -279,6 +289,7 @@ BTF_SET_END(qdisc_enqueue_kfunc_set)
 
 BTF_SET_START(qdisc_dequeue_kfunc_set)
 BTF_ID(func, bpf_qdisc_watchdog_schedule)
+BTF_ID(func, bpf_qdisc_bstats_update)
 BTF_SET_END(qdisc_dequeue_kfunc_set)
 
 enum qdisc_ops_kf_flags {
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH bpf-next v7 06/10] bpf: net_sched: Disable attaching bpf qdisc to non root
  2025-04-09 21:45 [PATCH bpf-next v7 00/10] bpf qdisc Amery Hung
                   ` (4 preceding siblings ...)
  2025-04-09 21:46 ` [PATCH bpf-next v7 05/10] bpf: net_sched: Support updating bstats Amery Hung
@ 2025-04-09 21:46 ` Amery Hung
  2025-04-09 21:46 ` [PATCH bpf-next v7 07/10] libbpf: Support creating and destroying qdisc Amery Hung
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Amery Hung @ 2025-04-09 21:46 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, edumazet, kuba,
	xiyou.wangcong, jhs, martin.lau, jiri, stfomichev, toke,
	sinquersw, ekarani.silvestre, yangpeihao, yepeilin.cs,
	kernel-team

Do not allow users to attach bpf qdiscs to classful qdiscs. This is to
prevent accidentally breaking existings classful qdiscs if they rely on
some data in the child qdisc. This restriction can potentially be lifted
in the future. Note that, we still allow bpf qdisc to be attached to mq.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 net/sched/bpf_qdisc.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c
index 5aff83d7d1d8..cb158c8c433e 100644
--- a/net/sched/bpf_qdisc.c
+++ b/net/sched/bpf_qdisc.c
@@ -158,13 +158,19 @@ static int bpf_qdisc_gen_prologue(struct bpf_insn *insn_buf, bool direct_write,
 		return 0;
 
 	/* r6 = r1; // r6 will be "u64 *ctx". r1 is "u64 *ctx".
+	 * r2 = r1[16]; // r2 will be "struct netlink_ext_ack *extack"
 	 * r1 = r1[0]; // r1 will be "struct Qdisc *sch"
-	 * r0 = bpf_qdisc_init_prologue(r1);
+	 * r0 = bpf_qdisc_init_prologue(r1, r2);
+	 * if r0 == 0 goto pc+1;
+	 * BPF_EXIT;
 	 * r1 = r6; // r1 will be "u64 *ctx".
 	 */
 	*insn++ = BPF_MOV64_REG(BPF_REG_6, BPF_REG_1);
+	*insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1, 16);
 	*insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, 0);
 	*insn++ = BPF_CALL_KFUNC(0, bpf_qdisc_init_prologue_ids[0]);
+	*insn++ = BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1);
+	*insn++ = BPF_EXIT_INSN();
 	*insn++ = BPF_MOV64_REG(BPF_REG_1, BPF_REG_6);
 	*insn++ = prog->insnsi[0];
 
@@ -237,11 +243,26 @@ __bpf_kfunc void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64
 }
 
 /* bpf_qdisc_init_prologue - Hidden kfunc called in prologue of .init. */
-__bpf_kfunc void bpf_qdisc_init_prologue(struct Qdisc *sch)
+__bpf_kfunc int bpf_qdisc_init_prologue(struct Qdisc *sch,
+					struct netlink_ext_ack *extack)
 {
 	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	struct Qdisc *p;
+
+	if (sch->parent != TC_H_ROOT) {
+		p = qdisc_lookup(dev, TC_H_MAJ(sch->parent));
+		if (!p)
+			return -ENOENT;
+
+		if (!(p->flags & TCQ_F_MQROOT)) {
+			NL_SET_ERR_MSG(extack, "BPF qdisc only supported on root or mq");
+			return -EINVAL;
+		}
+	}
 
 	qdisc_watchdog_init(&q->watchdog, sch);
+	return 0;
 }
 
 /* bpf_qdisc_reset_destroy_epilogue - Hidden kfunc called in epilogue of .reset
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH bpf-next v7 07/10] libbpf: Support creating and destroying qdisc
  2025-04-09 21:45 [PATCH bpf-next v7 00/10] bpf qdisc Amery Hung
                   ` (5 preceding siblings ...)
  2025-04-09 21:46 ` [PATCH bpf-next v7 06/10] bpf: net_sched: Disable attaching bpf qdisc to non root Amery Hung
@ 2025-04-09 21:46 ` Amery Hung
  2025-04-09 21:46 ` [PATCH bpf-next v7 08/10] selftests/bpf: Add a basic fifo qdisc test Amery Hung
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Amery Hung @ 2025-04-09 21:46 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, edumazet, kuba,
	xiyou.wangcong, jhs, martin.lau, jiri, stfomichev, toke,
	sinquersw, ekarani.silvestre, yangpeihao, yepeilin.cs,
	kernel-team

From: Amery Hung <amery.hung@bytedance.com>

Extend struct bpf_tc_hook with handle, qdisc name and a new attach type,
BPF_TC_QDISC, to allow users to add or remove any qdisc specified in
addition to clsact.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 tools/lib/bpf/libbpf.h  |  5 ++++-
 tools/lib/bpf/netlink.c | 20 +++++++++++++++++---
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index e0605403f977..fdcee6a71e0f 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -1283,6 +1283,7 @@ enum bpf_tc_attach_point {
 	BPF_TC_INGRESS = 1 << 0,
 	BPF_TC_EGRESS  = 1 << 1,
 	BPF_TC_CUSTOM  = 1 << 2,
+	BPF_TC_QDISC   = 1 << 3,
 };
 
 #define BPF_TC_PARENT(a, b) 	\
@@ -1297,9 +1298,11 @@ struct bpf_tc_hook {
 	int ifindex;
 	enum bpf_tc_attach_point attach_point;
 	__u32 parent;
+	__u32 handle;
+	const char *qdisc;
 	size_t :0;
 };
-#define bpf_tc_hook__last_field parent
+#define bpf_tc_hook__last_field qdisc
 
 struct bpf_tc_opts {
 	size_t sz;
diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
index 68a2def17175..c997e69d507f 100644
--- a/tools/lib/bpf/netlink.c
+++ b/tools/lib/bpf/netlink.c
@@ -529,9 +529,9 @@ int bpf_xdp_query_id(int ifindex, int flags, __u32 *prog_id)
 }
 
 
-typedef int (*qdisc_config_t)(struct libbpf_nla_req *req);
+typedef int (*qdisc_config_t)(struct libbpf_nla_req *req, const struct bpf_tc_hook *hook);
 
-static int clsact_config(struct libbpf_nla_req *req)
+static int clsact_config(struct libbpf_nla_req *req, const struct bpf_tc_hook *hook)
 {
 	req->tc.tcm_parent = TC_H_CLSACT;
 	req->tc.tcm_handle = TC_H_MAKE(TC_H_CLSACT, 0);
@@ -539,6 +539,16 @@ static int clsact_config(struct libbpf_nla_req *req)
 	return nlattr_add(req, TCA_KIND, "clsact", sizeof("clsact"));
 }
 
+static int qdisc_config(struct libbpf_nla_req *req, const struct bpf_tc_hook *hook)
+{
+	const char *qdisc = OPTS_GET(hook, qdisc, NULL);
+
+	req->tc.tcm_parent = OPTS_GET(hook, parent, TC_H_ROOT);
+	req->tc.tcm_handle = OPTS_GET(hook, handle, 0);
+
+	return nlattr_add(req, TCA_KIND, qdisc, strlen(qdisc) + 1);
+}
+
 static int attach_point_to_config(struct bpf_tc_hook *hook,
 				  qdisc_config_t *config)
 {
@@ -552,6 +562,9 @@ static int attach_point_to_config(struct bpf_tc_hook *hook,
 		return 0;
 	case BPF_TC_CUSTOM:
 		return -EOPNOTSUPP;
+	case BPF_TC_QDISC:
+		*config = &qdisc_config;
+		return 0;
 	default:
 		return -EINVAL;
 	}
@@ -596,7 +609,7 @@ static int tc_qdisc_modify(struct bpf_tc_hook *hook, int cmd, int flags)
 	req.tc.tcm_family  = AF_UNSPEC;
 	req.tc.tcm_ifindex = OPTS_GET(hook, ifindex, 0);
 
-	ret = config(&req);
+	ret = config(&req, hook);
 	if (ret < 0)
 		return ret;
 
@@ -639,6 +652,7 @@ int bpf_tc_hook_destroy(struct bpf_tc_hook *hook)
 	case BPF_TC_INGRESS:
 	case BPF_TC_EGRESS:
 		return libbpf_err(__bpf_tc_detach(hook, NULL, true));
+	case BPF_TC_QDISC:
 	case BPF_TC_INGRESS | BPF_TC_EGRESS:
 		return libbpf_err(tc_qdisc_delete(hook));
 	case BPF_TC_CUSTOM:
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH bpf-next v7 08/10] selftests/bpf: Add a basic fifo qdisc test
  2025-04-09 21:45 [PATCH bpf-next v7 00/10] bpf qdisc Amery Hung
                   ` (6 preceding siblings ...)
  2025-04-09 21:46 ` [PATCH bpf-next v7 07/10] libbpf: Support creating and destroying qdisc Amery Hung
@ 2025-04-09 21:46 ` Amery Hung
  2025-04-10 23:59   ` Martin KaFai Lau
  2025-04-09 21:46 ` [PATCH bpf-next v7 09/10] selftests/bpf: Add a bpf fq qdisc to selftest Amery Hung
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Amery Hung @ 2025-04-09 21:46 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, edumazet, kuba,
	xiyou.wangcong, jhs, martin.lau, jiri, stfomichev, toke,
	sinquersw, ekarani.silvestre, yangpeihao, yepeilin.cs,
	kernel-team

From: Amery Hung <amery.hung@bytedance.com>

This selftest includes a bare minimum fifo qdisc, which simply enqueues
sk_buffs into the back of a bpf list and dequeues from the front of the
list.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 tools/testing/selftests/bpf/config            |   1 +
 .../selftests/bpf/prog_tests/bpf_qdisc.c      |  81 ++++++++++++
 .../selftests/bpf/progs/bpf_qdisc_common.h    |  31 +++++
 .../selftests/bpf/progs/bpf_qdisc_fifo.c      | 117 ++++++++++++++++++
 4 files changed, 230 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c

diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index c378d5d07e02..6b0cab55bd2d 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -71,6 +71,7 @@ CONFIG_NET_IPGRE=y
 CONFIG_NET_IPGRE_DEMUX=y
 CONFIG_NET_IPIP=y
 CONFIG_NET_MPLS_GSO=y
+CONFIG_NET_SCH_BPF=y
 CONFIG_NET_SCH_FQ=y
 CONFIG_NET_SCH_INGRESS=y
 CONFIG_NET_SCHED=y
diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
new file mode 100644
index 000000000000..1ec321eb089f
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
@@ -0,0 +1,81 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/pkt_sched.h>
+#include <linux/rtnetlink.h>
+#include <test_progs.h>
+
+#include "network_helpers.h"
+#include "bpf_qdisc_fifo.skel.h"
+
+#define LO_IFINDEX 1
+
+static const unsigned int total_bytes = 10 * 1024 * 1024;
+
+static void do_test(char *qdisc)
+{
+	DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook, .ifindex = LO_IFINDEX,
+			    .attach_point = BPF_TC_QDISC,
+			    .parent = TC_H_ROOT,
+			    .handle = 0x8000000,
+			    .qdisc = qdisc);
+	int srv_fd = -1, cli_fd = -1;
+	int err;
+
+	err = bpf_tc_hook_create(&hook);
+	if (!ASSERT_OK(err, "attach qdisc"))
+		return;
+
+	srv_fd = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
+	if (!ASSERT_OK_FD(srv_fd, "start server"))
+		goto done;
+
+	cli_fd = connect_to_fd(srv_fd, 0);
+	if (!ASSERT_OK_FD(cli_fd, "connect to client"))
+		goto done;
+
+	err = send_recv_data(srv_fd, cli_fd, total_bytes);
+	ASSERT_OK(err, "send_recv_data");
+
+done:
+	if (srv_fd != -1)
+		close(srv_fd);
+	if (cli_fd != -1)
+		close(cli_fd);
+
+	bpf_tc_hook_destroy(&hook);
+}
+
+static void test_fifo(void)
+{
+	struct bpf_qdisc_fifo *fifo_skel;
+	struct bpf_link *link;
+
+	fifo_skel = bpf_qdisc_fifo__open_and_load();
+	if (!ASSERT_OK_PTR(fifo_skel, "bpf_qdisc_fifo__open_and_load"))
+		return;
+
+	link = bpf_map__attach_struct_ops(fifo_skel->maps.fifo);
+	if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops")) {
+		bpf_qdisc_fifo__destroy(fifo_skel);
+		return;
+	}
+
+	do_test("bpf_fifo");
+
+	bpf_link__destroy(link);
+	bpf_qdisc_fifo__destroy(fifo_skel);
+}
+
+void test_bpf_qdisc(void)
+{
+	struct netns_obj *netns;
+
+	netns = netns_new("bpf_qdisc_ns", true);
+	if (!ASSERT_OK_PTR(netns, "netns_new"))
+		return;
+
+	if (test__start_subtest("fifo"))
+		test_fifo();
+
+	netns_free(netns);
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
new file mode 100644
index 000000000000..65a2c561c0bb
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _BPF_QDISC_COMMON_H
+#define _BPF_QDISC_COMMON_H
+
+#define NET_XMIT_SUCCESS        0x00
+#define NET_XMIT_DROP           0x01    /* skb dropped                  */
+#define NET_XMIT_CN             0x02    /* congestion notification      */
+
+#define TC_PRIO_CONTROL  7
+#define TC_PRIO_MAX      15
+
+#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+
+u32 bpf_skb_get_hash(struct sk_buff *p) __ksym;
+void bpf_kfree_skb(struct sk_buff *p) __ksym;
+void bpf_qdisc_skb_drop(struct sk_buff *p, struct bpf_sk_buff_ptr *to_free) __ksym;
+void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64 delta_ns) __ksym;
+void bpf_qdisc_bstats_update(struct Qdisc *sch, const struct sk_buff *skb) __ksym;
+
+static struct qdisc_skb_cb *qdisc_skb_cb(const struct sk_buff *skb)
+{
+	return (struct qdisc_skb_cb *)skb->cb;
+}
+
+static inline unsigned int qdisc_pkt_len(const struct sk_buff *skb)
+{
+	return qdisc_skb_cb(skb)->pkt_len;
+}
+
+#endif
diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c b/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
new file mode 100644
index 000000000000..0c7cfb82dae1
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_fifo.c
@@ -0,0 +1,117 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <vmlinux.h>
+#include "bpf_experimental.h"
+#include "bpf_qdisc_common.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct skb_node {
+	struct sk_buff __kptr * skb;
+	struct bpf_list_node node;
+};
+
+private(A) struct bpf_spin_lock q_fifo_lock;
+private(A) struct bpf_list_head q_fifo __contains(skb_node, node);
+
+SEC("struct_ops/bpf_fifo_enqueue")
+int BPF_PROG(bpf_fifo_enqueue, struct sk_buff *skb, struct Qdisc *sch,
+	     struct bpf_sk_buff_ptr *to_free)
+{
+	struct skb_node *skbn;
+	u32 pkt_len;
+
+	if (sch->q.qlen == sch->limit)
+		goto drop;
+
+	skbn = bpf_obj_new(typeof(*skbn));
+	if (!skbn)
+		goto drop;
+
+	pkt_len = qdisc_pkt_len(skb);
+
+	sch->q.qlen++;
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	if (skb)
+		bpf_qdisc_skb_drop(skb, to_free);
+
+	bpf_spin_lock(&q_fifo_lock);
+	bpf_list_push_back(&q_fifo, &skbn->node);
+	bpf_spin_unlock(&q_fifo_lock);
+
+	sch->qstats.backlog += pkt_len;
+	return NET_XMIT_SUCCESS;
+drop:
+	bpf_qdisc_skb_drop(skb, to_free);
+	return NET_XMIT_DROP;
+}
+
+SEC("struct_ops/bpf_fifo_dequeue")
+struct sk_buff *BPF_PROG(bpf_fifo_dequeue, struct Qdisc *sch)
+{
+	struct bpf_list_node *node;
+	struct sk_buff *skb = NULL;
+	struct skb_node *skbn;
+
+	bpf_spin_lock(&q_fifo_lock);
+	node = bpf_list_pop_front(&q_fifo);
+	bpf_spin_unlock(&q_fifo_lock);
+	if (!node)
+		return NULL;
+
+	skbn = container_of(node, struct skb_node, node);
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	bpf_obj_drop(skbn);
+	if (!skb)
+		return NULL;
+
+	sch->qstats.backlog -= qdisc_pkt_len(skb);
+	bpf_qdisc_bstats_update(sch, skb);
+	sch->q.qlen--;
+
+	return skb;
+}
+
+SEC("struct_ops/bpf_fifo_init")
+int BPF_PROG(bpf_fifo_init, struct Qdisc *sch, struct nlattr *opt,
+	     struct netlink_ext_ack *extack)
+{
+	sch->limit = 1000;
+	return 0;
+}
+
+SEC("struct_ops/bpf_fifo_reset")
+void BPF_PROG(bpf_fifo_reset, struct Qdisc *sch)
+{
+	struct bpf_list_node *node;
+	struct skb_node *skbn;
+	int i;
+
+	bpf_for(i, 0, sch->q.qlen) {
+		struct sk_buff *skb = NULL;
+
+		bpf_spin_lock(&q_fifo_lock);
+		node = bpf_list_pop_front(&q_fifo);
+		bpf_spin_unlock(&q_fifo_lock);
+
+		if (!node)
+			break;
+
+		skbn = container_of(node, struct skb_node, node);
+		skb = bpf_kptr_xchg(&skbn->skb, skb);
+		if (skb)
+			bpf_kfree_skb(skb);
+		bpf_obj_drop(skbn);
+	}
+	sch->q.qlen = 0;
+}
+
+SEC(".struct_ops")
+struct Qdisc_ops fifo = {
+	.enqueue   = (void *)bpf_fifo_enqueue,
+	.dequeue   = (void *)bpf_fifo_dequeue,
+	.init      = (void *)bpf_fifo_init,
+	.reset     = (void *)bpf_fifo_reset,
+	.id        = "bpf_fifo",
+};
+
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH bpf-next v7 09/10] selftests/bpf: Add a bpf fq qdisc to selftest
  2025-04-09 21:45 [PATCH bpf-next v7 00/10] bpf qdisc Amery Hung
                   ` (7 preceding siblings ...)
  2025-04-09 21:46 ` [PATCH bpf-next v7 08/10] selftests/bpf: Add a basic fifo qdisc test Amery Hung
@ 2025-04-09 21:46 ` Amery Hung
  2025-04-09 21:46 ` [PATCH bpf-next v7 10/10] selftests/bpf: Test attaching bpf qdisc to mq and non root Amery Hung
  2025-04-10 23:50 ` [PATCH bpf-next v7 00/10] bpf qdisc patchwork-bot+netdevbpf
  10 siblings, 0 replies; 23+ messages in thread
From: Amery Hung @ 2025-04-09 21:46 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, edumazet, kuba,
	xiyou.wangcong, jhs, martin.lau, jiri, stfomichev, toke,
	sinquersw, ekarani.silvestre, yangpeihao, yepeilin.cs,
	kernel-team

From: Amery Hung <amery.hung@bytedance.com>

This test implements a more sophisticated qdisc using bpf. The bpf fair-
queueing (fq) qdisc gives each flow an equal chance to transmit data. It
also respects the timestamp of skb for rate limiting.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 .../selftests/bpf/prog_tests/bpf_qdisc.c      |  24 +
 .../selftests/bpf/progs/bpf_qdisc_fq.c        | 750 ++++++++++++++++++
 2 files changed, 774 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
index 1ec321eb089f..230d8f935303 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
@@ -6,6 +6,7 @@
 
 #include "network_helpers.h"
 #include "bpf_qdisc_fifo.skel.h"
+#include "bpf_qdisc_fq.skel.h"
 
 #define LO_IFINDEX 1
 
@@ -66,6 +67,27 @@ static void test_fifo(void)
 	bpf_qdisc_fifo__destroy(fifo_skel);
 }
 
+static void test_fq(void)
+{
+	struct bpf_qdisc_fq *fq_skel;
+	struct bpf_link *link;
+
+	fq_skel = bpf_qdisc_fq__open_and_load();
+	if (!ASSERT_OK_PTR(fq_skel, "bpf_qdisc_fq__open_and_load"))
+		return;
+
+	link = bpf_map__attach_struct_ops(fq_skel->maps.fq);
+	if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops")) {
+		bpf_qdisc_fq__destroy(fq_skel);
+		return;
+	}
+
+	do_test("bpf_fq");
+
+	bpf_link__destroy(link);
+	bpf_qdisc_fq__destroy(fq_skel);
+}
+
 void test_bpf_qdisc(void)
 {
 	struct netns_obj *netns;
@@ -76,6 +98,8 @@ void test_bpf_qdisc(void)
 
 	if (test__start_subtest("fifo"))
 		test_fifo();
+	if (test__start_subtest("fq"))
+		test_fq();
 
 	netns_free(netns);
 }
diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c b/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c
new file mode 100644
index 000000000000..7c110a156224
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c
@@ -0,0 +1,750 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* bpf_fq is intended for testing the bpf qdisc infrastructure and not a direct
+ * copy of sch_fq. bpf_fq implements the scheduling algorithm of sch_fq before
+ * 29f834aa326e ("net_sched: sch_fq: add 3 bands and WRR scheduling") was
+ * introduced. It gives each flow a fair chance to transmit packets in a
+ * round-robin fashion. Note that for flow pacing, bpf_fq currently only
+ * respects skb->tstamp but not skb->sk->sk_pacing_rate. In addition, if there
+ * are multiple bpf_fq instances, they will have a shared view of flows and
+ * configuration since some key data structure such as fq_prio_flows,
+ * fq_nonprio_flows, and fq_bpf_data are global.
+ *
+ * To use bpf_fq alone without running selftests, use the following commands.
+ *
+ * 1. Register bpf_fq to the kernel
+ *     bpftool struct_ops register bpf_qdisc_fq.bpf.o /sys/fs/bpf
+ * 2. Add bpf_fq to an interface
+ *     tc qdisc add dev <interface name> root handle <handle> bpf_fq
+ * 3. Delete bpf_fq attached to the interface
+ *     tc qdisc delete dev <interface name> root
+ * 4. Unregister bpf_fq
+ *     bpftool struct_ops unregister name fq
+ *
+ * The qdisc name, bpf_fq, used in tc commands is defined by Qdisc_ops.id.
+ * The struct_ops_map_name, fq, used in the bpftool command is the name of the
+ * Qdisc_ops.
+ *
+ * SEC(".struct_ops")
+ * struct Qdisc_ops fq = {
+ *         ...
+ *         .id        = "bpf_fq",
+ * };
+ */
+
+#include <vmlinux.h>
+#include <errno.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_experimental.h"
+#include "bpf_qdisc_common.h"
+
+char _license[] SEC("license") = "GPL";
+
+#define NSEC_PER_USEC 1000L
+#define NSEC_PER_SEC 1000000000L
+
+#define NUM_QUEUE (1 << 20)
+
+struct fq_bpf_data {
+	u32 quantum;
+	u32 initial_quantum;
+	u32 flow_refill_delay;
+	u32 flow_plimit;
+	u64 horizon;
+	u32 orphan_mask;
+	u32 timer_slack;
+	u64 time_next_delayed_flow;
+	u64 unthrottle_latency_ns;
+	u8 horizon_drop;
+	u32 new_flow_cnt;
+	u32 old_flow_cnt;
+	u64 ktime_cache;
+};
+
+enum {
+	CLS_RET_PRIO	= 0,
+	CLS_RET_NONPRIO = 1,
+	CLS_RET_ERR	= 2,
+};
+
+struct skb_node {
+	u64 tstamp;
+	struct sk_buff __kptr * skb;
+	struct bpf_rb_node node;
+};
+
+struct fq_flow_node {
+	int credit;
+	u32 qlen;
+	u64 age;
+	u64 time_next_packet;
+	struct bpf_list_node list_node;
+	struct bpf_rb_node rb_node;
+	struct bpf_rb_root queue __contains(skb_node, node);
+	struct bpf_spin_lock lock;
+	struct bpf_refcount refcount;
+};
+
+struct dequeue_nonprio_ctx {
+	bool stop_iter;
+	u64 expire;
+	u64 now;
+};
+
+struct remove_flows_ctx {
+	bool gc_only;
+	u32 reset_cnt;
+	u32 reset_max;
+};
+
+struct unset_throttled_flows_ctx {
+	bool unset_all;
+	u64 now;
+};
+
+struct fq_stashed_flow {
+	struct fq_flow_node __kptr * flow;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, __u64);
+	__type(value, struct fq_stashed_flow);
+	__uint(max_entries, NUM_QUEUE);
+} fq_nonprio_flows SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, __u64);
+	__type(value, struct fq_stashed_flow);
+	__uint(max_entries, 1);
+} fq_prio_flows SEC(".maps");
+
+private(A) struct bpf_spin_lock fq_delayed_lock;
+private(A) struct bpf_rb_root fq_delayed __contains(fq_flow_node, rb_node);
+
+private(B) struct bpf_spin_lock fq_new_flows_lock;
+private(B) struct bpf_list_head fq_new_flows __contains(fq_flow_node, list_node);
+
+private(C) struct bpf_spin_lock fq_old_flows_lock;
+private(C) struct bpf_list_head fq_old_flows __contains(fq_flow_node, list_node);
+
+private(D) struct fq_bpf_data q;
+
+/* Wrapper for bpf_kptr_xchg that expects NULL dst */
+static void bpf_kptr_xchg_back(void *map_val, void *ptr)
+{
+	void *ret;
+
+	ret = bpf_kptr_xchg(map_val, ptr);
+	if (ret)
+		bpf_obj_drop(ret);
+}
+
+static bool skbn_tstamp_less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct skb_node *skbn_a;
+	struct skb_node *skbn_b;
+
+	skbn_a = container_of(a, struct skb_node, node);
+	skbn_b = container_of(b, struct skb_node, node);
+
+	return skbn_a->tstamp < skbn_b->tstamp;
+}
+
+static bool fn_time_next_packet_less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct fq_flow_node *flow_a;
+	struct fq_flow_node *flow_b;
+
+	flow_a = container_of(a, struct fq_flow_node, rb_node);
+	flow_b = container_of(b, struct fq_flow_node, rb_node);
+
+	return flow_a->time_next_packet < flow_b->time_next_packet;
+}
+
+static void
+fq_flows_add_head(struct bpf_list_head *head, struct bpf_spin_lock *lock,
+		  struct fq_flow_node *flow, u32 *flow_cnt)
+{
+	bpf_spin_lock(lock);
+	bpf_list_push_front(head, &flow->list_node);
+	bpf_spin_unlock(lock);
+	*flow_cnt += 1;
+}
+
+static void
+fq_flows_add_tail(struct bpf_list_head *head, struct bpf_spin_lock *lock,
+		  struct fq_flow_node *flow, u32 *flow_cnt)
+{
+	bpf_spin_lock(lock);
+	bpf_list_push_back(head, &flow->list_node);
+	bpf_spin_unlock(lock);
+	*flow_cnt += 1;
+}
+
+static void
+fq_flows_remove_front(struct bpf_list_head *head, struct bpf_spin_lock *lock,
+		      struct bpf_list_node **node, u32 *flow_cnt)
+{
+	bpf_spin_lock(lock);
+	*node = bpf_list_pop_front(head);
+	bpf_spin_unlock(lock);
+	*flow_cnt -= 1;
+}
+
+static bool
+fq_flows_is_empty(struct bpf_list_head *head, struct bpf_spin_lock *lock)
+{
+	struct bpf_list_node *node;
+
+	bpf_spin_lock(lock);
+	node = bpf_list_pop_front(head);
+	if (node) {
+		bpf_list_push_front(head, node);
+		bpf_spin_unlock(lock);
+		return false;
+	}
+	bpf_spin_unlock(lock);
+
+	return true;
+}
+
+/* flow->age is used to denote the state of the flow (not-detached, detached, throttled)
+ * as well as the timestamp when the flow is detached.
+ *
+ * 0: not-detached
+ * 1 - (~0ULL-1): detached
+ * ~0ULL: throttled
+ */
+static void fq_flow_set_detached(struct fq_flow_node *flow)
+{
+	flow->age = bpf_jiffies64();
+}
+
+static bool fq_flow_is_detached(struct fq_flow_node *flow)
+{
+	return flow->age != 0 && flow->age != ~0ULL;
+}
+
+static bool sk_listener(struct sock *sk)
+{
+	return (1 << sk->__sk_common.skc_state) & (TCPF_LISTEN | TCPF_NEW_SYN_RECV);
+}
+
+static void fq_gc(void);
+
+static int fq_new_flow(void *flow_map, struct fq_stashed_flow **sflow, u64 hash)
+{
+	struct fq_stashed_flow tmp = {};
+	struct fq_flow_node *flow;
+	int ret;
+
+	flow = bpf_obj_new(typeof(*flow));
+	if (!flow)
+		return -ENOMEM;
+
+	flow->credit = q.initial_quantum,
+	flow->qlen = 0,
+	flow->age = 1,
+	flow->time_next_packet = 0,
+
+	ret = bpf_map_update_elem(flow_map, &hash, &tmp, 0);
+	if (ret == -ENOMEM || ret == -E2BIG) {
+		fq_gc();
+		bpf_map_update_elem(&fq_nonprio_flows, &hash, &tmp, 0);
+	}
+
+	*sflow = bpf_map_lookup_elem(flow_map, &hash);
+	if (!*sflow) {
+		bpf_obj_drop(flow);
+		return -ENOMEM;
+	}
+
+	bpf_kptr_xchg_back(&(*sflow)->flow, flow);
+	return 0;
+}
+
+static int
+fq_classify(struct sk_buff *skb, struct fq_stashed_flow **sflow)
+{
+	struct sock *sk = skb->sk;
+	int ret = CLS_RET_NONPRIO;
+	u64 hash = 0;
+
+	if ((skb->priority & TC_PRIO_MAX) == TC_PRIO_CONTROL) {
+		*sflow = bpf_map_lookup_elem(&fq_prio_flows, &hash);
+		ret = CLS_RET_PRIO;
+	} else {
+		if (!sk || sk_listener(sk)) {
+			hash = bpf_skb_get_hash(skb) & q.orphan_mask;
+			/* Avoid collision with an existing flow hash, which
+			 * only uses the lower 32 bits of hash, by setting the
+			 * upper half of hash to 1.
+			 */
+			hash |= (1ULL << 32);
+		} else if (sk->__sk_common.skc_state == TCP_CLOSE) {
+			hash = bpf_skb_get_hash(skb) & q.orphan_mask;
+			hash |= (1ULL << 32);
+		} else {
+			hash = sk->__sk_common.skc_hash;
+		}
+		*sflow = bpf_map_lookup_elem(&fq_nonprio_flows, &hash);
+	}
+
+	if (!*sflow)
+		ret = fq_new_flow(&fq_nonprio_flows, sflow, hash) < 0 ?
+		      CLS_RET_ERR : CLS_RET_NONPRIO;
+
+	return ret;
+}
+
+static bool fq_packet_beyond_horizon(struct sk_buff *skb)
+{
+	return (s64)skb->tstamp > (s64)(q.ktime_cache + q.horizon);
+}
+
+SEC("struct_ops/bpf_fq_enqueue")
+int BPF_PROG(bpf_fq_enqueue, struct sk_buff *skb, struct Qdisc *sch,
+	     struct bpf_sk_buff_ptr *to_free)
+{
+	struct fq_flow_node *flow = NULL, *flow_copy;
+	struct fq_stashed_flow *sflow;
+	u64 time_to_send, jiffies;
+	struct skb_node *skbn;
+	int ret;
+
+	if (sch->q.qlen >= sch->limit)
+		goto drop;
+
+	if (!skb->tstamp) {
+		time_to_send = q.ktime_cache = bpf_ktime_get_ns();
+	} else {
+		if (fq_packet_beyond_horizon(skb)) {
+			q.ktime_cache = bpf_ktime_get_ns();
+			if (fq_packet_beyond_horizon(skb)) {
+				if (q.horizon_drop)
+					goto drop;
+
+				skb->tstamp = q.ktime_cache + q.horizon;
+			}
+		}
+		time_to_send = skb->tstamp;
+	}
+
+	ret = fq_classify(skb, &sflow);
+	if (ret == CLS_RET_ERR)
+		goto drop;
+
+	flow = bpf_kptr_xchg(&sflow->flow, flow);
+	if (!flow)
+		goto drop;
+
+	if (ret == CLS_RET_NONPRIO) {
+		if (flow->qlen >= q.flow_plimit) {
+			bpf_kptr_xchg_back(&sflow->flow, flow);
+			goto drop;
+		}
+
+		if (fq_flow_is_detached(flow)) {
+			flow_copy = bpf_refcount_acquire(flow);
+
+			jiffies = bpf_jiffies64();
+			if ((s64)(jiffies - (flow_copy->age + q.flow_refill_delay)) > 0) {
+				if (flow_copy->credit < q.quantum)
+					flow_copy->credit = q.quantum;
+			}
+			flow_copy->age = 0;
+			fq_flows_add_tail(&fq_new_flows, &fq_new_flows_lock, flow_copy,
+					  &q.new_flow_cnt);
+		}
+	}
+
+	skbn = bpf_obj_new(typeof(*skbn));
+	if (!skbn) {
+		bpf_kptr_xchg_back(&sflow->flow, flow);
+		goto drop;
+	}
+
+	skbn->tstamp = skb->tstamp = time_to_send;
+
+	sch->qstats.backlog += qdisc_pkt_len(skb);
+
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	if (skb)
+		bpf_qdisc_skb_drop(skb, to_free);
+
+	bpf_spin_lock(&flow->lock);
+	bpf_rbtree_add(&flow->queue, &skbn->node, skbn_tstamp_less);
+	bpf_spin_unlock(&flow->lock);
+
+	flow->qlen++;
+	bpf_kptr_xchg_back(&sflow->flow, flow);
+
+	sch->q.qlen++;
+	return NET_XMIT_SUCCESS;
+
+drop:
+	bpf_qdisc_skb_drop(skb, to_free);
+	sch->qstats.drops++;
+	return NET_XMIT_DROP;
+}
+
+static int fq_unset_throttled_flows(u32 index, struct unset_throttled_flows_ctx *ctx)
+{
+	struct bpf_rb_node *node = NULL;
+	struct fq_flow_node *flow;
+
+	bpf_spin_lock(&fq_delayed_lock);
+
+	node = bpf_rbtree_first(&fq_delayed);
+	if (!node) {
+		bpf_spin_unlock(&fq_delayed_lock);
+		return 1;
+	}
+
+	flow = container_of(node, struct fq_flow_node, rb_node);
+	if (!ctx->unset_all && flow->time_next_packet > ctx->now) {
+		q.time_next_delayed_flow = flow->time_next_packet;
+		bpf_spin_unlock(&fq_delayed_lock);
+		return 1;
+	}
+
+	node = bpf_rbtree_remove(&fq_delayed, &flow->rb_node);
+
+	bpf_spin_unlock(&fq_delayed_lock);
+
+	if (!node)
+		return 1;
+
+	flow = container_of(node, struct fq_flow_node, rb_node);
+	flow->age = 0;
+	fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow, &q.old_flow_cnt);
+
+	return 0;
+}
+
+static void fq_flow_set_throttled(struct fq_flow_node *flow)
+{
+	flow->age = ~0ULL;
+
+	if (q.time_next_delayed_flow > flow->time_next_packet)
+		q.time_next_delayed_flow = flow->time_next_packet;
+
+	bpf_spin_lock(&fq_delayed_lock);
+	bpf_rbtree_add(&fq_delayed, &flow->rb_node, fn_time_next_packet_less);
+	bpf_spin_unlock(&fq_delayed_lock);
+}
+
+static void fq_check_throttled(u64 now)
+{
+	struct unset_throttled_flows_ctx ctx = {
+		.unset_all = false,
+		.now = now,
+	};
+	unsigned long sample;
+
+	if (q.time_next_delayed_flow > now)
+		return;
+
+	sample = (unsigned long)(now - q.time_next_delayed_flow);
+	q.unthrottle_latency_ns -= q.unthrottle_latency_ns >> 3;
+	q.unthrottle_latency_ns += sample >> 3;
+
+	q.time_next_delayed_flow = ~0ULL;
+	bpf_loop(NUM_QUEUE, fq_unset_throttled_flows, &ctx, 0);
+}
+
+static struct sk_buff*
+fq_dequeue_nonprio_flows(u32 index, struct dequeue_nonprio_ctx *ctx)
+{
+	u64 time_next_packet, time_to_send;
+	struct bpf_rb_node *rb_node;
+	struct sk_buff *skb = NULL;
+	struct bpf_list_head *head;
+	struct bpf_list_node *node;
+	struct bpf_spin_lock *lock;
+	struct fq_flow_node *flow;
+	struct skb_node *skbn;
+	bool is_empty;
+	u32 *cnt;
+
+	if (q.new_flow_cnt) {
+		head = &fq_new_flows;
+		lock = &fq_new_flows_lock;
+		cnt = &q.new_flow_cnt;
+	} else if (q.old_flow_cnt) {
+		head = &fq_old_flows;
+		lock = &fq_old_flows_lock;
+		cnt = &q.old_flow_cnt;
+	} else {
+		if (q.time_next_delayed_flow != ~0ULL)
+			ctx->expire = q.time_next_delayed_flow;
+		goto break_loop;
+	}
+
+	fq_flows_remove_front(head, lock, &node, cnt);
+	if (!node)
+		goto break_loop;
+
+	flow = container_of(node, struct fq_flow_node, list_node);
+	if (flow->credit <= 0) {
+		flow->credit += q.quantum;
+		fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow, &q.old_flow_cnt);
+		return NULL;
+	}
+
+	bpf_spin_lock(&flow->lock);
+	rb_node = bpf_rbtree_first(&flow->queue);
+	if (!rb_node) {
+		bpf_spin_unlock(&flow->lock);
+		is_empty = fq_flows_is_empty(&fq_old_flows, &fq_old_flows_lock);
+		if (head == &fq_new_flows && !is_empty) {
+			fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow, &q.old_flow_cnt);
+		} else {
+			fq_flow_set_detached(flow);
+			bpf_obj_drop(flow);
+		}
+		return NULL;
+	}
+
+	skbn = container_of(rb_node, struct skb_node, node);
+	time_to_send = skbn->tstamp;
+
+	time_next_packet = (time_to_send > flow->time_next_packet) ?
+		time_to_send : flow->time_next_packet;
+	if (ctx->now < time_next_packet) {
+		bpf_spin_unlock(&flow->lock);
+		flow->time_next_packet = time_next_packet;
+		fq_flow_set_throttled(flow);
+		return NULL;
+	}
+
+	rb_node = bpf_rbtree_remove(&flow->queue, rb_node);
+	bpf_spin_unlock(&flow->lock);
+
+	if (!rb_node)
+		goto add_flow_and_break;
+
+	skbn = container_of(rb_node, struct skb_node, node);
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	bpf_obj_drop(skbn);
+
+	if (!skb)
+		goto add_flow_and_break;
+
+	flow->credit -= qdisc_skb_cb(skb)->pkt_len;
+	flow->qlen--;
+
+add_flow_and_break:
+	fq_flows_add_head(head, lock, flow, cnt);
+
+break_loop:
+	ctx->stop_iter = true;
+	return skb;
+}
+
+static struct sk_buff *fq_dequeue_prio(void)
+{
+	struct fq_flow_node *flow = NULL;
+	struct fq_stashed_flow *sflow;
+	struct bpf_rb_node *rb_node;
+	struct sk_buff *skb = NULL;
+	struct skb_node *skbn;
+	u64 hash = 0;
+
+	sflow = bpf_map_lookup_elem(&fq_prio_flows, &hash);
+	if (!sflow)
+		return NULL;
+
+	flow = bpf_kptr_xchg(&sflow->flow, flow);
+	if (!flow)
+		return NULL;
+
+	bpf_spin_lock(&flow->lock);
+	rb_node = bpf_rbtree_first(&flow->queue);
+	if (!rb_node) {
+		bpf_spin_unlock(&flow->lock);
+		goto out;
+	}
+
+	skbn = container_of(rb_node, struct skb_node, node);
+	rb_node = bpf_rbtree_remove(&flow->queue, &skbn->node);
+	bpf_spin_unlock(&flow->lock);
+
+	if (!rb_node)
+		goto out;
+
+	skbn = container_of(rb_node, struct skb_node, node);
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	bpf_obj_drop(skbn);
+
+out:
+	bpf_kptr_xchg_back(&sflow->flow, flow);
+
+	return skb;
+}
+
+SEC("struct_ops/bpf_fq_dequeue")
+struct sk_buff *BPF_PROG(bpf_fq_dequeue, struct Qdisc *sch)
+{
+	struct dequeue_nonprio_ctx cb_ctx = {};
+	struct sk_buff *skb = NULL;
+	int i;
+
+	if (!sch->q.qlen)
+		goto out;
+
+	skb = fq_dequeue_prio();
+	if (skb)
+		goto dequeue;
+
+	q.ktime_cache = cb_ctx.now = bpf_ktime_get_ns();
+	fq_check_throttled(q.ktime_cache);
+	bpf_for(i, 0, sch->limit) {
+		skb = fq_dequeue_nonprio_flows(i, &cb_ctx);
+		if (cb_ctx.stop_iter)
+			break;
+	};
+
+	if (skb) {
+dequeue:
+		sch->q.qlen--;
+		sch->qstats.backlog -= qdisc_pkt_len(skb);
+		bpf_qdisc_bstats_update(sch, skb);
+		return skb;
+	}
+
+	if (cb_ctx.expire)
+		bpf_qdisc_watchdog_schedule(sch, cb_ctx.expire, q.timer_slack);
+out:
+	return NULL;
+}
+
+static int fq_remove_flows_in_list(u32 index, void *ctx)
+{
+	struct bpf_list_node *node;
+	struct fq_flow_node *flow;
+
+	bpf_spin_lock(&fq_new_flows_lock);
+	node = bpf_list_pop_front(&fq_new_flows);
+	bpf_spin_unlock(&fq_new_flows_lock);
+	if (!node) {
+		bpf_spin_lock(&fq_old_flows_lock);
+		node = bpf_list_pop_front(&fq_old_flows);
+		bpf_spin_unlock(&fq_old_flows_lock);
+		if (!node)
+			return 1;
+	}
+
+	flow = container_of(node, struct fq_flow_node, list_node);
+	bpf_obj_drop(flow);
+
+	return 0;
+}
+
+extern unsigned CONFIG_HZ __kconfig;
+
+/* limit number of collected flows per round */
+#define FQ_GC_MAX 8
+#define FQ_GC_AGE (3*CONFIG_HZ)
+
+static bool fq_gc_candidate(struct fq_flow_node *flow)
+{
+	u64 jiffies = bpf_jiffies64();
+
+	return fq_flow_is_detached(flow) &&
+	       ((s64)(jiffies - (flow->age + FQ_GC_AGE)) > 0);
+}
+
+static int
+fq_remove_flows(struct bpf_map *flow_map, u64 *hash,
+		struct fq_stashed_flow *sflow, struct remove_flows_ctx *ctx)
+{
+	if (sflow->flow &&
+	    (!ctx->gc_only || fq_gc_candidate(sflow->flow))) {
+		bpf_map_delete_elem(flow_map, hash);
+		ctx->reset_cnt++;
+	}
+
+	return ctx->reset_cnt < ctx->reset_max ? 0 : 1;
+}
+
+static void fq_gc(void)
+{
+	struct remove_flows_ctx cb_ctx = {
+		.gc_only = true,
+		.reset_cnt = 0,
+		.reset_max = FQ_GC_MAX,
+	};
+
+	bpf_for_each_map_elem(&fq_nonprio_flows, fq_remove_flows, &cb_ctx, 0);
+}
+
+SEC("struct_ops/bpf_fq_reset")
+void BPF_PROG(bpf_fq_reset, struct Qdisc *sch)
+{
+	struct unset_throttled_flows_ctx utf_ctx = {
+		.unset_all = true,
+	};
+	struct remove_flows_ctx rf_ctx = {
+		.gc_only = false,
+		.reset_cnt = 0,
+		.reset_max = NUM_QUEUE,
+	};
+	struct fq_stashed_flow *sflow;
+	u64 hash = 0;
+
+	sch->q.qlen = 0;
+	sch->qstats.backlog = 0;
+
+	bpf_for_each_map_elem(&fq_nonprio_flows, fq_remove_flows, &rf_ctx, 0);
+
+	rf_ctx.reset_cnt = 0;
+	bpf_for_each_map_elem(&fq_prio_flows, fq_remove_flows, &rf_ctx, 0);
+	fq_new_flow(&fq_prio_flows, &sflow, hash);
+
+	bpf_loop(NUM_QUEUE, fq_remove_flows_in_list, NULL, 0);
+	q.new_flow_cnt = 0;
+	q.old_flow_cnt = 0;
+
+	bpf_loop(NUM_QUEUE, fq_unset_throttled_flows, &utf_ctx, 0);
+}
+
+SEC("struct_ops/bpf_fq_init")
+int BPF_PROG(bpf_fq_init, struct Qdisc *sch, struct nlattr *opt,
+	     struct netlink_ext_ack *extack)
+{
+	struct net_device *dev = sch->dev_queue->dev;
+	u32 psched_mtu = dev->mtu + dev->hard_header_len;
+	struct fq_stashed_flow *sflow;
+	u64 hash = 0;
+
+	if (fq_new_flow(&fq_prio_flows, &sflow, hash) < 0)
+		return -ENOMEM;
+
+	sch->limit = 10000;
+	q.initial_quantum = 10 * psched_mtu;
+	q.quantum = 2 * psched_mtu;
+	q.flow_refill_delay = 40;
+	q.flow_plimit = 100;
+	q.horizon = 10ULL * NSEC_PER_SEC;
+	q.horizon_drop = 1;
+	q.orphan_mask = 1024 - 1;
+	q.timer_slack = 10 * NSEC_PER_USEC;
+	q.time_next_delayed_flow = ~0ULL;
+	q.unthrottle_latency_ns = 0ULL;
+	q.new_flow_cnt = 0;
+	q.old_flow_cnt = 0;
+
+	return 0;
+}
+
+SEC(".struct_ops")
+struct Qdisc_ops fq = {
+	.enqueue   = (void *)bpf_fq_enqueue,
+	.dequeue   = (void *)bpf_fq_dequeue,
+	.reset     = (void *)bpf_fq_reset,
+	.init      = (void *)bpf_fq_init,
+	.id        = "bpf_fq",
+};
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH bpf-next v7 10/10] selftests/bpf: Test attaching bpf qdisc to mq and non root
  2025-04-09 21:45 [PATCH bpf-next v7 00/10] bpf qdisc Amery Hung
                   ` (8 preceding siblings ...)
  2025-04-09 21:46 ` [PATCH bpf-next v7 09/10] selftests/bpf: Add a bpf fq qdisc to selftest Amery Hung
@ 2025-04-09 21:46 ` Amery Hung
  2025-04-10 23:50 ` [PATCH bpf-next v7 00/10] bpf qdisc patchwork-bot+netdevbpf
  10 siblings, 0 replies; 23+ messages in thread
From: Amery Hung @ 2025-04-09 21:46 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, edumazet, kuba,
	xiyou.wangcong, jhs, martin.lau, jiri, stfomichev, toke,
	sinquersw, ekarani.silvestre, yangpeihao, yepeilin.cs,
	kernel-team

Until we are certain that existing classful qdiscs work with bpf qdisc,
make sure we don't allow attaching a bpf qdisc to non root. Meanwhile,
attaching to mq is allowed.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 tools/testing/selftests/bpf/config            |  1 +
 .../selftests/bpf/prog_tests/bpf_qdisc.c      | 75 +++++++++++++++++++
 2 files changed, 76 insertions(+)

diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index 6b0cab55bd2d..3201a962b3dc 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -74,6 +74,7 @@ CONFIG_NET_MPLS_GSO=y
 CONFIG_NET_SCH_BPF=y
 CONFIG_NET_SCH_FQ=y
 CONFIG_NET_SCH_INGRESS=y
+CONFIG_NET_SCH_HTB=y
 CONFIG_NET_SCHED=y
 CONFIG_NETDEVSIM=y
 CONFIG_NETFILTER=y
diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
index 230d8f935303..c9a54177c84e 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c
@@ -88,6 +88,77 @@ static void test_fq(void)
 	bpf_qdisc_fq__destroy(fq_skel);
 }
 
+static void test_qdisc_attach_to_mq(void)
+{
+	DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook,
+			    .attach_point = BPF_TC_QDISC,
+			    .parent = TC_H_MAKE(1 << 16, 1),
+			    .handle = 0x11 << 16,
+			    .qdisc = "bpf_fifo");
+	struct bpf_qdisc_fifo *fifo_skel;
+	struct bpf_link *link;
+	int err;
+
+	fifo_skel = bpf_qdisc_fifo__open_and_load();
+	if (!ASSERT_OK_PTR(fifo_skel, "bpf_qdisc_fifo__open_and_load"))
+		return;
+
+	link = bpf_map__attach_struct_ops(fifo_skel->maps.fifo);
+	if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops")) {
+		bpf_qdisc_fifo__destroy(fifo_skel);
+		return;
+	}
+
+	SYS(out, "ip link add veth0 type veth peer veth1");
+	hook.ifindex = if_nametoindex("veth0");
+	SYS(out, "tc qdisc add dev veth0 root handle 1: mq");
+
+	err = bpf_tc_hook_create(&hook);
+	ASSERT_OK(err, "attach qdisc");
+
+	bpf_tc_hook_destroy(&hook);
+
+	SYS(out, "tc qdisc delete dev veth0 root mq");
+out:
+	bpf_link__destroy(link);
+	bpf_qdisc_fifo__destroy(fifo_skel);
+}
+
+static void test_qdisc_attach_to_non_root(void)
+{
+	DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook, .ifindex = LO_IFINDEX,
+			    .attach_point = BPF_TC_QDISC,
+			    .parent = TC_H_MAKE(1 << 16, 1),
+			    .handle = 0x11 << 16,
+			    .qdisc = "bpf_fifo");
+	struct bpf_qdisc_fifo *fifo_skel;
+	struct bpf_link *link;
+	int err;
+
+	fifo_skel = bpf_qdisc_fifo__open_and_load();
+	if (!ASSERT_OK_PTR(fifo_skel, "bpf_qdisc_fifo__open_and_load"))
+		return;
+
+	link = bpf_map__attach_struct_ops(fifo_skel->maps.fifo);
+	if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops")) {
+		bpf_qdisc_fifo__destroy(fifo_skel);
+		return;
+	}
+
+	SYS(out, "tc qdisc add dev lo root handle 1: htb");
+	SYS(out_del_htb, "tc class add dev lo parent 1: classid 1:1 htb rate 75Kbit");
+
+	err = bpf_tc_hook_create(&hook);
+	if (!ASSERT_ERR(err, "attach qdisc"))
+		bpf_tc_hook_destroy(&hook);
+
+out_del_htb:
+	SYS(out, "tc qdisc delete dev lo root htb");
+out:
+	bpf_link__destroy(link);
+	bpf_qdisc_fifo__destroy(fifo_skel);
+}
+
 void test_bpf_qdisc(void)
 {
 	struct netns_obj *netns;
@@ -100,6 +171,10 @@ void test_bpf_qdisc(void)
 		test_fifo();
 	if (test__start_subtest("fq"))
 		test_fq();
+	if (test__start_subtest("attach to mq"))
+		test_qdisc_attach_to_mq();
+	if (test__start_subtest("attach to non root"))
+		test_qdisc_attach_to_non_root();
 
 	netns_free(netns);
 }
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v7 02/10] bpf: net_sched: Support implementation of Qdisc_ops in bpf
  2025-04-09 21:45 ` [PATCH bpf-next v7 02/10] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
@ 2025-04-10 23:49   ` Martin KaFai Lau
  0 siblings, 0 replies; 23+ messages in thread
From: Martin KaFai Lau @ 2025-04-10 23:49 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, edumazet, kuba,
	xiyou.wangcong, jhs, martin.lau, jiri, stfomichev, toke,
	sinquersw, ekarani.silvestre, yangpeihao, yepeilin.cs,
	kernel-team

On Wed, Apr 9, 2025 at 2:47 PM Amery Hung <ameryhung@gmail.com> wrote:
> +static int bpf_qdisc_sk_buff_access(struct bpf_verifier_log *log,
> +                                   const struct bpf_reg_state *reg,
> +                                   int off, size_t *end)
> +{
> +       switch (off) {
> +       case offsetof(struct sk_buff, tstamp):
> +               *end = offsetofend(struct sk_buff, tstamp);
> +               break;
> +       case offsetof(struct sk_buff, priority):
> +               *end = offsetofend(struct sk_buff, priority);
> +               break;
> +       case offsetof(struct sk_buff, mark):
> +               *end = offsetofend(struct sk_buff, mark);
> +               break;
> +       case offsetof(struct sk_buff, queue_mapping):
> +               *end = offsetofend(struct sk_buff, queue_mapping);
> +               break;
> +       case offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb, tc_classid):
> +               *end = offsetof(struct sk_buff, cb) +
> +                      offsetofend(struct qdisc_skb_cb, tc_classid);
> +               break;
> +       case offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb, data[0]) ...
> +            offsetof(struct sk_buff, cb) + offsetof(struct qdisc_skb_cb,
> +                                                    data[QDISC_CB_PRIV_LEN - 1]):
> +               *end = offsetof(struct sk_buff, cb) +
> +                      offsetofend(struct qdisc_skb_cb, data[QDISC_CB_PRIV_LEN - 1]);
> +               break;
> +       case offsetof(struct sk_buff, tc_index):
> +               *end = offsetofend(struct sk_buff, tc_index);
> +               break;

I only kept the WRITE access to tstamp and data[]. Removed everything
else. It can be revisited when they are really needed in the future.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v7 00/10] bpf qdisc
  2025-04-09 21:45 [PATCH bpf-next v7 00/10] bpf qdisc Amery Hung
                   ` (9 preceding siblings ...)
  2025-04-09 21:46 ` [PATCH bpf-next v7 10/10] selftests/bpf: Test attaching bpf qdisc to mq and non root Amery Hung
@ 2025-04-10 23:50 ` patchwork-bot+netdevbpf
  10 siblings, 0 replies; 23+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-04-10 23:50 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, edumazet, kuba,
	xiyou.wangcong, jhs, martin.lau, jiri, stfomichev, toke,
	sinquersw, ekarani.silvestre, yangpeihao, yepeilin.cs,
	kernel-team

Hello:

This series was applied to bpf/bpf-next.git (net)
by Martin KaFai Lau <martin.lau@kernel.org>:

On Wed,  9 Apr 2025 14:45:56 -0700 you wrote:
> Hi all,
> 
> This patchset aims to support implementing qdisc using bpf struct_ops.
> This version takes a step back and only implements the minimum support
> for bpf qdisc. 1) support of adding skb to bpf_list and bpf_rbtree
> directly and 2) classful qdisc are deferred to future patchsets. In
> addition, we only allow attaching bpf qdisc to root or mq for now.
> This is to prevent accidentally breaking exisiting classful qdiscs
> that rely on data in a child qdisc. This limit may be lifted in the
> future after careful inspection.
> 
> [...]

Here is the summary with links:
  - [bpf-next,v7,01/10] bpf: Prepare to reuse get_ctx_arg_idx
    https://git.kernel.org/bpf/bpf-next/c/a5c234d9bac6
  - [bpf-next,v7,02/10] bpf: net_sched: Support implementation of Qdisc_ops in bpf
    (no matching commit)
  - [bpf-next,v7,03/10] bpf: net_sched: Add basic bpf qdisc kfuncs
    https://git.kernel.org/bpf/bpf-next/c/81bb1c2c3e8e
  - [bpf-next,v7,04/10] bpf: net_sched: Add a qdisc watchdog timer
    https://git.kernel.org/bpf/bpf-next/c/f2ccce960e92
  - [bpf-next,v7,05/10] bpf: net_sched: Support updating bstats
    https://git.kernel.org/bpf/bpf-next/c/9658e4249727
  - [bpf-next,v7,06/10] bpf: net_sched: Disable attaching bpf qdisc to non root
    https://git.kernel.org/bpf/bpf-next/c/0ddc97ad5337
  - [bpf-next,v7,07/10] libbpf: Support creating and destroying qdisc
    https://git.kernel.org/bpf/bpf-next/c/8c52471cd771
  - [bpf-next,v7,08/10] selftests/bpf: Add a basic fifo qdisc test
    https://git.kernel.org/bpf/bpf-next/c/cb7e598ef547
  - [bpf-next,v7,09/10] selftests/bpf: Add a bpf fq qdisc to selftest
    https://git.kernel.org/bpf/bpf-next/c/69dfd77a5adc
  - [bpf-next,v7,10/10] selftests/bpf: Test attaching bpf qdisc to mq and non root
    https://git.kernel.org/bpf/bpf-next/c/93cb5375fb16

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v7 08/10] selftests/bpf: Add a basic fifo qdisc test
  2025-04-09 21:46 ` [PATCH bpf-next v7 08/10] selftests/bpf: Add a basic fifo qdisc test Amery Hung
@ 2025-04-10 23:59   ` Martin KaFai Lau
  2025-04-11 17:20     ` Amery Hung
  0 siblings, 1 reply; 23+ messages in thread
From: Martin KaFai Lau @ 2025-04-10 23:59 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, edumazet, kuba,
	xiyou.wangcong, jhs, martin.lau, jiri, stfomichev, toke,
	sinquersw, ekarani.silvestre, yangpeihao, yepeilin.cs,
	kernel-team

On Wed, Apr 9, 2025 at 2:49 PM Amery Hung <ameryhung@gmail.com> wrote:
> +void test_bpf_qdisc(void)

nit. re-name to "test_ns_bpf_qdisc"....

> +{
> +       struct netns_obj *netns;
> +
> +       netns = netns_new("bpf_qdisc_ns", true);

... then this can be saved.

> +       if (!ASSERT_OK_PTR(netns, "netns_new"))
> +               return;
> +
> +       if (test__start_subtest("fifo"))
> +               test_fifo();
> +
> +       netns_free(netns);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
> new file mode 100644
> index 000000000000..65a2c561c0bb
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
> @@ -0,0 +1,31 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _BPF_QDISC_COMMON_H
> +#define _BPF_QDISC_COMMON_H
> +
> +#define NET_XMIT_SUCCESS        0x00
> +#define NET_XMIT_DROP           0x01    /* skb dropped                  */
> +#define NET_XMIT_CN             0x02    /* congestion notification      */
> +
> +#define TC_PRIO_CONTROL  7
> +#define TC_PRIO_MAX      15
> +
> +#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
> +
> +u32 bpf_skb_get_hash(struct sk_buff *p) __ksym;
> +void bpf_kfree_skb(struct sk_buff *p) __ksym;
> +void bpf_qdisc_skb_drop(struct sk_buff *p, struct bpf_sk_buff_ptr *to_free) __ksym;
> +void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64 delta_ns) __ksym;
> +void bpf_qdisc_bstats_update(struct Qdisc *sch, const struct sk_buff *skb) __ksym;

nit. These kfunc declarations should be no longer needed. vmlinux.h
should already have them. Update pahole if vmlinux.h does not have
them.

The set has been applied. Please consider following up the nits in
selftests. Thanks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v7 03/10] bpf: net_sched: Add basic bpf qdisc kfuncs
  2025-04-09 21:45 ` [PATCH bpf-next v7 03/10] bpf: net_sched: Add basic bpf qdisc kfuncs Amery Hung
@ 2025-04-11 13:32   ` Kumar Kartikeya Dwivedi
  2025-04-11 16:58     ` Amery Hung
  0 siblings, 1 reply; 23+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-11 13:32 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, edumazet, kuba,
	xiyou.wangcong, jhs, martin.lau, jiri, stfomichev, toke,
	sinquersw, ekarani.silvestre, yangpeihao, yepeilin.cs,
	kernel-team

On Wed, 9 Apr 2025 at 23:46, Amery Hung <ameryhung@gmail.com> wrote:
>
> From: Amery Hung <amery.hung@bytedance.com>
>
> Add basic kfuncs for working on skb in qdisc.
>
> Both bpf_qdisc_skb_drop() and bpf_kfree_skb() can be used to release
> a reference to an skb. However, bpf_qdisc_skb_drop() can only be called
> in .enqueue where a to_free skb list is available from kernel to defer
> the release. bpf_kfree_skb() should be used elsewhere. It is also used
> in bpf_obj_free_fields() when cleaning up skb in maps and collections.
>
> bpf_skb_get_hash() returns the flow hash of an skb, which can be used
> to build flow-based queueing algorithms.
>
> Finally, allow users to create read-only dynptr via bpf_dynptr_from_skb().
>
> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
> ---

How do we prevent UAF when dynptr is accessed after bpf_kfree_skb?

>  [...]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v7 03/10] bpf: net_sched: Add basic bpf qdisc kfuncs
  2025-04-11 13:32   ` Kumar Kartikeya Dwivedi
@ 2025-04-11 16:58     ` Amery Hung
  2025-04-11 17:08       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 23+ messages in thread
From: Amery Hung @ 2025-04-11 16:58 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, edumazet, kuba,
	xiyou.wangcong, jhs, martin.lau, jiri, stfomichev, toke,
	sinquersw, ekarani.silvestre, yangpeihao, yepeilin.cs,
	kernel-team

On Fri, Apr 11, 2025 at 6:32 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Wed, 9 Apr 2025 at 23:46, Amery Hung <ameryhung@gmail.com> wrote:
> >
> > From: Amery Hung <amery.hung@bytedance.com>
> >
> > Add basic kfuncs for working on skb in qdisc.
> >
> > Both bpf_qdisc_skb_drop() and bpf_kfree_skb() can be used to release
> > a reference to an skb. However, bpf_qdisc_skb_drop() can only be called
> > in .enqueue where a to_free skb list is available from kernel to defer
> > the release. bpf_kfree_skb() should be used elsewhere. It is also used
> > in bpf_obj_free_fields() when cleaning up skb in maps and collections.
> >
> > bpf_skb_get_hash() returns the flow hash of an skb, which can be used
> > to build flow-based queueing algorithms.
> >
> > Finally, allow users to create read-only dynptr via bpf_dynptr_from_skb().
> >
> > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
> > ---
>
> How do we prevent UAF when dynptr is accessed after bpf_kfree_skb?
>

Good question...

Maybe we can add a ref_obj_id field to bpf_reg_state->dynptr to track
the ref_obj_id of the object underlying a dynptr?

Then, in release_reference(), in addition to finding ref_obj_id in
registers, verifier will also search stack slots and invalidate all
dynptrs with the ref_obj_id.

Does this sound like a feasible solution?

> >  [...]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v7 03/10] bpf: net_sched: Add basic bpf qdisc kfuncs
  2025-04-11 16:58     ` Amery Hung
@ 2025-04-11 17:08       ` Kumar Kartikeya Dwivedi
  2025-04-11 17:17         ` Amery Hung
  2025-04-11 18:36         ` Martin KaFai Lau
  0 siblings, 2 replies; 23+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-11 17:08 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, edumazet, kuba,
	xiyou.wangcong, jhs, martin.lau, jiri, stfomichev, toke,
	sinquersw, ekarani.silvestre, yangpeihao, yepeilin.cs,
	kernel-team

On Fri, 11 Apr 2025 at 18:59, Amery Hung <ameryhung@gmail.com> wrote:
>
> On Fri, Apr 11, 2025 at 6:32 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Wed, 9 Apr 2025 at 23:46, Amery Hung <ameryhung@gmail.com> wrote:
> > >
> > > From: Amery Hung <amery.hung@bytedance.com>
> > >
> > > Add basic kfuncs for working on skb in qdisc.
> > >
> > > Both bpf_qdisc_skb_drop() and bpf_kfree_skb() can be used to release
> > > a reference to an skb. However, bpf_qdisc_skb_drop() can only be called
> > > in .enqueue where a to_free skb list is available from kernel to defer
> > > the release. bpf_kfree_skb() should be used elsewhere. It is also used
> > > in bpf_obj_free_fields() when cleaning up skb in maps and collections.
> > >
> > > bpf_skb_get_hash() returns the flow hash of an skb, which can be used
> > > to build flow-based queueing algorithms.
> > >
> > > Finally, allow users to create read-only dynptr via bpf_dynptr_from_skb().
> > >
> > > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > > Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
> > > ---
> >
> > How do we prevent UAF when dynptr is accessed after bpf_kfree_skb?
> >
>
> Good question...
>
> Maybe we can add a ref_obj_id field to bpf_reg_state->dynptr to track
> the ref_obj_id of the object underlying a dynptr?
>
> Then, in release_reference(), in addition to finding ref_obj_id in
> registers, verifier will also search stack slots and invalidate all
> dynptrs with the ref_obj_id.
>
> Does this sound like a feasible solution?

Yes, though I talked with Andrii and he has better ideas for doing
this generically, but for now I think we can make this fix as a
stopgap.
I will add a fixes tag, asked the question because I had the same
question when implementing a similar pattern for my patch, and was
wondering how you solved it.

I made a similar fix to what you described for now.

>
> > >  [...]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v7 03/10] bpf: net_sched: Add basic bpf qdisc kfuncs
  2025-04-11 17:08       ` Kumar Kartikeya Dwivedi
@ 2025-04-11 17:17         ` Amery Hung
  2025-04-11 17:22           ` Kumar Kartikeya Dwivedi
  2025-04-11 18:36         ` Martin KaFai Lau
  1 sibling, 1 reply; 23+ messages in thread
From: Amery Hung @ 2025-04-11 17:17 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, edumazet, kuba,
	xiyou.wangcong, jhs, martin.lau, jiri, stfomichev, toke,
	sinquersw, ekarani.silvestre, yangpeihao, yepeilin.cs,
	kernel-team

On Fri, Apr 11, 2025 at 10:08 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Fri, 11 Apr 2025 at 18:59, Amery Hung <ameryhung@gmail.com> wrote:
> >
> > On Fri, Apr 11, 2025 at 6:32 AM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > On Wed, 9 Apr 2025 at 23:46, Amery Hung <ameryhung@gmail.com> wrote:
> > > >
> > > > From: Amery Hung <amery.hung@bytedance.com>
> > > >
> > > > Add basic kfuncs for working on skb in qdisc.
> > > >
> > > > Both bpf_qdisc_skb_drop() and bpf_kfree_skb() can be used to release
> > > > a reference to an skb. However, bpf_qdisc_skb_drop() can only be called
> > > > in .enqueue where a to_free skb list is available from kernel to defer
> > > > the release. bpf_kfree_skb() should be used elsewhere. It is also used
> > > > in bpf_obj_free_fields() when cleaning up skb in maps and collections.
> > > >
> > > > bpf_skb_get_hash() returns the flow hash of an skb, which can be used
> > > > to build flow-based queueing algorithms.
> > > >
> > > > Finally, allow users to create read-only dynptr via bpf_dynptr_from_skb().
> > > >
> > > > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > > > Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
> > > > ---
> > >
> > > How do we prevent UAF when dynptr is accessed after bpf_kfree_skb?
> > >
> >
> > Good question...
> >
> > Maybe we can add a ref_obj_id field to bpf_reg_state->dynptr to track
> > the ref_obj_id of the object underlying a dynptr?
> >
> > Then, in release_reference(), in addition to finding ref_obj_id in
> > registers, verifier will also search stack slots and invalidate all
> > dynptrs with the ref_obj_id.
> >
> > Does this sound like a feasible solution?
>
> Yes, though I talked with Andrii and he has better ideas for doing
> this generically, but for now I think we can make this fix as a
> stopgap.

Sounds good. Just making sure I am not doing redundant work, you will
send the fix you made, right?

Thanks,
Amery

> I will add a fixes tag, asked the question because I had the same
> question when implementing a similar pattern for my patch, and was
> wondering how you solved it.
>
> I made a similar fix to what you described for now.
>
> >
> > > >  [...]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v7 08/10] selftests/bpf: Add a basic fifo qdisc test
  2025-04-10 23:59   ` Martin KaFai Lau
@ 2025-04-11 17:20     ` Amery Hung
  0 siblings, 0 replies; 23+ messages in thread
From: Amery Hung @ 2025-04-11 17:20 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, edumazet, kuba,
	xiyou.wangcong, jhs, martin.lau, jiri, stfomichev, toke,
	sinquersw, ekarani.silvestre, yangpeihao, yepeilin.cs,
	kernel-team

On Thu, Apr 10, 2025 at 4:59 PM Martin KaFai Lau <iamkafai@gmail.com> wrote:
>
> On Wed, Apr 9, 2025 at 2:49 PM Amery Hung <ameryhung@gmail.com> wrote:
> > +void test_bpf_qdisc(void)
>
> nit. re-name to "test_ns_bpf_qdisc"....
>
> > +{
> > +       struct netns_obj *netns;
> > +
> > +       netns = netns_new("bpf_qdisc_ns", true);
>
> ... then this can be saved.
>
> > +       if (!ASSERT_OK_PTR(netns, "netns_new"))
> > +               return;
> > +
> > +       if (test__start_subtest("fifo"))
> > +               test_fifo();
> > +
> > +       netns_free(netns);
> > +}
> > diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
> > new file mode 100644
> > index 000000000000..65a2c561c0bb
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h
> > @@ -0,0 +1,31 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +#ifndef _BPF_QDISC_COMMON_H
> > +#define _BPF_QDISC_COMMON_H
> > +
> > +#define NET_XMIT_SUCCESS        0x00
> > +#define NET_XMIT_DROP           0x01    /* skb dropped                  */
> > +#define NET_XMIT_CN             0x02    /* congestion notification      */
> > +
> > +#define TC_PRIO_CONTROL  7
> > +#define TC_PRIO_MAX      15
> > +
> > +#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
> > +
> > +u32 bpf_skb_get_hash(struct sk_buff *p) __ksym;
> > +void bpf_kfree_skb(struct sk_buff *p) __ksym;
> > +void bpf_qdisc_skb_drop(struct sk_buff *p, struct bpf_sk_buff_ptr *to_free) __ksym;
> > +void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64 delta_ns) __ksym;
> > +void bpf_qdisc_bstats_update(struct Qdisc *sch, const struct sk_buff *skb) __ksym;
>
> nit. These kfunc declarations should be no longer needed. vmlinux.h
> should already have them. Update pahole if vmlinux.h does not have
> them.
>
> The set has been applied. Please consider following up the nits in
> selftests. Thanks.

I will send a follow up addressing two things you mentioned. Thanks
for the suggestions.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v7 03/10] bpf: net_sched: Add basic bpf qdisc kfuncs
  2025-04-11 17:17         ` Amery Hung
@ 2025-04-11 17:22           ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 23+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-11 17:22 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, edumazet, kuba,
	xiyou.wangcong, jhs, martin.lau, jiri, stfomichev, toke,
	sinquersw, ekarani.silvestre, yangpeihao, yepeilin.cs,
	kernel-team

On Fri, 11 Apr 2025 at 19:18, Amery Hung <ameryhung@gmail.com> wrote:
>
> On Fri, Apr 11, 2025 at 10:08 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Fri, 11 Apr 2025 at 18:59, Amery Hung <ameryhung@gmail.com> wrote:
> > >
> > > On Fri, Apr 11, 2025 at 6:32 AM Kumar Kartikeya Dwivedi
> > > <memxor@gmail.com> wrote:
> > > >
> > > > On Wed, 9 Apr 2025 at 23:46, Amery Hung <ameryhung@gmail.com> wrote:
> > > > >
> > > > > From: Amery Hung <amery.hung@bytedance.com>
> > > > >
> > > > > Add basic kfuncs for working on skb in qdisc.
> > > > >
> > > > > Both bpf_qdisc_skb_drop() and bpf_kfree_skb() can be used to release
> > > > > a reference to an skb. However, bpf_qdisc_skb_drop() can only be called
> > > > > in .enqueue where a to_free skb list is available from kernel to defer
> > > > > the release. bpf_kfree_skb() should be used elsewhere. It is also used
> > > > > in bpf_obj_free_fields() when cleaning up skb in maps and collections.
> > > > >
> > > > > bpf_skb_get_hash() returns the flow hash of an skb, which can be used
> > > > > to build flow-based queueing algorithms.
> > > > >
> > > > > Finally, allow users to create read-only dynptr via bpf_dynptr_from_skb().
> > > > >
> > > > > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > > > > Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
> > > > > ---
> > > >
> > > > How do we prevent UAF when dynptr is accessed after bpf_kfree_skb?
> > > >
> > >
> > > Good question...
> > >
> > > Maybe we can add a ref_obj_id field to bpf_reg_state->dynptr to track
> > > the ref_obj_id of the object underlying a dynptr?
> > >
> > > Then, in release_reference(), in addition to finding ref_obj_id in
> > > registers, verifier will also search stack slots and invalidate all
> > > dynptrs with the ref_obj_id.
> > >
> > > Does this sound like a feasible solution?
> >
> > Yes, though I talked with Andrii and he has better ideas for doing
> > this generically, but for now I think we can make this fix as a
> > stopgap.
>
> Sounds good. Just making sure I am not doing redundant work, you will
> send the fix you made, right?
>

Yes.

> Thanks,
> Amery
>
> > I will add a fixes tag, asked the question because I had the same
> > question when implementing a similar pattern for my patch, and was
> > wondering how you solved it.
> >
> > I made a similar fix to what you described for now.
> >
> > >
> > > > >  [...]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v7 03/10] bpf: net_sched: Add basic bpf qdisc kfuncs
  2025-04-11 17:08       ` Kumar Kartikeya Dwivedi
  2025-04-11 17:17         ` Amery Hung
@ 2025-04-11 18:36         ` Martin KaFai Lau
  2025-04-11 20:03           ` Amery Hung
  1 sibling, 1 reply; 23+ messages in thread
From: Martin KaFai Lau @ 2025-04-11 18:36 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, Amery Hung
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, edumazet, kuba,
	xiyou.wangcong, jhs, martin.lau, jiri, stfomichev, toke,
	sinquersw, ekarani.silvestre, yangpeihao, yepeilin.cs,
	kernel-team

On 4/11/25 10:08 AM, Kumar Kartikeya Dwivedi wrote:
> On Fri, 11 Apr 2025 at 18:59, Amery Hung <ameryhung@gmail.com> wrote:
>>
>> On Fri, Apr 11, 2025 at 6:32 AM Kumar Kartikeya Dwivedi
>> <memxor@gmail.com> wrote:
>>>
>>> On Wed, 9 Apr 2025 at 23:46, Amery Hung <ameryhung@gmail.com> wrote:
>>>>
>>>> From: Amery Hung <amery.hung@bytedance.com>
>>>>
>>>> Add basic kfuncs for working on skb in qdisc.
>>>>
>>>> Both bpf_qdisc_skb_drop() and bpf_kfree_skb() can be used to release
>>>> a reference to an skb. However, bpf_qdisc_skb_drop() can only be called
>>>> in .enqueue where a to_free skb list is available from kernel to defer
>>>> the release. bpf_kfree_skb() should be used elsewhere. It is also used
>>>> in bpf_obj_free_fields() when cleaning up skb in maps and collections.
>>>>
>>>> bpf_skb_get_hash() returns the flow hash of an skb, which can be used
>>>> to build flow-based queueing algorithms.
>>>>
>>>> Finally, allow users to create read-only dynptr via bpf_dynptr_from_skb().
>>>>
>>>> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
>>>> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
>>>> ---
>>>
>>> How do we prevent UAF when dynptr is accessed after bpf_kfree_skb?
>>>
>>
>> Good question...
>>
>> Maybe we can add a ref_obj_id field to bpf_reg_state->dynptr to track
>> the ref_obj_id of the object underlying a dynptr?
>>
>> Then, in release_reference(), in addition to finding ref_obj_id in
>> registers, verifier will also search stack slots and invalidate all
>> dynptrs with the ref_obj_id.
>>
>> Does this sound like a feasible solution?
> 
> Yes, though I talked with Andrii and he has better ideas for doing
> this generically, but for now I think we can make this fix as a
> stopgap.

In case the better fix will take longer, just want to mention that an option is 
to remove the bpf_dynptr_from_skb() from bpf qdisc. I don't see an urgent need 
for the bpf qdisc to be able to directly access the skb->data. btw, I don't 
think bpf qdisc should write to the skb->data.

The same goes for the bpf_kfree_skb(). I was thinking if it is useful at all 
considering there is already a bpf_qdisc_skb_drop(). I kept it there because it 
is a little more intuitive in case the .reset/.destroy wanted to do a "skb = 
bpf_kptr_xchg(&skbn->skb, NULL);" and then explicitly free the 
bpf_kfree_skb(skb). However, the bpf prog can also directly do the 
bpf_obj_drop(skbn) and then bpf_kfree_skb() is not needed, right?



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v7 03/10] bpf: net_sched: Add basic bpf qdisc kfuncs
  2025-04-11 18:36         ` Martin KaFai Lau
@ 2025-04-11 20:03           ` Amery Hung
  2025-04-11 20:11             ` Martin KaFai Lau
  0 siblings, 1 reply; 23+ messages in thread
From: Amery Hung @ 2025-04-11 20:03 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Kumar Kartikeya Dwivedi, bpf, netdev, alexei.starovoitov, andrii,
	daniel, edumazet, kuba, xiyou.wangcong, jhs, martin.lau, jiri,
	stfomichev, toke, sinquersw, ekarani.silvestre, yangpeihao,
	yepeilin.cs, kernel-team

On Fri, Apr 11, 2025 at 11:37 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 4/11/25 10:08 AM, Kumar Kartikeya Dwivedi wrote:
> > On Fri, 11 Apr 2025 at 18:59, Amery Hung <ameryhung@gmail.com> wrote:
> >>
> >> On Fri, Apr 11, 2025 at 6:32 AM Kumar Kartikeya Dwivedi
> >> <memxor@gmail.com> wrote:
> >>>
> >>> On Wed, 9 Apr 2025 at 23:46, Amery Hung <ameryhung@gmail.com> wrote:
> >>>>
> >>>> From: Amery Hung <amery.hung@bytedance.com>
> >>>>
> >>>> Add basic kfuncs for working on skb in qdisc.
> >>>>
> >>>> Both bpf_qdisc_skb_drop() and bpf_kfree_skb() can be used to release
> >>>> a reference to an skb. However, bpf_qdisc_skb_drop() can only be called
> >>>> in .enqueue where a to_free skb list is available from kernel to defer
> >>>> the release. bpf_kfree_skb() should be used elsewhere. It is also used
> >>>> in bpf_obj_free_fields() when cleaning up skb in maps and collections.
> >>>>
> >>>> bpf_skb_get_hash() returns the flow hash of an skb, which can be used
> >>>> to build flow-based queueing algorithms.
> >>>>
> >>>> Finally, allow users to create read-only dynptr via bpf_dynptr_from_skb().
> >>>>
> >>>> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> >>>> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
> >>>> ---
> >>>
> >>> How do we prevent UAF when dynptr is accessed after bpf_kfree_skb?
> >>>
> >>
> >> Good question...
> >>
> >> Maybe we can add a ref_obj_id field to bpf_reg_state->dynptr to track
> >> the ref_obj_id of the object underlying a dynptr?
> >>
> >> Then, in release_reference(), in addition to finding ref_obj_id in
> >> registers, verifier will also search stack slots and invalidate all
> >> dynptrs with the ref_obj_id.
> >>
> >> Does this sound like a feasible solution?
> >
> > Yes, though I talked with Andrii and he has better ideas for doing
> > this generically, but for now I think we can make this fix as a
> > stopgap.
>
> In case the better fix will take longer, just want to mention that an option is
> to remove the bpf_dynptr_from_skb() from bpf qdisc. I don't see an urgent need
> for the bpf qdisc to be able to directly access the skb->data. btw, I don't
> think bpf qdisc should write to the skb->data.
>
> The same goes for the bpf_kfree_skb(). I was thinking if it is useful at all
> considering there is already a bpf_qdisc_skb_drop(). I kept it there because it
> is a little more intuitive in case the .reset/.destroy wanted to do a "skb =
> bpf_kptr_xchg(&skbn->skb, NULL);" and then explicitly free the
> bpf_kfree_skb(skb). However, the bpf prog can also directly do the
> bpf_obj_drop(skbn) and then bpf_kfree_skb() is not needed, right?
>
>

My rationale for keeping two skb releasing kfuncs: bpf_kfree_skb() is
the dtor and since dtor can only have one argument, so
bpf_qdisc_skb_drop() can not replace it. Since bpf_kfree_skb() is here
to stay, I allow users to call it directly for convenience. Only
exposing bpf_qdisc_skb_drop() and calling kfree_skb() in
bpf_qdisc_skb_drop() when to_free is NULL will also do. I don’t have a
strong opinion.

Yes, bpf_kfree_skb() will not be needed if doing bpf_obj_drop(skbn).
bpf_obj_drop() internally will call the dtor of a kptr (i.e., in this
case, bpf_kfree_skb()) in an allocated object.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH bpf-next v7 03/10] bpf: net_sched: Add basic bpf qdisc kfuncs
  2025-04-11 20:03           ` Amery Hung
@ 2025-04-11 20:11             ` Martin KaFai Lau
  0 siblings, 0 replies; 23+ messages in thread
From: Martin KaFai Lau @ 2025-04-11 20:11 UTC (permalink / raw)
  To: Amery Hung
  Cc: Kumar Kartikeya Dwivedi, bpf, netdev, alexei.starovoitov, andrii,
	daniel, edumazet, kuba, xiyou.wangcong, jhs, martin.lau, jiri,
	stfomichev, toke, sinquersw, ekarani.silvestre, yangpeihao,
	yepeilin.cs, kernel-team

On 4/11/25 1:03 PM, Amery Hung wrote:
>> The same goes for the bpf_kfree_skb(). I was thinking if it is useful at all
>> considering there is already a bpf_qdisc_skb_drop(). I kept it there because it
>> is a little more intuitive in case the .reset/.destroy wanted to do a "skb =
>> bpf_kptr_xchg(&skbn->skb, NULL);" and then explicitly free the
>> bpf_kfree_skb(skb). However, the bpf prog can also directly do the
>> bpf_obj_drop(skbn) and then bpf_kfree_skb() is not needed, right?
>>
>>
> 
> My rationale for keeping two skb releasing kfuncs: bpf_kfree_skb() is
> the dtor and since dtor can only have one argument, so
> bpf_qdisc_skb_drop() can not replace it. Since bpf_kfree_skb() is here
> to stay, I allow users to call it directly for convenience. Only
> exposing bpf_qdisc_skb_drop() and calling kfree_skb() in
> bpf_qdisc_skb_drop() when to_free is NULL will also do. I don’t have a
> strong opinion.
> 
> Yes, bpf_kfree_skb() will not be needed if doing bpf_obj_drop(skbn).
> bpf_obj_drop() internally will call the dtor of a kptr (i.e., in this
> case, bpf_kfree_skb()) in an allocated object.

Make sense. Lets keep the bpf_kfree_skb() as-is instead of complicating the 
bpf_qdisc_skb_drop() with NULL vs non-NULL argument.


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-04-11 20:11 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-09 21:45 [PATCH bpf-next v7 00/10] bpf qdisc Amery Hung
2025-04-09 21:45 ` [PATCH bpf-next v7 01/10] bpf: Prepare to reuse get_ctx_arg_idx Amery Hung
2025-04-09 21:45 ` [PATCH bpf-next v7 02/10] bpf: net_sched: Support implementation of Qdisc_ops in bpf Amery Hung
2025-04-10 23:49   ` Martin KaFai Lau
2025-04-09 21:45 ` [PATCH bpf-next v7 03/10] bpf: net_sched: Add basic bpf qdisc kfuncs Amery Hung
2025-04-11 13:32   ` Kumar Kartikeya Dwivedi
2025-04-11 16:58     ` Amery Hung
2025-04-11 17:08       ` Kumar Kartikeya Dwivedi
2025-04-11 17:17         ` Amery Hung
2025-04-11 17:22           ` Kumar Kartikeya Dwivedi
2025-04-11 18:36         ` Martin KaFai Lau
2025-04-11 20:03           ` Amery Hung
2025-04-11 20:11             ` Martin KaFai Lau
2025-04-09 21:46 ` [PATCH bpf-next v7 04/10] bpf: net_sched: Add a qdisc watchdog timer Amery Hung
2025-04-09 21:46 ` [PATCH bpf-next v7 05/10] bpf: net_sched: Support updating bstats Amery Hung
2025-04-09 21:46 ` [PATCH bpf-next v7 06/10] bpf: net_sched: Disable attaching bpf qdisc to non root Amery Hung
2025-04-09 21:46 ` [PATCH bpf-next v7 07/10] libbpf: Support creating and destroying qdisc Amery Hung
2025-04-09 21:46 ` [PATCH bpf-next v7 08/10] selftests/bpf: Add a basic fifo qdisc test Amery Hung
2025-04-10 23:59   ` Martin KaFai Lau
2025-04-11 17:20     ` Amery Hung
2025-04-09 21:46 ` [PATCH bpf-next v7 09/10] selftests/bpf: Add a bpf fq qdisc to selftest Amery Hung
2025-04-09 21:46 ` [PATCH bpf-next v7 10/10] selftests/bpf: Test attaching bpf qdisc to mq and non root Amery Hung
2025-04-10 23:50 ` [PATCH bpf-next v7 00/10] bpf qdisc patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).