netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v2 0/7] Add kfunc bpf_xdp_pull_data
@ 2025-09-05 17:33 Amery Hung
  2025-09-05 17:33 ` [PATCH bpf-next v2 1/7] net/mlx5e: Fix generating skb from nonlinear xdp_buff Amery Hung
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Amery Hung @ 2025-09-05 17:33 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, kuba, stfomichev,
	martin.lau, mohsin.bashr, noren, dtatulea, saeedm, tariqt, mbloch,
	maciej.fijalkowski, kernel-team

RFC v1 -> v2
  Rebase onto bpf-next

  Will rebase on the mlx5 patchset that avoids copying payload
  to linear part by Christoph if that got merged first, or seperate the
  mlx5 fix from this set.

  patch 1
  - Remove the unnecessary head frag search (Dragos)
  - Rewind the end frag pointer to simplify the change (Dragos)
  - Rewind the end frag pointer and recalculate truesize only when the
    number of frags changed (Dragos)

  patch 3
  - Fix len == zero behavior. To mirror bpf_skb_pull_data() correctly,
    the kfunc should do nothing (Stanislav)
  - Fix a pointer wrap around bug (Jakub)
  - Use memmove() when moving sinfo->frags (Jakub)
  
---

Hi all,

This patchset introduces a new kfunc bpf_xdp_pull_data() to allow
pulling nonlinear xdp data. This may be useful when a driver places
headers in fragments. When an xdp program would like to keep parsing
packet headers using direct packet access, it can call
bpf_xdp_pull_data() to make the header available in the linear data
area. The kfunc can also be used to decapsulate the header in the
nonlinear data, as currently there is no easy way to do this.

This patchset also tries to fix an issue in the mlx5e driver. The driver
curretly assumes the packet layout to be unchanged after xdp program
runs and may generate packet with corrupted data or trigger kernel warning
if xdp programs calls layout-changing kfunc such as bpf_xdp_adjust_tail(),
bpf_xdp_adjust_head() or bpf_xdp_pull_data() introduced in this set.

Tested with the added bpf selftest using bpf test_run and also on
mlx5 with the tools/testing/selftests/drivers/net/{xdp.py, ping.py}. mlx5
with striding RQ always pass xdp_buff with empty linear data to xdp
programs. xdp.test_xdp_native_pass_mb would fail to parse the header before
this patchset.

Thanks!
Amery

---

Amery Hung (7):
  net/mlx5e: Fix generating skb from nonlinear xdp_buff
  bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail
  bpf: Support pulling non-linear xdp data
  bpf: Clear packet pointers after changing packet data in kfuncs
  bpf: Support specifying linear xdp packet data size in test_run
  selftests/bpf: Test bpf_xdp_pull_data
  selftests: drv-net: Pull data before parsing headers

 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  38 ++++++-
 include/net/xdp_sock_drv.h                    |  21 +++-
 kernel/bpf/verifier.c                         |  13 +++
 net/bpf/test_run.c                            |   9 +-
 net/core/filter.c                             | 104 ++++++++++++++++--
 .../bpf/prog_tests/xdp_context_test_run.c     |   4 +-
 .../selftests/bpf/prog_tests/xdp_pull_data.c  |  96 ++++++++++++++++
 .../selftests/bpf/progs/test_xdp_pull_data.c  |  36 ++++++
 .../selftests/net/lib/xdp_native.bpf.c        |  90 ++++++++++++---
 9 files changed, 372 insertions(+), 39 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_pull_data.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_pull_data.c

-- 
2.47.3


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH bpf-next v2 1/7] net/mlx5e: Fix generating skb from nonlinear xdp_buff
  2025-09-05 17:33 [PATCH bpf-next v2 0/7] Add kfunc bpf_xdp_pull_data Amery Hung
@ 2025-09-05 17:33 ` Amery Hung
  2025-09-08 14:41   ` Dragos Tatulea
  2025-09-05 17:33 ` [PATCH bpf-next v2 2/7] bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail Amery Hung
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Amery Hung @ 2025-09-05 17:33 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, kuba, stfomichev,
	martin.lau, mohsin.bashr, noren, dtatulea, saeedm, tariqt, mbloch,
	maciej.fijalkowski, kernel-team

xdp programs can change the layout of an xdp_buff through
bpf_xdp_adjust_tail() and bpf_xdp_adjust_head(). Therefore, the driver
cannot assume the size of the linear data area nor fragments. Fix the
bug in mlx5 by generating skb according to xdp_buff after xdp programs
run.

Currently, when handling multi-buf xdp, the mlx5 driver assumes the
layout of an xdp_buff to be unchanged. That is, the linear data area
continues to be empty and fragments remains the same. This may cause
the driver to generate erroneous skb or triggering a kernel
warning. When an xdp program added linear data through
bpf_xdp_adjust_head(), the linear data will be ignored as
mlx5e_build_linear_skb() builds an skb without linear data and then
pull data from fragments to fill the linear data area. When an xdp
program has shrunk the non-linear data through bpf_xdp_adjust_tail(),
the delta passed to __pskb_pull_tail() may exceed the actual nonlinear
data size and trigger the BUG_ON in it.

To fix the issue, first record the original number of fragments. If the
number of fragments changes after the xdp program runs, rewind the end
fragment pointer by the difference and recalculate the truesize. Then,
build the skb with linear data area matching the xdp_buff. Finally, only
pull data in if there is non-linear data and fill the linear part up to
256 bytes.

Fixes: f52ac7028bec ("net/mlx5e: RX, Add XDP multi-buffer support in Striding RQ")
Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 38 +++++++++++++++++--
 1 file changed, 35 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index b8c609d91d11..6b6bb90cf003 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1729,6 +1729,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 	struct mlx5e_wqe_frag_info *head_wi = wi;
 	u16 rx_headroom = rq->buff.headroom;
 	struct mlx5e_frag_page *frag_page;
+	u8 nr_frags_free, old_nr_frags;
 	struct skb_shared_info *sinfo;
 	u32 frag_consumed_bytes;
 	struct bpf_prog *prog;
@@ -1772,17 +1773,27 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 		wi++;
 	}
 
+	old_nr_frags = sinfo->nr_frags;
+
 	prog = rcu_dereference(rq->xdp_prog);
 	if (prog && mlx5e_xdp_handle(rq, prog, mxbuf)) {
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
 			struct mlx5e_wqe_frag_info *pwi;
 
+			wi -= old_nr_frags - sinfo->nr_frags;
+
 			for (pwi = head_wi; pwi < wi; pwi++)
 				pwi->frag_page->frags++;
 		}
 		return NULL; /* page/packet was consumed by XDP */
 	}
 
+	nr_frags_free = old_nr_frags - sinfo->nr_frags;
+	if (unlikely(nr_frags_free)) {
+		wi -= nr_frags_free;
+		truesize -= nr_frags_free * frag_info->frag_stride;
+	}
+
 	skb = mlx5e_build_linear_skb(
 		rq, mxbuf->xdp.data_hard_start, rq->buff.frame0_sz,
 		mxbuf->xdp.data - mxbuf->xdp.data_hard_start,
@@ -2004,6 +2015,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 	u32 byte_cnt       = cqe_bcnt;
 	struct skb_shared_info *sinfo;
 	unsigned int truesize = 0;
+	u32 pg_consumed_bytes;
 	struct bpf_prog *prog;
 	struct sk_buff *skb;
 	u32 linear_frame_sz;
@@ -2057,7 +2069,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 
 	while (byte_cnt) {
 		/* Non-linear mode, hence non-XSK, which always uses PAGE_SIZE. */
-		u32 pg_consumed_bytes = min_t(u32, PAGE_SIZE - frag_offset, byte_cnt);
+		pg_consumed_bytes = min_t(u32, PAGE_SIZE - frag_offset, byte_cnt);
 
 		if (test_bit(MLX5E_RQ_STATE_SHAMPO, &rq->state))
 			truesize += pg_consumed_bytes;
@@ -2073,10 +2085,15 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 	}
 
 	if (prog) {
+		u8 nr_frags_free, old_nr_frags = sinfo->nr_frags;
+		u32 len;
+
 		if (mlx5e_xdp_handle(rq, prog, mxbuf)) {
 			if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
 				struct mlx5e_frag_page *pfp;
 
+				frag_page -= old_nr_frags - sinfo->nr_frags;
+
 				for (pfp = head_page; pfp < frag_page; pfp++)
 					pfp->frags++;
 
@@ -2087,9 +2104,22 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 			return NULL; /* page/packet was consumed by XDP */
 		}
 
+		len = mxbuf->xdp.data_end - mxbuf->xdp.data;
+
+		nr_frags_free = old_nr_frags - sinfo->nr_frags;
+		if (unlikely(nr_frags_free)) {
+			frag_page -= nr_frags_free;
+
+			/* the last frag is always freed first */
+			truesize -= ALIGN(pg_consumed_bytes, BIT(rq->mpwqe.log_stride_sz));
+			while (--nr_frags_free)
+				truesize -= nr_frags_free *
+					    ALIGN(PAGE_SIZE, BIT(rq->mpwqe.log_stride_sz));
+		}
+
 		skb = mlx5e_build_linear_skb(
 			rq, mxbuf->xdp.data_hard_start, linear_frame_sz,
-			mxbuf->xdp.data - mxbuf->xdp.data_hard_start, 0,
+			mxbuf->xdp.data - mxbuf->xdp.data_hard_start, len,
 			mxbuf->xdp.data - mxbuf->xdp.data_meta);
 		if (unlikely(!skb)) {
 			mlx5e_page_release_fragmented(rq->page_pool,
@@ -2114,8 +2144,10 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 			do
 				pagep->frags++;
 			while (++pagep < frag_page);
+
+			headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len, sinfo->xdp_frags_size);
+			__pskb_pull_tail(skb, headlen);
 		}
-		__pskb_pull_tail(skb, headlen);
 	} else {
 		dma_addr_t addr;
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH bpf-next v2 2/7] bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail
  2025-09-05 17:33 [PATCH bpf-next v2 0/7] Add kfunc bpf_xdp_pull_data Amery Hung
  2025-09-05 17:33 ` [PATCH bpf-next v2 1/7] net/mlx5e: Fix generating skb from nonlinear xdp_buff Amery Hung
@ 2025-09-05 17:33 ` Amery Hung
  2025-09-05 17:33 ` [PATCH bpf-next v2 3/7] bpf: Support pulling non-linear xdp data Amery Hung
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Amery Hung @ 2025-09-05 17:33 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, kuba, stfomichev,
	martin.lau, mohsin.bashr, noren, dtatulea, saeedm, tariqt, mbloch,
	maciej.fijalkowski, kernel-team

Move skb_frag_t adjustment into bpf_xdp_shrink_data() and extend its
functionality to be able to shrink an xdp fragment from both head and
tail. In a later patch, bpf_xdp_pull_data() will reuse it to shrink an
xdp fragment from head.

Additionally, in bpf_xdp_frags_shrink_tail(), breaking the loop when
bpf_xdp_shrink_data() returns false (i.e., not releasing the current
fragment) is not necessary as the loop condition, offset > 0, has the
same effect. Remove the else branch to simplify the code.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
 include/net/xdp_sock_drv.h | 21 ++++++++++++++++++---
 net/core/filter.c          | 28 +++++++++++++++++-----------
 2 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 513c8e9704f6..4f2d3268a676 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -160,13 +160,23 @@ static inline struct xdp_buff *xsk_buff_get_frag(const struct xdp_buff *first)
 	return ret;
 }
 
-static inline void xsk_buff_del_tail(struct xdp_buff *tail)
+static inline void xsk_buff_del_frag(struct xdp_buff *xdp)
 {
-	struct xdp_buff_xsk *xskb = container_of(tail, struct xdp_buff_xsk, xdp);
+	struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp);
 
 	list_del(&xskb->list_node);
 }
 
+static inline struct xdp_buff *xsk_buff_get_head(struct xdp_buff *first)
+{
+	struct xdp_buff_xsk *xskb = container_of(first, struct xdp_buff_xsk, xdp);
+	struct xdp_buff_xsk *frag;
+
+	frag = list_first_entry(&xskb->pool->xskb_list, struct xdp_buff_xsk,
+				list_node);
+	return &frag->xdp;
+}
+
 static inline struct xdp_buff *xsk_buff_get_tail(struct xdp_buff *first)
 {
 	struct xdp_buff_xsk *xskb = container_of(first, struct xdp_buff_xsk, xdp);
@@ -389,8 +399,13 @@ static inline struct xdp_buff *xsk_buff_get_frag(const struct xdp_buff *first)
 	return NULL;
 }
 
-static inline void xsk_buff_del_tail(struct xdp_buff *tail)
+static inline void xsk_buff_del_frag(struct xdp_buff *xdp)
+{
+}
+
+static inline struct xdp_buff *xsk_buff_get_head(struct xdp_buff *first)
 {
+	return NULL;
 }
 
 static inline struct xdp_buff *xsk_buff_get_tail(struct xdp_buff *first)
diff --git a/net/core/filter.c b/net/core/filter.c
index 63f3baee2daf..0b82cb348ce0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4153,27 +4153,31 @@ static int bpf_xdp_frags_increase_tail(struct xdp_buff *xdp, int offset)
 	return 0;
 }
 
-static void bpf_xdp_shrink_data_zc(struct xdp_buff *xdp, int shrink,
+static void bpf_xdp_shrink_data_zc(struct xdp_buff *xdp, int shrink, bool tail,
 				   enum xdp_mem_type mem_type, bool release)
 {
-	struct xdp_buff *zc_frag = xsk_buff_get_tail(xdp);
+	struct xdp_buff *zc_frag = tail ? xsk_buff_get_tail(xdp) :
+					  xsk_buff_get_head(xdp);
 
 	if (release) {
-		xsk_buff_del_tail(zc_frag);
+		xsk_buff_del_frag(zc_frag);
 		__xdp_return(0, mem_type, false, zc_frag);
 	} else {
-		zc_frag->data_end -= shrink;
+		if (tail)
+			zc_frag->data_end -= shrink;
+		else
+			zc_frag->data += shrink;
 	}
 }
 
 static bool bpf_xdp_shrink_data(struct xdp_buff *xdp, skb_frag_t *frag,
-				int shrink)
+				int shrink, bool tail)
 {
 	enum xdp_mem_type mem_type = xdp->rxq->mem.type;
 	bool release = skb_frag_size(frag) == shrink;
 
 	if (mem_type == MEM_TYPE_XSK_BUFF_POOL) {
-		bpf_xdp_shrink_data_zc(xdp, shrink, mem_type, release);
+		bpf_xdp_shrink_data_zc(xdp, shrink, tail, mem_type, release);
 		goto out;
 	}
 
@@ -4181,6 +4185,12 @@ static bool bpf_xdp_shrink_data(struct xdp_buff *xdp, skb_frag_t *frag,
 		__xdp_return(skb_frag_netmem(frag), mem_type, false, NULL);
 
 out:
+	if (!release) {
+		if (!tail)
+			skb_frag_off_add(frag, shrink);
+		skb_frag_size_sub(frag, shrink);
+	}
+
 	return release;
 }
 
@@ -4198,12 +4208,8 @@ static int bpf_xdp_frags_shrink_tail(struct xdp_buff *xdp, int offset)
 
 		len_free += shrink;
 		offset -= shrink;
-		if (bpf_xdp_shrink_data(xdp, frag, shrink)) {
+		if (bpf_xdp_shrink_data(xdp, frag, shrink, true))
 			n_frags_free++;
-		} else {
-			skb_frag_size_sub(frag, shrink);
-			break;
-		}
 	}
 	sinfo->nr_frags -= n_frags_free;
 	sinfo->xdp_frags_size -= len_free;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH bpf-next v2 3/7] bpf: Support pulling non-linear xdp data
  2025-09-05 17:33 [PATCH bpf-next v2 0/7] Add kfunc bpf_xdp_pull_data Amery Hung
  2025-09-05 17:33 ` [PATCH bpf-next v2 1/7] net/mlx5e: Fix generating skb from nonlinear xdp_buff Amery Hung
  2025-09-05 17:33 ` [PATCH bpf-next v2 2/7] bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail Amery Hung
@ 2025-09-05 17:33 ` Amery Hung
  2025-09-08 19:27   ` Martin KaFai Lau
  2025-09-09  1:54   ` Jakub Kicinski
  2025-09-05 17:33 ` [PATCH bpf-next v2 4/7] bpf: Clear packet pointers after changing packet data in kfuncs Amery Hung
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 16+ messages in thread
From: Amery Hung @ 2025-09-05 17:33 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, kuba, stfomichev,
	martin.lau, mohsin.bashr, noren, dtatulea, saeedm, tariqt, mbloch,
	maciej.fijalkowski, kernel-team

Add kfunc, bpf_xdp_pull_data(), to support pulling data from xdp
fragments. Similar to bpf_skb_pull_data(), bpf_xdp_pull_data() makes
the first len bytes of data directly readable and writable in bpf
programs. If the "len" argument is larger than the linear data size,
data in fragments will be copied to the linear region when there
is enough room between xdp->data_end and xdp_data_hard_end(xdp),
which is subject to driver implementation.

A use case of the kfunc is to decapsulate headers residing in xdp
fragments. It is possible for a NIC driver to place headers in xdp
fragments. To keep using direct packet access for parsing and
decapsulating headers, users can pull headers into the linear data
area by calling bpf_xdp_pull_data() and then pop the header with
bpf_xdp_adjust_head().

An unused argument, flags is reserved for future extension (e.g.,
tossing the data instead of copying it to the linear data area).

Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
 net/core/filter.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 0b82cb348ce0..0b161106debd 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -12212,6 +12212,81 @@ __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
 	return 0;
 }
 
+/**
+ * bpf_xdp_pull_data() - Pull in non-linear xdp data.
+ * @x: &xdp_md associated with the XDP buffer
+ * @len: length of data to be made directly accessible in the linear part
+ * @flags: future use, must be zero
+ *
+ * Pull in non-linear data in case the XDP buffer associated with @x is
+ * non-linear and not all @len are in the linear data area.
+ *
+ * Direct packet access allows reading and writing linear XDP data through
+ * packet pointers (i.e., &xdp_md->data + offsets). When an eBPF program wants
+ * to directly access data that may be in the non-linear area, call this kfunc
+ * to make sure the data is available in the linear area.
+ *
+ * This kfunc can also be used with bpf_xdp_adjust_head() to decapsulate
+ * headers in the non-linear data area.
+ *
+ * A call to this kfunc is susceptible to change the underlying packet buffer.
+ * Therefore, at load time, all checks on pointers previously done by the
+ * verifier are invalidated and must be performed again, if the kfunc is used
+ * in combination with direct packet access.
+ *
+ * Return:
+ * * %0         - success
+ * * %-EINVAL   - invalid len or flags
+ */
+__bpf_kfunc int bpf_xdp_pull_data(struct xdp_md *x, u32 len, u64 flags)
+{
+	struct xdp_buff *xdp = (struct xdp_buff *)x;
+	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
+	void *data_hard_end = xdp_data_hard_end(xdp);
+	void *data_end = xdp->data + len;
+	int i, delta, n_frags_free = 0, len_free = 0;
+
+	if (flags)
+		return -EINVAL;
+
+	if (unlikely(len > xdp_get_buff_len(xdp)))
+		return -EINVAL;
+
+	if (unlikely(data_end < xdp->data || data_end > data_hard_end))
+		return -EINVAL;
+
+	delta = data_end - xdp->data_end;
+	if (delta <= 0)
+		return 0;
+
+	for (i = 0; i < sinfo->nr_frags && delta; i++) {
+		skb_frag_t *frag = &sinfo->frags[i];
+		u32 shrink = min_t(u32, delta, skb_frag_size(frag));
+
+		memcpy(xdp->data_end + len_free, skb_frag_address(frag), shrink);
+
+		len_free += shrink;
+		delta -= shrink;
+		if (bpf_xdp_shrink_data(xdp, frag, shrink, false))
+			n_frags_free++;
+	}
+
+	if (unlikely(n_frags_free)) {
+		memmove(sinfo->frags, sinfo->frags + n_frags_free,
+			(sinfo->nr_frags - n_frags_free) * sizeof(skb_frag_t));
+
+		sinfo->nr_frags -= n_frags_free;
+
+		if (!sinfo->nr_frags)
+			xdp_buff_clear_frags_flag(xdp);
+	}
+
+	sinfo->xdp_frags_size -= len_free;
+	xdp->data_end = data_end;
+
+	return 0;
+}
+
 __bpf_kfunc_end_defs();
 
 int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,
@@ -12239,6 +12314,7 @@ BTF_KFUNCS_END(bpf_kfunc_check_set_skb_meta)
 
 BTF_KFUNCS_START(bpf_kfunc_check_set_xdp)
 BTF_ID_FLAGS(func, bpf_dynptr_from_xdp)
+BTF_ID_FLAGS(func, bpf_xdp_pull_data)
 BTF_KFUNCS_END(bpf_kfunc_check_set_xdp)
 
 BTF_KFUNCS_START(bpf_kfunc_check_set_sock_addr)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH bpf-next v2 4/7] bpf: Clear packet pointers after changing packet data in kfuncs
  2025-09-05 17:33 [PATCH bpf-next v2 0/7] Add kfunc bpf_xdp_pull_data Amery Hung
                   ` (2 preceding siblings ...)
  2025-09-05 17:33 ` [PATCH bpf-next v2 3/7] bpf: Support pulling non-linear xdp data Amery Hung
@ 2025-09-05 17:33 ` Amery Hung
  2025-09-05 17:33 ` [PATCH bpf-next v2 5/7] bpf: Support specifying linear xdp packet data size in test_run Amery Hung
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Amery Hung @ 2025-09-05 17:33 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, kuba, stfomichev,
	martin.lau, mohsin.bashr, noren, dtatulea, saeedm, tariqt, mbloch,
	maciej.fijalkowski, kernel-team

bpf_xdp_pull_data() may change packet data and therefore packet pointers
need to be invalidated. Add bpf_xdp_pull_data() to the special kfunc
list instead of introducing a new KF_ flag until there are more kfuncs
changing packet data.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
 kernel/bpf/verifier.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b9394f8fac0e..2355f2999773 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12235,6 +12235,7 @@ enum special_kfunc_type {
 	KF_bpf_dynptr_from_skb,
 	KF_bpf_dynptr_from_xdp,
 	KF_bpf_dynptr_from_skb_meta,
+	KF_bpf_xdp_pull_data,
 	KF_bpf_dynptr_slice,
 	KF_bpf_dynptr_slice_rdwr,
 	KF_bpf_dynptr_clone,
@@ -12285,10 +12286,12 @@ BTF_ID(func, bpf_rbtree_right)
 BTF_ID(func, bpf_dynptr_from_skb)
 BTF_ID(func, bpf_dynptr_from_xdp)
 BTF_ID(func, bpf_dynptr_from_skb_meta)
+BTF_ID(func, bpf_xdp_pull_data)
 #else
 BTF_ID_UNUSED
 BTF_ID_UNUSED
 BTF_ID_UNUSED
+BTF_ID_UNUSED
 #endif
 BTF_ID(func, bpf_dynptr_slice)
 BTF_ID(func, bpf_dynptr_slice_rdwr)
@@ -12358,6 +12361,11 @@ static bool is_kfunc_bpf_preempt_enable(struct bpf_kfunc_call_arg_meta *meta)
 	return meta->func_id == special_kfunc_list[KF_bpf_preempt_enable];
 }
 
+static bool is_kfunc_pkt_changing(struct bpf_kfunc_call_arg_meta *meta)
+{
+	return meta->func_id == special_kfunc_list[KF_bpf_xdp_pull_data];
+}
+
 static enum kfunc_ptr_arg_type
 get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
 		       struct bpf_kfunc_call_arg_meta *meta,
@@ -14077,6 +14085,9 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		}
 	}
 
+	if (is_kfunc_pkt_changing(&meta))
+		clear_all_pkt_pointers(env);
+
 	nargs = btf_type_vlen(meta.func_proto);
 	args = (const struct btf_param *)(meta.func_proto + 1);
 	for (i = 0; i < nargs; i++) {
@@ -17798,6 +17809,8 @@ static int visit_insn(int t, struct bpf_verifier_env *env)
 			 */
 			if (ret == 0 && is_kfunc_sleepable(&meta))
 				mark_subprog_might_sleep(env, t);
+			if (ret == 0 && is_kfunc_pkt_changing(&meta))
+				mark_subprog_changes_pkt_data(env, t);
 		}
 		return visit_func_call_insn(t, insns, env, insn->src_reg == BPF_PSEUDO_CALL);
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH bpf-next v2 5/7] bpf: Support specifying linear xdp packet data size in test_run
  2025-09-05 17:33 [PATCH bpf-next v2 0/7] Add kfunc bpf_xdp_pull_data Amery Hung
                   ` (3 preceding siblings ...)
  2025-09-05 17:33 ` [PATCH bpf-next v2 4/7] bpf: Clear packet pointers after changing packet data in kfuncs Amery Hung
@ 2025-09-05 17:33 ` Amery Hung
  2025-09-05 17:33 ` [PATCH bpf-next v2 6/7] selftests/bpf: Test bpf_xdp_pull_data Amery Hung
  2025-09-05 17:33 ` [PATCH bpf-next v2 7/7] selftests: drv-net: Pull data before parsing headers Amery Hung
  6 siblings, 0 replies; 16+ messages in thread
From: Amery Hung @ 2025-09-05 17:33 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, kuba, stfomichev,
	martin.lau, mohsin.bashr, noren, dtatulea, saeedm, tariqt, mbloch,
	maciej.fijalkowski, kernel-team

To test bpf_xdp_pull_data(), an xdp packet containing fragments as well
as free linear data area after xdp->data_end needs to be created.
However, bpf_prog_test_run_xdp() always fills the linear area with
data_in before creating fragments, leaving no space to pull data. This
patch will allow users to specify the linear data size through
ctx->data_end.

Currently, ctx_in->data_end must match data_size_in and will not be the
final ctx->data_end seen by xdp programs. This is because ctx->data_end
is populated according to the xdp_buff passed to test_run. The linear
data area available in an xdp_buff, max_data_sz, is alawys filled up
before copying data_in into fragments.

This patch will allow users to specify the size of data that goes into
the linear area. When ctx_in->data_end is different from data_size_in,
only ctx_in->data_end bytes of data will be put into the linear area when
creating the xdp_buff.

While ctx_in->data_end will be allowed to be different from data_size_in,
it cannot be larger than the data_size_in as there will be no data to
copy from user space. If it is larger than the maximum linear data area
size, the layout suggested by the user will not be honored. Data beyond
max_data_sz bytes will still be copied into fragments.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
 net/bpf/test_run.c                                       | 9 +++++----
 .../selftests/bpf/prog_tests/xdp_context_test_run.c      | 4 +---
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 4a862d605386..1a0d0bc35ad5 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -1207,8 +1207,8 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 {
 	bool do_live = (kattr->test.flags & BPF_F_TEST_XDP_LIVE_FRAMES);
 	u32 tailroom = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	u32 retval = 0, duration, max_data_sz, data_sz;
 	u32 batch_size = kattr->test.batch_size;
-	u32 retval = 0, duration, max_data_sz;
 	u32 size = kattr->test.data_size_in;
 	u32 headroom = XDP_PACKET_HEADROOM;
 	u32 repeat = kattr->test.repeat;
@@ -1246,7 +1246,7 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 
 	if (ctx) {
 		/* There can't be user provided data before the meta data */
-		if (ctx->data_meta || ctx->data_end != size ||
+		if (ctx->data_meta || ctx->data_end > size ||
 		    ctx->data > ctx->data_end ||
 		    unlikely(xdp_metalen_invalid(ctx->data)) ||
 		    (do_live && (kattr->test.data_out || kattr->test.ctx_out)))
@@ -1256,11 +1256,12 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 	}
 
 	max_data_sz = PAGE_SIZE - headroom - tailroom;
-	if (size > max_data_sz) {
+	data_sz = (ctx && ctx->data_end < max_data_sz) ? ctx->data_end : max_data_sz;
+	if (size > data_sz) {
 		/* disallow live data mode for jumbo frames */
 		if (do_live)
 			goto free_ctx;
-		size = max_data_sz;
+		size = data_sz;
 	}
 
 	data = bpf_test_init(kattr, size, max_data_sz, headroom, tailroom);
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_context_test_run.c b/tools/testing/selftests/bpf/prog_tests/xdp_context_test_run.c
index 46e0730174ed..178292d1251a 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_context_test_run.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_context_test_run.c
@@ -97,9 +97,7 @@ void test_xdp_context_test_run(void)
 	/* Meta data must be 255 bytes or smaller */
 	test_xdp_context_error(prog_fd, opts, 0, 256, sizeof(data), 0, 0, 0);
 
-	/* Total size of data must match data_end - data_meta */
-	test_xdp_context_error(prog_fd, opts, 0, sizeof(__u32),
-			       sizeof(data) - 1, 0, 0, 0);
+	/* Total size of data must be data_end - data_meta or larger */
 	test_xdp_context_error(prog_fd, opts, 0, sizeof(__u32),
 			       sizeof(data) + 1, 0, 0, 0);
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH bpf-next v2 6/7] selftests/bpf: Test bpf_xdp_pull_data
  2025-09-05 17:33 [PATCH bpf-next v2 0/7] Add kfunc bpf_xdp_pull_data Amery Hung
                   ` (4 preceding siblings ...)
  2025-09-05 17:33 ` [PATCH bpf-next v2 5/7] bpf: Support specifying linear xdp packet data size in test_run Amery Hung
@ 2025-09-05 17:33 ` Amery Hung
  2025-09-05 17:33 ` [PATCH bpf-next v2 7/7] selftests: drv-net: Pull data before parsing headers Amery Hung
  6 siblings, 0 replies; 16+ messages in thread
From: Amery Hung @ 2025-09-05 17:33 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, kuba, stfomichev,
	martin.lau, mohsin.bashr, noren, dtatulea, saeedm, tariqt, mbloch,
	maciej.fijalkowski, kernel-team

Test bpf_xdp_pull_data() with xdp packets with different layouts. The
xdp bpf program first checks if the layout is as expected. Then, it
calls bpf_xdp_pull_data(). Finally, it checks the 0xbb marker at offset
1024 using directly packet access.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
 .../selftests/bpf/prog_tests/xdp_pull_data.c  | 96 +++++++++++++++++++
 .../selftests/bpf/progs/test_xdp_pull_data.c  | 36 +++++++
 2 files changed, 132 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_pull_data.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_pull_data.c

diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_pull_data.c b/tools/testing/selftests/bpf/prog_tests/xdp_pull_data.c
new file mode 100644
index 000000000000..2cd18e15d47e
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_pull_data.c
@@ -0,0 +1,96 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <test_progs.h>
+#include <network_helpers.h>
+#include "test_xdp_pull_data.skel.h"
+
+/* xdp_pull_data_prog will directly read a marker 0xbb stored at buf[1024]
+ * so caller expecting XDP_PASS should always pass pull_len no less than 1024
+ */
+void test_xdp_pull_data_common(struct test_xdp_pull_data *skel,
+			       int buf_len, int linear_len,
+			       int pull_len, int retval)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, topts);
+	struct xdp_md ctx = {};
+	int prog_fd, err;
+	__u8 *buf;
+
+	buf = calloc(buf_len, sizeof(__u8));
+	if (!ASSERT_OK_PTR(buf, "calloc buf"))
+		return;
+
+	buf[1023] = 0xaa;
+	buf[1024] = 0xbb;
+	buf[1025] = 0xcc;
+
+	topts.data_in = buf;
+	topts.data_out = buf;
+	topts.data_size_in = buf_len;
+	topts.data_size_out = buf_len;
+	ctx.data_end = linear_len;
+	topts.ctx_in = &ctx;
+	topts.ctx_out = &ctx;
+	topts.ctx_size_in = sizeof(ctx);
+	topts.ctx_size_out = sizeof(ctx);
+
+	skel->bss->linear_len = linear_len;
+	skel->bss->pull_len = pull_len;
+
+	prog_fd = bpf_program__fd(skel->progs.xdp_pull_data_prog);
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+	ASSERT_OK(err, "bpf_prog_test_run_opts");
+	ASSERT_EQ(topts.retval, retval, "xdp_pull_data_prog retval");
+
+	if (retval == XDP_DROP)
+		goto out;
+
+	ASSERT_EQ(ctx.data_end, pull_len, "linear data size");
+	ASSERT_EQ(topts.data_size_out, buf_len, "linear + non-linear data size");
+	/* Make sure data around xdp->data_end was not messed up by
+	 * bpf_xdp_pull_data()
+	 */
+	ASSERT_EQ(buf[1023], 0xaa, "buf[1023]");
+	ASSERT_EQ(buf[1024], 0xbb, "buf[1024]");
+	ASSERT_EQ(buf[1025], 0xcc, "buf[1025]");
+out:
+	free(buf);
+}
+
+static void test_xdp_pull_data_basic(void)
+{
+	struct test_xdp_pull_data *skel;
+	u32 page_size;
+
+	skel = test_xdp_pull_data__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "test_xdp_pull_data__open_and_load"))
+		return;
+
+	page_size = sysconf(_SC_PAGE_SIZE);
+
+	/* linear xdp pkt, pull 0 byte */
+	test_xdp_pull_data_common(skel, 2048, 2048, 2048, XDP_PASS);
+	/* multi-buf pkt, pull results in linear xdp pkt */
+	test_xdp_pull_data_common(skel, 2048, 1024, 2048, XDP_PASS);
+	/* multi-buf pkt, pull 1 byte to linear data area */
+	test_xdp_pull_data_common(skel, 9000, 1024, 1025, XDP_PASS);
+	/* multi-buf pkt, pull 0 byte to linear data area */
+	test_xdp_pull_data_common(skel, 9000, 1025, 1025, XDP_PASS);
+
+	/* linear xdp pkt, pull more than total data len */
+	test_xdp_pull_data_common(skel, 2048, 2048, 2049, XDP_DROP);
+	/* multi-buf pkt with no space left in linear data area.
+	 * Since ctx.data_end (4096) > max_data_sz, bpf_prog_test_run_xdp()
+	 * will fill the whole linear data area and put the reset into a
+	 * fragment.
+	 */
+	test_xdp_pull_data_common(skel, page_size, page_size, page_size, XDP_DROP);
+
+	test_xdp_pull_data__destroy(skel);
+}
+
+void test_xdp_pull_data(void)
+{
+	if (test__start_subtest("xdp_pull_data"))
+		test_xdp_pull_data_basic();
+}
diff --git a/tools/testing/selftests/bpf/progs/test_xdp_pull_data.c b/tools/testing/selftests/bpf/progs/test_xdp_pull_data.c
new file mode 100644
index 000000000000..f32e6b4a79f5
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_xdp_pull_data.c
@@ -0,0 +1,36 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include  "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+
+int _version SEC("version") = 1;
+
+int linear_len;
+int pull_len;
+
+SEC("xdp.frags")
+int xdp_pull_data_prog(struct xdp_md *xdp)
+{
+	__u8 *data_end = (void *)(long)xdp->data_end;
+	__u8 *data = (void *)(long)xdp->data;
+	__u8 *val_p;
+	int err;
+
+	if (linear_len != data_end - data)
+		return XDP_DROP;
+
+	err = bpf_xdp_pull_data(xdp, pull_len, 0);
+	if (err)
+		return XDP_DROP;
+
+	val_p = (void *)(long)xdp->data + 1024;
+	if (val_p + 1 > (void *)(long)xdp->data_end)
+		return XDP_DROP;
+
+	if (*val_p != 0xbb)
+		return XDP_DROP;
+
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH bpf-next v2 7/7] selftests: drv-net: Pull data before parsing headers
  2025-09-05 17:33 [PATCH bpf-next v2 0/7] Add kfunc bpf_xdp_pull_data Amery Hung
                   ` (5 preceding siblings ...)
  2025-09-05 17:33 ` [PATCH bpf-next v2 6/7] selftests/bpf: Test bpf_xdp_pull_data Amery Hung
@ 2025-09-05 17:33 ` Amery Hung
  6 siblings, 0 replies; 16+ messages in thread
From: Amery Hung @ 2025-09-05 17:33 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, kuba, stfomichev,
	martin.lau, mohsin.bashr, noren, dtatulea, saeedm, tariqt, mbloch,
	maciej.fijalkowski, kernel-team

It is possible for drivers to generate xdp packets with data residing
entirely in fragments. To keep parsing headers using direcy packet
access, call bpf_xdp_pull_data() to pull headers into the linear data
area.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
 .../selftests/net/lib/xdp_native.bpf.c        | 90 +++++++++++++++----
 1 file changed, 75 insertions(+), 15 deletions(-)

diff --git a/tools/testing/selftests/net/lib/xdp_native.bpf.c b/tools/testing/selftests/net/lib/xdp_native.bpf.c
index 521ba38f2ddd..68b2a08055ce 100644
--- a/tools/testing/selftests/net/lib/xdp_native.bpf.c
+++ b/tools/testing/selftests/net/lib/xdp_native.bpf.c
@@ -14,6 +14,9 @@
 #define MAX_PAYLOAD_LEN 5000
 #define MAX_HDR_LEN 64
 
+extern int bpf_xdp_pull_data(struct xdp_md *xdp, __u32 len,
+			     __u64 flags) __ksym __weak;
+
 enum {
 	XDP_MODE = 0,
 	XDP_PORT = 1,
@@ -68,30 +71,57 @@ static void record_stats(struct xdp_md *ctx, __u32 stat_type)
 
 static struct udphdr *filter_udphdr(struct xdp_md *ctx, __u16 port)
 {
-	void *data_end = (void *)(long)ctx->data_end;
-	void *data = (void *)(long)ctx->data;
 	struct udphdr *udph = NULL;
-	struct ethhdr *eth = data;
+	void *data, *data_end;
+	struct ethhdr *eth;
+	int err;
+
+	err = bpf_xdp_pull_data(ctx, sizeof(*eth), 0);
+	if (err)
+		return NULL;
+
+	data_end = (void *)(long)ctx->data_end;
+	data = eth = (void *)(long)ctx->data;
 
 	if (data + sizeof(*eth) > data_end)
 		return NULL;
 
 	if (eth->h_proto == bpf_htons(ETH_P_IP)) {
-		struct iphdr *iph = data + sizeof(*eth);
+		struct iphdr *iph;
+
+		err = bpf_xdp_pull_data(ctx, sizeof(*eth) + sizeof(*iph) +
+					     sizeof(*udph), 0);
+		if (err)
+			return NULL;
+
+		data_end = (void *)(long)ctx->data_end;
+		data = (void *)(long)ctx->data;
+
+		iph = data + sizeof(*eth);
 
 		if (iph + 1 > (struct iphdr *)data_end ||
 		    iph->protocol != IPPROTO_UDP)
 			return NULL;
 
-		udph = (void *)eth + sizeof(*iph) + sizeof(*eth);
-	} else if (eth->h_proto  == bpf_htons(ETH_P_IPV6)) {
-		struct ipv6hdr *ipv6h = data + sizeof(*eth);
+		udph = data + sizeof(*iph) + sizeof(*eth);
+	} else if (eth->h_proto == bpf_htons(ETH_P_IPV6)) {
+		struct ipv6hdr *ipv6h;
+
+		err = bpf_xdp_pull_data(ctx, sizeof(*eth) + sizeof(*ipv6h) +
+					     sizeof(*udph), 0);
+		if (err)
+			return NULL;
+
+		data_end = (void *)(long)ctx->data_end;
+		data = (void *)(long)ctx->data;
+
+		ipv6h = data + sizeof(*eth);
 
 		if (ipv6h + 1 > (struct ipv6hdr *)data_end ||
 		    ipv6h->nexthdr != IPPROTO_UDP)
 			return NULL;
 
-		udph = (void *)eth + sizeof(*ipv6h) + sizeof(*eth);
+		udph = data + sizeof(*ipv6h) + sizeof(*eth);
 	} else {
 		return NULL;
 	}
@@ -145,17 +175,34 @@ static void swap_machdr(void *data)
 
 static int xdp_mode_tx_handler(struct xdp_md *ctx, __u16 port)
 {
-	void *data_end = (void *)(long)ctx->data_end;
-	void *data = (void *)(long)ctx->data;
 	struct udphdr *udph = NULL;
-	struct ethhdr *eth = data;
+	void *data, *data_end;
+	struct ethhdr *eth;
+	int err;
+
+	err = bpf_xdp_pull_data(ctx, sizeof(*eth), 0);
+	if (err)
+		return XDP_PASS;
+
+	data_end = (void *)(long)ctx->data_end;
+	data = eth = (void *)(long)ctx->data;
 
 	if (data + sizeof(*eth) > data_end)
 		return XDP_PASS;
 
 	if (eth->h_proto == bpf_htons(ETH_P_IP)) {
-		struct iphdr *iph = data + sizeof(*eth);
-		__be32 tmp_ip = iph->saddr;
+		struct iphdr *iph;
+		__be32 tmp_ip;
+
+		err = bpf_xdp_pull_data(ctx, sizeof(*eth) + sizeof(*iph) +
+					     sizeof(*udph), 0);
+		if (err)
+			return XDP_PASS;
+
+		data_end = (void *)(long)ctx->data_end;
+		data = (void *)(long)ctx->data;
+
+		iph = data + sizeof(*eth);
 
 		if (iph + 1 > (struct iphdr *)data_end ||
 		    iph->protocol != IPPROTO_UDP)
@@ -169,8 +216,10 @@ static int xdp_mode_tx_handler(struct xdp_md *ctx, __u16 port)
 			return XDP_PASS;
 
 		record_stats(ctx, STATS_RX);
+		eth = data;
 		swap_machdr((void *)eth);
 
+		tmp_ip = iph->saddr;
 		iph->saddr = iph->daddr;
 		iph->daddr = tmp_ip;
 
@@ -178,9 +227,19 @@ static int xdp_mode_tx_handler(struct xdp_md *ctx, __u16 port)
 
 		return XDP_TX;
 
-	} else if (eth->h_proto  == bpf_htons(ETH_P_IPV6)) {
-		struct ipv6hdr *ipv6h = data + sizeof(*eth);
+	} else if (eth->h_proto == bpf_htons(ETH_P_IPV6)) {
 		struct in6_addr tmp_ipv6;
+		struct ipv6hdr *ipv6h;
+
+		err = bpf_xdp_pull_data(ctx, sizeof(*eth) + sizeof(*ipv6h) +
+					     sizeof(*udph), 0);
+		if (err)
+			return XDP_PASS;
+
+		data_end = (void *)(long)ctx->data_end;
+		data = (void *)(long)ctx->data;
+
+		ipv6h = data + sizeof(*eth);
 
 		if (ipv6h + 1 > (struct ipv6hdr *)data_end ||
 		    ipv6h->nexthdr != IPPROTO_UDP)
@@ -194,6 +253,7 @@ static int xdp_mode_tx_handler(struct xdp_md *ctx, __u16 port)
 			return XDP_PASS;
 
 		record_stats(ctx, STATS_RX);
+		eth = data;
 		swap_machdr((void *)eth);
 
 		__builtin_memcpy(&tmp_ipv6, &ipv6h->saddr, sizeof(tmp_ipv6));
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next v2 1/7] net/mlx5e: Fix generating skb from nonlinear xdp_buff
  2025-09-05 17:33 ` [PATCH bpf-next v2 1/7] net/mlx5e: Fix generating skb from nonlinear xdp_buff Amery Hung
@ 2025-09-08 14:41   ` Dragos Tatulea
  2025-09-08 17:23     ` Amery Hung
  0 siblings, 1 reply; 16+ messages in thread
From: Dragos Tatulea @ 2025-09-08 14:41 UTC (permalink / raw)
  To: Amery Hung, bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, kuba, stfomichev,
	martin.lau, mohsin.bashr, noren, saeedm, tariqt, mbloch,
	maciej.fijalkowski, kernel-team

On Fri, Sep 05, 2025 at 10:33:45AM -0700, Amery Hung wrote:
> xdp programs can change the layout of an xdp_buff through
> bpf_xdp_adjust_tail() and bpf_xdp_adjust_head(). Therefore, the driver
> cannot assume the size of the linear data area nor fragments. Fix the
> bug in mlx5 by generating skb according to xdp_buff after xdp programs
> run.
>
Shouldn't this patch be a fix for net then?

> Currently, when handling multi-buf xdp, the mlx5 driver assumes the
> layout of an xdp_buff to be unchanged. That is, the linear data area
> continues to be empty and fragments remains the same. This may cause
> the driver to generate erroneous skb or triggering a kernel
> warning. When an xdp program added linear data through
> bpf_xdp_adjust_head(), the linear data will be ignored as
> mlx5e_build_linear_skb() builds an skb without linear data and then
> pull data from fragments to fill the linear data area. When an xdp
> program has shrunk the non-linear data through bpf_xdp_adjust_tail(),
> the delta passed to __pskb_pull_tail() may exceed the actual nonlinear
> data size and trigger the BUG_ON in it.
> 
> To fix the issue, first record the original number of fragments. If the
> number of fragments changes after the xdp program runs, rewind the end
> fragment pointer by the difference and recalculate the truesize. Then,
> build the skb with linear data area matching the xdp_buff. Finally, only
> pull data in if there is non-linear data and fill the linear part up to
> 256 bytes.
> 
> Fixes: f52ac7028bec ("net/mlx5e: RX, Add XDP multi-buffer support in Striding RQ")
Your fix covers both Legacy RQ and Striding RQ. So the tag is only 1/2
correct. Normally we have separate patches for each mode.


> Signed-off-by: Amery Hung <ameryhung@gmail.com>
> ---
>  .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 38 +++++++++++++++++--
>  1 file changed, 35 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index b8c609d91d11..6b6bb90cf003 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -1729,6 +1729,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
>  	struct mlx5e_wqe_frag_info *head_wi = wi;
>  	u16 rx_headroom = rq->buff.headroom;
>  	struct mlx5e_frag_page *frag_page;
> +	u8 nr_frags_free, old_nr_frags;
>  	struct skb_shared_info *sinfo;
>  	u32 frag_consumed_bytes;
>  	struct bpf_prog *prog;
> @@ -1772,17 +1773,27 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
>  		wi++;
>  	}
>  
> +	old_nr_frags = sinfo->nr_frags;
> +
>  	prog = rcu_dereference(rq->xdp_prog);
>  	if (prog && mlx5e_xdp_handle(rq, prog, mxbuf)) {
>  		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
>  			struct mlx5e_wqe_frag_info *pwi;
>  
> +			wi -= old_nr_frags - sinfo->nr_frags;
> +
>  			for (pwi = head_wi; pwi < wi; pwi++)
>  				pwi->frag_page->frags++;
>  		}
>  		return NULL; /* page/packet was consumed by XDP */
>  	}
>  
> +	nr_frags_free = old_nr_frags - sinfo->nr_frags;
> +	if (unlikely(nr_frags_free)) {
Even with with a branch prediction hint, is it really worth it?


> +		wi -= nr_frags_free;
> +		truesize -= nr_frags_free * frag_info->frag_stride;
> +	}
> +
>  	skb = mlx5e_build_linear_skb(
>  		rq, mxbuf->xdp.data_hard_start, rq->buff.frame0_sz,
>  		mxbuf->xdp.data - mxbuf->xdp.data_hard_start,
> @@ -2004,6 +2015,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>  	u32 byte_cnt       = cqe_bcnt;
>  	struct skb_shared_info *sinfo;
>  	unsigned int truesize = 0;
> +	u32 pg_consumed_bytes;
>  	struct bpf_prog *prog;
>  	struct sk_buff *skb;
>  	u32 linear_frame_sz;
> @@ -2057,7 +2069,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>  
>  	while (byte_cnt) {
>  		/* Non-linear mode, hence non-XSK, which always uses PAGE_SIZE. */
> -		u32 pg_consumed_bytes = min_t(u32, PAGE_SIZE - frag_offset, byte_cnt);
> +		pg_consumed_bytes = min_t(u32, PAGE_SIZE - frag_offset, byte_cnt);
>  
>  		if (test_bit(MLX5E_RQ_STATE_SHAMPO, &rq->state))
>  			truesize += pg_consumed_bytes;
> @@ -2073,10 +2085,15 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>  	}
>  
>  	if (prog) {
> +		u8 nr_frags_free, old_nr_frags = sinfo->nr_frags;
> +		u32 len;
> +
>  		if (mlx5e_xdp_handle(rq, prog, mxbuf)) {
>  			if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
>  				struct mlx5e_frag_page *pfp;
>  
> +				frag_page -= old_nr_frags - sinfo->nr_frags;
> +
>  				for (pfp = head_page; pfp < frag_page; pfp++)
>  					pfp->frags++;
>  
> @@ -2087,9 +2104,22 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>  			return NULL; /* page/packet was consumed by XDP */
>  		}
>  
> +		len = mxbuf->xdp.data_end - mxbuf->xdp.data;
> +
> +		nr_frags_free = old_nr_frags - sinfo->nr_frags;
> +		if (unlikely(nr_frags_free)) {
Same question about the if.

> +			frag_page -= nr_frags_free;
> +
> +			/* the last frag is always freed first */
> +			truesize -= ALIGN(pg_consumed_bytes, BIT(rq->mpwqe.log_stride_sz));
> +			while (--nr_frags_free)
> +				truesize -= nr_frags_free *
> +					    ALIGN(PAGE_SIZE, BIT(rq->mpwqe.log_stride_sz));
> +		}
> +
This doesn't seem correct. It seems to remove too much from truesize
when nr_frags_free > 2. I think it should be:

truesize -= ALIGN(pg_consumed_bytes, BIT(rq->mpwqe.log_stride_sz)) -
	    (nr_frags_free - 1) * ALIGN(PAGE_SIZE, BIT(rq->mpwqe.log_stride_sz));

And PAGE_SIZE is aligned to stride size so you can shorted it to:

truesize -= ALIGN(pg_consumed_bytes, BIT(rq->mpwqe.log_stride_sz)) -
	    (nr_frags_free - 1) * PAGE_SIZE;

Thanks,
Dragos


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next v2 1/7] net/mlx5e: Fix generating skb from nonlinear xdp_buff
  2025-09-08 14:41   ` Dragos Tatulea
@ 2025-09-08 17:23     ` Amery Hung
  0 siblings, 0 replies; 16+ messages in thread
From: Amery Hung @ 2025-09-08 17:23 UTC (permalink / raw)
  To: Dragos Tatulea
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, kuba, stfomichev,
	martin.lau, mohsin.bashr, noren, saeedm, tariqt, mbloch,
	maciej.fijalkowski, kernel-team

On Mon, Sep 8, 2025 at 7:42 AM Dragos Tatulea <dtatulea@nvidia.com> wrote:

Resending the reply to the list again as some html stuff accidentally
got mixed in

>
> On Fri, Sep 05, 2025 at 10:33:45AM -0700, Amery Hung wrote:
> > xdp programs can change the layout of an xdp_buff through
> > bpf_xdp_adjust_tail() and bpf_xdp_adjust_head(). Therefore, the driver
> > cannot assume the size of the linear data area nor fragments. Fix the
> > bug in mlx5 by generating skb according to xdp_buff after xdp programs
> > run.
> >
> Shouldn't this patch be a fix for net then?

Make sense. I will separate the mlx5 patch from this set and target net.

>
> > Currently, when handling multi-buf xdp, the mlx5 driver assumes the
> > layout of an xdp_buff to be unchanged. That is, the linear data area
> > continues to be empty and fragments remains the same. This may cause
> > the driver to generate erroneous skb or triggering a kernel
> > warning. When an xdp program added linear data through
> > bpf_xdp_adjust_head(), the linear data will be ignored as
> > mlx5e_build_linear_skb() builds an skb without linear data and then
> > pull data from fragments to fill the linear data area. When an xdp
> > program has shrunk the non-linear data through bpf_xdp_adjust_tail(),
> > the delta passed to __pskb_pull_tail() may exceed the actual nonlinear
> > data size and trigger the BUG_ON in it.
> >
> > To fix the issue, first record the original number of fragments. If the
> > number of fragments changes after the xdp program runs, rewind the end
> > fragment pointer by the difference and recalculate the truesize. Then,
> > build the skb with linear data area matching the xdp_buff. Finally, only
> > pull data in if there is non-linear data and fill the linear part up to
> > 256 bytes.
> >
> > Fixes: f52ac7028bec ("net/mlx5e: RX, Add XDP multi-buffer support in Striding RQ")
> Your fix covers both Legacy RQ and Striding RQ. So the tag is only 1/2
> correct. Normally we have separate patches for each mode.

Will split the patch into two.

>
>
>
> > Signed-off-by: Amery Hung <ameryhung@gmail.com>
> > ---
> >  .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 38 +++++++++++++++++--
> >  1 file changed, 35 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > index b8c609d91d11..6b6bb90cf003 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > @@ -1729,6 +1729,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
> >       struct mlx5e_wqe_frag_info *head_wi = wi;
> >       u16 rx_headroom = rq->buff.headroom;
> >       struct mlx5e_frag_page *frag_page;
> > +     u8 nr_frags_free, old_nr_frags;
> >       struct skb_shared_info *sinfo;
> >       u32 frag_consumed_bytes;
> >       struct bpf_prog *prog;
> > @@ -1772,17 +1773,27 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
> >               wi++;
> >       }
> >
> > +     old_nr_frags = sinfo->nr_frags;
> > +
> >       prog = rcu_dereference(rq->xdp_prog);
> >       if (prog && mlx5e_xdp_handle(rq, prog, mxbuf)) {
> >               if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
> >                       struct mlx5e_wqe_frag_info *pwi;
> >
> > +                     wi -= old_nr_frags - sinfo->nr_frags;
> > +
> >                       for (pwi = head_wi; pwi < wi; pwi++)
> >                               pwi->frag_page->frags++;
> >               }
> >               return NULL; /* page/packet was consumed by XDP */
> >       }
> >
> > +     nr_frags_free = old_nr_frags - sinfo->nr_frags;
> > +     if (unlikely(nr_frags_free)) {
> Even with with a branch prediction hint, is it really worth it?
>

[...]

>
> > +             wi -= nr_frags_free;
> > +             truesize -= nr_frags_free * frag_info->frag_stride;
> > +     }
> > +
> >       skb = mlx5e_build_linear_skb(
> >               rq, mxbuf->xdp.data_hard_start, rq->buff.frame0_sz,
> >               mxbuf->xdp.data - mxbuf->xdp.data_hard_start,
> > @@ -2004,6 +2015,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> >       u32 byte_cnt       = cqe_bcnt;
> >       struct skb_shared_info *sinfo;
> >       unsigned int truesize = 0;
> > +     u32 pg_consumed_bytes;
> >       struct bpf_prog *prog;
> >       struct sk_buff *skb;
> >       u32 linear_frame_sz;
> > @@ -2057,7 +2069,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> >
> >       while (byte_cnt) {
> >               /* Non-linear mode, hence non-XSK, which always uses PAGE_SIZE. */
> > -             u32 pg_consumed_bytes = min_t(u32, PAGE_SIZE - frag_offset, byte_cnt);
> > +             pg_consumed_bytes = min_t(u32, PAGE_SIZE - frag_offset, byte_cnt);
> >
> >               if (test_bit(MLX5E_RQ_STATE_SHAMPO, &rq->state))
> >                       truesize += pg_consumed_bytes;
> > @@ -2073,10 +2085,15 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> >       }
> >
> >       if (prog) {
> > +             u8 nr_frags_free, old_nr_frags = sinfo->nr_frags;
> > +             u32 len;
> > +
> >               if (mlx5e_xdp_handle(rq, prog, mxbuf)) {
> >                       if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
> >                               struct mlx5e_frag_page *pfp;
> >
> > +                             frag_page -= old_nr_frags - sinfo->nr_frags;
> > +
> >                               for (pfp = head_page; pfp < frag_page; pfp++)
> >                                       pfp->frags++;
> >
> > @@ -2087,9 +2104,22 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> >                       return NULL; /* page/packet was consumed by XDP */
> >               }
> >
> > +             len = mxbuf->xdp.data_end - mxbuf->xdp.data;
> > +
> > +             nr_frags_free = old_nr_frags - sinfo->nr_frags;
> > +             if (unlikely(nr_frags_free)) {
> Same question about the if.

I see. I will make the recalculation unconditional.

>
> > +                     frag_page -= nr_frags_free;
> > +
> > +                     /* the last frag is always freed first */
> > +                     truesize -= ALIGN(pg_consumed_bytes, BIT(rq->mpwqe.log_stride_sz));
> > +                     while (--nr_frags_free)
> > +                             truesize -= nr_frags_free *
> > +                                         ALIGN(PAGE_SIZE, BIT(rq->mpwqe.log_stride_sz));
> > +             }
> > +
> This doesn't seem correct. It seems to remove too much from truesize
> when nr_frags_free > 2. I think it should be:
>
> truesize -= ALIGN(pg_consumed_bytes, BIT(rq->mpwqe.log_stride_sz)) -
>             (nr_frags_free - 1) * ALIGN(PAGE_SIZE, BIT(rq->mpwqe.log_stride_sz));
>
> And PAGE_SIZE is aligned to stride size so you can shorted it to:
>
> truesize -= ALIGN(pg_consumed_bytes, BIT(rq->mpwqe.log_stride_sz)) -
>             (nr_frags_free - 1) * PAGE_SIZE;

Sorry that I was being sloppy here. You are correct, and I think you
probably meant "+" instead of "-".

truesize -= ALIGN(pg_consumed_bytes, BIT(rq->mpwqe.log_stride_sz)) +
             (nr_frags_free - 1) * PAGE_SIZE;

>
> Thanks,
> Dragos
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next v2 3/7] bpf: Support pulling non-linear xdp data
  2025-09-05 17:33 ` [PATCH bpf-next v2 3/7] bpf: Support pulling non-linear xdp data Amery Hung
@ 2025-09-08 19:27   ` Martin KaFai Lau
  2025-09-08 22:28     ` Amery Hung
  2025-09-09  1:54   ` Jakub Kicinski
  1 sibling, 1 reply; 16+ messages in thread
From: Martin KaFai Lau @ 2025-09-08 19:27 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, kuba, stfomichev,
	martin.lau, mohsin.bashr, noren, dtatulea, saeedm, tariqt, mbloch,
	maciej.fijalkowski, kernel-team

On 9/5/25 10:33 AM, Amery Hung wrote:
> An unused argument, flags is reserved for future extension (e.g.,
> tossing the data instead of copying it to the linear data area).

> +__bpf_kfunc int bpf_xdp_pull_data(struct xdp_md *x, u32 len, u64 flags)

I was thinking the flag may be needed to avoid copy. I think we have recently 
concluded that bpf_xdp_adjust_head can support shrink on multi buf also. If it 
is the case, it is probably better to keep a similar api as the 
bpf_skb_pull_data which does not have the flags argument.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next v2 3/7] bpf: Support pulling non-linear xdp data
  2025-09-08 19:27   ` Martin KaFai Lau
@ 2025-09-08 22:28     ` Amery Hung
  0 siblings, 0 replies; 16+ messages in thread
From: Amery Hung @ 2025-09-08 22:28 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, kuba, stfomichev,
	martin.lau, mohsin.bashr, noren, dtatulea, saeedm, tariqt, mbloch,
	maciej.fijalkowski, kernel-team

On Mon, Sep 8, 2025 at 12:27 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 9/5/25 10:33 AM, Amery Hung wrote:
> > An unused argument, flags is reserved for future extension (e.g.,
> > tossing the data instead of copying it to the linear data area).
>
> > +__bpf_kfunc int bpf_xdp_pull_data(struct xdp_md *x, u32 len, u64 flags)
>
> I was thinking the flag may be needed to avoid copy. I think we have recently
> concluded that bpf_xdp_adjust_head can support shrink on multi buf also. If it
> is the case, it is probably better to keep a similar api as the
> bpf_skb_pull_data which does not have the flags argument.
>

Make sense. I will remove flags.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next v2 3/7] bpf: Support pulling non-linear xdp data
  2025-09-05 17:33 ` [PATCH bpf-next v2 3/7] bpf: Support pulling non-linear xdp data Amery Hung
  2025-09-08 19:27   ` Martin KaFai Lau
@ 2025-09-09  1:54   ` Jakub Kicinski
  2025-09-10 15:17     ` Amery Hung
  1 sibling, 1 reply; 16+ messages in thread
From: Jakub Kicinski @ 2025-09-09  1:54 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, stfomichev,
	martin.lau, mohsin.bashr, noren, dtatulea, saeedm, tariqt, mbloch,
	maciej.fijalkowski, kernel-team

On Fri,  5 Sep 2025 10:33:47 -0700 Amery Hung wrote:
> + * Direct packet access allows reading and writing linear XDP data through
> + * packet pointers (i.e., &xdp_md->data + offsets). 

Add:
 The amount of data which ends up in the linear part of the xdp_buf
 depends on the NIC and its configuration. 

> When an eBPF program wants
> + * to directly access data that may be in the non-linear area, call this kfunc
                         ^^^^
          maybe s/data/headers

> + * to make sure the data is available in the linear area.

Should we add a mention here of the copy helpers and dynptr for
accessing data without pulling?

> + * This kfunc can also be used with bpf_xdp_adjust_head() to decapsulate
> + * headers in the non-linear data area.
> + *
> + * A call to this kfunc is susceptible to change the underlying packet buffer.

Maybe:
 A call to this kfunc will modify the buffer geometry.

> + * Therefore, at load time, all checks on pointers previously done by the
> + * verifier are invalidated and must be performed again, if the kfunc is used
> + * in combination with direct packet access.

>	void *data_end = xdp->data + len;

nit: I think the code would be easier to follow if we renamed this 
to "new_end"?


Larger note: I wonder if we should support "shifting the buffer down"
if there's insufficient tailroom. XDP has rather copious headroom,
but tailroom may be pretty tight, and it may depend on the length of
the headers. So if there's not enough tailroom but there's enough
headroom -- should we try to memmove the existing headers?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next v2 3/7] bpf: Support pulling non-linear xdp data
  2025-09-09  1:54   ` Jakub Kicinski
@ 2025-09-10 15:17     ` Amery Hung
  2025-09-10 18:04       ` Jakub Kicinski
  0 siblings, 1 reply; 16+ messages in thread
From: Amery Hung @ 2025-09-10 15:17 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, stfomichev,
	martin.lau, mohsin.bashr, noren, dtatulea, saeedm, tariqt, mbloch,
	maciej.fijalkowski, kernel-team

On Mon, Sep 8, 2025 at 9:54 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri,  5 Sep 2025 10:33:47 -0700 Amery Hung wrote:
> > + * Direct packet access allows reading and writing linear XDP data through
> > + * packet pointers (i.e., &xdp_md->data + offsets).
>
> Add:
>  The amount of data which ends up in the linear part of the xdp_buf
>  depends on the NIC and its configuration.

[...]

>
> > When an eBPF program wants
> > + * to directly access data that may be in the non-linear area, call this kfunc
>                          ^^^^
>           maybe s/data/headers
>
> > + * to make sure the data is available in the linear area.
>
> Should we add a mention here of the copy helpers and dynptr for
> accessing data without pulling?

[...]

>
> > + * This kfunc can also be used with bpf_xdp_adjust_head() to decapsulate
> > + * headers in the non-linear data area.
> > + *
> > + * A call to this kfunc is susceptible to change the underlying packet buffer.
>
> Maybe:
>  A call to this kfunc will modify the buffer geometry.

Will improve the comment based on the suggestions. Thanks!

>
> > + * Therefore, at load time, all checks on pointers previously done by the
> > + * verifier are invalidated and must be performed again, if the kfunc is used
> > + * in combination with direct packet access.
>
> >       void *data_end = xdp->data + len;
>
> nit: I think the code would be easier to follow if we renamed this
> to "new_end"?

I was following the common pattern in other XDP kfuncs, but can change
it for readability.

>
>
> Larger note: I wonder if we should support "shifting the buffer down"
> if there's insufficient tailroom. XDP has rather copious headroom,
> but tailroom may be pretty tight, and it may depend on the length of
> the headers. So if there's not enough tailroom but there's enough
> headroom -- should we try to memmove the existing headers?

I think it should. If users want to reserve space for metadata, they
can check the headroom before pulling data.

If the kfunc does not do memmove(), users are still able to do so in
XDP programs through bpf_xdp_adjust_head() and memmove(), but it feels
less easy to use IMO.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next v2 3/7] bpf: Support pulling non-linear xdp data
  2025-09-10 15:17     ` Amery Hung
@ 2025-09-10 18:04       ` Jakub Kicinski
  2025-09-10 19:11         ` Amery Hung
  0 siblings, 1 reply; 16+ messages in thread
From: Jakub Kicinski @ 2025-09-10 18:04 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, stfomichev,
	martin.lau, mohsin.bashr, noren, dtatulea, saeedm, tariqt, mbloch,
	maciej.fijalkowski, kernel-team

On Wed, 10 Sep 2025 11:17:52 -0400 Amery Hung wrote:
> > Larger note: I wonder if we should support "shifting the buffer down"
> > if there's insufficient tailroom. XDP has rather copious headroom,
> > but tailroom may be pretty tight, and it may depend on the length of
> > the headers. So if there's not enough tailroom but there's enough
> > headroom -- should we try to memmove the existing headers?  
> 
> I think it should. If users want to reserve space for metadata, they
> can check the headroom before pulling data.
> 
> If the kfunc does not do memmove(), users are still able to do so in
> XDP programs through bpf_xdp_adjust_head() and memmove(), but it feels
> less easy to use IMO.

Actually, I don't think adjust_head() would even work. The program can
adjust head and memmove() the header, but there's no way to "punch out"
the end of the head buffer. We can only grow and shrink start of packet
and end of packet. After adjust_head + memmove in the prog buffer would
look something like:

  _ _ _ _ __________ _____ _ _ _ _      ________
   hroom |  headers | old | troom      |  frag0 |
  - - - - ---------- ----- - - - -      --------

and the program has no way to "free" the "old" to let pull grab data
from frag0 in its place...

skb pull helper can allocate a completely fresh buffer, but IDK if
drivers are ready to have the head buffer swapped under their feet.
So I think that best we can do is have the pull() helper aromatically
memmove the headers.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next v2 3/7] bpf: Support pulling non-linear xdp data
  2025-09-10 18:04       ` Jakub Kicinski
@ 2025-09-10 19:11         ` Amery Hung
  0 siblings, 0 replies; 16+ messages in thread
From: Amery Hung @ 2025-09-10 19:11 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, stfomichev,
	martin.lau, mohsin.bashr, noren, dtatulea, saeedm, tariqt, mbloch,
	maciej.fijalkowski, kernel-team

On Wed, Sep 10, 2025 at 2:04 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 10 Sep 2025 11:17:52 -0400 Amery Hung wrote:
> > > Larger note: I wonder if we should support "shifting the buffer down"
> > > if there's insufficient tailroom. XDP has rather copious headroom,
> > > but tailroom may be pretty tight, and it may depend on the length of
> > > the headers. So if there's not enough tailroom but there's enough
> > > headroom -- should we try to memmove the existing headers?
> >
> > I think it should. If users want to reserve space for metadata, they
> > can check the headroom before pulling data.
> >
> > If the kfunc does not do memmove(), users are still able to do so in
> > XDP programs through bpf_xdp_adjust_head() and memmove(), but it feels
> > less easy to use IMO.
>
> Actually, I don't think adjust_head() would even work. The program can
> adjust head and memmove() the header, but there's no way to "punch out"
> the end of the head buffer. We can only grow and shrink start of packet
> and end of packet. After adjust_head + memmove in the prog buffer would
> look something like:

Ahh. You are right.

>
>   _ _ _ _ __________ _____ _ _ _ _      ________
>    hroom |  headers | old | troom      |  frag0 |
>   - - - - ---------- ----- - - - -      --------
>
> and the program has no way to "free" the "old" to let pull grab data
> from frag0 in its place...
>
> skb pull helper can allocate a completely fresh buffer, but IDK if
> drivers are ready to have the head buffer swapped under their feet.
> So I think that best we can do is have the pull() helper aromatically
> memmove the headers.

Agree. Will make the kfunc memmove headers if more spaces are needed.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-09-10 19:11 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-05 17:33 [PATCH bpf-next v2 0/7] Add kfunc bpf_xdp_pull_data Amery Hung
2025-09-05 17:33 ` [PATCH bpf-next v2 1/7] net/mlx5e: Fix generating skb from nonlinear xdp_buff Amery Hung
2025-09-08 14:41   ` Dragos Tatulea
2025-09-08 17:23     ` Amery Hung
2025-09-05 17:33 ` [PATCH bpf-next v2 2/7] bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail Amery Hung
2025-09-05 17:33 ` [PATCH bpf-next v2 3/7] bpf: Support pulling non-linear xdp data Amery Hung
2025-09-08 19:27   ` Martin KaFai Lau
2025-09-08 22:28     ` Amery Hung
2025-09-09  1:54   ` Jakub Kicinski
2025-09-10 15:17     ` Amery Hung
2025-09-10 18:04       ` Jakub Kicinski
2025-09-10 19:11         ` Amery Hung
2025-09-05 17:33 ` [PATCH bpf-next v2 4/7] bpf: Clear packet pointers after changing packet data in kfuncs Amery Hung
2025-09-05 17:33 ` [PATCH bpf-next v2 5/7] bpf: Support specifying linear xdp packet data size in test_run Amery Hung
2025-09-05 17:33 ` [PATCH bpf-next v2 6/7] selftests/bpf: Test bpf_xdp_pull_data Amery Hung
2025-09-05 17:33 ` [PATCH bpf-next v2 7/7] selftests: drv-net: Pull data before parsing headers Amery Hung

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).