bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC bpf-next v1 0/7] Add kfunc bpf_xdp_pull_data
@ 2025-08-25 19:39 Amery Hung
  2025-08-25 19:39 ` [RFC bpf-next v1 1/7] net/mlx5e: Fix generating skb from nonlinear xdp_buff Amery Hung
                   ` (8 more replies)
  0 siblings, 9 replies; 31+ messages in thread
From: Amery Hung @ 2025-08-25 19:39 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, kuba, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team

Hi all,

This patchset introduces a new kfunc bpf_xdp_pull_data() to allow
pulling nonlinear xdp data. This may be useful when a driver places
headers in fragments. When an xdp program would like to keep parsing
packet headers using direct packet access, it can call
bpf_xdp_pull_data() to make the header available in the linear data
area. The kfunc can also be used to decapsulate the header in the
nonlinear data, as currently there is no easy way to do this.

This patchset also tries to fix an issue in the mlx5e driver. The driver
curretly assumes the packet layout to be unchanged after xdp program
runs and may generate packet with corrupted data or trigger kernel warning
if xdp programs calls layout-changing kfunc such as bpf_xdp_adjust_tail(),
bpf_xdp_adjust_head() or bpf_xdp_pull_data() introduced in this set.

Tested with the added bpf selftest using bpf test_run and also on
mlx5e with the tools/testing/selftests/drivers/net/xdp.py. mlx5e with
striding RQ will produce xdp_buff with empty linear data.
xdp.test_xdp_native_pass_mb would fail to parse the header before this
patchset.

Grateful for any feedback (especially the driver part).

Thanks!
Amery

Amery Hung (7):
  net/mlx5e: Fix generating skb from nonlinear xdp_buff
  bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail
  bpf: Support pulling non-linear xdp data
  bpf: Clear packet pointers after changing packet data in kfuncs
  bpf: Support specifying linear xdp packet data size in test_run
  selftests/bpf: Test bpf_xdp_pull_data
  selftests: drv-net: Pull data before parsing headers

 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 59 ++++++++----
 include/net/xdp_sock_drv.h                    | 21 +++-
 kernel/bpf/verifier.c                         | 13 +++
 net/bpf/test_run.c                            |  9 +-
 net/core/filter.c                             | 81 +++++++++++++---
 .../bpf/prog_tests/xdp_context_test_run.c     |  4 +-
 .../selftests/bpf/prog_tests/xdp_pull_data.c  | 96 +++++++++++++++++++
 .../selftests/bpf/progs/test_xdp_pull_data.c  | 36 +++++++
 .../selftests/net/lib/xdp_native.bpf.c        | 90 ++++++++++++++---
 9 files changed, 356 insertions(+), 53 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_pull_data.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_pull_data.c

-- 
2.47.3


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC bpf-next v1 1/7] net/mlx5e: Fix generating skb from nonlinear xdp_buff
  2025-08-25 19:39 [RFC bpf-next v1 0/7] Add kfunc bpf_xdp_pull_data Amery Hung
@ 2025-08-25 19:39 ` Amery Hung
  2025-08-27 13:45   ` Dragos Tatulea
  2025-08-28 13:41   ` Nimrod Oren
  2025-08-25 19:39 ` [RFC bpf-next v1 2/7] bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail Amery Hung
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 31+ messages in thread
From: Amery Hung @ 2025-08-25 19:39 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, kuba, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team

xdp programs can change the layout of an xdp_buff through
bpf_xdp_adjust_tail(), bpf_xdp_adjust_head(). Therefore, the driver
cannot assume the size of the linear data area nor fragments. Fix the
bug in mlx5e driver by generating skb according to xdp_buff layout.

Currently, when handling multi-buf xdp, the mlx5e driver assumes the
layout of an xdp_buff to be unchanged. That is, the linear data area
continues to be empty and the fragments remains the same. This may
cause the driver to generate erroneous skb or triggering a kernel
warning. When an xdp program added linear data through
bpf_xdp_adjust_head() the linear data will be ignored as
mlx5e_build_linear_skb() builds an skb with empty linear data and then
pull data from fragments to fill the linear data area. When an xdp
program has shrunk the nonlinear data through bpf_xdp_adjust_tail(),
the delta passed to __pskb_pull_tail() may exceed the actual nonlinear
data size and trigger the BUG_ON in it.

To fix the issue, first build the skb with linear data area matching
the xdp_buff. Then, call __pskb_pull_tail() to fill the linear data for
up to MLX5E_RX_MAX_HEAD bytes. In addition, recalculate nr_frags and
truesize after xdp program runs.

Fixes: f52ac7028bec ("net/mlx5e: RX, Add XDP multi-buffer support in Striding RQ")
Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 59 ++++++++++++++-----
 1 file changed, 43 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index b8c609d91d11..c5173f1ccb4e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1725,16 +1725,17 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 			     struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
 {
 	struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0];
+	struct mlx5e_wqe_frag_info *pwi, *head_wi = wi;
 	struct mlx5e_xdp_buff *mxbuf = &rq->mxbuf;
-	struct mlx5e_wqe_frag_info *head_wi = wi;
 	u16 rx_headroom = rq->buff.headroom;
 	struct mlx5e_frag_page *frag_page;
 	struct skb_shared_info *sinfo;
-	u32 frag_consumed_bytes;
+	u32 frag_consumed_bytes, i;
 	struct bpf_prog *prog;
 	struct sk_buff *skb;
 	dma_addr_t addr;
 	u32 truesize;
+	u8 nr_frags;
 	void *va;
 
 	frag_page = wi->frag_page;
@@ -1775,14 +1776,26 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 	prog = rcu_dereference(rq->xdp_prog);
 	if (prog && mlx5e_xdp_handle(rq, prog, mxbuf)) {
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
-			struct mlx5e_wqe_frag_info *pwi;
+			pwi = head_wi;
+			while (pwi->frag_page->netmem != sinfo->frags[0].netmem && pwi < wi)
+				pwi++;
 
-			for (pwi = head_wi; pwi < wi; pwi++)
+			for (i = 0; i < sinfo->nr_frags; i++, pwi++)
 				pwi->frag_page->frags++;
 		}
 		return NULL; /* page/packet was consumed by XDP */
 	}
 
+	nr_frags = sinfo->nr_frags;
+	pwi = head_wi + 1;
+
+	if (prog) {
+		truesize = sinfo->nr_frags * frag_info->frag_stride;
+
+		while (pwi->frag_page->netmem != sinfo->frags[0].netmem && pwi < wi)
+			pwi++;
+	}
+
 	skb = mlx5e_build_linear_skb(
 		rq, mxbuf->xdp.data_hard_start, rq->buff.frame0_sz,
 		mxbuf->xdp.data - mxbuf->xdp.data_hard_start,
@@ -1796,12 +1809,12 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 
 	if (xdp_buff_has_frags(&mxbuf->xdp)) {
 		/* sinfo->nr_frags is reset by build_skb, calculate again. */
-		xdp_update_skb_shared_info(skb, wi - head_wi - 1,
+		xdp_update_skb_shared_info(skb, nr_frags,
 					   sinfo->xdp_frags_size, truesize,
 					   xdp_buff_is_frag_pfmemalloc(
 						&mxbuf->xdp));
 
-		for (struct mlx5e_wqe_frag_info *pwi = head_wi + 1; pwi < wi; pwi++)
+		for (i = 0; i < nr_frags; i++, pwi++)
 			pwi->frag_page->frags++;
 	}
 
@@ -2073,12 +2086,18 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 	}
 
 	if (prog) {
+		u8 nr_frags;
+		u32 len, i;
+
 		if (mlx5e_xdp_handle(rq, prog, mxbuf)) {
 			if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
-				struct mlx5e_frag_page *pfp;
+				struct mlx5e_frag_page *pagep = head_page;
+
+				while (pagep->netmem != sinfo->frags[0].netmem && pagep < frag_page)
+					pagep++;
 
-				for (pfp = head_page; pfp < frag_page; pfp++)
-					pfp->frags++;
+				for (i = 0; i < sinfo->nr_frags; i++)
+					pagep->frags++;
 
 				wi->linear_page.frags++;
 			}
@@ -2087,9 +2106,12 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 			return NULL; /* page/packet was consumed by XDP */
 		}
 
+		len = mxbuf->xdp.data_end - mxbuf->xdp.data;
+		nr_frags = sinfo->nr_frags;
+
 		skb = mlx5e_build_linear_skb(
 			rq, mxbuf->xdp.data_hard_start, linear_frame_sz,
-			mxbuf->xdp.data - mxbuf->xdp.data_hard_start, 0,
+			mxbuf->xdp.data - mxbuf->xdp.data_hard_start, len,
 			mxbuf->xdp.data - mxbuf->xdp.data_meta);
 		if (unlikely(!skb)) {
 			mlx5e_page_release_fragmented(rq->page_pool,
@@ -2102,20 +2124,25 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 		mlx5e_page_release_fragmented(rq->page_pool, &wi->linear_page);
 
 		if (xdp_buff_has_frags(&mxbuf->xdp)) {
-			struct mlx5e_frag_page *pagep;
+			struct mlx5e_frag_page *pagep = head_page;
+
+			truesize = nr_frags * PAGE_SIZE;
 
 			/* sinfo->nr_frags is reset by build_skb, calculate again. */
-			xdp_update_skb_shared_info(skb, frag_page - head_page,
+			xdp_update_skb_shared_info(skb, nr_frags,
 						   sinfo->xdp_frags_size, truesize,
 						   xdp_buff_is_frag_pfmemalloc(
 							&mxbuf->xdp));
 
-			pagep = head_page;
-			do
+			while (pagep->netmem != sinfo->frags[0].netmem && pagep < frag_page)
+				pagep++;
+
+			for (i = 0; i < nr_frags; i++, pagep++)
 				pagep->frags++;
-			while (++pagep < frag_page);
+
+			headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len, sinfo->xdp_frags_size);
+			__pskb_pull_tail(skb, headlen);
 		}
-		__pskb_pull_tail(skb, headlen);
 	} else {
 		dma_addr_t addr;
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC bpf-next v1 2/7] bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail
  2025-08-25 19:39 [RFC bpf-next v1 0/7] Add kfunc bpf_xdp_pull_data Amery Hung
  2025-08-25 19:39 ` [RFC bpf-next v1 1/7] net/mlx5e: Fix generating skb from nonlinear xdp_buff Amery Hung
@ 2025-08-25 19:39 ` Amery Hung
  2025-08-28 13:43   ` Nimrod Oren
  2025-08-25 19:39 ` [RFC bpf-next v1 3/7] bpf: Support pulling non-linear xdp data Amery Hung
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Amery Hung @ 2025-08-25 19:39 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, kuba, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team

Move skb_frag_t adjustment into bpf_xdp_shrink_data() and extend its
functionality to be able to shrink an xdp fragment from both head and
tail. In a later patch, bpf_xdp_pull_data() will reuse it to shrink an
xdp fragment from head.

Additionally, in bpf_xdp_frags_shrink_tail(), breaking the loop when
bpf_xdp_shrink_data() returns false (i.e., not releasing the current
fragment) is not necessary as the loop condition, offset > 0, has the
same effect. Remove the else branch to simplify the code.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
 include/net/xdp_sock_drv.h | 21 ++++++++++++++++++---
 net/core/filter.c          | 29 +++++++++++++++++------------
 2 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 513c8e9704f6..4f2d3268a676 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -160,13 +160,23 @@ static inline struct xdp_buff *xsk_buff_get_frag(const struct xdp_buff *first)
 	return ret;
 }
 
-static inline void xsk_buff_del_tail(struct xdp_buff *tail)
+static inline void xsk_buff_del_frag(struct xdp_buff *xdp)
 {
-	struct xdp_buff_xsk *xskb = container_of(tail, struct xdp_buff_xsk, xdp);
+	struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp);
 
 	list_del(&xskb->list_node);
 }
 
+static inline struct xdp_buff *xsk_buff_get_head(struct xdp_buff *first)
+{
+	struct xdp_buff_xsk *xskb = container_of(first, struct xdp_buff_xsk, xdp);
+	struct xdp_buff_xsk *frag;
+
+	frag = list_first_entry(&xskb->pool->xskb_list, struct xdp_buff_xsk,
+				list_node);
+	return &frag->xdp;
+}
+
 static inline struct xdp_buff *xsk_buff_get_tail(struct xdp_buff *first)
 {
 	struct xdp_buff_xsk *xskb = container_of(first, struct xdp_buff_xsk, xdp);
@@ -389,8 +399,13 @@ static inline struct xdp_buff *xsk_buff_get_frag(const struct xdp_buff *first)
 	return NULL;
 }
 
-static inline void xsk_buff_del_tail(struct xdp_buff *tail)
+static inline void xsk_buff_del_frag(struct xdp_buff *xdp)
+{
+}
+
+static inline struct xdp_buff *xsk_buff_get_head(struct xdp_buff *first)
 {
+	return NULL;
 }
 
 static inline struct xdp_buff *xsk_buff_get_tail(struct xdp_buff *first)
diff --git a/net/core/filter.c b/net/core/filter.c
index 63f3baee2daf..f0ee5aec7977 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4153,34 +4153,43 @@ static int bpf_xdp_frags_increase_tail(struct xdp_buff *xdp, int offset)
 	return 0;
 }
 
-static void bpf_xdp_shrink_data_zc(struct xdp_buff *xdp, int shrink,
+static void bpf_xdp_shrink_data_zc(struct xdp_buff *xdp, int shrink, bool tail,
 				   enum xdp_mem_type mem_type, bool release)
 {
-	struct xdp_buff *zc_frag = xsk_buff_get_tail(xdp);
+	struct xdp_buff *zc_frag = tail ? xsk_buff_get_tail(xdp) :
+					  xsk_buff_get_head(xdp);
 
 	if (release) {
-		xsk_buff_del_tail(zc_frag);
+		xsk_buff_del_frag(zc_frag);
 		__xdp_return(0, mem_type, false, zc_frag);
 	} else {
-		zc_frag->data_end -= shrink;
+		if (tail)
+			zc_frag->data_end -= shrink;
+		else
+			zc_frag->data += shrink;
 	}
 }
 
 static bool bpf_xdp_shrink_data(struct xdp_buff *xdp, skb_frag_t *frag,
-				int shrink)
+				int shrink, bool tail)
 {
 	enum xdp_mem_type mem_type = xdp->rxq->mem.type;
 	bool release = skb_frag_size(frag) == shrink;
 
 	if (mem_type == MEM_TYPE_XSK_BUFF_POOL) {
-		bpf_xdp_shrink_data_zc(xdp, shrink, mem_type, release);
+		bpf_xdp_shrink_data_zc(xdp, shrink, tail, mem_type, release);
 		goto out;
 	}
 
 	if (release)
 		__xdp_return(skb_frag_netmem(frag), mem_type, false, NULL);
-
 out:
+	if (!release) {
+		if (!tail)
+			skb_frag_off_add(frag, shrink);
+		skb_frag_size_sub(frag, shrink);
+	}
+
 	return release;
 }
 
@@ -4198,12 +4207,8 @@ static int bpf_xdp_frags_shrink_tail(struct xdp_buff *xdp, int offset)
 
 		len_free += shrink;
 		offset -= shrink;
-		if (bpf_xdp_shrink_data(xdp, frag, shrink)) {
+		if (bpf_xdp_shrink_data(xdp, frag, shrink, true))
 			n_frags_free++;
-		} else {
-			skb_frag_size_sub(frag, shrink);
-			break;
-		}
 	}
 	sinfo->nr_frags -= n_frags_free;
 	sinfo->xdp_frags_size -= len_free;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC bpf-next v1 3/7] bpf: Support pulling non-linear xdp data
  2025-08-25 19:39 [RFC bpf-next v1 0/7] Add kfunc bpf_xdp_pull_data Amery Hung
  2025-08-25 19:39 ` [RFC bpf-next v1 1/7] net/mlx5e: Fix generating skb from nonlinear xdp_buff Amery Hung
  2025-08-25 19:39 ` [RFC bpf-next v1 2/7] bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail Amery Hung
@ 2025-08-25 19:39 ` Amery Hung
  2025-08-25 21:29   ` Stanislav Fomichev
  2025-08-25 22:39   ` Jakub Kicinski
  2025-08-25 19:39 ` [RFC bpf-next v1 4/7] bpf: Clear packet pointers after changing packet data in kfuncs Amery Hung
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 31+ messages in thread
From: Amery Hung @ 2025-08-25 19:39 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, kuba, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team

Add kfunc, bpf_xdp_pull_data(), to support pulling data from xdp
fragments. Similar to bpf_skb_pull_data(), bpf_xdp_pull_data() makes
the first len bytes of data directly readable and writable in bpf
programs. If the "len" argument is larger than the linear data size,
data in fragments will be copied to the linear region when there
is enough room between xdp->data_end and xdp_data_hard_end(xdp),
which is subject to driver implementation.

A use case of the kfunc is to decapsulate headers residing in xdp
fragments. It is possible for a NIC driver to place headers in xdp
fragments. To keep using direct packet access for parsing and
decapsulating headers, users can pull headers into the linear data
area by calling bpf_xdp_pull_data() and then pop the header with
bpf_xdp_adjust_head().

An unused argument, flags is reserved for future extension (e.g.,
tossing the data instead of copying it to the linear data area).

Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
 net/core/filter.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index f0ee5aec7977..82d953e077ac 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -12211,6 +12211,57 @@ __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
 	return 0;
 }
 
+__bpf_kfunc int bpf_xdp_pull_data(struct xdp_md *x, u32 len, u64 flags)
+{
+	struct xdp_buff *xdp = (struct xdp_buff *)x;
+	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
+	void *data_end, *data_hard_end = xdp_data_hard_end(xdp);
+	int i, delta, buff_len, n_frags_free = 0, len_free = 0;
+
+	buff_len = xdp_get_buff_len(xdp);
+
+	if (unlikely(len > buff_len))
+		return -EINVAL;
+
+	if (!len)
+		len = xdp_get_buff_len(xdp);
+
+	data_end = xdp->data + len;
+	delta = data_end - xdp->data_end;
+
+	if (delta <= 0)
+		return 0;
+
+	if (unlikely(data_end > data_hard_end))
+		return -EINVAL;
+
+	for (i = 0; i < sinfo->nr_frags && delta; i++) {
+		skb_frag_t *frag = &sinfo->frags[i];
+		u32 shrink = min_t(u32, delta, skb_frag_size(frag));
+
+		memcpy(xdp->data_end + len_free, skb_frag_address(frag), shrink);
+
+		len_free += shrink;
+		delta -= shrink;
+		if (bpf_xdp_shrink_data(xdp, frag, shrink, false))
+			n_frags_free++;
+	}
+
+	for (i = 0; i < sinfo->nr_frags - n_frags_free; i++) {
+		memcpy(&sinfo->frags[i], &sinfo->frags[i + n_frags_free],
+		       sizeof(skb_frag_t));
+	}
+
+	sinfo->nr_frags -= n_frags_free;
+	sinfo->xdp_frags_size -= len_free;
+	xdp->data_end = data_end;
+
+	if (unlikely(!sinfo->nr_frags))
+		xdp_buff_clear_frags_flag(xdp);
+
+	return 0;
+}
+
 __bpf_kfunc_end_defs();
 
 int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,
@@ -12238,6 +12289,7 @@ BTF_KFUNCS_END(bpf_kfunc_check_set_skb_meta)
 
 BTF_KFUNCS_START(bpf_kfunc_check_set_xdp)
 BTF_ID_FLAGS(func, bpf_dynptr_from_xdp)
+BTF_ID_FLAGS(func, bpf_xdp_pull_data)
 BTF_KFUNCS_END(bpf_kfunc_check_set_xdp)
 
 BTF_KFUNCS_START(bpf_kfunc_check_set_sock_addr)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC bpf-next v1 4/7] bpf: Clear packet pointers after changing packet data in kfuncs
  2025-08-25 19:39 [RFC bpf-next v1 0/7] Add kfunc bpf_xdp_pull_data Amery Hung
                   ` (2 preceding siblings ...)
  2025-08-25 19:39 ` [RFC bpf-next v1 3/7] bpf: Support pulling non-linear xdp data Amery Hung
@ 2025-08-25 19:39 ` Amery Hung
  2025-08-25 19:39 ` [RFC bpf-next v1 5/7] bpf: Support specifying linear xdp packet data size in test_run Amery Hung
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Amery Hung @ 2025-08-25 19:39 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, kuba, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team

bpf_xdp_pull_data() may change packet data and therefore packet pointers
need to be invalidated. Add bpf_xdp_pull_data() to the special kfunc
list instead of introducing a new KF_ flag until there are more kfuncs
changing packet data.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
 kernel/bpf/verifier.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 4e47992361ea..80c64b42596c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12235,6 +12235,7 @@ enum special_kfunc_type {
 	KF_bpf_dynptr_from_skb,
 	KF_bpf_dynptr_from_xdp,
 	KF_bpf_dynptr_from_skb_meta,
+	KF_bpf_xdp_pull_data,
 	KF_bpf_dynptr_slice,
 	KF_bpf_dynptr_slice_rdwr,
 	KF_bpf_dynptr_clone,
@@ -12285,10 +12286,12 @@ BTF_ID(func, bpf_rbtree_right)
 BTF_ID(func, bpf_dynptr_from_skb)
 BTF_ID(func, bpf_dynptr_from_xdp)
 BTF_ID(func, bpf_dynptr_from_skb_meta)
+BTF_ID(func, bpf_xdp_pull_data)
 #else
 BTF_ID_UNUSED
 BTF_ID_UNUSED
 BTF_ID_UNUSED
+BTF_ID_UNUSED
 #endif
 BTF_ID(func, bpf_dynptr_slice)
 BTF_ID(func, bpf_dynptr_slice_rdwr)
@@ -12358,6 +12361,11 @@ static bool is_kfunc_bpf_preempt_enable(struct bpf_kfunc_call_arg_meta *meta)
 	return meta->func_id == special_kfunc_list[KF_bpf_preempt_enable];
 }
 
+static bool is_kfunc_pkt_changing(struct bpf_kfunc_call_arg_meta *meta)
+{
+	return meta->func_id == special_kfunc_list[KF_bpf_xdp_pull_data];
+}
+
 static enum kfunc_ptr_arg_type
 get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
 		       struct bpf_kfunc_call_arg_meta *meta,
@@ -14077,6 +14085,9 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		}
 	}
 
+	if (is_kfunc_pkt_changing(&meta))
+		clear_all_pkt_pointers(env);
+
 	nargs = btf_type_vlen(meta.func_proto);
 	args = (const struct btf_param *)(meta.func_proto + 1);
 	for (i = 0; i < nargs; i++) {
@@ -17794,6 +17805,8 @@ static int visit_insn(int t, struct bpf_verifier_env *env)
 			 */
 			if (ret == 0 && is_kfunc_sleepable(&meta))
 				mark_subprog_might_sleep(env, t);
+			if (ret == 0 && is_kfunc_pkt_changing(&meta))
+				mark_subprog_changes_pkt_data(env, t);
 		}
 		return visit_func_call_insn(t, insns, env, insn->src_reg == BPF_PSEUDO_CALL);
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC bpf-next v1 5/7] bpf: Support specifying linear xdp packet data size in test_run
  2025-08-25 19:39 [RFC bpf-next v1 0/7] Add kfunc bpf_xdp_pull_data Amery Hung
                   ` (3 preceding siblings ...)
  2025-08-25 19:39 ` [RFC bpf-next v1 4/7] bpf: Clear packet pointers after changing packet data in kfuncs Amery Hung
@ 2025-08-25 19:39 ` Amery Hung
  2025-08-25 19:39 ` [RFC bpf-next v1 6/7] selftests/bpf: Test bpf_xdp_pull_data Amery Hung
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Amery Hung @ 2025-08-25 19:39 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, kuba, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team

To test bpf_xdp_pull_data(), an xdp packet containing fragments as well
as free linear data area after xdp->data_end needs to be created.
However, bpf_prog_test_run_xdp() always fills the linear area with
data_in before creating fragments, leaving no space to pull data. This
patch will allow users to specify the linear data size through
ctx->data_end.

Currently, ctx_in->data_end must match data_size_in and will not be the
final ctx->data_end seen by xdp programs. This is because ctx->data_end
is populated according to the xdp_buff passed to test_run. The linear
data area available in an xdp_buff, max_data_sz, is alawys filled up
before copying data_in into fragments.

This patch will allow users to specify the size of data that goes into
the linear area. When ctx_in->data_end is different from data_size_in,
only ctx_in->data_end bytes of data will be put into the linear area when
creating the xdp_buff.

While ctx_in->data_end will be allowed to be different from data_size_in,
it cannot be larger than the data_size_in as there will be no data to
copy from user space. If it is larger than the maximum linear data area
size, the layout suggested by the user will not be honored. Data beyond
max_data_sz bytes will still be copied into fragments.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
 net/bpf/test_run.c                                       | 9 +++++----
 .../selftests/bpf/prog_tests/xdp_context_test_run.c      | 4 +---
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 4a862d605386..1a0d0bc35ad5 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -1207,8 +1207,8 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 {
 	bool do_live = (kattr->test.flags & BPF_F_TEST_XDP_LIVE_FRAMES);
 	u32 tailroom = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	u32 retval = 0, duration, max_data_sz, data_sz;
 	u32 batch_size = kattr->test.batch_size;
-	u32 retval = 0, duration, max_data_sz;
 	u32 size = kattr->test.data_size_in;
 	u32 headroom = XDP_PACKET_HEADROOM;
 	u32 repeat = kattr->test.repeat;
@@ -1246,7 +1246,7 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 
 	if (ctx) {
 		/* There can't be user provided data before the meta data */
-		if (ctx->data_meta || ctx->data_end != size ||
+		if (ctx->data_meta || ctx->data_end > size ||
 		    ctx->data > ctx->data_end ||
 		    unlikely(xdp_metalen_invalid(ctx->data)) ||
 		    (do_live && (kattr->test.data_out || kattr->test.ctx_out)))
@@ -1256,11 +1256,12 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 	}
 
 	max_data_sz = PAGE_SIZE - headroom - tailroom;
-	if (size > max_data_sz) {
+	data_sz = (ctx && ctx->data_end < max_data_sz) ? ctx->data_end : max_data_sz;
+	if (size > data_sz) {
 		/* disallow live data mode for jumbo frames */
 		if (do_live)
 			goto free_ctx;
-		size = max_data_sz;
+		size = data_sz;
 	}
 
 	data = bpf_test_init(kattr, size, max_data_sz, headroom, tailroom);
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_context_test_run.c b/tools/testing/selftests/bpf/prog_tests/xdp_context_test_run.c
index 46e0730174ed..178292d1251a 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_context_test_run.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_context_test_run.c
@@ -97,9 +97,7 @@ void test_xdp_context_test_run(void)
 	/* Meta data must be 255 bytes or smaller */
 	test_xdp_context_error(prog_fd, opts, 0, 256, sizeof(data), 0, 0, 0);
 
-	/* Total size of data must match data_end - data_meta */
-	test_xdp_context_error(prog_fd, opts, 0, sizeof(__u32),
-			       sizeof(data) - 1, 0, 0, 0);
+	/* Total size of data must be data_end - data_meta or larger */
 	test_xdp_context_error(prog_fd, opts, 0, sizeof(__u32),
 			       sizeof(data) + 1, 0, 0, 0);
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC bpf-next v1 6/7] selftests/bpf: Test bpf_xdp_pull_data
  2025-08-25 19:39 [RFC bpf-next v1 0/7] Add kfunc bpf_xdp_pull_data Amery Hung
                   ` (4 preceding siblings ...)
  2025-08-25 19:39 ` [RFC bpf-next v1 5/7] bpf: Support specifying linear xdp packet data size in test_run Amery Hung
@ 2025-08-25 19:39 ` Amery Hung
  2025-08-25 19:39 ` [RFC bpf-next v1 7/7] selftests: drv-net: Pull data before parsing headers Amery Hung
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Amery Hung @ 2025-08-25 19:39 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, kuba, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team

Test bpf_xdp_pull_data() with xdp packets with different layouts. The
xdp bpf program first checks if the layout is as expected. Then, it
calls bpf_xdp_pull_data(). Finally, it checks the 0xbb marker at offset
1024 using directly packet access.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
 .../selftests/bpf/prog_tests/xdp_pull_data.c  | 96 +++++++++++++++++++
 .../selftests/bpf/progs/test_xdp_pull_data.c  | 36 +++++++
 2 files changed, 132 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_pull_data.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_pull_data.c

diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_pull_data.c b/tools/testing/selftests/bpf/prog_tests/xdp_pull_data.c
new file mode 100644
index 000000000000..2cd18e15d47e
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_pull_data.c
@@ -0,0 +1,96 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <test_progs.h>
+#include <network_helpers.h>
+#include "test_xdp_pull_data.skel.h"
+
+/* xdp_pull_data_prog will directly read a marker 0xbb stored at buf[1024]
+ * so caller expecting XDP_PASS should always pass pull_len no less than 1024
+ */
+void test_xdp_pull_data_common(struct test_xdp_pull_data *skel,
+			       int buf_len, int linear_len,
+			       int pull_len, int retval)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, topts);
+	struct xdp_md ctx = {};
+	int prog_fd, err;
+	__u8 *buf;
+
+	buf = calloc(buf_len, sizeof(__u8));
+	if (!ASSERT_OK_PTR(buf, "calloc buf"))
+		return;
+
+	buf[1023] = 0xaa;
+	buf[1024] = 0xbb;
+	buf[1025] = 0xcc;
+
+	topts.data_in = buf;
+	topts.data_out = buf;
+	topts.data_size_in = buf_len;
+	topts.data_size_out = buf_len;
+	ctx.data_end = linear_len;
+	topts.ctx_in = &ctx;
+	topts.ctx_out = &ctx;
+	topts.ctx_size_in = sizeof(ctx);
+	topts.ctx_size_out = sizeof(ctx);
+
+	skel->bss->linear_len = linear_len;
+	skel->bss->pull_len = pull_len;
+
+	prog_fd = bpf_program__fd(skel->progs.xdp_pull_data_prog);
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+	ASSERT_OK(err, "bpf_prog_test_run_opts");
+	ASSERT_EQ(topts.retval, retval, "xdp_pull_data_prog retval");
+
+	if (retval == XDP_DROP)
+		goto out;
+
+	ASSERT_EQ(ctx.data_end, pull_len, "linear data size");
+	ASSERT_EQ(topts.data_size_out, buf_len, "linear + non-linear data size");
+	/* Make sure data around xdp->data_end was not messed up by
+	 * bpf_xdp_pull_data()
+	 */
+	ASSERT_EQ(buf[1023], 0xaa, "buf[1023]");
+	ASSERT_EQ(buf[1024], 0xbb, "buf[1024]");
+	ASSERT_EQ(buf[1025], 0xcc, "buf[1025]");
+out:
+	free(buf);
+}
+
+static void test_xdp_pull_data_basic(void)
+{
+	struct test_xdp_pull_data *skel;
+	u32 page_size;
+
+	skel = test_xdp_pull_data__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "test_xdp_pull_data__open_and_load"))
+		return;
+
+	page_size = sysconf(_SC_PAGE_SIZE);
+
+	/* linear xdp pkt, pull 0 byte */
+	test_xdp_pull_data_common(skel, 2048, 2048, 2048, XDP_PASS);
+	/* multi-buf pkt, pull results in linear xdp pkt */
+	test_xdp_pull_data_common(skel, 2048, 1024, 2048, XDP_PASS);
+	/* multi-buf pkt, pull 1 byte to linear data area */
+	test_xdp_pull_data_common(skel, 9000, 1024, 1025, XDP_PASS);
+	/* multi-buf pkt, pull 0 byte to linear data area */
+	test_xdp_pull_data_common(skel, 9000, 1025, 1025, XDP_PASS);
+
+	/* linear xdp pkt, pull more than total data len */
+	test_xdp_pull_data_common(skel, 2048, 2048, 2049, XDP_DROP);
+	/* multi-buf pkt with no space left in linear data area.
+	 * Since ctx.data_end (4096) > max_data_sz, bpf_prog_test_run_xdp()
+	 * will fill the whole linear data area and put the reset into a
+	 * fragment.
+	 */
+	test_xdp_pull_data_common(skel, page_size, page_size, page_size, XDP_DROP);
+
+	test_xdp_pull_data__destroy(skel);
+}
+
+void test_xdp_pull_data(void)
+{
+	if (test__start_subtest("xdp_pull_data"))
+		test_xdp_pull_data_basic();
+}
diff --git a/tools/testing/selftests/bpf/progs/test_xdp_pull_data.c b/tools/testing/selftests/bpf/progs/test_xdp_pull_data.c
new file mode 100644
index 000000000000..f32e6b4a79f5
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_xdp_pull_data.c
@@ -0,0 +1,36 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include  "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+
+int _version SEC("version") = 1;
+
+int linear_len;
+int pull_len;
+
+SEC("xdp.frags")
+int xdp_pull_data_prog(struct xdp_md *xdp)
+{
+	__u8 *data_end = (void *)(long)xdp->data_end;
+	__u8 *data = (void *)(long)xdp->data;
+	__u8 *val_p;
+	int err;
+
+	if (linear_len != data_end - data)
+		return XDP_DROP;
+
+	err = bpf_xdp_pull_data(xdp, pull_len, 0);
+	if (err)
+		return XDP_DROP;
+
+	val_p = (void *)(long)xdp->data + 1024;
+	if (val_p + 1 > (void *)(long)xdp->data_end)
+		return XDP_DROP;
+
+	if (*val_p != 0xbb)
+		return XDP_DROP;
+
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC bpf-next v1 7/7] selftests: drv-net: Pull data before parsing headers
  2025-08-25 19:39 [RFC bpf-next v1 0/7] Add kfunc bpf_xdp_pull_data Amery Hung
                   ` (5 preceding siblings ...)
  2025-08-25 19:39 ` [RFC bpf-next v1 6/7] selftests/bpf: Test bpf_xdp_pull_data Amery Hung
@ 2025-08-25 19:39 ` Amery Hung
  2025-08-25 22:41 ` [RFC bpf-next v1 0/7] Add kfunc bpf_xdp_pull_data Jakub Kicinski
  2025-08-28 13:39 ` Nimrod Oren
  8 siblings, 0 replies; 31+ messages in thread
From: Amery Hung @ 2025-08-25 19:39 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, kuba, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team

It is possible for drivers to generate xdp packets with data residing
entirely in fragments. To keep parsing headers using direcy packet
access, call bpf_xdp_pull_data() to pull headers into the linear data
area.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
 .../selftests/net/lib/xdp_native.bpf.c        | 90 +++++++++++++++----
 1 file changed, 75 insertions(+), 15 deletions(-)

diff --git a/tools/testing/selftests/net/lib/xdp_native.bpf.c b/tools/testing/selftests/net/lib/xdp_native.bpf.c
index 521ba38f2ddd..68b2a08055ce 100644
--- a/tools/testing/selftests/net/lib/xdp_native.bpf.c
+++ b/tools/testing/selftests/net/lib/xdp_native.bpf.c
@@ -14,6 +14,9 @@
 #define MAX_PAYLOAD_LEN 5000
 #define MAX_HDR_LEN 64
 
+extern int bpf_xdp_pull_data(struct xdp_md *xdp, __u32 len,
+			     __u64 flags) __ksym __weak;
+
 enum {
 	XDP_MODE = 0,
 	XDP_PORT = 1,
@@ -68,30 +71,57 @@ static void record_stats(struct xdp_md *ctx, __u32 stat_type)
 
 static struct udphdr *filter_udphdr(struct xdp_md *ctx, __u16 port)
 {
-	void *data_end = (void *)(long)ctx->data_end;
-	void *data = (void *)(long)ctx->data;
 	struct udphdr *udph = NULL;
-	struct ethhdr *eth = data;
+	void *data, *data_end;
+	struct ethhdr *eth;
+	int err;
+
+	err = bpf_xdp_pull_data(ctx, sizeof(*eth), 0);
+	if (err)
+		return NULL;
+
+	data_end = (void *)(long)ctx->data_end;
+	data = eth = (void *)(long)ctx->data;
 
 	if (data + sizeof(*eth) > data_end)
 		return NULL;
 
 	if (eth->h_proto == bpf_htons(ETH_P_IP)) {
-		struct iphdr *iph = data + sizeof(*eth);
+		struct iphdr *iph;
+
+		err = bpf_xdp_pull_data(ctx, sizeof(*eth) + sizeof(*iph) +
+					     sizeof(*udph), 0);
+		if (err)
+			return NULL;
+
+		data_end = (void *)(long)ctx->data_end;
+		data = (void *)(long)ctx->data;
+
+		iph = data + sizeof(*eth);
 
 		if (iph + 1 > (struct iphdr *)data_end ||
 		    iph->protocol != IPPROTO_UDP)
 			return NULL;
 
-		udph = (void *)eth + sizeof(*iph) + sizeof(*eth);
-	} else if (eth->h_proto  == bpf_htons(ETH_P_IPV6)) {
-		struct ipv6hdr *ipv6h = data + sizeof(*eth);
+		udph = data + sizeof(*iph) + sizeof(*eth);
+	} else if (eth->h_proto == bpf_htons(ETH_P_IPV6)) {
+		struct ipv6hdr *ipv6h;
+
+		err = bpf_xdp_pull_data(ctx, sizeof(*eth) + sizeof(*ipv6h) +
+					     sizeof(*udph), 0);
+		if (err)
+			return NULL;
+
+		data_end = (void *)(long)ctx->data_end;
+		data = (void *)(long)ctx->data;
+
+		ipv6h = data + sizeof(*eth);
 
 		if (ipv6h + 1 > (struct ipv6hdr *)data_end ||
 		    ipv6h->nexthdr != IPPROTO_UDP)
 			return NULL;
 
-		udph = (void *)eth + sizeof(*ipv6h) + sizeof(*eth);
+		udph = data + sizeof(*ipv6h) + sizeof(*eth);
 	} else {
 		return NULL;
 	}
@@ -145,17 +175,34 @@ static void swap_machdr(void *data)
 
 static int xdp_mode_tx_handler(struct xdp_md *ctx, __u16 port)
 {
-	void *data_end = (void *)(long)ctx->data_end;
-	void *data = (void *)(long)ctx->data;
 	struct udphdr *udph = NULL;
-	struct ethhdr *eth = data;
+	void *data, *data_end;
+	struct ethhdr *eth;
+	int err;
+
+	err = bpf_xdp_pull_data(ctx, sizeof(*eth), 0);
+	if (err)
+		return XDP_PASS;
+
+	data_end = (void *)(long)ctx->data_end;
+	data = eth = (void *)(long)ctx->data;
 
 	if (data + sizeof(*eth) > data_end)
 		return XDP_PASS;
 
 	if (eth->h_proto == bpf_htons(ETH_P_IP)) {
-		struct iphdr *iph = data + sizeof(*eth);
-		__be32 tmp_ip = iph->saddr;
+		struct iphdr *iph;
+		__be32 tmp_ip;
+
+		err = bpf_xdp_pull_data(ctx, sizeof(*eth) + sizeof(*iph) +
+					     sizeof(*udph), 0);
+		if (err)
+			return XDP_PASS;
+
+		data_end = (void *)(long)ctx->data_end;
+		data = (void *)(long)ctx->data;
+
+		iph = data + sizeof(*eth);
 
 		if (iph + 1 > (struct iphdr *)data_end ||
 		    iph->protocol != IPPROTO_UDP)
@@ -169,8 +216,10 @@ static int xdp_mode_tx_handler(struct xdp_md *ctx, __u16 port)
 			return XDP_PASS;
 
 		record_stats(ctx, STATS_RX);
+		eth = data;
 		swap_machdr((void *)eth);
 
+		tmp_ip = iph->saddr;
 		iph->saddr = iph->daddr;
 		iph->daddr = tmp_ip;
 
@@ -178,9 +227,19 @@ static int xdp_mode_tx_handler(struct xdp_md *ctx, __u16 port)
 
 		return XDP_TX;
 
-	} else if (eth->h_proto  == bpf_htons(ETH_P_IPV6)) {
-		struct ipv6hdr *ipv6h = data + sizeof(*eth);
+	} else if (eth->h_proto == bpf_htons(ETH_P_IPV6)) {
 		struct in6_addr tmp_ipv6;
+		struct ipv6hdr *ipv6h;
+
+		err = bpf_xdp_pull_data(ctx, sizeof(*eth) + sizeof(*ipv6h) +
+					     sizeof(*udph), 0);
+		if (err)
+			return XDP_PASS;
+
+		data_end = (void *)(long)ctx->data_end;
+		data = (void *)(long)ctx->data;
+
+		ipv6h = data + sizeof(*eth);
 
 		if (ipv6h + 1 > (struct ipv6hdr *)data_end ||
 		    ipv6h->nexthdr != IPPROTO_UDP)
@@ -194,6 +253,7 @@ static int xdp_mode_tx_handler(struct xdp_md *ctx, __u16 port)
 			return XDP_PASS;
 
 		record_stats(ctx, STATS_RX);
+		eth = data;
 		swap_machdr((void *)eth);
 
 		__builtin_memcpy(&tmp_ipv6, &ipv6h->saddr, sizeof(tmp_ipv6));
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 3/7] bpf: Support pulling non-linear xdp data
  2025-08-25 19:39 ` [RFC bpf-next v1 3/7] bpf: Support pulling non-linear xdp data Amery Hung
@ 2025-08-25 21:29   ` Stanislav Fomichev
  2025-08-25 22:23     ` Amery Hung
  2025-08-25 22:39   ` Jakub Kicinski
  1 sibling, 1 reply; 31+ messages in thread
From: Stanislav Fomichev @ 2025-08-25 21:29 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, kuba, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team

On 08/25, Amery Hung wrote:
> Add kfunc, bpf_xdp_pull_data(), to support pulling data from xdp
> fragments. Similar to bpf_skb_pull_data(), bpf_xdp_pull_data() makes
> the first len bytes of data directly readable and writable in bpf
> programs. If the "len" argument is larger than the linear data size,
> data in fragments will be copied to the linear region when there
> is enough room between xdp->data_end and xdp_data_hard_end(xdp),
> which is subject to driver implementation.
> 
> A use case of the kfunc is to decapsulate headers residing in xdp
> fragments. It is possible for a NIC driver to place headers in xdp
> fragments. To keep using direct packet access for parsing and
> decapsulating headers, users can pull headers into the linear data
> area by calling bpf_xdp_pull_data() and then pop the header with
> bpf_xdp_adjust_head().
> 
> An unused argument, flags is reserved for future extension (e.g.,
> tossing the data instead of copying it to the linear data area).
> 
> Signed-off-by: Amery Hung <ameryhung@gmail.com>
> ---
>  net/core/filter.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 52 insertions(+)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index f0ee5aec7977..82d953e077ac 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -12211,6 +12211,57 @@ __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
>  	return 0;
>  }
>  
> +__bpf_kfunc int bpf_xdp_pull_data(struct xdp_md *x, u32 len, u64 flags)
> +{
> +	struct xdp_buff *xdp = (struct xdp_buff *)x;
> +	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> +	void *data_end, *data_hard_end = xdp_data_hard_end(xdp);
> +	int i, delta, buff_len, n_frags_free = 0, len_free = 0;
> +
> +	buff_len = xdp_get_buff_len(xdp);
> +
> +	if (unlikely(len > buff_len))
> +		return -EINVAL;
> +
> +	if (!len)
> +		len = xdp_get_buff_len(xdp);

Why not return -EINVAL here for len=0?

> +
> +	data_end = xdp->data + len;
> +	delta = data_end - xdp->data_end;
> +
> +	if (delta <= 0)
> +		return 0;
> +
> +	if (unlikely(data_end > data_hard_end))
> +		return -EINVAL;
> +
> +	for (i = 0; i < sinfo->nr_frags && delta; i++) {
> +		skb_frag_t *frag = &sinfo->frags[i];
> +		u32 shrink = min_t(u32, delta, skb_frag_size(frag));
> +
> +		memcpy(xdp->data_end + len_free, skb_frag_address(frag), shrink);

skb_frag_address can return NULL for unreadable frags.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 3/7] bpf: Support pulling non-linear xdp data
  2025-08-25 21:29   ` Stanislav Fomichev
@ 2025-08-25 22:23     ` Amery Hung
  2025-08-25 22:29       ` Jakub Kicinski
  2025-08-25 22:46       ` Stanislav Fomichev
  0 siblings, 2 replies; 31+ messages in thread
From: Amery Hung @ 2025-08-25 22:23 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, kuba, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team

On Mon, Aug 25, 2025 at 2:29 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> On 08/25, Amery Hung wrote:
> > Add kfunc, bpf_xdp_pull_data(), to support pulling data from xdp
> > fragments. Similar to bpf_skb_pull_data(), bpf_xdp_pull_data() makes
> > the first len bytes of data directly readable and writable in bpf
> > programs. If the "len" argument is larger than the linear data size,
> > data in fragments will be copied to the linear region when there
> > is enough room between xdp->data_end and xdp_data_hard_end(xdp),
> > which is subject to driver implementation.
> >
> > A use case of the kfunc is to decapsulate headers residing in xdp
> > fragments. It is possible for a NIC driver to place headers in xdp
> > fragments. To keep using direct packet access for parsing and
> > decapsulating headers, users can pull headers into the linear data
> > area by calling bpf_xdp_pull_data() and then pop the header with
> > bpf_xdp_adjust_head().
> >
> > An unused argument, flags is reserved for future extension (e.g.,
> > tossing the data instead of copying it to the linear data area).
> >
> > Signed-off-by: Amery Hung <ameryhung@gmail.com>
> > ---
> >  net/core/filter.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 52 insertions(+)
> >
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index f0ee5aec7977..82d953e077ac 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -12211,6 +12211,57 @@ __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
> >       return 0;
> >  }
> >
> > +__bpf_kfunc int bpf_xdp_pull_data(struct xdp_md *x, u32 len, u64 flags)
> > +{
> > +     struct xdp_buff *xdp = (struct xdp_buff *)x;
> > +     struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> > +     void *data_end, *data_hard_end = xdp_data_hard_end(xdp);
> > +     int i, delta, buff_len, n_frags_free = 0, len_free = 0;
> > +
> > +     buff_len = xdp_get_buff_len(xdp);
> > +
> > +     if (unlikely(len > buff_len))
> > +             return -EINVAL;
> > +
> > +     if (!len)
> > +             len = xdp_get_buff_len(xdp);
>
> Why not return -EINVAL here for len=0?
>

I try to mirror the behavior of bpf_skb_pull_data() to reduce confusion here.

> > +
> > +     data_end = xdp->data + len;
> > +     delta = data_end - xdp->data_end;
> > +
> > +     if (delta <= 0)
> > +             return 0;
> > +
> > +     if (unlikely(data_end > data_hard_end))
> > +             return -EINVAL;
> > +
> > +     for (i = 0; i < sinfo->nr_frags && delta; i++) {
> > +             skb_frag_t *frag = &sinfo->frags[i];
> > +             u32 shrink = min_t(u32, delta, skb_frag_size(frag));
> > +
> > +             memcpy(xdp->data_end + len_free, skb_frag_address(frag), shrink);
>
> skb_frag_address can return NULL for unreadable frags.

Is it safe to assume that drivers will ensure frags to be readable? It
seems at least mlx5 does.

I did a quick check and found other xdp kfuncs using
skb_frag_address() without checking the return.

Thanks for reviewing!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 3/7] bpf: Support pulling non-linear xdp data
  2025-08-25 22:23     ` Amery Hung
@ 2025-08-25 22:29       ` Jakub Kicinski
  2025-08-25 22:36         ` Amery Hung
  2025-08-25 22:46       ` Stanislav Fomichev
  1 sibling, 1 reply; 31+ messages in thread
From: Jakub Kicinski @ 2025-08-25 22:29 UTC (permalink / raw)
  To: Amery Hung
  Cc: Stanislav Fomichev, bpf, netdev, alexei.starovoitov, andrii,
	daniel, martin.lau, mohsin.bashr, saeedm, tariqt, mbloch,
	maciej.fijalkowski, kernel-team

On Mon, 25 Aug 2025 15:23:23 -0700 Amery Hung wrote:
> > > +     if (!len)
> > > +             len = xdp_get_buff_len(xdp);  
> >
> > Why not return -EINVAL here for len=0?
> >  
> 
> I try to mirror the behavior of bpf_skb_pull_data() to reduce confusion here.

Different question for the same LoC :)

Why not

	if (!len)
		len = buff_len;
?
Or perhaps:

	len = len ?: buf_len;

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 3/7] bpf: Support pulling non-linear xdp data
  2025-08-25 22:29       ` Jakub Kicinski
@ 2025-08-25 22:36         ` Amery Hung
  0 siblings, 0 replies; 31+ messages in thread
From: Amery Hung @ 2025-08-25 22:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, bpf, netdev, alexei.starovoitov, andrii,
	daniel, martin.lau, mohsin.bashr, saeedm, tariqt, mbloch,
	maciej.fijalkowski, kernel-team

On Mon, Aug 25, 2025 at 3:29 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 25 Aug 2025 15:23:23 -0700 Amery Hung wrote:
> > > > +     if (!len)
> > > > +             len = xdp_get_buff_len(xdp);
> > >
> > > Why not return -EINVAL here for len=0?
> > >
> >
> > I try to mirror the behavior of bpf_skb_pull_data() to reduce confusion here.
>
> Different question for the same LoC :)
>
> Why not
>
>         if (!len)
>                 len = buff_len;
> ?
> Or perhaps:
>
>         len = len ?: buf_len;

... Ope. Thanks for catching this. I will fix it in the next iteration.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 3/7] bpf: Support pulling non-linear xdp data
  2025-08-25 19:39 ` [RFC bpf-next v1 3/7] bpf: Support pulling non-linear xdp data Amery Hung
  2025-08-25 21:29   ` Stanislav Fomichev
@ 2025-08-25 22:39   ` Jakub Kicinski
  2025-08-26  5:12     ` Amery Hung
  1 sibling, 1 reply; 31+ messages in thread
From: Jakub Kicinski @ 2025-08-25 22:39 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team

On Mon, 25 Aug 2025 12:39:14 -0700 Amery Hung wrote:
> +__bpf_kfunc int bpf_xdp_pull_data(struct xdp_md *x, u32 len, u64 flags)
> +{
> +	struct xdp_buff *xdp = (struct xdp_buff *)x;
> +	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> +	void *data_end, *data_hard_end = xdp_data_hard_end(xdp);
> +	int i, delta, buff_len, n_frags_free = 0, len_free = 0;
> +
> +	buff_len = xdp_get_buff_len(xdp);
> +
> +	if (unlikely(len > buff_len))
> +		return -EINVAL;
> +
> +	if (!len)
> +		len = xdp_get_buff_len(xdp);
> +
> +	data_end = xdp->data + len;
> +	delta = data_end - xdp->data_end;
> +
> +	if (delta <= 0)
> +		return 0;
> +
> +	if (unlikely(data_end > data_hard_end))
> +		return -EINVAL;

Is this safe against pointers wrapping on 32b systems?

Maybe it's better to do:

	 if (unlikely(data_hard_end - xdp->data_end < delta))

?

> +	for (i = 0; i < sinfo->nr_frags && delta; i++) {
> +		skb_frag_t *frag = &sinfo->frags[i];
> +		u32 shrink = min_t(u32, delta, skb_frag_size(frag));
> +
> +		memcpy(xdp->data_end + len_free, skb_frag_address(frag), shrink);
> +
> +		len_free += shrink;
> +		delta -= shrink;
> +		if (bpf_xdp_shrink_data(xdp, frag, shrink, false))
> +			n_frags_free++;

possibly

		else
			break;

and then you don't have to check delta in the for loop condition?

> +	}
> +
> +	for (i = 0; i < sinfo->nr_frags - n_frags_free; i++) {
> +		memcpy(&sinfo->frags[i], &sinfo->frags[i + n_frags_free],
> +		       sizeof(skb_frag_t));

This feels like it'd really want to be a memmove(), no?

> +	}
> +
> +	sinfo->nr_frags -= n_frags_free;
> +	sinfo->xdp_frags_size -= len_free;
> +	xdp->data_end = data_end;
> +
> +	if (unlikely(!sinfo->nr_frags))
> +		xdp_buff_clear_frags_flag(xdp);
> +
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 0/7] Add kfunc bpf_xdp_pull_data
  2025-08-25 19:39 [RFC bpf-next v1 0/7] Add kfunc bpf_xdp_pull_data Amery Hung
                   ` (6 preceding siblings ...)
  2025-08-25 19:39 ` [RFC bpf-next v1 7/7] selftests: drv-net: Pull data before parsing headers Amery Hung
@ 2025-08-25 22:41 ` Jakub Kicinski
  2025-08-26 19:38   ` Gal Pressman
  2025-08-28 13:39 ` Nimrod Oren
  8 siblings, 1 reply; 31+ messages in thread
From: Jakub Kicinski @ 2025-08-25 22:41 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team, Gal Pressman

On Mon, 25 Aug 2025 12:39:11 -0700 Amery Hung wrote:
> Hi all,
> 
> This patchset introduces a new kfunc bpf_xdp_pull_data() to allow
> pulling nonlinear xdp data. This may be useful when a driver places
> headers in fragments. When an xdp program would like to keep parsing
> packet headers using direct packet access, it can call
> bpf_xdp_pull_data() to make the header available in the linear data
> area. The kfunc can also be used to decapsulate the header in the
> nonlinear data, as currently there is no easy way to do this.
> 
> This patchset also tries to fix an issue in the mlx5e driver. The driver
> curretly assumes the packet layout to be unchanged after xdp program
> runs and may generate packet with corrupted data or trigger kernel warning
> if xdp programs calls layout-changing kfunc such as bpf_xdp_adjust_tail(),
> bpf_xdp_adjust_head() or bpf_xdp_pull_data() introduced in this set.
> 
> Tested with the added bpf selftest using bpf test_run and also on
> mlx5e with the tools/testing/selftests/drivers/net/xdp.py. mlx5e with
> striding RQ will produce xdp_buff with empty linear data.
> xdp.test_xdp_native_pass_mb would fail to parse the header before this
> patchset.
> 
> Grateful for any feedback (especially the driver part).

CC: Gal, this is the correct way to resolve the XDP not having headers
in the first frag for mlx5.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 3/7] bpf: Support pulling non-linear xdp data
  2025-08-25 22:23     ` Amery Hung
  2025-08-25 22:29       ` Jakub Kicinski
@ 2025-08-25 22:46       ` Stanislav Fomichev
  2025-08-25 22:58         ` Jakub Kicinski
  1 sibling, 1 reply; 31+ messages in thread
From: Stanislav Fomichev @ 2025-08-25 22:46 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, kuba, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team

On 08/25, Amery Hung wrote:
> On Mon, Aug 25, 2025 at 2:29 PM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> >
> > On 08/25, Amery Hung wrote:
> > > Add kfunc, bpf_xdp_pull_data(), to support pulling data from xdp
> > > fragments. Similar to bpf_skb_pull_data(), bpf_xdp_pull_data() makes
> > > the first len bytes of data directly readable and writable in bpf
> > > programs. If the "len" argument is larger than the linear data size,
> > > data in fragments will be copied to the linear region when there
> > > is enough room between xdp->data_end and xdp_data_hard_end(xdp),
> > > which is subject to driver implementation.
> > >
> > > A use case of the kfunc is to decapsulate headers residing in xdp
> > > fragments. It is possible for a NIC driver to place headers in xdp
> > > fragments. To keep using direct packet access for parsing and
> > > decapsulating headers, users can pull headers into the linear data
> > > area by calling bpf_xdp_pull_data() and then pop the header with
> > > bpf_xdp_adjust_head().
> > >
> > > An unused argument, flags is reserved for future extension (e.g.,
> > > tossing the data instead of copying it to the linear data area).
> > >
> > > Signed-off-by: Amery Hung <ameryhung@gmail.com>
> > > ---
> > >  net/core/filter.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 52 insertions(+)
> > >
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index f0ee5aec7977..82d953e077ac 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -12211,6 +12211,57 @@ __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
> > >       return 0;
> > >  }
> > >
> > > +__bpf_kfunc int bpf_xdp_pull_data(struct xdp_md *x, u32 len, u64 flags)
> > > +{
> > > +     struct xdp_buff *xdp = (struct xdp_buff *)x;
> > > +     struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> > > +     void *data_end, *data_hard_end = xdp_data_hard_end(xdp);
> > > +     int i, delta, buff_len, n_frags_free = 0, len_free = 0;
> > > +
> > > +     buff_len = xdp_get_buff_len(xdp);
> > > +
> > > +     if (unlikely(len > buff_len))
> > > +             return -EINVAL;
> > > +
> > > +     if (!len)
> > > +             len = xdp_get_buff_len(xdp);
> >
> > Why not return -EINVAL here for len=0?
> >
> 
> I try to mirror the behavior of bpf_skb_pull_data() to reduce confusion here.

Ah, makes sense!

> > > +
> > > +     data_end = xdp->data + len;
> > > +     delta = data_end - xdp->data_end;
> > > +
> > > +     if (delta <= 0)
> > > +             return 0;
> > > +
> > > +     if (unlikely(data_end > data_hard_end))
> > > +             return -EINVAL;
> > > +
> > > +     for (i = 0; i < sinfo->nr_frags && delta; i++) {
> > > +             skb_frag_t *frag = &sinfo->frags[i];
> > > +             u32 shrink = min_t(u32, delta, skb_frag_size(frag));
> > > +
> > > +             memcpy(xdp->data_end + len_free, skb_frag_address(frag), shrink);
> >
> > skb_frag_address can return NULL for unreadable frags.
> 
> Is it safe to assume that drivers will ensure frags to be readable? It
> seems at least mlx5 does.
> 
> I did a quick check and found other xdp kfuncs using
> skb_frag_address() without checking the return.

The unreadable frags will always be unredabale to the host. This is TCP
device memory, the memory on the accelerators that is not mapped onto
the CPU. Any attempts to read that memory should gracefully error out.

Can you also pls fix that other one? (not as part of the series should
be ok)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 3/7] bpf: Support pulling non-linear xdp data
  2025-08-25 22:46       ` Stanislav Fomichev
@ 2025-08-25 22:58         ` Jakub Kicinski
  2025-08-26  0:12           ` Stanislav Fomichev
  0 siblings, 1 reply; 31+ messages in thread
From: Jakub Kicinski @ 2025-08-25 22:58 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Amery Hung, bpf, netdev, alexei.starovoitov, andrii, daniel,
	martin.lau, mohsin.bashr, saeedm, tariqt, mbloch,
	maciej.fijalkowski, kernel-team

On Mon, 25 Aug 2025 15:46:02 -0700 Stanislav Fomichev wrote:
> > > skb_frag_address can return NULL for unreadable frags.  
> > 
> > Is it safe to assume that drivers will ensure frags to be readable? It
> > seems at least mlx5 does.
> > 
> > I did a quick check and found other xdp kfuncs using
> > skb_frag_address() without checking the return.  
> 
> The unreadable frags will always be unredabale to the host. This is TCP
> device memory, the memory on the accelerators that is not mapped onto
> the CPU. Any attempts to read that memory should gracefully error out.
> 
> Can you also pls fix that other one? (not as part of the series should
> be ok)

But we don't support mixing XDP with unreadable mem today.
Is the concern just for future proofing?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 3/7] bpf: Support pulling non-linear xdp data
  2025-08-25 22:58         ` Jakub Kicinski
@ 2025-08-26  0:12           ` Stanislav Fomichev
  2025-08-26  0:30             ` Jakub Kicinski
  0 siblings, 1 reply; 31+ messages in thread
From: Stanislav Fomichev @ 2025-08-26  0:12 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Amery Hung, bpf, netdev, alexei.starovoitov, andrii, daniel,
	martin.lau, mohsin.bashr, saeedm, tariqt, mbloch,
	maciej.fijalkowski, kernel-team

On 08/25, Jakub Kicinski wrote:
> On Mon, 25 Aug 2025 15:46:02 -0700 Stanislav Fomichev wrote:
> > > > skb_frag_address can return NULL for unreadable frags.  
> > > 
> > > Is it safe to assume that drivers will ensure frags to be readable? It
> > > seems at least mlx5 does.
> > > 
> > > I did a quick check and found other xdp kfuncs using
> > > skb_frag_address() without checking the return.  
> > 
> > The unreadable frags will always be unredabale to the host. This is TCP
> > device memory, the memory on the accelerators that is not mapped onto
> > the CPU. Any attempts to read that memory should gracefully error out.
> > 
> > Can you also pls fix that other one? (not as part of the series should
> > be ok)
> 
> But we don't support mixing XDP with unreadable mem today.
> Is the concern just for future proofing?

Good point. I though we did add that proofing during the initial rx
patch series, but I don't see any. Ignore me!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 3/7] bpf: Support pulling non-linear xdp data
  2025-08-26  0:12           ` Stanislav Fomichev
@ 2025-08-26  0:30             ` Jakub Kicinski
  0 siblings, 0 replies; 31+ messages in thread
From: Jakub Kicinski @ 2025-08-26  0:30 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Amery Hung, bpf, netdev, alexei.starovoitov, andrii, daniel,
	martin.lau, mohsin.bashr, saeedm, tariqt, mbloch,
	maciej.fijalkowski, kernel-team

On Mon, 25 Aug 2025 17:12:13 -0700 Stanislav Fomichev wrote:
> > > The unreadable frags will always be unredabale to the host. This is TCP
> > > device memory, the memory on the accelerators that is not mapped onto
> > > the CPU. Any attempts to read that memory should gracefully error out.
> > > 
> > > Can you also pls fix that other one? (not as part of the series should
> > > be ok)  
> > 
> > But we don't support mixing XDP with unreadable mem today.
> > Is the concern just for future proofing?  
> 
> Good point. I though we did add that proofing during the initial rx
> patch series, but I don't see any. Ignore me!

I could be wrong, but I think we did for skb paths only.
Perhaps you're thinking about those helpers.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 3/7] bpf: Support pulling non-linear xdp data
  2025-08-25 22:39   ` Jakub Kicinski
@ 2025-08-26  5:12     ` Amery Hung
  2025-08-26 13:20       ` Jakub Kicinski
  0 siblings, 1 reply; 31+ messages in thread
From: Amery Hung @ 2025-08-26  5:12 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team

On Mon, Aug 25, 2025 at 3:39 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 25 Aug 2025 12:39:14 -0700 Amery Hung wrote:
> > +__bpf_kfunc int bpf_xdp_pull_data(struct xdp_md *x, u32 len, u64 flags)
> > +{
> > +     struct xdp_buff *xdp = (struct xdp_buff *)x;
> > +     struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> > +     void *data_end, *data_hard_end = xdp_data_hard_end(xdp);
> > +     int i, delta, buff_len, n_frags_free = 0, len_free = 0;
> > +
> > +     buff_len = xdp_get_buff_len(xdp);
> > +
> > +     if (unlikely(len > buff_len))
> > +             return -EINVAL;
> > +
> > +     if (!len)
> > +             len = xdp_get_buff_len(xdp);
> > +
> > +     data_end = xdp->data + len;
> > +     delta = data_end - xdp->data_end;
> > +
> > +     if (delta <= 0)
> > +             return 0;
> > +
> > +     if (unlikely(data_end > data_hard_end))
> > +             return -EINVAL;
>
> Is this safe against pointers wrapping on 32b systems?
>

You are right. This may be a problem.

> Maybe it's better to do:
>
>          if (unlikely(data_hard_end - xdp->data_end < delta))
>
> ?

But delta may be negative if the pointer wraps around and then the
function will still continue. How about adding data_end < xdp->data
check and reorganizing the checks like this?

        buff_len = xdp_get_buff_len(xdp);

        /* cannot pull more than the packet size */
        if (unlikely(len > buff_len))
                return -EINVAL;

        len = len ?: buff_len;

        data_end = xdp->data + len;

        /* pointer wraps around */
        if (unlikely(data_end < xdp->data))
                return -EINVAL;

        /* cannot pull without enough tailroom in the linear area */
        if (unlikely(data_end > data_hard_end))
                return -EINVAL;

        /* len bytes of data already in the linear area */
        delta = data_end - xdp->data_end;
        if (delta <= 0)
                return 0;

>
> > +     for (i = 0; i < sinfo->nr_frags && delta; i++) {
> > +             skb_frag_t *frag = &sinfo->frags[i];
> > +             u32 shrink = min_t(u32, delta, skb_frag_size(frag));
> > +
> > +             memcpy(xdp->data_end + len_free, skb_frag_address(frag), shrink);
> > +
> > +             len_free += shrink;
> > +             delta -= shrink;
> > +             if (bpf_xdp_shrink_data(xdp, frag, shrink, false))
> > +                     n_frags_free++;
>
> possibly
>
>                 else
>                         break;
>
> and then you don't have to check delta in the for loop condition?
>

I will drop the delta check and add the else branch. I will also make
the bpf_xdp_shrink_data() refactor in patch 2 consistent with this.

> > +     }
> > +
> > +     for (i = 0; i < sinfo->nr_frags - n_frags_free; i++) {
> > +             memcpy(&sinfo->frags[i], &sinfo->frags[i + n_frags_free],
> > +                    sizeof(skb_frag_t));
>
> This feels like it'd really want to be a memmove(), no?
>

Right. Thanks for the suggestion!


> > +     }
> > +
> > +     sinfo->nr_frags -= n_frags_free;
> > +     sinfo->xdp_frags_size -= len_free;
> > +     xdp->data_end = data_end;
> > +
> > +     if (unlikely(!sinfo->nr_frags))
> > +             xdp_buff_clear_frags_flag(xdp);
> > +
> > +     return 0;
> > +}

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 3/7] bpf: Support pulling non-linear xdp data
  2025-08-26  5:12     ` Amery Hung
@ 2025-08-26 13:20       ` Jakub Kicinski
  2025-08-26 13:44         ` Amery Hung
  0 siblings, 1 reply; 31+ messages in thread
From: Jakub Kicinski @ 2025-08-26 13:20 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team

On Mon, 25 Aug 2025 22:12:21 -0700 Amery Hung wrote:
> > > +     data_end = xdp->data + len;
> > > +     delta = data_end - xdp->data_end;
> > > +
> > > +     if (delta <= 0)
> > > +             return 0;
> > > +
> > > +     if (unlikely(data_end > data_hard_end))
> > > +             return -EINVAL;  
> >
> > Is this safe against pointers wrapping on 32b systems?
> >  
> 
> You are right. This may be a problem.
> 
> > Maybe it's better to do:
> >
> >          if (unlikely(data_hard_end - xdp->data_end < delta))
> >
> > ?  
> 
> But delta may be negative if the pointer wraps around and then the
> function will still continue. How about adding data_end < xdp->data
> check and reorganizing the checks like this?

You already checked that delta is positive in the previous if (),
so I think it's safe. Admittedly having 3 separate conditions is 
more readable but it's not strictly necessary. Up to you.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 3/7] bpf: Support pulling non-linear xdp data
  2025-08-26 13:20       ` Jakub Kicinski
@ 2025-08-26 13:44         ` Amery Hung
  0 siblings, 0 replies; 31+ messages in thread
From: Amery Hung @ 2025-08-26 13:44 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team

On Tue, Aug 26, 2025 at 6:20 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 25 Aug 2025 22:12:21 -0700 Amery Hung wrote:
> > > > +     data_end = xdp->data + len;
> > > > +     delta = data_end - xdp->data_end;
> > > > +
> > > > +     if (delta <= 0)
> > > > +             return 0;
> > > > +
> > > > +     if (unlikely(data_end > data_hard_end))
> > > > +             return -EINVAL;
> > >
> > > Is this safe against pointers wrapping on 32b systems?
> > >
> >
> > You are right. This may be a problem.
> >
> > > Maybe it's better to do:
> > >
> > >          if (unlikely(data_hard_end - xdp->data_end < delta))
> > >
> > > ?
> >
> > But delta may be negative if the pointer wraps around and then the
> > function will still continue. How about adding data_end < xdp->data
> > check and reorganizing the checks like this?
>
> You already checked that delta is positive in the previous if (),
> so I think it's safe. Admittedly having 3 separate conditions is
> more readable but it's not strictly necessary. Up to you.

Got it. I will change to the new set of checks. The original kfunc
would return 0 when pointer wrapping happens and delta <= 0.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 0/7] Add kfunc bpf_xdp_pull_data
  2025-08-25 22:41 ` [RFC bpf-next v1 0/7] Add kfunc bpf_xdp_pull_data Jakub Kicinski
@ 2025-08-26 19:38   ` Gal Pressman
  0 siblings, 0 replies; 31+ messages in thread
From: Gal Pressman @ 2025-08-26 19:38 UTC (permalink / raw)
  To: Jakub Kicinski, Amery Hung
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team, Dragos Tatulea, Nimrod Oren

On 26/08/2025 1:41, Jakub Kicinski wrote:
> On Mon, 25 Aug 2025 12:39:11 -0700 Amery Hung wrote:
>> Hi all,
>>
>> This patchset introduces a new kfunc bpf_xdp_pull_data() to allow
>> pulling nonlinear xdp data. This may be useful when a driver places
>> headers in fragments. When an xdp program would like to keep parsing
>> packet headers using direct packet access, it can call
>> bpf_xdp_pull_data() to make the header available in the linear data
>> area. The kfunc can also be used to decapsulate the header in the
>> nonlinear data, as currently there is no easy way to do this.
>>
>> This patchset also tries to fix an issue in the mlx5e driver. The driver
>> curretly assumes the packet layout to be unchanged after xdp program
>> runs and may generate packet with corrupted data or trigger kernel warning
>> if xdp programs calls layout-changing kfunc such as bpf_xdp_adjust_tail(),
>> bpf_xdp_adjust_head() or bpf_xdp_pull_data() introduced in this set.
>>
>> Tested with the added bpf selftest using bpf test_run and also on
>> mlx5e with the tools/testing/selftests/drivers/net/xdp.py. mlx5e with
>> striding RQ will produce xdp_buff with empty linear data.
>> xdp.test_xdp_native_pass_mb would fail to parse the header before this
>> patchset.
>>
>> Grateful for any feedback (especially the driver part).
> 
> CC: Gal, this is the correct way to resolve the XDP not having headers
> in the first frag for mlx5.

Thanks for the heads up Jakub, CC'd Dragos and Nimrod to review.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 1/7] net/mlx5e: Fix generating skb from nonlinear xdp_buff
  2025-08-25 19:39 ` [RFC bpf-next v1 1/7] net/mlx5e: Fix generating skb from nonlinear xdp_buff Amery Hung
@ 2025-08-27 13:45   ` Dragos Tatulea
  2025-08-28  3:44     ` Amery Hung
  2025-08-28 13:41   ` Nimrod Oren
  1 sibling, 1 reply; 31+ messages in thread
From: Dragos Tatulea @ 2025-08-27 13:45 UTC (permalink / raw)
  To: Amery Hung, bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, kuba, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team, noren

On Mon, Aug 25, 2025 at 12:39:12PM -0700, Amery Hung wrote:
> xdp programs can change the layout of an xdp_buff through
> bpf_xdp_adjust_tail(), bpf_xdp_adjust_head(). Therefore, the driver
> cannot assume the size of the linear data area nor fragments. Fix the
> bug in mlx5e driver by generating skb according to xdp_buff layout.
>
Good find! Thanks for tackling this Amery.

> Currently, when handling multi-buf xdp, the mlx5e driver assumes the
> layout of an xdp_buff to be unchanged. That is, the linear data area
> continues to be empty and the fragments remains the same.
This is true only for striding rq xdp. Legacy rq xdp puts the header
in the linear part.

> This may
> cause the driver to generate erroneous skb or triggering a kernel
> warning. When an xdp program added linear data through
> bpf_xdp_adjust_head() the linear data will be ignored as
> mlx5e_build_linear_skb() builds an skb with empty linear data and then
> pull data from fragments to fill the linear data area. When an xdp
> program has shrunk the nonlinear data through bpf_xdp_adjust_tail(),
> the delta passed to __pskb_pull_tail() may exceed the actual nonlinear
> data size and trigger the BUG_ON in it.
> 
> To fix the issue, first build the skb with linear data area matching
> the xdp_buff. Then, call __pskb_pull_tail() to fill the linear data for
> up to MLX5E_RX_MAX_HEAD bytes. In addition, recalculate nr_frags and
> truesize after xdp program runs.
>
The ordering here seems misleading. AFAIU recalculating nr_frags happens
first.

> Fixes: f52ac7028bec ("net/mlx5e: RX, Add XDP multi-buffer support in Striding RQ")
> Signed-off-by: Amery Hung <ameryhung@gmail.com>
> ---
>  .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 59 ++++++++++++++-----
>  1 file changed, 43 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index b8c609d91d11..c5173f1ccb4e 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -1725,16 +1725,17 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
>  			     struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
>  {
>  	struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0];
> +	struct mlx5e_wqe_frag_info *pwi, *head_wi = wi;
>  	struct mlx5e_xdp_buff *mxbuf = &rq->mxbuf;
> -	struct mlx5e_wqe_frag_info *head_wi = wi;
>  	u16 rx_headroom = rq->buff.headroom;
>  	struct mlx5e_frag_page *frag_page;
>  	struct skb_shared_info *sinfo;
> -	u32 frag_consumed_bytes;
> +	u32 frag_consumed_bytes, i;
>  	struct bpf_prog *prog;
>  	struct sk_buff *skb;
>  	dma_addr_t addr;
>  	u32 truesize;
> +	u8 nr_frags;
>  	void *va;
>  
>  	frag_page = wi->frag_page;
> @@ -1775,14 +1776,26 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
>  	prog = rcu_dereference(rq->xdp_prog);
>  	if (prog && mlx5e_xdp_handle(rq, prog, mxbuf)) {
>  		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
> -			struct mlx5e_wqe_frag_info *pwi;
> +			pwi = head_wi;
> +			while (pwi->frag_page->netmem != sinfo->frags[0].netmem && pwi < wi)
> +				pwi++;
>
Is this trying to skip counting the frags for the linear part? If yes,
don't understand the reasoning. If not, I don't follow the code.

AFAIU frags have to be counted for the linear part + sinfo->nr_frags.
Frags could be less after xdp program execution, but the linear part is
still there.

> -			for (pwi = head_wi; pwi < wi; pwi++)
> +			for (i = 0; i < sinfo->nr_frags; i++, pwi++)
>  				pwi->frag_page->frags++;
Why not:

	pwi = head_wi;
	for (int i = 0; i < (sinfo->nr_frags + 1); i++, pwi++)
		pwi->frag_page->frags++;

>  		}
>  		return NULL; /* page/packet was consumed by XDP */
>  	}
>  
> +	nr_frags = sinfo->nr_frags;
This makes sense. You are using this in xdp_update_skb_shared_info()
below.

> +	pwi = head_wi + 1;
> +
> +	if (prog) {
You could do here: if (unlikely(sinfo->nr_frags != nr_frags).

> +		truesize = sinfo->nr_frags * frag_info->frag_stride;
> +
Ack. Recalculating truesize.

> +		while (pwi->frag_page->netmem != sinfo->frags[0].netmem && pwi < wi)
> +			pwi++;
Why is this needed here?
> +	}

>  	skb = mlx5e_build_linear_skb(
>  		rq, mxbuf->xdp.data_hard_start, rq->buff.frame0_sz,
>  		mxbuf->xdp.data - mxbuf->xdp.data_hard_start,
> @@ -1796,12 +1809,12 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
>  
>  	if (xdp_buff_has_frags(&mxbuf->xdp)) {
>  		/* sinfo->nr_frags is reset by build_skb, calculate again. */
> -		xdp_update_skb_shared_info(skb, wi - head_wi - 1,
> +		xdp_update_skb_shared_info(skb, nr_frags,
>  					   sinfo->xdp_frags_size, truesize,
>  					   xdp_buff_is_frag_pfmemalloc(
>  						&mxbuf->xdp));
>  
> -		for (struct mlx5e_wqe_frag_info *pwi = head_wi + 1; pwi < wi; pwi++)
> +		for (i = 0; i < nr_frags; i++, pwi++)
>  			pwi->frag_page->frags++;
Why not pull the pwi assignmet to head_wi + 1 up from the for scope and use i
with i < nr_frags condition?

>  	}
>  
> @@ -2073,12 +2086,18 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>  	}
>  
>  	if (prog) {
> +		u8 nr_frags;
> +		u32 len, i;
> +
>  		if (mlx5e_xdp_handle(rq, prog, mxbuf)) {
>  			if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
> -				struct mlx5e_frag_page *pfp;
> +				struct mlx5e_frag_page *pagep = head_page;
> +
> +				while (pagep->netmem != sinfo->frags[0].netmem && pagep < frag_page)
> +					pagep++;
>
Why do you need this?

> -				for (pfp = head_page; pfp < frag_page; pfp++)
> -					pfp->frags++;
> +				for (i = 0; i < sinfo->nr_frags; i++)
> +					pagep->frags++;
This looks good here but with pfp = head_page. head_page should point to the first
frag. The linear part is in wi->linear_page.


>  				wi->linear_page.frags++;
>  			}
> @@ -2087,9 +2106,12 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>  			return NULL; /* page/packet was consumed by XDP */
>  		}
>  
> +		len = mxbuf->xdp.data_end - mxbuf->xdp.data;
> +		nr_frags = sinfo->nr_frags;
> +
>  		skb = mlx5e_build_linear_skb(
>  			rq, mxbuf->xdp.data_hard_start, linear_frame_sz,
> -			mxbuf->xdp.data - mxbuf->xdp.data_hard_start, 0,
> +			mxbuf->xdp.data - mxbuf->xdp.data_hard_start, len,
>  			mxbuf->xdp.data - mxbuf->xdp.data_meta);
This makes sense.

>  		if (unlikely(!skb)) {
>  			mlx5e_page_release_fragmented(rq->page_pool,
> @@ -2102,20 +2124,25 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>  		mlx5e_page_release_fragmented(rq->page_pool, &wi->linear_page);
>  
>  		if (xdp_buff_has_frags(&mxbuf->xdp)) {
> -			struct mlx5e_frag_page *pagep;
> +			struct mlx5e_frag_page *pagep = head_page;
> +
> +			truesize = nr_frags * PAGE_SIZE;
I am not sure that this is accurate. The last fragment might be smaller
than page size. It should be aligned to BIT(rq->mpwqe.log_stride_sz).

>  
>  			/* sinfo->nr_frags is reset by build_skb, calculate again. */
> -			xdp_update_skb_shared_info(skb, frag_page - head_page,
> +			xdp_update_skb_shared_info(skb, nr_frags,
>  						   sinfo->xdp_frags_size, truesize,
>  						   xdp_buff_is_frag_pfmemalloc(
>  							&mxbuf->xdp));
>  
> -			pagep = head_page;
> -			do
> +			while (pagep->netmem != sinfo->frags[0].netmem && pagep < frag_page)
> +				pagep++;
> +
> +			for (i = 0; i < nr_frags; i++, pagep++)
>  				pagep->frags++;
> -			while (++pagep < frag_page);
> +
> +			headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len, sinfo->xdp_frags_size);
> +			__pskb_pull_tail(skb, headlen);
>  		}
> -		__pskb_pull_tail(skb, headlen);
What happens when there are no more frags? (bpf_xdp_frags_shrink_tail()
shrinked them out). Is that at all possible?

In general, I think the code would be nicer if it would do a rewind of
the end pointer based on the diff between the old and new nr_frags.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 1/7] net/mlx5e: Fix generating skb from nonlinear xdp_buff
  2025-08-27 13:45   ` Dragos Tatulea
@ 2025-08-28  3:44     ` Amery Hung
  2025-08-28 16:23       ` Dragos Tatulea
  0 siblings, 1 reply; 31+ messages in thread
From: Amery Hung @ 2025-08-28  3:44 UTC (permalink / raw)
  To: Dragos Tatulea
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, kuba, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team, noren

On Wed, Aug 27, 2025 at 6:45 AM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>
> On Mon, Aug 25, 2025 at 12:39:12PM -0700, Amery Hung wrote:
> > xdp programs can change the layout of an xdp_buff through
> > bpf_xdp_adjust_tail(), bpf_xdp_adjust_head(). Therefore, the driver
> > cannot assume the size of the linear data area nor fragments. Fix the
> > bug in mlx5e driver by generating skb according to xdp_buff layout.
> >
> Good find! Thanks for tackling this Amery.
>
> > Currently, when handling multi-buf xdp, the mlx5e driver assumes the
> > layout of an xdp_buff to be unchanged. That is, the linear data area
> > continues to be empty and the fragments remains the same.
> This is true only for striding rq xdp. Legacy rq xdp puts the header
> in the linear part.
>
> > This may
> > cause the driver to generate erroneous skb or triggering a kernel
> > warning. When an xdp program added linear data through
> > bpf_xdp_adjust_head() the linear data will be ignored as
> > mlx5e_build_linear_skb() builds an skb with empty linear data and then
> > pull data from fragments to fill the linear data area. When an xdp
> > program has shrunk the nonlinear data through bpf_xdp_adjust_tail(),
> > the delta passed to __pskb_pull_tail() may exceed the actual nonlinear
> > data size and trigger the BUG_ON in it.
> >
> > To fix the issue, first build the skb with linear data area matching
> > the xdp_buff. Then, call __pskb_pull_tail() to fill the linear data for
> > up to MLX5E_RX_MAX_HEAD bytes. In addition, recalculate nr_frags and
> > truesize after xdp program runs.
> >
> The ordering here seems misleading. AFAIU recalculating nr_frags happens
> first.
>
> > Fixes: f52ac7028bec ("net/mlx5e: RX, Add XDP multi-buffer support in Striding RQ")
> > Signed-off-by: Amery Hung <ameryhung@gmail.com>
> > ---
> >  .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 59 ++++++++++++++-----
> >  1 file changed, 43 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > index b8c609d91d11..c5173f1ccb4e 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > @@ -1725,16 +1725,17 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
> >                            struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
> >  {
> >       struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0];
> > +     struct mlx5e_wqe_frag_info *pwi, *head_wi = wi;
> >       struct mlx5e_xdp_buff *mxbuf = &rq->mxbuf;
> > -     struct mlx5e_wqe_frag_info *head_wi = wi;
> >       u16 rx_headroom = rq->buff.headroom;
> >       struct mlx5e_frag_page *frag_page;
> >       struct skb_shared_info *sinfo;
> > -     u32 frag_consumed_bytes;
> > +     u32 frag_consumed_bytes, i;
> >       struct bpf_prog *prog;
> >       struct sk_buff *skb;
> >       dma_addr_t addr;
> >       u32 truesize;
> > +     u8 nr_frags;
> >       void *va;
> >
> >       frag_page = wi->frag_page;
> > @@ -1775,14 +1776,26 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
> >       prog = rcu_dereference(rq->xdp_prog);
> >       if (prog && mlx5e_xdp_handle(rq, prog, mxbuf)) {
> >               if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
> > -                     struct mlx5e_wqe_frag_info *pwi;
> > +                     pwi = head_wi;
> > +                     while (pwi->frag_page->netmem != sinfo->frags[0].netmem && pwi < wi)
> > +                             pwi++;
> >
> Is this trying to skip counting the frags for the linear part? If yes,
> don't understand the reasoning. If not, I don't follow the code.
>
> AFAIU frags have to be counted for the linear part + sinfo->nr_frags.
> Frags could be less after xdp program execution, but the linear part is
> still there.
>

This is to search the first frag after xdp runs because I thought it
is possible that the first frag (head_wi+1) might be released by
bpf_xdp_pull_data() and then the frag will start from head_wi+2.

After sleeping on it a bit, it seems it is not possible as there is
not enough room in the linear to completely pull PAGE_SIZE byte of
data from the first frag to the linear area. Is this correct?

> > -                     for (pwi = head_wi; pwi < wi; pwi++)
> > +                     for (i = 0; i < sinfo->nr_frags; i++, pwi++)
> >                               pwi->frag_page->frags++;
> Why not:

Will fix it as well as other similar places.

>
>         pwi = head_wi;
>         for (int i = 0; i < (sinfo->nr_frags + 1); i++, pwi++)
>                 pwi->frag_page->frags++;
>
> >               }
> >               return NULL; /* page/packet was consumed by XDP */
> >       }
> >
> > +     nr_frags = sinfo->nr_frags;
> This makes sense. You are using this in xdp_update_skb_shared_info()
> below.
>
> > +     pwi = head_wi + 1;
> > +
> > +     if (prog) {
> You could do here: if (unlikely(sinfo->nr_frags != nr_frags).

Got it.

>
> > +             truesize = sinfo->nr_frags * frag_info->frag_stride;
> > +
> Ack. Recalculating truesize.
>
> > +             while (pwi->frag_page->netmem != sinfo->frags[0].netmem && pwi < wi)
> > +                     pwi++;
> Why is this needed here?
> > +     }
>
> >       skb = mlx5e_build_linear_skb(
> >               rq, mxbuf->xdp.data_hard_start, rq->buff.frame0_sz,
> >               mxbuf->xdp.data - mxbuf->xdp.data_hard_start,
> > @@ -1796,12 +1809,12 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
> >
> >       if (xdp_buff_has_frags(&mxbuf->xdp)) {
> >               /* sinfo->nr_frags is reset by build_skb, calculate again. */
> > -             xdp_update_skb_shared_info(skb, wi - head_wi - 1,
> > +             xdp_update_skb_shared_info(skb, nr_frags,
> >                                          sinfo->xdp_frags_size, truesize,
> >                                          xdp_buff_is_frag_pfmemalloc(
> >                                               &mxbuf->xdp));
> >
> > -             for (struct mlx5e_wqe_frag_info *pwi = head_wi + 1; pwi < wi; pwi++)
> > +             for (i = 0; i < nr_frags; i++, pwi++)
> >                       pwi->frag_page->frags++;
> Why not pull the pwi assignmet to head_wi + 1 up from the for scope and use i
> with i < nr_frags condition?
>
> >       }
> >
> > @@ -2073,12 +2086,18 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> >       }
> >
> >       if (prog) {
> > +             u8 nr_frags;
> > +             u32 len, i;
> > +
> >               if (mlx5e_xdp_handle(rq, prog, mxbuf)) {
> >                       if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
> > -                             struct mlx5e_frag_page *pfp;
> > +                             struct mlx5e_frag_page *pagep = head_page;
> > +
> > +                             while (pagep->netmem != sinfo->frags[0].netmem && pagep < frag_page)
> > +                                     pagep++;
> >
> Why do you need this?
>
> > -                             for (pfp = head_page; pfp < frag_page; pfp++)
> > -                                     pfp->frags++;
> > +                             for (i = 0; i < sinfo->nr_frags; i++)
> > +                                     pagep->frags++;
> This looks good here but with pfp = head_page. head_page should point to the first
> frag. The linear part is in wi->linear_page.
>
>
> >                               wi->linear_page.frags++;
> >                       }
> > @@ -2087,9 +2106,12 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> >                       return NULL; /* page/packet was consumed by XDP */
> >               }
> >
> > +             len = mxbuf->xdp.data_end - mxbuf->xdp.data;
> > +             nr_frags = sinfo->nr_frags;
> > +
> >               skb = mlx5e_build_linear_skb(
> >                       rq, mxbuf->xdp.data_hard_start, linear_frame_sz,
> > -                     mxbuf->xdp.data - mxbuf->xdp.data_hard_start, 0,
> > +                     mxbuf->xdp.data - mxbuf->xdp.data_hard_start, len,
> >                       mxbuf->xdp.data - mxbuf->xdp.data_meta);
> This makes sense.
>
> >               if (unlikely(!skb)) {
> >                       mlx5e_page_release_fragmented(rq->page_pool,
> > @@ -2102,20 +2124,25 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> >               mlx5e_page_release_fragmented(rq->page_pool, &wi->linear_page);
> >
> >               if (xdp_buff_has_frags(&mxbuf->xdp)) {
> > -                     struct mlx5e_frag_page *pagep;
> > +                     struct mlx5e_frag_page *pagep = head_page;
> > +
> > +                     truesize = nr_frags * PAGE_SIZE;
> I am not sure that this is accurate. The last fragment might be smaller
> than page size. It should be aligned to BIT(rq->mpwqe.log_stride_sz).
>

According to the truesize calculation in
mlx5e_skb_from_cqe_mpwrq_nonlinear() just before mlx5e_xdp_handle().
After the first frag, the frag_offset is always 0 and
pg_consumed_bytes will be PAGE_SIZE. Therefore the last page also
consumes a page, no?

If the last page has variable size, I wonder how can
bpf_xdp_adjust_tail() handle a dynamic tailroom. bpf_xdp_adjust_tail()
requires a driver to specify a static frag size (the maximum size a
frag can grow) when calling __xdp_rxq_info_reg(), which seem to be a
page in mlx5.


> >
> >                       /* sinfo->nr_frags is reset by build_skb, calculate again. */
> > -                     xdp_update_skb_shared_info(skb, frag_page - head_page,
> > +                     xdp_update_skb_shared_info(skb, nr_frags,
> >                                                  sinfo->xdp_frags_size, truesize,
> >                                                  xdp_buff_is_frag_pfmemalloc(
> >                                                       &mxbuf->xdp));
> >
> > -                     pagep = head_page;
> > -                     do
> > +                     while (pagep->netmem != sinfo->frags[0].netmem && pagep < frag_page)
> > +                             pagep++;
> > +
> > +                     for (i = 0; i < nr_frags; i++, pagep++)
> >                               pagep->frags++;
> > -                     while (++pagep < frag_page);
> > +
> > +                     headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len, sinfo->xdp_frags_size);
> > +                     __pskb_pull_tail(skb, headlen);
> >               }
> > -             __pskb_pull_tail(skb, headlen);
> What happens when there are no more frags? (bpf_xdp_frags_shrink_tail()
> shrinked them out). Is that at all possible?

It is possible for bpf_xdp_frags_shrink_tail() to release all frags.
There is no limit of how much they can shrink. If there is linear
data, the kfunc allows shrinking data_end until ETH_HLEN. Before this
patchset, it could trigger a BUG_ON in __pskb_pull_tail(). After this
set, the driver will pass a empty skb to the upper layer.

For bpf_xdp_pull_data(), in the case of mlx5, I think it is only
possible to release all frags when the first and only frag contains
less than 256 bytes, which is the free space in the linear page.

>
> In general, I think the code would be nicer if it would do a rewind of
> the end pointer based on the diff between the old and new nr_frags.
>

Not sure if I get this. Do you mean calling __pskb_pull_tail() some
how based on the difference between sinfo->nr_frags and nr_frags?

Thanks for reviewing the patch!

> Thanks,
> Dragos

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 0/7] Add kfunc bpf_xdp_pull_data
  2025-08-25 19:39 [RFC bpf-next v1 0/7] Add kfunc bpf_xdp_pull_data Amery Hung
                   ` (7 preceding siblings ...)
  2025-08-25 22:41 ` [RFC bpf-next v1 0/7] Add kfunc bpf_xdp_pull_data Jakub Kicinski
@ 2025-08-28 13:39 ` Nimrod Oren
  2025-08-29  7:26   ` Amery Hung
  2025-08-29 18:21   ` Martin KaFai Lau
  8 siblings, 2 replies; 31+ messages in thread
From: Nimrod Oren @ 2025-08-28 13:39 UTC (permalink / raw)
  To: Amery Hung, bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, kuba, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team, Dragos Tatulea

On 25/08/2025 22:39, Amery Hung wrote:
> This patchset introduces a new kfunc bpf_xdp_pull_data() to allow
> pulling nonlinear xdp data. This may be useful when a driver places
> headers in fragments. When an xdp program would like to keep parsing
> packet headers using direct packet access, it can call
> bpf_xdp_pull_data() to make the header available in the linear data
> area. The kfunc can also be used to decapsulate the header in the
> nonlinear data, as currently there is no easy way to do this.

I'm currently working on a series that converts the xdp_native program
to use dynptr for accessing header data. If accepted, it should provide
better performance, since dynptr can access without copying the data.

> This patchset also tries to fix an issue in the mlx5e driver. The driver
> curretly assumes the packet layout to be unchanged after xdp program
> runs and may generate packet with corrupted data or trigger kernel warning
> if xdp programs calls layout-changing kfunc such as bpf_xdp_adjust_tail(),
> bpf_xdp_adjust_head() or bpf_xdp_pull_data() introduced in this set.

Thanks for working on this!

> Tested with the added bpf selftest using bpf test_run and also on
> mlx5e with the tools/testing/selftests/drivers/net/xdp.py. mlx5e with
> striding RQ will produce xdp_buff with empty linear data.
> xdp.test_xdp_native_pass_mb would fail to parse the header before this
> patchset.

I got a crash when testing this series with the xdp_dummy program from
tools/testing/selftests/net/lib/. Need to make sure we're not breaking
compatibility for programs that keep the linear part empty.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 1/7] net/mlx5e: Fix generating skb from nonlinear xdp_buff
  2025-08-25 19:39 ` [RFC bpf-next v1 1/7] net/mlx5e: Fix generating skb from nonlinear xdp_buff Amery Hung
  2025-08-27 13:45   ` Dragos Tatulea
@ 2025-08-28 13:41   ` Nimrod Oren
  1 sibling, 0 replies; 31+ messages in thread
From: Nimrod Oren @ 2025-08-28 13:41 UTC (permalink / raw)
  To: Amery Hung, bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, kuba, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team, Dragos Tatulea, Gal Pressman

On 25/08/2025 22:39, Amery Hung wrote:
> +			headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len, sinfo->xdp_frags_size);

I suspect that this is what caused the crash I mentioned earlier.
It seems that sinfo->xdp_frags_size is already zeroed at this point.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 2/7] bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail
  2025-08-25 19:39 ` [RFC bpf-next v1 2/7] bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail Amery Hung
@ 2025-08-28 13:43   ` Nimrod Oren
  0 siblings, 0 replies; 31+ messages in thread
From: Nimrod Oren @ 2025-08-28 13:43 UTC (permalink / raw)
  To: Amery Hung, bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, kuba, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team, Dragos Tatulea, Gal Pressman

On 25/08/2025 22:39, Amery Hung wrote:
> Move skb_frag_t adjustment into bpf_xdp_shrink_data() and extend its
> functionality to be able to shrink an xdp fragment from both head and
> tail. In a later patch, bpf_xdp_pull_data() will reuse it to shrink an
> xdp fragment from head.
I had assumed that XDP multi-buffer frags must always be the same size,
except for the last one. If that’s the case, shrinking from the head
seems to break this rule.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 1/7] net/mlx5e: Fix generating skb from nonlinear xdp_buff
  2025-08-28  3:44     ` Amery Hung
@ 2025-08-28 16:23       ` Dragos Tatulea
  0 siblings, 0 replies; 31+ messages in thread
From: Dragos Tatulea @ 2025-08-28 16:23 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, kuba, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team, noren

On Wed, Aug 27, 2025 at 08:44:24PM -0700, Amery Hung wrote:
> On Wed, Aug 27, 2025 at 6:45 AM Dragos Tatulea <dtatulea@nvidia.com> wrote:
> >
> > On Mon, Aug 25, 2025 at 12:39:12PM -0700, Amery Hung wrote:
> > > [...]
> > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > index b8c609d91d11..c5173f1ccb4e 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > @@ -1725,16 +1725,17 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
> > >                            struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
> > >  {
> > >       struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0];
> > > +     struct mlx5e_wqe_frag_info *pwi, *head_wi = wi;
> > >       struct mlx5e_xdp_buff *mxbuf = &rq->mxbuf;
> > > -     struct mlx5e_wqe_frag_info *head_wi = wi;
> > >       u16 rx_headroom = rq->buff.headroom;
> > >       struct mlx5e_frag_page *frag_page;
> > >       struct skb_shared_info *sinfo;
> > > -     u32 frag_consumed_bytes;
> > > +     u32 frag_consumed_bytes, i;
> > >       struct bpf_prog *prog;
> > >       struct sk_buff *skb;
> > >       dma_addr_t addr;
> > >       u32 truesize;
> > > +     u8 nr_frags;
> > >       void *va;
> > >
> > >       frag_page = wi->frag_page;
> > > @@ -1775,14 +1776,26 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
> > >       prog = rcu_dereference(rq->xdp_prog);
> > >       if (prog && mlx5e_xdp_handle(rq, prog, mxbuf)) {
> > >               if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
> > > -                     struct mlx5e_wqe_frag_info *pwi;
> > > +                     pwi = head_wi;
> > > +                     while (pwi->frag_page->netmem != sinfo->frags[0].netmem && pwi < wi)
> > > +                             pwi++;
> > >
> > Is this trying to skip counting the frags for the linear part? If yes,
> > don't understand the reasoning. If not, I don't follow the code.
> >
> > AFAIU frags have to be counted for the linear part + sinfo->nr_frags.
> > Frags could be less after xdp program execution, but the linear part is
> > still there.
> >
> 
> This is to search the first frag after xdp runs because I thought it
> is possible that the first frag (head_wi+1) might be released by
> bpf_xdp_pull_data() and then the frag will start from head_wi+2.
> 
> After sleeping on it a bit, it seems it is not possible as there is
> not enough room in the linear to completely pull PAGE_SIZE byte of
> data from the first frag to the linear area. Is this correct?
> 
Right. AFAIU the usable linear part is smaller due to headroom and
tailroom.

[...]
> > >               if (unlikely(!skb)) {
> > >                       mlx5e_page_release_fragmented(rq->page_pool,
> > > @@ -2102,20 +2124,25 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > >               mlx5e_page_release_fragmented(rq->page_pool, &wi->linear_page);
> > >
> > >               if (xdp_buff_has_frags(&mxbuf->xdp)) {
> > > -                     struct mlx5e_frag_page *pagep;
> > > +                     struct mlx5e_frag_page *pagep = head_page;
> > > +
> > > +                     truesize = nr_frags * PAGE_SIZE;
> > I am not sure that this is accurate. The last fragment might be smaller
> > than page size. It should be aligned to BIT(rq->mpwqe.log_stride_sz).
> >
> 
> According to the truesize calculation in
> mlx5e_skb_from_cqe_mpwrq_nonlinear() just before mlx5e_xdp_handle().
> After the first frag, the frag_offset is always 0 and
> pg_consumed_bytes will be PAGE_SIZE. Therefore the last page also
> consumes a page, no?
>
My understanding is that the last pg_consumed_bytes will be a byte_cnt
that is smaller than PAGE_SIZE as there is a min operation.

> If the last page has variable size, I wonder how can
> bpf_xdp_adjust_tail() handle a dynamic tailroom. 
That is a good point. So this can stay as is I guess.

> bpf_xdp_adjust_tail()
> requires a driver to specify a static frag size (the maximum size a
> frag can grow) when calling __xdp_rxq_info_reg(), which seem to be a
> page in mlx5.
>
This is an issue raised by Nimrod as well. Currently striding rq sets
rxq->frag_size to 0. It is set to PAGE_SIZE only in legacy rq mode.

> 
> > >
> > >                       /* sinfo->nr_frags is reset by build_skb, calculate again. */
> > > -                     xdp_update_skb_shared_info(skb, frag_page - head_page,
> > > +                     xdp_update_skb_shared_info(skb, nr_frags,
> > >                                                  sinfo->xdp_frags_size, truesize,
> > >                                                  xdp_buff_is_frag_pfmemalloc(
> > >                                                       &mxbuf->xdp));
> > >
> > > -                     pagep = head_page;
> > > -                     do
> > > +                     while (pagep->netmem != sinfo->frags[0].netmem && pagep < frag_page)
> > > +                             pagep++;
> > > +
> > > +                     for (i = 0; i < nr_frags; i++, pagep++)
> > >                               pagep->frags++;
> > > -                     while (++pagep < frag_page);
> > > +
> > > +                     headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len, sinfo->xdp_frags_size);
> > > +                     __pskb_pull_tail(skb, headlen);
> > >               }
> > > -             __pskb_pull_tail(skb, headlen);
> > What happens when there are no more frags? (bpf_xdp_frags_shrink_tail()
> > shrinked them out). Is that at all possible?
> 
> It is possible for bpf_xdp_frags_shrink_tail() to release all frags.
> There is no limit of how much they can shrink. If there is linear
> data, the kfunc allows shrinking data_end until ETH_HLEN. Before this
> patchset, it could trigger a BUG_ON in __pskb_pull_tail(). After this
> set, the driver will pass a empty skb to the upper layer.
> 
I see what you mean.

> For bpf_xdp_pull_data(), in the case of mlx5, I think it is only
> possible to release all frags when the first and only frag contains
> less than 256 bytes, which is the free space in the linear page.
>
Why would only 256 bytes be free in the linear area? My understanding
is that we have PAGE_SIZE - headroom - tailroom which should be more?

> >
> > In general, I think the code would be nicer if it would do a rewind of
> > the end pointer based on the diff between the old and new nr_frags.
> >
> 
> Not sure if I get this. Do you mean calling __pskb_pull_tail() some
> how based on the difference between sinfo->nr_frags and nr_frags?
> 
> Thanks for reviewing the patch!
>
I was suggesting an approach for the whole patch that might be cleaner.

Roll back frag_page to the last used fragment after program execution:

	frag_page -= old_nr_frags - new_nr_frags;

... and after that you won't need to touch the frag counting loops
and the xdp_update_skb_shared_info().

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 0/7] Add kfunc bpf_xdp_pull_data
  2025-08-28 13:39 ` Nimrod Oren
@ 2025-08-29  7:26   ` Amery Hung
  2025-08-30  0:09     ` Jakub Kicinski
  2025-08-29 18:21   ` Martin KaFai Lau
  1 sibling, 1 reply; 31+ messages in thread
From: Amery Hung @ 2025-08-29  7:26 UTC (permalink / raw)
  To: Nimrod Oren
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, kuba, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team, Dragos Tatulea

On Thu, Aug 28, 2025 at 6:39 AM Nimrod Oren <noren@nvidia.com> wrote:
>
> On 25/08/2025 22:39, Amery Hung wrote:
> > This patchset introduces a new kfunc bpf_xdp_pull_data() to allow
> > pulling nonlinear xdp data. This may be useful when a driver places
> > headers in fragments. When an xdp program would like to keep parsing
> > packet headers using direct packet access, it can call
> > bpf_xdp_pull_data() to make the header available in the linear data
> > area. The kfunc can also be used to decapsulate the header in the
> > nonlinear data, as currently there is no easy way to do this.
>
> I'm currently working on a series that converts the xdp_native program
> to use dynptr for accessing header data. If accepted, it should provide
> better performance, since dynptr can access without copying the data.
>

I feel that bpf_xdp_pull_data() is a more generic approach, but yeah
dynptr may yield better performance. Looking forward to seeing the
numbers.

It will also be great if the dynptr approach doesn't require
xdp_native to make assumptions about the xdp_buff layout (headers are
in frags if linear data is empty), and creates two versions of header
parsing code.

> > This patchset also tries to fix an issue in the mlx5e driver. The driver
> > curretly assumes the packet layout to be unchanged after xdp program
> > runs and may generate packet with corrupted data or trigger kernel warning
> > if xdp programs calls layout-changing kfunc such as bpf_xdp_adjust_tail(),
> > bpf_xdp_adjust_head() or bpf_xdp_pull_data() introduced in this set.
>
> Thanks for working on this!
>
> > Tested with the added bpf selftest using bpf test_run and also on
> > mlx5e with the tools/testing/selftests/drivers/net/xdp.py. mlx5e with
> > striding RQ will produce xdp_buff with empty linear data.
> > xdp.test_xdp_native_pass_mb would fail to parse the header before this
> > patchset.
>
> I got a crash when testing this series with the xdp_dummy program from
> tools/testing/selftests/net/lib/. Need to make sure we're not breaking
> compatibility for programs that keep the linear part empty.

Appreciate the detailed review! I will make sure to test xdp_dummy as
well. I am taking some time off and will get back to this patchset
next week.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 0/7] Add kfunc bpf_xdp_pull_data
  2025-08-28 13:39 ` Nimrod Oren
  2025-08-29  7:26   ` Amery Hung
@ 2025-08-29 18:21   ` Martin KaFai Lau
  1 sibling, 0 replies; 31+ messages in thread
From: Martin KaFai Lau @ 2025-08-29 18:21 UTC (permalink / raw)
  To: Nimrod Oren, Amery Hung
  Cc: bpf, netdev, alexei.starovoitov, andrii, daniel, kuba, martin.lau,
	mohsin.bashr, saeedm, tariqt, mbloch, maciej.fijalkowski,
	kernel-team, Dragos Tatulea

On 8/28/25 6:39 AM, Nimrod Oren wrote:
> I'm currently working on a series that converts the xdp_native program
> to use dynptr for accessing header data. If accepted, it should provide
> better performance, since dynptr can access without copying the data.

The bpf_xdp_adjust_tail is aware of xdp_buff_has_frags. Is there a reason that 
bpf_xdp_adjust_head cannot handle frags also?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC bpf-next v1 0/7] Add kfunc bpf_xdp_pull_data
  2025-08-29  7:26   ` Amery Hung
@ 2025-08-30  0:09     ` Jakub Kicinski
  0 siblings, 0 replies; 31+ messages in thread
From: Jakub Kicinski @ 2025-08-30  0:09 UTC (permalink / raw)
  To: Amery Hung
  Cc: Nimrod Oren, bpf, netdev, alexei.starovoitov, andrii, daniel,
	martin.lau, mohsin.bashr, saeedm, tariqt, mbloch,
	maciej.fijalkowski, kernel-team, Dragos Tatulea

On Fri, 29 Aug 2025 00:26:29 -0700 Amery Hung wrote:
> > I'm currently working on a series that converts the xdp_native program
> > to use dynptr for accessing header data. If accepted, it should provide
> > better performance, since dynptr can access without copying the data.
> 
> I feel that bpf_xdp_pull_data() is a more generic approach, but yeah
> dynptr may yield better performance. Looking forward to seeing the
> numbers.

To be 100% clear, being able to push and pull custom UDP encap headers
(that the NIC may not even be able to parse) is one of the main use
cases for XDP, for L3 load balances. dynptr may be slower or faster 
for reading and writing to the packet, but it can't push or pull.
So this is a bit of a side conversation. Sorry if I'm stating the
obvious.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2025-08-30  0:09 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-25 19:39 [RFC bpf-next v1 0/7] Add kfunc bpf_xdp_pull_data Amery Hung
2025-08-25 19:39 ` [RFC bpf-next v1 1/7] net/mlx5e: Fix generating skb from nonlinear xdp_buff Amery Hung
2025-08-27 13:45   ` Dragos Tatulea
2025-08-28  3:44     ` Amery Hung
2025-08-28 16:23       ` Dragos Tatulea
2025-08-28 13:41   ` Nimrod Oren
2025-08-25 19:39 ` [RFC bpf-next v1 2/7] bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail Amery Hung
2025-08-28 13:43   ` Nimrod Oren
2025-08-25 19:39 ` [RFC bpf-next v1 3/7] bpf: Support pulling non-linear xdp data Amery Hung
2025-08-25 21:29   ` Stanislav Fomichev
2025-08-25 22:23     ` Amery Hung
2025-08-25 22:29       ` Jakub Kicinski
2025-08-25 22:36         ` Amery Hung
2025-08-25 22:46       ` Stanislav Fomichev
2025-08-25 22:58         ` Jakub Kicinski
2025-08-26  0:12           ` Stanislav Fomichev
2025-08-26  0:30             ` Jakub Kicinski
2025-08-25 22:39   ` Jakub Kicinski
2025-08-26  5:12     ` Amery Hung
2025-08-26 13:20       ` Jakub Kicinski
2025-08-26 13:44         ` Amery Hung
2025-08-25 19:39 ` [RFC bpf-next v1 4/7] bpf: Clear packet pointers after changing packet data in kfuncs Amery Hung
2025-08-25 19:39 ` [RFC bpf-next v1 5/7] bpf: Support specifying linear xdp packet data size in test_run Amery Hung
2025-08-25 19:39 ` [RFC bpf-next v1 6/7] selftests/bpf: Test bpf_xdp_pull_data Amery Hung
2025-08-25 19:39 ` [RFC bpf-next v1 7/7] selftests: drv-net: Pull data before parsing headers Amery Hung
2025-08-25 22:41 ` [RFC bpf-next v1 0/7] Add kfunc bpf_xdp_pull_data Jakub Kicinski
2025-08-26 19:38   ` Gal Pressman
2025-08-28 13:39 ` Nimrod Oren
2025-08-29  7:26   ` Amery Hung
2025-08-30  0:09     ` Jakub Kicinski
2025-08-29 18:21   ` Martin KaFai Lau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).