public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH net 0/6] xsk: tailroom reservation and MTU validation
@ 2026-03-16 17:45 Maciej Fijalkowski
  2026-03-16 17:45 ` [PATCH net 1/6] xsk: respect tailroom for ZC setups Maciej Fijalkowski
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Maciej Fijalkowski @ 2026-03-16 17:45 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, Maciej Fijalkowski

Hi,

here we fix a long-standing issue regarding multi-buffer scenario in ZC
mode - we have not been providing space at the end of the buffer where
multi-buffer XDP works on skb_shared_info. This has been brought to our
attention via [0].

Unaligned mode does not get any specific treatment, it is user's
responsibility to properly handle XSK addresses in queues.

With two adjustments included here in this set against xskxceiver I have
been able to pass the full test suite on ice.

Thanks,
Maciej

[0]: https://community.intel.com/t5/Ethernet-Products/X710-XDP-Packet-Corruption-Issue-DRV-MODE-Zero-Copy-Multi-Buffer/m-p/1724208


Maciej Fijalkowski (6):
  xsk: respect tailroom for ZC setups
  ice: do not round up result of dbuff calculation for xsk pool
  i40e: do not round up result of dbuff calculation for xsk pool
  xsk: validate MTU against usable frame size on bind
  selftests: bpf: fix pkt grow tests
  selftests: bpf: have a separate variable for drop test

 drivers/net/ethernet/intel/i40e/i40e_main.c   |  8 ++++---
 drivers/net/ethernet/intel/ice/ice_base.c     | 10 +++++++--
 include/net/xdp_sock_drv.h                    | 21 ++++++++++++++++++-
 net/xdp/xsk_buff_pool.c                       | 16 +++++++++++---
 .../selftests/bpf/prog_tests/test_xsk.c       | 21 ++++++++++++++++---
 .../selftests/bpf/progs/xsk_xdp_progs.c       |  3 ++-
 6 files changed, 66 insertions(+), 13 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH net 1/6] xsk: respect tailroom for ZC setups
  2026-03-16 17:45 [PATCH net 0/6] xsk: tailroom reservation and MTU validation Maciej Fijalkowski
@ 2026-03-16 17:45 ` Maciej Fijalkowski
  2026-03-16 22:53   ` Stanislav Fomichev
  2026-03-16 17:45 ` [PATCH net 2/6] ice: do not round up result of dbuff calculation for xsk pool Maciej Fijalkowski
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Maciej Fijalkowski @ 2026-03-16 17:45 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, Maciej Fijalkowski

Multi-buffer XDP stores information about frags in skb_shared_info that
sits at the tailroom of a packet. The storage space is reserved via
xdp_data_hard_end():

	((xdp)->data_hard_start + (xdp)->frame_sz -	\
	 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))

and then we refer to it via macro below:

static inline struct skb_shared_info *
xdp_get_shared_info_from_buff(const struct xdp_buff *xdp)
{
        return (struct skb_shared_info *)xdp_data_hard_end(xdp);
}

Currently we do not respect this tailroom space in multi-buffer AF_XDP
ZC scenario. To address this, introduce xsk_pool_get_tailroom() and use
it within xsk_pool_get_rx_frame_size() which is used in ZC drivers to
configure length of HW Rx buffer.

xsk_pool_get_tailroom() is only reserving necessary space when pool is
zc and underlying netdev supports zc multi-buffer. Since this function
relies on pool->umem->zc setting, set it before ndo_bpf during zc
configuration, so that driver that actually calls
xsk_pool_get_rx_frame_size() inside ndo_bpf will get correct tailroom
value.

Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 include/net/xdp_sock_drv.h | 21 ++++++++++++++++++++-
 net/xdp/xsk_buff_pool.c    |  3 ++-
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 6b9ebae2dc95..13b2aae00737 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -41,6 +41,19 @@ static inline u32 xsk_pool_get_headroom(struct xsk_buff_pool *pool)
 	return XDP_PACKET_HEADROOM + pool->headroom;
 }
 
+static inline u32 xsk_pool_get_tailroom(struct xsk_buff_pool *pool)
+{
+	struct xdp_umem *umem = pool->umem;
+
+	/* Reserve tailroom only for zero-copy pools that opted into
+	 * multi-buffer. The reserved area is used for skb_shared_info,
+	 * matching the XDP core's xdp_data_hard_end() layout.
+	 */
+	if (umem->zc && (umem->flags & XDP_UMEM_SG_FLAG))
+		return SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	return 0;
+}
+
 static inline u32 xsk_pool_get_chunk_size(struct xsk_buff_pool *pool)
 {
 	return pool->chunk_size;
@@ -48,7 +61,8 @@ static inline u32 xsk_pool_get_chunk_size(struct xsk_buff_pool *pool)
 
 static inline u32 xsk_pool_get_rx_frame_size(struct xsk_buff_pool *pool)
 {
-	return xsk_pool_get_chunk_size(pool) - xsk_pool_get_headroom(pool);
+	return xsk_pool_get_chunk_size(pool) - xsk_pool_get_headroom(pool) -
+	       xsk_pool_get_tailroom(pool);
 }
 
 static inline u32 xsk_pool_get_rx_frag_step(struct xsk_buff_pool *pool)
@@ -332,6 +346,11 @@ static inline u32 xsk_pool_get_headroom(struct xsk_buff_pool *pool)
 	return 0;
 }
 
+static inline u32 xsk_pool_get_tailroom(struct xsk_buff_pool *pool)
+{
+	return 0;
+}
+
 static inline u32 xsk_pool_get_chunk_size(struct xsk_buff_pool *pool)
 {
 	return 0;
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 37b7a68b89b3..2cfc19e363e3 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -213,6 +213,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
 	bpf.command = XDP_SETUP_XSK_POOL;
 	bpf.xsk.pool = pool;
 	bpf.xsk.queue_id = queue_id;
+	pool->umem->zc = true;
 
 	netdev_ops_assert_locked(netdev);
 	err = netdev->netdev_ops->ndo_bpf(netdev, &bpf);
@@ -224,13 +225,13 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
 		err = -EINVAL;
 		goto err_unreg_xsk;
 	}
-	pool->umem->zc = true;
 	pool->xdp_zc_max_segs = netdev->xdp_zc_max_segs;
 	return 0;
 
 err_unreg_xsk:
 	xp_disable_drv_zc(pool);
 err_unreg_pool:
+	pool->umem->zc = false;
 	if (!force_zc)
 		err = 0; /* fallback to copy mode */
 	if (err) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net 2/6] ice: do not round up result of dbuff calculation for xsk pool
  2026-03-16 17:45 [PATCH net 0/6] xsk: tailroom reservation and MTU validation Maciej Fijalkowski
  2026-03-16 17:45 ` [PATCH net 1/6] xsk: respect tailroom for ZC setups Maciej Fijalkowski
@ 2026-03-16 17:45 ` Maciej Fijalkowski
  2026-03-17  9:21   ` Björn Töpel
  2026-03-16 17:45 ` [PATCH net 3/6] i40e: " Maciej Fijalkowski
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Maciej Fijalkowski @ 2026-03-16 17:45 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, Maciej Fijalkowski

When programming dbuff on rx queue context, avoid division round up as
it causes to actually corrupt the tailroom for AF_XDP ZC. Below is an
example based on 4k chunk size when xsk pool pointer is valid on given
rx ring:

chunk_size = 4096
headroom = 256
tailroom = 320

ring->rx_buf_len = 4096 - 256 - 320 = 3520

rx_ctx.dbuff = DIV_ROUND_UP(3520, 128) ->
3520 / 128 = 27.5 -> round up results in 28

dbuff programming unit is 128. If we give 128 * 28 = 3584. So HW will
corrupt 64 bytes from tailroom. Doing plain division will floor the
result and would not exceed reserved space at the end.

Also, restore ::rx_buf_len setting via xsk_pool_get_rx_frame_size() as
of now it respects the tailroom.

Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_base.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
index 1667f686ff75..930688bd3476 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.c
+++ b/drivers/net/ethernet/intel/ice/ice_base.c
@@ -498,8 +498,11 @@ static int ice_setup_rx_ctx(struct ice_rx_ring *ring)
 	/* Receive Packet Data Buffer Size.
 	 * The Packet Data Buffer Size is defined in 128 byte units.
 	 */
-	rlan_ctx.dbuf = DIV_ROUND_UP(ring->rx_buf_len,
-				     BIT_ULL(ICE_RLAN_CTX_DBUF_S));
+	if (ring->xsk_pool)
+		rlan_ctx.dbuf = ring->rx_buf_len / BIT_ULL(ICE_RLAN_CTX_DBUF_S);
+	else
+		rlan_ctx.dbuf = DIV_ROUND_UP(ring->rx_buf_len,
+					     BIT_ULL(ICE_RLAN_CTX_DBUF_S));
 
 	/* use 32 byte descriptors */
 	rlan_ctx.dsize = 1;
@@ -673,6 +676,9 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
 		if (ring->xsk_pool) {
 			u32 frag_size =
 				xsk_pool_get_rx_frag_step(ring->xsk_pool);
+
+			ring->rx_buf_len =
+				xsk_pool_get_rx_frame_size(ring->xsk_pool);
 			err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
 						 ring->q_index,
 						 ring->q_vector->napi.napi_id,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net 3/6] i40e: do not round up result of dbuff calculation for xsk pool
  2026-03-16 17:45 [PATCH net 0/6] xsk: tailroom reservation and MTU validation Maciej Fijalkowski
  2026-03-16 17:45 ` [PATCH net 1/6] xsk: respect tailroom for ZC setups Maciej Fijalkowski
  2026-03-16 17:45 ` [PATCH net 2/6] ice: do not round up result of dbuff calculation for xsk pool Maciej Fijalkowski
@ 2026-03-16 17:45 ` Maciej Fijalkowski
  2026-03-17  9:21   ` Björn Töpel
  2026-03-16 17:45 ` [PATCH net 4/6] xsk: validate MTU against usable frame size on bind Maciej Fijalkowski
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Maciej Fijalkowski @ 2026-03-16 17:45 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, Maciej Fijalkowski

When programming dbuff on rx queue context, avoid division round up as
it causes to actually corrupt the tailroom for AF_XDP ZC. Below is an
example based on 4k chunk size when xsk pool pointer is valid on given
rx ring:

chunk_size = 4096
headroom = 256
tailroom = 320

ring->rx_buf_len = 4096 - 256 - 320 = 3520

rx_ctx.dbuff = DIV_ROUND_UP(3520, 128) ->
3520 / 128 = 27.5 -> round up results in 28

dbuff programming unit is 128. If we give 128 * 28 = 3584. So HW will
corrupt 64 bytes from tailroom. Doing plain division will floor the
result and would not exceed reserved space at the end.

Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 926d001b2150..f37cf6ab4461 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3621,9 +3621,11 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 skip:
 	xdp_init_buff(&ring->xdp, xdp_frame_sz, &ring->xdp_rxq);
 
-	rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
-				    BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT));
-
+	if (ring->xsk_pool)
+		rx_ctx.dbuff = ring->rx_buf_len / BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT);
+	else
+		rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
+					    BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT));
 	rx_ctx.base = (ring->dma / 128);
 	rx_ctx.qlen = ring->count;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net 4/6] xsk: validate MTU against usable frame size on bind
  2026-03-16 17:45 [PATCH net 0/6] xsk: tailroom reservation and MTU validation Maciej Fijalkowski
                   ` (2 preceding siblings ...)
  2026-03-16 17:45 ` [PATCH net 3/6] i40e: " Maciej Fijalkowski
@ 2026-03-16 17:45 ` Maciej Fijalkowski
  2026-03-17  9:30   ` Björn Töpel
  2026-03-18 16:46   ` Alexander Lobakin
  2026-03-16 17:45 ` [PATCH net 5/6] selftests: bpf: fix pkt grow tests Maciej Fijalkowski
  2026-03-16 17:45 ` [PATCH net 6/6] selftests: bpf: have a separate variable for drop test Maciej Fijalkowski
  5 siblings, 2 replies; 17+ messages in thread
From: Maciej Fijalkowski @ 2026-03-16 17:45 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, Maciej Fijalkowski

AF_XDP bind currently accepts zero-copy pool configurations without
verifying that the device MTU fits into the usable frame space provided
by the UMEM chunk.

This becomes a problem since we started to respect tailroom which is
subtracted from chunk_size (among with headroom). 2k chunk size might
not provide enough space for standard 1500 MTU, so let us catch such
settings at bind time.

This prevents creating an already-invalid setup and complements the
MTU change restriction for devices with an attached XSK pool.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 net/xdp/xsk_buff_pool.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 2cfc19e363e3..0a6d76df97dc 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -157,6 +157,7 @@ static void xp_disable_drv_zc(struct xsk_buff_pool *pool)
 int xp_assign_dev(struct xsk_buff_pool *pool,
 		  struct net_device *netdev, u16 queue_id, u16 flags)
 {
+	bool mbuf = flags & XDP_USE_SG;
 	bool force_zc, force_copy;
 	struct netdev_bpf bpf;
 	int err = 0;
@@ -178,7 +179,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
 	if (err)
 		return err;
 
-	if (flags & XDP_USE_SG)
+	if (mbuf)
 		pool->umem->flags |= XDP_UMEM_SG_FLAG;
 
 	if (flags & XDP_USE_NEED_WAKEUP)
@@ -200,11 +201,19 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
 		goto err_unreg_pool;
 	}
 
-	if (netdev->xdp_zc_max_segs == 1 && (flags & XDP_USE_SG)) {
+	if (netdev->xdp_zc_max_segs == 1 && mbuf) {
 		err = -EOPNOTSUPP;
 		goto err_unreg_pool;
 	}
 
+	if (!mbuf) {
+		if (netdev->mtu + netdev->hard_header_len >
+				xsk_pool_get_rx_frame_size(pool)) {
+			err = -EOPNOTSUPP;
+			goto err_unreg_pool;
+		}
+	}
+
 	if (dev_get_min_mp_channel_count(netdev)) {
 		err = -EBUSY;
 		goto err_unreg_pool;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net 5/6] selftests: bpf: fix pkt grow tests
  2026-03-16 17:45 [PATCH net 0/6] xsk: tailroom reservation and MTU validation Maciej Fijalkowski
                   ` (3 preceding siblings ...)
  2026-03-16 17:45 ` [PATCH net 4/6] xsk: validate MTU against usable frame size on bind Maciej Fijalkowski
@ 2026-03-16 17:45 ` Maciej Fijalkowski
  2026-03-17  9:27   ` Björn Töpel
  2026-03-16 17:45 ` [PATCH net 6/6] selftests: bpf: have a separate variable for drop test Maciej Fijalkowski
  5 siblings, 1 reply; 17+ messages in thread
From: Maciej Fijalkowski @ 2026-03-16 17:45 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, Maciej Fijalkowski

Skip tail adjust tests in xskxceiver for SKB mode as it is not very
friendly for it. multi-buffer case does not work as xdp_rxq_info that is
registered for generic XDP does not report ::frag_size. The non-mbuf
path copies packet via skb_pp_cow_data() which only accounts for
headroom, leaving us with no tailroom and causing underlying XDP prog to
drop packets therefore.

For multi-buffer test on other modes, change the amount of bytes we use
for growth, assume worst-case scenario and take care of headroom and
tailroom.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 .../selftests/bpf/prog_tests/test_xsk.c       | 21 ++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
index 7e38ec6e656b..a496f1784b66 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
@@ -2551,16 +2551,31 @@ int testapp_adjust_tail_shrink_mb(struct test_spec *test)
 
 int testapp_adjust_tail_grow(struct test_spec *test)
 {
+	if (test->mode == TEST_MODE_SKB)
+		return TEST_SKIP;
+
 	/* Grow by 4 bytes for testing purpose */
 	return testapp_adjust_tail(test, 4, MIN_PKT_SIZE * 2);
 }
 
 int testapp_adjust_tail_grow_mb(struct test_spec *test)
 {
+	u32 grow_size;
+
+	if (test->mode == TEST_MODE_SKB)
+		return TEST_SKIP;
+
+	/* worst case scenario is when underlying setup will work on 3k
+	 * buffers, let us account for it; given that we will use 6k as
+	 * pkt_len, expect that it will be broken down to 2 descs each
+	 * with 3k payload;
+	 *
+	 * 4k is truesize, 3k payload, 256 HR, 320 TR;
+	 */
+	grow_size = XSK_UMEM__MAX_FRAME_SIZE - XSK_UMEM__LARGE_FRAME_SIZE - 256 - 320;
 	test->mtu = MAX_ETH_JUMBO_SIZE;
-	/* Grow by (frag_size - last_frag_Size) - 1 to stay inside the last fragment */
-	return testapp_adjust_tail(test, (XSK_UMEM__MAX_FRAME_SIZE / 2) - 1,
-				   XSK_UMEM__LARGE_FRAME_SIZE * 2);
+
+	return testapp_adjust_tail(test, grow_size, XSK_UMEM__LARGE_FRAME_SIZE * 2);
 }
 
 int testapp_tx_queue_consumer(struct test_spec *test)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net 6/6] selftests: bpf: have a separate variable for drop test
  2026-03-16 17:45 [PATCH net 0/6] xsk: tailroom reservation and MTU validation Maciej Fijalkowski
                   ` (4 preceding siblings ...)
  2026-03-16 17:45 ` [PATCH net 5/6] selftests: bpf: fix pkt grow tests Maciej Fijalkowski
@ 2026-03-16 17:45 ` Maciej Fijalkowski
  5 siblings, 0 replies; 17+ messages in thread
From: Maciej Fijalkowski @ 2026-03-16 17:45 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, Maciej Fijalkowski

Currently two different XDP programs share a static variable for
different purposes (picking where to redirect on shared umem test &
whether to drop a packet). This can be a problem when running full test
suite - idx can be written by shared umem test and this value can cause
a false behavior within XDP drop half test.

Introduce a dedicated variable for drop half test so that these two
don't step on each other toes.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 tools/testing/selftests/bpf/progs/xsk_xdp_progs.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/progs/xsk_xdp_progs.c b/tools/testing/selftests/bpf/progs/xsk_xdp_progs.c
index 683306db8594..40609d8c2471 100644
--- a/tools/testing/selftests/bpf/progs/xsk_xdp_progs.c
+++ b/tools/testing/selftests/bpf/progs/xsk_xdp_progs.c
@@ -15,6 +15,7 @@ struct {
 	__uint(value_size, sizeof(int));
 } xsk SEC(".maps");
 
+static unsigned int drop_idx;
 static unsigned int idx;
 int adjust_value = 0;
 int count = 0;
@@ -27,7 +28,7 @@ SEC("xdp.frags") int xsk_def_prog(struct xdp_md *xdp)
 SEC("xdp.frags") int xsk_xdp_drop(struct xdp_md *xdp)
 {
 	/* Drop every other packet */
-	if (idx++ % 2)
+	if (drop_idx++ % 2)
 		return XDP_DROP;
 
 	return bpf_redirect_map(&xsk, 0, XDP_DROP);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH net 1/6] xsk: respect tailroom for ZC setups
  2026-03-16 17:45 ` [PATCH net 1/6] xsk: respect tailroom for ZC setups Maciej Fijalkowski
@ 2026-03-16 22:53   ` Stanislav Fomichev
  2026-03-17  9:19     ` Björn Töpel
  0 siblings, 1 reply; 17+ messages in thread
From: Stanislav Fomichev @ 2026-03-16 22:53 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: netdev, bpf, magnus.karlsson, kuba, pabeni, horms, larysa.zaremba,
	aleksander.lobakin

On 03/16, Maciej Fijalkowski wrote:
> Multi-buffer XDP stores information about frags in skb_shared_info that
> sits at the tailroom of a packet. The storage space is reserved via
> xdp_data_hard_end():
> 
> 	((xdp)->data_hard_start + (xdp)->frame_sz -	\
> 	 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
> 
> and then we refer to it via macro below:
> 
> static inline struct skb_shared_info *
> xdp_get_shared_info_from_buff(const struct xdp_buff *xdp)
> {
>         return (struct skb_shared_info *)xdp_data_hard_end(xdp);
> }
> 
> Currently we do not respect this tailroom space in multi-buffer AF_XDP
> ZC scenario. To address this, introduce xsk_pool_get_tailroom() and use
> it within xsk_pool_get_rx_frame_size() which is used in ZC drivers to
> configure length of HW Rx buffer.
> 
> xsk_pool_get_tailroom() is only reserving necessary space when pool is
> zc and underlying netdev supports zc multi-buffer. Since this function
> relies on pool->umem->zc setting, set it before ndo_bpf during zc
> configuration, so that driver that actually calls
> xsk_pool_get_rx_frame_size() inside ndo_bpf will get correct tailroom
> value.
> 
> Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  include/net/xdp_sock_drv.h | 21 ++++++++++++++++++++-
>  net/xdp/xsk_buff_pool.c    |  3 ++-
>  2 files changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
> index 6b9ebae2dc95..13b2aae00737 100644
> --- a/include/net/xdp_sock_drv.h
> +++ b/include/net/xdp_sock_drv.h
> @@ -41,6 +41,19 @@ static inline u32 xsk_pool_get_headroom(struct xsk_buff_pool *pool)
>  	return XDP_PACKET_HEADROOM + pool->headroom;
>  }
>  
> +static inline u32 xsk_pool_get_tailroom(struct xsk_buff_pool *pool)
> +{
> +	struct xdp_umem *umem = pool->umem;
> +
> +	/* Reserve tailroom only for zero-copy pools that opted into
> +	 * multi-buffer. The reserved area is used for skb_shared_info,
> +	 * matching the XDP core's xdp_data_hard_end() layout.
> +	 */
> +	if (umem->zc && (umem->flags & XDP_UMEM_SG_FLAG))
> +		return SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	return 0;
> +}
> +
>  static inline u32 xsk_pool_get_chunk_size(struct xsk_buff_pool *pool)
>  {
>  	return pool->chunk_size;
> @@ -48,7 +61,8 @@ static inline u32 xsk_pool_get_chunk_size(struct xsk_buff_pool *pool)
>  
>  static inline u32 xsk_pool_get_rx_frame_size(struct xsk_buff_pool *pool)
>  {
> -	return xsk_pool_get_chunk_size(pool) - xsk_pool_get_headroom(pool);
> +	return xsk_pool_get_chunk_size(pool) - xsk_pool_get_headroom(pool) -
> +	       xsk_pool_get_tailroom(pool);
>  }
>  
>  static inline u32 xsk_pool_get_rx_frag_step(struct xsk_buff_pool *pool)
> @@ -332,6 +346,11 @@ static inline u32 xsk_pool_get_headroom(struct xsk_buff_pool *pool)
>  	return 0;
>  }

[..]
 
> +static inline u32 xsk_pool_get_tailroom(struct xsk_buff_pool *pool)
> +{
> +	return 0;
> +}

Not sure it's needed? xsk_pool_get_tailroom is only used by
CONFIG_XDP_SOCKETS' version of xsk_pool_get_rx_frame_size.

>  static inline u32 xsk_pool_get_chunk_size(struct xsk_buff_pool *pool)
>  {
>  	return 0;
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index 37b7a68b89b3..2cfc19e363e3 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -213,6 +213,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>  	bpf.command = XDP_SETUP_XSK_POOL;
>  	bpf.xsk.pool = pool;
>  	bpf.xsk.queue_id = queue_id;
> +	pool->umem->zc = true;
>  
>  	netdev_ops_assert_locked(netdev);
>  	err = netdev->netdev_ops->ndo_bpf(netdev, &bpf);
> @@ -224,13 +225,13 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>  		err = -EINVAL;
>  		goto err_unreg_xsk;
>  	}
> -	pool->umem->zc = true;
>  	pool->xdp_zc_max_segs = netdev->xdp_zc_max_segs;
>  	return 0;
>  
>  err_unreg_xsk:
>  	xp_disable_drv_zc(pool);
>  err_unreg_pool:
> +	pool->umem->zc = false;
>  	if (!force_zc)
>  		err = 0; /* fallback to copy mode */
>  	if (err) {

I'm not super familiar with the shared umem patch, but is it safe to
unconditionally undo pool->umem->zc = false here? xp_assign_dev_shared
looks at this umem->zc flag.. Presumably other places do as well on
teardown?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net 1/6] xsk: respect tailroom for ZC setups
  2026-03-16 22:53   ` Stanislav Fomichev
@ 2026-03-17  9:19     ` Björn Töpel
  2026-03-17 11:08       ` Maciej Fijalkowski
  0 siblings, 1 reply; 17+ messages in thread
From: Björn Töpel @ 2026-03-17  9:19 UTC (permalink / raw)
  To: Stanislav Fomichev, Maciej Fijalkowski
  Cc: netdev, bpf, magnus.karlsson, kuba, pabeni, horms, larysa.zaremba,
	aleksander.lobakin

Stanislav Fomichev <stfomichev@gmail.com> writes:

> On 03/16, Maciej Fijalkowski wrote:
>> Multi-buffer XDP stores information about frags in skb_shared_info that
>> sits at the tailroom of a packet. The storage space is reserved via
>> xdp_data_hard_end():
>> 
>> 	((xdp)->data_hard_start + (xdp)->frame_sz -	\
>> 	 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
>> 
>> and then we refer to it via macro below:
>> 
>> static inline struct skb_shared_info *
>> xdp_get_shared_info_from_buff(const struct xdp_buff *xdp)
>> {
>>         return (struct skb_shared_info *)xdp_data_hard_end(xdp);
>> }
>> 
>> Currently we do not respect this tailroom space in multi-buffer AF_XDP
>> ZC scenario. To address this, introduce xsk_pool_get_tailroom() and use
>> it within xsk_pool_get_rx_frame_size() which is used in ZC drivers to
>> configure length of HW Rx buffer.
>> 
>> xsk_pool_get_tailroom() is only reserving necessary space when pool is
>> zc and underlying netdev supports zc multi-buffer. Since this function
>> relies on pool->umem->zc setting, set it before ndo_bpf during zc
>> configuration, so that driver that actually calls
>> xsk_pool_get_rx_frame_size() inside ndo_bpf will get correct tailroom
>> value.
>> 
>> Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
>> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
>> ---
>>  include/net/xdp_sock_drv.h | 21 ++++++++++++++++++++-
>>  net/xdp/xsk_buff_pool.c    |  3 ++-
>>  2 files changed, 22 insertions(+), 2 deletions(-)
>> 
>> diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
>> index 6b9ebae2dc95..13b2aae00737 100644
>> --- a/include/net/xdp_sock_drv.h
>> +++ b/include/net/xdp_sock_drv.h
>> @@ -41,6 +41,19 @@ static inline u32 xsk_pool_get_headroom(struct xsk_buff_pool *pool)
>>  	return XDP_PACKET_HEADROOM + pool->headroom;
>>  }
>>  
>> +static inline u32 xsk_pool_get_tailroom(struct xsk_buff_pool *pool)
>> +{
>> +	struct xdp_umem *umem = pool->umem;
>> +
>> +	/* Reserve tailroom only for zero-copy pools that opted into
>> +	 * multi-buffer. The reserved area is used for skb_shared_info,
>> +	 * matching the XDP core's xdp_data_hard_end() layout.
>> +	 */
>> +	if (umem->zc && (umem->flags & XDP_UMEM_SG_FLAG))
>> +		return SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> +	return 0;
>> +}
>> +
>>  static inline u32 xsk_pool_get_chunk_size(struct xsk_buff_pool *pool)
>>  {
>>  	return pool->chunk_size;
>> @@ -48,7 +61,8 @@ static inline u32 xsk_pool_get_chunk_size(struct xsk_buff_pool *pool)
>>  
>>  static inline u32 xsk_pool_get_rx_frame_size(struct xsk_buff_pool *pool)
>>  {
>> -	return xsk_pool_get_chunk_size(pool) - xsk_pool_get_headroom(pool);
>> +	return xsk_pool_get_chunk_size(pool) - xsk_pool_get_headroom(pool) -
>> +	       xsk_pool_get_tailroom(pool);
>>  }
>>  
>>  static inline u32 xsk_pool_get_rx_frag_step(struct xsk_buff_pool *pool)
>> @@ -332,6 +346,11 @@ static inline u32 xsk_pool_get_headroom(struct xsk_buff_pool *pool)
>>  	return 0;
>>  }
>
> [..]
>  
>> +static inline u32 xsk_pool_get_tailroom(struct xsk_buff_pool *pool)
>> +{
>> +	return 0;
>> +}
>
> Not sure it's needed? xsk_pool_get_tailroom is only used by
> CONFIG_XDP_SOCKETS' version of xsk_pool_get_rx_frame_size.
>
>>  static inline u32 xsk_pool_get_chunk_size(struct xsk_buff_pool *pool)
>>  {
>>  	return 0;
>> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
>> index 37b7a68b89b3..2cfc19e363e3 100644
>> --- a/net/xdp/xsk_buff_pool.c
>> +++ b/net/xdp/xsk_buff_pool.c
>> @@ -213,6 +213,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>>  	bpf.command = XDP_SETUP_XSK_POOL;
>>  	bpf.xsk.pool = pool;
>>  	bpf.xsk.queue_id = queue_id;
>> +	pool->umem->zc = true;
>>  
>>  	netdev_ops_assert_locked(netdev);
>>  	err = netdev->netdev_ops->ndo_bpf(netdev, &bpf);
>> @@ -224,13 +225,13 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>>  		err = -EINVAL;
>>  		goto err_unreg_xsk;
>>  	}
>> -	pool->umem->zc = true;
>>  	pool->xdp_zc_max_segs = netdev->xdp_zc_max_segs;
>>  	return 0;
>>  
>>  err_unreg_xsk:
>>  	xp_disable_drv_zc(pool);
>>  err_unreg_pool:
>> +	pool->umem->zc = false;
>>  	if (!force_zc)
>>  		err = 0; /* fallback to copy mode */
>>  	if (err) {
>
> I'm not super familiar with the shared umem patch, but is it safe to
> unconditionally undo pool->umem->zc = false here? xp_assign_dev_shared
> looks at this umem->zc flag.. Presumably other places do as well on
> teardown?

Good catch!

I can elaborate a bit; the zero-copy property of umem is shared between
all users (sockets) of that umem. IOW, all sockets sharing an umem,
inherits whatever the first socket negotiated.

So, we could get into something like:

1. Socket A binds queue 0, ndo_bpf OK (umem->zc = true)
2. Socket B binds queue 1 via xp_assign_dev_shared()
   reads umem->zc == true, so flags = XDP_ZEROCOPY
   xp_assign_dev() sets umem->zc = true
   ndo_bpf() NOK for queue 1 -> error path: umem->zc = false (oops)
3. Socket A is still active on queue 0 in ZC mode, but umem->zc is now
   false

...and we'll have a bunch of checks on umem->zc that now has incorrect
state.

From this follows that the zc flag shouldn't be toggled on a shared
resource without checking if other consumers exist. I think a per-pool
zc flag is needed here or smth. :-(


Björn

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net 2/6] ice: do not round up result of dbuff calculation for xsk pool
  2026-03-16 17:45 ` [PATCH net 2/6] ice: do not round up result of dbuff calculation for xsk pool Maciej Fijalkowski
@ 2026-03-17  9:21   ` Björn Töpel
  0 siblings, 0 replies; 17+ messages in thread
From: Björn Töpel @ 2026-03-17  9:21 UTC (permalink / raw)
  To: Maciej Fijalkowski, netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, Maciej Fijalkowski

Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> When programming dbuff on rx queue context, avoid division round up as
> it causes to actually corrupt the tailroom for AF_XDP ZC. Below is an
> example based on 4k chunk size when xsk pool pointer is valid on given
> rx ring:
>
> chunk_size = 4096
> headroom = 256
> tailroom = 320
>
> ring->rx_buf_len = 4096 - 256 - 320 = 3520
>
> rx_ctx.dbuff = DIV_ROUND_UP(3520, 128) ->
> 3520 / 128 = 27.5 -> round up results in 28
>
> dbuff programming unit is 128. If we give 128 * 28 = 3584. So HW will
> corrupt 64 bytes from tailroom. Doing plain division will floor the
> result and would not exceed reserved space at the end.
>
> Also, restore ::rx_buf_len setting via xsk_pool_get_rx_frame_size() as
> of now it respects the tailroom.
>
> Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support")
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_base.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
> index 1667f686ff75..930688bd3476 100644
> --- a/drivers/net/ethernet/intel/ice/ice_base.c
> +++ b/drivers/net/ethernet/intel/ice/ice_base.c
> @@ -498,8 +498,11 @@ static int ice_setup_rx_ctx(struct ice_rx_ring *ring)
>  	/* Receive Packet Data Buffer Size.
>  	 * The Packet Data Buffer Size is defined in 128 byte units.
>  	 */
> -	rlan_ctx.dbuf = DIV_ROUND_UP(ring->rx_buf_len,
> -				     BIT_ULL(ICE_RLAN_CTX_DBUF_S));
> +	if (ring->xsk_pool)
> +		rlan_ctx.dbuf = ring->rx_buf_len / BIT_ULL(ICE_RLAN_CTX_DBUF_S);
> +	else
> +		rlan_ctx.dbuf = DIV_ROUND_UP(ring->rx_buf_len,
> +					     BIT_ULL(ICE_RLAN_CTX_DBUF_S));

nit: Maybe put the BIT_ULL() in a variable (steps/unit?) and use ?: 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net 3/6] i40e: do not round up result of dbuff calculation for xsk pool
  2026-03-16 17:45 ` [PATCH net 3/6] i40e: " Maciej Fijalkowski
@ 2026-03-17  9:21   ` Björn Töpel
  0 siblings, 0 replies; 17+ messages in thread
From: Björn Töpel @ 2026-03-17  9:21 UTC (permalink / raw)
  To: Maciej Fijalkowski, netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, Maciej Fijalkowski

Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> When programming dbuff on rx queue context, avoid division round up as
> it causes to actually corrupt the tailroom for AF_XDP ZC. Below is an
> example based on 4k chunk size when xsk pool pointer is valid on given
> rx ring:
>
> chunk_size = 4096
> headroom = 256
> tailroom = 320
>
> ring->rx_buf_len = 4096 - 256 - 320 = 3520
>
> rx_ctx.dbuff = DIV_ROUND_UP(3520, 128) ->
> 3520 / 128 = 27.5 -> round up results in 28
>
> dbuff programming unit is 128. If we give 128 * 28 = 3584. So HW will
> corrupt 64 bytes from tailroom. Doing plain division will floor the
> result and would not exceed reserved space at the end.
>
> Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_main.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index 926d001b2150..f37cf6ab4461 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -3621,9 +3621,11 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
>  skip:
>  	xdp_init_buff(&ring->xdp, xdp_frame_sz, &ring->xdp_rxq);
>  
> -	rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
> -				    BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT));
> -
> +	if (ring->xsk_pool)
> +		rx_ctx.dbuff = ring->rx_buf_len / BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT);
> +	else
> +		rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
> +					    BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT));

Same nit as ice.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net 5/6] selftests: bpf: fix pkt grow tests
  2026-03-16 17:45 ` [PATCH net 5/6] selftests: bpf: fix pkt grow tests Maciej Fijalkowski
@ 2026-03-17  9:27   ` Björn Töpel
  2026-03-17 10:57     ` Maciej Fijalkowski
  0 siblings, 1 reply; 17+ messages in thread
From: Björn Töpel @ 2026-03-17  9:27 UTC (permalink / raw)
  To: Maciej Fijalkowski, netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, Maciej Fijalkowski

Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> Skip tail adjust tests in xskxceiver for SKB mode as it is not very
> friendly for it. multi-buffer case does not work as xdp_rxq_info that is
> registered for generic XDP does not report ::frag_size. The non-mbuf
> path copies packet via skb_pp_cow_data() which only accounts for
> headroom, leaving us with no tailroom and causing underlying XDP prog to
> drop packets therefore.
>
> For multi-buffer test on other modes, change the amount of bytes we use
> for growth, assume worst-case scenario and take care of headroom and
> tailroom.
>
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  .../selftests/bpf/prog_tests/test_xsk.c       | 21 ++++++++++++++++---
>  1 file changed, 18 insertions(+), 3 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> index 7e38ec6e656b..a496f1784b66 100644
> --- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> +++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> @@ -2551,16 +2551,31 @@ int testapp_adjust_tail_shrink_mb(struct test_spec *test)
>  
>  int testapp_adjust_tail_grow(struct test_spec *test)
>  {
> +	if (test->mode == TEST_MODE_SKB)
> +		return TEST_SKIP;
> +
>  	/* Grow by 4 bytes for testing purpose */
>  	return testapp_adjust_tail(test, 4, MIN_PKT_SIZE * 2);
>  }
>  
>  int testapp_adjust_tail_grow_mb(struct test_spec *test)
>  {
> +	u32 grow_size;
> +
> +	if (test->mode == TEST_MODE_SKB)
> +		return TEST_SKIP;
> +
> +	/* worst case scenario is when underlying setup will work on 3k
> +	 * buffers, let us account for it; given that we will use 6k as
> +	 * pkt_len, expect that it will be broken down to 2 descs each
> +	 * with 3k payload;
> +	 *
> +	 * 4k is truesize, 3k payload, 256 HR, 320 TR;
> +	 */
> +	grow_size = XSK_UMEM__MAX_FRAME_SIZE - XSK_UMEM__LARGE_FRAME_SIZE - 256 - 320;

nit: Use defines, XDP_PACKET_HEADROOM, and maybe
SKB_DATA_ALIGN(sizeof(struct skb_shared_info))?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net 4/6] xsk: validate MTU against usable frame size on bind
  2026-03-16 17:45 ` [PATCH net 4/6] xsk: validate MTU against usable frame size on bind Maciej Fijalkowski
@ 2026-03-17  9:30   ` Björn Töpel
  2026-03-18 16:46   ` Alexander Lobakin
  1 sibling, 0 replies; 17+ messages in thread
From: Björn Töpel @ 2026-03-17  9:30 UTC (permalink / raw)
  To: Maciej Fijalkowski, netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, Maciej Fijalkowski

Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> AF_XDP bind currently accepts zero-copy pool configurations without
> verifying that the device MTU fits into the usable frame space provided
> by the UMEM chunk.
>
> This becomes a problem since we started to respect tailroom which is
> subtracted from chunk_size (among with headroom). 2k chunk size might
> not provide enough space for standard 1500 MTU, so let us catch such
> settings at bind time.
>
> This prevents creating an already-invalid setup and complements the
> MTU change restriction for devices with an attached XSK pool.
>
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  net/xdp/xsk_buff_pool.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index 2cfc19e363e3..0a6d76df97dc 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -157,6 +157,7 @@ static void xp_disable_drv_zc(struct xsk_buff_pool *pool)
>  int xp_assign_dev(struct xsk_buff_pool *pool,
>  		  struct net_device *netdev, u16 queue_id, u16 flags)
>  {
> +	bool mbuf = flags & XDP_USE_SG;
>  	bool force_zc, force_copy;
>  	struct netdev_bpf bpf;
>  	int err = 0;
> @@ -178,7 +179,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>  	if (err)
>  		return err;
>  
> -	if (flags & XDP_USE_SG)
> +	if (mbuf)
>  		pool->umem->flags |= XDP_UMEM_SG_FLAG;
>  
>  	if (flags & XDP_USE_NEED_WAKEUP)
> @@ -200,11 +201,19 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>  		goto err_unreg_pool;
>  	}
>  
> -	if (netdev->xdp_zc_max_segs == 1 && (flags & XDP_USE_SG)) {
> +	if (netdev->xdp_zc_max_segs == 1 && mbuf) {
>  		err = -EOPNOTSUPP;
>  		goto err_unreg_pool;
>  	}
>  
> +	if (!mbuf) {
> +		if (netdev->mtu + netdev->hard_header_len >
> +				xsk_pool_get_rx_frame_size(pool)) {
> +			err = -EOPNOTSUPP;

Hmm, EINVAL?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net 5/6] selftests: bpf: fix pkt grow tests
  2026-03-17  9:27   ` Björn Töpel
@ 2026-03-17 10:57     ` Maciej Fijalkowski
  2026-03-17 12:13       ` Björn Töpel
  0 siblings, 1 reply; 17+ messages in thread
From: Maciej Fijalkowski @ 2026-03-17 10:57 UTC (permalink / raw)
  To: Björn Töpel
  Cc: netdev, bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin

On Tue, Mar 17, 2026 at 10:27:11AM +0100, Björn Töpel wrote:
> Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
> 
> > Skip tail adjust tests in xskxceiver for SKB mode as it is not very
> > friendly for it. multi-buffer case does not work as xdp_rxq_info that is
> > registered for generic XDP does not report ::frag_size. The non-mbuf
> > path copies packet via skb_pp_cow_data() which only accounts for
> > headroom, leaving us with no tailroom and causing underlying XDP prog to
> > drop packets therefore.
> >
> > For multi-buffer test on other modes, change the amount of bytes we use
> > for growth, assume worst-case scenario and take care of headroom and
> > tailroom.
> >
> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > ---
> >  .../selftests/bpf/prog_tests/test_xsk.c       | 21 ++++++++++++++++---
> >  1 file changed, 18 insertions(+), 3 deletions(-)
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> > index 7e38ec6e656b..a496f1784b66 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> > @@ -2551,16 +2551,31 @@ int testapp_adjust_tail_shrink_mb(struct test_spec *test)
> >  
> >  int testapp_adjust_tail_grow(struct test_spec *test)
> >  {
> > +	if (test->mode == TEST_MODE_SKB)
> > +		return TEST_SKIP;
> > +
> >  	/* Grow by 4 bytes for testing purpose */
> >  	return testapp_adjust_tail(test, 4, MIN_PKT_SIZE * 2);
> >  }
> >  
> >  int testapp_adjust_tail_grow_mb(struct test_spec *test)
> >  {
> > +	u32 grow_size;
> > +
> > +	if (test->mode == TEST_MODE_SKB)
> > +		return TEST_SKIP;
> > +
> > +	/* worst case scenario is when underlying setup will work on 3k
> > +	 * buffers, let us account for it; given that we will use 6k as
> > +	 * pkt_len, expect that it will be broken down to 2 descs each
> > +	 * with 3k payload;
> > +	 *
> > +	 * 4k is truesize, 3k payload, 256 HR, 320 TR;
> > +	 */
> > +	grow_size = XSK_UMEM__MAX_FRAME_SIZE - XSK_UMEM__LARGE_FRAME_SIZE - 256 - 320;
> 
> nit: Use defines, XDP_PACKET_HEADROOM, and maybe
> SKB_DATA_ALIGN(sizeof(struct skb_shared_info))?

umm right, but the latter would have to be exposed via UAPI ?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net 1/6] xsk: respect tailroom for ZC setups
  2026-03-17  9:19     ` Björn Töpel
@ 2026-03-17 11:08       ` Maciej Fijalkowski
  0 siblings, 0 replies; 17+ messages in thread
From: Maciej Fijalkowski @ 2026-03-17 11:08 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Stanislav Fomichev, netdev, bpf, magnus.karlsson, kuba, pabeni,
	horms, larysa.zaremba, aleksander.lobakin

On Tue, Mar 17, 2026 at 10:19:25AM +0100, Björn Töpel wrote:
> Stanislav Fomichev <stfomichev@gmail.com> writes:
> 
> > On 03/16, Maciej Fijalkowski wrote:
> >> Multi-buffer XDP stores information about frags in skb_shared_info that
> >> sits at the tailroom of a packet. The storage space is reserved via
> >> xdp_data_hard_end():
> >> 
> >> 	((xdp)->data_hard_start + (xdp)->frame_sz -	\
> >> 	 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
> >> 
> >> and then we refer to it via macro below:
> >> 
> >> static inline struct skb_shared_info *
> >> xdp_get_shared_info_from_buff(const struct xdp_buff *xdp)
> >> {
> >>         return (struct skb_shared_info *)xdp_data_hard_end(xdp);
> >> }
> >> 
> >> Currently we do not respect this tailroom space in multi-buffer AF_XDP
> >> ZC scenario. To address this, introduce xsk_pool_get_tailroom() and use
> >> it within xsk_pool_get_rx_frame_size() which is used in ZC drivers to
> >> configure length of HW Rx buffer.
> >> 
> >> xsk_pool_get_tailroom() is only reserving necessary space when pool is
> >> zc and underlying netdev supports zc multi-buffer. Since this function
> >> relies on pool->umem->zc setting, set it before ndo_bpf during zc
> >> configuration, so that driver that actually calls
> >> xsk_pool_get_rx_frame_size() inside ndo_bpf will get correct tailroom
> >> value.
> >> 
> >> Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
> >> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> >> ---
> >>  include/net/xdp_sock_drv.h | 21 ++++++++++++++++++++-
> >>  net/xdp/xsk_buff_pool.c    |  3 ++-
> >>  2 files changed, 22 insertions(+), 2 deletions(-)
> >> 
> >> diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
> >> index 6b9ebae2dc95..13b2aae00737 100644
> >> --- a/include/net/xdp_sock_drv.h
> >> +++ b/include/net/xdp_sock_drv.h
> >> @@ -41,6 +41,19 @@ static inline u32 xsk_pool_get_headroom(struct xsk_buff_pool *pool)
> >>  	return XDP_PACKET_HEADROOM + pool->headroom;
> >>  }
> >>  
> >> +static inline u32 xsk_pool_get_tailroom(struct xsk_buff_pool *pool)
> >> +{
> >> +	struct xdp_umem *umem = pool->umem;
> >> +
> >> +	/* Reserve tailroom only for zero-copy pools that opted into
> >> +	 * multi-buffer. The reserved area is used for skb_shared_info,
> >> +	 * matching the XDP core's xdp_data_hard_end() layout.
> >> +	 */
> >> +	if (umem->zc && (umem->flags & XDP_UMEM_SG_FLAG))
> >> +		return SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> >> +	return 0;
> >> +}
> >> +
> >>  static inline u32 xsk_pool_get_chunk_size(struct xsk_buff_pool *pool)
> >>  {
> >>  	return pool->chunk_size;
> >> @@ -48,7 +61,8 @@ static inline u32 xsk_pool_get_chunk_size(struct xsk_buff_pool *pool)
> >>  
> >>  static inline u32 xsk_pool_get_rx_frame_size(struct xsk_buff_pool *pool)
> >>  {
> >> -	return xsk_pool_get_chunk_size(pool) - xsk_pool_get_headroom(pool);
> >> +	return xsk_pool_get_chunk_size(pool) - xsk_pool_get_headroom(pool) -
> >> +	       xsk_pool_get_tailroom(pool);
> >>  }
> >>  
> >>  static inline u32 xsk_pool_get_rx_frag_step(struct xsk_buff_pool *pool)
> >> @@ -332,6 +346,11 @@ static inline u32 xsk_pool_get_headroom(struct xsk_buff_pool *pool)
> >>  	return 0;
> >>  }
> >
> > [..]
> >  
> >> +static inline u32 xsk_pool_get_tailroom(struct xsk_buff_pool *pool)
> >> +{
> >> +	return 0;
> >> +}
> >
> > Not sure it's needed? xsk_pool_get_tailroom is only used by
> > CONFIG_XDP_SOCKETS' version of xsk_pool_get_rx_frame_size.

To Stan - we probably could live without this, right.

> >
> >>  static inline u32 xsk_pool_get_chunk_size(struct xsk_buff_pool *pool)
> >>  {
> >>  	return 0;
> >> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> >> index 37b7a68b89b3..2cfc19e363e3 100644
> >> --- a/net/xdp/xsk_buff_pool.c
> >> +++ b/net/xdp/xsk_buff_pool.c
> >> @@ -213,6 +213,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
> >>  	bpf.command = XDP_SETUP_XSK_POOL;
> >>  	bpf.xsk.pool = pool;
> >>  	bpf.xsk.queue_id = queue_id;
> >> +	pool->umem->zc = true;
> >>  
> >>  	netdev_ops_assert_locked(netdev);
> >>  	err = netdev->netdev_ops->ndo_bpf(netdev, &bpf);
> >> @@ -224,13 +225,13 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
> >>  		err = -EINVAL;
> >>  		goto err_unreg_xsk;
> >>  	}
> >> -	pool->umem->zc = true;
> >>  	pool->xdp_zc_max_segs = netdev->xdp_zc_max_segs;
> >>  	return 0;
> >>  
> >>  err_unreg_xsk:
> >>  	xp_disable_drv_zc(pool);
> >>  err_unreg_pool:
> >> +	pool->umem->zc = false;
> >>  	if (!force_zc)
> >>  		err = 0; /* fallback to copy mode */
> >>  	if (err) {
> >
> > I'm not super familiar with the shared umem patch, but is it safe to
> > unconditionally undo pool->umem->zc = false here? xp_assign_dev_shared
> > looks at this umem->zc flag.. Presumably other places do as well on
> > teardown?
> 
> Good catch!
> 
> I can elaborate a bit; the zero-copy property of umem is shared between
> all users (sockets) of that umem. IOW, all sockets sharing an umem,
> inherits whatever the first socket negotiated.
> 
> So, we could get into something like:
> 
> 1. Socket A binds queue 0, ndo_bpf OK (umem->zc = true)
> 2. Socket B binds queue 1 via xp_assign_dev_shared()
>    reads umem->zc == true, so flags = XDP_ZEROCOPY
>    xp_assign_dev() sets umem->zc = true
>    ndo_bpf() NOK for queue 1 -> error path: umem->zc = false (oops)
> 3. Socket A is still active on queue 0 in ZC mode, but umem->zc is now
>    false
> 
> ...and we'll have a bunch of checks on umem->zc that now has incorrect
> state.
> 
> From this follows that the zc flag shouldn't be toggled on a shared
> resource without checking if other consumers exist. I think a per-pool
> zc flag is needed here or smth. :-(

Larysa suggested to use pool->dev ptr instead of touching umem->zc, let me
see if that will be sufficient. Besides that is currently broken as you
guys are saying!

> 
> 
> Björn

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net 5/6] selftests: bpf: fix pkt grow tests
  2026-03-17 10:57     ` Maciej Fijalkowski
@ 2026-03-17 12:13       ` Björn Töpel
  0 siblings, 0 replies; 17+ messages in thread
From: Björn Töpel @ 2026-03-17 12:13 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: netdev, bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin

On Tue, 17 Mar 2026 at 11:57, Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Tue, Mar 17, 2026 at 10:27:11AM +0100, Björn Töpel wrote:
> > Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
> >
> > > Skip tail adjust tests in xskxceiver for SKB mode as it is not very
> > > friendly for it. multi-buffer case does not work as xdp_rxq_info that is
> > > registered for generic XDP does not report ::frag_size. The non-mbuf
> > > path copies packet via skb_pp_cow_data() which only accounts for
> > > headroom, leaving us with no tailroom and causing underlying XDP prog to
> > > drop packets therefore.
> > >
> > > For multi-buffer test on other modes, change the amount of bytes we use
> > > for growth, assume worst-case scenario and take care of headroom and
> > > tailroom.
> > >
> > > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > > ---
> > >  .../selftests/bpf/prog_tests/test_xsk.c       | 21 ++++++++++++++++---
> > >  1 file changed, 18 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> > > index 7e38ec6e656b..a496f1784b66 100644
> > > --- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> > > +++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> > > @@ -2551,16 +2551,31 @@ int testapp_adjust_tail_shrink_mb(struct test_spec *test)
> > >
> > >  int testapp_adjust_tail_grow(struct test_spec *test)
> > >  {
> > > +   if (test->mode == TEST_MODE_SKB)
> > > +           return TEST_SKIP;
> > > +
> > >     /* Grow by 4 bytes for testing purpose */
> > >     return testapp_adjust_tail(test, 4, MIN_PKT_SIZE * 2);
> > >  }
> > >
> > >  int testapp_adjust_tail_grow_mb(struct test_spec *test)
> > >  {
> > > +   u32 grow_size;
> > > +
> > > +   if (test->mode == TEST_MODE_SKB)
> > > +           return TEST_SKIP;
> > > +
> > > +   /* worst case scenario is when underlying setup will work on 3k
> > > +    * buffers, let us account for it; given that we will use 6k as
> > > +    * pkt_len, expect that it will be broken down to 2 descs each
> > > +    * with 3k payload;
> > > +    *
> > > +    * 4k is truesize, 3k payload, 256 HR, 320 TR;
> > > +    */
> > > +   grow_size = XSK_UMEM__MAX_FRAME_SIZE - XSK_UMEM__LARGE_FRAME_SIZE - 256 - 320;
> >
> > nit: Use defines, XDP_PACKET_HEADROOM, and maybe
> > SKB_DATA_ALIGN(sizeof(struct skb_shared_info))?
>
> umm right, but the latter would have to be exposed via UAPI ?

Ah, good point. Another option would be document in the comments where
the magic numbers come from.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net 4/6] xsk: validate MTU against usable frame size on bind
  2026-03-16 17:45 ` [PATCH net 4/6] xsk: validate MTU against usable frame size on bind Maciej Fijalkowski
  2026-03-17  9:30   ` Björn Töpel
@ 2026-03-18 16:46   ` Alexander Lobakin
  1 sibling, 0 replies; 17+ messages in thread
From: Alexander Lobakin @ 2026-03-18 16:46 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: netdev, bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba

From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Date: Mon, 16 Mar 2026 18:45:48 +0100

> AF_XDP bind currently accepts zero-copy pool configurations without
> verifying that the device MTU fits into the usable frame space provided
> by the UMEM chunk.
> 
> This becomes a problem since we started to respect tailroom which is
> subtracted from chunk_size (among with headroom). 2k chunk size might
> not provide enough space for standard 1500 MTU, so let us catch such
> settings at bind time.
> 
> This prevents creating an already-invalid setup and complements the
> MTU change restriction for devices with an attached XSK pool.
> 
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  net/xdp/xsk_buff_pool.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index 2cfc19e363e3..0a6d76df97dc 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -157,6 +157,7 @@ static void xp_disable_drv_zc(struct xsk_buff_pool *pool)
>  int xp_assign_dev(struct xsk_buff_pool *pool,
>  		  struct net_device *netdev, u16 queue_id, u16 flags)
>  {
> +	bool mbuf = flags & XDP_USE_SG;
>  	bool force_zc, force_copy;
>  	struct netdev_bpf bpf;
>  	int err = 0;
> @@ -178,7 +179,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>  	if (err)
>  		return err;
>  
> -	if (flags & XDP_USE_SG)
> +	if (mbuf)
>  		pool->umem->flags |= XDP_UMEM_SG_FLAG;
>  
>  	if (flags & XDP_USE_NEED_WAKEUP)
> @@ -200,11 +201,19 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>  		goto err_unreg_pool;
>  	}
>  
> -	if (netdev->xdp_zc_max_segs == 1 && (flags & XDP_USE_SG)) {
> +	if (netdev->xdp_zc_max_segs == 1 && mbuf) {
>  		err = -EOPNOTSUPP;
>  		goto err_unreg_pool;
>  	}
>  
> +	if (!mbuf) {
> +		if (netdev->mtu + netdev->hard_header_len >
> +				xsk_pool_get_rx_frame_size(pool)) {

Misses a corner case with VLAN(s).
hard_header_len is ETH_HLEN for Ethernet devices.

Should be something like

		if (READ_ONCE(netdev->mtu) + ETH_HLEN + 2 * VLAN_HLEN >
		    xsk_pool_get_rx_frame_size(pool))

(note that you also need READ_ONCE(netdev->mtu))

There's also a netdev feature which disables CRC stripping, so
ETH_FCS_LEN could also be included. But the feature is not popular,
thus up to you.

> +			err = -EOPNOTSUPP;
> +			goto err_unreg_pool;
> +		}
> +	}
> +
>  	if (dev_get_min_mp_channel_count(netdev)) {
>  		err = -EBUSY;
>  		goto err_unreg_pool;

Thanks,
Olek

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-03-18 16:47 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-16 17:45 [PATCH net 0/6] xsk: tailroom reservation and MTU validation Maciej Fijalkowski
2026-03-16 17:45 ` [PATCH net 1/6] xsk: respect tailroom for ZC setups Maciej Fijalkowski
2026-03-16 22:53   ` Stanislav Fomichev
2026-03-17  9:19     ` Björn Töpel
2026-03-17 11:08       ` Maciej Fijalkowski
2026-03-16 17:45 ` [PATCH net 2/6] ice: do not round up result of dbuff calculation for xsk pool Maciej Fijalkowski
2026-03-17  9:21   ` Björn Töpel
2026-03-16 17:45 ` [PATCH net 3/6] i40e: " Maciej Fijalkowski
2026-03-17  9:21   ` Björn Töpel
2026-03-16 17:45 ` [PATCH net 4/6] xsk: validate MTU against usable frame size on bind Maciej Fijalkowski
2026-03-17  9:30   ` Björn Töpel
2026-03-18 16:46   ` Alexander Lobakin
2026-03-16 17:45 ` [PATCH net 5/6] selftests: bpf: fix pkt grow tests Maciej Fijalkowski
2026-03-17  9:27   ` Björn Töpel
2026-03-17 10:57     ` Maciej Fijalkowski
2026-03-17 12:13       ` Björn Töpel
2026-03-16 17:45 ` [PATCH net 6/6] selftests: bpf: have a separate variable for drop test Maciej Fijalkowski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox