public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 net 0/8] xsk: tailroom reservation and MTU validation
@ 2026-03-19 17:55 Maciej Fijalkowski
  2026-03-19 17:55 ` [PATCH v2 net 1/8] xsk: tighten UMEM headroom validation to account for tailroom and min frame Maciej Fijalkowski
                   ` (7 more replies)
  0 siblings, 8 replies; 21+ messages in thread
From: Maciej Fijalkowski @ 2026-03-19 17:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, bjorn, Maciej Fijalkowski

v1->v2:
- remove xsk_pool_get_tailroom() definition for !CONFIG_XDP_SOCKETS
  (Stan)
- do not rely on pool->umem->zc when configuring tailroom (Stan, Bjorn)
- simplify dbuff setting in ZC drivers (Bjorn)
- use defines for {head,tail}room in tests (Bjorn)
- return EINVAL instead of EOPNOTSUPP when mtu setting is wrong (Bjorn)
- include vlan headers and fcs length when validating mtu (Olek)
- tighten umem headroom validation when registering umem (Sashiko AI)
- set XDP_USE_SG in xp_assign_dev_shared() (Sashiko AI)
- adjust rx dropped xskxceiver test

Hi,

here we fix a long-standing issue regarding multi-buffer scenario in ZC
mode - we have not been providing space at the end of the buffer where
multi-buffer XDP works on skb_shared_info. This has been brought to our
attention via [0].

Unaligned mode does not get any specific treatment, it is user's
responsibility to properly handle XSK addresses in queues.

With two adjustments included here in this set against xskxceiver I have
been able to pass the full test suite on ice.

Thanks,
Maciej

[0]: https://community.intel.com/t5/Ethernet-Products/X710-XDP-Packet-Corruption-Issue-DRV-MODE-Zero-Copy-Multi-Buffer/m-p/1724208


Maciej Fijalkowski (8):
  xsk: tighten UMEM headroom validation to account for tailroom and min
    frame
  xsk: respect tailroom for ZC setups
  ice: do not round up result of dbuf calculation for xsk pool
  i40e: do not round up result of dbuff calculation for xsk pool
  xsk: validate MTU against usable frame size on bind
  selftests: bpf: fix pkt grow tests
  selftests: bpf: have a separate variable for drop test
  selftests: bpf: adjust rx_dropped xskxceiver's test to respect
    tailroom

 drivers/net/ethernet/intel/i40e/i40e_main.c   |  2 ++
 drivers/net/ethernet/intel/ice/ice_base.c     |  5 ++++
 include/net/xdp_sock_drv.h                    | 16 +++++++++-
 net/xdp/xdp_umem.c                            |  3 +-
 net/xdp/xsk_buff_pool.c                       | 16 ++++++++--
 .../selftests/bpf/prog_tests/test_xsk.c       | 29 +++++++++++++++----
 .../selftests/bpf/progs/xsk_xdp_progs.c       |  3 +-
 7 files changed, 64 insertions(+), 10 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 net 1/8] xsk: tighten UMEM headroom validation to account for tailroom and min frame
  2026-03-19 17:55 [PATCH v2 net 0/8] xsk: tailroom reservation and MTU validation Maciej Fijalkowski
@ 2026-03-19 17:55 ` Maciej Fijalkowski
  2026-03-20  8:13   ` Björn Töpel
  2026-03-20 15:01   ` Stanislav Fomichev
  2026-03-19 17:55 ` [PATCH v2 net 2/8] xsk: respect tailroom for ZC setups Maciej Fijalkowski
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 21+ messages in thread
From: Maciej Fijalkowski @ 2026-03-19 17:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, bjorn, Maciej Fijalkowski

The current headroom validation in xdp_umem_reg() could leave us with
insufficient space dedicated to even receive minimum-sized ethernet
frame. Furthermore if multi-buffer would come to play then
skb_shared_info stored at the end of XSK frame would be corrupted.

Multi-buffer setting is known later in the configuration process so
besides accounting for ETH_ZLEN, let us also take care of tailroom space
upfront.

Fixes: 99e3a236dd43 ("xsk: Add missing check on user supplied headroom size")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 net/xdp/xdp_umem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 066ce07c506d..50963c079f85 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -203,7 +203,8 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	if (!unaligned_chunks && chunks_rem)
 		return -EINVAL;
 
-	if (headroom >= chunk_size - XDP_PACKET_HEADROOM)
+	if (headroom >= chunk_size - XDP_PACKET_HEADROOM -
+			SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) - ETH_ZLEN)
 		return -EINVAL;
 
 	if (mr->flags & XDP_UMEM_TX_METADATA_LEN) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 net 2/8] xsk: respect tailroom for ZC setups
  2026-03-19 17:55 [PATCH v2 net 0/8] xsk: tailroom reservation and MTU validation Maciej Fijalkowski
  2026-03-19 17:55 ` [PATCH v2 net 1/8] xsk: tighten UMEM headroom validation to account for tailroom and min frame Maciej Fijalkowski
@ 2026-03-19 17:55 ` Maciej Fijalkowski
  2026-03-20  8:14   ` Björn Töpel
  2026-03-20 15:01   ` Stanislav Fomichev
  2026-03-19 17:55 ` [PATCH v2 net 3/8] ice: do not round up result of dbuf calculation for xsk pool Maciej Fijalkowski
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 21+ messages in thread
From: Maciej Fijalkowski @ 2026-03-19 17:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, bjorn, Maciej Fijalkowski

Multi-buffer XDP stores information about frags in skb_shared_info that
sits at the tailroom of a packet. The storage space is reserved via
xdp_data_hard_end():

	((xdp)->data_hard_start + (xdp)->frame_sz -	\
	 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))

and then we refer to it via macro below:

static inline struct skb_shared_info *
xdp_get_shared_info_from_buff(const struct xdp_buff *xdp)
{
        return (struct skb_shared_info *)xdp_data_hard_end(xdp);
}

Currently we do not respect this tailroom space in multi-buffer AF_XDP
ZC scenario. To address this, introduce xsk_pool_get_tailroom() and use
it within xsk_pool_get_rx_frame_size() which is used in ZC drivers to
configure length of HW Rx buffer.

xsk_pool_get_tailroom() is only reserving necessary space when pool is
zc and underlying netdev supports zc multi-buffer. Rely on pool->dev
state when configuring tailroom. xsk_pool_get_rx_frame_size() inside
ndo_bpf is usually called when bringing up queues and before xsk's dma
mappings have been configured, which makes it valid to rely on
pool->dev.

Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 include/net/xdp_sock_drv.h | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 6b9ebae2dc95..bef4e1b91034 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -41,6 +41,19 @@ static inline u32 xsk_pool_get_headroom(struct xsk_buff_pool *pool)
 	return XDP_PACKET_HEADROOM + pool->headroom;
 }
 
+static inline u32 xsk_pool_get_tailroom(struct xsk_buff_pool *pool)
+{
+	struct xdp_umem *umem = pool->umem;
+
+	/* Reserve tailroom only for zero-copy pools that opted into
+	 * multi-buffer. The reserved area is used for skb_shared_info,
+	 * matching the XDP core's xdp_data_hard_end() layout.
+	 */
+	if (pool->dev && (umem->flags & XDP_UMEM_SG_FLAG))
+		return SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	return 0;
+}
+
 static inline u32 xsk_pool_get_chunk_size(struct xsk_buff_pool *pool)
 {
 	return pool->chunk_size;
@@ -48,7 +61,8 @@ static inline u32 xsk_pool_get_chunk_size(struct xsk_buff_pool *pool)
 
 static inline u32 xsk_pool_get_rx_frame_size(struct xsk_buff_pool *pool)
 {
-	return xsk_pool_get_chunk_size(pool) - xsk_pool_get_headroom(pool);
+	return xsk_pool_get_chunk_size(pool) - xsk_pool_get_headroom(pool) -
+	       xsk_pool_get_tailroom(pool);
 }
 
 static inline u32 xsk_pool_get_rx_frag_step(struct xsk_buff_pool *pool)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 net 3/8] ice: do not round up result of dbuf calculation for xsk pool
  2026-03-19 17:55 [PATCH v2 net 0/8] xsk: tailroom reservation and MTU validation Maciej Fijalkowski
  2026-03-19 17:55 ` [PATCH v2 net 1/8] xsk: tighten UMEM headroom validation to account for tailroom and min frame Maciej Fijalkowski
  2026-03-19 17:55 ` [PATCH v2 net 2/8] xsk: respect tailroom for ZC setups Maciej Fijalkowski
@ 2026-03-19 17:55 ` Maciej Fijalkowski
  2026-03-20  8:18   ` Björn Töpel
  2026-03-19 17:55 ` [PATCH v2 net 4/8] i40e: do not round up result of dbuff " Maciej Fijalkowski
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Maciej Fijalkowski @ 2026-03-19 17:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, bjorn, Maciej Fijalkowski

When programming dbuf on rx queue context, avoid division round up as
it causes to actually corrupt the tailroom for AF_XDP ZC. Below is an
example based on 4k chunk size when xsk pool pointer is valid on given
rx ring:

chunk_size = 4096
headroom = 256
tailroom = 320

ring->rx_buf_len = 4096 - 256 - 320 = 3520

rx_ctx.dbuf = DIV_ROUND_UP(3520, 128) ->
3520 / 128 = 27.5 -> round up results in 28

dbuf programming unit is 128. If we give 128 * 28 = 3584. So HW will
corrupt 64 bytes from tailroom. Decrement dbuf by 1 when xsk_pool is
present on given ice_rx_ring.

Also, restore ::rx_buf_len setting via xsk_pool_get_rx_frame_size() as
of now it respects the tailroom.

Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_base.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
index 1667f686ff75..f9514d7bb83c 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.c
+++ b/drivers/net/ethernet/intel/ice/ice_base.c
@@ -500,6 +500,8 @@ static int ice_setup_rx_ctx(struct ice_rx_ring *ring)
 	 */
 	rlan_ctx.dbuf = DIV_ROUND_UP(ring->rx_buf_len,
 				     BIT_ULL(ICE_RLAN_CTX_DBUF_S));
+	if (ring->xsk_pool)
+		rlan_ctx.dbuf--;
 
 	/* use 32 byte descriptors */
 	rlan_ctx.dsize = 1;
@@ -673,6 +675,9 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
 		if (ring->xsk_pool) {
 			u32 frag_size =
 				xsk_pool_get_rx_frag_step(ring->xsk_pool);
+
+			ring->rx_buf_len =
+				xsk_pool_get_rx_frame_size(ring->xsk_pool);
 			err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
 						 ring->q_index,
 						 ring->q_vector->napi.napi_id,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 net 4/8] i40e: do not round up result of dbuff calculation for xsk pool
  2026-03-19 17:55 [PATCH v2 net 0/8] xsk: tailroom reservation and MTU validation Maciej Fijalkowski
                   ` (2 preceding siblings ...)
  2026-03-19 17:55 ` [PATCH v2 net 3/8] ice: do not round up result of dbuf calculation for xsk pool Maciej Fijalkowski
@ 2026-03-19 17:55 ` Maciej Fijalkowski
  2026-03-19 17:55 ` [PATCH v2 net 5/8] xsk: validate MTU against usable frame size on bind Maciej Fijalkowski
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Maciej Fijalkowski @ 2026-03-19 17:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, bjorn, Maciej Fijalkowski

When programming dbuff on rx queue context, avoid division round up as
it causes to actually corrupt the tailroom for AF_XDP ZC. Below is an
example based on 4k chunk size when xsk pool pointer is valid on given
rx ring:

chunk_size = 4096
headroom = 256
tailroom = 320

ring->rx_buf_len = 4096 - 256 - 320 = 3520

rx_ctx.dbuff = DIV_ROUND_UP(3520, 128) ->
3520 / 128 = 27.5 -> round up results in 28

dbuff programming unit is 128. If we give 128 * 28 = 3584. So HW will
corrupt 64 bytes from tailroom. Decrement dbuff value when xsk_pool is
present on i40e's rx ring.

Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 926d001b2150..9fd55145eeee 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3623,6 +3623,8 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 
 	rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
 				    BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT));
+	if (ring->xsk_pool)
+		rx_ctx.dbuff--;
 
 	rx_ctx.base = (ring->dma / 128);
 	rx_ctx.qlen = ring->count;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 net 5/8] xsk: validate MTU against usable frame size on bind
  2026-03-19 17:55 [PATCH v2 net 0/8] xsk: tailroom reservation and MTU validation Maciej Fijalkowski
                   ` (3 preceding siblings ...)
  2026-03-19 17:55 ` [PATCH v2 net 4/8] i40e: do not round up result of dbuff " Maciej Fijalkowski
@ 2026-03-19 17:55 ` Maciej Fijalkowski
  2026-03-20  8:38   ` Björn Töpel
  2026-03-19 17:55 ` [PATCH v2 net 6/8] selftests: bpf: fix pkt grow tests Maciej Fijalkowski
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Maciej Fijalkowski @ 2026-03-19 17:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, bjorn, Maciej Fijalkowski

AF_XDP bind currently accepts zero-copy pool configurations without
verifying that the device MTU fits into the usable frame space provided
by the UMEM chunk.

This becomes a problem since we started to respect tailroom which is
subtracted from chunk_size (among with headroom). 2k chunk size might
not provide enough space for standard 1500 MTU, so let us catch such
settings at bind time.

This prevents creating an already-invalid setup and complements the
MTU change restriction for devices with an attached XSK pool.

Currently xp_assign_dev_shared() is missing XDP_USE_SG being propagated
to flags so set it in order to preserve mtu check that is supposed to be
done only when no multi-buffer setup is in picture.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 net/xdp/xsk_buff_pool.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 37b7a68b89b3..e9377b05118b 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -157,6 +157,7 @@ static void xp_disable_drv_zc(struct xsk_buff_pool *pool)
 int xp_assign_dev(struct xsk_buff_pool *pool,
 		  struct net_device *netdev, u16 queue_id, u16 flags)
 {
+	bool mbuf = flags & XDP_USE_SG;
 	bool force_zc, force_copy;
 	struct netdev_bpf bpf;
 	int err = 0;
@@ -178,7 +179,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
 	if (err)
 		return err;
 
-	if (flags & XDP_USE_SG)
+	if (mbuf)
 		pool->umem->flags |= XDP_UMEM_SG_FLAG;
 
 	if (flags & XDP_USE_NEED_WAKEUP)
@@ -200,10 +201,18 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
 		goto err_unreg_pool;
 	}
 
-	if (netdev->xdp_zc_max_segs == 1 && (flags & XDP_USE_SG)) {
+	if (netdev->xdp_zc_max_segs == 1 && mbuf) {
 		err = -EOPNOTSUPP;
 		goto err_unreg_pool;
 	}
+#define ETH_PAD_LEN (ETH_HLEN + 2 * VLAN_HLEN  + ETH_FCS_LEN)
+	if (!mbuf) {
+		if (READ_ONCE(netdev->mtu) +  ETH_PAD_LEN >
+		    xsk_pool_get_rx_frame_size(pool)) {
+			err = -EINVAL;
+			goto err_unreg_pool;
+		}
+	}
 
 	if (dev_get_min_mp_channel_count(netdev)) {
 		err = -EBUSY;
@@ -247,6 +256,9 @@ int xp_assign_dev_shared(struct xsk_buff_pool *pool, struct xdp_sock *umem_xs,
 	struct xdp_umem *umem = umem_xs->umem;
 
 	flags = umem->zc ? XDP_ZEROCOPY : XDP_COPY;
+	if (umem->flags & XDP_UMEM_SG_FLAG)
+		flags |= XDP_USE_SG;
+
 	if (umem_xs->pool->uses_need_wakeup)
 		flags |= XDP_USE_NEED_WAKEUP;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 net 6/8] selftests: bpf: fix pkt grow tests
  2026-03-19 17:55 [PATCH v2 net 0/8] xsk: tailroom reservation and MTU validation Maciej Fijalkowski
                   ` (4 preceding siblings ...)
  2026-03-19 17:55 ` [PATCH v2 net 5/8] xsk: validate MTU against usable frame size on bind Maciej Fijalkowski
@ 2026-03-19 17:55 ` Maciej Fijalkowski
  2026-03-20  8:40   ` Björn Töpel
  2026-03-19 17:55 ` [PATCH v2 net 7/8] selftests: bpf: have a separate variable for drop test Maciej Fijalkowski
  2026-03-19 17:55 ` [PATCH v2 net 8/8] selftests: bpf: adjust rx_dropped xskxceiver's test to respect tailroom Maciej Fijalkowski
  7 siblings, 1 reply; 21+ messages in thread
From: Maciej Fijalkowski @ 2026-03-19 17:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, bjorn, Maciej Fijalkowski

Skip tail adjust tests in xskxceiver for SKB mode as it is not very
friendly for it. multi-buffer case does not work as xdp_rxq_info that is
registered for generic XDP does not report ::frag_size. The non-mbuf
path copies packet via skb_pp_cow_data() which only accounts for
headroom, leaving us with no tailroom and causing underlying XDP prog to
drop packets therefore.

For multi-buffer test on other modes, change the amount of bytes we use
for growth, assume worst-case scenario and take care of headroom and
tailroom.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 .../selftests/bpf/prog_tests/test_xsk.c       | 25 ++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
index 7e38ec6e656b..95cbbf425e9a 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
@@ -35,6 +35,7 @@
 #define XSK_UMEM__INVALID_FRAME_SIZE	(MAX_ETH_JUMBO_SIZE + 1)
 #define XSK_UMEM__LARGE_FRAME_SIZE	(3 * 1024)
 #define XSK_UMEM__MAX_FRAME_SIZE	(4 * 1024)
+#define XSK_UMEM__PACKET_TAILROOM 320
 
 static const u8 g_mac[ETH_ALEN] = {0x55, 0x44, 0x33, 0x22, 0x11, 0x00};
 
@@ -2551,16 +2552,34 @@ int testapp_adjust_tail_shrink_mb(struct test_spec *test)
 
 int testapp_adjust_tail_grow(struct test_spec *test)
 {
+	if (test->mode == TEST_MODE_SKB)
+		return TEST_SKIP;
+
 	/* Grow by 4 bytes for testing purpose */
 	return testapp_adjust_tail(test, 4, MIN_PKT_SIZE * 2);
 }
 
 int testapp_adjust_tail_grow_mb(struct test_spec *test)
 {
+	u32 grow_size;
+
+	if (test->mode == TEST_MODE_SKB)
+		return TEST_SKIP;
+
+	/* worst case scenario is when underlying setup will work on 3k
+	 * buffers, let us account for it; given that we will use 6k as
+	 * pkt_len, expect that it will be broken down to 2 descs each
+	 * with 3k payload;
+	 *
+	 * 4k is truesize, 3k payload, 256 HR, 320 TR;
+	 */
+	grow_size = XSK_UMEM__MAX_FRAME_SIZE -
+		    XSK_UMEM__LARGE_FRAME_SIZE -
+		    XDP_PACKET_HEADROOM -
+		    XSK_UMEM__PACKET_TAILROOM;
 	test->mtu = MAX_ETH_JUMBO_SIZE;
-	/* Grow by (frag_size - last_frag_Size) - 1 to stay inside the last fragment */
-	return testapp_adjust_tail(test, (XSK_UMEM__MAX_FRAME_SIZE / 2) - 1,
-				   XSK_UMEM__LARGE_FRAME_SIZE * 2);
+
+	return testapp_adjust_tail(test, grow_size, XSK_UMEM__LARGE_FRAME_SIZE * 2);
 }
 
 int testapp_tx_queue_consumer(struct test_spec *test)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 net 7/8] selftests: bpf: have a separate variable for drop test
  2026-03-19 17:55 [PATCH v2 net 0/8] xsk: tailroom reservation and MTU validation Maciej Fijalkowski
                   ` (5 preceding siblings ...)
  2026-03-19 17:55 ` [PATCH v2 net 6/8] selftests: bpf: fix pkt grow tests Maciej Fijalkowski
@ 2026-03-19 17:55 ` Maciej Fijalkowski
  2026-03-20  8:41   ` Björn Töpel
  2026-03-19 17:55 ` [PATCH v2 net 8/8] selftests: bpf: adjust rx_dropped xskxceiver's test to respect tailroom Maciej Fijalkowski
  7 siblings, 1 reply; 21+ messages in thread
From: Maciej Fijalkowski @ 2026-03-19 17:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, bjorn, Maciej Fijalkowski

Currently two different XDP programs share a static variable for
different purposes (picking where to redirect on shared umem test &
whether to drop a packet). This can be a problem when running full test
suite - idx can be written by shared umem test and this value can cause
a false behavior within XDP drop half test.

Introduce a dedicated variable for drop half test so that these two
don't step on each other toes.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 tools/testing/selftests/bpf/progs/xsk_xdp_progs.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/progs/xsk_xdp_progs.c b/tools/testing/selftests/bpf/progs/xsk_xdp_progs.c
index 683306db8594..40609d8c2471 100644
--- a/tools/testing/selftests/bpf/progs/xsk_xdp_progs.c
+++ b/tools/testing/selftests/bpf/progs/xsk_xdp_progs.c
@@ -15,6 +15,7 @@ struct {
 	__uint(value_size, sizeof(int));
 } xsk SEC(".maps");
 
+static unsigned int drop_idx;
 static unsigned int idx;
 int adjust_value = 0;
 int count = 0;
@@ -27,7 +28,7 @@ SEC("xdp.frags") int xsk_def_prog(struct xdp_md *xdp)
 SEC("xdp.frags") int xsk_xdp_drop(struct xdp_md *xdp)
 {
 	/* Drop every other packet */
-	if (idx++ % 2)
+	if (drop_idx++ % 2)
 		return XDP_DROP;
 
 	return bpf_redirect_map(&xsk, 0, XDP_DROP);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 net 8/8] selftests: bpf: adjust rx_dropped xskxceiver's test to respect tailroom
  2026-03-19 17:55 [PATCH v2 net 0/8] xsk: tailroom reservation and MTU validation Maciej Fijalkowski
                   ` (6 preceding siblings ...)
  2026-03-19 17:55 ` [PATCH v2 net 7/8] selftests: bpf: have a separate variable for drop test Maciej Fijalkowski
@ 2026-03-19 17:55 ` Maciej Fijalkowski
  2026-03-20  8:42   ` Björn Töpel
  7 siblings, 1 reply; 21+ messages in thread
From: Maciej Fijalkowski @ 2026-03-19 17:55 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, bjorn, Maciej Fijalkowski

Since we have changed how big user defined headroom in umem can be,
change the logic in testapp_stats_rx_dropped() so we pass updated
headroom validation in xdp_umem_reg() and still drop half of frames.

Test works on non-mbuf setup so xsk_pool_get_rx_frame_size() that is
called on xsk_rcv_check() will not account skb_shared_info size. Taking
the tailroom size into account in test being fixed is needed as
xdp_umem_reg() defaults to respect it.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 tools/testing/selftests/bpf/prog_tests/test_xsk.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
index 95cbbf425e9a..270fd0b6dc22 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
@@ -1984,10 +1984,10 @@ int testapp_stats_rx_dropped(struct test_spec *test)
 		return TEST_SKIP;
 	}
 
-	if (pkt_stream_replace_half(test, MIN_PKT_SIZE * 4, 0))
+	if (pkt_stream_replace_half(test, MIN_PKT_SIZE + XSK_UMEM__PACKET_TAILROOM, 0))
 		return TEST_FAILURE;
 	test->ifobj_rx->umem->frame_headroom = test->ifobj_rx->umem->frame_size -
-		XDP_PACKET_HEADROOM - MIN_PKT_SIZE * 3;
+		XDP_PACKET_HEADROOM - MIN_PKT_SIZE - (XSK_UMEM__PACKET_TAILROOM - 1);
 	if (pkt_stream_receive_half(test))
 		return TEST_FAILURE;
 	test->ifobj_rx->validation_func = validate_rx_dropped;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 net 1/8] xsk: tighten UMEM headroom validation to account for tailroom and min frame
  2026-03-19 17:55 ` [PATCH v2 net 1/8] xsk: tighten UMEM headroom validation to account for tailroom and min frame Maciej Fijalkowski
@ 2026-03-20  8:13   ` Björn Töpel
  2026-03-20 15:01   ` Stanislav Fomichev
  1 sibling, 0 replies; 21+ messages in thread
From: Björn Töpel @ 2026-03-20  8:13 UTC (permalink / raw)
  To: Maciej Fijalkowski, netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, Maciej Fijalkowski

Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> The current headroom validation in xdp_umem_reg() could leave us with
> insufficient space dedicated to even receive minimum-sized ethernet
> frame. Furthermore if multi-buffer would come to play then
> skb_shared_info stored at the end of XSK frame would be corrupted.
>
> Multi-buffer setting is known later in the configuration process so
> besides accounting for ETH_ZLEN, let us also take care of tailroom space
> upfront.
>
> Fixes: 99e3a236dd43 ("xsk: Add missing check on user supplied headroom size")
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

Reviewed-by: Björn Töpel <bjorn@kernel.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 net 2/8] xsk: respect tailroom for ZC setups
  2026-03-19 17:55 ` [PATCH v2 net 2/8] xsk: respect tailroom for ZC setups Maciej Fijalkowski
@ 2026-03-20  8:14   ` Björn Töpel
  2026-03-20 15:01   ` Stanislav Fomichev
  1 sibling, 0 replies; 21+ messages in thread
From: Björn Töpel @ 2026-03-20  8:14 UTC (permalink / raw)
  To: Maciej Fijalkowski, netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, Maciej Fijalkowski

Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> Multi-buffer XDP stores information about frags in skb_shared_info that
> sits at the tailroom of a packet. The storage space is reserved via
> xdp_data_hard_end():
>
> 	((xdp)->data_hard_start + (xdp)->frame_sz -	\
> 	 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
>
> and then we refer to it via macro below:
>
> static inline struct skb_shared_info *
> xdp_get_shared_info_from_buff(const struct xdp_buff *xdp)
> {
>         return (struct skb_shared_info *)xdp_data_hard_end(xdp);
> }
>
> Currently we do not respect this tailroom space in multi-buffer AF_XDP
> ZC scenario. To address this, introduce xsk_pool_get_tailroom() and use
> it within xsk_pool_get_rx_frame_size() which is used in ZC drivers to
> configure length of HW Rx buffer.
>
> xsk_pool_get_tailroom() is only reserving necessary space when pool is
> zc and underlying netdev supports zc multi-buffer. Rely on pool->dev
> state when configuring tailroom. xsk_pool_get_rx_frame_size() inside
> ndo_bpf is usually called when bringing up queues and before xsk's dma
> mappings have been configured, which makes it valid to rely on
> pool->dev.
>
> Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

Reviewed-by: Björn Töpel <bjorn@kernel.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 net 3/8] ice: do not round up result of dbuf calculation for xsk pool
  2026-03-19 17:55 ` [PATCH v2 net 3/8] ice: do not round up result of dbuf calculation for xsk pool Maciej Fijalkowski
@ 2026-03-20  8:18   ` Björn Töpel
  2026-03-20 15:57     ` Maciej Fijalkowski
  0 siblings, 1 reply; 21+ messages in thread
From: Björn Töpel @ 2026-03-20  8:18 UTC (permalink / raw)
  To: Maciej Fijalkowski, netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, Maciej Fijalkowski

Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> When programming dbuf on rx queue context, avoid division round up as
> it causes to actually corrupt the tailroom for AF_XDP ZC. Below is an
> example based on 4k chunk size when xsk pool pointer is valid on given
> rx ring:
>
> chunk_size = 4096
> headroom = 256
> tailroom = 320
>
> ring->rx_buf_len = 4096 - 256 - 320 = 3520
>
> rx_ctx.dbuf = DIV_ROUND_UP(3520, 128) ->
> 3520 / 128 = 27.5 -> round up results in 28
>
> dbuf programming unit is 128. If we give 128 * 28 = 3584. So HW will
> corrupt 64 bytes from tailroom. Decrement dbuf by 1 when xsk_pool is
> present on given ice_rx_ring.
>
> Also, restore ::rx_buf_len setting via xsk_pool_get_rx_frame_size() as
> of now it respects the tailroom.
>
> Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support")
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_base.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
> index 1667f686ff75..f9514d7bb83c 100644
> --- a/drivers/net/ethernet/intel/ice/ice_base.c
> +++ b/drivers/net/ethernet/intel/ice/ice_base.c
> @@ -500,6 +500,8 @@ static int ice_setup_rx_ctx(struct ice_rx_ring *ring)
>  	 */
>  	rlan_ctx.dbuf = DIV_ROUND_UP(ring->rx_buf_len,
>  				     BIT_ULL(ICE_RLAN_CTX_DBUF_S));
> +	if (ring->xsk_pool)
> +		rlan_ctx.dbuf--;

Hmm, wont this be overly pessimistic? Smth like

  if (ring->xsk_pool)
    rlan_ctx.dbuf = ring->rx_buf_len >> ICE_RLAN_CTX_DBUF_S;
  // else round up?



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 net 5/8] xsk: validate MTU against usable frame size on bind
  2026-03-19 17:55 ` [PATCH v2 net 5/8] xsk: validate MTU against usable frame size on bind Maciej Fijalkowski
@ 2026-03-20  8:38   ` Björn Töpel
  2026-03-20 15:51     ` Maciej Fijalkowski
  0 siblings, 1 reply; 21+ messages in thread
From: Björn Töpel @ 2026-03-20  8:38 UTC (permalink / raw)
  To: Maciej Fijalkowski, netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, Maciej Fijalkowski

Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> AF_XDP bind currently accepts zero-copy pool configurations without
> verifying that the device MTU fits into the usable frame space provided
> by the UMEM chunk.
>
> This becomes a problem since we started to respect tailroom which is
> subtracted from chunk_size (among with headroom). 2k chunk size might
> not provide enough space for standard 1500 MTU, so let us catch such
> settings at bind time.
>
> This prevents creating an already-invalid setup and complements the
> MTU change restriction for devices with an attached XSK pool.
>
> Currently xp_assign_dev_shared() is missing XDP_USE_SG being propagated
> to flags so set it in order to preserve mtu check that is supposed to be
> done only when no multi-buffer setup is in picture.
>
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

Fixes tag?

This got me thinking; Nothing in the xsk core prevents MTU from being
changes while xsk is runing? Some drivers do! Not for this patch, but
maybe xsk should listen to MTU notifiers?

Seems like we're in a bind-time TOCTOU gap...

> ---
>  net/xdp/xsk_buff_pool.c | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index 37b7a68b89b3..e9377b05118b 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -157,6 +157,7 @@ static void xp_disable_drv_zc(struct xsk_buff_pool *pool)
>  int xp_assign_dev(struct xsk_buff_pool *pool,
>  		  struct net_device *netdev, u16 queue_id, u16 flags)
>  {
> +	bool mbuf = flags & XDP_USE_SG;
>  	bool force_zc, force_copy;
>  	struct netdev_bpf bpf;
>  	int err = 0;
> @@ -178,7 +179,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>  	if (err)
>  		return err;
>  
> -	if (flags & XDP_USE_SG)
> +	if (mbuf)
>  		pool->umem->flags |= XDP_UMEM_SG_FLAG;
>  
>  	if (flags & XDP_USE_NEED_WAKEUP)
> @@ -200,10 +201,18 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>  		goto err_unreg_pool;
>  	}
>  
> -	if (netdev->xdp_zc_max_segs == 1 && (flags & XDP_USE_SG)) {
> +	if (netdev->xdp_zc_max_segs == 1 && mbuf) {
>  		err = -EOPNOTSUPP;
>  		goto err_unreg_pool;
>  	}
> +#define ETH_PAD_LEN (ETH_HLEN + 2 * VLAN_HLEN  + ETH_FCS_LEN)

Yuk! Move this somewhere else.

> +	if (!mbuf) {
> +		if (READ_ONCE(netdev->mtu) +  ETH_PAD_LEN >

I think READ_ONCE sends wrong signal to readers. We're in an
ASSERT_RTNL() region.

> +		    xsk_pool_get_rx_frame_size(pool)) {
> +			err = -EINVAL;
> +			goto err_unreg_pool;
> +		}
> +	}
>  
>  	if (dev_get_min_mp_channel_count(netdev)) {
>  		err = -EBUSY;
> @@ -247,6 +256,9 @@ int xp_assign_dev_shared(struct xsk_buff_pool *pool, struct xdp_sock *umem_xs,
>  	struct xdp_umem *umem = umem_xs->umem;
>  
>  	flags = umem->zc ? XDP_ZEROCOPY : XDP_COPY;
> +	if (umem->flags & XDP_UMEM_SG_FLAG)
> +		flags |= XDP_USE_SG;
> +
>  	if (umem_xs->pool->uses_need_wakeup)
>  		flags |= XDP_USE_NEED_WAKEUP;
>  
> -- 
> 2.43.0

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 net 6/8] selftests: bpf: fix pkt grow tests
  2026-03-19 17:55 ` [PATCH v2 net 6/8] selftests: bpf: fix pkt grow tests Maciej Fijalkowski
@ 2026-03-20  8:40   ` Björn Töpel
  0 siblings, 0 replies; 21+ messages in thread
From: Björn Töpel @ 2026-03-20  8:40 UTC (permalink / raw)
  To: Maciej Fijalkowski, netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, Maciej Fijalkowski

Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> Skip tail adjust tests in xskxceiver for SKB mode as it is not very
> friendly for it. multi-buffer case does not work as xdp_rxq_info that is
> registered for generic XDP does not report ::frag_size. The non-mbuf
> path copies packet via skb_pp_cow_data() which only accounts for
> headroom, leaving us with no tailroom and causing underlying XDP prog to
> drop packets therefore.
>
> For multi-buffer test on other modes, change the amount of bytes we use
> for growth, assume worst-case scenario and take care of headroom and
> tailroom.
>
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

Reviewed-by: Björn Töpel <bjorn@kernel.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 net 7/8] selftests: bpf: have a separate variable for drop test
  2026-03-19 17:55 ` [PATCH v2 net 7/8] selftests: bpf: have a separate variable for drop test Maciej Fijalkowski
@ 2026-03-20  8:41   ` Björn Töpel
  0 siblings, 0 replies; 21+ messages in thread
From: Björn Töpel @ 2026-03-20  8:41 UTC (permalink / raw)
  To: Maciej Fijalkowski, netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, Maciej Fijalkowski

Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> Currently two different XDP programs share a static variable for
> different purposes (picking where to redirect on shared umem test &
> whether to drop a packet). This can be a problem when running full test
> suite - idx can be written by shared umem test and this value can cause
> a false behavior within XDP drop half test.
>
> Introduce a dedicated variable for drop half test so that these two
> don't step on each other toes.
>
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

Reviewed-by: Björn Töpel <bjorn@kernel.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 net 8/8] selftests: bpf: adjust rx_dropped xskxceiver's test to respect tailroom
  2026-03-19 17:55 ` [PATCH v2 net 8/8] selftests: bpf: adjust rx_dropped xskxceiver's test to respect tailroom Maciej Fijalkowski
@ 2026-03-20  8:42   ` Björn Töpel
  0 siblings, 0 replies; 21+ messages in thread
From: Björn Töpel @ 2026-03-20  8:42 UTC (permalink / raw)
  To: Maciej Fijalkowski, netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin, Maciej Fijalkowski

Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> Since we have changed how big user defined headroom in umem can be,
> change the logic in testapp_stats_rx_dropped() so we pass updated
> headroom validation in xdp_umem_reg() and still drop half of frames.
>
> Test works on non-mbuf setup so xsk_pool_get_rx_frame_size() that is
> called on xsk_rcv_check() will not account skb_shared_info size. Taking
> the tailroom size into account in test being fixed is needed as
> xdp_umem_reg() defaults to respect it.
>
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

Reviewed-by: Björn Töpel <bjorn@kernel.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 net 1/8] xsk: tighten UMEM headroom validation to account for tailroom and min frame
  2026-03-19 17:55 ` [PATCH v2 net 1/8] xsk: tighten UMEM headroom validation to account for tailroom and min frame Maciej Fijalkowski
  2026-03-20  8:13   ` Björn Töpel
@ 2026-03-20 15:01   ` Stanislav Fomichev
  1 sibling, 0 replies; 21+ messages in thread
From: Stanislav Fomichev @ 2026-03-20 15:01 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: netdev, bpf, magnus.karlsson, kuba, pabeni, horms, larysa.zaremba,
	aleksander.lobakin, bjorn

On 03/19, Maciej Fijalkowski wrote:
> The current headroom validation in xdp_umem_reg() could leave us with
> insufficient space dedicated to even receive minimum-sized ethernet
> frame. Furthermore if multi-buffer would come to play then
> skb_shared_info stored at the end of XSK frame would be corrupted.
> 
> Multi-buffer setting is known later in the configuration process so
> besides accounting for ETH_ZLEN, let us also take care of tailroom space
> upfront.
> 
> Fixes: 99e3a236dd43 ("xsk: Add missing check on user supplied headroom size")
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

Acked-by: Stanislav Fomichev <sdf@fomichev.me>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 net 2/8] xsk: respect tailroom for ZC setups
  2026-03-19 17:55 ` [PATCH v2 net 2/8] xsk: respect tailroom for ZC setups Maciej Fijalkowski
  2026-03-20  8:14   ` Björn Töpel
@ 2026-03-20 15:01   ` Stanislav Fomichev
  1 sibling, 0 replies; 21+ messages in thread
From: Stanislav Fomichev @ 2026-03-20 15:01 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: netdev, bpf, magnus.karlsson, kuba, pabeni, horms, larysa.zaremba,
	aleksander.lobakin, bjorn

On 03/19, Maciej Fijalkowski wrote:
> Multi-buffer XDP stores information about frags in skb_shared_info that
> sits at the tailroom of a packet. The storage space is reserved via
> xdp_data_hard_end():
> 
> 	((xdp)->data_hard_start + (xdp)->frame_sz -	\
> 	 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
> 
> and then we refer to it via macro below:
> 
> static inline struct skb_shared_info *
> xdp_get_shared_info_from_buff(const struct xdp_buff *xdp)
> {
>         return (struct skb_shared_info *)xdp_data_hard_end(xdp);
> }
> 
> Currently we do not respect this tailroom space in multi-buffer AF_XDP
> ZC scenario. To address this, introduce xsk_pool_get_tailroom() and use
> it within xsk_pool_get_rx_frame_size() which is used in ZC drivers to
> configure length of HW Rx buffer.
> 
> xsk_pool_get_tailroom() is only reserving necessary space when pool is
> zc and underlying netdev supports zc multi-buffer. Rely on pool->dev
> state when configuring tailroom. xsk_pool_get_rx_frame_size() inside
> ndo_bpf is usually called when bringing up queues and before xsk's dma
> mappings have been configured, which makes it valid to rely on
> pool->dev.
> 
> Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

Acked-by: Stanislav Fomichev <sdf@fomichev.me>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 net 5/8] xsk: validate MTU against usable frame size on bind
  2026-03-20  8:38   ` Björn Töpel
@ 2026-03-20 15:51     ` Maciej Fijalkowski
  2026-03-21 12:21       ` Maciej Fijalkowski
  0 siblings, 1 reply; 21+ messages in thread
From: Maciej Fijalkowski @ 2026-03-20 15:51 UTC (permalink / raw)
  To: Björn Töpel
  Cc: netdev, bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin

On Fri, Mar 20, 2026 at 09:38:00AM +0100, Björn Töpel wrote:
> Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
> 
> > AF_XDP bind currently accepts zero-copy pool configurations without
> > verifying that the device MTU fits into the usable frame space provided
> > by the UMEM chunk.
> >
> > This becomes a problem since we started to respect tailroom which is
> > subtracted from chunk_size (among with headroom). 2k chunk size might
> > not provide enough space for standard 1500 MTU, so let us catch such
> > settings at bind time.
> >
> > This prevents creating an already-invalid setup and complements the
> > MTU change restriction for devices with an attached XSK pool.
> >
> > Currently xp_assign_dev_shared() is missing XDP_USE_SG being propagated
> > to flags so set it in order to preserve mtu check that is supposed to be
> > done only when no multi-buffer setup is in picture.
> >
> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> 
> Fixes tag?

My reasoning was that since this came up due to respecting tailroom I went
with no fixes tag, but missing mbuf flag setting for shared umem was an
existing bug, so we could pick something now.

> 
> This got me thinking; Nothing in the xsk core prevents MTU from being
> changes while xsk is runing? Some drivers do! Not for this patch, but
> maybe xsk should listen to MTU notifiers?

Nice, I had exact patch locally, attaching at the bottom [0].
However I withdrew it as this would yield a rework on xskxceiver - test
suite actually changes MTU while keeping XSK resources alive.

We can get back to this idea but I didn't want to stir up the pot too much
in this set.

> 
> Seems like we're in a bind-time TOCTOU gap...
> 
> > ---
> >  net/xdp/xsk_buff_pool.c | 16 ++++++++++++++--
> >  1 file changed, 14 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> > index 37b7a68b89b3..e9377b05118b 100644
> > --- a/net/xdp/xsk_buff_pool.c
> > +++ b/net/xdp/xsk_buff_pool.c
> > @@ -157,6 +157,7 @@ static void xp_disable_drv_zc(struct xsk_buff_pool *pool)
> >  int xp_assign_dev(struct xsk_buff_pool *pool,
> >  		  struct net_device *netdev, u16 queue_id, u16 flags)
> >  {
> > +	bool mbuf = flags & XDP_USE_SG;
> >  	bool force_zc, force_copy;
> >  	struct netdev_bpf bpf;
> >  	int err = 0;
> > @@ -178,7 +179,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
> >  	if (err)
> >  		return err;
> >  
> > -	if (flags & XDP_USE_SG)
> > +	if (mbuf)
> >  		pool->umem->flags |= XDP_UMEM_SG_FLAG;
> >  
> >  	if (flags & XDP_USE_NEED_WAKEUP)
> > @@ -200,10 +201,18 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
> >  		goto err_unreg_pool;
> >  	}
> >  
> > -	if (netdev->xdp_zc_max_segs == 1 && (flags & XDP_USE_SG)) {
> > +	if (netdev->xdp_zc_max_segs == 1 && mbuf) {
> >  		err = -EOPNOTSUPP;
> >  		goto err_unreg_pool;
> >  	}
> > +#define ETH_PAD_LEN (ETH_HLEN + 2 * VLAN_HLEN  + ETH_FCS_LEN)
> 
> Yuk! Move this somewhere else.
> 
> > +	if (!mbuf) {
> > +		if (READ_ONCE(netdev->mtu) +  ETH_PAD_LEN >
> 
> I think READ_ONCE sends wrong signal to readers. We're in an
> ASSERT_RTNL() region.

Good point!

> 
> > +		    xsk_pool_get_rx_frame_size(pool)) {
> > +			err = -EINVAL;
> > +			goto err_unreg_pool;
> > +		}
> > +	}
> >  
> >  	if (dev_get_min_mp_channel_count(netdev)) {
> >  		err = -EBUSY;
> > @@ -247,6 +256,9 @@ int xp_assign_dev_shared(struct xsk_buff_pool *pool, struct xdp_sock *umem_xs,
> >  	struct xdp_umem *umem = umem_xs->umem;
> >  
> >  	flags = umem->zc ? XDP_ZEROCOPY : XDP_COPY;
> > +	if (umem->flags & XDP_UMEM_SG_FLAG)
> > +		flags |= XDP_USE_SG;
> > +
> >  	if (umem_xs->pool->uses_need_wakeup)
> >  		flags |= XDP_USE_NEED_WAKEUP;
> >  
> > -- 
> > 2.43.0

[0]:
Subject: [PATCH net 4/5] xsk: forbid MTU changes while an XSK pool is attached

AF_XDP pool setup depends on the netdev configuration that is in effect
at bind time.

In particular, the usable frame size seen by zero-copy drivers is
derived from the UMEM chunk layout. Changing MTU after a pool has been
attached can invalidate that setup and leave the device and pool with
incompatible packet geometry.

Reject NETDEV_PRECHANGEMTU when an XSK pool is attached to the device.
This keeps the policy in the AF_XDP code, avoids touching individual
ndo_change_mtu() implementations, and stops the MTU change before the
driver callback is reached.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 net/xdp/xsk.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index e6530996053b..73cbd5774031 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1790,6 +1790,18 @@ static int xsk_mmap(struct file *file, struct socket *sock,
 	return remap_vmalloc_range(vma, q->ring, 0);
 }
 
+static bool xsk_dev_has_pool(struct net_device *dev)
+{
+	u32 i, n;
+
+	n = max_t(u32, dev->real_num_rx_queues, dev->real_num_tx_queues);
+	for (i = 0; i < n; i++)
+		if (xsk_get_pool_from_qid(dev, i))
+			return true;
+
+	return false;
+}
+
 static int xsk_notifier(struct notifier_block *this,
 			unsigned long msg, void *ptr)
 {
@@ -1798,6 +1810,11 @@ static int xsk_notifier(struct notifier_block *this,
 	struct sock *sk;
 
 	switch (msg) {
+	case NETDEV_PRECHANGEMTU:
+		if (xsk_dev_has_pool(dev))
+			return notifier_from_errno(-EBUSY);
+		break;
+
 	case NETDEV_UNREGISTER:
 		mutex_lock(&net->xdp.lock);
 		sk_for_each(sk, &net->xdp.list) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 net 3/8] ice: do not round up result of dbuf calculation for xsk pool
  2026-03-20  8:18   ` Björn Töpel
@ 2026-03-20 15:57     ` Maciej Fijalkowski
  0 siblings, 0 replies; 21+ messages in thread
From: Maciej Fijalkowski @ 2026-03-20 15:57 UTC (permalink / raw)
  To: Björn Töpel
  Cc: netdev, bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin

On Fri, Mar 20, 2026 at 09:18:19AM +0100, Björn Töpel wrote:
> Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
> 
> > When programming dbuf on rx queue context, avoid division round up as
> > it causes to actually corrupt the tailroom for AF_XDP ZC. Below is an
> > example based on 4k chunk size when xsk pool pointer is valid on given
> > rx ring:
> >
> > chunk_size = 4096
> > headroom = 256
> > tailroom = 320
> >
> > ring->rx_buf_len = 4096 - 256 - 320 = 3520
> >
> > rx_ctx.dbuf = DIV_ROUND_UP(3520, 128) ->
> > 3520 / 128 = 27.5 -> round up results in 28
> >
> > dbuf programming unit is 128. If we give 128 * 28 = 3584. So HW will
> > corrupt 64 bytes from tailroom. Decrement dbuf by 1 when xsk_pool is
> > present on given ice_rx_ring.
> >
> > Also, restore ::rx_buf_len setting via xsk_pool_get_rx_frame_size() as
> > of now it respects the tailroom.
> >
> > Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support")
> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> > ---
> >  drivers/net/ethernet/intel/ice/ice_base.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
> > index 1667f686ff75..f9514d7bb83c 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_base.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_base.c
> > @@ -500,6 +500,8 @@ static int ice_setup_rx_ctx(struct ice_rx_ring *ring)
> >  	 */
> >  	rlan_ctx.dbuf = DIV_ROUND_UP(ring->rx_buf_len,
> >  				     BIT_ULL(ICE_RLAN_CTX_DBUF_S));
> > +	if (ring->xsk_pool)
> > +		rlan_ctx.dbuf--;
> 
> Hmm, wont this be overly pessimistic? Smth like
> 
>   if (ring->xsk_pool)
>     rlan_ctx.dbuf = ring->rx_buf_len >> ICE_RLAN_CTX_DBUF_S;
>   // else round up?

Ok!

> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 net 5/8] xsk: validate MTU against usable frame size on bind
  2026-03-20 15:51     ` Maciej Fijalkowski
@ 2026-03-21 12:21       ` Maciej Fijalkowski
  0 siblings, 0 replies; 21+ messages in thread
From: Maciej Fijalkowski @ 2026-03-21 12:21 UTC (permalink / raw)
  To: Björn Töpel
  Cc: netdev, bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	larysa.zaremba, aleksander.lobakin

On Fri, Mar 20, 2026 at 04:51:16PM +0100, Maciej Fijalkowski wrote:
> On Fri, Mar 20, 2026 at 09:38:00AM +0100, Björn Töpel wrote:
> > Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
> > 

[...]

> > > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> > > index 37b7a68b89b3..e9377b05118b 100644
> > > --- a/net/xdp/xsk_buff_pool.c
> > > +++ b/net/xdp/xsk_buff_pool.c
> > > @@ -157,6 +157,7 @@ static void xp_disable_drv_zc(struct xsk_buff_pool *pool)
> > >  int xp_assign_dev(struct xsk_buff_pool *pool,
> > >  		  struct net_device *netdev, u16 queue_id, u16 flags)
> > >  {
> > > +	bool mbuf = flags & XDP_USE_SG;
> > >  	bool force_zc, force_copy;
> > >  	struct netdev_bpf bpf;
> > >  	int err = 0;
> > > @@ -178,7 +179,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
> > >  	if (err)
> > >  		return err;
> > >  
> > > -	if (flags & XDP_USE_SG)
> > > +	if (mbuf)
> > >  		pool->umem->flags |= XDP_UMEM_SG_FLAG;
> > >  
> > >  	if (flags & XDP_USE_NEED_WAKEUP)
> > > @@ -200,10 +201,18 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
> > >  		goto err_unreg_pool;
> > >  	}
> > >  
> > > -	if (netdev->xdp_zc_max_segs == 1 && (flags & XDP_USE_SG)) {
> > > +	if (netdev->xdp_zc_max_segs == 1 && mbuf) {
> > >  		err = -EOPNOTSUPP;
> > >  		goto err_unreg_pool;
> > >  	}
> > > +#define ETH_PAD_LEN (ETH_HLEN + 2 * VLAN_HLEN  + ETH_FCS_LEN)
> > 
> > Yuk! Move this somewhere else.

Forgot to respond here. I'm gonna just place it at the top of the file but
to properly address this we could introduce a common define for this
calculation, since we define it already three times, at least from intel
source code PoV + i assume i could easily dig up more open-coded examples:

drivers/net/ethernet/intel/ice/ice_txrx.h:73
#define ICE_ETH_PKT_HDR_PAD	(ETH_HLEN + ETH_FCS_LEN + (VLAN_HLEN * 2))

drivers/net/ethernet/intel/i40e/i40e_txrx.h:115
#define I40E_PACKET_HDR_PAD (ETH_HLEN + ETH_FCS_LEN + (VLAN_HLEN * 2))

include/net/libeth/rx.h:21
#define LIBETH_RX_LL_LEN	(ETH_HLEN + 2 * VLAN_HLEN + ETH_FCS_LEN)

But such refactor is obviously a -next material.

[...]

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2026-03-21 12:21 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-19 17:55 [PATCH v2 net 0/8] xsk: tailroom reservation and MTU validation Maciej Fijalkowski
2026-03-19 17:55 ` [PATCH v2 net 1/8] xsk: tighten UMEM headroom validation to account for tailroom and min frame Maciej Fijalkowski
2026-03-20  8:13   ` Björn Töpel
2026-03-20 15:01   ` Stanislav Fomichev
2026-03-19 17:55 ` [PATCH v2 net 2/8] xsk: respect tailroom for ZC setups Maciej Fijalkowski
2026-03-20  8:14   ` Björn Töpel
2026-03-20 15:01   ` Stanislav Fomichev
2026-03-19 17:55 ` [PATCH v2 net 3/8] ice: do not round up result of dbuf calculation for xsk pool Maciej Fijalkowski
2026-03-20  8:18   ` Björn Töpel
2026-03-20 15:57     ` Maciej Fijalkowski
2026-03-19 17:55 ` [PATCH v2 net 4/8] i40e: do not round up result of dbuff " Maciej Fijalkowski
2026-03-19 17:55 ` [PATCH v2 net 5/8] xsk: validate MTU against usable frame size on bind Maciej Fijalkowski
2026-03-20  8:38   ` Björn Töpel
2026-03-20 15:51     ` Maciej Fijalkowski
2026-03-21 12:21       ` Maciej Fijalkowski
2026-03-19 17:55 ` [PATCH v2 net 6/8] selftests: bpf: fix pkt grow tests Maciej Fijalkowski
2026-03-20  8:40   ` Björn Töpel
2026-03-19 17:55 ` [PATCH v2 net 7/8] selftests: bpf: have a separate variable for drop test Maciej Fijalkowski
2026-03-20  8:41   ` Björn Töpel
2026-03-19 17:55 ` [PATCH v2 net 8/8] selftests: bpf: adjust rx_dropped xskxceiver's test to respect tailroom Maciej Fijalkowski
2026-03-20  8:42   ` Björn Töpel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox