[PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim
@ 2026-06-23 13:32 Maciej Fijalkowski
  2026-06-23 13:32 ` [PATCH net 1/7] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx Maciej Fijalkowski
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	kerneljasonxing, bjorn, Maciej Fijalkowski

Hi,

This series fixes several AF_XDP multi-buffer Tx paths where descriptors
consumed from the Tx ring are not consistently returned to userspace
through the completion ring when the packet is later dropped as invalid.

The affected cases are invalid or oversized multi-buffer Tx packets in
both the generic and zero-copy paths. In these cases, the kernel can
consume one or more Tx descriptors while building or validating a
multi-buffer packet, then drop the packet before it reaches the device.
Userspace still owns the UMEM buffers only after the corresponding
addresses are returned through the CQ. Missing completions therefore
make userspace lose track of those buffers.

The generic path fixes cover three related cases:
* partially built multi-buffer skbs dropped by xsk_drop_skb();
  continuation descriptors left in the Tx ring after xsk_build_skb()
  reports overflow;
* invalid descriptors encountered in the middle of a multi-buffer
  packet, including the offending invalid descriptor itself.

The zero-copy path is handled separately. The batched Tx parser now
distinguishes descriptors that can be passed to the driver from
descriptors that are consumed only because they belong to an invalid
multi-buffer packet. Reclaim-only descriptors are written to the CQ
address area and published in completion order, after any earlier
driver-visible Tx descriptors.

The ZC batching path can also retain drain state when userspace has not
yet provided the end of an invalid multi-buffer packet. To keep this
state local to the singular batched path, the series prevents a second
Tx socket from joining the same pool while such drain state exists.
During the singular-to-shared transition, Tx batching is gated,
pre-existing readers are waited out, and bind fails with -EAGAIN if the
existing socket still has pending drain state. This avoids adding
multi-buffer drain handling to the shared-UMEM fallback path.

The last two patches update xskxceiver so the tests account invalid
multi-buffer Tx packets as descriptors that must be reclaimed, while
still not expecting those invalid packets on the Rx side.

This is a follow-up to Jason's changes [0] which were addressing generic
xmit only and this set allows me to pass full xskxceiver test suite run
against ice driver.

Thanks,
Maciej

[0]: https://lore.kernel.org/netdev/20260520004244.55663-1-kerneljasonxing@gmail.com/

Jason Xing (3):
  xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
  xsk: drain continuation descs after overflow in xsk_build_skb()
  xsk: drain continuation descs on invalid descriptor in
    __xsk_generic_xmit()

Maciej Fijalkowski (4):
  xsk: reclaim offending invalid desc in generic multi-buffer Tx
  xsk: reclaim invalid multi-buffer Tx descs in ZC path
  selftests/xsk: fix too-many-frags multi-buffer Tx test
  selftests/xsk: account invalid multi-buffer Tx descriptors

 include/net/xdp_sock.h                        |   1 +
 include/net/xsk_buff_pool.h                   |   6 +
 net/xdp/xsk.c                                 | 114 ++++++++++++++++--
 net/xdp/xsk_buff_pool.c                       |  66 ++++++++++
 net/xdp/xsk_queue.h                           |  66 +++++++---
 .../selftests/bpf/prog_tests/test_xsk.c       |  44 ++++---
 6 files changed, 254 insertions(+), 43 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH net 1/7] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
  2026-06-23 13:32 [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim Maciej Fijalkowski
@ 2026-06-23 13:32 ` Maciej Fijalkowski
  2026-06-23 13:32 ` [PATCH net 2/7] xsk: drain continuation descs after overflow in xsk_build_skb() Maciej Fijalkowski
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	kerneljasonxing, bjorn, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

This patch is inspired by the check[1] from sashiko. It says when
overflow happens, the address of cq to be published is invalid.
Actually the severer thing is the whole process of publishing the
address of cq in this particular case is not right: it should truely
publish the address and advance the cached_prod in cq as long as it
reads descriptors from txq.

The following is the full analysis.
xsk_drop_skb() is called in three places, which all discard a partially
built multi-buffer skb:
1) xsk_build_skb() -EOVERFLOW error path: packet exceeds MAX_SKB_FRAGS
2) __xsk_generic_xmit() post-loop cleanup: an invalid descriptor in
   the TX ring prevents the partial packet from completing
3) xsk_release(): socket close while xs->skb holds an incomplete packet

In all three cases, the TX descriptors for the already-processed frags
have been consumed from the TX ring (xskq_cons_release), and CQ slots
have been reserved. However, xsk_drop_skb() calls xsk_consume_skb()
which cancels the CQ reservations via xsk_cq_cancel_locked(). Since
the buffer addresses never appear in the completion queue, userspace
permanently loses track of these buffers.

Fix this by letting consume_skb() trigger the existing xsk_destruct_skb
destructor, which already submits buffer addresses to the CQ via
xsk_cq_submit_addr_locked().

Note that cancelling the descriptors back to the TX ring (via
xskq_cons_cancel_n) is not a appropriate option because an oversized
packet that always exceeds MAX_SKB_FRAGS would be retried indefinitely,
which is an obviously deadlock bug in the TX path.

Also move the desc->addr assignment in xsk_build_skb() above the
overflow check so that the current descriptor's address is recorded
before a potential -EOVERFLOW jump to free_err, consistent with the
zerocopy path in xsk_build_skb_zerocopy().

[1]: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/

Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index b970f30ea9b9..a7a83dc4546a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -794,8 +794,11 @@ static void xsk_consume_skb(struct sk_buff *skb)

 static void xsk_drop_skb(struct sk_buff *skb)
 {
-	xdp_sk(skb->sk)->tx->invalid_descs += xsk_get_num_desc(skb);
-	xsk_consume_skb(skb);
+	struct xdp_sock *xs = xdp_sk(skb->sk);
+
+	xs->tx->invalid_descs += xsk_get_num_desc(skb);
+	consume_skb(skb);
+	xs->skb = NULL;
 }

 static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
@@ -877,7 +880,7 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
 			return ERR_PTR(-ENOMEM);

 		/* in case of -EOVERFLOW that could happen below,
-		 * xsk_consume_skb() will release this node as whole skb
+		 * xsk_drop_skb() will release this node as whole skb
 		 * would be dropped, which implies freeing all list elements
 		 */
 		xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
@@ -969,6 +972,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 				goto free_err;
 			}

+			xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
+
 			if (unlikely(nr_frags == (MAX_SKB_FRAGS - 1) && xp_mb_desc(desc))) {
 				err = -EOVERFLOW;
 				goto free_err;
@@ -986,8 +991,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,

 			skb_add_rx_frag(skb, nr_frags, page, 0, len, PAGE_SIZE);
 			refcount_add(PAGE_SIZE, &xs->sk.sk_wmem_alloc);
-
-			xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
 		}
 	}

-- 
2.43.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net 2/7] xsk: drain continuation descs after overflow in xsk_build_skb()
  2026-06-23 13:32 [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim Maciej Fijalkowski
  2026-06-23 13:32 ` [PATCH net 1/7] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx Maciej Fijalkowski
@ 2026-06-23 13:32 ` Maciej Fijalkowski
  2026-06-24 13:33   ` sashiko-bot
  2026-06-23 13:32 ` [PATCH net 3/7] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit() Maciej Fijalkowski
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	kerneljasonxing, bjorn, Jason Xing, Maciej Fijalkowski

From: Jason Xing <kernelxing@tencent.com>

When a multi-buffer packet exceeds MAX_SKB_FRAGS and triggers -EOVERFLOW,
only the current descriptor is released from the TX ring. The remaining
continuation descriptors of the same packet stay in the ring. Since
xs->skb is set to NULL after the drop, the TX loop picks up these
leftover frags and misinterprets each one as the beginning of a new
packet, corrupting the packet stream.

Fix this by adding a drain_cont flag to xdp_sock. When overflow occurs
and the dropped descriptor has XDP_PKT_CONTD set, the flag is raised,
so we have a chance to examine and handle the potential remaining descs
of this big overflow'ed skb.

When the last fragment (without XDP_PKT_CONTD) is processed, the flag
is cleared and the loop continues to process subsequent descriptors
with the remaining budget. This behavior follows how previous xmit path
treats overflow packets.

Closes: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # wrapped cq addr submission onto routine
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 include/net/xdp_sock.h |  1 +
 net/xdp/xsk.c          | 24 ++++++++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index ebac60a3d8a1..8b51876efbed 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -80,6 +80,7 @@ struct xdp_sock {
 	 * call of __xsk_generic_xmit().
 	 */
 	struct sk_buff *skb;
+	bool drain_cont;
 
 	struct list_head map_list;
 	/* Protects map_list */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index a7a83dc4546a..e80c035a7af5 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -737,6 +737,19 @@ static void xsk_cq_submit_addr_locked(struct xsk_buff_pool *pool,
 	spin_unlock_irqrestore(&pool->cq_prod_lock, flags);
 }
 
+static void xsk_cq_submit_addr_single_locked(struct xsk_buff_pool *pool,
+					     struct xdp_desc *desc)
+{
+	unsigned long flags;
+	u32 idx;
+
+	spin_lock_irqsave(&pool->cq_prod_lock, flags);
+	idx = xskq_get_prod(pool->cq);
+	xskq_prod_write_addr(pool->cq, idx, desc->addr);
+	xskq_prod_submit_n(pool->cq, 1);
+	spin_unlock_irqrestore(&pool->cq_prod_lock, flags);
+}
+
 static void xsk_cq_cancel_locked(struct xsk_buff_pool *pool, u32 n)
 {
 	spin_lock(&pool->cq->cq_cached_prod_lock);
@@ -1063,11 +1076,22 @@ static int __xsk_generic_xmit(struct sock *sk)
 			goto out;
 		}
 
+		if (unlikely(xs->drain_cont)) {
+			xsk_cq_submit_addr_single_locked(xs->pool, &desc);
+
+			xs->tx->invalid_descs++;
+			xskq_cons_release(xs->tx);
+			xs->drain_cont = xp_mb_desc(&desc);
+			continue;
+		}
+
 		skb = xsk_build_skb(xs, &desc);
 		if (IS_ERR(skb)) {
 			err = PTR_ERR(skb);
 			if (err != -EOVERFLOW)
 				goto out;
+			if (xp_mb_desc(&desc))
+				xs->drain_cont = true;
 			err = 0;
 			continue;
 		}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net 3/7] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit()
  2026-06-23 13:32 [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim Maciej Fijalkowski
  2026-06-23 13:32 ` [PATCH net 1/7] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx Maciej Fijalkowski
  2026-06-23 13:32 ` [PATCH net 2/7] xsk: drain continuation descs after overflow in xsk_build_skb() Maciej Fijalkowski
@ 2026-06-23 13:32 ` Maciej Fijalkowski
  2026-06-24 13:33   ` sashiko-bot
  2026-06-23 13:32 ` [PATCH net 4/7] xsk: reclaim offending invalid desc in generic multi-buffer Tx Maciej Fijalkowski
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	kerneljasonxing, bjorn, Jason Xing

From: Jason Xing <kernelxing@tencent.com>

When the TX loop in __xsk_generic_xmit() encounters an invalid
descriptor mid-packet (e.g. an out-of-bounds address), the partial
skb is dropped and the offending descriptor is released. However,
remaining continuation descriptors belonging to the same multi-buffer
packet still sit in the TX ring. Since xs->skb becomes NULL after the
drop, the next iteration treats the leftover continuation fragment as
a brand-new packet, corrupting the packet stream.

Fix this by setting the drain_cont flag when the released descriptor
has XDP_PKT_CONTD set. On the next call to __xsk_generic_xmit(), the
drain logic introduced in the previous patch handles the remaining
fragments with normal CQ backpressure.

There is one subtle case: if a subsequent continuation descriptor also
has an invalid address, xskq_cons_peek_desc() rejects it and the
while loop is never entered, so the in-loop drain path cannot clear
drain_cont. The post-loop code already handles this: it sees
xskq_has_descs() is true (the failed descriptor was read but not
released by peek), releases it, and checks its XDP_PKT_CONTD flag.
Add an else branch so that when the released descriptor is the
last fragment (no XDP_PKT_CONTD), drain_cont is cleared. This
prevents the next valid packet from being incorrectly drained.

Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index e80c035a7af5..c489fadc3608 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1128,6 +1128,7 @@ static int __xsk_generic_xmit(struct sock *sk)
 		if (xs->skb)
 			xsk_drop_skb(xs->skb);
 		xskq_cons_release(xs->tx);
+		xs->drain_cont = xp_mb_desc(&desc);
 	}
 
 out:
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net 4/7] xsk: reclaim offending invalid desc in generic multi-buffer Tx
  2026-06-23 13:32 [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim Maciej Fijalkowski
                   ` (2 preceding siblings ...)
  2026-06-23 13:32 ` [PATCH net 3/7] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit() Maciej Fijalkowski
@ 2026-06-23 13:32 ` Maciej Fijalkowski
  2026-06-24 13:33   ` sashiko-bot
  2026-06-23 13:32 ` [PATCH net 5/7] xsk: reclaim invalid multi-buffer Tx descs in ZC path Maciej Fijalkowski
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	kerneljasonxing, bjorn, Maciej Fijalkowski

After an invalid descriptor is found in __xsk_generic_xmit(),
xskq_cons_peek_desc() returns false and the loop body is not entered.
Jason's drain fixes reclaim descriptors already attached to xs->skb and
later continuation descriptors handled through drain_cont, but the
offending descriptor that made peek fail is only released from the Tx
ring.

This loses one completion for each invalid multi-buffer packet in the
generic path. Userspace then waits forever for a descriptor that has
already been consumed by the kernel.

If the failed descriptor belongs to an already-started or already-draining
multi-buffer packet, publish its address to the completion ring before
releasing it. Standalone invalid descriptors keep the existing behavior.

Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 net/xdp/xsk.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index c489fadc3608..43791647cf18 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1125,8 +1125,22 @@ static int __xsk_generic_xmit(struct sock *sk)
 	}
 
 	if (xskq_has_descs(xs->tx)) {
+		bool reclaim_desc = xs->skb || xs->drain_cont;
+
+		if (reclaim_desc) {
+			err = xsk_cq_reserve_locked(xs->pool);
+			if (err) {
+				err = -EAGAIN;
+				goto out;
+			}
+		}
+
 		if (xs->skb)
 			xsk_drop_skb(xs->skb);
+
+		if (reclaim_desc)
+			xsk_cq_submit_addr_single_locked(xs->pool, &desc);
+
 		xskq_cons_release(xs->tx);
 		xs->drain_cont = xp_mb_desc(&desc);
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net 5/7] xsk: reclaim invalid multi-buffer Tx descs in ZC path
  2026-06-23 13:32 [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim Maciej Fijalkowski
                   ` (3 preceding siblings ...)
  2026-06-23 13:32 ` [PATCH net 4/7] xsk: reclaim offending invalid desc in generic multi-buffer Tx Maciej Fijalkowski
@ 2026-06-23 13:32 ` Maciej Fijalkowski
  2026-06-24 13:33   ` sashiko-bot
  2026-06-23 13:32 ` [PATCH net 6/7] selftests/xsk: fix too-many-frags multi-buffer Tx test Maciej Fijalkowski
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	kerneljasonxing, bjorn, Maciej Fijalkowski

Currently, the zero-copy Tx batching path stops when it encounters an
invalid descriptor. For multi-buffer packets this can leave descriptors
consumed from the Tx ring without returning their buffers to userspace
through the completion ring.

Handle invalid multi-buffer packets as a packet-sized unit. Keep
descriptors that are valid for transmission separate from descriptors
that are consumed only because they belong to an invalid multi-buffer
packet. The former are returned to the driver as Tx work, while the
latter are written to the CQ address area so they can be reclaimed by
userspace.

The batched path can retain drain state when the producer has not yet
supplied the end of an invalid packet. Do not allow a second Tx socket to
join the pool while such state exists. Gate the batched data path while a
same-pool bind waits for pre-existing readers, then either add the new
socket or fail the bind with -EAGAIN. This guarantees that drain state is
handled only by the singular batched path and avoids teaching the shared
UMEM fallback path about multi-buffer packet draining.

The reclaim-only descriptors must not be submitted to the completion
ring immediately when they follow real Tx descriptors in the same batch.
Drivers may complete only part of the Tx work returned by
xsk_tx_peek_release_desc_batch(), and publishing the reclaim descriptors
too early would also publish earlier real Tx descriptors that the driver
has not completed yet.

Track the number of driver-visible Tx descriptors that precede pending
reclaim descriptors. xsk_tx_completed() first advances through the real
Tx completions and submits the reclaim descriptors only after all earlier
Tx descriptors in the CQ address order have been completed. If a batch
contains only reclaim descriptors, complete them immediately because
there is no driver-visible Tx work in front of them.

This preserves CQ ordering while ensuring that every descriptor consumed
as part of an invalid multi-buffer packet is eventually returned to
userspace.

Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 include/net/xsk_buff_pool.h |  6 ++++
 net/xdp/xsk.c               | 62 +++++++++++++++++++++++++++++++---
 net/xdp/xsk_buff_pool.c     | 66 +++++++++++++++++++++++++++++++++++++
 net/xdp/xsk_queue.h         | 66 +++++++++++++++++++++++++++----------
 4 files changed, 177 insertions(+), 23 deletions(-)

diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index ccb3b350001f..4e5abacfcbb7 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -78,9 +78,12 @@ struct xsk_buff_pool {
 	u32 chunk_size;
 	u32 chunk_shift;
 	u32 frame_len;
+	u32 reclaim_descs;
+	u32 tx_zc_pending_descs;
 	u32 xdp_zc_max_segs;
 	u8 tx_metadata_len; /* inherited from umem */
 	u8 cached_need_wakeup;
+	bool tx_share_pending;
 	bool uses_need_wakeup;
 	bool unaligned;
 	bool tx_sw_csum;
@@ -113,6 +116,9 @@ void xp_get_pool(struct xsk_buff_pool *pool);
 bool xp_put_pool(struct xsk_buff_pool *pool);
 void xp_clear_dev(struct xsk_buff_pool *pool);
 void xp_add_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs);
+int xp_prepare_xsk_tx_share(struct xsk_buff_pool *pool, struct xdp_sock *xs,
+			    bool *pending);
+void xp_finish_xsk_tx_share(struct xsk_buff_pool *pool);
 void xp_del_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs);
 
 /* AF_XDP, and XDP core. */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 43791647cf18..2dda854c6590 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -499,6 +499,18 @@ void __xsk_map_flush(struct list_head *flush_list)
 
 void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries)
 {
+	if (unlikely(pool->reclaim_descs)) {
+		if (nb_entries < pool->tx_zc_pending_descs) {
+			pool->tx_zc_pending_descs -= nb_entries;
+			xskq_prod_submit_n(pool->cq, nb_entries);
+			return;
+		}
+
+		pool->tx_zc_pending_descs = 0;
+		nb_entries += pool->reclaim_descs;
+		pool->reclaim_descs = 0;
+	}
+
 	xskq_prod_submit_n(pool->cq, nb_entries);
 }
 EXPORT_SYMBOL(xsk_tx_completed);
@@ -576,9 +588,20 @@ static u32 xsk_tx_peek_release_fallback(struct xsk_buff_pool *pool, u32 max_entr
 
 u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_pkts)
 {
+	struct xsk_tx_batch batch = {};
 	struct xdp_sock *xs;
+	u32 cq_cached_prod;
 
 	rcu_read_lock();
+
+	/* Pairs with the release stores in xp_prepare_xsk_tx_share() and
+	 * xp_finish_xsk_tx_share(). If bind is converting a singular Tx pool
+	 * to shared, do not enter the singular batched path.
+	 */
+	if (smp_load_acquire(&pool->tx_share_pending))
+		goto out;
+	if (unlikely(pool->reclaim_descs))
+		goto out;
 	if (!list_is_singular(&pool->xsk_tx_list)) {
 		/* Fallback to the non-batched version */
 		rcu_read_unlock();
@@ -586,10 +609,8 @@ u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_pkts)
 	}
 
 	xs = list_first_or_null_rcu(&pool->xsk_tx_list, struct xdp_sock, tx_list);
-	if (!xs) {
-		nb_pkts = 0;
+	if (!xs)
 		goto out;
-	}
 
 	nb_pkts = xskq_cons_nb_entries(xs->tx, nb_pkts);
 
@@ -603,19 +624,38 @@ u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_pkts)
 	if (!nb_pkts)
 		goto out;
 
-	nb_pkts = xskq_cons_read_desc_batch(xs->tx, pool, nb_pkts);
+	batch = xskq_cons_read_desc_batch(xs, pool, nb_pkts);
+	nb_pkts = xsk_tx_batch_cq_descs(&batch);
 	if (!nb_pkts) {
 		xs->tx->queue_empty_descs++;
 		goto out;
 	}
 
 	__xskq_cons_release(xs->tx);
+	cq_cached_prod = pool->cq->cached_prod;
+
 	xskq_prod_write_addr_batch(pool->cq, pool->tx_descs, nb_pkts);
+
+	if (unlikely(batch.reclaim_descs)) {
+		u32 cq_pending_descs;
+
+		/* CQ is positional. Descriptors already written but not
+		 * submitted must complete before any reclaim-only descriptors
+		 * appended below.
+		 */
+		cq_pending_descs = cq_cached_prod - xskq_get_prod(pool->cq);
+
+		pool->tx_zc_pending_descs = batch.tx_descs + cq_pending_descs;
+		pool->reclaim_descs = batch.reclaim_descs;
+		if (unlikely(!pool->tx_zc_pending_descs))
+			xsk_tx_completed(pool, 0);
+	}
+
 	xs->sk.sk_write_space(&xs->sk);
 
 out:
 	rcu_read_unlock();
-	return nb_pkts;
+	return batch.tx_descs;
 }
 EXPORT_SYMBOL(xsk_tx_peek_release_desc_batch);
 
@@ -1442,6 +1482,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
 	struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr;
 	struct sock *sk = sock->sk;
 	struct xdp_sock *xs = xdp_sk(sk);
+	bool tx_share_pending = false;
 	struct net_device *dev;
 	int bound_dev_if;
 	u32 flags, qid;
@@ -1549,6 +1590,13 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
 				goto out_unlock;
 			}
 
+			err = xp_prepare_xsk_tx_share(umem_xs->pool, xs,
+						      &tx_share_pending);
+			if (err) {
+				sockfd_put(sock);
+				goto out_unlock;
+			}
+
 			xp_get_pool(umem_xs->pool);
 			xs->pool = umem_xs->pool;
 
@@ -1559,6 +1607,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
 			if (xs->tx && !xs->pool->tx_descs) {
 				err = xp_alloc_tx_descs(xs->pool, xs);
 				if (err) {
+					if (tx_share_pending)
+						xp_finish_xsk_tx_share(xs->pool);
 					xp_put_pool(xs->pool);
 					xs->pool = NULL;
 					sockfd_put(sock);
@@ -1598,6 +1648,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
 	xs->sg = !!(xs->umem->flags & XDP_UMEM_SG_FLAG);
 	xs->queue_id = qid;
 	xp_add_xsk(xs->pool, xs);
+	if (tx_share_pending)
+		xp_finish_xsk_tx_share(xs->pool);
 
 	if (qid < dev->real_num_rx_queues) {
 		struct netdev_rx_queue *rxq;
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 1f28a9641571..6fa732a843a9 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -22,6 +22,72 @@ void xp_add_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs)
 	spin_unlock(&pool->xsk_tx_list_lock);
 }
 
+int xp_prepare_xsk_tx_share(struct xsk_buff_pool *pool, struct xdp_sock *xs,
+			    bool *pending)
+{
+	struct xdp_sock *tmp;
+	int err = 0;
+
+	*pending = false;
+	if (!xs->tx)
+		return 0;
+
+	spin_lock(&pool->xsk_tx_list_lock);
+	if (!list_is_singular(&pool->xsk_tx_list)) {
+		spin_unlock(&pool->xsk_tx_list_lock);
+		return 0;
+	}
+
+	if (pool->tx_share_pending) {
+		spin_unlock(&pool->xsk_tx_list_lock);
+		return -EAGAIN;
+	}
+
+	/* Pairs with the acquire load in xsk_tx_peek_release_desc_batch().
+	 * Stop new singular batched Tx readers before synchronize_net()
+	 * waits for readers that may already have observed a singular list.
+	 */
+	smp_store_release(&pool->tx_share_pending, true);
+	*pending = true;
+	spin_unlock(&pool->xsk_tx_list_lock);
+
+	/* A batch that observed a singular Tx socket list before the gate was
+	 * armed may set drain_cont. Wait for all such readers before checking
+	 * whether the pool can safely become shared.
+	 */
+	synchronize_net();
+
+	spin_lock(&pool->xsk_tx_list_lock);
+	list_for_each_entry(tmp, &pool->xsk_tx_list, tx_list) {
+		if (READ_ONCE(tmp->drain_cont)) {
+			err = -EAGAIN;
+			break;
+		}
+	}
+
+	if (err) {
+		/* Pairs with the acquire load in xsk_tx_peek_release_desc_batch().
+		 * No socket was added; clear the gate so Tx can resume.
+		 */
+		smp_store_release(&pool->tx_share_pending, false);
+		*pending = false;
+	}
+	spin_unlock(&pool->xsk_tx_list_lock);
+
+	return err;
+}
+
+void xp_finish_xsk_tx_share(struct xsk_buff_pool *pool)
+{
+	spin_lock(&pool->xsk_tx_list_lock);
+	/* Pairs with the acquire load in xsk_tx_peek_release_desc_batch().
+	 * Publish the preceding xp_add_xsk() list update before allowing Tx
+	 * to observe that the share transition has finished.
+	 */
+	smp_store_release(&pool->tx_share_pending, false);
+	spin_unlock(&pool->xsk_tx_list_lock);
+}
+
 void xp_del_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs)
 {
 	if (!xs->tx)
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 3e3fbb73d23e..99fa62e0d337 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -58,6 +58,16 @@ struct parsed_desc {
 	u32 valid;
 };
 
+struct xsk_tx_batch {
+	u32 tx_descs;
+	u32 reclaim_descs;
+};
+
+static inline u32 xsk_tx_batch_cq_descs(const struct xsk_tx_batch *batch)
+{
+	return batch->tx_descs + batch->reclaim_descs;
+}
+
 /* The structure of the shared state of the rings are a simple
  * circular buffer, as outlined in
  * Documentation/core-api/circular-buffers.rst. For the Rx and
@@ -263,17 +273,19 @@ static inline void parse_desc(struct xsk_queue *q, struct xsk_buff_pool *pool,
 	parsed->mb = xp_mb_desc(desc);
 }
 
-static inline
-u32 xskq_cons_read_desc_batch(struct xsk_queue *q, struct xsk_buff_pool *pool,
-			      u32 max)
+static inline struct xsk_tx_batch
+xskq_cons_read_desc_batch(struct xdp_sock *xs, struct xsk_buff_pool *pool,
+			  u32 max)
 {
-	u32 cached_cons = q->cached_cons, nb_entries = 0;
 	struct xdp_desc *descs = pool->tx_descs;
-	u32 total_descs = 0, nr_frags = 0;
+	bool drain = READ_ONCE(xs->drain_cont);
+	u32 cached_cons, nb_entries = 0;
+	struct xsk_tx_batch batch = {};
+	struct xsk_queue *q = xs->tx;
+	u32 nr_frags = 0;
+
+	cached_cons = q->cached_cons;
 
-	/* track first entry, if stumble upon *any* invalid descriptor, rewind
-	 * current packet that consists of frags and stop the processing
-	 */
 	while (cached_cons != q->cached_prod && nb_entries < max) {
 		struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
 		u32 idx = cached_cons & q->ring_mask;
@@ -282,26 +294,44 @@ u32 xskq_cons_read_desc_batch(struct xsk_queue *q, struct xsk_buff_pool *pool,
 		descs[nb_entries] = ring->desc[idx];
 		cached_cons++;
 		parse_desc(q, pool, &descs[nb_entries], &parsed);
-		if (unlikely(!parsed.valid))
-			break;
+		if (unlikely(!parsed.valid)) {
+			if (!drain && !nr_frags && !parsed.mb)
+				break;
+
+			drain = true;
+		}
+
+		nr_frags++;
+		nb_entries++;
 
 		if (likely(!parsed.mb)) {
-			total_descs += (nr_frags + 1);
-			nr_frags = 0;
-		} else {
-			nr_frags++;
-			if (nr_frags == pool->xdp_zc_max_segs) {
+			if (unlikely(drain)) {
+				batch.reclaim_descs = nr_frags;
+				WRITE_ONCE(xs->drain_cont, false);
 				nr_frags = 0;
 				break;
 			}
+
+			batch.tx_descs += nr_frags;
+			nr_frags = 0;
+			continue;
 		}
-		nb_entries++;
+
+		if (nr_frags == pool->xdp_zc_max_segs)
+			drain = true;
 	}
 
-	cached_cons -= nr_frags;
+	if (nr_frags) {
+		if (drain) {
+			batch.reclaim_descs = nr_frags;
+			WRITE_ONCE(xs->drain_cont, true);
+		} else {
+			cached_cons -= nr_frags;
+		}
+	}
 	/* Release valid plus any invalid entries */
 	xskq_cons_release_n(q, cached_cons - q->cached_cons);
-	return total_descs;
+	return batch;
 }
 
 /* Functions for consumers */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net 6/7] selftests/xsk: fix too-many-frags multi-buffer Tx test
  2026-06-23 13:32 [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim Maciej Fijalkowski
                   ` (4 preceding siblings ...)
  2026-06-23 13:32 ` [PATCH net 5/7] xsk: reclaim invalid multi-buffer Tx descs in ZC path Maciej Fijalkowski
@ 2026-06-23 13:32 ` Maciej Fijalkowski
  2026-06-24 13:33   ` sashiko-bot
  2026-06-23 13:32 ` [PATCH net 7/7] selftests/xsk: account invalid multi-buffer Tx descriptors Maciej Fijalkowski
  2026-06-24 15:38 ` [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim Stanislav Fomichev
  7 siblings, 1 reply; 16+ messages in thread
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	kerneljasonxing, bjorn, Maciej Fijalkowski

The too-many-frags test describes a packet that is valid from the Tx
ring ownership point of view, but invalid for transmission because it
exceeds the supported number of fragments.

Keep the generated Tx descriptors valid so that __send_pkts() accounts
them as outstanding descriptors that must be reclaimed through the CQ.
Then mark the corresponding Rx packet invalid so the test still does
not expect the oversized packet to appear on the receive side.

Add a valid synchronization packet after the oversized packet so the
test can verify that the Tx path drains the bad packet and resumes at
the next packet boundary.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 .../selftests/bpf/prog_tests/test_xsk.c       | 20 +++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
index 72875071d4f1..de1e63c3fdf6 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
@@ -2258,7 +2258,7 @@ int testapp_too_many_frags(struct test_spec *test)
 		max_frags += 1;
 	}
 
-	pkts = calloc(2 * max_frags + 2, sizeof(struct pkt));
+	pkts = calloc(2 * max_frags + 3, sizeof(struct pkt));
 	if (!pkts)
 		return TEST_FAILURE;
 
@@ -2279,21 +2279,29 @@ int testapp_too_many_frags(struct test_spec *test)
 	/* An invalid packet with the max amount of frags but signals packet
 	 * continues on the last frag
 	 */
-	for (i = max_frags + 1; i < 2 * max_frags + 1; i++) {
+	for (i = max_frags + 1; i < 2 * max_frags + 2; i++) {
 		pkts[i].len = MIN_PKT_SIZE;
 		pkts[i].options = XDP_PKT_CONTD;
-		pkts[i].valid = false;
+		pkts[i].valid = true;
 	}
+	pkts[2 * max_frags + 1].options = 0;
 
 	/* Valid packet for synch */
-	pkts[2 * max_frags + 1].len = MIN_PKT_SIZE;
-	pkts[2 * max_frags + 1].valid = true;
+	pkts[2 * max_frags + 2].len = MIN_PKT_SIZE;
+	pkts[2 * max_frags + 2].valid = true;
 
-	if (pkt_stream_generate_custom(test, pkts, 2 * max_frags + 2)) {
+	if (pkt_stream_generate_custom(test, pkts, 2 * max_frags + 3)) {
 		free(pkts);
 		return TEST_FAILURE;
 	}
 
+	/* The generated Tx stream must keep the too-big packet valid so that
+	 * __send_pkts() accounts its descriptors in outstanding_tx. The Rx
+	 * stream, however, must not expect this packet on the wire.
+	 */
+	test->ifobj_rx->xsk->pkt_stream->pkts[2].valid = false;
+	test->ifobj_rx->xsk->pkt_stream->nb_valid_entries--;
+
 	ret = testapp_validate_traffic(test);
 	free(pkts);
 	return ret;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net 7/7] selftests/xsk: account invalid multi-buffer Tx descriptors
  2026-06-23 13:32 [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim Maciej Fijalkowski
                   ` (5 preceding siblings ...)
  2026-06-23 13:32 ` [PATCH net 6/7] selftests/xsk: fix too-many-frags multi-buffer Tx test Maciej Fijalkowski
@ 2026-06-23 13:32 ` Maciej Fijalkowski
  2026-06-24 13:33   ` sashiko-bot
  2026-06-24 15:38 ` [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim Stanislav Fomichev
  7 siblings, 1 reply; 16+ messages in thread
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	kerneljasonxing, bjorn, Maciej Fijalkowski

Invalid descriptors in the middle of a multi-buffer packet still belong
to the packet being consumed from the Tx ring. The tests should therefore
count the whole invalid packet as outstanding in verbatim mode, even
though the packet must not be expected on the Rx side.

Make fragment counting follow the packet boundary instead of stopping at
the first invalid fragment. Update custom stream generation so invalid
middle fragments terminate the generated Rx packet while Tx accounting
still covers all descriptors consumed from the invalid multi-buffer
packet.

Also add explicit end fragments after invalid middle descriptors. This
exercises the kernel drain logic and verifies that subsequent valid
packets are not interpreted as continuations of the invalid packet.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 .../selftests/bpf/prog_tests/test_xsk.c       | 24 ++++++++++++-------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
index de1e63c3fdf6..d8a1c0d40e5a 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
@@ -433,14 +433,14 @@ static u32 pkt_nb_frags(u32 frame_size, struct pkt_stream *pkt_stream, struct pk
 	}
 
 	/* Search for the end of the packet in verbatim mode */
-	if (!pkt_continues(pkt->options) || !pkt->valid)
+	if (!pkt_continues(pkt->options))
 		return nb_frags;
 
 	next_frag = pkt_stream->current_pkt_nb;
 	pkt++;
 	while (next_frag++ < pkt_stream->nb_pkts) {
 		nb_frags++;
-		if (!pkt_continues(pkt->options) || !pkt->valid)
+		if (!pkt_continues(pkt->options))
 			break;
 		pkt++;
 	}
@@ -671,11 +671,11 @@ static struct pkt_stream *__pkt_stream_generate_custom(struct ifobject *ifobj, s
 			if (!frame->valid || !pkt_continues(frame->options))
 				payload++;
 		} else {
-			if (frame->valid)
+			if (frame->valid) {
 				len += frame->len;
-			if (frame->valid && pkt_continues(frame->options))
-				continue;
-
+				if (pkt_continues(frame->options))
+					continue;
+			}
 			pkt->pkt_nb = pkt_nb;
 			pkt->len = len;
 			pkt->valid = frame->valid;
@@ -1214,6 +1214,7 @@ static int __send_pkts(struct ifobject *ifobject, struct xsk_socket_info *xsk, b
 	for (i = 0; i < xsk->batch_size; i++) {
 		struct pkt *pkt = pkt_stream_get_next_tx_pkt(pkt_stream);
 		u32 nb_frags_left, nb_frags, bytes_written = 0;
+		struct pkt *first_pkt = pkt;
 
 		if (!pkt)
 			break;
@@ -1258,6 +1259,8 @@ static int __send_pkts(struct ifobject *ifobject, struct xsk_socket_info *xsk, b
 		if (pkt && pkt->valid) {
 			valid_pkts++;
 			valid_frags += nb_frags;
+		} else if (pkt_stream->verbatim && pkt_continues(first_pkt->options)) {
+			valid_frags += nb_frags;
 		}
 	}
 
@@ -2104,13 +2107,16 @@ int testapp_invalid_desc_mb(struct test_spec *test)
 		{0, 0, 0, false, 0},
 		/* Invalid address in the second frame */
 		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
-		{umem_sz, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
+		{umem_sz * 2, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
+		{0, MIN_PKT_SIZE, 0, false, 0},
 		/* Invalid len in the middle */
 		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
 		{0, XSK_UMEM__INVALID_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
+		{0, MIN_PKT_SIZE, 0, false, 0},
 		/* Invalid options in the middle */
 		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
 		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XSK_DESC__INVALID_OPTION},
+		{0, MIN_PKT_SIZE, 0, false, 0},
 		/* Transmit 2 frags, receive 3 */
 		{0, XSK_UMEM__MAX_FRAME_SIZE, 0, true, XDP_PKT_CONTD},
 		{0, XSK_UMEM__MAX_FRAME_SIZE, 0, true, 0},
@@ -2122,8 +2128,8 @@ int testapp_invalid_desc_mb(struct test_spec *test)
 
 	if (umem->unaligned_mode) {
 		/* Crossing a chunk boundary allowed */
-		pkts[12].valid = true;
-		pkts[13].valid = true;
+		pkts[15].valid = true;
+		pkts[16].valid = true;
 	}
 
 	test->mtu = MAX_ETH_JUMBO_SIZE;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH net 5/7] xsk: reclaim invalid multi-buffer Tx descs in ZC path
  2026-06-23 13:32 ` [PATCH net 5/7] xsk: reclaim invalid multi-buffer Tx descs in ZC path Maciej Fijalkowski
@ 2026-06-24 13:33   ` sashiko-bot
  0 siblings, 0 replies; 16+ messages in thread
From: sashiko-bot @ 2026-06-24 13:33 UTC (permalink / raw)
  To: Maciej Fijalkowski; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Heap Buffer Overflow (OOB Write) in `xsk_tx_peek_release_fallback()` due to array size mismatch in shared UMEM configuration.
--

commit 6ddf04bf3e0869241f35a9093870aefb3a5e86f2
Author: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Subject: xsk: reclaim invalid multi-buffer Tx descs in ZC path

This commit addresses an issue in the zero-copy Tx batching path where
multi-buffer packets with an invalid descriptor could leave earlier
descriptors consumed but unreturned to userspace. It handles invalid
multi-buffer packets as a unit, queuing reclaim-only descriptors to the
completion ring, and gates the batched path while a same-pool bind waits
for pre-existing readers.

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -576,9 +588,20 @@ static u32 xsk_tx_peek_release_fallback(struct xsk_buff_pool *pool, u32 max_entr
>  
>  u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_pkts)
>  {
> +	struct xsk_tx_batch batch = {};
>  	struct xdp_sock *xs;
> +	u32 cq_cached_prod;
>  
>  	rcu_read_lock();
> +
> +	/* Pairs with the release stores in xp_prepare_xsk_tx_share() and
> +	 * xp_finish_xsk_tx_share(). If bind is converting a singular Tx pool
> +	 * to shared, do not enter the singular batched path.
> +	 */
> +	if (smp_load_acquire(&pool->tx_share_pending))
> +		goto out;
> +	if (unlikely(pool->reclaim_descs))
> +		goto out;
>  	if (!list_is_singular(&pool->xsk_tx_list)) {
>  		/* Fallback to the non-batched version */
>  		rcu_read_unlock();

[Severity: High]
This is a pre-existing issue, but can the fallback path overwrite the
pool->tx_descs array?

In xsk_bind(), when a shared pool is created, pool->tx_descs is allocated
using xp_alloc_tx_descs() based on the first socket's nentries. If the first
socket is configured with a very small TX ring (for example, 2 entries),
pool->tx_descs is sized to match.

If a second socket with a larger TX ring is then bound to the same shared
pool, pool->tx_descs is not reallocated to accommodate the larger capacity.

When the batched path sees multiple sockets and falls back to the non-batched
version:

static u32 xsk_tx_peek_release_fallback(struct xsk_buff_pool *pool, u32 max_entries)
{
	struct xdp_desc *descs = pool->tx_descs;
	u32 nb_pkts = 0;

	while (nb_pkts < max_entries && xsk_tx_peek_desc(pool, &descs[nb_pkts]))
		nb_pkts++;
    ...
}

Does this loop write up to max_entries (the driver's budget, typically 64)
into the descs array, causing an out-of-bounds write past the initial small
allocation limit?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260623133240.1048434-1-maciej.fijalkowski@intel.com?part=5

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net 3/7] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit()
  2026-06-23 13:32 ` [PATCH net 3/7] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit() Maciej Fijalkowski
@ 2026-06-24 13:33   ` sashiko-bot
  0 siblings, 0 replies; 16+ messages in thread
From: sashiko-bot @ 2026-06-24 13:33 UTC (permalink / raw)
  To: Maciej Fijalkowski; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

New issues:
- [Medium] The `invalid_descs` statistic is repeatedly incremented for the same invalid descriptor if the Completion Queue (CQ) is full.

Pre-existing issues:
- [High] A malformed descriptor (e.g., invalid TX metadata) permanently stalls the TX ring because it is never consumed.
--

commit b076f5f0d28ebb521e9e57462363bc0f2250e923
Author: Jason Xing <kerneljasonxing@gmail.com>

xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit()

When the TX loop in __xsk_generic_xmit() encounters an invalid descriptor
mid-packet, the partial skb is dropped and the offending descriptor is
released. This commit ensures remaining continuation descriptors belonging
to the same multi-buffer packet are correctly drained on the next iteration
instead of being treated as a brand-new packet.

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index e80c035a7af54..c489fadc36087 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c

[Severity: High]
This is a pre-existing issue, but does encountering a non-transient error
during skb building permanently stall the TX ring?

In __xsk_generic_xmit(), if xsk_build_skb() returns a non-transient error
like -EINVAL (for example, from an invalid XDP_TX_METADATA payload), it
intentionally skips releasing the descriptor:

net/xdp/xsk.c:xsk_build_skb() {
    ...
    } else {
        /* Let application retry */
        xsk_cq_cancel_locked(xs->pool, 1);
    }
}

Since the error is not -EOVERFLOW, __xsk_generic_xmit() jumps to out
without releasing the descriptor:

net/xdp/xsk.c:__xsk_generic_xmit() {
    ...
    skb = xsk_build_skb(xs, &desc);
    if (IS_ERR(skb)) {
        err = PTR_ERR(skb);
        if (err != -EOVERFLOW)
            goto out;
    ...
}

Because the descriptor remains at the head of the TX ring and the error
is non-transient, will every subsequent call peek the same descriptor,
fail with -EINVAL, and exit without ever advancing?

[ ... ]

> @@ -1128,6 +1128,7 @@ static int __xsk_generic_xmit(struct sock *sk)
>  		if (xs->skb)
>  			xsk_drop_skb(xs->skb);
>  		xskq_cons_release(xs->tx);
> +		xs->drain_cont = xp_mb_desc(&desc);
>  	}
>  
>  out:

[Severity: Medium]
If the completion queue is full, does this post-loop cleanup jump to out
without consuming the invalid descriptor from the TX ring?

Looking earlier in this same xskq_has_descs() block in __xsk_generic_xmit(),
if reclaim_desc is true, it attempts to reserve a CQ slot:

net/xdp/xsk.c:__xsk_generic_xmit() {
    ...
    if (xskq_has_descs(xs->tx)) {
        bool reclaim_desc = xs->skb || xs->drain_cont;

        if (reclaim_desc) {
            err = xsk_cq_reserve_locked(xs->pool);
            if (err) {
                err = -EAGAIN;
                goto out;
            }
        }
    ...
}

If xsk_cq_reserve_locked() returns -EAGAIN, we jump to out before reaching
the xskq_cons_release() call in the diff above.

Would the next invocation (via poll or sendmsg) then peek the exact same
unconsumed descriptor?

Since xskq_cons_is_valid_desc() increments invalid_descs every time a
malformed descriptor is peeked:

net/xdp/xsk_queue.h:xskq_cons_is_valid_desc() {
    ...
    if (!xp_validate_desc(pool, d)) {
        q->invalid_descs++;
        return false;
    }
    ...
}

Could this arbitrarily inflate the invalid_descs statistic while the
CQ remains full?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260623133240.1048434-1-maciej.fijalkowski@intel.com?part=3

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net 2/7] xsk: drain continuation descs after overflow in xsk_build_skb()
  2026-06-23 13:32 ` [PATCH net 2/7] xsk: drain continuation descs after overflow in xsk_build_skb() Maciej Fijalkowski
@ 2026-06-24 13:33   ` sashiko-bot
  0 siblings, 0 replies; 16+ messages in thread
From: sashiko-bot @ 2026-06-24 13:33 UTC (permalink / raw)
  To: Maciej Fijalkowski; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] The post-loop logic in `__xsk_generic_xmit()` leaves the invalid descriptor in the TX ring if the Completion Queue (CQ) is full, leading to unbounded double-counting of the `invalid_descs` statistic.
--

commit c4ba06aa2b6f82804cdabd955a9ab94cfe185d39
Author: Jason Xing <kerneljasonxing@gmail.com>

xsk: drain continuation descs after overflow in xsk_build_skb()

This patch adds a drain_cont flag to xdp_sock to handle multi-buffer packets
that exceed the maximum allowed fragments. It ensures the remaining
descriptors of a dropped packet are drained instead of being misinterpreted
as the start of new packets.

Closes: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c

[ ... ]

> @@ -1063,11 +1076,22 @@ static int __xsk_generic_xmit(struct sock *sk)
>  			goto out;
>  		}
>  
> +		if (unlikely(xs->drain_cont)) {
> +			xsk_cq_submit_addr_single_locked(xs->pool, &desc);
> +
> +			xs->tx->invalid_descs++;
> +			xskq_cons_release(xs->tx);
> +			xs->drain_cont = xp_mb_desc(&desc);
> +			continue;
> +		}
> +
>  		skb = xsk_build_skb(xs, &desc);
>  		if (IS_ERR(skb)) {
>  			err = PTR_ERR(skb);
>  			if (err != -EOVERFLOW)
>  				goto out;
> +			if (xp_mb_desc(&desc))
> +				xs->drain_cont = true;
>  			err = 0;
>  			continue;
>  		}

[Severity: Medium]
When an invalid descriptor belonging to a multi-buffer packet is encountered,
xskq_cons_peek_desc() increments invalid_descs and terminates the while loop.

After the loop exits, if xsk_cq_reserve_locked() fails because the completion
queue is full, __xsk_generic_xmit() goes to out without calling
xskq_cons_release(). Does this leave the invalid descriptor in the TX ring?

On subsequent retries, xskq_cons_peek_desc() will evaluate the exact same
descriptor again, which increments invalid_descs repeatedly.

There were prior concerns raised about this code path, but some of those
appear to be incorrect:

It was suggested this causes a livelock, but returning -EAGAIN correctly
yields back to user-space without infinite busy-looping inside the kernel.

It was also suggested this exposes garbage addresses. However, returning the
invalid descriptor's address to the completion queue is an intentional
design choice to prevent user-space buffer leaks, as established in prior
commits.

Finally, there was a concern about corrupting drain_cont. Using
xp_mb_desc() on an invalid descriptor is intentional to maintain
multi-buffer framing. If a descriptor is invalid but has the continuation
flag set, subsequent valid descriptors belong to the same dropped packet
and must be drained.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260623133240.1048434-1-maciej.fijalkowski@intel.com?part=2

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net 4/7] xsk: reclaim offending invalid desc in generic multi-buffer Tx
  2026-06-23 13:32 ` [PATCH net 4/7] xsk: reclaim offending invalid desc in generic multi-buffer Tx Maciej Fijalkowski
@ 2026-06-24 13:33   ` sashiko-bot
  0 siblings, 0 replies; 16+ messages in thread
From: sashiko-bot @ 2026-06-24 13:33 UTC (permalink / raw)
  To: Maciej Fijalkowski; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Incomplete reclaim of the first invalid descriptor in a multi-buffer packet, leaking a completion.
- [Medium] Artificial inflation of the tx_invalid_descs counter when the completion queue is full.
--

commit 94ab243e4d16baca076b39e394196d66b2624dbb
Author: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

xsk: reclaim offending invalid desc in generic multi-buffer Tx

The patch aims to reclaim invalid descriptors found in __xsk_generic_xmit()
by adding them to the completion ring before releasing them from the Tx ring.
This targets failed descriptors belonging to already-started or
already-draining multi-buffer packets.

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index c489fadc36087..43791647cf182 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -1125,8 +1125,22 @@ static int __xsk_generic_xmit(struct sock *sk)
>  	}
>  
>  	if (xskq_has_descs(xs->tx)) {
> +		bool reclaim_desc = xs->skb || xs->drain_cont;

[Severity: High]
Does this logic correctly handle the first invalid descriptor of a
multi-buffer packet?

If the very first descriptor of a multi-buffer packet is invalid, both
xs->skb and xs->drain_cont will be false, meaning reclaim_desc evaluates
to false. The invalid descriptor will then be dropped without submitting
an address to the completion queue.

However, xs->drain_cont is subsequently set to true below because
xp_mb_desc(&desc) evaluates to true. Subsequent valid descriptors of the
same packet will then be processed in the drain path and have completions
generated, leaving userspace with N-1 completions for an N-fragment packet.

> +
> +		if (reclaim_desc) {
> +			err = xsk_cq_reserve_locked(xs->pool);
> +			if (err) {
> +				err = -EAGAIN;
> +				goto out;

[Severity: Medium]
Could returning -EAGAIN here artificially inflate the tx_invalid_descs
statistic?

If xsk_cq_reserve_locked() fails, the function returns -EAGAIN without
consuming the descriptor since xskq_cons_release(xs->tx) is skipped.

On subsequent retries, when xskq_cons_peek_desc() evaluates this exact same
invalid descriptor, it unconditionally increments q->invalid_descs (in
xskq_cons_is_valid_desc() in net/xdp/xsk_queue.h). Since the descriptor
is never consumed when the completion queue is full, the counter will be
incremented repeatedly on each retry.

> +			}
> +		}
> +
>  		if (xs->skb)
>  			xsk_drop_skb(xs->skb);
> +
> +		if (reclaim_desc)
> +			xsk_cq_submit_addr_single_locked(xs->pool, &desc);
> +
>  		xskq_cons_release(xs->tx);
>  		xs->drain_cont = xp_mb_desc(&desc);
>  	}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260623133240.1048434-1-maciej.fijalkowski@intel.com?part=4

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net 7/7] selftests/xsk: account invalid multi-buffer Tx descriptors
  2026-06-23 13:32 ` [PATCH net 7/7] selftests/xsk: account invalid multi-buffer Tx descriptors Maciej Fijalkowski
@ 2026-06-24 13:33   ` sashiko-bot
  0 siblings, 0 replies; 16+ messages in thread
From: sashiko-bot @ 2026-06-24 13:33 UTC (permalink / raw)
  To: Maciej Fijalkowski; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Buffer leak in the generic (SKB) TX path when the first fragment of an invalid AF_XDP multi-buffer packet is rejected.
--

commit aba0a0e03b19503a4e6b7748da982e4082aebe45
Author: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

selftests/xsk: account invalid multi-buffer Tx descriptors

This commit updates the af_xdp selftests to correctly account for invalid
middle fragments in multi-buffer tx packets. It ensures the whole invalid
packet is counted as outstanding in verbatim mode and appends explicit end
fragments to verify the kernel's drain logic.

> diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> index de1e63c3fdf63..d8a1c0d40e5af 100644
> --- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> +++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
[ ... ]
> @@ -2104,13 +2107,16 @@ int testapp_invalid_desc_mb(struct test_spec *test)
>  		{0, 0, 0, false, 0},
>  		/* Invalid address in the second frame */
>  		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
> -		{umem_sz, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
> +		{umem_sz * 2, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
> +		{0, MIN_PKT_SIZE, 0, false, 0},

[Severity: High]
This isn't a regression introduced by this patch, but testing invalid
multi-buffer packets like this appears to expose a pre-existing buffer leak
in the generic (skb) tx path.

If an application sends a multi-buffer packet where the first fragment is
invalid (e.g. an out-of-bounds address), does __xsk_generic_xmit() in
net/xdp/xsk.c permanently leak the buffer?

In net/xdp/xsk.c:__xsk_generic_xmit():

        if (xskq_has_descs(xs->tx)) {
                bool reclaim_desc = xs->skb || xs->drain_cont;
                ...
                if (reclaim_desc)
                        xsk_cq_submit_addr_single_locked(xs->pool, &desc);

                xskq_cons_release(xs->tx);
                xs->drain_cont = xp_mb_desc(&desc);
        }

When the first fragment is invalid, xs->skb is null and xs->drain_cont is
false, making reclaim_desc evaluate to false.

The descriptor is then consumed from the tx ring via xskq_cons_release(), but
since it is never submitted to the completion queue, the buffer seems to be
permanently leaked, potentially leading to umem exhaustion.

[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260623133240.1048434-1-maciej.fijalkowski@intel.com?part=7

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net 6/7] selftests/xsk: fix too-many-frags multi-buffer Tx test
  2026-06-23 13:32 ` [PATCH net 6/7] selftests/xsk: fix too-many-frags multi-buffer Tx test Maciej Fijalkowski
@ 2026-06-24 13:33   ` sashiko-bot
  0 siblings, 0 replies; 16+ messages in thread
From: sashiko-bot @ 2026-06-24 13:33 UTC (permalink / raw)
  To: Maciej Fijalkowski; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Low] The comment describing the invalid packet construction contradicts the new code logic.
--

commit 55a7ead387fcdab87dd100f7febd5012bd365ab8
Author: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

selftests/xsk: fix too-many-frags multi-buffer Tx test

This patch updates the too-many-frags test to keep the oversized packet's
Tx descriptors valid for accounting purposes while ignoring the packet on
the receive side. It also appends a synchronization packet to verify
proper queue draining.

> diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> --- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
> +++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
[ ... ]
> @@ -2279,21 +2279,29 @@ int testapp_too_many_frags(struct test_spec *test)
>  	/* An invalid packet with the max amount of frags but signals packet
>  	 * continues on the last frag
>  	 */

[Severity: Low]
Does this comment need to be updated to reflect the new code logic?

The comment explicitly states the packet has the max amount of frags and
signals continuation on the last fragment. However, the modified loop now
appends an extra fragment (resulting in max_frags + 1 total fragments)
and explicitly clears the XDP_PKT_CONTD flag on that final fragment.

> -	for (i = max_frags + 1; i < 2 * max_frags + 1; i++) {
> +	for (i = max_frags + 1; i < 2 * max_frags + 2; i++) {
>  		pkts[i].len = MIN_PKT_SIZE;
>  		pkts[i].options = XDP_PKT_CONTD;
> -		pkts[i].valid = false;
> +		pkts[i].valid = true;
>  	}
> +	pkts[2 * max_frags + 1].options = 0;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260623133240.1048434-1-maciej.fijalkowski@intel.com?part=6

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim
  2026-06-23 13:32 [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim Maciej Fijalkowski
                   ` (6 preceding siblings ...)
  2026-06-23 13:32 ` [PATCH net 7/7] selftests/xsk: account invalid multi-buffer Tx descriptors Maciej Fijalkowski
@ 2026-06-24 15:38 ` Stanislav Fomichev
  2026-06-24 16:37   ` Maciej Fijalkowski
  7 siblings, 1 reply; 16+ messages in thread
From: Stanislav Fomichev @ 2026-06-24 15:38 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: netdev, bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	kerneljasonxing, bjorn

On 06/23, Maciej Fijalkowski wrote:
> Hi,
> 
> This series fixes several AF_XDP multi-buffer Tx paths where descriptors
> consumed from the Tx ring are not consistently returned to userspace
> through the completion ring when the packet is later dropped as invalid.
> 
> The affected cases are invalid or oversized multi-buffer Tx packets in
> both the generic and zero-copy paths. In these cases, the kernel can
> consume one or more Tx descriptors while building or validating a
> multi-buffer packet, then drop the packet before it reaches the device.
> Userspace still owns the UMEM buffers only after the corresponding
> addresses are returned through the CQ. Missing completions therefore
> make userspace lose track of those buffers.
> 
> The generic path fixes cover three related cases:
> * partially built multi-buffer skbs dropped by xsk_drop_skb();
>   continuation descriptors left in the Tx ring after xsk_build_skb()
>   reports overflow;
> * invalid descriptors encountered in the middle of a multi-buffer
>   packet, including the offending invalid descriptor itself.
> 
> The zero-copy path is handled separately. The batched Tx parser now
> distinguishes descriptors that can be passed to the driver from
> descriptors that are consumed only because they belong to an invalid
> multi-buffer packet. Reclaim-only descriptors are written to the CQ
> address area and published in completion order, after any earlier
> driver-visible Tx descriptors.
> 
> The ZC batching path can also retain drain state when userspace has not
> yet provided the end of an invalid multi-buffer packet. To keep this
> state local to the singular batched path, the series prevents a second
> Tx socket from joining the same pool while such drain state exists.
> During the singular-to-shared transition, Tx batching is gated,
> pre-existing readers are waited out, and bind fails with -EAGAIN if the
> existing socket still has pending drain state. This avoids adding
> multi-buffer drain handling to the shared-UMEM fallback path.
> 
> The last two patches update xskxceiver so the tests account invalid
> multi-buffer Tx packets as descriptors that must be reclaimed, while
> still not expecting those invalid packets on the Rx side.
> 
> This is a follow-up to Jason's changes [0] which were addressing generic
> xmit only and this set allows me to pass full xskxceiver test suite run
> against ice driver.

There is a fair amount of feedback from sashiko already :-( So the meta
question from me is: is it time to scrap our current approach where
we parse descriptor by descriptor? (and maintain half-baked skb and
half-consumed descriptor queues)

Should we:

1. do desc[MAX_SKB_FRAGS] and xskq_cons_peek_desc until we exhaust
PKT_CONT (if the last packet has PKT_CONT, return EOVERFLOW to userspace
and do a full stop here)
2. now that we really know the number of valid descriptors -> reserve
the cq space (if not -> EAGAIN)
3. pre-allocate everything here (if at any point we have ENOMEM -> cleanup
locally, don't ever create semi-initialized skb)
4. construct the skb
5. xmit

If at any point there is an issue, the cleanup is straightforward. That
whole xk->skb goes away, no state between syscalls. Thoughts?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim
  2026-06-24 15:38 ` [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim Stanislav Fomichev
@ 2026-06-24 16:37   ` Maciej Fijalkowski
  0 siblings, 0 replies; 16+ messages in thread
From: Maciej Fijalkowski @ 2026-06-24 16:37 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: netdev, bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	kerneljasonxing, bjorn

On Wed, Jun 24, 2026 at 08:38:20AM -0700, Stanislav Fomichev wrote:
> On 06/23, Maciej Fijalkowski wrote:
> > Hi,
> > 
> > This series fixes several AF_XDP multi-buffer Tx paths where descriptors
> > consumed from the Tx ring are not consistently returned to userspace
> > through the completion ring when the packet is later dropped as invalid.
> > 
> > The affected cases are invalid or oversized multi-buffer Tx packets in
> > both the generic and zero-copy paths. In these cases, the kernel can
> > consume one or more Tx descriptors while building or validating a
> > multi-buffer packet, then drop the packet before it reaches the device.
> > Userspace still owns the UMEM buffers only after the corresponding
> > addresses are returned through the CQ. Missing completions therefore
> > make userspace lose track of those buffers.
> > 
> > The generic path fixes cover three related cases:
> > * partially built multi-buffer skbs dropped by xsk_drop_skb();
> >   continuation descriptors left in the Tx ring after xsk_build_skb()
> >   reports overflow;
> > * invalid descriptors encountered in the middle of a multi-buffer
> >   packet, including the offending invalid descriptor itself.
> > 
> > The zero-copy path is handled separately. The batched Tx parser now
> > distinguishes descriptors that can be passed to the driver from
> > descriptors that are consumed only because they belong to an invalid
> > multi-buffer packet. Reclaim-only descriptors are written to the CQ
> > address area and published in completion order, after any earlier
> > driver-visible Tx descriptors.
> > 
> > The ZC batching path can also retain drain state when userspace has not
> > yet provided the end of an invalid multi-buffer packet. To keep this
> > state local to the singular batched path, the series prevents a second
> > Tx socket from joining the same pool while such drain state exists.
> > During the singular-to-shared transition, Tx batching is gated,
> > pre-existing readers are waited out, and bind fails with -EAGAIN if the
> > existing socket still has pending drain state. This avoids adding
> > multi-buffer drain handling to the shared-UMEM fallback path.
> > 
> > The last two patches update xskxceiver so the tests account invalid
> > multi-buffer Tx packets as descriptors that must be reclaimed, while
> > still not expecting those invalid packets on the Rx side.
> > 
> > This is a follow-up to Jason's changes [0] which were addressing generic
> > xmit only and this set allows me to pass full xskxceiver test suite run
> > against ice driver.
> 
> There is a fair amount of feedback from sashiko already :-( So the meta
> question from me is: is it time to scrap our current approach where
> we parse descriptor by descriptor? (and maintain half-baked skb and
> half-consumed descriptor queues)
> 
> Should we:
> 
> 1. do desc[MAX_SKB_FRAGS] and xskq_cons_peek_desc until we exhaust
> PKT_CONT (if the last packet has PKT_CONT, return EOVERFLOW to userspace
> and do a full stop here)
> 2. now that we really know the number of valid descriptors -> reserve
> the cq space (if not -> EAGAIN)
> 3. pre-allocate everything here (if at any point we have ENOMEM -> cleanup
> locally, don't ever create semi-initialized skb)
> 4. construct the skb
> 5. xmit

Yeah generic xmit became utterly horrible, haven't gone through sashiko
reviews yet, but bare in mind this set also aligns zc side to what was
previously being addressed by Jason.

I believe planned logistics were to get these fixes onto net and then
Jason had an implementation of batching on generic xmit, directed towards
-next and that's where we could address current flow.

> 
> If at any point there is an issue, the cleanup is straightforward. That
> whole xk->skb goes away, no state between syscalls. Thoughts?

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-06-24 16:37 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-23 13:32 [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim Maciej Fijalkowski
2026-06-23 13:32 ` [PATCH net 1/7] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx Maciej Fijalkowski
2026-06-23 13:32 ` [PATCH net 2/7] xsk: drain continuation descs after overflow in xsk_build_skb() Maciej Fijalkowski
2026-06-24 13:33   ` sashiko-bot
2026-06-23 13:32 ` [PATCH net 3/7] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit() Maciej Fijalkowski
2026-06-24 13:33   ` sashiko-bot
2026-06-23 13:32 ` [PATCH net 4/7] xsk: reclaim offending invalid desc in generic multi-buffer Tx Maciej Fijalkowski
2026-06-24 13:33   ` sashiko-bot
2026-06-23 13:32 ` [PATCH net 5/7] xsk: reclaim invalid multi-buffer Tx descs in ZC path Maciej Fijalkowski
2026-06-24 13:33   ` sashiko-bot
2026-06-23 13:32 ` [PATCH net 6/7] selftests/xsk: fix too-many-frags multi-buffer Tx test Maciej Fijalkowski
2026-06-24 13:33   ` sashiko-bot
2026-06-23 13:32 ` [PATCH net 7/7] selftests/xsk: account invalid multi-buffer Tx descriptors Maciej Fijalkowski
2026-06-24 13:33   ` sashiko-bot
2026-06-24 15:38 ` [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim Stanislav Fomichev
2026-06-24 16:37   ` Maciej Fijalkowski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.