Netdev List
 help / color / mirror / Atom feed
From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
To: netdev@vger.kernel.org
Cc: bpf@vger.kernel.org, magnus.karlsson@intel.com,
	stfomichev@gmail.com, kuba@kernel.org, pabeni@redhat.com,
	horms@kernel.org, kerneljasonxing@gmail.com, bjorn@kernel.org,
	Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Subject: [PATCH net 5/7] xsk: reclaim invalid multi-buffer Tx descs in ZC path
Date: Tue, 23 Jun 2026 15:32:38 +0200	[thread overview]
Message-ID: <20260623133240.1048434-6-maciej.fijalkowski@intel.com> (raw)
In-Reply-To: <20260623133240.1048434-1-maciej.fijalkowski@intel.com>

Currently, the zero-copy Tx batching path stops when it encounters an
invalid descriptor. For multi-buffer packets this can leave descriptors
consumed from the Tx ring without returning their buffers to userspace
through the completion ring.

Handle invalid multi-buffer packets as a packet-sized unit. Keep
descriptors that are valid for transmission separate from descriptors
that are consumed only because they belong to an invalid multi-buffer
packet. The former are returned to the driver as Tx work, while the
latter are written to the CQ address area so they can be reclaimed by
userspace.

The batched path can retain drain state when the producer has not yet
supplied the end of an invalid packet. Do not allow a second Tx socket to
join the pool while such state exists. Gate the batched data path while a
same-pool bind waits for pre-existing readers, then either add the new
socket or fail the bind with -EAGAIN. This guarantees that drain state is
handled only by the singular batched path and avoids teaching the shared
UMEM fallback path about multi-buffer packet draining.

The reclaim-only descriptors must not be submitted to the completion
ring immediately when they follow real Tx descriptors in the same batch.
Drivers may complete only part of the Tx work returned by
xsk_tx_peek_release_desc_batch(), and publishing the reclaim descriptors
too early would also publish earlier real Tx descriptors that the driver
has not completed yet.

Track the number of driver-visible Tx descriptors that precede pending
reclaim descriptors. xsk_tx_completed() first advances through the real
Tx completions and submits the reclaim descriptors only after all earlier
Tx descriptors in the CQ address order have been completed. If a batch
contains only reclaim descriptors, complete them immediately because
there is no driver-visible Tx work in front of them.

This preserves CQ ordering while ensuring that every descriptor consumed
as part of an invalid multi-buffer packet is eventually returned to
userspace.

Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 include/net/xsk_buff_pool.h |  6 ++++
 net/xdp/xsk.c               | 62 +++++++++++++++++++++++++++++++---
 net/xdp/xsk_buff_pool.c     | 66 +++++++++++++++++++++++++++++++++++++
 net/xdp/xsk_queue.h         | 66 +++++++++++++++++++++++++++----------
 4 files changed, 177 insertions(+), 23 deletions(-)

diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index ccb3b350001f..4e5abacfcbb7 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -78,9 +78,12 @@ struct xsk_buff_pool {
 	u32 chunk_size;
 	u32 chunk_shift;
 	u32 frame_len;
+	u32 reclaim_descs;
+	u32 tx_zc_pending_descs;
 	u32 xdp_zc_max_segs;
 	u8 tx_metadata_len; /* inherited from umem */
 	u8 cached_need_wakeup;
+	bool tx_share_pending;
 	bool uses_need_wakeup;
 	bool unaligned;
 	bool tx_sw_csum;
@@ -113,6 +116,9 @@ void xp_get_pool(struct xsk_buff_pool *pool);
 bool xp_put_pool(struct xsk_buff_pool *pool);
 void xp_clear_dev(struct xsk_buff_pool *pool);
 void xp_add_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs);
+int xp_prepare_xsk_tx_share(struct xsk_buff_pool *pool, struct xdp_sock *xs,
+			    bool *pending);
+void xp_finish_xsk_tx_share(struct xsk_buff_pool *pool);
 void xp_del_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs);
 
 /* AF_XDP, and XDP core. */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 43791647cf18..2dda854c6590 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -499,6 +499,18 @@ void __xsk_map_flush(struct list_head *flush_list)
 
 void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries)
 {
+	if (unlikely(pool->reclaim_descs)) {
+		if (nb_entries < pool->tx_zc_pending_descs) {
+			pool->tx_zc_pending_descs -= nb_entries;
+			xskq_prod_submit_n(pool->cq, nb_entries);
+			return;
+		}
+
+		pool->tx_zc_pending_descs = 0;
+		nb_entries += pool->reclaim_descs;
+		pool->reclaim_descs = 0;
+	}
+
 	xskq_prod_submit_n(pool->cq, nb_entries);
 }
 EXPORT_SYMBOL(xsk_tx_completed);
@@ -576,9 +588,20 @@ static u32 xsk_tx_peek_release_fallback(struct xsk_buff_pool *pool, u32 max_entr
 
 u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_pkts)
 {
+	struct xsk_tx_batch batch = {};
 	struct xdp_sock *xs;
+	u32 cq_cached_prod;
 
 	rcu_read_lock();
+
+	/* Pairs with the release stores in xp_prepare_xsk_tx_share() and
+	 * xp_finish_xsk_tx_share(). If bind is converting a singular Tx pool
+	 * to shared, do not enter the singular batched path.
+	 */
+	if (smp_load_acquire(&pool->tx_share_pending))
+		goto out;
+	if (unlikely(pool->reclaim_descs))
+		goto out;
 	if (!list_is_singular(&pool->xsk_tx_list)) {
 		/* Fallback to the non-batched version */
 		rcu_read_unlock();
@@ -586,10 +609,8 @@ u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_pkts)
 	}
 
 	xs = list_first_or_null_rcu(&pool->xsk_tx_list, struct xdp_sock, tx_list);
-	if (!xs) {
-		nb_pkts = 0;
+	if (!xs)
 		goto out;
-	}
 
 	nb_pkts = xskq_cons_nb_entries(xs->tx, nb_pkts);
 
@@ -603,19 +624,38 @@ u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_pkts)
 	if (!nb_pkts)
 		goto out;
 
-	nb_pkts = xskq_cons_read_desc_batch(xs->tx, pool, nb_pkts);
+	batch = xskq_cons_read_desc_batch(xs, pool, nb_pkts);
+	nb_pkts = xsk_tx_batch_cq_descs(&batch);
 	if (!nb_pkts) {
 		xs->tx->queue_empty_descs++;
 		goto out;
 	}
 
 	__xskq_cons_release(xs->tx);
+	cq_cached_prod = pool->cq->cached_prod;
+
 	xskq_prod_write_addr_batch(pool->cq, pool->tx_descs, nb_pkts);
+
+	if (unlikely(batch.reclaim_descs)) {
+		u32 cq_pending_descs;
+
+		/* CQ is positional. Descriptors already written but not
+		 * submitted must complete before any reclaim-only descriptors
+		 * appended below.
+		 */
+		cq_pending_descs = cq_cached_prod - xskq_get_prod(pool->cq);
+
+		pool->tx_zc_pending_descs = batch.tx_descs + cq_pending_descs;
+		pool->reclaim_descs = batch.reclaim_descs;
+		if (unlikely(!pool->tx_zc_pending_descs))
+			xsk_tx_completed(pool, 0);
+	}
+
 	xs->sk.sk_write_space(&xs->sk);
 
 out:
 	rcu_read_unlock();
-	return nb_pkts;
+	return batch.tx_descs;
 }
 EXPORT_SYMBOL(xsk_tx_peek_release_desc_batch);
 
@@ -1442,6 +1482,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
 	struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr;
 	struct sock *sk = sock->sk;
 	struct xdp_sock *xs = xdp_sk(sk);
+	bool tx_share_pending = false;
 	struct net_device *dev;
 	int bound_dev_if;
 	u32 flags, qid;
@@ -1549,6 +1590,13 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
 				goto out_unlock;
 			}
 
+			err = xp_prepare_xsk_tx_share(umem_xs->pool, xs,
+						      &tx_share_pending);
+			if (err) {
+				sockfd_put(sock);
+				goto out_unlock;
+			}
+
 			xp_get_pool(umem_xs->pool);
 			xs->pool = umem_xs->pool;
 
@@ -1559,6 +1607,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
 			if (xs->tx && !xs->pool->tx_descs) {
 				err = xp_alloc_tx_descs(xs->pool, xs);
 				if (err) {
+					if (tx_share_pending)
+						xp_finish_xsk_tx_share(xs->pool);
 					xp_put_pool(xs->pool);
 					xs->pool = NULL;
 					sockfd_put(sock);
@@ -1598,6 +1648,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
 	xs->sg = !!(xs->umem->flags & XDP_UMEM_SG_FLAG);
 	xs->queue_id = qid;
 	xp_add_xsk(xs->pool, xs);
+	if (tx_share_pending)
+		xp_finish_xsk_tx_share(xs->pool);
 
 	if (qid < dev->real_num_rx_queues) {
 		struct netdev_rx_queue *rxq;
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 1f28a9641571..6fa732a843a9 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -22,6 +22,72 @@ void xp_add_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs)
 	spin_unlock(&pool->xsk_tx_list_lock);
 }
 
+int xp_prepare_xsk_tx_share(struct xsk_buff_pool *pool, struct xdp_sock *xs,
+			    bool *pending)
+{
+	struct xdp_sock *tmp;
+	int err = 0;
+
+	*pending = false;
+	if (!xs->tx)
+		return 0;
+
+	spin_lock(&pool->xsk_tx_list_lock);
+	if (!list_is_singular(&pool->xsk_tx_list)) {
+		spin_unlock(&pool->xsk_tx_list_lock);
+		return 0;
+	}
+
+	if (pool->tx_share_pending) {
+		spin_unlock(&pool->xsk_tx_list_lock);
+		return -EAGAIN;
+	}
+
+	/* Pairs with the acquire load in xsk_tx_peek_release_desc_batch().
+	 * Stop new singular batched Tx readers before synchronize_net()
+	 * waits for readers that may already have observed a singular list.
+	 */
+	smp_store_release(&pool->tx_share_pending, true);
+	*pending = true;
+	spin_unlock(&pool->xsk_tx_list_lock);
+
+	/* A batch that observed a singular Tx socket list before the gate was
+	 * armed may set drain_cont. Wait for all such readers before checking
+	 * whether the pool can safely become shared.
+	 */
+	synchronize_net();
+
+	spin_lock(&pool->xsk_tx_list_lock);
+	list_for_each_entry(tmp, &pool->xsk_tx_list, tx_list) {
+		if (READ_ONCE(tmp->drain_cont)) {
+			err = -EAGAIN;
+			break;
+		}
+	}
+
+	if (err) {
+		/* Pairs with the acquire load in xsk_tx_peek_release_desc_batch().
+		 * No socket was added; clear the gate so Tx can resume.
+		 */
+		smp_store_release(&pool->tx_share_pending, false);
+		*pending = false;
+	}
+	spin_unlock(&pool->xsk_tx_list_lock);
+
+	return err;
+}
+
+void xp_finish_xsk_tx_share(struct xsk_buff_pool *pool)
+{
+	spin_lock(&pool->xsk_tx_list_lock);
+	/* Pairs with the acquire load in xsk_tx_peek_release_desc_batch().
+	 * Publish the preceding xp_add_xsk() list update before allowing Tx
+	 * to observe that the share transition has finished.
+	 */
+	smp_store_release(&pool->tx_share_pending, false);
+	spin_unlock(&pool->xsk_tx_list_lock);
+}
+
 void xp_del_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs)
 {
 	if (!xs->tx)
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 3e3fbb73d23e..99fa62e0d337 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -58,6 +58,16 @@ struct parsed_desc {
 	u32 valid;
 };
 
+struct xsk_tx_batch {
+	u32 tx_descs;
+	u32 reclaim_descs;
+};
+
+static inline u32 xsk_tx_batch_cq_descs(const struct xsk_tx_batch *batch)
+{
+	return batch->tx_descs + batch->reclaim_descs;
+}
+
 /* The structure of the shared state of the rings are a simple
  * circular buffer, as outlined in
  * Documentation/core-api/circular-buffers.rst. For the Rx and
@@ -263,17 +273,19 @@ static inline void parse_desc(struct xsk_queue *q, struct xsk_buff_pool *pool,
 	parsed->mb = xp_mb_desc(desc);
 }
 
-static inline
-u32 xskq_cons_read_desc_batch(struct xsk_queue *q, struct xsk_buff_pool *pool,
-			      u32 max)
+static inline struct xsk_tx_batch
+xskq_cons_read_desc_batch(struct xdp_sock *xs, struct xsk_buff_pool *pool,
+			  u32 max)
 {
-	u32 cached_cons = q->cached_cons, nb_entries = 0;
 	struct xdp_desc *descs = pool->tx_descs;
-	u32 total_descs = 0, nr_frags = 0;
+	bool drain = READ_ONCE(xs->drain_cont);
+	u32 cached_cons, nb_entries = 0;
+	struct xsk_tx_batch batch = {};
+	struct xsk_queue *q = xs->tx;
+	u32 nr_frags = 0;
+
+	cached_cons = q->cached_cons;
 
-	/* track first entry, if stumble upon *any* invalid descriptor, rewind
-	 * current packet that consists of frags and stop the processing
-	 */
 	while (cached_cons != q->cached_prod && nb_entries < max) {
 		struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
 		u32 idx = cached_cons & q->ring_mask;
@@ -282,26 +294,44 @@ u32 xskq_cons_read_desc_batch(struct xsk_queue *q, struct xsk_buff_pool *pool,
 		descs[nb_entries] = ring->desc[idx];
 		cached_cons++;
 		parse_desc(q, pool, &descs[nb_entries], &parsed);
-		if (unlikely(!parsed.valid))
-			break;
+		if (unlikely(!parsed.valid)) {
+			if (!drain && !nr_frags && !parsed.mb)
+				break;
+
+			drain = true;
+		}
+
+		nr_frags++;
+		nb_entries++;
 
 		if (likely(!parsed.mb)) {
-			total_descs += (nr_frags + 1);
-			nr_frags = 0;
-		} else {
-			nr_frags++;
-			if (nr_frags == pool->xdp_zc_max_segs) {
+			if (unlikely(drain)) {
+				batch.reclaim_descs = nr_frags;
+				WRITE_ONCE(xs->drain_cont, false);
 				nr_frags = 0;
 				break;
 			}
+
+			batch.tx_descs += nr_frags;
+			nr_frags = 0;
+			continue;
 		}
-		nb_entries++;
+
+		if (nr_frags == pool->xdp_zc_max_segs)
+			drain = true;
 	}
 
-	cached_cons -= nr_frags;
+	if (nr_frags) {
+		if (drain) {
+			batch.reclaim_descs = nr_frags;
+			WRITE_ONCE(xs->drain_cont, true);
+		} else {
+			cached_cons -= nr_frags;
+		}
+	}
 	/* Release valid plus any invalid entries */
 	xskq_cons_release_n(q, cached_cons - q->cached_cons);
-	return total_descs;
+	return batch;
 }
 
 /* Functions for consumers */
-- 
2.43.0


  parent reply	other threads:[~2026-06-23 13:33 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-23 13:32 [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim Maciej Fijalkowski
2026-06-23 13:32 ` [PATCH net 1/7] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx Maciej Fijalkowski
2026-06-23 13:32 ` [PATCH net 2/7] xsk: drain continuation descs after overflow in xsk_build_skb() Maciej Fijalkowski
2026-06-23 13:32 ` [PATCH net 3/7] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit() Maciej Fijalkowski
2026-06-23 13:32 ` [PATCH net 4/7] xsk: reclaim offending invalid desc in generic multi-buffer Tx Maciej Fijalkowski
2026-06-23 13:32 ` Maciej Fijalkowski [this message]
2026-06-23 13:32 ` [PATCH net 6/7] selftests/xsk: fix too-many-frags multi-buffer Tx test Maciej Fijalkowski
2026-06-23 13:32 ` [PATCH net 7/7] selftests/xsk: account invalid multi-buffer Tx descriptors Maciej Fijalkowski
2026-06-24 15:38 ` [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim Stanislav Fomichev
2026-06-24 16:37   ` Maciej Fijalkowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260623133240.1048434-6-maciej.fijalkowski@intel.com \
    --to=maciej.fijalkowski@intel.com \
    --cc=bjorn@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=horms@kernel.org \
    --cc=kerneljasonxing@gmail.com \
    --cc=kuba@kernel.org \
    --cc=magnus.karlsson@intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=stfomichev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox