Netdev List
 help / color / mirror / Atom feed
From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
To: <netdev@vger.kernel.org>
Cc: <bpf@vger.kernel.org>, <magnus.karlsson@intel.com>,
	<stfomichev@gmail.com>, <kuba@kernel.org>, <pabeni@redhat.com>,
	<horms@kernel.org>, <kerneljasonxing@gmail.com>,
	<bjorn@kernel.org>
Subject: Re: [PATCH net 5/7] xsk: reclaim invalid multi-buffer Tx descs in ZC path
Date: Mon, 29 Jun 2026 21:06:25 +0200	[thread overview]
Message-ID: <akLCMYoVilZJZEC1@boxer> (raw)
In-Reply-To: <20260623133240.1048434-6-maciej.fijalkowski@intel.com>

On Tue, Jun 23, 2026 at 03:32:38PM +0200, Maciej Fijalkowski wrote:
> Currently, the zero-copy Tx batching path stops when it encounters an
> invalid descriptor. For multi-buffer packets this can leave descriptors
> consumed from the Tx ring without returning their buffers to userspace
> through the completion ring.
> 
> Handle invalid multi-buffer packets as a packet-sized unit. Keep
> descriptors that are valid for transmission separate from descriptors
> that are consumed only because they belong to an invalid multi-buffer
> packet. The former are returned to the driver as Tx work, while the
> latter are written to the CQ address area so they can be reclaimed by
> userspace.
> 
> The batched path can retain drain state when the producer has not yet
> supplied the end of an invalid packet. Do not allow a second Tx socket to
> join the pool while such state exists. Gate the batched data path while a
> same-pool bind waits for pre-existing readers, then either add the new
> socket or fail the bind with -EAGAIN. This guarantees that drain state is
> handled only by the singular batched path and avoids teaching the shared
> UMEM fallback path about multi-buffer packet draining.

Well I think this approach is broken unfortunately. Second socket still
can submit too big packet or invalid descriptor within multi-buffer
packet. Then fallback path would not handle it correctly.

Seems we need to teach it how to play with these corner cases.

> 
> The reclaim-only descriptors must not be submitted to the completion
> ring immediately when they follow real Tx descriptors in the same batch.
> Drivers may complete only part of the Tx work returned by
> xsk_tx_peek_release_desc_batch(), and publishing the reclaim descriptors
> too early would also publish earlier real Tx descriptors that the driver
> has not completed yet.
> 
> Track the number of driver-visible Tx descriptors that precede pending
> reclaim descriptors. xsk_tx_completed() first advances through the real
> Tx completions and submits the reclaim descriptors only after all earlier
> Tx descriptors in the CQ address order have been completed. If a batch
> contains only reclaim descriptors, complete them immediately because
> there is no driver-visible Tx work in front of them.
> 
> This preserves CQ ordering while ensuring that every descriptor consumed
> as part of an invalid multi-buffer packet is eventually returned to
> userspace.
> 
> Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  include/net/xsk_buff_pool.h |  6 ++++
>  net/xdp/xsk.c               | 62 +++++++++++++++++++++++++++++++---
>  net/xdp/xsk_buff_pool.c     | 66 +++++++++++++++++++++++++++++++++++++
>  net/xdp/xsk_queue.h         | 66 +++++++++++++++++++++++++++----------
>  4 files changed, 177 insertions(+), 23 deletions(-)
> 
> diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> index ccb3b350001f..4e5abacfcbb7 100644
> --- a/include/net/xsk_buff_pool.h
> +++ b/include/net/xsk_buff_pool.h
> @@ -78,9 +78,12 @@ struct xsk_buff_pool {
>  	u32 chunk_size;
>  	u32 chunk_shift;
>  	u32 frame_len;
> +	u32 reclaim_descs;
> +	u32 tx_zc_pending_descs;
>  	u32 xdp_zc_max_segs;
>  	u8 tx_metadata_len; /* inherited from umem */
>  	u8 cached_need_wakeup;
> +	bool tx_share_pending;
>  	bool uses_need_wakeup;
>  	bool unaligned;
>  	bool tx_sw_csum;
> @@ -113,6 +116,9 @@ void xp_get_pool(struct xsk_buff_pool *pool);
>  bool xp_put_pool(struct xsk_buff_pool *pool);
>  void xp_clear_dev(struct xsk_buff_pool *pool);
>  void xp_add_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs);
> +int xp_prepare_xsk_tx_share(struct xsk_buff_pool *pool, struct xdp_sock *xs,
> +			    bool *pending);
> +void xp_finish_xsk_tx_share(struct xsk_buff_pool *pool);
>  void xp_del_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs);
>  
>  /* AF_XDP, and XDP core. */
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 43791647cf18..2dda854c6590 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -499,6 +499,18 @@ void __xsk_map_flush(struct list_head *flush_list)
>  
>  void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries)
>  {
> +	if (unlikely(pool->reclaim_descs)) {
> +		if (nb_entries < pool->tx_zc_pending_descs) {
> +			pool->tx_zc_pending_descs -= nb_entries;
> +			xskq_prod_submit_n(pool->cq, nb_entries);
> +			return;
> +		}
> +
> +		pool->tx_zc_pending_descs = 0;
> +		nb_entries += pool->reclaim_descs;
> +		pool->reclaim_descs = 0;
> +	}
> +
>  	xskq_prod_submit_n(pool->cq, nb_entries);
>  }
>  EXPORT_SYMBOL(xsk_tx_completed);
> @@ -576,9 +588,20 @@ static u32 xsk_tx_peek_release_fallback(struct xsk_buff_pool *pool, u32 max_entr
>  
>  u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_pkts)
>  {
> +	struct xsk_tx_batch batch = {};
>  	struct xdp_sock *xs;
> +	u32 cq_cached_prod;
>  
>  	rcu_read_lock();
> +
> +	/* Pairs with the release stores in xp_prepare_xsk_tx_share() and
> +	 * xp_finish_xsk_tx_share(). If bind is converting a singular Tx pool
> +	 * to shared, do not enter the singular batched path.
> +	 */
> +	if (smp_load_acquire(&pool->tx_share_pending))
> +		goto out;
> +	if (unlikely(pool->reclaim_descs))
> +		goto out;
>  	if (!list_is_singular(&pool->xsk_tx_list)) {
>  		/* Fallback to the non-batched version */
>  		rcu_read_unlock();
> @@ -586,10 +609,8 @@ u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_pkts)
>  	}
>  
>  	xs = list_first_or_null_rcu(&pool->xsk_tx_list, struct xdp_sock, tx_list);
> -	if (!xs) {
> -		nb_pkts = 0;
> +	if (!xs)
>  		goto out;
> -	}
>  
>  	nb_pkts = xskq_cons_nb_entries(xs->tx, nb_pkts);
>  
> @@ -603,19 +624,38 @@ u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_pkts)
>  	if (!nb_pkts)
>  		goto out;
>  
> -	nb_pkts = xskq_cons_read_desc_batch(xs->tx, pool, nb_pkts);
> +	batch = xskq_cons_read_desc_batch(xs, pool, nb_pkts);
> +	nb_pkts = xsk_tx_batch_cq_descs(&batch);
>  	if (!nb_pkts) {
>  		xs->tx->queue_empty_descs++;
>  		goto out;
>  	}
>  
>  	__xskq_cons_release(xs->tx);
> +	cq_cached_prod = pool->cq->cached_prod;
> +
>  	xskq_prod_write_addr_batch(pool->cq, pool->tx_descs, nb_pkts);
> +
> +	if (unlikely(batch.reclaim_descs)) {
> +		u32 cq_pending_descs;
> +
> +		/* CQ is positional. Descriptors already written but not
> +		 * submitted must complete before any reclaim-only descriptors
> +		 * appended below.
> +		 */
> +		cq_pending_descs = cq_cached_prod - xskq_get_prod(pool->cq);
> +
> +		pool->tx_zc_pending_descs = batch.tx_descs + cq_pending_descs;
> +		pool->reclaim_descs = batch.reclaim_descs;
> +		if (unlikely(!pool->tx_zc_pending_descs))
> +			xsk_tx_completed(pool, 0);
> +	}
> +
>  	xs->sk.sk_write_space(&xs->sk);
>  
>  out:
>  	rcu_read_unlock();
> -	return nb_pkts;
> +	return batch.tx_descs;
>  }
>  EXPORT_SYMBOL(xsk_tx_peek_release_desc_batch);
>  
> @@ -1442,6 +1482,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
>  	struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr;
>  	struct sock *sk = sock->sk;
>  	struct xdp_sock *xs = xdp_sk(sk);
> +	bool tx_share_pending = false;
>  	struct net_device *dev;
>  	int bound_dev_if;
>  	u32 flags, qid;
> @@ -1549,6 +1590,13 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
>  				goto out_unlock;
>  			}
>  
> +			err = xp_prepare_xsk_tx_share(umem_xs->pool, xs,
> +						      &tx_share_pending);
> +			if (err) {
> +				sockfd_put(sock);
> +				goto out_unlock;
> +			}
> +
>  			xp_get_pool(umem_xs->pool);
>  			xs->pool = umem_xs->pool;
>  
> @@ -1559,6 +1607,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
>  			if (xs->tx && !xs->pool->tx_descs) {
>  				err = xp_alloc_tx_descs(xs->pool, xs);
>  				if (err) {
> +					if (tx_share_pending)
> +						xp_finish_xsk_tx_share(xs->pool);
>  					xp_put_pool(xs->pool);
>  					xs->pool = NULL;
>  					sockfd_put(sock);
> @@ -1598,6 +1648,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
>  	xs->sg = !!(xs->umem->flags & XDP_UMEM_SG_FLAG);
>  	xs->queue_id = qid;
>  	xp_add_xsk(xs->pool, xs);
> +	if (tx_share_pending)
> +		xp_finish_xsk_tx_share(xs->pool);
>  
>  	if (qid < dev->real_num_rx_queues) {
>  		struct netdev_rx_queue *rxq;
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index 1f28a9641571..6fa732a843a9 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -22,6 +22,72 @@ void xp_add_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs)
>  	spin_unlock(&pool->xsk_tx_list_lock);
>  }
>  
> +int xp_prepare_xsk_tx_share(struct xsk_buff_pool *pool, struct xdp_sock *xs,
> +			    bool *pending)
> +{
> +	struct xdp_sock *tmp;
> +	int err = 0;
> +
> +	*pending = false;
> +	if (!xs->tx)
> +		return 0;
> +
> +	spin_lock(&pool->xsk_tx_list_lock);
> +	if (!list_is_singular(&pool->xsk_tx_list)) {
> +		spin_unlock(&pool->xsk_tx_list_lock);
> +		return 0;
> +	}
> +
> +	if (pool->tx_share_pending) {
> +		spin_unlock(&pool->xsk_tx_list_lock);
> +		return -EAGAIN;
> +	}
> +
> +	/* Pairs with the acquire load in xsk_tx_peek_release_desc_batch().
> +	 * Stop new singular batched Tx readers before synchronize_net()
> +	 * waits for readers that may already have observed a singular list.
> +	 */
> +	smp_store_release(&pool->tx_share_pending, true);
> +	*pending = true;
> +	spin_unlock(&pool->xsk_tx_list_lock);
> +
> +	/* A batch that observed a singular Tx socket list before the gate was
> +	 * armed may set drain_cont. Wait for all such readers before checking
> +	 * whether the pool can safely become shared.
> +	 */
> +	synchronize_net();
> +
> +	spin_lock(&pool->xsk_tx_list_lock);
> +	list_for_each_entry(tmp, &pool->xsk_tx_list, tx_list) {
> +		if (READ_ONCE(tmp->drain_cont)) {
> +			err = -EAGAIN;
> +			break;
> +		}
> +	}
> +
> +	if (err) {
> +		/* Pairs with the acquire load in xsk_tx_peek_release_desc_batch().
> +		 * No socket was added; clear the gate so Tx can resume.
> +		 */
> +		smp_store_release(&pool->tx_share_pending, false);
> +		*pending = false;
> +	}
> +	spin_unlock(&pool->xsk_tx_list_lock);
> +
> +	return err;
> +}
> +
> +void xp_finish_xsk_tx_share(struct xsk_buff_pool *pool)
> +{
> +	spin_lock(&pool->xsk_tx_list_lock);
> +	/* Pairs with the acquire load in xsk_tx_peek_release_desc_batch().
> +	 * Publish the preceding xp_add_xsk() list update before allowing Tx
> +	 * to observe that the share transition has finished.
> +	 */
> +	smp_store_release(&pool->tx_share_pending, false);
> +	spin_unlock(&pool->xsk_tx_list_lock);
> +}
> +
>  void xp_del_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs)
>  {
>  	if (!xs->tx)
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> index 3e3fbb73d23e..99fa62e0d337 100644
> --- a/net/xdp/xsk_queue.h
> +++ b/net/xdp/xsk_queue.h
> @@ -58,6 +58,16 @@ struct parsed_desc {
>  	u32 valid;
>  };
>  
> +struct xsk_tx_batch {
> +	u32 tx_descs;
> +	u32 reclaim_descs;
> +};
> +
> +static inline u32 xsk_tx_batch_cq_descs(const struct xsk_tx_batch *batch)
> +{
> +	return batch->tx_descs + batch->reclaim_descs;
> +}
> +
>  /* The structure of the shared state of the rings are a simple
>   * circular buffer, as outlined in
>   * Documentation/core-api/circular-buffers.rst. For the Rx and
> @@ -263,17 +273,19 @@ static inline void parse_desc(struct xsk_queue *q, struct xsk_buff_pool *pool,
>  	parsed->mb = xp_mb_desc(desc);
>  }
>  
> -static inline
> -u32 xskq_cons_read_desc_batch(struct xsk_queue *q, struct xsk_buff_pool *pool,
> -			      u32 max)
> +static inline struct xsk_tx_batch
> +xskq_cons_read_desc_batch(struct xdp_sock *xs, struct xsk_buff_pool *pool,
> +			  u32 max)
>  {
> -	u32 cached_cons = q->cached_cons, nb_entries = 0;
>  	struct xdp_desc *descs = pool->tx_descs;
> -	u32 total_descs = 0, nr_frags = 0;
> +	bool drain = READ_ONCE(xs->drain_cont);
> +	u32 cached_cons, nb_entries = 0;
> +	struct xsk_tx_batch batch = {};
> +	struct xsk_queue *q = xs->tx;
> +	u32 nr_frags = 0;
> +
> +	cached_cons = q->cached_cons;
>  
> -	/* track first entry, if stumble upon *any* invalid descriptor, rewind
> -	 * current packet that consists of frags and stop the processing
> -	 */
>  	while (cached_cons != q->cached_prod && nb_entries < max) {
>  		struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
>  		u32 idx = cached_cons & q->ring_mask;
> @@ -282,26 +294,44 @@ u32 xskq_cons_read_desc_batch(struct xsk_queue *q, struct xsk_buff_pool *pool,
>  		descs[nb_entries] = ring->desc[idx];
>  		cached_cons++;
>  		parse_desc(q, pool, &descs[nb_entries], &parsed);
> -		if (unlikely(!parsed.valid))
> -			break;
> +		if (unlikely(!parsed.valid)) {
> +			if (!drain && !nr_frags && !parsed.mb)
> +				break;
> +
> +			drain = true;
> +		}
> +
> +		nr_frags++;
> +		nb_entries++;
>  
>  		if (likely(!parsed.mb)) {
> -			total_descs += (nr_frags + 1);
> -			nr_frags = 0;
> -		} else {
> -			nr_frags++;
> -			if (nr_frags == pool->xdp_zc_max_segs) {
> +			if (unlikely(drain)) {
> +				batch.reclaim_descs = nr_frags;
> +				WRITE_ONCE(xs->drain_cont, false);
>  				nr_frags = 0;
>  				break;
>  			}
> +
> +			batch.tx_descs += nr_frags;
> +			nr_frags = 0;
> +			continue;
>  		}
> -		nb_entries++;
> +
> +		if (nr_frags == pool->xdp_zc_max_segs)
> +			drain = true;
>  	}
>  
> -	cached_cons -= nr_frags;
> +	if (nr_frags) {
> +		if (drain) {
> +			batch.reclaim_descs = nr_frags;
> +			WRITE_ONCE(xs->drain_cont, true);
> +		} else {
> +			cached_cons -= nr_frags;
> +		}
> +	}
>  	/* Release valid plus any invalid entries */
>  	xskq_cons_release_n(q, cached_cons - q->cached_cons);
> -	return total_descs;
> +	return batch;
>  }
>  
>  /* Functions for consumers */
> -- 
> 2.43.0
> 

  reply	other threads:[~2026-06-29 19:06 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-23 13:32 [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim Maciej Fijalkowski
2026-06-23 13:32 ` [PATCH net 1/7] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx Maciej Fijalkowski
2026-06-26 11:12   ` Larysa Zaremba
2026-06-26 12:24     ` Jason Xing
2026-06-26 13:43       ` Fijalkowski, Maciej
2026-06-27  0:36         ` Jason Xing
2026-06-23 13:32 ` [PATCH net 2/7] xsk: drain continuation descs after overflow in xsk_build_skb() Maciej Fijalkowski
2026-06-23 13:32 ` [PATCH net 3/7] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit() Maciej Fijalkowski
2026-06-23 13:32 ` [PATCH net 4/7] xsk: reclaim offending invalid desc in generic multi-buffer Tx Maciej Fijalkowski
2026-06-25 14:16   ` Jason Xing
2026-06-23 13:32 ` [PATCH net 5/7] xsk: reclaim invalid multi-buffer Tx descs in ZC path Maciej Fijalkowski
2026-06-29 19:06   ` Maciej Fijalkowski [this message]
2026-06-23 13:32 ` [PATCH net 6/7] selftests/xsk: fix too-many-frags multi-buffer Tx test Maciej Fijalkowski
2026-06-23 13:32 ` [PATCH net 7/7] selftests/xsk: account invalid multi-buffer Tx descriptors Maciej Fijalkowski
2026-06-24 15:38 ` [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim Stanislav Fomichev
2026-06-24 16:37   ` Maciej Fijalkowski
2026-06-25  1:33     ` Jason Xing
2026-06-25 16:05       ` Stanislav Fomichev
2026-06-26  0:24         ` Jason Xing
2026-06-26 16:25           ` Stanislav Fomichev
2026-06-27  0:47             ` Jason Xing
2026-06-26  7:42         ` Maciej Fijalkowski
2026-06-26 16:30           ` Stanislav Fomichev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=akLCMYoVilZJZEC1@boxer \
    --to=maciej.fijalkowski@intel.com \
    --cc=bjorn@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=horms@kernel.org \
    --cc=kerneljasonxing@gmail.com \
    --cc=kuba@kernel.org \
    --cc=magnus.karlsson@intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=stfomichev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox