From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
To: <netdev@vger.kernel.org>
Cc: <bpf@vger.kernel.org>, <magnus.karlsson@intel.com>,
<stfomichev@gmail.com>, <kuba@kernel.org>, <pabeni@redhat.com>,
<horms@kernel.org>, <kerneljasonxing@gmail.com>,
<bjorn@kernel.org>
Subject: Re: [PATCH net 5/7] xsk: reclaim invalid multi-buffer Tx descs in ZC path
Date: Mon, 29 Jun 2026 21:06:25 +0200 [thread overview]
Message-ID: <akLCMYoVilZJZEC1@boxer> (raw)
In-Reply-To: <20260623133240.1048434-6-maciej.fijalkowski@intel.com>
On Tue, Jun 23, 2026 at 03:32:38PM +0200, Maciej Fijalkowski wrote:
> Currently, the zero-copy Tx batching path stops when it encounters an
> invalid descriptor. For multi-buffer packets this can leave descriptors
> consumed from the Tx ring without returning their buffers to userspace
> through the completion ring.
>
> Handle invalid multi-buffer packets as a packet-sized unit. Keep
> descriptors that are valid for transmission separate from descriptors
> that are consumed only because they belong to an invalid multi-buffer
> packet. The former are returned to the driver as Tx work, while the
> latter are written to the CQ address area so they can be reclaimed by
> userspace.
>
> The batched path can retain drain state when the producer has not yet
> supplied the end of an invalid packet. Do not allow a second Tx socket to
> join the pool while such state exists. Gate the batched data path while a
> same-pool bind waits for pre-existing readers, then either add the new
> socket or fail the bind with -EAGAIN. This guarantees that drain state is
> handled only by the singular batched path and avoids teaching the shared
> UMEM fallback path about multi-buffer packet draining.
Well I think this approach is broken unfortunately. Second socket still
can submit too big packet or invalid descriptor within multi-buffer
packet. Then fallback path would not handle it correctly.
Seems we need to teach it how to play with these corner cases.
>
> The reclaim-only descriptors must not be submitted to the completion
> ring immediately when they follow real Tx descriptors in the same batch.
> Drivers may complete only part of the Tx work returned by
> xsk_tx_peek_release_desc_batch(), and publishing the reclaim descriptors
> too early would also publish earlier real Tx descriptors that the driver
> has not completed yet.
>
> Track the number of driver-visible Tx descriptors that precede pending
> reclaim descriptors. xsk_tx_completed() first advances through the real
> Tx completions and submits the reclaim descriptors only after all earlier
> Tx descriptors in the CQ address order have been completed. If a batch
> contains only reclaim descriptors, complete them immediately because
> there is no driver-visible Tx work in front of them.
>
> This preserves CQ ordering while ensuring that every descriptor consumed
> as part of an invalid multi-buffer packet is eventually returned to
> userspace.
>
> Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
> include/net/xsk_buff_pool.h | 6 ++++
> net/xdp/xsk.c | 62 +++++++++++++++++++++++++++++++---
> net/xdp/xsk_buff_pool.c | 66 +++++++++++++++++++++++++++++++++++++
> net/xdp/xsk_queue.h | 66 +++++++++++++++++++++++++++----------
> 4 files changed, 177 insertions(+), 23 deletions(-)
>
> diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> index ccb3b350001f..4e5abacfcbb7 100644
> --- a/include/net/xsk_buff_pool.h
> +++ b/include/net/xsk_buff_pool.h
> @@ -78,9 +78,12 @@ struct xsk_buff_pool {
> u32 chunk_size;
> u32 chunk_shift;
> u32 frame_len;
> + u32 reclaim_descs;
> + u32 tx_zc_pending_descs;
> u32 xdp_zc_max_segs;
> u8 tx_metadata_len; /* inherited from umem */
> u8 cached_need_wakeup;
> + bool tx_share_pending;
> bool uses_need_wakeup;
> bool unaligned;
> bool tx_sw_csum;
> @@ -113,6 +116,9 @@ void xp_get_pool(struct xsk_buff_pool *pool);
> bool xp_put_pool(struct xsk_buff_pool *pool);
> void xp_clear_dev(struct xsk_buff_pool *pool);
> void xp_add_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs);
> +int xp_prepare_xsk_tx_share(struct xsk_buff_pool *pool, struct xdp_sock *xs,
> + bool *pending);
> +void xp_finish_xsk_tx_share(struct xsk_buff_pool *pool);
> void xp_del_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs);
>
> /* AF_XDP, and XDP core. */
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 43791647cf18..2dda854c6590 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -499,6 +499,18 @@ void __xsk_map_flush(struct list_head *flush_list)
>
> void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries)
> {
> + if (unlikely(pool->reclaim_descs)) {
> + if (nb_entries < pool->tx_zc_pending_descs) {
> + pool->tx_zc_pending_descs -= nb_entries;
> + xskq_prod_submit_n(pool->cq, nb_entries);
> + return;
> + }
> +
> + pool->tx_zc_pending_descs = 0;
> + nb_entries += pool->reclaim_descs;
> + pool->reclaim_descs = 0;
> + }
> +
> xskq_prod_submit_n(pool->cq, nb_entries);
> }
> EXPORT_SYMBOL(xsk_tx_completed);
> @@ -576,9 +588,20 @@ static u32 xsk_tx_peek_release_fallback(struct xsk_buff_pool *pool, u32 max_entr
>
> u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_pkts)
> {
> + struct xsk_tx_batch batch = {};
> struct xdp_sock *xs;
> + u32 cq_cached_prod;
>
> rcu_read_lock();
> +
> + /* Pairs with the release stores in xp_prepare_xsk_tx_share() and
> + * xp_finish_xsk_tx_share(). If bind is converting a singular Tx pool
> + * to shared, do not enter the singular batched path.
> + */
> + if (smp_load_acquire(&pool->tx_share_pending))
> + goto out;
> + if (unlikely(pool->reclaim_descs))
> + goto out;
> if (!list_is_singular(&pool->xsk_tx_list)) {
> /* Fallback to the non-batched version */
> rcu_read_unlock();
> @@ -586,10 +609,8 @@ u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_pkts)
> }
>
> xs = list_first_or_null_rcu(&pool->xsk_tx_list, struct xdp_sock, tx_list);
> - if (!xs) {
> - nb_pkts = 0;
> + if (!xs)
> goto out;
> - }
>
> nb_pkts = xskq_cons_nb_entries(xs->tx, nb_pkts);
>
> @@ -603,19 +624,38 @@ u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_pkts)
> if (!nb_pkts)
> goto out;
>
> - nb_pkts = xskq_cons_read_desc_batch(xs->tx, pool, nb_pkts);
> + batch = xskq_cons_read_desc_batch(xs, pool, nb_pkts);
> + nb_pkts = xsk_tx_batch_cq_descs(&batch);
> if (!nb_pkts) {
> xs->tx->queue_empty_descs++;
> goto out;
> }
>
> __xskq_cons_release(xs->tx);
> + cq_cached_prod = pool->cq->cached_prod;
> +
> xskq_prod_write_addr_batch(pool->cq, pool->tx_descs, nb_pkts);
> +
> + if (unlikely(batch.reclaim_descs)) {
> + u32 cq_pending_descs;
> +
> + /* CQ is positional. Descriptors already written but not
> + * submitted must complete before any reclaim-only descriptors
> + * appended below.
> + */
> + cq_pending_descs = cq_cached_prod - xskq_get_prod(pool->cq);
> +
> + pool->tx_zc_pending_descs = batch.tx_descs + cq_pending_descs;
> + pool->reclaim_descs = batch.reclaim_descs;
> + if (unlikely(!pool->tx_zc_pending_descs))
> + xsk_tx_completed(pool, 0);
> + }
> +
> xs->sk.sk_write_space(&xs->sk);
>
> out:
> rcu_read_unlock();
> - return nb_pkts;
> + return batch.tx_descs;
> }
> EXPORT_SYMBOL(xsk_tx_peek_release_desc_batch);
>
> @@ -1442,6 +1482,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
> struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr;
> struct sock *sk = sock->sk;
> struct xdp_sock *xs = xdp_sk(sk);
> + bool tx_share_pending = false;
> struct net_device *dev;
> int bound_dev_if;
> u32 flags, qid;
> @@ -1549,6 +1590,13 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
> goto out_unlock;
> }
>
> + err = xp_prepare_xsk_tx_share(umem_xs->pool, xs,
> + &tx_share_pending);
> + if (err) {
> + sockfd_put(sock);
> + goto out_unlock;
> + }
> +
> xp_get_pool(umem_xs->pool);
> xs->pool = umem_xs->pool;
>
> @@ -1559,6 +1607,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
> if (xs->tx && !xs->pool->tx_descs) {
> err = xp_alloc_tx_descs(xs->pool, xs);
> if (err) {
> + if (tx_share_pending)
> + xp_finish_xsk_tx_share(xs->pool);
> xp_put_pool(xs->pool);
> xs->pool = NULL;
> sockfd_put(sock);
> @@ -1598,6 +1648,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
> xs->sg = !!(xs->umem->flags & XDP_UMEM_SG_FLAG);
> xs->queue_id = qid;
> xp_add_xsk(xs->pool, xs);
> + if (tx_share_pending)
> + xp_finish_xsk_tx_share(xs->pool);
>
> if (qid < dev->real_num_rx_queues) {
> struct netdev_rx_queue *rxq;
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index 1f28a9641571..6fa732a843a9 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -22,6 +22,72 @@ void xp_add_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs)
> spin_unlock(&pool->xsk_tx_list_lock);
> }
>
> +int xp_prepare_xsk_tx_share(struct xsk_buff_pool *pool, struct xdp_sock *xs,
> + bool *pending)
> +{
> + struct xdp_sock *tmp;
> + int err = 0;
> +
> + *pending = false;
> + if (!xs->tx)
> + return 0;
> +
> + spin_lock(&pool->xsk_tx_list_lock);
> + if (!list_is_singular(&pool->xsk_tx_list)) {
> + spin_unlock(&pool->xsk_tx_list_lock);
> + return 0;
> + }
> +
> + if (pool->tx_share_pending) {
> + spin_unlock(&pool->xsk_tx_list_lock);
> + return -EAGAIN;
> + }
> +
> + /* Pairs with the acquire load in xsk_tx_peek_release_desc_batch().
> + * Stop new singular batched Tx readers before synchronize_net()
> + * waits for readers that may already have observed a singular list.
> + */
> + smp_store_release(&pool->tx_share_pending, true);
> + *pending = true;
> + spin_unlock(&pool->xsk_tx_list_lock);
> +
> + /* A batch that observed a singular Tx socket list before the gate was
> + * armed may set drain_cont. Wait for all such readers before checking
> + * whether the pool can safely become shared.
> + */
> + synchronize_net();
> +
> + spin_lock(&pool->xsk_tx_list_lock);
> + list_for_each_entry(tmp, &pool->xsk_tx_list, tx_list) {
> + if (READ_ONCE(tmp->drain_cont)) {
> + err = -EAGAIN;
> + break;
> + }
> + }
> +
> + if (err) {
> + /* Pairs with the acquire load in xsk_tx_peek_release_desc_batch().
> + * No socket was added; clear the gate so Tx can resume.
> + */
> + smp_store_release(&pool->tx_share_pending, false);
> + *pending = false;
> + }
> + spin_unlock(&pool->xsk_tx_list_lock);
> +
> + return err;
> +}
> +
> +void xp_finish_xsk_tx_share(struct xsk_buff_pool *pool)
> +{
> + spin_lock(&pool->xsk_tx_list_lock);
> + /* Pairs with the acquire load in xsk_tx_peek_release_desc_batch().
> + * Publish the preceding xp_add_xsk() list update before allowing Tx
> + * to observe that the share transition has finished.
> + */
> + smp_store_release(&pool->tx_share_pending, false);
> + spin_unlock(&pool->xsk_tx_list_lock);
> +}
> +
> void xp_del_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs)
> {
> if (!xs->tx)
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> index 3e3fbb73d23e..99fa62e0d337 100644
> --- a/net/xdp/xsk_queue.h
> +++ b/net/xdp/xsk_queue.h
> @@ -58,6 +58,16 @@ struct parsed_desc {
> u32 valid;
> };
>
> +struct xsk_tx_batch {
> + u32 tx_descs;
> + u32 reclaim_descs;
> +};
> +
> +static inline u32 xsk_tx_batch_cq_descs(const struct xsk_tx_batch *batch)
> +{
> + return batch->tx_descs + batch->reclaim_descs;
> +}
> +
> /* The structure of the shared state of the rings are a simple
> * circular buffer, as outlined in
> * Documentation/core-api/circular-buffers.rst. For the Rx and
> @@ -263,17 +273,19 @@ static inline void parse_desc(struct xsk_queue *q, struct xsk_buff_pool *pool,
> parsed->mb = xp_mb_desc(desc);
> }
>
> -static inline
> -u32 xskq_cons_read_desc_batch(struct xsk_queue *q, struct xsk_buff_pool *pool,
> - u32 max)
> +static inline struct xsk_tx_batch
> +xskq_cons_read_desc_batch(struct xdp_sock *xs, struct xsk_buff_pool *pool,
> + u32 max)
> {
> - u32 cached_cons = q->cached_cons, nb_entries = 0;
> struct xdp_desc *descs = pool->tx_descs;
> - u32 total_descs = 0, nr_frags = 0;
> + bool drain = READ_ONCE(xs->drain_cont);
> + u32 cached_cons, nb_entries = 0;
> + struct xsk_tx_batch batch = {};
> + struct xsk_queue *q = xs->tx;
> + u32 nr_frags = 0;
> +
> + cached_cons = q->cached_cons;
>
> - /* track first entry, if stumble upon *any* invalid descriptor, rewind
> - * current packet that consists of frags and stop the processing
> - */
> while (cached_cons != q->cached_prod && nb_entries < max) {
> struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
> u32 idx = cached_cons & q->ring_mask;
> @@ -282,26 +294,44 @@ u32 xskq_cons_read_desc_batch(struct xsk_queue *q, struct xsk_buff_pool *pool,
> descs[nb_entries] = ring->desc[idx];
> cached_cons++;
> parse_desc(q, pool, &descs[nb_entries], &parsed);
> - if (unlikely(!parsed.valid))
> - break;
> + if (unlikely(!parsed.valid)) {
> + if (!drain && !nr_frags && !parsed.mb)
> + break;
> +
> + drain = true;
> + }
> +
> + nr_frags++;
> + nb_entries++;
>
> if (likely(!parsed.mb)) {
> - total_descs += (nr_frags + 1);
> - nr_frags = 0;
> - } else {
> - nr_frags++;
> - if (nr_frags == pool->xdp_zc_max_segs) {
> + if (unlikely(drain)) {
> + batch.reclaim_descs = nr_frags;
> + WRITE_ONCE(xs->drain_cont, false);
> nr_frags = 0;
> break;
> }
> +
> + batch.tx_descs += nr_frags;
> + nr_frags = 0;
> + continue;
> }
> - nb_entries++;
> +
> + if (nr_frags == pool->xdp_zc_max_segs)
> + drain = true;
> }
>
> - cached_cons -= nr_frags;
> + if (nr_frags) {
> + if (drain) {
> + batch.reclaim_descs = nr_frags;
> + WRITE_ONCE(xs->drain_cont, true);
> + } else {
> + cached_cons -= nr_frags;
> + }
> + }
> /* Release valid plus any invalid entries */
> xskq_cons_release_n(q, cached_cons - q->cached_cons);
> - return total_descs;
> + return batch;
> }
>
> /* Functions for consumers */
> --
> 2.43.0
>
next prev parent reply other threads:[~2026-06-29 19:06 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-23 13:32 [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim Maciej Fijalkowski
2026-06-23 13:32 ` [PATCH net 1/7] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx Maciej Fijalkowski
2026-06-26 11:12 ` Larysa Zaremba
2026-06-26 12:24 ` Jason Xing
2026-06-26 13:43 ` Fijalkowski, Maciej
2026-06-27 0:36 ` Jason Xing
2026-06-23 13:32 ` [PATCH net 2/7] xsk: drain continuation descs after overflow in xsk_build_skb() Maciej Fijalkowski
2026-06-23 13:32 ` [PATCH net 3/7] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit() Maciej Fijalkowski
2026-06-23 13:32 ` [PATCH net 4/7] xsk: reclaim offending invalid desc in generic multi-buffer Tx Maciej Fijalkowski
2026-06-25 14:16 ` Jason Xing
2026-06-23 13:32 ` [PATCH net 5/7] xsk: reclaim invalid multi-buffer Tx descs in ZC path Maciej Fijalkowski
2026-06-29 19:06 ` Maciej Fijalkowski [this message]
2026-06-23 13:32 ` [PATCH net 6/7] selftests/xsk: fix too-many-frags multi-buffer Tx test Maciej Fijalkowski
2026-06-23 13:32 ` [PATCH net 7/7] selftests/xsk: account invalid multi-buffer Tx descriptors Maciej Fijalkowski
2026-06-24 15:38 ` [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim Stanislav Fomichev
2026-06-24 16:37 ` Maciej Fijalkowski
2026-06-25 1:33 ` Jason Xing
2026-06-25 16:05 ` Stanislav Fomichev
2026-06-26 0:24 ` Jason Xing
2026-06-26 16:25 ` Stanislav Fomichev
2026-06-27 0:47 ` Jason Xing
2026-06-26 7:42 ` Maciej Fijalkowski
2026-06-26 16:30 ` Stanislav Fomichev
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=akLCMYoVilZJZEC1@boxer \
--to=maciej.fijalkowski@intel.com \
--cc=bjorn@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=horms@kernel.org \
--cc=kerneljasonxing@gmail.com \
--cc=kuba@kernel.org \
--cc=magnus.karlsson@intel.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=stfomichev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox