All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stanislav Fomichev <sdf.kernel@gmail.com>
To: Jason Xing <kerneljasonxing@gmail.com>
Cc: davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	 pabeni@redhat.com, bjorn@kernel.org, magnus.karlsson@intel.com,
	 maciej.fijalkowski@intel.com, jonathan.lemon@gmail.com,
	sdf@fomichev.me, ast@kernel.org,  daniel@iogearbox.net,
	hawk@kernel.org, john.fastabend@gmail.com, horms@kernel.org,
	 andrew+netdev@lunn.ch, bpf@vger.kernel.org,
	netdev@vger.kernel.org,  Jason Xing <kernelxing@tencent.com>
Subject: Re: [PATCH net 3/4] xsk: drain continuation descs after overflow in xsk_build_skb()
Date: Wed, 13 May 2026 09:27:46 -0700	[thread overview]
Message-ID: <agSk6FPCDTBWMGpv@devvm7509.cco0.facebook.com> (raw)
In-Reply-To: <20260510012310.88570-4-kerneljasonxing@gmail.com>

On 05/10, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> When a multi-buffer packet exceeds MAX_SKB_FRAGS and triggers -EOVERFLOW,
> only the current descriptor is released from the TX ring. The remaining
> continuation descriptors of the same packet stay in the ring. Since
> xs->skb is set to NULL after the drop, the TX loop picks up these
> leftover frags and misinterprets each one as the beginning of a new
> packet, corrupting the packet stream.
> 
> Fix this by adding a drain_cont flag to xdp_sock. When overflow occurs
> and the dropped descriptor has XDP_PKT_CONTD set, the flag is raised.
> The main TX loop in __xsk_generic_xmit() then handles continuation
> descriptors one at a time: each gets a normal CQ reservation (with
> backpressure), its address is submitted to the completion queue, and
> the descriptor is released from the TX ring. When the last fragment
> (without XDP_PKT_CONTD) is processed, the flag is cleared and the
> function returns -EOVERFLOW so the next call starts with a fresh
> budget for normal packets.
> 
> This reuses the existing CQ backpressure and budget mechanisms, so if
> the CQ is full the function returns -EAGAIN and userspace drains the
> CQ before retrying. Zero buffer leakage, zero packet stream corruption.
> 
> Closes: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
> Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>  include/net/xdp_sock.h |  1 +
>  net/xdp/xsk.c          | 22 ++++++++++++++++++++++
>  2 files changed, 23 insertions(+)
> 
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index 23e8861e8b25..1958d19d9925 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -80,6 +80,7 @@ struct xdp_sock {
>  	 * call of __xsk_generic_xmit().
>  	 */
>  	struct sk_buff *skb;
> +	bool drain_cont;
>  
>  	struct list_head map_list;
>  	/* Protects map_list */
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 3f1e590c855d..232dd7126905 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -936,6 +936,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>  			xs->tx->invalid_descs++;
>  		}
>  		xskq_cons_release(xs->tx);
> +		if (xp_mb_desc(desc))
> +			xs->drain_cont = true;
>  	} else {
>  		/* Let application retry */
>  		xsk_cq_cancel_locked(xs->pool, 1);
> @@ -982,6 +984,26 @@ static int __xsk_generic_xmit(struct sock *sk)
>  			goto out;
>  		}
>  
> +		if (unlikely(xs->drain_cont)) {
> +			unsigned long flags;
> +			u32 idx;
> +

[..]

> +			spin_lock_irqsave(&xs->pool->cq_prod_lock, flags);
> +			idx = xskq_get_prod(xs->pool->cq);
> +			xskq_prod_write_addr(xs->pool->cq, idx, desc.addr);
> +			xskq_prod_submit_n(xs->pool->cq, 1);
> +			spin_unlock_irqrestore(&xs->pool->cq_prod_lock, flags);

Not sure I understand why you want this if you're still marking the desc
invalid.

> +
> +			xs->tx->invalid_descs++;
> +			xskq_cons_release(xs->tx);
> +			if (!xp_mb_desc(&desc)) {
> +				xs->drain_cont = false;
> +				err = -EOVERFLOW;
> +				goto out;
> +			}

I also don't understand why you want to return -EOVERFLOW again? Why not
(quietly) swallow these invalid xp_mb_desc from the previous packet and move
on?

Also, have you considered doing this drain in xsk_build_skb error path?
Is it more messy? Carrying this drain_cont across the calls seems a bit
too complicated (but idk, maybe the alternative drain in xsk_build_skb
is even messier).

  reply	other threads:[~2026-05-13 16:27 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-10  1:23 [PATCH net 0/4] xsk: fix meta and publish of cq issues Jason Xing
2026-05-10  1:23 ` [PATCH net 1/4] xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata() Jason Xing
2026-05-11 15:03   ` Stanislav Fomichev
2026-05-12 14:32     ` Jason Xing
2026-05-12 22:34       ` Stanislav Fomichev
2026-05-13 14:21         ` Jason Xing
2026-05-13 15:37           ` Stanislav Fomichev
2026-05-14  0:11             ` Jason Xing
2026-05-10  1:23 ` [PATCH net 2/4] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx Jason Xing
2026-05-10  1:23 ` [PATCH net 3/4] xsk: drain continuation descs after overflow in xsk_build_skb() Jason Xing
2026-05-13 16:27   ` Stanislav Fomichev [this message]
2026-05-14  0:21     ` Jason Xing
2026-05-15  0:29       ` Stanislav Fomichev
2026-05-15  2:36         ` Jason Xing
2026-05-10  1:23 ` [PATCH net 4/4] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit() Jason Xing
2026-05-11 14:16 ` [PATCH net 0/4] xsk: fix meta and publish of cq issues Jakub Kicinski
2026-05-12 14:29   ` Jason Xing

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=agSk6FPCDTBWMGpv@devvm7509.cco0.facebook.com \
    --to=sdf.kernel@gmail.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=ast@kernel.org \
    --cc=bjorn@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=horms@kernel.org \
    --cc=john.fastabend@gmail.com \
    --cc=jonathan.lemon@gmail.com \
    --cc=kerneljasonxing@gmail.com \
    --cc=kernelxing@tencent.com \
    --cc=kuba@kernel.org \
    --cc=maciej.fijalkowski@intel.com \
    --cc=magnus.karlsson@intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@fomichev.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.