Re: [PATCH net 3/4] xsk: drain continuation descs after overflow in xsk_build_skb()

Netdev List
 help / color / mirror / Atom feed

From: Stanislav Fomichev <sdf.kernel@gmail.com>
To: Jason Xing <kerneljasonxing@gmail.com>
Cc: davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	 pabeni@redhat.com, bjorn@kernel.org, magnus.karlsson@intel.com,
	 maciej.fijalkowski@intel.com, jonathan.lemon@gmail.com,
	sdf@fomichev.me, ast@kernel.org,  daniel@iogearbox.net,
	hawk@kernel.org, john.fastabend@gmail.com, horms@kernel.org,
	 andrew+netdev@lunn.ch, bpf@vger.kernel.org,
	netdev@vger.kernel.org,  Jason Xing <kernelxing@tencent.com>
Subject: Re: [PATCH net 3/4] xsk: drain continuation descs after overflow in xsk_build_skb()
Date: Thu, 14 May 2026 17:29:54 -0700	[thread overview]
Message-ID: <agZnf78JR9pcLL3v@devvm7509.cco0.facebook.com> (raw)
In-Reply-To: <CAL+tcoA6Cu3P-y2wYjJ_R9cSeRkW2tswY1LveqH=uxd2McuR-w@mail.gmail.com>

On 05/14, Jason Xing wrote:
> On Thu, May 14, 2026 at 12:27 AM Stanislav Fomichev
> <sdf.kernel@gmail.com> wrote:
> >
> > On 05/10, Jason Xing wrote:
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > When a multi-buffer packet exceeds MAX_SKB_FRAGS and triggers -EOVERFLOW,
> > > only the current descriptor is released from the TX ring. The remaining
> > > continuation descriptors of the same packet stay in the ring. Since
> > > xs->skb is set to NULL after the drop, the TX loop picks up these
> > > leftover frags and misinterprets each one as the beginning of a new
> > > packet, corrupting the packet stream.
> > >
> > > Fix this by adding a drain_cont flag to xdp_sock. When overflow occurs
> > > and the dropped descriptor has XDP_PKT_CONTD set, the flag is raised.
> > > The main TX loop in __xsk_generic_xmit() then handles continuation
> > > descriptors one at a time: each gets a normal CQ reservation (with
> > > backpressure), its address is submitted to the completion queue, and
> > > the descriptor is released from the TX ring. When the last fragment
> > > (without XDP_PKT_CONTD) is processed, the flag is cleared and the
> > > function returns -EOVERFLOW so the next call starts with a fresh
> > > budget for normal packets.
> > >
> > > This reuses the existing CQ backpressure and budget mechanisms, so if
> > > the CQ is full the function returns -EAGAIN and userspace drains the
> > > CQ before retrying. Zero buffer leakage, zero packet stream corruption.
> > >
> > > Closes: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
> > > Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > >  include/net/xdp_sock.h |  1 +
> > >  net/xdp/xsk.c          | 22 ++++++++++++++++++++++
> > >  2 files changed, 23 insertions(+)
> > >
> > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > > index 23e8861e8b25..1958d19d9925 100644
> > > --- a/include/net/xdp_sock.h
> > > +++ b/include/net/xdp_sock.h
> > > @@ -80,6 +80,7 @@ struct xdp_sock {
> > >        * call of __xsk_generic_xmit().
> > >        */
> > >       struct sk_buff *skb;
> > > +     bool drain_cont;
> > >
> > >       struct list_head map_list;
> > >       /* Protects map_list */
> > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > index 3f1e590c855d..232dd7126905 100644
> > > --- a/net/xdp/xsk.c
> > > +++ b/net/xdp/xsk.c
> > > @@ -936,6 +936,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > >                       xs->tx->invalid_descs++;
> > >               }
> > >               xskq_cons_release(xs->tx);
> > > +             if (xp_mb_desc(desc))
> > > +                     xs->drain_cont = true;
> > >       } else {
> > >               /* Let application retry */
> > >               xsk_cq_cancel_locked(xs->pool, 1);
> > > @@ -982,6 +984,26 @@ static int __xsk_generic_xmit(struct sock *sk)
> > >                       goto out;
> > >               }
> > >
> > > +             if (unlikely(xs->drain_cont)) {
> > > +                     unsigned long flags;
> > > +                     u32 idx;
> > > +
> >
> > [..]
> >
> > > +                     spin_lock_irqsave(&xs->pool->cq_prod_lock, flags);
> > > +                     idx = xskq_get_prod(xs->pool->cq);
> > > +                     xskq_prod_write_addr(xs->pool->cq, idx, desc.addr);
> > > +                     xskq_prod_submit_n(xs->pool->cq, 1);
> > > +                     spin_unlock_irqrestore(&xs->pool->cq_prod_lock, flags);
> >
> > Not sure I understand why you want this if you're still marking the desc
> > invalid.
> 
> The key point is that as long as we read the desc from txq, the desc
> should be either put back to txq again or publish it in the cq (for
> application to keep track of this) at this point. Or else, the
> application will lose track of this desc, which breaks the whole
> logic.

Yes, makes sense!
 
> > > +
> > > +                     xs->tx->invalid_descs++;
> > > +                     xskq_cons_release(xs->tx);
> > > +                     if (!xp_mb_desc(&desc)) {
> > > +                             xs->drain_cont = false;
> > > +                             err = -EOVERFLOW;
> > > +                             goto out;
> > > +                     }
> >
> > I also don't understand why you want to return -EOVERFLOW again? Why not
> > (quietly) swallow these invalid xp_mb_desc from the previous packet and move
> > on?
> 
> This particular desc is really one of overflow cases, right? We should
> warn users to handle this with this error code.

Right, but didn't we already return -EOVERFLOW to the user once?
The part where you set drain_count=true will -EOVERFLOW. Then this
remainder will also return -EOVERFLOW?

But let's maybe step back, what's the expectation on the user
when we return -EOVERFLOW? In theory, the user can re-submit new
shorter packet (at the current prod index), right? And then your
!xp_mb_desc logic will break/swallow/-EOVERFLOW it?

> >
> > Also, have you considered doing this drain in xsk_build_skb error path?
> > Is it more messy? Carrying this drain_cont across the calls seems a bit
> > too complicated (but idk, maybe the alternative drain in xsk_build_skb
> > is even messier).
> 
> Sure, it was my first version actually. One of the primary reasons is
> how to handle it gracefully when CQ is full? Then those descriptors
> can not be processed in this loop at this time and will be delayed
> until CQ has available slots. Another important reason is that I want
> to put the whole read process under the previous clear logic in
> __xsk_generic_xmit(). Some protections like budget/batch still take
> effect in this case. In this way, there is no strange/exceptional read
> process in copy mode :)

Agreed, given the fact that we need to handle prod backpressure, your
approach is good (but see above comment, we should probably clarify
what's the expectation on the user side).

next prev parent reply	other threads:[~2026-05-15  0:29 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-10  1:23 [PATCH net 0/4] xsk: fix meta and publish of cq issues Jason Xing
2026-05-10  1:23 ` [PATCH net 1/4] xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata() Jason Xing
2026-05-11 15:03   ` Stanislav Fomichev
2026-05-12 14:32     ` Jason Xing
2026-05-12 22:34       ` Stanislav Fomichev
2026-05-13 14:21         ` Jason Xing
2026-05-13 15:37           ` Stanislav Fomichev
2026-05-14  0:11             ` Jason Xing
2026-05-10  1:23 ` [PATCH net 2/4] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx Jason Xing
2026-05-10  1:23 ` [PATCH net 3/4] xsk: drain continuation descs after overflow in xsk_build_skb() Jason Xing
2026-05-13 16:27   ` Stanislav Fomichev
2026-05-14  0:21     ` Jason Xing
2026-05-15  0:29       ` Stanislav Fomichev [this message]
2026-05-15  2:36         ` Jason Xing
2026-05-10  1:23 ` [PATCH net 4/4] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit() Jason Xing
2026-05-11 14:16 ` [PATCH net 0/4] xsk: fix meta and publish of cq issues Jakub Kicinski
2026-05-12 14:29   ` Jason Xing

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=agZnf78JR9pcLL3v@devvm7509.cco0.facebook.com \
    --to=sdf.kernel@gmail.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=ast@kernel.org \
    --cc=bjorn@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=horms@kernel.org \
    --cc=john.fastabend@gmail.com \
    --cc=jonathan.lemon@gmail.com \
    --cc=kerneljasonxing@gmail.com \
    --cc=kernelxing@tencent.com \
    --cc=kuba@kernel.org \
    --cc=maciej.fijalkowski@intel.com \
    --cc=magnus.karlsson@intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@fomichev.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox