Netdev List
 help / color / mirror / Atom feed
From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
To: Jason Xing <kerneljasonxing@gmail.com>
Cc: <davem@davemloft.net>, <edumazet@google.com>, <kuba@kernel.org>,
	<pabeni@redhat.com>, <bjorn@kernel.org>,
	<magnus.karlsson@intel.com>, <jonathan.lemon@gmail.com>,
	<sdf@fomichev.me>, <ast@kernel.org>, <daniel@iogearbox.net>,
	<hawk@kernel.org>, <john.fastabend@gmail.com>, <horms@kernel.org>,
	<andrew+netdev@lunn.ch>, <bpf@vger.kernel.org>,
	<netdev@vger.kernel.org>, Jason Xing <kernelxing@tencent.com>
Subject: Re: [PATCH net v4 3/5] xsk: drain continuation descs after overflow in xsk_build_skb()
Date: Thu, 21 May 2026 14:02:14 +0200	[thread overview]
Message-ID: <ag70RrU4nGy0Nlna@boxer> (raw)
In-Reply-To: <CAL+tcoAvbEZ7U2oXBWwLAKeuC==gAHi=GXLZcrEPxbb4mQ9sfA@mail.gmail.com>

On Thu, May 21, 2026 at 07:53:27AM +0800, Jason Xing wrote:
> On Thu, May 21, 2026 at 12:10 AM Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > On Wed, May 20, 2026 at 08:42:42AM +0800, Jason Xing wrote:
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > When a multi-buffer packet exceeds MAX_SKB_FRAGS and triggers -EOVERFLOW,
> > > only the current descriptor is released from the TX ring. The remaining
> > > continuation descriptors of the same packet stay in the ring. Since
> > > xs->skb is set to NULL after the drop, the TX loop picks up these
> > > leftover frags and misinterprets each one as the beginning of a new
> > > packet, corrupting the packet stream.
> > >
> > > Fix this by adding a drain_cont flag to xdp_sock. When overflow occurs
> > > and the dropped descriptor has XDP_PKT_CONTD set, the flag is raised,
> > > so we have a chance to examine and handle the potential remaining descs
> > > of this big overflow'ed skb.
> > >
> > > When the last fragment (without XDP_PKT_CONTD) is processed, the flag
> > > is cleared and the loop continues to process subsequent descriptors
> > > with the remaining budget. This behavior follows how previous xmit path
> > > treats overflow packets.
> > >
> > > Closes: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
> > > Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > >  include/net/xdp_sock.h |  1 +
> > >  net/xdp/xsk.c          | 19 +++++++++++++++++++
> > >  2 files changed, 20 insertions(+)
> > >
> > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > > index ebac60a3d8a1..8b51876efbed 100644
> > > --- a/include/net/xdp_sock.h
> > > +++ b/include/net/xdp_sock.h
> > > @@ -80,6 +80,7 @@ struct xdp_sock {
> > >        * call of __xsk_generic_xmit().
> > >        */
> > >       struct sk_buff *skb;
> > > +     bool drain_cont;
> > >
> > >       struct list_head map_list;
> > >       /* Protects map_list */
> > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > index 0a6203c42576..f4add7be8c93 100644
> > > --- a/net/xdp/xsk.c
> > > +++ b/net/xdp/xsk.c
> > > @@ -1062,11 +1062,30 @@ static int __xsk_generic_xmit(struct sock *sk)
> > >                       goto out;
> > >               }
> > >
> > > +             if (unlikely(xs->drain_cont)) {
> >
> > Hi Jason,
> >
> > could we get away from introducing dedicated boolean to xdp_sock for
> > handling this rare case?
> 
> I've tried several ways, but they didn't work. Please see the short
> conversation with Stan:
> https://lore.kernel.org/all/CAL+tcoA6Cu3P-y2wYjJ_R9cSeRkW2tswY1LveqH=uxd2McuR-w@mail.gmail.com/
> 
> That means we always rely on a flag (that is either static or like
> this patch) to recognize how we treat this abnormal descs.
> 
> >
> > > +                     unsigned long flags;
> > > +                     u32 idx;
> > > +
> > > +                     spin_lock_irqsave(&xs->pool->cq_prod_lock, flags);
> > > +                     idx = xskq_get_prod(xs->pool->cq);
> > > +                     xskq_prod_write_addr(xs->pool->cq, idx, desc.addr);
> > > +                     xskq_prod_submit_n(xs->pool->cq, 1);
> > > +                     spin_unlock_irqrestore(&xs->pool->cq_prod_lock, flags);
> >
> > It feels a bit off to do this cleanup one-by-one.
> 
> Actually no, it adheres to the basic send logic in copy mode.

Not really. Here you are producer to cq which we have batched variants for
usage.

But after second thoughts I don't think it would be worth to put any
further effort onto optimizations here. Anyways, even though generic xmit
is a whole bunch of mess, maybe you could pull this out onto a helper
function?

> 
> >
> > I wonder if it could be covered by xsk_drop_skb() where you would walk the
> > tx ring and store addrs at xsk_addrs and then use xsk_cq_submit_addr_locked() ?
> >
> > OTOH, in theory the amount of frags after we hit the overflow case could
> > be well over MAX_SKB_FRAGS again so xsk_addrs wouldn't be able to cover
> > them.
> 
> Right, this is why we now put the logic into the xmit path just like
> other kinds of descriptors. We need to check and send it under a
> series of protections like the check of cq pressure and budget. I
> don't think it's an exception to escape the check logic. The whole
> process should be fast as there is no need to build skb and send it to
> the driver as long as this case is recognized.
> 
> Thanks,
> Jason
> 
> >
> > > +
> > > +                     xs->tx->invalid_descs++;
> > > +                     xskq_cons_release(xs->tx);
> > > +                     if (!xp_mb_desc(&desc))
> > > +                             xs->drain_cont = false;
> > > +                     continue;
> > > +             }
> > > +
> > >               skb = xsk_build_skb(xs, &desc);
> > >               if (IS_ERR(skb)) {
> > >                       err = PTR_ERR(skb);
> > >                       if (err != -EOVERFLOW)
> > >                               goto out;
> > > +                     if (xp_mb_desc(&desc))
> > > +                             xs->drain_cont = true;
> > >                       err = 0;
> > >                       continue;
> > >               }
> > > --
> > > 2.43.7
> > >

  reply	other threads:[~2026-05-21 12:02 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-20  0:42 [PATCH net v4 0/5] xsk: fix meta and publish of cq issues Jason Xing
2026-05-20  0:42 ` [PATCH net v4 1/5] xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata() Jason Xing
2026-05-21 12:04   ` Maciej Fijalkowski
2026-05-20  0:42 ` [PATCH net v4 2/5] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx Jason Xing
2026-05-21 12:05   ` Maciej Fijalkowski
2026-05-20  0:42 ` [PATCH net v4 3/5] xsk: drain continuation descs after overflow in xsk_build_skb() Jason Xing
2026-05-20 16:10   ` Maciej Fijalkowski
2026-05-20 23:53     ` Jason Xing
2026-05-21 12:02       ` Maciej Fijalkowski [this message]
2026-05-21 13:10         ` Jason Xing
2026-05-22  9:06           ` Magnus Karlsson
2026-05-22  9:22             ` Jason Xing
2026-05-20  0:42 ` [PATCH net v4 4/5] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit() Jason Xing
2026-05-20  0:42 ` [PATCH net v4 5/5] selftests/xsk: drain CQ to wait for TX completion Jason Xing
2026-05-21 12:23 ` [PATCH net v4 0/5] xsk: fix meta and publish of cq issues Maciej Fijalkowski
2026-05-21 12:41   ` Jason Xing
2026-05-21 12:59     ` Maciej Fijalkowski
2026-05-21 13:07       ` Jason Xing
2026-05-21 14:24         ` Maciej Fijalkowski
2026-05-22  8:55           ` Jason Xing
2026-05-22 13:48             ` Jason Xing
2026-05-22 18:33               ` Maciej Fijalkowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ag70RrU4nGy0Nlna@boxer \
    --to=maciej.fijalkowski@intel.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=ast@kernel.org \
    --cc=bjorn@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=horms@kernel.org \
    --cc=john.fastabend@gmail.com \
    --cc=jonathan.lemon@gmail.com \
    --cc=kerneljasonxing@gmail.com \
    --cc=kernelxing@tencent.com \
    --cc=kuba@kernel.org \
    --cc=magnus.karlsson@intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@fomichev.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox